Scikit Learn Docs PDF
Scikit Learn Docs PDF
Release 0.23.dev0
scikit-learn developers
1 Welcome to scikit-learn 1
1.1 Installing scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Related Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 About us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Who is using scikit-learn? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Release History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.8 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
1.9 Scikit-learn governance and decision-making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
i
5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
5.5 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722
5.6 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
5.7 Data and sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
6 Examples 727
6.1 Miscellaneous examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
6.2 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
6.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
6.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
6.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
6.6 Covariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905
6.7 Cross decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 920
6.8 Dataset examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
6.9 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933
6.10 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946
6.11 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993
6.12 Examples based on real world datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1050
6.13 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111
6.14 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124
6.15 Gaussian Process for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141
6.16 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176
6.17 Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266
6.18 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1280
6.19 Missing Value Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1311
6.20 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1317
6.21 Multioutput methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1369
6.22 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1372
6.23 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1412
6.24 Pipelines and composite estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424
6.25 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446
6.26 Release Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1473
6.27 Semi Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1480
6.28 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1493
6.29 Tutorial exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1529
6.30 Working with text documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538
ii
7.17 sklearn.impute: Impute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1961
7.18 sklearn.inspection: inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1973
7.19 sklearn.isotonic: Isotonic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1980
7.20 sklearn.kernel_approximation Kernel Approximation . . . . . . . . . . . . . . . . . . . 1985
7.21 sklearn.kernel_ridge Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . 1995
7.22 sklearn.linear_model: Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1998
7.23 sklearn.manifold: Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2099
7.24 sklearn.metrics: Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2119
7.25 sklearn.mixture: Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 2205
7.26 sklearn.model_selection: Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 2217
7.27 sklearn.multiclass: Multiclass and multilabel classification . . . . . . . . . . . . . . . . . . 2272
7.28 sklearn.multioutput: Multioutput regression and classification . . . . . . . . . . . . . . . . 2281
7.29 sklearn.naive_bayes: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2291
7.30 sklearn.neighbors: Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2309
7.31 sklearn.neural_network: Neural network models . . . . . . . . . . . . . . . . . . . . . . . 2372
7.32 sklearn.pipeline: Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2385
7.33 sklearn.preprocessing: Preprocessing and Normalization . . . . . . . . . . . . . . . . . . . 2394
7.34 sklearn.random_projection: Random projection . . . . . . . . . . . . . . . . . . . . . . . 2451
7.35 sklearn.semi_supervised Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . 2458
7.36 sklearn.svm: Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2464
7.37 sklearn.tree: Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2494
7.38 sklearn.utils: Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2526
7.39 Recently deprecated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2553
Bibliography 2603
Index 2613
iii
iv
CHAPTER
ONE
WELCOME TO SCIKIT-LEARN
Then run:
In order to check your installation you can use
Note that in order to avoid potential conflicts with other packages it is strongly recommended to use a virtual environ-
ment, e.g. python3 virtualenv (see python3 virtualenv documentation) or conda environments.
Using an isolated environment makes possible to install a specific version of scikit-learn and its dependencies indepen-
dently of any previously installed Python packages. In particular under Linux is it discouraged to install pip packages
alongside the packages managed by the package manager of the distribution (apt, dnf, pacman. . . ).
Note that you should always remember to activate the environment of your choice prior to running any Python com-
mand whenever you start a new terminal session.
If you have not installed NumPy or SciPy yet, you can also install these using conda or pip. When using pip, please
ensure that binary wheels are used, and NumPy and SciPy are not recompiled from source, which can happen when
using particular configurations of operating system and hardware (such as Linux on a Raspberry Pi).
If you must install scikit-learn and its dependencies with pip, you can install it as scikit-learn[alldeps].
Scikit-learn plotting capabilities (i.e., functions start with “plot_” and classes end with “Display”) require Matplotlib
(>= 1.5.1). For running the examples Matplotlib >= 1.5.1 is required. A few examples require scikit-image >= 0.12.3,
a few examples require pandas >= 0.18.0.
1
scikit-learn user guide, Release 0.23.dev0
Warning: Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. Scikit-learn now requires
Python 3.5 or newer.
Note: For installing on PyPy, PyPy3-v5.10+, Numpy 1.14.0+, and scipy 1.1.0+ are required.
Some third-party distributions provide versions of scikit-learn integrated with their package-management systems.
These can make installation and upgrading much easier for users since the integration includes the ability to automat-
ically install dependencies (numpy, scipy) that scikit-learn requires.
The following is an incomplete list of OS and python distributions that provide their own version of scikit-learn.
Arch Linux
Arch Linux’s package is provided through the official repositories as python-scikit-learn for Python. It can
be installed by typing the following command:
Debian/Ubuntu
The Debian/Ubuntu package is splitted in three different packages called python3-sklearn (python modules),
python3-sklearn-lib (low-level implementations and bindings), python3-sklearn-doc (documenta-
tion). Only the Python 3 version is available in the Debian Buster (the more recent Debian distribution). Packages can
be installed using apt-get:
Fedora
The Fedora package is called python3-scikit-learn for the python 3 version, the only one available in Fe-
dora30. It can be installed using dnf:
NetBSD
The MacPorts package is named py<XY>-scikits-learn, where XY denotes the Python version. It can be
installed by typing the following command:
Canopy and Anaconda both ship a recent version of scikit-learn, in addition to a large set of scientific python library
for Windows, Mac OSX and Linux.
Anaconda offers scikit-learn as part of its free distribution.
This version of scikit-learn comes with alternative solvers for some common estimators. Those solvers come from the
DAAL C++ library and are optimized for multi-core Intel CPUs.
Note that those solvers are not enabled by default, please refer to the daal4py documentation for more details.
Compatibility with the standard scikit-learn solvers is checked by running the full scikit-learn test suite via automated
continuous integration as reported on https://github.com/IntelPython/daal4py.
1.1.3 Troubleshooting
It can happen that pip fails to install packages when reaching the default path size limit of Windows if Python is
installed in a nested location such as the AppData folder structure under the user home directory, for instance:
C:\Users\username>C:\Users\username\AppData\Local\Microsoft\WindowsApps\python.exe -m
˓→pip install scikit-learn
Collecting scikit-learn
...
Installing collected packages: scikit-learn
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file
˓→or directory:
˓→'C:\\Users\\username\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.7_
˓→qbz5n2kfra8p0\\LocalCache\\local-packages\\Python37\\site-
˓→packages\\sklearn\\datasets\\tests\\data\\openml\\292\\api-v1-json-data-list-data_
˓→name-australian-limit-2-data_version-1-status-deactivated.json.gz'
In this case it is possible to lift that limit in the Windows registry by using the regedit tool:
1. Type “regedit” in the Windows start menu to launch regedit.
2. Go to the Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
key.
3. Edit the value of the LongPathsEnabled property of that key and set it to 1.
4. Reinstall scikit-learn (ignoring the previous broken installation):
Here we try to give some answers to questions that regularly pop up on the mailing list.
scikit-learn, but not scikit or SciKit nor sci-kit learn. Also not scikits.learn or scikits-learn, which were previously
used.
There are multiple scikits, which are scientific toolboxes built around SciPy. You can find a list at https://scikits.
appspot.com/scikits. Apart from scikit-learn, another popular one is scikit-image.
See Contributing. Before wanting to add a new algorithm, which is usually a major and lengthy undertaking, it is
recommended to start with known issues. Please do not contact the contributors of scikit-learn directly regarding
contributing to scikit-learn.
For general machine learning questions, please use Cross Validated with the [machine-learning] tag.
For scikit-learn usage questions, please use Stack Overflow with the [scikit-learn] and [python] tags. You
can alternatively use the mailing list.
Please make sure to include a minimal reproduction code snippet (ideally shorter than 10 lines) that highlights your
problem on a toy dataset (for instance from sklearn.datasets or randomly generated with functions of numpy.
random with a fixed random seed). Please remove any line of code that is not necessary to reproduce your problem.
The problem should be reproducible by simply copy-pasting your code snippet in a Python shell with scikit-learn
installed. Do not forget to include the import statements.
More guidance to write good reproduction code snippets can be found at:
https://stackoverflow.com/help/mcve
If your problem raises an exception that you do not understand (even after googling it), please make sure to include
the full traceback that you obtain when running the reproduction script.
For bug reports or feature requests, please make use of the issue tracker on GitHub.
There is also a scikit-learn Gitter channel where some users and developers might be found.
Please do not email any authors directly to ask for assistance, report bugs, or for any other issue related to
scikit-learn.
Don’t make a bunch object! They are not part of the scikit-learn API. Bunch objects are just a way to package some
numpy arrays. As a scikit-learn user you only ever need numpy arrays to feed your model with data.
For instance to train a classifier, all you need is a 2D array X for the input variables and a 1D array y for the target
variables. The array X holds the features as columns and samples as rows . The array y contains integer values to
encode the class membership of each sample in X.
1.2.8 How can I load my own datasets into a format usable by scikit-learn?
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that
are convertible to numeric arrays such as pandas DataFrame are also acceptable.
For more information on loading your data files into these usable data structures, please refer to loading external
datasets.
We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+
citations and wide use and usefulness. A technique that provides a clear-cut improvement (e.g. an enhanced data
structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.
From the algorithms or techniques that meet the above criteria, only those which fit well within the current API of
scikit-learn, that is a fit, predict/transform interface and ordinarily having input/output that is a numpy array
or sparse matrix, are accepted.
The contributor should support the importance of the proposed addition with research papers and/or implementations
in other similar packages, demonstrate its usefulness via common use-cases/applications and corroborate performance
improvements, if any, with benchmarks and/or plots. It is expected that the proposed algorithm should outperform the
methods that are already implemented in scikit-learn at least in some areas.
Inclusion of a new algorithm speeding up an existing model is easier if:
• it does not introduce new hyper-parameters (as it makes the library more future-proof),
• it is easy to document clearly when the contribution improves the speed and when it does not, for instance “when
n_features >> n_samples”,
• benchmarks clearly show a speed up.
Also note that your implementation need not be in scikit-learn to be used together with scikit-learn tools. You can
implement your favorite algorithm in a scikit-learn compatible way, upload it to GitHub and let us know. We will be
happy to list it under Related Projects. If you already have a package on GitHub following the scikit-learn API, you
may also be interested to look at scikit-learn-contrib.
1.2.10 Why are you so selective on what algorithms you include in scikit-learn?
Code is maintenance cost, and we need to balance the amount of code we have with the size of the team (and add to
this the fact that complexity scales non linearly with the number of features). The package relies on core developers
using their free time to fix bugs, maintain code and review contributions. Any algorithm that is added needs future
attention by the developers, at which point the original author might long have lost interest. See also What are the
inclusion criteria for new algorithms ?. For a great read about long-term maintenance issues in open-source software,
look at the Executive Summary of Roads and Bridges
Not in the foreseeable future. scikit-learn tries to provide a unified API for the basic tasks in machine learning, with
pipelines and meta-algorithms like grid search to tie everything together. The required concepts, APIs, algorithms
and expertise required for structured learning are different from what scikit-learn has to offer. If we started doing
arbitrary structured learning, we’d need to redesign the whole package and the project would likely collapse under its
own weight.
There are two project with API similar to scikit-learn that do structured prediction:
• pystruct handles general structured learning (focuses on SSVMs on arbitrary graph structures with approximate
inference; defines the notion of sample as an instance of the graph structure)
• seqlearn handles sequences only (focuses on exact inference; has HMMs, but mostly for the sake of complete-
ness; treats a feature vector as a sample and uses an offset encoding for the dependencies between feature
vectors)
No, or at least not in the near future. The main reason is that GPU support will introduce many software dependencies
and introduce platform specific issues. scikit-learn is designed to be easy to install on a wide variety of platforms.
Outside of neural networks, GPUs don’t play a large role in machine learning today, and much larger gains in speed
can often be achieved by a careful choice of algorithms.
In case you didn’t know, PyPy is an alternative Python implementation with a built-in just-in-time compiler. Experi-
mental support for PyPy3-v5.10+ has been added, which requires Numpy 1.14.0+, and scipy 1.1.0+.
scikit-learn estimators assume you’ll feed them real-valued feature vectors. This assumption is hard-coded in pretty
much all of the library. However, you can feed non-numerical inputs to estimators in several ways.
If you have text documents, you can use a term frequency features; see Text feature extraction for the built-in text
vectorizers. For more general feature extraction from any kind of data, see Loading features from dicts and Feature
hashing.
Another common case is when you have non-numerical data and a custom distance (or similarity) metric on these data.
Examples include strings with edit distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
encoded as numbers, but doing so is painful and error-prone. Working with distance metrics on arbitrary data can be
done in two ways.
Firstly, many estimators take precomputed distance/similarity matrices, so if the dataset is not too large, you can
compute distances for all pairs of inputs. If the dataset is large, you can use feature vectors with only one “feature”,
which is an index into a separate data structure, and supply a custom metric function that looks up the actual data in
this data structure. E.g., to use DBSCAN with Levenshtein distances:
1.2.16 Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
Several scikit-learn tools such as GridSearchCV and cross_val_score rely internally on Python’s
multiprocessing module to parallelize execution onto several Python processes by passing n_jobs > 1 as
argument.
The problem is that Python multiprocessing does a fork system call without following it with an exec system
call for performance reasons. Many libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
of) MKL, the OpenMP runtime of GCC, nvidia’s Cuda (and probably many others), manage their own internal thread
pool. Upon a call to fork, the thread pool state in the child process is corrupted: the thread pool believes it has many
threads while only the main thread state has been forked. It is possible to change the libraries to make them detect
when a fork happens and reinitialize the thread pool in that case: we did that for OpenBLAS (merged upstream in
master since 0.2.10) and we contributed a patch to GCC’s OpenMP runtime (not yet reviewed).
But in the end the real culprit is Python’s multiprocessing that does fork without exec to reduce the overhead
of starting and using new Python processes for parallel computing. Unfortunately this is a violation of the POSIX
standard and therefore some software editors like Apple refuse to consider the lack of fork-safety in Accelerate /
vecLib as a bug.
In Python 3.4+ it is now possible to configure multiprocessing to use the ‘forkserver’ or ‘spawn’ start methods
(instead of the default ‘fork’) to manage the process pools. To work around this issue when using scikit-learn, you
can set the JOBLIB_START_METHOD environment variable to ‘forkserver’. However the user should be aware that
using the ‘forkserver’ method prevents joblib.Parallel to call function interactively defined in a shell session.
If you have custom code that uses multiprocessing directly instead of using it via joblib you can enable the
‘forkserver’ mode globally for your program: Insert the following instructions in your main script:
import multiprocessing
if __name__ == '__main__':
multiprocessing.set_start_method('forkserver')
You can find more default on the new start methods in the multiprocessing documentation.
1.2.17 Why does my job use more cores than specified with n_jobs?
This is because n_jobs only controls the number of jobs for routines that are parallelized with joblib, but parallel
code can come from other sources:
• some routines may be parallelized with OpenMP (for code written in C or Cython).
• scikit-learn relies a lot on numpy, which in turn may rely on numerical libraries like MKL, OpenBLAS or BLIS
which can provide parallel implementations.
For more details, please refer to our Parallelism notes.
1.2.18 Why is there no support for deep or reinforcement learning / Will there be
support for deep or reinforcement learning in scikit-learn?
Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning
additionally requiring GPUs for efficient computing. However, neither of these fit within the design constraints of
scikit-learn; as a result, deep learning and reinforcement learning are currently out of scope for what scikit-learn seeks
to achieve.
You can find more information about addition of gpu support at Will you add GPU support?.
The scikit-learn review process takes a significant amount of time, and contributors should not be discouraged by a
lack of activity or review on their pull request. We care a lot about getting things right the first time, as maintenance
and later change comes at a high cost. We rarely release any “experimental” code, so all of our contributions will be
subject to high use immediately and should be of the highest quality possible initially.
Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the reviewers and core developers are working
on scikit-learn on their own time. If a review of your pull request comes slowly, it is likely because the reviewers are
busy. We ask for your understanding and request that you not close your pull request or discontinue your work solely
because of this reason.
For testing and replicability, it is often important to have the entire execution controlled by a single seed for the pseudo-
random number generator used in algorithms that have a randomized component. Scikit-learn does not use its own
global random state; whenever a RandomState instance or an integer random seed is not provided as an argument, it
relies on the numpy global random state, which can be set using numpy.random.seed. For example, to set an
execution’s numpy global random state to 42, one could execute the following in his or her script:
import numpy as np
np.random.seed(42)
However, a global random state is prone to modification by other code during execution. Thus, the only way to ensure
replicability is to pass RandomState instances everywhere and ensure that both estimators and cross-validation
splitters have their random_state parameter set.
Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices of a single numeric dtype. These do
not explicitly represent categorical variables at present. Thus, unlike R’s data.frames or pandas.DataFrame, we require
explicit conversion of categorical features to numeric values, as discussed in Encoding categorical features. See also
Column Transformer with Mixed Types for an example of working with heterogeneous (e.g. categorical and numeric)
data.
1.2.22 Why does Scikit-learn not directly work with, for example, pan-
das.DataFrame?
The homogeneous NumPy and SciPy data objects currently expected are most efficient to process for most operations.
Extensive work would also be needed to support Pandas categorical types. Restricting input to homogeneous types
therefore reduces maintenance cost and encourages usage of efficient data structures.
Currently transform only works for features X in a pipeline. There’s a long-standing discussion about not being
able to transform y in a pipeline. Follow on github issue #4143. Meanwhile check out sklearn.compose.
TransformedTargetRegressor, pipegraph, imbalanced-learn. Note that Scikit-learn solved for the case where
y has an invertible transformation applied before training and inverted after prediction. Scikit-learn intends to solve
for use cases where y should be transformed at training time and not at test time, for resampling and similar uses, like
at imbalanced learn. In general, these use cases can be solved with a custom meta estimator rather than a Pipeline
1.3 Support
1.3. Support 9
scikit-learn user guide, Release 0.23.dev0
• Some scikit-learn developers support users on StackOverflow using the [scikit-learn] tag.
• For general theoretical or methodological Machine Learning questions stack exchange is probably a more suit-
able venue.
In both cases please use a descriptive question in the title field (e.g. no “Please help with scikit-learn!” as this is not a
question) and put details on what you tried to achieve, what were the expected results and what you observed instead
in the details field.
Code and data snippets are welcome. Minimalistic (up to ~20 lines long) reproduction script very helpful.
Please describe the nature of your data and the how you preprocessed it: what is the number of samples, what is the
number and type of features (i.d. categorical or numerical) and for supervised learning tasks, what target are your
trying to predict: binary, multiclass (1 out of n_classes) or multilabel (k out of n_classes) classification or
continuous variable regression.
If you think you’ve encountered a bug, please report it to the issue tracker:
https://github.com/scikit-learn/scikit-learn/issues
Don’t forget to include:
• steps (or better script) to reproduce,
• expected outcome,
• observed outcome or python (or gdb) tracebacks
To help developers fix your bug faster, please link to a https://gist.github.com holding a standalone minimalistic python
script that reproduces your bug and optionally a minimalistic subsample of your dataset (for instance exported as CSV
files using numpy.savetxt).
Note: gists are git cloneable repositories and thus you can use git to push datafiles to them.
1.3.4 IRC
This documentation is relative to 0.23.dev0. Documentation for other versions can be found here.
Printable pdf documentation for old versions can be found here.
Projects implementing the scikit-learn estimator API are encouraged to use the scikit-learn-contrib template which
facilitates best practices for testing and documenting estimators. The scikit-learn-contrib GitHub organisation also
accepts high-quality contributions of repositories conforming to this template.
Below is a list of sister-projects, extensions and domain specific packages.
These tools adapt scikit-learn for use with other technologies or otherwise enhance the functionality of scikit-learn’s
estimators.
Data formats
• sklearn_pandas bridge for scikit-learn pipelines and pandas data frame with dedicated transformers.
• sklearn_xarray provides compatibility of scikit-learn estimators with xarray data structures.
Auto-ML
• auto_ml Automated machine learning for production and analytics, built on scikit-learn and related projects.
Trains a pipeline wth all the standard machine learning steps. Tuned for prediction speed and ease of transfer to
production environments.
• auto-sklearn An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator
• TPOT An automated machine learning toolkit that optimizes a series of scikit-learn operators to design a ma-
chine learning pipeline, including data and feature preprocessors as well as the estimators. Works as a drop-in
replacement for a scikit-learn estimator.
• scikit-optimize A library to minimize (very) expensive and noisy black-box functions. It implements sev-
eral methods for sequential model-based optimization, and includes a replacement for GridSearchCV or
RandomizedSearchCV to do cross-validated parameter search using any of these strategies.
Experimentation frameworks
• REP Environment for conducting data-driven research in a consistent and reproducible way
• ML Frontend provides dataset management and SVM fitting/prediction through web-based and programmatic
interfaces.
• Scikit-Learn Laboratory A command-line wrapper around scikit-learn that makes it easy to run machine learning
experiments with multiple learners and large feature sets.
• Xcessiv is a notebook-like application for quick, scalable, and automated hyperparameter tuning and stacked
ensembling. Provides a framework for keeping track of model-hyperparameter combinations.
Model inspection and visualisation
• eli5 A library for debugging/inspecting machine learning models and explaining their predictions.
• mlxtend Includes model visualization utilities.
• scikit-plot A visualization library for quick and easy generation of common plots in data analysis and machine
learning.
• yellowbrick A suite of custom matplotlib visualizers for scikit-learn estimators to support visual feature analysis,
model selection, evaluation, and diagnostics.
Model export for production
• onnxmltools Serializes many Scikit-learn pipelines to ONNX for interchange and prediction.
• sklearn2pmml Serialization of a wide variety of scikit-learn estimators and transformers into PMML with the
help of JPMML-SkLearn library.
• sklearn-porter Transpile trained scikit-learn models to C, Java, Javascript and others.
• sklearn-compiledtrees Generate a C++ implementation of the predict function for decision trees (and ensembles)
trained by sklearn. Useful for latency-sensitive production environments.
Not everything belongs or is mature enough for the central scikit-learn project. The following are projects providing
interfaces similar to scikit-learn for additional learning algorithms, infrastructures and tasks.
Structured learning
• sktime A scikit-learn compatible toolbox for machine learning with time series including time series classifica-
tion/regression and (supervised/panel) forecasting.
• Seqlearn Sequence classification using HMMs or structured perceptron.
• HMMLearn Implementation of hidden markov models that was previously part of scikit-learn.
• PyStruct General conditional random fields and structured prediction.
• pomegranate Probabilistic modelling for Python, with an emphasis on hidden Markov models.
• sklearn-crfsuite Linear-chain conditional random fields (CRFsuite wrapper with sklearn-like API).
Deep neural networks etc.
• pylearn2 A deep learning and neural network library build on theano with scikit-learn like interface.
• sklearn_theano scikit-learn compatible estimators, transformers, and datasets which use Theano internally
• nolearn A number of wrappers and abstractions around existing neural network libraries
• keras Deep Learning library capable of running on top of either TensorFlow or Theano.
• lasagne A lightweight library to build and train neural networks in Theano.
• skorch A scikit-learn compatible neural network library that wraps PyTorch.
Broad scope
• mlxtend Includes a number of additional estimators as well as model visualization utilities.
• sparkit-learn Scikit-learn API and functionality for PySpark’s distributed modelling.
Other regression and classification
• xgboost Optimised gradient boosted decision tree library.
• ML-Ensemble Generalized ensemble learning (stacking, blending, subsemble, deep ensembles, etc.).
• lightning Fast state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc. . . ).
• py-earth Multivariate adaptive regression splines
• Kernel Regression Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection
• gplearn Genetic Programming for symbolic regression tasks.
• multiisotonic Isotonic regression on multidimensional features.
• scikit-multilearn Multi-label classification with focus on label space manipulation.
• seglearn Time series and sequence learning using sliding window segmentation.
Decomposition and clustering
• lda: Fast implementation of latent Dirichlet allocation in Cython which uses Gibbs sampling
to sample from the true posterior distribution. (scikit-learn’s sklearn.decomposition.
LatentDirichletAllocation implementation uses variational inference to sample from a tractable
approximation of a topic model’s posterior distribution.)
• Sparse Filtering Unsupervised feature learning based on sparse-filtering
• kmodes k-modes clustering algorithm for categorical data, and several of its variations.
• hdbscan HDBSCAN and Robust Single Linkage clustering algorithms for robust variable density clustering.
• spherecluster Spherical K-means and mixture of von Mises Fisher clustering routines for data on the unit hyper-
sphere.
Pre-processing
• categorical-encoding A library of sklearn compatible categorical variable encoders.
• imbalanced-learn Various methods to under- and over-sample datasets.
• GraphLab Implementation of classical recommendation techniques (in C++, with Python bindings).
• implicit, Library for implicit feedback datasets.
• lightfm A Python/Cython implementation of a hybrid recommender system.
• OpenRec TensorFlow-based neural-network inspired recommendation algorithms.
• Spotlight Pytorch-based implementation of deep recommender models.
• Surprise Lib Library for explicit feedback datasets.
1.5 About us
1.5.1 History
This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu
Brucher started work on this project as part of his thesis.
In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the
project and made the first public release, February the 1st 2010. Since then, several releases have appeared following
a ~3 month cycle, and a thriving international community has been leading the development.
1.5.2 Governance
The decision making process and governance structure of scikit-learn is laid out in the governance document.
1.5.3 Authors
The following people are currently core contributors to scikit-learn’s development and maintenance:
Please do not email the authors directly to ask for assistance or report issues. Instead, please see What’s the best way
to ask questions about scikit-learn in the FAQ.
See also:
How you can contribute to the project
The following people have been active contributors in the past, but are no longer active in the project:
• Mathieu Blondel
• Matthieu Brucher
• Lars Buitinck
• David Cournapeau
• Noel Dawe
• Shiqiao Du
• Vincent Dubourg
• Edouard Duchesnay
• Alexander Fabisch
• Virgile Fritsch
• Satrajit Ghosh
• Angel Soler Gollonet
• Chris Gorgolewski
• Jaques Grobler
• Brian Holt
• Arnaud Joly
• Thouis (Ray) Jones
• Kyle Kastner
• manoj kumar
• Robert Layton
• Wei Li
• Paolo Losi
• Gilles Louppe
• Vincent Michel
• Jarrod Millman
• Alexandre Passos
• Fabian Pedregosa
• Peter Prettenhofer
• (Venkat) Raghav, Rajagopalan
• Jacob Schreiber
• Jake Vanderplas
• David Warde-Farley
• Ron Weiss
If you use scikit-learn in a scientific publication, we would appreciate citations to the following paper:
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Bibtex entry:
@article{scikit-learn,
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
volume={12},
pages={2825--2830},
year={2011}
}
If you want to cite scikit-learn for its API or design, you may also want to consider the following paper:
1.5. About us 15
scikit-learn user guide, Release 0.23.dev0
API design for machine learning software: experiences from the scikit-learn project, Buitinck et al., 2013.
Bibtex entry:
@inproceedings{sklearn_api,
author = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and
Fabian Pedregosa and Andreas Mueller and Olivier Grisel and
Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort
and Jaques Grobler and Robert Layton and Jake VanderPlas and
Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux},
title = {{API} design for machine learning software: experiences from
˓→the scikit-learn
project},
booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine
˓→Learning},
year = {2013},
pages = {108--122},
}
1.5.6 Artwork
High quality PNG and SVG logos are available in the doc/logos/ source directory.
1.5.7 Funding
Scikit-Learn is a community driven project, however institutional and private grants help to assure its sustainability.
The project would like to thank the following funders.
The Members of the Scikit-Learn Consortium at Inria Foundation fund Olivier Grisel, Guillaume Lemaitre, Jérémie
du Boisberranger and Chiara Marmo.
Andreas Müller received a grant to improve scikit-learn from the Alfred P. Sloan Foundation . This grant supports the
position of Nicolas Hug and Thomas J. Fan.
1.5. About us 17
scikit-learn user guide, Release 0.23.dev0
Past Sponsors
INRIA actively supports this project. It has provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
(2012-2013) and Olivier Grisel (2013-2017) to work on this project full-time. It also hosts coding sprints and other
events.
Paris-Saclay Center for Data Science funded one year for a developer to work on the project full-time (2014-2015),
50% of the time of Guillaume Lemaitre (2016-2017) and 50% of the time of Joris van den Bossche (2017-2018).
NYU Moore-Sloan Data Science Environment funded Andreas Mueller (2014-2016) to work on this project. The
Moore-Sloan Data Science Environment also funds several students to work on the project part-time.
Télécom Paristech funded Manoj Kumar (2014), Tom Dupré la Tour (2015), Raghav RV (2015-2017), Thierry Guille-
mot (2016-2017) and Albert Thomas (2017) to work on scikit-learn.
The Labex DigiCosme funded Nicolas Goix (2015-2016), Tom Dupré la Tour (2015-2016 and 2017-2018), Mathurin
Massias (2018-2019) to work part time on scikit-learn during their PhDs. It also funded a scikit-learn coding sprint in
2015.
The following students were sponsored by Google to work on scikit-learn through the Google Summer of Code
program.
• 2007 - David Cournapeau
• 2011 - Vlad Niculae
• 2012 - Vlad Niculae, Immanuel Bayer.
• 2013 - Kemal Eren, Nicolas Trésegnie
• 2014 - Hamzeh Alsalhi, Issam Laradji, Maheshakya Wijewardena, Manoj Kumar.
• 2015 - Raghav RV, Wei Xue
The NeuroDebian project providing Debian packaging and contributions is supported by Dr. James V. Haxby (Dart-
mouth College).
1.5.8 Sprints
The International 2019 Paris sprint was kindly hosted by AXA. Also some participants could attend thanks to the
support of the Alfred P. Sloan Foundation, the Python Software Foundation (PSF) and the DATAIA Institute.
The 2013 International Paris Sprint was made possible thanks to the support of Télécom Paristech, tinyclues, the
French Python Association and the Fonds de la Recherche Scientifique.
The 2011 International Granada sprint was made possible thanks to the support of the PSF and tinyclues.
If you are interested in donating to the project or to one of our code-sprints, you can use the Paypal button below or the
NumFOCUS Donations Page (if you use the latter, please indicate that you are donating for the scikit-learn project).
All donations will be handled by NumFOCUS, a non-profit-organization which is managed by a board of Scipy
community members. NumFOCUS’s mission is to foster scientific computing software, in particular in Python. As
a fiscal home of scikit-learn, it ensures that money is available when needed to keep the project funded and available
while in compliance with tax regulations.
The received donations for the scikit-learn project mostly will go towards covering travel-expenses for code sprints, as
well as towards the organization budget of the project1 .
Notes
• We would like to thank Rackspace for providing us with a free Rackspace Cloud account to automatically build
the documentation and the example gallery from for the development version of scikit-learn using this tool.
• We would also like to thank Microsoft Azure, Travis Cl, CircleCl for free CPU time on their Continuous Inte-
gration servers.
1.6.1 J.P.Morgan
Scikit-learn is an indispensable part of the Python machine learning toolkit at JPMorgan. It is very widely used
across all parts of the bank for classification, predictive analytics, and very many other machine learning tasks. Its
1 Regarding the organization budget in particular, we might use some of the donated funds to pay for other project expenses such as DNS,
straightforward API, its breadth of algorithms, and the quality of its documentation combine to make scikit-learn
simultaneously very approachable and very powerful.
Stephen Simmons, VP, Athena Research, JPMorgan
1.6.2 Spotify
Scikit-learn provides a toolbox with solid implementations of a bunch of state-of-the-art models and makes it easy to
plug them into existing applications. We’ve been using it quite a lot for music recommendations at Spotify and I think
it’s the most well-designed ML package I’ve seen so far.
Erik Bernhardsson, Engineering Manager Music Discovery & Machine Learning, Spotify
1.6.3 Inria
At INRIA, we use scikit-learn to support leading-edge basic research in many teams: Parietal for neuroimaging, Lear
for computer vision, Visages for medical image analysis, Privatics for security. The project is a fantastic tool to
address difficult applications of machine learning in an academic environment as it is performant and versatile, but all
easy-to-use and well documented, which makes it well suited to grad students.
Gaël Varoquaux, research at Parietal
1.6.4 betaworks
Betaworks is a NYC-based startup studio that builds new products, grows companies, and invests in others. Over
the past 8 years we’ve launched a handful of social data analytics-driven services, such as Bitly, Chartbeat, digg and
Scale Model. Consistently the betaworks data science team uses Scikit-learn for a variety of tasks. From exploratory
analysis, to product development, it is an essential part of our toolkit. Recent uses are included in digg’s new video
recommender system, and Poncho’s dynamic heuristic subspace clustering.
Gilad Lotan, Chief Data Scientist
At Hugging Face we’re using NLP and probabilistic models to generate conversational Artificial intelligences that are
fun to chat with. Despite using deep neural nets for a few of our NLP tasks, scikit-learn is still the bread-and-butter of
our daily machine learning routine. The ease of use and predictability of the interface, as well as the straightforward
mathematical explanations that are here when you need them, is the killer feature. We use a variety of scikit-learn
models in production and they are also operationally very pleasant to work with.
Julien Chaumond, Chief Technology Officer
1.6.6 Evernote
Building a classifier is typically an iterative process of exploring the data, selecting the features (the attributes of the
data believed to be predictive in some way), training the models, and finally evaluating them. For many of these tasks,
we relied on the excellent scikit-learn package for Python.
Read more
Mark Ayzenshtat, VP, Augmented Intelligence
At Telecom ParisTech, scikit-learn is used for hands-on sessions and home assignments in introductory and advanced
machine learning courses. The classes are for undergrads and masters students. The great benefit of scikit-learn is its
fast learning curve that allows students to quickly start working on interesting and motivating problems.
Alexandre Gramfort, Assistant Professor
1.6.8 Booking.com
At Booking.com, we use machine learning algorithms for many different applications, such as recommending ho-
tels and destinations to our customers, detecting fraudulent reservations, or scheduling our customer service agents.
Scikit-learn is one of the tools we use when implementing standard algorithms for prediction tasks. Its API and doc-
umentations are excellent and make it easy to use. The scikit-learn developers do a great job of incorporating state of
the art implementations and new algorithms into the package. Thus, scikit-learn provides convenient access to a wide
spectrum of algorithms, and allows us to readily find the right tool for the right job.
Melanie Mueller, Data Scientist
1.6.9 AWeber
The scikit-learn toolkit is indispensable for the Data Analysis and Management team at AWeber. It allows us to do
AWesome stuff we would not otherwise have the time or resources to accomplish. The documentation is excellent,
allowing new engineers to quickly evaluate and apply many different algorithms to our data. The text feature extraction
utilities are useful when working with the large volume of email content we have at AWeber. The RandomizedPCA
implementation, along with Pipelining and FeatureUnions, allows us to develop complex machine learning algorithms
efficiently and reliably.
Anyone interested in learning more about how AWeber deploys scikit-learn in a production environment should check
out talks from PyData Boston by AWeber’s Michael Becker available at https://github.com/mdbecker/pydata_2013
Michael Becker, Software Engineer, Data Analysis and Management Ninjas
1.6.10 Yhat
The combination of consistent APIs, thorough documentation, and top notch implementation make scikit-learn our
favorite machine learning package in Python. scikit-learn makes doing advanced analysis in Python accessible to
anyone. At Yhat, we make it easy to integrate these models into your production applications. Thus eliminating the
unnecessary dev time encountered productionizing analytical work.
Greg Lamp, Co-founder Yhat
1.6.11 Rangespan
The Python scikit-learn toolkit is a core tool in the data science group at Rangespan. Its large collection of well
documented models and algorithms allow our team of data scientists to prototype fast and quickly iterate to find the
right solution to our learning problems. We find that scikit-learn is not only the right tool for prototyping, but its
careful and well tested implementation give us the confidence to run scikit-learn models in production.
Jurgen Van Gael, Data Science Director at Rangespan Ltd
1.6.12 Birchbox
At Birchbox, we face a range of machine learning problems typical to E-commerce: product recommendation, user
clustering, inventory prediction, trends detection, etc. Scikit-learn lets us experiment with many models, especially in
the exploration phase of a new project: the data can be passed around in a consistent way; models are easy to save and
reuse; updates keep us informed of new developments from the pattern discovery research community. Scikit-learn is
an important tool for our team, built the right way in the right language.
Thierry Bertin-Mahieux, Birchbox, Data Scientist
Scikit-learn is our #1 toolkit for all things machine learning at Bestofmedia. We use it for a variety of tasks (e.g. spam
fighting, ad click prediction, various ranking models) thanks to the varied, state-of-the-art algorithm implementations
packaged into it. In the lab it accelerates prototyping of complex pipelines. In production I can say it has proven to be
robust and efficient enough to be deployed for business critical components.
Eustache Diemert, Lead Scientist Bestofmedia Group
1.6.14 Change.org
At change.org we automate the use of scikit-learn’s RandomForestClassifier in our production systems to drive email
targeting that reaches millions of users across the world each week. In the lab, scikit-learn’s ease-of-use, performance,
and overall variety of algorithms implemented has proved invaluable in giving us a single reliable source to turn to for
our machine-learning needs.
Vijay Ramesh, Software Engineer in Data/science at Change.org
At PHIMECA Engineering, we use scikit-learn estimators as surrogates for expensive-to-evaluate numerical models
(mostly but not exclusively finite-element mechanical models) for speeding up the intensive post-processing operations
involved in our simulation-based decision making framework. Scikit-learn’s fit/predict API together with its efficient
cross-validation tools considerably eases the task of selecting the best-fit estimator. We are also using scikit-learn for
illustrating concepts in our training sessions. Trainees are always impressed by the ease-of-use of scikit-learn despite
the apparent theoretical complexity of machine learning.
Vincent Dubourg, PHIMECA Engineering, PhD Engineer
1.6.16 HowAboutWe
At HowAboutWe, scikit-learn lets us implement a wide array of machine learning techniques in analysis and in pro-
duction, despite having a small team. We use scikit-learn’s classification algorithms to predict user behavior, enabling
us to (for example) estimate the value of leads from a given traffic source early in the lead’s tenure on our site. Also, our
users’ profiles consist of primarily unstructured data (answers to open-ended questions), so we use scikit-learn’s fea-
ture extraction and dimensionality reduction tools to translate these unstructured data into inputs for our matchmaking
system.
Daniel Weitzenfeld, Senior Data Scientist at HowAboutWe
1.6.17 PeerIndex
At PeerIndex we use scientific methodology to build the Influence Graph - a unique dataset that allows us to identify
who’s really influential and in which context. To do this, we have to tackle a range of machine learning and predic-
tive modeling problems. Scikit-learn has emerged as our primary tool for developing prototypes and making quick
progress. From predicting missing data and classifying tweets to clustering communities of social media users, scikit-
learn proved useful in a variety of applications. Its very intuitive interface and excellent compatibility with other
python tools makes it and indispensable tool in our daily research efforts.
Ferenc Huszar - Senior Data Scientist at Peerindex
1.6.18 DataRobot
DataRobot is building next generation predictive analytics software to make data scientists more productive, and
scikit-learn is an integral part of our system. The variety of machine learning techniques in combination with the
solid implementations that scikit-learn offers makes it a one-stop-shopping library for machine learning in Python.
Moreover, its consistent API, well-tested code and permissive licensing allow us to use it in a production environment.
Scikit-learn has literally saved us years of work we would have had to do ourselves to bring our product to market.
Jeremy Achin, CEO & Co-founder DataRobot Inc.
1.6.19 OkCupid
We’re using scikit-learn at OkCupid to evaluate and improve our matchmaking system. The range of features it has,
especially preprocessing utilities, means we can use it for a wide variety of projects, and it’s performant enough to
handle the volume of data that we need to sort through. The documentation is really thorough, as well, which makes
the library quite easy to use.
David Koh - Senior Data Scientist at OkCupid
1.6.20 Lovely
At Lovely, we strive to deliver the best apartment marketplace, with respect to our users and our listings. From
understanding user behavior, improving data quality, and detecting fraud, scikit-learn is a regular tool for gathering
insights, predictive modeling and improving our product. The easy-to-read documentation and intuitive architecture of
the API makes machine learning both explorable and accessible to a wide range of python developers. I’m constantly
recommending that more developers and scientists try scikit-learn.
Simon Frid - Data Scientist, Lead at Lovely
Data Publica builds a new predictive sales tool for commercial and marketing teams called C-Radar. We extensively
use scikit-learn to build segmentations of customers through clustering, and to predict future customers based on past
partnerships success or failure. We also categorize companies using their website communication thanks to scikit-learn
and its machine learning algorithm implementations. Eventually, machine learning makes it possible to detect weak
signals that traditional tools cannot see. All these complex tasks are performed in an easy and straightforward way
thanks to the great quality of the scikit-learn framework.
Guillaume Lebourgeois & Samuel Charron - Data Scientists at Data Publica
1.6.22 Machinalis
Scikit-learn is the cornerstone of all the machine learning projects carried at Machinalis. It has a consistent API, a
wide selection of algorithms and lots of auxiliary tools to deal with the boilerplate. We have used it in production en-
vironments on a variety of projects including click-through rate prediction, information extraction, and even counting
sheep!
In fact, we use it so much that we’ve started to freeze our common use cases into Python packages, some of them
open-sourced, like FeatureForge . Scikit-learn in one word: Awesome.
Rafael Carrascosa, Lead developer
1.6.23 solido
Scikit-learn is helping to drive Moore’s Law, via Solido. Solido creates computer-aided design tools used by the
majority of top-20 semiconductor companies and fabs, to design the bleeding-edge chips inside smartphones, auto-
mobiles, and more. Scikit-learn helps to power Solido’s algorithms for rare-event estimation, worst-case verification,
optimization, and more. At Solido, we are particularly fond of scikit-learn’s libraries for Gaussian Process models,
large-scale regularized linear regression, and classification. Scikit-learn has increased our productivity, because for
many ML problems we no longer need to “roll our own” code. This PyData 2014 talk has details.
Trent McConaghy, founder, Solido Design Automation Inc.
1.6.24 INFONEA
We employ scikit-learn for rapid prototyping and custom-made Data Science solutions within our in-memory based
Business Intelligence Software INFONEA®. As a well-documented and comprehensive collection of state-of-the-art
algorithms and pipelining methods, scikit-learn enables us to provide flexible and scalable scientific analysis solutions.
Thus, scikit-learn is immensely valuable in realizing a powerful integration of Data Science technology within self-
service business analytics.
Thorsten Kranz, Data Scientist, Coma Soft AG.
1.6.25 Dataiku
Our software, Data Science Studio (DSS), enables users to create data services that combine ETL with Machine
Learning. Our Machine Learning module integrates many scikit-learn algorithms. The scikit-learn library is a perfect
integration with DSS because it offers algorithms for virtually all business cases. Our goal is to offer a transparent and
flexible tool that makes it easier to optimize time consuming aspects of building a data service, preparing data, and
training machine learning algorithms on all types of data.
Florian Douetteau, CEO, Dataiku
Here at Otto Group, one of global Big Five B2C online retailers, we are using scikit-learn in all aspects of our daily
work from data exploration to development of machine learning application to the productive deployment of those
services. It helps us to tackle machine learning problems ranging from e-commerce to logistics. It consistent APIs
enabled us to build the Palladium REST-API framework around it and continuously deliver scikit-learn based services.
Christian Rammig, Head of Data Science, Otto Group
1.6.27 Zopa
At Zopa, the first ever Peer-to-Peer lending platform, we extensively use scikit-learn to run the business and optimize
our users’ experience. It powers our Machine Learning models involved in credit risk, fraud risk, marketing, and
pricing, and has been used for originating at least 1 billion GBP worth of Zopa loans. It is very well documented,
powerful, and simple to use. We are grateful for the capabilities it has provided, and for allowing us to deliver on our
mission of making money simple and fair.
Vlasios Vasileiou, Head of Data Science, Zopa
1.6.28 MARS
Scikit-Learn is integral to the Machine Learning Ecosystem at Mars. Whether we’re designing better recipes for
petfood or closely analysing our cocoa supply chain, Scikit-Learn is used as a tool for rapidly prototyping ideas and
taking them to production. This allows us to better understand and meet the needs of our consumers worldwide.
Scikit-Learn’s feature-rich toolset is easy to use and equips our associates with the capabilities they need to solve the
business challenges they face every day.
Michael Fitzke Next Generation Technologies Sr Leader, Mars Inc.
Release notes for all scikit-learn releases are linked in this this page.
Tip: Subscribe to scikit-learn releases on libraries.io to be notified when new versions are released.
In Development
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• models come here
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
sklearn.cluster
sklearn.datasets
sklearn.feature_extraction
sklearn.gaussian_process
sklearn.linear_model
• [F IX ] Fixed a bug where if a sample_weight parameter was passed to the fit method of linear_model.
RANSACRegressor, it would not be passed to the wrapped base_estimator during the fitting of the final
model. #15573 by Jeremy Alexandre.
• [E FFICIENCY ] linear_model.RidgeCV and linear_model.RidgeClassifierCV now does not al-
locate a potentially large array to store dual coefficients for all hyperparameters during its fit, nor an array to
store all error or LOO predictions unless store_cv_values is True. #15652 by Jérôme Dockès.
• [F IX ] add best_score_ attribute to linear_model.RidgeCV and linear_model.
RidgeClassifierCV . #15653 by Jérôme Dockès.
sklearn.model_selection
sklearn.preprocessing
sklearn.tree
• [F IX ] tree.plot_tree rotate parameter was unused and has been deprecated. #15806 by Chiara Marmo.
sklearn.utils
January 2 2020
This is a bug-fix release to primarily resolve some packaging issues in version 0.22.0. It also includes minor docu-
mentation improvements and some bug fixes.
Changelog
sklearn.cluster
• [F IX ] cluster.KMeans with algorithm="elkan" now uses the same stopping criterion as with the
default algorithm="full". #15930 by @inder128.
sklearn.inspection
sklearn.metrics
sklearn.model_selection
sklearn.naive_bayes
sklearn.preprocessing
sklearn.semi_supervised
sklearn.utils
• [F IX ] utils.check_array now correctly converts pandas DataFrame with boolean columns to floats.
#15797 by Thomas Fan.
• [F IX ] utils.check_is_fitted accepts back an explicit attributes argument to check for specific
attributes as explicit markers of a fitted estimator. When no explicit attributes are provided, only the
attributes that end with a underscore and do not start with double underscore are used as “fitted” markers.
The all_or_any argument is also no longer deprecated. This change is made to restore some backward
compatibility with the behavior of this utility in version 0.21. #15947 by Thomas Fan.
December 3 2019
For a short description of the main highlights of the release, please refer to Release Highlights for scikit-learn 0.22.
Website update
Our website was revamped and given a fresh new look. #14849 by Thomas Fan.
A function or object is public if it is documented in the API Reference and if it can be imported with an import path
without leading underscores. For example sklearn.pipeline.make_pipeline is public, while sklearn.
pipeline._name_estimators is private. sklearn.ensemble._gb.BaseEnsemble is private too be-
cause the whole _gb module is private.
Up to 0.22, some tools were de-facto public (no leading underscore), while they should have been private in the first
place. In version 0.22, these tools have been made properly private, and the public API space has been cleaned. In addi-
tion, importing from most sub-modules is now deprecated: you should for example use from sklearn.cluster
import Birch instead of from sklearn.cluster.birch import Birch (in practice, birch.py has
been moved to _birch.py).
Note: All the tools in the public API should be documented in the API Reference. If you find a public tool (without
leading underscore) that isn’t in the API reference, that means it should either be private or documented. Please let us
know by opening an issue!
When deprecating a feature, previous versions of scikit-learn used to raise a DeprecationWarning. Since the
DeprecationWarnings aren’t shown by default by Python, scikit-learn needed to resort to a custom warning
filter to always show the warnings. That filter would sometimes interfere with users custom warning filters.
Starting from version 0.22, scikit-learn will show FutureWarnings for deprecations, as recommended by the
Python documentation. FutureWarnings are always shown by default by Python, so the custom filter has been
removed and scikit-learn no longer hinders with user filters. #15080 by Nicolas Hug.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• cluster.KMeans when n_jobs=1. [F IX ]
• decomposition.SparseCoder, decomposition.DictionaryLearning, and
decomposition.MiniBatchDictionaryLearning [F IX ]
• decomposition.SparseCoder with algorithm='lasso_lars' [F IX ]
• decomposition.SparsePCA where normalize_components has no effect due to deprecation.
• ensemble.HistGradientBoostingClassifier and ensemble.
HistGradientBoostingRegressor [F IX ], [F EATURE ], [E NHANCEMENT ].
• impute.IterativeImputer when X has features with no missing values. [F EATURE ]
• linear_model.Ridge when X is sparse. [F IX ]
• model_selection.StratifiedKFold and any use of cv=int with a classifier. [F IX ]
• cross_decomposition.CCA when using scipy >= 1.3 [F IX ]
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
sklearn.base
• [API C HANGE ] From version 0.24 base.BaseEstimator.get_params will raise an AttributeError rather
than return None for parameters that are in the estimator’s constructor but not stored as attributes on the instance.
#14464 by Joel Nothman.
sklearn.calibration
sklearn.cluster
sklearn.compose
sklearn.cross_decomposition
sklearn.datasets
sklearn.decomposition
sklearn.dummy
• [F IX ] dummy.DummyClassifier now handles checking the existence of the provided constant in multiouput
cases. #14908 by Martina G. Vilas.
• [API C HANGE ] The default value of the strategy parameter in dummy.DummyClassifier will change
from 'stratified' in version 0.22 to 'prior' in 0.24. A FutureWarning is raised when the default value
is used. #15382 by Thomas Fan.
• [API C HANGE ] The outputs_2d_ attribute is deprecated in dummy.DummyClassifier and dummy.
DummyRegressor. It is equivalent to n_outputs > 1. #14933 by Nicolas Hug
sklearn.ensemble
sklearn.feature_extraction
• [E NHANCEMENT ] A warning will now be raised if a parameter choice means that another parameter will
be unused on calling the fit() method for feature_extraction.text.HashingVectorizer,
feature_extraction.text.CountVectorizer and feature_extraction.text.
TfidfVectorizer. #14602 by Gaurav Chawla.
• [F IX ] Functions created by build_preprocessor and build_analyzer of feature_extraction.
text.VectorizerMixin can now be pickled. #14430 by Dillon Niederhut.
• [F IX ] feature_extraction.text.strip_accents_unicode now correctly removes accents from
strings that are in NFKD normalized form. #15100 by Daniel Grady.
• [F IX ] Fixed a bug that caused feature_extraction.DictVectorizer to raise an OverflowError
during the transform operation when producing a scipy.sparse matrix on large input data. #15463 by
Norvan Sahiner.
• [API C HANGE ] Deprecated unused copy param for feature_extraction.text.TfidfVectorizer.
transform it will be removed in v0.24. #14520 by Guillem G. Subies.
sklearn.feature_selection
sklearn.gaussian_process
sklearn.impute
• [M AJOR F EATURE ] Added impute.KNNImputer, to impute missing values using k-Nearest Neighbors.
#12852 by Ashim Bhattarai and Thomas Fan and #15010 by Guillaume Lemaitre.
• [F EATURE ] impute.IterativeImputer has new skip_compute flag that is False by default, which,
when True, will skip computation on features that have no missing values during the fit phase. #13773 by
Sergey Feldman.
• [E FFICIENCY ] impute.MissingIndicator.fit_transform avoid repeated computation of the
masked matrix. #14356 by Harsh Soni.
• [F IX ] impute.IterativeImputer now works when there is only one feature. By Sergey Feldman.
• [F IX ] Fixed a bug in impute.IterativeImputer where features where imputed in the reverse desired
order with imputation_order either "ascending" or "descending". #15393 by Venkatachalam N.
sklearn.inspection
sklearn.kernel_approximation
sklearn.linear_model
• [E FFICIENCY ] The ‘liblinear’ logistic regression solver is now faster and requires less memory. #14108, #14170,
#14296 by Alex Henrie.
• [E NHANCEMENT ] linear_model.BayesianRidge now accepts hyperparameters alpha_init and
lambda_init which can be used to set the initial value of the maximization procedure in fit. #13618 by
Yoshihiro Uchida.
• [F IX ] linear_model.Ridge now correctly fits an intercept when X is sparse, solver="auto" and
fit_intercept=True, because the default solver in this configuration has changed to sparse_cg, which
can fit an intercept with sparse data. #13995 by Jérôme Dockès.
• [F IX ] linear_model.Ridge with solver='sag' now accepts F-ordered and non-contiguous arrays and
makes a conversion instead of failing. #14458 by Guillaume Lemaitre.
• [F IX ] linear_model.LassoCV no longer forces precompute=False when fitting the final model.
#14591 by Andreas Müller.
• [F IX ] linear_model.RidgeCV and linear_model.RidgeClassifierCV now correctly scores
when cv=None. #14864 by Venkatachalam N.
• [F IX ] Fixed a bug in linear_model.LogisticRegressionCV where the scores_, n_iter_ and
coefs_paths_ attribute would have a wrong ordering with penalty='elastic-net'. #15044 by Nico-
las Hug
• [F IX ] linear_model.MultiTaskLassoCV and linear_model.MultiTaskElasticNetCV with
X of dtype int and fit_intercept=True. #15086 by Alex Gramfort.
• [F IX ] The liblinear solver now supports sample_weight. #15038 by Guillaume Lemaitre.
sklearn.manifold
sklearn.metrics
• [M AJOR F EATURE ] metrics.plot_roc_curve has been added to plot roc curves. This function introduces
the visualization API described in the User Guide. #14357 by Thomas Fan.
• [F EATURE ] Added a new parameter zero_division to multiple classifica-
tion metrics: precision_score, recall_score, f1_score, fbeta_score,
precision_recall_fscore_support, classification_report. This allows to set returned
value for ill-defined metrics. #14900 by Marc Torrellas Socastro.
• [F EATURE ] Added the metrics.pairwise.nan_euclidean_distances metric, which calculates eu-
clidean distances in the presence of missing values. #12852 by Ashim Bhattarai and Thomas Fan.
• [F EATURE ] New ranking metrics metrics.ndcg_score and metrics.dcg_score have been added to
compute Discounted Cumulative Gain and Normalized Discounted Cumulative Gain. #9951 by Jérôme Dockès.
• [F EATURE ] metrics.plot_precision_recall_curve has been added to plot precision recall curves.
#14936 by Thomas Fan.
• [F EATURE ] metrics.plot_confusion_matrix has been added to plot confusion matrices. #15083 by
Thomas Fan.
• [F EATURE ] Added multiclass support to metrics.roc_auc_score with correspond-
ing scorers 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', and
'roc_auc_ovo_weighted'. #12789 and #15274 by Kathy Chen, Mohamed Maskani, and Thomas
Fan.
• [F EATURE ] Add metrics.mean_tweedie_deviance measuring the Tweedie deviance for a given power
parameter. Also add mean Poisson deviance metrics.mean_poisson_deviance and mean Gamma de-
viance metrics.mean_gamma_deviance that are special cases of the Tweedie deviance for power=1
and power=2 respectively. #13938 by Christian Lorentzen and Roman Yurchak.
• [E FFICIENCY ] Improved performance of metrics.pairwise.manhattan_distances in the case of
sparse matrices. #15049 by Paolo Toccaceli <ptocca>.
• [E NHANCEMENT ] The parameter beta in metrics.fbeta_score is updated to accept the zero and
float('+inf') value. #13231 by Dong-hee Na.
• [E NHANCEMENT ] Added parameter squared in metrics.mean_squared_error to return root mean
squared error. #13467 by Urvang Patel.
• [E NHANCEMENT ] Allow computing averaged metrics in the case of no true positives. #14595 by Andreas Müller.
• [E NHANCEMENT ] Multilabel metrics now supports list of lists as input. #14865 Srivatsan Ramesh, Herilalaina
Rakotoarison, Léonard Binet.
• [E NHANCEMENT ] metrics.median_absolute_error now supports multioutput parameter. #14732
by Agamemnon Krasoulis.
• [E NHANCEMENT ] ‘roc_auc_ovr_weighted’ and ‘roc_auc_ovo_weighted’ can now be used as the scoring param-
eter of model-selection tools. #14417 by Thomas Fan.
• [E NHANCEMENT ] metrics.confusion_matrix accepts a parameters normalize allowing to normalize
the confusion matrix by column, rows, or overall. #15625 by Guillaume Lemaitre <glemaitre>.
• [F IX ] Raise a ValueError in metrics.silhouette_score when a precomputed distance matrix contains
non-zero diagonal entries. #12258 by Stephen Tierney.
• [API C HANGE ] scoring="neg_brier_score" should be used instead of
scoring="brier_score_loss" which is now deprecated. #14898 by Stefan Matcovici.
sklearn.model_selection
sklearn.multioutput
sklearn.naive_bayes
• [M AJOR F EATURE ] Added naive_bayes.CategoricalNB that implements the Categorical Naive Bayes
classifier. #12569 by Tim Bicker and Florian Wilhelm.
sklearn.neighbors
sklearn.neural_network
sklearn.pipeline
• [E NHANCEMENT ] pipeline.Pipeline now supports score_samples if the final estimator does. #13806 by
Anaël Beaugnon.
• [F IX ] The fit in FeatureUnion now accepts fit_params to pass to the underlying transformers. #15119
by Adrin Jalali.
• [API C HANGE ] None as a transformer is now deprecated in pipeline.FeatureUnion. Please use 'drop'
instead. #15053 by Thomas Fan.
sklearn.preprocessing
sklearn.model_selection
sklearn.svm
• [E NHANCEMENT ] svm.SVC and svm.NuSVC now accept a break_ties parameter. This param-
eter results in predict breaking the ties according to the confidence values of decision_function, if
decision_function_shape='ovr', and the number of target classes > 2. #12557 by Adrin Jalali.
• [E NHANCEMENT ] SVM estimators now throw a more specific error when kernel='precomputed' and fit
on non-square data. #14336 by Gregory Dexter.
• [F IX ] svm.SVC, svm.SVR, svm.NuSVR and svm.OneClassSVM when received values negative or zero
for parameter sample_weight in method fit(), generated an invalid model. This behavior occurred only in
some border scenarios. Now in these cases, fit() will fail with an Exception. #14286 by Alex Shacked.
• [F IX ] The n_support_ attribute of svm.SVR and svm.OneClassSVM was previously non-initialized, and
had size 2. It has now size 1 with the correct value. #15099 by Nicolas Hug.
• [F IX ] fixed a bug in BaseLibSVM._sparse_fit where n_SV=0 raised a ZeroDivisionError. #14894 by
Danna Naser.
• [F IX ] The liblinear solver now supports sample_weight. #15038 by Guillaume Lemaitre.
sklearn.tree
sklearn.utils
– mocking.CheckingClassifier
– optimize.newton_cg
– random.random_choice_csc
– utils.choose_check_classifiers_labels
– utils.enforce_estimator_tags_y
– utils.optimize.newton_cg
– utils.random.random_choice_csc
– utils.safe_indexing
– utils.mocking
– utils.fast_dict
– utils.seq_dataset
– utils.weight_vector
– utils.fixes.parallel_helper (removed)
– All of utils.testing except for all_estimators which is now in utils.
sklearn.isotonic
Miscellaneous
• [F IX ] Port lobpcg from SciPy which implement some bug fixes but only available in 1.3+. #13609 and #14971
by Guillaume Lemaitre.
• [API C HANGE ] Scikit-learn now converts any input data structure implementing a duck array to a numpy array
(using __array__) to ensure consistent behavior instead of relying on __array_function__ (see NEP
18). #14702 by Andreas Müller.
• [API C HANGE ] Replace manual checks with check_is_fitted. Errors thrown when using a non-fitted
estimators are now more uniform. #13013 by Agamemnon Krasoulis.
• Added check that pairwise estimators raise error on non-square data #14336 by Gregory Dexter.
• Added two common multioutput estimator tests check_classifier_multioutput and
check_regressor_multioutput. #13392 by Rok Mihevc.
• [F IX ] Added check_transformer_data_not_an_array to checks where missing
• [F IX ] The estimators tags resolution now follows the regular MRO. They used to be overridable only once.
#14884 by Andreas Müller.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.20, in-
cluding:
Aaron Alphonsus, Abbie Popa, Abdur-Rahmaan Janhangeer, abenbihi, Abhinav Sagar, Abhishek Jana, Abraham K.
Lagat, Adam J. Stewart, Aditya Vyas, Adrin Jalali, Agamemnon Krasoulis, Alec Peters, Alessandro Surace, Alexan-
dre de Siqueira, Alexandre Gramfort, alexgoryainov, Alex Henrie, Alex Itkes, alexshacked, Allen Akinkunle, Anaël
Beaugnon, Anders Kaseorg, Andrea Maldonado, Andrea Navarrete, Andreas Mueller, Andreas Schuderer, Andrew
Nystrom, Angela Ambroz, Anisha Keshavan, Ankit Jha, Antonio Gutierrez, Anuja Kelkar, Archana Alva, arnaud-
stiegler, arpanchowdhry, ashimb9, Ayomide Bamidele, Baran Buluttekin, barrycg, Bharat Raghunathan, Bill Mill,
Biswadip Mandal, blackd0t, Brian G. Barkley, Brian Wignall, Bryan Yang, c56pony, camilaagw, cartman_nabana,
catajara, Cat Chenal, Cathy, cgsavard, Charles Vesteghem, Chiara Marmo, Chris Gregory, Christian Lorentzen, Chris-
tos Aridas, Dakota Grusak, Daniel Grady, Daniel Perry, Danna Naser, DatenBergwerk, David Dormagen, deeplook,
Dillon Niederhut, Dong-hee Na, Dougal J. Sutherland, DrGFreeman, Dylan Cashman, edvardlindelof, Eric Larson,
Eric Ndirangu, Eunseop Jeong, Fanny, federicopisanu, Felix Divo, flaviomorelli, FranciDona, Franco M. Luque, Frank
Hoang, Frederic Haase, g0g0gadget, Gabriel Altay, Gabriel do Vale Rios, Gael Varoquaux, ganevgv, gdex1, getgau-
rav2, Gideon Sonoiya, Gordon Chen, gpapadok, Greg Mogavero, Grzegorz Szpak, Guillaume Lemaitre, Guillem Gar-
cía Subies, H4dr1en, hadshirt, Hailey Nguyen, Hanmin Qin, Hannah Bruce Macdonald, Harsh Mahajan, Harsh Soni,
Honglu Zhang, Hossein Pourbozorg, Ian Sanders, Ingrid Spielman, J-A16, jaehong park, Jaime Ferrando Huertas,
James Hill, James Myatt, Jay, jeremiedbb, Jérémie du Boisberranger, jeromedockes, Jesper Dramsch, Joan Massich,
Joanna Zhang, Joel Nothman, Johann Faouzi, Jonathan Rahn, Jon Cusick, Jose Ortiz, Kanika Sabharwal, Katarina
Slama, kellycarmody, Kennedy Kang’ethe, Kensuke Arai, Kesshi Jordan, Kevad, Kevin Loftis, Kevin Winata, Kevin
Yu-Sheng Li, Kirill Dolmatov, Kirthi Shankar Sivamani, krishna katyal, Lakshmi Krishnan, Lakshya KD, LalliAcqua,
lbfin, Leland McInnes, Léonard Binet, Loic Esteve, loopyme, lostcoaster, Louis Huynh, lrjball, Luca Ionescu, Lutz
Roeder, MaggieChege, Maithreyi Venkatesh, Maltimore, Maocx, Marc Torrellas, Marie Douriez, Markus, Markus
Frey, Martina G. Vilas, Martin Oywa, Martin Thoma, Masashi SHIBATA, Maxwell Aladago, mbillingr, m-clare,
Meghann Agarwal, m.fab, Micah Smith, miguelbarao, Miguel Cabrera, Mina Naghshhnejad, Ming Li, motmoti,
mschaffenroth, mthorrell, Natasha Borders, nezar-a, Nicolas Hug, Nidhin Pattaniyil, Nikita Titov, Nishan Singh Mann,
Nitya Mandyam, norvan, notmatthancock, novaya, nxorable, Oleg Stikhin, Oleksandr Pavlyk, Olivier Grisel, Omar
Saleem, Owen Flanagan, panpiort8, Paolo, Paolo Toccaceli, Paresh Mathur, Paula, Peng Yu, Peter Marko, pierre-
tallotte, poorna-kumar, pspachtholz, qdeffense, Rajat Garg, Raphaël Bournhonesque, Ray, Ray Bell, Rebekah Kim,
Reza Gharibi, Richard Payne, Richard W, rlms, Robert Juergens, Rok Mihevc, Roman Feldbauer, Roman Yurchak,
R Sanjabi, RuchitaGarde, Ruth Waithera, Sackey, Sam Dixon, Samesh Lakhotia, Samuel Taylor, Sarra Habchi, Scott
Gigante, Scott Sievert, Scott White, Sebastian Pölsterl, Sergey Feldman, SeWook Oh, she-dares, Shreya V, Shub-
ham Mehta, Shuzhe Xiao, SimonCW, smarie, smujjiga, Sönke Behrends, Soumirai, Sourav Singh, stefan-matcovici,
steinfurt, Stéphane Couvreur, Stephan Tulkens, Stephen Cowley, Stephen Tierney, SylvainLan, th0rwas, theoptips,
theotheo, Thierno Ibrahima DIOP, Thomas Edwards, Thomas J Fan, Thomas Moreau, Thomas Schmitt, Tilen Kusterle,
Tim Bicker, Timsaur, Tim Staley, Tirth Patel, Tola A, Tom Augspurger, Tom Dupré la Tour, topisan, Trevor Stephens,
ttang131, Urvang Patel, Vathsala Achar, veerlosar, Venkatachalam N, Victor Luzgin, Vincent Jeanselme, Vincent
Lostanlen, Vladimir Korolev, vnherdeiro, Wenbo Zhao, Wendy Hu, willdarnell, William de Vazelhes, wolframalpha,
xavier dupré, xcjason, x-martian, xsat, xun-tang, Yinglr, yokasre, Yu-Hang “Maxin” Tang, Yulia Zamriy, Zhao Feng
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• The v0.20.0 release notes failed to mention a backwards incompatibility in metrics.make_scorer
when needs_proba=True and y_true is binary. Now, the scorer function is supposed to accept a 1D
y_pred (i.e., probability of the positive class, shape (n_samples,)), instead of a 2D y_pred (i.e., shape
(n_samples, 2)).
Changelog
sklearn.cluster
• [F IX ] Fixed a bug in cluster.KMeans where computation with init='random' was single threaded for
n_jobs > 1 or n_jobs = -1. #12955 by Prabakaran Kumaresshan.
• [F IX ] Fixed a bug in cluster.OPTICS where users were unable to pass float min_samples and
min_cluster_size. #14496 by Fabian Klopfer and Hanmin Qin.
• [F IX ] Fixed a bug in cluster.KMeans where KMeans++ initialisation could rarely result in an IndexError.
#11756 by Joel Nothman.
sklearn.compose
sklearn.datasets
sklearn.ensemble
sklearn.impute
sklearn.inspection
sklearn.linear_model
sklearn.neighbors
sklearn.tree
• [F IX ] Fixed bug in tree.export_text when the tree has one feature and a single feature name is passed in.
#14053 by Thomas Fan.
• [F IX ] Fixed an issue with plot_tree where it displayed entropy calculations even for gini criterion in
DecisionTreeClassifiers. #13947 by Frank Hoang.
24 May 2019
Changelog
sklearn.decomposition
sklearn.metrics
sklearn.preprocessing
• [F IX ] Fixed a bug in preprocessing.OneHotEncoder where the new drop parameter was not reflected
in get_feature_names. #13894 by James Myatt.
sklearn.utils.sparsefuncs
• [F IX ] Fixed a bug where min_max_axis would fail on 32-bit systems for certain large inputs. This
affects preprocessing.MaxAbsScaler, preprocessing.normalize and preprocessing.
LabelBinarizer. #13741 by Roddy MacSween.
17 May 2019
This is a bug-fix release to primarily resolve some packaging issues in version 0.21.0. It also includes minor docu-
mentation improvements and some bug fixes.
Changelog
sklearn.inspection
• [F IX ] Fixed a bug in inspection.partial_dependence to only check classifier and not regressor for
the multiclass-multioutput case. #14309 by Guillaume Lemaitre.
sklearn.metrics
sklearn.neighbors
May 2019
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• discriminant_analysis.LinearDiscriminantAnalysis for multiclass classification. [F IX ]
• discriminant_analysis.LinearDiscriminantAnalysis with ‘eigen’ solver. [F IX ]
• linear_model.BayesianRidge [F IX ]
• Decision trees and derived ensembles when both max_depth and max_leaf_nodes are set. [F IX ]
• linear_model.LogisticRegression and linear_model.LogisticRegressionCV with
‘saga’ solver. [F IX ]
• ensemble.GradientBoostingClassifier [F IX ]
• sklearn.feature_extraction.text.HashingVectorizer, sklearn.
feature_extraction.text.TfidfVectorizer, and sklearn.feature_extraction.
text.CountVectorizer [F IX ]
• neural_network.MLPClassifier [F IX ]
• svm.SVC.decision_function and multiclass.OneVsOneClassifier.
decision_function. [F IX ]
• linear_model.SGDClassifier and any derived classifiers. [F IX ]
• Any model using the linear_model._sag.sag_solver function with a 0 seed, includ-
ing linear_model.LogisticRegression, linear_model.LogisticRegressionCV ,
linear_model.Ridge, and linear_model.RidgeCV with ‘sag’ solver. [F IX ]
• linear_model.RidgeCV when using generalized cross-validation with sparse inputs. [F IX ]
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
• The default max_iter for linear_model.LogisticRegression is too small for many solvers given
the default tol. In particular, we accidentally changed the default max_iter for the liblinear solver from
1000 to 100 iterations in #3591 released in version 0.16. In a future release we hope to choose better default
max_iter and tol heuristically depending on the solver (see #13317).
Changelog
Support for Python 3.4 and below has been officially dropped.
sklearn.base
• [API C HANGE ] The R2 score used when calling score on a regressor will use
multioutput='uniform_average' from version 0.23 to keep consistent with metrics.r2_score.
This will influence the score method of all the multioutput regressors (except for multioutput.
MultiOutputRegressor). #13157 by Hanmin Qin.
sklearn.calibration
• [E NHANCEMENT ] Added support to bin the data passed into calibration.calibration_curve by quan-
tiles instead of uniformly between 0 and 1. #13086 by Scott Cole.
• [E NHANCEMENT ] Allow n-dimensional arrays as input for calibration.CalibratedClassifierCV.
#13485 by William de Vazelhes.
sklearn.cluster
sklearn.compose
sklearn.datasets
• [F IX ] Added support for 64-bit group IDs and pointers in SVMLight files. #10727 by Bryan K Woods.
• [F IX ] datasets.load_sample_images returns images with a deterministic order. #13250 by Thomas
Fan.
sklearn.decomposition
sklearn.discriminant_analysis
sklearn.dummy
• [F IX ] Fixed a bug in dummy.DummyClassifier where the predict_proba method was returning int32
array instead of float64 for the stratified strategy. #13266 by Christos Aridas.
• [F IX ] Fixed a bug in dummy.DummyClassifier where it was throwing a dimension mismatch error in
prediction time if a column vector y with shape=(n, 1) was given at fit time. #13545 by Nick Sorros and
Adrin Jalali.
sklearn.ensemble
• [M AJOR F EATURE ] Add two new implementations of gradient boosting trees: ensemble.
HistGradientBoostingClassifier and ensemble.HistGradientBoostingRegressor.
The implementation of these estimators is inspired by LightGBM and can be orders of magnitude faster than
ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier
when the number of samples is larger than tens of thousands of samples. The API of these new estimators
is slightly different, and some of the features from ensemble.GradientBoostingClassifier and
ensemble.GradientBoostingRegressor are not yet supported.
These new estimators are experimental, which means that their results or their API might change without any
deprecation cycle. To use them, you need to explicitly import enable_hist_gradient_boosting:
sklearn.externals
• [API C HANGE ] Deprecated externals.six since we have dropped support for Python 2.7. #12916 by Han-
min Qin.
sklearn.feature_extraction
sklearn.impute
• [M AJOR F EATURE ] Added impute.IterativeImputer, which is a strategy for imputing missing values
by modeling each feature with missing values as a function of other features in a round-robin fashion. #8478
and #12177 by Sergey Feldman and Ben Lawson.
The API of IterativeImputer is experimental and subject to change without any deprecation cycle. To use them,
you need to explicitly import enable_iterative_imputer:
the imputer’s transform. That allows a predictive estimator to account for missingness. #12583, #13601 by
Danylo Baibak.
• [F IX ] In impute.MissingIndicator avoid implicit densification by raising an exception if input is sparse
add missing_values property is set to 0. #13240 by Bartosz Telenczuk.
• [F IX ] Fixed two bugs in impute.MissingIndicator. First, when X is sparse, all the non-zero non missing
values used to become explicit False in the transformed data. Then, when features='missing-only',
all features used to be kept if there were no missing values at all. #13562 by Jérémie du Boisberranger.
sklearn.inspection
(new subpackage)
• [F EATURE ] Partial dependence plots (inspection.plot_partial_dependence) are now supported for
any regressor or classifier (provided that they have a predict_proba method). #12599 by Trevor Stephens
and Nicolas Hug.
sklearn.isotonic
sklearn.linear_model
• [E NHANCEMENT ] linear_model.Ridge now preserves float32 and float64 dtypes. #8769 and
#11000 by Guillaume Lemaitre, and Joan Massich
• [F EATURE ] linear_model.LogisticRegression and linear_model.
LogisticRegressionCV now support Elastic-Net penalty, with the ‘saga’ solver. #11646 by Nicolas
Hug.
• [F EATURE ] Added linear_model.lars_path_gram, which is linear_model.lars_path in the
sufficient stats mode, allowing users to compute linear_model.lars_path without providing X and y.
#11699 by Kuai Yu.
• [E FFICIENCY ] linear_model.make_dataset now preserves float32 and float64 dtypes, reducing
memory consumption in stochastic gradient, SAG and SAGA solvers. #8769 and #11000 by Nelle Varoquaux,
Arthur Imbert, Guillaume Lemaitre, and Joan Massich
• [E NHANCEMENT ] linear_model.LogisticRegression now supports an unregularized objective when
penalty='none' is passed. This is equivalent to setting C=np.inf with l2 regularization. Not supported
by the liblinear solver. #12860 by Nicolas Hug.
• [E NHANCEMENT ] sparse_cg solver in linear_model.Ridge now supports fitting the intercept (i.e.
fit_intercept=True) when inputs are sparse. #13336 by Bartosz Telenczuk.
• [E NHANCEMENT ] The coordinate descent solver used in Lasso, ElasticNet, etc. now issues a
ConvergenceWarning when it completes without meeting the desired toleranbce. #11754 and #13397
by Brent Fagan and Adrin Jalali.
• [F IX ] Fixed a bug in linear_model.LogisticRegression and linear_model.
LogisticRegressionCV with ‘saga’ solver, where the weights would not be correctly updated in
some cases. #11646 by Tom Dupre la Tour.
• [F IX ] Fixed the posterior mean, posterior covariance and returned regularization parameters in
linear_model.BayesianRidge. The posterior mean and the posterior covariance were not the ones
computed with the last update of the regularization parameters and the returned regularization parameters were
not the final ones. Also fixed the formula of the log marginal likelihood used to compute the score when
compute_score=True. #12174 by Albert Thomas.
• [F IX ] Fixed a bug in linear_model.LassoLarsIC, where user input copy_X=False at instance cre-
ation would be overridden by default parameter value copy_X=True in fit. #12972 by Lucio Fernandez-
Arjona
• [F IX ] Fixed a bug in linear_model.LinearRegression that was not returning the same coeffecients
and intercepts with fit_intercept=True in sparse and dense case. #13279 by Alexandre Gramfort
• [F IX ] Fixed a bug in linear_model.HuberRegressor that was broken when X was of dtype bool. #13328
by Alexandre Gramfort.
• [F IX ] Fixed a performance issue of saga and sag solvers when called in a joblib.Parallel setting with
n_jobs > 1 and backend="threading", causing them to perform worse than in the sequential case.
#13389 by Pierre Glaser.
• [F IX ] Fixed a bug in linear_model.stochastic_gradient.BaseSGDClassifier that was not
deterministic when trained in a multi-class setting on several threads. #13422 by Clément Doumouro.
• [F IX ] Fixed bug in linear_model.ridge_regression, linear_model.Ridge
and linear_model.RidgeClassifier that caused unhandled exception for arguments
return_intercept=True and solver=auto (default) or any other solver different from sag.
#13363 by Bartosz Telenczuk
• [F IX ] linear_model.ridge_regression will now raise an exception if return_intercept=True
and solver is different from sag. Previously, only warning was issued. #13363 by Bartosz Telenczuk
• [F IX ] linear_model.ridge_regression will choose sparse_cg solver for sparse inputs when
solver=auto and sample_weight is provided (previously cholesky solver was selected). #13363 by
Bartosz Telenczuk
• [API C HANGE ] The use of linear_model.lars_path with X=None while passing Gram is deprecated in
version 0.21 and will be removed in version 0.23. Use linear_model.lars_path_gram instead. #11699
by Kuai Yu.
• [API C HANGE ] linear_model.logistic_regression_path is deprecated in version 0.21 and will be
removed in version 0.23. #12821 by Nicolas Hug.
• [F IX ] linear_model.RidgeCV with generalized cross-validation now correctly fits an intercept when
fit_intercept=True and the design matrix is sparse. #13350 by Jérôme Dockès
sklearn.manifold
sklearn.metrics
• [F EATURE ] Added the metrics.max_error metric and a corresponding 'max_error' scorer for single
output regression. #12232 by Krishna Sangeeth.
sklearn.mixture
• [F IX ] Fixed a bug in mixture.BaseMixture and therefore on estimators based on it, i.e. mixture.
GaussianMixture and mixture.BayesianGaussianMixture, where fit_predict and fit.
predict were not equivalent. #13142 by Jérémie du Boisberranger.
sklearn.model_selection
• [F EATURE ] Classes GridSearchCV and RandomizedSearchCV now allow for refit=callable to add flex-
ibility in identifying the best estimator. See Balance model complexity and cross-validated score. #11354 by
sklearn.multiclass
sklearn.multioutput
sklearn.neighbors
sklearn.neural_network
sklearn.pipeline
• [F EATURE ] pipeline.Pipeline can now use indexing notation (e.g. my_pipeline[0:-1]) to extract a
subsequence of steps as another Pipeline instance. A Pipeline can also be indexed directly to extract a particular
step (e.g. my_pipeline['svc']), rather than accessing named_steps. #2568 by Joel Nothman.
• [F EATURE ] Added optional parameter verbose in pipeline.Pipeline, compose.
ColumnTransformer and pipeline.FeatureUnion and corresponding make_ helpers for showing
progress and timing of each step. #11364 by Baze Petrushev, Karan Desai, Joel Nothman, and Thomas Fan.
• [E NHANCEMENT ] pipeline.Pipeline now supports using 'passthrough' as a transformer, with the
same effect as None. #11144 by Thomas Fan.
• [E NHANCEMENT ] pipeline.Pipeline implements __len__ and therefore len(pipeline) returns the
number of steps in the pipeline. #13439 by Lakshya KD.
sklearn.preprocessing
• [F EATURE ] preprocessing.OneHotEncoder now supports dropping one feature per category with a new
drop parameter. #12908 by Drew Johnston.
• [E FFICIENCY ] preprocessing.OneHotEncoder and preprocessing.OrdinalEncoder now han-
dle pandas DataFrames more efficiently. #13253 by @maikia.
• [E FFICIENCY ] Make preprocessing.MultiLabelBinarizer cache class mappings instead of calculat-
ing it every time on the fly. #12116 by Ekaterina Krivich and Joel Nothman.
• [E FFICIENCY ] preprocessing.PolynomialFeatures now supports compressed sparse row (CSR) ma-
trices as input for degrees 2 and 3. This is typically much faster than the dense case as it scales with matrix
density and expansion degree (on the order of density^degree), and is much, much faster than the compressed
sparse column (CSC) case. #12197 by Andrew Nystrom.
• [E FFICIENCY ] Speed improvement in preprocessing.PolynomialFeatures, in the dense case. Also
added a new parameter order which controls output order for further speed performances. #12251 by Tom
Dupre la Tour.
• [F IX ] Fixed the calculation overflow when using a float16 dtype with preprocessing.StandardScaler.
#13007 by Raffaello Baluyot
• [F IX ] Fixed a bug in preprocessing.QuantileTransformer and preprocessing.
quantile_transform to force n_quantiles to be at most equal to n_samples. Values of n_quantiles
larger than n_samples were either useless or resulting in a wrong approximation of the cumulative distribution
function estimator. #13333 by Albert Thomas.
• [API C HANGE ] The default value of copy in preprocessing.quantile_transform will change from
False to True in 0.23 in order to make it more consistent with the default copy values of other functions in
preprocessing and prevent unexpected side effects by modifying the value of X inplace. #13459 by Hunter
McGushion.
sklearn.svm
sklearn.tree
• [F EATURE ] Decision Trees can now be plotted with matplotlib using tree.plot_tree without relying on the
dot library, removing a hard-to-install dependency. #8508 by Andreas Müller.
• [F EATURE ] Decision Trees can now be exported in a human readable textual format using tree.
export_text. #6261 by Giuseppe Vettigli <JustGlowing>.
• [F EATURE ] get_n_leaves() and get_depth() have been added to tree.BaseDecisionTree
and consequently all estimators based on it, including tree.DecisionTreeClassifier, tree.
DecisionTreeRegressor, tree.ExtraTreeClassifier, and tree.ExtraTreeRegressor.
#12300 by Adrin Jalali.
• [F IX ] Trees and forests did not previously predict multi-output classification targets with string labels, despite
accepting them in fit. #11458 by Mitar Milutinovic.
• [F IX ] Fixed an issue with tree.BaseDecisionTree and consequently all estimators based
on it, including tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.
ExtraTreeClassifier, and tree.ExtraTreeRegressor, where they used to exceed the given
max_depth by 1 while expanding the tree if max_leaf_nodes and max_depth were both specified by
the user. Please note that this also affects all ensemble methods using decision trees. #12344 by Adrin Jalali.
sklearn.utils
• [F EATURE ] utils.resample now accepts a stratify parameter for sampling according to class distribu-
tions. #13549 by Nicolas Hug.
• [API C HANGE ] Deprecated warn_on_dtype parameter from utils.check_array and utils.
check_X_y. Added explicit warning for dtype conversion in check_pairwise_arrays if the metric
being passed is a pairwise boolean metric. #13382 by Prathmesh Savale.
Multiple modules
• [M AJOR F EATURE ] The __repr__() method of all estimators (used when calling print(estimator))
has been entirely re-written, building on Python’s pretty printing standard library. All parameters are printed
by default, but this can be altered with the print_changed_only option in sklearn.set_config.
#11705 by Nicolas Hug.
• [M AJOR F EATURE ] Add estimators tags: these are annotations of estimators that allow programmatic inspection
of their capabilities, such as sparse matrix support, supported output types and supported methods. Estimator
tags also determine the tests that are run on an estimator when check_estimator is called. Read more in
the User Guide. #8022 by Andreas Müller.
• [E FFICIENCY ] Memory copies are avoided when casting arrays to a different dtype in multiple estimators. #11973
by Roman Yurchak.
• [F IX ] Fixed a bug in the implementation of the our_rand_r helper function that was not behaving consistently
across platforms. #13422 by Madhura Parikh and Clément Doumouro.
Miscellaneous
• [E NHANCEMENT ] Joblib is no longer vendored in scikit-learn, and becomes a dependency. Minimal supported
version is joblib 0.11, however using version >= 0.13 is strongly recommended. #13531 by Roman Yurchak.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.20, in-
cluding:
adanhawth, Aditya Vyas, Adrin Jalali, Agamemnon Krasoulis, Albert Thomas, Alberto Torres, Alexandre Gramfort,
amourav, Andrea Navarrete, Andreas Mueller, Andrew Nystrom, assiaben, Aurélien Bellet, Bartosz Michałowski,
Bartosz Telenczuk, bauks, BenjaStudio, bertrandhaut, Bharat Raghunathan, brentfagan, Bryan Woods, Cat Chenal,
Cheuk Ting Ho, Chris Choe, Christos Aridas, Clément Doumouro, Cole Smith, Connossor, Corey Levinson, Dan
Ellis, Dan Stine, Danylo Baibak, daten-kieker, Denis Kataev, Didi Bar-Zev, Dillon Gardner, Dmitry Mottl, Dmitry
Vukolov, Dougal J. Sutherland, Dowon, drewmjohnston, Dror Atariah, Edward J Brown, Ekaterina Krivich, Eliza-
beth Sander, Emmanuel Arias, Eric Chang, Eric Larson, Erich Schubert, esvhd, Falak, Feda Curic, Federico Caselli,
Frank Hoang, Fibinse Xavier‘, Finn O’Shea, Gabriel Marzinotto, Gabriel Vacaliuc, Gabriele Calvo, Gael Varoquaux,
GauravAhlawat, Giuseppe Vettigli, Greg Gandenberger, Guillaume Fournier, Guillaume Lemaitre, Gustavo De Mari
Pereira, Hanmin Qin, haroldfox, hhu-luqi, Hunter McGushion, Ian Sanders, JackLangerman, Jacopo Notarstefano,
jakirkham, James Bourbeau, Jan Koch, Jan S, janvanrijn, Jarrod Millman, jdethurens, jeremiedbb, JF, joaak, Joan
Massich, Joel Nothman, Jonathan Ohayon, Joris Van den Bossche, josephsalmon, Jérémie Méhault, Katrin Leinwe-
ber, ken, kms15, Koen, Kossori Aruku, Krishna Sangeeth, Kuai Yu, Kulbear, Kushal Chauhan, Kyle Jackson, Lakshya
KD, Leandro Hermida, Lee Yi Jie Joel, Lily Xiong, Lisa Sarah Thomas, Loic Esteve, louib, luk-f-a, maikia, mail-liam,
Manimaran, Manuel López-Ibáñez, Marc Torrellas, Marco Gaido, Marco Gorelli, MarcoGorelli, marineLM, Mark
Hannel, Martin Gubri, Masstran, mathurinm, Matthew Roeschke, Max Copeland, melsyt, mferrari3, Mickaël Schoent-
gen, Ming Li, Mitar, Mohammad Aftab, Mohammed AbdelAal, Mohammed Ibraheem, Muhammad Hassaan Rafique,
mwestt, Naoya Iijima, Nicholas Smith, Nicolas Goix, Nicolas Hug, Nikolay Shebanov, Oleksandr Pavlyk, Oliver
Rausch, Olivier Grisel, Orestis, Osman, Owen Flanagan, Paul Paczuski, Pavel Soriano, pavlos kallis, Pawel Sendyk,
peay, Peter, Peter Cock, Peter Hausamann, Peter Marko, Pierre Glaser, pierretallotte, Pim de Haan, Piotr Szymański,
Prabakaran Kumaresshan, Pradeep Reddy Raamana, Prathmesh Savale, Pulkit Maloo, Quentin Batista, Radostin Stoy-
anov, Raf Baluyot, Rajdeep Dua, Ramil Nugmanov, Raúl García Calvo, Rebekah Kim, Reshama Shaikh, Rohan
Lekhwani, Rohan Singh, Rohan Varma, Rohit Kapoor, Roman Feldbauer, Roman Yurchak, Romuald M, Roopam
Sharma, Ryan, Rüdiger Busche, Sam Waterbury, Samuel O. Ronsin, SandroCasagrande, Scott Cole, Scott Lowe, Se-
bastian Raschka, Shangwu Yao, Shivam Kotwalia, Shiyu Duan, smarie, Sriharsha Hatwar, Stephen Hoover, Stephen
Tierney, Stéphane Couvreur, surgan12, SylvainLan, TakingItCasual, Tashay Green, thibsej, Thomas Fan, Thomas
J Fan, Thomas Moreau, Tom Dupré la Tour, Tommy, Tulio Casagrande, Umar Farouk Umar, Utkarsh Upadhyay,
Vinayak Mehta, Vishaal Kapoor, Vivek Kumar, Vlad Niculae, vqean3, Wenhao Zhang, William de Vazelhes, xhan,
Xing Han Lu, xinyuliu12, Yaroslav Halchenko, Zach Griffith, Zach Miller, Zayd Hammoudeh, Zhuyi Xue, Zijie (ZJ)
Poh, ^__^
Changelog
sklearn.cluster
• [F IX ] Fixed a bug in cluster.KMeans where KMeans++ initialisation could rarely result in an IndexError.
#11756 by Joel Nothman.
sklearn.compose
sklearn.decomposition
sklearn.model_selection
• [F IX ] Fixed a bug where model_selection.StratifiedKFold shuffles each class’s samples with the
same random_state, making shuffle=True ineffective. #13124 by Hanmin Qin.
sklearn.neighbors
March 1, 2019
This is a bug-fix release with some minor documentation improvements and enhancements to features released in
0.20.0.
Changelog
sklearn.cluster
• [F IX ] Fixed a bug in cluster.KMeans where computation was single threaded when n_jobs > 1 or
n_jobs = -1. #12949 by Prabakaran Kumaresshan.
sklearn.compose
• [F IX ] Fixed a bug in compose.ColumnTransformer to handle negative indexes in the columns list of the
transformers. #12946 by Pierre Tallotte.
sklearn.covariance
sklearn.decomposition
sklearn.datasets
sklearn.feature_extraction
sklearn.impute
• [F IX ] add support for non-numeric data in sklearn.impute.MissingIndicator which was not sup-
ported while sklearn.impute.SimpleImputer was supporting this for some imputation strategies.
#13046 by Guillaume Lemaitre.
sklearn.linear_model
sklearn.preprocessing
sklearn.svm
• [F IX ] Fixed a bug in svm.SVC, svm.NuSVC, svm.SVR, svm.NuSVR and svm.OneClassSVM where the
scale option of parameter gamma is erroneously defined as 1 / (n_features * X.std()). It’s now
defined as 1 / (n_features * X.var()). #13221 by Hanmin Qin.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• sklearn.neighbors when metric=='jaccard' (bug fix)
• use of 'seuclidean' or 'mahalanobis' metrics in some cases (bug fix)
Changelog
sklearn.compose
sklearn.metrics
sklearn.neighbors
sklearn.utils
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• decomposition.IncrementalPCA (bug fix)
Changelog
sklearn.cluster
• [E FFICIENCY ] make cluster.MeanShift no longer try to do nested parallelism as the overhead would hurt
performance significantly when n_jobs > 1. #12159 by Olivier Grisel.
• [F IX ] Fixed a bug in cluster.DBSCAN with precomputed sparse neighbors graph, which would add explicitly
zeros on the diagonal even when already present. #12105 by Tom Dupre la Tour.
sklearn.compose
• [F IX ] Fixed an issue in compose.ColumnTransformer when stacking columns with types not convertible
to a numeric. #11912 by Adrin Jalali.
• [API C HANGE ] compose.ColumnTransformer now applies the sparse_threshold even if all trans-
formation results are sparse. #12304 by Andreas Müller.
sklearn.datasets
• [F IX ] datasets.fetch_openml to correctly use the local cache. #12246 by Jan N. van Rijn.
• [F IX ] datasets.fetch_openml to correctly handle ignore attributes and row id attributes. #12330 by Jan
N. van Rijn.
• [F IX ] Fixed integer overflow in datasets.make_classification for values of n_informative pa-
rameter larger than 64. #10811 by Roman Feldbauer.
• [F IX ] Fixed olivetti faces dataset DESCR attribute to point to the right location in datasets.
fetch_olivetti_faces. #12441 by Jérémie du Boisberranger
• [F IX ] datasets.fetch_openml to retry downloading when reading from local cache fails. #12517 by
Thomas Fan.
sklearn.decomposition
sklearn.ensemble
sklearn.feature_extraction
sklearn.linear_model
sklearn.metrics
sklearn.mixture
sklearn.neighbors
sklearn.preprocessing
sklearn.utils
• [F IX ] Use float64 for mean accumulator to avoid floating point precision issues in preprocessing.
StandardScaler and decomposition.IncrementalPCA when using float32 datasets. #12338 by
bauks.
• [F IX ] Calling utils.check_array on pandas.Series, which raised an error in 0.20.0, now returns the
expected output again. #12625 by Andreas Müller
Miscellaneous
• [F IX ] When using site joblib by setting the environment variable SKLEARN_SITE_JOBLIB, added compati-
bility with joblib 0.11 in addition to 0.12+. #12350 by Joel Nothman and Roman Yurchak.
• [F IX ] Make sure to avoid raising FutureWarning when calling np.vstack with numpy 1.16 and later (use
list comprehensions instead of generator expressions in many locations of the scikit-learn code base). #12467
by Olivier Grisel.
• [API C HANGE ] Removed all mentions of sklearn.externals.joblib, and deprecated joblib
methods exposed in sklearn.utils, except for utils.parallel_backend and utils.
register_parallel_backend, which allow users to configure parallel computation in scikit-learn. Other
functionalities are part of joblib. package and should be used directly, by installing it. The goal of this change
is to prepare for unvendoring joblib in future version of scikit-learn. #12345 by Thomas Moreau
Warning: Version 0.20 is the last version of scikit-learn to support Python 2.7 and Python 3.4. Scikit-learn 0.21
will require Python 3.5 or higher.
Highlights
We have tried to improve our support for common data-science use-cases including missing values, categorical vari-
ables, heterogeneous data, and features/targets with unusual distributions. Missing values in features, represented by
NaNs, are now accepted in column-wise preprocessing such as scalers. Each feature is fitted disregarding NaNs, and
data containing NaNs can be transformed. The new impute module provides estimators for learning despite missing
data.
ColumnTransformer handles the case where different features or columns of a pandas.DataFrame need dif-
ferent preprocessing. String or pandas Categorical columns can now be encoded with OneHotEncoder or
OrdinalEncoder.
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• cluster.MeanShift (bug fix)
• decomposition.IncrementalPCA in Python 2 (bug fix)
• decomposition.SparsePCA (bug fix)
• ensemble.GradientBoostingClassifier (bug fix affecting feature importances)
• isotonic.IsotonicRegression (bug fix)
• linear_model.ARDRegression (bug fix)
• linear_model.LogisticRegressionCV (bug fix)
• linear_model.OrthogonalMatchingPursuit (bug fix)
• linear_model.PassiveAggressiveClassifier (bug fix)
• linear_model.PassiveAggressiveRegressor (bug fix)
• linear_model.Perceptron (bug fix)
• linear_model.SGDClassifier (bug fix)
• linear_model.SGDRegressor (bug fix)
• metrics.roc_auc_score (bug fix)
• metrics.roc_curve (bug fix)
• neural_network.BaseMultilayerPerceptron (bug fix)
• neural_network.MLPClassifier (bug fix)
• neural_network.MLPRegressor (bug fix)
• The v0.19.0 release notes failed to mention a backwards incompatibility with model_selection.
StratifiedKFold when shuffle=True due to #7823.
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
sklearn.cluster
sklearn.compose
• New module.
sklearn.covariance
sklearn.datasets
• [M AJOR F EATURE ] Added datasets.fetch_openml to fetch datasets from OpenML. OpenML is a free,
open data sharing platform and will be used instead of mldata as it provides better service availability. #9908 by
Andreas Müller and Jan N. van Rijn.
• [F EATURE ] In datasets.make_blobs, one can now pass a list to the n_samples parameter to indicate
the number of samples to generate per cluster. #8617 by Maskani Filali Mohamed and Konstantinos Katrioplas.
• [F EATURE ] Add filename attribute to datasets that have a CSV file. #9101 by alex-33 and Maskani Filali
Mohamed.
• [F EATURE ] return_X_y parameter has been added to several dataset loaders. #10774 by Chris Catalfo.
• [F IX ] Fixed a bug in datasets.load_boston which had a wrong data point. #10795 by Takeshi
Yoshizawa.
• [F IX ] Fixed a bug in datasets.load_iris which had two wrong data points. #11082 by Sadhana Srini-
vasan and Hanmin Qin.
• [F IX ] Fixed a bug in datasets.fetch_kddcup99, where data were not properly shuffled. #9731 by Nico-
las Goix.
• [F IX ] Fixed a bug in datasets.make_circles, where no odd number of data points could be generated.
#10045 by Christian Braune.
• [API C HANGE ] Deprecated sklearn.datasets.fetch_mldata to be removed in version 0.22. ml-
data.org is no longer operational. Until removal it will remain possible to load cached datasets. #11466 by
Joel Nothman.
sklearn.decomposition
sklearn.discriminant_analysis
sklearn.dummy
• [F EATURE ] dummy.DummyRegressor now has a return_std option in its predict method. The re-
turned standard deviations will be zeros.
• [F EATURE ] dummy.DummyClassifier and dummy.DummyRegressor now only require X to be an ob-
ject with finite length or shape. #9832 by Vrishank Bhardwaj.
• [F EATURE ] dummy.DummyClassifier and dummy.DummyRegressor can now be scored without sup-
plying test samples. #11951 by Rüdiger Busche.
sklearn.ensemble
sklearn.feature_extraction
sklearn.feature_selection
sklearn.gaussian_process
when using return_std=True in particular more when called several times in a row. #9234 by andrewww
and Minghui Liu.
sklearn.impute
sklearn.isotonic
sklearn.linear_model
sklearn.manifold
• [E FFICIENCY ] Speed improvements for both ‘exact’ and ‘barnes_hut’ methods in manifold.TSNE. #10593
and #10610 by Tom Dupre la Tour.
• [F EATURE ] Support sparse input in manifold.Isomap.fit. #8554 by Leland McInnes.
• [F EATURE ] manifold.t_sne.trustworthiness accepts metrics other than Euclidean. #9775 by
William de Vazelhes.
• [F IX ] Fixed a bug in manifold.spectral_embedding where the normalization of the spectrum was
using a division instead of a multiplication. #8129 by Jan Margeta, Guillaume Lemaitre, and Devansh D..
• [API C HANGE ] [F EATURE ] Deprecate precomputed parameter in function manifold.t_sne.
trustworthiness. Instead, the new parameter metric should be used with any compatible metric in-
cluding ‘precomputed’, in which case the input matrix X should be a matrix of pairwise distances or squared
distances. #9775 by William de Vazelhes.
sklearn.metrics
• [M AJOR F EATURE ] Added the metrics.davies_bouldin_score metric for evaluation of clustering mod-
els without a ground truth. #10827 by Luis Osa.
• [M AJOR F EATURE ] Added the metrics.balanced_accuracy_score metric and a corresponding
'balanced_accuracy' scorer for binary and multiclass classification. #8066 by @xyguo and Aman
Dalmia, and #10587 by Joel Nothman.
• [F EATURE ] Partial AUC is available via max_fpr parameter in metrics.roc_auc_score. #3840 by
Alexander Niederbühl.
• [F EATURE ] A scorer based on metrics.brier_score_loss is also available. #9521 by Hanmin Qin.
• [F EATURE ] Added control over the normalization in metrics.normalized_mutual_info_score and
metrics.adjusted_mutual_info_score via the average_method parameter. In version 0.22, the
default normalizer for each will become the arithmetic mean of the entropies of each clustering. #11124 by
Arya McCarthy.
• [F EATURE ] Added output_dict parameter in metrics.classification_report to return classifi-
cation statistics as dictionary. #11160 by Dan Barkhorn.
• [F EATURE ] metrics.classification_report now reports all applicable averages on the given data, in-
cluding micro, macro and weighted average as well as samples average for multilabel data. #11679 by Alexander
Pacha.
• [F EATURE ] metrics.average_precision_score now supports binary y_true other than {0, 1} or
{-1, 1} through pos_label parameter. #9980 by Hanmin Qin.
• [F EATURE ] metrics.label_ranking_average_precision_score now supports
sample_weight. #10845 by Jose Perez-Parras Toledano.
• [F EATURE ] Add dense_output parameter to metrics.pairwise.linear_kernel. When False and
both inputs are sparse, will return a sparse matrix. #10999 by Taylor G Smith.
• [E FFICIENCY ] metrics.silhouette_score and metrics.silhouette_samples are more mem-
ory efficient and run faster. This avoids some reported freezes and MemoryErrors. #11135 by Joel Nothman.
• [F IX ] Fixed a bug in metrics.precision_recall_fscore_support when truncated
range(n_labels) is passed as value for labels. #10377 by Gaurav Dhingra.
• [F IX ] Fixed a bug due to floating point error in metrics.roc_auc_score with non-integer sample weights.
#9786 by Hanmin Qin.
• [F IX ] Fixed a bug where metrics.roc_curve sometimes starts on y-axis instead of (0, 0), which is in-
consistent with the document and other implementations. Note that this will not influence the result from
metrics.roc_auc_score #10093 by alexryndin and Hanmin Qin.
• [F IX ] Fixed a bug to avoid integer overflow. Casted product to 64 bits integer in metrics.
mutual_info_score. #9772 by Kumar Ashutosh.
• [F IX ] Fixed a bug where metrics.average_precision_score will sometimes return nan when
sample_weight contains 0. #9980 by Hanmin Qin.
sklearn.mixture
sklearn.model_selection
sklearn.multioutput
sklearn.naive_bayes
• [M AJOR F EATURE ] Added naive_bayes.ComplementNB, which implements the Complement Naive Bayes
classifier described in Rennie et al. (2003). #8190 by Michael A. Alcorn.
• [F EATURE ] Add var_smoothing parameter in naive_bayes.GaussianNB to give a precise control over
variances calculation. #9681 by Dmitry Mottl.
• [F IX ] Fixed a bug in naive_bayes.GaussianNB which incorrectly raised error for prior list which summed
to 1. #10005 by Gaurav Dhingra.
• [F IX ] Fixed a bug in naive_bayes.MultinomialNB which did not accept vector valued pseudocounts
(alpha). #10346 by Tobias Madsen
sklearn.neighbors
sklearn.neural_network
sklearn.pipeline
• [F EATURE ] The predict method of pipeline.Pipeline now passes keyword arguments on to the
pipeline’s last estimator, enabling the use of parameters such as return_std in a pipeline with caution.
#9304 by Breno Freitas.
• [API C HANGE ] pipeline.FeatureUnion now supports 'drop' as a transformer to drop features. #11144
by Thomas Fan.
sklearn.preprocessing
• [API C HANGE ] The NaN marker for the missing values has been changed between the preprocessing.
Imputer and the impute.SimpleImputer. missing_values='NaN' should now be
missing_values=np.nan. #11211 by Jeremie du Boisberranger.
• [API C HANGE ] In preprocessing.FunctionTransformer, the default of validate will be from
True to False in 0.22. #10655 by Guillaume Lemaitre.
sklearn.svm
• [F IX ] Fixed a bug in svm.SVC where when the argument kernel is unicode in Python2, the
predict_proba method was raising an unexpected TypeError given dense inputs. #10412 by Jiongyan
Zhang.
• [API C HANGE ] Deprecate random_state parameter in svm.OneClassSVM as the underlying implemen-
tation is not random. #9497 by Albert Thomas.
• [API C HANGE ] The default value of gamma parameter of svm.SVC, NuSVC, SVR, NuSVR, OneClassSVM
will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. #8361 by
Gaurav Dhingra and Ting Neo.
sklearn.tree
• [E NHANCEMENT ] Although private (and hence not assured API stability), tree._criterion.
ClassificationCriterion and tree._criterion.RegressionCriterion may now be cim-
ported and extended. #10325 by Camil Staps.
• [F IX ] Fixed a bug in tree.BaseDecisionTree with splitter="best" where split threshold could
become infinite when values in X were near infinite. #10536 by Jonathan Ohayon.
• [F IX ] Fixed a bug in tree.MAE to ensure sample weights are being used during the calculation of tree MAE im-
purity. Previous behaviour could cause suboptimal splits to be chosen since the impurity calculation considered
all samples to be of equal weight importance. #11464 by John Stott.
sklearn.utils
Multiple modules
• [F EATURE ] [API C HANGE ] More consistent outlier detection API: Add a score_samples method
in svm.OneClassSVM , ensemble.IsolationForest, neighbors.LocalOutlierFactor,
covariance.EllipticEnvelope. It allows to access raw score functions from original pa-
pers. A new offset_ parameter allows to link score_samples and decision_function
methods. The contamination parameter of ensemble.IsolationForest and neighbors.
LocalOutlierFactor decision_function methods is used to define this offset_ such that outliers
(resp. inliers) have negative (resp. positive) decision_function values. By default, contamination
is kept unchanged to 0.1 for a deprecation period. In 0.22, it will be set to “auto”, thus using method-specific
score offsets. In covariance.EllipticEnvelope decision_function method, the raw_values
parameter is deprecated as the shifted Mahalanobis distance will be always returned in 0.22. #9015 by Nicolas
Goix.
• [F EATURE ] [API C HANGE ] A behaviour parameter has been introduced in ensemble.
IsolationForest to ensure backward compatibility. In the old behaviour, the decision_function is
independent of the contamination parameter. A threshold attribute depending on the contamination
parameter is thus used. In the new behaviour the decision_function is dependent on the
contamination parameter, in such a way that 0 becomes its natural threshold to detect outliers. Set-
ting behaviour to “old” is deprecated and will not be possible in version 0.22. Beside, the behaviour parameter
will be removed in 0.24. #11553 by Nicolas Goix.
• [API C HANGE ] Added convergence warning to svm.LinearSVC and linear_model.
LogisticRegression when verbose is set to 0. #10881 by Alexandre Sevin.
• [API C HANGE ] Changed warning type from UserWarning to exceptions.ConvergenceWarning
for failing convergence in linear_model.logistic_regression_path, linear_model.
RANSACRegressor, linear_model.ridge_regression, gaussian_process.
GaussianProcessRegressor, gaussian_process.GaussianProcessClassifier,
decomposition.fastica, cross_decomposition.PLSCanonical, cluster.
AffinityPropagation, and cluster.Birch. #10306 by Jonathan Siebert.
Miscellaneous
• [M AJOR F EATURE ] A new configuration parameter, working_memory was added to control memory con-
sumption limits in chunked operations, such as the new metrics.pairwise_distances_chunked. See
Limiting Working Memory. #10280 by Joel Nothman and Aman Dalmia.
• [F EATURE ] The version of joblib bundled with Scikit-learn is now 0.12. This uses a new default multiprocess-
ing implementation, named loky. While this may incur some memory and communication overhead, it should
provide greater cross-platform stability than relying on Python standard library multiprocessing. #11741 by the
Joblib developers, especially Thomas Moreau and Olivier Grisel.
• [F EATURE ] An environment variable to use the site joblib instead of the vendored one was added (Environment
variables). The main API of joblib is now exposed in sklearn.utils. #11166 by Gael Varoquaux.
• [F EATURE ] Add almost complete PyPy 3 support. Known unsupported functionalities are datasets.
load_svmlight_file, feature_extraction.FeatureHasher and feature_extraction.
text.HashingVectorizer. For running on PyPy, PyPy3-v5.10+, Numpy 1.14.0+, and scipy 1.1.0+ are
required. #11010 by Ronan Lamy and Roman Yurchak.
• [F EATURE ] A utility method sklearn.show_versions was added to print out information relevant for
debugging. It includes the user system, the Python executable, the version of the main libraries and BLAS
binding information. #11596 by Alexandre Boucaud
• [F IX ] Fixed a bug when setting parameters on meta-estimator, involving both a wrapped estimator and its pa-
rameter. #9999 by Marcus Voss and Joel Nothman.
• [F IX ] Fixed a bug where calling sklearn.base.clone was not thread safe and could result in a “pop from
empty list” error. #9569 by Andreas Müller.
• [API C HANGE ] The default value of n_jobs is changed from 1 to None in all related functions and classes.
n_jobs=None means unset. It will generally be interpreted as n_jobs=1, unless the current joblib.
Parallel backend context specifies otherwise (See Glossary for additional information). Note that this
change happens immediately (i.e., without a deprecation cycle). #11741 by Olivier Grisel.
• [F IX ] Fixed a bug in validation helpers where passing a Dask DataFrame results in an error. #12462 by Zachariah
Miller
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.19, in-
cluding:
211217613, Aarshay Jain, absolutelyNoWarranty, Adam Greenhall, Adam Kleczewski, Adam Richie-Halford, adelr,
AdityaDaflapurkar, Adrin Jalali, Aidan Fitzgerald, aishgrt1, Akash Shivram, Alan Liddell, Alan Yee, Albert Thomas,
Alexander Lenail, Alexander-N, Alexandre Boucaud, Alexandre Gramfort, Alexandre Sevin, Alex Egg, Alvaro Perez-
Diaz, Amanda, Aman Dalmia, Andreas Bjerre-Nielsen, Andreas Mueller, Andrew Peng, Angus Williams, Aniruddha
Dave, annaayzenshtat, Anthony Gitter, Antonio Quinonez, Anubhav Marwaha, Arik Pamnani, Arthur Ozga, Artiem
K, Arunava, Arya McCarthy, Attractadore, Aurélien Bellet, Aurélien Geron, Ayush Gupta, Balakumaran Manoha-
ran, Bangda Sun, Barry Hart, Bastian Venthur, Ben Lawson, Benn Roth, Breno Freitas, Brent Yi, brett koonce,
Caio Oliveira, Camil Staps, cclauss, Chady Kamar, Charlie Brummitt, Charlie Newey, chris, Chris, Chris Catalfo,
Chris Foster, Chris Holdgraf, Christian Braune, Christian Hirsch, Christian Hogan, Christopher Jenness, Clement
Joudet, cnx, cwitte, Dallas Card, Dan Barkhorn, Daniel, Daniel Ferreira, Daniel Gomez, Daniel Klevebring, Danielle
Shwed, Daniel Mohns, Danil Baibak, Darius Morawiec, David Beach, David Burns, David Kirkby, David Nichol-
son, David Pickup, Derek, Didi Bar-Zev, diegodlh, Dillon Gardner, Dillon Niederhut, dilutedsauce, dlovell, Dmitry
Mottl, Dmitry Petrov, Dor Cohen, Douglas Duhaime, Ekaterina Tuzova, Eric Chang, Eric Dean Sanchez, Erich Schu-
bert, Eunji, Fang-Chieh Chou, FarahSaeed, felix, Félix Raimundo, fenx, filipj8, FrankHui, Franz Wompner, Freija
Descamps, frsi, Gabriele Calvo, Gael Varoquaux, Gaurav Dhingra, Georgi Peev, Gil Forsyth, Giovanni Giuseppe
Costa, gkevinyen5418, goncalo-rodrigues, Gryllos Prokopis, Guillaume Lemaitre, Guillaume “Vermeille” Sanchez,
Gustavo De Mari Pereira, hakaa1, Hanmin Qin, Henry Lin, Hong, Honghe, Hossein Pourbozorg, Hristo, Hunan Ros-
tomyan, iampat, Ivan PANICO, Jaewon Chung, Jake VanderPlas, jakirkham, James Bourbeau, James Malcolm, Jamie
Cox, Jan Koch, Jan Margeta, Jan Schlüter, janvanrijn, Jason Wolosonovich, JC Liu, Jeb Bearer, jeremiedbb, Jimmy
Wan, Jinkun Wang, Jiongyan Zhang, jjabl, jkleint, Joan Massich, Joël Billaud, Joel Nothman, Johannes Hansen,
JohnStott, Jonatan Samoocha, Jonathan Ohayon, Jörg Döpfert, Joris Van den Bossche, Jose Perez-Parras Toledano,
josephsalmon, jotasi, jschendel, Julian Kuhlmann, Julien Chaumond, julietcl, Justin Shenk, Karl F, Kasper Primdal
Lauritzen, Katrin Leinweber, Kirill, ksemb, Kuai Yu, Kumar Ashutosh, Kyeongpil Kang, Kye Taylor, kyledrogo,
Leland McInnes, Léo DS, Liam Geron, Liutong Zhou, Lizao Li, lkjcalc, Loic Esteve, louib, Luciano Viola, Lucija
Gregov, Luis Osa, Luis Pedro Coelho, Luke M Craig, Luke Persola, Mabel, Mabel Villalba, Maniteja Nandana, MarkI-
wanchyshyn, Mark Roth, Markus Müller, MarsGuy, Martin Gubri, martin-hahn, martin-kokos, mathurinm, Matthias
Feurer, Max Copeland, Mayur Kulkarni, Meghann Agarwal, Melanie Goetz, Michael A. Alcorn, Minghui Liu, Ming
Li, Minh Le, Mohamed Ali Jamaoui, Mohamed Maskani, Mohammad Shahebaz, Muayyad Alsadi, Nabarun Pal, Na-
garjuna Kumar, Naoya Kanai, Narendran Santhanam, NarineK, Nathaniel Saul, Nathan Suh, Nicholas Nadeau, P.Eng.,
AVS, Nick Hoh, Nicolas Goix, Nicolas Hug, Nicolau Werneck, nielsenmarkus11, Nihar Sheth, Nikita Titov, Nilesh
Kevlani, Nirvan Anjirbag, notmatthancock, nzw, Oleksandr Pavlyk, oliblum90, Oliver Rausch, Olivier Grisel, Oren
Milman, Osaid Rehman Nasir, pasbi, Patrick Fernandes, Patrick Olden, Paul Paczuski, Pedro Morales, Peter, Peter St.
John, pierreablin, pietruh, Pinaki Nath Chowdhury, Piotr Szymański, Pradeep Reddy Raamana, Pravar D Mahajan,
pravarmahajan, QingYing Chen, Raghav RV, Rajendra arora, RAKOTOARISON Herilalaina, Rameshwar Bhaskaran,
RankyLau, Rasul Kerimov, Reiichiro Nakano, Rob, Roman Kosobrodov, Roman Yurchak, Ronan Lamy, rragundez,
Rüdiger Busche, Ryan, Sachin Kelkar, Sagnik Bhattacharya, Sailesh Choyal, Sam Radhakrishnan, Sam Steingold,
Samuel Bell, Samuel O. Ronsin, Saqib Nizam Shamsi, SATISH J, Saurabh Gupta, Scott Gigante, Sebastian Flen-
nerhag, Sebastian Raschka, Sebastien Dubois, Sébastien Lerique, Sebastin Santy, Sergey Feldman, Sergey Melderis,
Sergul Aydore, Shahebaz, Shalil Awaley, Shangwu Yao, Sharad Vijalapuram, Sharan Yalburgi, shenhanc78, Shivam
Rastogi, Shu Haoran, siftikha, Sinclert Pérez, SolutusImmensus, Somya Anand, srajan paliwal, Sriharsha Hatwar, Sri
Krishna, Stefan van der Walt, Stephen McDowell, Steven Brown, syonekura, Taehoon Lee, Takanori Hayashi, tarcusx,
Taylor G Smith, theriley106, Thomas, Thomas Fan, Thomas Heavey, Tobias Madsen, tobycheese, Tom Augspurger,
Tom Dupré la Tour, Tommy, Trevor Stephens, Trishnendu Ghorai, Tulio Casagrande, twosigmajab, Umar Farouk
Umar, Urvang Patel, Utkarsh Upadhyay, Vadim Markovtsev, Varun Agrawal, Vathsala Achar, Vilhelm von Ehren-
heim, Vinayak Mehta, Vinit, Vinod Kumar L, Viraj Mavani, Viraj Navkal, Vivek Kumar, Vlad Niculae, vqean3, Vris-
hank Bhardwaj, vufg, wallygauze, Warut Vijitbenjaronk, wdevazelhes, Wenhao Zhang, Wes Barnett, Will, William de
Vazelhes, Will Rosenfeld, Xin Xiong, Yiming (Paul) Li, ymazari, Yufeng, Zach Griffith, Zé Vinícius, Zhenqing Hu,
Zhiqing Xiao, Zijie (ZJ) Poh
July, 2018
This release is exclusively in order to support Python 3.7.
Related changes
Note there may be minor differences in TSNE output in this release (due to #9623), in the case where multiple samples
have equal distance to some sample.
Changelog
API changes
• Reverted the addition of metrics.ndcg_score and metrics.dcg_score which had been merged into
version 0.19.0 by error. The implementations were broken and undocumented.
• return_train_score which was added to model_selection.GridSearchCV ,
model_selection.RandomizedSearchCV and model_selection.cross_validate in
version 0.19.0 will be changing its default value from True to False in version 0.21. We found that calculating
training score could have a great effect on cross validation runtime in some cases. Users should explicitly
set return_train_score to False if prediction or scoring functions are slow, resulting in a deleterious
effect on CV runtime, or to True if they wish to use the calculated scores. #9677 by Kumar Ashutosh and Joel
Nothman.
• correlation_models and regression_models from the legacy gaussian processes implementation
have been belatedly deprecated. #9717 by Kumar Ashutosh.
Bug fixes
• Dataset fetchers make sure temporary files are closed before removing them, which caused errors on Windows.
#9847 by Joan Massich.
• Fixed a regression in manifold.TSNE where it no longer supported metrics other than ‘euclidean’ and ‘pre-
computed’. #9623 by Oli Blum.
Enhancements
• Our test suite and utils.estimator_checks.check_estimators can now be run without Nose in-
stalled. #9697 by Joan Massich.
• To improve usability of version 0.19’s pipeline.Pipeline caching, memory now allows joblib.
Memory instances. This make use of the new utils.validation.check_memory helper. issue:9584
by Kumar Ashutosh
• Some fixes to examples: #9750, #9788, #9815
• Made a FutureWarning in SGD-based estimators less verbose. #9802 by Vrishank Bhardwaj.
Highlights
Changed models
The following estimators and functions, when fit with the same data and parameters, may produce different models
from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.
• cluster.KMeans with sparse X and initial centroids given (bug fix)
• cross_decomposition.PLSRegression with scale=True (bug fix)
• ensemble.GradientBoostingClassifier and ensemble.GradientBoostingRegressor
where min_impurity_split is used (bug fix)
• gradient boosting loss='quantile' (bug fix)
• ensemble.IsolationForest (bug fix)
• feature_selection.SelectFdr (bug fix)
• linear_model.RANSACRegressor (bug fix)
• linear_model.LassoLars (bug fix)
• linear_model.LassoLarsIC (bug fix)
• manifold.TSNE (bug fix)
• neighbors.NearestCentroid (bug fix)
• semi_supervised.LabelSpreading (bug fix)
• semi_supervised.LabelPropagation (bug fix)
• tree based models where min_weight_fraction_leaf is used (enhancement)
• model_selection.StratifiedKFold with shuffle=True (this change, due to #7823 was not men-
tioned in the release notes at the time)
Details are listed in the changelog below.
(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)
Changelog
New features
• The new solver 'mu' implements a Multiplicate Update in decomposition.NMF, allowing the optimization
of all beta-divergences, including the Frobenius norm, the generalized Kullback-Leibler divergence and the
Itakura-Saito divergence. #5295 by Tom Dupre la Tour.
Model selection and evaluation
• model_selection.GridSearchCV and model_selection.RandomizedSearchCV now support
simultaneous evaluation of multiple metrics. Refer to the Specifying multiple metrics for evaluation section of
the user guide for more information. #7388 by Raghav RV
• Added the model_selection.cross_validate which allows evaluation of multiple metrics. This func-
tion returns a dict with more useful information from cross-validation such as the train scores, fit times and score
times. Refer to The cross_validate function and multiple metric evaluation section of the userguide for more
information. #7388 by Raghav RV
• Added metrics.mean_squared_log_error, which computes the mean square error of the logarithmic
transformation of targets, particularly useful for targets with an exponential trend. #7655 by Karan Desai.
• Added metrics.dcg_score and metrics.ndcg_score, which compute Discounted cumulative gain
(DCG) and Normalized discounted cumulative gain (NDCG). #7739 by David Gasquez.
• Added the model_selection.RepeatedKFold and model_selection.
RepeatedStratifiedKFold. #8120 by Neeraj Gangwar.
Miscellaneous
• Validation that input data contains no NaN or inf can now be suppressed using config_context, at your
own risk. This will save on runtime, and may be particularly useful for prediction time. #7548 by Joel Nothman.
• Added a test to ensure parameter listing in docstrings match the function/class signature. #9206 by Alexandre
Gramfort and Raghav RV.
Enhancements
Bug fixes
• Fixed a bug in svm.OneClassSVM where it returned floats instead of integer classes. #8676 by Vathsala
Achar.
• Fix AIC/BIC criterion computation in linear_model.LassoLarsIC. #9022 by Alexandre Gramfort and
Mehmet Basbug.
• Fixed a memory leak in our LibLinear implementation. #9024 by Sergei Lebedev
• Fix bug where stratified CV splitters did not work with linear_model.LassoCV . #8973 by Paulo Haddad.
• Fixed a bug in gaussian_process.GaussianProcessRegressor when the standard deviation and
covariance predicted without fit would fail with a unmeaningful error by default. #6573 by Quazi Marufur
Rahman and Manoj Kumar.
Other predictors
• Fix semi_supervised.BaseLabelPropagation to correctly implement LabelPropagation and
LabelSpreading as done in the referenced papers. #9239 by Andre Ambrosio Boechat, Utkarsh Upadhyay,
and Joel Nothman.
Decomposition, manifold learning and clustering
• Fixed the implementation of manifold.TSNE:
• early_exageration parameter had no effect and is now used for the first 250 optimization iterations.
• Fixed the AssertionError: Tree consistency failed exception reported in #8992.
• Improve the learning schedule to match the one from the reference implementation lvdmaaten/bhtsne. by
Thomas Moreau and Olivier Grisel.
• Fix a bug in decomposition.LatentDirichletAllocation where the perplexity method was
returning incorrect results because the transform method returns normalized document topic distributions as
of version 0.18. #7954 by Gary Foreman.
• Fix output shape and bugs with n_jobs > 1 in decomposition.SparseCoder transform and
decomposition.sparse_encode for one-dimensional data and one component. This also impacts the
output shape of decomposition.DictionaryLearning. #8086 by Andreas Müller.
• Fixed the implementation of explained_variance_ in decomposition.PCA, decomposition.
RandomizedPCA and decomposition.IncrementalPCA. #9105 by Hanmin Qin.
• Fixed the implementation of noise_variance_ in decomposition.PCA. #9108 by Hanmin Qin.
• Fixed a bug where cluster.DBSCAN gives incorrect result when input is a precomputed sparse matrix with
initial rows all zero. #8306 by Akshay Gupta
• Fix a bug regarding fitting cluster.KMeans with a sparse array X and initial centroids, where X’s means
were unnecessarily being subtracted from the centroids. #7872 by Josh Karnofsky.
• Fixes to the input validation in covariance.EllipticEnvelope. #8086 by Andreas Müller.
• Fixed a bug in covariance.MinCovDet where inputting data that produced a singular covariance matrix
would cause the helper method _c_step to throw an exception. #3367 by Jeremy Steward
• Fixed a bug in manifold.TSNE affecting convergence of the gradient descent. #8768 by David DeTomaso.
• Fixed a bug in manifold.TSNE where it stored the incorrect kl_divergence_. #6507 by Sebastian
Saeger.
• Fixed improper scaling in cross_decomposition.PLSRegression with scale=True. #7819 by
jayzed82.
• Deprecate the y parameter in transform and inverse_transform. The method should not accept y
parameter, as it’s used at the prediction time. #8174 by Tahar Zanouda, Alexandre Gramfort and Raghav RV.
• SciPy >= 0.13.3 and NumPy >= 1.8.2 are now the minimum supported versions for scikit-learn. The following
backported functions in utils have been removed or deprecated accordingly. #8854 and #8874 by Naoya
Kanai
• The store_covariances and covariances_ parameters of discriminant_analysis.
QuadraticDiscriminantAnalysis has been renamed to store_covariance and covariance_
to be consistent with the corresponding parameter names of the discriminant_analysis.
LinearDiscriminantAnalysis. They will be removed in version 0.21. #7998 by Jiacheng
Removed in 0.19:
– utils.fixes.argpartition
– utils.fixes.array_equal
– utils.fixes.astype
– utils.fixes.bincount
– utils.fixes.expit
– utils.fixes.frombuffer_empty
– utils.fixes.in1d
– utils.fixes.norm
– utils.fixes.rankdata
– utils.fixes.safe_copy
Deprecated in 0.19, to be removed in 0.21:
– utils.arpack.eigs
– utils.arpack.eigsh
– utils.arpack.svds
– utils.extmath.fast_dot
– utils.extmath.logsumexp
– utils.extmath.norm
– utils.extmath.pinvh
– utils.graph.graph_laplacian
– utils.random.choice
– utils.sparsetools.connected_components
– utils.stats.rankdata
• Estimators with both methods decision_function and predict_proba are now required to have a
monotonic relation between them. The method check_decision_proba_consistency has been added
in utils.estimator_checks to check their consistency. #7578 by Shubham Bhardwaj
• All checks in utils.estimator_checks, in particular utils.estimator_checks.
check_estimator now accept estimator instances. Most other checks do not accept estimator classes any
more. #9019 by Andreas Müller.
• Ensure that estimators’ attributes ending with _ are not set in the constructor but only in the fit method.
Most notably, ensemble estimators (deriving from ensemble.BaseEnsemble) now only have self.
estimators_ available after fit. #7464 by Lars Buitinck and Loic Esteve.
Thanks to everyone who has contributed to the maintenance and improvement of the project since version 0.18, in-
cluding:
Joel Nothman, Loic Esteve, Andreas Mueller, Guillaume Lemaitre, Olivier Grisel, Hanmin Qin, Raghav RV, Alexandre
Gramfort, themrmax, Aman Dalmia, Gael Varoquaux, Naoya Kanai, Tom Dupré la Tour, Rishikesh, Nelson Liu, Tae-
hoon Lee, Nelle Varoquaux, Aashil, Mikhail Korobov, Sebastin Santy, Joan Massich, Roman Yurchak, RAKOTOARI-
SON Herilalaina, Thierry Guillemot, Alexandre Abadie, Carol Willing, Balakumaran Manoharan, Josh Karnofsky,
Vlad Niculae, Utkarsh Upadhyay, Dmitry Petrov, Minghui Liu, Srivatsan, Vincent Pham, Albert Thomas, Jake Van-
derPlas, Attractadore, JC Liu, alexandercbooth, chkoar, Óscar Nájera, Aarshay Jain, Kyle Gilliam, Ramana Subra-
manyam, CJ Carey, Clement Joudet, David Robles, He Chen, Joris Van den Bossche, Karan Desai, Katie Luangkote,
Leland McInnes, Maniteja Nandana, Michele Lacchia, Sergei Lebedev, Shubham Bhardwaj, akshay0724, omtcyfz,
rickiepark, waterponey, Vathsala Achar, jbDelafosse, Ralf Gommers, Ekaterina Krivich, Vivek Kumar, Ishank Gulati,
Dave Elliott, ldirer, Reiichiro Nakano, Levi John Wolf, Mathieu Blondel, Sid Kapur, Dougal J. Sutherland, midinas,
mikebenfield, Sourav Singh, Aseem Bansal, Ibraim Ganiev, Stephen Hoover, AishwaryaRK, Steven C. Howell, Gary
Foreman, Neeraj Gangwar, Tahar, Jon Crall, dokato, Kathy Chen, ferria, Thomas Moreau, Charlie Brummitt, Nicolas
Goix, Adam Kleczewski, Sam Shleifer, Nikita Singh, Basil Beirouti, Giorgio Patrini, Manoj Kumar, Rafael Possas,
James Bourbeau, James A. Bednar, Janine Harper, Jaye, Jean Helie, Jeremy Steward, Artsiom, John Wei, Jonathan
LIgo, Jonathan Rahn, seanpwilliams, Arthur Mensch, Josh Levy, Julian Kuhlmann, Julien Aubert, Jörn Hees, Kai,
shivamgargsya, Kat Hempstalk, Kaushik Lakshmikanth, Kennedy, Kenneth Lyons, Kenneth Myers, Kevin Yap, Kir-
ill Bobyrev, Konstantin Podshumok, Arthur Imbert, Lee Murray, toastedcornflakes, Lera, Li Li, Arthur Douillard,
Mainak Jas, tobycheese, Manraj Singh, Manvendra Singh, Marc Meketon, MarcoFalke, Matthew Brett, Matthias
Gilch, Mehul Ahuja, Melanie Goetz, Meng, Peng, Michael Dezube, Michal Baumgartner, vibrantabhi19, Artem Golu-
bin, Milen Paskov, Antonin Carette, Morikko, MrMjauh, NALEPA Emmanuel, Namiya, Antoine Wendlinger, Narine
Kokhlikyan, NarineK, Nate Guerin, Angus Williams, Ang Lu, Nicole Vavrova, Nitish Pandey, Okhlopkov Daniil
Olegovich, Andy Craze, Om Prakash, Parminder Singh, Patrick Carlson, Patrick Pei, Paul Ganssle, Paulo Haddad,
Paweł Lorek, Peng Yu, Pete Bachant, Peter Bull, Peter Csizsek, Peter Wang, Pieter Arthur de Jong, Ping-Yao, Chang,
Preston Parry, Puneet Mathur, Quentin Hibon, Andrew Smith, Andrew Jackson, 1kastner, Rameshwar Bhaskaran, Re-
becca Bilbro, Remi Rampin, Andrea Esuli, Rob Hall, Robert Bradshaw, Romain Brault, Aman Pratik, Ruifeng Zheng,
Russell Smith, Sachin Agarwal, Sailesh Choyal, Samson Tan, Samuël Weber, Sarah Brown, Sebastian Pölsterl, Se-
bastian Raschka, Sebastian Saeger, Alyssa Batula, Abhyuday Pratap Singh, Sergey Feldman, Sergul Aydore, Sharan
Yalburgi, willduan, Siddharth Gupta, Sri Krishna, Almer, Stijn Tonk, Allen Riddell, Theofilos Papapanagiotou, Alison,
Alexis Mignon, Tommy Boucher, Tommy Löfstedt, Toshihiro Kamishima, Tyler Folkman, Tyler Lanigan, Alexander
Junge, Varun Shenoy, Victor Poughon, Vilhelm von Ehrenheim, Aleksandr Sandrovskii, Alan Yee, Vlasios Vasileiou,
Warut Vijitbenjaronk, Yang Zhang, Yaroslav Halchenko, Yichuan Liu, Yuichi Fujikawa, affanv14, aivision2020, xor,
andreh7, brady salz, campustrampus, Agamemnon Krasoulis, ditenberg, elena-sharova, filipj8, fukatani, gedeck, guin-
iol, guoci, hakaa1, hongkahjun, i-am-xhy, jakirkham, jaroslaw-weber, jayzed82, jeroko, jmontoyam, jonathan.striebel,
josephsalmon, jschendel, leereeves, martin-hahn, mathurinm, mehak-sachdeva, mlewis1729, mlliou112, mthorrell,
ndingwall, nuffe, yangarbiter, plagree, pldtc325, Breno Freitas, Brett Olsen, Brian A. Alfano, Brian Burns, polmauri,
Brandon Carter, Charlton Austin, Chayant T15h, Chinmaya Pancholi, Christian Danielsen, Chung Yen, Chyi-Kwei
Yau, pravarmahajan, DOHMATOB Elvis, Daniel LeJeune, Daniel Hnyk, Darius Morawiec, David DeTomaso, David
Gasquez, David Haberthür, David Heryanto, David Kirkby, David Nicholson, rashchedrin, Deborah Gertrude Digges,
Denis Engemann, Devansh D, Dickson, Bob Baxley, Don86, E. Lynch-Klarup, Ed Rogers, Elizabeth Ferriss, Ellen-
Co2, Fabian Egli, Fang-Chieh Chou, Bing Tian Dai, Greg Stupp, Grzegorz Szpak, Bertrand Thirion, Hadrien Bertrand,
Harizo Rajaona, zxcvbnius, Henry Lin, Holger Peters, Icyblade Dai, Igor Andriushchenko, Ilya, Isaac Laughlin, Iván
Vallés, Aurélien Bellet, JPFrancoia, Jacob Schreiber, Asish Mahapatra
Scikit-learn 0.18 is the last major release of scikit-learn to support Python 2.6. Later versions of scikit-learn will
require Python 2.7 or above.
Changelog
• Fixes for compatibility with NumPy 1.13.0: #7946 #8355 by Loic Esteve.
• Minor compatibility changes in the examples #9010 #8040 #9149.
Code Contributors
Changelog
Enhancements
Bug fixes
• Fix issue where min_grad_norm and n_iter_without_progress parameters were not being utilised
by manifold.TSNE. #6497 by Sebastian Säger
• Fix bug for svm’s decision values when decision_function_shape is ovr in svm.SVC. svm.SVC’s
decision_function was incorrect from versions 0.17.0 through 0.18.0. #7724 by Bing Tian Dai
• Attribute explained_variance_ratio of discriminant_analysis.
LinearDiscriminantAnalysis calculated with SVD and Eigen solver are now of the same length.
#7632 by JPFrancoia
• Fixes issue in Univariate feature selection where score functions were not accepting multi-label targets. #7676
by Mohammed Affan
• Fixed setting parameters when calling fit multiple times on feature_selection.SelectFromModel.
#7756 by Andreas Müller
• Fixes issue in partial_fit method of multiclass.OneVsRestClassifier when number of classes
used in partial_fit was less than the total number of classes in the data. #7786 by Srivatsan Ramesh
• Fixes issue in calibration.CalibratedClassifierCV where the sum of probabilities of each class
for a data was not 1, and CalibratedClassifierCV now handles the case where the training set has less
number of classes than the total data. #7799 by Srivatsan Ramesh
• Fix a bug where sklearn.feature_selection.SelectFdr did not exactly implement Benjamini-
Hochberg procedure. It formerly may have selected fewer features than it should. #7490 by Peng Meng.
• sklearn.manifold.LocallyLinearEmbedding now correctly handles integer inputs. #6282 by Jake
Vanderplas.
• The min_weight_fraction_leaf parameter of tree-based classifiers and regressors now assumes uniform
sample weights by default if the sample_weight argument is not passed to the fit function. Previously, the
parameter was silently ignored. #7301 by Nelson Liu.
• Numerical issue with linear_model.RidgeCV on centered data when n_features > n_samples.
#6178 by Bertrand Thirion
• Tree splitting criterion classes’ cloning/pickling is now memory safe #7680 by Ibraim Ganiev.
• Fixed a bug where decomposition.NMF sets its n_iters_ attribute in transform(). #7553 by Ekate-
rina Krivich.
• sklearn.linear_model.LogisticRegressionCV now correctly handles string labels. #5874 by
Raghav RV.
• Fixed a bug where sklearn.model_selection.train_test_split raised an error when
stratify is a list of string labels. #7593 by Raghav RV.
• Fixed a bug where sklearn.model_selection.GridSearchCV and sklearn.
model_selection.RandomizedSearchCV were not pickleable because of a pickling bug in np.
ma.MaskedArray. #7594 by Raghav RV.
• All cross-validation utilities in sklearn.model_selection now permit one time cross-validation splitters
for the cv parameter. Also non-deterministic cross-validation splitters (where multiple calls to split produce
dissimilar splits) can be used as cv parameter. The sklearn.model_selection.GridSearchCV will
cross-validate each parameter setting on the split produced by the first split call to the cross-validation splitter.
#7660 by Raghav RV.
• Fix bug where preprocessing.MultiLabelBinarizer.fit_transform returned an invalid CSR
matrix. #7750 by CJ Carey.
• Fixed a bug where metrics.pairwise.cosine_distances could return a small negative distance.
#7732 by Artsion.
Scikit-learn 0.18 will be the last version of scikit-learn to support Python 2.6. Later versions of scikit-learn will
require Python 2.7 or above.
Changelog
New features
Enhancements
Bug fixes
new class solves the computational problems of the old class and computes the Variational Bayesian Gaussian
mixture faster than before. #6651 by Wei Xue and Thierry Guillemot.
• The old mixture.GMM is deprecated in favor of the new mixture.GaussianMixture. The new class
computes the Gaussian mixture faster than before and some of computational problems have been solved. #6666
by Wei Xue and Thierry Guillemot.
Model evaluation and meta-estimators
• The sklearn.cross_validation, sklearn.grid_search and sklearn.learning_curve
have been deprecated and the classes and functions have been reorganized into the sklearn.
model_selection module. Ref Model Selection Enhancements and API Changes for more information.
#4294 by Raghav RV.
• The grid_scores_ attribute of model_selection.GridSearchCV and model_selection.
RandomizedSearchCV is deprecated in favor of the attribute cv_results_. Ref Model Selection En-
hancements and API Changes for more information. #6697 by Raghav RV.
• The parameters n_iter or n_folds in old CV splitters are replaced by the new parameter n_splits since
it can provide a consistent and unambiguous interface to represent the number of train-test splits. #7187 by
YenChen Lin.
• classes parameter was renamed to labels in metrics.hamming_loss. #7260 by Sebastián Vanrell.
• The splitter classes LabelKFold, LabelShuffleSplit, LeaveOneLabelOut and
LeavePLabelsOut are renamed to model_selection.GroupKFold, model_selection.
GroupShuffleSplit, model_selection.LeaveOneGroupOut and model_selection.
LeavePGroupsOut respectively. Also the parameter labels in the split method of the newly renamed
splitters model_selection.LeaveOneGroupOut and model_selection.LeavePGroupsOut
is renamed to groups. Additionally in model_selection.LeavePGroupsOut, the parameter
n_labels is renamed to n_groups. #6660 by Raghav RV.
• Error and loss names for scoring parameters are now prefixed by 'neg_', such as
neg_mean_squared_error. The unprefixed versions are deprecated and will be removed in version 0.20.
#7261 by Tim Head.
Code Contributors
Aditya Joshi, Alejandro, Alexander Fabisch, Alexander Loginov, Alexander Minyushkin, Alexander Rudy, Alexan-
dre Abadie, Alexandre Abraham, Alexandre Gramfort, Alexandre Saint, alexfields, Alvaro Ulloa, alyssaq, Amlan
Kar, Andreas Mueller, andrew giessel, Andrew Jackson, Andrew McCulloh, Andrew Murray, Anish Shah, Arafat,
Archit Sharma, Ariel Rokem, Arnaud Joly, Arnaud Rachez, Arthur Mensch, Ash Hoover, asnt, b0noI, Behzad Tabib-
ian, Bernardo, Bernhard Kratzwald, Bhargav Mangipudi, blakeflei, Boyuan Deng, Brandon Carter, Brett Naul, Brian
McFee, Caio Oliveira, Camilo Lamus, Carol Willing, Cass, CeShine Lee, Charles Truong, Chyi-Kwei Yau, CJ Carey,
codevig, Colin Ni, Dan Shiebler, Daniel, Daniel Hnyk, David Ellis, David Nicholson, David Staub, David Thaler,
David Warshaw, Davide Lasagna, Deborah, definitelyuncertain, Didi Bar-Zev, djipey, dsquareindia, edwinENSAE,
Elias Kuthe, Elvis DOHMATOB, Ethan White, Fabian Pedregosa, Fabio Ticconi, fisache, Florian Wilhelm, Francis,
Francis O’Donovan, Gael Varoquaux, Ganiev Ibraim, ghg, Gilles Louppe, Giorgio Patrini, Giovanni Cherubin, Gio-
vanni Lanzani, Glenn Qian, Gordon Mohr, govin-vatsan, Graham Clenaghan, Greg Reda, Greg Stupp, Guillaume
Lemaitre, Gustav Mörtberg, halwai, Harizo Rajaona, Harry Mavroforakis, hashcode55, hdmetor, Henry Lin, Hob-
son Lane, Hugo Bowne-Anderson, Igor Andriushchenko, Imaculate, Inki Hwang, Isaac Sijaranamual, Ishank Gulati,
Issam Laradji, Iver Jordal, jackmartin, Jacob Schreiber, Jake Vanderplas, James Fiedler, James Routley, Jan Zikes,
Janna Brettingen, jarfa, Jason Laska, jblackburne, jeff levesque, Jeffrey Blackburne, Jeffrey04, Jeremy Hintz, jere-
mynixon, Jeroen, Jessica Yung, Jill-Jênn Vie, Jimmy Jia, Jiyuan Qian, Joel Nothman, johannah, John, John Boersma,
John Kirkham, John Moeller, jonathan.striebel, joncrall, Jordi, Joseph Munoz, Joshua Cook, JPFrancoia, jrfiedler,
JulianKahnert, juliathebrave, kaichogami, KamalakerDadi, Kenneth Lyons, Kevin Wang, kingjr, kjell, Konstantin
Podshumok, Kornel Kielczewski, Krishna Kalyan, krishnakalyan3, Kvle Putnam, Kyle Jackson, Lars Buitinck, ldavid,
LeiG, LeightonZhang, Leland McInnes, Liang-Chi Hsieh, Lilian Besson, lizsz, Loic Esteve, Louis Tiao, Léonie Borne,
Mads Jensen, Maniteja Nandana, Manoj Kumar, Manvendra Singh, Marco, Mario Krell, Mark Bao, Mark Szepieniec,
Martin Madsen, MartinBpr, MaryanMorel, Massil, Matheus, Mathieu Blondel, Mathieu Dubois, Matteo, Matthias Ek-
man, Max Moroz, Michael Scherer, michiaki ariga, Mikhail Korobov, Moussa Taifi, mrandrewandrade, Mridul Seth,
nadya-p, Naoya Kanai, Nate George, Nelle Varoquaux, Nelson Liu, Nick James, NickleDave, Nico, Nicolas Goix,
Nikolay Mayorov, ningchi, nlathia, okbalefthanded, Okhlopkov, Olivier Grisel, Panos Louridas, Paul Strickland, Per-
rine Letellier, pestrickland, Peter Fischer, Pieter, Ping-Yao, Chang, practicalswift, Preston Parry, Qimu Zheng, Rachit
Kansal, Raghav RV, Ralf Gommers, Ramana.S, Rammig, Randy Olson, Rob Alexander, Robert Lutz, Robin Schucker,
Rohan Jain, Ruifeng Zheng, Ryan Yu, Rémy Léone, saihttam, Saiwing Yeung, Sam Shleifer, Samuel St-Jean, Sar-
taj Singh, Sasank Chilamkurthy, saurabh.bansod, Scott Andrews, Scott Lowe, seales, Sebastian Raschka, Sebastian
Saeger, Sebastián Vanrell, Sergei Lebedev, shagun Sodhani, shanmuga cv, Shashank Shekhar, shawpan, shengxid-
uan, Shota, shuckle16, Skipper Seabold, sklearn-ci, SmedbergM, srvanrell, Sébastien Lerique, Taranjeet, themrmax,
Thierry, Thierry Guillemot, Thomas, Thomas Hallock, Thomas Moreau, Tim Head, tKammy, toastedcornflakes, Tom,
TomDLT, Toshihiro Kamishima, tracer0tong, Trent Hauck, trevorstephens, Tue Vo, Varun, Varun Jewalikar, Viach-
eslav, Vighnesh Birodkar, Vikram, Villu Ruusmann, Vinayak Mehta, walter, waterponey, Wenhua Yang, Wenjian
Huang, Will Welch, wyseguy7, xyguo, yanlend, Yaroslav Halchenko, yelite, Yen, YenChenLin, Yichuan Liu, Yoav
Ram, Yoshiki, Zheng RuiFeng, zivori, Óscar Nájera
Changelog
Bug fixes
• Upgrade vendored joblib to version 0.9.4 that fixes an important bug in joblib.Parallel that can silently
yield to wrong results when working on datasets larger than 1MB: https://github.com/joblib/joblib/blob/0.9.4/
CHANGES.rst
• Fixed reading of Bunch pickles generated with scikit-learn version <= 0.16. This can affect users who have
already downloaded a dataset with scikit-learn 0.16 and are loading it with scikit-learn 0.17. See #6196 for how
this affected datasets.fetch_20newsgroups. By Loic Esteve.
• Fixed a bug that prevented using ROC AUC score to perform grid search on several CPU / cores on large arrays.
See #6147 By Olivier Grisel.
• Fixed a bug that prevented to properly set the presort parameter in ensemble.
GradientBoostingRegressor. See #5857 By Andrew McCulloh.
• Fixed a joblib error when evaluating the perplexity of a decomposition.
LatentDirichletAllocation model. See #6258 By Chyi-Kwei Yau.
November 5, 2015
Changelog
New features
• All the Scaler classes but preprocessing.RobustScaler can be fitted online by calling partial_fit.
By Giorgio Patrini.
• The new class ensemble.VotingClassifier implements a “majority rule” / “soft voting” ensemble
classifier to combine estimators for classification. By Sebastian Raschka.
• The new class preprocessing.RobustScaler provides an alternative to preprocessing.
StandardScaler for feature-wise centering and range normalization that is robust to outliers. By Thomas
Unterthiner.
• The new class preprocessing.MaxAbsScaler provides an alternative to preprocessing.
MinMaxScaler for feature-wise range normalization when the data is already centered or sparse. By Thomas
Unterthiner.
• The new class preprocessing.FunctionTransformer turns a Python function into a Pipeline-
compatible transformer object. By Joe Jevnik.
• The new classes cross_validation.LabelKFold and cross_validation.
LabelShuffleSplit generate train-test folds, respectively similar to cross_validation.KFold and
cross_validation.ShuffleSplit, except that the folds are conditioned on a label array. By Brian
McFee, Jean Kossaifi and Gilles Louppe.
• decomposition.LatentDirichletAllocation implements the Latent Dirichlet Allocation topic
model with online variational inference. By Chyi-Kwei Yau, with code based on an implementation by Matt
Hoffman. (#3659)
• The new solver sag implements a Stochastic Average Gradient descent and is available in both
linear_model.LogisticRegression and linear_model.Ridge. This solver is very efficient for
large datasets. By Danny Sullivan and Tom Dupre la Tour. (#4738)
• The new solver cd implements a Coordinate Descent in decomposition.NMF. Previous solver based on
Projected Gradient is still available setting new parameter solver to pg, but is deprecated and will be removed
in 0.19, along with decomposition.ProjectedGradientNMF and parameters sparseness, eta,
beta and nls_max_iter. New parameters alpha and l1_ratio control L1 and L2 regularization, and
shuffle adds a shuffling step in the cd solver. By Tom Dupre la Tour and Mathieu Blondel.
Enhancements
• manifold.TSNE now supports approximate optimization via the Barnes-Hut method, leading to much faster
fitting. By Christopher Erick Moody. (#4025)
• cluster.mean_shift_.MeanShift now supports parallel execution, as implemented in the
mean_shift function. By Martino Sorbaro.
• naive_bayes.GaussianNB now supports fitting with sample_weight. By Jan Hendrik Metzen.
• dummy.DummyClassifier now supports a prior fitting strategy. By Arnaud Joly.
• Added a fit_predict method for mixture.GMM and subclasses. By Cory Lorenz.
• Added the metrics.label_ranking_loss metric. By Arnaud Joly.
• Added the metrics.cohen_kappa_score metric.
• Added a warm_start constructor parameter to the bagging ensemble models to increase the size of the en-
semble. By Tim Head.
• Added option to use multi-output regression metrics without averaging. By Konstantin Shmelkov and Michael
Eickenberg.
• Speed up tree based methods by reducing the number of computations needed when computing the impurity
measure taking into account linear relationship of the computed statistics. The effect is particularly visible with
extra trees and on datasets with categorical or sparse features. By Arnaud Joly.
• ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier
now expose an apply method for retrieving the leaf indices each sample ends up in under each try. By Ja-
cob Schreiber.
• Add sample_weight support to linear_model.LinearRegression. By Sonny Hu. (##4881)
• Add n_iter_without_progress to manifold.TSNE to control the stopping criterion. By Santi Vil-
lalba. (#5186)
• Added optional parameter random_state in linear_model.Ridge , to set the seed of the pseudo random
generator used in sag solver. By Tom Dupre la Tour.
• Added optional parameter warm_start in linear_model.LogisticRegression. If set to True, the
solvers lbfgs, newton-cg and sag will be initialized with the coefficients computed in the previous fit. By
Tom Dupre la Tour.
• Added sample_weight support to linear_model.LogisticRegression for the lbfgs,
newton-cg, and sag solvers. By Valentin Stolbunov. Support added to the liblinear solver. By Manoj
Kumar.
• Added optional parameter presort to ensemble.GradientBoostingRegressor and ensemble.
GradientBoostingClassifier, keeping default behavior the same. This allows gradient boosters to
turn off presorting when building deep trees or using sparse data. By Jacob Schreiber.
• Altered metrics.roc_curve to drop unnecessary thresholds by default. By Graham Clenaghan.
• Added feature_selection.SelectFromModel meta-transformer which can be used along with esti-
mators that have coef_ or feature_importances_ attribute to select important features of the input data.
By Maheshakya Wijewardena, Joel Nothman and Manoj Kumar.
• Added metrics.pairwise.laplacian_kernel. By Clyde Fare.
• covariance.GraphLasso allows separate control of the convergence criterion for the Elastic-Net subprob-
lem via the enet_tol parameter.
• Improved verbosity in decomposition.DictionaryLearning.
• ensemble.RandomForestClassifier and ensemble.RandomForestRegressor no longer ex-
plicitly store the samples used in bagging, resulting in a much reduced memory footprint for storing random
forest models.
• Added positive option to linear_model.Lars and linear_model.lars_path to force coeffi-
cients to be positive. (#5131)
• Added the X_norm_squared parameter to metrics.pairwise.euclidean_distances to provide
precomputed squared norms for X.
• Added the fit_predict method to pipeline.Pipeline.
• Added the preprocessing.min_max_scale function.
Bug fixes
• Fixed bug where grid_search.RandomizedSearchCV could consume a lot of memory for large discrete
grids. By Joel Nothman.
• Fixed bug in linear_model.LogisticRegressionCV where penalty was ignored in the final fit. By
Manoj Kumar.
• Fixed bug in ensemble.forest.ForestClassifier while computing oob_score and X is a
sparse.csc_matrix. By Ankur Ankan.
• All regressors now consistently handle and warn when given y that is of shape (n_samples, 1). By Andreas
Müller and Henry Lin. (#5431)
• Fix in cluster.KMeans cluster reassignment for sparse input by Lars Buitinck.
• Fixed a bug in lda.LDA that could cause asymmetric covariance matrices when using shrinkage. By Martin
Billinger.
• Fixed cross_validation.cross_val_predict for estimators with sparse predictions. By Buddha
Prakash.
• Fixed the predict_proba method of linear_model.LogisticRegression to use soft-max instead
of one-vs-rest normalization. By Manoj Kumar. (#5182)
• Fixed the partial_fit method of linear_model.SGDClassifier when called with
average=True. By Andrew Lamb. (#5282)
• Dataset fetchers use different filenames under Python 2 and Python 3 to avoid pickling compatibility issues. By
Olivier Grisel. (#5355)
• Fixed a bug in naive_bayes.GaussianNB which caused classification results to depend on scale. By Jake
Vanderplas.
• Fixed temporarily linear_model.Ridge, which was incorrect when fitting the intercept in the case of
sparse data. The fix automatically changes the solver to ‘sag’ in this case. #5360 by Tom Dupre la Tour.
• Fixed a performance bug in decomposition.RandomizedPCA on data with a large number of features
and fewer samples. (#4478) By Andreas Müller, Loic Esteve and Giorgio Patrini.
• Fixed bug in cross_decomposition.PLS that yielded unstable and platform dependent output, and failed
on fit_transform. By Arthur Mensch.
• Fixes to the Bunch class used to store datasets.
• Fixed ensemble.plot_partial_dependence ignoring the percentiles parameter.
• Providing a set as vocabulary in CountVectorizer no longer leads to inconsistent results when pickling.
• Fixed the conditions on when a precomputed Gram matrix needs to be recomputed in linear_model.
LinearRegression, linear_model.OrthogonalMatchingPursuit, linear_model.Lasso
and linear_model.ElasticNet.
• Fixed inconsistent memory layout in the coordinate descent solver that affected linear_model.
DictionaryLearning and covariance.GraphLasso. (#5337) By Olivier Grisel.
• manifold.LocallyLinearEmbedding no longer ignores the reg parameter.
• Nearest Neighbor estimators with custom distance metrics can now be pickled. (#4362)
• Fixed a bug in pipeline.FeatureUnion where transformer_weights were not properly handled
when performing grid-searches.
• Fixed a bug in linear_model.LogisticRegression and linear_model.
LogisticRegressionCV when using class_weight='balanced' or class_weight='auto'.
By Tom Dupre la Tour.
Code Contributors
Aaron Schumacher, Adithya Ganesh, akitty, Alexandre Gramfort, Alexey Grigorev, Ali Baharev, Allen Riddell, Ando
Saabas, Andreas Mueller, Andrew Lamb, Anish Shah, Ankur Ankan, Anthony Erlinger, Ari Rouvinen, Arnaud Joly,
Arnaud Rachez, Arthur Mensch, banilo, Barmaley.exe, benjaminirving, Boyuan Deng, Brett Naul, Brian McFee,
Buddha Prakash, Chi Zhang, Chih-Wei Chang, Christof Angermueller, Christoph Gohlke, Christophe Bourguignat,
Christopher Erick Moody, Chyi-Kwei Yau, Cindy Sridharan, CJ Carey, Clyde-fare, Cory Lorenz, Dan Blanchard,
Daniel Galvez, Daniel Kronovet, Danny Sullivan, Data1010, David, David D Lowe, David Dotson, djipey, Dmitry
Spikhalskiy, Donne Martin, Dougal J. Sutherland, Dougal Sutherland, edson duarte, Eduardo Caro, Eric Larson, Eric
Martin, Erich Schubert, Fernando Carrillo, Frank C. Eckert, Frank Zalkow, Gael Varoquaux, Ganiev Ibraim, Gilles
Louppe, Giorgio Patrini, giorgiop, Graham Clenaghan, Gryllos Prokopis, gwulfs, Henry Lin, Hsuan-Tien Lin, Im-
manuel Bayer, Ishank Gulati, Jack Martin, Jacob Schreiber, Jaidev Deshpande, Jake Vanderplas, Jan Hendrik Metzen,
Jean Kossaifi, Jeffrey04, Jeremy, jfraj, Jiali Mei, Joe Jevnik, Joel Nothman, John Kirkham, John Wittenauer, Joseph,
Joshua Loyal, Jungkook Park, KamalakerDadi, Kashif Rasul, Keith Goodman, Kian Ho, Konstantin Shmelkov, Kyler
Brown, Lars Buitinck, Lilian Besson, Loic Esteve, Louis Tiao, maheshakya, Maheshakya Wijewardena, Manoj Ku-
mar, MarkTab marktab.net, Martin Ku, Martin Spacek, MartinBpr, martinosorb, MaryanMorel, Masafumi Oyamada,
Mathieu Blondel, Matt Krump, Matti Lyra, Maxim Kolganov, mbillinger, mhg, Michael Heilman, Michael Patterson,
Miroslav Batchkarov, Nelle Varoquaux, Nicolas, Nikolay Mayorov, Olivier Grisel, Omer Katz, Óscar Nájera, Pauli
Virtanen, Peter Fischer, Peter Prettenhofer, Phil Roth, pianomania, Preston Parry, Raghav RV, Rob Zinkov, Robert
Layton, Rohan Ramanath, Saket Choudhary, Sam Zhang, santi, saurabh.bansod, scls19fr, Sebastian Raschka, Sebas-
tian Saeger, Shivan Sornarajah, SimonPL, sinhrks, Skipper Seabold, Sonny Hu, sseg, Stephen Hoover, Steven De
Gryze, Steven Seguin, Theodore Vasiloudis, Thomas Unterthiner, Tiago Freitas Pereira, Tian Wang, Tim Head, Timo-
thy Hopper, tokoroten, Tom Dupré la Tour, Trevor Stephens, Valentin Stolbunov, Vighnesh Birodkar, Vinayak Mehta,
Vincent, Vincent Michel, vstolbunov, wangz10, Wei Xue, Yucheng Low, Yury Zhauniarovich, Zac Stewart, zhai_pro,
Zichen Wang
Changelog
Bug fixes
Highlights
• Speed improvements (notably in cluster.DBSCAN ), reduced memory requirements, bug-fixes and better
default settings.
• Multinomial Logistic regression and a path algorithm in linear_model.LogisticRegressionCV .
• Out-of core learning of PCA via decomposition.IncrementalPCA.
• Probability callibration of classifiers using calibration.CalibratedClassifierCV .
• cluster.Birch clustering method for large-scale datasets.
• Scalable approximate nearest neighbors search with Locality-sensitive hashing forests in neighbors.
LSHForest.
• Improved error messages and better validation when using malformed input data.
• More robust integration with pandas dataframes.
Changelog
New features
• The new neighbors.LSHForest implements locality-sensitive hashing for approximate nearest neighbors
search. By Maheshakya Wijewardena.
• Added svm.LinearSVR. This class uses the liblinear implementation of Support Vector Regression which is
much faster for large sample sizes than svm.SVR with linear kernel. By Fabian Pedregosa and Qiang Luo.
• Incremental fit for GaussianNB.
• Added sample_weight support to dummy.DummyClassifier and dummy.DummyRegressor. By
Arnaud Joly.
• Added the metrics.label_ranking_average_precision_score metrics. By Arnaud Joly.
• Add the metrics.coverage_error metrics. By Arnaud Joly.
• Added linear_model.LogisticRegressionCV . By Manoj Kumar, Fabian Pedregosa, Gael Varoquaux
and Alexandre Gramfort.
• Added warm_start constructor parameter to make it possible for any trained forest model to grow additional
trees incrementally. By Laurent Direr.
Enhancements
Documentation improvements
Bug fixes
• RBFSampler with gamma=g formerly approximated rbf_kernel with gamma=g/2.; the definition of
gamma is now consistent, which may substantially change your results if you use a fixed value. (If you cross-
validated over gamma, it probably doesn’t matter too much.) By Dougal Sutherland.
• Pipeline object delegate the classes_ attribute to the underlying estimator. It allows, for instance, to make
bagging of a pipeline object. By Arnaud Joly
• neighbors.NearestCentroid now uses the median as the centroid when metric is set to manhattan.
It was using the mean before. By Manoj Kumar
• Fix numerical stability issues in linear_model.SGDClassifier and linear_model.
SGDRegressor by clipping large gradients and ensuring that weight decay rescaling is always positive (for
large l2 regularization and large learning rate values). By Olivier Grisel
• When compute_full_tree is set to “auto”, the full tree is built when n_clusters is high and
is early stopped when n_clusters is low, while the behavior should be vice-versa in cluster.
AgglomerativeClustering (and friends). This has been fixed By Manoj Kumar
• Fix lazy centering of data in linear_model.enet_path and linear_model.lasso_path. It was
centered around one. It has been changed to be centered around the origin. By Manoj Kumar
• Fix handling of precomputed affinity matrices in cluster.AgglomerativeClustering when using
connectivity constraints. By Cathy Deng
• Correct partial_fit handling of class_prior for sklearn.naive_bayes.MultinomialNB and
sklearn.naive_bayes.BernoulliNB. By Trevor Stephens.
• Fixed a crash in metrics.precision_recall_fscore_support when using unsorted labels in the
multi-label setting. By Andreas Müller.
• Avoid skipping the first nearest neighbor in the methods radius_neighbors, kneighbors,
kneighbors_graph and radius_neighbors_graph in sklearn.neighbors.
NearestNeighbors and family, when the query data is not the same as fit data. By Manoj Kumar.
• Fix log-density calculation in the mixture.GMM with tied covariance. By Will Dawson
• Fixed a scaling error in feature_selection.SelectFdr where a factor n_features was missing. By
Andrew Tulloch
• Fix zero division in neighbors.KNeighborsRegressor and related classes when using distance weight-
ing and having identical data points. By Garret-R.
• Fixed round off errors with non positive-definite covariance matrices in GMM. By Alexis Mignon.
• Fixed a error in the computation of conditional probabilities in naive_bayes.BernoulliNB. By Hanna
Wallach.
• Make the method radius_neighbors of neighbors.NearestNeighbors return the samples lying
on the boundary for algorithm='brute'. By Yan Yi.
• Flip sign of dual_coef_ of svm.SVC to make it consistent with the documentation and
decision_function. By Artem Sobolev.
• Fixed handling of ties in isotonic.IsotonicRegression. We now use the weighted average of targets
(secondary method). By Andreas Müller and Michael Bommarito.
• GridSearchCV and cross_val_score and other meta-estimators don’t convert pandas DataFrames into
arrays any more, allowing DataFrame specific operations in custom estimators.
Code Contributors
A. Flaxman, Aaron Schumacher, Aaron Staple, abhishek thakur, Akshay, akshayah3, Aldrian Obaja, Alexander
Fabisch, Alexandre Gramfort, Alexis Mignon, Anders Aagaard, Andreas Mueller, Andreas van Cranenburgh, An-
drew Tulloch, Andrew Walker, Antony Lee, Arnaud Joly, banilo, Barmaley.exe, Ben Davies, Benedikt Koehler, bhsu,
Boris Feld, Borja Ayerdi, Boyuan Deng, Brent Pedersen, Brian Wignall, Brooke Osborn, Calvin Giles, Cathy Deng,
Celeo, cgohlke, chebee7i, Christian Stade-Schuldt, Christof Angermueller, Chyi-Kwei Yau, CJ Carey, Clemens Brun-
ner, Daiki Aminaka, Dan Blanchard, danfrankj, Danny Sullivan, David Fletcher, Dmitrijs Milajevs, Dougal J. Suther-
land, Erich Schubert, Fabian Pedregosa, Florian Wilhelm, floydsoft, Félix-Antoine Fortin, Gael Varoquaux, Garrett-R,
Gilles Louppe, gpassino, gwulfs, Hampus Bengtsson, Hamzeh Alsalhi, Hanna Wallach, Harry Mavroforakis, Hasil
Sharma, Helder, Herve Bredin, Hsiang-Fu Yu, Hugues SALAMIN, Ian Gilmore, Ilambharathi Kanniah, Imran Haque,
isms, Jake VanderPlas, Jan Dlabal, Jan Hendrik Metzen, Jatin Shah, Javier López Peña, jdcaballero, Jean Kossaifi, Jeff
Hammerbacher, Joel Nothman, Jonathan Helmus, Joseph, Kaicheng Zhang, Kevin Markham, Kyle Beauchamp, Kyle
Kastner, Lagacherie Matthieu, Lars Buitinck, Laurent Direr, leepei, Loic Esteve, Luis Pedro Coelho, Lukas Michel-
bacher, maheshakya, Manoj Kumar, Manuel, Mario Michael Krell, Martin, Martin Billinger, Martin Ku, Mateusz
Susik, Mathieu Blondel, Matt Pico, Matt Terry, Matteo Visconti dOC, Matti Lyra, Max Linke, Mehdi Cherti, Michael
Bommarito, Michael Eickenberg, Michal Romaniuk, MLG, mr.Shu, Nelle Varoquaux, Nicola Montecchio, Nicolas,
Nikolay Mayorov, Noel Dawe, Okal Billy, Olivier Grisel, Óscar Nájera, Paolo Puggioni, Peter Prettenhofer, Pratap
Vardhan, pvnguyen, queqichao, Rafael Carrascosa, Raghav R V, Rahiel Kasim, Randall Mason, Rob Zinkov, Robert
Bradshaw, Saket Choudhary, Sam Nicholls, Samuel Charron, Saurabh Jha, sethdandridge, sinhrks, snuderl, Stefan
Otte, Stefan van der Walt, Steve Tjoa, swu, Sylvain Zimmer, tejesh95, terrycojones, Thomas Delteil, Thomas Un-
terthiner, Tomas Kazmar, trevorstephens, tttthomasssss, Tzu-Ming Kuo, ugurcaliskan, ugurthemaster, Vinayak Mehta,
Vincent Dubourg, Vjacheslav Murashkin, Vlad Niculae, wadawson, Wei Xue, Will Lamond, Wu Jiang, x0l, Xinfan
Meng, Yan Yi, Yu-Chin
September 4, 2014
Bug fixes
• Fixed handling of the p parameter of the Minkowski distance that was previously ignored in nearest neighbors
models. By Nikolay Mayorov.
• Fixed duplicated alphas in linear_model.LassoLars with early stopping on 32 bit Python. By Olivier
Grisel and Fabian Pedregosa.
• Fixed the build under Windows when scikit-learn is built with MSVC while NumPy is built with MinGW. By
Olivier Grisel and Federico Vaggi.
• Fixed an array index overflow bug in the coordinate descent solver. By Gael Varoquaux.
• Better handling of numpy 1.9 deprecation warnings. By Gael Varoquaux.
• Removed unnecessary data copy in cluster.KMeans. By Gael Varoquaux.
• Explicitly close open files to avoid ResourceWarnings under Python 3. By Calvin Giles.
• The transform of discriminant_analysis.LinearDiscriminantAnalysis now projects the
input on the most discriminant directions. By Martin Billinger.
• Fixed potential overflow in _tree.safe_realloc by Lars Buitinck.
August 1, 2014
Bug fixes
Highlights
Changelog
New features
Enhancements
• Reduce memory usage and overhead when fitting and predicting with forests of randomized trees in parallel
with n_jobs != 1 by leveraging new threading backend of joblib 0.8 and releasing the GIL in the tree fitting
Cython code. By Olivier Grisel and Gilles Louppe.
• Speed improvement of the sklearn.ensemble.gradient_boosting module. By Gilles Louppe and
Peter Prettenhofer.
• Various enhancements to the sklearn.ensemble.gradient_boosting module: a warm_start ar-
gument to fit additional trees, a max_leaf_nodes argument to fit GBM style trees, a monitor fit argument
to inspect the estimator during training, and refactoring of the verbose code. By Peter Prettenhofer.
• Faster sklearn.ensemble.ExtraTrees by caching feature values. By Arnaud Joly.
• Faster depth-based tree building algorithm such as decision tree, random forest, extra trees or gradient tree
boosting (with depth based growing strategy) by avoiding trying to split on found constant features in the sample
subset. By Arnaud Joly.
• Add min_weight_fraction_leaf pre-pruning parameter to tree-based methods: the minimum weighted
fraction of the input samples required to be at a leaf node. By Noel Dawe.
• Added metrics.pairwise_distances_argmin_min, by Philippe Gervais.
• Added predict method to cluster.AffinityPropagation and cluster.MeanShift, by Mathieu
Blondel.
• Vector and matrix multiplications have been optimised throughout the library by Denis Engemann, and Alexan-
dre Gramfort. In particular, they should take less memory with older NumPy versions (prior to 1.7.2).
• Precision-recall and ROC examples now use train_test_split, and have more explanation of why these metrics
are useful. By Kyle Kastner
• The training algorithm for decomposition.NMF is faster for sparse matrices and has much lower memory
complexity, meaning it will scale up gracefully to large datasets. By Lars Buitinck.
• Added svd_method option with default value to “randomized” to decomposition.FactorAnalysis to
save memory and significantly speedup computation by Denis Engemann, and Alexandre Gramfort.
• Changed cross_validation.StratifiedKFold to try and preserve as much of the original ordering of
samples as possible so as not to hide overfitting on datasets with a non-negligible level of samples dependency.
By Daniel Nouri and Olivier Grisel.
• Add multi-output support to gaussian_process.GaussianProcess by John Novak.
• Support for precomputed distance matrices in nearest neighbor estimators by Robert Layton and Joel Nothman.
• Norm computations optimized for NumPy 1.6 and later versions by Lars Buitinck. In particular, the k-means
algorithm no longer needs a temporary data structure the size of its input.
• dummy.DummyClassifier can now be used to predict a constant output value. By Manoj Kumar.
• dummy.DummyRegressor has now a strategy parameter which allows to predict the mean, the median of the
training set or a constant output value. By Maheshakya Wijewardena.
• Multi-label classification output in multilabel indicator format is now supported by metrics.
roc_auc_score and metrics.average_precision_score by Arnaud Joly.
• Significant performance improvements (more than 100x speedup for large problems) in isotonic.
IsotonicRegression by Andrew Tulloch.
• Speed and memory usage improvements to the SGD algorithm for linear models: it now uses threads, not
separate processes, when n_jobs>1. By Lars Buitinck.
• Grid search and cross validation allow NaNs in the input arrays so that preprocessors such as
preprocessing.Imputer can be trained within the cross validation loop, avoiding potentially skewed
results.
• Ridge regression can now deal with sample weights in feature space (only sample space until then). By Michael
Eickenberg. Both solutions are provided by the Cholesky solver.
• Several classification and regression metrics now support weighted samples with the new
sample_weight argument: metrics.accuracy_score, metrics.zero_one_loss,
metrics.precision_score, metrics.average_precision_score, metrics.
f1_score, metrics.fbeta_score, metrics.recall_score, metrics.roc_auc_score,
metrics.explained_variance_score, metrics.mean_squared_error, metrics.
mean_absolute_error, metrics.r2_score. By Noel Dawe.
• Speed up of the sample generator datasets.make_multilabel_classification. By Joel Nothman.
Documentation improvements
• The Working With Text Data tutorial has now been worked in to the main documentation’s tutorial section.
Includes exercises and skeletons for tutorial presentation. Original tutorial created by several authors including
Olivier Grisel, Lars Buitinck and many others. Tutorial integration into the scikit-learn documentation by Jaques
Grobler
• Added Computational Performance documentation. Discussion and examples of prediction latency / throughput
and different factors that have influence over speed. Additional tips for building faster models and choosing a
relevant compromise between speed and predictive power. By Eustache Diemert.
Bug fixes
People
• 6 Daniel Nouri
• 6 Chen Liu
• 6 Michael Eickenberg
• 6 ugurthemaster
• 5 Aaron Schumacher
• 5 Baptiste Lagarde
• 5 Rajat Khanduja
• 5 Robert McGibbon
• 5 Sergio Pascual
• 4 Alexis Metaireau
• 4 Ignacio Rossi
• 4 Virgile Fritsch
• 4 Sebastian Säger
• 4 Ilambharathi Kanniah
• 4 sdenton4
• 4 Robert Layton
• 4 Alyssa
• 4 Amos Waterland
• 3 Andrew Tulloch
• 3 murad
• 3 Steven Maude
• 3 Karol Pysniak
• 3 Jacques Kvam
• 3 cgohlke
• 3 cjlin
• 3 Michael Becker
• 3 hamzeh
• 3 Eric Jacobsen
• 3 john collins
• 3 kaushik94
• 3 Erwin Marsi
• 2 csytracy
• 2 LK
• 2 Vlad Niculae
• 2 Laurent Direr
• 2 Erik Shilts
• 2 Raul Garreta
• 2 Yoshiki Vázquez Baeza
• 2 Yung Siang Liau
• 2 abhishek thakur
• 2 James Yu
• 2 Rohit Sivaprasad
• 2 Roland Szabo
• 2 amormachine
• 2 Alexis Mignon
• 2 Oscar Carlsson
• 2 Nantas Nardelli
• 2 jess010
• 2 kowalski87
• 2 Andrew Clegg
• 2 Federico Vaggi
• 2 Simon Frid
• 2 Félix-Antoine Fortin
• 1 Ralf Gommers
• 1 t-aft
• 1 Ronan Amicel
• 1 Rupesh Kumar Srivastava
• 1 Ryan Wang
• 1 Samuel Charron
• 1 Samuel St-Jean
• 1 Fabian Pedregosa
• 1 Skipper Seabold
• 1 Stefan Walk
• 1 Stefan van der Walt
• 1 Stephan Hoyer
• 1 Allen Riddell
• 1 Valentin Haenel
• 1 Vijay Ramesh
• 1 Will Myers
• 1 Yaroslav Halchenko
• 1 Yoni Ben-Meshulam
• 1 Yury V. Zaytsev
• 1 adrinjalali
• 1 ai8rahim
• 1 alemagnani
• 1 alex
• 1 benjamin wilson
• 1 chalmerlowe
• 1 dzikie drożdże
• 1 jamestwebber
• 1 matrixorz
• 1 popo
• 1 samuela
• 1 François Boulogne
• 1 Alexander Measure
• 1 Ethan White
• 1 Guilherme Trein
• 1 Hendrik Heuer
• 1 IvicaJovic
• 1 Jan Hendrik Metzen
• 1 Jean Michel Rouly
• 1 Eduardo Ariño de la Rubia
• 1 Jelle Zijlstra
• 1 Eddy L O Jansson
• 1 Denis
• 1 John
• 1 John Schmidt
• 1 Jorge Cañardo Alastuey
• 1 Joseph Perla
• 1 Joshua Vredevoogd
• 1 José Ricardo
• 1 Julien Miotte
• 1 Kemal Eren
• 1 Kenta Sato
• 1 David Cournapeau
• 1 Kyle Kelley
• 1 Daniele Medri
• 1 Laurent Luce
• 1 Laurent Pierron
• 1 Luis Pedro Coelho
• 1 DanielWeitzenfeld
• 1 Craig Thompson
• 1 Chyi-Kwei Yau
• 1 Matthew Brett
• 1 Matthias Feurer
• 1 Max Linke
• 1 Chris Filo Gorgolewski
• 1 Charles Earl
• 1 Michael Hanke
• 1 Michele Orrù
• 1 Bryan Lunt
• 1 Brian Kearns
• 1 Paul Butler
• 1 Paweł Mandera
• 1 Peter
• 1 Andrew Ash
• 1 Pietro Zambelli
• 1 staubda
August 7, 2013
Changelog
• Missing values with sparse and dense matrices can be imputed with the transformer preprocessing.
Imputer by Nicolas Trésegnie.
• The core implementation of decisions trees has been rewritten from scratch, allowing for faster tree induction
and lower memory consumption in all tree-based estimators. By Gilles Louppe.
• Added ensemble.AdaBoostClassifier and ensemble.AdaBoostRegressor, by Noel Dawe and
Gilles Louppe. See the AdaBoost section of the user guide for details and examples.
• Added grid_search.RandomizedSearchCV and grid_search.ParameterSampler for random-
ized hyperparameter optimization. By Andreas Müller.
• Added biclustering algorithms (sklearn.cluster.bicluster.SpectralCoclustering and
sklearn.cluster.bicluster.SpectralBiclustering), data generation methods (sklearn.
datasets.make_biclusters and sklearn.datasets.make_checkerboard), and scoring met-
rics (sklearn.metrics.consensus_score). By Kemal Eren.
• Added Restricted Boltzmann Machines (neural_network.BernoulliRBM ). By Yann Dauphin.
• Python 3 support by Justin Vincent, Lars Buitinck, Subhodeep Moitra and Olivier Grisel. All tests now pass
under Python 3.3.
• Ability to pass one penalty (alpha value) per target in linear_model.Ridge, by @eickenberg and Mathieu
Blondel.
• Fixed sklearn.linear_model.stochastic_gradient.py L2 regularization issue (minor practical
significance). By Norbert Crombach and Mathieu Blondel .
• Added an interactive version of Andreas Müller’s Machine Learning Cheat Sheet (for scikit-learn) to the docu-
mentation. See Choosing the right estimator. By Jaques Grobler.
• grid_search.GridSearchCV and cross_validation.cross_val_score now support the use
of advanced scoring function such as area under the ROC curve and f-beta scores. See The scoring parameter:
defining model evaluation rules for details. By Andreas Müller and Lars Buitinck. Passing a function from
sklearn.metrics as score_func is deprecated.
• Multi-label classification output is now supported by metrics.accuracy_score,
metrics.zero_one_loss, metrics.f1_score, metrics.fbeta_score, metrics.
classification_report, metrics.precision_score and metrics.recall_score by
Arnaud Joly.
• Two new metrics metrics.hamming_loss and metrics.jaccard_similarity_score are added
with multi-label support by Arnaud Joly.
• Speed and memory usage improvements in feature_extraction.text.CountVectorizer and
feature_extraction.text.TfidfVectorizer, by Jochen Wersdörfer and Roman Sinayev.
• The min_df parameter in feature_extraction.text.CountVectorizer and
feature_extraction.text.TfidfVectorizer, which used to be 2, has been reset to 1 to
avoid unpleasant surprises (empty vocabularies) for novice users who try it out on tiny document collections. A
value of at least 2 is still recommended for practical use.
• svm.LinearSVC, linear_model.SGDClassifier and linear_model.SGDRegressor now
have a sparsify method that converts their coef_ into a sparse matrix, meaning stored models trained
using these estimators can be made much more compact.
• linear_model.SGDClassifier now produces multiclass probability estimates when trained under log
loss or modified Huber loss.
• Hyperlinks to documentation in example code on the website by Martin Luessi.
• Fixed bug in preprocessing.MinMaxScaler causing incorrect scaling of the features for non-default
feature_range settings. By Andreas Müller.
• max_features in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor and all
derived ensemble estimators now supports percentage values. By Gilles Louppe.
• Performance improvements in isotonic.IsotonicRegression by Nelle Varoquaux.
• metrics.accuracy_score has an option normalize to return the fraction or the number of correctly clas-
sified sample by Arnaud Joly.
• Added metrics.log_loss that computes log loss, aka cross-entropy loss. By Jochen Wersdörfer and Lars
Buitinck.
• A bug that caused ensemble.AdaBoostClassifier’s to output incorrect probabilities has been fixed.
• Feature selectors now share a mixin providing consistent transform, inverse_transform and
get_support methods. By Joel Nothman.
• A fitted grid_search.GridSearchCV or grid_search.RandomizedSearchCV can now generally
be pickled. By Joel Nothman.
People
• 7 Hrishikesh Huilgolkar
• 6 Kyle Kastner
• 6 Martin Luessi
• 6 Rob Speer
• 5 Federico Vaggi
• 5 Raul Garreta
• 5 Rob Zinkov
• 4 Ken Geis
• 3 A. Flaxman
• 3 Denton Cockburn
• 3 Dougal Sutherland
• 3 Ian Ozsvald
• 3 Johannes Schönberger
• 3 Robert McGibbon
• 3 Roman Sinayev
• 3 Szabo Roland
• 2 Diego Molla
• 2 Imran Haque
• 2 Jochen Wersdörfer
• 2 Sergey Karayev
• 2 Yannick Schwartz
• 2 jamestwebber
• 1 Abhijeet Kolhe
• 1 Alexander Fabisch
• 1 Bastiaan van den Berg
• 1 Benjamin Peterson
• 1 Daniel Velkov
• 1 Fazlul Shahriar
• 1 Felix Brockherde
• 1 Félix-Antoine Fortin
• 1 Harikrishnan S
• 1 Jack Hale
• 1 JakeMick
• 1 James McDermott
• 1 John Benediktsson
• 1 John Zwinck
• 1 Joshua Vredevoogd
• 1 Justin Pati
• 1 Kevin Hughes
• 1 Kyle Kelley
• 1 Matthias Ekman
• 1 Miroslav Shubernetskiy
• 1 Naoki Orii
• 1 Norbert Crombach
• 1 Rafael Cunha de Almeida
• 1 Rolando Espinoza La fuente
• 1 Seamus Abshere
• 1 Sergey Feldman
• 1 Sergio Medina
• 1 Stefano Lattarini
• 1 Steve Koch
• 1 Sturla Molden
• 1 Thomas Jarosch
• 1 Yaroslav Halchenko
Changelog
People
Changelog
• metrics.zero_one_loss (formerly metrics.zero_one) now has option for normalized output that
reports the fraction of misclassifications, rather than the raw number of misclassifications. By Kyle Beauchamp.
• tree.DecisionTreeClassifier and all derived ensemble models now support sample weighting, by
Noel Dawe and Gilles Louppe.
• Speedup improvement when using bootstrap samples in forests of randomized trees, by Peter Prettenhofer and
Gilles Louppe.
• Partial dependence plots for Gradient Tree Boosting in ensemble.partial_dependence.
partial_dependence by Peter Prettenhofer. See Partial Dependence Plots for an example.
• The table of contents on the website has now been made expandable by Jaques Grobler.
• feature_selection.SelectPercentile now breaks ties deterministically instead of returning all
equally ranked features.
• feature_selection.SelectKBest and feature_selection.SelectPercentile are more
numerically stable since they use scores, rather than p-values, to rank results. This means that they might
sometimes select different features than they did previously.
• Ridge regression and ridge classification fitting with sparse_cg solver no longer has quadratic memory com-
plexity, by Lars Buitinck and Fabian Pedregosa.
• Ridge regression and ridge classification now support a new fast solver called lsqr, by Mathieu Blondel.
• Speed up of metrics.precision_recall_curve by Conrad Lee.
• Added support for reading/writing svmlight files with pairwise preference attribute (qid in svmlight file format)
in datasets.dump_svmlight_file and datasets.load_svmlight_file by Fabian Pedregosa.
• Faster and more robust metrics.confusion_matrix and Clustering performance evaluation by Wei Li.
• cross_validation.cross_val_score now works with precomputed kernels and affinity matrices, by
Andreas Müller.
• LARS algorithm made more numerically stable with heuristics to drop regressors too correlated as well as to
stop the path when numerical noise becomes predominant, by Gael Varoquaux.
• Faster implementation of metrics.precision_recall_curve by Conrad Lee.
• New kernel metrics.chi2_kernel by Andreas Müller, often used in computer vision applications.
• Fix of longstanding bug in naive_bayes.BernoulliNB fixed by Shaun Jackman.
People
• 106 Wei Li
• 101 Olivier Grisel
• 65 Vlad Niculae
• 54 Gilles Louppe
• 40 Jaques Grobler
• 38 Alexandre Gramfort
• 30 Rob Zinkov
• 19 Aymeric Masurelle
• 18 Andrew Winterman
• 17 Fabian Pedregosa
• 17 Nelle Varoquaux
• 16 Christian Osendorfer
• 14 Daniel Nouri
• 13 Virgile Fritsch
• 13 syhw
• 12 Satrajit Ghosh
• 10 Corey Lynch
• 10 Kyle Beauchamp
• 9 Brian Cheung
• 9 Immanuel Bayer
• 9 mr.Shu
• 8 Conrad Lee
• 8 James Bergstra
• 7 Tadej Janež
• 6 Brian Cajes
• 6 Jake Vanderplas
• 6 Michael
• 6 Noel Dawe
• 6 Tiago Nunes
• 6 cow
• 5 Anze
• 5 Shiqiao Du
• 4 Christian Jauvin
• 4 Jacques Kvam
• 4 Richard T. Guy
• 4 Robert Layton
• 3 Alexandre Abraham
• 3 Doug Coleman
• 3 Scott Dickerson
• 2 ApproximateIdentity
• 2 John Benediktsson
• 2 Mark Veronda
• 2 Matti Lyra
• 2 Mikhail Korobov
• 2 Xinfan Meng
• 1 Alejandro Weinstein
• 1 Alexandre Passos
• 1 Christoph Deil
• 1 Eugene Nizhibitsky
• 1 Kenneth C. Arnold
• 1 Luis Pedro Coelho
• 1 Miroslav Batchkarov
• 1 Pavel
• 1 Sebastian Berg
• 1 Shaun Jackman
• 1 Subhodeep Moitra
• 1 bob
• 1 dengemann
• 1 emanuele
• 1 x006
October 8, 2012
The 0.12.1 release is a bug-fix release with no additional features, but is instead a set of bug fixes
Changelog
People
• 14 Peter Prettenhofer
• 12 Gael Varoquaux
• 10 Andreas Müller
• 5 Lars Buitinck
• 3 Virgile Fritsch
• 1 Alexandre Gramfort
• 1 Gilles Louppe
• 1 Mathieu Blondel
September 4, 2012
Changelog
• Add MultiTaskLasso and MultiTaskElasticNet for joint feature selection, by Alexandre Gramfort.
• Added metrics.auc_score and metrics.average_precision_score convenience functions by
Andreas Müller.
• Improved sparse matrix support in the Feature selection module by Andreas Müller.
• New word boundaries-aware character n-gram analyzer for the Text feature extraction module by @kernc.
• Fixed bug in spectral clustering that led to single point clusters by Andreas Müller.
• In feature_extraction.text.CountVectorizer, added an option to ignore infrequent words,
min_df by Andreas Müller.
• Add support for multiple targets in some linear models (ElasticNet, Lasso and OrthogonalMatchingPursuit) by
Vlad Niculae and Alexandre Gramfort.
• Fixes in decomposition.ProbabilisticPCA score function by Wei Li.
• Fixed feature importance computation in Gradient Tree Boosting.
• The old scikits.learn package has disappeared; all code should import from sklearn instead, which
was introduced in 0.9.
• In metrics.roc_curve, the thresholds array is now returned with it’s order reversed, in order to keep
it consistent with the order of the returned fpr and tpr.
• In hmm objects, like hmm.GaussianHMM, hmm.MultinomialHMM, etc., all parameters must be passed to
the object when initialising it and not through fit. Now fit will only accept the data as an input parameter.
• For all SVM classes, a faulty behavior of gamma was fixed. Previously, the default gamma value was only
computed the first time fit was called and then stored. It is now recalculated on every call to fit.
• All Base classes are now abstract meta classes so that they can not be instantiated.
• cluster.ward_tree now also returns the parent array. This is necessary for early-stopping in which case
the tree is not completely built.
• In feature_extraction.text.CountVectorizer the parameters min_n and max_n were joined to
the parameter n_gram_range to enable grid-searching both at once.
• In feature_extraction.text.CountVectorizer, words that appear only in one document are now
ignored by default. To reproduce the previous behavior, set min_df=1.
• Fixed API inconsistency: linear_model.SGDClassifier.predict_proba now returns 2d array
when fit on two classes.
• Fixed API inconsistency: discriminant_analysis.QuadraticDiscriminantAnalysis.
decision_function and discriminant_analysis.LinearDiscriminantAnalysis.
decision_function now return 1d arrays when fit on two classes.
• Grid of alphas used for fitting linear_model.LassoCV and linear_model.ElasticNetCV is now
stored in the attribute alphas_ rather than overriding the init parameter alphas.
• Linear models when alpha is estimated by cross-validation store the estimated value in the alpha_ attribute
rather than just alpha or best_alpha.
• ensemble.GradientBoostingClassifier now supports ensemble.
GradientBoostingClassifier.staged_predict_proba, and ensemble.
GradientBoostingClassifier.staged_predict.
• svm.sparse.SVC and other sparse SVM classes are now deprecated. The all classes in the Support Vector
Machines module now automatically select the sparse or dense representation base on the input.
• All clustering algorithms now interpret the array X given to fit as input data, in particular cluster.
SpectralClustering and cluster.AffinityPropagation which previously expected affinity ma-
trices.
• For clustering algorithms that take the desired number of clusters as a parameter, this parameter is now called
n_clusters.
People
• 3 flyingimmidev
• 2 Francois Savard
• 2 Hannes Schulz
• 2 Peter Welinder
• 2 Yaroslav Halchenko
• 2 Wei Li
• 1 Alex Companioni
• 1 Brandyn A. White
• 1 Bussonnier Matthias
• 1 Charles-Pierre Astolfi
• 1 Dan O’Huiginn
• 1 David Cournapeau
• 1 Keith Goodman
• 1 Ludwig Schwardt
• 1 Olivier Hervieu
• 1 Sergio Medina
• 1 Shiqiao Du
• 1 Tim Sheerman-Chase
• 1 buguen
May 7, 2012
Changelog
Highlights
• Gradient boosted regression trees (Gradient Tree Boosting) for classification and regression by Peter Pretten-
hofer and Scott White .
• Simple dict-based feature loader with support for categorical variables (feature_extraction.
DictVectorizer) by Lars Buitinck.
• Added Matthews correlation coefficient (metrics.matthews_corrcoef) and added macro and micro av-
erage options to metrics.precision_score, metrics.recall_score and metrics.f1_score
by Satrajit Ghosh.
• Out of Bag Estimates of generalization error for Ensemble methods by Andreas Müller.
• Randomized sparse linear models for feature selection, by Alexandre Gramfort and Gael Varoquaux
• Label Propagation for semi-supervised learning, by Clay Woolam. Note the semi-supervised API is still work
in progress, and may change.
• Added BIC/AIC model selection to classical Gaussian mixture models and unified the API with the remainder
of scikit-learn, by Bertrand Thirion
• Added sklearn.cross_validation.StratifiedShuffleSplit, which is a sklearn.
cross_validation.ShuffleSplit with balanced splits, by Yannick Schwartz.
• sklearn.neighbors.NearestCentroid classifier added, along with a shrink_threshold param-
eter, which implements shrunken centroid classification, by Robert Layton.
Other changes
• Merged dense and sparse implementations of Stochastic Gradient Descent module and exposed utility extension
types for sequential datasets seq_dataset and weight vectors weight_vector by Peter Prettenhofer.
• Added partial_fit (support for online/minibatch learning) and warm_start to the Stochastic Gradient De-
scent module by Mathieu Blondel.
• Dense and sparse implementations of Support Vector Machines classes and linear_model.
LogisticRegression merged by Lars Buitinck.
• Regressors can now be used as base estimator in the Multiclass and multilabel algorithms module by Mathieu
Blondel.
• Added n_jobs option to metrics.pairwise.pairwise_distances and metrics.pairwise.
pairwise_kernels for parallel computation, by Mathieu Blondel.
• K-means can now be run in parallel, using the n_jobs argument to either K-means or KMeans, by Robert
Layton.
• Improved Cross-validation: evaluating estimator performance and Tuning the hyper-parameters of an estima-
tor documentation and introduced the new cross_validation.train_test_split helper function by
Olivier Grisel
• svm.SVC members coef_ and intercept_ changed sign for consistency with decision_function;
for kernel==linear, coef_ was fixed in the one-vs-one case, by Andreas Müller.
• Performance improvements to efficient leave-one-out cross-validated Ridge regression, esp. for the
n_samples > n_features case, in linear_model.RidgeCV , by Reuben Fletcher-Costin.
• Refactoring and simplification of the Text feature extraction API and fixed a bug that caused possible negative
IDF, by Olivier Grisel.
• Beam pruning option in _BaseHMM module has been removed since it is difficult to Cythonize. If you are
interested in contributing a Cython version, you can use the python version in the git history as a reference.
• Classes in Nearest Neighbors now support arbitrary Minkowski metric for nearest neighbors searches. The
metric can be specified by argument p.
People
• 1 Claire Revillet
• 1 Conrad Lee
• 1 Edouard Duchesnay
• 1 Jan Hendrik Metzen
• 1 Meng Xinfan
• 1 Rob Zinkov
• 1 Shiqiao
• 1 Udi Weinsberg
• 1 Virgile Fritsch
• 1 Xinfan Meng
• 1 Yaroslav Halchenko
• 1 jansoe
• 1 Leon Palafox
Changelog
• Python 2.5 compatibility was dropped; the minimum Python version needed to use scikit-learn is now 2.6.
• Sparse inverse covariance estimation using the graph Lasso, with associated cross-validated estimator, by Gael
Varoquaux
• New Tree module by Brian Holt, Peter Prettenhofer, Satrajit Ghosh and Gilles Louppe. The module comes with
complete documentation and examples.
• Fixed a bug in the RFE module by Gilles Louppe (issue #378).
• Fixed a memory leak in Support Vector Machines module by Brian Holt (issue #367).
• Faster tests by Fabian Pedregosa and others.
• Silhouette Coefficient cluster analysis evaluation metric added as sklearn.metrics.
silhouette_score by Robert Layton.
• Fixed a bug in K-means in the handling of the n_init parameter: the clustering algorithm used to be run
n_init times but the last solution was retained instead of the best solution by Olivier Grisel.
• Minor refactoring in Stochastic Gradient Descent module; consolidated dense and sparse predict methods; En-
hanced test time performance by converting model parameters to fortran-style arrays after fitting (only multi-
class).
• Adjusted Mutual Information metric added as sklearn.metrics.adjusted_mutual_info_score
by Robert Layton.
• Models like SVC/SVR/LinearSVC/LogisticRegression from libsvm/liblinear now support scaling of C regular-
ization parameter by the number of samples by Alexandre Gramfort.
• New Ensemble Methods module by Gilles Louppe and Brian Holt. The module comes with the random forest
algorithm and the extra-trees method, along with documentation and examples.
• Novelty and Outlier Detection: outlier and novelty detection, by Virgile Fritsch.
• Kernel Approximation: a transform implementing kernel approximation for fast SGD on non-linear kernels by
Andreas Müller.
• Fixed a bug due to atom swapping in Orthogonal Matching Pursuit (OMP) by Vlad Niculae.
• Sparse coding with a precomputed dictionary by Vlad Niculae.
• Mini Batch K-Means performance improvements by Olivier Grisel.
• K-means support for sparse matrices by Mathieu Blondel.
• Improved documentation for developers and for the sklearn.utils module, by Jake Vanderplas.
• Vectorized 20newsgroups dataset loader (sklearn.datasets.fetch_20newsgroups_vectorized)
by Mathieu Blondel.
• Multiclass and multilabel algorithms by Lars Buitinck.
• Utilities for fast computation of mean and variance for sparse matrices by Mathieu Blondel.
• Make sklearn.preprocessing.scale and sklearn.preprocessing.Scaler work on sparse
matrices by Olivier Grisel
• Feature importances using decision trees and/or forest of trees, by Gilles Louppe.
• Parallel implementation of forests of randomized trees by Gilles Louppe.
• sklearn.cross_validation.ShuffleSplit can subsample the train sets as well as the test sets by
Olivier Grisel.
• Errors in the build of the documentation fixed by Andreas Müller.
Here are the code migration instructions when upgrading from scikit-learn version 0.9:
• Some estimators that may overwrite their inputs to save memory previously had overwrite_ parameters;
these have been replaced with copy_ parameters with exactly the opposite meaning.
This particularly affects some of the estimators in linear_model. The default behavior is still to copy
everything passed in.
• The SVMlight dataset loader sklearn.datasets.load_svmlight_file no longer supports loading
two files at once; use load_svmlight_files instead. Also, the (unused) buffer_mb parameter is gone.
• Sparse estimators in the Stochastic Gradient Descent module use dense parameter vector coef_ instead of
sparse_coef_. This significantly improves test time performance.
• The Covariance estimation module now has a robust estimator of covariance, the Minimum Covariance Deter-
minant estimator.
• Cluster evaluation metrics in metrics.cluster have been refactored but the changes are backwards compat-
ible. They have been moved to the metrics.cluster.supervised, along with metrics.cluster.
unsupervised which contains the Silhouette Coefficient.
• The permutation_test_score function now behaves the same way as cross_val_score (i.e. uses
the mean score across the folds.)
• Cross Validation generators now use integer indices (indices=True) by default instead of boolean masks.
This make it more intuitive to use with sparse matrix data.
• The functions used for sparse coding, sparse_encode and sparse_encode_parallel have been com-
bined into sklearn.decomposition.sparse_encode, and the shapes of the arrays have been trans-
posed for consistency with the matrix factorization setting, as opposed to the regression setting.
• Fixed an off-by-one error in the SVMlight/LibSVM file format handling; files generated using sklearn.
datasets.dump_svmlight_file should be re-generated. (They should continue to work, but acciden-
tally had one extra column of zeros prepended.)
• BaseDictionaryLearning class replaced by SparseCodingMixin.
• sklearn.utils.extmath.fast_svd has been renamed sklearn.utils.extmath.
randomized_svd and the default oversampling is now fixed to 10 additional random vectors instead
of doubling the number of components to extract. The new behavior follows the reference paper.
People
• 1 Félix-Antoine Fortin
• 1 Juan Manuel Caicedo Carvajal
• 1 Nelle Varoquaux
• 1 Nicolas Pinto
• 1 Tiziano Zito
• 1 Xinfan Meng
Changelog
• Nearest Neighbors module refactoring by Jake Vanderplas : general refactoring, support for sparse matrices in
input, speed and documentation improvements. See the next section for a full list of API changes.
• Improvements on the Feature selection module by Gilles Louppe : refactoring of the RFE classes, documenta-
tion rewrite, increased efficiency and minor API changes.
• Sparse principal components analysis (SparsePCA and MiniBatchSparsePCA) by Vlad Niculae, Gael Varo-
quaux and Alexandre Gramfort
• Printing an estimator now behaves independently of architectures and Python version thanks to Jean Kossaifi.
• Loader for libsvm/svmlight format by Mathieu Blondel and Lars Buitinck
• Documentation improvements: thumbnails in example gallery by Fabian Pedregosa.
• Important bugfixes in Support Vector Machines module (segfaults, bad performance) by Fabian Pedregosa.
• Added Multinomial Naive Bayes and Bernoulli Naive Bayes by Lars Buitinck
• Text feature extraction optimizations by Lars Buitinck
• Chi-Square feature selection (feature_selection.univariate_selection.chi2) by Lars Buit-
inck.
• Generated datasets module refactoring by Gilles Louppe
• Multiclass and multilabel algorithms by Mathieu Blondel
• Ball tree rewrite by Jake Vanderplas
• Implementation of DBSCAN algorithm by Robert Layton
• Kmeans predict and transform by Robert Layton
• Preprocessing module refactoring by Olivier Grisel
• Faster mean shift by Conrad Lee
• New Bootstrap, Random permutations cross-validation a.k.a. Shuffle & Split and various other improve-
ments in cross validation schemes by Olivier Grisel and Gael Varoquaux
• Adjusted Rand index and V-Measure clustering evaluation metrics by Olivier Grisel
• Added Orthogonal Matching Pursuit by Vlad Niculae
• Added 2D-patch extractor utilities in the Feature extraction module by Vlad Niculae
• Implementation of linear_model.LassoLarsCV (cross-validated Lasso solver using the Lars algorithm)
and linear_model.LassoLarsIC (BIC/AIC model selection in Lars) by Gael Varoquaux and Alexandre
Gramfort
• Scalability improvements to metrics.roc_curve by Olivier Hervieu
• Distance helper functions metrics.pairwise.pairwise_distances and metrics.pairwise.
pairwise_kernels by Robert Layton
• Mini-Batch K-Means by Nelle Varoquaux and Peter Prettenhofer.
• mldata utilities by Pietro Berkes.
• The Olivetti faces dataset by David Warde-Farley.
Here are the code migration instructions when upgrading from scikit-learn version 0.8:
• The scikits.learn package was renamed sklearn. There is still a scikits.learn package alias for
backward compatibility.
Third-party projects with a dependency on scikit-learn 0.9+ should upgrade their codebase. For instance, under
Linux / MacOSX just run (make a backup first!):
• Estimators no longer accept model parameters as fit arguments: instead all parameters must be only
be passed as constructor arguments or using the now public set_params method inherited from base.
BaseEstimator.
Some estimators can still accept keyword arguments on the fit but this is restricted to data-dependent values
(e.g. a Gram matrix or an affinity matrix that are precomputed from the X data matrix.
• The cross_val package has been renamed to cross_validation although there is also a cross_val
package alias in place for backward compatibility.
Third-party projects with a dependency on scikit-learn 0.9+ should upgrade their codebase. For instance, under
Linux / MacOSX just run (make a backup first!):
People
Changelog
People
• 96 Gael Varoquaux
• 96 Vlad Niculae
• 94 Fabian Pedregosa
• 36 Alexandre Gramfort
• 32 Paolo Losi
• 31 Edouard Duchesnay
• 30 Mathieu Blondel
• 25 Peter Prettenhofer
• 22 Nicolas Pinto
• 11 Virgile Fritsch
– 7 Lars Buitinck
– 6 Vincent Michel
– 5 Bertrand Thirion
– 4 Thouis (Ray) Jones
– 4 Vincent Schut
– 3 Jan Schlüter
– 2 Julien Miotte
– 2 Matthieu Perrot
– 2 Yann Malet
– 2 Yaroslav Halchenko
– 1 Amit Aides
– 1 Andreas Müller
– 1 Feth Arezki
– 1 Meng Xinfan
March 2, 2011
scikit-learn 0.7 was released in March 2011, roughly three months after the 0.6 release. This release is marked by the
speed improvements in existing algorithms like k-Nearest Neighbors and K-Means algorithm and by the inclusion of
an efficient algorithm for computing the Ridge Generalized Cross Validation solution. Unlike the preceding release,
no new modules where added to this release.
Changelog
• Better handling of collinearity and early stopping in linear_model.lars_path [Alexandre Gramfort and
Fabian Pedregosa].
• Fixes for liblinear ordering of labels and sign of coefficients [Dan Yamins, Paolo Losi, Mathieu Blondel and
Fabian Pedregosa].
• Performance improvements for Nearest Neighbors algorithm in high-dimensional spaces [Fabian Pedregosa].
• Performance improvements for cluster.KMeans [Gael Varoquaux and James Bergstra].
• Sanity checks for SVM-based classes [Mathieu Blondel].
• Refactoring of neighbors.NeighborsClassifier and neighbors.kneighbors_graph: added
different algorithms for the k-Nearest Neighbor Search and implemented a more stable algorithm for finding
barycenter weights. Also added some developer documentation for this module, see notes_neighbors for more
information [Fabian Pedregosa].
• Documentation improvements: Added pca.RandomizedPCA and linear_model.
LogisticRegression to the class reference. Also added references of matrices used for clustering
and other fixes [Gael Varoquaux, Fabian Pedregosa, Mathieu Blondel, Olivier Grisel, Virgile Fritsch ,
Emmanuelle Gouillart]
• Binded decision_function in classes that make use of liblinear, dense and sparse variants, like svm.
LinearSVC or linear_model.LogisticRegression [Fabian Pedregosa].
• Performance and API improvements to metrics.euclidean_distances and to pca.
RandomizedPCA [James Bergstra].
• Fix compilation issues under NetBSD [Kamel Ibn Hassen Derouiche]
• Allow input sequences of different lengths in hmm.GaussianHMM [Ron Weiss].
• Fix bug in affinity propagation caused by incorrect indexing [Xinfan Meng]
People
• 1 VirgileFritsch
• 1 Yaroslav Halchenko
• 1 Xinfan Meng
Changelog
• New stochastic gradient descent module by Peter Prettenhofer. The module comes with complete documentation
and examples.
• Improved svm module: memory consumption has been reduced by 50%, heuristic to automatically set class
weights, possibility to assign weights to samples (see SVM: Weighted samples for an example).
• New Gaussian Processes module by Vincent Dubourg. This module also has great documenta-
tion and some very neat examples. See example_gaussian_process_plot_gp_regression.py or exam-
ple_gaussian_process_plot_gp_probabilistic_classification_after_regression.py for a taste of what can be done.
• It is now possible to use liblinear’s Multi-class SVC (option multi_class in svm.LinearSVC)
• New features and performance improvements of text feature extraction.
• Improved sparse matrix support, both in main classes (grid_search.GridSearchCV) as in modules
sklearn.svm.sparse and sklearn.linear_model.sparse.
• Lots of cool new examples and a new section that uses real-world datasets was created. These include: Faces
recognition example using eigenfaces and SVMs, Species distribution modeling, Libsvm GUI, Wikipedia princi-
pal eigenvector and others.
• Faster Least Angle Regression algorithm. It is now 2x faster than the R version on worst case and up to 10x
times faster on some cases.
• Faster coordinate descent algorithm. In particular, the full path version of lasso (linear_model.
lasso_path) is more than 200x times faster than before.
• It is now possible to get probability estimates from a linear_model.LogisticRegression model.
• module renaming: the glm module has been renamed to linear_model, the gmm module has been included into
the more general mixture model and the sgd module has been included in linear_model.
• Lots of bug fixes and documentation improvements.
People
• 59 Mathieu Blondel
• 55 Gael Varoquaux
• 33 Vincent Dubourg
• 21 Ron Weiss
• 9 Bertrand Thirion
• 3 Alexandre Passos
• 3 Anne-Laure Fouque
• 2 Ronan Amicel
• 1 Christian Osendorfer
Changelog
New classes
• Support for sparse matrices in some classifiers of modules svm and linear_model (see svm.
sparse.SVC, svm.sparse.SVR, svm.sparse.LinearSVC, linear_model.sparse.Lasso,
linear_model.sparse.ElasticNet)
• New pipeline.Pipeline object to compose different estimators.
• Recursive Feature Elimination routines in module Feature selection.
• Addition of various classes capable of cross validation in the linear_model module (linear_model.
LassoCV , linear_model.ElasticNetCV , etc.).
• New, more efficient LARS algorithm implementation. The Lasso variant of the algorithm is also implemented.
See linear_model.lars_path, linear_model.Lars and linear_model.LassoLars.
• New Hidden Markov Models module (see classes hmm.GaussianHMM, hmm.MultinomialHMM, hmm.
GMMHMM)
• New module feature_extraction (see class reference)
• New FastICA algorithm in module sklearn.fastica
Documentation
• Improved documentation for many modules, now separating narrative documentation from the class reference.
As an example, see documentation for the SVM module and the complete class reference.
Fixes
• API changes: adhere variable names to PEP-8, give more meaningful names.
• Fixes for svm module to run on a shared memory context (multiprocessing).
• It is again possible to generate latex (and thus PDF) from the sphinx docs.
Examples
External dependencies
Removed modules
• Module ann (Artificial Neural Networks) has been removed from the distribution. Users wanting this sort of
algorithms should take a look into pybrain.
Misc
Authors
The following is a list of authors for this release, preceded by number of commits:
• 262 Fabian Pedregosa
• 240 Gael Varoquaux
• 149 Alexandre Gramfort
• 116 Olivier Grisel
• 40 Vincent Michel
• 38 Ron Weiss
• 23 Matthieu Perrot
• 10 Bertrand Thirion
• 7 Yaroslav Halchenko
• 9 VirgileFritsch
• 6 Edouard Duchesnay
• 4 Mathieu Blondel
• 1 Ariel Rokem
• 1 Matthieu Brucher
Changelog
Authors
The committer list for this release is the following (preceded by number of commits):
• 143 Fabian Pedregosa
• 35 Alexandre Gramfort
• 34 Olivier Grisel
• 11 Gael Varoquaux
• 5 Yaroslav Halchenko
• 2 Vincent Michel
• 1 Chris Filo Gorgolewski
Earlier versions included contributions by Fred Mailhot, David Cooke, David Huard, Dave Morrill, Ed Schofield,
Travis Oliphant, Pearu Peterson.
1.8 Roadmap
This document list general directions that core contributors are interested to see developed in scikit-learn. The fact
that an item is listed here is in no way a promise that it will happen, as resources are limited. Rather, it is an indication
that help is welcomed on this topic.
Eleven years after the inception of Scikit-learn, much has changed in the world of machine learning. Key changes
include:
• Computational tools: The exploitation of GPUs, distributed programming frameworks like Scala/Spark, etc.
• High-level Python libraries for experimentation, processing and data management: Jupyter notebook, Cython,
Pandas, Dask, Numba. . .
• Changes in the focus of machine learning research: artificial intelligence applications (where input structure is
key) with deep learning, representation learning, reinforcement learning, domain transfer, etc.
A more subtle change over the last decade is that, due to changing interests in ML, PhD students in machine learning
are more likely to contribute to PyTorch, Dask, etc. than to Scikit-learn, so our contributor pool is very different to a
decade ago.
Scikit-learn remains very popular in practice for trying out canonical machine learning techniques, particularly for
applications in experimental science and in data science. A lot of what we provide is now very mature. But it can be
costly to maintain, and we cannot therefore include arbitrary new implementations. Yet Scikit-learn is also essential
in defining an API framework for the development of interoperable machine learning components external to the core
library.
Thus our main goals in this era are to:
• continue maintaining a high-quality, well-documented collection of canonical tools for data processing and
machine learning within the current scope (i.e. rectangular data largely invariant to column and row order;
predicting targets with simple structure)
• improve the ease for users to develop and publish external components
• improve inter-operability with modern data science tools (e.g. Pandas, Dask) and infrastructures (e.g. distributed
processing)
Many of the more fine-grained goals can be found under the API tag on the issue tracker.
The list is numbered not as an indication of the order of priority, but to make referring to specific points easier. Please
add new entries only at the bottom. Note that the crossed out entries are already done, and we try to keep the document
up to date as we work on these issues.
1. Improved handling of Pandas DataFrames
• document current handling
• column reordering issue #7242
• avoiding unnecessary conversion to ndarray #12147
• returning DataFrames from transformers #5523
20. Everything in Scikit-learn should probably conform to our API contract. We are still in the process of making
decisions on some of these related issues.
• Pipeline <pipeline.Pipeline> and FeatureUnion modify their input parameters in fit. Fix-
ing this requires making sure we have a good grasp of their use cases to make sure all current functionality
is maintained. #8157 #7382
21. (Optional) Improve scikit-learn common tests suite to make sure that (at least for frequently used) models have
stable predictions across-versions (to be discussed);
• Extend documentation to mention how to deploy models in Python-free environments for instance ONNX.
and use the above best practices to assess predictive consistency between scikit-learn and ONNX prediction
functions on validation set.
• Document good practices to detect temporal distribution drift for deployed model and good practices for
re-training on fresh data without causing catastrophic predictive performance regressions.
sklearn.ensemble
• a stacking implementation, #11047
sklearn.cluster
• kmeans variants for non-Euclidean distances, if we can show these have benefits beyond hierarchical clustering.
sklearn.model_selection
• multi-metric scoring is slow #9326
• perhaps we want to be able to get back more than multiple metrics
• the handling of random states in CV splitters is a poor design and contradicts the validation of similar parameters
in estimators, #15177
• exploit warm-starting and path algorithms so the benefits of EstimatorCV objects can be accessed via
GridSearchCV and used in Pipelines. #1626
• Cross-validation should be able to be replaced by OOB estimates whenever a cross-validation iterator is used.
• Redundant computations in pipelines should be avoided (related to point above) cf daskml
sklearn.neighbors
• Ability to substitute a custom/approximate/precomputed nearest neighbors implementation for ours in all/most
contexts that nearest neighbors are used for learning. #10463
sklearn.pipeline
• Performance issues with Pipeline.memory
• see “Everything in Scikit-learn should conform to our API contract” above
The purpose of this document is to formalize the governance process used by the scikit-learn project, to clarify how
decisions are made and how the various elements of our community interact. This document establishes a decision-
making structure that takes into account feedback from all members of the community and strives to find consensus,
while avoiding any deadlocks.
This is a meritocratic, consensus-based community project. Anyone with an interest in the project can join the com-
munity, contribute to the project design and participate in the decision making process. This document describes how
that participation takes place and how to set about earning merit within the project community.
Contributors
Contributors are community members who contribute in concrete ways to the project. Anyone can become a contrib-
utor, and contributions can take many forms – not only code – as detailed in the contributors guide.
Core developers
Core developers are community members who have shown that they are dedicated to the continued development of
the project through ongoing engagement with the community. They have shown they can be trusted to maintain scikit-
learn with care. Being a core developer allows contributors to more easily carry on with their project related activities
by giving them direct access to the project’s repository and is represented as being an organization member on the
scikit-learn GitHub organization. Core developers are expected to review code contributions, can merge approved pull
requests, can cast votes for and against merging a pull-request, and can be involved in deciding major changes to the
API.
New core developers can be nominated by any existing core developers. Once they have been nominated, there will
be a vote by the current core developers. Voting on new core developers is one of the few activities that takes place on
the project’s private management list. While it is expected that most votes will be unanimous, a two-thirds majority of
the cast votes is enough. The vote needs to be open for at least 1 week.
Core developers that have not contributed to the project (commits or GitHub comments) in the past 12 months will be
asked if they want to become emeritus core developers and recant their commit and voting rights until they become
active again. The list of core developers, active and emeritus (with dates at which they became active) is public on the
scikit-learn website.
Technical Committee
The Technical Committee (TC) members are core developers who have additional responsibilities to ensure the smooth
running of the project. TC members are expected to participate in strategic planning, and approve changes to the
governance model. The purpose of the TC is to ensure a smooth progress from the big-picture perspective. Indeed
changes that impact the full project require a synthetic analysis and a consensus that is both explicit and informed.
In cases that the core developer community (which includes the TC members) fails to reach such a consensus in the
required time frame, the TC is the entity to resolve the issue. Membership of the TC is by nomination by a core
developer. A nomination will result in discussion which cannot take more than a month and then a vote by the core
developers which will stay open for a week. TC membership votes are subject to a two-third majority of all cast votes
as well as a simple majority approval of all the current TC members. TC members who do not actively engage with
the TC duties are expected to resign.
The initial Technical Committee of scikit-learn consists of Alexandre Gramfort, Olivier Grisel, Andreas Müller, Joel
Nothman, Hanmin Qin, Gaël Varoquaux, and Roman Yurchak.
Decisions about the future of the project are made through discussion with all members of the community. All non-
sensitive project management discussion takes place on the project contributors’ mailing list and the issue tracker.
Occasionally, sensitive discussion occurs on a private list.
Scikit-learn uses a “consensus seeking” process for making decisions. The group tries to find a resolution that has no
open objections among core developers. At any point during the discussion, any core-developer can call for a vote,
which will conclude one month from the call for the vote. Any vote must be backed by a SLEP <slep>. If no
option can gather two thirds of the votes cast, the decision is escalated to the TC, which in turn will use consensus
seeking with the fallback option of a simple majority vote if no consensus can be found within a month. This is what
we hereafter may refer to as “the decision making process”.
Decisions (in addition to adding core developers and TC membership as above) are made according to the following
rules:
• Minor Documentation changes, such as typo fixes, or addition / correction of a sentence, but no change of the
scikit-learn.org landing page or the “about” page: Requires +1 by a core developer, no -1 by a core developer
(lazy consensus), happens on the issue or pull request page. Core developers are expected to give “reasonable
time” to others to give their opinion on the pull request if they’re not confident others would agree.
• Code changes and major documentation changes require +1 by two core developers, no -1 by a core developer
(lazy consensus), happens on the issue of pull-request page.
• Changes to the API principles and changes to dependencies or supported versions happen via a Enhance-
ment proposals (SLEPs) and follows the decision-making process outlined above.
• Changes to the governance model use the same decision process outlined above.
If a veto -1 vote is cast on a lazy consensus, the proposer can appeal to the community and core developers and the
change can be approved or rejected using the decision making procedure outlined above.
For all votes, a proposal must have been made public and discussed before the vote. Such proposal must be a consol-
idated document, in the form of a ‘Scikit-Learn Enhancement Proposal’ (SLEP), rather than a long discussion on an
issue. A SLEP must be submitted as a pull-request to enhancement proposals using the SLEP template.
TWO
SCIKIT-LEARN TUTORIALS
Section contents
In this section, we introduce the machine learning vocabulary that we use throughout scikit-learn and give a simple
learning example.
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data.
If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is
said to have several attributes or features.
Learning problems fall into a few categories:
• supervised learning, in which the data comes with additional attributes that we want to predict (Click here to go
to the scikit-learn supervised learning page).This problem can be either:
– classification: samples belong to two or more classes and we want to learn from already labeled data how
to predict the class of unlabeled data. An example of a classification problem would be handwritten digit
recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories.
Another way to think of classification is as a discrete (as opposed to continuous) form of supervised
learning where one has a limited number of categories and for each of the n samples provided, one is to
try to label them with the correct category or class.
– regression: if the desired output consists of one or more continuous variables, then the task is called
regression. An example of a regression problem would be the prediction of the length of a salmon as a
function of its age and weight.
• unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding
target values. The goal in such problems may be to discover groups of similar examples within the data, where
it is called clustering, or to determine the distribution of data within the input space, known as density estima-
tion, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of
visualization (Click here to go to the Scikit-Learn unsupervised learning page).
171
scikit-learn user guide, Release 0.23.dev0
Machine learning is about learning some properties of a data set and then testing those properties against another
data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We
call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on
which we test the learned properties.
scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for classification and the
boston house prices dataset for regression.
In the following, we start a Python interpreter from our shell and then load the iris and digits datasets. Our
notational convention is that $ denotes the shell prompt while >>> denotes the Python interpreter prompt:
$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()
A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in
the .data member, which is a n_samples, n_features array. In the case of supervised problem, one or more
response variables are stored in the .target member. More details on the different datasets can be found in the
dedicated section.
For instance, in the case of the digits dataset, digits.data gives access to the features that can be used to classify
the digits samples:
>>> print(digits.data)
[[ 0. 0. 5. ... 0. 0. 0.]
[ 0. 0. 0. ... 10. 0. 0.]
[ 0. 0. 0. ... 16. 9. 0.]
...
[ 0. 0. 1. ... 6. 0. 0.]
[ 0. 0. 2. ... 12. 0. 0.]
[ 0. 0. 10. ... 12. 1. 0.]]
and digits.target gives the ground truth for the digit dataset, that is the number corresponding to each digit
image that we are trying to learn:
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])
The data is always a 2D array, shape (n_samples, n_features), although the original data may have had a
different shape. In the case of the digits, each original sample is an image of shape (8, 8) and can be accessed
using:
>>> digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
172 [ 0., 0., 6., 13., 10., 0., 0., 0.]]) Chapter 2. scikit-learn Tutorials
scikit-learn user guide, Release 0.23.dev0
The simple example on this dataset illustrates how starting from the original problem one can shape the data for
consumption in scikit-learn.
In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given samples
of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the
classes to which unseen samples belong.
In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and
predict(T).
An example of an estimator is the class sklearn.svm.SVC, which implements support vector classification. The
estimator’s constructor takes as arguments the model’s parameters.
For now, we will consider the estimator as a black box:
In this example, we set the value of gamma manually. To find good values for these parameters, we can use tools
such as grid search and cross validation.
The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from the model. This is
done by passing our training set to the fit method. For the training set, we’ll use all the images from our dataset,
except for the last image, which we’ll reserve for our predicting. We select the training set with the [:-1] Python
syntax, which produces a new array that contains all but the last item from digits.data:
Now you can predict new values. In this case, you’ll predict using the last image from digits.data. By predicting,
you’ll determine the image from the training set that best matches the last image.
>>> clf.predict(digits.data[-1:])
array([8])
As you can see, it is a challenging task: after all, the images are of poor resolution. Do you agree with the classifier?
A complete example of this classification problem is available as an example that you can run and study: Recognizing
hand-written digits.
It is possible to save a model in scikit-learn by using Python’s built-in persistence model, pickle:
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> X, y = datasets.load_iris(return_X_y=True)
>>> clf.fit(X, y)
SVC()
In the specific case of scikit-learn, it may be more interesting to use joblib’s replacement for pickle (joblib.dump
& joblib.load), which is more efficient on big data but it can only pickle to the disk and not to a string:
>>> from joblib import dump, load
>>> dump(clf, 'filename.joblib')
Later, you can reload the pickled model (possibly in another Python process) with:
>>> clf = load('filename.joblib')
Note: joblib.dump and joblib.load functions also accept file-like object instead of filenames. More infor-
mation on data persistence with Joblib is available here.
Note that pickle has some security and maintainability issues. Please refer to section Model persistence for more
detailed information about model persistence with scikit-learn.
2.1.5 Conventions
scikit-learn estimators follow certain rules to make their behavior more predictive. These are described in more detail
in the Glossary of Common Terms and API Elements.
Type casting
>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]
>>> list(clf.predict(iris.data[:3]))
['setosa', 'setosa', 'setosa']
Here, the first predict() returns an integer array, since iris.target (an integer array) was used in fit. The
second predict() returns a string array, since iris.target_names was for fitting.
Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method. Calling
fit() more than once will overwrite what was learned by any previous fit():
>>> import numpy as np
>>> from sklearn.datasets import load_iris
>>> from sklearn.svm import SVC
>>> X, y = load_iris(return_X_y=True)
Here, the default kernel rbf is first changed to linear via SVC.set_params() after the estimator has been
constructed, and changed back to rbf to refit the estimator and to make a second prediction.
When using multiclass classifiers, the learning and prediction task that is performed is dependent on the
format of the target data fit upon:
>>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
>>> y = [0, 0, 1, 1, 2]
In the above case, the classifier is fit on a 1d array of multiclass labels and the predict() method therefore provides
corresponding multiclass predictions. It is also possible to fit upon a 2d array of binary label indicators:
>>> y = LabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Here, the classifier is fit() on a 2d binary label representation of y, using the LabelBinarizer. In this case
predict() returns a 2d array representing the corresponding multilabel predictions.
Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of the three labels fit
upon. With multilabel outputs, it is similarly possible for an instance to be assigned multiple labels:
In this case, the classifier is fit upon instances each assigned multiple labels. The MultiLabelBinarizer is
used to binarize the 2d array of multilabels to fit upon. As a result, predict() returns a 2d array with multiple
predicted labels for each instance.
Statistical learning
Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are fac-
ing is rapidly growing. Problems it tackles range from building a prediction function linking different observations,
to classifying observations, or learning the structure in an unlabeled dataset.
This tutorial will explore statistical learning, the use of machine learning techniques with the goal of statistical
inference: drawing conclusions on the data at hand.
Scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific
Python packages (NumPy, SciPy, matplotlib).
2.2.1 Statistical learning: the setting and the estimator object in scikit-learn
Datasets
Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be
understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis,
while the second is the features axis.
It is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width, as
detailed in iris.DESCR.
When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to
be used by scikit-learn.
To use this dataset with scikit-learn, we transform each 8x8 image into a feature vector of length 64
>>> data = digits.images.reshape((digits.images.shape[0], -1))
Estimators objects
Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is any object that
learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters
useful features from raw data.
All estimator objects expose a fit method that takes a dataset (usually a 2-d array):
>>> estimator.fit(data)
Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the
corresponding attribute:
>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
1
Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the
estimated parameters are attributes of the estimator object ending by an underscore:
>>> estimator.estimated_param_
Supervised learning consists in learning the link between two datasets: the observed data X and an external variable
y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.
All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X)
method that, given unlabeled observations X, returns the predicted labels y.
If the prediction task is to classify the observations in a set of finite labels, in other words to “name” the objects
observed, the task is said to be a classification task. On the other hand, if the goal is to predict a continuous target
variable, it is said to be a regression task.
Classifying irises:
The iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and
Virginica) from their petal and sepal length and width:
>>> import numpy as np
>>> from sklearn import datasets
>>> iris_X, iris_y = datasets.load_iris(return_X_y=True)
>>> np.unique(iris_y)
array([0, 1, 2])
The simplest possible classifier is the nearest neighbor: given a new observation X_test, find in the training set (i.e.
the data used to train the estimator) the observation with the closest feature vector. (Please see the Nearest Neighbors
section of the online Scikit-learn documentation for more information about this type of classifier.)
While experimenting with any learning algorithm, it is important not to test the prediction of an estimator on the
data used to fit the estimator as this would not be evaluating the performance of the estimator on new data. This is
why datasets are often split into train and test data.
For an estimator to be effective, you need the distance between neighboring points to be less than some value 𝑑, which
depends on the problem. In one dimension, this requires on average 𝑛 ∼ 1/𝑑 points. In the context of the above 𝑘-NN
example, if the data is described by just one feature with values ranging from 0 to 1 and with 𝑛 training observations,
then new data will be no further away than 1/𝑛. Therefore, the nearest neighbor decision rule will be efficient as soon
Diabetes dataset
The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood pressure) measure on 442
patients, and an indication of disease progression after one year:
>>> diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
>>> diabetes_X_train = diabetes_X[:-20]
>>> diabetes_X_test = diabetes_X[-20:]
>>> diabetes_y_train = diabetes_y[:-20]
>>> diabetes_y_test = diabetes_y[-20:]
Linear regression
LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set of parameters in order
to make the sum of the squared residuals of the model as small as possible.
Linear models: 𝑦 = 𝑋𝛽 + 𝜖
• 𝑋: data
• 𝑦: target variable
• 𝛽: Coefficients
• 𝜖: Observation noise
Shrinkage
If there are few data points per dimension, noise in the observations induces high variance:
>>> np.random.seed(0)
>>> for _ in range(6):
... this_X = .1 * np.random.normal(size=(2, 1)) + X
... regr.fit(this_X, y)
... plt.plot(test, regr.predict(test))
... plt.scatter(this_X, y, s=3)
A solution in high-dimensional statistical learning is to shrink the regression coefficients to zero: any two randomly
chosen set of observations are likely to be uncorrelated. This is called Ridge regression:
>>> plt.figure()
>>> np.random.seed(0)
>>> for _ in range(6):
... this_X = .1 * np.random.normal(size=(2, 1)) + X
... regr.fit(this_X, y)
... plt.plot(test, regr.predict(test))
... plt.scatter(this_X, y, s=3)
This is an example of bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower
the variance.
We can choose alpha to minimize left out error, this time using the diabetes dataset rather than our synthetic data:
Note: Capturing in the fitted parameters noise that prevents the model to generalize to new data is called overfitting.
The bias introduced by the ridge regression is called a regularization.
Sparsity
Note: A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions and one of
the target variable). It is hard to develop an intuition on such representation, but it may be useful to keep in mind that
it would be a fairly empty space.
We can see that, although feature 2 has a strong coefficient on the full model, it conveys little information on y when
considered with feature 1.
To improve the conditioning of the problem (i.e. mitigating the The curse of dimensionality), it would be interesting
to select only the informative features and set non-informative ones, like feature 2 to 0. Ridge regression will decrease
their contribution, but not set them to zero. Another penalization approach, called Lasso (least absolute shrinkage and
selection operator), can set some coefficients to zero. Such methods are called sparse method and sparsity can be
seen as an application of Occam’s razor: prefer simpler models.
Different algorithms can be used to solve the same mathematical problem. For instance the Lasso object in scikit-
learn solves the lasso regression problem using a coordinate descent method, that is efficient on large datasets.
However, scikit-learn also provides the LassoLars object using the LARS algorithm, which is very efficient for
problems in which the weight vector estimated is very sparse (i.e. problems with very few observations).
Classification
For classification, as in the labeling iris task, linear regression is not the right approach as it will give too much weight
to data far from the decision frontier. A linear approach is to fit a sigmoid function or logistic function:
1
𝑦 = sigmoid(𝑋𝛽 − offset) + 𝜖 = +𝜖
1 + exp(−𝑋𝛽 + offset)
Multiclass classification
If you have several classes to predict, an option often used is to fit one-versus-all classifiers and then use a voting
heuristic for the final decision.
The C parameter controls the amount of regularization in the LogisticRegression object: a large value
for C results in less regularization. penalty="l2" gives Shrinkage (i.e. non-sparse coefficients), while
penalty="l1" gives Sparsity.
Exercise
Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last 10% and test
prediction performance on these observations.
from sklearn import datasets, neighbors, linear_model
Solution: ../../auto_examples/exercises/plot_digits_classification_exercise.py
Linear SVMs
Support Vector Machines belong to the discriminant model family: they try to find a combination of samples to build
a plane maximizing the margin between the two classes. Regularization is set by the C parameter: a small value for C
means the margin is calculated using many or all of the observations around the separating line (more regularization);
a large value for C means the margin is calculated on observations close to the separating line (less regularization).
Example:
SVMs can be used in regression –SVR (Support Vector Regression)–, or in classification –SVC (Support Vector Clas-
sification).
Using kernels
Classes are not always linearly separable in feature space. The solution is to build a decision function that is not linear
but may be polynomial instead. This is done using the kernel trick that can be seen as creating a decision energy by
positioning kernels on observations:
Interactive example
See the SVM GUI to download svm_gui.py; add data points of both classes with right and left button, fit the
model and change parameters and data.
Exercise
Try classifying classes 1 and 2 from the iris dataset with SVMs, with the 2 first features. Leave out 10% of each
class and test prediction performance on these observations.
Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only one class.
Hint: You can use the decision_function method on a grid to get intuitions.
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0, :2]
y = y[y != 0]
Solution: ../../auto_examples/exercises/plot_iris_exercise.py
As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on
new data. Bigger is better.
>>> from sklearn import datasets, svm
>>> X_digits, y_digits = datasets.load_digits(return_X_y=True)
>>> svc = svm.SVC(C=1, kernel='linear')
>>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.98
To get a better measure of prediction accuracy (which we can use as a proxy for goodness of fit of the model), we can
successively split the data in folds that we use for training and testing:
>>> import numpy as np
>>> X_folds = np.array_split(X_digits, 3)
>>> y_folds = np.array_split(y_digits, 3)
>>> scores = list()
>>> for k in range(3):
... # We use 'list' to copy, in order to 'pop' later on
... X_train = list(X_folds)
... X_test = X_train.pop(k)
... X_train = np.concatenate(X_train)
... y_train = list(y_folds)
... y_test = y_train.pop(k)
... y_train = np.concatenate(y_train)
... scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
>>> print(scores)
[0.934..., 0.956..., 0.939...]
Cross-validation generators
Scikit-learn has a collection of classes which can be used to generate lists of train/test indices for popular cross-
validation strategies.
They expose a split method which accepts the input dataset to be split and yields the train/test set indices for each
iteration of the chosen cross-validation strategy.
This example shows an example usage of the split method.
The cross-validation score can be directly calculated using the cross_val_score helper. Given an estimator, the
cross-validation object and the input dataset, the cross_val_score splits the data repeatedly into a training and a
testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration
of cross-validation.
By default the estimator’s score method is used to compute the individual scores.
Refer the metrics module to learn more on the available scoring methods.
n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
Alternatively, the scoring argument can be provided to specify an alternative scoring method.
Cross-validation generators
Exercise
On the digits dataset, plot the cross-validation score of a SVC estimator with an linear kernel as a function of
parameter C (use a logarithmic grid of points, from 1 to 10).
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
X, y = datasets.load_digits(return_X_y=True)
svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)
scores = list()
Grid-search
scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and
chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction
and exposes an estimator API:
>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
... n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None,...
>>> clf.best_score_
0.925...
>>> clf.best_estimator_.C
0.0077...
By default, the GridSearchCV uses a 3-fold cross-validation. However, if it detects that a classifier is passed, rather
than a regressor, it uses a stratified 3-fold. The default will change to a 5-fold cross-validation in version 0.22.
Nested cross-validation
Two cross-validation loops are performed in parallel: one by the GridSearchCV estimator to set gamma and the
other one by cross_val_score to measure the prediction performance of the estimator. The resulting scores
are unbiased estimates of the prediction score on new data.
Warning: You cannot nest objects with parallel computing (n_jobs different than 1).
Cross-validated estimators
Cross-validation to set a parameter can be done more efficiently on an algorithm-by-algorithm basis. This is why, for
certain estimators, scikit-learn exposes Cross-validation: evaluating estimator performance estimators that set their
parameter automatically by cross-validation:
>>> from sklearn import linear_model, datasets
>>> lasso = linear_model.LassoCV()
>>> X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True)
>>> lasso.fit(X_diabetes, y_diabetes)
LassoCV()
>>> # The estimator chose automatically its lambda:
>>> lasso.alpha_
0.00375...
These estimators are called similarly to their counterparts, with ‘CV’ appended to their name.
Exercise
X, y = datasets.load_diabetes(return_X_y=True)
X = X[:150]
Given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist to label
them: we could try a clustering task: split the observations into well-separated group called clusters.
K-means clustering
Note that there exist a lot of different clustering criteria and associated algorithms. The simplest clustering algorithm
is K-means.
Warning: There is absolutely no guarantee of recovering a ground truth. First, choosing the right number of
clusters is hard. Second, the algorithm is sensitive to initialization, and can fall into local minima, although scikit-
learn employs several tricks to mitigate this issue.
Clustering in general and KMeans, in particular, can be seen as a way of choosing a small number of exemplars to
compress the information. The problem is sometimes known as vector quantization. For instance, this can be used
to posterize an image:
>>> import scipy as sp
>>> try:
... face = sp.face(gray=True)
... except AttributeError:
... from scipy import misc
... face = misc.face(gray=True)
>>> X = face.reshape((-1, 1)) # We need an (n_sample, n_feature) array
>>> k_means = cluster.KMeans(n_clusters=5, n_init=1)
>>> k_means.fit(X)
KMeans(n_clusters=5, n_init=1)
>>> values = k_means.cluster_centers_.squeeze()
>>> labels = k_means.labels_
>>> face_compressed = np.choose(labels, values)
>>> face_compressed.shape = face.shape
A Hierarchical clustering method is a type of cluster analysis that aims to build a hierarchy of clusters. In general, the
various approaches of this technique are either:
• Agglomerative - bottom-up approaches: each observation starts in its own cluster, and clusters are iteratively
merged in such a way to minimize a linkage criterion. This approach is particularly interesting when the clus-
ters of interest are made of only a few observations. When the number of clusters is large, it is much more
computationally efficient than k-means.
• Divisive - top-down approaches: all observations start in one cluster, which is iteratively split as one moves
down the hierarchy. For estimating large numbers of clusters, this approach is both slow (due to all observations
starting as one cluster, which it splits recursively) and statistically ill-posed.
Connectivity-constrained clustering
With agglomerative clustering, it is possible to specify which samples can be clustered together by giving a connec-
tivity graph. Graphs in scikit-learn are represented by their adjacency matrix. Often, a sparse matrix is used. This
can be useful, for instance, to retrieve connected regions (sometimes also referred to as connected components) when
clustering an image:
import skimage
from skimage.data import coins
from skimage.transform import rescale
# #############################################################################
# Generate data
orig_coins = coins()
Feature agglomeration
We have seen that sparsity could be used to mitigate the curse of dimensionality, i.e an insufficient amount of ob-
servations compared to the number of features. Another approach is to merge together similar features: feature
agglomeration. This approach can be implemented by clustering in the feature direction, in other words clustering
the transposed data.
Some estimators expose a transform method, for instance to reduce the dimensionality of the dataset.
If X is our multivariate data, then the problem that we are trying to solve is to rewrite it on a different observational
basis: we want to learn loadings L and a set of components C such that X = L C. Different criteria exist to choose
the components
Principal component analysis (PCA) selects the successive components that explain the maximum variance in the
signal.
The point cloud spanned by the observations above is very flat in one direction: one of the three univariate features
can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat
When used to transform data, PCA can reduce the dimensionality of the data by projecting on a principal subspace.
Independent component analysis (ICA) selects components so that the distribution of their loadings carries a maximum
amount of independent information. It is able to recover non-Gaussian independent signals:
Pipelining
We have seen that some estimators can transform data and that some estimators can predict variables. We can also
create combined estimators:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))
The dataset used in this example is a preprocessed excerpt of the “Labeled Faces in the Wild”, also known as LFW:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
"""
===================================================
Faces recognition example using eigenfaces and SVMs
===================================================
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
.. _LFW: http://vis-www.cs.umass.edu/lfw/
Expected results for the top 5 most represented people in the dataset:
"""
from time import time
import logging
import matplotlib.pyplot as plt
print(__doc__)
# #############################################################################
# Download the data, if not already on disk and load it as numpy arrays
# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
# #############################################################################
# Split into a training set and a test set using a stratified k fold
# #############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150
# #############################################################################
# Train a SVM classification model
# #############################################################################
# Quantitative evaluation of the model quality on the test set
# #############################################################################
# Qualitative evaluation of the predictions using matplotlib
plot_gallery(X_test, prediction_titles, h, w)
plt.show()
Prediction Eigenfaces
Expected results for the top 5 most represented people in the dataset:
Can we predict the variation in stock prices for Google over a given time frame?
Learning a graph structure
If you encounter a bug with scikit-learn or something that needs clarification in the docstring or the online
documentation, please feel free to ask on the Mailing List
Quora.com Quora has a topic for Machine Learning related questions that also features some
interesting discussions: https://www.quora.com/topic/Machine-Learning
Stack Exchange The Stack Exchange family of sites hosts multiple subdomains for Machine
Learning questions.
– _’An excellent free online course for Machine Learning taught by Professor Andrew Ng of Stanford’: https://www.
coursera.org/learn/machine-learning
– _’Another excellent free online course that takes a more general approach to Artificial Intelligence’:
https://www.udacity.com/course/intro-to-artificial-intelligence–cs271
The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analyzing a
collection of text documents (newsgroups posts) on twenty different topics.
In this section we will see how to:
• load the file contents and the categories
• extract feature vectors suitable for machine learning
• train a linear model to perform categorization
• use a grid search strategy to find a good configuration of both the feature extraction components and the classifier
To get started with this tutorial, you must first install scikit-learn and all of its required dependencies.
Please refer to the installation instructions page for more information and for system-specific instructions.
The source of this tutorial can be found within your scikit-learn folder:
scikit-learn/doc/tutorial/text_analytics/
% cp -r skeletons work_directory/sklearn_tut_workspace
Machine learning algorithms need data. Go to each $TUTORIAL_HOME/data sub-folder and run the
fetch_data.py script from there (after having read them first).
For instance:
% cd $TUTORIAL_HOME/data/languages
% less fetch_data.py
% python fetch_data.py
The dataset is called “Twenty Newsgroups”. Here is the official description, quoted from the website:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected
by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explic-
itly mention this collection. The 20 newsgroups collection has become a popular data set for experiments
in text applications of machine learning techniques, such as text classification and text clustering.
In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn. Alternatively, it is possible
to download the dataset manually from the website and use the sklearn.datasets.load_files function by
pointing it to the 20news-bydate-train sub-folder of the uncompressed archive folder.
In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out
of the 20 available in the dataset:
>>> categories = ['alt.atheism', 'soc.religion.christian',
... 'comp.graphics', 'sci.med']
We can now load the list of files matching those categories as follows:
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty_train = fetch_20newsgroups(subset='train',
... categories=categories, shuffle=True, random_state=42)
The returned dataset is a scikit-learn “bunch”: a simple holder object with fields that can be both accessed
as python dict keys or object attributes for convenience, for instance the target_names holds the list of the
requested category names:
>>> twenty_train.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
The files themselves are loaded in memory in the data attribute. For reference the filenames are also available:
>>> len(twenty_train.data)
2257
>>> len(twenty_train.filenames)
2257
>>> print(twenty_train.target_names[twenty_train.target[0]])
comp.graphics
Supervised learning algorithms will require a category label for each document in the training set. In this case the cat-
egory is the name of the newsgroup which also happens to be the name of the folder holding the individual documents.
For speed and space efficiency reasons scikit-learn loads the target attribute as an array of integers that corre-
sponds to the index of the category name in the target_names list. The category integer id of each sample is stored
in the target attribute:
>>> twenty_train.target[:10]
array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])
You might have noticed that the samples were shuffled randomly when we called fetch_20newsgroups(...,
shuffle=True, random_state=42): this is useful if you wish to select only a subset of samples to quickly
train a model and get a first idea of the results before re-training on the complete dataset later.
In order to perform machine learning on text documents, we first need to turn the text content into numerical feature
vectors.
Bags of words
Fortunately, most values in X will be zeros since for a given document less than a few thousand distinct words will
be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot
of memory by only storing the non-zero parts of the feature vectors in memory.
scipy.sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these
structures.
Text preprocessing, tokenizing and filtering of stopwords are all included in CountVectorizer, which builds a
dictionary of features and transforms documents to feature vectors:
CountVectorizer supports counts of N-grams of words or consecutive characters. Once fitted, the vectorizer has
built a dictionary of feature indices:
>>> count_vect.vocabulary_.get(u'algorithm')
4690
The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.
Occurrence count is a good start but there is an issue: longer documents will have higher average count values than
shorter documents, even though they might talk about the same topics.
To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by
the total number of words in the document: these new features are called tf for Term Frequencies.
Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are
therefore less informative than those that occur only in a smaller portion of the corpus.
This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.
Both tf and tf–idf can be computed as follows using TfidfTransformer:
In the above example-code, we firstly use the fit(..) method to fit our estimator to the data and secondly the
transform(..) method to transform our count-matrix to a tf-idf representation. These two steps can be com-
bined to achieve the same end result faster by skipping redundant processing. This is done through using the
fit_transform(..) method as shown below, and as mentioned in the note in the previous section:
Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a
naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this
classifier; the one most suitable for word counts is the multinomial variant:
To try to predict the outcome on a new document we need to extract the features using almost the same feature extract-
ing chain as before. The difference is that we call transform instead of fit_transform on the transformers,
since they have already been fit to the training set:
In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a
Pipeline class that behaves like a compound classifier:
The names vect, tfidf and clf (classifier) are arbitrary. We will use them to perform grid search for suitable
hyperparameters below. We can now train the model with a single command:
We achieved 83.5% accuracy. Let’s see if we can do better with a linear support vector machine (SVM), which is
widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). We
can change the learner by simply plugging a different classifier object into our pipeline:
We achieved 91.3% accuracy using the SVM. scikit-learn provides further utilities for more detailed perfor-
mance analysis of the results:
As expected the confusion matrix shows that posts from the newsgroups on atheism and Christianity are more often
confused for one another than with computer graphics.
We’ve already encountered some parameters such as use_idf in the TfidfTransformer. Classifiers tend to have
many parameters as well; e.g., MultinomialNB includes a smoothing parameter alpha and SGDClassifier
has a penalty parameter alpha and configurable loss and penalty terms in the objective function (see the module
documentation, or use the Python help function to get a description of these).
Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of
the best parameters on a grid of possible values. We try out all classifiers on either words or bigrams, with or without
idf, and with a penalty parameter of either 0.01 or 0.001 for the linear SVM:
Obviously, such an exhaustive search can be expensive. If we have multiple CPU cores at our disposal, we can tell
the grid searcher to try these eight parameter combinations in parallel with the n_jobs parameter. If we give this
parameter a value of -1, grid search will detect how many cores are installed and use them all:
The grid search instance behaves like a normal scikit-learn model. Let’s perform the search on a smaller subset
of the training data to speed up the computation:
The result of calling fit on a GridSearchCV object is a classifier that we can use to predict:
The object’s best_score_ and best_params_ attributes store the best mean score and the parameters setting
corresponding to that score:
>>> gs_clf.best_score_
0.9...
>>> for param_name in sorted(parameters.keys()):
... print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
...
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)
Exercises
To do the exercises, copy the content of the ‘skeletons’ folder as a new folder named ‘workspace’:
% cp -r skeletons workspace
You can then edit the content of the workspace without fear of losing the original exercise instructions.
Then fire an ipython shell and run the work-in-progress script with:
• Write a text classification pipeline using a custom preprocessor and CharNGramAnalyzer using data from
Wikipedia articles as training set.
• Evaluate the performance on some held out test set.
ipython command line:
• Write a text classification pipeline to classify movie reviews as either positive or negative.
• Find a good set of parameters using grid search.
• Evaluate the performance on a held out test set.
ipython command line:
Using the results of the previous exercises and the cPickle module of the standard library, write a command line
utility that detects the language of some text provided on stdin and estimate the polarity (positive or negative) if the
text is written in English.
Bonus point if the utility is able to give a confidence level for its predictions.
Here are a few suggestions to help further your scikit-learn intuition upon the completion of this tutorial:
• Try playing around with the analyzer and token normalisation under CountVectorizer.
• If you don’t have labels, try using Clustering on your problem.
• If you have multiple labels per document, e.g categories, have a look at the Multiclass and multilabel section.
• Try using Truncated SVD for latent semantic analysis.
• Have a look at using Out-of-core Classification to learn from data that would not fit into the computer main
memory.
• Have a look at the Hashing Vectorizer as a memory efficient alternative to CountVectorizer.
Often the hardest part of solving a machine learning problem can be finding the right estimator for the job.
Different estimators are better suited for different types of data and different problems.
The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which
estimators to try on your data.
For those that are still new to the scientific Python ecosystem, we highly recommend the Python Scientific Lecture
Notes. This will help you find your footing a bit and will definitely improve your scikit-learn experience. A basic
understanding of NumPy arrays is recommended to make the most of scikit-learn.
There are several online tutorials available which are geared toward specific subject areas:
• Machine Learning for NeuroImaging in Python
• Machine Learning for Astronomical Data Analysis
2.5.3 Videos
• An introduction to scikit-learn Part I and Part II at Scipy 2013 by Gael Varoquaux, Jake Vanderplas and Olivier
Grisel. Notebooks on github.
• Introduction to scikit-learn by Gael Varoquaux at ICML 2010
A three minute video from a very early stage of scikit-learn, explaining the basic idea and approach
we are following.
• Introduction to statistical learning with scikit-learn by Gael Varoquaux at SciPy 2011
An extensive tutorial, consisting of four sessions of one hour. The tutorial covers the basics of ma-
chine learning, many algorithms and how to apply them using scikit-learn. The material correspond-
ing is now in the scikit-learn documentation section A tutorial on statistical-learning for scientific
data processing.
• Statistical Learning for Text Classification with scikit-learn and NLTK (and slides) by Olivier Grisel at PyCon
2011
Thirty minute introduction to text classification. Explains how to use NLTK and scikit-learn to solve
real-world text classification tasks and compares against cloud-based solutions.
• Introduction to Interactive Predictive Analytics in Python with scikit-learn by Olivier Grisel at PyCon 2012
3-hours long introduction to prediction tasks using scikit-learn.
• scikit-learn - Machine Learning in Python by Jake Vanderplas at the 2012 PyData workshop at Google
Interactive demonstration of some scikit-learn features. 75 minutes.
• scikit-learn tutorial by Jake Vanderplas at PyData NYC 2012
Presentation using the online tutorial, 45 minutes.
%doctest_mode
in the IPython-console. You can then simply copy and paste the examples directly into IPython without having to
worry about removing the >>> manually.
THREE
GETTING STARTED
The purpose of this guide is to illustrate some of the main features that scikit-learn provides. It assumes a very
basic working knowledge of machine learning practices (model fitting, predicting, cross-validation, etc.). Please refer
to our installation instructions for installing scikit-learn.
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It
also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other
utilities.
Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators. Each esti-
mator can be fitted to some data using its fit method.
Here is a simple example where we fit a RandomForestClassifier to some very basic data:
215
scikit-learn user guide, Release 0.23.dev0
Machine learning workflows are often composed of different parts. A typical pipeline consists of a pre-processing step
that transforms or imputes the data, and a final predictor that predicts target values.
In scikit-learn, pre-processors and transformers follow the same API as the estimator objects (they actually all
inherit from the same BaseEstimator class). The transformer objects don’t have a predict method but rather a
transform method that outputs a newly transformed sample matrix X:
Sometimes, you want to apply different transformations to different features: the ColumnTransformer is designed for
these use-cases.
Transformers and estimators (predictors) can be combined together into a single unifying object: a Pipeline. The
pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with fit and predict.
As we will see later, using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your
training data.
In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a
pipeline on the test data:
Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated.
We have just seen the train_test_split helper that splits a dataset into train and test sets, but scikit-learn
provides many other tools for model evaluation, in particular for cross-validation.
We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper. Note
that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring
functions. Please refer to our User Guide for more details:
All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization
power of an estimator often critically depends on a few parameters. For example a RandomForestRegressor
has a n_estimators parameter that determines the number of trees in the forest, and a max_depth parameter
that determines the maximum depth of each tree. Quite often, it is not clear what the exact values of these parameters
should be since they depend on the data at hand.
Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation). In the
following example, we randomly search over the parameter space of a random forest with a RandomizedSearchCV
object. When the search is over, the RandomizedSearchCV behaves as a RandomForestRegressor that has
been fitted with the best set of parameters. Read more in the User Guide:
>>> # the search object now acts like a normal random forest estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
0.73...
Note: In practice, you almost always want to search over a pipeline, instead of a single estimator. One of the main
reasons is that if you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any
kind of cross-validation, you would be breaking the fundamental assumption of independence between training and
testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets
are available to the train sets. This will lead to over-estimating the generalization power of the estimator (you can read
more in this kaggle post).
Using a pipeline for cross-validation and searching will largely keep you from this common pitfall.
We have briefly covered estimator fitting and predicting, pre-processing steps, pipelines, cross-validation tools and
automatic hyper-parameter searches. This guide should give you an overview of some of the main features of the
library, but there is much more to scikit-learn!
Please refer to our User Guide for details on all the tools that we provide. You can also find an exhaustive list of the
public API in the API Reference.
You can also look at our numerous examples that illustrate the use of scikit-learn in many different contexts.
The tutorials also contain additional learning resources.
FOUR
USER GUIDE
The following are a set of methods intended for regression in which the target value is expected to be a linear combi-
nation of the features. In mathematical notation, if 𝑦ˆ is the predicted value.
𝑦ˆ(𝑤, 𝑥) = 𝑤0 + 𝑤1 𝑥1 + ... + 𝑤𝑝 𝑥𝑝
Across the module, we designate the vector 𝑤 = (𝑤1 , ..., 𝑤𝑝 ) as coef_ and 𝑤0 as intercept_.
To perform classification with generalized linear models, see Logistic regression.
LinearRegression fits a linear model with coefficients 𝑤 = (𝑤1 , ..., 𝑤𝑝 ) to minimize the residual sum of squares
between the observed targets in the dataset, and the targets predicted by the linear approximation. Mathematically it
solves a problem of the form:
LinearRegression will take in its fit method arrays X, y and will store the coefficients 𝑤 of the linear model
in its coef_ member:
219
scikit-learn user guide, Release 0.23.dev0
The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are
correlated and the columns of the design matrix 𝑋 have an approximate linear dependence, the design matrix becomes
close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed
target, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected
without an experimental design.
Examples:
The least squares solution is computed using the singular value decomposition of X. If X is a matrix of shape
(n_samples, n_features) this method has a cost of 𝑂(𝑛samples 𝑛2features ), assuming that 𝑛samples ≥ 𝑛features .
Regression
Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the
coefficients. The ridge coefficients minimize a penalized residual sum of squares:
min ||𝑋𝑤 − 𝑦||22 + 𝛼||𝑤||22
𝑤
The complexity parameter 𝛼 ≥ 0 controls the amount of shrinkage: the larger the value of 𝛼, the greater the amount
of shrinkage and thus the coefficients become more robust to collinearity.
As with other linear models, Ridge will take in its fit method arrays X, y and will store the coefficients 𝑤 of the
linear model in its coef_ member:
Classification
The Ridge regressor has a classifier variant: RidgeClassifier. This classifier first converts binary targets to
{-1, 1} and then treats the problem as a regression task, optimizing the same objective as above. The predicted
class corresponds to the sign of the regressor’s prediction. For multiclass classification, the problem is treated as
multi-output regression, and the predicted class corresponds to the output with the highest value.
It might seem questionable to use a (penalized) Least Squares loss to fit a classification model instead of the more
traditional logistic or hinge losses. However in practice all those models can lead to similar cross-validation scores in
terms of accuracy or precision/recall, while the penalized least squares loss used by the RidgeClassifier allows
for a very different choice of the numerical solvers with distinct computational performance profiles.
The RidgeClassifier can be significantly faster than e.g. LogisticRegression with a high number of
classes, because it is able to compute the projection matrix (𝑋 𝑇 𝑋)−1 𝑋 𝑇 only once.
This classifier is sometimes referred to as a Least Squares Support Vector Machines with a linear kernel.
Examples:
Ridge Complexity
This method has the same order of complexity as Ordinary Least Squares.
RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in
the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efficient form of
leave-one-out cross-validation:
Specifying the value of the cv attribute will trigger the use of cross-validation with GridSearchCV , for example
cv=10 for 10-fold cross-validation, rather than Generalized Cross-Validation.
References
• “Notes on Regularized Least Squares”, Rifkin & Lippert (technical report, course slides).
Lasso
The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to
prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given
solution is dependent. For this reason Lasso and its variants are fundamental to the field of compressed sensing.
Under certain conditions, it can recover the exact set of non-zero coefficients (see Compressive sensing: tomography
reconstruction with L1 prior (Lasso)).
Mathematically, it consists of a linear model with an added regularization term. The objective function to minimize is:
1
min ||𝑋𝑤 − 𝑦||22 + 𝛼||𝑤||1
𝑤 2𝑛samples
The lasso estimate thus solves the minimization of the least-squares penalty with 𝛼||𝑤||1 added, where 𝛼 is a constant
and ||𝑤||1 is the ℓ1 -norm of the coefficient vector.
The implementation in the class Lasso uses coordinate descent as the algorithm to fit the coefficients. See Least
Angle Regression for another implementation:
The function lasso_path is useful for lower-level tasks, as it computes the coefficients along the full path of
possible values.
Examples:
The following two references explain the iterations used in the coordinate descent solver of scikit-learn, as well as the
duality gap computation used for convergence control.
References
• “Regularization Path For Generalized linear Models by Coordinate Descent”, Friedman, Hastie & Tibshirani,
J Stat Softw, 2010 (Paper).
• “An Interior-Point Method for Large-Scale L1-Regularized Least Squares,” S. J. Kim, K. Koh, M. Lustig, S.
Boyd and D. Gorinevsky, in IEEE Journal of Selected Topics in Signal Processing, 2007 (Paper)
The alpha parameter controls the degree of sparsity of the estimated coefficients.
Using cross-validation
scikit-learn exposes objects that set the Lasso alpha parameter by cross-validation: LassoCV and LassoLarsCV .
LassoLarsCV is based on the Least Angle Regression algorithm explained below.
For high-dimensional datasets with many collinear features, LassoCV is most often preferable. However,
LassoLarsCV has the advantage of exploring more relevant values of alpha parameter, and if the number of
samples is very small compared to the number of features, it is often faster than LassoCV .
Alternatively, the estimator LassoLarsIC proposes to use the Akaike information criterion (AIC) and the Bayes
Information criterion (BIC). It is a computationally cheaper alternative to find the optimal value of alpha as the regu-
larization path is computed only once instead of k+1 times when using k-fold cross-validation. However, such criteria
needs a proper estimation of the degrees of freedom of the solution, are derived for large samples (asymptotic results)
and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when
the problem is badly conditioned (more features than samples).
Examples:
The equivalence between alpha and the regularization parameter of SVM, C is given by alpha = 1 / C or
alpha = 1 / (n_samples * C), depending on the estimator and the exact objective function optimized by
the model.
Multi-task Lasso
The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly:
y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all
the regression problems, also called tasks.
The following figure compares the location of the non-zero entries in the coefficient matrix W obtained with a simple
Lasso or a MultiTaskLasso. The Lasso estimates yield scattered non-zeros while the non-zeros of the MultiTaskLasso
are full columns.
Fitting a time-series model, imposing that any active feature be active at all times.
Examples:
Mathematically, it consists of a linear model trained with a mixed ℓ1 ℓ2 -norm for regularization. The objective function
to minimize is:
1
min ||𝑋𝑊 − 𝑌 ||2Fro + 𝛼||𝑊 ||21
𝑤 2𝑛samples
and ℓ1 ℓ2 reads
∑︁ √︃∑︁
||𝐴||21 = 𝑎2𝑖𝑗 .
𝑖 𝑗
The implementation in the class MultiTaskLasso uses coordinate descent as the algorithm to fit the coefficients.
Elastic-Net
ElasticNet is a linear regression model trained with both ℓ1 and ℓ2 -norm regularization of the coefficients. This
combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still main-
taining the regularization properties of Ridge. We control the convex combination of ℓ1 and ℓ2 using the l1_ratio
parameter.
Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one
of these at random, while elastic-net is likely to pick both.
A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to inherit some of Ridge’s
stability under rotation.
The objective function to minimize is in this case
1 𝛼(1 − 𝜌)
min ||𝑋𝑤 − 𝑦||22 + 𝛼𝜌||𝑤||1 + ||𝑤||22
𝑤 2𝑛samples 2
The class ElasticNetCV can be used to set the parameters alpha (𝛼) and l1_ratio (𝜌) by cross-validation.
Examples:
The following two references explain the iterations used in the coordinate descent solver of scikit-learn, as well as the
duality gap computation used for convergence control.
References
• “Regularization Path For Generalized linear Models by Coordinate Descent”, Friedman, Hastie & Tibshirani,
J Stat Softw, 2010 (Paper).
• “An Interior-Point Method for Large-Scale L1-Regularized Least Squares,” S. J. Kim, K. Koh, M. Lustig, S.
Boyd and D. Gorinevsky, in IEEE Journal of Selected Topics in Signal Processing, 2007 (Paper)
Multi-task Elastic-Net
The MultiTaskElasticNet is an elastic-net model that estimates sparse coefficients for multiple regression prob-
lems jointly: Y is a 2D array of shape (n_samples, n_tasks). The constraint is that the selected features are
the same for all the regression problems, also called tasks.
Mathematically, it consists of a linear model trained with a mixed ℓ1 ℓ2 -norm and ℓ2 -norm for regularization. The
objective function to minimize is:
1 𝛼(1 − 𝜌)
min ||𝑋𝑊 − 𝑌 ||2Fro + 𝛼𝜌||𝑊 ||21 + ||𝑊 ||2Fro
𝑊 2𝑛samples 2
The implementation in the class MultiTaskElasticNet uses coordinate descent as the algorithm to fit the coef-
ficients.
The class MultiTaskElasticNetCV can be used to set the parameters alpha (𝛼) and l1_ratio (𝜌) by cross-
validation.
Least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron,
Trevor Hastie, Iain Johnstone and Robert Tibshirani. LARS is similar to forward stepwise regression. At each step, it
finds the feature most correlated with the target. When there are multiple features having equal correlation, instead of
continuing along the same feature, it proceeds in a direction equiangular between the features.
The advantages of LARS are:
• It is numerically efficient in contexts where the number of features is significantly greater than the number of
samples.
• It is computationally just as fast as forward selection and has the same order of complexity as ordinary least
squares.
• It produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune
the model.
• If two features are almost equally correlated with the target, then their coefficients should increase at approxi-
mately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.
• It is easily modified to produce solutions for other estimators, like the Lasso.
The disadvantages of the LARS method include:
• Because LARS is based upon an iterative refitting of the residuals, it would appear to be especially sensitive to
the effects of noise. This problem is discussed in detail by Weisberg in the discussion section of the Efron et al.
(2004) Annals of Statistics article.
The LARS model can be used using estimator Lars, or its low-level implementation lars_path or
lars_path_gram.
LARS Lasso
LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on
coordinate descent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefficients.
Examples:
The Lars algorithm provides the full path of the coefficients along the regularization parameter almost for free, thus a
common operation is to retrieve the path with one of the functions lars_path or lars_path_gram.
Mathematical formulation
The algorithm is similar to forward stepwise regression, but instead of including features at each step, the estimated
coefficients are increased in a direction equiangular to each one’s correlations with the residual.
Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the
ℓ1 norm of the parameter vector. The full coefficients path is stored in the array coef_path_, which has size
(n_features, max_features+1). The first column is always zero.
References:
• Original Algorithm is detailed in the paper Least Angle Regression by Hastie et al.
OrthogonalMatchingPursuit and orthogonal_mp implements the OMP algorithm for approximating the
fit of a linear model with constraints imposed on the number of non-zero coefficients (ie. the ℓ0 pseudo-norm).
Being a forward feature selection method like Least Angle Regression, orthogonal matching pursuit can approximate
the optimum solution vector with a fixed number of non-zero elements:
Alternatively, orthogonal matching pursuit can target a specific error instead of a specific number of non-zero coeffi-
cients. This can be expressed as:
OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current
residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is
recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.
Examples:
References:
• https://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
• Matching pursuits with time-frequency dictionaries, S. G. Mallat, Z. Zhang,
Bayesian Regression
Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the
regularization parameter is not set in a hard sense but tuned to the data at hand.
This can be done by introducing uninformative priors over the hyper parameters of the model. The ℓ2 regularization
used in Ridge regression and classification is equivalent to finding a maximum a posteriori estimation under a Gaussian
prior over the coefficients 𝑤 with precision 𝜆−1 . Instead of setting lambda manually, it is possible to treat it as a
random variable to be estimated from the data.
To obtain a fully probabilistic model, the output 𝑦 is assumed to be Gaussian distributed around 𝑋𝑤:
𝑝(𝑦|𝑋, 𝑤, 𝛼) = 𝒩 (𝑦|𝑋𝑤, 𝛼)
where 𝛼 is again treated as a random variable that is to be estimated from the data.
The advantages of Bayesian Regression are:
• It adapts to the data at hand.
• It can be used to include regularization parameters in the estimation procedure.
The disadvantages of Bayesian regression include:
• Inference of the model can be time consuming.
References
• A good introduction to Bayesian methods is given in C. Bishop: Pattern Recognition and Machine learning
• Original Algorithm is detailed in the book Bayesian learning for neural networks by Rad-
ford M. Neal
BayesianRidge estimates a probabilistic model of the regression problem as described above. The prior for the
coefficient 𝑤 is given by a spherical Gaussian:
The priors over 𝛼 and 𝜆 are chosen to be gamma distributions, the conjugate prior for the precision of the Gaussian.
The resulting model is called Bayesian Ridge Regression, and is similar to the classical Ridge.
The parameters 𝑤, 𝛼 and 𝜆 are estimated jointly during the fit of the model, the regularization parameters 𝛼 and 𝜆
being estimated by maximizing the log marginal likelihood. The scikit-learn implementation is based on the algorithm
described in Appendix A of (Tipping, 2001) where the update of the parameters 𝛼 and 𝜆 is done as suggested in
(MacKay, 1992). The initial value of the maximization procedure can be set with the hyperparameters alpha_init
and lambda_init.
There are four more hyperparameters, 𝛼1 , 𝛼2 , 𝜆1 and 𝜆2 of the gamma prior distributions over 𝛼 and 𝜆. These are
usually chosen to be non-informative. By default 𝛼1 = 𝛼2 = 𝜆1 = 𝜆2 = 10−6 .
Bayesian Ridge Regression is used for regression:
After being fitted, the model can then be used to predict new values:
>>> reg.coef_
array([0.49999993, 0.49999993])
Due to the Bayesian framework, the weights found are slightly different to the ones found by Ordinary Least Squares.
However, Bayesian Ridge Regression is more robust to ill-posed problems.
Examples:
References:
• Section 3.3 in Christopher M. Bishop: Pattern Recognition and Machine Learning, 2006
• David J. C. MacKay, Bayesian Interpolation, 1992.
• Michael E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine, 2001.
ARDRegression is very similar to Bayesian Ridge Regression, but can lead to sparser coefficients 𝑤12 .
ARDRegression poses a different prior over 𝑤, by dropping the assumption of the Gaussian being spherical.
Instead, the distribution over 𝑤 is assumed to be an axis-parallel, elliptical Gaussian distribution.
This means each coefficient 𝑤𝑖 is drawn from a Gaussian distribution, centered on zero and with a precision 𝜆𝑖 :
ARD is also known in the literature as Sparse Bayesian Learning and Relevance Vector Machine34 .
Examples:
References:
Logistic regression
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is
also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier.
In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
Logistic regression is implemented in LogisticRegression. This implementation can fit binary, One-vs-Rest, or
multinomial logistic regression with optional ℓ1 , ℓ2 or Elastic-Net regularization.
1 Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 7.2.1
2 David Wipf and Srikantan Nagarajan: A new view of automatic relevance determination
3 Michael E. Tipping: Sparse Bayesian Learning and the Relevance Vector Machine
4 Tristan Fletcher: Relevance Vector Machines explained
Note: Regularization is applied by default, which is common in machine learning but not in statistics. Another
advantage of regularization is that it improves numerical stability. No regularization amounts to setting C to a very
high value.
As an optimization problem, binary class ℓ2 penalized logistic regression minimizes the following cost function:
𝑛
1 ∑︁
min 𝑤𝑇 𝑤 + 𝐶 log(exp(−𝑦𝑖 (𝑋𝑖𝑇 𝑤 + 𝑐)) + 1).
𝑤,𝑐 2
𝑖=1
Elastic-Net regularization is a combination of ℓ1 and ℓ2 , and minimizes the following cost function:
𝑛
1−𝜌 𝑇 ∑︁
min 𝑤 𝑤 + 𝜌‖𝑤‖1 + 𝐶 log(exp(−𝑦𝑖 (𝑋𝑖𝑇 𝑤 + 𝑐)) + 1),
𝑤,𝑐 2 𝑖=1
where 𝜌 controls the strength of ℓ1 regularization vs. ℓ2 regularization (it corresponds to the l1_ratio parameter).
Note that, in this notation, it’s assumed that the target 𝑦𝑖 takes values in the set −1, 1 at trial 𝑖. We can also see that
Elastic-Net is equivalent to ℓ1 when 𝜌 = 1 and equivalent to ℓ2 when 𝜌 = 0.
The solvers implemented in the class LogisticRegression are “liblinear”, “newton-cg”, “lbfgs”, “sag” and
“saga”:
The solver “liblinear” uses a coordinate descent (CD) algorithm, and relies on the excellent C++ LIBLINEAR library,
which is shipped with scikit-learn. However, the CD algorithm implemented in liblinear cannot learn a true multi-
nomial (multiclass) model; instead, the optimization problem is decomposed in a “one-vs-rest” fashion so separate
binary classifiers are trained for all classes. This happens under the hood, so LogisticRegression instances us-
ing this solver behave as multiclass classifiers. For ℓ1 regularization sklearn.svm.l1_min_c allows to calculate
the lower bound for C in order to get a non “null” (all feature weights to zero) model.
The “lbfgs”, “sag” and “newton-cg” solvers only support ℓ2 regularization or no regularization, and are found to
converge faster for some high-dimensional data. Setting multi_class to “multinomial” with these solvers learns
a true multinomial logistic regression model5 , which means that its probability estimates should be better calibrated
than the default “one-vs-rest” setting.
The “sag” solver uses Stochastic Average Gradient descent6 . It is faster than other solvers for large datasets, when
both the number of samples and the number of features are large.
The “saga” solver7 is a variant of “sag” that also supports the non-smooth penalty="l1". This is there-
fore the solver of choice for sparse multinomial logistic regression. It is also the only solver that supports
penalty="elasticnet".
The “lbfgs” is an optimization algorithm that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm8 ,
which belongs to quasi-Newton methods. The “lbfgs” solver is recommended for use for small data-sets but for
larger datasets its performance suffers.9
The following table summarizes the penalties supported by each solver:
5Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 4.3.4
6Mark Schmidt, Nicolas Le Roux, and Francis Bach: Minimizing Finite Sums with the Stochastic Average Gradient.
7 Aaron Defazio, Francis Bach, Simon Lacoste-Julien: SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex
Composite Objectives.
8 https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm
9 “Performance Evaluation of Lbfgs vs other solvers”
Solvers
Penalties ‘liblinear’ ‘lbfgs’ ‘newton-cg’ ‘sag’ ‘saga’
Multinomial + L2 penalty no yes yes yes yes
OVR + L2 penalty yes yes yes yes yes
Multinomial + L1 penalty no no no no yes
OVR + L1 penalty yes no no no yes
Elastic-Net no no no no yes
No penalty (‘none’) no yes yes yes yes
Behaviors
Penalize the intercept (bad) yes no no no no
Faster for large datasets no no no yes yes
Robust to unscaled datasets yes yes yes no no
The “lbfgs” solver is used by default for its robustness. For large datasets the “saga” solver is usually faster. For large
dataset, you may also consider using SGDClassifier with ‘log’ loss, which might be even faster but requires more
tuning.
Examples:
There might be a difference in the scores obtained between LogisticRegression with solver=liblinear
or LinearSVC and the external liblinear library directly, when fit_intercept=False and the fit coef_
(or) the data to be predicted are zeroes. This is because for the sample(s) with decision_function zero,
LogisticRegression and LinearSVC predict the negative class, while liblinear predicts the positive class.
Note that a model with fit_intercept=False and having many samples with decision_function zero,
is likely to be a underfit, bad model and you are advised to set fit_intercept=True and increase the inter-
cept_scaling.
LogisticRegressionCV implements Logistic Regression with built-in cross-validation support, to find the opti-
mal C and l1_ratio parameters according to the scoring attribute. The “newton-cg”, “sag”, “saga” and “lbfgs”
solvers are found to be faster for high-dimensional dense data, due to warm-starting (see Glossary).
References:
Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the
number of samples (and the number of features) is very large. The partial_fit method allows online/out-of-core
learning.
The classes SGDClassifier and SGDRegressor provide functionality to fit linear models for classifica-
tion and regression using different (convex) loss functions and different penalties. E.g., with loss="log",
SGDClassifier fits a logistic regression model, while with loss="hinge" it fits a linear support vector ma-
chine (SVM).
References
Perceptron
The Perceptron is another simple classification algorithm suitable for large scale learning. By default:
• It does not require a learning rate.
• It is not regularized (penalized).
• It updates its model only on mistakes.
The last characteristic implies that the Perceptron is slightly faster to train than SGD with the hinge loss and that the
resulting models are sparser.
The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Per-
ceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization
parameter C.
For classification, PassiveAggressiveClassifier can be used with loss='hinge' (PA-I) or
loss='squared_hinge' (PA-II). For regression, PassiveAggressiveRegressor can be used with
loss='epsilon_insensitive' (PA-I) or loss='squared_epsilon_insensitive' (PA-II).
References:
Robust regression aims to fit a regression model in the presence of corrupt data: either outliers, or error in the model.
There are different things to keep in mind when dealing with data corrupted by outliers:
• Outliers in X or in y?
An important notion of robust fitting is that of breakdown point: the fraction of data that can be outlying for the fit to
start missing the inlying data.
Note that in general, robust fitting in high-dimensional setting (large n_features) is very hard. The robust models
here will probably not work in these settings.
Scikit-learn provides 3 robust regression estimators: RANSAC, Theil Sen and HuberRegressor.
• HuberRegressor should be faster than RANSAC and Theil Sen unless the number of samples are
very large, i.e n_samples >> n_features. This is because RANSAC and Theil Sen fit on
smaller subsets of the data. However, both Theil Sen and RANSAC are unlikely to be as robust as
HuberRegressor for the default parameters.
• RANSAC is faster than Theil Sen and scales much better with the number of samples.
• RANSAC will deal better with large outliers in the y direction (most common situation).
• Theil Sen will cope better with medium-size outliers in the X direction, but this property will
disappear in high-dimensional settings.
When in doubt, use RANSAC.
RANSAC (RANdom SAmple Consensus) fits a model from random subsets of inliers from the complete data set.
RANSAC is a non-deterministic algorithm producing only a reasonable result with a certain probability, which is
dependent on the number of iterations (see max_trials parameter). It is typically used for linear and non-linear
regression problems and is especially popular in the field of photogrammetric computer vision.
The algorithm splits the complete input sample data into a set of inliers, which may be subject to noise, and outliers,
which are e.g. caused by erroneous measurements or invalid hypotheses about the data. The resulting model is then
estimated only from the determined inliers.
Examples:
References:
• https://en.wikipedia.org/wiki/RANSAC
• “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Auto-
mated Cartography” Martin A. Fischler and Robert C. Bolles - SRI International (1981)
• “Performance Evaluation of RANSAC Family” Sunglok Choi, Taemin Kim and Wonpil Yu - BMVC (2009)
The TheilSenRegressor estimator uses a generalization of the median in multiple dimensions. It is thus robust
to multivariate outliers. Note however that the robustness of the estimator decreases quickly with the dimensionality of
the problem. It loses its robustness properties and becomes no better than an ordinary least squares in high dimension.
Examples:
• Theil-Sen Regression
• Robust linear estimator fitting
References:
• https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator
Theoretical considerations
TheilSenRegressor is comparable to the Ordinary Least Squares (OLS) in terms of asymptotic efficiency and as
an unbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric method which means it makes no assumption
about the underlying distribution of the data. Since Theil-Sen is a median-based estimator, it is more robust against
corrupted data aka outliers. In univariate setting, Theil-Sen has a breakdown point of about 29.3% in case of a simple
linear regression which means that it can tolerate arbitrary corrupted data of up to 29.3%.
T. Kärkkäinen and S. Äyrämö: On Computation of Spatial Median for Robust Data Mining.
which makes it infeasible to be applied exhaustively to problems with a large number of samples and features. There-
fore, the magnitude of a subpopulation can be chosen to limit the time and space complexity by considering only a
random subset of all possible combinations.
Examples:
• Theil-Sen Regression
References:
Huber Regression
The HuberRegressor is different to Ridge because it applies a linear loss to samples that are classified as outliers.
A sample is classified as an inlier if the absolute error of that sample is lesser than a certain threshold. It differs from
TheilSenRegressor and RANSACRegressor because it does not ignore the effect of the outliers but gives a
lesser weight to them.
where
{︃
𝑧2, if |𝑧| < 𝜖,
𝐻𝜖 (𝑧) = 2
2𝜖|𝑧| − 𝜖 , otherwise
It is advised to set the parameter epsilon to 1.35 to achieve 95% statistical efficiency.
Notes
The HuberRegressor differs from using SGDRegressor with loss set to huber in the following ways.
• HuberRegressor is scaling invariant. Once epsilon is set, scaling X and y down or up by different values
would produce the same robustness to outliers as before. as compared to SGDRegressor where epsilon
has to be set again when X and y are scaled.
• HuberRegressor should be more efficient to use on data with small number of samples while
SGDRegressor needs a number of passes on the training data to produce the same robustness.
Examples:
References:
• Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
Note that this estimator is different from the R implementation of Robust Regression (http://www.ats.ucla.edu/stat/r/
dae/rreg.htm) because the R implementation does a weighted least squares implementation with weights given to each
sample on the basis of how much the residual is greater than a certain threshold.
One common pattern within machine learning is to use linear models trained on nonlinear functions of the data. This
approach maintains the generally fast performance of linear methods, while allowing them to fit a much wider range
of data.
For example, a simple linear regression can be extended by constructing polynomial features from the coefficients.
In the standard linear regression case, you might have a model that looks like this for two-dimensional data:
𝑦ˆ(𝑤, 𝑥) = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
If we want to fit a paraboloid to the data instead of a plane, we can combine the features in second-order polynomials,
so that the model looks like this:
The (sometimes surprising) observation is that this is still a linear model: to see this, imagine creating a new set of
features
𝑦ˆ(𝑤, 𝑧) = 𝑤0 + 𝑤1 𝑧1 + 𝑤2 𝑧2 + 𝑤3 𝑧3 + 𝑤4 𝑧4 + 𝑤5 𝑧5
We see that the resulting polynomial regression is in the same class of linear models we considered above (i.e. the
model is linear in 𝑤) and can be solved by the same techniques. By considering linear fits within a higher-dimensional
space built with these basis functions, the model has the flexibility to fit a much broader range of data.
Here is an example of applying this idea to one-dimensional data, using polynomial features of varying degrees:
This figure is created using the PolynomialFeatures transformer, which transforms an input data matrix into a
new data matrix of a given degree. It can be used as follows:
The features of X have been transformed from [𝑥1 , 𝑥2 ] to [1, 𝑥1 , 𝑥2 , 𝑥21 , 𝑥1 𝑥2 , 𝑥22 ], and can now be used within any
linear model.
This sort of preprocessing can be streamlined with the Pipeline tools. A single object representing a simple polynomial
regression can be created and used as follows:
The linear model trained on polynomial features is able to exactly recover the input polynomial coefficients.
In some cases it’s not necessary to include higher powers of any single feature, but only the so-called interaction
features that multiply together at most 𝑑 distinct features. These can be gotten from PolynomialFeatures with
the setting interaction_only=True.
For example, when dealing with boolean features, 𝑥𝑛𝑖 = 𝑥𝑖 for all 𝑛 and is therefore useless; but 𝑥𝑖 𝑥𝑗 represents the
conjunction of two booleans. This way, we can solve the XOR problem with a linear classifier:
>>> clf.predict(X)
array([0, 1, 1, 0])
>>> clf.score(X, y)
1.0
The plot shows decision boundaries for Linear Discriminant Analysis and Quadratic Discriminant Analysis. The
bottom row demonstrates that Linear Discriminant Analysis can only learn linear boundaries, while Quadratic Dis-
criminant Analysis can learn quadratic boundaries and is therefore more flexible.
Examples:
Linear and Quadratic Discriminant Analysis with covariance ellipsoid: Comparison of LDA and QDA on synthetic
data.
LinearDiscriminantAnalysis.predict.
Examples:
Comparison of LDA and PCA 2D projection of Iris dataset: Comparison of LDA and PCA for dimensionality
reduction of the Iris dataset
Both LDA and QDA can be derived from simple probabilistic models which model the class conditional distribution
of the data 𝑃 (𝑋|𝑦 = 𝑘) for each class 𝑘. Predictions can then be obtained by using Bayes’ rule:
To understand the use of LDA in dimensionality reduction, it is useful to start with a geometric reformulation of the
LDA classification rule explained above. We write 𝐾 for the total number of target classes. Since in LDA we assume
3 “The Elements of Statistical Learning”, Hastie T., Tibshirani R., Friedman J., Section 4.3, p.106-119, 2008.
that all classes have the same estimated covariance Σ, we can rescale the data so that this covariance is the identity:
𝑋 * = 𝐷−1/2 𝑈 𝑡 𝑋 with Σ = 𝑈 𝐷𝑈 𝑡
Then one can show that to classify a data point after scaling is equivalent to finding the estimated class mean 𝜇*𝑘 which
is closest to the data point in the Euclidean distance. But this can be done just as well after projecting on the 𝐾 − 1
affine subspace 𝐻𝐾 generated by all the 𝜇*𝑘 for all classes. This shows that, implicit in the LDA classifier, there is a
dimensionality reduction by linear projection onto a 𝐾 − 1 dimensional space.
We can reduce the dimension even more, to a chosen 𝐿, by projecting onto the linear subspace 𝐻𝐿 which max-
imizes the variance of the 𝜇*𝑘 after projection (in effect, we are doing a form of PCA for the transformed class
means 𝜇*𝑘 ). This 𝐿 corresponds to the n_components parameter used in the discriminant_analysis.
LinearDiscriminantAnalysis.transform method. See3 for more details.
Shrinkage
Shrinkage is a tool to improve estimation of covariance matrices in situations where the number of training sam-
ples is small compared to the number of features. In this scenario, the empirical sample covariance is a poor es-
timator. Shrinkage LDA can be used by setting the shrinkage parameter of the discriminant_analysis.
LinearDiscriminantAnalysis class to ‘auto’. This automatically determines the optimal shrinkage parameter
in an analytic way following the lemma introduced by Ledoit and Wolf4 . Note that currently shrinkage only works
when setting the solver parameter to ‘lsqr’ or ‘eigen’.
The shrinkage parameter can also be manually set between 0 and 1. In particular, a value of 0 corresponds to
no shrinkage (which means the empirical covariance matrix will be used) and a value of 1 corresponds to complete
shrinkage (which means that the diagonal matrix of variances will be used as an estimate for the covariance matrix).
Setting this parameter to a value between these two extrema will estimate a shrunk version of the covariance matrix.
4 Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix. The Journal of Portfolio Management 30(4), 110-119, 2004.
Estimation algorithms
The default solver is ‘svd’. It can perform both classification and transform, and it does not rely on the calculation
of the covariance matrix. This can be an advantage in situations where the number of features is large. However, the
‘svd’ solver cannot be used with shrinkage.
The ‘lsqr’ solver is an efficient algorithm that only works for classification. It supports shrinkage.
The ‘eigen’ solver is based on the optimization of the between class scatter to within class scatter ratio. It can be used
for both classification and transform, and it supports shrinkage. However, the ‘eigen’ solver needs to compute the
covariance matrix, so it might not be suitable for situations with a high number of features.
Examples:
Normal and Shrinkage Linear Discriminant Analysis for classification: Comparison of LDA classifiers with and
without shrinkage.
References:
Kernel ridge regression (KRR) [M2012] combines Ridge regression and classification (linear least squares with l2-
norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel
and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.
The form of the model learned by KernelRidge is identical to support vector regression (SVR). However, different
loss functions are used: KRR uses squared error loss while support vector regression uses 𝜖-insensitive loss, both
combined with l2 regularization. In contrast to SVR, fitting KernelRidge can be done in closed-form and is typically
faster for medium-sized datasets. On the other hand, the learned model is non-sparse and thus slower than SVR, which
learns a sparse model for 𝜖 > 0, at prediction-time.
The following figure compares KernelRidge and SVR on an artificial dataset, which consists of a sinusoidal target
function and strong noise added to every fifth datapoint. The learned model of KernelRidge and SVR is plotted,
where both complexity/regularization and bandwidth of the RBF kernel have been optimized using grid-search. The
learned functions are very similar; however, fitting KernelRidge is approx. seven times faster than fitting SVR
(both with grid-search). However, prediction of 100000 target values is more than three times faster with SVR since it
has learned a sparse model using only approx. 1/3 of the 100 training datapoints as support vectors.
The next figure compares the time for fitting and prediction of KernelRidge and SVR for different sizes of the
training set. Fitting KernelRidge is faster than SVR for medium-sized training sets (less than 1000 samples);
however, for larger training sets SVR scales better. With regard to prediction time, SVR is faster than KernelRidge
for all sizes of the training set because of the learned sparse solution. Note that the degree of sparsity and thus the
prediction time depends on the parameters 𝜖 and 𝐶 of the SVR; 𝜖 = 0 would correspond to a dense model.
References:
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and
outliers detection.
Classification
SVC, NuSVC and LinearSVC are classes capable of performing multi-class classification on a dataset.
SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical
formulations (see section Mathematical formulation). On the other hand, LinearSVC is another implementation
of Support Vector Classification for the case of a linear kernel. Note that LinearSVC does not accept keyword
kernel, as this is assumed to be linear. It also lacks some of the members of SVC and NuSVC, like support_.
As other classifiers, SVC, NuSVC and LinearSVC take as input two arrays: an array X of size [n_samples,
n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:
>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC()
After being fitted, the model can then be used to predict new values:
>>> clf.predict([[2., 2.]])
array([1])
SVMs decision function depends on some subset of the training data, called the support vectors. Some properties of
these support vectors can be found in members support_vectors_, support_ and n_support:
>>> # get support vectors
>>> clf.support_vectors_
array([[0., 0.],
[1., 1.]])
>>> # get indices of support vectors
>>> clf.support_
array([0, 1]...)
>>> # get number of support vectors for each class
>>> clf.n_support_
array([1, 1]...)
Multi-class classification
SVC and NuSVC implement the “one-against-one” approach (Knerr et al., 1990) for multi- class classifica-
tion. If n_class is the number of classes, then n_class * (n_class - 1) / 2 classifiers are con-
structed and each one trains data from two classes. To provide a consistent interface with other classifiers, the
decision_function_shape option allows to monotically transform the results of the “one-against-one” classi-
fiers to a decision function of shape (n_samples, n_classes).
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(decision_function_shape='ovo')
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
4
On the other hand, LinearSVC implements “one-vs-the-rest” multi-class strategy, thus training n_class models. If
there are only two classes, only one model is trained:
0 0
𝛼0,1 𝛼0,2 Coefficients for SVs of class 0
1 1
𝛼0,1 𝛼0,2
2 2
𝛼0,1 𝛼0,2
0 0
𝛼1,0 𝛼1,2 Coefficients for SVs of class 1
1 1
𝛼1,0 𝛼1,2
0 0
𝛼2,0 𝛼2,1 Coefficients for SVs of class 2
1 1
𝛼2,0 𝛼2,1
The decision_function method of SVC and NuSVC gives per-class scores for each sample (or a single score
per sample in the binary case). When the constructor option probability is set to True, class membership
probability estimates (from the methods predict_proba and predict_log_proba) are enabled. In the binary
case, the probabilities are calibrated using Platt scaling: logistic regression on the SVM’s scores, fit by an additional
cross-validation on the training data. In the multiclass case, this is extended as per Wu et al. (2004).
Needless to say, the cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition,
the probability estimates may be inconsistent with the scores, in the sense that the “argmax” of the scores may not be
the argmax of the probabilities. (E.g., in binary classification, a sample may be labeled by predict as belonging
to a class that has probability <½ according to predict_proba.) Platt’s method is also known to have theoret-
ical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set
probability=False and use decision_function instead of predict_proba.
Please note that when decision_function_shape='ovr' and n_classes > 2, unlike
decision_function, the predict method does not try to break ties by default. You can set
break_ties=True for the output of predict to be the same as np.argmax(clf.decision_function(.
..), axis=1), otherwise the first class among the tied classes will always be returned; but have in mind that it
comes with a computational cost.
References:
• Wu, Lin and Weng, “Probability estimates for multi-class classification by pairwise coupling”, JMLR 5:975-
1005, 2004.
• Platt “Probabilistic outputs for SVMs and comparisons to regularized likelihood methods”.
Unbalanced problems
In problems where it is desired to give more importance to certain classes or certain individual samples keywords
class_weight and sample_weight can be used.
SVC (but not NuSVC) implement a keyword class_weight in the fit method. It’s a dictionary of the form
{class_label : value}, where value is a floating point number > 0 that sets the parameter C of class
class_label to C * value.
SVC, NuSVC, SVR, NuSVR, LinearSVC, LinearSVR and OneClassSVM implement also weights for individual
samples in method fit through keyword sample_weight. Similar to class_weight, these set the parameter C
for the i-th example to C * sample_weight[i].
Examples:
Regression
The method of Support Vector Classification can be extended to solve regression problems. This method is called
Support Vector Regression.
The model produced by support vector classification (as described above) depends only on a subset of the training
data, because the cost function for building the model does not care about training points that lie beyond the margin.
Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because
the cost function for building the model ignores any training data close to the model prediction.
There are three different implementations of Support Vector Regression: SVR, NuSVR and LinearSVR.
LinearSVR provides a faster implementation than SVR but only considers linear kernels, while NuSVR implements
a slightly different formulation than SVR and LinearSVR. See Implementation details for further details.
As with classification classes, the fit method will take as argument vectors X, y, only that in this case y is expected to
have floating point values instead of integer values:
Examples:
The class OneClassSVM implements a One-Class SVM which is used in outlier detection.
See Novelty and Outlier Detection for the description and usage of OneClassSVM.
Complexity
Support Vector Machines are powerful tools, but their compute and storage requirements increase rapidly with the
number of training vectors. The core of an SVM is a quadratic programming problem (QP), separating support
vectors from the rest of the training data. The QP solver used by this libsvm-based implementation scales between
𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 × 𝑛2𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ) and 𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 × 𝑛3𝑠𝑎𝑚𝑝𝑙𝑒𝑠 ) depending on how efficiently the libsvm cache is used in
practice (dataset dependent). If the data is very sparse 𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 should be replaced by the average number of non-
zero features in a sample vector.
Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more
efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.
• Avoiding data copy: For SVC, SVR, NuSVC and NuSVR, if the data passed to certain methods is not C-ordered
contiguous, and double precision, it will be copied before calling the underlying C implementation. You can
check whether a given numpy array is C-contiguous by inspecting its flags attribute.
For LinearSVC (and LogisticRegression) any input passed as a numpy array will be copied and con-
verted to the liblinear internal sparse data representation (double precision floats and int32 indices of non-zero
components). If you want to fit a large-scale linear classifier without copying a dense numpy C-contiguous
double precision array as input we suggest to use the SGDClassifier class instead. The objective function
can be configured to be almost the same as the LinearSVC model.
• Kernel cache size: For SVC, SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run
times for larger problems. If you have enough RAM available, it is recommended to set cache_size to a
higher value than the default of 200(MB), such as 500(MB) or 1000(MB).
• Setting C: C is 1 by default and it’s a reasonable default choice. If you have a lot of noisy observations you
should decrease it. It corresponds to regularize more the estimation.
LinearSVC and LinearSVR are less sensitive to C when it becomes large, and prediction results stop im-
proving after a certain threshold. Meanwhile, larger C values will take more time to train, sometimes up to 10
times longer, as shown by Fan et al. (2008)
• Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data.
For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0
and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. See
section Preprocessing data for more details on scaling and normalization.
• Parameter nu in NuSVC/OneClassSVM /NuSVR approximates the fraction of training errors and support vec-
tors.
• In SVC, if data for classification are unbalanced (e.g. many positive and few negative), set
class_weight='balanced' and/or try different penalty parameters C.
• Randomness of the underlying implementations: The underlying implementations of SVC and NuSVC use
a random number generator only to shuffle the data for probability estimation (when probability is set to
True). This randomness can be controlled with the random_state parameter. If probability is set
to False these estimators are not random and random_state has no effect on the results. The underlying
OneClassSVM implementation is similar to the ones of SVC and NuSVC. As no probability estimation is
provided for OneClassSVM , it is not random.
The underlying LinearSVC implementation uses a random number generator to select features when fitting the
model with a dual coordinate descent (i.e when dual is set to True). It is thus not uncommon, to have slightly
different results for the same input data. If that happens, try with a smaller tol parameter. This randomness
can also be controlled with the random_state parameter. When dual is set to False the underlying
implementation of LinearSVC is not random and random_state has no effect on the results.
• Using L1 penalization as provided by LinearSVC(loss='l2', penalty='l1', dual=False)
yields a sparse solution, i.e. only a subset of feature weights is different from zero and contribute to the de-
cision function. Increasing C yields a more complex model (more feature are selected). The C value that yields
a “null” model (all weights equal to zero) can be calculated using l1_min_c.
References:
• Fan, Rong-En, et al., “LIBLINEAR: A library for large linear classification.”, Journal of machine learning
research 9.Aug (2008): 1871-1874.
Kernel functions
Custom Kernels
You can define your own kernels by either giving the kernel as a python function or by precomputing the Gram matrix.
Classifiers with custom kernels behave the same way as any other classifiers, except that:
• Field support_vectors_ is now empty, only indices of support vectors are stored in support_
• A reference (and not a copy) of the first argument in the fit() method is stored for future reference. If that
array changes between the use of fit() and predict() you will have unexpected results.
You can also use your own defined kernels by passing a function to the keyword kernel in the constructor.
Your kernel must take as arguments two matrices of shape (n_samples_1, n_features), (n_samples_2,
n_features) and return a kernel matrix of shape (n_samples_1, n_samples_2).
The following code defines a linear kernel and creates a classifier instance that will use that kernel:
Examples:
Set kernel='precomputed' and pass the Gram matrix instead of X in the fit method. At the moment, the kernel
values between all training vectors and the test vectors must be provided.
When training an SVM with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and
gamma. The parameter C, common to all SVM kernels, trades off misclassification of training examples against
simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all
training examples correctly. gamma defines how much influence a single training example has. The larger gamma is,
the closer other examples must be to be affected.
Proper choice of C and gamma is critical to the SVM’s performance. One is advised to use sklearn.
model_selection.GridSearchCV with C and gamma spaced exponentially far apart to choose good values.
Examples:
Mathematical formulation
A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which
can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane
that has the largest distance to the nearest training data points of any class (so-called functional margin), since in
general the larger the margin the lower the generalization error of the classifier.
SVC
Given training vectors 𝑥𝑖 ∈ R𝑝 , i=1,. . . , n, in two classes, and a vector 𝑦 ∈ {1, −1}𝑛 , SVC solves the following primal
problem:
𝑛
1 ∑︁
min 𝑤𝑇 𝑤 + 𝐶 𝜁𝑖
𝑤,𝑏,𝜁 2
𝑖=1
subject to 𝑦𝑖 (𝑤𝑇 𝜑(𝑥𝑖 ) + 𝑏) ≥ 1 − 𝜁𝑖 ,
𝜁𝑖 ≥ 0, 𝑖 = 1, ..., 𝑛
Its dual is
1
min 𝛼𝑇 𝑄𝛼 − 𝑒𝑇 𝛼
𝛼 2
subject to 𝑦 𝑇 𝛼 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶, 𝑖 = 1, ..., 𝑛
where 𝑒 is the vector of all ones, 𝐶 > 0 is the upper bound, 𝑄 is an 𝑛 by 𝑛 positive semidefinite matrix, 𝑄𝑖𝑗 ≡
𝑦𝑖 𝑦𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 ), where 𝐾(𝑥𝑖 , 𝑥𝑗 ) = 𝜑(𝑥𝑖 )𝑇 𝜑(𝑥𝑗 ) is the kernel. Here training vectors are implicitly mapped into a
higher (maybe infinite) dimensional space by the function 𝜑.
The decision function is:
𝑛
∑︁
sgn( 𝑦𝑖 𝛼𝑖 𝐾(𝑥𝑖 , 𝑥) + 𝜌)
𝑖=1
Note: While SVM models derived from libsvm and liblinear use C as regularization parameter, most other estimators
use alpha. The exact equivalence between the amount of regularization of two models depends on the exact objective
function optimized by the model. For example, when the estimator used is sklearn.linear_model.Ridge
1
regression, the relation between them is given as 𝐶 = 𝑎𝑙𝑝ℎ𝑎 .
This parameters can be accessed through the members dual_coef_ which holds the product 𝑦𝑖 𝛼𝑖 ,
support_vectors_ which holds the support vectors, and intercept_ which holds the independent term 𝜌
:
References:
• “Automatic Capacity Tuning of Very Large VC-dimension Classifiers”, I. Guyon, B. Boser, V. Vapnik -
Advances in neural information processing 1993.
• “Support-vector networks”, C. Cortes, V. Vapnik - Machine Learning, 20, 273-297 (1995).
NuSVC
We introduce a new parameter 𝜈 which controls the number of support vectors and training errors. The parameter
𝜈 ∈ (0, 1] is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.
It can be shown that the 𝜈-SVC formulation is a reparameterization of the 𝐶-SVC and therefore mathematically
equivalent.
SVR
Given training vectors 𝑥𝑖 ∈ R𝑝 , i=1,. . . , n, and a vector 𝑦 ∈ R𝑛 𝜀-SVR solves the following primal problem:
𝑛
1 ∑︁
min * 𝑤𝑇 𝑤 + 𝐶 (𝜁𝑖 + 𝜁𝑖* )
𝑤,𝑏,𝜁,𝜁 2
𝑖=1
subject to 𝑦𝑖 − 𝑤𝑇 𝜑(𝑥𝑖 ) − 𝑏 ≤ 𝜀 + 𝜁𝑖 ,
𝑤𝑇 𝜑(𝑥𝑖 ) + 𝑏 − 𝑦𝑖 ≤ 𝜀 + 𝜁𝑖* ,
𝜁𝑖 , 𝜁𝑖* ≥ 0, 𝑖 = 1, ..., 𝑛
Its dual is
1
min (𝛼 − 𝛼* )𝑇 𝑄(𝛼 − 𝛼* ) + 𝜀𝑒𝑇 (𝛼 + 𝛼* ) − 𝑦 𝑇 (𝛼 − 𝛼* )
2
𝛼,𝛼*
subject to 𝑒𝑇 (𝛼 − 𝛼* ) = 0
0 ≤ 𝛼𝑖 , 𝛼𝑖* ≤ 𝐶, 𝑖 = 1, ..., 𝑛
where 𝑒 is the vector of all ones, 𝐶 > 0 is the upper bound, 𝑄 is an 𝑛 by 𝑛 positive semidefinite matrix, 𝑄𝑖𝑗 ≡
𝐾(𝑥𝑖 , 𝑥𝑗 ) = 𝜑(𝑥𝑖 )𝑇 𝜑(𝑥𝑗 ) is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite)
dimensional space by the function 𝜑.
The decision function is:
𝑛
∑︁
(𝛼𝑖 − 𝛼𝑖* )𝐾(𝑥𝑖 , 𝑥) + 𝜌
𝑖=1
These parameters can be accessed through the members dual_coef_ which holds the difference 𝛼𝑖 − 𝛼𝑖* ,
support_vectors_ which holds the support vectors, and intercept_ which holds the independent term 𝜌
References:
• “A Tutorial on Support Vector Regression”, Alex J. Smola, Bernhard Schölkopf - Statistics and Computing
archive Volume 14 Issue 3, August 2004, p. 199-222.
Implementation details
Internally, we use libsvm and liblinear to handle all computations. These libraries are wrapped using C and Cython.
References:
For a description of the implementation and details of the algorithms used, please refer to
• LIBSVM: A Library for Support Vector Machines.
• LIBLINEAR – A Library for Large Linear Classification.
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear clas-
sifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though
SGD has been around in the machine learning community for a long time, it has received a considerable amount of
attention just recently in the context of large-scale learning.
SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text
classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale
to problems with more than 10^5 training examples and more than 10^5 features.
The advantages of Stochastic Gradient Descent are:
• Efficiency.
• Ease of implementation (lots of opportunities for code tuning).
The disadvantages of Stochastic Gradient Descent include:
• SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
• SGD is sensitive to feature scaling.
Classification
Warning: Make sure you permute (shuffle) your training data before fitting the model or use shuffle=True
to shuffle after each iteration.
The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different
loss functions and penalties for classification.
As other classifiers, SGD has to be fitted with two arrays: an array X of size [n_samples, n_features] holding the
training samples, and an array Y of size [n_samples] holding the target values (class labels) for the training samples:
After being fitted, the model can then be used to predict new values:
>>> clf.predict([[2., 2.]])
array([1])
SGD fits a linear model to the training data. The member coef_ holds the model parameters:
>>> clf.coef_
array([[9.9..., 9.9...]])
Whether or not the model should use an intercept, i.e. a biased hyperplane, is controlled by the parameter
fit_intercept.
To get the signed distance to the hyperplane use SGDClassifier.decision_function:
>>> clf.decision_function([[2., 2.]])
array([29.6...])
The concrete loss function can be set via the loss parameter. SGDClassifier supports the following loss func-
tions:
• loss="hinge": (soft-margin) linear Support Vector Machine,
• loss="modified_huber": smoothed hinge loss,
• loss="log": logistic regression,
• and all regression losses below.
The first two loss functions are lazy, they only update the model parameters if an example violates the margin con-
straint, which makes training very efficient and may result in sparser models, even when L2 penalty is used.
Using loss="log" or loss="modified_huber" enables the predict_proba method, which gives a vector
of probability estimates 𝑃 (𝑦|𝑥) per sample 𝑥:
>>> clf = SGDClassifier(loss="log", max_iter=5).fit(X, y)
>>> clf.predict_proba([[1., 1.]])
array([[0.00..., 0.99...]])
The concrete penalty can be set via the penalty parameter. SGD supports the following penalties:
• penalty="l2": L2 norm penalty on coef_.
• penalty="l1": L1 norm penalty on coef_.
• penalty="elasticnet": Convex combination of L2 and L1; (1 - l1_ratio) * L2 +
l1_ratio * L1.
The default setting is penalty="l2". The L1 penalty leads to sparse solutions, driving most coefficients to zero.
The Elastic Net solves some deficiencies of the L1 penalty in the presence of highly correlated attributes. The param-
eter l1_ratio controls the convex combination of L1 and L2 penalty.
SGDClassifier supports multi-class classification by combining multiple binary classifiers in a “one versus all”
(OVA) scheme. For each of the 𝐾 classes, a binary classifier is learned that discriminates between that and all other
𝐾 − 1 classes. At testing time, we compute the confidence score (i.e. the signed distances to the hyperplane) for each
classifier and choose the class with the highest confidence. The Figure below illustrates the OVA approach on the iris
dataset. The dashed lines represent the three OVA classifiers; the background colors show the decision surface induced
by the three classifiers.
Examples:
SGDClassifier supports averaged SGD (ASGD). Averaging can be enabled by setting `average=True`.
ASGD works by averaging the coefficients of the plain SGD over each iteration over a sample. When using ASGD
the learning rate can be larger and even constant leading on some datasets to a speed up in training time.
For classification with a logistic loss, another variant of SGD with an averaging strategy is available with Stochastic
Average Gradient (SAG) algorithm, available as a solver in LogisticRegression.
Regression
The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different
loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression prob-
lems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or
ElasticNet.
The concrete loss function can be set via the loss parameter. SGDRegressor supports the following loss functions:
• loss="squared_loss": Ordinary least squares,
• loss="huber": Huber loss for robust regression,
• loss="epsilon_insensitive": linear Support Vector Regression.
The Huber and epsilon-insensitive loss functions can be used for robust regression. The width of the insensitive region
has to be specified via the parameter epsilon. This parameter depends on the scale of the target variables.
SGDRegressor supports averaged SGD as SGDClassifier. Averaging can be enabled by setting
`average=True`.
For regression with a squared loss and a l2 penalty, another variant of SGD with an averaging strategy is available with
Stochastic Average Gradient (SAG) algorithm, available as a solver in Ridge.
Note: The sparse implementation produces slightly different results than the dense implementation due to a shrunk
learning rate for the intercept.
There is built-in support for sparse data given in any matrix in a format supported by scipy.sparse. For maximum
efficiency, however, use the CSR matrix format as defined in scipy.sparse.csr_matrix.
Examples:
Complexity
The major advantage of SGD is its efficiency, which is basically linear in the number of training examples. If X is a
matrix of size (n, p) training has a cost of 𝑂(𝑘𝑛¯
𝑝), where k is the number of iterations (epochs) and 𝑝¯ is the average
number of non-zero attributes per sample.
Recent theoretical results, however, show that the runtime to get some desired optimization accuracy does not increase
as the training set size increases.
Stopping criterion
The classes SGDClassifier and SGDRegressor provide two criteria to stop the algorithm when a given level of
convergence is reached:
• With early_stopping=True, the input data is split into a training set and a validation set. The model
is then fitted on the training set, and the stopping criterion is based on the prediction score computed on the
validation set. The size of the validation set can be changed with the parameter validation_fraction.
• With early_stopping=False, the model is fitted on the entire input data and the stopping criterion is
based on the objective function computed on the input data.
In both cases, the criterion is evaluated once by epoch, and the algorithm stops when the criterion does not improve
n_iter_no_change times in a row. The improvement is evaluated with a tolerance tol, and the algorithm stops
in any case after a maximum number of iteration max_iter.
• Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For
example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and
variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can
be easily done using StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # Don't cheat - fit only on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data
If your attributes have an intrinsic scale (e.g. word frequencies or indicator features) scaling is not needed.
• Finding a reasonable regularization term 𝛼 is best done using GridSearchCV, usually in the range 10.
0**-np.arange(1,7).
• Empirically, we found that SGD converges after observing approx. 10^6 training samples. Thus, a reasonable
first guess for the number of iterations is max_iter = np.ceil(10**6 / n), where n is the size of the
training set.
• If you apply SGD to features extracted using PCA we found that it is often wise to scale the feature values by
some constant c such that the average L2 norm of the training data equals one.
• We found that Averaged SGD works best with a larger number of features and a higher eta0
References:
• “Efficient BackProp” Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks of the Trade 1998.
Mathematical formulation
Given a set of training examples (𝑥1 , 𝑦1 ), . . . , (𝑥𝑛 , 𝑦𝑛 ) where 𝑥𝑖 ∈ R𝑚 and 𝑦𝑖 ∈ {−1, 1}, our goal is to learn a linear
scoring function 𝑓 (𝑥) = 𝑤𝑇 𝑥 + 𝑏 with model parameters 𝑤 ∈ R𝑚 and intercept 𝑏 ∈ R. In order to make predictions,
we simply look at the sign of 𝑓 (𝑥). A common choice to find the model parameters is by minimizing the regularized
training error given by
𝑛
1 ∑︁
𝐸(𝑤, 𝑏) = 𝐿(𝑦𝑖 , 𝑓 (𝑥𝑖 )) + 𝛼𝑅(𝑤)
𝑛 𝑖=1
where 𝐿 is a loss function that measures model (mis)fit and 𝑅 is a regularization term (aka penalty) that penalizes
model complexity; 𝛼 > 0 is a non-negative hyperparameter.
Different choices for 𝐿 entail different classifiers such as
SGD
Stochastic gradient descent is an optimization method for unconstrained optimization problems. In contrast to (batch)
gradient descent, SGD approximates the true gradient of 𝐸(𝑤, 𝑏) by considering a single training example at a time.
The class SGDClassifier implements a first-order SGD learning routine. The algorithm iterates over the training
examples and for each example updates the model parameters according to the update rule given by
𝜕𝑅(𝑤) 𝜕𝐿(𝑤𝑇 𝑥𝑖 + 𝑏, 𝑦𝑖 )
𝑤 ← 𝑤 − 𝜂(𝛼 + )
𝜕𝑤 𝜕𝑤
where 𝜂 is the learning rate which controls the step-size in the parameter space. The intercept 𝑏 is updated similarly
but without regularization.
The learning rate 𝜂 can be either constant or gradually decaying. For classification, the default learning rate schedule
(learning_rate='optimal') is given by
1
𝜂 (𝑡) =
𝛼(𝑡0 + 𝑡)
where 𝑡 is the time step (there are a total of n_samples * n_iter time steps), 𝑡0 is determined based on a heuristic
proposed by Léon Bottou such that the expected initial updates are comparable with the expected size of the weights
(this assuming that the norm of the training samples is approx. 1). The exact definition can be found in _init_t in
BaseSGD.
For regression the default learning rate schedule is inverse scaling (learning_rate='invscaling'), given by
𝑒𝑡𝑎0
𝜂 (𝑡) =
𝑡𝑝𝑜𝑤𝑒𝑟_𝑡
where 𝑒𝑡𝑎0 and 𝑝𝑜𝑤𝑒𝑟_𝑡 are hyperparameters chosen by the user via eta0 and power_t, resp.
For a constant learning rate use learning_rate='constant' and use eta0 to specify the learning rate.
For an adaptively decreasing learning rate, use learning_rate='adaptive' and use eta0 to specify the start-
ing learning rate. When the stopping criterion is reached, the learning rate is divided by 5, and the algorithm does not
stop. The algorithm stops when the learning rate goes below 1e-6.
The model parameters can be accessed through the members coef_ and intercept_:
• Member coef_ holds the weights 𝑤
• Member intercept_ holds 𝑏
References:
• “Solving large scale linear prediction problems using stochastic gradient descent algorithms” T. Zhang - In
Proceedings of ICML ‘04.
• “Regularization and variable selection via the elastic net” H. Zou, T. Hastie - Journal of the Royal Statistical
Society Series B, 67 (2), 301-320.
• “Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent” Xu, Wei
Implementation details
The implementation of SGD is influenced by the Stochastic Gradient SVM of Léon Bottou. Similar to SvmSGD,
the weight vector is represented as the product of a scalar and a vector which allows an efficient weight update in
the case of L2 regularization. In the case of sparse feature vectors, the intercept is updated with a smaller learning
rate (multiplied by 0.01) to account for the fact that it is updated more frequently. Training examples are picked up
sequentially and the learning rate is lowered after each observed example. We adopted the learning rate schedule from
Shalev-Shwartz et al. 2007. For multi-class classification, a “one versus all” approach is used. We use the truncated
gradient algorithm proposed by Tsuruoka et al. 2009 for L1 regularization (and the Elastic Net). The code is written
in Cython.
References:
• “Pegasos: Primal estimated sub-gradient solver for svm” S. Shalev-Shwartz, Y. Singer, N. Srebro - In Pro-
ceedings of ICML ‘07.
• “Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty” Y. Tsu-
ruoka, J. Tsujii, S. Ananiadou - In Proceedings of the AFNLP/ACL ‘09.
sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods.
Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and
spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete
labels, and regression for data with continuous labels.
The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance
to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest
neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can,
in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based meth-
ods are known as non-generalizing machine learning methods, since they simply “remember” all of its training data
(possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree).
Despite its simplicity, nearest neighbors has been successful in a large number of classification and regression prob-
lems, including handwritten digits and satellite image scenes. Being a non-parametric method, it is often successful in
classification situations where the decision boundary is very irregular.
The classes in sklearn.neighbors can handle either NumPy arrays or scipy.sparse matrices as input. For
dense matrices, a large number of possible distance metrics are supported. For sparse matrices, arbitrary Minkowski
metrics are supported for searches.
There are many learning routines which rely on nearest neighbors at their core. One example is kernel density estima-
tion, discussed in the density estimation section.
NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three
different nearest neighbors algorithms: BallTree, KDTree, and a brute-force algorithm based on routines in
sklearn.metrics.pairwise. The choice of neighbors search algorithm is controlled through the keyword
'algorithm', which must be one of ['auto', 'ball_tree', 'kd_tree', 'brute']. When the de-
fault value 'auto' is passed, the algorithm attempts to determine the best approach from the training data. For a
discussion of the strengths and weaknesses of each option, see Nearest Neighbor Algorithms.
Warning: Regarding the Nearest Neighbors algorithms, if two neighbors 𝑘 + 1 and 𝑘 have identical
distances but different labels, the result will depend on the ordering of the training data.
For the simple task of finding the nearest neighbors between two sets of data, the unsupervised algorithms within
sklearn.neighbors can be used:
Because the query set matches the training set, the nearest neighbor of each point is the point itself, at a distance of
zero.
It is also possible to efficiently produce a sparse graph showing the connections between neighboring points:
>>> nbrs.kneighbors_graph(X).toarray()
array([[1., 1., 0., 0., 0., 0.],
[1., 1., 0., 0., 0., 0.],
[0., 1., 1., 0., 0., 0.],
[0., 0., 0., 1., 1., 0.],
[0., 0., 0., 1., 1., 0.],
[0., 0., 0., 0., 1., 1.]])
The dataset is structured such that points nearby in index order are nearby in parameter space, leading to an ap-
proximately block-diagonal matrix of K-nearest neighbors. Such a sparse graph is useful in a variety of cir-
cumstances which make use of spatial relationships between points for unsupervised learning: in particular,
see sklearn.manifold.Isomap, sklearn.manifold.LocallyLinearEmbedding, and sklearn.
cluster.SpectralClustering.
Alternatively, one can use the KDTree or BallTree classes directly to find nearest neighbors. This is the function-
ality wrapped by the NearestNeighbors class used above. The Ball Tree and KD Tree have the same interface;
we’ll show an example of using the KD Tree here:
Refer to the KDTree and BallTree class documentation for more information on the options available for nearest
neighbors searches, including specification of query strategies, distance metrics, etc. For a list of available metrics,
see the documentation of the DistanceMetric class.
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt
to construct a general internal model, but simply stores instances of the training data. Classification is computed from
a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the
most representatives within the nearest neighbors of the point.
scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier implements learn-
ing based on the 𝑘 nearest neighbors of each query point, where 𝑘 is an integer value specified by the user.
RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius
𝑟 of each training point, where 𝑟 is a floating-point value specified by the user.
The 𝑘-neighbors classification in KNeighborsClassifier is the most commonly used technique. The optimal
choice of the value 𝑘 is highly data-dependent: in general a larger 𝑘 suppresses the effects of noise, but makes the
classification boundaries less distinct.
In cases where the data is not uniformly sampled, radius-based neighbors classification in
RadiusNeighborsClassifier can be a better choice. The user specifies a fixed radius 𝑟, such that
points in sparser neighborhoods use fewer nearest neighbors for the classification. For high-dimensional parameter
spaces, this method becomes less effective due to the so-called “curse of dimensionality”.
The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed
from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors
such that nearer neighbors contribute more to the fit. This can be accomplished through the weights keyword. The
default value, weights = 'uniform', assigns uniform weights to each neighbor. weights = 'distance'
assigns weights proportional to the inverse of the distance from the query point. Alternatively, a user-defined function
of the distance can be supplied to compute the weights.
Examples:
Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables.
The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors.
scikit-learn implements two different neighbors regressors: KNeighborsRegressor implements learning
based on the 𝑘 nearest neighbors of each query point, where 𝑘 is an integer value specified by the user.
RadiusNeighborsRegressor implements learning based on the neighbors within a fixed radius 𝑟 of the query
point, where 𝑟 is a floating-point value specified by the user.
The basic nearest neighbors regression uses uniform weights: that is, each point in the local neighborhood contributes
uniformly to the classification of a query point. Under some circumstances, it can be advantageous to weight points
such that nearby points contribute more to the regression than faraway points. This can be accomplished through the
weights keyword. The default value, weights = 'uniform', assigns equal weights to all points. weights
= 'distance' assigns weights proportional to the inverse of the distance from the query point. Alternatively, a
user-defined function of the distance can be supplied, which will be used to compute the weights.
The use of multi-output nearest neighbors for regression is demonstrated in Face completion with a multi-output
estimators. In this example, the inputs X are the pixels of the upper half of faces and the outputs Y are the pixels of
the lower half of those faces.
Examples:
Brute Force
Fast computation of nearest neighbors is an active area of research in machine learning. The most naive neighbor
search implementation involves the brute-force computation of distances between all pairs of points in the dataset: for
𝑁 samples in 𝐷 dimensions, this approach scales as 𝑂[𝐷𝑁 2 ]. Efficient brute-force neighbors searches can be very
competitive for small data samples. However, as the number of samples 𝑁 grows, the brute-force approach quickly
becomes infeasible. In the classes within sklearn.neighbors, brute-force neighbors searches are specified using
the keyword algorithm = 'brute', and are computed using the routines available in sklearn.metrics.
pairwise.
K-D Tree
To address the computational inefficiencies of the brute-force approach, a variety of tree-based data structures have
been invented. In general, these structures attempt to reduce the required number of distance calculations by efficiently
encoding aggregate distance information for the sample. The basic idea is that if point 𝐴 is very distant from point
𝐵, and point 𝐵 is very close to point 𝐶, then we know that points 𝐴 and 𝐶 are very distant, without having to
explicitly calculate their distance. In this way, the computational cost of a nearest neighbors search can be reduced to
𝑂[𝐷𝑁 log(𝑁 )] or better. This is a significant improvement over brute-force for large 𝑁 .
An early approach to taking advantage of this aggregate information was the KD tree data structure (short for K-
dimensional tree), which generalizes two-dimensional Quad-trees and 3-dimensional Oct-trees to an arbitrary number
of dimensions. The KD tree is a binary tree structure which recursively partitions the parameter space along the data
axes, dividing it into nested orthotropic regions into which data points are filed. The construction of a KD tree is very
fast: because partitioning is performed only along the data axes, no 𝐷-dimensional distances need to be computed.
Once constructed, the nearest neighbor of a query point can be determined with only 𝑂[log(𝑁 )] distance computations.
Though the KD tree approach is very fast for low-dimensional (𝐷 < 20) neighbors searches, it becomes inefficient
as 𝐷 grows very large: this is one manifestation of the so-called “curse of dimensionality”. In scikit-learn, KD tree
neighbors searches are specified using the keyword algorithm = 'kd_tree', and are computed using the class
KDTree.
References:
• “Multidimensional binary search trees used for associative searching”, Bentley, J.L., Communications of the
ACM (1975)
Ball Tree
To address the inefficiencies of KD Trees in higher dimensions, the ball tree data structure was developed. Where KD
trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres. This makes tree
construction more costly than that of the KD tree, but results in a data structure which can be very efficient on highly
structured data, even in very high dimensions.
A ball tree recursively divides the data into nodes defined by a centroid 𝐶 and radius 𝑟, such that each point in the
node lies within the hyper-sphere defined by 𝑟 and 𝐶. The number of candidate points for a neighbor search is reduced
through use of the triangle inequality:
|𝑥 + 𝑦| ≤ |𝑥| + |𝑦|
With this setup, a single distance calculation between a test point and the centroid is sufficient to determine a lower
and upper bound on the distance to all points within the node. Because of the spherical geometry of the ball tree nodes,
it can out-perform a KD-tree in high dimensions, though the actual performance is highly dependent on the structure
of the training data. In scikit-learn, ball-tree-based neighbors searches are specified using the keyword algorithm
= 'ball_tree', and are computed using the class sklearn.neighbors.BallTree. Alternatively, the user
can work with the BallTree class directly.
References:
• “Five balltree construction algorithms”, Omohundro, S.M., International Computer Science Institute Techni-
cal Report (1989)
The optimal algorithm for a given dataset is a complicated choice, and depends on a number of factors:
• number of samples 𝑁 (i.e. n_samples) and dimensionality 𝐷 (i.e. n_features).
– Brute force query time grows as 𝑂[𝐷𝑁 ]
– Ball tree query time grows as approximately 𝑂[𝐷 log(𝑁 )]
– KD tree query time changes with 𝐷 in a way that is difficult to precisely characterise. For small 𝐷 (less
than 20 or so) the cost is approximately 𝑂[𝐷 log(𝑁 )], and the KD tree query can be very efficient. For
larger 𝐷, the cost increases to nearly 𝑂[𝐷𝑁 ], and the overhead due to the tree structure can lead to queries
which are slower than brute force.
For small data sets (𝑁 less than 30 or so), log(𝑁 ) is comparable to 𝑁 , and brute force algorithms can be more
efficient than a tree-based approach. Both KDTree and BallTree address this through providing a leaf size
parameter: this controls the number of samples at which a query switches to brute-force. This allows both
algorithms to approach the efficiency of a brute-force computation for small 𝑁 .
• data structure: intrinsic dimensionality of the data and/or sparsity of the data. Intrinsic dimensionality refers
to the dimension 𝑑 ≤ 𝐷 of a manifold on which the data lies, which can be linearly or non-linearly embedded
in the parameter space. Sparsity refers to the degree to which the data fills the parameter space (this is to be
distinguished from the concept as used in “sparse” matrices. The data matrix may have no zero entries, but the
structure can still be “sparse” in this sense).
– Brute force query time is unchanged by data structure.
– Ball tree and KD tree query times can be greatly influenced by data structure. In general, sparser data with a
smaller intrinsic dimensionality leads to faster query times. Because the KD tree internal representation is
aligned with the parameter axes, it will not generally show as much improvement as ball tree for arbitrarily
structured data.
Datasets used in machine learning tend to be very structured, and are very well-suited for tree-based queries.
• number of neighbors 𝑘 requested for a query point.
– Brute force query time is largely unaffected by the value of 𝑘
– Ball tree and KD tree query time will become slower as 𝑘 increases. This is due to two effects: first, a
larger 𝑘 leads to the necessity to search a larger portion of the parameter space. Second, using 𝑘 > 1
requires internal queueing of results as the tree is traversed.
As 𝑘 becomes large compared to 𝑁 , the ability to prune branches in a tree-based query is reduced. In this
situation, Brute force queries can be more efficient.
• number of query points. Both the ball tree and the KD Tree require a construction phase. The cost of this
construction becomes negligible when amortized over many queries. If only a small number of queries will
be performed, however, the construction can make up a significant fraction of the total cost. If very few query
points will be required, brute force is better than a tree-based method.
Currently, algorithm = 'auto' selects 'brute' if 𝑘 >= 𝑁/2, the input data is sparse, or
effective_metric_ isn’t in the VALID_METRICS list for either 'kd_tree' or 'ball_tree'. Otherwise,
it selects the first out of 'kd_tree' and 'ball_tree' that has effective_metric_ in its VALID_METRICS
list. This choice is based on the assumption that the number of query points is at least the same order as the number of
training points, and that leaf_size is close to its default value of 30.
Effect of leaf_size
As noted above, for small sample sizes a brute force search can be more efficient than a tree-based query. This fact is
accounted for in the ball tree and KD tree by internally switching to brute force searches within leaf nodes. The level
of this switch can be specified with the parameter leaf_size. This parameter choice has many effects:
construction time A larger leaf_size leads to a faster tree construction time, because fewer nodes need to be
created
query time Both a large or small leaf_size can lead to suboptimal query cost. For leaf_size approaching
1, the overhead involved in traversing nodes can significantly slow query times. For leaf_size approach-
ing the size of the training set, queries become essentially brute force. A good compromise between these is
leaf_size = 30, the default value of the parameter.
memory As leaf_size increases, the memory required to store a tree structure decreases. This is especially
important in the case of ball tree, which stores a 𝐷-dimensional centroid for each node. The required storage
space for BallTree is approximately 1 / leaf_size times the size of the training set.
leaf_size is not referenced for brute force queries.
The NearestCentroid classifier is a simple algorithm that represents each class by the centroid of its members. In
effect, this makes it similar to the label updating phase of the sklearn.cluster.KMeans algorithm. It also has
no parameters to choose, making it a good baseline classifier. It does, however, suffer on non-convex classes, as well as
when classes have drastically different variances, as equal variance in all dimensions is assumed. See Linear Discrim-
inant Analysis (sklearn.discriminant_analysis.LinearDiscriminantAnalysis) and Quadratic
Discriminant Analysis (sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis) for
more complex methods that do not make this assumption. Usage of the default NearestCentroid is simple:
The NearestCentroid classifier has a shrink_threshold parameter, which implements the nearest shrunken
centroid classifier. In effect, the value of each feature for each centroid is divided by the within-class variance of that
feature. The feature values are then reduced by shrink_threshold. Most notably, if a particular feature value
crosses zero, it is set to zero. In effect, this removes the feature from affecting the classification. This is useful, for
example, for removing noisy features.
In the example below, using a small shrink threshold increases the accuracy of the model from 0.81 to 0.82.
Examples:
• Nearest Centroid Classification: an example of classification using nearest centroid with different shrink
thresholds.
Many scikit-learn estimators rely on nearest neighbors: Several classifiers and regressors such as
KNeighborsClassifier and KNeighborsRegressor, but also some clustering methods such as DBSCAN
and SpectralClustering, and some manifold embeddings such as TSNE and Isomap.
All these estimators can compute internally the nearest neighbors, but most of them also accept precomputed near-
est neighbors sparse graph, as given by kneighbors_graph and radius_neighbors_graph. With mode
mode='connectivity', these functions return a binary adjacency sparse graph as required, for instance, in
SpectralClustering. Whereas with mode='distance', they return a distance sparse graph as required,
for instance, in DBSCAN . To include these functions in a scikit-learn pipeline, one can also use the corresponding
classes KNeighborsTransformer and RadiusNeighborsTransformer. The benefits of this sparse graph
API are multiple.
First, the precomputed graph can be re-used multiple times, for instance while varying a parameter of the estimator.
This can be done manually by the user, or using the caching properties of the scikit-learn pipeline:
>>> from sklearn.manifold import Isomap
>>> from sklearn.neighbors import KNeighborsTransformer
>>> from sklearn.pipeline import make_pipeline
>>> estimator = make_pipeline(
... KNeighborsTransformer(n_neighbors=5, mode='distance'),
... Isomap(neighbors_algorithm='precomputed'),
... memory='/path/to/cache')
Second, precomputing the graph can give finer control on the nearest neighbors estimation, for instance enabling
multiprocessing though the parameter n_jobs, which might not be available in all estimators.
Finally, the precomputation can be performed by custom estimators to use different implementations, such as approxi-
mate nearest neighbors methods, or implementation with special data types. The precomputed neighbors sparse graph
Note: When a specific number of neighbors is queried (using KNeighborsTransformer), the definition of
n_neighbors is ambiguous since it can either include each training point as its own neighbor, or exclude them.
Neither choice is perfect, since including them leads to a different number of non-self neighbors during training and
testing, while excluding them leads to a difference between fit(X).transform(X) and fit_transform(X),
which is against scikit-learn API. In KNeighborsTransformer we use the definition which includes each training
point as its own neighbor in the count of n_neighbors. However, for compatibility reasons with other estimators
which use the other definition, one extra neighbor will be computed when mode == 'distance'. To maximise
compatibility with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors
estimator, since unnecessary neighbors will be filtered by following estimators.
Examples:
In the above illustrating figure, we consider some points from a randomly generated dataset. We focus on the stochastic
KNN classification of point no. 3. The thickness of a link between sample 3 and another point is proportional to their
distance, and can be seen as the relative weight (or probability) that a stochastic nearest neighbor prediction rule would
assign to this point. In the original space, sample 3 has many stochastic neighbors from various classes, so the right
class is not very likely. However, in the projected space learned by NCA, the only stochastic neighbors with non-
negligible weight are from the same class as sample 3, guaranteeing that the latter will be well classified. See the
mathematical formulation for more details.
Classification
Combined with a nearest neighbors classifier (KNeighborsClassifier), NCA is attractive for classification be-
cause it can naturally handle multi-class problems without any increase in the model size, and does not introduce
additional parameters that require fine-tuning by the user.
NCA classification has been shown to work well in practice for data sets of varying size and difficulty. In contrast to
related methods such as Linear Discriminant Analysis, NCA does not make any assumptions about the class distribu-
tions. The nearest neighbor classification can naturally produce highly irregular decision boundaries.
To use this model for classification, one needs to combine a NeighborhoodComponentsAnalysis instance that
learns the optimal transformation with a KNeighborsClassifier instance that performs the classification in the
projected space. Here is an example using the two classes:
The plot shows decision boundaries for Nearest Neighbor Classification and Neighborhood Components Analysis
classification on the iris dataset, when training and scoring on only two features, for visualisation purposes.
Dimensionality reduction
NCA can be used to perform supervised dimensionality reduction. The input data are projected onto a linear sub-
space consisting of the directions which minimize the NCA objective. The desired dimensionality can be set us-
ing the parameter n_components. For instance, the following figure shows a comparison of dimensionality re-
duction with Principal Component Analysis (sklearn.decomposition.PCA), Linear Discriminant Analysis
(sklearn.discriminant_analysis.LinearDiscriminantAnalysis) and Neighborhood Component
Analysis (NeighborhoodComponentsAnalysis) on the Digits dataset, a dataset with size 𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 1797
and 𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = 64. The data set is split into a training and a test set of equal size, then standardized. For evalua-
tion the 3-nearest neighbor classification accuracy is computed on the 2-dimensional projected points found by each
method. Each data sample belongs to one of 10 classes.
Examples:
Mathematical formulation
The goal of NCA is to learn an optimal linear transformation matrix of size (n_components, n_features),
which maximises the sum over all samples 𝑖 of the probability 𝑝𝑖 that 𝑖 is correctly classified, i.e.:
𝑁
∑︁−1
arg max 𝑝𝑖
𝐿 𝑖=0
with 𝑁 = n_samples and 𝑝𝑖 the probability of sample 𝑖 being correctly classified according to a stochastic nearest
neighbors rule in the learned embedded space:
∑︁
𝑝𝑖 = 𝑝𝑖𝑗
𝑗∈𝐶𝑖
where 𝐶𝑖 is the set of points in the same class as sample 𝑖, and 𝑝𝑖𝑗 is the softmax over Euclidean distances in the
embedded space:
Mahalanobis distance
Implementation
This implementation follows what is explained in the original paper1 . For the optimisation method, it currently uses
scipy’s L-BFGS-B with a full gradient computation at each iteration, to avoid to tune the learning rate and provide
stable learning.
See the examples below and the docstring of NeighborhoodComponentsAnalysis.fit for further informa-
tion.
Complexity
Training
NCA stores a matrix of pairwise distances, taking n_samples ** 2 memory. Time complexity depends on the
number of iterations done by the optimisation algorithm. However, one can set the maximum number of itera-
tions with the argument max_iter. For each iteration, time complexity is O(n_components x n_samples
x min(n_samples, n_features)).
1 “Neighbourhood Components Analysis”, J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Advances in Neural Information Processing
Transform
Here the transform operation returns 𝐿𝑋 𝑇 , therefore its time complexity equals n_components *
n_features * n_samples_test. There is no added space complexity in the operation.
References:
Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic
classification problems.
The advantages of Gaussian processes are:
• The prediction interpolates the observations (at least for regular kernels).
• The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide
based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest.
• Versatile: different kernels can be specified. Common kernels are provided, but it is also possible to specify
custom kernels.
The disadvantages of Gaussian processes include:
• They are not sparse, i.e., they use the whole samples/features information to perform the prediction.
• They lose efficiency in high dimensional spaces – namely when the number of features exceeds a few dozens.
The GaussianProcessRegressor implements Gaussian processes (GP) for regression purposes. For this, the
prior of the GP needs to be specified. The prior mean is assumed to be constant and zero (for normalize_y=False)
or the training data’s mean (for normalize_y=True). The prior’s covariance is specified by passing a kernel object.
The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-
marginal-likelihood (LML) based on the passed optimizer. As the LML may have multiple local optima, the
optimizer can be started repeatedly by specifying n_restarts_optimizer. The first run is always conducted
starting from the initial hyperparameter values of the kernel; subsequent runs are conducted from hyperparameter
values that have been chosen randomly from the range of allowed values. If the initial hyperparameters should be kept
fixed, None can be passed as optimizer.
The noise level in the targets can be specified by passing it via the parameter alpha, either globally as a scalar or
per datapoint. Note that a moderate noise level can also be helpful for dealing with numeric issues during fitting as
it is effectively implemented as Tikhonov regularization, i.e., by adding it to the diagonal of the kernel matrix. An
alternative to specifying the noise level explicitly is to include a WhiteKernel component into the kernel, which can
estimate the global noise level from the data (see example below).
The implementation is based on Algorithm 2.1 of [RW2006]. In addition to the API of standard scikit-learn estimators,
GaussianProcessRegressor:
• allows prediction without prior fitting (based on the GP prior)
• provides an additional method sample_y(X), which evaluates samples drawn from the GPR (prior or poste-
rior) at given inputs
• exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of
selecting hyperparameters, e.g., via Markov chain Monte Carlo.
GPR examples
This example illustrates that GPR with a sum-kernel including a WhiteKernel can estimate the noise level of data. An
illustration of the log-marginal-likelihood (LML) landscape shows that there exist two local maxima of LML.
The first corresponds to a model with a high noise level and a large length scale, which explains all variations in the
data by noise.
The second one has a smaller noise level and shorter length scale, which explains most of the variation by the noise-
free functional relationship. The second model has a higher likelihood; however, depending on the initial value for the
hyperparameters, the gradient-based optimization might also converge to the high-noise solution. It is thus important
to repeat the optimization several times for different initializations.
Both kernel ridge regression (KRR) and GPR learn a target function by employing internally the “kernel trick”. KRR
learns a linear function in the space induced by the respective kernel which corresponds to a non-linear function in
the original space. The linear function in the kernel space is chosen based on the mean-squared error loss with ridge
regularization. GPR uses the kernel to define the covariance of a prior distribution over the target functions and uses
the observed training data to define a likelihood function. Based on Bayes theorem, a (Gaussian) posterior distribution
over target functions is defined, whose mean is used for prediction.
A major difference is that GPR can choose the kernel’s hyperparameters based on gradient-ascent on the marginal
likelihood function while KRR needs to perform a grid search on a cross-validated loss function (mean-squared error
loss). A further difference is that GPR learns a generative, probabilistic model of the target function and can thus
provide meaningful confidence intervals and posterior samples along with the predictions while KRR only provides
predictions.
The following figure illustrates both methods on an artificial dataset, which consists of a sinusoidal target function
and strong noise. The figure compares the learned model of KRR and GPR based on a ExpSineSquared kernel,
which is suited for learning periodic functions. The kernel’s hyperparameters control the smoothness (length_scale)
and periodicity of the kernel (periodicity). Moreover, the noise level of the data is learned explicitly by GPR by an
additional WhiteKernel component in the kernel and by the regularization parameter alpha of KRR.
The figure shows that both methods learn reasonable models of the target function. GPR correctly identifies the peri-
odicity of the function to be roughly 2 * 𝜋 (6.28), while KRR chooses the doubled periodicity 4 * 𝜋 . Besides that, GPR
provides reasonable confidence bounds on the prediction which are not available for KRR. A major difference between
the two methods is the time required for fitting and predicting: while fitting KRR is fast in principle, the grid-search
for hyperparameter optimization scales exponentially with the number of hyperparameters (“curse of dimensional-
ity”). The gradient-based optimization of the parameters in GPR does not suffer from this exponential scaling and is
thus considerable faster on this example with 3-dimensional hyperparameter space. The time for predicting is similar;
however, generating the variance of the predictive distribution of GPR takes considerable longer than just predicting
the mean.
This example is based on Section 5.4.3 of [RW2006]. It illustrates an example of complex kernel engineering and
hyperparameter optimization using gradient ascent on the log-marginal-likelihood. The data consists of the monthly
average atmospheric CO2 concentrations (in parts per million by volume (ppmv)) collected at the Mauna Loa Obser-
vatory in Hawaii, between 1958 and 1997. The objective is to model the CO2 concentration as a function of the time
t.
The kernel is composed of several terms that are responsible for explaining different properties of the signal:
• a long term, smooth rising trend is to be explained by an RBF kernel. The RBF kernel with a large length-scale
enforces this component to be smooth; it is not enforced that the trend is rising which leaves this choice to the
GP. The specific length-scale and the amplitude are free hyperparameters.
• a seasonal component, which is to be explained by the periodic ExpSineSquared kernel with a fixed periodicity
of 1 year. The length-scale of this periodic component, controlling its smoothness, is a free parameter. In order
to allow decaying away from exact periodicity, the product with an RBF kernel is taken. The length-scale of this
RBF component controls the decay time and is a further free parameter.
• smaller, medium term irregularities are to be explained by a RationalQuadratic kernel component, whose length-
scale and alpha parameter, which determines the diffuseness of the length-scales, are to be determined. Ac-
cording to [RW2006], these irregularities can better be explained by a RationalQuadratic than an RBF kernel
component, probably because it can accommodate several length-scales.
• a “noise” term, consisting of an RBF kernel contribution, which shall explain the correlated noise components
such as local weather phenomena, and a WhiteKernel contribution for the white noise. The relative amplitudes
and the RBF’s length scale are further free parameters.
Maximizing the log-marginal-likelihood after subtracting the target’s mean yields the following kernel with an LML
of -83.214:
34.4**2 * RBF(length_scale=41.8)
+ 3.27**2 * RBF(length_scale=180) * ExpSineSquared(length_scale=1.44,
periodicity=1)
+ 0.446**2 * RationalQuadratic(alpha=17.7, length_scale=0.957)
+ 0.197**2 * RBF(length_scale=0.138) + WhiteKernel(noise_level=0.0336)
Thus, most of the target signal (34.4ppm) is explained by a long-term rising trend (length-scale 41.8 years). The
periodic component has an amplitude of 3.27ppm, a decay time of 180 years and a length-scale of 1.44. The long
decay time indicates that we have a locally very close to periodic seasonal component. The correlated noise has an
amplitude of 0.197ppm with a length scale of 0.138 years and a white-noise contribution of 0.197ppm. Thus, the
overall noise level is very small, indicating that the data can be very well explained by the model. The figure shows
also that the model makes very confident predictions until around 2015
The GaussianProcessClassifier implements Gaussian processes (GP) for classification purposes, more
specifically for probabilistic classification, where test predictions take the form of class probabilities. GaussianPro-
cessClassifier places a GP prior on a latent function 𝑓 , which is then squashed through a link function to obtain the
probabilistic classification. The latent function 𝑓 is a so-called nuisance function, whose values are not observed and
are not relevant by themselves. Its purpose is to allow a convenient formulation of the model, and 𝑓 is removed (inte-
grated out) during prediction. GaussianProcessClassifier implements the logistic link function, for which the integral
cannot be computed analytically but is easily approximated in the binary case.
In contrast to the regression setting, the posterior of the latent function 𝑓 is not Gaussian even for a GP prior since
a Gaussian likelihood is inappropriate for discrete class labels. Rather, a non-Gaussian likelihood corresponding to
the logistic link function (logit) is used. GaussianProcessClassifier approximates the non-Gaussian posterior with a
Gaussian based on the Laplace approximation. More details can be found in Chapter 3 of [RW2006].
The GP prior mean is assumed to be zero. The prior’s covariance is specified by passing a kernel object. The hyper-
parameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginal-
likelihood (LML) based on the passed optimizer. As the LML may have multiple local optima, the optimizer can
be started repeatedly by specifying n_restarts_optimizer. The first run is always conducted starting from the
initial hyperparameter values of the kernel; subsequent runs are conducted from hyperparameter values that have been
chosen randomly from the range of allowed values. If the initial hyperparameters should be kept fixed, None can be
passed as optimizer.
GaussianProcessClassifier supports multi-class classification by performing either one-versus-rest or one-
versus-one based training and prediction. In one-versus-rest, one binary Gaussian process classifier is fitted for each
class, which is trained to separate this class from the rest. In “one_vs_one”, one binary Gaussian process classifier is
fitted for each pair of classes, which is trained to separate these two classes. The predictions of these binary predictors
are combined into multi-class predictions. See the section on multi-class classification for more details.
In the case of Gaussian process classification, “one_vs_one” might be computationally cheaper since it has to solve
many problems involving only a subset of the whole training set rather than fewer problems on the whole dataset. Since
Gaussian process classification scales cubically with the size of the dataset, this might be considerably faster. How-
ever, note that “one_vs_one” does not support predicting probability estimates but only plain predictions. Moreover,
note that GaussianProcessClassifier does not (yet) implement a true multi-class Laplace approximation in-
ternally, but as discussed above is based on solving several binary classification tasks internally, which are combined
using one-versus-rest or one-versus-one.
GPC examples
This example illustrates the predicted probability of GPC for an RBF kernel with different choices of the hyperparam-
eters. The first figure shows the predicted probability of GPC with arbitrarily chosen hyperparameters and with the
hyperparameters corresponding to the maximum log-marginal-likelihood (LML).
While the hyperparameters chosen by optimizing LML have a considerable larger LML, they perform slightly worse
according to the log-loss on test data. The figure shows that this is because they exhibit a steep change of the class
probabilities at the class boundaries (which is good) but have predicted probabilities close to 0.5 far away from the
class boundaries (which is bad) This undesirable effect is caused by the Laplace approximation used internally by
GPC.
The second figure shows the log-marginal-likelihood for different choices of the kernel’s hyperparameters, highlighting
the two choices of the hyperparameters used in the first figure by black dots.
This example illustrates GPC on XOR data. Compared are a stationary, isotropic kernel (RBF) and a non-stationary
kernel (DotProduct). On this particular dataset, the DotProduct kernel obtains considerably better results be-
cause the class-boundaries are linear and coincide with the coordinate axes. In practice, however, stationary kernels
such as RBF often obtain better results.
This example illustrates the predicted probability of GPC for an isotropic and anisotropic RBF kernel on a two-
dimensional version for the iris-dataset. This illustrates the applicability of GPC to non-binary classification. The
anisotropic RBF kernel obtains slightly higher log-marginal-likelihood by assigning different length-scales to the two
feature dimensions.
Kernels (also called “covariance functions” in the context of GPs) are a crucial ingredient of GPs which determine
the shape of prior and posterior of the GP. They encode the assumptions on the function being learned by defining the
“similarity” of two datapoints combined with the assumption that similar datapoints should have similar target values.
Two categories of kernels can be distinguished: stationary kernels depend only on the distance of two datapoints
and not on their absolute values 𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝑘(𝑑(𝑥𝑖 , 𝑥𝑗 )) and are thus invariant to translations in the input space,
while non-stationary kernels depend also on the specific values of the datapoints. Stationary kernels can further be
subdivided into isotropic and anisotropic kernels, where isotropic kernels are also invariant to rotations in the input
space. For more details, we refer to Chapter 4 of [RW2006].
The main usage of a Kernel is to compute the GP’s covariance between datapoints. For this, the method __call__
of the kernel can be called. This method can either be used to compute the “auto-covariance” of all pairs of datapoints
in a 2d array X, or the “cross-covariance” of all combinations of datapoints of a 2d array X with datapoints in a 2d
array Y. The following identity holds true for all kernels k (except for the WhiteKernel): k(X) == K(X, Y=X)
If only the diagonal of the auto-covariance is being used, the method diag() of a kernel can be called, which is more
computationally efficient than the equivalent call to __call__: np.diag(k(X, X)) == k.diag(X)
Kernels are parameterized by a vector 𝜃 of hyperparameters. These hyperparameters can for instance control length-
scales or periodicity of a kernel (see below). All kernels support computing analytic gradients of the kernel’s auto-
covariance with respect to 𝜃 via setting eval_gradient=True in the __call__ method. This gradient is used by
the Gaussian process (both regressor and classifier) in computing the gradient of the log-marginal-likelihood, which
in turn is used to determine the value of 𝜃, which maximizes the log-marginal-likelihood, via gradient ascent. For
each hyperparameter, the initial value and the bounds need to be specified when creating an instance of the kernel.
The current value of 𝜃 can be get and set via the property theta of the kernel object. Moreover, the bounds of the
hyperparameters can be accessed by the property bounds of the kernel. Note that both properties (theta and bounds)
return log-transformed values of the internally used values since those are typically more amenable to gradient-based
optimization. The specification of each hyperparameter is stored in the form of an instance of Hyperparameter
in the respective kernel. Note that a kernel using a hyperparameter with name “x” must have the attributes self.x and
self.x_bounds.
The abstract base class for all kernels is Kernel. Kernel implements a similar interface as Estimator, providing
the methods get_params(), set_params(), and clone(). This allows setting kernel values also via meta-
estimators such as Pipeline or GridSearch. Note that due to the nested structure of kernels (by applying kernel
operators, see below), the names of kernel parameters might become relatively complicated. In general, for a binary
kernel operator, parameters of the left operand are prefixed with k1__ and parameters of the right operand with k2__.
An additional convenience method is clone_with_theta(theta), which returns a cloned version of the kernel
but with the hyperparameters set to theta. An illustrative example:
˓→length_scale_bounds=(0.0, 10.0))
All Gaussian process kernels are interoperable with sklearn.metrics.pairwise and vice versa: instances
of subclasses of Kernel can be passed as metric to pairwise_kernels from sklearn.metrics.
pairwise. Moreover, kernel functions from pairwise can be used as GP kernels by using the wrapper class
PairwiseKernel. The only caveat is that the gradient of the hyperparameters is not analytic but numeric and
all those kernels support only isotropic distances. The parameter gamma is considered to be a hyperparameter and
may be optimized. The other kernel parameters are set directly at initialization and are kept fixed.
Basic kernels
The ConstantKernel kernel can be used as part of a Product kernel where it scales the magnitude of the other
factor (kernel) or as part of a Sum kernel, where it modifies the mean of the Gaussian process. It depends on a
parameter 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡_𝑣𝑎𝑙𝑢𝑒. It is defined as:
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡_𝑣𝑎𝑙𝑢𝑒 ∀ 𝑥1 , 𝑥2
The main use-case of the WhiteKernel kernel is as part of a sum-kernel where it explains the noise-component of
the signal. Tuning its parameter 𝑛𝑜𝑖𝑠𝑒_𝑙𝑒𝑣𝑒𝑙 corresponds to estimating the noise-level. It is defined as:
Kernel operators
Kernel operators take one or two base kernels and combine them into a new kernel. The Sum kernel takes two kernels
𝑘1 and 𝑘2 and combines them via 𝑘𝑠𝑢𝑚 (𝑋, 𝑌 ) = 𝑘1(𝑋, 𝑌 ) + 𝑘2(𝑋, 𝑌 ). The Product kernel takes two kernels 𝑘1
and 𝑘2 and combines them via 𝑘𝑝𝑟𝑜𝑑𝑢𝑐𝑡 (𝑋, 𝑌 ) = 𝑘1(𝑋, 𝑌 ) * 𝑘2(𝑋, 𝑌 ). The Exponentiation kernel takes one
base kernel and a scalar parameter 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 and combines them via 𝑘𝑒𝑥𝑝 (𝑋, 𝑌 ) = 𝑘(𝑋, 𝑌 )exponent .
The RBF kernel is a stationary kernel. It is also known as the “squared exponential” kernel. It is parameterized by a
length-scale parameter 𝑙 > 0, which can either be a scalar (isotropic variant of the kernel) or a vector with the same
number of dimensions as the inputs 𝑥 (anisotropic variant of the kernel). The kernel is given by:
(︂ )︂
1
𝑘(𝑥𝑖 , 𝑥𝑗 ) = exp − 𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙)2
2
This kernel is infinitely differentiable, which implies that GPs with this kernel as covariance function have mean square
derivatives of all orders, and are thus very smooth. The prior and posterior of a GP resulting from an RBF kernel are
shown in the following figure:
Matérn kernel
The Matern kernel is a stationary kernel and a generalization of the RBF kernel. It has an additional parameter 𝜈
which controls the smoothness of the resulting function. It is parameterized by a length-scale parameter 𝑙 > 0, which
can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs 𝑥
(anisotropic variant of the kernel). The kernel is given by:
(︃ )︃𝜈 (︃ )︃
2 1 √ √
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎 𝛾 2𝜈𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) 𝐾𝜈 𝛾 2𝜈𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) ,
Γ(𝜈)2𝜈−1
As 𝜈 → ∞, the Matérn kernel converges to the RBF kernel. When 𝜈 = 1/2, the Matérn kernel becomes identical to
the absolute exponential kernel, i.e.,
(︃ )︃
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎 2 exp − 𝛾𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) 𝜈= 1
2
In particular, 𝜈 = 3/2:
(︃ )︃ (︃ )︃
√ √
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎 2 1 + 𝛾 3𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) exp − 𝛾 3𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) 𝜈= 3
2
and 𝜈 = 5/2:
(︃ )︃ (︃ )︃
2
√ 5 √
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎 1 + 𝛾 5𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) + 𝛾 2 𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙)2 exp − 𝛾 5𝑑(𝑥𝑖 /𝑙, 𝑥𝑗 /𝑙) 𝜈= 5
2
3
are popular choices for learning functions that are not infinitely differentiable (as assumed by the RBF kernel) but at
least once (𝜈 = 3/2) or twice differentiable (𝜈 = 5/2).
The flexibility of controlling the smoothness of the learned function via 𝜈 allows adapting to the properties of the
true underlying functional relation. The prior and posterior of a GP resulting from a Matérn kernel are shown in the
following figure:
See [RW2006], pp84 for further details regarding the different variants of the Matérn kernel.
The RationalQuadratic kernel can be seen as a scale mixture (an infinite sum) of RBF kernels with different
characteristic length-scales. It is parameterized by a length-scale parameter 𝑙 > 0 and a scale mixture parameter 𝛼 > 0
Only the isotropic variant where 𝑙 is a scalar is supported at the moment. The kernel is given by:
)︂−𝛼
𝑑(𝑥𝑖 , 𝑥𝑗 )2
(︂
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 1+
2𝛼𝑙2
The prior and posterior of a GP resulting from a RationalQuadratic kernel are shown in the following figure:
Exp-Sine-Squared kernel
The ExpSineSquared kernel allows modeling periodic functions. It is parameterized by a length-scale parameter
𝑙 > 0 and a periodicity parameter 𝑝 > 0. Only the isotropic variant where 𝑙 is a scalar is supported at the moment.
The kernel is given by:
(︁ )︁
2
𝑘(𝑥𝑖 , 𝑥𝑗 ) = exp −2 (sin(𝜋/𝑝 * 𝑑(𝑥𝑖 , 𝑥𝑗 ))/𝑙)
The prior and posterior of a GP resulting from an ExpSineSquared kernel are shown in the following figure:
Dot-Product kernel
The DotProduct kernel is non-stationary and can be obtained from linear regression by putting 𝑁 (0, 1) priors on
the coefficients of 𝑥𝑑 (𝑑 = 1, ..., 𝐷) and a prior of 𝑁 (0, 𝜎02 ) on the bias. The DotProduct kernel is invariant to a
rotation of the coordinates about the origin, but not translations. It is parameterized by a parameter 𝜎02 . For 𝜎02 = 0,
the kernel is called the homogeneous linear kernel, otherwise it is inhomogeneous. The kernel is given by
𝑘(𝑥𝑖 , 𝑥𝑗 ) = 𝜎02 + 𝑥𝑖 · 𝑥𝑗
The DotProduct kernel is commonly combined with exponentiation. An example with exponent 2 is shown in the
following figure:
References
The cross decomposition module contains two main families of algorithms: the partial least squares (PLS) and the
canonical correlation analysis (CCA).
These families of algorithms are useful to find linear relations between two multivariate datasets: the X and Y argu-
ments of the fit method are 2D arrays.
Cross decomposition algorithms find the fundamental relations between two matrices (X and Y). They are latent
variable approaches to modeling the covariance structures in these two spaces. They will try to find the multidi-
mensional direction in the X space that explains the maximum multidimensional variance direction in the Y space.
PLS-regression is particularly suited when the matrix of predictors has more variables than observations, and when
there is multicollinearity among X values. By contrast, standard regression will fail in these cases.
Classes included in this module are PLSRegression PLSCanonical, CCA and PLSSVD
Reference:
• JA Wegelin A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case
Examples:
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive”
assumption of conditional independence between every pair of features given the value of the class variable. Bayes’
theorem states the following relationship, given class variable 𝑦 and dependent feature vector 𝑥1 through 𝑥𝑛 , :
𝑃 (𝑦)𝑃 (𝑥1 , . . . 𝑥𝑛 | 𝑦)
𝑃 (𝑦 | 𝑥1 , . . . , 𝑥𝑛 ) =
𝑃 (𝑥1 , . . . , 𝑥𝑛 )
Using the naive conditional independence assumption that
𝑃 (𝑥𝑖 |𝑦, 𝑥1 , . . . , 𝑥𝑖−1 , 𝑥𝑖+1 , . . . , 𝑥𝑛 ) = 𝑃 (𝑥𝑖 |𝑦),
for all 𝑖, this relationship is simplified to
∏︀𝑛
𝑃 (𝑦) 𝑖=1 𝑃 (𝑥𝑖 | 𝑦)
𝑃 (𝑦 | 𝑥1 , . . . , 𝑥𝑛 ) =
𝑃 (𝑥1 , . . . , 𝑥𝑛 )
Since 𝑃 (𝑥1 , . . . , 𝑥𝑛 ) is constant given the input, we can use the following classification rule:
𝑛
∏︁
𝑃 (𝑦 | 𝑥1 , . . . , 𝑥𝑛 ) ∝ 𝑃 (𝑦) 𝑃 (𝑥𝑖 | 𝑦)
𝑖=1
⇓
𝑛
∏︁
𝑦ˆ = arg max 𝑃 (𝑦) 𝑃 (𝑥𝑖 | 𝑦),
𝑦
𝑖=1
and we can use Maximum A Posteriori (MAP) estimation to estimate 𝑃 (𝑦) and 𝑃 (𝑥𝑖 | 𝑦); the former is then the
relative frequency of class 𝑦 in the training set.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of 𝑃 (𝑥𝑖 |
𝑦).
In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-
world situations, famously document classification and spam filtering. They require a small amount of training data to
estimate the necessary parameters. (For theoretical reasons why naive Bayes works well, and on which types of data
it does, see the references below.)
Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling
of the class conditional feature distributions means that each distribution can be independently estimated as a one
dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.
On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
References:
GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is
assumed to be Gaussian:
(𝑥𝑖 − 𝜇𝑦 )2
(︂ )︂
1
𝑃 (𝑥𝑖 | 𝑦) = √︁ exp −
2𝜋𝜎𝑦2 2𝜎𝑦2
MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two
classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts,
although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors 𝜃𝑦 =
(𝜃𝑦1 , . . . , 𝜃𝑦𝑛 ) for each class 𝑦, where 𝑛 is the number of features (in text classification, the size of the vocabulary)
and 𝜃𝑦𝑖 is the probability 𝑃 (𝑥𝑖 | 𝑦) of feature 𝑖 appearing in a sample belonging to class 𝑦.
The parameters 𝜃𝑦 is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:
𝑁𝑦𝑖 + 𝛼
𝜃ˆ𝑦𝑖 =
𝑁𝑦 + 𝛼𝑛
∑︀
𝑁𝑦𝑖 =
where ∑︀ 𝑥∈𝑇 𝑥𝑖 is the number of times feature 𝑖 appears in a sample of class 𝑦 in the training set 𝑇 , and
𝑛
𝑁𝑦 = 𝑖=1 𝑁𝑦𝑖 is the total count of all features for class 𝑦.
The smoothing priors 𝛼 ≥ 0 accounts for features not present in the learning samples and prevents zero probabilities
in further computations. Setting 𝛼 = 1 is called Laplace smoothing, while 𝛼 < 1 is called Lidstone smoothing.
ComplementNB implements the complement naive Bayes (CNB) algorithm. CNB is an adaptation of the standard
multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. Specifically, CNB uses
statistics from the complement of each class to compute the model’s weights. The inventors of CNB show empirically
that the parameter estimates for CNB are more stable than those for MNB. Further, CNB regularly outperforms MNB
(often by a considerable margin) on text classification tasks. The procedure for calculating the weights is as follows:
∑︀
𝛼𝑖 + 𝑗:𝑦𝑗 ̸=𝑐 𝑑𝑖𝑗
𝜃ˆ𝑐𝑖 = ∑︀ ∑︀
𝛼 + 𝑗:𝑦𝑗 ̸=𝑐 𝑘 𝑑𝑘𝑗
𝑤𝑐𝑖 = log 𝜃ˆ𝑐𝑖
𝑤𝑐𝑖
𝑤𝑐𝑖 = ∑︀
𝑗 |𝑤𝑐𝑗 |
where the summations are over all documents 𝑗 not in class 𝑐, 𝑑𝑖𝑗 is either the ∑︀
count or tf-idf value of term 𝑖 in
document 𝑗, 𝛼𝑖 is a smoothing hyperparameter like that found in MNB, and 𝛼 = 𝑖 𝛼𝑖 . The second normalization
addresses the tendency for longer documents to dominate parameter estimates in MNB. The classification rule is:
∑︁
𝑐ˆ = arg min 𝑡𝑖 𝑤𝑐𝑖
𝑐
𝑖
i.e., a document is assigned to the class that is the poorest complement match.
References:
• Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive bayes text
classifiers. In ICML (Vol. 3, pp. 616-623).
BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed ac-
cording to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a
binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued
feature vectors; if handed any other kind of data, a BernoulliNB instance may binarize its input (depending on the
binarize parameter).
The decision rule for Bernoulli naive Bayes is based on
which differs from multinomial NB’s rule in that it explicitly penalizes the non-occurrence of a feature 𝑖 that is an
indicator for class 𝑦, where the multinomial variant would simply ignore a non-occurring feature.
In the case of text classification, word occurrence vectors (rather than word count vectors) may be used to train and
use this classifier. BernoulliNB might perform better on some datasets, especially those with shorter documents.
It is advisable to evaluate both models, if time permits.
References:
• C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge Uni-
versity Press, pp. 234-265.
• A. McCallum and K. Nigam (1998). A comparison of event models for Naive Bayes text classification. Proc.
AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
• V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering with Naive Bayes – Which Naive
Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).
CategoricalNB implements the categorical naive Bayes algorithm for categorically distributed data. It assumes
that each feature, which is described by the index 𝑖, has its own categorical distribution.
For each feature 𝑖 in the training set 𝑋, CategoricalNB estimates a categorical distribution for each feature i of
X conditioned on the class y. The index set of the samples is defined as 𝐽 = {1, . . . , 𝑚}, with 𝑚 as the number of
samples.
The probability of category 𝑡 in feature 𝑖 given class 𝑐 is estimated as:
𝑁𝑡𝑖𝑐 + 𝛼
𝑃 (𝑥𝑖 = 𝑡 | 𝑦 = 𝑐 ; 𝛼) = ,
𝑁𝑐 + 𝛼𝑛𝑖
where 𝑁𝑡𝑖𝑐 = |{𝑗 ∈ 𝐽 | 𝑥𝑖𝑗 = 𝑡, 𝑦𝑗 = 𝑐}| is the number of times category 𝑡 appears in the samples 𝑥𝑖 , which belong
to class 𝑐, 𝑁𝑐 = |{𝑗 ∈ 𝐽 | 𝑦𝑗 = 𝑐}| is the number of samples with class c, 𝛼 is a smoothing parameter and 𝑛𝑖 is the
number of available categories of feature 𝑖.
CategoricalNB assumes that the sample matrix 𝑋 is encoded (for instance with the help of OrdinalEncoder)
such that all categories for each feature 𝑖 are represented with numbers 0, ..., 𝑛𝑖 −1 where 𝑛𝑖 is the number of available
categories of feature 𝑖.
Naive Bayes models can be used to tackle large scale classification problems for which the full training set might not fit
in memory. To handle this case, MultinomialNB, BernoulliNB, and GaussianNB expose a partial_fit
method that can be used incrementally as done with other classifiers as demonstrated in Out-of-core classification of
text documents. All naive Bayes classifiers support sample weighting.
Contrary to the fit method, the first call to partial_fit needs to be passed the list of all the expected class labels.
For an overview of available strategies in scikit-learn, see also the out-of-core learning documentation.
Note: The partial_fit method call of naive Bayes models introduces some computational overhead. It is
recommended to use data chunk sizes that are as large as possible, that is as the available RAM allows.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The
goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the
data features.
For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else
decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.
• Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be
created and blank values to be removed. Note however that this module does not support missing values.
• The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
• Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing
datasets that have only one type of variable. See algorithms for more information.
• Able to handle multi-output problems.
• Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily
explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may
be more difficult to interpret.
• Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the
model.
• Performs well even if its assumptions are somewhat violated by the true model from which the data were
generated.
The disadvantages of decision trees include:
• Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfit-
ting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required
at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
• Decision trees can be unstable because small variations in the data might result in a completely different tree
being generated. This problem is mitigated by using decision trees within an ensemble.
• The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality
and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic
algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms
cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in
an ensemble learner, where the features and samples are randomly sampled with replacement.
• There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity
or multiplexer problems.
• Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the
dataset prior to fitting with the decision tree.
Classification
After being fitted, the model can then be used to predict the class of samples:
Alternatively, the probability of each class can be predicted, which is the fraction of training samples of the same class
in a leaf:
DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass
(where the labels are [0, . . . , K-1]) classification.
Using the Iris dataset, we can construct a tree as follows:
Once trained, you can plot the tree with the plot_tree function:
We can also export the tree in Graphviz format using the export_graphviz exporter. If you use the conda package
manager, the graphviz binaries
and the python package can be installed with
conda install python-graphviz
Alternatively binaries for graphviz can be downloaded from the graphviz project homepage, and the Python wrapper
installed from pypi with pip install graphviz.
Below is an example graphviz export of the above tree trained on the entire iris dataset; the results are saved in an
output file iris.pdf:
The export_graphviz exporter also supports a variety of aesthetic options, including coloring nodes by their class
(or value for regression) and using explicit variable and class names if desired. Jupyter notebooks also render these
plots inline automatically:
>>> dot_data = tree.export_graphviz(clf, out_file=None,
... feature_names=iris.feature_names,
... class_names=iris.target_names,
... filled=True, rounded=True,
... special_characters=True)
>>> graph = graphviz.Source(dot_data)
>>> graph
Alternatively, the tree can also be exported in textual format with the function export_text. This method doesn’t
require the installation of external libraries and is more compact:
>>> from sklearn.datasets import load_iris
>>> from sklearn.tree import DecisionTreeClassifier
(continues on next page)
Examples:
Regression
Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.
As in the classification setting, the fit method will take as argument arrays X and y, only that in this case y is expected
to have floating point values instead of integer values:
Examples:
Multi-output problems
A multi-output problem is a supervised learning problem with several outputs to predict, that is when Y is a 2d array
of size [n_samples, n_outputs].
When there is no correlation between the outputs, a very simple way to solve this kind of problem is to build n
independent models, i.e. one for each output, and then to use those models to independently predict each one of the
n outputs. However, because it is likely that the output values related to the same input are themselves correlated, an
often better way is to build a single model capable of predicting simultaneously all n outputs. First, it requires lower
training time since only a single estimator is built. Second, the generalization accuracy of the resulting estimator may
often be increased.
With regard to decision trees, this strategy can readily be used to support multi-output problems. This requires the
following changes:
• Store n output values in leaves, instead of 1;
• Use splitting criteria that compute the average reduction across all n outputs.
This module offers support for multi-output problems by implementing this strategy in both
DecisionTreeClassifier and DecisionTreeRegressor. If a decision tree is fit on an output
array Y of size [n_samples, n_outputs] then the resulting estimator will:
• Output n_output values upon predict;
• Output a list of n_output arrays of class probabilities upon predict_proba.
The use of multi-output trees for regression is demonstrated in Multi-output Decision Tree Regression. In this example,
the input X is a single real value and the outputs Y are the sine and cosine of X.
The use of multi-output trees for classification is demonstrated in Face completion with a multi-output estimators. In
this example, the inputs X are the pixels of the upper half of faces and the outputs Y are the pixels of the lower half of
those faces.
Examples:
References:
• M. Dumont et al, Fast multi-class image annotation with random subwindows and multiple output randomized
trees, International Conference on Computer Vision Theory and Applications 2009
Complexity
In general, the run time cost to construct a balanced binary tree is 𝑂(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )) and query
time 𝑂(log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )). Although the tree construction algorithm attempts to generate balanced trees, they will not
always be balanced. Assuming that the subtrees remain approximately balanced, the cost at each node consists of
searching through 𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 ) to find the feature that offers the largest reduction in entropy. This has a cost of
𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )) at each node, leading to a total cost over the entire trees (by summing the cost at
each node) of 𝑂(𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑛2𝑠𝑎𝑚𝑝𝑙𝑒𝑠 log(𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 )).
• Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to
number of features is important, since a tree with few samples in high dimensional space is very likely to
overfit.
• Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a
better chance of finding features that are discriminative.
• Understanding the decision tree structure will help in gaining more insights about how the decision tree makes
predictions, which is important for understanding the important features in the data.
• Visualise your tree as you are training by using the export function. Use max_depth=3 as an initial tree
depth to get a feel for how the tree is fitting to your data, and then increase the depth.
• Remember that the number of samples required to populate the tree doubles for each additional level the tree
grows to. Use max_depth to control the size of the tree to prevent overfitting.
• Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision
in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will
overfit, whereas a large number will prevent the tree from learning the data. Try min_samples_leaf=5 as an
initial value. If the sample size varies greatly, a float number can be used as percentage in these two parameters.
While min_samples_split can create arbitrarily small leaves, min_samples_leaf guarantees that each
leaf has a minimum size, avoiding low-variance, over-fit leaf nodes in regression problems. For classification
with few classes, min_samples_leaf=1 is often the best choice.
• Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant.
Class balancing can be done by sampling an equal number of samples from each class, or preferably by nor-
malizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that
weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward
dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.
• If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning
criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of
the overall sum of the sample weights.
• All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset
will be made.
• If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and
sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix
input compared to a dense matrix when features have zero values in most of the samples.
What are all the various decision tree algorithms and how do they differ from each other? Which one is implemented
in scikit-learn?
ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding
for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical
targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the
tree to generalise to unseen data.
C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining
a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of
intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy
of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a
rule’s precondition if the accuracy of the rule improves without it.
C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets
than C4.5 while being more accurate.
CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target
variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold
that yield the largest information gain at each node.
scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support
categorical variables for now.
Mathematical formulation
Given training vectors 𝑥𝑖 ∈ 𝑅𝑛 , i=1,. . . , l and a label vector 𝑦 ∈ 𝑅𝑙 , a decision tree recursively partitions the space
such that the samples with the same labels are grouped together.
Let the data at node 𝑚 be represented by 𝑄. For each candidate split 𝜃 = (𝑗, 𝑡𝑚 ) consisting of a feature 𝑗 and threshold
𝑡𝑚 , partition the data into 𝑄𝑙𝑒𝑓 𝑡 (𝜃) and 𝑄𝑟𝑖𝑔ℎ𝑡 (𝜃) subsets
The impurity at 𝑚 is computed using an impurity function 𝐻(), the choice of which depends on the task being solved
(classification or regression)
𝑛𝑙𝑒𝑓 𝑡 𝑛𝑟𝑖𝑔ℎ𝑡
𝐺(𝑄, 𝜃) = 𝐻(𝑄𝑙𝑒𝑓 𝑡 (𝜃)) + 𝐻(𝑄𝑟𝑖𝑔ℎ𝑡 (𝜃))
𝑁𝑚 𝑁𝑚
Select the parameters that minimises the impurity
𝜃* = argmin𝜃 𝐺(𝑄, 𝜃)
Recurse for subsets 𝑄𝑙𝑒𝑓 𝑡 (𝜃* ) and 𝑄𝑟𝑖𝑔ℎ𝑡 (𝜃* ) until the maximum allowable depth is reached, 𝑁𝑚 < min𝑠𝑎𝑚𝑝𝑙𝑒𝑠 or
𝑁𝑚 = 1.
Classification criteria
If a target is a classification outcome taking on values 0,1,. . . ,K-1, for node 𝑚, representing a region 𝑅𝑚 with 𝑁𝑚
observations, let
∑︁
𝑝𝑚𝑘 = 1/𝑁𝑚 𝐼(𝑦𝑖 = 𝑘)
𝑥𝑖 ∈𝑅𝑚
Entropy
∑︁
𝐻(𝑋𝑚 ) = − 𝑝𝑚𝑘 log(𝑝𝑚𝑘 )
𝑘
and Misclassification
𝐻(𝑋𝑚 ) = 1 − max(𝑝𝑚𝑘 )
Regression criteria
If the target is a continuous value, then for node 𝑚, representing a region 𝑅𝑚 with 𝑁𝑚 observations, common criteria
to minimise as for determining locations for future splits are Mean Squared Error, which minimizes the L2 error
using mean values at terminal nodes, and Mean Absolute Error, which minimizes the L1 error using median values at
terminal nodes.
Mean Squared Error:
1 ∑︁
𝑦¯𝑚 = 𝑦𝑖
𝑁𝑚
𝑖∈𝑁𝑚
1 ∑︁
𝐻(𝑋𝑚 ) = (𝑦𝑖 − 𝑦¯𝑚 )2
𝑁𝑚
𝑖∈𝑁𝑚
Minimal cost-complexity pruning is an algorithm used to prune a tree to avoid over-fitting, described in Chapter 3 of
[BRE]. This algorithm is parameterized by 𝛼 ≥ 0 known as the complexity parameter. The complexity parameter is
used to define the cost-complexity measure, 𝑅𝛼 (𝑇 ) of a given tree 𝑇 :
𝑅𝛼 (𝑇 ) = 𝑅(𝑇 ) + 𝛼|𝑇 |
where |𝑇 | is the number of terminal nodes in 𝑇 and 𝑅(𝑇 ) is traditionally defined as the total misclassification rate of
the terminal nodes. Alternatively, scikit-learn uses the total sample weighted impurity of the terminal nodes for 𝑅(𝑇 ).
As shown above, the impurity of a node depends on the criterion. Minimal cost-complexity pruning finds the subtree
of 𝑇 that minimizes 𝑅𝛼 (𝑇 ).
The cost complexity measure of a single node is 𝑅𝛼 (𝑡) = 𝑅(𝑡)+𝛼. The branch, 𝑇𝑡 , is defined to be a tree where node 𝑡
is its root. In general, the impurity of a node is greater than the sum of impurities of its terminal nodes, 𝑅(𝑇𝑡 ) < 𝑅(𝑡).
However, the cost complexity measure of a node, 𝑡, and its branch, 𝑇𝑡 , can be equal depending on 𝛼. We define the
effective 𝛼 of a node to be the value where they are equal, 𝑅𝛼 (𝑇𝑡 ) = 𝑅𝛼 (𝑡) or 𝛼𝑒𝑓 𝑓 (𝑡) = 𝑅(𝑡)−𝑅(𝑇
|𝑇 |−1
𝑡)
. A non-terminal
node with the smallest value of 𝛼𝑒𝑓 𝑓 is the weakest link and will be pruned. This process stops when the pruned tree’s
minimal 𝛼𝑒𝑓 𝑓 is greater than the ccp_alpha parameter.
Examples:
References:
• https://en.wikipedia.org/wiki/Decision_tree_learning
• https://en.wikipedia.org/wiki/Predictive_analytics
• J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 1993.
• T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning, Springer, 2009.
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning
algorithm in order to improve generalizability / robustness over a single estimator.
Two families of ensemble methods are usually distinguished:
• In averaging methods, the driving principle is to build several estimators independently and then to average
their predictions. On average, the combined estimator is usually better than any of the single base estimator
because its variance is reduced.
Examples: Bagging methods, Forests of randomized trees, . . .
• By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the
combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.
Examples: AdaBoost, Gradient Tree Boosting, . . .
Bagging meta-estimator
In ensemble algorithms, bagging methods form a class of algorithms which build several instances of a black-box
estimator on random subsets of the original training set and then aggregate their individual predictions to form a final
prediction. These methods are used as a way to reduce the variance of a base estimator (e.g., a decision tree), by
introducing randomization into its construction procedure and then making an ensemble out of it. In many cases,
bagging methods constitute a very simple way to improve with respect to a single model, without making it necessary
to adapt the underlying base algorithm. As they provide a way to reduce overfitting, bagging methods work best with
strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually
work best with weak models (e.g., shallow decision trees).
Bagging methods come in many flavours but mostly differ from each other by the way they draw random subsets of
the training set:
• When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known
as Pasting [B1999].
• When samples are drawn with replacement, then the method is known as Bagging [B1996].
• When random subsets of the dataset are drawn as random subsets of the features, then the method is known as
Random Subspaces [H1998].
• Finally, when base estimators are built on subsets of both samples and features, then the method is known as
Random Patches [LG2012].
In scikit-learn, bagging methods are offered as a unified BaggingClassifier meta-estimator (resp.
BaggingRegressor), taking as input a user-specified base estimator along with parameters specifying the strategy
to draw random subsets. In particular, max_samples and max_features control the size of the subsets (in terms
of samples and features), while bootstrap and bootstrap_features control whether samples and features
are drawn with or without replacement. When using a subset of the available samples the generalization accuracy can
be estimated with the out-of-bag samples by setting oob_score=True. As an example, the snippet below illustrates
how to instantiate a bagging ensemble of KNeighborsClassifier base estimators, each built on random subsets
of 50% of the samples and 50% of the features.
Examples:
References
The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the Ran-
domForest algorithm and the Extra-Trees method. Both algorithms are perturb-and-combine techniques [B1998]
specifically designed for trees. This means a diverse set of classifiers is created by introducing randomness in the
classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.
As other classifiers, forest classifiers have to be fitted with two arrays: a sparse or dense array X of size [n_samples,
n_features] holding the training samples, and an array Y of size [n_samples] holding the target values (class
labels) for the training samples:
Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples,
n_outputs]).
Random Forests
In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the
ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.
Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input
features or a random subset of size max_features. (See the parameter tuning guidelines for more details).
The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual
decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision
trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel
out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase
in bias. In practice the variance reduction is often significant hence yielding an overall better model.
In contrast to the original publication [B2001], the scikit-learn implementation combines classifiers by averaging their
probabilistic prediction, instead of letting each classifier vote for a single class.
Parameters
The main parameters to adjust when using these methods is n_estimators and max_features. The former
is the number of trees in the forest. The larger the better, but also the longer it will take to compute. In addition,
note that results will stop getting significantly better beyond a critical number of trees. The latter is the size of the
random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also
the greater the increase in bias. Empirical good default values are max_features=None (always considering all
features instead of a random subset) for regression problems, and max_features="sqrt" (using a random subset
of size sqrt(n_features)) for classification tasks (where n_features is the number of features in the data).
Good results are often achieved when setting max_depth=None in combination with min_samples_split=2
(i.e., when fully developing the trees). Bear in mind though that these values are usually not optimal, and might result
in models that consume a lot of RAM. The best parameter values should always be cross-validated. In addition, note
that in random forests, bootstrap samples are used by default (bootstrap=True) while the default strategy for
extra-trees is to use the whole dataset (bootstrap=False). When using bootstrap sampling the generalization
accuracy can be estimated on the left out or out-of-bag samples. This can be enabled by setting oob_score=True.
Note: The size of the model with the default parameters is 𝑂(𝑀 * 𝑁 * 𝑙𝑜𝑔(𝑁 )), where 𝑀 is the number of
trees and 𝑁 is the number of samples. In order to reduce the size of the model, you can change these parameters:
Parallelization
Finally, this module also features the parallel construction of the trees and the parallel computation of the predictions
through the n_jobs parameter. If n_jobs=k then computations are partitioned into k jobs, and run on k cores of
the machine. If n_jobs=-1 then all cores available on the machine are used. Note that because of inter-process
communication overhead, the speedup might not be linear (i.e., using k jobs will unfortunately not be k times as fast).
Significant speedup can still be achieved though when building a large number of trees, or when building a single tree
requires a fair amount of time (e.g., on large datasets).
Examples:
References
• P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.
The relative rank (i.e. depth) of a feature used as a decision node in a tree can be used to assess the relative importance
of that feature with respect to the predictability of the target variable. Features used at the top of the tree contribute
to the final prediction decision of a larger fraction of the input samples. The expected fraction of the samples they
contribute to can thus be used as an estimate of the relative importance of the features. In scikit-learn, the fraction of
samples a feature contributes to is combined with the decrease in impurity from splitting them to create a normalized
estimate of the predictive power of that feature.
By averaging the estimates of predictive ability over several randomized trees one can reduce the variance of such
an estimate and use it for feature selection. This is known as the mean decrease in impurity, or MDI. Refer to [L2014]
for more information on MDI and feature importance evaluation with Random Forests.
The following example shows a color-coded representation of the relative importances of each individual pixel for a
face recognition task using a ExtraTreesClassifier model.
In practice those estimates are stored as an attribute named feature_importances_ on the fitted model. This
is an array with shape (n_features,) whose values are positive and sum to 1.0. The higher the value, the more
important is the contribution of the matching feature to the prediction function.
Examples:
References
Examples:
See also:
Manifold learning techniques can also be useful to derive non-linear representations of feature space, also these ap-
proaches focus also on dimensionality reduction.
AdaBoost
The module sklearn.ensemble includes the popular boosting algorithm AdaBoost, introduced in 1995 by Freund
and Schapire [FS1995].
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than
random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from
all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data
modifications at each so-called boosting iteration consist of applying weights 𝑤1 , 𝑤2 , . . . , 𝑤𝑁 to each of the training
samples. Initially, those weights are all set to 𝑤𝑖 = 1/𝑁 , so that the first step simply trains a weak learner on the
original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is
reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted
model induced at the previous step have their weights increased, whereas the weights are decreased for those that were
predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each
subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the
sequence [HTF].
AdaBoost can be used both for classification and regression problems:
• For multi-class classification, AdaBoostClassifier implements AdaBoost-SAMME and AdaBoost-
SAMME.R [ZZRH2009].
• For regression, AdaBoostRegressor implements AdaBoost.R2 [D1997].
Usage
The following example shows how to fit an AdaBoost classifier with 100 weak learners:
>>> X, y = load_iris(return_X_y=True)
>>> clf = AdaBoostClassifier(n_estimators=100)
>>> scores = cross_val_score(clf, X, y, cv=5)
>>> scores.mean()
0.9...
The number of weak learners is controlled by the parameter n_estimators. The learning_rate parameter
controls the contribution of the weak learners in the final combination. By default, weak learners are decision stumps.
Different weak learners can be specified through the base_estimator parameter. The main parameters to tune to
obtain good results are n_estimators and the complexity of the base estimators (e.g., its depth max_depth or
minimum required number of samples to consider a split min_samples_split).
Examples:
• Discrete versus Real AdaBoost compares the classification error of a decision stump, decision tree, and a
boosted decision stump using AdaBoost-SAMME and AdaBoost-SAMME.R.
• Multi-class AdaBoosted Decision Trees shows the performance of AdaBoost-SAMME and AdaBoost-
SAMME.R on a multi-class problem.
• Two-class AdaBoost shows the decision boundary and decision function values for a non-linearly separable
two-class problem using AdaBoost-SAMME.
• Decision Tree Regression with AdaBoost demonstrates regression with the AdaBoost.R2 algorithm.
References
Gradient Tree Boosting or Gradient Boosted Decision Trees (GBDT) is a generalization of boosting to arbitrary differ-
entiable loss functions. GBDT is an accurate and effective off-the-shelf procedure that can be used for both regression
and classification problems in a variety of areas including Web search ranking and ecology.
The module sklearn.ensemble provides methods for both classification and regression via gradient boosted
decision trees.
Note: Scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely
HistGradientBoostingClassifier and HistGradientBoostingRegressor, inspired by LightGBM
(See [LightGBM]).
These histogram-based estimators can be orders of magnitude faster than GradientBoostingClassifier
and GradientBoostingRegressor when the number of samples is larger than tens of thousands of samples.
They also have built-in support for missing values, which avoids the need for an imputer.
These estimators are described in more detail below in Histogram-Based Gradient Boosting.
The following guide focuses on GradientBoostingClassifier and GradientBoostingRegressor,
which might be preferred for small sample sizes since binning may lead to split points that are too approximate in
this setting.
Classification
GradientBoostingClassifier supports both binary and multi-class classification. The following example
shows how to fit a gradient boosting classifier with 100 decision stumps as weak learners:
>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> X, y = make_hastie_10_2(random_state=0)
>>> X_train, X_test = X[:2000], X[2000:]
>>> y_train, y_test = y[:2000], y[2000:]
The number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each
tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via
max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via
shrinkage .
Note: Classification with more than 2 classes requires the induction of n_classes regression trees at each iter-
ation, thus, the total number of induced trees equals n_classes * n_estimators. For datasets with a large
number of classes we strongly recommend to use HistGradientBoostingClassifier as an alternative to
GradientBoostingClassifier .
Regression
GradientBoostingRegressor supports a number of different loss functions for regression which can be speci-
fied via the argument loss; the default loss function for regression is least squares ('ls').
The figure below shows the results of applying GradientBoostingRegressor with least squares loss and 500
base learners to the Boston house price dataset (sklearn.datasets.load_boston). The plot on the left shows
the train and test error at each iteration. The train error at each iteration is stored in the train_score_ attribute
of the gradient boosting model. The test error at each iterations can be obtained via the staged_predict method
which returns a generator that yields the predictions at each stage. Plots like these can be used to determine the optimal
number of trees (i.e. n_estimators) by early stopping. The plot on the right shows the feature importances which
can be obtained via the feature_importances_ property.
Examples:
The size of the regression tree base learners defines the level of variable interactions that can be captured by the
gradient boosting model. In general, a tree of depth h can capture interactions of order h . There are two ways in
which the size of the individual regression trees can be controlled.
If you specify max_depth=h then complete binary trees of depth h will be grown. Such trees will have (at most)
2**h leaf nodes and 2**h - 1 split nodes.
Alternatively, you can control the tree size by specifying the number of leaf nodes via the parameter
max_leaf_nodes. In this case, trees will be grown using best-first search where nodes with the highest improve-
ment in impurity will be expanded first. A tree with max_leaf_nodes=k has k - 1 split nodes and thus can
model interactions of up to order max_leaf_nodes - 1 .
We found that max_leaf_nodes=k gives comparable results to max_depth=k-1 but is significantly faster to
train at the expense of a slightly higher training error. The parameter max_leaf_nodes corresponds to the variable
J in the chapter on gradient boosting in [F2001] and is related to the parameter interaction.depth in R’s gbm
package where max_leaf_nodes == interaction.depth + 1 .
Mathematical formulation
𝑀
∑︁
𝐹 (𝑥) = 𝛾𝑚 ℎ𝑚 (𝑥)
𝑚=1
where ℎ𝑚 (𝑥) are the basis functions which are usually called weak learners in the context of boosting. Gradient Tree
Boosting uses decision trees of fixed size as weak learners. Decision trees have a number of abilities that make them
valuable for boosting, namely the ability to handle data of mixed type and the ability to model complex functions.
Similar to other boosting algorithms, GBRT builds the additive model in a greedy fashion:
where the newly added tree ℎ𝑚 tries to minimize the loss 𝐿, given the previous ensemble 𝐹𝑚−1 :
𝑛
∑︁
ℎ𝑚 = arg min 𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ) + ℎ(𝑥𝑖 )).
ℎ
𝑖=1
The initial model 𝐹0 is problem specific, for least-squares regression one usually chooses the mean of the target values.
Note: The initial model can also be specified via the init argument. The passed object has to implement fit and
predict.
Gradient Boosting attempts to solve this minimization problem numerically via steepest descent: The steepest descent
direction is the negative gradient of the loss function evaluated at the current model 𝐹𝑚−1 which can be calculated for
any differentiable loss function:
𝑛
∑︁
𝐹𝑚 (𝑥) = 𝐹𝑚−1 (𝑥) − 𝛾𝑚 ∇𝐹 𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ))
𝑖=1
𝑛
∑︁ 𝜕𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ))
𝛾𝑚 = arg min 𝐿(𝑦𝑖 , 𝐹𝑚−1 (𝑥𝑖 ) − 𝛾 )
𝛾
𝑖=1
𝜕𝐹𝑚−1 (𝑥𝑖 )
The algorithms for regression and classification only differ in the concrete loss function used.
Loss Functions
The following loss functions are supported and can be specified using the parameter loss:
• Regression
– Least squares ('ls'): The natural choice for regression due to its superior computational properties. The
initial model is given by the mean of the target values.
– Least absolute deviation ('lad'): A robust loss function for regression. The initial model is given by the
median of the target values.
– Huber ('huber'): Another robust loss function that combines least squares and least absolute deviation;
use alpha to control the sensitivity with regards to outliers (see [F2001] for more details).
– Quantile ('quantile'): A loss function for quantile regression. Use 0 < alpha < 1 to specify the
quantile. This loss function can be used to create prediction intervals (see Prediction Intervals for Gradient
Boosting Regression).
• Classification
– Binomial deviance ('deviance'): The negative binomial log-likelihood loss function for binary classi-
fication (provides probability estimates). The initial model is given by the log odds-ratio.
– Multinomial deviance ('deviance'): The negative multinomial log-likelihood loss function for multi-
class classification with n_classes mutually exclusive classes. It provides probability estimates. The
initial model is given by the prior probability of each class. At each iteration n_classes regression trees
have to be constructed which makes GBRT rather inefficient for data sets with a large number of classes.
– Exponential loss ('exponential'): The same loss function as AdaBoostClassifier. Less robust
to mislabeled examples than 'deviance'; can only be used for binary classification.
Regularization
Shrinkage
[F2001] proposed a simple regularization strategy that scales the contribution of each weak learner by a factor 𝜈:
The parameter 𝜈 is also called the learning rate because it scales the step length the gradient descent procedure; it can
be set via the learning_rate parameter.
The parameter learning_rate strongly interacts with the parameter n_estimators, the number of weak learn-
ers to fit. Smaller values of learning_rate require larger numbers of weak learners to maintain a constant training
error. Empirical evidence suggests that small values of learning_rate favor better test error. [HTF] recommend
to set the learning rate to a small constant (e.g. learning_rate <= 0.1) and choose n_estimators by early
stopping. For a more detailed discussion of the interaction between learning_rate and n_estimators see
[R2007].
Subsampling
[F1999] proposed stochastic gradient boosting, which combines gradient boosting with bootstrap averaging (bagging).
At each iteration the base classifier is trained on a fraction subsample of the available training data. The subsample
is drawn without replacement. A typical value of subsample is 0.5.
The figure below illustrates the effect of shrinkage and subsampling on the goodness-of-fit of the model. We can
clearly see that shrinkage outperforms no-shrinkage. Subsampling with shrinkage can further increase the accuracy of
the model. Subsampling without shrinkage, on the other hand, does poorly.
Another strategy to reduce the variance is by subsampling the features analogous to the random splits in
RandomForestClassifier . The number of subsampled features can be controlled via the max_features
parameter.
Note: Using a small max_features value can significantly decrease the runtime.
Stochastic gradient boosting allows to compute out-of-bag estimates of the test deviance by computing the improve-
ment in deviance on the examples that are not included in the bootstrap sample (i.e. the out-of-bag examples). The
improvements are stored in the attribute oob_improvement_. oob_improvement_[i] holds the improvement
in terms of the loss on the OOB samples if you add the i-th stage to the current predictions. Out-of-bag estimates can
be used for model selection, for example to determine the optimal number of iterations. OOB estimates are usually
very pessimistic thus we recommend to use cross-validation instead and only use OOB if cross-validation is too time
consuming.
Examples:
Interpretation
Individual decision trees can be interpreted easily by simply visualizing the tree structure. Gradient boosting models,
however, comprise hundreds of regression trees thus they cannot be easily interpreted by visual inspection of the
individual trees. Fortunately, a number of techniques have been proposed to summarize and interpret gradient boosting
models.
Feature importance
Often features do not contribute equally to predict the target response; in many situations the majority of the features
are in fact irrelevant. When interpreting a model, the first question usually is: what are those important features and
how do they contributing in predicting the target response?
Individual decision trees intrinsically perform feature selection by selecting appropriate split points. This information
can be used to measure the importance of each feature; the basic idea is: the more often a feature is used in the split
points of a tree the more important that feature is. This notion of importance can be extended to decision tree ensembles
by simply averaging the feature importance of each tree (see Feature importance evaluation for more details).
The feature importance scores of a fit gradient boosting model can be accessed via the feature_importances_
property:
>>> X, y = make_hastie_10_2(random_state=0)
>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
... max_depth=1, random_state=0).fit(X, y)
>>> clf.feature_importances_
array([0.10..., 0.10..., 0.11..., ...
Examples:
Scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely
HistGradientBoostingClassifier and HistGradientBoostingRegressor, inspired by LightGBM
(See [LightGBM]).
These histogram-based estimators can be orders of magnitude faster than GradientBoostingClassifier
and GradientBoostingRegressor when the number of samples is larger than tens of thousands of samples.
They also have built-in support for missing values, which avoids the need for an imputer.
These fast estimators first bin the input samples X into integer-valued bins (typically 256 bins) which tremen-
dously reduces the number of splitting points to consider, and allows the algorithm to leverage integer-based
data structures (histograms) instead of relying on sorted continuous values when building the trees. The API
of these estimators is slightly different, and some of the features from GradientBoostingClassifier and
GradientBoostingRegressor are not yet supported: in particular sample weights, and some loss functions.
These estimators are still experimental: their predictions and their API might change without any deprecation cycle.
To use them, you need to explicitly import enable_hist_gradient_boosting:
Examples:
Usage
>>> X, y = make_hastie_10_2(random_state=0)
>>> X_train, X_test = X[:2000], X[2000:]
>>> y_train, y_test = y[:2000], y[2000:]
Available losses for regression are ‘least_squares’ and ‘least_absolute_deviation’, which is less sensitive to outliers.
For classification, ‘binary_crossentropy’ is used for binary classification and ‘categorical_crossentropy’ is used for
multiclass classification. By default the loss is ‘auto’ and will select the appropriate loss depending on y passed to fit.
The size of the trees can be controlled through the max_leaf_nodes, max_depth, and min_samples_leaf
parameters.
The number of bins used to bin the data is controlled with the max_bins parameter. Using less bins acts as a form
of regularization. It is generally recommended to use as many bins as possible, which is the default.
The l2_regularization parameter is a regularizer on the loss function and corresponds to 𝜆 in equation (2) of
[XGBoost].
The early-stopping behaviour is controlled via the scoring, validation_fraction, n_iter_no_change,
and tol parameters. It is possible to early-stop using an arbitrary scorer, or just the training or validation loss. By
default, early-stopping is performed using the default scorer of the estimator on a validation set but it is also possible
to perform early-stopping based on the loss value, which is significantly faster.
When the missingness pattern is predictive, the splits can be done on whether the feature value is missing or not:
If no missing values were encountered for a given feature during training, then samples with missing values are mapped
to whichever child has the most samples.
Low-level parallelism
The bottleneck of a gradient boosting procedure is building the decision trees. Building a traditional decision tree (as in
the other GBDTs GradientBoostingClassifier and GradientBoostingRegressor) requires sorting
the samples at each node (for each feature). Sorting is needed so that the potential gain of a split point can be computed
efficiently. Splitting a single node has thus a complexity of 𝒪(𝑛features × 𝑛 log(𝑛)) where 𝑛 is the number of samples
at the node.
HistGradientBoostingClassifier and HistGradientBoostingRegressor, in contrast, do not re-
quire sorting the feature values and instead use a data-structure called a histogram, where the samples are implicitly
ordered. Building a histogram has a 𝒪(𝑛) complexity, so the node splitting procedure has a 𝒪(𝑛features × 𝑛) com-
plexity, much smaller than the previous one. In addition, instead of considering 𝑛 split points, we here consider only
max_bins split points, which is much smaller.
In order to build histograms, the input data X needs to be binned into integer-valued bins. This binning procedure does
require sorting the feature values, but it only happens once at the very beginning of the boosting process (not at each
node, like in GradientBoostingClassifier and GradientBoostingRegressor).
Finally, many parts of the implementation of HistGradientBoostingClassifier and
HistGradientBoostingRegressor are parallelized.
References
Voting Classifier
The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use
a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be
useful for a set of equally well performing model in order to balance out their individual weaknesses.
In majority voting, the predicted class label for a particular sample is the class label that represents the majority (mode)
of the class labels predicted by each individual classifier.
E.g., if the prediction for a given sample is
• classifier 1 -> class 1
• classifier 2 -> class 1
• classifier 3 -> class 2
the VotingClassifier (with voting='hard') would classify the sample as “class 1” based on the majority class label.
In the cases of a tie, the VotingClassifier will select the class based on the ascending sort order. E.g., in the
following scenario
Usage
The following example shows how to fit the majority rule classifier:
>>> for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random
˓→Forest', 'naive Bayes', 'Ensemble']):
In contrast to majority voting (hard voting), soft voting returns the class label as argmax of the sum of predicted
probabilities.
Specific weights can be assigned to each classifier via the weights parameter. When weights are provided, the
predicted class probabilities for each classifier are collected, multiplied by the classifier weight, and averaged. The
final class label is then derived from the class label with the highest average probability.
To illustrate this with a simple example, let’s assume we have 3 classifiers and a 3-class classification problems where
we assign equal weights to all classifiers: w1=1, w2=1, w3=1.
The weighted average probabilities for a sample would then be calculated as follows:
Here, the predicted class label is 2, since it has the highest average probability.
The following example illustrates how the decision regions may change when a soft VotingClassifier is used
based on an linear Support Vector Machine, a Decision Tree, and a K-nearest neighbor classifier:
The VotingClassifier can also be used together with GridSearchCV in order to tune the hyperparameters of
the individual estimators:
Usage
In order to predict the class labels based on the predicted class-probabilities (scikit-learn estimators in the VotingClas-
sifier must support predict_proba method):
>>> eclf = VotingClassifier(
... estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
... voting='soft'
... )
Voting Regressor
The idea behind the VotingRegressor is to combine conceptually different machine learning regressors and return
the average predicted values. Such a regressor can be useful for a set of equally well performing models in order to
balance out their individual weaknesses.
Usage
Examples:
Stacked generalization
Stacked generalization is a method for combining estimators to reduce their biases [W1992] [HTF]. More precisely,
the predictions of each individual estimator are stacked together and used as input to a final estimator to compute the
prediction. This final estimator is trained through cross-validation.
The StackingClassifier and StackingRegressor provide such strategies which can be applied to classi-
fication and regression problems.
The estimators parameter corresponds to the list of the estimators which are stacked together in parallel on the
input data. It should be given as a list of names and estimators:
The final_estimator will use the predictions of the estimators as input. It needs to be a classifier or a
regressor when using StackingClassifier or StackingRegressor, respectively:
To train the estimators and final_estimator, the fit method needs to be called on the training data:
During training, the estimators are fitted on the whole training data X_train. They will be used when calling
predict or predict_proba. To generalize and avoid over-fitting, the final_estimator is trained on out-
Note that it is also possible to get the output of the stacked outputs of the estimators using the transform
method:
>>> reg.transform(X_test[:5])
array([[28.78..., 28.43... , 22.62...],
[35.96..., 32.58..., 23.68...],
[14.97..., 14.05..., 16.45...],
[25.19..., 25.54..., 22.92...],
[18.93..., 19.26..., 17.03... ]])
In practise, a stacking predictor predict as good as the best predictor of the base layer and even sometimes out-
putperform it by combining the different strength of the these predictors. However, training a stacking predictor is
computationally expensive.
References
Warning: All classifiers in scikit-learn do multiclass classification out-of-the-box. You don’t need to use the
sklearn.multiclass module unless you want to experiment with different multiclass strategies.
The sklearn.multiclass module implements meta-estimators to solve multiclass and multilabel clas-
sification problems by decomposing such problems into binary classification problems. multioutput regression is
also supported.
• Multiclass classification: classification task with more than two classes. Each sample can only be labelled as
one class.
For example, classification using features extracted from a set of images of fruit, where each image may either
be of an orange, an apple, or a pear. Each image is one sample and is labelled as one of the 3 possible classes.
Multiclass classification makes the assumption that each sample is assigned to one and only one label - one
sample cannot, for example, be both a pear and an apple.
Valid multiclass representations for type_of_target (y) are:
– 1d or column vector containing more than two discrete values. An example of a vector y for 3 samples:
– sparse binary matrix of shape (n_samples, n_classes) with a single element per row, where each
column represents one class. An example of a sparse binary matrix y for 3 samples, where the columns,
in order, are orange, apple and pear:
• Multilabel classification: classification task labelling each sample with x labels from n_classes possible
classes, where x can be 0 to n_classes inclusive. This can be thought of as predicting properties of a sample
that are not mutually exclusive. Formally, a binary output is assigned to each class, for every sample. Positive
classes are indicated with 1 and negative classes with 0 or -1. It is thus comparable to running n_classes bi-
nary classification tasks, for example with sklearn.multioutput.MultiOutputClassifier. This
approach treats each label independently whereas multilabel classifiers may treat the multiple classes simulta-
neously, accounting for correlated behaviour amoung them.
For example, prediction of the topics relevant to a text document or video. The document or video may be about
one of ‘religion’, ‘politics’, ‘finance’ or ‘education’, several of the topic classes or all of the topic classes.
Valid representation of multilabel y is either dense (or sparse) binary matrix of shape (n_samples,
n_classes). Each column represents a class. The 1’s in each row denote the positive classes a sample
• Multioutput regression: predicts multiple numerical properties for each sample. Each property is a numerical
variable and the number of properties to be predicted for each sample is greater than or equal to 2. Some
estimators that support multioutput regression are faster than just running n_output estimators.
For example, prediction of both wind speed and wind direction, in degrees, using data obtained at a certain
location. Each sample would be data obtained at one location and both wind speed and directtion would be
output for each sample.
Valid representation of multilabel y is dense matrix of shape (n_samples, n_classes) of floats. A
column wise concatenation of continuous variables. An example of y for 3 samples:
• Multioutput-multiclass classification (also known as multitask classification): classification task which la-
bels each sample with a set of non-binary properties. Both the number of properties and the number of classes
per property is greater than 2. A single estimator thus handles several joint classification tasks. This is both a
generalization of the multilabel classification task, which only considers binary attributes, as well as a general-
ization of the multiclass classification task, where only one property is considered.
For example, classification of the properties “type of fruit” and “colour” for a set of images of fruit. The property
“type of fruit” has the possible classes: “apple”, “pear” and “orange”. The property “colour” has the possible
classes: “green”, “red”, “yellow” and “orange”. Each sample is an image of a fruit, a label is output for both
properties and each label is one of the possible classes of the corresponding property.
Valid representation of multilabel y is dense matrix of shape (n_samples, n_classes) of floats. A
column wise concatenation of 1d multiclass variables. An example of y for 3 samples:
Note that any classifiers handling multioutput-multiclass (also known as multitask classification) tasks, support
the multilabel classification task as a special case. Multitask classification is similar to the multioutput classifi-
cation task with different model formulations. For more information, see the relevant estimator documentation.
All scikit-learn classifiers are capable of multiclass classification, but the meta-estimators offered by sklearn.
multiclass permit changing the way they handle more than two classes because this may have an effect on classifier
performance (either in terms of generalization error or required computational resources).
Summary
Below is a summary of the classifiers supported by scikit-learn grouped by strategy; you don’t need the meta-estimators
in this class if you’re using one of these, unless you want custom multiclass behavior:
• Inherently multiclass:
– sklearn.naive_bayes.BernoulliNB
– sklearn.tree.DecisionTreeClassifier
– sklearn.tree.ExtraTreeClassifier
– sklearn.ensemble.ExtraTreesClassifier
– sklearn.naive_bayes.GaussianNB
– sklearn.neighbors.KNeighborsClassifier
– sklearn.semi_supervised.LabelPropagation
– sklearn.semi_supervised.LabelSpreading
– sklearn.discriminant_analysis.LinearDiscriminantAnalysis
– sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
– sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
– sklearn.linear_model.LogisticRegressionCV (setting multi_class=”multinomial”)
– sklearn.neural_network.MLPClassifier
– sklearn.neighbors.NearestCentroid
– sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis
– sklearn.neighbors.RadiusNeighborsClassifier
– sklearn.ensemble.RandomForestClassifier
– sklearn.linear_model.RidgeClassifier
– sklearn.linear_model.RidgeClassifierCV
• Multiclass as One-Vs-One:
– sklearn.svm.NuSVC
– sklearn.svm.SVC.
– sklearn.gaussian_process.GaussianProcessClassifier (setting multi_class =
“one_vs_one”)
• Multiclass as One-Vs-The-Rest:
– sklearn.ensemble.GradientBoostingClassifier
– sklearn.gaussian_process.GaussianProcessClassifier (setting multi_class =
“one_vs_rest”)
– sklearn.svm.LinearSVC (setting multi_class=”ovr”)
– sklearn.linear_model.LogisticRegression (setting multi_class=”ovr”)
– sklearn.linear_model.LogisticRegressionCV (setting multi_class=”ovr”)
– sklearn.linear_model.SGDClassifier
– sklearn.linear_model.Perceptron
– sklearn.linear_model.PassiveAggressiveClassifier
• Support multilabel:
– sklearn.tree.DecisionTreeClassifier
– sklearn.tree.ExtraTreeClassifier
– sklearn.ensemble.ExtraTreesClassifier
– sklearn.neighbors.KNeighborsClassifier
– sklearn.neural_network.MLPClassifier
– sklearn.neighbors.RadiusNeighborsClassifier
– sklearn.ensemble.RandomForestClassifier
– sklearn.linear_model.RidgeClassifierCV
• Support multiclass-multioutput:
– sklearn.tree.DecisionTreeClassifier
– sklearn.tree.ExtraTreeClassifier
– sklearn.ensemble.ExtraTreesClassifier
– sklearn.neighbors.KNeighborsClassifier
– sklearn.neighbors.RadiusNeighborsClassifier
– sklearn.ensemble.RandomForestClassifier
In multilabel learning, the joint set of binary classification tasks is expressed with label binary indicator array: each
sample is one row of a 2d array of shape (n_samples, n_classes) with binary values: the one, i.e. the non zero elements,
corresponds to the subset of labels. An array such as np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]])
represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample.
Producing multilabel data as a list of sets of labels may be more intuitive. The MultiLabelBinarizer transformer
can be used to convert between a collection of collections of labels and the indicator format.
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> y = [[2, 3, 4], [2], [0, 1, 3], [0, 1, 2, 3, 4], [0, 1, 2]]
>>> MultiLabelBinarizer().fit_transform(y)
array([[0, 0, 1, 1, 1],
[0, 0, 1, 0, 0],
[1, 1, 0, 1, 0],
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0]])
One-Vs-The-Rest
This strategy, also known as one-vs-all, is implemented in OneVsRestClassifier. The strategy consists in fitting
one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computa-
tional efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since
each class is represented by one and only one classifier, it is possible to gain knowledge about the class by inspecting
its corresponding classifier. This is the most commonly used strategy and is a fair default choice.
Multiclass learning
Multilabel learning
OneVsRestClassifier also supports multilabel classification. To use this feature, feed the classifier an indicator
matrix, in which cell [i, j] indicates the presence of label j in sample i.
Examples:
• Multilabel classification
One-Vs-One
OneVsOneClassifier constructs one classifier per pair of classes. At prediction time, the class which received
the most votes is selected. In the event of a tie (among two classes with an equal number of votes), it selects the class
with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels
computed by the underlying binary classifiers.
Since it requires to fit n_classes * (n_classes - 1) / 2 classifiers, this method is usually slower than
one-vs-the-rest, due to its O(n_classes^2) complexity. However, this method may be advantageous for algorithms
such as kernel algorithms which don’t scale well with n_samples. This is because each individual learning problem
only involves a small subset of the data whereas, with one-vs-the-rest, the complete dataset is used n_classes times.
The decision function is the result of a monotonic transformation of the one-versus-one classification.
Multiclass learning
References:
• “Pattern Recognition and Machine Learning. Springer”, Christopher M. Bishop, page 183, (First Edition)
Error-Correcting Output-Codes
Output-code based strategies are fairly different from one-vs-the-rest and one-vs-one. With these strategies, each class
is represented in a Euclidean space, where each dimension can only be 0 or 1. Another way to put it is that each class
is represented by a binary code (an array of 0 and 1). The matrix which keeps track of the location/code of each class
is called the code book. The code size is the dimensionality of the aforementioned space. Intuitively, each class should
be represented by a code as unique as possible and a good code book should be designed to optimize classification
accuracy. In this implementation, we simply use a randomly-generated code book as advocated in3 although more
elaborate methods may be added in the future.
At fitting time, one binary classifier per bit in the code book is fitted. At prediction time, the classifiers are used to
project new points in the class space and the class closest to the points is chosen.
In OutputCodeClassifier, the code_size attribute allows the user to control the number of classifiers which
will be used. It is a percentage of the total number of classes.
A number between 0 and 1 will require fewer classifiers than one-vs-the-rest. In theory, log2(n_classes) /
n_classes is sufficient to represent each class unambiguously. However, in practice, it may not lead to good
accuracy since log2(n_classes) is much smaller than n_classes.
A number greater than 1 will require more classifiers than one-vs-the-rest. In this case, some classifiers will in theory
correct for the mistakes made by other classifiers, hence the name “error-correcting”. In practice, however, this may
not happen as classifier mistakes will typically be correlated. The error-correcting output codes have a similar effect
to bagging.
Multiclass learning
References:
• “Solving multiclass learning problems via error-correcting output codes”, Dietterich T., Bakiri G., Journal of
Artificial Intelligence Research 2, 1995.
• “The Elements of Statistical Learning”, Hastie T., Tibshirani R., Friedman J., page 606 (second-edition) 2008.
Multioutput regression
Multioutput regression support can be added to any regressor with MultiOutputRegressor. This strategy con-
sists of fitting one regressor per target. Since each target is represented by exactly one regressor it is possible to
gain knowledge about the target by inspecting its corresponding regressor. As MultiOutputRegressor fits one
regressor per target it can not take advantage of correlations between targets.
Below is an example of multioutput regression:
Multioutput classification
Multioutput classification support can be added to any classifier with MultiOutputClassifier. This strategy
consists of fitting one classifier per target. This allows multiple target variable classifications. The purpose of this class
is to extend estimators to be able to estimate a series of target functions (f1,f2,f3. . . ,fn) that are trained on a single X
predictor matrix to predict a series of responses (y1,y2,y3. . . ,yn).
Below is an example of multioutput classification:
Classifier Chain
Classifier chains (see ClassifierChain) are a way of combining a number of binary classifiers into a single
multi-label model that is capable of exploiting correlations among targets.
For a multi-label classification problem with N classes, N binary classifiers are assigned an integer between 0 and N-1.
These integers define the order of models in the chain. Each classifier is then fit on the available training data plus the
true labels of the classes whose models were assigned a lower number.
When predicting, the true labels will not be available. Instead the predictions of each model are passed on to the
subsequent models in the chain to be used as features.
Clearly the order of the chain is important. The first model in the chain has no information about the other labels while
the last model in the chain has features indicating the presence of all of the other labels. In general one does not know
the optimal ordering of the models in the chain so typically many randomly ordered chains are fit and their predictions
are averaged together.
References:
Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank, “Classifier Chains for Multi-label Classifica-
tion”, 2009.
Regressor Chain
Regressor chains (see RegressorChain) is analogous to ClassifierChain as a way of combining a number of re-
gressions into a single multi-target model that is capable of exploiting correlations among targets.
The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality re-
duction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-
dimensional datasets.
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance
doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value
in all samples.
As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are
either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and
the variance of such variables is given by
Var[𝑋] = 𝑝(1 − 𝑝)
As expected, VarianceThreshold has removed the first column, which has a probability 𝑝 = 5/6 > .8 of
containing a zero.
Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen
as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the
transform method:
• SelectKBest removes all but the 𝑘 highest scoring features
• SelectPercentile removes all but a user-specified highest scoring percentage of features
• using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate
SelectFdr, or family wise error SelectFwe.
• GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy.
This allows to select the best univariate selection strategy with hyper-parameter search estimator.
For instance, we can perform a 𝜒2 test to the samples to retrieve only the two best features as follows:
These objects take as input a scoring function that returns univariate scores and p-values (or only scores for
SelectKBest and SelectPercentile):
If you use sparse data (i.e. data represented as sparse matrices), chi2, mutual_info_regression,
mutual_info_classif will deal with the data without making it dense.
Warning: Beware not to use a regression scoring function with a classification problem, you will get useless
results.
Examples:
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature
elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the
estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_
attribute or through a feature_importances_ attribute. Then, the least important features are pruned from
current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to
select is eventually reached.
RFECV performs RFE in a cross-validation loop to find the optimal number of features.
Examples:
• Recursive feature elimination: A recursive feature elimination example showing the relevance of pixels in a
digit classification task.
• Recursive feature elimination with cross-validation: A recursive feature elimination example with automatic
tuning of the number of features selected with cross-validation.
SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or
feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the
corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart
from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument.
Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.
For examples on how it is to be used refer to the sections below.
Examples
• Feature selection using SelectFromModel and LassoCV: Selecting the two most important features from the
Boston dataset without knowing the threshold beforehand.
Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero.
When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along
with feature_selection.SelectFromModel to select the non-zero coefficients. In particular, sparse
estimators useful for this purpose are the linear_model.Lasso for regression, and of linear_model.
LogisticRegression and svm.LinearSVC for classification:
With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected.
With Lasso, the higher the alpha parameter, the fewer features selected.
Examples:
• Classification of text documents using sparse features: Comparison of different algorithms for document
classification including L1-based feature selection.
For a good choice of alpha, the Lasso can fully recover the exact set of non-zero variables using only few obser-
vations, provided certain specific conditions are met. In particular, the number of samples should be “sufficiently
large”, or L1 models will perform at random, where “sufficiently large” depends on the number of non-zero co-
efficients, the logarithm of the number of features, the amount of noise, the smallest absolute value of non-zero
coefficients, and the structure of the design matrix X. In addition, the design matrix must display certain specific
properties, such as not being too correlated.
There is no general rule to select an alpha parameter for recovery of non-zero coefficients. It can by set by cross-
validation (LassoCV or LassoLarsCV), though this may lead to under-penalized models: including a small
number of non-relevant variables is not detrimental to prediction score. BIC (LassoLarsIC) tends, on the oppo-
site, to set high values of alpha.
Reference Richard G. Baraniuk “Compressive Sensing”, IEEE Signal Processing Magazine [120] July 2007 http:
//users.isr.ist.utl.pt/~aguiar/CS_notes.pdf
Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module)
can be used to compute feature importances, which in turn can be used to discard irrelevant features (when coupled
with the sklearn.feature_selection.SelectFromModel meta-transformer):
Examples:
• Feature importances with forests of trees: example on synthetic data showing the recovery of the actually
meaningful features.
• Pixel importances with a parallel forest of trees: example on face recognition data.
Feature selection is usually used as a pre-processing step before doing the actual learning. The recommended way to
do this in scikit-learn is to use a sklearn.pipeline.Pipeline:
clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
('classification', RandomForestClassifier())
])
clf.fit(X, y)
4.1.14 Semi-Supervised
Semi-supervised learning is a situation in which in your training data some of the samples are not labeled. The semi-
supervised estimators in sklearn.semi_supervised are able to make use of this additional unlabeled data to
better capture the shape of the underlying data distribution and generalize better to new samples. These algorithms
can perform well when we have a very small amount of labeled points and a large amount of unlabeled points.
Unlabeled entries in y
It is important to assign an identifier to unlabeled points along with the labeled data when training the model with
the fit method. The identifier that this implementation uses is the integer value −1.
Label Propagation
Fig. 1: An illustration of label-propagation: the structure of unlabeled observations is consistent with the class
structure, and thus the class label can be propagated to the unlabeled observations of the training set.
LabelPropagation and LabelSpreading differ in modifications to the similarity matrix that graph and the
clamping effect on the label distributions. Clamping allows the algorithm to change the weight of the true ground
labeled data to some degree. The LabelPropagation algorithm performs hard clamping of input labels, which
means 𝛼 = 0. This clamping factor can be relaxed, to say 𝛼 = 0.2, which means that we will always retain 80 percent
of our original label distribution, but the algorithm gets to change its confidence of the distribution within 20 percent.
LabelPropagation uses the raw similarity matrix constructed from the data with no modifications. In contrast,
LabelSpreading minimizes a loss function that has regularization properties, as such it is often more robust to
noise. The algorithm iterates on a modified version of the original graph and normalizes the edge weights by computing
the normalized graph Laplacian matrix. This procedure is also used in Spectral clustering.
Label propagation models have two built-in kernel methods. Choice of kernel effects both scalability and performance
of the algorithms. The following are available:
• rbf (exp(−𝛾|𝑥 − 𝑦|2 ), 𝛾 > 0). 𝛾 is specified by keyword gamma.
• knn (1[𝑥′ ∈ 𝑘𝑁 𝑁 (𝑥)]). 𝑘 is specified by keyword n_neighbors.
The RBF kernel will produce a fully connected graph which is represented in memory by a dense matrix. This matrix
may be very large and combined with the cost of performing a full matrix multiplication calculation for each iteration
of the algorithm can lead to prohibitively long running times. On the other hand, the KNN kernel will produce a much
more memory-friendly sparse matrix which can drastically reduce running times.
Examples
References
[1] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux. In Semi-Supervised Learning (2006), pp. 193-216
[2] Olivier Delalleau, Yoshua Bengio, Nicolas Le Roux. Efficient Non-Parametric Function Induction in Semi-
Supervised Learning. AISTAT 2005 https://research.microsoft.com/en-us/people/nicolasl/efficient_ssl.pdf
The class IsotonicRegression fits a non-decreasing function to data. It solves the following problem:
minimize 𝑖 𝑤𝑖 (𝑦𝑖 − 𝑦ˆ𝑖 )2
∑︀
When performing classification you often want not only to predict the class label, but also obtain a probability of the
respective label. This probability gives you some kind of confidence on the prediction. Some models can give you
poor estimates of the class probabilities and some even do not support probability prediction. The calibration module
allows you to better calibrate the probabilities of a given model, or to add support for probability prediction.
Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly
interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such
that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the
positive class. The following plot compares how well the probabilistic predictions of different classifiers are calibrated:
LogisticRegression returns well calibrated predictions by default as it directly optimizes log-loss. In contrast,
the other methods return biased probabilities; with different biases per method:
• GaussianNB tends to push probabilities to 0 or 1 (note the counts in the histograms). This is mainly because
it makes the assumption that features are conditionally independent given the class, which is not the case in this
dataset which contains 2 redundant features.
• RandomForestClassifier shows the opposite behavior: the histograms show peaks at approximately
0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by
Niculescu-Mizil and Caruana4 : “Methods such as bagging and random forests that average predictions from a
base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base
models will bias predictions that should be near zero or one away from these values. Because predictions are
restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example,
if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict
zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict
values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We
observe this effect most strongly with random forests because the base-level trees trained with random forests
have relatively high variance due to feature subsetting.” As a result, the calibration curve also referred to as the
reliability diagram (Wilks 19955 ) shows a characteristic sigmoid shape, indicating that the classifier could trust
its “intuition” more and return probabilities closer to 0 or 1 typically.
• Linear Support Vector Classification (LinearSVC) shows an even more sigmoid curve as the RandomForest-
Classifier, which is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana4 ), which
focus on hard samples that are close to the decision boundary (the support vectors).
Two approaches for performing calibration of probabilistic predictions are provided: a parametric approach based on
Platt’s sigmoid model and a non-parametric approach based on isotonic regression (sklearn.isotonic). Proba-
bility calibration should be done on new data not used for model fitting. The class CalibratedClassifierCV
uses a cross-validation generator and estimates for each split the model parameter on the train samples and the cali-
bration of the test samples. The probabilities predicted for the folds are then averaged. Already fitted classifiers can
be calibrated by CalibratedClassifierCV via the parameter cv=”prefit”. In this case, the user has to take care
manually that data for model fitting and calibration are disjoint.
The following images demonstrate the benefit of probability calibration. The first image present a dataset with 2
classes and 3 blobs of data. The blob in the middle contains random samples of each class. The probability for the
samples in this blob should be 0.5.
The following image shows on the data above the estimated probability using a Gaussian naive Bayes classifier without
calibration, with a sigmoid calibration and with a non-parametric isotonic calibration. One can observe that the non-
parametric model provides the most accurate probability estimates for samples in the middle, i.e., 0.5.
The following experiment is performed on an artificial dataset for binary classification with 100,000 samples (1,000
of them are used for model fitting) with 20 features. Of the 20 features, only 2 are informative and 10 are redundant.
The figure shows the estimated probabilities obtained with logistic regression, a linear support-vector classifier (SVC),
and linear SVC with both isotonic calibration and sigmoid calibration. The Brier score is a metric which is a combi-
nation of calibration loss and refinement loss, brier_score_loss, reported in the legend (the smaller the better).
Calibration loss is defined as the mean squared deviation from empirical probabilities derived from the slope of ROC
segments. Refinement loss can be defined as the expected optimal loss as measured by the area under the optimal cost
curve.
One can observe here that logistic regression is well calibrated as its curve is nearly diagonal. Linear SVC’s calibration
curve or reliability diagram has a sigmoid curve, which is typical for an under-confident classifier. In the case of
LinearSVC, this is caused by the margin property of the hinge loss, which lets the model focus on hard samples that
are close to the decision boundary (the support vectors). Both kinds of calibration can fix this issue and yield nearly
identical results. The next figure shows the calibration curve of Gaussian naive Bayes on the same data, with both
kinds of calibration and also without calibration.
One can see that Gaussian naive Bayes performs very badly but does so in an other way than linear SVC: While linear
SVC exhibited a sigmoid calibration curve, Gaussian naive Bayes’ calibration curve has a transposed-sigmoid shape.
This is typical for an over-confident classifier. In this case, the classifier’s overconfidence is caused by the redundant
features which violate the naive Bayes assumption of feature-independence.
Calibration of the probabilities of Gaussian naive Bayes with isotonic regression can fix this issue as can be seen from
the nearly diagonal calibration curve. Sigmoid calibration also improves the brier score slightly, albeit not as strongly
as the non-parametric isotonic calibration. This is an intrinsic limitation of sigmoid calibration, whose parametric form
4 Predicting Good Probabilities with Supervised Learning, A. Niculescu-Mizil & R. Caruana, ICML 2005
5 On the combination of forecast probabilities for consecutive precipitation periods. Wea. Forecasting, 5, 640–650., Wilks, D. S., 1990a
assumes a sigmoid rather than a transposed-sigmoid curve. The non-parametric isotonic calibration model, however,
makes no such strong assumptions and can deal with either shape, provided that there is sufficient calibration data. In
general, sigmoid calibration is preferable in cases where the calibration curve is sigmoid and where there is limited
calibration data, while isotonic calibration is preferable for non-sigmoid calibration curves and in situations where
large amounts of data are available for calibration.
CalibratedClassifierCV can also deal with classification tasks that involve more than two classes if the base
estimator can do so. In this case, the classifier is calibrated first for each class separately in an one-vs-rest fashion.
When predicting probabilities for unseen data, the calibrated probabilities for each class are predicted separately. As
those probabilities do not necessarily sum to one, a postprocessing is performed to normalize them.
The next image illustrates how sigmoid calibration changes predicted probabilities for a 3-class classification problem.
Illustrated is the standard 2-simplex, where the three corners correspond to the three classes. Arrows point from the
probability vectors predicted by an uncalibrated classifier to the probability vectors predicted by the same classifier
after sigmoid calibration on a hold-out validation set. Colors indicate the true class of an instance (red: class 1, green:
class 2, blue: class 3).
The base classifier is a random forest classifier with 25 base estimators (trees). If this classifier is trained on all 800
training datapoints, it is overly confident in its predictions and thus incurs a large log-loss. Calibrating an identical
classifier, which was trained on 600 datapoints, with method=’sigmoid’ on the remaining 200 datapoints reduces the
confidence of the predictions, i.e., moves the probability vectors from the edges of the simplex towards the center:
This calibration results in a lower log-loss. Note that an alternative would have been to increase the number of base
estimators which would have resulted in a similar decrease in log-loss.
References:
• Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers, B. Zadrozny &
C. Elkan, ICML 2001
• Transforming Classifier Scores into Accurate Multiclass Probability Estimates, B. Zadrozny & C. Elkan,
(KDD 2002)
• Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, J.
Platt, (1999)
Warning: This implementation is not intended for large-scale applications. In particular, scikit-learn offers no
GPU support. For much faster, GPU-based implementations, as well as frameworks offering much more flexibility
to build deep learning architectures, see Related Projects.
Multi-layer Perceptron
Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function 𝑓 (·) : 𝑅𝑚 → 𝑅𝑜 by training
on a dataset, where 𝑚 is the number of dimensions for input and 𝑜 is the number of dimensions for output. Given a set
of features 𝑋 = 𝑥1 , 𝑥2 , ..., 𝑥𝑚 and a target 𝑦, it can learn a non-linear function approximator for either classification
or regression. It is different from logistic regression, in that between the input and the output layer, there can be one
or more non-linear layers, called hidden layers. Figure 1 shows a one hidden layer MLP with scalar output.
The leftmost layer, known as the input layer, consists of a set of neurons {𝑥𝑖 |𝑥1 , 𝑥2 , ..., 𝑥𝑚 } representing the input
features. Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear sum-
mation 𝑤1 𝑥1 + 𝑤2 𝑥2 + ... + 𝑤𝑚 𝑥𝑚 , followed by a non-linear activation function 𝑔(·) : 𝑅 → 𝑅 - like the hyperbolic
tan function. The output layer receives the values from the last hidden layer and transforms them into output values.
The module contains the public attributes coefs_ and intercepts_. coefs_ is a list of weight matrices, where
weight matrix at index 𝑖 represents the weights between layer 𝑖 and layer 𝑖+1. intercepts_ is a list of bias vectors,
where the vector at index 𝑖 represents the bias values added to layer 𝑖 + 1.
Classification
Class MLPClassifier implements a multi-layer perceptron (MLP) algorithm that trains using Backpropagation.
MLP trains on two arrays: array X of size (n_samples, n_features), which holds the training samples represented as
floating point feature vectors; and array y of size (n_samples,), which holds the target values (class labels) for the
training samples:
After fitting (training), the model can predict labels for new samples:
MLP can fit a non-linear model to the training data. clf.coefs_ contains the weight matrices that constitute the
model parameters:
Currently, MLPClassifier supports only the Cross-Entropy loss function, which allows probability estimates by
running the predict_proba method.
MLP trains using Backpropagation. More precisely, it trains using some form of gradient descent and the gradients
are calculated using Backpropagation. For classification, it minimizes the Cross-Entropy loss function, giving a vector
of probability estimates 𝑃 (𝑦|𝑥) per sample 𝑥:
Further, the model supports multi-label classification in which a sample can belong to more than one class. For each
class, the raw output passes through the logistic function. Values larger or equal to 0.5 are rounded to 1, otherwise to
0. For a predicted output of a sample, the indices where the value is 1 represents the assigned classes of that sample:
See the examples below and the docstring of MLPClassifier.fit for further information.
Examples:
Regression
Class MLPRegressor implements a multi-layer perceptron (MLP) that trains using backpropagation with no activa-
tion function in the output layer, which can also be seen as using the identity function as activation function. Therefore,
it uses the square error as the loss function, and the output is a set of continuous values.
MLPRegressor also supports multi-output regression, in which a sample can have more than one target.
Regularization
Both MLPRegressor and MLPClassifier use parameter alpha for regularization (L2 regularization) term
which helps in avoiding overfitting by penalizing weights with large magnitudes. Following plot displays varying
decision function with value of alpha.
See the examples below for further information.
Examples:
Algorithms
MLP trains using Stochastic Gradient Descent, Adam, or L-BFGS. Stochastic Gradient Descent (SGD) updates pa-
rameters using the gradient of the loss function with respect to a parameter that needs adaptation, i.e.
𝜕𝑅(𝑤) 𝜕𝐿𝑜𝑠𝑠
𝑤 ← 𝑤 − 𝜂(𝛼 + )
𝜕𝑤 𝜕𝑤
where 𝜂 is the learning rate which controls the step-size in the parameter space search. 𝐿𝑜𝑠𝑠 is the loss function used
for the network.
More details can be found in the documentation of SGD
Adam is similar to SGD in a sense that it is a stochastic optimizer, but it can automatically adjust the amount to update
parameters based on adaptive estimates of lower-order moments.
With SGD or Adam, training supports online and mini-batch learning.
L-BFGS is a solver that approximates the Hessian matrix which represents the second-order partial derivative of a
function. Further it approximates the inverse of the Hessian matrix to perform parameter updates. The implementation
uses the Scipy version of L-BFGS.
If the selected solver is ‘L-BFGS’, training does not support online nor mini-batch learning.
Complexity
Suppose there are 𝑛 training samples, 𝑚 features, 𝑘 hidden layers, each containing ℎ neurons - for simplicity, and 𝑜
output neurons. The time complexity of backpropagation is 𝑂(𝑛 · 𝑚 · ℎ𝑘 · 𝑜 · 𝑖), where 𝑖 is the number of iterations.
Since backpropagation has a high time complexity, it is advisable to start with smaller number of hidden neurons and
few hidden layers for training.
Mathematical formulation
Given a set of training examples (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 ) where 𝑥𝑖 ∈ R𝑛 and 𝑦𝑖 ∈ {0, 1}, a one hidden layer
one hidden neuron MLP learns the function 𝑓 (𝑥) = 𝑊2 𝑔(𝑊1𝑇 𝑥 + 𝑏1 ) + 𝑏2 where 𝑊1 ∈ R𝑚 and 𝑊2 , 𝑏1 , 𝑏2 ∈ R are
model parameters. 𝑊1 , 𝑊2 represent the weights of the input layer and hidden layer, respectively; and 𝑏1 , 𝑏2 represent
the bias added to the hidden layer and the output layer, respectively. 𝑔(·) : 𝑅 → 𝑅 is the activation function, set by
default as the hyperbolic tan. It is given as,
𝑒𝑧 − 𝑒−𝑧
𝑔(𝑧) =
𝑒𝑧 + 𝑒−𝑧
For binary classification, 𝑓 (𝑥) passes through the logistic function 𝑔(𝑧) = 1/(1+𝑒−𝑧 ) to obtain output values between
zero and one. A threshold, set to 0.5, would assign samples of outputs larger or equal 0.5 to the positive class, and the
rest to the negative class.
If there are more than two classes, 𝑓 (𝑥) itself would be a vector of size (n_classes,). Instead of passing through logistic
function, it passes through the softmax function, which is written as,
exp(𝑧𝑖 )
softmax(𝑧)𝑖 = ∑︀𝑘
𝑙=1 exp(𝑧𝑙 )
where 𝑧𝑖 represents the 𝑖 th element of the input to softmax, which corresponds to class 𝑖, and 𝐾 is the number of
classes. The result is a vector containing the probabilities that sample 𝑥 belong to each class. The output is the class
with the highest probability.
In regression, the output remains as 𝑓 (𝑥); therefore, output activation function is just the identity function.
MLP uses different loss functions depending on the problem type. The loss function for classification is Cross-Entropy,
which in binary case is given as,
where 𝛼||𝑊 ||22 is an L2-regularization term (aka penalty) that penalizes complex models; and 𝛼 > 0 is a non-negative
hyperparameter that controls the magnitude of the penalty.
For regression, MLP uses the Square Error loss function; written as,
1 𝛼
𝐿𝑜𝑠𝑠(ˆ
𝑦 , 𝑦, 𝑊 ) = 𝑦 − 𝑦||22 + ||𝑊 ||22
||ˆ
2 2
Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function by repeatedly updating
these weights. After computing the loss, a backward pass propagates it from the output layer to the previous layers,
providing each weight parameter with an update value meant to decrease the loss.
In gradient descent, the gradient ∇𝐿𝑜𝑠𝑠𝑊 of the loss with respect to the weights is computed and deducted from 𝑊 .
More formally, this is expressed as,
𝑊 𝑖+1 = 𝑊 𝑖 − 𝜖∇𝐿𝑜𝑠𝑠𝑖𝑊
where 𝑖 is the iteration step, and 𝜖 is the learning rate with a value larger than 0.
The algorithm stops when it reaches a preset maximum number of iterations; or when the improvement in loss is below
a certain, small number.
• Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data. For
example, scale each attribute on the input vector X to [0, 1] or [-1, +1], or standardize it to have mean 0 and
variance 1. Note that you must apply the same scaling to the test set for meaningful results. You can use
StandardScaler for standardization.
>>> from sklearn.preprocessing import StandardScaler # doctest: +SKIP
>>> scaler = StandardScaler() # doctest: +SKIP
>>> # Don't cheat - fit only on training data
>>> scaler.fit(X_train) # doctest: +SKIP
>>> X_train = scaler.transform(X_train) # doctest: +SKIP
>>> # apply same transformation to test data
>>> X_test = scaler.transform(X_test) # doctest: +SKIP
• Finding a reasonable regularization parameter 𝛼 is best done using GridSearchCV, usually in the range 10.0
** -np.arange(1, 7).
• Empirically, we observed that L-BFGS converges faster and with better solutions on small datasets. For rela-
tively large datasets, however, Adam is very robust. It usually converges quickly and gives pretty good perfor-
mance. SGD with momentum or nesterov’s momentum, on the other hand, can perform better than those two
algorithms if learning rate is correctly tuned.
If you want more control over stopping criteria or learning rate in SGD, or want to do additional monitoring, using
warm_start=True and max_iter=1 and iterating yourself can be helpful:
References:
• “Learning representations by back-propagating errors.” Rumelhart, David E., Geoffrey E. Hinton, and Ronald
J. Williams.
• “Stochastic Gradient Descent” L. Bottou - Website, 2010.
• “Backpropagation” Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen - Website, 2011.
• “Efficient BackProp” Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks of the Trade 1998.
• “Adam: A method for stochastic optimization.” Kingma, Diederik, and Jimmy Ba. arXiv preprint
arXiv:1412.6980 (2014).
sklearn.mixture is a package which enables one to learn Gaussian Mixture Models (diagonal, spherical, tied
and full covariance matrices supported), sample them, and estimate them from data. Facilities to help determine the
appropriate number of components are also provided.
A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a
finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing
k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the
latent Gaussians.
Scikit-learn implements different classes to estimate Gaussian mixture models, that correspond to different estimation
strategies, detailed below.
Fig. 3: Two-component Gaussian mixture model: data points, and equi-probability surfaces of the model.
Gaussian Mixture
The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-
Gaussian models. It can also draw confidence ellipsoids for multivariate models, and compute the Bayesian Infor-
mation Criterion to assess the number of clusters in the data. A GaussianMixture.fit method is provided that
learns a Gaussian Mixture Model from train data. Given test data, it can assign to each sample the Gaussian it mostly
probably belong to using the GaussianMixture.predict method.
The GaussianMixture comes with different options to constrain the covariance of the difference classes estimated:
spherical, diagonal, tied or full covariance.
Examples:
• See GMM covariances for an example of using the Gaussian mixture as clustering on the iris dataset.
• See Density Estimation for a Gaussian mixture for an example on plotting the density estimation.
Pros
Cons
Singularities When one has insufficiently many points per mixture, estimating the covariance matrices
becomes difficult, and the algorithm is known to diverge and find solutions with infinite likelihood
unless one regularizes the covariances artificially.
Number of components This algorithm will always use all the components it has access to, needing
held-out data or information theoretical criteria to decide how many components to use in the ab-
sence of external cues.
The BIC criterion can be used to select the number of components in a Gaussian Mixture in an efficient way. In theory,
it recovers the true number of components only in the asymptotic regime (i.e. if much data is available and assuming
that the data was actually generated i.i.d. from a mixture of Gaussian distribution). Note that using a Variational
Bayesian Gaussian mixture avoids the specification of the number of components for a Gaussian mixture model.
Examples:
• See Gaussian Mixture Model Selection for an example of model selection performed with classical Gaussian
mixture.
The main difficulty in learning Gaussian mixture models from unlabeled data is that it is one usually doesn’t know
which points came from which latent component (if one has access to this information it gets very easy to fit a separate
Gaussian distribution to each set of points). Expectation-maximization is a well-founded statistical algorithm to get
around this problem by an iterative process. First one assumes random components (randomly centered on data points,
learned from k-means, or even just normally distributed around the origin) and computes for each point a probability
of being generated by each component of the model. Then, one tweaks the parameters to maximize the likelihood of
the data given those assignments. Repeating this process is guaranteed to always converge to a local optimum.
The BayesianGaussianMixture object implements a variant of the Gaussian mixture model with variational
inference algorithms. The API is similar as the one defined by GaussianMixture.
Variational inference is an extension of expectation-maximization that maximizes a lower bound on model evidence
(including priors) instead of data likelihood. The principle behind variational methods is the same as expectation-
maximization (that is both are iterative algorithms that alternate between finding the probabilities for each point to
be generated by each mixture and fitting the mixture to these assigned points), but variational methods add regular-
ization by integrating information from prior distributions. This avoids the singularities often found in expectation-
maximization solutions but introduces some subtle biases to the model. Inference is often notably slower, but not
usually as much so as to render usage unpractical.
Due to its Bayesian nature, the variational algorithm needs more hyper- parameters than expectation-maximization,
the most important of these being the concentration parameter weight_concentration_prior. Specifying a
low value for the concentration prior will make the model put most of the weight on few components set the remain-
ing components weights very close to zero. High values of the concentration prior will allow a larger number of
components to be active in the mixture.
The parameters implementation of the BayesianGaussianMixture class proposes two types of prior for the
weights distribution: a finite mixture model with Dirichlet distribution and an infinite mixture model with the Dirichlet
Process. In practice Dirichlet Process inference algorithm is approximated and uses a truncated distribution with a fixed
maximum number of components (called the Stick-breaking representation). The number of components actually used
almost always depends on the data.
The next figure compares the results obtained for the different type of the weight concentration prior (parameter
weight_concentration_prior_type) for different values of weight_concentration_prior. Here,
we can see the value of the weight_concentration_prior parameter has a strong impact on the effective
number of active components obtained. We can also notice that large values for the concentration weight prior lead
to more uniform weights when the type of prior is ‘dirichlet_distribution’ while this is not necessarily the case for the
‘dirichlet_process’ type (used by default).
The examples below compare Gaussian mixture models with a fixed number of components, to the variational Gaus-
sian mixture models with a Dirichlet process prior. Here, a classical Gaussian mixture is fitted with 5 components on
a dataset composed of 2 clusters. We can see that the variational Gaussian mixture with a Dirichlet process prior is
able to limit itself to only 2 components whereas the Gaussian mixture fits the data with a fixed number of components
that has to be set a priori by the user. In this case the user has selected n_components=5 which does not match the
true generative distribution of this toy dataset. Note that with very little observations, the variational Gaussian mixture
models with a Dirichlet process prior can take a conservative stand, and fit only one component.
On the following figure we are fitting a dataset not well-depicted by a Gaussian mixture. Adjusting the
weight_concentration_prior, parameter of the BayesianGaussianMixture controls the number of
components used to fit this data. We also present on the last two plots a random sampling generated from the two
resulting mixtures.
Examples:
• See Gaussian Mixture Model Ellipsoids for an example on plotting the confidence ellipsoids for both
GaussianMixture and BayesianGaussianMixture.
• Gaussian Mixture Model Sine Curve shows using GaussianMixture and
BayesianGaussianMixture to fit a sine wave.
• See Concentration Prior Type Analysis of Variation Bayesian Gaussian Mixture for an ex-
ample plotting the confidence ellipsoids for the BayesianGaussianMixture with dif-
ferent weight_concentration_prior_type for different values of the parameter
weight_concentration_prior.
Pros
Less sensitivity to the number of parameters unlike finite models, which will almost always use
all components as much as they can, and hence will produce wildly different solutions for
different numbers of components, the variational inference with a Dirichlet process prior
(weight_concentration_prior_type='dirichlet_process') won’t change much
with changes to the parameters, leading to more stability and less tuning.
Regularization due to the incorporation of prior information, variational solutions have less pathological
special cases than expectation-maximization solutions.
Cons
Speed the extra parametrization necessary for variational inference make inference slower, although not
by much.
Hyperparameters this algorithm needs an extra hyperparameter that might need experimental tuning via
cross-validation.
Bias there are many implicit biases in the inference algorithms (and also in the Dirichlet process if used),
and whenever there is a mismatch between these biases and the data it might be possible to fit better
models using a finite mixture.
Here we describe variational inference algorithms on Dirichlet process mixture. The Dirichlet process is a prior
probability distribution on clusterings with an infinite, unbounded, number of partitions. Variational techniques let us
incorporate this prior structure on Gaussian mixture models at almost no penalty in inference time, comparing with a
finite Gaussian mixture model.
An important question is how can the Dirichlet process use an infinite, unbounded number of clusters and still be
consistent. While a full explanation doesn’t fit this manual, one can think of its stick breaking process analogy to help
understanding it. The stick breaking process is a generative story for the Dirichlet process. We start with a unit-length
stick and in each step we break off a portion of the remaining stick. Each time, we associate the length of the piece of
the stick to the proportion of points that falls into a group of the mixture. At the end, to represent the infinite mixture,
we associate the last remaining piece of the stick to the proportion of points that don’t fall into all the other groups. The
length of each piece is a random variable with probability proportional to the concentration parameter. Smaller value
of the concentration will divide the unit-length into larger pieces of the stick (defining more concentrated distribution).
Larger concentration values will create smaller pieces of the stick (increasing the number of components with non
zero weights).
Variational inference techniques for the Dirichlet process still work with a finite approximation to this infinite mixture
model, but instead of having to specify a priori how many components one wants to use, one just specifies the concen-
tration parameter and an upper bound on the number of mixture components (this upper bound, assuming it is higher
than the “true” number of components, affects only algorithmic complexity, not the actual number of components
used).
Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the
idea that the dimensionality of many data sets is only artificially high.
Introduction
High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to
show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization
of the structure of a dataset, the dimension must be reduced in some way.
The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though
this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired.
In a random projection, it is likely that the more interesting structure within the data will be lost.
To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have
been designed, such as Principal Component Analysis (PCA), Independent Component Analysis, Linear Discriminant
Analysis, and others. These algorithms define specific rubrics to choose an “interesting” linear projection of the data.
These methods can be powerful, but often miss important non-linear structure in the data.
Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-
linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it
learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.
Examples:
• See Manifold learning on handwritten digits: Locally Linear Embedding, Isomap. . . for an example of
dimensionality reduction on handwritten digits.
• See Comparison of Manifold Learning methods for an example of dimensionality reduction on a toy “S-
curve” dataset.
Isomap
One of the earliest approaches to manifold learning is the Isomap algorithm, short for Isometric Mapping. Isomap can
be viewed as an extension of Multi-dimensional Scaling (MDS) or Kernel PCA. Isomap seeks a lower-dimensional
embedding which maintains geodesic distances between all points. Isomap can be performed with the object Isomap.
Complexity
cost can often be improved using the ARPACK solver. The eigensolver can be specified by the user with the
path_method keyword of Isomap. If unspecified, the code attempts to choose the best algorithm for the
input data.
The overall complexity of Isomap is 𝑂[𝐷 log(𝑘)𝑁 log(𝑁 )] + 𝑂[𝑁 2 (𝑘 + log(𝑁 ))] + 𝑂[𝑑𝑁 2 ].
• 𝑁 : number of training data points
• 𝐷 : input dimension
• 𝑘 : number of nearest neighbors
• 𝑑 : output dimension
References:
• “A global geometric framework for nonlinear dimensionality reduction” Tenenbaum, J.B.; De Silva, V.; &
Langford, J.C. Science 290 (5500)
Locally linear embedding (LLE) seeks a lower-dimensional projection of the data which preserves distances within
local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally
compared to find the best non-linear embedding.
Locally linear embedding can be performed with function locally_linear_embedding or its object-oriented
counterpart LocallyLinearEmbedding.
Complexity
References:
• “Nonlinear dimensionality reduction by locally linear embedding” Roweis, S. & Saul, L. Science 290:2323
(2000)
One well-known issue with LLE is the regularization problem. When the number of neighbors is greater than the
number of input dimensions, the matrix defining each local neighborhood is rank-deficient. To address this, standard
LLE applies an arbitrary regularization parameter 𝑟, which is chosen relative to the trace of the local weight matrix.
Though it can be shown formally that as 𝑟 → 0, the solution converges to the desired embedding, there is no guarantee
that the optimal solution will be found for 𝑟 > 0. This problem manifests itself in embeddings which distort the
underlying geometry of the manifold.
One method to address the regularization problem is to use multiple weight vectors in each neighborhood.
This is the essence of modified locally linear embedding (MLLE). MLLE can be performed with function
locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the key-
word method = 'modified'. It requires n_neighbors > n_components.
Complexity
References:
• “MLLE: Modified Locally Linear Embedding Using Multiple Weights” Zhang, Z. & Wang, J.
Hessian Eigenmapping
Hessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another method of solving the regularization
problem of LLE. It revolves around a hessian-based quadratic form at each neighborhood which is used to recover
the locally linear structure. Though other implementations note its poor scaling with data size, sklearn imple-
ments some algorithmic improvements which make its cost comparable to that of other LLE variants for small output
dimension. HLLE can be performed with function locally_linear_embedding or its object-oriented counter-
part LocallyLinearEmbedding, with the keyword method = 'hessian'. It requires n_neighbors >
n_components * (n_components + 3) / 2.
Complexity
• 𝐷 : input dimension
• 𝑘 : number of nearest neighbors
• 𝑑 : output dimension
References:
• “Hessian Eigenmaps: Locally linear embedding techniques for high-dimensional data” Donoho, D. &
Grimes, C. Proc Natl Acad Sci USA. 100:5591 (2003)
Spectral Embedding
Spectral Embedding is an approach to calculating a non-linear embedding. Scikit-learn implements Laplacian Eigen-
maps, which finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian.
The graph generated can be considered as a discrete approximation of the low dimensional manifold in the high dimen-
sional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold
are mapped close to each other in the low dimensional space, preserving local distances. Spectral embedding can be
performed with the function spectral_embedding or its object-oriented counterpart SpectralEmbedding.
Complexity
• 𝑑 : output dimension
References:
• “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation” M. Belkin, P. Niyogi, Neural
Computation, June 2003; 15 (6):1373-1396
Though not technically a variant of LLE, Local tangent space alignment (LTSA) is algorithmically similar enough
to LLE that it can be put in this category. Rather than focusing on preserving neighborhood distances as in LLE,
LTSA seeks to characterize the local geometry at each neighborhood via its tangent space, and performs a global
optimization to align these local tangent spaces to learn the embedding. LTSA can be performed with function
locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the key-
word method = 'ltsa'.
Complexity
References:
• “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment” Zhang, Z. & Zha,
H. Journal of Shanghai Univ. 8:406 (2004)
Multidimensional scaling (MDS) seeks a low-dimensional representation of the data in which the distances respect well
the distances in the original high-dimensional space.
In general, MDS is a technique used for analyzing similarity or dissimilarity data. It attempts to model similarity or
dissimilarity data as distances in a geometric spaces. The data can be ratings of similarity between objects, interaction
frequencies of molecules, or trade indices between countries.
There exists two types of MDS algorithm: metric and non metric. In the scikit-learn, the class MDS implements
both. In Metric MDS, the input similarity matrix arises from a metric (and thus respects the triangular inequality), the
distances between output two points are then set to be as close as possible to the similarity or dissimilarity data. In
the non-metric version, the algorithms will try to preserve the order of the distances, and hence seek for a monotonic
relationship between the distances in the embedded space and the similarities/dissimilarities.
Let 𝑆 be the similarity matrix, and 𝑋 the coordinates of the 𝑛 input points. Disparities 𝑑ˆ𝑖𝑗 are transformation of the
similarities chosen in some optimal ways. The objective, called the stress, is then defined by 𝑖<𝑗 𝑑𝑖𝑗 (𝑋) − 𝑑ˆ𝑖𝑗 (𝑋)
∑︀
Metric MDS
The simplest metric MDS model, called absolute MDS, disparities are defined by 𝑑ˆ𝑖𝑗 = 𝑆𝑖𝑗 . With absolute MDS, the
value 𝑆𝑖𝑗 should then correspond exactly to the distance between point 𝑖 and 𝑗 in the embedding point.
Most commonly, disparities are set to 𝑑ˆ𝑖𝑗 = 𝑏𝑆𝑖𝑗 .
Nonmetric MDS
Non metric MDS focuses on the ordination of the data. If 𝑆𝑖𝑗 < 𝑆𝑗𝑘 , then the embedding should enforce 𝑑𝑖𝑗 < 𝑑𝑗𝑘 .
A simple algorithm to enforce that is to use a monotonic regression of 𝑑𝑖𝑗 on 𝑆𝑖𝑗 , yielding disparities 𝑑ˆ𝑖𝑗 in the same
order as 𝑆𝑖𝑗 .
A trivial solution to this problem is to set all the points on the origin. In order to avoid that, the disparities 𝑑ˆ𝑖𝑗 are
normalized.
References:
• “Modern Multidimensional Scaling - Theory and Applications” Borg, I.; Groenen P. Springer Series in Statis-
tics (1997)
• “Nonmetric multidimensional scaling: a numerical method” Kruskal, J. Psychometrika, 29 (1964)
• “Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis” Kruskal, J. Psychome-
trika, 29, (1964)
t-SNE (TSNE) converts affinities of data points to probabilities. The affinities in the original space are represented by
Gaussian joint probabilities and the affinities in the embedded space are represented by Student’s t-distributions. This
allows t-SNE to be particularly sensitive to local structure and has a few other advantages over existing techniques:
• Revealing the structure at many scales on a single map
• Revealing data that lie in multiple, different, manifolds or clusters
• Reducing the tendency to crowd points together at the center
While Isomap, LLE and variants are best suited to unfold a single continuous low dimensional manifold, t-SNE will
focus on the local structure of the data and will tend to extract clustered local groups of samples as highlighted on the
S-curve example. This ability to group samples based on the local structure might be beneficial to visually disentangle
a dataset that comprises several manifolds at once as is the case in the digits dataset.
The Kullback-Leibler (KL) divergence of the joint probabilities in the original space and the embedded space will
be minimized by gradient descent. Note that the KL divergence is not convex, i.e. multiple restarts with different
initializations will end up in local minima of the KL divergence. Hence, it is sometimes useful to try different seeds
and select the embedding with the lowest KL divergence.
The disadvantages to using t-SNE are roughly:
• t-SNE is computationally expensive, and can take several hours on million-sample datasets where PCA will
finish in seconds or minutes
• The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings.
• The algorithm is stochastic and multiple restarts with different seeds can yield different embeddings. However,
it is perfectly legitimate to pick the embedding with the least error.
• Global structure is not explicitly preserved. This problem is mitigated by initializing points with PCA (using
init='pca').
Optimizing t-SNE
The main purpose of t-SNE is visualization of high-dimensional data. Hence, it works best when the data will be
embedded on two or three dimensions.
Optimizing the KL divergence can be a little bit tricky sometimes. There are five parameters that control the optimiza-
tion of t-SNE and therefore possibly the quality of the resulting embedding:
• perplexity
• early exaggeration factor
• learning rate
• maximum number of iterations
• angle (not used in the exact method)
The perplexity is defined as 𝑘 = 2(𝑆) where 𝑆 is the Shannon entropy of the conditional probability distribution.
The perplexity of a 𝑘-sided die is 𝑘, so that 𝑘 is effectively the number of nearest neighbors t-SNE considers when
generating the conditional probabilities. Larger perplexities lead to more nearest neighbors and less sensitive to small
structure. Conversely a lower perplexity considers a smaller number of neighbors, and thus ignores more global
information in favour of the local neighborhood. As dataset sizes get larger more points will be required to get a
reasonable sample of the local neighborhood, and hence larger perplexities may be required. Similarly noisier datasets
will require larger perplexity values to encompass enough local neighbors to see beyond the background noise.
The maximum number of iterations is usually high enough and does not need any tuning. The optimization consists of
two phases: the early exaggeration phase and the final optimization. During early exaggeration the joint probabilities
in the original space will be artificially increased by multiplication with a given factor. Larger factors result in larger
gaps between natural clusters in the data. If the factor is too high, the KL divergence could increase during this phase.
Usually it does not have to be tuned. A critical parameter is the learning rate. If it is too low gradient descent will get
stuck in a bad local minimum. If it is too high the KL divergence will increase during optimization. More tips can be
found in Laurens van der Maaten’s FAQ (see references). The last parameter, angle, is a tradeoff between performance
and accuracy. Larger angles imply that we can approximate larger regions by a single point, leading to better speed
but less accurate results.
“How to Use t-SNE Effectively” provides a good discussion of the effects of the various parameters, as well as
interactive plots to explore the effects of different parameters.
Barnes-Hut t-SNE
The Barnes-Hut t-SNE that has been implemented here is usually much slower than other manifold learning algo-
rithms. The optimization is quite difficult and the computation of the gradient is 𝑂[𝑑𝑁 𝑙𝑜𝑔(𝑁 )], where 𝑑 is the number
of output dimensions and 𝑁 is the number of samples. The Barnes-Hut method improves on the exact method where
t-SNE complexity is 𝑂[𝑑𝑁 2 ], but has several other notable differences:
• The Barnes-Hut implementation only works when the target dimensionality is 3 or less. The 2D case is typical
when building visualizations.
• Barnes-Hut only works with dense input data. Sparse data matrices can only be embedded with the exact method
or can be approximated by a dense low rank projection for instance using sklearn.decomposition.
TruncatedSVD
• Barnes-Hut is an approximation of the exact method. The approximation is parameterized with the angle pa-
rameter, therefore the angle parameter is unused when method=”exact”
• Barnes-Hut is significantly more scalable. Barnes-Hut can be used to embed hundred of thousands of data points
while the exact method can handle thousands of samples before becoming computationally intractable
For visualization purpose (which is the main use case of t-SNE), using the Barnes-Hut method is strongly recom-
mended. The exact t-SNE method is useful for checking the theoretically properties of the embedding possibly in
higher dimensional space but limit to small datasets due to computational constraints.
Also note that the digits labels roughly match the natural grouping found by t-SNE while the linear 2D projection of
the PCA model yields a representation where label regions largely overlap. This is a strong clue that this data can be
well separated by non linear methods that focus on the local structure (e.g. an SVM with a Gaussian RBF kernel).
However, failing to visualize well separated homogeneously labeled groups with t-SNE in 2D does not necessarily
imply that the data cannot be correctly classified by a supervised model. It might be the case that 2 dimensions are not
low enough to accurately represents the internal structure of the data.
References:
• “Visualizing High-Dimensional Data Using t-SNE” van der Maaten, L.J.P.; Hinton, G. Journal of Machine
Learning Research (2008)
• “t-Distributed Stochastic Neighbor Embedding” van der Maaten, L.J.P.
• “Accelerating t-SNE using Tree-Based Algorithms.” L.J.P. van der Maaten. Journal of Machine Learning
Research 15(Oct):3221-3245, 2014.
• Make sure the same scale is used over all features. Because manifold learning methods are based on a nearest-
neighbor search, the algorithm may perform poorly otherwise. See StandardScaler for convenient ways of
scaling heterogeneous data.
• The reconstruction error computed by each routine can be used to choose the optimal output dimension. For a
𝑑-dimensional manifold embedded in a 𝐷-dimensional parameter space, the reconstruction error will decrease
as n_components is increased until n_components == d.
• Note that noisy data can “short-circuit” the manifold, in essence acting as a bridge between parts of the manifold
that would otherwise be well-separated. Manifold learning on noisy and/or incomplete data is an active area of
research.
• Certain input configurations can lead to singular weight matrices, for example when more than two points in the
dataset are identical, or when the data is split into disjointed groups. In this case, solver='arpack' will
fail to find the null space. The easiest way to address this is to use solver='dense' which will work on a
singular matrix, though it may be very slow depending on the number of input points. Alternatively, one can
attempt to understand the source of the singularity: if it is due to disjoint sets, increasing n_neighbors may
help. If it is due to identical points in the dataset, removing these points may help.
See also:
Totally Random Trees Embedding can also be useful to derive non-linear representations of feature space, also it does
not perform dimensionality reduction.
4.2.3 Clustering
Input data
One important thing to note is that the algorithms implemented in this module can take different kinds of matrix as
input. All the methods accept standard data matrices of shape [n_samples, n_features]. These can be ob-
tained from the classes in the sklearn.feature_extraction module. For AffinityPropagation,
SpectralClustering and DBSCAN one can also input similarity matrices of shape [n_samples,
n_samples]. These can be obtained from the functions in the sklearn.metrics.pairwise module.
Non-flat geometry clustering is useful when the clusters have a specific shape, i.e. a non-flat manifold, and the standard
euclidean distance is not the right metric. This case arises in the two top rows of the figure above.
Gaussian mixture models, useful for clustering, are described in another chapter of the documentation dedicated
to mixture models. KMeans can be seen as a special case of Gaussian mixture model with equal covariance per
component.
K-means
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion
known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to
be specified. It scales well to large number of samples and has been used across a large range of application areas in
many different fields.
The k-means algorithm divides a set of 𝑁 samples 𝑋 into 𝐾 disjoint clusters 𝐶, each described by the mean 𝜇𝑗 of
the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general,
points from 𝑋, although they live in the same space.
The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares crite-
rion:
𝑛
∑︁
min (||𝑥𝑖 − 𝜇𝑗 ||2 )
𝜇𝑗 ∈𝐶
𝑖=0
Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:
• Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds
poorly to elongated clusters, or manifolds with irregular shapes.
• Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very
high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse
of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA)
prior to k-means clustering can alleviate this problem and speed up the computations.
K-means is often referred to as Lloyd’s algorithm. In basic terms, the algorithm has three steps. The first step chooses
the initial centroids, with the most basic method being to choose 𝑘 samples from the dataset 𝑋. After initialization, K-
means consists of looping between the two other steps. The first step assigns each sample to its nearest centroid. The
second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid.
The difference between the old and the new centroids are computed and the algorithm repeats these last two steps until
this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.
K-means is equivalent to the expectation-maximization algorithm with a small, all-equal, diagonal covariance matrix.
The algorithm can also be understood through the concept of Voronoi diagrams. First the Voronoi diagram of the points
is calculated using the current centroids. Each segment in the Voronoi diagram becomes a separate cluster. Secondly,
the centroids are updated to the mean of each segment. The algorithm then repeats this until a stopping criterion is
fulfilled. Usually, the algorithm stops when the relative decrease in the objective function between iterations is less
than the given tolerance value. This is not the case in this implementation: iteration stops when centroids move less
than the tolerance.
Given enough time, K-means will always converge, however this may be to a local minimum. This is highly depen-
dent on the initialization of the centroids. As a result, the computation is often done several times, with different
initializations of the centroids. One method to help address this issue is the k-means++ initialization scheme, which
has been implemented in scikit-learn (use the init='k-means++' parameter). This initializes the centroids to
be (generally) distant from each other, leading to provably better results than random initialization, as shown in the
reference.
The algorithm supports sample weights, which can be given by a parameter sample_weight. This allows to assign
more weight to some samples when computing cluster centers and values of inertia. For example, assigning a weight
of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset 𝑋.
A parameter can be given to allow K-means to be run in parallel, called n_jobs. Giving this parameter a positive
value uses that many processors (default: 1). A value of -1 uses all available processors, with -2 using one less, and so
on. Parallelization generally speeds up computation at the cost of memory (in this case, multiple copies of centroids
need to be stored, one for each job).
Warning: The parallel version of K-Means is broken on OS X when numpy uses the Accelerate Framework.
This is expected behavior: Accelerate can be called after a fork but you need to execv the subprocess with the
Python binary (which multiprocessing does not do under posix).
K-means can be used for vector quantization. This is achieved using the transform method of a trained model of
KMeans.
Examples:
• Demonstration of k-means assumptions: Demonstrating when k-means performs intuitively and when it does
not
• A demo of K-Means clustering on the handwritten digits data: Clustering handwritten digits
References:
• “k-means++: The advantages of careful seeding” Arthur, David, and Sergei Vassilvitskii, Proceedings of
the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied
Mathematics (2007)
The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation
time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, ran-
domly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required
to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch
k-means produces results that are generally only slightly worse than the standard algorithm.
The algorithm iterates between two major steps, similar to vanilla k-means. In the first step, 𝑏 samples are drawn
randomly from the dataset, to form a mini-batch. These are then assigned to the nearest centroid. In the second step,
the centroids are updated. In contrast to k-means, this is done on a per-sample basis. For each sample in the mini-batch,
the assigned centroid is updated by taking the streaming average of the sample and all previous samples assigned to
that centroid. This has the effect of decreasing the rate of change for a centroid over time. These steps are performed
until convergence or a predetermined number of iterations is reached.
MiniBatchKMeans converges faster than KMeans, but the quality of the results is reduced. In practice this differ-
ence in quality can be quite small, as shown in the example and cited reference.
Examples:
• Comparison of the K-Means and MiniBatchKMeans clustering algorithms: Comparison of KMeans and
MiniBatchKMeans
• Clustering text documents using k-means: Document clustering using sparse MiniBatchKMeans
• Online learning of a dictionary of parts of faces
References:
• “Web Scale K-Means clustering” D. Sculley, Proceedings of the 19th international conference on World wide
web (2010)
Affinity Propagation
AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A
dataset is then described using a small number of exemplars, which are identified as those most representative of other
samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other,
which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at
which point the final exemplars are chosen, and hence the final clustering is given.
Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this pur-
pose, the two important parameters are the preference, which controls how many exemplars are used, and the damping
factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these
messages.
The main drawback of Affinity Propagation is its complexity. The algorithm has a time complexity of the order
𝑂(𝑁 2 𝑇 ), where 𝑁 is the number of samples and 𝑇 is the number of iterations until convergence. Further, the memory
complexity is of the order 𝑂(𝑁 2 ) if a dense similarity matrix is used, but reducible if a sparse similarity matrix is
used. This makes Affinity Propagation most appropriate for small to medium sized datasets.
Examples:
• Demo of affinity propagation clustering algorithm: Affinity Propagation on a synthetic 2D datasets with 3
classes.
• Visualizing the stock market structure Affinity Propagation on Financial time series to find groups of compa-
nies
Algorithm description: The messages sent between points belong to one of two categories. The first is the responsi-
bility 𝑟(𝑖, 𝑘), which is the accumulated evidence that sample 𝑘 should be the exemplar for sample 𝑖. The second is the
availability 𝑎(𝑖, 𝑘) which is the accumulated evidence that sample 𝑖 should choose sample 𝑘 to be its exemplar, and
considers the values for all other samples that 𝑘 should be an exemplar. In this way, exemplars are chosen by samples
if they are (1) similar enough to many samples and (2) chosen by many samples to be representative of themselves.
More formally, the responsibility of a sample 𝑘 to be the exemplar of sample 𝑖 is given by:
Where 𝑠(𝑖, 𝑘) is the similarity between samples 𝑖 and 𝑘. The availability of sample 𝑘 to be the exemplar of sample 𝑖 is
given by:
∑︁
𝑎(𝑖, 𝑘) ← 𝑚𝑖𝑛[0, 𝑟(𝑘, 𝑘) + 𝑟(𝑖′ , 𝑘)]
𝑖′ 𝑠.𝑡. 𝑖′ ∈{𝑖,𝑘}
/
To begin with, all values for 𝑟 and 𝑎 are set to zero, and the calculation of each iterates until convergence. As discussed
above, in order to avoid numerical oscillations when updating the messages, the damping factor 𝜆 is introduced to
iteration process:
Mean Shift
MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which
works by updating candidates for centroids to be the mean of the points within a given region. These candidates are
then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.
Given a candidate centroid 𝑥𝑖 for iteration 𝑡, the candidate is updated according to the following equation:
𝑥𝑡+1
𝑖 = 𝑚(𝑥𝑡𝑖 )
Where 𝑁 (𝑥𝑖 ) is the neighborhood of samples within a given distance around 𝑥𝑖 and 𝑚 is the mean shift vector that
is computed for each centroid that points towards a region of the maximum increase in the density of points. This
is computed using the following equation, effectively updating a centroid to be the mean of the samples within its
neighborhood:
∑︀
𝑥 ∈𝑁 (𝑥𝑖 ) 𝐾(𝑥𝑗 − 𝑥𝑖 )𝑥𝑗
𝑚(𝑥𝑖 ) = ∑︀𝑗
𝑥𝑗 ∈𝑁 (𝑥𝑖 ) 𝐾(𝑥𝑗 − 𝑥𝑖 )
The algorithm automatically sets the number of clusters, instead of relying on a parameter bandwidth, which dictates
the size of the region to search through. This parameter can be set manually, but can be estimated using the provided
estimate_bandwidth function, which is called if the bandwidth is not set.
The algorithm is not highly scalable, as it requires multiple nearest neighbor searches during the execution of the
algorithm. The algorithm is guaranteed to converge, however the algorithm will stop iterating when the change in
centroids is small.
Labelling a new sample is performed by finding the nearest centroid for a given sample.
Examples:
• A demo of the mean-shift clustering algorithm: Mean Shift clustering on a synthetic 2D datasets with 3
classes.
References:
• “Mean shift: A robust approach toward feature space analysis.” D. Comaniciu and P. Meer, IEEE Transactions
on Pattern Analysis and Machine Intelligence (2002)
Spectral clustering
SpectralClustering performs a low-dimension embedding of the affinity matrix between samples, followed
by clustering, e.g., by KMeans, of the components of the eigenvectors in the low dimensional space. It is especially
computationally efficient if the affinity matrix is sparse and the amg solver is used for the eigenvalue problem (Note,
the amg solver requires that the pyamg module is installed.)
The present version of SpectralClustering requires the number of clusters to be specified in advance. It works well for
a small number of clusters, but is not advised for many clusters.
For two clusters, SpectralClustering solves a convex relaxation of the normalised cuts problem on the similarity graph:
cutting the graph in two so that the weight of the edges cut is small compared to the weights of the edges inside each
cluster. This criteria is especially interesting when working on images, where graph vertices are pixels, and weights
of the edges of the similarity graph are computed using a function of a gradient of the image.
Examples:
• Spectral clustering for image segmentation: Segmenting objects from a noisy background using spectral
clustering.
• Segmenting the picture of greek coins in regions: Spectral clustering to split the image of coins in regions.
Different label assignment strategies can be used, corresponding to the assign_labels parameter of
SpectralClustering. "kmeans" strategy can match finer details, but can be unstable. In particular, unless
you control the random_state, it may not be reproducible from run-to-run, as it depends on random initializa-
tion. The alternative "discretize" strategy is 100% reproducible, but tends to create parcels of fairly even and
geometrical shape.
assign_labels="kmeans" assign_labels="discretize"
Spectral Clustering can also be used to partition graphs via their spectral embeddings. In this case, the affinity matrix
is the adjacency matrix of the graph, and SpectralClustering is initialized with affinity='precomputed':
References:
Hierarchical clustering
Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting
them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique
cluster that gathers all the samples, the leaves being the clusters with only one sample. See the Wikipedia page for
more details.
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each
observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the
metric used for the merge strategy:
• Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in
this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
• Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.
• Average linkage minimizes the average of the distances between all observations of pairs of clusters.
• Single linkage minimizes the distance between the closest observations of pairs of clusters.
AgglomerativeClustering can also scale to large number of samples when it is used jointly with a connectivity
matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at
each step all the possible merges.
FeatureAgglomeration
The FeatureAgglomeration uses agglomerative clustering to group together features that look very similar,
thus decreasing the number of features. It is a dimensionality reduction tool, see Unsupervised dimensionality
reduction.
Agglomerative cluster has a “rich get richer” behavior that leads to uneven cluster sizes. In this regard, single linkage
is the worst strategy, and Ward gives the most regular sizes. However, the affinity (or distance used in clustering)
cannot be varied with Ward, thus for non Euclidean metrics, average linkage is a good alternative. Single linkage,
while not robust to noisy data, can be computed very efficiently and can therefore be useful to provide hierarchical
clustering of larger datasets. Single linkage can also perform well on non-globular data.
Examples:
• Various Agglomerative Clustering on a 2D embedding of digits: exploration of the different linkage strategies
in a real dataset.
It’s possible to visualize the tree representing the hierarchical merging of clusters as a dendrogram. Visual inspection
can often be useful for understanding the structure of the data, though more so in the case of small sample sizes.
An interesting aspect of AgglomerativeClustering is that connectivity constraints can be added to this al-
gorithm (only adjacent clusters can be merged together), through a connectivity matrix that defines for each sample
the neighboring samples following a given structure of the data. For instance, in the swiss-roll example below, the
connectivity constraints forbid the merging of points that are not adjacent on the swiss roll, and thus avoid forming
clusters that extend across overlapping folds of the roll.
These constraint are useful to impose a certain local structure, but they also make the algorithm faster, especially when
the number of the samples is high.
The connectivity constraints are imposed via an connectivity matrix: a scipy sparse matrix that has elements only
at the intersection of a row and a column with indices of the dataset that should be connected. This matrix can
be constructed from a-priori information: for instance, you may wish to cluster web pages by only merging pages
with a link pointing from one to another. It can also be learned from the data, for instance using sklearn.
neighbors.kneighbors_graph to restrict merging to nearest neighbors as in this example, or using sklearn.
feature_extraction.image.grid_to_graph to enable only merging of neighboring pixels on an image,
as in the coin example.
Examples:
• A demo of structured Ward hierarchical clustering on an image of coins: Ward clustering to split the image
of coins in regions.
• Hierarchical clustering: structured vs unstructured ward: Example of Ward algorithm on a swiss-roll, com-
parison of structured approaches versus unstructured approaches.
• Feature agglomeration vs. univariate selection: Example of dimensionality reduction with feature agglomer-
ation based on Ward hierarchical clustering.
• Agglomerative clustering with and without structure
Single, average and complete linkage can be used with a variety of distances (or affinities), in particular Euclidean
distance (l2), Manhattan distance (or Cityblock, or l1), cosine distance, or any precomputed affinity matrix.
• l1 distance is often good for sparse features, or sparse noise: i.e. many of the features are zero, as in text mining
using occurrences of rare words.
• cosine distance is interesting because it is invariant to global scalings of the signal.
The guidelines for choosing a metric is to use one that maximizes the distance between samples in different classes,
and minimizes that within each class.
Examples:
DBSCAN
The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather
generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are
convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in
areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance
measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There
are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say
dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.
More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other
samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample
is in a dense area of the vector space. A cluster is a set of core samples that can be built by recursively taking a core
sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so
on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster
but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.
Any core sample is part of a cluster, by definition. Any sample that is not a core sample, and is at least eps in distance
from any core sample, is considered an outlier by the algorithm.
While the parameter min_samples primarily controls how tolerant the algorithm is towards noise (on noisy and
large data sets it may be desirable to increase this parameter), the parameter eps is crucial to choose appropriately for
the data set and distance function and usually cannot be left at the default value. It controls the local neighborhood of
the points. When chosen too small, most data will not be clustered at all (and labeled as -1 for “noise”). When chosen
too large, it causes close clusters to be merged into one cluster, and eventually the entire data set to be returned as a
single cluster. Some heuristics for choosing this parameter have been discussed in the literature, for example based on
a knee in the nearest neighbor distances plot (as discussed in the references below).
In the figure below, the color indicates cluster membership, with large circles indicating core samples found by the
algorithm. Smaller circles are non-core samples that are still part of a cluster. Moreover, the outliers are indicated by
black points below.
Examples:
Implementation
The DBSCAN algorithm is deterministic, always generating the same clusters when given the same data in the
same order. However, the results can differ when data is provided in a different order. First, even though the core
samples will always be assigned to the same clusters, the labels of those clusters will depend on the order in which
those samples are encountered in the data. Second and more importantly, the clusters to which non-core samples
are assigned can differ depending on the data order. This would happen when a non-core sample has a distance
lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must
be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned
to whichever cluster is generated first in a pass through the data, and so the results will depend on the data ordering.
The current implementation uses ball trees and kd-trees to determine the neighborhood of points, which avoids
calculating the full distance matrix (as was done in scikit-learn versions before 0.14). The possibility to use custom
metrics is retained; for details, see NearestNeighbors.
This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the
case where kd-trees or ball-trees cannot be used (e.g., with sparse matrices). This matrix will consume n^2 floats.
A couple of mechanisms for getting around this are:
• Use OPTICS clustering in conjunction with the extract_dbscan method. OPTICS clustering also calcu-
lates the full pairwise matrix, but only keeps one row in memory at a time (memory complexity n).
• A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precom-
puted in a memory-efficient way and dbscan can be run over this with metric='precomputed'. See
sklearn.neighbors.NearestNeighbors.radius_neighbors_graph.
• The dataset can be compressed, either by removing exact duplicates if these occur in your data, or by using
BIRCH. Then you only have a relatively small number of representatives for a large number of points. You
can then provide a sample_weight when fitting DBSCAN.
References:
• “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” Ester, M., H. P.
Kriegel, J. Sander, and X. Xu, In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
• “DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. Schubert, E., Sander, J., Ester,
M., Kriegel, H. P., & Xu, X. (2017). In ACM Transactions on Database Systems (TODS), 42(3), 19.
OPTICS
The OPTICS algorithm shares many similarities with the DBSCAN algorithm, and can be considered a generalization
of DBSCAN that relaxes the eps requirement from a single value to a value range. The key difference between
DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability graph, which assigns each sample both
a reachability_ distance, and a spot within the cluster ordering_ attribute; these two attributes are assigned
when the model is fitted, and are used to determine cluster membership. If OPTICS is run with the default value of inf
set for max_eps, then DBSCAN style cluster extraction can be performed repeatedly in linear time for any given eps
value using the cluster_optics_dbscan method. Setting max_eps to a lower value will result in shorter run
times, and can be thought of as the maximum neighborhood radius from each point to find other potential reachable
points.
The reachability distances generated by OPTICS allow for variable density extraction of clusters within a single data
set. As shown in the above plot, combining reachability distances and data set ordering_ produces a reachability
plot, where point density is represented on the Y-axis, and points are ordered such that nearby points are adjacent.
‘Cutting’ the reachability plot at a single value produces DBSCAN like results; all points above the ‘cut’ are classified
as noise, and each time that there is a break when reading from left to right signifies a new cluster. The default
cluster extraction with OPTICS looks at the steep slopes within the graph to find clusters, and the user can define
what counts as a steep slope using the parameter xi. There are also other possibilities for analysis on the graph itself,
such as generating hierarchical representations of the data through reachability-plot dendrograms, and the hierarchy
of clusters detected by the algorithm can be accessed through the cluster_hierarchy_ parameter. The plot
above has been color-coded so that cluster colors in planar space match the linear segment clusters of the reachability
plot. Note that the blue and red clusters are adjacent in the reachability plot, and can be hierarchically represented as
children of a larger parent cluster.
Examples:
The results from OPTICS cluster_optics_dbscan method and DBSCAN are very similar, but not always
identical; specifically, labeling of periphery and noise points. This is in part because the first samples of each dense
area processed by OPTICS have a large reachability value while being close to other points in their area, and will
thus sometimes be marked as noise rather than periphery. This affects adjacent points when they are considered as
Computational Complexity
Spatial indexing trees are used to avoid calculating the full distance matrix, and allow for efficient memory usage
on large sets of samples. Different distance metrics can be supplied via the metric keyword.
For large datasets, similar (but not identical) results can be obtained via HDBSCAN. The HDBSCAN implemen-
tation is multithreaded, and has better algorithmic runtime complexity than OPTICS, at the cost of worse memory
scaling. For extremely large datasets that exhaust system memory using HDBSCAN, OPTICS will maintain n (as
opposed to n^2) memory scaling; however, tuning of the max_eps parameter will likely need to be used to give a
solution in a reasonable amount of wall time.
References:
• “OPTICS: ordering points to identify the clustering structure.” Ankerst, Mihael, Markus M. Breunig, Hans-
Peter Kriegel, and Jörg Sander. In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
Birch
The Birch builds a tree called the Clustering Feature Tree (CFT) for the given data. The data is essentially lossy
compressed to a set of Clustering Feature nodes (CF Nodes). The CF Nodes have a number of subclusters called
Clustering Feature subclusters (CF Subclusters) and these CF Subclusters located in the non-terminal CF Nodes can
have CF Nodes as children.
The CF Subclusters hold the necessary information for clustering which prevents the need to hold the entire input data
in memory. This information includes:
• Number of samples in a subcluster.
• Linear Sum - A n-dimensional vector holding the sum of all samples
• Squared Sum - Sum of the squared L2 norm of all samples.
• Centroids - To avoid recalculation linear sum / n_samples.
• Squared norm of the centroids.
The Birch algorithm has two parameters, the threshold and the branching factor. The branching factor limits the
number of subclusters in a node and the threshold limits the distance between the entering sample and the existing
subclusters.
This algorithm can be viewed as an instance or data reduction method, since it reduces the input data to a set of
subclusters which are obtained directly from the leaves of the CFT. This reduced data can be further processed by
feeding it into a global clusterer. This global clusterer can be set by n_clusters. If n_clusters is set to None,
the subclusters from the leaves are directly read off, otherwise a global clustering step labels these subclusters into
global clusters (labels) and the samples are mapped to the global label of the nearest subcluster.
Algorithm description:
• A new sample is inserted into the root of the CF Tree which is a CF Node. It is then merged with the subcluster of
the root, that has the smallest radius after merging, constrained by the threshold and branching factor conditions.
If the subcluster has any child node, then this is done repeatedly till it reaches a leaf. After finding the nearest
subcluster in the leaf, the properties of this subcluster and the parent subclusters are recursively updated.
• If the radius of the subcluster obtained by merging the new sample and the nearest subcluster is greater than
the square of the threshold and if the number of subclusters is greater than the branching factor, then a space is
temporarily allocated to this new sample. The two farthest subclusters are taken and the subclusters are divided
into two groups on the basis of the distance between these subclusters.
• If this split node has a parent subcluster and there is room for a new subcluster, then the parent is split into two.
If there is no room, then this node is again split into two and the process is continued recursively, till it reaches
the root.
Birch or MiniBatchKMeans?
• Birch does not scale very well to high dimensional data. As a rule of thumb if n_features is greater than
twenty, it is generally better to use MiniBatchKMeans.
• If the number of instances of data needs to be reduced, or if one wants a large number of subclusters either as a
preprocessing step or otherwise, Birch is more useful than MiniBatchKMeans.
How to use partial_fit?
To avoid the computation of global clustering, for every call of partial_fit the user is advised
1. To set n_clusters=None initially
2. Train all data by multiple calls to partial_fit.
3. Set n_clusters to a required value using brc.set_params(n_clusters=n_clusters).
4. Call partial_fit finally with no arguments, i.e. brc.partial_fit() which performs the global clus-
tering.
References:
• Tian Zhang, Raghu Ramakrishnan, Maron Livny BIRCH: An efficient data clustering method for large
databases. https://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
• Roberto Perdisci JBirch - Java implementation of BIRCH clustering algorithm https://code.google.com/
archive/p/jbirch
Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision
and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute
values of the cluster labels into account but rather if this clustering define separations of the data similar to some
ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar
than members of different classes according to some similarity metric.
Given the knowledge of the ground truth class assignments labels_true and our clustering algorithm assignments
of the same samples labels_pred, the adjusted Rand index is a function that measures the similarity of the two
assignments, ignoring permutations and with chance normalization:
One can permute 0 and 1 in the predicted labels, rename 2 to 3, and get the same score:
Furthermore, adjusted_rand_score is symmetric: swapping the argument does not change the score. It can
thus be used as a consensus measure:
Advantages
• Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for raw Rand index or the V-measure for instance).
• Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI,
1.0 is the perfect match score.
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-
means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster
with “folded” shapes.
Drawbacks
• Contrary to inertia, ARI requires knowledge of the ground truth classes while is almost never available in
practice or requires manual assignment by human annotators (as in the supervised learning setting).
However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that
can be used for clustering model selection (TODO).
Examples:
• Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on the
value of clustering measures for random assignments.
Mathematical formulation
If C is a ground truth class assignment and K the clustering, let us define 𝑎 and 𝑏 as:
• 𝑎, the number of pairs of elements that are in the same set in C and in the same set in K
• 𝑏, the number of pairs of elements that are in different sets in C and in different sets in K
The raw (unadjusted) Rand index is then given by:
𝑎+𝑏
RI = 𝑛
𝐶2 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑛
Where 𝐶2 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 is the total number of possible pairs in the dataset (without ordering).
However the RI score does not guarantee that random label assignments will get a value close to zero (esp. if the
number of clusters is in the same order of magnitude as the number of samples).
To counter this effect we can discount the expected RI 𝐸[RI] of random labelings by defining the adjusted Rand index
as follows:
RI − 𝐸[RI]
ARI =
max(RI) − 𝐸[RI]
References
Given the knowledge of the ground truth class assignments labels_true and our clustering algorithm assignments
of the same samples labels_pred, the Mutual Information is a function that measures the agreement of the two
assignments, ignoring permutations. Two different normalized versions of this measure are available, Normalized
Mutual Information (NMI) and Adjusted Mutual Information (AMI). NMI is often used in the literature, while
AMI was proposed more recently and is normalized against chance:
One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get the same score:
Advantages
• Random (uniform) label assignments have a AMI score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
• Upper bound of 1: Values close to zero indicate two label assignments that are largely independent, while
values close to one indicate significant agreement. Further, an AMI of exactly 1 indicates that the two label
assignments are equal (with or without permutation).
Drawbacks
• Contrary to inertia, MI-based measures require the knowledge of the ground truth classes while almost
never available in practice or requires manual assignment by human annotators (as in the supervised learning
setting).
However MI-based measures can also be useful in purely unsupervised setting as a building block for a Consen-
sus Index that can be used for clustering model selection.
• NMI and MI are not adjusted against chance.
Examples:
• Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on the
value of clustering measures for random assignments. This example also includes the Adjusted Rand Index.
Mathematical formulation
Assume two label assignments (of the same N objects), 𝑈 and 𝑉 . Their entropy is the amount of uncertainty for a
partition set, defined by:
|𝑈 |
∑︁
𝐻(𝑈 ) = − 𝑃 (𝑖) log(𝑃 (𝑖))
𝑖=1
where 𝑃 (𝑖) = |𝑈𝑖 |/𝑁 is the probability that an object picked at random from 𝑈 falls into class 𝑈𝑖 . Likewise for 𝑉 :
|𝑉 |
∑︁
𝐻(𝑉 ) = − 𝑃 ′ (𝑗) log(𝑃 ′ (𝑗))
𝑗=1
′
With 𝑃 (𝑗) = |𝑉𝑗 |/𝑁 . The mutual information (MI) between 𝑈 and 𝑉 is calculated by:
|𝑈 | |𝑉 | (︂ )︂
∑︁ ∑︁ 𝑃 (𝑖, 𝑗)
MI(𝑈, 𝑉 ) = 𝑃 (𝑖, 𝑗) log
𝑖=1 𝑗=1
𝑃 (𝑖)𝑃 ′ (𝑗)
where 𝑃 (𝑖, 𝑗) = |𝑈𝑖 ∩ 𝑉𝑗 |/𝑁 is the probability that an object picked at random falls into both classes 𝑈𝑖 and 𝑉𝑗 .
It also can be expressed in set cardinality formulation:
|𝑈 | |𝑉 | (︂ )︂
∑︁ ∑︁ |𝑈𝑖 ∩ 𝑉𝑗 | 𝑁 |𝑈𝑖 ∩ 𝑉𝑗 |
MI(𝑈, 𝑉 ) = log
𝑖=1 𝑗=1
𝑁 |𝑈𝑖 ||𝑉𝑗 |
Using the expected value, the adjusted mutual information can then be calculated using a similar form to that of the
adjusted Rand index:
MI − 𝐸[MI]
AMI =
mean(𝐻(𝑈 ), 𝐻(𝑉 )) − 𝐸[MI]
For normalized mutual information and adjusted mutual information, the normalizing value is typically some gener-
alized mean of the entropies of each clustering. Various generalized means exist, and no firm rules exist for preferring
one over the others. The decision is largely a field-by-field basis; for instance, in community detection, the arithmetic
mean is most common. Each normalizing method provides “qualitatively similar behaviours” [YAT2016]. In our
implementation, this is controlled by the average_method parameter.
Vinh et al. (2010) named variants of NMI and AMI by their averaging method [VEB2010]. Their ‘sqrt’ and ‘sum’
averages are the geometric and arithmetic means; we use these more broadly common names.
References
• Strehl, Alexander, and Joydeep Ghosh (2002). “Cluster ensembles – a knowledge reuse frame-
work for combining multiple partitions”. Journal of Machine Learning Research 3: 583–617.
doi:10.1162/153244303321897735.
• Wikipedia entry for the (normalized) Mutual Information
• Wikipedia entry for the Adjusted Mutual Information
Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric
using conditional entropy analysis.
In particular Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assign-
ment:
• homogeneity: each cluster contains only members of a single class.
• completeness: all members of a given class are assigned to the same cluster.
We can turn those concept as scores homogeneity_score and completeness_score. Both are bounded
below by 0.0 and above by 1.0 (higher is better):
beta defaults to a value of 1.0, but for using a value less than 1 for beta:
more weight will be attributed to homogeneity, and using a value greater than 1:
The following clustering assignment is slightly better, since it is homogeneous but not complete:
Note: v_measure_score is symmetric: it can be used to evaluate the agreement of two independent assignments
on the same dataset.
This is not the case for completeness_score and homogeneity_score: both are bound by the relationship:
homogeneity_score(a, b) == completeness_score(b, a)
Advantages
Drawbacks
• The previously introduced metrics are not normalized with regards to random labeling: this means that
depending on the number of samples, clusters and ground truth classes, a completely random labeling will
not always yield the same values for homogeneity, completeness and hence v-measure. In particular random
labeling won’t yield zero scores especially when the number of clusters is large.
This problem can safely be ignored when the number of samples is more than a thousand and the number of
clusters is less than 10. For smaller sample sizes or larger number of clusters it is safer to use an adjusted
index such as the Adjusted Rand Index (ARI).
• These metrics require the knowledge of the ground truth classes while almost never available in practice or
requires manual assignment by human annotators (as in the supervised learning setting).
Examples:
• Adjustment for chance in clustering performance evaluation: Analysis of the impact of the dataset size on the
value of clustering measures for random assignments.
Mathematical formulation
with 𝑛 the total number of samples, 𝑛𝑐 and 𝑛𝑘 the number of samples respectively belonging to class 𝑐 and cluster 𝑘,
and finally 𝑛𝑐,𝑘 the number of samples from class 𝑐 assigned to cluster 𝑘.
The conditional entropy of clusters given class 𝐻(𝐾|𝐶) and the entropy of clusters 𝐻(𝐾) are defined in a sym-
metric manner.
Rosenberg and Hirschberg further define V-measure as the harmonic mean of homogeneity and completeness:
ℎ·𝑐
𝑣 =2·
ℎ+𝑐
References
• V-Measure: A conditional entropy-based external cluster evaluation measure Andrew Rosenberg and Julia
Hirschberg, 2007
Fowlkes-Mallows scores
Where TP is the number of True Positive (i.e. the number of pair of points that belong to the same clusters in both the
true labels and the predicted labels), FP is the number of False Positive (i.e. the number of pair of points that belong
to the same clusters in the true labels and not in the predicted labels) and FN is the number of False Negative (i.e the
number of pair of points that belongs in the same clusters in the predicted labels and not in the true labels).
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get the same score:
Advantages
• Random (uniform) label assignments have a FMI score close to 0.0 for any value of n_clusters and
n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
• Upper-bounded at 1: Values close to zero indicate two label assignments that are largely independent, while
values close to one indicate significant agreement. Further, values of exactly 0 indicate purely independent
label assignments and a FMI of exactly 1 indicates that the two label assignments are equal (with or without
permutation).
• No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-
means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster
with “folded” shapes.
Drawbacks
• Contrary to inertia, FMI-based measures require the knowledge of the ground truth classes while almost
never available in practice or requires manual assignment by human annotators (as in the supervised learning
setting).
References
• E. B. Fowkles and C. L. Mallows, 1983. “A method for comparing two hierarchical clusterings”. Journal of
the American Statistical Association. http://wildfire.stat.ucla.edu/pdflibrary/fowlkes.pdf
• Wikipedia entry for the Fowlkes-Mallows Index
Silhouette Coefficient
If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coeffi-
cient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette
Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample
and is composed of two scores:
• a: The mean distance between a sample and all other points in the same class.
• b: The mean distance between a sample and all other points in the next nearest cluster.
The Silhouette Coefficient s for a single sample is then given as:
𝑏−𝑎
𝑠=
𝑚𝑎𝑥(𝑎, 𝑏)
The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.
In normal usage, the Silhouette Coefficient is applied to the results of a cluster analysis.
References
• Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster
Analysis”. Computational and Applied Mathematics 20: 53–65. doi:10.1016/0377-0427(87)90125-7.
Advantages
• The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero
indicate overlapping clusters.
• The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
Drawbacks
• The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density
based clusters like those obtained through DBSCAN.
Examples:
• Selecting the number of clusters with silhouette analysis on KMeans clustering : In this example the silhouette
analysis is used to choose an optimal value for n_clusters.
Calinski-Harabasz Index
If the ground truth labels are not known, the Calinski-Harabasz index (sklearn.metrics.
calinski_harabasz_score) - also known as the Variance Ratio Criterion - can be used to evaluate the
model, where a higher Calinski-Harabasz score relates to a model with better defined clusters.
The index is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters (where
dispersion is defined as the sum of distances squared):
>>> from sklearn import metrics
>>> from sklearn.metrics import pairwise_distances
>>> from sklearn import datasets
>>> X, y = datasets.load_iris(return_X_y=True)
In normal usage, the Calinski-Harabasz index is applied to the results of a cluster analysis:
>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabasz_score(X, labels)
561.62...
Advantages
• The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
• The score is fast to compute.
Drawbacks
• The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as
density based clusters like those obtained through DBSCAN.
Mathematical formulation
For a set of data 𝐸 of size 𝑛𝐸 which has been clustered into 𝑘 clusters, the Calinski-Harabasz score 𝑠 is defined as the
ratio of the between-clusters dispersion mean and the within-cluster dispersion:
tr(𝐵𝑘 ) 𝑛𝐸 − 𝑘
𝑠= ×
tr(𝑊𝑘 ) 𝑘−1
where tr(𝐵𝑘 ) is trace of the between group dispersion matrix and tr(𝑊𝑘 ) is the trace of the within-cluster dispersion
matrix defined by:
𝑘 ∑︁
∑︁
𝑊𝑘 = (𝑥 − 𝑐𝑞 )(𝑥 − 𝑐𝑞 )𝑇
𝑞=1 𝑥∈𝐶𝑞
𝑘
∑︁
𝐵𝑘 = 𝑛𝑞 (𝑐𝑞 − 𝑐𝐸 )(𝑐𝑞 − 𝑐𝐸 )𝑇
𝑞=1
with 𝐶𝑞 the set of points in cluster 𝑞, 𝑐𝑞 the center of cluster 𝑞, 𝑐𝐸 the center of 𝐸, and 𝑛𝑞 the number of points in
cluster 𝑞.
References
• Caliński, T., & Harabasz, J. (1974). “A Dendrite Method for Cluster Analysis”. Communications in Statistics-
theory and Methods 3: 1-27. doi:10.1080/03610927408827101.
Davies-Bouldin Index
If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.
davies_bouldin_score) can be used to evaluate the model, where a lower Davies-Bouldin index relates
to a model with better separation between the clusters.
This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the
distance between clusters with the size of the clusters themselves.
Zero is the lowest possible score. Values closer to zero indicate a better partition.
In normal usage, the Davies-Bouldin index is applied to the results of a cluster analysis as follows:
Advantages
Drawbacks
• The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as
density based clusters like those obtained from DBSCAN.
• The usage of centroid distance limits the distance metric to Euclidean space.
Mathematical formulation
The index is defined as the average similarity between each cluster 𝐶𝑖 for 𝑖 = 1, ..., 𝑘 and its most similar one 𝐶𝑗 . In
the context of this index, similarity is defined as a measure 𝑅𝑖𝑗 that trades off:
• 𝑠𝑖 , the average distance between each point of cluster 𝑖 and the centroid of that cluster – also know as cluster
diameter.
• 𝑑𝑖𝑗 , the distance between cluster centroids 𝑖 and 𝑗.
A simple choice to construct 𝑅𝑖 𝑗 so that it is nonnegative and symmetric is:
𝑠𝑖 + 𝑠𝑗
𝑅𝑖𝑗 =
𝑑𝑖𝑗
Then the Davies-Bouldin index is defined as:
𝑘
1 ∑︁
𝐷𝐵 = max 𝑅𝑖𝑗
𝑘 𝑖=1 𝑖̸=𝑗
References
• Davies, David L.; Bouldin, Donald W. (1979). “A Cluster Separation Measure” IEEE Transactions on Pattern
Analysis and Machine Intelligence. PAMI-1 (2): 224-227. doi:10.1109/TPAMI.1979.4766909.
• Halkidi, Maria; Batistakis, Yannis; Vazirgiannis, Michalis (2001). “On Clustering Validation Techniques”
Journal of Intelligent Information Systems, 17(2-3), 107-145. doi:10.1023/A:1012801612483.
• Wikipedia entry for Davies-Bouldin index.
Contingency Matrix
The first row of output array indicates that there are three samples whose true cluster is “a”. Of them, two are in
predicted cluster 0, one is in 1, and none is in 2. And the second row indicates that there are three samples whose true
cluster is “b”. Of them, none is in predicted cluster 0, one is in 1 and two are in 2.
A confusion matrix for classification is a square contingency matrix where the order of rows and columns correspond
to a list of classes.
Advantages
• Allows to examine the spread of each true cluster across predicted clusters and vice versa.
• The contingency table calculated is typically utilized in the calculation of a similarity statistic (like the others
listed in this document) between the two clusterings.
Drawbacks
• Contingency matrix is easy to interpret for a small number of clusters, but becomes very hard to interpret for a
large number of clusters.
• It doesn’t give a single metric to use as an objective for clustering optimisation.
References
4.2.4 Biclustering
Biclustering can be performed with the module sklearn.cluster.bicluster. Biclustering algorithms simul-
taneously cluster rows and columns of a data matrix. These clusters of rows and columns are known as biclusters.
Each determines a submatrix of the original data matrix with some desired properties.
For instance, given a matrix of shape (10, 10), one possible bicluster with three rows and two columns induces a
submatrix of shape (3, 2):
For visualization purposes, given a bicluster, the rows and columns of the data matrix may be rearranged to make the
bicluster contiguous.
Algorithms differ in how they define biclusters. Some of the common types include:
• constant values, constant rows, or constant columns
• unusually high or low values
• submatrices with low variance
• correlated rows or columns
Algorithms also differ in how rows and columns may be assigned to biclusters, which leads to different bicluster
structures. Block diagonal or checkerboard structures occur when rows and columns are divided into partitions.
If each row and each column belongs to exactly one bicluster, then rearranging the rows and columns of the data matrix
reveals the biclusters on the diagonal. Here is an example of this structure where biclusters have higher average values
than the other rows and columns:
In the checkerboard case, each row belongs to all column clusters, and each column belongs to all row clusters. Here
is an example of this structure where the variance of the values within each bicluster is small:
After fitting a model, row and column cluster membership can be found in the rows_ and columns_ attributes.
rows_[i] is a binary vector with nonzero entries corresponding to rows that belong to bicluster i. Similarly,
columns_[i] indicates which columns belong to bicluster i.
Some models also have row_labels_ and column_labels_ attributes. These models partition the rows and
columns, such as in the block diagonal and checkerboard bicluster structures.
Note: Biclustering has many other names in different fields including co-clustering, two-mode clustering, two-way
clustering, block clustering, coupled two-way clustering, etc. The names of some algorithms, such as the Spectral
Co-Clustering algorithm, reflect these alternate names.
Spectral Co-Clustering
The SpectralCoclustering algorithm finds biclusters with values higher than those in the corresponding other
rows and columns. Each row and each column belongs to exactly one bicluster, so rearranging the rows and columns
to make partitions contiguous reveals these high values along the diagonal:
Note: The algorithm treats the input data matrix as a bipartite graph: the rows and columns of the matrix correspond
to the two sets of vertices, and each entry corresponds to an edge between a row and a column. The algorithm
approximates the normalized cut of this graph to find heavy subgraphs.
Mathematical formulation
An approximate solution to the optimal normalized cut may be found via the generalized eigenvalue decomposition of
the Laplacian of the graph. Usually this would mean working directly with the Laplacian matrix. If the original data
matrix 𝐴 has shape 𝑚 × 𝑛, the Laplacian matrix for the corresponding bipartite graph has shape (𝑚 + 𝑛) × (𝑚 + 𝑛).
However, in this case it is possible to work directly with 𝐴, which is smaller and more efficient.
The input matrix 𝐴 is preprocessed as follows:
𝐴𝑛 = 𝑅−1/2 𝐴𝐶 −1/2
∑︀
Where
∑︀ 𝑅 is the diagonal matrix with entry 𝑖 equal to 𝑗 𝐴𝑖𝑗 and 𝐶 is the diagonal matrix with entry 𝑗 equal to
𝑖 𝐴 𝑖𝑗 .
The singular value decomposition, 𝐴𝑛 = 𝑈 Σ𝑉 ⊤ , provides the partitions of the rows and columns of 𝐴. A subset of
the left singular vectors gives the row partitions, and a subset of the right singular vectors gives the column partitions.
The ℓ = ⌈log2 𝑘⌉ singular vectors, starting from the second, provide the desired partitioning information. They are
used to form the matrix 𝑍:
⎡ −1/2 ⎤
𝑅 𝑈
𝑍=⎣ ⎦
−1/2
𝐶 𝑉
Examples:
• A demo of the Spectral Co-Clustering algorithm: A simple example showing how to generate a data matrix
with biclusters and apply this method to it.
• Biclustering documents with the Spectral Co-clustering algorithm: An example of finding biclusters in the
twenty newsgroup dataset.
References:
• Dhillon, Inderjit S, 2001. Co-clustering documents and words using bipartite spectral graph partitioning.
Spectral Biclustering
The SpectralBiclustering algorithm assumes that the input data matrix has a hidden checkerboard structure.
The rows and columns of a matrix with this structure may be partitioned so that the entries of any bicluster in the
Cartesian product of row clusters and column clusters are approximately constant. For instance, if there are two row
partitions and three column partitions, each row will belong to three biclusters, and each column will belong to two
biclusters.
The algorithm partitions the rows and columns of a matrix so that a corresponding blockwise-constant checkerboard
matrix provides a good approximation to the original matrix.
Mathematical formulation
The input matrix 𝐴 is first normalized to make the checkerboard pattern more obvious. There are three possible
methods:
1. Independent row and column normalization, as in Spectral Co-Clustering. This method makes the rows sum to
a constant and the columns sum to a different constant.
2. Bistochastization: repeated row and column normalization until convergence. This method makes both rows
and columns sum to the same constant.
3. Log normalization: the log of the data matrix is computed: 𝐿 = log 𝐴. Then the column mean 𝐿𝑖· , row mean
𝐿·𝑗 , and overall mean 𝐿·· of 𝐿 are computed. The final matrix is computed according to the formula
𝐾𝑖𝑗 = 𝐿𝑖𝑗 − 𝐿𝑖· − 𝐿·𝑗 + 𝐿··
After normalizing, the first few singular vectors are computed, just as in the Spectral Co-Clustering algorithm.
If log normalization was used, all the singular vectors are meaningful. However, if independent normalization or
bistochastization were used, the first singular vectors, 𝑢1 and 𝑣1 . are discarded. From now on, the “first” singular
vectors refers to 𝑢2 . . . 𝑢𝑝+1 and 𝑣2 . . . 𝑣𝑝+1 except in the case of log normalization.
Given these singular vectors, they are ranked according to which can be best approximated by a piecewise-constant
vector. The approximations for each vector are found using one-dimensional k-means and scored using the Euclidean
distance. Some subset of the best left and right singular vector are selected. Next, the data is projected to this best
subset of singular vectors and clustered.
For instance, if 𝑝 singular vectors were calculated, the 𝑞 best are found as described, where 𝑞 < 𝑝. Let 𝑈 be the matrix
with columns the 𝑞 best left singular vectors, and similarly 𝑉 for the right. To partition the rows, the rows of 𝐴 are
projected to a 𝑞 dimensional space: 𝐴 * 𝑉 . Treating the 𝑚 rows of this 𝑚 × 𝑞 matrix as samples and clustering using
k-means yields the row labels. Similarly, projecting the columns to 𝐴⊤ * 𝑈 and clustering this 𝑛 × 𝑞 matrix yields the
column labels.
Examples:
• A demo of the Spectral Biclustering algorithm: a simple example showing how to generate a checkerboard
matrix and bicluster it.
References:
• Kluger, Yuval, et. al., 2003. Spectral biclustering of microarray data: coclustering genes and conditions.
Biclustering evaluation
There are two ways of evaluating a biclustering result: internal and external. Internal measures, such as cluster
stability, rely only on the data and the result themselves. Currently there are no internal bicluster measures in scikit-
learn. External measures refer to an external source of information, such as the true solution. When working with
real data the true solution is usually unknown, but biclustering artificial data may be useful for evaluating algorithms
precisely because the true solution is known.
To compare a set of found biclusters to the set of true biclusters, two similarity measures are needed: a similarity
measure for individual biclusters, and a way to combine these individual similarities into an overall score.
To compare individual biclusters, several measures have been used. For now, only the Jaccard index is implemented:
|𝐴 ∩ 𝐵|
𝐽(𝐴, 𝐵) =
|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|
where 𝐴 and 𝐵 are biclusters, |𝐴 ∩ 𝐵| is the number of elements in their intersection. The Jaccard index achieves its
minimum of 0 when the biclusters to not overlap at all and its maximum of 1 when they are identical.
Several methods have been developed to compare two sets of biclusters. For now, only consensus_score (Hochre-
iter et. al., 2010) is available:
1. Compute bicluster similarities for pairs of biclusters, one in each set, using the Jaccard index or a similar
measure.
2. Assign biclusters from one set to another in a one-to-one fashion to maximize the sum of their similarities. This
step is performed using the Hungarian algorithm.
3. The final sum of similarities is divided by the size of the larger set.
The minimum consensus score, 0, occurs when all pairs of biclusters are totally dissimilar. The maximum score, 1,
occurs when both sets are identical.
References:
• Hochreiter, Bodenhofer, et. al., 2010. FABIA: factor analysis for bicluster acquisition.
PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum
amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns 𝑛 components in its
fit method, and can be used on new data to project it on these components.
PCA centers but does not scale the input data for each feature before applying the SVD. The optional parameter
whiten=True makes it possible to project the data onto the singular space while scaling each component to unit
variance. This is often useful if the models down-stream make strong assumptions on the isotropy of the signal: this
is for example the case for Support Vector Machines with the RBF kernel and the K-Means clustering algorithm.
Below is an example of the iris dataset, which is comprised of 4 features, projected on the 2 dimensions that explain
most variance:
The PCA object also provides a probabilistic interpretation of the PCA that can give a likelihood of data based on the
amount of variance it explains. As such it implements a score method that can be used in cross-validation:
Examples:
Incremental PCA
The PCA object is very useful, but has certain limitations for large datasets. The biggest limitation is that PCA only sup-
ports batch processing, which means all of the data to be processed must fit in main memory. The IncrementalPCA
object uses a different form of processing and allows for partial computations which almost exactly match the results
of PCA while processing the data in a minibatch fashion. IncrementalPCA makes it possible to implement out-of-
core Principal Component Analysis either by:
• Using its partial_fit method on chunks of data fetched sequentially from the local hard drive or a network
database.
• Calling its fit method on a sparse matrix or a memory mapped file using numpy.memmap.
IncrementalPCA only stores estimates of component and noise variances, in order update
explained_variance_ratio_ incrementally. This is why memory usage depends on the number of
samples per batch, rather than the number of samples to be processed in the dataset.
As in PCA, IncrementalPCA centers but does not scale the input data for each feature before applying the SVD.
Examples:
• Incremental PCA
It is often interesting to project data to a lower-dimensional space that preserves most of the variance, by dropping the
singular vector of components associated with lower singular values.
For instance, if we work with 64x64 pixel gray-level pictures for face recognition, the dimensionality of the data is
4096 and it is slow to train an RBF support vector machine on such wide data. Furthermore we know that the intrinsic
dimensionality of the data is much lower than 4096 since all pictures of human faces look somewhat alike. The
samples lie on a manifold of much lower dimension (say around 200 for instance). The PCA algorithm can be used to
linearly transform the data while both reducing the dimensionality and preserve most of the explained variance at the
same time.
The class PCA used with the optional parameter svd_solver='randomized' is very useful in that case: since
we are going to drop most of the singular vectors it is much more efficient to limit the computation to an approximated
estimate of the singular vectors we will keep to actually perform the transform.
For instance, the following shows 16 sample portraits (centered around 0.0) from the Olivetti dataset. On the right
hand side are the first 16 singular vectors reshaped as portraits. Since we only require the top 16 singular vectors of a
dataset with size 𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 400 and 𝑛𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = 64 × 64 = 4096, the computation time is less than 1s:
If we note 𝑛max = max(𝑛samples , 𝑛features ) and 𝑛min = min(𝑛samples , 𝑛features ), the time complexity of the random-
ized PCA is 𝑂(𝑛2max · 𝑛components ) instead of 𝑂(𝑛2max · 𝑛min ) for the exact method implemented in PCA.
The memory footprint of randomized PCA is also proportional to 2 · 𝑛max · 𝑛components instead of 𝑛max · 𝑛min for the
exact method.
Note: the implementation of inverse_transform in PCA with svd_solver='randomized' is not the exact
inverse transform of transform even when whiten=False (default).
Examples:
References:
• “Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decomposi-
tions” Halko, et al., 2009
Kernel PCA
KernelPCA is an extension of PCA which achieves non-linear dimensionality reduction through the use of
kernels (see Pairwise metrics, Affinities and Kernels). It has many applications including denoising, compres-
sion and structured prediction (kernel dependency estimation). KernelPCA supports both transform and
inverse_transform.
Examples:
• Kernel PCA
SparsePCA is a variant of PCA, with the goal of extracting the set of sparse components that best reconstruct the
data.
Mini-batch sparse PCA (MiniBatchSparsePCA) is a variant of SparsePCA that is faster but less accurate. The
increased speed is reached by iterating over small chunks of the set of features, for a given number of iterations.
Principal component analysis (PCA) has the disadvantage that the components extracted by this method have exclu-
sively dense expressions, i.e. they have non-zero coefficients when expressed as linear combinations of the original
variables. This can make interpretation difficult. In many cases, the real underlying components can be more naturally
imagined as sparse vectors; for example in face recognition, components might naturally map to parts of faces.
Sparse principal components yields a more parsimonious, interpretable representation, clearly emphasizing which of
the original features contribute to the differences between samples.
The following example illustrates 16 components extracted using sparse PCA from the Olivetti faces dataset. It can
be seen how the regularization term induces many zeros. Furthermore, the natural structure of the data causes the
non-zero coefficients to be vertically adjacent. The model does not enforce this mathematically: each component is
a vector ℎ ∈ R4096 , and there is no notion of vertical adjacency except during the human-friendly visualization as
64x64 pixel images. The fact that the components shown below appear local is the effect of the inherent structure of
the data, which makes such local patterns minimize reconstruction error. There exist sparsity-inducing norms that take
into account adjacency and different kinds of structure; see [Jen09] for a review of such methods. For more details on
how to use Sparse PCA, see the Examples section, below.
Note that there are many different formulations for the Sparse PCA problem. The one implemented here is based
on [Mrl09] . The optimization problem solved is a PCA problem (dictionary learning) with an ℓ1 penalty on the
components:
1
(𝑈 * , 𝑉 * ) = arg min ||𝑋 − 𝑈 𝑉 ||22 + 𝛼||𝑉 ||1
𝑈,𝑉 2
subject to ||𝑈𝑘 ||2 = 1 for all 0 ≤ 𝑘 < 𝑛𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠
The sparsity-inducing ℓ1 norm also prevents learning components from noise when few training samples are available.
The degree of penalization (and thus sparsity) can be adjusted through the hyperparameter alpha. Small values lead
to a gently regularized factorization, while larger values shrink many coefficients to zero.
Note: While in the spirit of an online algorithm, the class MiniBatchSparsePCA does not implement
partial_fit because the algorithm is online along the features direction, not the samples direction.
Examples:
References:
TruncatedSVD implements a variant of singular value decomposition (SVD) that only computes the 𝑘 largest
singular values, where 𝑘 is a user-specified parameter.
When truncated SVD is applied to term-document matrices (as returned by CountVectorizer or
TfidfVectorizer), this transformation is known as latent semantic analysis (LSA), because it transforms such
matrices to a “semantic” space of low dimensionality. In particular, LSA is known to combat the effects of synonymy
and polysemy (both of which roughly mean there are multiple meanings per word), which cause term-document ma-
trices to be overly sparse and exhibit poor similarity under measures such as cosine similarity.
Note: LSA is also known as latent semantic indexing, LSI, though strictly that refers to its use in persistent indexes
for information retrieval purposes.
𝑋 ≈ 𝑋𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘⊤
𝑋 ′ = 𝑋𝑉𝑘
Note: Most treatments of LSA in the natural language processing (NLP) and information retrieval (IR) literature
swap the axes of the matrix 𝑋 so that it has shape n_features × n_samples. We present LSA in a different way
that matches the scikit-learn API better, but the singular values found are the same.
TruncatedSVD is very similar to PCA, but differs in that it works on sample matrices 𝑋 directly instead of their
covariance matrices. When the columnwise (per-feature) means of 𝑋 are subtracted from the feature values, truncated
SVD on the resulting matrix is equivalent to PCA. In practical terms, this means that the TruncatedSVD transformer
accepts scipy.sparse matrices without the need to densify them, as densifying may fill up memory even for
medium-sized document collections.
While the TruncatedSVD transformer works with any (sparse) feature matrix, using it on tf–idf matrices is recom-
mended over raw frequency counts in an LSA/document processing setting. In particular, sublinear scaling and inverse
document frequency should be turned on (sublinear_tf=True, use_idf=True) to bring the feature values
closer to a Gaussian distribution, compensating for LSA’s erroneous assumptions about textual data.
Examples:
References:
• Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2008), Introduction to Information Re-
trieval, Cambridge University Press, chapter 18: Matrix decompositions & latent semantic indexing
Dictionary Learning
The SparseCoder object is an estimator that can be used to transform signals into sparse linear combination of
atoms from a fixed, precomputed dictionary such as a discrete wavelet basis. This object therefore does not implement
a fit method. The transformation amounts to a sparse coding problem: finding a representation of the data as a linear
combination of as few dictionary atoms as possible. All variations of dictionary learning implement the following
transform methods, controllable via the transform_method initialization parameter:
• Orthogonal matching pursuit (Orthogonal Matching Pursuit (OMP))
• Least-angle regression (Least Angle Regression)
• Lasso computed by least-angle regression
• Lasso using coordinate descent (Lasso)
• Thresholding
Thresholding is very fast but it does not yield accurate reconstructions. They have been shown useful in literature for
classification tasks. For image reconstruction tasks, orthogonal matching pursuit yields the most accurate, unbiased
reconstruction.
The dictionary learning objects offer, via the split_code parameter, the possibility to separate the positive and
negative values in the results of sparse coding. This is useful when dictionary learning is used for extracting features
that will be used for supervised learning, because it allows the learning algorithm to assign different weights to negative
loadings of a particular atom, from to the corresponding positive loading.
The split code for a single sample has length 2 * n_components and is constructed using the following rule:
First, the regular code of length n_components is computed. Then, the first n_components entries of the
split_code are filled with the positive part of the regular code vector. The second half of the split code is filled
with the negative part of the code vector, only with a positive sign. Therefore, the split_code is non-negative.
Examples:
Dictionary learning (DictionaryLearning) is a matrix factorization problem that amounts to finding a (usually
overcomplete) dictionary that will perform well at sparsely encoding the fitted data.
Representing data as sparse combinations of atoms from an overcomplete dictionary is suggested to be the way the
mammalian primary visual cortex works. Consequently, dictionary learning applied on image patches has been shown
to give good results in image processing tasks such as image completion, inpainting and denoising, as well as for
supervised recognition tasks.
Dictionary learning is an optimization problem solved by alternatively updating the sparse code, as a solution to
multiple Lasso problems, considering the dictionary fixed, and then updating the dictionary to best fit the sparse code.
1
(𝑈 * , 𝑉 * ) = arg min ||𝑋 − 𝑈 𝑉 ||22 + 𝛼||𝑈 ||1
𝑈,𝑉 2
subject to ||𝑉𝑘 ||2 = 1 for all 0 ≤ 𝑘 < 𝑛atoms
After using such a procedure to fit the dictionary, the transform is simply a sparse coding step that shares the same
implementation with all dictionary learning objects (see Sparse coding with a precomputed dictionary).
It is also possible to constrain the dictionary and/or code to be positive to match constraints that may be present in the
data. Below are the faces with different positivity constraints applied. Red indicates negative values, blue indicates
The following image shows how a dictionary learned from 4x4 pixel image patches extracted from part of the image
of a raccoon face looks like.
Examples:
References:
• “Online dictionary learning for sparse coding” J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009
MiniBatchDictionaryLearning implements a faster, but less accurate version of the dictionary learning algo-
rithm that is better suited for large datasets.
By default, MiniBatchDictionaryLearning divides the data into mini-batches and optimizes in an online
manner by cycling over the mini-batches for the specified number of iterations. However, at the moment it does not
implement a stopping condition.
The estimator also implements partial_fit, which updates the dictionary by iterating only once over a mini-batch.
This can be used for online learning when the data is not readily available from the start, or for when the data does not
fit into the memory.
Note that when using dictionary learning to extract a representation (e.g. for sparse coding) clustering can be a
good proxy to learn the dictionary. For instance the MiniBatchKMeans estimator is computationally efficient
and implements on-line learning with a partial_fit method.
Example: Online learning of a dictionary of parts of faces
Factor Analysis
In unsupervised learning we only have a dataset 𝑋 = {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 }. How can this dataset be described mathemat-
ically? A very simple continuous latent variable model for 𝑋 is
𝑥𝑖 = 𝑊 ℎ𝑖 + 𝜇 + 𝜖
The vector ℎ𝑖 is called “latent” because it is unobserved. 𝜖 is considered a noise term distributed according to a
Gaussian with mean 0 and covariance Ψ (i.e. 𝜖 ∼ 𝒩 (0, Ψ)), 𝜇 is some arbitrary offset vector. Such a model is called
“generative” as it describes how 𝑥𝑖 is generated from ℎ𝑖 . If we use all the 𝑥𝑖 ’s as columns to form a matrix X and all
the ℎ𝑖 ’s as columns of a matrix H then we can write (with suitably defined M and E):
X = 𝑊H + M + E
𝑝(𝑥𝑖 |ℎ𝑖 ) = 𝒩 (𝑊 ℎ𝑖 + 𝜇, Ψ)
For a complete probabilistic model we also need a prior distribution for the latent variable ℎ. The most straightforward
assumption (based on the nice properties of the Gaussian distribution) is ℎ ∼ 𝒩 (0, I). This yields a Gaussian as the
marginal distribution of 𝑥:
𝑝(𝑥) = 𝒩 (𝜇, 𝑊 𝑊 𝑇 + Ψ)
Now, without any further assumptions the idea of having a latent variable ℎ would be superfluous – 𝑥 can be com-
pletely modelled with a mean and a covariance. We need to impose some more specific structure on one of these two
parameters. A simple additional assumption regards the structure of the error covariance Ψ:
The main advantage for Factor Analysis over PCA is that it can model the variance in every direction of the input space
independently (heteroscedastic noise):
This allows better model selection than probabilistic PCA in the presence of heteroscedastic noise:
Examples:
Independent component analysis separates a multivariate signal into additive subcomponents that are maximally in-
dependent. It is implemented in scikit-learn using the Fast ICA algorithm. Typically, ICA is not used for reducing
dimensionality but for separating superimposed signals. Since the ICA model does not include a noise term, for the
model to be correct, whitening must be applied. This can be done internally using the whiten argument or manually
using one of the PCA variants.
It is classically used to separate mixed signals (a problem known as blind source separation), as in the example below:
ICA can also be used as yet another non linear decomposition that finds components with some sparsity:
Examples:
NMF 1 is an alternative approach to decomposition that assumes that the data and the components are non-negative.
NMF can be plugged in instead of PCA or its variants, in the cases where the data matrix does not contain negative
values. It finds a decomposition of samples 𝑋 into two matrices 𝑊 and 𝐻 of non-negative elements, by optimizing the
distance 𝑑 between 𝑋 and the matrix product 𝑊 𝐻. The most widely used distance function is the squared Frobenius
norm, which is an obvious extension of the Euclidean norm to matrices:
1 1 ∑︁
𝑑Fro (𝑋, 𝑌 ) = ||𝑋 − 𝑌 ||2Fro = (𝑋𝑖𝑗 − 𝑌𝑖𝑗 )2
2 2 𝑖,𝑗
Unlike PCA, the representation of a vector is obtained in an additive fashion, by superimposing the components,
without subtracting. Such additive models are efficient for representing images and text.
It has been observed in [Hoyer, 2004]2 that, when carefully constrained, NMF can produce a parts-based representation
of the dataset, resulting in interpretable models. The following example displays 16 sparse components found by NMF
from the images in the Olivetti faces dataset, in comparison with the PCA eigenfaces.
1 “Learning the parts of objects by non-negative matrix factorization” D. Lee, S. Seung, 1999
2 “Non-negative Matrix Factorization with Sparseness Constraints” P. Hoyer, 2004
The init attribute determines the initialization method applied, which has a great impact on the performance of the
method. NMF implements the method Nonnegative Double Singular Value Decomposition. NNDSVD4 is based on
two SVD processes, one approximating the data matrix, the other approximating positive sections of the resulting
partial SVD factors utilizing an algebraic property of unit rank matrices. The basic NNDSVD algorithm is better fit
for sparse factorization. Its variants NNDSVDa (in which all zeros are set equal to the mean of all elements of the
data), and NNDSVDar (in which the zeros are set to random perturbations less than the mean of the data divided by
100) are recommended in the dense case.
Note that the Multiplicative Update (‘mu’) solver cannot update zeros present in the initialization, so it leads to poorer
results when used jointly with the basic NNDSVD algorithm which introduces a lot of zeros; in this case, NNDSVDa
or NNDSVDar should be preferred.
NMF can also be initialized with correctly scaled random non-negative matrices by setting init="random". An
integer seed or a RandomState can also be passed to random_state to control reproducibility.
In NMF, L1 and L2 priors can be added to the loss function in order to regularize the model. The L2 prior uses the
Frobenius norm, while the L1 prior uses an elementwise L1 norm. As in ElasticNet, we control the combination
of L1 and L2 with the l1_ratio (𝜌) parameter, and the intensity of the regularization with the alpha (𝛼) parameter.
Then the priors terms are:
𝛼(1 − 𝜌) 𝛼(1 − 𝜌)
𝛼𝜌||𝑊 ||1 + 𝛼𝜌||𝐻||1 + ||𝑊 ||2Fro + ||𝐻||2Fro
2 2
4 “SVD based initialization: A head start for nonnegative matrix factorization” C. Boutsidis, E. Gallopoulos, 2008
𝛼(1 − 𝜌) 𝛼(1 − 𝜌)
𝑑Fro (𝑋, 𝑊 𝐻) + 𝛼𝜌||𝑊 ||1 + 𝛼𝜌||𝐻||1 + ||𝑊 ||2Fro + ||𝐻||2Fro
2 2
NMF regularizes both W and H. The public function non_negative_factorization allows a finer control
through the regularization attribute, and may regularize only W, only H, or both.
As described previously, the most widely used distance function is the squared Frobenius norm, which is an obvious
extension of the Euclidean norm to matrices:
1 1 ∑︁
𝑑Fro (𝑋, 𝑌 ) = ||𝑋 − 𝑌 ||2𝐹 𝑟𝑜 = (𝑋𝑖𝑗 − 𝑌𝑖𝑗 )2
2 2 𝑖,𝑗
Other distance functions can be used in NMF as, for example, the (generalized) Kullback-Leibler (KL) divergence,
also referred as I-divergence:
∑︁ 𝑋𝑖𝑗
𝑑𝐾𝐿 (𝑋, 𝑌 ) = (𝑋𝑖𝑗 log( ) − 𝑋𝑖𝑗 + 𝑌𝑖𝑗 )
𝑖,𝑗
𝑌𝑖𝑗
These three distances are special cases of the beta-divergence family, with 𝛽 = 2, 1, 0 respectively6 . The beta-
divergence are defined by :
∑︁ 1
𝑑𝛽 (𝑋, 𝑌 ) = (𝑋 𝛽 + (𝛽 − 1)𝑌𝑖𝑗𝛽 − 𝛽𝑋𝑖𝑗 𝑌𝑖𝑗𝛽−1 )
𝑖,𝑗
𝛽(𝛽 − 1) 𝑖𝑗
Note that this definition is not valid if 𝛽 ∈ (0; 1), yet it can be continuously extended to the definitions of 𝑑𝐾𝐿 and
𝑑𝐼𝑆 respectively.
NMF implements two solvers, using Coordinate Descent (‘cd’)5 , and Multiplicative Update (‘mu’)6 . The ‘mu’ solver
can optimize every beta-divergence, including of course the Frobenius norm (𝛽 = 2), the (generalized) Kullback-
Leibler divergence (𝛽 = 1) and the Itakura-Saito divergence (𝛽 = 0). Note that for 𝛽 ∈ (1; 2), the ‘mu’ solver is
significantly faster than for other values of 𝛽. Note also that with a negative (or 0, i.e. ‘itakura-saito’) 𝛽, the input
matrix cannot contain zero values.
The ‘cd’ solver can only optimize the Frobenius norm. Due to the underlying non-convexity of NMF, the different
solvers may converge to different minima, even when optimizing the same distance function.
NMF is best used with the fit_transform method, which returns the matrix W. The matrix H is stored into the
fitted model in the components_ attribute; the method transform will decompose a new matrix X_new based on
these stored components:
Examples:
References:
Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora.
It is also a topic model that is used for discovering abstract topics from a collection of documents.
The graphical model of LDA is a three-level generative model:
Note on notations presented in the graphical model above, which can be found in Hoffman et al. (2013):
• The corpus is a collection of 𝐷 documents.
• A document is a sequence of 𝑁 words.
• There are 𝐾 topics in the corpus.
• The boxes represent repeated sampling.
In the graphical model, each node is a random variable and has a role in the generative process. A shaded node
indicates an observed variable and an unshaded node indicates a hidden (latent) variable. In this case, words in the
corpus are the only data that we observe. The latent variables determine the random mixture of topics in the corpus
and the distribution of words in the documents. The goal of LDA is to use the observed words to infer the hidden topic
structure.
When modeling text corpora, the model assumes the following generative process for a corpus with 𝐷 documents and
𝐾 topics, with 𝐾 corresponding to n_components in the API:
1. For each topic 𝑘 ∈ 𝐾, draw 𝛽𝑘 ∼ Dirichlet(𝜂). This provides a distribution over the words, i.e. the
probability of a word appearing in topic 𝑘. 𝜂 corresponds to topic_word_prior.
2. For each document 𝑑 ∈ 𝐷, draw the topic proportions 𝜃𝑑 ∼ Dirichlet(𝛼). 𝛼 corresponds to
doc_topic_prior.
3. For each word 𝑖 in document 𝑑:
a. Draw the topic assignment 𝑧𝑑𝑖 ∼ Multinomial(𝜃𝑑 )
b. Draw the observed word 𝑤𝑖𝑗 ∼ Multinomial(𝛽𝑧𝑑𝑖 )
For parameter estimation, the posterior distribution is:
𝑝(𝑧, 𝜃, 𝛽|𝛼, 𝜂)
𝑝(𝑧, 𝜃, 𝛽|𝑤, 𝛼, 𝜂) =
𝑝(𝑤|𝛼, 𝜂)
Since the posterior is intractable, variational Bayesian method uses a simpler distribution 𝑞(𝑧, 𝜃, 𝛽|𝜆, 𝜑, 𝛾) to approx-
imate it, and those variational parameters 𝜆, 𝜑, 𝛾 are optimized to maximize the Evidence Lower Bound (ELBO):
△
log 𝑃 (𝑤|𝛼, 𝜂) ≥ 𝐿(𝑤, 𝜑, 𝛾, 𝜆) = 𝐸𝑞 [log 𝑝(𝑤, 𝑧, 𝜃, 𝛽|𝛼, 𝜂)] − 𝐸𝑞 [log 𝑞(𝑧, 𝜃, 𝛽)]
Maximizing ELBO is equivalent to minimizing the Kullback-Leibler(KL) divergence between 𝑞(𝑧, 𝜃, 𝛽) and the true
posterior 𝑝(𝑧, 𝜃, 𝛽|𝑤, 𝛼, 𝜂).
LatentDirichletAllocation implements the online variational Bayes algorithm and supports both online and
batch update methods. While the batch method updates variational variables after each full pass through the data, the
online method updates variational variables from mini-batch data points.
Note: Although the online method is guaranteed to converge to a local optimum point, the quality of the optimum
point and the speed of convergence may depend on mini-batch size and attributes related to learning rate setting.
Examples:
• Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
References:
See also Dimensionality reduction for dimensionality reduction with Neighborhood Components Analysis.
Many statistical problems require the estimation of a population’s covariance matrix, which can be seen as an estima-
tion of data set scatter plot shape. Most of the time, such an estimation has to be done on a sample whose properties
(size, structure, homogeneity) have a large influence on the estimation’s quality. The sklearn.covariance pack-
age provides tools for accurately estimating a population’s covariance matrix under various settings.
We assume that the observations are independent and identically distributed (i.i.d.).
Empirical covariance
The covariance matrix of a data set is known to be well approximated by the classical maximum likelihood estimator
(or “empirical covariance”), provided the number of observations is large enough compared to the number of features
(the variables describing the observations). More precisely, the Maximum Likelihood Estimator of a sample is an
unbiased estimator of the corresponding population’s covariance matrix.
The empirical covariance matrix of a sample can be computed using the empirical_covariance function of the
package, or by fitting an EmpiricalCovariance object to the data sample with the EmpiricalCovariance.
fit method. Be careful that results depend on whether the data are centered, so one may want to use the
assume_centered parameter accurately. More precisely, if assume_centered=False, then the test set
is supposed to have the same mean vector as the training set. If not, both should be centered by the user, and
assume_centered=True should be used.
Examples:
• See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood for an example on how to fit an
EmpiricalCovariance object to data.
Shrunk Covariance
Basic shrinkage
Despite being an unbiased estimator of the covariance matrix, the Maximum Likelihood Estimator is not a good esti-
mator of the eigenvalues of the covariance matrix, so the precision matrix obtained from its inversion is not accurate.
Sometimes, it even occurs that the empirical covariance matrix cannot be inverted for numerical reasons. To avoid
such an inversion problem, a transformation of the empirical covariance matrix has been introduced: the shrinkage.
In scikit-learn, this transformation (with a user-defined shrinkage coefficient) can be directly applied to a pre-computed
covariance with the shrunk_covariance method. Also, a shrunk estimator of the covariance can be fitted to data
with a ShrunkCovariance object and its ShrunkCovariance.fit method. Again, results depend on whether
the data are centered, so one may want to use the assume_centered parameter accurately.
Mathematically, this shrinkage consists in reducing the ratio between the smallest and the largest eigenvalues of the
empirical covariance matrix. It can be done by simply shifting every eigenvalue according to a given offset, which is
equivalent of finding the l2-penalized Maximum Likelihood Estimator of the covariance matrix. In practice, shrinkage
^
boils down to a simple a convex transformation : Σshrunk = (1 − 𝛼)Σ̂ + 𝛼 Tr𝑝Σ Id.
Choosing the amount of shrinkage, 𝛼 amounts to setting a bias/variance trade-off, and is discussed below.
Examples:
• See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood for an example on how to fit a
ShrunkCovariance object to data.
Ledoit-Wolf shrinkage
In their 2004 paper1 , O. Ledoit and M. Wolf propose a formula to compute the optimal shrinkage coefficient 𝛼 that
minimizes the Mean Squared Error between the estimated and the real covariance matrix.
The Ledoit-Wolf estimator of the covariance matrix can be computed on a sample with the ledoit_wolf function
of the sklearn.covariance package, or it can be otherwise obtained by fitting a LedoitWolf object to the
same sample.
1 O. Ledoit and M. Wolf, “A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices”, Journal of Multivariate Analysis, Vol-
Examples:
• See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood for an example on how to fit a
LedoitWolf object to data and for visualizing the performances of the Ledoit-Wolf estimator in terms of
likelihood.
References:
Under the assumption that the data are Gaussian distributed, Chen et al.2 derived a formula aimed at choosing a
shrinkage coefficient that yields a smaller Mean Squared Error than the one given by Ledoit and Wolf’s formula. The
resulting estimator is known as the Oracle Shrinkage Approximating estimator of the covariance.
The OAS estimator of the covariance matrix can be computed on a sample with the oas function of the sklearn.
covariance package, or it can be otherwise obtained by fitting an OAS object to the same sample.
Fig. 7: Bias-variance trade-off when setting the shrinkage: comparing the choices of Ledoit-Wolf and OAS estimators
References:
Examples:
• See Shrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood for an example on how to fit an
OAS object to data.
2 Chen et al., “Shrinkage Algorithms for MMSE Covariance Estimation”, IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.
• See Ledoit-Wolf vs OAS estimation to visualize the Mean Squared Error difference between a LedoitWolf
and an OAS estimator of the covariance.
The matrix inverse of the covariance matrix, often called the precision matrix, is proportional to the partial correlation
matrix. It gives the partial independence relationship. In other words, if two features are independent conditionally on
the others, the corresponding coefficient in the precision matrix will be zero. This is why it makes sense to estimate
a sparse precision matrix: the estimation of the covariance matrix is better conditioned by learning independence
relations from the data. This is known as covariance selection.
In the small-samples situation, in which n_samples is on the order of n_features or smaller, sparse inverse
covariance estimators tend to work better than shrunk covariance estimators. However, in the opposite situation, or for
very correlated data, they can be numerically unstable. In addition, unlike shrinkage estimators, sparse estimators are
able to recover off-diagonal structure.
The GraphicalLasso estimator uses an l1 penalty to enforce sparsity on the precision matrix: the higher its
alpha parameter, the more sparse the precision matrix. The corresponding GraphicalLassoCV object uses
cross-validation to automatically set the alpha parameter.
Fig. 8: A comparison of maximum likelihood, shrinkage and sparse estimates of the covariance and precision matrix
in the very small samples settings.
• If your number of observations is not large compared to the number of edges in your underlying graph, you will
not recover it.
• Even if you are in favorable recovery conditions, the alpha parameter chosen by cross-validation (e.g. using the
GraphicalLassoCV object) will lead to selecting too many edges. However, the relevant edges will have
heavier weights than the irrelevant ones.
Where 𝐾 is the precision matrix to be estimated, and 𝑆 is the sample covariance matrix. ‖𝐾‖1 is the sum of the abso-
lute values of off-diagonal coefficients of 𝐾. The algorithm employed to solve this problem is the GLasso algorithm,
from the Friedman 2008 Biostatistics paper. It is the same algorithm as in the R glasso package.
Examples:
• Sparse inverse covariance estimation: example on synthetic data showing some recovery of a structure, and
comparing to other covariance estimators.
• Visualizing the stock market structure: example on real stock market data, finding which symbols are most
linked.
References:
• Friedman et al, “Sparse inverse covariance estimation with the graphical lasso”, Biostatistics 9, pp 432, 2008
Real data sets are often subject to measurement or recording errors. Regular but uncommon observations may also
appear for a variety of reasons. Observations which are very uncommon are called outliers. The empirical covariance
estimator and the shrunk covariance estimators presented above are very sensitive to the presence of outliers in the data.
Therefore, one should use robust covariance estimators to estimate the covariance of its real data sets. Alternatively,
robust covariance estimators can be used to perform outlier detection and discard/downweight some observations
according to further processing of the data.
The sklearn.covariance package implements a robust estimator of covariance, the Minimum Covariance De-
terminant3 .
The Minimum Covariance Determinant estimator is a robust estimator of a data set’s covariance introduced by P.J.
Rousseeuw in3 . The idea is to find a given proportion (h) of “good” observations which are not outliers and compute
their empirical covariance matrix. This empirical covariance matrix is then rescaled to compensate the performed
selection of observations (“consistency step”). Having computed the Minimum Covariance Determinant estimator,
one can give weights to observations according to their Mahalanobis distance, leading to a reweighted estimate of the
covariance matrix of the data set (“reweighting step”).
Rousseeuw and Van Driessen4 developed the FastMCD algorithm in order to compute the Minimum Covariance
Determinant. This algorithm is used in scikit-learn when fitting an MCD object to data. The FastMCD algorithm also
computes a robust estimate of the data set location at the same time.
Raw estimates can be accessed as raw_location_ and raw_covariance_ attributes of a MinCovDet robust
covariance estimator object.
References:
Examples:
• See Robust vs Empirical covariance estimate for an example on how to fit a MinCovDet object to data and
see how the estimate remains accurate despite the presence of outliers.
• See Robust covariance estimation and Mahalanobis distances relevance to visualize the difference between
EmpiricalCovariance and MinCovDet covariance estimators in terms of Mahalanobis distance (so
we get a better estimate of the precision matrix too).
3P. J. Rousseeuw. Least median of squares regression. J. Am Stat Ass, 79:871, 1984.
4A Fast Algorithm for the Minimum Covariance Determinant Estimator, 1999, American Statistical Association and the American Society for
Quality, TECHNOMETRICS.
Influence of outliers on location and covariance Separating inliers from outliers using a Mahalanobis
estimates distance
Many applications require being able to decide whether a new observation belongs to the same distribution as existing
observations (it is an inlier), or should be considered as different (it is an outlier). Often, this ability is used to clean
real data sets. Two important distinctions must be made:
outlier detection The training data contains outliers which are defined as observations that are far from
the others. Outlier detection estimators thus try to fit the regions where the training data is the most
concentrated, ignoring the deviant observations.
novelty detection The training data is not polluted by outliers and we are interested in detecting whether
a new observation is an outlier. In this context an outlier is also called a novelty.
Outlier detection and novelty detection are both used for anomaly detection, where one is interested in detecting
abnormal or unusual observations. Outlier detection is then also known as unsupervised anomaly detection and novelty
detection as semi-supervised anomaly detection. In the context of outlier detection, the outliers/anomalies cannot form
a dense cluster as available estimators assume that the outliers/anomalies are located in low density regions. On the
contrary, in the context of novelty detection, novelties/anomalies can form a dense cluster as long as they are in a low
density region of the training data, considered as normal in this context.
The scikit-learn project provides a set of machine learning tools that can be used both for novelty or outlier detection.
This strategy is implemented with objects learning in an unsupervised way from the data:
estimator.fit(X_train)
new observations can then be sorted as inliers or outliers with a predict method:
estimator.predict(X_test)
Inliers are labeled 1, while outliers are labeled -1. The predict method makes use of a threshold on the raw scoring
function computed by the estimator. This scoring function is accessible through the score_samples method, while
the threshold can be controlled by the contamination parameter.
The decision_function method is also defined from the scoring function, in such a way that negative values are
outliers and non-negative ones are inliers:
estimator.decision_function(X_test)
A comparison of the outlier detection algorithms in scikit-learn. Local Outlier Factor (LOF) does not show a decision
boundary in black as it has no predict method to be applied on new data when it is used for outlier detection.
ensemble.IsolationForest and neighbors.LocalOutlierFactor perform reasonably well on the
data sets considered here. The svm.OneClassSVM is known to be sensitive to outliers and thus does not perform
very well for outlier detection. Finally, covariance.EllipticEnvelope assumes the data is Gaussian and
learns an ellipse. For more details on the different estimators refer to the example Comparing anomaly detection
algorithms for outlier detection on toy datasets and the sections hereunder.
Examples:
• See Comparing anomaly detection algorithms for outlier detection on toy datasets for a com-
parison of the svm.OneClassSVM , the ensemble.IsolationForest, the neighbors.
LocalOutlierFactor and covariance.EllipticEnvelope.
Novelty Detection
Consider a data set of 𝑛 observations from the same distribution described by 𝑝 features. Consider now that we add one
more observation to that data set. Is the new observation so different from the others that we can doubt it is regular?
(i.e. does it come from the same distribution?) Or on the contrary, is it so similar to the other that we cannot distinguish
it from the original observations? This is the question addressed by the novelty detection tools and methods.
In general, it is about to learn a rough, close frontier delimiting the contour of the initial observations distribution,
plotted in embedding 𝑝-dimensional space. Then, if further observations lay within the frontier-delimited subspace,
they are considered as coming from the same population than the initial observations. Otherwise, if they lay outside
the frontier, we can say that they are abnormal with a given confidence in our assessment.
The One-Class SVM has been introduced by Schölkopf et al. for that purpose and implemented in the Support Vector
Machines module in the svm.OneClassSVM object. It requires the choice of a kernel and a scalar parameter to
define a frontier. The RBF kernel is usually chosen although there exists no exact formula or algorithm to set its
bandwidth parameter. This is the default in the scikit-learn implementation. The 𝜈 parameter, also known as the
margin of the One-Class SVM, corresponds to the probability of finding a new, but regular, observation outside the
frontier.
References:
• Estimating the support of a high-dimensional distribution Schölkopf, Bernhard, et al. Neural computation
13.7 (2001): 1443-1471.
Examples:
• See One-class SVM with non-linear kernel (RBF) for visualizing the frontier learned around some data by a
svm.OneClassSVM object.
• Species distribution modeling
Outlier Detection
Outlier detection is similar to novelty detection in the sense that the goal is to separate a core of regular observa-
tions from some polluting ones, called outliers. Yet, in the case of outlier detection, we don’t have a clean data set
representing the population of regular observations that can be used to train any tool.
One common way of performing outlier detection is to assume that the regular data come from a known distribution
(e.g. data are Gaussian distributed). From this assumption, we generally try to define the “shape” of the data, and can
define outlying observations as observations which stand far enough from the fit shape.
The scikit-learn provides an object covariance.EllipticEnvelope that fits a robust covariance estimate to
the data, and thus fits an ellipse to the central data points, ignoring points outside the central mode.
For instance, assuming that the inlier data are Gaussian distributed, it will estimate the inlier location and covariance
in a robust way (i.e. without being influenced by outliers). The Mahalanobis distances obtained from this estimate is
used to derive a measure of outlyingness. This strategy is illustrated below.
Examples:
• See Robust covariance estimation and Mahalanobis distances relevance for an illustration of the dif-
ference between using a standard (covariance.EmpiricalCovariance) or a robust estimate
(covariance.MinCovDet) of location and covariance to assess the degree of outlyingness of an ob-
servation.
References:
• Rousseeuw, P.J., Van Driessen, K. “A fast algorithm for the minimum covariance determinant estimator”
Technometrics 41(3), 212 (1999)
Isolation Forest
One efficient way of performing outlier detection in high-dimensional datasets is to use random forests. The
ensemble.IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly se-
lecting a split value between the maximum and minimum values of the selected feature.
Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample
is equivalent to the path length from the root node to the terminating node.
This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collec-
tively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
The implementation of ensemble.IsolationForest is based on an ensemble of tree.
ExtraTreeRegressor. Following Isolation Forest original paper, the maximum depth of each tree is set
to ⌈log2 (𝑛)⌉ where 𝑛 is the number of samples used to build the tree (see (Liu et al., 2008) for more details).
This algorithm is illustrated below.
The ensemble.IsolationForest supports warm_start=True which allows you to add more trees to an
already fitted model:
Examples:
References:
• Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth
IEEE International Conference on.
Another efficient way to perform outlier detection on moderately high dimensional datasets is to use the Local Outlier
Factor (LOF) algorithm.
The neighbors.LocalOutlierFactor (LOF) algorithm computes a score (called local outlier factor) reflect-
ing the degree of abnormality of the observations. It measures the local density deviation of a given data point with
respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.
In practice the local density is obtained from the k-nearest neighbors. The LOF score of an observation is equal to the
ratio of the average local density of his k-nearest neighbors, and its own local density: a normal instance is expected
to have a local density similar to that of its neighbors, while abnormal data are expected to have much smaller local
density.
The number k of neighbors considered, (alias parameter n_neighbors) is typically chosen 1) greater than the minimum
number of objects a cluster has to contain, so that other objects can be local outliers relative to this cluster, and 2)
smaller than the maximum number of close by objects that can potentially be local outliers. In practice, such informa-
tions are generally not available, and taking n_neighbors=20 appears to work well in general. When the proportion of
outliers is high (i.e. greater than 10 %, as in the example below), n_neighbors should be greater (n_neighbors=35 in
the example below).
The strength of the LOF algorithm is that it takes both local and global properties of datasets into consideration: it can
perform well even in datasets where abnormal samples have different underlying densities. The question is not, how
isolated the sample is, but how isolated it is with respect to the surrounding neighborhood.
When applying LOF for outlier detection, there are no predict, decision_function and score_samples
methods but only a fit_predict method. The scores of abnormality of the training samples are accessi-
ble through the negative_outlier_factor_ attribute. Note that predict, decision_function and
score_samples can be used on new unseen data when LOF is applied for novelty detection, i.e. when the
novelty parameter is set to True. See Novelty detection with Local Outlier Factor.
This strategy is illustrated below.
Examples:
• See Outlier detection with Local Outlier Factor (LOF) for an illustration of the use of neighbors.
LocalOutlierFactor.
• See Comparing anomaly detection algorithms for outlier detection on toy datasets for a comparison with
other anomaly detection methods.
References:
• Breunig, Kriegel, Ng, and Sander (2000) LOF: identifying density-based local outliers. Proc. ACM SIGMOD
To use neighbors.LocalOutlierFactor for novelty detection, i.e. predict labels or compute the score of
abnormality of new unseen data, you need to instantiate the estimator with the novelty parameter set to True
before fitting the estimator:
lof = LocalOutlierFactor(novelty=True)
lof.fit(X_train)
Density estimation walks the line between unsupervised learning, feature engineering, and data modeling. Some of
the most popular and useful density estimation techniques are mixture models such as Gaussian Mixtures (sklearn.
mixture.GaussianMixture), and neighbor-based approaches such as the kernel density estimate (sklearn.
neighbors.KernelDensity). Gaussian Mixtures are discussed more fully in the context of clustering, because
the technique is also useful as an unsupervised clustering scheme.
Density estimation is a very simple concept, and most people are already familiar with one common density estimation
technique: the histogram.
A histogram is a simple visualization of data where bins are defined, and the number of data points within each bin is
tallied. An example of a histogram can be seen in the upper-left panel of the following figure:
A major problem with histograms, however, is that the choice of binning can have a disproportionate effect on the
resulting visualization. Consider the upper-right panel of the above figure. It shows a histogram over the same data,
with the bins shifted right. The results of the two visualizations look entirely different, and might lead to different
interpretations of the data.
Intuitively, one can also think of a histogram as a stack of blocks, one block per point. By stacking the blocks in the
appropriate grid space, we recover the histogram. But what if, instead of stacking the blocks on a regular grid, we
center each block on the point it represents, and sum the total height at each location? This idea leads to the lower-left
visualization. It is perhaps not as clean as a histogram, but the fact that the data drive the block locations mean that it
is a much better representation of the underlying data.
This visualization is an example of a kernel density estimation, in this case with a top-hat kernel (i.e. a square block
at each point). We can recover a smoother distribution by using a smoother kernel. The bottom-right plot shows a
Gaussian kernel density estimate, in which each point contributes a Gaussian curve to the total. The result is a smooth
density estimate which is derived from the data, and functions as a powerful non-parametric model of the distribution
of points.
It’s clear how the kernel shape affects the smoothness of the resulting distribution. The scikit-learn kernel density
estimator can be used as follows:
Here we have used kernel='gaussian', as seen above. Mathematically, a kernel is a positive function 𝐾(𝑥; ℎ)
which is controlled by the bandwidth parameter ℎ. Given this kernel form, the density estimate at a point 𝑦 within a
group of points 𝑥𝑖 ; 𝑖 = 1 · · · 𝑁 is given by:
𝑁
∑︁
𝜌𝐾 (𝑦) = 𝐾((𝑦 − 𝑥𝑖 )/ℎ)
𝑖=1
The bandwidth here acts as a smoothing parameter, controlling the tradeoff between bias and variance in the result. A
large bandwidth leads to a very smooth (i.e. high-bias) density distribution. A small bandwidth leads to an unsmooth
(i.e. high-variance) density distribution.
sklearn.neighbors.KernelDensity implements several common kernel forms, which are shown in the
following figure:
The kernel density estimator can be used with any of the valid distance metrics (see sklearn.neighbors.
DistanceMetric for a list of available metrics), though the results are properly normalized only for the Euclidean
metric. One particularly useful metric is the Haversine distance which measures the angular distance between points
on a sphere. Here is an example of using a kernel density estimate for a visualization of geospatial data, in this case
the distribution of observations of two different species on the South American continent:
One other useful application of kernel density estimation is to learn a non-parametric generative model of a dataset in
order to efficiently draw new samples from this generative model. Here is an example of using this process to create a
new set of hand-written digits, using a Gaussian kernel learned on a PCA projection of the data:
The “new” data consists of linear combinations of the input data, with weights probabilistically drawn given the KDE
model.
Examples:
• Simple 1D Kernel Density Estimation: computation of simple kernel density estimates in one dimension.
• Kernel Density Estimation: an example of using Kernel Density estimation to learn a generative model of the
hand-written digits data, and drawing new samples from this model.
• Kernel Density Estimate of Species Distributions: an example of Kernel Density estimation using the Haver-
sine distance metric to visualize geospatial data
Restricted Boltzmann machines (RBM) are unsupervised nonlinear feature learners based on a probabilistic model.
The features extracted by an RBM or a hierarchy of RBMs often give good results when fed into a linear classifier
such as a linear SVM or a perceptron.
The model makes assumptions regarding the distribution of inputs. At the moment, scikit-learn only provides
BernoulliRBM , which assumes the inputs are either binary values or values between 0 and 1, each encoding the
probability that the specific feature would be turned on.
The RBM tries to maximize the likelihood of the data using a particular graphical model. The parameter learning
algorithm used (Stochastic Maximum Likelihood) prevents the representations from straying far from the input data,
which makes them capture interesting regularities, but makes the model less useful for small datasets, and usually not
useful for density estimation.
The method gained popularity for initializing deep neural networks with the weights of independent RBMs. This
method is known as unsupervised pre-training.
Examples:
The nodes are random variables whose states depend on the state of the other nodes they are connected to. The model
is therefore parameterized by the weights of the connections, as well as one intercept (bias) term for each visible and
hidden unit, omitted from the image for simplicity.
The energy function measures the quality of a joint assignment:
∑︁ ∑︁ ∑︁ ∑︁
𝐸(v, h) = − 𝑤𝑖𝑗 𝑣𝑖 ℎ𝑗 − 𝑏𝑖 𝑣 𝑖 − 𝑐𝑗 ℎ𝑗
𝑖 𝑗 𝑖 𝑗
In the formula above, b and c are the intercept vectors for the visible and hidden layers, respectively. The joint
𝑒−𝐸(v,h)
𝑃 (v, h) =
𝑍
The word restricted refers to the bipartite structure of the model, which prohibits direct interaction between hidden
units, or between visible units. This means that the following conditional independencies are assumed:
ℎ𝑖 ⊥ℎ𝑗 |v
𝑣𝑖 ⊥𝑣𝑗 |h
The bipartite structure allows for the use of efficient block Gibbs sampling for inference.
In the BernoulliRBM , all units are binary stochastic units. This means that the input data should either be binary, or
real-valued between 0 and 1 signifying the probability that the visible unit would turn on or off. This is a good model
for character recognition, where the interest is on which pixels are active and which aren’t. For images of natural
scenes it no longer fits because of background, depth and the tendency of neighbouring pixels to take the same values.
The conditional probability distribution of each unit is given by the logistic sigmoid activation function of the input it
receives:
∑︁
𝑃 (𝑣𝑖 = 1|h) = 𝜎( 𝑤𝑖𝑗 ℎ𝑗 + 𝑏𝑖 )
𝑗
∑︁
𝑃 (ℎ𝑖 = 1|v) = 𝜎( 𝑤𝑖𝑗 𝑣𝑖 + 𝑐𝑗 )
𝑖
The training algorithm implemented in BernoulliRBM is known as Stochastic Maximum Likelihood (SML) or
Persistent Contrastive Divergence (PCD). Optimizing maximum likelihood directly is infeasible because of the form
of the data likelihood:
∑︁ ∑︁
log 𝑃 (𝑣) = log 𝑒−𝐸(𝑣,ℎ) − log 𝑒−𝐸(𝑥,𝑦)
ℎ 𝑥,𝑦
For simplicity the equation above is written for a single training example. The gradient with respect to the weights is
formed of two terms corresponding to the ones above. They are usually known as the positive gradient and the negative
gradient, because of their respective signs. In this implementation, the gradients are estimated over mini-batches of
samples.
In maximizing the log-likelihood, the positive gradient makes the model prefer hidden states that are compatible with
the observed training data. Because of the bipartite structure of RBMs, it can be computed efficiently. The negative
gradient, however, is intractable. Its goal is to lower the energy of joint states that the model prefers, therefore making
it stay true to the data. It can be approximated by Markov chain Monte Carlo using block Gibbs sampling by iteratively
sampling each of 𝑣 and ℎ given the other, until the chain mixes. Samples generated in this way are sometimes referred
as fantasy particles. This is inefficient and it is difficult to determine whether the Markov chain mixes.
The Contrastive Divergence method suggests to stop the chain after a small number of iterations, 𝑘, usually even 1.
This method is fast and has low variance, but the samples are far from the model distribution.
Persistent Contrastive Divergence addresses this. Instead of starting a new chain each time the gradient is needed, and
performing only one Gibbs sampling step, in PCD we keep a number of chains (fantasy particles) that are updated 𝑘
Gibbs steps after each weight update. This allows the particles to explore the space more thoroughly.
References:
• “A fast learning algorithm for deep belief nets” G. Hinton, S. Osindero, Y.-W. Teh, 2006
• “Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient” T. Tieleman,
2008
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model
that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict
anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when
performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test,
y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial
settings machine learning usually starts out experimentally. Here is a flowchart of typical cross validation workflow in
model training. The best parameters can be determined by grid search techniques.
In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split
helper function. Let’s load the iris data set to fit a linear support vector machine on it:
>>> X, y = datasets.load_iris(return_X_y=True)
>>> X.shape, y.shape
((150, 4), (150,))
We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set
for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator
performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer
report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-
called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and
when the experiment seems to be successful, final evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be
used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation)
sets.
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for
final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the
training set is split into k smaller sets (other approaches are described below, but generally follow the same principles).
The following procedure is followed for each of the k “folds”:
• A model is trained using 𝑘 − 1 of the folds as training data;
• the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a
performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.
This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an
arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of
samples is very small.
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the
dataset.
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the
iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each
time):
The mean score and the 95% confidence interval of the score estimate are hence given by:
By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change
this by using the scoring parameter:
See The scoring parameter: defining model evaluation rules for details. In the case of the Iris dataset, the samples are
balanced across target classes hence the accuracy and the F1-score are almost equal.
When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by
default, the latter being used if the estimator derives from ClassifierMixin.
It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:
Another option is to use an iterable yielding (train, test) splits as arrays of indices, for example:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization,
feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied
to held-out data for prediction:
>>> from sklearn import preprocessing
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.4, random_state=0)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train_transformed = scaler.transform(X_train)
>>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
>>> X_test_transformed = scaler.transform(X_test)
>>> clf.score(X_test_transformed, y_test)
0.9333...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
>>> from sklearn.pipeline import make_pipeline
>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
>>> cross_val_score(clf, X, y, cv=cv)
array([0.977..., 0.933..., 0.955..., 0.933..., 0.977...])
The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element
in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation
strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).
Examples
The following sections list utilities to generate indices that can be used to generate dataset splits according to different
cross validation strategies.
Assuming that some data is Independent and Identically Distributed (i.i.d.) is making the assumption that all samples
stem from the same generative process and that the generative process is assumed to have no memory of past generated
samples.
The following cross-validators can be used in such cases.
NOTE
While i.i.d. data is a common assumption in machine learning theory, it rarely holds in practice. If one knows that
the samples have been generated using a time-dependent process, it’s safer to use a time-series aware cross-validation
scheme Similarly if we know that the generative process has a group structure (samples from collected from different
subjects, experiments, measurement devices) it safer to use group-wise cross-validation.
K-fold
KFold divides all the samples in 𝑘 groups of samples, called folds (if 𝑘 = 𝑛, this is equivalent to the Leave One Out
strategy), of equal sizes (if possible). The prediction function is learned using 𝑘 − 1 folds, and the fold left out is used
for test.
Example of 2-fold cross-validation on a dataset with 4 samples:
Here is a visualization of the cross-validation behavior. Note that KFold is not affected by classes or groups.
Each fold is constituted by two arrays: the first one is related to the training set, and the second one to the test set.
Thus, one can create the training/test sets using numpy indexing:
Repeated K-Fold
RepeatedKFold repeats K-Fold n times. It can be used when one requires to run KFold n times, producing
different splits in each repetition.
Example of 2-fold K-Fold repeated 2 times:
Similarly, RepeatedStratifiedKFold repeats Stratified K-Fold n times with different randomization in each
repetition.
LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except
one, the test set being the sample left out. Thus, for 𝑛 samples, we have 𝑛 different training sets and 𝑛 different tests
set. This cross-validation procedure does not waste much data as only one sample is removed from the training set:
>>> X = [1, 2, 3, 4]
>>> loo = LeaveOneOut()
>>> for train, test in loo.split(X):
... print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
Potential users of LOO for model selection should weigh a few known caveats. When compared with 𝑘-fold cross
validation, one builds 𝑛 models from 𝑛 samples instead of 𝑘 models, where 𝑛 > 𝑘. Moreover, each is trained on 𝑛 − 1
samples rather than (𝑘 − 1)𝑛/𝑘. In both ways, assuming 𝑘 is not too large and 𝑘 < 𝑛, LOO is more computationally
expensive than 𝑘-fold cross validation.
In terms of accuracy, LOO often results in high variance as an estimator for the test error. Intuitively, since 𝑛 − 1 of
the 𝑛 samples are used to build each model, models constructed from folds are virtually identical to each other and to
the model built from the entire training set.
However, if the learning curve is steep for the training size in question, then 5- or 10- fold cross validation can
overestimate the generalization error.
As a general rule, most authors, and empirical evidence, suggest that 5- or 10- fold cross validation should be preferred
to LOO.
References:
• http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html;
• T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer 2009
• L. Breiman, P. Spector Submodel selection and evaluation in regression: The X-random case, International
Statistical Review 1992;
• R. Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Intl.
Jnt. Conf. AI
• R. Bharat Rao, G. Fung, R. Rosales, On the Dangers of Cross-Validation. An Experimental Evaluation, SIAM
2008;
• G. James, D. Witten, T. Hastie, R Tibshirani, An Introduction to Statistical Learning, Springer 2013.
LeavePOut is very similar to LeaveOneOut as it(︀creates all the possible training/test sets by removing 𝑝 samples
from the complete set. For 𝑛 samples, this produces 𝑛𝑝 train-test pairs. Unlike LeaveOneOut and KFold, the test
)︀
>>> X = np.ones(4)
>>> lpo = LeavePOut(p=2)
>>> for train, test in lpo.split(X):
... print("%s %s" % (train, test))
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]
ShuffleSplit
The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples
are first shuffled and then split into a pair of train and test sets.
It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state
pseudo random number generator.
Here is a usage example:
Here is a visualization of the cross-validation behavior. Note that ShuffleSplit is not affected by classes or
groups.
ShuffleSplit is thus a good alternative to KFold cross validation that allows a finer control on the number of
iterations and the proportion of samples on each side of the train / test split.
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there
could be several times more negative samples than positive samples. In such cases it is recommended to use stratified
sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative class
frequencies is approximately preserved in each train and validation fold.
Stratified k-fold
StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same
percentage of samples of each target class as the complete set.
Here is an example of stratified 3-fold cross-validation on a dataset with 50 samples from two unbalanced classes. We
show the number of samples in each class and compare with KFold.
We can see that StratifiedKFold preserves the class ratios (approximately 1 / 10) in both train and test dataset.
Here is a visualization of the cross-validation behavior.
RepeatedStratifiedKFold can be used to repeat Stratified K-Fold n times with different randomization in each
repetition.
StratifiedShuffleSplit is a variation of ShuffleSplit, which returns stratified splits, i.e which creates splits
by preserving the same percentage for each target class as in the complete set.
Here is a visualization of the cross-validation behavior.
The i.i.d. assumption is broken if the underlying generative process yield groups of dependent samples.
Such a grouping of data is domain specific. An example would be when there is medical data collected from multiple
patients, with multiple samples taken from each patient. And such data is likely to be dependent on the individual
group. In our example, the patient id for each sample will be its group identifier.
In this case we would like to know if a model trained on a particular set of groups generalizes well to the unseen
groups. To measure this, we need to ensure that all the samples in the validation fold come from groups that are not
represented at all in the paired training fold.
The following cross-validation splitters can be used to do that. The grouping identifier for the samples is specified via
the groups parameter.
Group k-fold
GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training
sets. For example if the data is obtained from different subjects with several samples per-subject and if the model is
flexible enough to learn from highly person specific features it could fail to generalize to new subjects. GroupKFold
makes it possible to detect this kind of overfitting situations.
Imagine you have three subjects, each with an associated number from 1 to 3:
>>> X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
>>> y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
>>> groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
Each subject is in a different testing fold, and the same subject is never in both testing and training. Notice that the
folds do not have exactly the same size due to the imbalance in the data.
Here is a visualization of the cross-validation behavior.
LeaveOneGroupOut is a cross-validation scheme which holds out the samples according to a third-party provided
array of integer groups. This group information can be used to encode arbitrary domain specific pre-defined cross-
validation folds.
Each training set is thus constituted by all the samples except the ones related to a specific group.
For example, in the cases of multiple experiments, LeaveOneGroupOut can be used to create a cross-validation
based on the different experiments: we create a training set using the samples of all the experiments except one:
Another common application is to use time information: for instance the groups could be the year of collection of the
samples and thus allow for cross-validation against time-based splits.
LeavePGroupsOut is similar as LeaveOneGroupOut, but removes samples related to 𝑃 groups for each train-
ing/test set.
Example of Leave-2-Group Out:
>>> X = np.arange(6)
>>> y = [1, 1, 1, 2, 2, 2]
>>> groups = [1, 1, 2, 2, 3, 3]
>>> lpgo = LeavePGroupsOut(n_groups=2)
>>> for train, test in lpgo.split(X, y, groups=groups):
... print("%s %s" % (train, test))
[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]
This class is useful when the behavior of LeavePGroupsOut is desired, but the number of groups is large enough
that generating all possible partitions with 𝑃 groups withheld would be prohibitively expensive. In such a sce-
nario, GroupShuffleSplit provides a random sample (with replacement) of the train / test splits generated by
LeavePGroupsOut.
For some datasets, a pre-defined split of the data into training- and validation fold or into several cross-validation folds
already exists. Using PredefinedSplit it is possible to use these folds e.g. when searching for hyperparameters.
For example, when using a validation set, set the test_fold to 0 for all samples that are part of the validation set,
and to -1 for all other samples.
Time series data is characterised by the correlation between observations that are near in time (autocorrelation). How-
ever, classical cross-validation techniques such as KFold and ShuffleSplit assume the samples are independent
and identically distributed, and would result in unreasonable correlation between training and testing instances (yield-
ing poor estimates of generalisation error) on time series data. Therefore, it is very important to evaluate our model
for time series data on the “future” observations least like those that are used to train the model. To achieve this, one
solution is provided by TimeSeriesSplit.
TimeSeriesSplit is a variation of k-fold which returns first 𝑘 folds as train set and the (𝑘 + 1) th fold as test set.
Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before
them. Also, it adds all surplus data to the first training partition, which is always used to train the model.
This class can be used to cross-validate time series data samples that are observed at fixed time intervals.
Example of 3-split time series cross-validation on a dataset with 6 samples:
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> print(tscv)
TimeSeriesSplit(max_train_size=None, n_splits=3)
>>> for train, test in tscv.split(X):
... print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]
A note on shuffling
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may
be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not
independently and identically distributed. For example, if samples correspond to news articles, and are ordered by
their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation
score: it will be tested on samples that are artificially similar (close in time) to training samples.
Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them.
Note that:
• This consumes less memory than shuffling the data directly.
• By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying
cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split
still returns a random split.
• The random_state parameter defaults to None, meaning that the shuffling will be different every time
KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for
each set of parameters validated by a single call to its fit method.
• To get identical results for each split, set random_state to an integer.
Cross validation iterators can also be used to directly perform model selection using Grid Search for the optimal
hyperparameters of the model. This is the topic of the next section: Tuning the hyper-parameters of an estimator.
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as
arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support
Vector Classifier, alpha for Lasso, etc.
It is possible and recommended to search the hyper-parameter space for the best cross validation score.
Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the
names and current values for all parameters for a given estimator, use:
estimator.get_params()
The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values
specified with the param_grid parameter. For instance, the following param_grid:
param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
specifies that two grids should be explored: one with a linear kernel and C values in [1, 10, 100, 1000], and the second
one with an RBF kernel, and the cross-product of C values ranging in [1, 10, 100, 1000] and gamma values in [0.001,
0.0001].
The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible
combinations of parameter values are evaluated and the best combination is retained.
Examples:
• See Parameter estimation using grid search with cross-validation for an example of Grid Search computation
on the digits dataset.
• See Sample pipeline for text feature extraction and evaluation for an example of Grid Search coupling pa-
rameters from a text documents feature extractor (n-gram count vectorizer and TF-IDF transformer) with a
classifier (here a linear SVM trained with SGD with either elastic net or L2 penalty) using a pipeline.
Pipeline instance.
• See Nested versus non-nested cross-validation for an example of Grid Search within a cross validation loop
on the iris dataset. This is the best practice for evaluating the performance of a model with grid search.
• See Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV for an example of
GridSearchCV being used to evaluate multiple metrics simultaneously.
• See Balance model complexity and cross-validated score for an example of using refit=callable in-
terface in GridSearchCV . The example shows how this interface adds certain amount of flexibility in
identifying the “best” estimator. This interface can also be used in multiple metrics evaluation.
While using a grid of parameter settings is currently the most widely used method for parameter optimization, other
search methods have more favourable properties. RandomizedSearchCV implements a randomized search over
parameters, where each setting is sampled from a distribution over possible parameter values. This has two main
benefits over an exhaustive search:
• A budget can be chosen independent of the number of parameters and possible values.
• Adding parameters that do not influence the performance does not decrease efficiency.
Specifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for
GridSearchCV . Additionally, a computation budget, being the number of sampled candidates or sampling itera-
tions, is specified using the n_iter parameter. For each parameter, either a distribution over possible values or a list
of discrete choices (which will be sampled uniformly) can be specified:
This example uses the scipy.stats module, which contains many useful distributions for sampling parameters,
such as expon, gamma, uniform or randint.
In principle, any function can be passed that provides a rvs (random variate sample) method to sample a value. A
call to the rvs function should provide independent random samples from possible parameter values on consecutive
calls.
Warning: The distributions in scipy.stats prior to version scipy 0.16 do not allow specifying a
random state. Instead, they use the global numpy random state, that can be seeded via np.random.
seed or set using np.random.set_state. However, beginning scikit-learn 0.18, the sklearn.
model_selection module sets the random state provided by the user if scipy >= 0.16 is also
available.
For continuous parameters, such as C above, it is important to specify a continuous distribution to take full advantage
of the randomization. This way, increasing n_iter will always lead to a finer search.
A continuous log-uniform random variable is available through loguniform. This is a continuous version of log-
spaced parameters. For example to specify C above, loguniform(1, 100) can be used instead of [1, 10,
100] or np.logspace(0, 2, num=1000). This is an alias to SciPy’s stats.reciprocal.
Mirroring the example above in grid search, we can specify a continuous random variable that is log-uniformly dis-
tributed between 1e0 and 1e3:
from sklearn.utils.fixes import loguniform
{'C': loguniform(1e0, 1e3),
'gamma': loguniform(1e-4, 1e-3),
'kernel': ['rbf'],
'class_weight':['balanced', None]}
Examples:
• Comparing randomized search and grid search for hyperparameter estimation compares the usage and effi-
ciency of randomized search and grid search.
References:
• Bergstra, J. and Bengio, Y., Random search for hyper-parameter optimization, The Journal of Machine Learn-
ing Research (2012)
By default, parameter search uses the score function of the estimator to evaluate a parameter setting. These are
the sklearn.metrics.accuracy_score for classification and sklearn.metrics.r2_score for regres-
sion. For some applications, other scoring functions are better suited (for example in unbalanced classification, the
accuracy score is often uninformative). An alternative scoring function can be specified via the scoring parameter
to GridSearchCV , RandomizedSearchCV and many of the specialized cross-validation tools described below.
See The scoring parameter: defining model evaluation rules for more details.
GridSearchCV and RandomizedSearchCV allow specifying multiple metrics for the scoring parameter.
Multimetric scoring can either be specified as a list of strings of predefined scores names or a dict mapping the scorer
name to the scorer function and/or the predefined scorer name(s). See Using multiple metric evaluation for more
details.
When specifying multiple metrics, the refit parameter must be set to the metric (string) for which the
best_params_ will be found and used to build the best_estimator_ on the whole dataset. If the search
should not be refit, set refit=False. Leaving refit to the default value None will result in an error when using
multiple metrics.
See Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV for an example usage.
GridSearchCV and RandomizedSearchCV allow searching over parameters of composite or nested estimators
such as Pipeline, ColumnTransformer, VotingClassifier or CalibratedClassifierCV using a
dedicated <estimator>__<parameter> syntax:
Here, <estimator> is the parameter name of the nested estimator, in this case base_estimator. If the meta-
estimator is constructed as a collection of estimators as in pipeline.Pipeline, then <estimator> refers to
the name of the estimator, see Nested parameters. In practice, there can be several levels of nesting:
Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the
parameters of the grid.
When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid
search process: it is recommended to split the data into a development set (to be fed to the GridSearchCV instance)
and an evaluation set to compute performance metrics.
This can be done by using the train_test_split utility function.
Parallelism
GridSearchCV and RandomizedSearchCV evaluate each parameter setting independently. Computations can
be run in parallel if your OS supports it, by using the keyword n_jobs=-1. See function signature for more details.
Robustness to failure
Some parameter settings may result in a failure to fit one or more folds of the data. By default, this will cause the
entire search to fail, even if some parameter settings could be fully evaluated. Setting error_score=0 (or =np.
NaN) will make the procedure robust to such failure, issuing a warning and setting the score for that fold to 0 (or NaN),
but completing the search.
Some models can fit data for a range of values of some parameter almost as efficiently as fitting the estimator for
a single value of the parameter. This feature can be leveraged to perform a more efficient cross-validation used for
model selection of this parameter.
The most common parameter amenable to this strategy is the parameter encoding the strength of the regularizer. In
this case we say that we compute the regularization path of the estimator.
Here is the list of such models:
linear_model.ElasticNetCV ([l1_ratio, eps, Elastic Net model with iterative fitting along a regular-
. . . ]) ization path.
linear_model.LarsCV ([fit_intercept, . . . ]) Cross-validated Least Angle Regression model.
linear_model.LassoCV ([eps, n_alphas, . . . ]) Lasso linear model with iterative fitting along a regular-
ization path.
linear_model.LassoLarsCV ([fit_intercept, Cross-validated Lasso, using the LARS algorithm.
. . . ])
linear_model.LogisticRegressionCV ([Cs, Logistic Regression CV (aka logit, MaxEnt) classifier.
. . . ])
linear_model.MultiTaskElasticNetCV ([. . . ]) Multi-task L1/L2 ElasticNet with built-in cross-
validation.
linear_model.MultiTaskLassoCV ([eps, . . . ]) Multi-task Lasso model trained with L1/L2 mixed-norm
as regularizer.
linear_model.OrthogonalMatchingPursuitCV Cross-validated
([. . . ]) Orthogonal Matching Pursuit model
(OMP).
linear_model.RidgeCV ([alphas, . . . ]) Ridge regression with built-in cross-validation.
linear_model.RidgeClassifierCV ([alphas, Ridge classifier with built-in cross-validation.
. . . ])
sklearn.linear_model.ElasticNetCV
enet_path
ElasticNet
Notes
If you are interested in controlling the L1 and L2 penalty separately, keep in mind that this is equivalent to:
a * L1 + b * L2
for:
Examples
Methods
Where:
MultiTaskElasticNet
MultiTaskElasticNetCV
ElasticNet
ElasticNetCV
Notes
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.linear_model.LarsCV
Examples
Methods
Notes
sklearn.linear_model.LassoCV
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random. Used when selection == ‘random’.
selection [str, default ‘cyclic’] If set to ‘random’, a random coefficient is updated every iteration
rather than looping over features sequentially by default. This (setting to ‘random’) often
leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes
alpha_ [float] The amount of penalization chosen by cross validation
coef_ [array, shape (n_features,) | (n_targets, n_features)] parameter vector (w in the cost func-
tion formula)
intercept_ [float | array, shape (n_targets,)] independent term in decision function.
mse_path_ [array, shape (n_alphas, n_folds)] mean square error for the test set on each fold,
varying alpha
alphas_ [numpy array, shape (n_alphas,)] The grid of alphas used for fitting
dual_gap_ [ndarray, shape ()] The dual gap at the end of the optimization for the optimal alpha
(alpha_).
n_iter_ [int] number of iterations run by the coordinate descent solver to reach the specified
tolerance for the optimal alpha.
See also:
lars_path
lasso_path
LassoLars
Lasso
LassoLarsCV
Notes
Examples
Methods
Where:
lars_path
Lasso
LassoLars
LassoCV
LassoLarsCV
sklearn.decomposition.sparse_encode
Notes
Examples
predict(self, X)
Predict using the linear model.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape (n_samples,)] Returns predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
Notes
sklearn.linear_model.LassoLarsCV
cv_alphas_ [array-like of shape (n_cv_alphas,)] all the values of alpha along the path for the
different folds
mse_path_ [array-like of shape (n_folds, n_cv_alphas)] the mean square error on left-out for
each fold along the path (alpha values given by cv_alphas)
n_iter_ [array-like or int] the number of iterations run by Lars with the optimal alpha.
See also:
Notes
The object solves the same problem as the LassoCV object. However, unlike the LassoCV, it find the relevant
alphas values by itself. In general, because of this property, it will be more stable. However, it is more fragile to
heavily multicollinear datasets.
It is more efficient than the LassoCV if only a small number of features are selected compared to the total
number, for instance if there are very few samples compared to the number of features.
Examples
Methods
Notes
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.linear_model.LogisticRegressionCV
penalty [{‘l1’, ‘l2’, ‘elasticnet’}, default=’l2’] Used to specify the norm used in the penaliza-
tion. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties. ‘elasticnet’ is
only supported by the ‘saga’ solver.
scoring [str or callable, default=None] A string (see model evaluation documentation) or a
scorer callable object / function with signature scorer(estimator, X, y). For a list
of scoring functions that can be used, look at sklearn.metrics. The default scoring
option used is ‘accuracy’.
solver [{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’] Algorithm to use in
the optimization problem.
• For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for
large ones.
• For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial
loss; ‘liblinear’ is limited to one-versus-rest schemes.
• ‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty, whereas ‘liblinear’ and ‘saga’
handle L1 penalty.
• ‘liblinear’ might be slower in LogisticRegressionCV because it does not handle warm-
starting.
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approxi-
mately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.
New in version 0.17: Stochastic Average Gradient descent solver.
New in version 0.19: SAGA solver.
tol [float, default=1e-4] Tolerance for stopping criteria.
max_iter [int, default=100] Maximum number of iterations of the optimization algorithm.
class_weight [dict or ‘balanced’, default=None] Weights associated with classes in the form
{class_label: weight}. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely pro-
portional to class frequencies in the input data as n_samples / (n_classes * np.
bincount(y)).
Note that these weights will be multiplied with sample_weight (passed through the fit
method) if sample_weight is specified.
New in version 0.17: class_weight == ‘balanced’
n_jobs [int, default=None] Number of CPU cores used during the cross-validation loop. None
means 1 unless in a joblib.parallel_backend context. -1 means using all proces-
sors. See Glossary for more details.
verbose [int, default=0] For the ‘liblinear’, ‘sag’ and ‘lbfgs’ solvers set verbose to any positive
number for verbosity.
refit [bool, default=True] If set to True, the scores are averaged across all folds, and the coefs
and the C that corresponds to the best score is taken, and a final refit is done using these
parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across
folds are averaged.
intercept_scaling [float, default=1] Useful only when the solver ‘liblinear’ is used and
self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling],
‘multi_class’ option given is ‘multinomial’ then the same scores are repeated across all
classes, since this is the multinomial class. Each dict value has shape (n_folds, n_cs
or (n_folds, n_cs, n_l1_ratios) if penalty='elasticnet'.
C_ [ndarray of shape (n_classes,) or (n_classes - 1,)] Array of C that maps to the best scores
across every class. If refit is set to False, then for each class, the best C is the average of
the C’s that correspond to the best scores for each fold. C_ is of shape(n_classes,) when the
problem is binary.
l1_ratio_ [ndarray of shape (n_classes,) or (n_classes - 1,)] Array of l1_ratio that maps to the
best scores across every class. If refit is set to False, then for each class, the best l1_ratio is
the average of the l1_ratio’s that correspond to the best scores for each fold. l1_ratio_
is of shape(n_classes,) when the problem is binary.
n_iter_ [ndarray of shape (n_classes, n_folds, n_cs) or (1, n_folds, n_cs)] Actual number of
iterations for all classes, folds and Cs. In the binary or multinomial cases, the first dimension
is equal to 1. If penalty='elasticnet', the shape is (n_classes, n_folds,
n_cs, n_l1_ratios) or (1, n_folds, n_cs, n_l1_ratios).
See also:
LogisticRegression
Examples
Methods
Returns
C [array, shape [n_samples]] Predicted class label per sample.
predict_log_proba(self, X)
Predict logarithm of probability estimates.
The returned estimates for all classes are ordered by the label of classes.
Parameters
X [array-like of shape (n_samples, n_features)] Vector to be scored, where n_samples is
the number of samples and n_features is the number of features.
Returns
T [array-like of shape (n_samples, n_classes)] Returns the log-probability of the sample for
each class in the model, where classes are ordered as they are in self.classes_.
predict_proba(self, X)
Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find
the predicted probability of each class. Else use a one-vs-rest approach, i.e calculate the probability of
each class assuming it to be positive using the logistic function. and normalize these values across all the
classes.
Parameters
X [array-like of shape (n_samples, n_features)] Vector to be scored, where n_samples is
the number of samples and n_features is the number of features.
Returns
T [array-like of shape (n_samples, n_classes)] Returns the probability of the sample for each
class in the model, where classes are ordered as they are in self.classes_.
score(self, X, y, sample_weight=None)
Returns the score using the scoring option on the given test data and labels.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Score of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
Notes
For non-sparse models, i.e. when there are not many zeros in coef_, this may actually increase memory
usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be
computed with (coef_ == 0).sum(), must be more than 50% for this to provide significant benefits.
After calling this method, further fitting with the partial_fit method (if any) will not work until you call
densify.
sklearn.linear_model.MultiTaskElasticNetCV
Where:
to put more values close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge), as in [.1, .5,
.7, .9, .95, .99, 1]
eps [float, optional] Length of the path. eps=1e-3 means that alpha_min / alpha_max
= 1e-3.
n_alphas [int, optional] Number of alphas along the regularization path
alphas [array-like, optional] List of alphas where to compute the models. If not provided, set
automatically.
fit_intercept [boolean] whether to calculate the intercept for this model. If set to false, no
intercept will be used in calculations (i.e. data is expected to be centered).
normalize [boolean, optional, default False] This parameter is ignored when
fit_intercept is set to False. If True, the regressors X will be normalized be-
fore regression by subtracting the mean and dividing by the l2-norm. If you wish to
standardize, please use sklearn.preprocessing.StandardScaler before
calling fit on an estimator with normalize=False.
max_iter [int, optional] The maximum number of iterations
tol [float, optional] The tolerance for the optimization: if the updates are smaller than tol, the
optimization code checks the dual gap for optimality and continues until it is smaller than
tol.
cv [int, cross-validation generator or an iterable, optional] Determines the cross-validation split-
ting strategy. Possible inputs for cv are:
• None, to use the default 5-fold cross-validation,
• integer, to specify the number of folds.
• CV splitter,
• An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
copy_X [boolean, optional, default True] If True, X will be copied; else, it may be overwritten.
verbose [bool or integer] Amount of verbosity.
n_jobs [int or None, optional (default=None)] Number of CPUs to use during the cross vali-
dation. Note that this is used only if multiple values for l1_ratio are given. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
random_state [int, RandomState instance or None, optional, default None] The seed of the
pseudo random number generator that selects a random feature to update. If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random. Used when selection == ‘random’.
selection [str, default ‘cyclic’] If set to ‘random’, a random coefficient is updated every iteration
rather than looping over features sequentially by default. This (setting to ‘random’) often
leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes
intercept_ [array, shape (n_tasks,)] Independent term in decision function.
coef_ [array, shape (n_tasks, n_features)] Parameter vector (W in the cost function formula).
Note that coef_ stores the transpose of W, W.T.
alpha_ [float] The amount of penalization chosen by cross validation
mse_path_ [array, shape (n_alphas, n_folds) or (n_l1_ratio, n_alphas, n_folds)] mean square
error for the test set on each fold, varying alpha
alphas_ [numpy array, shape (n_alphas,) or (n_l1_ratio, n_alphas)] The grid of alphas used for
fitting, for each l1_ratio
l1_ratio_ [float] best l1_ratio obtained by cross-validation.
n_iter_ [int] number of iterations run by the coordinate descent solver to reach the specified
tolerance for the optimal alpha.
See also:
MultiTaskElasticNet
ElasticNetCV
MultiTaskLassoCV
Notes
Examples
Methods
Where:
l1_ratio [float, optional] Number between 0 and 1 passed to elastic net (scaling between l1
and l2 penalties). l1_ratio=1 corresponds to the Lasso.
eps [float] Length of the path. eps=1e-3 means that alpha_min / alpha_max =
1e-3.
n_alphas [int, optional] Number of alphas along the regularization path.
alphas [ndarray, optional] List of alphas where to compute the models. If None alphas are
set automatically.
precompute [True | False | ‘auto’ | array-like] Whether to use a precomputed Gram matrix
to speed up calculations. If set to 'auto' let us decide. The Gram matrix can also be
passed as argument.
Xy [array-like, optional] Xy = np.dot(X.T, y) that can be precomputed. It is useful only
when the Gram matrix is precomputed.
copy_X [bool, optional, default True] If True, X will be copied; else, it may be overwritten.
coef_init [array, shape (n_features, ) | None] The initial values of the coefficients.
verbose [bool or int] Amount of verbosity.
return_n_iter [bool] Whether to return the number of iterations or not.
positive [bool, default False] If set to True, forces coefficients to be positive. (Only allowed
when y.ndim == 1).
check_input [bool, default True] Skip input validation checks, including the Gram matrix
when provided assuming there are handled by the caller when check_input=False.
**params [kwargs] Keyword arguments passed to the coordinate descent solver.
Returns
alphas [array, shape (n_alphas,)] The alphas along the path where models are computed.
coefs [array, shape (n_features, n_alphas) or (n_outputs, n_features, n_alphas)] Coefficients
along the path.
dual_gaps [array, shape (n_alphas,)] The dual gaps at the end of the optimization for each
alpha.
n_iters [array-like, shape (n_alphas,)] The number of iterations taken by the coordinate
descent optimizer to reach the specified tolerance for each alpha. (Is returned when
return_n_iter is set to True).
See also:
MultiTaskElasticNet
MultiTaskElasticNetCV
ElasticNet
ElasticNetCV
Notes
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape (n_samples,)] Returns predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.linear_model.MultiTaskLassoCV
Where:
||W||_21 = \sum_i \sqrt{\sum_j w_{ij}^2}
random_state [int, RandomState instance or None, optional, default None] The seed of the
pseudo random number generator that selects a random feature to update. If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random. Used when selection == ‘random’
selection [str, default ‘cyclic’] If set to ‘random’, a random coefficient is updated every iteration
rather than looping over features sequentially by default. This (setting to ‘random’) often
leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes
intercept_ [array, shape (n_tasks,)] Independent term in decision function.
coef_ [array, shape (n_tasks, n_features)] Parameter vector (W in the cost function formula).
Note that coef_ stores the transpose of W, W.T.
alpha_ [float] The amount of penalization chosen by cross validation
mse_path_ [array, shape (n_alphas, n_folds)] mean square error for the test set on each fold,
varying alpha
alphas_ [numpy array, shape (n_alphas,)] The grid of alphas used for fitting.
n_iter_ [int] number of iterations run by the coordinate descent solver to reach the specified
tolerance for the optimal alpha.
See also:
MultiTaskElasticNet
ElasticNetCV
MultiTaskElasticNetCV
Notes
Examples
Methods
Where:
lars_path
Lasso
LassoLars
LassoCV
LassoLarsCV
sklearn.decomposition.sparse_encode
Notes
Examples
predict(self, X)
Predict using the linear model.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape (n_samples,)] Returns predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
Notes
sklearn.linear_model.OrthogonalMatchingPursuitCV
class sklearn.linear_model.OrthogonalMatchingPursuitCV(copy=True,
fit_intercept=True, normal-
ize=True, max_iter=None,
cv=None, n_jobs=None,
verbose=False)
Cross-validated Orthogonal Matching Pursuit model (OMP).
See glossary entry for cross-validation estimator.
Read more in the User Guide.
Parameters
copy [bool, optional] Whether the design matrix X must be copied by the algorithm. A false
value is only helpful if X is already Fortran-ordered, otherwise a copy is made anyway.
fit_intercept [boolean, optional] whether to calculate the intercept for this model. If set to false,
no intercept will be used in calculations (i.e. data is expected to be centered).
normalize [boolean, optional, default True] This parameter is ignored when fit_intercept
is set to False. If True, the regressors X will be normalized before regression by sub-
tracting the mean and dividing by the l2-norm. If you wish to standardize, please use
sklearn.preprocessing.StandardScaler before calling fit on an estimator
with normalize=False.
max_iter [integer, optional] Maximum numbers of iterations to perform, therefore maximum
features to include. 10% of n_features but at least 5 if available.
cv [int, cross-validation generator or an iterable, optional] Determines the cross-validation split-
ting strategy. Possible inputs for cv are:
orthogonal_mp
orthogonal_mp_gram
lars_path
Lars
LassoLars
OrthogonalMatchingPursuit
LarsCV
LassoLarsCV
decomposition.sparse_encode
Examples
Methods
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.linear_model.RidgeCV
the problem and reduces the variance of the estimates. Larger values specify stronger reg-
ularization. Alpha corresponds to C^-1 in other linear models such as LogisticRegression
or LinearSVC. If using generalized cross-validation, alphas must be positive.
fit_intercept [bool, default=True] Whether to calculate the intercept for this model. If set to
false, no intercept will be used in calculations (i.e. data is expected to be centered).
normalize [bool, default=False] This parameter is ignored when fit_intercept is set
to False. If True, the regressors X will be normalized before regression by subtract-
ing the mean and dividing by the l2-norm. If you wish to standardize, please use
sklearn.preprocessing.StandardScaler before calling fit on an estimator
with normalize=False.
scoring [string, callable, default=None] A string (see model evaluation documentation) or a
scorer callable object / function with signature scorer(estimator, X, y). If None,
the negative mean squared error if cv is ‘auto’ or None (i.e. when using generalized cross-
validation), and r2 score otherwise.
cv [int, cross-validation generator or an iterable, default=None] Determines the cross-validation
splitting strategy. Possible inputs for cv are:
• None, to use the efficient Leave-One-Out cross-validation (also known as Generalized
Cross-Validation).
• integer, to specify the number of folds.
• CV splitter,
• An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if y is binary or multiclass, sklearn.model_selection.
StratifiedKFold is used, else, sklearn.model_selection.KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
gcv_mode [{‘auto’, ‘svd’, eigen’}, default=’auto’] Flag indicating which strategy to use when
performing Generalized Cross-Validation. Options are:
The ‘auto’ mode is the default and is intended to pick the cheaper option of the two depend-
ing on the shape of the training data.
store_cv_values [bool, default=False] Flag indicating if the cross-validation values correspond-
ing to each alpha should be stored in the cv_values_ attribute (see below). This flag is
only compatible with cv=None (i.e. using Generalized Cross-Validation).
Attributes
cv_values_ [ndarray of shape (n_samples, n_alphas) or shape (n_samples, n_targets, n_alphas),
optional] Cross-validation values for each alpha (if store_cv_values=True and
cv=None). After fit() has been called, this attribute will contain the mean squared
errors (by default) or the values of the {loss,score}_func function (if provided in the
constructor).
coef_ [ndarray of shape (n_features) or (n_targets, n_features)] Weight vector(s).
intercept_ [float or ndarray of shape (n_targets,)] Independent term in decision function. Set to
0.0 if fit_intercept = False.
Examples
Methods
Notes
When sample_weight is provided, the selected hyperparameter may depend on whether we use generalized
cross-validation (cv=None or cv=’auto’) or another form of cross-validation, because only generalized
cross-validation takes the sample weights into account when computing the validation score.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict using the linear model.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape (n_samples,)] Returns predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
Returns
self [object] Estimator instance.
sklearn.linear_model.RidgeClassifierCV
class_weight [dict or ‘balanced’, default=None] Weights associated with classes in the form
{class_label: weight}. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely pro-
portional to class frequencies in the input data as n_samples / (n_classes * np.
bincount(y))
store_cv_values [bool, default=False] Flag indicating if the cross-validation values correspond-
ing to each alpha should be stored in the cv_values_ attribute (see below). This flag is
only compatible with cv=None (i.e. using Generalized Cross-Validation).
Attributes
cv_values_ [ndarray of shape (n_samples, n_targets, n_alphas), optional] Cross-validation val-
ues for each alpha (if store_cv_values=True and cv=None). After fit() has been
called, this attribute will contain the mean squared errors (by default) or the values of the
{loss,score}_func function (if provided in the constructor). This attribute exists only
when store_cv_values is True.
coef_ [ndarray of shape (1, n_features) or (n_targets, n_features)] Coefficient of the features in
the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
intercept_ [float or ndarray of shape (n_targets,)] Independent term in decision function. Set to
0.0 if fit_intercept = False.
alpha_ [float] Estimated regularization parameter.
best_score_ [float] Score of base estimator with best alpha.
classes_ [ndarray of shape (n_classes,)] The classes labels.
See also:
Notes
For multi-class classification, n_class classifiers are trained in a one-versus-all approach. Concretely, this is
implemented by taking advantage of the multi-variate response support in Ridge.
Examples
Methods
Returns
C [array, shape [n_samples]] Predicted class label per sample.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
Information Criterion
Some models can offer an information-theoretic closed-form formula of the optimal estimate of the regularization
parameter by computing a single regularization path (instead of several when using cross-validation).
Here is the list of models benefiting from the Akaike Information Criterion (AIC) or the Bayesian Information Criterion
(BIC) for automated model selection:
linear_model.LassoLarsIC([criterion, . . . ]) Lasso model fit with Lars using BIC or AIC for model
selection
sklearn.linear_model.LassoLarsIC
AIC is the Akaike information criterion and BIC is the Bayes Information criterion. Such criteria are useful
to select the value of the regularization parameter by making a trade-off between the goodness of fit and the
complexity of the model. A good model should explain well the data while being simple.
Read more in the User Guide.
Parameters
criterion [{‘bic’ , ‘aic’}, default=’aic’] The type of criterion to use.
fit_intercept [bool, default=True] whether to calculate the intercept for this model. If set to
false, no intercept will be used in calculations (i.e. data is expected to be centered).
verbose [bool or int, default=False] Sets the verbosity amount
normalize [bool, default=True] This parameter is ignored when fit_intercept is set
to False. If True, the regressors X will be normalized before regression by subtract-
ing the mean and dividing by the l2-norm. If you wish to standardize, please use
sklearn.preprocessing.StandardScaler before calling fit on an estimator
with normalize=False.
precompute [bool, ‘auto’ or array-like, default=’auto’] Whether to use a precomputed Gram
matrix to speed up calculations. If set to 'auto' let us decide. The Gram matrix can also
be passed as argument.
max_iter [int, default=500] Maximum number of iterations to perform. Can be used for early
stopping.
eps [float, optional] The machine-precision regularization in the computation of the Cholesky
diagonal factors. Increase this for very ill-conditioned systems. Unlike the tol parameter in
some iterative optimization-based algorithms, this parameter does not control the tolerance
of the optimization. By default, np.finfo(np.float).eps is used
copy_X [bool, default=True] If True, X will be copied; else, it may be overwritten.
positive [bool, default=False] Restrict coefficients to be >= 0. Be aware that you might want to
remove fit_intercept which is set True by default. Under the positive restriction the model
coefficients do not converge to the ordinary-least-squares solution for small values of alpha.
Only coefficients up to the smallest alpha value (alphas_[alphas_ > 0.].min()
when fit_path=True) reached by the stepwise Lars-Lasso algorithm are typically in congru-
ence with the solution of the coordinate descent Lasso estimator. As a consequence using
LassoLarsIC only makes sense for problems where a sparse solution is expected and/or
reached.
Attributes
coef_ [array-like of shape (n_features,)] parameter vector (w in the formulation formula)
intercept_ [float] independent term in decision function.
alpha_ [float] the alpha parameter chosen by the information criterion
n_iter_ [int] number of iterations run by lars_path to find the grid of alphas.
criterion_ [array-like of shape (n_alphas,)] The value of the information criteria (‘aic’, ‘bic’)
across all alphas. The alpha which has the smallest information criterion is chosen. This
value is larger by a factor of n_samples compared to Eqns. 2.15 and 2.16 in (Zou et al,
2007).
See also:
Notes
Examples
Methods
Notes
When using ensemble methods base upon bagging, i.e. generating new training sets using sampling with replacement,
part of the training set remains unused. For each classifier in the ensemble, a different part of the training set is left
out.
This left out portion can be used to estimate the generalization error without having to rely on a separate validation
set. This estimate comes “for free” as no additional data is needed and can be used for model selection.
This is currently implemented in the following classes:
sklearn.ensemble.RandomForestClassifier
max_depth [int, default=None] The maximum depth of the tree. If None, then nodes are ex-
panded until all leaves are pure or until all leaves contain less than min_samples_split sam-
ples.
min_samples_split [int or float, default=2] The minimum number of samples required to split
an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a fraction and ceil(min_samples_split
* n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for fractions.
min_samples_leaf [int or float, default=1] The minimum number of samples required to be
at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf training samples in each of the left and right branches. This may
have the effect of smoothing the model, especially in regression.
• If int, then consider min_samples_leaf as the minimum number.
• If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf *
n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for fractions.
min_weight_fraction_leaf [float, default=0.] The minimum weighted fraction of the sum total
of weights (of all the input samples) required to be at a leaf node. Samples have equal weight
when sample_weight is not provided.
max_features [{“auto”, “sqrt”, “log2”}, int or float, default=”auto”] The number of features to
consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a fraction and int(max_features *
n_features) features are considered at each split.
• If “auto”, then max_features=sqrt(n_features).
• If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples
is found, even if it requires to effectively inspect more than max_features features.
max_leaf_nodes [int, default=None] Grow trees with max_leaf_nodes in best-first fashion.
Best nodes are defined as relative reduction in impurity. If None then unlimited number of
leaf nodes.
min_impurity_decrease [float, default=0.] A node will be split if this split induces a decrease
of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
where N is the total number of samples, N_t is the number of samples at the current node,
N_t_L is the number of samples in the left child, and N_t_R is the number of samples in
the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
New in version 0.19.
min_impurity_split [float, default=0] Threshold for early stopping in tree growth. A node will
split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of
min_impurity_decrease in 0.19. The default value of min_impurity_split
has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use
min_impurity_decrease instead.
bootstrap [bool, default=True] Whether bootstrap samples are used when building trees. If
False, the whole datset is used to build each tree.
oob_score [bool, default=False] Whether to use out-of-bag samples to estimate the generaliza-
tion accuracy.
n_jobs [int, default=None] The number of jobs to run in parallel. fit, predict,
decision_path and apply are all parallelized over the trees. None means 1 unless
in a joblib.parallel_backend context. -1 means using all processors. See Glos-
sary for more details.
random_state [int, RandomState instance, default=None] Controls both the randomness of
the bootstrapping of the samples used when building trees (if bootstrap=True) and
the sampling of the features to consider when looking for the best split at each node (if
max_features < n_features). See Glossary for details.
verbose [int, default=0] Controls the verbosity when fitting and predicting.
warm_start [bool, default=False] When set to True, reuse the solution of the previous call to
fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the
Glossary.
class_weight [{“balanced”, “balanced_subsample”}, dict or list of dicts, default=None]
Weights associated with classes in the form {class_label: weight}. If not given,
all classes are supposed to have weight one. For multi-output problems, a list of dicts can
be provided in the same order as the columns of y.
Note that for multioutput (including multilabel) weights should be defined for each class of
every column in its own dict. For example, for four-class multilabel classification weights
should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5},
{3:1}, {4:1}].
The “balanced” mode uses the values of y to automatically adjust weights inversely pro-
portional to class frequencies in the input data as n_samples / (n_classes * np.
bincount(y))
The “balanced_subsample” mode is the same as “balanced” except that weights are com-
puted based on the bootstrap sample for every tree grown.
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit
method) if sample_weight is specified.
ccp_alpha [non-negative float, default=0.0] Complexity parameter used for Minimal Cost-
Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-
Complexity Pruning for details.
New in version 0.22.
max_samples [int or float, default=None] If bootstrap is True, the number of samples to draw
from X to train each base estimator.
• If None (default), then draw X.shape[0] samples.
• If int, then draw max_samples samples.
• If float, then draw max_samples * X.shape[0] samples. Thus, max_samples
should be in the interval (0, 1).
New in version 0.22.
Attributes
base_estimator_ [DecisionTreeClassifier] The child estimator template used to create the col-
lection of fitted sub-estimators.
estimators_ [list of DecisionTreeClassifier] The collection of fitted sub-estimators.
classes_ [array of shape (n_classes,) or a list of such arrays] The classes labels (single output
problem), or a list of arrays of class labels (multi-output problem).
n_classes_ [int or list] The number of classes (single output problem), or a list containing the
number of classes for each output (multi-output problem).
n_features_ [int] The number of features when fit is performed.
n_outputs_ [int] The number of outputs when fit is performed.
feature_importances_ [ndarray of shape (n_features,)] Return the feature importances
(the higher, the more important the feature).
oob_score_ [float] Score of the training dataset obtained using an out-of-bag estimate. This
attribute exists only when oob_score is True.
oob_decision_function_ [ndarray of shape (n_samples, n_classes)] Decision function com-
puted with out-of-bag estimate on the training set. If n_estimators is small it might
be possible that a data point was never left out during the bootstrap. In this case,
oob_decision_function_ might contain NaN. This attribute exists only when
oob_score is True.
See also:
DecisionTreeClassifier, ExtraTreesClassifier
Notes
The default values for the parameters controlling the size of the trees (e.g. max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data, max_features=n_features and bootstrap=False, if the improvement of the
criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic
behaviour during fitting, random_state has to be fixed.
References
[R45f14345c000-1]
Examples
Methods
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
indicator [sparse matrix of shape (n_samples, n_nodes)] Return a node indicator matrix
where non zero elements indicates that the samples goes through the nodes. The matrix is
of CSR format.
n_nodes_ptr [ndarray of size (n_estimators + 1,)] The columns from indica-
tor[n_nodes_ptr[i]:n_nodes_ptr[i+1]] gives the indicator value for the i-th estimator.
property feature_importances_
Return the feature importances (the higher, the more important the feature).
Returns
feature_importances_ [ndarray of shape (n_features,)] The values of this array sum to 1,
unless all trees are single node trees consisting of only the root node, in which case it will
be an array of zeros.
fit(self, X, y, sample_weight=None)
Build a forest of trees from the training set (X, y).
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is
provided, it will be converted into a sparse csc_matrix.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] The target values (class labels
in classification, real numbers in regression).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
then samples are equally weighted. Splits that would create child nodes with net zero
or negative weight are ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any single class carrying a
negative weight in either child node.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict class for X.
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability
estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
y [ndarray of shape (n_samples,) or (n_samples, n_outputs)] The predicted classes.
predict_log_proba(self, X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class
probabilities of the trees in the forest.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
p [ndarray of shape (n_samples, n_classes), or a list of n_outputs] such arrays if n_outputs
> 1. The class probabilities of the input samples. The order of the classes corresponds to
that in the attribute classes_.
predict_proba(self, X)
Predict class probabilities for X.
The predicted class probabilities of an input sample are computed as the mean predicted class probabilities
of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class
in a leaf.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
p [ndarray of shape (n_samples, n_classes), or a list of n_outputs] such arrays if n_outputs
> 1. The class probabilities of the input samples. The order of the classes corresponds to
that in the attribute classes_.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.ensemble.RandomForestRegressor
where N is the total number of samples, N_t is the number of samples at the current node,
N_t_L is the number of samples in the left child, and N_t_R is the number of samples in
the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
New in version 0.19.
min_impurity_split [float, default=0] Threshold for early stopping in tree growth. A node will
split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of
min_impurity_decrease in 0.19. The default value of min_impurity_split
has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use
min_impurity_decrease instead.
bootstrap [bool, default=True] Whether bootstrap samples are used when building trees. If
False, the whole datset is used to build each tree.
oob_score [bool, default=False] whether to use out-of-bag samples to estimate the R^2 on un-
seen data.
n_jobs [int, default=None] The number of jobs to run in parallel. fit, predict,
decision_path and apply are all parallelized over the trees. None means 1 unless
in a joblib.parallel_backend context. -1 means using all processors. See Glos-
sary for more details.
random_state [int, RandomState instance, default=None] Controls both the randomness of
the bootstrapping of the samples used when building trees (if bootstrap=True) and
the sampling of the features to consider when looking for the best split at each node (if
max_features < n_features). See Glossary for details.
verbose [int, default=0] Controls the verbosity when fitting and predicting.
warm_start [bool, default=False] When set to True, reuse the solution of the previous call to
fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the
Glossary.
ccp_alpha [non-negative float, default=0.0] Complexity parameter used for Minimal Cost-
Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-
Complexity Pruning for details.
New in version 0.22.
max_samples [int or float, default=None] If bootstrap is True, the number of samples to draw
from X to train each base estimator.
• If None (default), then draw X.shape[0] samples.
• If int, then draw max_samples samples.
DecisionTreeRegressor, ExtraTreesRegressor
Notes
The default values for the parameters controlling the size of the trees (e.g. max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data, max_features=n_features and bootstrap=False, if the improvement of the
criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic
behaviour during fitting, random_state has to be fixed.
The default value max_features="auto" uses n_features rather than n_features / 3. The latter
was originally suggested in [1], whereas the former was more recently justified empirically in [2].
References
[Rf91cab2dc427-1], [Rf91cab2dc427-2]
Examples
Methods
property feature_importances_
Return the feature importances (the higher, the more important the feature).
Returns
feature_importances_ [ndarray of shape (n_features,)] The values of this array sum to 1,
unless all trees are single node trees consisting of only the root node, in which case it will
be an array of zeros.
fit(self, X, y, sample_weight=None)
Build a forest of trees from the training set (X, y).
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is
provided, it will be converted into a sparse csc_matrix.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] The target values (class labels
in classification, real numbers in regression).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
then samples are equally weighted. Splits that would create child nodes with net zero
or negative weight are ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any single class carrying a
negative weight in either child node.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the trees in the forest.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
y [ndarray of shape (n_samples,) or (n_samples, n_outputs)] The predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.ensemble.ExtraTreesClassifier
max_features [{“auto”, “sqrt”, “log2”}, int or float, default=”auto”] The number of features to
consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a fraction and int(max_features *
n_features) features are considered at each split.
• If “auto”, then max_features=sqrt(n_features).
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples
is found, even if it requires to effectively inspect more than max_features features.
max_leaf_nodes [int, default=None] Grow trees with max_leaf_nodes in best-first fashion.
Best nodes are defined as relative reduction in impurity. If None then unlimited number of
leaf nodes.
min_impurity_decrease [float, default=0.] A node will be split if this split induces a decrease
of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
where N is the total number of samples, N_t is the number of samples at the current node,
N_t_L is the number of samples in the left child, and N_t_R is the number of samples in
the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
New in version 0.19.
min_impurity_split [float, default=0] Threshold for early stopping in tree growth. A node will
split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of
min_impurity_decrease in 0.19. The default value of min_impurity_split
has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use
min_impurity_decrease instead.
bootstrap [bool, default=False] Whether bootstrap samples are used when building trees. If
False, the whole dataset is used to build each tree.
oob_score [bool, default=False] Whether to use out-of-bag samples to estimate the generaliza-
tion accuracy.
n_jobs [int, default=None] The number of jobs to run in parallel. fit, predict,
decision_path and apply are all parallelized over the trees. None means 1 unless
in a joblib.parallel_backend context. -1 means using all processors. See Glos-
sary for more details.
random_state [int, RandomState instance, default=None] Controls 3 sources of randomness:
• the bootstrapping of the samples used when building trees (if bootstrap=True)
• the sampling of the features to consider when looking for the best split at each node (if
max_features < n_features)
Notes
The default values for the parameters controlling the size of the trees (e.g. max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
References
[Rc8f28bfad63f-1]
Examples
Methods
Returns
feature_importances_ [ndarray of shape (n_features,)] The values of this array sum to 1,
unless all trees are single node trees consisting of only the root node, in which case it will
be an array of zeros.
fit(self, X, y, sample_weight=None)
Build a forest of trees from the training set (X, y).
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is
provided, it will be converted into a sparse csc_matrix.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] The target values (class labels
in classification, real numbers in regression).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
then samples are equally weighted. Splits that would create child nodes with net zero
or negative weight are ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any single class carrying a
negative weight in either child node.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict class for X.
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability
estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
y [ndarray of shape (n_samples,) or (n_samples, n_outputs)] The predicted classes.
predict_log_proba(self, X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class
probabilities of the trees in the forest.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
p [ndarray of shape (n_samples, n_classes), or a list of n_outputs] such arrays if n_outputs
> 1. The class probabilities of the input samples. The order of the classes corresponds to
that in the attribute classes_.
predict_proba(self, X)
Predict class probabilities for X.
The predicted class probabilities of an input sample are computed as the mean predicted class probabilities
of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class
in a leaf.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
p [ndarray of shape (n_samples, n_classes), or a list of n_outputs] such arrays if n_outputs
> 1. The class probabilities of the input samples. The order of the classes corresponds to
that in the attribute classes_.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.ensemble.ExtraTreesRegressor
max_features [{“auto”, “sqrt”, “log2”} int or float, default=”auto”] The number of features to
consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a fraction and int(max_features *
n_features) features are considered at each split.
• If “auto”, then max_features=n_features.
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples
is found, even if it requires to effectively inspect more than max_features features.
max_leaf_nodes [int, default=None] Grow trees with max_leaf_nodes in best-first fashion.
Best nodes are defined as relative reduction in impurity. If None then unlimited number of
leaf nodes.
min_impurity_decrease [float, default=0.] A node will be split if this split induces a decrease
of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
where N is the total number of samples, N_t is the number of samples at the current node,
N_t_L is the number of samples in the left child, and N_t_R is the number of samples in
the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
New in version 0.19.
min_impurity_split [float, default=0] Threshold for early stopping in tree growth. A node will
split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of
min_impurity_decrease in 0.19. The default value of min_impurity_split
has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use
min_impurity_decrease instead.
bootstrap [bool, default=False] Whether bootstrap samples are used when building trees. If
False, the whole dataset is used to build each tree.
oob_score [bool, default=False] Whether to use out-of-bag samples to estimate the R^2 on
unseen data.
n_jobs [int, default=None] The number of jobs to run in parallel. fit, predict,
decision_path and apply are all parallelized over the trees. None means 1 unless
in a joblib.parallel_backend context. -1 means using all processors. See Glos-
sary for more details.
random_state [int, RandomState instance, default=None] Controls 3 sources of randomness:
• the bootstrapping of the samples used when building trees (if bootstrap=True)
• the sampling of the features to consider when looking for the best split at each node (if
max_features < n_features)
Notes
The default values for the parameters controlling the size of the trees (e.g. max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
References
[Ra7d0c8995fbc-1]
Methods
Returns
feature_importances_ [ndarray of shape (n_features,)] The values of this array sum to 1,
unless all trees are single node trees consisting of only the root node, in which case it will
be an array of zeros.
fit(self, X, y, sample_weight=None)
Build a forest of trees from the training set (X, y).
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is
provided, it will be converted into a sparse csc_matrix.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] The target values (class labels
in classification, real numbers in regression).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
then samples are equally weighted. Splits that would create child nodes with net zero
or negative weight are ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any single class carrying a
negative weight in either child node.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the trees in the forest.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples. Inter-
nally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided,
it will be converted into a sparse csr_matrix.
Returns
y [ndarray of shape (n_samples,) or (n_samples, n_outputs)] The predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.ensemble.GradientBoostingClassifier
where N is the total number of samples, N_t is the number of samples at the current node,
N_t_L is the number of samples in the left child, and N_t_R is the number of samples in
the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
New in version 0.19.
min_impurity_split [float, (default=0)] Threshold for early stopping in tree growth. A node
will split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of
min_impurity_decrease in 0.19. The default value of min_impurity_split
has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use
min_impurity_decrease instead.
init [estimator or ‘zero’, optional (default=None)] An estimator object that is used to compute
the initial predictions. init has to provide fit and predict_proba. If ‘zero’, the
initial raw predictions are set to zero. By default, a DummyEstimator predicting the
classes priors is used.
random_state [int, RandomState instance or None, optional (default=None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
max_features [int, float, string or None, optional (default=None)] The number of features to
consider when looking for the best split:
• If int, then consider max_features features at each split.
sklearn.ensemble.HistGradientBoostingClassifier
sklearn.tree.DecisionTreeClassifier, RandomForestClassifier
AdaBoostClassifier
Notes
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data and max_features=n_features, if the improvement of the criterion is identical for
several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting,
random_state has to be fixed.
References
J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29,
No. 5, 2001.
J. Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
Methods
Returns
feature_importances_ [array, shape (n_features,)] The values of this array sum to 1, unless
all trees are single node trees consisting of only the root node, in which case it will be an
array of zeros.
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input samples. Inter-
nally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to
a sparse csr_matrix.
Returns
p [array, shape (n_samples, n_classes)] The class log-probabilities of the input samples. The
order of the classes corresponds to that in the attribute classes_.
Raises
AttributeError If the loss does not support probabilities.
predict_proba(self, X)
Predict class probabilities for X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input samples. Inter-
nally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to
a sparse csr_matrix.
Returns
p [array, shape (n_samples, n_classes)] The class probabilities of the input samples. The
order of the classes corresponds to that in the attribute classes_.
Raises
AttributeError If the loss does not support probabilities.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
staged_decision_function(self, X)
Compute decision function of X for each iteration.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input samples. Inter-
nally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to
a sparse csr_matrix.
Returns
score [generator of array, shape (n_samples, k)] The decision function of the input sam-
ples, which corresponds to the raw values predicted from the trees of the ensemble . The
classes corresponds to that in the attribute classes_. Regression and binary classification
are special cases with k == 1, otherwise k==n_classes.
staged_predict(self, X)
Predict class at each stage for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input samples. Inter-
nally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to
a sparse csr_matrix.
Returns
y [generator of array of shape (n_samples,)] The predicted value of the input samples.
staged_predict_proba(self, X)
Predict class probabilities at each stage for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input samples. Inter-
nally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to
a sparse csr_matrix.
Returns
y [generator of array of shape (n_samples,)] The predicted value of the input samples.
sklearn.ensemble.GradientBoostingRegressor
where N is the total number of samples, N_t is the number of samples at the current node,
N_t_L is the number of samples in the left child, and N_t_R is the number of samples in
the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
New in version 0.19.
min_impurity_split [float, (default=0)] Threshold for early stopping in tree growth. A node
will split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of
min_impurity_decrease in 0.19. The default value of min_impurity_split
has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use
min_impurity_decrease instead.
init [estimator or ‘zero’, optional (default=None)] An estimator object that is used to compute
the initial predictions. init has to provide fit and predict. If ‘zero’, the initial raw predic-
tions are set to zero. By default a DummyEstimator is used, predicting either the average
target value (for loss=’ls’), or a quantile for the other losses.
random_state [int, RandomState instance or None, optional (default=None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
max_features [int, float, string or None, optional (default=None)] The number of features to
consider when looking for the best split:
• If int, then consider max_features features at each split.
• If float, then max_features is a fraction and int(max_features *
n_features) features are considered at each split.
sklearn.ensemble.HistGradientBoostingRegressor
sklearn.tree.DecisionTreeRegressor, RandomForestRegressor
Notes
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data and max_features=n_features, if the improvement of the criterion is identical for
several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting,
random_state has to be fixed.
References
J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29,
No. 5, 2001.
J. Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
Methods
Returns
feature_importances_ [array, shape (n_features,)] The values of this array sum to 1, unless
all trees are single node trees consisting of only the root node, in which case it will be an
array of zeros.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict regression target for X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input samples. Inter-
nally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to
a sparse csr_matrix.
Returns
y [array, shape (n_samples,)] The predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
Returns
self [object] Estimator instance.
staged_predict(self, X)
Predict regression target at each stage for X.
This method allows monitoring (i.e. determine error on testing set) after each stage.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input samples. Inter-
nally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to
a sparse csr_matrix.
Returns
y [generator of array of shape (n_samples,)] The predicted value of the input samples.
There are 3 different APIs for evaluating the quality of a model’s predictions:
• Estimator score method: Estimators have a score method providing a default evaluation criterion for the
problem they are designed to solve. This is not discussed on this page, but in each estimator’s documentation.
• Scoring parameter: Model-evaluation tools using cross-validation (such as model_selection.
cross_val_score and model_selection.GridSearchCV ) rely on an internal scoring strategy. This
is discussed in the section The scoring parameter: defining model evaluation rules.
• Metric functions: The metrics module implements functions assessing prediction error for specific purposes.
These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics
and Clustering metrics.
Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions.
See also:
For “pairwise” metrics, between samples and not estimators or predictions, see the Pairwise metrics, Affinities and
Kernels section.
For the most common use cases, you can designate a scorer object with the scoring parameter; the table below
shows all possible values. All scorer objects follow the convention that higher return values are better than
lower return values. Thus metrics which measure the distance between the model and the data, like metrics.
mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.
Usage examples:
Note: The values listed by the ValueError exception correspond to the functions measuring prediction accuracy
described in the following sections. The scorer objects for those functions are stored in the dictionary sklearn.
metrics.SCORERS.
The module sklearn.metrics also exposes a set of simple functions measuring a prediction error given ground
truth and prediction:
• functions ending with _score return a value to maximize, the higher the better.
• functions ending with _error or _loss return a value to minimize, the lower the better. When converting into
a scorer object using make_scorer, set the greater_is_better parameter to False (True by default; see
the parameter description below).
Metrics available for various machine learning tasks are detailed in sections below.
Many metrics are not given names to be used as scoring values, sometimes because they require additional param-
eters, such as fbeta_score. In such cases, you need to generate an appropriate scoring object. The simplest way
to generate a callable object for scoring is by using make_scorer. That function converts metrics into callables that
can be used for model evaluation.
One typical use case is to wrap an existing metric function from the library with non-default values for its parameters,
such as the beta parameter for the fbeta_score function:
The second use case is to build a completely custom scorer object from a simple python function using
make_scorer, which can take several parameters:
• the python function you want to use (my_custom_loss_func in the example below)
• whether the python function returns a score (greater_is_better=True, the default) or a loss
(greater_is_better=False). If a loss, the output of the python function is negated by the scorer ob-
ject, conforming to the cross validation convention that scorers return higher values for better models.
• for classification metrics only: whether the python function you provided requires continuous decision certain-
ties (needs_threshold=True). The default value is False.
• any additional parameters, such as beta or labels in f1_score.
Here is an example of building custom scorers, and of using the greater_is_better parameter:
You can generate even more flexible model scorers by constructing your own scoring object from scratch, without using
the make_scorer factory. For a callable to be a scorer, it needs to meet the protocol specified by the following two
rules:
• It can be called with parameters (estimator, X, y), where estimator is the model that should be
evaluated, X is validation data, and y is the ground truth target for X (in the supervised case) or None (in the
unsupervised case).
• It returns a floating point number that quantifies the estimator prediction quality on X, with reference to y.
Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.
Note that the dict values can either be scorer functions or one of the predefined metric strings.
Currently only those scorer functions that return a single score can be passed inside the dict. Scorer functions that
return multiple values are not permitted and will require a wrapper to return a single metric:
Classification metrics
The sklearn.metrics module implements several loss, score, and utility functions to measure classification per-
formance. Some metrics might require probability estimates of the positive class, confidence values, or binary deci-
sions values. Most implementations allow each sample to provide a weighted contribution to the overall score, through
the sample_weight parameter.
Some of these are restricted to the binary classification case:
And some work with binary and multilabel (but not multiclass) problems:
In the following sub-sections, we will describe each of those functions, preceded by some notes on common API and
metric definition.
Some metrics are essentially defined for binary classification tasks (e.g. f1_score, roc_auc_score). In these
cases, by default only the positive label is evaluated, assuming by default that the positive class is labelled 1 (though
this may be configurable through the pos_label parameter).
In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems,
one for each class. There are then a number of ways to average binary metric calculations across the set of classes,
each of which may be useful in some scenario. Where available, you should select among these using the average
parameter.
• "macro" simply calculates the mean of the binary metrics, giving equal weight to each class. In problems
where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their
performance. On the other hand, the assumption that all classes are equally important is often untrue, such that
macro-averaging will over-emphasize the typically low performance on an infrequent class.
• "weighted" accounts for class imbalance by computing the average of binary metrics in which each class’s
score is weighted by its presence in the true data sample.
• "micro" gives each sample-class pair an equal contribution to the overall metric (except as a result of sample-
weight). Rather than summing the metric per class, this sums the dividends and divisors that make up the
per-class metrics to calculate an overall quotient. Micro-averaging may be preferred in multilabel settings,
including multiclass classification where a majority class is to be ignored.
• "samples" applies only to multilabel problems. It does not calculate a per-class measure, instead calculat-
ing the metric over the true and predicted classes for each sample in the evaluation data, and returning their
(sample_weight-weighted) average.
• Selecting average=None will return an array with the score for each class.
While multiclass data is provided to the metric, like binary targets, as an array of class labels, multilabel data is
specified as an indicator matrix, in which cell [i, j] has value 1 if sample i has label j and value 0 otherwise.
Accuracy score
The accuracy_score function computes the accuracy, either the fraction (default) or the count (normalize=False)
of correct predictions.
In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample
strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the fraction of correct
predictions over 𝑛samples is defined as
𝑛samples −1
1 ∑︁
accuracy(𝑦, 𝑦ˆ) = 1(ˆ
𝑦𝑖 = 𝑦𝑖 )
𝑛samples 𝑖=0
Example:
• See Test with permutations the significance of a classification score for an example of accuracy score usage
using permutations of the dataset.
The balanced_accuracy_score function computes the balanced accuracy, which avoids inflated performance
estimates on imbalanced datasets. It is the macro-average of recall scores per class or, equivalently, raw accuracy
where each sample is weighted according to the inverse prevalence of its true class. Thus for balanced datasets, the
score is equal to accuracy.
In the binary case, balanced accuracy is equal to the arithmetic mean of sensitivity (true positive rate) and specificity
(true negative rate), or the area under the ROC curve with binary predictions rather than scores.
If the classifier performs equally well on either class, this term reduces to the conventional accuracy (i.e., the number
of correct predictions divided by the total number of predictions).
In contrast, if the conventional accuracy is above chance only because the classifier takes advantage of an imbalanced
1
test set, then the balanced accuracy, as appropriate, will drop to 𝑛_𝑐𝑙𝑎𝑠𝑠𝑒𝑠 .
1
The score ranges from 0 to 1, or when adjusted=True is used, it rescaled to the range 1−𝑛_𝑐𝑙𝑎𝑠𝑠𝑒𝑠 to 1, inclusive,
with performance at random scoring 0.
If 𝑦𝑖 is the true value of the 𝑖-th sample, and 𝑤𝑖 is the corresponding sample weight, then we adjust the sample weight
to:
𝑤𝑖
𝑤
ˆ𝑖 = ∑︀
𝑗 1(𝑦𝑗 = 𝑦𝑖 )𝑤𝑗
where 1(𝑥) is the indicator function. Given predicted 𝑦ˆ𝑖 for sample 𝑖, balanced accuracy is defined as:
1 ∑︁
balanced-accuracy(𝑦, 𝑦ˆ, 𝑤) = ∑︀ 1(ˆ
𝑦𝑖 = 𝑦𝑖 )𝑤
ˆ𝑖
𝑤
ˆ𝑖 𝑖
With adjusted=True, balanced accuracy reports the relative increase from balanced-accuracy(𝑦, 0, 𝑤) =
1
𝑛_𝑐𝑙𝑎𝑠𝑠𝑒𝑠 . In the binary case, this is also known as *Youden’s J statistic*, or informedness.
Note: The multiclass definition here seems the most reasonable extension of the metric used in binary classification,
though there is no certain consensus in the literature:
• Our definition: [Mosley2013], [Kelleher2015] and [Guyon2015], where [Guyon2015] adopt the adjusted version
to ensure that random predictions have a score of 0 and perfect predictions have a score of 1..
• Class balanced accuracy as described in [Mosley2013]: the minimum between the precision and the recall for
each class is computed. Those values are then averaged over the total number of classes to get the balanced
accuracy.
• Balanced Accuracy as described in [Urbanowicz2015]: the average of sensitivity and specificity is computed
for each class and then averaged over total number of classes.
References:
Cohen’s kappa
The function cohen_kappa_score computes Cohen’s kappa statistic. This measure is intended to compare label-
ings by different human annotators, not a classifier versus a ground truth.
The kappa score (see docstring) is a number between -1 and 1. Scores above .8 are generally considered good agree-
ment; zero or lower means no agreement (practically random labels).
Kappa scores can be computed for binary or multiclass problems, but not for multilabel problems (except by manually
computing a per-label score) and not for more than two annotators.
Confusion matrix
The confusion_matrix function evaluates classification accuracy by computing the confusion matrix with each
row corresponding to the true class <https://en.wikipedia.org/wiki/Confusion_matrix>‘_. (Wikipedia and other refer-
ences may use different convention for axes.)
By definition, entry 𝑖, 𝑗 in a confusion matrix is the number of observations actually in group 𝑖, but predicted to be in
group 𝑗. Here is an example:
plot_confusion_matrix can be used to visually represent a confusion matrix as shown in the Confusion matrix
example, which creates the following figure:
The parameter normalize allows to report ratios instead of counts. The confusion matrix can be normalized in 3
different ways: 'pred', 'true', and 'all' which will divide the counts by the sum of each columns, rows, or
the entire matrix, respectively.
For binary problems, we can get counts of true negatives, false positives, false negatives and true positives as follows:
Example:
• See Confusion matrix for an example of using a confusion matrix to evaluate classifier output quality.
• See Recognizing hand-written digits for an example of using a confusion matrix to classify hand-written
digits.
• See Classification of text documents using sparse features for an example of using a confusion matrix to
classify text documents.
Classification report
The classification_report function builds a text report showing the main classification metrics. Here is a
small example with custom target_names and inferred labels:
accuracy 0.60 5
macro avg 0.56 0.50 0.49 5
weighted avg 0.67 0.60 0.59 5
Example:
• See Recognizing hand-written digits for an example of classification report usage for hand-written digits.
• See Classification of text documents using sparse features for an example of classification report usage for
text documents.
• See Parameter estimation using grid search with cross-validation for an example of classification report usage
for grid search with nested cross-validation.
Hamming loss
The hamming_loss computes the average Hamming loss or Hamming distance between two sets of samples.
If 𝑦ˆ𝑗 is the predicted value for the 𝑗-th label of a given sample, 𝑦𝑗 is the corresponding true value, and 𝑛labels is the
number of classes or labels, then the Hamming loss 𝐿𝐻𝑎𝑚𝑚𝑖𝑛𝑔 between two samples is defined as:
1 ∑︁−1
𝑛labels
𝐿𝐻𝑎𝑚𝑚𝑖𝑛𝑔 (𝑦, 𝑦ˆ) = 𝑦𝑗 ̸= 𝑦𝑗 )
1(ˆ
𝑛labels 𝑗=0
Note: In multiclass classification, the Hamming loss corresponds to the Hamming distance between y_true and
y_pred which is similar to the Zero one loss function. However, while zero-one loss penalizes prediction sets that
do not strictly match true sets, the Hamming loss penalizes individual labels. Thus the Hamming loss, upper bounded
by the zero-one loss, is always between zero and one, inclusive; and predicting a proper subset or superset of the true
labels will give a Hamming loss between zero and one, exclusive.
Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the
ability of the classifier to find all the positive samples.
The F-measure (𝐹𝛽 and 𝐹1 measures) can be interpreted as a weighted harmonic mean of the precision and recall. A
𝐹𝛽 measure reaches its best value at 1 and its worst score at 0. With 𝛽 = 1, 𝐹𝛽 and 𝐹1 are equivalent, and the recall
and the precision are equally important.
The precision_recall_curve computes a precision-recall curve from the ground truth label and a score given
by the classifier by varying a decision threshold.
The average_precision_score function computes the average precision (AP) from prediction scores. The
value is between 0 and 1 and higher is better. AP is defined as
∑︁
AP = (𝑅𝑛 − 𝑅𝑛−1 )𝑃𝑛
𝑛
where 𝑃𝑛 and 𝑅𝑛 are the precision and recall at the nth threshold. With random predictions, the AP is the fraction of
positive samples.
References [Manning2008] and [Everingham2010] present alternative variants of AP that interpolate the precision-
recall curve. Currently, average_precision_score does not implement any interpolated variant. References
[Davis2006] and [Flach2015] describe why a linear interpolation of points on the precision-recall curve provides an
overly-optimistic measure of classifier performance. This linear interpolation is used when computing area under the
curve with the trapezoidal rule in auc.
Several functions allow you to analyze the precision, recall and F-measures score:
Note that the precision_recall_curve function is restricted to the binary case. The
average_precision_score function works only in binary classification and multilabel indicator format.
The plot_precision_recall_curve function plots the precision recall as follows.
Examples:
• See Classification of text documents using sparse features for an example of f1_score usage to classify
text documents.
• See Parameter estimation using grid search with cross-validation for an example of precision_score
and recall_score usage to estimate parameters using grid search with nested cross-validation.
• See Precision-Recall for an example of precision_recall_curve usage to evaluate classifier output
quality.
References:
Binary classification
In a binary classification task, the terms ‘’positive” and ‘’negative” refer to the classifier’s prediction, and the terms
‘’true” and ‘’false” refer to whether that prediction corresponds to the external judgment (sometimes known as the
‘’observation”). Given these definitions, we can formulate the following table:
In this context, we can define the notions of precision, recall and F-measure:
𝑡𝑝
precision = ,
𝑡𝑝 + 𝑓 𝑝
𝑡𝑝
recall = ,
𝑡𝑝 + 𝑓 𝑛
precision × recall
𝐹𝛽 = (1 + 𝛽 2 ) .
𝛽 2 precision + recall
Here are some small examples in binary classification:
>>> from sklearn import metrics
>>> y_pred = [0, 1, 0, 0]
>>> y_true = [0, 1, 0, 1]
>>> metrics.precision_score(y_true, y_pred)
1.0
>>> metrics.recall_score(y_true, y_pred)
0.5
>>> metrics.f1_score(y_true, y_pred)
0.66...
>>> metrics.fbeta_score(y_true, y_pred, beta=0.5)
0.83...
>>> metrics.fbeta_score(y_true, y_pred, beta=1)
0.66...
>>> metrics.fbeta_score(y_true, y_pred, beta=2)
0.55...
>>> metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5)
(array([0.66..., 1. ]), array([1. , 0.5]), array([0.71..., 0.83...]), array([2,
˓→ 2]))
In multiclass and multilabel classification task, the notions of precision, recall, and F-measures can be ap-
plied to each label independently. There are a few ways to combine results across labels, specified by the
average argument to the average_precision_score (multilabel only), f1_score, fbeta_score,
precision_recall_fscore_support, precision_score and recall_score functions, as described
above. Note that if all labels are included, “micro”-averaging in a multiclass setting will produce precision, recall and
𝐹 that are all identical to accuracy. Also note that “weighted” averaging may produce an F-score that is not between
precision and recall.
To make this more explicit, consider the following notation:
• 𝑦 the set of predicted (𝑠𝑎𝑚𝑝𝑙𝑒, 𝑙𝑎𝑏𝑒𝑙) pairs
• 𝑦ˆ the set of true (𝑠𝑎𝑚𝑝𝑙𝑒, 𝑙𝑎𝑏𝑒𝑙) pairs
• 𝐿 the set of labels
• 𝑆 the set of samples
• 𝑦𝑠 the subset of 𝑦 with sample 𝑠, i.e. 𝑦𝑠 := {(𝑠′ , 𝑙) ∈ 𝑦|𝑠′ = 𝑠}
• 𝑦𝑙 the subset of 𝑦 with label 𝑙
• similarly, 𝑦ˆ𝑠 and 𝑦ˆ𝑙 are subsets of 𝑦ˆ
|𝐴∩𝐵|
• 𝑃 (𝐴, 𝐵) := |𝐴| for some sets 𝐴 and 𝐵
|𝐴∩𝐵|
• 𝑅(𝐴, 𝐵) := |𝐵| (Conventions vary on handling 𝐵 = ∅; this implementation uses 𝑅(𝐴, 𝐵) := 0, and similar
for 𝑃 .)
• 𝐹𝛽 (𝐴, 𝐵) := 1 + 𝛽 2 𝛽𝑃2 𝑃(𝐴,𝐵)×𝑅(𝐴,𝐵)
(︀ )︀
(𝐴,𝐵)+𝑅(𝐴,𝐵)
For multiclass classification with a “negative class”, it is possible to exclude some labels:
Similarly, labels not present in the data sample may be accounted for in macro-averaging.
The jaccard_score function computes the average of Jaccard similarity coefficients, also called the Jaccard index,
between pairs of label sets.
The Jaccard similarity coefficient of the 𝑖-th samples, with a ground truth label set 𝑦𝑖 and predicted label set 𝑦ˆ𝑖 , is
defined as
|𝑦𝑖 ∩ 𝑦ˆ𝑖 |
𝐽(𝑦𝑖 , 𝑦ˆ𝑖 ) = .
|𝑦𝑖 ∪ 𝑦ˆ𝑖 |
Multiclass problems are binarized and treated like the corresponding multilabel problem:
Hinge loss
The hinge_loss function computes the average distance between the model and the data using hinge loss, a one-
sided metric that considers only prediction errors. (Hinge loss is used in maximal margin classifiers such as support
vector machines.)
If the labels are encoded with +1 and -1, 𝑦: is the true value, and 𝑤 is the predicted decisions as output by
decision_function, then the hinge loss is defined as:
If there are more than two labels, hinge_loss uses a multiclass variant due to Crammer & Singer. Here is the paper
describing it.
If 𝑦𝑤 is the predicted decision for true label and 𝑦𝑡 is the maximum of the predicted decisions for all other labels,
where predicted decisions are output by decision function, then multiclass hinge loss is defined by:
Here a small example demonstrating the use of the hinge_loss function with a svm classifier in a binary class
problem:
Here is an example demonstrating the use of the hinge_loss function with a svm classifier in a multiclass problem:
Log loss
Log loss, also called logistic regression loss or cross-entropy loss, is defined on probability estimates. It is commonly
used in (multinomial) logistic regression and neural networks, as well as in some variants of expectation-maximization,
and can be used to evaluate the probability outputs (predict_proba) of a classifier instead of its discrete predic-
tions.
For binary classification with a true label 𝑦 ∈ {0, 1} and a probability estimate 𝑝 = Pr(𝑦 = 1), the log loss per sample
is the negative log-likelihood of the classifier given the true label:
This extends to the multiclass case as follows. Let the true labels for a set of samples be encoded as a 1-of-K binary
indicator matrix 𝑌 , i.e., 𝑦𝑖,𝑘 = 1 if sample 𝑖 has label 𝑘 taken from a set of 𝐾 labels. Let 𝑃 be a matrix of probability
estimates, with 𝑝𝑖,𝑘 = Pr(𝑡𝑖,𝑘 = 1). Then the log loss of the whole set is
𝑁 −1 𝐾−1
1 ∑︁ ∑︁
𝐿log (𝑌, 𝑃 ) = − log Pr(𝑌 |𝑃 ) = − 𝑦𝑖,𝑘 log 𝑝𝑖,𝑘
𝑁 𝑖=0
𝑘=0
To see how this generalizes the binary log loss given above, note that in the binary case, 𝑝𝑖,0 = 1 − 𝑝𝑖,1 and 𝑦𝑖,0 =
1 − 𝑦𝑖,1 , so expanding the inner sum over 𝑦𝑖,𝑘 ∈ {0, 1} gives the binary log loss.
The log_loss function computes log loss given a list of ground-truth labels and a probability matrix, as returned by
an estimator’s predict_proba method.
>>> from sklearn.metrics import log_loss
>>> y_true = [0, 0, 1, 1]
>>> y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
>>> log_loss(y_true, y_pred)
0.1738...
The first [.9, .1] in y_pred denotes 90% probability that the first sample has label 0. The log loss is non-negative.
The matthews_corrcoef function computes the Matthew’s correlation coefficient (MCC) for binary classes.
Quoting Wikipedia:
“The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary
(two-class) classifications. It takes into account true and false positives and negatives and is generally
regarded as a balanced measure which can be used even if the classes are of very different sizes. The
MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents
a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also
known as the phi coefficient.”
In the binary (two-class) case, 𝑡𝑝, 𝑡𝑛, 𝑓 𝑝 and 𝑓 𝑛 are respectively the number of true positives, true negatives, false
positives and false negatives, the MCC is defined as
𝑡𝑝 × 𝑡𝑛 − 𝑓 𝑝 × 𝑓 𝑛
𝑀 𝐶𝐶 = √︀ .
(𝑡𝑝 + 𝑓 𝑝)(𝑡𝑝 + 𝑓 𝑛)(𝑡𝑛 + 𝑓 𝑝)(𝑡𝑛 + 𝑓 𝑛)
In the multiclass case, the Matthews correlation coefficient can be defined in terms of a confusion_matrix 𝐶 for
𝐾 classes. To simplify the definition consider the following intermediate variables:
∑︀𝐾
• 𝑡𝑘 = 𝑖 𝐶𝑖𝑘 the number of times class 𝑘 truly occurred,
∑︀𝐾
• 𝑝𝑘 = 𝑖 𝐶𝑘𝑖 the number of times class 𝑘 was predicted,
∑︀𝐾
• 𝑐= 𝑘 𝐶𝑘𝑘 the total number of samples correctly predicted,
∑︀𝐾 ∑︀𝐾
• 𝑠= 𝑖 𝑗 𝐶𝑖𝑗 the total number of samples.
[[1, 0],
[0, 1]],
[[0, 1],
[1, 0]]])
Here is an example demonstrating the use of the multilabel_confusion_matrix function with multiclass
input:
[[5, 0],
[1, 0]],
[[2, 1],
[1, 2]]])
Here are some examples demonstrating the use of the multilabel_confusion_matrix function to calculate
recall (or sensitivity), specificity, fall out and miss rate for each class in a problem with multilabel indicator matrix
input.
Calculating recall (also called the true positive rate or the sensitivity) for each class:
Calculating specificity (also called the true negative rate) for each class:
Calculating fall out (also called the false positive rate) for each class:
Calculating miss rate (also called the false negative rate) for each class:
The function roc_curve computes the receiver operating characteristic curve, or ROC curve. Quoting Wikipedia :
“A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates
the performance of a binary classifier system as its discrimination threshold is varied. It is created by
plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false
positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known
as sensitivity, and FPR is one minus the specificity or true negative rate.”
This function requires the true binary value and the target scores, which can either be probability estimates of the
positive class, confidence values, or binary decisions. Here is a small example of how to use the roc_curve function:
The roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is
also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in
one number. For more information see the Wikipedia article on AUC.
In multi-label classification, the roc_auc_score function is extended by averaging over the labels as above.
Compared to metrics such as the subset accuracy, the Hamming loss, or the F1 score, ROC doesn’t require optimizing
where 𝑐 is the number of classes and AUC(𝑗|𝑘) is the AUC with class 𝑗 as the positive class and class 𝑘 as the negative
class. In general, AUC(𝑗|𝑘) ̸= AUC(𝑘|𝑗)) in the multiclass case. This algorithm is used by setting the keyword
argument multiclass to 'ovo' and average to 'macro'.
The [HT2001] multiclass AUC metric can be extended to be weighted by the prevalence:
𝑐 𝑐
2 ∑︁ ∑︁
𝑝(𝑗 ∪ 𝑘)(AUC(𝑗|𝑘) + AUC(𝑘|𝑗))
𝑐(𝑐 − 1) 𝑗=1
𝑘>𝑗
where 𝑐 is the number of classes. This algorithm is used by setting the keyword argument multiclass to 'ovo'
and average to 'weighted'. The 'weighted' option returns a prevalence-weighted average as described in
[FC2009].
One-vs-rest Algorithm: Computes the AUC of each class against the rest [PD2000]. The algorithm is functionally
the same as the multilabel case. To enable this algorithm set the keyword argument multiclass to 'ovr'. Like
OvO, OvR supports two types of averaging: 'macro' [F2006] and 'weighted' [F2001].
In applications where a high false positive rate is not tolerable the parameter max_fpr of roc_auc_score can be
used to summarize the ROC curve up to the given limit.
Examples:
• See Receiver Operating Characteristic (ROC) for an example of using ROC to evaluate the quality of the
output of a classifier.
• See Receiver Operating Characteristic (ROC) with cross validation for an example of using ROC to evaluate
classifier output quality, using cross-validation.
• See Species distribution modeling for an example of using ROC to model species distribution.
References:
The zero_one_loss function computes the sum or the average of the 0-1 classification loss (𝐿0−1 ) over 𝑛samples .
By default, the function normalizes over the sample. To get the sum of the 𝐿0−1 , set normalize to False.
In multilabel classification, the zero_one_loss scores a subset as one if its labels strictly match the predictions,
and as a zero if there are any errors. By default, the function returns the percentage of imperfectly predicted subsets.
To get the count of such subsets instead, set normalize to False
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the 0-1 loss 𝐿0−1 is defined
as:
𝑦𝑖 ̸= 𝑦𝑖 )
𝐿0−1 (𝑦𝑖 , 𝑦ˆ𝑖 ) = 1(ˆ
In the multilabel case with binary label indicators, where the first label set [0,1] has an error:
Example:
• See Recursive feature elimination with cross-validation for an example of zero one loss usage to perform
recursive feature elimination with cross-validation.
The brier_score_loss function computes the Brier score for binary classes. Quoting Wikipedia:
“The Brier score is a proper score function that measures the accuracy of probabilistic predictions. It is
applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete
outcomes.”
This function returns a score of the mean square difference between the actual outcome and the predicted probability
of the possible outcome. The actual outcome has to be 1 or 0 (true or false), while the predicted probability of the
actual outcome can be a value between 0 and 1.
The brier score loss is also between 0 to 1 and the lower the score (the mean square difference is smaller), the more
accurate the prediction is. It can be thought of as a measure of the “calibration” of a set of probabilistic predictions.
𝑁
1 ∑︁
𝐵𝑆 = (𝑓𝑡 − 𝑜𝑡 )2
𝑁 𝑡=1
where : 𝑁 is the total number of predictions, 𝑓𝑡 is the predicted probability of the actual outcome 𝑜𝑡 .
Here is a small example of usage of this function::
Example:
• See Probability calibration of classifiers for an example of Brier score loss usage to perform probability
calibration of classifiers.
References:
• G. Brier, Verification of forecasts expressed in terms of probability, Monthly weather review 78.1 (1950)
In multilabel learning, each sample can have any number of ground truth labels associated with it. The goal is to give
high scores and better rank to the ground truth labels.
Coverage error
The coverage_error function computes the average number of labels that have to be included in the final predic-
tion such that all true labels are predicted. This is useful if you want to know how many top-scored-labels you have
to predict in average without missing any true one. The best value of this metrics is thus the average number of true
labels.
Note: Our implementation’s score is 1 greater than the one given in Tsoumakas et al., 2010. This extends it to handle
the degenerate case in which an instance has 0 true labels.
𝑛samples ×𝑛labels
Formally, given a binary indicator matrix of the ground truth labels 𝑦 ∈ {0, 1} and the score associated
with each label 𝑓ˆ ∈ R𝑛samples ×𝑛labels , the coverage is defined as
𝑛samples −1
1 ∑︁
𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒(𝑦, 𝑓ˆ) = max rank𝑖𝑗
𝑛samples 𝑖=0
𝑗:𝑦𝑖𝑗 =1
⃒{︁ }︁⃒
with rank𝑖𝑗 = ⃒ 𝑘 : 𝑓ˆ𝑖𝑘 ≥ 𝑓ˆ𝑖𝑗 ⃒. Given the rank definition, ties in y_scores are broken by giving the maximal rank
⃒ ⃒
that would have been assigned to all tied values.
Here is a small example of usage of this function:
{︁ }︁ ⃒{︁ }︁⃒
where ℒ𝑖𝑗 = 𝑘 : 𝑦𝑖𝑘 = 1, 𝑓ˆ𝑖𝑘 ≥ 𝑓ˆ𝑖𝑗 , rank𝑖𝑗 = ⃒ 𝑘 : 𝑓ˆ𝑖𝑘 ≥ 𝑓ˆ𝑖𝑗 ⃒, | · | computes the cardinality of the set (i.e., the
⃒ ⃒
number of elements in the set), and || · ||0 is the ℓ0 “norm” (which computes the number of nonzero elements in a
vector).
Here is a small example of usage of this function:
Ranking loss
The label_ranking_loss function computes the ranking loss which averages over the samples the number of
label pairs that are incorrectly ordered, i.e. true labels have a lower score than false labels, weighted by the inverse of
the number of ordered pairs of false and true labels. The lowest achievable ranking loss is zero.
𝑛samples ×𝑛labels
Formally, given a binary indicator matrix of the ground truth labels 𝑦 ∈ {0, 1} and the score associated
with each label 𝑓ˆ ∈ R𝑛samples ×𝑛labels , the ranking loss is defined as
𝑛samples −1
1 ∑︁ 1 ⃒{︁ }︁⃒
𝑟𝑎𝑛𝑘𝑖𝑛𝑔_𝑙𝑜𝑠𝑠(𝑦, 𝑓ˆ) = ⃒ (𝑘, 𝑙) : 𝑓ˆ𝑖𝑘 ≤ 𝑓ˆ𝑖𝑙 , 𝑦𝑖𝑘 = 1, 𝑦𝑖𝑙 = 0 ⃒
⃒ ⃒
𝑛samples 𝑖=0
||𝑦𝑖 ||0 (𝑛labels − ||𝑦𝑖 ||0 )
where | · | computes the cardinality of the set (i.e., the number of elements in the set) and || · ||0 is the ℓ0 “norm” (which
computes the number of nonzero elements in a vector).
Here is a small example of usage of this function:
References:
• Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In Data mining and knowledge
discovery handbook (pp. 667-685). Springer US.
Discounted Cumulative Gain (DCG) and Normalized Discounted Cumulative Gain (NDCG) are ranking metrics; they
compare a predicted order to ground-truth scores, such as the relevance of answers to a query.
from the Wikipedia page for Discounted Cumulative Gain:
“Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to
measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of
documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position
in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result
discounted at lower ranks”
DCG orders the true targets (e.g. relevance of query answers) in the predicted order, then multiplies them by a
logarithmic decay and sums the result. The sum can be truncated after the first 𝐾 results, in which case we call it
DCG@K. NDCG, or NDCG@K is DCG divided by the DCG obtained by a perfect prediction, so that it is always
between 0 and 1. Usually, NDCG is preferred to DCG.
Compared with the ranking loss, NDCG can take into account relevance scores, rather than a ground-truth ranking. So
if the ground-truth consists only of an ordering, the ranking loss should be preferred; if the ground-truth consists of
actual usefulness scores (e.g. 0 for irrelevant, 1 for relevant, 2 for very relevant), NDCG can be used.
For one sample, given the vector of continuous ground-truth values for each target 𝑦 ∈ R𝑀 , where 𝑀 is the number
of outputs, and the prediction 𝑦ˆ, which induces the ranking function 𝑓 , the DCG score is
min(𝐾,𝑀 )
∑︁ 𝑦𝑓 (𝑟)
𝑟=1
log(1 + 𝑟)
and the NDCG score is the DCG score divided by the DCG score obtained for 𝑦.
References:
Regression metrics
The sklearn.metrics module implements several loss, score, and utility functions to measure regression
performance. Some of those have been enhanced to handle the multioutput case: mean_squared_error,
mean_absolute_error, explained_variance_score and r2_score.
These functions have an multioutput keyword argument which specifies the way the scores or losses for each
individual target should be averaged. The default is 'uniform_average', which specifies a uniformly weighted
mean over outputs. If an ndarray of shape (n_outputs,) is passed, then its entries are interpreted as weights
and an according weighted average is returned. If multioutput is 'raw_values' is specified, then all unaltered
individual scores or losses will be returned in an array of shape (n_outputs,).
The r2_score and explained_variance_score accept an additional value 'variance_weighted' for
the multioutput parameter. This option leads to a weighting of each individual score by the variance of the
corresponding target variable. This setting quantifies the globally captured unscaled variance. If the target vari-
ables are of different scale, then this score puts more importance on well explaining the higher variance variables.
multioutput='variance_weighted' is the default value for r2_score for backward compatibility. This
will be changed to uniform_average in the future.
If 𝑦ˆ is the estimated target output, 𝑦 the corresponding (correct) target output, and 𝑉 𝑎𝑟 is Variance, the square of the
standard deviation, then the explained variance is estimated as follow:
𝑉 𝑎𝑟{𝑦 − 𝑦ˆ}
𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑_𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑦, 𝑦ˆ) = 1 −
𝑉 𝑎𝑟{𝑦}
The best possible score is 1.0, lower values are worse.
Here is a small example of usage of the explained_variance_score function:
>>> from sklearn.metrics import explained_variance_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> explained_variance_score(y_true, y_pred)
0.957...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> explained_variance_score(y_true, y_pred, multioutput='raw_values')
array([0.967..., 1. ])
>>> explained_variance_score(y_true, y_pred, multioutput=[0.3, 0.7])
0.990...
Max error
The max_error function computes the maximum residual error , a metric that captures the worst case error between
the predicted value and the true value. In a perfectly fitted single output regression model, max_error would be 0
on the training set and though this would be highly unlikely in the real world, this metric shows the extent of error that
the model had when it was fitted.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample, and 𝑦𝑖 is the corresponding true value, then the max error is defined as
The mean_absolute_error function computes mean absolute error, a risk metric corresponding to the expected
value of the absolute error loss or 𝑙1-norm loss.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample, and 𝑦𝑖 is the corresponding true value, then the mean absolute error
(MAE) estimated over 𝑛samples is defined as
𝑛samples −1
1 ∑︁
MAE(𝑦, 𝑦ˆ) = |𝑦𝑖 − 𝑦ˆ𝑖 | .
𝑛samples 𝑖=0
The mean_squared_error function computes mean square error, a risk metric corresponding to the expected
value of the squared (quadratic) error or loss.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample, and 𝑦𝑖 is the corresponding true value, then the mean squared error
(MSE) estimated over 𝑛samples is defined as
𝑛samples −1
1 ∑︁
MSE(𝑦, 𝑦ˆ) = (𝑦𝑖 − 𝑦ˆ𝑖 )2 .
𝑛samples 𝑖=0
Examples:
• See Gradient Boosting regression for an example of mean squared error usage to evaluate gradient boosting
regression.
The mean_squared_log_error function computes a risk metric corresponding to the expected value of the
squared logarithmic (quadratic) error or loss.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample, and 𝑦𝑖 is the corresponding true value, then the mean squared logarithmic
error (MSLE) estimated over 𝑛samples is defined as
𝑛samples −1
1 ∑︁
MSLE(𝑦, 𝑦ˆ) = (log𝑒 (1 + 𝑦𝑖 ) − log𝑒 (1 + 𝑦ˆ𝑖 ))2 .
𝑛samples 𝑖=0
Where log𝑒 (𝑥) means the natural logarithm of 𝑥. This metric is best to use when targets having exponential growth,
such as population counts, average sales of a commodity over a span of years etc. Note that this metric penalizes an
under-predicted estimate greater than an over-predicted estimate.
Here is a small example of usage of the mean_squared_log_error function:
The median_absolute_error is particularly interesting because it is robust to outliers. The loss is calculated by
taking the median of all absolute differences between the target and the prediction.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample and 𝑦𝑖 is the corresponding true value, then the median absolute error
(MedAE) estimated over 𝑛samples is defined as
Example:
• See Lasso and Elastic Net for Sparse Signals for an example of R2 score usage to evaluate Lasso and Elastic
Net on sparse signals.
The mean_tweedie_deviance function computes the mean Tweedie deviance error with a power parameter (𝑝).
This is a metric that elicits predicted expectation values of regression targets.
Following special cases exist,
• when power=0 it is equivalent to mean_squared_error.
• when power=1 it is equivalent to mean_poisson_deviance.
• when power=2 it is equivalent to mean_gamma_deviance.
If 𝑦ˆ𝑖 is the predicted value of the 𝑖-th sample, and 𝑦𝑖 is the corresponding true value, then the mean Tweedie deviance
error (D) for power 𝑝, estimated over 𝑛samples is defined as
⎧
⎪
⎪ (𝑦𝑖 − 𝑦ˆ𝑖 )2 , for 𝑝 = 0 (Normal)
𝑛samples −1 ⎨
⎪
2(𝑦𝑖 log(𝑦/ˆ 𝑦𝑖 ) + 𝑦ˆ𝑖 − 𝑦𝑖 ), for𝑝 = 1 (Poisson)
⎪
1 ∑︁
D(𝑦, 𝑦ˆ) =
𝑛samples 𝑖=0 ⎪ ⎪ 2(log(ˆ𝑦 /𝑦
𝑖 𝑖 ) + 𝑦 /ˆ
𝑦
𝑖 𝑖 − 1), for𝑝 = 2 (Gamma)
(︁ 2−𝑝 1−𝑝 2−𝑝 )︁
𝑦 𝑦
^ 𝑦
^
⎪ max(𝑦 ,0)
⎩2 𝑖
− 𝑖 + 𝑖 , otherwise
⎪
(1−𝑝)(2−𝑝) 1−𝑝 2−𝑝
Tweedie deviance is a homogeneous function of degree 2-power. Thus, Gamma distribution with power=2 means
that simultaneously scaling y_true and y_pred has no effect on the deviance. For Poisson distribution power=1
the deviance scales linearly, and for Normal distribution (power=0), quadratically. In general, the higher power the
less weight is given to extreme deviations between true and predicted targets.
For instance, let’s compare the two predictions 1.0 and 100 that are both 50% of their corresponding true value.
The mean squared error (power=0) is very sensitive to the prediction difference of the second point,:
we would get identical errors. The deviance when power=2 is thus only sensitive to relative errors.
Clustering metrics
The sklearn.metrics module implements several loss, score, and utility functions. For more information see the
Clustering performance evaluation section for instance clustering, and Biclustering evaluation for biclustering.
Dummy estimators
When doing supervised learning, a simple sanity check consists of comparing one’s estimator against simple rules of
thumb. DummyClassifier implements several such simple strategies for classification:
• stratified generates random predictions by respecting the training set class distribution.
• most_frequent always predicts the most frequent label in the training set.
• prior always predicts the class that maximizes the class prior (like most_frequent) and
predict_proba returns the class prior.
• uniform generates predictions uniformly at random.
• constant always predicts a constant label that is provided by the user. A major motivation of this
method is F1-scoring, when the positive class is in the minority.
Note that with all these strategies, the predict method completely ignores the input data!
To illustrate DummyClassifier, first let’s create an imbalanced dataset:
We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
We see that the accuracy was boosted to almost 100%. A cross validation strategy is recommended for a better
estimate of the accuracy, if it is not too CPU costly. For more information see the Cross-validation: evaluating
estimator performance section. Moreover if you want to optimize over the parameter space, it is highly recommended
to use an appropriate methodology; see the Tuning the hyper-parameters of an estimator section for details.
More generally, when the accuracy of a classifier is too close to random, it probably means that something went wrong:
features are not helpful, a hyperparameter is not correctly tuned, the classifier is suffering from class imbalance, etc. . .
DummyRegressor also implements four simple rules of thumb for regression:
• mean always predicts the mean of the training targets.
• median always predicts the median of the training targets.
• quantile always predicts a user provided quantile of the training targets.
• constant always predicts a constant value that is provided by the user.
In all these strategies, the predict method completely ignores the input data.
After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to
retrain. The following section gives you an example of how to persist a model with pickle. We’ll also review a few
security and maintainability issues when working with pickle serialization.
An alternative to pickling is to export the model to another format using one of the model export tools listed under
Related Projects. Unlike pickling, once exported you cannot recover the full Scikit-learn estimator object, but you can
deploy the model for prediction, usually by using tools supporting open model interchange formats such as ONNX or
PMML.
Persistence example
It is possible to save a model in scikit-learn by using Python’s built-in persistence model, namely pickle:
In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is
more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators,
but can only pickle to the disk and not to a string:
Later you can load back the pickled model (possibly in another Python process) with:
Note: dump and load functions also accept file-like object instead of filenames. More information on data persis-
tence with Joblib is available here.
pickle (and joblib by extension), has some issues regarding maintainability and security. Because of this,
• Never unpickle untrusted data as it could lead to malicious code being executed upon loading.
• While models saved using one version of scikit-learn might load in other versions, this is entirely unsupported
and inadvisable. It should also be kept in mind that operations performed on such data could give different and
unexpected results.
In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the
pickled model:
• The training data, e.g. a reference to an immutable snapshot
• The python source code used to generate the model
• The versions of scikit-learn and its dependencies
• The cross validation score obtained on the training data
This should make it possible to check that the cross-validation score is in the same range as before.
Since a model internal representation may be different on two different architectures, dumping a model on one archi-
tecture and loading it on another architecture is not supported.
If you want to know more about these issues and explore other possible serialization methods, please refer to this talk
by Alex Gaynor.
Every estimator has its advantages and drawbacks. Its generalization error can be decomposed in terms of bias,
variance and noise. The bias of an estimator is its average error for different training sets. The variance of an
estimator indicates how sensitive it is to varying training sets. Noise is a property of the data.
In the following plot, we see a function 𝑓 (𝑥) = cos( 32 𝜋𝑥) and some noisy samples from that function. We use three
different estimators to fit the function: linear regression with polynomial features of degree 1, 4 and 15. We see that
the first estimator can at best provide only a poor fit to the samples and the true function because it is too simple
(high bias), the second estimator approximates it almost perfectly and the last estimator approximates the training data
perfectly but does not fit the true function very well, i.e. it is very sensitive to varying training data (high variance).
Bias and variance are inherent properties of estimators and we usually have to select learning algorithms and hyper-
parameters so that both bias and variance are as low as possible (see Bias-variance dilemma). Another way to reduce
the variance of a model is to use more training data. However, you should only collect more training data if the true
function is too complex to be approximated by an estimator with a lower variance.
In the simple one-dimensional problem that we have seen in the example it is easy to see whether the estimator suffers
from bias or variance. However, in high-dimensional spaces, models can become very difficult to visualize. For this
reason, it is often helpful to use the tools described below.
Examples:
Validation curve
To validate a model we need a scoring function (see Metrics and scoring: quantifying the quality of predictions), for
example accuracy for classifiers. The proper way of choosing multiple hyperparameters of an estimator are of course
grid search or similar methods (see Tuning the hyper-parameters of an estimator) that select the hyperparameter with
the maximum score on a validation set or multiple validation sets. Note that if we optimized the hyperparameters
based on a validation score the validation score is biased and not a good estimate of the generalization any longer. To
get a proper estimate of the generalization we have to compute the score on another test set.
However, it is sometimes helpful to plot the influence of a single hyperparameter on the training score and the valida-
tion score to find out whether the estimator is overfitting or underfitting for some hyperparameter values.
>>> np.random.seed(0)
>>> X, y = load_iris(return_X_y=True)
>>> indices = np.arange(y.shape[0])
>>> np.random.shuffle(indices)
>>> X, y = X[indices], y[indices]
If the training score and the validation score are both low, the estimator will be underfitting. If the training score is
high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training
score and a high validation score is usually not possible. All three cases can be found in the plot below where we vary
the parameter 𝛾 of an SVM on the digits dataset.
Learning curve
A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It
is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from
a variance error or a bias error. Consider the following example where we plot the learning curve of a naive Bayes
classifier and an SVM.
For the naive Bayes, both the validation score and the training score converge to a value that is quite low with increasing
size of the training set. Thus, we will probably not benefit much from more training data.
In contrast, for small amounts of data, the training score of the SVM is much greater than the validation score. Adding
more training samples will most likely increase generalization.
We can use the function learning_curve to generate the values that are required to plot such a learning curve
(number of samples that have been used, the average scores on the training sets and the average scores on the validation
sets):
4.4 Inspection
Predictive performance is often the main goal of developing machine learning models. Yet summarising performance
with an evaluation metric is often insufficient: it assumes that the evaluation metric and test dataset perfectly reflect
the target domain, which is rarely true. In certain domains, a model needs a certain level of interpretability before
it can be deployed. A model that is exhibiting performance issues needs to be debugged for one to understand the
model’s underlying issue. The sklearn.inspection module provides tools to help understand the predictions
from a model and what affects them. This can be used to evaluate assumptions and biases of a model, design a better
model, or to diagnose issues with model performance.
Partial dependence plots (PDP) show the dependence between the target response1 and a set of ‘target’ features,
marginalizing over the values of all other features (the ‘complement’ features). Intuitively, we can interpret the partial
dependence as the expected target response as a function of the ‘target’ features.
Due to the limits of human perception the size of the target feature set must be small (usually, one or two) thus the
target features are usually chosen among the most important features.
The figure below shows four one-way and one two-way partial dependence plots for the California housing dataset,
with a GradientBoostingRegressor:
One-way PDPs tell us about the interaction between the target response and the target feature (e.g. linear, non-linear).
The upper left plot in the above figure shows the effect of the median income in a district on the median house price;
we can clearly see a linear relationship among them. Note that PDPs assume that the target features are independent
from the complement features, and this assumption is often violated in practice.
PDPs with two target features show the interactions among the two features. For example, the two-variable PDP in
the above figure shows the dependence of median house price on joint values of house age and average occupants per
1 For classification, the target response may be the probability of a class (the positive class for binary classification), or the decision function.
household. We can clearly see an interaction between the two features: for an average occupancy greater than two, the
house price is nearly independent of the house age, whereas for values less than 2 there is a strong dependence on age.
The sklearn.inspection module provides a convenience function plot_partial_dependence to cre-
ate one-way and two-way partial dependence plots. In the below example we show how to create a grid of partial
dependence plots: two one-way PDPs for the features 0 and 1 and a two-way PDP between the two features:
>>> X, y = make_hastie_10_2(random_state=0)
>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
... max_depth=1, random_state=0).fit(X, y)
>>> features = [0, 1, (0, 1)]
>>> plot_partial_dependence(clf, X, features)
You can access the newly created figure and Axes objects using plt.gcf() and plt.gca().
For multi-class classification, you need to set the class label for which the PDPs should be created via the target
argument:
The same parameter target is used to specify the target in multi-output regression settings.
If you need the raw values of the partial dependence function rather than the plots, you can use the sklearn.
inspection.partial_dependence function:
The values at which the partial dependence should be evaluated are directly generated from X. For 2-way par-
tial dependence, a 2D-grid of values is generated. The values field returned by sklearn.inspection.
partial_dependence gives the actual values used in the grid for each target feature. They also correspond
to the axis of the plots.
For each value of the ‘target’ features in the grid the partial dependence function needs to marginalize the predictions
of the estimator over all possible values of the ‘complement’ features. With the 'brute' method, this is done by
replacing every target feature value of X by those in the grid, and computing the average prediction.
In decision trees this can be evaluated efficiently without reference to the training data ('recursion' method). For
each grid point a weighted tree traversal is performed: if a split node involves a ‘target’ feature, the corresponding
left or right branch is followed, otherwise both branches are followed, each branch is weighted by the fraction of
training samples that entered that branch. Finally, the partial dependence is given by a weighted average of all visited
leaves. Note that with the 'recursion' method, X is only used to generate the grid, not to compute the averaged
predictions. The averaged predictions will always be computed on the data with which the trees were trained.
Examples:
References
T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Second Edition, Section 10.13.2,
Springer, 2009.
C. Molnar, Interpretable Machine Learning, Section 5.1, 2019.
Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data
is rectangular. This is especially useful for non-linear or opaque estimators. The permutation feature importance is
defined to be the decrease in a model score when a single feature value is randomly shuffled1 . This procedure breaks
the relationship between the feature and the target, thus the drop in the model score is indicative of how much the
model depends on the feature. This technique benefits from being model agnostic and can be calculated many times
with different permutations of the feature.
The permutation_importance function calculates the feature importance of estimators for a given dataset.
The n_repeats parameter sets the number of times a feature is randomly shuffled and returns a sample of feature
importances. Permutation importances can either be computed on the training set or an held-out testing or validation
set. Using a held-out set makes it possible to highlight which features contribute the most to the generalization power
of the inspected model. Features that are important on the training set but not on the held-out set might cause the
model to overfit.
1 L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001. https://doi.org/10.1023/A:1010933404324
Note that features that are deemed non-important for some model with a low predictive performance could be highly
predictive for a model that generalizes better. The conclusions should always be drawn in the context of the specific
model under inspection and cannot be automatically generalized to the intrinsic predictive value of the features by
them-selves. Therefore it is always important to evaluate the predictive power of a model using a held-out set (or
better with cross-validation) prior to computing importances.
Tree based models provides a different measure of feature importances based on the mean decrease in impurity (MDI,
the splitting criterion). This gives importance to features that may not be predictive on unseen data. The permutation
feature importance avoids this issue, since it can be applied to unseen data. Furthermore, impurity-based feature
importance for trees are strongly biased and favor high cardinality features (typically numerical features). Permutation-
based feature importances do not exhibit such a bias. Additionally, the permutation feature importance may use
an arbitrary metric on the tree’s predictions. These two methods of obtaining feature importance are explored in:
Permutation Importance vs Random Forest Feature Importance (MDI).
When two features are correlated and one of the features is permuted, the model will still have access to the feature
through its correlated feature. This will result in a lower importance for both features, where they might actually be
important. One way to handle this is to cluster features that are correlated and only keep one feature from each cluster.
This use case is explored in: Permutation Importance with Multicollinear or Correlated Features.
Examples:
References:
4.5 Visualizations
Scikit-learn defines a simple API for creating visualizations for machine learning. The key feature of this API is to
allow for quick plotting and visual adjustments without recalculation. In the following example, we plot a ROC curve
for a fitted support vector machine:
The returned svc_disp object allows us to continue using the already computed ROC curve for SVC in future plots.
In this case, the svc_disp is a RocCurveDisplay that stores the computed values as attributes called roc_auc,
fpr, and tpr. Next, we train a random forest classifier and plot the previously computed roc curve again by using
the plot method of the Display object.
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)
ax = plt.gca()
rfc_disp = plot_roc_curve(rfc, X_test, y_test, ax=ax, alpha=0.8)
svc_disp.plot(ax=ax, alpha=0.8)
Notice that we pass alpha=0.8 to the plot functions to adjust the alpha values of the curves.
Examples:
Functions
Display Objects
scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (see Unsupervised
dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representa-
tions.
Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g.
mean and standard deviation for normalization) from a training set, and a transform method which applies this
transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and
transforming the training data simultaneously.
Combining such transformers, either in parallel or series is covered in Pipelines and composite estimators. Pair-
wise metrics, Affinities and Kernels covers transforming feature spaces into affinity matrices, while Transforming the
prediction target (y) considers transformations of the target space (e.g. categorical labels) for use in scikit-learn.
616 Chapter 4. User Guide
All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last
estimator may be any type (transformer, classifier, etc.).
Usage
Construction
The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you
want to give this step and value is an estimator object:
The utility function make_pipeline is a shorthand for constructing pipelines; it takes a variable number of estima-
tors and returns a pipeline, filling in the names automatically:
Accessing steps
The estimators of a pipeline are stored as a list in the steps attribute, but can be accessed by index or name by
indexing (with [idx]) the Pipeline:
>>> pipe.steps[0]
('reduce_dim', PCA())
>>> pipe[0]
PCA()
>>> pipe['reduce_dim']
PCA()
Pipeline’s named_steps attribute allows accessing steps by name with tab completion in interactive environments:
A sub-pipeline can also be extracted using the slicing notation commonly used for Python Sequences such as lists or
strings (although only a step of 1 is permitted). This is convenient for performing only some of the transformations
(or their inverse):
>>> pipe[:1]
Pipeline(steps=[('reduce_dim', PCA())])
>>> pipe[-1:]
Pipeline(steps=[('clf', SVC())])
Nested parameters
Parameters of the estimators in the pipeline can be accessed using the <estimator>__<parameter> syntax:
>>> pipe.set_params(clf__C=10)
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC(C=10))])
Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to
'passthrough':
>>> pipe[0]
PCA()
or by name:
>>> pipe['reduce_dim']
PCA()
Examples:
See also:
Notes
Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it
on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator
is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.
Fitting transformers may be computationally expensive. With its memory parameter set, Pipeline will cache each
transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the
parameters and input data are identical. A typical example is the case of a grid search in which the transformers can
be fitted only once and reused for each configuration.
The parameter memory is needed in order to cache the transformers. memory can be either a string containing the
directory where to cache the transformers or a joblib.Memory object:
Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to
the pipeline cannot be inspected directly. In following example, accessing the PCA instance pca2 will raise an
AttributeError since pca2 will be an unfitted transformer. Instead, use the attribute named_steps to
inspect estimators within the pipeline:
Examples:
TransformedTargetRegressor transforms the targets y before fitting a regression model. The predictions are
mapped back to the original space via an inverse transform. It takes as an argument the regressor that will be used for
prediction, and the transformer that will be applied to the target variable:
For simple transformations, instead of a Transformer object, a pair of functions can be passed, defining the transfor-
mation and its inverse mapping:
By default, the provided functions are checked at each fit to be the inverse of each other. However, it is possible to
bypass this checking by setting check_inverse to False:
Note: The transformation can be triggered by setting either transformer or the pair of functions func and
inverse_func. However, setting both options will raise an error.
Examples:
FeatureUnion combines several transformer objects into a new transformer that combines their output. A
FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently.
The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a
larger matrix.
When you want to apply different transformations to each field of the data, see the related class sklearn.compose.
ColumnTransformer (see user guide).
FeatureUnion serves the same purposes as Pipeline - convenience and joint parameter estimation and valida-
tion.
FeatureUnion and Pipeline can be combined to create complex models.
(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only
produces a union when the feature sets are disjoint, and making sure they are the caller’s responsibility.)
Usage
A FeatureUnion is built using a list of (key, value) pairs, where the key is the name you want to give to a
given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object:
Like pipelines, feature unions have a shorthand constructor called make_union that does not require explicit naming
of the components.
Like Pipeline, individual steps may be replaced using set_params, and ignored by setting to 'drop':
>>> combined.set_params(kernel_pca='drop')
FeatureUnion(transformer_list=[('linear_pca', PCA()),
('kernel_pca', 'drop')])
Examples:
Warning: The compose.ColumnTransformer class is experimental and the API is subject to change.
Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires
separate preprocessing or feature extraction steps. Often it is easiest to preprocess data before applying scikit-learn
methods, for example using pandas. Processing your data before passing it to scikit-learn might be problematic for
one of the following reasons:
1. Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as
data leakage), for example in the case of scalers or imputing missing values.
2. You may want to include the parameters of the preprocessors in a parameter search.
The ColumnTransformer helps performing different transformations for different columns of the data, within a
Pipeline that is safe from data leakage and that can be parametrized. ColumnTransformer works on arrays,
sparse matrices, and pandas DataFrames.
To each column, a different transformation can be applied, such as preprocessing or a specific feature extraction
method:
For this data, we might want to encode the 'city' column as a categorical variable using preprocessing.
OneHotEncoder but apply a feature_extraction.text.CountVectorizer to the 'title' column.
As we might use multiple feature extraction methods on the same column, we give each transformer a unique
name, say 'city_category' and 'title_bow'. By default, the remaining rating columns are ignored
(remainder='drop'):
>>> column_trans.fit(X)
ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
['city']),
('title_bow', CountVectorizer(), 'title')])
>>> column_trans.get_feature_names()
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']
>>> column_trans.transform(X).toarray()
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)
In the above example, the CountVectorizer expects a 1D array as input and therefore the columns were specified
as a string ('title'). However, preprocessing.OneHotEncoder as most of other transformers expects 2D
data, therefore in that case you need to specify the column as a list of strings (['city']).
Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer
array, a slice, a boolean mask, or with a make_column_selector. The make_column_selector is used to
select columns based on data type or column name:
Strings can reference columns if the input is a DataFrame, integers are always interpreted as the positional columns.
We can keep the remaining rating columns by setting remainder='passthrough'. The values are appended to
the end of the transformation:
>>> column_trans.fit_transform(X)
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]]...)
The remainder parameter can be set to an estimator to transform the remaining rating columns. The transformed
values are appended to the end of the transformation:
Examples:
The sklearn.feature_extraction module can be used to extract features in a format supported by machine
learning algorithms from datasets consisting of formats such as text and image.
Note: Feature extraction is very different from Feature selection: the former consists in transforming arbitrary data,
such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique
The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict
objects to the NumPy/SciPy representation used by scikit-learn estimators.
While not particularly fast to process, Python’s dict has the advantages of being convenient to use, being sparse
(absent features need not be stored) and storing feature names in addition to values.
DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete)
features. Categorical features are “attribute-value” pairs where the value is restricted to a list of discrete of possibilities
without ordering (e.g. topic identifiers, types of objects, tags, names. . . ).
In the following, “city” is a categorical attribute while “temperature” is a traditional numerical feature:
>>> measurements = [
... {'city': 'Dubai', 'temperature': 33.},
... {'city': 'London', 'temperature': 12.},
... {'city': 'San Francisco', 'temperature': 18.},
... ]
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']
DictVectorizer is also a useful representation transformation for training sequence classifiers in Natural Lan-
guage Processing models that typically work by extracting feature windows around a particular word of interest.
For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as
complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of
features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:
>>> pos_window = [
... {
... 'word-2': 'the',
... 'pos-2': 'DT',
... 'word-1': 'cat',
... 'pos-1': 'NN',
... 'word+1': 'on',
... 'pos+1': 'PP',
... },
... # in a real application one would extract many such dictionaries
... ]
This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe
after being piped into a text.TfidfTransformer for normalization):
As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting
matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to
make the resulting data structure able to fit in memory the DictVectorizer class uses a scipy.sparse matrix
by default instead of a numpy.ndarray.
Feature hashing
The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing,
or the “hashing trick”. Instead of building a hash table of the features encountered in training, as the vectorizers
do, instances of FeatureHasher apply a hash function to the features to determine their column index in sample
matrices directly. The result is increased speed and reduced memory usage, at the expense of inspectability; the hasher
does not remember what the input features looked like and has no inverse_transform method.
Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the
sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions
are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero.
This mechanism is enabled by default with alternate_sign=True and is particularly useful for small hash table
sizes (n_features < 10000). For large hash table sizes, it can be disabled, to allow the output to be passed
to estimators like sklearn.naive_bayes.MultinomialNB or sklearn.feature_selection.chi2
feature selectors that expect non-negative inputs.
FeatureHasher accepts either mappings (like Python’s dict and its variants in the collections module),
(feature, value) pairs, or strings, depending on the constructor parameter input_type. Mapping are treated
as lists of (feature, value) pairs, while single strings have an implicit value of 1, so ['feat1', 'feat2',
'feat3'] is interpreted as [('feat1', 1), ('feat2', 1), ('feat3', 1)]. If a single feature occurs
multiple times in a sample, the associated values will be summed (so ('feat', 2) and ('feat', 3.5) become
('feat', 5.5)). The output from FeatureHasher is always a scipy.sparse matrix in the CSR format.
Feature hashing can be employed in document classification, but unlike text.CountVectorizer,
FeatureHasher does not do word splitting or any other preprocessing except Unicode-to-UTF-8 encoding; see
Vectorizing a large text corpus with the hashing trick, below, for a combined tokenizer/hasher.
As an example, consider a word-level natural language processing task that needs features extracted from (token,
part_of_speech) pairs. One could use a Python generator function to extract features:
hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)
Implementation details
FeatureHasher uses the signed 32-bit variant of MurmurHash3. As a result (and because of limitations in scipy.
sparse), the maximum number of features supported is currently 231 − 1.
The original formulation of the hashing trick by Weinberger et al. used two separate hash functions ℎ and 𝜉 to deter-
mine the column index and sign of a feature, respectively. The present implementation works under the assumption
that the sign bit of MurmurHash3 is independent of its other bits.
Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two
as the n_features parameter; otherwise the features will not be mapped evenly to the columns.
References:
• Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and Josh Attenberg (2009). Feature hash-
ing for large scale multitask learning. Proc. ICML.
• MurmurHash3.
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of
symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a
fixed size rather than the raw text documents with variable length.
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from
text content, namely:
• tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and
punctuation as token separators.
• counting the occurrences of tokens in each document.
• normalizing and weighting with diminishing importance tokens that occur in the majority of samples / docu-
ments.
In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency (normalized or not) is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token
(e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This
specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” represen-
tation. Documents are described by word occurrences while completely ignoring the relative position information of
the words in the document.
Sparsity
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will
have many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order
of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, imple-
mentations will typically use a sparse representation such as the implementations available in the scipy.sparse
package.
This model has many parameters, however the default values are quite reasonable (please see the reference documen-
tation for the details):
Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse ... format>
The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does
this step can be requested explicitly:
Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the
resulting matrix. This interpretation of the columns can be retrieved as follows:
>>> vectorizer.get_feature_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:
>>> vectorizer.vocabulary_.get('document')
1
Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform
method:
Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in
equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some
of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):
The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local
positioning patterns:
In particular the interrogative form “Is this” is only present in the last document:
Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content
of a text, and which may be removed to avoid them being construed as signal for prediction. Sometimes, however,
similar words are useful for prediction, such as in classifying writing style or personality.
There are several known issues in our provided ‘english’ stop word list. It does not aim to be a general, ‘one-size-fits-
all’ solution as some tasks may require a more custom solution. See [NQY18] for more details.
Please take care in choosing a stop word list. Popular stop word lists may include words that are highly informative to
some tasks, such as computer.
You should also make sure that the stop word list has had the same preprocessing and tokenization applied as the one
used in the vectorizer. The word we’ve is split into we and ve by CountVectorizer’s default tokenizer, so if we’ve is in
stop_words, but ve is not, ve will be retained from we’ve in transformed text. Our vectorizers will try to identify
and warn about some kinds of inconsistencies.
References
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little
meaningful information about the actual contents of the document. If we were to feed the direct count data directly to
a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common
to use the tf–idf transform.
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: tf-idf(t,d) =
tf(t,d) × idf(t).
Using the TfidfTransformer’s default settings, TfidfTransformer(norm='l2', use_idf=True,
smooth_idf=True, sublinear_tf=False) the term frequency, the number of times a term occurs in a
given document, is multiplied with idf component, which is computed as
1+𝑛
idf(𝑡) = log 1+df(𝑡) + 1,
where 𝑛 is the total number of documents in the document set, and df(𝑡) is the number of documents in the document
set that contain term 𝑡. The resulting tf-idf vectors are then normalized by the Euclidean norm:
𝑣 𝑣
𝑣𝑛𝑜𝑟𝑚 = ||𝑣||2 = √
𝑣 1 2 +𝑣 2 2 +···+𝑣 𝑛 2
.
This was originally a term weighting scheme developed for information retrieval (as a ranking function for search
engines results) that has also found good use in document classification and clustering.
The following sections contain further explanations and examples that illustrate how the tf-idfs are computed exactly
and how the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from
the standard textbook notation that defines the idf as
𝑛
idf(𝑡) = log 1+df(𝑡) .
In the TfidfTransformer and TfidfVectorizer with smooth_idf=False, the “1” count is added to the
idf instead of the idf’s denominator:
𝑛
idf(𝑡) = log df(𝑡) +1
This normalization is implemented by the TfidfTransformer class:
Again please see the reference documentation for the details on all the parameters.
Let’s take an example with the following counts. The first term is present 100% of the time hence not very interesting.
The two other features only in less than 50% of the time hence probably more representative of the content of the
documents:
>>> tfidf.toarray()
array([[0.81940995, 0. , 0.57320793],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[0.47330339, 0.88089948, 0. ],
[0.58149261, 0. , 0.81355169]])
For example, we can compute the tf-idf of the first term in the first document in the counts array as follows:
𝑛=6
df(𝑡)term1 = 6
𝑛
idf(𝑡)term1 = log df(𝑡) + 1 = log(1) + 1 = 1
tf-idfterm1 = tf × idf = 3 × 1 = 3
Now, if we repeat this computation for the remaining 2 terms in the document, we get
tf-idfterm2 = 0 × (log(6/1) + 1) = 0
tf-idfterm3 = 1 × (log(6/2) + 1) ≈ 2.0986
and the vector of raw tf-idfs:
tf-idfraw = [3, 0, 2.0986].
Then, applying the Euclidean (L2) norm, we obtain the following tf-idfs for document 1:
[3,0,2.0986]
√︂(︀ )︀ = [0.819, 0, 0.573].
32 +02 +2.09862
Furthermore, the default parameter smooth_idf=True adds “1” to the numerator and denominator as if an extra
document was seen containing every term in the collection exactly once, which prevents zero divisions:
1+𝑛
idf(𝑡) = log 1+df(𝑡) +1
Using this modification, the tf-idf of the third term in document 1 changes to 1.8473:
tf-idfterm3 = 1 × log(7/3) + 1 ≈ 1.8473
And the L2-normalized tf-idf changes to
[3,0,1.8473]
√︂(︀ )︀ = [0.8515, 0, 0.5243]:
32 +02 +1.84732
The weights of each feature computed by the fit method call are stored in a model attribute:
>>> transformer.idf_
array([1. ..., 2.25..., 1.84...])
As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all
the options of CountVectorizer and TfidfTransformer in a single model:
While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might
offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular,
some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short
texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.
As usual the best way to adjust the feature extraction parameters is to use a cross-validated grid search, for instance by
pipelining the feature extractor with a classifier:
• Sample pipeline for text feature extraction and evaluation
Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding.
To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings
are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many
others exist.
Note: An encoding can also be called a ‘character set’, but this term is less accurate: several encodings can exist for
a single character set.
The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the
files are in. The CountVectorizer takes an encoding parameter for this purpose. For modern text files, the
correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8").
If the text you are loading is not actually encoded with UTF-8, however, you will get a UnicodeDecodeError.
The vectorizers can be told to be silent about decoding errors by setting the decode_error parameter to either
"ignore" or "replace". See the documentation for the Python function bytes.decode for more details (type
help(bytes.decode) at the Python prompt).
If you are having trouble decoding text, here are some things to try:
• Find out what the actual encoding of the text is. The file might come with a header or README that tells you
the encoding, or there might be some standard encoding you can assume based on where the text comes from.
• You may be able to find out what kind of encoding it is in general using the UNIX command file. The Python
chardet module comes with a script called chardetect.py that will guess the specific encoding, though
you cannot rely on its guess being correct.
• You could try UTF-8 and disregard the errors. You can decode byte strings with bytes.
decode(errors='replace') to replace all decoding errors with a meaningless character, or set
decode_error='replace' in the vectorizer. This may damage the usefulness of your features.
• Real text may come from a variety of sources that may have used different encodings, or even be sloppily
decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the
Web. The Python package ftfy can automatically sort out some classes of decoding errors, so you could try
decoding the unknown text as latin-1 and then using ftfy to fix errors.
• If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20
Newsgroups dataset), you can fall back on a simple single-byte encoding such as latin-1. Some text may
display incorrectly, but at least the same sequence of bytes will always represent the same feature.
For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately) to
figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not
shown here.
˓→\x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00
˓→\x00f\x00o\x00r\x00t\x00"
(Depending on the version of chardet, it might get the first one wrong.)
For an introduction to Unicode and character encodings in general, see Joel Spolsky’s Absolute Minimum Every
Software Developer Must Know About Unicode.
The bag of words representation is quite simplistic but surprisingly useful in practice.
In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train
document classifiers, for instance:
• Classification of text documents using sparse features
In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such
as K-means:
• Clustering text documents using k-means
Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering,
for instance by using Non-negative matrix factorization (NMF or NNMF):
• Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disre-
garding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings
or word derivations.
N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of
bigrams (n=2), where occurrences of pairs of consecutive words are counted.
One might alternatively consider a collection of character n-grams, a representation resilient against misspellings and
derivations.
For example, let’s say we’re dealing with a corpus of two documents: ['words', 'wprds']. The second docu-
ment contains a misspelling of the word ‘words’. A simple bag of words representation would consider these two as
very distinct documents, differing in both of the two possible features. A character 2-gram representation, however,
would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better:
In the above example, char_wb analyzer is used, which creates n-grams only from characters inside word boundaries
(padded with space on each side). The char analyzer, alternatively, creates n-grams that span across words:
The word boundaries-aware variant char_wb is especially interesting for languages that use white-spaces for word
separation as it generates significantly less noisy features than the raw char variant in that case. For such languages
it can increase both the predictive accuracy and convergence speed of classifiers trained using such features while
retaining the robustness with regards to misspellings and word derivations.
While some local positioning information can be preserved by extracting n-grams instead of individual words, bag of
words and bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried
by that internal structure.
In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs
should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are
currently outside of the scope of scikit-learn.
The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens
to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large
datasets:
• the larger the corpus, the larger the vocabulary will grow and hence the memory use too,
• fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
• building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a
strictly online manner.
• pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than
pickling / un-pickling flat data structures such as a NumPy array of the same size),
• it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute
would have to be a shared state with a fine grained synchronization barrier: the mapping from token string
to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared,
potentially harming the concurrent workers’ performance to the point of making them slower than the sequential
variant.
It is possible to overcome those limitations by combining the “hashing trick” (Feature hashing) implemented by the
sklearn.feature_extraction.FeatureHasher class and the text preprocessing and tokenization features
of the CountVectorizer.
This combination is implementing in HashingVectorizer, a transformer class that is mostly API compatible with
CountVectorizer. HashingVectorizer is stateless, meaning that you don’t have to call fit on it:
You can see that 16 non-zero feature tokens were extracted in the vector output: this is less than the 19 non-zeros
extracted previously by the CountVectorizer on the same toy corpus. The discrepancy comes from hash function
collisions because of the low value of the n_features parameter.
In a real world setting, the n_features parameter can be left to its default value of 2 ** 20 (roughly one million
possible features). If memory or downstream models size is an issue selecting a lower value such as 2 ** 18 might
help without introducing too many additional collisions on typical text classification tasks.
Note that the dimensionality does not affect the CPU training time of algorithms which operate on CSR matrices
(LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive) but it does for algo-
rithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).
Let’s try again with the default setting:
>>> hv = HashingVectorizer()
>>> hv.transform(corpus)
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
We no longer get the collisions, but this comes at the expense of a much larger dimensionality of the output space. Of
course, other terms than the 19 used here might still collide with each other.
The HashingVectorizer also comes with the following limitations:
• it is not possible to invert the model (no inverse_transform method), nor to access the original string
representation of the features, because of the one-way nature of the hash function that performs the mapping.
• it does not provide IDF weighting as that would introduce statefulness in the model. A TfidfTransformer
can be appended to it in a pipeline if required.
An interesting development of using a HashingVectorizer is the ability to perform out-of-core scaling. This
means that we can learn from data that does not fit into the computer’s main memory.
A strategy to implement out-of-core scaling is to stream data to the estimator in mini-batches. Each mini-batch is
vectorized using HashingVectorizer so as to guarantee that the input space of the estimator has always the same
dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is
no limit to the amount of data that can be ingested using such an approach, from a practical point of view the learning
time is often limited by the CPU time one wants to spend on the task.
For a full-fledged example of out-of-core scaling in a text classification task see Out-of-core classification of text
documents.
In particular we name:
• preprocessor: a callable that takes an entire document as input (as a single string), and returns a possibly
transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase
the entire document, etc.
• tokenizer: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list
of these.
• analyzer: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the prepro-
cessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place
at the analyzer level, so a custom analyzer may have to reproduce these steps.
(Lucene users might recognize these names, but be aware that scikit-learn concepts may not map one-to-one onto
Lucene concepts.)
To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the class
and override the build_preprocessor, build_tokenizer and build_analyzer factory methods instead
of passing custom functions.
Some tips and tricks:
• If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens
separated by whitespace and pass analyzer=str.split
• Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-
speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer
or the analyzer. Here’s a CountVectorizer with a tokenizer and lemmatizer using NLTK:
>>> import re
>>> def to_british(tokens):
... for t in tokens:
... t = re.sub(r"(...)our$", r"\1or", t)
... t = re.sub(r"([bt])re$", r"\1er", t)
... t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
... t = re.sub(r"ogue$", "og", t)
... yield t
...
>>> class CustomVectorizer(CountVectorizer):
... def build_tokenizer(self):
... tokenize = super().build_tokenizer()
... return lambda doc: list(to_british(tokenize(doc)))
...
>>> print(CustomVectorizer().build_analyzer()(u"color colour"))
[...'color', ...'color']
for other styles of preprocessing; examples include stemming, lemmatization, or normalizing numerical tokens,
with the latter illustrated in:
– Biclustering documents with the Spectral Co-clustering algorithm
Customizing the vectorizer can also be useful when handling Asian languages that do not use an explicit word separator
such as whitespace.
Patch extraction
The extract_patches_2d function extracts patches from an image stored as a two-dimensional array, or
three-dimensional with color information along the third axis. For rebuilding an image from all its patches, use
reconstruct_from_patches_2d. For example let use generate a 4x4 pixel picture with 3 color channels (e.g.
in RGB format):
[[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])
Let us now try to reconstruct the original image from the patches by averaging on overlapping areas:
>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)
The PatchExtractor class works in the same way as extract_patches_2d, only it supports multiple images
as input. It is implemented as an estimator, so it can be used in pipelines. See:
>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)
Several estimators in the scikit-learn can use connectivity information between features or samples. For instance Ward
clustering (Hierarchical clustering) can cluster together only neighboring pixels of an image, thus forming contiguous
patches:
For this purpose, the estimators use a ‘connectivity’ matrix, giving which samples are connected.
The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a
connectivity matrix for images given the shape of these image.
These matrices can be used to impose connectivity in estimators that use connectivity information, such as Ward
clustering (Hierarchical clustering), but also to build precomputed kernels, or similarity matrices.
Note: Examples
• A demo of structured Ward hierarchical clustering on an image of coins
• Spectral clustering for image segmentation
• Feature agglomeration vs. univariate selection
The sklearn.preprocessing package provides several common utility functions and transformer classes to
change raw feature vectors into a representation that is more suitable for the downstream estimators.
In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust
scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on
a dataset containing marginal outliers is highlighted in Compare the effect of different scalers on data with outliers.
Standardization of datasets is a common requirement for many machine learning estimators implemented in
scikit-learn; they might behave badly if the individual features do not more or less look like standard normally dis-
tributed data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean
value of each feature, then scale it by dividing non-constant features by their standard deviation.
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support
Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and
have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might
dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The function scale provides a quick and easy way to perform this operation on a single array-like dataset:
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
>>> X_scaled.mean(axis=0)
array([0., 0., 0.])
>>> X_scaled.std(axis=0)
array([1., 1., 1.])
The preprocessing module further provides a utility class StandardScaler that implements the
Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply
the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.
pipeline.Pipeline:
>>> scaler.mean_
array([1. ..., 0. ..., 0.33...])
>>> scaler.scale_
array([0.81..., 0.81..., 1.24...])
>>> scaler.transform(X_train)
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
The scaler instance can then be used on new data to transform it the same way it did on the training set:
An alternative standardization is scaling features to lie between a given minimum and maximum value, often between
zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using
MinMaxScaler or MaxAbsScaler, respectively.
The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero
entries in sparse data.
Here is an example to scale a toy data matrix to the [0, 1] range:
The same instance of the transformer can then be applied to some new test data unseen during the fit call: the same
scaling and shifting operations will be applied to be consistent with the transformation performed on the train data:
It is possible to introspect the scaler attributes to find about the exact nature of the transformation learned on the
training data:
>>> min_max_scaler.scale_
array([0.5 , 0.5 , 0.33...])
>>> min_max_scaler.min_
array([0. , 0.5 , 0.33...])
MaxAbsScaler works in a very similar fashion, but scales in a way that the training data lies within the range [-1,
1] by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero
or sparse data.
Here is how to use the toy data from the previous example with this scaler:
As with scale, the module further provides convenience functions minmax_scale and maxabs_scale if you
don’t want to create an object.
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do.
However, it can make sense to scale sparse inputs, especially if features are on different scales.
MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended
way to go about this. However, scale and StandardScaler can accept scipy.sparse matrices as input, as
long as with_mean=False is explicitly passed to the constructor. Otherwise a ValueError will be raised as
silently centering would break the sparsity and would often crash the execution by allocating excessive amounts of
memory unintentionally. RobustScaler cannot be fitted to sparse inputs, but you can use the transform method
on sparse inputs.
Note that the scalers accept both Compressed Sparse Rows and Compressed Sparse Columns format (see scipy.
sparse.csr_matrix and scipy.sparse.csc_matrix). Any other sparse input will be converted to the
Compressed Sparse Rows representation. To avoid unnecessary memory copies, it is recommended to choose the
CSR or CSC representation upstream.
Finally, if the centered data is expected to be small enough, explicitly converting the input to an array using the
toarray method of sparse matrices is another option.
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well.
In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead. They use more
robust estimates for the center and range of your data.
References:
Further discussion on the importance of centering and scaling data is available on this FAQ: Should I normal-
ize/standardize/rescale the data?
Scaling vs Whitening
It is sometimes not enough to center and scale the features independently, since a downstream model can further
make some assumption on the linear independence of the features.
To address this issue you can use sklearn.decomposition.PCA with whiten=True to further remove the
linear correlation across features.
Scaling a 1D array
All above functions (i.e. scale, minmax_scale, maxabs_scale, and robust_scale) accept 1D array
which can be useful in some specific case.
If you have a kernel matrix of a kernel 𝐾 that computes a dot product in a feature space defined by function 𝜑, a
KernelCenterer can transform the kernel matrix so that it contains inner products in the feature space defined by
𝜑 followed by removal of the mean in that space.
Non-linear transformation
Two types of transformations are available: quantile transforms and power transforms. Both quantile and power
transforms are based on monotonic transformations of the features and thus preserve the rank of the values along each
feature.
Quantile transforms put all features into the same desired distribution based on the formula 𝐺−1 (𝐹 (𝑋)) where 𝐹 is
the cumulative distribution function of the feature and 𝐺−1 the quantile function of the desired output distribution 𝐺.
This formula is using the two following facts: (i) if 𝑋 is a random variable with a continuous cumulative distribution
function 𝐹 then 𝐹 (𝑋) is uniformly distributed on [0, 1]; (ii) if 𝑈 is a random variable with uniform distribution on
[0, 1] then 𝐺−1 (𝑈 ) has distribution 𝐺. By performing a rank transformation, a quantile transform smooths out unusual
distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances
within and across features.
Power transforms are a family of parametric transformations that aim to map data from any distribution to as close to
a Gaussian distribution.
This feature corresponds to the sepal length in cm. Once the quantile transformation applied, those landmarks approach
closely the percentiles previously defined:
In many modeling scenarios, normality of the features in a dataset is desirable. Power transforms are a family of
parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution
as possible in order to stabilize variance and minimize skewness.
PowerTransformer currently provides two such power transformations, the Yeo-Johnson transform and the Box-
Cox transform.
The Yeo-Johnson transform is given by:
⎧
⎪
⎪ [(𝑥𝑖 + 1)𝜆 − 1]/𝜆 if 𝜆 ̸= 0, 𝑥𝑖 ≥ 0,
⎪
⎪
⎪
if 𝜆 = 0, 𝑥𝑖 ≥ 0
⎪
⎨ln (𝑥𝑖 + 1)
⎪
(𝜆)
𝑥𝑖 =
⎪−[(−𝑥𝑖 + 1)2−𝜆 − 1]/(2 − 𝜆) if 𝜆 ̸= 2, 𝑥𝑖 < 0,
⎪
⎪
⎪
⎪
⎪
⎪
− ln(−𝑥𝑖 + 1) if 𝜆 = 2, 𝑥𝑖 < 0
⎩
Box-Cox can only be applied to strictly positive data. In both methods, the transformation is parameterized by 𝜆,
which is determined through maximum likelihood estimation. Here is an example of using Box-Cox to map samples
drawn from a lognormal distribution to a normal distribution:
While the above example sets the standardize option to False, PowerTransformer will apply zero-mean,
unit-variance normalization to the transformed output by default.
Below are examples of Box-Cox and Yeo-Johnson applied to various probability distributions. Note that when applied
to certain distributions, the power transforms achieve very Gaussian-like results, but with others, they are ineffective.
This highlights the importance of visualizing the data before and after transformation.
It is also possible to map data to a normal distribution using QuantileTransformer by setting
output_distribution='normal'. Using the earlier example with the iris dataset:
Thus the median of the input becomes the mean of the output, centered at 0. The normal output is clipped so that the
input’s minimum and maximum — corresponding to the 1e-7 and 1 - 1e-7 quantiles respectively — do not become
infinite under the transformation.
Normalization
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan
to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.
The function normalize provides a quick and easy way to perform this operation on a single array-like dataset,
either using the l1 or l2 norms:
>>> X_normalized
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])
The preprocessing module further provides a utility class Normalizer that implements the same operation
using the Transformer API (even though the fit method is useless in this case: the class is stateless as this
operation treats samples independently).
This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline:
The normalizer instance can then be used on sample vectors as any transformer:
>>> normalizer.transform(X)
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])
Sparse input
normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.
For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.
csr_matrix) before being fed to efficient Cython routines. To avoid unnecessary memory copies, it is recom-
mended to choose the CSR representation upstream.
Often features are not given as continuous values but categorical. For example a person could have fea-
tures ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox",
"uses Chrome", "uses Safari", "uses Internet Explorer"]. Such features can be efficiently
coded as integers, for instance ["male", "from US", "uses Internet Explorer"] could be expressed
as [0, 1, 3] while ["female", "from Asia", "uses Chrome"] would be [1, 2, 1].
To convert categorical features to such integer codes, we can use the OrdinalEncoder. This estimator transforms
each categorical feature to one new feature of integers (0 to n_categories - 1):
Such integer representation can, however, not be used directly with all scikit-learn estimators, as these expect continu-
ous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was
ordered arbitrarily).
Another possibility to convert categorical features to features that can be used with scikit-learn estimators is
to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with
the OneHotEncoder, which transforms each categorical feature with n_categories possible values into
n_categories binary features, with one of them 1, and all others 0.
Continuing the example above:
>>> enc.fit(X)
OneHotEncoder()
>>> enc.transform([['female', 'from US', 'uses Safari'],
... ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
[0., 1., 1., 0., 0., 1.]])
By default, the values each feature can take is inferred automatically from the dataset and can be found in the
categories_ attribute:
>>> enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'],
˓→dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
It is possible to specify this explicitly using the parameter categories. There are two genders, four possible
continents and four web browsers in our dataset:
>>> enc.fit(X)
OneHotEncoder(categories=[['female', 'male'],
['from Africa', 'from Asia', 'from Europe',
'from US'],
['uses Chrome', 'uses Firefox', 'uses IE',
'uses Safari']])
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
If there is a possibility that the training data might have missing categorical features, it can often be
better to specify handle_unknown='ignore' instead of setting the categories manually as above.
When handle_unknown='ignore' is specified and unknown categories are encountered during trans-
form, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])
It is also possible to encode each column into n_categories - 1 columns instead of n_categories columns
by using the drop parameter. This parameter allows the user to specify a category for each feature to be dropped. This
is useful to avoid co-linearity in the input matrix in some classifiers. Such functionality is useful, for example, when
using non-regularized regression (LinearRegression), since co-linearity would cause the covariance matrix to be
non-invertible. When this paramenter is not None, handle_unknown must be set to error:
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox
˓→']]
>>> drop_enc.transform(X).toarray()
array([[1., 1., 1.],
[0., 0., 0.]])
See Loading features from dicts for categorical features that are represented as a dict, not as scalars.
Discretization
Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into dis-
crete values. Certain datasets with continuous features may benefit from discretization, because discretization can
transform the dataset of continuous attributes to one with only nominal attributes.
One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For
instance, pre-processing with a discretizer can introduce nonlinearity to linear models.
K-bins discretization
By default the output is one-hot encoded into a sparse matrix (See Encoding categorical features) and this can be
configured with the encode parameter. For each feature, the bin edges are computed during fit and together with
the number of bins, they will define the intervals. Therefore, for the current example, these intervals are defined as:
• feature 1: [−∞, −1), [−1, 2), [2, ∞)
• feature 2: [−∞, 5), [5, ∞)
• feature 3: [−∞, 14), [14, ∞)
Based on these bin intervals, X is transformed as follows:
>>> est.transform(X)
array([[ 0., 1., 1.],
[ 1., 1., 1.],
[ 2., 0., 0.]])
The resulting dataset contains ordinal attributes which can be further used in a sklearn.pipeline.Pipeline.
Discretization is similar to constructing histograms for continuous data. However, histograms focus on counting
features which fall into particular bins, whereas discretization focuses on assigning feature values to these bins.
KBinsDiscretizer implements different binning strategies, which can be selected with the strategy parame-
ter. The ‘uniform’ strategy uses constant-width bins. The ‘quantile’ strategy uses the quantiles values to have equally
populated bins in each feature. The ‘kmeans’ strategy defines bins based on a k-means clustering procedure performed
on each feature independently.
Examples:
Feature binarization
Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for
downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate
Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM .
It is also common among the text processing community to use binary feature values (probably to simplify the proba-
bilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly
better in practice.
As for the Normalizer, the utility class Binarizer is meant to be used in the early stages of sklearn.
pipeline.Pipeline. The fit method does nothing as each sample is treated independently of others:
>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> binarizer.transform(X)
array([[1., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]])
As for the StandardScaler and Normalizer classes, the preprocessing module provides a companion function
binarize to be used when the transformer API is not necessary.
Note that the Binarizer is similar to the KBinsDiscretizer when k = 2, and when the bin edge is at the
value threshold.
Sparse input
binarize and Binarizer accept both dense array-like and sparse matrices from scipy.sparse as input.
For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.
csr_matrix). To avoid unnecessary memory copies, it is recommended to choose the CSR representation up-
stream.
Tools for imputing missing values are discussed at Imputation of missing values.
Often it’s useful to add complexity to the model by considering nonlinear features of the input data. A simple and com-
mon method to use is polynomial features, which can get features’ high-order and interaction terms. It is implemented
in PolynomialFeatures:
The features of X have been transformed from (𝑋1 , 𝑋2 ) to (1, 𝑋1 , 𝑋2 , 𝑋12 , 𝑋1 𝑋2 , 𝑋22 ).
In some cases, only interaction terms among features are required, and it can be gotten with the setting
interaction_only=True:
>>> X = np.arange(9).reshape(3, 3)
>>> X
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> poly = PolynomialFeatures(degree=3, interaction_only=True)
>>> poly.fit_transform(X)
array([[ 1., 0., 1., 2., 0., 0., 2., 0.],
[ 1., 3., 4., 5., 12., 15., 20., 60.],
[ 1., 6., 7., 8., 42., 48., 56., 336.]])
Note that polynomial features are used implicitly in kernel methods (e.g., sklearn.svm.SVC, sklearn.
decomposition.KernelPCA) when using polynomial Kernel functions.
See Polynomial interpolation for Ridge regression using created polynomial features.
Custom transformers
Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing.
You can implement a transformer from an arbitrary function with FunctionTransformer. For example, to build
a transformer that applies a log transformation in a pipeline, do:
You can ensure that func and inverse_func are the inverse of each other by setting check_inverse=True
and calling fit before transform. Please note that a warning is raised and can be turned into an error with a
filterwarnings:
For a full code example that demonstrates using a FunctionTransformer to do custom feature selection, see
Using FunctionTransformer to select columns
For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other place-
holders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array
are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows
and/or columns containing missing values. However, this comes at the price of losing data which may be valuable
(even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of
the data. See the Glossary of Common Terms and API Elements entry on imputation.
One type of imputation algorithm is univariate, which imputes values in the i-th feature dimension using only non-
missing values in that feature dimension (e.g. impute.SimpleImputer). By contrast, multivariate imputa-
tion algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. impute.
IterativeImputer).
The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed
with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the
missing values are located. This class also allows for different missing values encodings.
The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean value of the
columns (axis 0) that contain the missing values:
Note that this format is not meant to be used to implicitly store missing values in the matrix because it would densify
it at transform time. Missing values encoded by 0 must be used with dense input.
The SimpleImputer class also supports categorical data represented as string values or pandas categoricals when
using the 'most_frequent' or 'constant' strategy:
A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing
values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin
fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X.
A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done
for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final
imputation round are returned.
Note: This estimator is still experimental for now: the predictions and the API might change without any deprecation
Both SimpleImputer and IterativeImputer can be used in a Pipeline as a way to build a composite estimator
that supports imputation. See Imputing missing values before building an estimator.
Flexibility of IterativeImputer
There are many well-established imputation packages in the R data science ecosystem: Amelia, mi, mice, missForest,
etc. missForest is popular, and turns out to be a particular instance of different sequential imputation algorithms
that can all be implemented with IterativeImputer by passing in different regressors to be used for predicting
missing feature values. In the case of missForest, this regressor is a Random Forest. See Imputing missing values with
variants of IterativeImputer.
In the statistics community, it is common practice to perform multiple imputations, generating, for example, m separate
imputations for a single feature matrix. Each of these m imputations is then put through the subsequent analysis
pipeline (e.g. feature engineering, clustering, regression, classification). The m final analysis results (e.g. held-out
validation errors) allow the data scientist to obtain understanding of how analytic results may differ as a consequence
of the inherent uncertainty caused by the missing values. The above practice is called multiple imputation.
Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by
Chained Equations)1 , but differs from it by returning a single imputation instead of multiple imputations. However,
IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with
different random seeds when sample_posterior=True. See2 , chapter 4 for more discussion on multiple vs.
single imputations.
It is still an open problem as to how useful single vs. multiple imputation is in the context of prediction and classifica-
tion when the user is not interested in measuring uncertainty due to missing values.
Note that a call to the transform method of IterativeImputer is not allowed to change the number of samples.
Therefore multiple imputations cannot be achieved by a single call to transform.
References
1 Stef van Buuren, Karin Groothuis-Oudshoorn (2011). “mice: Multivariate Imputation by Chained Equations in R”. Journal of Statistical
The KNNImputer class provides imputation for filling in missing values using the k-Nearest Neighbors approach.
By default, a euclidean distance metric that supports missing values, nan_euclidean_distances, is used to find
the nearest neighbors. Each missing feature is imputed using values from n_neighbors nearest neighbors that have
a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor.
If a sample has more than one feature missing, then the neighbors for that sample can be different depending on the
particular feature being imputed. When the number of available neighbors is less than n_neighbors and there are
no defined distances to the training set, the training set average for that feature is used during imputation. If there is
at least one neighbor with a defined distance, the weighted or unweighted average of the remaining neighbors will
be used during imputation. If a feature is always missing in training, it is removed during transform. For more
information on the methodology, see ref. [OL2001].
The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean feature value
of the two nearest neighbors of samples with missing values:
The MissingIndicator transformer is useful to transform a dataset into corresponding binary matrix indicating
the presence of missing values in the dataset. This transformation is useful in conjunction with imputation. When
using imputation, preserving the information about which values had been missing can be informative. Note that
both the SimpleImputer and IterativeImputer have the boolean parameter add_indicator (False by
default) which when set to True provides a convenient way of stacking the output of the MissingIndicator
transformer with the output of the imputer.
NaN is usually used as the placeholder for missing values. However, it enforces the data type to be float. The parameter
missing_values allows to specify other placeholder such as integer. In the following example, we will use -1 as
missing values:
The features parameter is used to choose the features for which the mask is constructed. By default, it is
'missing-only' which returns the imputer mask of the features containing missing values at fit time:
>>> indicator.features_
array([0, 1, 3])
The features parameter can be set to 'all' to return all features whether or not they contain missing values:
Now we create a FeatureUnion. All features will be imputed using SimpleImputer, in order to enable classi-
fiers to work with this data. Additionally, it adds the the indicator variables from MissingIndicator.
Of course, we cannot use the transformer to make any predictions. We should wrap this in a Pipeline with a
classifier (e.g., a DecisionTreeClassifier) to be able to make predictions.
If your number of features is high, it may be useful to reduce it with an unsupervised step prior to supervised steps.
Many of the Unsupervised learning methods implement a transform method that can be used to reduce the dimen-
sionality. Below we discuss two specific example of this pattern that are heavily used.
Pipelining
The unsupervised data reduction and the supervised estimator can be chained in one step. See Pipeline: chaining
estimators.
decomposition.PCA looks for a combination of features that capture well the variance of the original features.
See Decomposing signals in components (matrix factorization problems).
Examples
Random projections
The module: random_projection provides several tools for data reduction by random projections. See the
relevant section of the documentation: Random Projection.
Examples
Feature agglomeration
cluster.FeatureAgglomeration applies Hierarchical clustering to group together features that behave sim-
ilarly.
Examples
Feature scaling
Note that if features have very different scaling or statistical properties, cluster.FeatureAgglomeration
may not be able to capture the links between related features. Using a preprocessing.StandardScaler
can be useful in these settings.
The sklearn.random_projection module implements a simple and computationally efficient way to reduce
the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing
times and smaller model sizes. This module implements two types of unstructured random matrix: Gaussian random
matrix and sparse random matrix.
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances
between any two samples of the dataset. Thus random projection is a suitable approximation technique for distance
based method.
References:
• Sanjoy Dasgupta. 2000. Experiments with random projection. In Proceedings of the Sixteenth conference
on Uncertainty in artificial intelligence (UAI’00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151.
• Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to
image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge
discovery and data mining (KDD ‘01). ACM, New York, NY, USA, 245-250.
The main theoretical result behind the efficiency of random projection is the Johnson-Lindenstrauss lemma (quoting
Wikipedia):
In mathematics, the Johnson-Lindenstrauss lemma is a result concerning low-distortion embeddings of
points from high-dimensional into low-dimensional Euclidean space. The lemma states that a small set
of points in a high-dimensional space can be embedded into a space of much lower dimension in such a
way that distances between the points are nearly preserved. The map used for the embedding is at least
Lipschitz, and can even be taken to be an orthogonal projection.
Knowing only the number of samples, the sklearn.random_projection.
johnson_lindenstrauss_min_dim estimates conservatively the minimal size of the random subspace
to guarantee a bounded distortion introduced by the random projection:
Example:
• See The Johnson-Lindenstrauss bound for embedding with random projections for a theoretical explication
on the Johnson-Lindenstrauss lemma and an empirical validation using sparse random matrices.
References:
• Sanjoy Dasgupta and Anupam Gupta, 1999. An elementary proof of the Johnson-Lindenstrauss Lemma.
where 𝑛components is the size of the projected subspace. By default the density of non zero elements is set to the
√
minimum density as recommended by Ping Li et al.: 1/ 𝑛features .
Here a small excerpt which illustrates how to use the sparse random projection transformer:
References:
• D. Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Jour-
nal of Computer and System Sciences 66 (2003) 671–687
• Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006. Very sparse random projections. In Proceedings
of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘06).
ACM, New York, NY, USA, 287-296.
This submodule contains functions that approximate the feature mappings that correspond to certain kernels, as they
are used for example in support vector machines (see Support Vector Machines). The following feature functions
perform non-linear transformations of the input, which can serve as a basis for linear classification or other algorithms.
The advantage of using approximate explicit feature maps compared to the kernel trick, which makes use of feature
maps implicitly, is that explicit mappings can be better suited for online learning and can significantly reduce the
cost of learning with very large datasets. Standard kernelized SVMs do not scale well to large datasets, but using an
approximate kernel map it is possible to use much more efficient linear SVMs. In particular, the combination of kernel
map approximations with SGDClassifier can make non-linear learning on large datasets possible.
Since there has not been much empirical work using approximate embeddings, it is advisable to compare results
against exact kernel methods when possible.
See also:
Polynomial regression: extending linear models with basis functions for an exact polynomial transformation.
The Nystroem method, as implemented in Nystroem is a general method for low-rank approximations of kernels.
It achieves this by essentially subsampling the data on which the kernel is evaluated. By default Nystroem uses the
rbf kernel, but it can use any kernel function or a precomputed kernel matrix. The number of samples used - which
is also the dimensionality of the features computed - is given by the parameter n_components.
The RBFSampler constructs an approximate mapping for the radial basis function kernel, also known as Random
Kitchen Sinks [RR2007]. This transformation can be used to explicitly model a kernel map, prior to applying a linear
algorithm, for example a linear SVM:
The mapping relies on a Monte Carlo approximation to the kernel values. The fit function performs the Monte Carlo
sampling, whereas the transform method performs the mapping of the data. Because of the inherent randomness
of the process, results may vary between different calls to the fit function.
The fit function takes two arguments: n_components, which is the target dimensionality of the feature transform,
and gamma, the parameter of the RBF-kernel. A higher n_components will result in a better approximation of the
kernel and will yield results more similar to those produced by a kernel SVM. Note that “fitting” the feature function
does not actually depend on the data given to the fit function. Only the dimensionality of the data is used. Details
on the method can be found in [RR2007].
For a given value of n_components RBFSampler is often less accurate as Nystroem. RBFSampler is cheaper
to compute, though, making use of larger feature spaces more efficient.
Fig. 9: Comparing an exact RBF kernel (left) with the approximation (right)
Examples:
The additive chi squared kernel is a kernel on histograms, often used in computer vision.
The additive chi squared kernel as used here is given by
∑︁ 2𝑥𝑖 𝑦𝑖
𝑘(𝑥, 𝑦) =
𝑖
𝑥𝑖 + 𝑦𝑖
This is not exactly the same as sklearn.metrics.additive_chi2_kernel. The authors of [VZ2010] prefer
the version above as it is always positive definite. Since the kernel is additive, it is possible to treat all components
𝑥𝑖 separately for embedding. This makes it possible to sample the Fourier transform in regular intervals, instead of
approximating using Monte Carlo sampling.
The class AdditiveChi2Sampler implements this component wise deterministic sampling. Each component
is sampled 𝑛 times, yielding 2𝑛 + 1 dimensions per input dimension (the multiple of two stems from the real and
complex part of the Fourier transform). In the literature, 𝑛 is usually chosen to be 1 or 2, transforming the dataset to
size n_samples * 5 * n_features (in the case of 𝑛 = 2).
The approximate feature map provided by AdditiveChi2Sampler can be combined with the approximate feature
map provided by RBFSampler to yield an approximate feature map for the exponentiated chi squared kernel. See
the [VZ2010] for details and [VVZ2010] for combination with the RBFSampler.
It has properties that are similar to the exponentiated chi squared kernel often used in computer vision, but allows for
a simple Monte Carlo approximation of the feature map.
The usage of the SkewedChi2Sampler is the same as the usage described above for the RBFSampler. The only
difference is in the free parameter, that is called 𝑐. For a motivation for this mapping and the mathematical details see
[LS2010].
Mathematical Details
Kernel methods like support vector machines or kernelized PCA rely on a property of reproducing kernel Hilbert
spaces. For any positive definite kernel function 𝑘 (a so called Mercer kernel), it is guaranteed that there exists a
mapping 𝜑 into a Hilbert space ℋ, such that
References:
Kernels are measures of similarity, i.e. s(a, b) > s(a, c) if objects a and b are considered “more similar” than
objects a and c. A kernel must also be positive semi-definite.
There are a number of ways to convert between a distance metric and a similarity measure, such as a kernel. Let D be
the distance, and S be the kernel:
1. S = np.exp(-D * gamma), where one heuristic for choosing gamma is 1 / num_features
2. S = 1. / (D / np.max(D))
The distances between the row vectors of X and the row vectors of Y can be evaluated using pairwise_distances.
If Y is omitted the pairwise distances of the row vectors of X are calculated. Similarly, pairwise.
pairwise_kernels can be used to calculate the kernel between X and Y using different kernel functions. See
the API reference for more details.
Cosine similarity
cosine_similarity computes the L2-normalized dot product of vectors. That is, if 𝑥 and 𝑦 are row vectors, their
cosine similarity 𝑘 is defined as:
𝑥𝑦 ⊤
𝑘(𝑥, 𝑦) =
‖𝑥‖‖𝑦‖
This is called cosine similarity, because Euclidean (L2) normalization projects the vectors onto the unit sphere, and
their dot product is then the cosine of the angle between the points denoted by the vectors.
This kernel is a popular choice for computing the similarity of documents represented as tf-idf vectors.
cosine_similarity accepts scipy.sparse matrices. (Note that the tf-idf functionality in sklearn.
feature_extraction.text can produce normalized vectors, in which case cosine_similarity is equiv-
alent to linear_kernel, only slower.)
References:
• C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge Uni-
versity Press. https://nlp.stanford.edu/IR-book/html/htmledition/the-vector-space-model-for-scoring-1.html
Linear kernel
The function linear_kernel computes the linear kernel, that is, a special case of polynomial_kernel with
degree=1 and coef0=0 (homogeneous). If x and y are column vectors, their linear kernel is:
𝑘(𝑥, 𝑦) = 𝑥⊤ 𝑦
Polynomial kernel
The function polynomial_kernel computes the degree-d polynomial kernel between two vectors. The polyno-
mial kernel represents the similarity between two vectors. Conceptually, the polynomial kernels considers not only
the similarity between vectors under the same dimension, but also across dimensions. When used in machine learning
algorithms, this allows to account for feature interaction.
The polynomial kernel is defined as:
𝑘(𝑥, 𝑦) = (𝛾𝑥⊤ 𝑦 + 𝑐0 )𝑑
where:
• x, y are the input vectors
• d is the kernel degree
If 𝑐0 = 0 the kernel is said to be homogeneous.
Sigmoid kernel
The function sigmoid_kernel computes the sigmoid kernel between two vectors. The sigmoid kernel is also
known as hyperbolic tangent, or Multilayer Perceptron (because, in the neural network field, it is often used as neuron
activation function). It is defined as:
𝑘(𝑥, 𝑦) = tanh(𝛾𝑥⊤ 𝑦 + 𝑐0 )
where:
• x, y are the input vectors
• 𝛾 is known as slope
• 𝑐0 is known as intercept
RBF kernel
The function rbf_kernel computes the radial basis function (RBF) kernel between two vectors. This kernel is
defined as:
where x and y are the input vectors. If 𝛾 = 𝜎 −2 the kernel is known as the Gaussian kernel of variance 𝜎 2 .
Laplacian kernel
The function laplacian_kernel is a variant on the radial basis function kernel defined as:
where x and y are the input vectors and ‖𝑥 − 𝑦‖1 is the Manhattan distance between the input vectors.
It has proven useful in ML applied to noiseless data. See e.g. Machine learning for quantum mechanics in a nutshell.
Chi-squared kernel
The chi-squared kernel is a very popular choice for training non-linear SVMs in computer vision applications. It can
be computed using chi2_kernel and then passed to an sklearn.svm.SVC with kernel="precomputed":
The data is assumed to be non-negative, and is often normalized to have an L1-norm of one. The normalization is
rationalized with the connection to the chi squared distance, which is a distance between discrete probability distribu-
tions.
The chi squared kernel is most commonly used on histograms (bags) of visual words.
References:
• Zhang, J. and Marszalek, M. and Lazebnik, S. and Schmid, C. Local features and kernels for classification
of texture and object categories: A comprehensive study International Journal of Computer Vision 2007
https://research.microsoft.com/en-us/um/people/manik/projects/trade-off/papers/ZhangIJCV06.pdf
These are transformers that are not intended to be used on features, only on supervised learning targets. See also
Transforming target in regression if you want to transform the prediction target for learning, but evaluate the model in
the original (untransformed) space.
Label binarization
LabelBinarizer is a utility class to help create a label indicator matrix from a list of multi-class labels:
>>> lb = preprocessing.MultiLabelBinarizer()
>>> lb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
[0, 0, 1]])
>>> lb.classes_
array([1, 2, 3])
Label encoding
LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-
1. This is sometimes useful for writing efficient Cython routines. LabelEncoder can be used as follows:
It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical
labels:
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']
The sklearn.datasets package embeds some small toy datasets as introduced in the Getting Started section.
This package also features helpers to fetch larger datasets commonly used by the machine learning community to
benchmark algorithms on data that comes from the ‘real world’.
To evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical
properties of the data (typically the correlation and informativeness of the features), it is also possible to generate
synthetic data.
There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of
dataset.
The dataset loaders. They can be used to load small standard datasets, described in the Toy datasets section.
The dataset fetchers. They can be used to download and load larger datasets, described in the Real world datasets
section.
Both loaders and fetchers functions return a dictionary-like object holding at least two items: an array of shape
n_samples * n_features with key data (except for 20newsgroups) and a numpy array of length n_samples,
containing the target values, with key target.
It’s also possible for almost all of these function to constrain the output to be a tuple containing only the data and the
target, by setting the return_X_y parameter to True.
The datasets also contain a full description in their DESCR attribute and some contain feature_names and
target_names. See the dataset descriptions below for details.
The dataset generation functions. They can be used to generate controlled synthetic datasets, described in the
Generated datasets section.
These functions return a tuple (X, y) consisting of a n_samples * n_features numpy array X and an array of
length n_samples containing the targets y.
In addition, there are also miscellaneous tools to load datasets of other formats or from other locations, described in
the Loading other datasets section.
scikit-learn comes with a few small standard datasets that do not require to download any file from some external
website.
They can be loaded using the following functions:
These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn. They
are however often too small to be representative of real world machine learning tasks.
References
• Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity’,
Wiley, 1980. 244-261.
• Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth
International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan
Kaufmann.
References
• Fisher, R.A. “The use of multiple measurements in taxonomic problems” Annual Eugenics, 7, Part II, 179-188
(1936); also in “Contributions to Mathematical Statistics” (John Wiley, NY, 1950).
• Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons.
ISBN 0-471-22361-1. See page 218.
• Dasarathy, B.V. (1980) “Nosing Around the Neighborhood: A New System Structure and Classification Rule
for Recognition in Partially Exposed Environments”. IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
• Gates, G.W. (1972) “The Reduced Nearest Neighbor Rule”. IEEE Transactions on Information Theory, May
1972, 431-433.
• See also: 1988 MLC Proceedings, 54-64. Cheeseman et al”s AUTOCLASS II conceptual clustering system
finds 3 classes in the data.
• Many, many more . . .
Diabetes dataset
Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were
obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease
progression one year after baseline.
Data Set Characteristics:
Number of Instances 442
Number of Attributes First 10 columns are numeric predictive values
Target Column 11 is a quantitative measure of disease progression one year after baseline
Attribute Information
• Age
• Sex
• Body mass index
• Average blood pressure
• S1
• S2
• S3
• S4
• S5
• S6
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times
n_samples (i.e. the sum of squares of each column totals 1).
Source URL: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see: Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least An-
gle Regression,” Annals of Statistics (with discussion), 407-499. (https://web.stanford.edu/~hastie/Papers/LARS/
LeastAngle_2002.pdf)
References
• C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit
Recognition, MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University.
• E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
• Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin. Linear dimensionalityreduction using
relevance weighted LDA. School of Electrical and Electronic Engineering Nanyang Technological University.
2005.
• Claudio Gentile. A New Approximate Maximal Margin Classification Algorithm. NIPS. 2000.
Linnerrud dataset
References
References
(1) S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep.
no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of
North Queensland. (Also submitted to Technometrics).
The data was used with many others for comparing various classifiers. The classes are separable, though only RDA
has achieved 100% correct classification. (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed
data)) (All results using the leave-one-out technique)
(2) S. Aeberhard, D. Coomans and O. de Vel, “THE CLASSIFICATION PERFORMANCE OF RDA” Tech. Rep.
no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of
North Queensland. (Also submitted to Journal of Chemometrics).
The mean, standard error, and “worst” or largest (mean of the three largest values) of these
features were computed for each image, resulting in 30 features. For instance, field 3 is
Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
• class:
– WDBC-Malignant
– WDBC-Benign
Summary Statistics
Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, “Decision
Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive
Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision
tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.
The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K.
P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”,
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
References
• W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis.
IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905,
pages 861-870, San Jose, CA, 1993.
• O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear program-
ming. Operations Research, 43(4), pages 570-577, July-August 1995.
• W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer
from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.
fetch_olivetti_faces([data_home, shuffle, Load the Olivetti faces data-set from AT&T (classifica-
. . . ]) tion).
fetch_20newsgroups([data_home, subset, . . . ]) Load the filenames and data from the 20 newsgroups
dataset (classification).
fetch_20newsgroups_vectorized([subset, Load the 20 newsgroups dataset and vectorize it into to-
. . . ]) ken counts (classification).
fetch_lfw_people([data_home, funneled, . . . ]) Load the Labeled Faces in the Wild (LFW) people
dataset (classification).
fetch_lfw_pairs([subset, data_home, . . . ]) Load the Labeled Faces in the Wild (LFW) pairs dataset
(classification).
fetch_covtype([data_home, . . . ]) Load the covertype dataset (classification).
fetch_rcv1([data_home, subset, . . . ]) Load the RCV1 multilabel dataset (classification).
fetch_kddcup99([subset, data_home, shuffle, . . . ]) Load the kddcup99 dataset (classification).
fetch_california_housing([data_home, . . . ]) Load the California housing dataset (regression).
This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cam-
bridge. The sklearn.datasets.fetch_olivetti_faces function is the data fetching / caching function
that downloads the data archive from AT&T.
As described on the original website:
There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken
at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and
facial details (glasses / no glasses). All the images were taken against a dark homogeneous background
with the subjects in an upright, frontal position (with tolerance for some side movement).
Data Set Characteristics:
Classes 40
Samples total 400
Dimensionality 4096
Features real, between 0 and 1
The image is quantized to 256 grey levels and stored as unsigned 8-bit integers; the loader will convert these to floating
point values on the interval [0, 1], which are easier to work with for many algorithms.
The “target” for this database is an integer from 0 to 39 indicating the identity of the person pictured; however, with
only 10 examples per class, this relatively small dataset is more interesting from an unsupervised or semi-supervised
perspective.
The original dataset consisted of 92 x 112, while the version available here consists of 64x64 images.
When using these images, please give credit to AT&T Laboratories Cambridge.
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for
training (or development) and the other one for testing (or for performance evaluation). The split between the train
and test set is based upon a messages posted before and after a specific date.
This module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list
of the raw texts that can be fed to text feature extractors such as sklearn.feature_extraction.text.
CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.
datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use
a feature extractor.
Data Set Characteristics:
Classes 20
Samples total 18846
Dimensionality 1
Features text
Usage
The real data lies in the filenames and target attributes. The target attribute is the integer index of the category:
>>> newsgroups_train.filenames.shape
(11314,)
>>> newsgroups_train.target.shape
(11314,)
>>> newsgroups_train.target[:10]
array([ 7, 4, 4, 1, 14, 16, 13, 3, 2, 4])
It is possible to load only a sub-selection of the categories by passing the list of the categories to load to the sklearn.
datasets.fetch_20newsgroups function:
>>> list(newsgroups_train.target_names)
['alt.atheism', 'sci.space']
>>> newsgroups_train.filenames.shape
(1073,)
>>> newsgroups_train.target.shape
(1073,)
>>> newsgroups_train.target[:10]
array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])
In order to feed predictive or clustering models with the text data, one first need to turn the text into vectors
of numerical values suitable for statistical analysis. This can be achieved with the utilities of the sklearn.
feature_extraction.text as demonstrated in the following example that extract TF-IDF vectors of unigram
tokens from a subset of 20news:
The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero components by sample in a more than
30000-dimensional space (less than .5% non-zero features):
It is easy for a classifier to overfit on particular things that appear in the 20 Newsgroups data, such as newsgroup
headers. Many classifiers achieve very high F-scores, but their results would not generalize to other documents that
aren’t from this window of time.
For example, let’s look at the results of a multinomial Naive Bayes classifier, which is fast to train and achieves a
decent F-score:
(The example Classification of text documents using sparse features shuffles the training and test data, instead of
segmenting by time, and in that case multinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious
yet of what’s going on inside this classifier?)
Let’s take a look at what the most informative features are:
You can now see many things that these features have overfit to:
• Almost every group is distinguished by whether headers such as NNTP-Posting-Host: and
Distribution: appear more or less often.
• Another significant feature involves whether the sender is affiliated with a university, as indicated either by their
headers or their signature.
• The word “article” is a significant feature, based on how often people quote previous posts like this: “In article
[article ID], [name] <[e-mail address]> wrote:”
• Other features match the names and e-mail addresses of particular people who were posting at the time.
With such an abundance of clues that distinguish newsgroups, the classifiers barely have to identify topics from text at
all, and they all perform at the same high level.
For this reason, the functions that load 20 Newsgroups data provide a parameter called remove, telling it what
kinds of information to strip out of each file. remove should be a tuple containing any subset of ('headers',
'footers', 'quotes'), telling it to remove headers, signature blocks, and quotation blocks respectively.
This classifier lost over a lot of its F-score, just because we removed metadata that has little to do with topic classifi-
cation. It loses even more if we also strip this metadata from the training data:
Some other classifiers cope better with this harder version of the task. Try running Sample pipeline for text feature
extraction and evaluation with and without the --filter option to compare the results.
Recommendation
When evaluating text classifiers on the 20 Newsgroups data, you should strip newsgroup-related metadata. In
scikit-learn, you can do this by setting remove=('headers', 'footers', 'quotes'). The F-score will
be lower because it is more realistic.
Examples
This dataset is a collection of JPEG pictures of famous people collected over the internet, all details are available on
the official website:
http://vis-www.cs.umass.edu/lfw/
Each picture is centered on a single face. The typical task is called Face Verification: given a pair of two pictures, a
binary classifier must predict whether the two images are from the same person.
An alternative task, Face Recognition or Face Identification is: given the picture of the face of an unknown person,
identify the name of the person by referring to a gallery of previously seen pictures of identified persons.
Both Face Verification and Face Recognition are tasks that are typically performed on the output of a model trained to
perform Face Detection. The most popular model for Face Detection is called Viola-Jones and is implemented in the
OpenCV library. The LFW faces were extracted by this face detector from various online websites.
Data Set Characteristics:
Classes 5749
Samples total 13233
Dimensionality 5828
Features real, between 0 and 255
Usage
scikit-learn provides two loaders that will automatically download, cache, parse the metadata files, decode the
jpeg and convert the interesting slices into memmapped numpy arrays. This dataset size is more than 200 MB. The first
load typically takes more than a couple of minutes to fully decode the relevant part of the JPEG files into numpy arrays.
If the dataset has been loaded once, the following times the loading times less than 200ms by using a memmapped
version memoized on the disk in the ~/scikit_learn_data/lfw_home/ folder using joblib.
The first loader is used for the Face Identification task: a multi-class classification task (hence supervised learning):
The default slice is a rectangular shape around the face, removing most of the background:
>>> lfw_people.data.dtype
dtype('float32')
>>> lfw_people.data.shape
(1288, 1850)
>>> lfw_people.images.shape
(1288, 50, 37)
Each of the 1140 faces is assigned to a single person id in the target array:
>>> lfw_people.target.shape
(1288,)
>>> list(lfw_people.target[:10])
[5, 6, 3, 1, 0, 1, 3, 4, 3, 0]
The second loader is typically used for the face verification task: each sample is a pair of two picture belonging or not
to the same person:
>>> list(lfw_pairs_train.target_names)
['Different persons', 'Same person']
>>> lfw_pairs_train.pairs.shape
(2200, 2, 62, 47)
>>> lfw_pairs_train.data.shape
(2200, 5828)
>>> lfw_pairs_train.target.shape
(2200,)
References:
• Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Gary
B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. University of Massachusetts, Amherst,
Technical Report 07-49, October, 2007.
Examples
Forest covertypes
The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting
each patch’s cover type, i.e. the dominant species of tree. There are seven covertypes, making this a multiclass
classification problem. Each sample has 54 features, described on the dataset’s homepage. Some of the features are
boolean indicators, while others are discrete or continuous measurements.
Data Set Characteristics:
Classes 7
Samples total 581012
Dimensionality 54
Features int
sklearn.datasets.fetch_covtype will load the covertype dataset; it returns a dictionary-like object with
the feature matrix in the data member and the target values in target. The dataset will be downloaded from the
web if necessary.
RCV1 dataset
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories made available
by Reuters, Ltd. for research purposes. The dataset is extensively described in1 .
Data Set Characteristics:
Classes 103
Samples total 804414
Dimensionality 47236
Features real, between 0 and 1
sklearn.datasets.fetch_rcv1 will load the following version: RCV1-v2, vectors, full sets, topics multil-
abels:
>>> rcv1.data.shape
(804414, 47236)
target: The target values are stored in a scipy CSR sparse matrix, with 804414 samples and 103 categories. Each
sample has a value of 1 in its categories, and 0 in others. The array has 3.15% of non zero values:
>>> rcv1.target.shape
(804414, 103)
1 Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. The Journal of
sample_id: Each sample can be identified by its ID, ranging (with gaps) from 2286 to 810596:
>>> rcv1.sample_id[:3]
array([2286, 2287, 2288], dtype=uint32)
target_names: The target values are the topics of each sample. Each sample belongs to at least one topic, and
to up to 17 topics. There are 103 topics, each represented by a string. Their corpus frequencies span five orders of
magnitude, from 5 occurrences for ‘GMIL’, to 381327 for ‘CCAT’:
>>> rcv1.target_names[:3].tolist()
['E11', 'ECAT', 'M11']
The dataset will be downloaded from the rcv1 homepage if necessary. The compressed size is about 656 MB.
References
Kddcup 99 dataset
The KDD Cup ‘99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection
System (IDS) Evaluation dataset, created by MIT Lincoln Lab [1]. The artificial data (described on the dataset’s
homepage) was generated using a closed network and hand-injected attacks to produce a large number of different
types of attack with normal activity in the background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of abnormal data which is unrealistic in real world,
and inappropriate for unsupervised anomaly detection which aims at detecting ‘abnormal’ data, ie
1) qualitatively different from normal data
2) in large minority among the observations.
We thus transform the KDD Data set into two different data sets: SA and SF.
-SA is obtained by simply selecting all the normal data, and a small proportion of abnormal data to gives an anomaly
proportion of 1%.
-SF is obtained as in [2] by simply picking up the data whose attribute logged_in is positive, thus focusing on the
intrusion attack, which gives a proportion of 0.3% of attack.
-http and smtp are two subsets of SF corresponding with third feature equal to ‘http’ (resp. to ‘smtp’)
General KDD structure :
SA structure :
SF structure :
http structure :
smtp structure :
sklearn.datasets.fetch_kddcup99 will load the kddcup99 dataset; it returns a dictionary-like object with
the feature matrix in the data member and the target values in target. The dataset will be downloaded from the
web if necessary.
References
• Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33
(1997) 291-297
In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of
controlled size and complexity.
Single label
Both make_blobs and make_classification create multiclass datasets by allocating each class one or more
normally-distributed clusters of points. make_blobs provides greater control regarding the centers and standard de-
viations of each cluster, and is used to demonstrate clustering. make_classification specialises in introducing
noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear
transformations of the feature space.
make_gaussian_quantiles divides a single Gaussian cluster into near-equal-size classes separated by concen-
tric hyperspheres. make_hastie_10_2 generates a similar binary, 10-dimensional problem.
make_circles and make_moons generate 2d binary classification datasets that are challenging to certain algo-
rithms (e.g. centroid-based clustering or linear classification), including optional Gaussian noise. They are useful for
visualisation. make_circles produces Gaussian data with a spherical decision boundary for binary classification,
while make_moons produces two interleaving half circles.
Multilabel
make_multilabel_classification generates random samples with multiple labels, reflecting a bag of words
drawn from a mixture of topics. The number of topics for each document is drawn from a Poisson distribution, and the
topics themselves are drawn from a fixed random distribution. Similarly, the number of words is drawn from Poisson,
with words drawn from a multinomial, where each topic defines a probability distribution over words. Simplifications
with respect to true bag-of-words mixtures include:
• Per-topic word distributions are independently drawn, where in reality all would be affected by a sparse base
distribution, and would be correlated.
• For a document generated from multiple topics, all topics are weighted equally in generating its bag of words.
• Documents without labels words at random, rather than from a base distribution.
Biclustering
make_biclusters(shape, n_clusters[, noise, . . . ]) Generate an array with constant block diagonal structure
for biclustering.
make_checkerboard(shape, n_clusters[, . . . ]) Generate an array with block checkerboard structure for
biclustering.
make_regression produces regression targets as an optionally-sparse random linear combination of random fea-
tures, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the
variance).
Other regression generators generate functions deterministically from randomized features.
make_sparse_uncorrelated produces a target as a linear combination of four features with fixed coef-
ficients. Others encode explicitly non-linear relations: make_friedman1 is related by polynomial and sine
transforms; make_friedman2 includes feature multiplication and reciprocation; and make_friedman3 is
similar with an arctan transformation on the target.
Sample images
Scikit-learn also embed a couple of sample JPEG images published under Creative Commons license by their authors.
Those images can be useful to test algorithms and pipeline on 2D data.
Warning: The default coding of images is based on the uint8 dtype to spare memory. Often machine learning
algorithms work best if the input is converted to a floating point representation first. Also, if you plan to use
matplotlib.pyplpt.imshow don’t forget to scale to the range 0 - 1 as done in the following example.
Examples:
scikit-learn includes utility functions for loading datasets in the svmlight / libsvm format. In this format, each line
takes the form <label> <feature-id>:<feature-value> <feature-id>:<feature-value> ..
.. This format is especially suitable for sparse datasets. In this module, scipy sparse CSR matrices are used for X and
numpy arrays are used for y.
You may load a dataset like as follows:
In this case, X_train and X_test are guaranteed to have the same number of features. Another way to achieve the
same result is to fix the number of features:
Related links:
openml.org is a public repository for machine learning data and experiments, that allows everybody to upload open
datasets.
The sklearn.datasets package is able to download datasets from the repository using the function sklearn.
datasets.fetch_openml.
For example, to download a dataset of gene expressions in mice brains:
To fully specify a dataset, you need to provide a name and a version, though the version is optional, see Dataset
Versions below. The dataset contains a total of 1080 examples belonging to 8 different classes:
>>> mice.data.shape
(1080, 77)
>>> mice.target.shape
(1080,)
>>> np.unique(mice.target)
array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s
˓→'], dtype=object)
You can get more information on the dataset by looking at the DESCR and details attributes:
>>> print(mice.DESCR)
**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) -
˓→2015
>>> mice.details
{'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF',
'upload_date': '2017-11-08T16:00:15', 'licence': 'Public',
'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff',
'file_id': '17928620', 'default_target_attribute': 'class',
'row_id_attribute': 'MouseID',
'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],
'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],
'visibility': 'public', 'status': 'active',
'md5_checksum': '3c479a6885bfa0438971388283a1ce32'}
The DESCR contains a free-text description of the data, while details contains a dictionary of meta-data stored
by openml, like the dataset id. For more details, see the OpenML documentation The data_id of the mice protein
dataset is 40966, and you can use this (or the name) to get more information on the dataset on the openml website:
>>> mice.url
'https://www.openml.org/d/40966'
Dataset Versions
A dataset is uniquely specified by its data_id, but not necessarily by its name. Several different “versions” of a
dataset with the same name can exist which can contain entirely different datasets. If a particular version of a dataset
has been found to contain significant issues, it might be deactivated. Using a name to specify a dataset will yield the
earliest version of a dataset that is still active. That means that fetch_openml(name="miceprotein") can
yield different results at different times if earlier versions become inactive. You can see that the dataset with data_id
40966 that we fetched above is the version 1 of the “miceprotein” dataset:
>>> mice.details['version']
'1'
In fact, this dataset only has one version. The iris dataset on the other hand has multiple versions:
Specifying the dataset by the name “iris” yields the lowest version, version 1, with the data_id 61. To make sure
you always get this exact dataset, it is safest to specify it by the dataset data_id. The other dataset, with data_id
969, is version 3 (version 2 has become inactive), and contains a binarized version of the data:
>>> np.unique(iris_969.target)
array(['N', 'P'], dtype=object)
You can also specify both the name and the version, which also uniquely identifies the dataset:
References:
• Vanschoren, van Rijn, Bischl and Torgo “OpenML: networked science in machine learning”, ACM SIGKDD
Explorations Newsletter, 15(2), 49-60, 2014.
scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible
to numeric arrays such as pandas DataFrame are also acceptable.
Here are some recommended ways to load standard columnar data into a format usable by scikit-learn:
• pandas.io provides tools to read data from common formats including CSV, Excel, JSON and SQL. DataFrames
may also be constructed from lists of tuples or dicts. Pandas handles heterogeneous data smoothly and provides
tools for manipulation and conversion into a numeric array suitable for scikit-learn.
• scipy.io specializes in binary formats often used in scientific computing context such as .mat and .arff
• numpy/routines.io for standard loading of columnar data into numpy arrays
• scikit-learn’s datasets.load_svmlight_file for the svmlight or libSVM sparse format
• scikit-learn’s datasets.load_files for directories of text files where the name of each directory is the
name of each category and each file inside of each directory corresponds to one sample from that category
For some miscellaneous data such as images, videos, and audio, you may wish to refer to:
• skimage.io or Imageio for loading images and videos into numpy arrays
• scipy.io.wavfile.read for reading WAV files into a numpy array
Categorical (or nominal) features stored as strings (common in pandas DataFrames) will need converting to
numerical features using sklearn.preprocessing.OneHotEncoder or sklearn.preprocessing.
OrdinalEncoder or similar. See Preprocessing data.
Note: if you manage your own numerical data it is recommended to use an optimized file format such as HDF5 to
reduce data load times. Various libraries such as H5Py, PyTables and pandas provides a Python interface for reading
and writing data in that format.
For some applications the amount of examples, features (or both) and/or the speed at which they need to be processed
are challenging for traditional approaches. In these cases scikit-learn has a number of options you can consider to
make your system scale.
Out-of-core (or “external memory”) learning is a technique used to learn from data that cannot fit in a computer’s main
memory (RAM).
Here is a sketch of a system designed to achieve this goal:
1. a way to stream instances
2. a way to extract features from instances
3. an incremental algorithm
Streaming instances
Basically, 1. may be a reader that yields instances from files on a hard drive, a database, from a network stream etc.
However, details on how to achieve this are beyond the scope of this documentation.
Extracting features
2. could be any relevant way to extract features among the different feature extraction methods supported by scikit-
learn. However, when working with data that needs vectorization and where the set of features or values is not
known in advance one should take explicit care. A good example is text classification where unknown terms are
likely to be found during training. It is possible to use a stateful vectorizer if making multiple passes over the data
is reasonable from an application point of view. Otherwise, one can turn up the difficulty by using a stateless feature
extractor. Currently the preferred way to do this is to use the so-called hashing trick as implemented by sklearn.
feature_extraction.FeatureHasher for datasets with categorical variables represented as list of Python
dicts or sklearn.feature_extraction.text.HashingVectorizer for text documents.
Incremental learning
Finally, for 3. we have a number of options inside scikit-learn. Although not all algorithms can learn incrementally
(i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates.
Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key
to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the
main memory. Choosing a good size for the mini-batch that balances relevancy and memory footprint could involve
some tuning1 .
Here is a list of incremental estimators for different tasks:
• Classification
– sklearn.naive_bayes.MultinomialNB
– sklearn.naive_bayes.BernoulliNB
– sklearn.linear_model.Perceptron
– sklearn.linear_model.SGDClassifier
– sklearn.linear_model.PassiveAggressiveClassifier
– sklearn.neural_network.MLPClassifier
• Regression
– sklearn.linear_model.SGDRegressor
– sklearn.linear_model.PassiveAggressiveRegressor
– sklearn.neural_network.MLPRegressor
• Clustering
– sklearn.cluster.MiniBatchKMeans
– sklearn.cluster.Birch
• Decomposition / feature Extraction
– sklearn.decomposition.MiniBatchDictionaryLearning
– sklearn.decomposition.IncrementalPCA
– sklearn.decomposition.LatentDirichletAllocation
• Preprocessing
– sklearn.preprocessing.StandardScaler
– sklearn.preprocessing.MinMaxScaler
– sklearn.preprocessing.MaxAbsScaler
For classification, a somewhat important thing to note is that although a stateless feature extraction routine may be
able to cope with new/unseen attributes, the incremental learner itself may be unable to cope with new/unseen targets
classes. In this case you have to pass all the possible classes to the first partial_fit call using the classes=
parameter.
Another aspect to consider when choosing a proper algorithm is that not all of them put the same importance on each
example over time. Namely, the Perceptron is still sensitive to badly labeled examples even after many examples
whereas the SGD* and PassiveAggressive* families are more robust to this kind of artifacts. Conversely, the
1 Depending on the algorithm the mini-batch size can influence results or not. SGD*, PassiveAggressive*, and discrete NaiveBayes are truly
online and are not affected by batch size. Conversely, MiniBatchKMeans convergence rate is affected by the batch size. Also, its memory footprint
can vary dramatically with batch size.
latter also tend to give less importance to remarkably different, yet properly labeled examples when they come late in
the stream as their learning rate decreases over time.
Examples
Finally, we have a full-fledged example of Out-of-core classification of text documents. It is aimed at providing a
starting point for people wanting to build out-of-core learning systems and demonstrates most of the notions discussed
above.
Furthermore, it also shows the evolution of the performance of different algorithms with the number of processed
examples.
Now looking at the computation time of the different parts, we see that the vectorization is much more expensive
than learning itself. From the different algorithms, MultinomialNB is the most expensive, but its overhead can be
mitigated by increasing the size of the mini-batches (exercise: change minibatch_size to 100 and 10000 in the
program and compare).
Notes
For some applications the performance (mainly latency and throughput at prediction time) of estimators is crucial. It
may also be of interest to consider the training throughput but this is often less important in a production setup (where
it often takes place offline).
We will review here the orders of magnitude you can expect from a number of scikit-learn estimators in different
contexts and provide some tips and tricks for overcoming performance bottlenecks.
Prediction latency is measured as the elapsed time necessary to make a prediction (e.g. in micro-seconds). Latency
is often viewed as a distribution and operations engineers often focus on the latency at a given percentile of this
distribution (e.g. the 90 percentile).
Prediction throughput is defined as the number of predictions the software can deliver in a given amount of time (e.g.
in predictions per second).
An important aspect of performance optimization is also that it can hurt prediction accuracy. Indeed, simpler models
(e.g. linear instead of non-linear, or with fewer parameters) often run faster but are not always able to take into account
the same exact properties of the data as more complex ones.
Prediction Latency
One of the most straight-forward concerns one may have when using/choosing a machine learning toolkit is the latency
at which predictions can be made in a production environment.
The main factors that influence the prediction latency are
1. Number of features
In general doing predictions in bulk (many instances at the same time) is more efficient for a number of reasons
(branching predictability, CPU cache, linear algebra libraries optimizations etc.). Here we see on a setting with few
features that independently of estimator choice the bulk mode is always faster, and for some of them by 1 to 2 orders
of magnitude:
To benchmark different estimators for your case you can simply change the n_features parameter in this example:
Prediction Latency. This should give you an estimate of the order of magnitude of the prediction latency.
Scikit-learn does some validation on data that increases the overhead per call to predict and similar functions.
In particular, checking that features are finite (not NaN or infinite) involves a full pass over the data. If you en-
sure that your data is acceptable, you may suppress checking for finiteness by setting the environment variable
SKLEARN_ASSUME_FINITE to a non-empty string before importing scikit-learn, or configure it in Python with
sklearn.set_config. For more control than these global settings, a config_context allows you to set this
configuration within a specified context:
Note that this will affect all uses of sklearn.utils.assert_all_finite within the context.
Obviously when the number of features increases so does the memory consumption of each example. Indeed, for a
matrix of 𝑀 instances with 𝑁 features, the space complexity is in 𝑂(𝑁 𝑀 ). From a computing perspective it also
means that the number of basic operations (e.g., multiplications for vector-matrix products in linear models) increases
too. Here is a graph of the evolution of the prediction latency with the number of features:
Overall you can expect the prediction time to increase at least linearly with the number of features (non-linear cases
can happen depending on the global memory footprint and estimator).
Scipy provides sparse matrix data structures which are optimized for storing sparse data. The main feature of sparse
formats is that you don’t store zeros so if your data is sparse then you use much less memory. A non-zero value in
a sparse (CSR or CSC) representation will only take on average one 32bit integer position + the 64 bit floating point
value + an additional 32bit per row or column in the matrix. Using sparse input on a dense (or sparse) linear model
can speedup prediction by quite a bit as only the non zero valued features impact the dot product and thus the model
predictions. Hence if you have 100 non zeros in 1e6 dimensional space, you only need 100 multiply and add operation
instead of 1e6.
Calculation over a dense representation, however, may leverage highly optimised vector operations and multithreading
in BLAS, and tends to result in fewer CPU cache misses. So the sparsity should typically be quite high (10% non-zeros
max, to be checked depending on the hardware) for the sparse input representation to be faster than the dense input
representation on a machine with many CPUs and an optimized BLAS implementation.
Here is sample code to test the sparsity of your input:
def sparsity_ratio(X):
return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
print("input sparsity ratio:", sparsity_ratio(X))
As a rule of thumb you can consider that if the sparsity ratio is greater than 90% you can probably benefit from sparse
formats. Check Scipy’s sparse matrix formats documentation for more information on how to build (or convert your
data to) sparse matrix formats. Most of the time the CSR and CSC formats work best.
Generally speaking, when model complexity increases, predictive power and latency are supposed to increase. In-
creasing predictive power is usually interesting, but for many applications we would better not increase prediction
latency too much. We will now review this idea for different families of supervised models.
For sklearn.linear_model (e.g. Lasso, ElasticNet, SGDClassifier/Regressor, Ridge & RidgeClassifier, Pas-
siveAggressiveClassifier/Regressor, LinearSVC, LogisticRegression. . . ) the decision function that is applied at pre-
diction time is the same (a dot product) , so latency should be equivalent.
Here is an example using sklearn.linear_model.SGDClassifier with the elasticnet penalty. The
regularization strength is globally controlled by the alpha parameter. With a sufficiently high alpha, one can then
increase the l1_ratio parameter of elasticnet to enforce various levels of sparsity in the model coefficients.
Higher sparsity here is interpreted as less model complexity as we need fewer coefficients to describe it fully. Of
course sparsity influences in turn the prediction time as the sparse dot-product takes time roughly proportional to the
number of non-zero coefficients.
For the sklearn.svm family of algorithms with a non-linear kernel, the latency is tied to the number of support
vectors (the fewer the faster). Latency and throughput should (asymptotically) grow linearly with the number of
support vectors in a SVC or SVR model. The kernel will also influence the latency as it is used to compute the
projection of the input vector once per support vector. In the following graph the nu parameter of sklearn.svm.
NuSVR was used to influence the number of support vectors.
For sklearn.ensemble of trees (e.g. RandomForest, GBT, ExtraTrees etc) the number of trees and their
depth play the most important role. Latency and throughput should scale linearly with the number of trees. In
this case we used directly the n_estimators parameter of sklearn.ensemble.gradient_boosting.
GradientBoostingRegressor.
In any case be warned that decreasing model complexity can hurt accuracy as mentioned above. For instance a non-
linearly separable problem can be handled with a speedy linear model but prediction power will very likely suffer in
the process.
Most scikit-learn models are usually pretty fast as they are implemented either with compiled Cython extensions or
optimized computing libraries. On the other hand, in many real world applications the feature extraction process (i.e.
turning raw data like database rows or network packets into numpy arrays) governs the overall prediction time. For
example on the Reuters text classification task the whole preparation (reading and parsing SGML files, tokenizing the
text and hashing it into a common vector space) is taking 100 to 500 times more time than the actual prediction code,
depending on the chosen model.
In many cases it is thus recommended to carefully time and profile your feature extraction code as it may be a good
place to start optimizing when your overall latency is too slow for your application.
Prediction Throughput
Another important metric to care about when sizing production systems is the throughput i.e. the number of predictions
you can make in a given amount of time. Here is a benchmark from the Prediction Latency example that measures this
quantity for a number of estimators on synthetic data:
These throughputs are achieved on a single process. An obvious way to increase the throughput of your application
is to spawn additional instances (usually processes in Python because of the GIL) that share the same model. One
might also add machines to spread the load. A detailed explanation on how to achieve this is beyond the scope of this
documentation though.
As scikit-learn relies heavily on Numpy/Scipy and linear algebra in general it makes sense to take explicit care of the
versions of these libraries. Basically, you ought to make sure that Numpy is built using an optimized BLAS / LAPACK
library.
Not all models benefit from optimized BLAS and Lapack implementations. For instance models based on (random-
ized) decision trees typically do not rely on BLAS calls in their inner loops, nor do kernel SVMs (SVC, SVR, NuSVC,
NuSVR). On the other hand a linear model implemented with a BLAS DGEMM call (via numpy.dot) will typically
benefit hugely from a tuned BLAS implementation and lead to orders of magnitude speedup over a non-optimized
BLAS.
You can display the BLAS / LAPACK implementation used by your NumPy / SciPy / scikit-learn install with the
following commands:
• MKL
• Apple Accelerate and vecLib frameworks (OSX only)
More information can be found on the Scipy install page and in this blog post from Daniel Nouri which has some nice
step by step install instructions for Debian / Ubuntu.
Some calculations when implemented using standard numpy vectorized operations involve using a large amount of
temporary memory. This may potentially exhaust system memory. Where computations can be performed in fixed-
memory chunks, we attempt to do so, and allow the user to hint at the maximum size of this working memory (de-
faulting to 1GB) using sklearn.set_config or config_context. The following suggests to limit temporary
working memory to 128 MiB:
Model Compression
Model compression in scikit-learn only concerns linear models for the moment. In this context it means that we want
to control the model sparsity (i.e. the number of non-zero coordinates in the model vectors). It is generally a good
idea to combine model sparsity with sparse input data representation.
Here is sample code that illustrates the use of the sparsify() method:
In this example we prefer the elasticnet penalty as it is often a good compromise between model compactness
and prediction power. One can also further tune the l1_ratio parameter (in combination with the regularization
strength alpha) to control this tradeoff.
A typical benchmark on synthetic data yields a >30% decrease in latency when both the model and input are sparse
(with 0.000024 and 0.027400 non-zero coefficients ratio respectively). Your mileage may vary depending on the
sparsity and size of your data and model. Furthermore, sparsifying can be very useful to reduce the memory usage of
predictive models deployed on production servers.
Model Reshaping
Model reshaping consists in selecting only a portion of the available features to fit a model. In other words, if a
model discards features during the learning phase we can then strip those from the input. This has several benefits.
Firstly it reduces memory (and therefore time) overhead of the model itself. It also allows to discard explicit feature
selection components in a pipeline once we know which features to keep from a previous run. Finally, it can help
reduce processing time and I/O usage upstream in the data access and feature extraction layers by not collecting and
building features that are discarded by the model. For instance if the raw data come from a database, it can make it
possible to write simpler and faster queries or reduce I/O usage by making the queries return lighter records. At the
moment, reshaping needs to be performed manually in scikit-learn. In the case of sparse input (particularly in CSR
format), it is generally sufficient to not generate the relevant features, leaving their columns empty.
Links
Parallelism
Some scikit-learn estimators and utilities can parallelize costly operations using multiple CPU cores, thanks to the
following components:
• via the joblib library. In this case the number of threads or processes can be controlled with the n_jobs
parameter.
• via OpenMP, used in C or Cython code.
In addition, some of the numpy routines that are used internally by scikit-learn may also be parallelized if numpy is
installed with specific numerical libraries such as MKL, OpenBLAS, or BLIS.
We describe these 3 scenarios in the following subsections.
Joblib-based parallelism
When the underlying implementation uses joblib, the number of workers (threads or processes) that are spawned in
parallel can be controlled via the n_jobs parameter.
Note: Where (and how) parallelization happens in the estimators is currently poorly documented. Please help us by
improving our docs and tackle issue 14228!
Joblib is able to support both multi-processing and multi-threading. Whether joblib chooses to spawn a thread or a
process depends on the backend that it’s using.
Scikit-learn generally relies on the loky backend, which is joblib’s default backend. Loky is a multi-processing back-
end. When doing multi-processing, in order to avoid duplicating the memory in each process (which isn’t reasonable
with big datasets), joblib will create a memmap that all processes can share, when the data is bigger than 1MB.
In some specific cases (when the code that is run in parallel releases the GIL), scikit-learn will indicate to joblib
that a multi-threading backend is preferable.
As a user, you may control the backend that joblib will use (regardless of what scikit-learn recommends) by using a
context manager:
OpenMP-based parallelism
OpenMP is used to parallelize code written in Cython or C, relying on multi-threading exclusively. By default (and
unless joblib is trying to avoid oversubscription), the implementation will use as many threads as possible.
You can control the exact number of threads that are used via the OMP_NUM_THREADS environment variable:
Scikit-learn relies heavily on NumPy and SciPy, which internally call multi-threaded linear algebra routines imple-
mented in libraries such as MKL, OpenBLAS or BLIS.
The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set via the MKL_NUM_THREADS,
OPENBLAS_NUM_THREADS, and BLIS_NUM_THREADS environment variables.
Please note that scikit-learn has no direct control over these implementations. Scikit-learn solely relies on Numpy and
Scipy.
Note: At the time of writing (2019), NumPy and SciPy packages distributed on pypi.org (used by pip) and on
the conda-forge channel are linked with OpenBLAS, while conda packages shipped on the “defaults” channel from
anaconda.org are linked by default with MKL.
It is generally recommended to avoid using significantly more processes or threads than the number of CPUs on a
machine. Over-subscription happens when a program is running too many threads at the same time.
Suppose you have a machine with 8 CPUs. Consider a case where you’re running a GridSearchCV (parallelized
with joblib) with n_jobs=8 over a HistGradientBoostingClassifier (parallelized with OpenMP). Each
instance of HistGradientBoostingClassifier will spawn 8 threads (since you have 8 CPUs). That’s a total
of 8 * 8 = 64 threads, which leads to oversubscription of physical CPU resources and to scheduling overhead.
Oversubscription can arise in the exact same fashion with parallelized routines from MKL, OpenBLAS or BLIS that
are nested in joblib calls.
Starting from joblib >= 0.14, when the loky backend is used (which is the default), joblib will tell its child
processes to limit the number of threads they can use, so as to avoid oversubscription. In practice the heuristic
that joblib uses is to tell the processes to use max_threads = n_cpus // n_jobs, via their corresponding
environment variable. Back to our example from above, since the joblib backend of GridSearchCV is loky, each
process will only be able to use 1 thread instead of 8, thus mitigating the oversubscription issue.
Note that:
• Manually setting one of the environment variables (OMP_NUM_THREADS, MKL_NUM_THREADS,
OPENBLAS_NUM_THREADS, or BLIS_NUM_THREADS) will take precedence over what joblib tries to do.
The total number of threads will be n_jobs * <LIB>_NUM_THREADS. Note that setting this limit will also
impact your computations in the main process, which will only use <LIB>_NUM_THREADS. Joblib exposes a
context manager for finer control over the number of threads in its workers (see joblib docs linked below).
• Joblib is currently unable to avoid oversubscription in a multi-threading context. It can only do so with the
loky backend (which spawns processes).
You will find additional details about joblib mitigation of oversubscription in joblib documentation.
Configuration switches
Python runtime
Environment variables
FIVE
This glossary hopes to definitively represent the tacit and explicit conventions applied in Scikit-learn and its API, while
providing a reference for users and contributors. It aims to describe the concepts and either detail their corresponding
API or link to other relevant parts of the documentation which do so. By linking to glossary entries from the API
Reference and User Guide, we may minimize redundancy and inconsistency.
We begin by listing general concepts (and any that didn’t fit elsewhere), but more specific sets of related terms are listed
below: Class APIs and Estimator Types, Target Types, Methods, Parameters, Attributes, Data and sample properties.
1d
1d array One-dimensional array. A NumPy array whose .shape has length 1. A vector.
2d
2d array Two-dimensional array. A NumPy array whose .shape has length 2. Often represents a matrix.
API Refers to both the specific interfaces for estimators implemented in Scikit-learn and the generalized conventions
across types of estimators as described in this glossary and overviewed in the contributor documentation.
The specific interfaces that constitute Scikit-learn’s public API are largely documented in API Reference. How-
ever we less formally consider anything as public API if none of the identifiers required to access it begins with
_. We generally try to maintain backwards compatibility for all objects in the public API.
Private API, including functions, modules and methods beginning _ are not assured to be stable.
array-like The most common data format for input to Scikit-learn estimators and functions, array-like is any type
object for which numpy.asarray will produce an array of appropriate shape (usually 1 or 2-dimensional) of
appropriate dtype (usually numeric).
This includes:
• a numpy array
• a list of numbers
• a list of length-k lists of numbers for some fixed length k
• a pandas.DataFrame with all columns numeric
• a numeric pandas.Series
It excludes:
• a sparse matrix
• an iterator
707
scikit-learn user guide, Release 0.23.dev0
• a generator
Note that output from scikit-learn estimators and functions (e.g. predictions) should generally be arrays or sparse
matrices, or lists thereof (as in multi-output tree.DecisionTreeClassifier’s predict_proba). An
estimator where predict() returns a list or a pandas.Series is not valid.
attribute
attributes We mostly use attribute to refer to how model information is stored on an estimator during fitting. Any pub-
lic attribute stored on an estimator instance is required to begin with an alphabetic character and end in a single
underscore if it is set in fit or partial_fit. These are what is documented under an estimator’s Attributes documen-
tation. The information stored in attributes is usually either: sufficient statistics used for prediction or transfor-
mation; transductive outputs such as labels_ or embedding_; or diagnostic data, such as feature_importances_.
Common attributes are listed below.
A public attribute may have the same name as a constructor parameter, with a _ appended. This is used to
store a validated or estimated version of the user’s input. For example, decomposition.PCA is constructed
with an n_components parameter. From this, together with other parameters and the data, PCA estimates the
attribute n_components_.
Further private attributes used in prediction/transformation/etc. may also be set when fitting. These begin with
a single underscore and are not assured to be stable for public access.
A public attribute on an estimator instance that does not end in an underscore should be the stored, unmodified
value of an __init__ parameter of the same name. Because of this equivalence, these are documented under
an estimator’s Parameters documentation.
backwards compatibility We generally try to maintain backwards compatibility (i.e. interfaces and behaviors may
be extended but not changed or removed) from release to release but this comes with some exceptions:
Public API only The behaviour of objects accessed through private identifiers (those beginning _) may be
changed arbitrarily between versions.
As documented We will generally assume that the users have adhered to the documented parameter types and
ranges. If the documentation asks for a list and the user gives a tuple, we do not assure consistent behavior
from version to version.
Deprecation Behaviors may change following a deprecation period (usually two releases long). Warnings are
issued using Python’s warnings module.
Keyword arguments We may sometimes assume that all optional parameters (other than X and y to fit and
similar methods) are passed as keyword arguments only and may be positionally reordered.
Bug fixes and enhancements Bug fixes and – less often – enhancements may change the behavior of estima-
tors, including the predictions of an estimator trained on the same data and random_state. When this
happens, we attempt to note it clearly in the changelog.
Serialization We make no assurances that pickling an estimator in one version will allow it to be unpickled
to an equivalent model in the subsequent version. (For estimators in the sklearn package, we issue a
warning when this unpickling is attempted, even if it may happen to work.) See Security & maintainability
limitations.
utils.estimator_checks.check_estimator We provide limited backwards compatibility assur-
ances for the estimator checks: we may add extra requirements on estimators tested with this function,
usually when these were informally assumed but not formally tested.
Despite this informal contract with our users, the software is provided as is, as stated in the licence. When
a release inadvertently introduces changes that are not backwards compatible, these are known as software
regressions.
callable A function, class or an object which implements the __call__ method; anything that returns True when
the argument of callable().
categorical feature A categorical or nominal feature is one that has a finite set of discrete values across the popu-
lation of data. These are commonly represented as columns of integers or strings. Strings will be rejected by
most scikit-learn estimators, and integers will be treated as ordinal or count-valued. For the use with most es-
timators, categorical variables should be one-hot encoded. Notable exceptions include tree-based models such
as random forests and gradient boosting models that often work better and faster with integer-coded categor-
ical variables. OrdinalEncoder helps encoding string-valued categorical features as ordinal integers, and
OneHotEncoder can be used to one-hot encode categorical features. See also Encoding categorical features
and the categorical-encoding package for tools related to encoding categorical features.
clone
cloned To copy an estimator instance and create a new one with identical parameters, but without any fitted attributes,
using clone.
When fit is called, a meta-estimator usually clones a wrapped estimator instance before fitting the cloned
instance. (Exceptions, for legacy reasons, include Pipeline and FeatureUnion.)
common tests This refers to the tests run on almost every estimator class in Scikit-learn to check they comply
with basic API conventions. They are available for external use through utils.estimator_checks.
check_estimator, with most of the implementation in sklearn/utils/estimator_checks.py.
Note: Some exceptions to the common testing regime are currently hard-coded into the library, but we hope to
replace this by marking exceptional behaviours on the estimator using semantic estimator tags.
deprecation We use deprecation to slowly violate our backwards compatibility assurances, usually to to:
• change the default value of a parameter; or
• remove a parameter, attribute, method, class, etc.
We will ordinarily issue a warning when a deprecated element is used, although there may be limitations to this.
For instance, we will raise a warning when someone sets a parameter that has been deprecated, but may not
when they access that parameter’s attribute on the estimator instance.
See the Contributors’ Guide.
dimensionality May be used to refer to the number of features (i.e. n_features), or columns in a 2d feature matrix.
Dimensions are, however, also used to refer to the length of a NumPy array’s shape, distinguishing a 1d array
from a 2d matrix.
docstring The embedded documentation for a module, class, function, etc., usually in code as a string at the beginning
of the object’s definition, and accessible as the object’s __doc__ attribute.
We try to adhere to PEP257, and follow NumpyDoc conventions.
double underscore
double underscore notation When specifying parameter names for nested estimators, __ may be used to separate
between parent and child in some contexts. The most common use is when setting parameters through a meta-
estimator with set_params and hence in specifying a search grid in parameter search. See parameter. It is also
used in pipeline.Pipeline.fit for passing sample properties to the fit methods of estimators in the
pipeline.
dtype
data type NumPy arrays assume a homogeneous data type throughout, available in the .dtype attribute of an array
(or sparse matrix). We generally assume simple data types for scikit-learn data: float or integer. We may support
object or string data types for arrays before encoding or vectorizing. Our estimators do not work with struct
arrays, for instance.
TODO: Mention efficiency and precision issues; casting policy.
duck typing We try to apply duck typing to determine how to handle some input values (e.g. checking whether a given
estimator is a classifier). That is, we avoid using isinstance where possible, and rely on the presence or
absence of attributes to determine an object’s behaviour. Some nuance is required when following this approach:
• For some estimators, an attribute may only be available once it is fitted. For instance, we cannot a priori
determine if predict_proba is available in a grid search where the grid includes alternating between a
probabilistic and a non-probabilistic predictor in the final step of the pipeline. In the following, we can
only determine if clf is probabilistic after fitting it on some data:
This means that we can only check for duck-typed attributes after fitting, and that we must be careful to
make meta-estimators only present attributes according to the state of the underlying estimator after fitting.
• Checking if an attribute is present (using hasattr) is in general just as expensive as getting the attribute
(getattr or dot notation). In some cases, getting the attribute may indeed be expensive (e.g. for some
implementations of feature_importances_, which may suggest this is an API design flaw). So code which
does hasattr followed by getattr should be avoided; getattr within a try-except block is pre-
ferred.
• For determining some aspects of an estimator’s expectations or support for some feature, we use estimator
tags instead of duck typing.
early stopping This consists in stopping an iterative optimization method before the convergence of the training loss,
to avoid over-fitting. This is generally done by monitoring the generalization score on a validation set. When
available, it is activated through the parameter early_stopping or by setting a positive n_iter_no_change.
estimator instance We sometimes use this terminology to distinguish an estimator class from a constructed instance.
For example, in the following, cls is an estimator class, while est1 and est2 are instances:
cls = RandomForestClassifier
est1 = cls()
est2 = RandomForestClassifier()
examples We try to give examples of basic usage for most functions and classes in the API:
• as doctests in their docstrings (i.e. within the sklearn/ library code itself).
• as examples in the example gallery rendered (using sphinx-gallery) from scripts in the examples/ direc-
tory, exemplifying key features or parameters of the estimator/function. These should also be referenced
from the User Guide.
• sometimes in the User Guide (built from doc/) alongside a technical description of the estimator.
evaluation metric
evaluation metrics Evaluation metrics give a measure of how well a model performs. We may use this term specif-
ically to refer to the functions in metrics (disregarding metrics.pairwise), as distinct from the score
method and the scoring API used in cross validation. See Metrics and scoring: quantifying the quality of
predictions.
These functions usually accept a ground truth (or the raw data where the metric evaluates clustering without
a ground truth) and a prediction, be it the output of predict (y_pred), of predict_proba (y_proba), or of
an arbitrary score function including decision_function (y_score). Functions are usually named to end with
_score if a greater score indicates a better model, and _loss if a lesser score indicates a better model. This
diversity of interface motivates the scoring API.
Note that some estimators can calculate metrics that are not included in metrics and are estimator-specific,
notably model likelihoods.
estimator tags A proposed feature (e.g. #8022) by which the capabilities of an estimator are described through a set
of semantic tags. This would enable some runtime behaviors based on estimator inspection, but it also allows
each estimator to be tested for appropriate invariances while being excepted from other common tests.
Some aspects of estimator tags are currently determined through the duck typing of methods like
predict_proba and through some special attributes on estimator objects:
_estimator_type This string-valued attribute identifies an estimator as being a classifier, regressor, etc. It
is set by mixins such as base.ClassifierMixin, but needs to be more explicitly adopted on a meta-
estimator. Its value should usually be checked by way of a helper such as base.is_classifier.
_pairwise This boolean attribute indicates whether the data (X) passed to fit and similar methods consists
of pairwise measures over samples rather than a feature representation for each sample. It is usually
True where an estimator has a metric or affinity or kernel parameter with value ‘precomputed’.
Its primary purpose is that when a meta-estimator extracts a sub-sample of data intended for a pairwise
estimator, the data needs to be indexed on both axes, while other data is indexed only on the first axis.
feature
features
feature vector In the abstract, a feature is a function (in its mathematical sense) mapping a sampled object to a nu-
meric or categorical quantity. “Feature” is also commonly used to refer to these quantities, being the individual
elements of a vector representing a sample. In a data matrix, features are represented as columns: each column
contains the result of applying a feature function to a set of samples.
Elsewhere features are known as attributes, predictors, regressors, or independent variables.
Nearly all estimators in scikit-learn assume that features are numeric, finite and not missing, even when they
have semantically distinct domains and distributions (categorical, ordinal, count-valued, real-valued, interval).
See also categorical feature and missing values.
n_features indicates the number of features in a dataset.
fitting Calling fit (or fit_transform, fit_predict, etc.) on an estimator.
fitted The state of an estimator after fitting.
There is no conventional procedure for checking if an estimator is fitted. However, an estimator that is not fitted:
• should raise exceptions.NotFittedError when a prediction method (predict, transform, etc.) is
called. (utils.validation.check_is_fitted is used internally for this purpose.)
• should not have any attributes beginning with an alphabetic character and ending with an underscore.
(Note that a descriptor for the attribute may still be present on the class, but hasattr should return False)
function We provide ad hoc function interfaces for many algorithms, while estimator classes provide a more consis-
tent interface.
In particular, Scikit-learn may provide a function interface that fits a model to some data and returns the learnt
model parameters, as in linear_model.enet_path. For transductive models, this also returns the em-
bedding or cluster labels, as in manifold.spectral_embedding or cluster.dbscan. Many prepro-
cessing transformers also provide a function interface, akin to calling fit_transform, as in preprocessing.
maxabs_scale. Users should be careful to avoid data leakage when making use of these fit_transform-
equivalent functions.
We do not have a strict policy about when to or when not to provide function forms of estimators, but maintainers
should consider consistency with existing interfaces, and whether providing a function would lead users astray
from best practices (as regards data leakage, etc.)
narrative documentation An alias for User Guide, i.e. documentation written in doc/modules/. Unlike the API
reference provided through docstrings, the User Guide aims to:
• group tools provided by Scikit-learn together thematically or in terms of usage;
• motivate why someone would use each particular tool, often through comparison;
• provide both intuitive and technical descriptions of tools;
• provide or link to examples of using key features of a tool.
np A shorthand for Numpy due to the conventional import statement:
import numpy as np
online learning Where a model is iteratively updated by receiving each batch of ground truth targets soon after
making predictions on corresponding batch of data. Intrinsically, the model must be usable for prediction after
each batch. See partial_fit.
out-of-core An efficiency strategy where not all the data is stored in main memory at once, usually by performing
learning on batches of data. See partial_fit.
outputs Individual scalar/categorical variables per sample in the target. For example, in multilabel classification each
possible label corresponds to a binary output. Also called responses, tasks or targets. See multiclass multioutput
and continuous multioutput.
pair A tuple of length two.
parameter
parameters
param
params We mostly use parameter to refer to the aspects of an estimator that can be specified in its construction. For
example, max_depth and random_state are parameters of RandomForestClassifier. Parameters
to an estimator’s constructor are stored unmodified as attributes on the estimator instance, and conventionally
start with an alphabetic character and end with an alphanumeric character. Each estimator’s constructor param-
eters are described in the estimator’s docstring.
We do not use parameters in the statistical sense, where parameters are values that specify a model and can be
estimated from data. What we call parameters might be what statisticians call hyperparameters to the model:
aspects for configuring model structure that are often not directly learnt from data. However, our parameters
are also used to prescribe modeling operations that do not affect the learnt model, such as n_jobs for controlling
parallelism.
When talking about the parameters of a meta-estimator, we may also be including the parameters of
the estimators wrapped by the meta-estimator. Ordinarily, these nested parameters are denoted by using
a double underscore (__) to separate between the estimator-as-parameter and its parameter. Thus clf
= BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=3)) has
a deep parameter base_estimator__max_depth with value 3, which is accessible with clf.
base_estimator.max_depth or clf.get_params()['base_estimator__max_depth'].
The list of parameters and their current values can be retrieved from an estimator instance using its get_params
method.
Between construction and fitting, parameters may be modified using set_params. To enable this, parameters are
not ordinarily validated or altered when the estimator is constructed, or when each parameter is set. Parameter
validation is performed when fit is called.
Common parameters are listed below.
pairwise metric
pairwise metrics In its broad sense, a pairwise metric defines a function for measuring similarity or dissimilarity
between two samples (with each ordinarily represented as a feature vector). We particularly provide im-
plementations of distance metrics (as well as improper metrics like Cosine Distance) through metrics.
pairwise_distances, and of kernel functions (a constrained class of similarity functions) in metrics.
pairwise_kernels. These can compute pairwise distance matrices that are symmetric and hence store data
redundantly.
See also precomputed and metric.
Note that for most distance metrics, we rely on implementations from scipy.spatial.distance, but may
reimplement for efficiency in our context. The neighbors module also duplicates some metric implementa-
tions for integration with efficient binary tree search data structures.
pd A shorthand for Pandas due to the conventional import statement:
import pandas as pd
precomputed Where algorithms rely on pairwise metrics, and can be computed from pairwise metrics alone, we
often allow the user to specify that the X provided is already in the pairwise (dis)similarity space, rather than
in a feature space. That is, when passed to fit, it is a square, symmetric matrix, with each vector indicating
(dis)similarity to every sample, and when passed to prediction/transformation methods, each row corresponds
to a testing sample and each column to a training sample.
Use of precomputed X is usually indicated by setting a metric, affinity or kernel parameter to the
string ‘precomputed’. An estimator should mark itself as being _pairwise if this is the case.
rectangular Data that can be represented as a matrix with samples on the first axis and a fixed, finite set of features
on the second is called rectangular.
This term excludes samples with non-vectorial structure, such as text, an image of arbitrary size, a time series of
arbitrary length, a set of vectors, etc. The purpose of a vectorizer is to produce rectangular forms of such data.
sample
samples We usually use this term as a noun to indicate a single feature vector. Elsewhere a sample is called an
instance, data point, or observation. n_samples indicates the number of samples in a dataset, being the
number of rows in a data array X.
sample property
sample properties A sample property is data for each sample (e.g. an array of length n_samples) passed to an
estimator method or a similar function, alongside but distinct from the features (X) and target (y). The most
prominent example is sample_weight; see others at Data and sample properties.
As of version 0.19 we do not have a consistent approach to handling sample properties and their routing in
meta-estimators, though a fit_params parameter is often used.
scikit-learn-contrib A venue for publishing Scikit-learn-compatible libraries that are broadly authorized by the
core developers and the contrib community, but not maintained by the core developer team. See https:
//scikit-learn-contrib.github.io.
scikit-learn enhancement proposals
SLEP
SLEPs Changes to the API principles and changes to dependencies or supported versions happen via a SLEP and
follows the decision-making process outlined in Scikit-learn governance and decision-making. For all votes,
a proposal must have been made public and discussed before the vote. Such proposal must be a consolidated
document, in the form of a ‘Scikit-Learn Enhancement Proposal’ (SLEP), rather than a long discussion on an
issue. A SLEP must be submitted as a pull-request to enhancement proposals using the SLEP template.
semi-supervised
semi-supervised learning
semisupervised Learning where the expected prediction (label or ground truth) is only available for some samples
provided as training data when fitting the model. We conventionally apply the label -1 to unlabeled samples in
semi-supervised classification.
sparse matrix
sparse graph A representation of two-dimensional numeric data that is more memory efficient the corresponding
dense numpy array where almost all elements are zero. We use the scipy.sparse framework, which provides
several underlying sparse data representations, or formats. Some formats are more efficient than others for
particular tasks, and when a particular format provides especial benefit, we try to document this fact in Scikit-
learn parameter descriptions.
Some sparse matrix formats (notably CSR, CSC, COO and LIL) distinguish between implicit and explicit zeros.
Explicit zeros are stored (i.e. they consume memory in a data array) in the data structure, while implicit zeros
correspond to every element not otherwise defined in explicit storage.
Two semantics for sparse matrices are used in Scikit-learn:
matrix semantics The sparse matrix is interpreted as an array with implicit and explicit zeros being interpreted
as the number 0. This is the interpretation most often adopted, e.g. when sparse matrices are used for
feature matrices or multilabel indicator matrices.
graph semantics As with scipy.sparse.csgraph, explicit zeros are interpreted as the number 0, but
implicit zeros indicate a masked or absent value, such as the absence of an edge between two ver-
tices of a graph, where an explicit value indicates an edge’s weight. This interpretation is adopted to
represent connectivity in clustering, in representations of nearest neighborhoods (e.g. neighbors.
kneighbors_graph), and for precomputed distance representation where only distances in the neigh-
borhood of each point are required.
When working with sparse matrices, we assume that it is sparse for a good reason, and avoid writing code that
densifies a user-provided sparse matrix, instead maintaining sparsity or raising an error if not possible (i.e. if an
estimator does not / cannot support sparse matrices).
supervised
supervised learning Learning where the expected prediction (label or ground truth) is available for each sample when
fitting the model, provided as y. This is the approach taken in a classifier or regressor among other estimators.
target
targets The dependent variable in supervised (and semisupervised) learning, passed as y to an estimator’s fit method.
Also known as dependent variable, outcome variable, response variable, ground truth or label. Scikit-learn
works with targets that have minimal structure: a class from a finite set, a finite real-valued number, multiple
classes, or multiple numbers. See Target Types.
transduction
transductive A transductive (contrasted with inductive) machine learning method is designed to model a specific
dataset, but not to apply that model to unseen data. Examples include manifold.TSNE, cluster.
AgglomerativeClustering and neighbors.LocalOutlierFactor.
unlabeled
unlabeled data Samples with an unknown ground truth when fitting; equivalently, missing values in the target. See
also semisupervised and unsupervised learning.
unsupervised
unsupervised learning Learning where the expected prediction (label or ground truth) is not available for each sam-
ple when fitting the model, as in clusterers and outlier detectors. Unsupervised estimators ignore any y passed
to fit.
classifier
classifiers A supervised (or semi-supervised) predictor with a finite set of discrete possible output values.
A classifier supports modeling some of binary, multiclass, multilabel, or multiclass multioutput targets. Within
scikit-learn, all classifiers support multi-class classification, defaulting to using a one-vs-rest strategy over the
binary classification problem.
Classifiers must store a classes_ attribute after fitting, and usually inherit from base.ClassifierMixin,
which sets their _estimator_type attribute.
A classifier can be distinguished from other estimators with is_classifier.
A classifier must implement:
• fit
• predict
• score
It may also be appropriate to implement decision_function, predict_proba and predict_log_proba.
clusterer
clusterers A unsupervised predictor with a finite set of discrete output values.
A clusterer usually stores labels_ after fitting, and must do so if it is transductive.
A clusterer must implement:
• fit
• fit_predict if transductive
• predict if inductive
density estimator TODO
estimator
estimators An object which manages the estimation and decoding of a model. The model is estimated as a determin-
istic function of:
• parameters provided in object construction or with set_params;
• the global numpy.random random state if the estimator’s random_state parameter is set to None; and
• any data or sample properties passed to the most recent call to fit, fit_transform or fit_predict, or data
similarly passed in a sequence of calls to partial_fit.
The estimated model is stored in public and private attributes on the estimator instance, facilitating decoding
through prediction and transformation methods.
Estimators must provide a fit method, and should provide set_params and get_params, although these are usually
provided by inheritance from base.BaseEstimator.
The core functionality of some estimators may also be available as a function.
feature extractor
feature extractors A transformer which takes input where each sample is not represented as an array-like object of
fixed length, and produces an array-like object of features for each sample (and thus a 2-dimensional array-like
for a set of samples). In other words, it (lossily) maps a non-rectangular data representation into rectangular
data.
binary A classification problem consisting of two classes. A binary target may represented as for a multiclass problem
but with only two labels. A binary decision function is represented as a 1d array.
Semantically, one class is often considered the “positive” class. Unless otherwise specified (e.g. using pos_label
in evaluation metrics), we consider the class label with the greater value (numerically or lexicographically) as
the positive class: of labels [0, 1], 1 is the positive class; of [1, 2], 2 is the positive class; of [‘no’, ‘yes’], ‘yes’
is the positive class; of [‘no’, ‘YES’], ‘no’ is the positive class. This affects the output of decision_function, for
instance.
Note that a dataset sampled from a multiclass y or a continuous y may appear to be binary.
type_of_target will return ‘binary’ for binary input, or a similar array with only a single class present.
continuous A regression problem where each sample’s target is a finite floating point number, represented as a 1-
dimensional array of floats (or sometimes ints).
type_of_target will return ‘continuous’ for continuous input, but if the data is all integers, it will be
identified as ‘multiclass’.
continuous multioutput
multioutput continuous A regression problem where each sample’s target consists of n_outputs outputs, each
one a finite floating point number, for a fixed int n_outputs > 1 in a particular dataset.
Continuous multioutput targets are represented as multiple continuous targets, horizontally stacked into an array
of shape (n_samples, n_outputs).
type_of_target will return ‘continuous-multioutput’ for continuous multioutput input, but if the data is all
integers, it will be identified as ‘multiclass-multioutput’.
multiclass A classification problem consisting of more than two classes. A multiclass target may be represented as
a 1-dimensional array of strings or integers. A 2d column vector of integers (i.e. a single output in multioutput
terms) is also accepted.
We do not officially support other orderable, hashable objects as class labels, even if estimators may happen to
work when given classification targets of such type.
For semi-supervised classification, unlabeled samples should have the special label -1 in y.
Within sckit-learn, all estimators supporting binary classification also support multiclass classification, using
One-vs-Rest by default.
A preprocessing.LabelEncoder helps to canonicalize multiclass targets as integers.
type_of_target will return ‘multiclass’ for multiclass input. The user may also want to handle ‘binary’
input identically to ‘multiclass’.
multiclass multioutput
multioutput multiclass A classification problem where each sample’s target consists of n_outputs outputs, each
a class label, for a fixed int n_outputs > 1 in a particular dataset. Each output has a fixed set of available
classes, and each sample is labelled with a class for each output. An output may be binary or multiclass, and in
the case where all outputs are binary, the target is multilabel.
Multiclass multioutput targets are represented as multiple multiclass targets, horizontally stacked into an array
of shape (n_samples, n_outputs).
XXX: For simplicity, we may not always support string class labels for multiclass multioutput, and integer class
labels should be used.
multioutput provides estimators which estimate multi-output problems using multiple single-output estima-
tors. This may not fully account for dependencies among the different outputs, which methods natively handling
the multioutput case (e.g. decision trees, nearest neighbors, neural networks) may do better.
type_of_target will return ‘multiclass-multioutput’ for multiclass multioutput input.
multilabel A multiclass multioutput target where each output is binary. This may be represented as a 2d (dense) array
or sparse matrix of integers, such that each column is a separate binary target, where positive labels are indicated
with 1 and negative labels are usually -1 or 0. Sparse multilabel targets are not supported everywhere that dense
multilabel targets are supported.
Semantically, a multilabel target can be thought of as a set of labels for each sample. While not used inter-
nally, preprocessing.MultiLabelBinarizer is provided as a utility to convert from a list of sets
representation to a 2d array or sparse matrix. One-hot encoding a multiclass target with preprocessing.
LabelBinarizer turns it into a multilabel problem.
type_of_target will return ‘multilabel-indicator’ for multilabel input, whether sparse or dense.
multioutput
multi-output A target where each sample has multiple classification/regression labels. See multiclass multioutput
and continuous multioutput. We do not currently support modelling mixed classification and regression targets.
5.4 Methods
decision_function In a fitted classifier or outlier detector, predicts a “soft” score for each sample in relation
to each class, rather than the “hard” categorical prediction produced by predict. Its input is usually only some
observed data, X.
If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.
Output conventions:
binary classification A 1-dimensional array, where values strictly greater than zero indicate the positive class
(i.e. the last class in classes_).
multiclass classification A 2-dimensional array, where the row-wise arg-maximum is the predicted class.
Columns are ordered according to classes_.
multilabel classification Scikit-learn is inconsistent in its representation of multilabel decision functions.
Some estimators represent it like multiclass multioutput, i.e. a list of 2d arrays, each with two columns.
Others represent it with a single 2d array, whose columns correspond to the individual binary classification
decisions. The latter representation is ambiguously identical to the multiclass classification format, though
its semantics differ: it should be interpreted, like in the binary case, by thresholding at 0.
TODO: This gist highlights the use of the different formats for multilabel.
multioutput classification A list of 2d arrays, corresponding to each multiclass decision function.
outlier detection A 1-dimensional array, where a value greater than or equal to zero indicates an inlier.
fit The fit method is provided on every estimator. It usually takes some samples X, targets y if the model is
supervised, and potentially other sample properties such as sample_weight. It should:
• clear any prior attributes stored on the estimator, unless warm_start is used;
• validate and interpret any parameters, ideally raising an error if invalid;
• validate the input data;
• estimate and store model attributes from the estimated parameters and provided data; and
• return the now fitted estimator to facilitate method chaining.
Target Types describes possible formats for y.
fit_predict Used especially for unsupervised, transductive estimators, this fits the model and returns the pre-
dictions (similar to predict) on the training data. In clusterers, these predictions are also stored in the labels_
attribute, and the output of .fit_predict(X) is usually equivalent to .fit(X).predict(X). The pa-
rameters to fit_predict are the same as those to fit.
fit_transform A method on transformers which fits the estimator and returns the transformed training data.
It takes parameters as in fit and its output should have the same shape as calling .fit(X, ...).
transform(X). There are nonetheless rare cases where .fit_transform(X, ...) and .fit(X, ..
.).transform(X) do not return the same value, wherein training data needs to be handled differently (due
to model blending in stacked ensembles, for instance; such cases should be clearly documented). Transductive
transformers may also provide fit_transform but not transform.
One reason to implement fit_transform is that performing fit and transform separately would be less
efficient than together. base.TransformerMixin provides a default implementation, providing a consis-
tent interface across transformers where fit_transform is or is not specialised.
In inductive learning – where the goal is to learn a generalised model that can be applied to new data – users
should be careful not to apply fit_transform to the entirety of a dataset (i.e. training and test data together)
before further modelling, as this results in data leakage.
get_feature_names Primarily for feature extractors, but also used for other transformers to provide string names
for each column in the output of the estimator’s transform method. It outputs a list of strings, and may take a
list of strings as input, corresponding to the names of input columns from which output column names can be
generated. By default input features are named x0, x1, . . . .
get_n_splits On a CV splitter (not an estimator), returns the number of elements one would get if iterating
through the return value of split given the same parameters. Takes the same parameters as split.
get_params Gets all parameters, and their values, that can be set using set_params. A parameter deep can be
used, when set to False to only return those parameters not including __, i.e. not due to indirection via contained
estimators.
Most estimators adopt the definition from base.BaseEstimator, which simply adopts the parameters de-
fined for __init__. pipeline.Pipeline, among others, reimplements get_params to declare the
estimators named in its steps parameters as themselves being parameters.
partial_fit Facilitates fitting an estimator in an online fashion. Unlike fit, repeatedly calling partial_fit
does not clear the model, but updates it with respect to the data provided. The portion of data provided to
partial_fit may be called a mini-batch. Each mini-batch must be of consistent shape, etc. In iterative
estimators, partial_fit often only performs a single iteration.
partial_fit may also be used for out-of-core learning, although usually limited to the case where learning
can be performed online, i.e. the model is usable after each partial_fit and there is no separate processing
needed to finalize the model. cluster.Birch introduces the convention that calling partial_fit(X)
will produce a model that is not finalized, but the model can be finalized by calling partial_fit() i.e.
without passing a further mini-batch.
Generally, estimator parameters should not be modified between calls to partial_fit, although
partial_fit should validate them as well as the new mini-batch of data. In contrast, warm_start is
used to repeatedly fit the same estimator with the same data but varying parameters.
Like fit, partial_fit should return the estimator object.
To clear the model, a new estimator should be constructed, for instance with base.clone.
NOTE: Using partial_fit after fit results in undefined behavior.
predict Makes a prediction for each sample, usually only taking X as input (but see under regressor output con-
ventions below). In a classifier or regressor, this prediction is in the same target space used in fitting (e.g. one
of {‘red’, ‘amber’, ‘green’} if the y in fitting consisted of these strings). Despite this, even when y passed to fit
is a list or other array-like, the output of predict should always be an array or sparse matrix. In a clusterer or
outlier detector the prediction is an integer.
If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.
Output conventions:
classifier An array of shape (n_samples,) (n_samples, n_outputs). Multilabel data may be rep-
resented as a sparse matrix if a sparse matrix was used in fitting. Each element should be one of the values
in the classifier’s classes_ attribute.
clusterer An array of shape (n_samples,) where each value is from 0 to n_clusters - 1 if the corre-
sponding sample is clustered, and -1 if the sample is not clustered, as in cluster.dbscan.
outlier detector An array of shape (n_samples,) where each value is -1 for an outlier and 1 otherwise.
regressor A numeric array of shape (n_samples,), usually float64. Some regressors have extra options in
their predict method, allowing them to return standard deviation (return_std=True) or covariance
(return_cov=True) relative to the predicted value. In this case, the return value is a tuple of arrays
corresponding to (prediction mean, std, cov) as required.
predict_log_proba The natural logarithm of the output of predict_proba, provided to facilitate numerical sta-
bility.
predict_proba A method in classifiers and clusterers that are able to return probability estimates for each
class/cluster. Its input is usually only some observed data, X.
If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.
Output conventions are like those for decision_function except in the binary classification case, where one
column is output for each class (while decision_function outputs a 1d array). For binary and multiclass
predictions, each row should add to 1.
Like other methods, predict_proba should only be present when the estimator can make probabilistic pre-
dictions (see duck typing). This means that the presence of the method may depend on estimator parameters (e.g.
in linear_model.SGDClassifier) or training data (e.g. in model_selection.GridSearchCV )
and may only appear after fitting.
score A method on an estimator, usually a predictor, which evaluates its predictions on a given dataset, and returns a
single numerical score. A greater return value should indicate better predictions; accuracy is used for classifiers
and R^2 for regressors by default.
If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.
Some estimators implement a custom, estimator-specific score function, often the likelihood of the data under
the model.
score_samples TODO
If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.
set_params Available in any estimator, takes keyword arguments corresponding to keys in get_params. Each is
provided a new value to assign such that calling get_params after set_params will reflect the changed pa-
rameters. Most estimators use the implementation in base.BaseEstimator, which handles nested parame-
ters and otherwise sets the parameter as an attribute on the estimator. The method is overridden in pipeline.
Pipeline and related estimators.
split On a CV splitter (not an estimator), this method accepts parameters (X, y, groups), where all may be op-
tional, and returns an iterator over (train_idx, test_idx) pairs. Each of {train,test}_idx is a 1d integer
array, with values from 0 from X.shape[0] - 1 of any length, such that no values appear in both some
train_idx and its corresponding test_idx.
transform In a transformer, transforms the input, usually only X, into some transformed space (conventionally
notated as Xt). Output is an array or sparse matrix of length n_samples and with number of columns fixed after
fitting.
If the estimator was not already fitted, calling this method should raise a exceptions.NotFittedError.
5.5 Parameters
These common parameter names, specifically used in estimator construction (see concept parameter), sometimes also
appear as parameters of functions or non-estimator constructors.
class_weight Used to specify sample weights when fitting classifiers as a function of the target class. Where
sample_weight is also supported and given, it is multiplied by the class_weight contribution. Similarly,
where class_weight is used in a multioutput (including multilabel) tasks, the weights are multiplied across
outputs (i.e. columns of y).
By default all samples have equal weight such that classes are effectively weighted by their their prevalence in
the training data. This could be achieved explicitly with class_weight={label1: 1, label2: 1,
...} for all class labels.
More generally, class_weight is specified as a dict mapping class labels to weights ({class_label:
weight}), such that each sample of the named class is given that weight.
class_weight='balanced' can be used to give all classes equal weight by giving each sample a
weight inversely related to its class’s prevalence in the training data: n_samples / (n_classes * np.
bincount(y)). Class weights will be used differently depending on the algorithm: for linear models (such
as linear SVM or logistic regression), the class weights will alter the loss function by weighting the loss of each
sample by its class weight. For tree-based algorithms, the class weights will be used for reweighting the splitting
criterion. Note however that this rebalancing does not take the weight of samples in each class into account.
For multioutput classification, a list of dicts is used to specify weights for each output. For example, for four-
class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1,
1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
The class_weight parameter is validated and interpreted with utils.compute_class_weight.
cv Determines a cross validation splitting strategy, as used in cross-validation based routines. cv
is also available in estimators such as multioutput.ClassifierChain or calibration.
CalibratedClassifierCV which use the predictions of one estimator as training data for another, to
not overfit the training supervision.
Possible inputs for cv are usually:
• An integer, specifying the number of folds in K-fold cross validation. K-fold will be stratified over classes
if the estimator is a classifier (determined by base.is_classifier) and the targets may represent a
binary or multiclass (but not multioutput) classification problem (determined by utils.multiclass.
type_of_target).
• A cross-validation splitter instance. Refer to the User Guide for splitters available within Scikit-learn.
• An iterable yielding train/test splits.
With some exceptions (especially where not using cross validation at all is an option), the default is 5-fold.
cv values are validated and interpreted with utils.check_cv.
kernel TODO
max_iter For estimators involving iterative optimization, this determines the maximum number of itera-
tions to be performed in fit. If max_iter iterations are run without convergence, a exceptions.
ConvergenceWarning should be raised. Note that the interpretation of “a single iteration” is inconsistent
across estimators: some, but not all, use it to mean a single epoch (i.e. a pass over every sample in the data).
FIXME perhaps we should have some common tests about the relationship between ConvergenceWarning and
max_iter.
memory Some estimators make use of joblib.Memory to store partial solutions during fitting. Thus when fit is
called again, those partial solutions have been memoized and can be reused.
A memory parameter can be specified as a string with a path to a directory, or a joblib.Memory instance
(or an object with a similar interface, i.e. a cache method) can be used.
memory values are validated and interpreted with utils.validation.check_memory.
metric As a parameter, this is the scheme for determining the distance between two data points. See metrics.
pairwise_distances. In practice, for some algorithms, an improper distance metric (one that does not
obey the triangle inequality, such as Cosine Distance) may be used.
XXX: hierarchical clustering uses affinity with this meaning.
We also use metric to refer to evaluation metrics, but avoid using this sense as a parameter name.
n_components The number of features which a transformer should transform the input into. See components_ for
the special case of affine projection.
n_iter_no_change Number of iterations with no improvement to wait before stopping the iterative procedure.
This is also known as a patience parameter. It is typically used with early stopping to avoid stopping too early.
n_jobs This parameter is used to specify how many concurrent processes or threads should be used for routines that
are parallelized with joblib.
n_jobs is an integer, specifying the maximum number of concurrently running workers. If 1 is given, no joblib
parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used. For n_jobs below -1,
(n_cpus + 1 + n_jobs) are used. For example with n_jobs=-2, all CPUs but one are used.
n_jobs is None by default, which means unset; it will generally be interpreted as n_jobs=1, unless the
current joblib.Parallel backend context specifies otherwise.
For more details on the use of joblib and its interactions with scikit-learn, please refer to our parallelism
notes.
pos_label Value with which positive labels must be encoded in binary classification problems in which the positive
class is not assumed. This value is typically required to compute asymmetric evaluation metrics such as precision
and recall.
random_state Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may
be provided to control the random number generator used. Note that the mere presence of random_state
doesn’t mean that randomization is always used, as it may be dependent on another parameter, e.g. shuffle,
being set.
random_state’s value may be:
None (default) Use the global random state from numpy.random.
An integer Use a new random number generator seeded by the given integer. To make a randomized al-
gorithm deterministic (i.e. running it multiple times will produce the same result), an arbitrary integer
random_state can be used. However, it may be worthwhile checking that your results are stable across
a number of different distinct random seeds. Popular integer random seeds are 0 and 42.
A numpy.random.RandomState instance Use the provided random state, only affecting other users of
the same random state instance. Calling fit multiple times will reuse the same instance, and will produce
different results.
utils.check_random_state is used internally to validate the input random_state and return a
RandomState instance.
scoring Specifies the score function to be maximized (usually by cross validation), or – in some cases – multiple
score functions to be reported. The score function can be a string accepted by metrics.get_scorer or a
callable scorer, not to be confused with an evaluation metric, as the latter have a more diverse API. scoring
may also be set to None, in which case the estimator’s score method is used. See The scoring parameter:
defining model evaluation rules in the User Guide.
Where multiple metrics can be evaluated, scoring may be given either as a list of unique strings or a dict with
names as keys and callables as values. Note that this does not specify which score function is to be maximised,
and another parameter such as refit may be used for this purpose.
The scoring parameter is validated and interpreted using metrics.check_scoring.
verbose Logging is not handled very consistently in Scikit-learn at present, but when it is provided as an option,
the verbose parameter is usually available to choose no logging (set to False). Any True value should enable
some logging, but larger integers (e.g. above 10) may be needed for full verbosity. Verbose logs are usually
printed to Standard Output. Estimators should not produce any output on Standard Output with the default
verbose setting.
warm_start When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as
to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model
learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model
attributes are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter
values. For example, warm_start may be used when building random forests to add more trees to the forest
(increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the
data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and
model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For exam-
ple, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For
classification, all data in a sequence of warm_start calls to fit must include samples from each class.
5.6 Attributes
labels_ A vector containing a cluster label for each sample of the training data in clusterers, identical to the output
of fit_predict. See also embedding_.
SIX
EXAMPLES
Note: Click here to download the full example code or to run this example in your browser via Binder
Default representation:
LogisticRegression(penalty='l1')
print(__doc__)
lr = LogisticRegression(penalty='l1')
print('Default representation:')
print(lr)
# LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
# intercept_scaling=1, l1_ratio=None, max_iter=100,
# multi_class='auto', n_jobs=None, penalty='l1',
# random_state=None, solver='warn', tol=0.0001, verbose=0,
# warm_start=False)
(continues on next page)
727
scikit-learn user guide, Release 0.23.dev0
set_config(print_changed_only=True)
print('\nWith changed_only option:')
print(lr)
# LogisticRegression(penalty='l1')
Note: Click here to download the full example code or to run this example in your browser via Binder
Scikit-learn defines a simple API for creating visualizations for machine learning. The key features of this API is to
allow for quick plotting and visual adjustments without recalculation. In this example, we will demonstrate how to
use the visualization API by comparing ROC curves.
print(__doc__)
First, we load the wine dataset and convert it to a binary classification problem. Then, we train a support vector
classifier on a training dataset.
X, y = load_wine(return_X_y=True)
y = y == 2
Out:
SVC(random_state=42)
Next, we plot the ROC curve with a single call to sklearn.metrics.plot_roc_curve. The returned
svc_disp object allows us to continue using the already computed ROC curve for the SVC in future plots.
We train a random forest classifier and create a plot comparing it to the SVC ROC curve. Notice how svc_disp
uses plot to plot the SVC ROC curve without recomputing the values of the roc curve itself. Furthermore, we pass
alpha=0.8 to the plot functions to adjust the alpha values of the curves.
Note: Click here to download the full example code or to run this example in your browser via Binder
An illustration of the isotonic regression on generated data. The isotonic regression finds a non-decreasing approx-
imation of a function while minimizing the mean squared error on the training data. The benefit of such a model is
that it does not assume any form for the target function such as linearity. For comparison a linear regression is also
presented.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
n = 100
x = np.arange(n)
rs = check_random_state(0)
y = rs.randint(-50, 50, size=(n,)) + 50. * np.log1p(np.arange(n))
# #############################################################################
# Fit IsotonicRegression and LinearRegression models
ir = IsotonicRegression()
y_ = ir.fit_transform(x, y)
(continues on next page)
lr = LinearRegression()
lr.fit(x[:, np.newaxis], y) # x needs to be 2d for LinearRegression
# #############################################################################
# Plot result
fig = plt.figure()
plt.plot(x, y, 'r.', markersize=12)
plt.plot(x, y_, 'b.-', markersize=12)
plt.plot(x, lr.predict(x[:, np.newaxis]), 'b-')
plt.gca().add_collection(lc)
plt.legend(('Data', 'Isotonic Fit', 'Linear Fit'), loc='lower right')
plt.title('Isotonic regression')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.inspection import plot_partial_dependence
First, we train a decision tree and a multi-layer perceptron on the boston housing price dataset.
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
tree = DecisionTreeRegressor()
mlp = make_pipeline(StandardScaler(),
MLPRegressor(hidden_layer_sizes=(100, 100),
tol=1e-2, max_iter=500, random_state=0))
tree.fit(X, y)
mlp.fit(X, y)
Out:
Pipeline(steps=[('standardscaler', StandardScaler()),
('mlpregressor',
MLPRegressor(hidden_layer_sizes=(100, 100), max_iter=500,
random_state=0, tol=0.01))])
We plot partial dependence curves for features “LSTAT” and “RM” for the decision tree. With two features,
plot_partial_dependence expects to plot two curves. Here the plot function place a grid of two plots us-
ing the space defined by ax .
Out:
warnings.warn(msg, FutureWarning)
The partial depdendence curves can be plotted for the multi-layer perceptron. In this case, line_kw is passed to
plot_partial_dependence to change the color of the curve.
The tree_disp and mlp_disp PartialDependenceDisplay objects contain all the computed information
needed to recreate the partial dependence curves. This means we can easily create additional plots without needing to
recompute the curves.
One way to plot the curves is to place them in the same figure, with the curves of each model on each row. First, we
create a figure with two axes within two rows and one column. The two axes are passed to the plot functions of
tree_disp and mlp_disp. The given axes will be used by the plotting function to draw the partial dependence.
The resulting plot places the decision tree partial dependence curves in the first row of the multi-layer perceptron in
the second row.
Out:
Another way to compare the curves is to plot them on top of each other. Here, we create a figure with one row and two
columns. The axes are passed into the plot function as a list, which will plot the partial dependence curves of each
model on the same axes. The length of the axes list must be equal to the number of plots drawn.
Out:
tree_disp.axes_ is a numpy array container the axes used to draw the partial dependence plots. This can be
passed to mlp_disp to have the same affect of drawing the plots on top of each other. Furthermore, the mlp_disp.
figure_ stores the figure, which allows for resizing the figure after calling plot. In this case tree_disp.axes_
has two dimensions, thus plot will only show the y label and y ticks on the left most plot.
Here, we plot the partial dependence curves for a single feature, “LSTAT”, on the same axes. In this case,
tree_disp.axes_ is passed into the second plot function.
Out:
warnings.warn(msg, FutureWarning)
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows the use of multi-output estimator to complete images. The goal is to predict the lower half of a
face given its upper half.
The first column of images shows true faces. The next columns illustrate how extremely randomized trees, k nearest
neighbors, linear regression and ridge regression complete the lower half of those faces.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
n_pixels = data.shape[1]
# Upper half of the faces
X_train = train[:, :(n_pixels + 1) // 2]
# Lower half of the faces
y_train = train[:, n_pixels // 2:]
X_test = test[:, :(n_pixels + 1) // 2]
y_test = test[:, n_pixels // 2:]
# Fit estimators
ESTIMATORS = {
"Extra trees": ExtraTreesRegressor(n_estimators=10, max_features=32,
random_state=0),
"K-nn": KNeighborsRegressor(),
"Linear regression": LinearRegression(),
"Ridge": RidgeCV(),
}
y_test_predict = dict()
for name, estimator in ESTIMATORS.items():
estimator.fit(X_train, y_train)
y_test_predict[name] = estimator.predict(X_test)
n_cols = 1 + len(ESTIMATORS)
plt.figure(figsize=(2. * n_cols, 2.26 * n_faces))
plt.suptitle("Face completion with multi-output estimators", size=16)
for i in range(n_faces):
true_face = np.hstack((X_test[i], y_test[i]))
if i:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 1)
else:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 1,
title="true faces")
sub.axis("off")
sub.imshow(true_face.reshape(image_shape),
cmap=plt.cm.gray,
(continues on next page)
if i:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j)
else:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j,
title=est)
sub.axis("off")
sub.imshow(completed_face.reshape(image_shape),
cmap=plt.cm.gray,
interpolation="nearest")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example simulates a multi-label document classification problem. The dataset is generated randomly based on
the following process:
• pick the number of labels: n ~ Poisson(n_labels)
• n times, choose a class c: c ~ Multinomial(theta)
• pick the document length: k ~ Poisson(length)
• k times, choose a word: w ~ Multinomial(theta_c)
In the above process, rejection sampling is used to make sure that n is more than 2, and that the document length is
never zero. Likewise, we reject classes which have already been chosen. The documents that are assigned to both
classes are plotted surrounded by two colored circles.
The classification is performed by projecting to the first two principal components found by PCA and CCA for visual-
isation purposes, followed by using the sklearn.multiclass.OneVsRestClassifier metaclassifier using
two SVCs with linear kernels to learn a discriminative model for each class. Note that PCA is used to perform an
unsupervised dimensionality reduction, while CCA is used to perform a supervised one.
Note: in the plot, “unlabeled samples” does not mean that we don’t know the labels (as in semi-supervised learning)
but that the samples simply do not have a label.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
classif = OneVsRestClassifier(SVC(kernel='linear'))
classif.fit(X, Y)
plt.subplot(2, 2, subplot)
plt.title(title)
plt.figure(figsize=(8, 6))
X, Y = make_multilabel_classification(n_classes=2, n_labels=1,
allow_unlabeled=True,
random_state=1)
X, Y = make_multilabel_classification(n_classes=2, n_labels=1,
allow_unlabeled=False,
random_state=1)
Note: Click here to download the full example code or to run this example in your browser via Binder
6.1.7 Comparing anomaly detection algorithms for outlier detection on toy datasets
This example shows characteristics of different anomaly detection algorithms on 2D datasets. Datasets contain one or
two modes (regions of high density) to illustrate the ability of algorithms to cope with multimodal data.
For each dataset, 15% of samples are generated as random uniform noise. This proportion is the value given to the nu
parameter of the OneClassSVM and the contamination parameter of the other outlier detection algorithms. Decision
boundaries between inliers and outliers are displayed in black except for Local Outlier Factor (LOF) as it has no
predict method to be applied on new data when it is used for outlier detection.
The sklearn.svm.OneClassSVM is known to be sensitive to outliers and thus does not perform very well for
outlier detection. This estimator is best suited for novelty detection when the training set is not contaminated by
outliers. That said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying
data is very challenging, and a One-class SVM might give useful results in these situations depending on the value of
its hyperparameters.
sklearn.covariance.EllipticEnvelope assumes the data is Gaussian and learns an ellipse. It thus de-
grades when the data is not unimodal. Notice however that this estimator is robust to outliers.
sklearn.ensemble.IsolationForest and sklearn.neighbors.LocalOutlierFactor seem
to perform reasonably well for multi-modal data sets. The advantage of sklearn.neighbors.
LocalOutlierFactor over the other estimators is shown for the third data set, where the two modes have different
densities. This advantage is explained by the local aspect of LOF, meaning that it only compares the score of abnor-
mality of one sample with the scores of its neighbors.
Finally, for the last data set, it is hard to say that one sample is more abnormal than another sample as they are
uniformly distributed in a hypercube. Except for the sklearn.svm.OneClassSVM which overfits a little, all
estimators present decent solutions for this situation. In such a case, it would be wise to look more closely at the
scores of abnormality of the samples as a good estimator should assign similar scores to all the samples.
While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional
data.
Finally, note that parameters of the models have been here handpicked but that in practice they need to be adjusted. In
the absence of labelled data, the problem is completely unsupervised so model selection can be a challenge.
import time
import numpy as np
import matplotlib
(continues on next page)
print(__doc__)
matplotlib.rcParams['contour.negative_linestyle'] = 'solid'
# Example settings
n_samples = 300
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers
# Define datasets
blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5,
**blobs_params)[0],
make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5],
**blobs_params)[0],
make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, .3],
**blobs_params)[0],
4. * (make_moons(n_samples=n_samples, noise=.05, random_state=0)[0] -
np.array([0.5, 0.25])),
14. * (np.random.RandomState(42).rand(n_samples, 2) - 0.5)]
plt.figure(figsize=(len(anomaly_algorithms) * 2 + 3, 12.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)
plot_num = 1
rng = np.random.RandomState(42)
plt.xlim(-7, 7)
plt.ylim(-7, 7)
plt.xticks(())
plt.yticks(())
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
transform=plt.gca().transAxes, size=15,
horizontalalignment='right')
plot_num += 1
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The Johnson-Lindenstrauss lemma states that any high dimensional dataset can be randomly projected into a lower
dimensional Euclidean space while controlling the distortion in the pairwise distances.
print(__doc__)
import sys
from time import time
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from distutils.version import LooseVersion
from sklearn.random_projection import johnson_lindenstrauss_min_dim
(continues on next page)
Theoretical bounds
The distortion introduced by a random projection p is asserted by the fact that p is defining an eps-embedding with
good probability as defined by:
Where u and v are any rows taken from a dataset of shape [n_samples, n_features] and p is a projection by a random
Gaussian N(0, 1) matrix with shape [n_components, n_features] (or a sparse Achlioptas matrix).
The minimum number of components to guarantees the eps-embedding is given by:
The first plot shows that with an increasing number of samples n_samples, the minimal number of dimensions
n_components increased logarithmically in order to guarantee an eps-embedding.
plt.figure()
for eps, color in zip(eps_range, colors):
min_n_components = johnson_lindenstrauss_min_dim(n_samples_range, eps=eps)
plt.loglog(n_samples_range, min_n_components, color=color)
The second plot shows that an increase of the admissible distortion eps allows to reduce drastically the minimal
number of dimensions n_components for a given number of samples n_samples
plt.figure()
for n_samples, color in zip(n_samples_range, colors):
min_n_components = johnson_lindenstrauss_min_dim(n_samples, eps=eps_range)
plt.semilogy(eps_range, min_n_components, color=color)
Empirical validation
We validate the above bounds on the 20 newsgroups text document (TF-IDF word frequencies) dataset or on the digits
dataset:
• for the 20 newsgroups dataset some 500 documents with 100k features in total are projected using a
sparse random matrix to smaller euclidean spaces with various values for the target number of dimensions
n_components.
• for the digits dataset, some 8x8 gray level pixels data for 500 handwritten digits pictures are randomly projected
to spaces for various larger number of dimensions n_components.
The default dataset is the 20 newsgroups dataset. To run the example on the digits dataset, pass the
--use-digits-dataset command line argument to this script.
if '--use-digits-dataset' in sys.argv:
data = load_digits().data[:500]
else:
data = fetch_20newsgroups_vectorized().data[:500]
projected_dists = euclidean_distances(
projected_data, squared=True).ravel()[nonzero]
plt.figure()
min_dist = min(projected_dists.min(), dists.min())
max_dist = max(projected_dists.max(), dists.max())
plt.hexbin(dists, projected_dists, gridsize=100, cmap=plt.cm.PuBu,
extent=[min_dist, max_dist, min_dist, max_dist])
plt.xlabel("Pairwise squared distances in original space")
plt.ylabel("Pairwise squared distances in projected space")
plt.title("Pairwise distances distribution for n_components=%d" %
n_components)
cb = plt.colorbar()
cb.set_label('Sample pairs counts')
plt.figure()
plt.hist(rates, bins=50, range=(0., 2.), edgecolor='k', **density_param)
plt.xlabel("Squared distances rate: projected / original")
plt.ylabel("Distribution of samples pairs")
plt.title("Histogram of pairwise distance rates for n_components=%d" %
n_components)
# TODO: compute the expected value of eps and add them to the previous plot
# as vertical lines / region
plt.show()
•
Out:
Embedding 500 samples with dim 130107 using various random projections
Projected 500 samples from 130107 to 300 in 0.732s
Random matrix with size: 1.297MB
Mean distances rate: 1.01 (0.18)
Projected 500 samples from 130107 to 1000 in 2.619s
Random matrix with size: 4.325MB
Mean distances rate: 0.92 (0.10)
Projected 500 samples from 130107 to 10000 in 25.322s
Random matrix with size: 43.289MB
Mean distances rate: 0.98 (0.03)
We can see that for low values of n_components the distribution is wide with many distorted pairs and a skewed
distribution (due to the hard limit of zero ratio on the left as distances are always positives) while for larger values of
n_components the distortion is controlled and the distances are well preserved by the random projection.
Remarks
According to the JL lemma, projecting 500 samples without too much distortion will require at least several thousands
dimensions, irrespective of the number of features of the original dataset.
Hence using random projections on the digits dataset which only has 64 features in the input space does not make
sense: it does not allow for dimensionality reduction in this case.
On the twenty newsgroups on the other hand the dimensionality can be decreased from 56436 down to 10000 while
reasonably preserving pairwise distances.
Note: Click here to download the full example code or to run this example in your browser via Binder
Both kernel ridge regression (KRR) and SVR learn a non-linear function by employing the kernel trick, i.e., they
learn a linear function in the space induced by the respective kernel which corresponds to a non-linear function in the
original space. They differ in the loss functions (ridge versus epsilon-insensitive loss). In contrast to SVR, fitting a
KRR can be done in closed-form and is typically faster for medium-sized datasets. On the other hand, the learned
model is non-sparse and thus slower than SVR at prediction-time.
This example illustrates both methods on an artificial dataset, which consists of a sinusoidal target function and strong
noise added to every fifth datapoint. The first figure compares the learned model of KRR and SVR when both com-
plexity/regularization and bandwidth of the RBF kernel are optimized using grid-search. The learned functions are
very similar; however, fitting KRR is approx. seven times faster than fitting SVR (both with grid-search). However,
prediction of 100000 target values is more than tree times faster with SVR since it has learned a sparse model using
only approx. 1/3 of the 100 training datapoints as support vectors.
The next figure compares the time for fitting and prediction of KRR and SVR for different sizes of the training set.
Fitting KRR is faster than SVR for medium- sized training sets (less than 1000 samples); however, for larger training
sets SVR scales better. With regard to prediction time, SVR is faster than KRR for all sizes of the training set because
of the learned sparse solution. Note that the degree of sparsity and thus the prediction time depends on the parameters
epsilon and C of the SVR.
•
Out:
SVR complexity and bandwidth selected and model fitted in 0.412 s
KRR complexity and bandwidth selected and model fitted in 0.147 s
Support vector ratio: 0.320
SVR prediction for 100000 inputs in 0.074 s
KRR prediction for 100000 inputs in 0.085 s
import time
import numpy as np
# #############################################################################
# Generate sample data
X = 5 * rng.rand(10000, 1)
y = np.sin(X).ravel()
# #############################################################################
# Fit regression model
train_size = 100
svr = GridSearchCV(SVR(kernel='rbf', gamma=0.1),
param_grid={"C": [1e0, 1e1, 1e2, 1e3],
"gamma": np.logspace(-2, 2, 5)})
kr = GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1),
param_grid={"alpha": [1e0, 0.1, 1e-2, 1e-3],
"gamma": np.logspace(-2, 2, 5)})
t0 = time.time()
svr.fit(X[:train_size], y[:train_size])
svr_fit = time.time() - t0
print("SVR complexity and bandwidth selected and model fitted in %.3f s"
% svr_fit)
t0 = time.time()
kr.fit(X[:train_size], y[:train_size])
kr_fit = time.time() - t0
print("KRR complexity and bandwidth selected and model fitted in %.3f s"
% kr_fit)
t0 = time.time()
y_svr = svr.predict(X_plot)
svr_predict = time.time() - t0
print("SVR prediction for %d inputs in %.3f s"
% (X_plot.shape[0], svr_predict))
t0 = time.time()
y_kr = kr.predict(X_plot)
kr_predict = time.time() - t0
print("KRR prediction for %d inputs in %.3f s"
% (X_plot.shape[0], kr_predict))
# #############################################################################
# Look at the results
sv_ind = svr.best_estimator_.support_
plt.scatter(X[sv_ind], y[sv_ind], c='r', s=50, label='SVR support vectors',
zorder=2, edgecolors=(0, 0, 0))
plt.scatter(X[:100], y[:100], c='k', label='data', zorder=1,
(continues on next page)
t0 = time.time()
estimator.predict(X_plot[:1000])
test_time.append(time.time() - t0)
plt.xscale("log")
plt.yscale("log")
plt.xlabel("Train size")
plt.ylabel("Time (seconds)")
plt.title('Execution Time')
plt.legend(loc="best")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
To apply an classifier on this data, we need to flatten the image, to turn the data in a (samples, feature) matrix:
n_samples = len(digits.data)
data = digits.data / 16.
data -= data.mean(axis=0)
kernel_svm_time = time()
kernel_svm.fit(data_train, targets_train)
kernel_svm_score = kernel_svm.score(data_test, targets_test)
kernel_svm_time = time() - kernel_svm_time
linear_svm_time = time()
linear_svm.fit(data_train, targets_train)
linear_svm_score = linear_svm.score(data_test, targets_test)
linear_svm_time = time() - linear_svm_time
for D in sample_sizes:
fourier_approx_svm.set_params(feature_map__n_components=D)
(continues on next page)
start = time()
fourier_approx_svm.fit(data_train, targets_train)
fourier_times.append(time() - start)
accuracy.plot([sample_sizes[0], sample_sizes[-1]],
[kernel_svm_score, kernel_svm_score], label="rbf svm")
timescale.plot([sample_sizes[0], sample_sizes[-1]],
[kernel_svm_time, kernel_svm_time], '--', label='rbf svm')
The second plot visualized the decision surfaces of the RBF kernel SVM and the linear SVM with approximate kernel
maps. The plot shows decision surfaces of the classifiers projected onto the first two principal components of the
data. This visualization should be taken with a grain of salt since it is just an interesting slice through the decision
surface in 64 dimensions. In particular note that a datapoint (represented as a dot) does not necessarily be classified
into the region it is lying in, since it will not lie on the plane that the first two principal components span. The usage
of RBFSampler and Nystroem is described in detail in Kernel Approximation.
# visualize the decision surface, projected down to the first
# two principal components of the dataset
pca = PCA(n_components=8).fit(data_train)
X = pca.transform(data_train)
plt.figure(figsize=(18, 7.5))
plt.rcParams.update({'font.size': 14})
# predict and plot
for i, clf in enumerate((kernel_svm, nystroem_approx_svm,
fourier_approx_svm)):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
plt.subplot(1, 3, i + 1)
Z = clf.predict(flat_grid)
plt.title(titles[i])
plt.tight_layout()
plt.show()
6.2 Biclustering
Note: Click here to download the full example code or to run this example in your browser via Binder
This example demonstrates how to generate a dataset and bicluster it using the Spectral Co-Clustering algorithm.
The dataset is generated using the make_biclusters function, which creates a matrix of small values and im-
plants bicluster with large values. The rows and columns are then shuffled and passed to the Spectral Co-Clustering
algorithm. Rearranging the shuffled matrix to make biclusters contiguous shows how accurately the algorithm found
the biclusters.
•
Out:
print(__doc__)
import numpy as np
from matplotlib import pyplot as plt
# shuffle clusters
rng = np.random.RandomState(0)
row_idx = rng.permutation(data.shape[0])
col_idx = rng.permutation(data.shape[1])
data = data[row_idx][:, col_idx]
plt.matshow(data, cmap=plt.cm.Blues)
plt.title("Shuffled dataset")
fit_data = data[np.argsort(model.row_labels_)]
fit_data = fit_data[:, np.argsort(model.column_labels_)]
plt.matshow(fit_data, cmap=plt.cm.Blues)
plt.title("After biclustering; rearranged to show biclusters")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example demonstrates how to generate a checkerboard dataset and bicluster it using the Spectral Biclustering
algorithm.
The data is generated with the make_checkerboard function, then shuffled and passed to the Spectral Biclustering
algorithm. The rows and columns of the shuffled matrix are rearranged to show the biclusters found by the algorithm.
The outer product of the row and column label vectors shows a representation of the checkerboard structure.
•
Out:
print(__doc__)
import numpy as np
from matplotlib import pyplot as plt
n_clusters = (4, 3)
data, rows, columns = make_checkerboard(
shape=(300, 300), n_clusters=n_clusters, noise=10,
(continues on next page)
plt.matshow(data, cmap=plt.cm.Blues)
plt.title("Original dataset")
# shuffle clusters
rng = np.random.RandomState(0)
row_idx = rng.permutation(data.shape[0])
col_idx = rng.permutation(data.shape[1])
data = data[row_idx][:, col_idx]
plt.matshow(data, cmap=plt.cm.Blues)
plt.title("Shuffled dataset")
fit_data = data[np.argsort(model.row_labels_)]
fit_data = fit_data[:, np.argsort(model.column_labels_)]
plt.matshow(fit_data, cmap=plt.cm.Blues)
plt.title("After biclustering; rearranged to show biclusters")
plt.matshow(np.outer(np.sort(model.row_labels_) + 1,
np.sort(model.column_labels_) + 1),
cmap=plt.cm.Blues)
plt.title("Checkerboard structure of rearranged data")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example demonstrates the Spectral Co-clustering algorithm on the twenty newsgroups dataset. The ‘comp.os.ms-
windows.misc’ category is excluded because it contains many posts containing nothing but data.
The TF-IDF vectorized posts form a word frequency matrix, which is then biclustered using Dhillon’s Spectral Co-
Clustering algorithm. The resulting document-word biclusters indicate subsets words used more often in those subsets
documents.
For a few of the best biclusters, its most common document categories and its ten most important words get printed.
The best biclusters are determined by their normalized cut. The best words are determined by comparing their sums
inside and outside the bicluster.
For comparison, the documents are also clustered using MiniBatchKMeans. The document clusters derived from the
biclusters achieve a better V-measure than clusters found by MiniBatchKMeans.
Out:
Vectorizing...
Coclustering...
Done in 2.58s. V-measure: 0.4387
MiniBatchKMeans...
Done in 4.94s. V-measure: 0.3344
Best biclusters:
----------------
bicluster 0 : 1829 documents, 2524 words
categories : 22% comp.sys.ibm.pc.hardware, 19% comp.sys.mac.hardware, 18% comp.
˓→graphics
words : card, pc, ram, drive, bus, mac, motherboard, port, windows, floppy
import numpy as np
print(__doc__)
def number_normalizer(tokens):
""" Map all numeric tokens to a placeholder.
(continues on next page)
For many applications, tokens that begin with a number are not directly
useful, but the fact that such a token exists can be relevant. By applying
this form of dimensionality reduction, some methods may perform better.
"""
return ("#NUMBER" if token[0].isdigit() else token for token in tokens)
class NumberNormalizingVectorizer(TfidfVectorizer):
def build_tokenizer(self):
tokenize = super().build_tokenizer()
return lambda doc: list(number_normalizer(tokenize(doc)))
# exclude 'comp.os.ms-windows.misc'
categories = ['alt.atheism', 'comp.graphics',
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
'comp.windows.x', 'misc.forsale', 'rec.autos',
'rec.motorcycles', 'rec.sport.baseball',
'rec.sport.hockey', 'sci.crypt', 'sci.electronics',
'sci.med', 'sci.space', 'soc.religion.christian',
'talk.politics.guns', 'talk.politics.mideast',
'talk.politics.misc', 'talk.religion.misc']
newsgroups = fetch_20newsgroups(categories=categories)
y_true = newsgroups.target
print("Vectorizing...")
X = vectorizer.fit_transform(newsgroups.data)
print("Coclustering...")
start_time = time()
cocluster.fit(X)
y_cocluster = cocluster.row_labels_
print("Done in {:.2f}s. V-measure: {:.4f}".format(
time() - start_time,
v_measure_score(y_cocluster, y_true)))
print("MiniBatchKMeans...")
start_time = time()
y_kmeans = kmeans.fit_predict(X)
print("Done in {:.2f}s. V-measure: {:.4f}".format(
time() - start_time,
v_measure_score(y_kmeans, y_true)))
feature_names = vectorizer.get_feature_names()
document_names = list(newsgroups.target_names[i] for i in newsgroups.target)
def bicluster_ncut(i):
rows, cols = cocluster.get_indices(i)
if not (np.any(rows) and np.any(cols)):
(continues on next page)
def most_common(d):
"""Items of a defaultdict(int) with the highest values.
bicluster_ncuts = list(bicluster_ncut(i)
for i in range(len(newsgroups.target_names)))
best_idx = np.argsort(bicluster_ncuts)[:5]
print()
print("Best biclusters:")
print("----------------")
for idx, cluster in enumerate(best_idx):
n_rows, n_cols = cocluster.get_shape(cluster)
cluster_docs, cluster_words = cocluster.get_indices(cluster)
if not len(cluster_docs) or not len(cluster_words):
continue
# categories
counter = defaultdict(int)
for i in cluster_docs:
counter[document_names[i]] += 1
cat_string = ", ".join("{:.0f}% {}".format(float(c) / n_rows * 100, name)
for name, c in most_common(counter)[:3])
# words
out_of_cluster_docs = cocluster.row_labels_ != cluster
out_of_cluster_docs = np.where(out_of_cluster_docs)[0]
word_col = X[:, cluster_words]
word_scores = np.array(word_col[cluster_docs, :].sum(axis=0) -
word_col[out_of_cluster_docs, :].sum(axis=0))
word_scores = word_scores.ravel()
important_words = list(feature_names[cluster_words[i]]
for i in word_scores.argsort()[:-11:-1])
6.3 Calibration
Note: Click here to download the full example code or to run this example in your browser via Binder
Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly
interpreted as a confidence level. For instance a well calibrated (binary) classifier should classify the samples such that
among the samples to which it gave a predict_proba value close to 0.8, approx. 80% actually belong to the positive
class.
LogisticRegression returns well calibrated predictions as it directly optimizes log-loss. In contrast, the other methods
return biased probabilities, with different biases per method:
• GaussianNaiveBayes tends to push probabilities to 0 or 1 (note the counts in the histograms). This is mainly
because it makes the assumption that features are conditionally independent given the class, which is not the
case in this dataset which contains 2 redundant features.
• RandomForestClassifier shows the opposite behavior: the histograms show peaks at approx. 0.2 and 0.9 proba-
bility, while probabilities close to 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil and
Caruana1 : “Methods such as bagging and random forests that average predictions from a base set of models can
have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predic-
tions that should be near zero or one away from these values. Because predictions are restricted to the interval
[0,1], errors caused by variance tend to be one- sided near zero and one. For example, if a model should predict
p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If we add noise to the
trees that bagging is averaging over, this noise will cause some trees to predict values larger than 0 for this case,
thus moving the average prediction of the bagged ensemble away from 0. We observe this effect most strongly
with random forests because the base-level trees trained with random forests have relatively high variance due
to feature subsetting.” As a result, the calibration curve shows a characteristic sigmoid shape, indicating that the
classifier could trust its “intuition” more and return probabilities closer to 0 or 1 typically.
• Support Vector Classification (SVC) shows an even more sigmoid curve as the RandomForestClassifier, which
is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana1 ), which focus on hard samples
that are close to the decision boundary (the support vectors).
References:
1 Predicting Good Probabilities with Supervised Learning, A. Niculescu-Mizil & R. Caruana, ICML 2005
print(__doc__)
import numpy as np
np.random.seed(0)
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2)
X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]
# Create classifiers
lr = LogisticRegression()
gnb = GaussianNB()
svc = LinearSVC(C=1.0)
rfc = RandomForestClassifier()
# #############################################################################
# Plot calibration plots
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots (reliability curve)')
Note: Click here to download the full example code or to run this example in your browser via Binder
When performing classification one often wants to predict not only the class label, but also the associated probability.
This probability gives some kind of confidence on the prediction. This example demonstrates how to display how well
calibrated the predicted probabilities are and how to calibrate an uncalibrated classifier.
The experiment is performed on an artificial dataset for binary classification with 100,000 samples (1,000 of them are
used for model fitting) with 20 features. Of the 20 features, only 2 are informative and 10 are redundant. The first
figure shows the estimated probabilities obtained with logistic regression, Gaussian naive Bayes, and Gaussian naive
Bayes with both isotonic calibration and sigmoid calibration. The calibration performance is evaluated with Brier
score, reported in the legend (the smaller the better). One can observe here that logistic regression is well calibrated
while raw Gaussian naive Bayes performs very badly. This is because of the redundant features which violate the
assumption of feature-independence and result in an overly confident classifier, which is indicated by the typical
transposed-sigmoid curve.
Calibration of the probabilities of Gaussian naive Bayes with isotonic regression can fix this issue as can be seen from
the nearly diagonal calibration curve. Sigmoid calibration also improves the brier score slightly, albeit not as strongly
as the non-parametric isotonic regression. This can be attributed to the fact that we have plenty of calibration data such
that the greater flexibility of the non-parametric model can be exploited.
The second figure shows the calibration curve of a linear support-vector classifier (LinearSVC). LinearSVC shows
the opposite behavior as Gaussian naive Bayes: the calibration curve has a sigmoid curve, which is typical for an
under-confident classifier. In the case of LinearSVC, this is caused by the margin property of the hinge loss, which
lets the model focus on hard samples that are close to the decision boundary (the support vectors).
Both kinds of calibration can fix this issue and yield nearly identical results. This shows that sigmoid calibration can
deal with situations where the calibration curve of the base classifier is sigmoid (e.g., for LinearSVC) but not where it
is transposed-sigmoid (e.g., Gaussian naive Bayes).
•
Out:
Logistic:
Brier: 0.099
Precision: 0.872
Recall: 0.851
F1: 0.862
Naive Bayes:
Brier: 0.118
Precision: 0.857
Recall: 0.876
F1: 0.867
Logistic:
Brier: 0.099
Precision: 0.872
Recall: 0.851
F1: 0.862
SVC:
Brier: 0.163
Precision: 0.872
Recall: 0.852
F1: 0.862
SVC + Isotonic:
Brier: 0.100
Precision: 0.853
Recall: 0.878
F1: 0.865
SVC + Sigmoid:
Brier: 0.099
Precision: 0.874
Recall: 0.849
F1: 0.861
print(__doc__)
fraction_of_positives, mean_predicted_value = \
calibration_curve(y_test, prob_pos, n_bins=10)
ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
(continues on next page)
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
When performing classification you often want to predict not only the class label, but also the associated probability.
This probability gives you some kind of confidence on the prediction. However, not all classifiers provide well-
calibrated probabilities, some being over-confident while others being under-confident. Thus, a separate calibration
of predicted probabilities is often desirable as a postprocessing. This example illustrates two different methods for
this calibration and evaluates the quality of the returned probabilities using Brier’s score (see https://en.wikipedia.org/
wiki/Brier_score).
Compared are the estimated probability using a Gaussian naive Bayes classifier without calibration, with a sigmoid
calibration, and with a non-parametric isotonic calibration. One can observe that only the non-parametric model is
able to provide a probability calibration that returns probabilities close to the expected 0.5 for most of the samples
belonging to the middle cluster with heterogeneous labels. This results in a significantly improved Brier score.
•
Out:
Brier scores: (the smaller the better)
No calibration: 0.104
With isotonic calibration: 0.084
With sigmoid calibration: 0.109
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
n_samples = 50000
n_bins = 3 # use 3 bins for calibration_curve as we have 3 clusters here
y[:n_samples // 2] = 0
y[n_samples // 2:] = 1
sample_weight = np.random.RandomState(42).rand(y.shape[0])
# #############################################################################
# Plot the data and the predicted probabilities
plt.figure()
y_unique = np.unique(y)
colors = cm.rainbow(np.linspace(0.0, 1.0, y_unique.size))
for this_y, color in zip(y_unique, colors):
this_X = X_train[y_train == this_y]
this_sw = sw_train[y_train == this_y]
plt.scatter(this_X[:, 0], this_X[:, 1], s=this_sw * 50,
(continues on next page)
plt.figure()
order = np.lexsort((prob_pos_clf, ))
plt.plot(prob_pos_clf[order], 'r', label='No calibration (%1.3f)' % clf_score)
plt.plot(prob_pos_isotonic[order], 'g', linewidth=3,
label='Isotonic calibration (%1.3f)' % clf_isotonic_score)
plt.plot(prob_pos_sigmoid[order], 'b', linewidth=3,
label='Sigmoid calibration (%1.3f)' % clf_sigmoid_score)
plt.plot(np.linspace(0, y_test.size, 51)[1::2],
y_test[order].reshape(25, -1).mean(1),
'k', linewidth=3, label=r'Empirical')
plt.ylim([-0.05, 1.05])
plt.xlabel("Instances sorted according to predicted probability "
"(uncalibrated GNB)")
plt.ylabel("P(y=1)")
plt.legend(loc="upper left")
plt.title("Gaussian naive Bayes probabilities")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates how sigmoid calibration changes predicted probabilities for a 3-class classification problem.
Illustrated is the standard 2-simplex, where the three corners correspond to the three classes. Arrows point from the
probability vectors predicted by an uncalibrated classifier to the probability vectors predicted by the same classifier
after sigmoid calibration on a hold-out validation set. Colors indicate the true class of an instance (red: class 1, green:
class 2, blue: class 3).
The base classifier is a random forest classifier with 25 base estimators (trees). If this classifier is trained on all 800
training datapoints, it is overly confident in its predictions and thus incurs a large log-loss. Calibrating an identical
classifier, which was trained on 600 datapoints, with method=’sigmoid’ on the remaining 200 datapoints reduces the
confidence of the predictions, i.e., moves the probability vectors from the edges of the simplex towards the center.
This calibration results in a lower log-loss. Note that an alternative would have been to increase the number of base
estimators which would have resulted in a similar decrease in log-loss.
•
Out:
Log-loss of
* uncalibrated classifier trained on 800 datapoints: 1.280
* classifier trained on 600 datapoints and calibrated on 200 datapoint: 0.534
print(__doc__)
import numpy as np
np.random.seed(0)
(continues on next page)
# Generate data
X, y = make_blobs(n_samples=1000, random_state=42, cluster_std=5.0)
X_train, y_train = X[:600], y[:600]
X_valid, y_valid = X[600:800], y[600:800]
X_train_valid, y_train_valid = X[:800], y[:800]
X_test, y_test = X[800:], y[800:]
print("Log-loss of")
print(" * uncalibrated classifier trained on 800 datapoints: %.3f "
% score)
print(" * classifier trained on 600 datapoints and calibrated on "
"200 datapoint: %.3f" % sig_score)
# Illustrate calibrator
plt.figure()
# generate grid over 2-simplex
p1d = np.linspace(0, 1, 20)
p0, p1 = np.meshgrid(p1d, p1d)
p2 = 1 - p0 - p1
p = np.c_[p0.ravel(), p1.ravel(), p2.ravel()]
p = p[p[:, 2] >= 0]
calibrated_classifier = sig_clf.calibrated_classifiers_[0]
prediction = np.vstack([calibrator.predict(this_p)
for calibrator, this_p in
zip(calibrated_classifier.calibrators_, p.T)]).T
prediction /= prediction.sum(axis=1)[:, None]
plt.grid(False)
for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
plt.plot([0, x], [x, 0], 'k', alpha=0.2)
plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)
plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)
plt.show()
6.4 Classification
Note: Click here to download the full example code or to run this example in your browser via Binder
import numpy as np
import matplotlib.pyplot as plt
X, y = generate_data(n_test, n_features)
score_clf1 += clf1.score(X, y)
score_clf2 += clf2.score(X, y)
acc_clf1.append(score_clf1 / n_averages)
acc_clf2.append(score_clf2 / n_averages)
plt.xlabel('n_features / n_samples')
plt.ylabel('Classification accuracy')
Note: Click here to download the full example code or to run this example in your browser via Binder
An example showing how the scikit-learn can be used to recognize images of hand-written digits.
This example is commented in the tutorial section of the user manual.
•
Out:
Confusion matrix:
[[87 0 0 0 1 0 0 0 0 0]
[ 0 88 1 0 0 0 0 0 1 1]
[ 0 0 85 1 0 0 0 0 0 0]
[ 0 0 0 79 0 3 0 4 5 0]
[ 0 0 0 0 88 0 0 0 0 4]
(continues on next page)
print(__doc__)
# The data that we are interested in is made of 8x8 images of digits, let's
# have a look at the first 4 images, stored in the `images` attribute of the
# dataset. If we were working from image files, we could load them using
# matplotlib.pyplot.imread. Note that each image must have the same size. For these
# images, we know which digit they represent: it is given in the 'target' of
# the dataset.
_, axes = plt.subplots(2, 4)
images_and_labels = list(zip(digits.images, digits.target))
for ax, (image, label) in zip(axes[0, :], images_and_labels[:4]):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Training: %i' % label)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot the classification probability for different classifiers. We use a 3 class dataset, and we classify it with a Sup-
port Vector classifier, L1 and L2 penalized logistic regression with either a One-Vs-Rest or multinomial setting, and
Gaussian process classification.
Linear SVC is not a probabilistic classifier by default but it has a built-in calibration option enabled in this example
(probability=True).
The logistic regression with One-Vs-Rest is not a multiclass classifier out of the box. As a result it has more trouble
in separating class 2 and 3 than the other estimators.
Out:
Accuracy (train) for L1 logistic: 82.7%
Accuracy (train) for L2 logistic (Multinomial): 82.7%
Accuracy (train) for L2 logistic (OvR): 79.3%
Accuracy (train) for Linear SVC: 82.0%
Accuracy (train) for GPC: 82.7%
print(__doc__)
iris = datasets.load_iris()
X = iris.data[:, 0:2] # we only take the first two features for visualization
y = iris.target
n_features = X.shape[1]
C = 10
kernel = 1.0 * RBF([1.0, 1.0]) # for GPC
n_classifiers = len(classifiers)
xx = np.linspace(3, 9, 100)
yy = np.linspace(1, 5, 100).T
xx, yy = np.meshgrid(xx, yy)
Xfull = np.c_[xx.ravel(), yy.ravel()]
y_pred = classifier.predict(X)
accuracy = accuracy_score(y, y_pred)
print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
# View probabilities:
probas = classifier.predict_proba(Xfull)
n_classes = np.unique(y_pred).size
for k in range(n_classes):
plt.subplot(n_classifiers, n_classes, index * n_classes + k + 1)
plt.title("Class %d" % k)
if k == 0:
plt.ylabel(name)
imshow_handle = plt.imshow(probas[:, k].reshape((100, 100)),
extent=(3, 9, 1, 5), origin='lower')
plt.xticks(())
plt.yticks(())
idx = (y_pred == k)
if idx.any():
plt.scatter(X[idx, 0], X[idx, 1], marker='o', c='w', edgecolor='k')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
A comparison of a several classifiers in scikit-learn on synthetic datasets. The point of this example is to illustrate
the nature of decision boundaries of different classifiers. This should be taken with a grain of salt, as the intuition
conveyed by these examples does not necessarily carry over to real datasets.
Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers
such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.
The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classifi-
cation accuracy on the test set.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
GaussianProcessClassifier(1.0 * RBF(1.0)),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1, max_iter=1000),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
if ds_cnt == 0:
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
size=15, horizontalalignment='right')
i += 1
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example plots the covariance ellipsoids of each class and decision boundary learned by LDA and QDA. The
ellipsoids display the double standard deviation for each class. With LDA, the standard deviation is the same for all
the classes, while each class has its own standard deviation with QDA.
print(__doc__)
# #############################################################################
# Colormap
cmap = colors.LinearSegmentedColormap(
'red_blue_classes',
{'red': [(0, 1, 1), (1, 0.7, 0.7)],
'green': [(0, 0.7, 0.7), (1, 0.7, 0.7)],
'blue': [(0, 0.7, 0.7), (1, 1, 1)]})
plt.cm.register_cmap(cmap=cmap)
# #############################################################################
# Generate datasets
(continues on next page)
def dataset_cov():
'''Generate 2 Gaussians samples with different covariance matrices'''
n, dim = 300, 2
np.random.seed(0)
C = np.array([[0., -1.], [2.5, .7]]) * 2.
X = np.r_[np.dot(np.random.randn(n, dim), C),
np.dot(np.random.randn(n, dim), C.T) + np.array([1, 4])]
y = np.hstack((np.zeros(n), np.ones(n)))
return X, y
# #############################################################################
# Plot functions
def plot_data(lda, X, y, y_pred, fig_index):
splot = plt.subplot(2, 2, fig_index)
if fig_index == 1:
plt.title('Linear Discriminant Analysis')
plt.ylabel('Data with\n fixed covariance')
elif fig_index == 2:
plt.title('Quadratic Discriminant Analysis')
elif fig_index == 3:
plt.ylabel('Data with\n varying covariances')
# class 0: dots
plt.scatter(X0_tp[:, 0], X0_tp[:, 1], marker='.', color='red')
plt.scatter(X0_fp[:, 0], X0_fp[:, 1], marker='x',
s=20, color='#990000') # dark red
# class 1: dots
plt.scatter(X1_tp[:, 0], X1_tp[:, 1], marker='.', color='blue')
plt.scatter(X1_fp[:, 0], X1_fp[:, 1], marker='x',
s=20, color='#000099') # dark blue
# means
plt.plot(lda.means_[0][0], lda.means_[0][1],
'*', color='yellow', markersize=15, markeredgecolor='grey')
plt.plot(lda.means_[1][0], lda.means_[1][1],
'*', color='yellow', markersize=15, markeredgecolor='grey')
return splot
6.5 Clustering
Note: Click here to download the full example code or to run this example in your browser via Binder
This example plots the corresponding dendrogram of a hierarchical clustering using AgglomerativeClustering and the
dendrogram method available in scipy.
import numpy as np
iris = load_iris()
X = iris.data
model = model.fit(X)
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode='level', p=3)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
These images how similar features are merged together using feature agglomeration.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(*images[0].shape)
agglo = cluster.FeatureAgglomeration(connectivity=connectivity,
n_clusters=32)
agglo.fit(X)
X_reduced = agglo.transform(X)
X_restored = agglo.inverse_transform(X_reduced)
images_restored = np.reshape(X_restored, images.shape)
plt.figure(1, figsize=(4, 3.5))
plt.clf()
plt.subplots_adjust(left=.01, right=.99, bottom=.01, top=.91)
for i in range(4):
plt.subplot(3, 4, i + 1)
plt.imshow(images[i], cmap=plt.cm.gray, vmax=16, interpolation='nearest')
plt.xticks(())
plt.yticks(())
if i == 1:
(continues on next page)
plt.subplot(3, 4, 10)
plt.imshow(np.reshape(agglo.labels_, images[0].shape),
interpolation='nearest', cmap=plt.cm.nipy_spectral)
plt.xticks(())
plt.yticks(())
plt.title('Labels')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Reference:
Dorin Comaniciu and Peter Meer, “Mean Shift: A robust approach toward feature space analysis”. IEEE Transactions
on Pattern Analysis and Machine Intelligence. 2002. pp. 603-619.
Out:
print(__doc__)
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)
# #############################################################################
# Compute clustering with MeanShift
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example is meant to illustrate situations where k-means will produce unintuitive and possibly unexpected clusters.
In the first three plots, the input data does not conform to some implicit assumption that k-means makes and undesirable
clusters are produced as a result. In the last plot, k-means returns intuitive clusters despite unevenly sized blobs.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 12))
n_samples = 1500
(continues on next page)
plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Incorrect Number of Blobs")
plt.subplot(222)
plt.scatter(X_aniso[:, 0], X_aniso[:, 1], c=y_pred)
plt.title("Anisotropicly Distributed Blobs")
# Different variance
X_varied, y_varied = make_blobs(n_samples=n_samples,
cluster_std=[1.0, 2.5, 0.5],
random_state=random_state)
y_pred = KMeans(n_clusters=3, random_state=random_state).fit_predict(X_varied)
plt.subplot(223)
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_pred)
plt.title("Unequal Variance")
plt.subplot(224)
plt.scatter(X_filtered[:, 0], X_filtered[:, 1], c=y_pred)
plt.title("Unevenly Sized Blobs")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example uses a large dataset of faces to learn a set of 20 x 20 images patches that constitute faces.
From the programming standpoint, it is interesting because it shows how to use the online API of the scikit-learn
to process a very large dataset by chunks. The way we proceed is that we load an image at a time and extract
randomly 50 patches from this image. Once we have accumulated 500 of these patches (using 10 images), we run the
partial_fit method of the online KMeans object, MiniBatchKMeans.
The verbose setting on the MiniBatchKMeans enables us to see that some clusters are reassigned during the successive
calls to partial-fit. This is because the number of patches that they represent has become too low, and it is better to
choose a random new cluster.
Out:
print(__doc__)
import time
faces = datasets.fetch_olivetti_faces()
# #############################################################################
# Learn the dictionary of images
buffer = []
t0 = time.time()
# The online learning part: cycle over the whole dataset 6 times
index = 0
for _ in range(6):
for img in faces.images:
data = extract_patches_2d(img, patch_size, max_patches=50,
random_state=rng)
data = np.reshape(data, (len(data), -1))
buffer.append(data)
index += 1
if index % 10 == 0:
data = np.concatenate(buffer, axis=0)
data -= np.mean(data, axis=0)
data /= np.std(data, axis=0)
kmeans.partial_fit(data)
buffer = []
if index % 100 == 0:
print('Partial fit of %4i out of %i'
% (index, 6 * len(faces.images)))
dt = time.time() - t0
print('done in %.2fs.' % dt)
# #############################################################################
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Face, a 1024 x 768 size image of a raccoon face, is used here to illustrate how k-means is used for vector quantization.
print(__doc__)
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
n_clusters = 5
np.random.seed(0)
vmin = face.min()
vmax = face.max()
# original face
plt.figure(1, figsize=(3, 2.2))
plt.imshow(face, cmap=plt.cm.gray, vmin=vmin, vmax=256)
# compressed face
plt.figure(2, figsize=(3, 2.2))
plt.imshow(face_compressed, cmap=plt.cm.gray, vmin=vmin, vmax=vmax)
# histogram
plt.figure(4, figsize=(3, 2.2))
plt.clf()
plt.axes([.01, .01, .98, .98])
plt.hist(X, bins=256, color='.5', edgecolor='.5')
plt.yticks(())
plt.xticks(regular_values)
values = np.sort(values)
for center_1, center_2 in zip(values[:-1], values[1:]):
plt.axvline(.5 * (center_1 + center_2), color='b')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Reference: Brendan J. Frey and Delbert Dueck, “Clustering by Passing Messages Between Data Points”, Science Feb.
2007
Out:
print(__doc__)
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
random_state=0)
# #############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows the effect of imposing a connectivity graph to capture local structure in the data. The graph is
simply the graph of 20 nearest neighbors.
Two consequences of imposing a connectivity can be seen. First clustering with a connectivity matrix is much faster.
Second, when using a connectivity matrix, single, average and complete linkage are unstable and tend to create a few
clusters that grow very quickly. Indeed, average and complete linkage fight this percolation behavior by considering all
the distances between two clusters when merging them ( while single linkage exaggerates the behaviour by considering
only the shortest distance between clusters). The connectivity graph breaks this mechanism for average and complete
linkage, making them resemble the more brittle single linkage. This effect is more pronounced for very sparse graphs
(try decreasing the number of neighbors in kneighbors_graph) and with complete linkage. In particular, having a very
small number of neighbors in the graph, imposes a geometry that is close to that of single linkage, which is well known
to have this percolation instability.
import time
import matplotlib.pyplot as plt
import numpy as np
X = np.concatenate((x, y))
X += .7 * np.random.randn(2, n_samples)
X = X.T
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example uses Spectral clustering on a graph created from voxel-to-voxel difference on an image to break this
image into multiple partly-homogeneous regions.
This procedure (spectral clustering on an image) is an efficient approximate solution for finding normalized graph cuts.
There are two options to assign labels:
• with ‘kmeans’ spectral clustering will cluster samples in the embedding space using a kmeans algorithm
• whereas ‘discrete’ will iteratively search for the closest partition space to the embedding space.
print(__doc__)
import time
import numpy as np
from distutils.version import LooseVersion
from scipy.ndimage.filters import gaussian_filter
import matplotlib.pyplot as plt
import skimage
from skimage.data import coins
from skimage.transform import rescale
# Convert the image into a graph with the value of the gradient on the
# edges.
graph = image.img_to_graph(rescaled_coins)
# Apply spectral clustering (this step goes much faster if you have pyamg
# installed)
N_REGIONS = 25
plt.figure(figsize=(5, 5))
(continues on next page)
•
Out:
Note: Click here to download the full example code or to run this example in your browser via Binder
An illustration of various linkage option for agglomerative clustering on a 2D embedding of the digits dataset.
The goal of this example is to show intuitively how the metrics behave, and not to find good clusters for the digits.
This is why the example works on a 2D embedding.
What this example shows us is the behavior “rich getting richer” of agglomerative clustering that tends to create uneven
cluster sizes. This behavior is pronounced for the average linkage strategy, that ends up with a couple of singleton
clusters, while in the case of single linkage we get a single central cluster with all other clusters being drawn from
noise points around the fringes.
•
Out:
Computing embedding
Done.
(continues on next page)
print(__doc__)
from time import time
import numpy as np
from scipy import ndimage
from matplotlib import pyplot as plt
X, y = datasets.load_digits(return_X_y=True)
n_samples, n_features = X.shape
np.random.seed(0)
X, y = nudge_images(X, y)
#----------------------------------------------------------------------
# Visualize the clustering
def plot_clustering(X_red, labels, title=None):
x_min, x_max = np.min(X_red, axis=0), np.max(X_red, axis=0)
X_red = (X_red - x_min) / (x_max - x_min)
plt.figure(figsize=(6, 4))
for i in range(X_red.shape[0]):
plt.text(X_red[i, 0], X_red[i, 1], str(y[i]),
color=plt.cm.nipy_spectral(labels[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
#----------------------------------------------------------------------
# 2D embedding of the digits dataset
print("Computing embedding")
X_red = manifold.SpectralEmbedding(n_components=2).fit_transform(X)
print("Done.")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The plots display firstly what a K-means algorithm would yield using three clusters. It is then shown what the effect
of a bad initialization is on the classification process: By setting n_init to only 1 (default is 10), the amount of times
that the algorithm will be run with different centroid seeds is reduced. The next plot displays what using eight clusters
would deliver and finally the ground truth.
•
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# Though the following import is not directly being used, it is required
# for 3D projection to work
from mpl_toolkits.mplot3d import Axes3D
np.random.seed(5)
(continues on next page)
iris = datasets.load_iris()
X = iris.data
y = iris.target
fignum = 1
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
for name, est in estimators:
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
est.fit(X)
labels = est.labels_
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(titles[fignum - 1])
ax.dist = 12
fignum = fignum + 1
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12
fig.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
In this example, an image with connected circles is generated and spectral clustering is used to separate the circles.
In these settings, the Spectral clustering approach solves the problem know as ‘normalized graph cuts’: the image is
seen as a graph of connected voxels, and the spectral clustering algorithm amounts to choosing graph cuts defining
regions while minimizing the ratio of the gradient along the cut, and the volume of the region.
As the algorithm tries to balance the volume (ie balance the region sizes), if we take circles with different sizes, the
segmentation fails.
In addition, as there is no useful information in the intensity of the image, or its gradient, we choose to perform the
spectral clustering on a graph that is only weakly informed by the gradient. This is close to performing a Voronoi
partition of the graph.
In addition, we use the mask of the objects to restrict the graph to the outline of the objects. In this example, we are
interested in separating the objects one from the other, and not from the background.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
l = 100
x, y = np.indices((l, l))
# #############################################################################
# 4 circles
img = circle1 + circle2 + circle3 + circle4
# We use a mask that limits to the foreground: the problem that we are
# interested in here is not separating the objects from the background,
# but separating them one from the other.
mask = img.astype(bool)
img = img.astype(float)
img += 1 + 0.2 * np.random.randn(*img.shape)
# Convert the image into a graph with the value of the gradient on the
# edges.
graph = image.img_to_graph(img, mask=mask)
plt.matshow(img)
plt.matshow(label_im)
# #############################################################################
# 2 circles
img = circle1 + circle2
mask = img.astype(bool)
img = img.astype(float)
plt.matshow(img)
plt.matshow(label_im)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Compute the segmentation of a 2D image with Ward hierarchical clustering. The clustering is spatially constrained in
order for each segmented region to be in one piece.
Out:
print(__doc__)
import numpy as np
from distutils.version import LooseVersion
from scipy.ndimage.filters import gaussian_filter
import skimage
from skimage.data import coins
from skimage.transform import rescale
# #############################################################################
# Generate data
orig_coins = coins()
# #############################################################################
# Define the structure A of the data. Pixels connected to their neighbors.
connectivity = grid_to_graph(*rescaled_coins.shape)
# #############################################################################
# Compute clustering
print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 27 # number of regions
ward = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward',
connectivity=connectivity)
ward.fit(X)
label = np.reshape(ward.labels_, rescaled_coins.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)
# #############################################################################
# Plot the results on an image
plt.figure(figsize=(5, 5))
plt.imshow(rescaled_coins, cmap=plt.cm.gray)
for l in range(n_clusters):
plt.contour(label == l,
colors=[plt.cm.nipy_spectral(l / float(n_clusters)), ])
(continues on next page)
Note: Click here to download the full example code or to run this example in your browser via Binder
Finds core samples of high density and expands clusters from them.
Out:
Estimated number of clusters: 3
Estimated number of noise points: 18
Homogeneity: 0.953
Completeness: 0.883
V-measure: 0.917
Adjusted Rand Index: 0.952
(continues on next page)
print(__doc__)
import numpy as np
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0)
X = StandardScaler().fit_transform(X)
# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
class_member_mask = (labels == k)
Note: Click here to download the full example code or to run this example in your browser via Binder
Performs a pixel-wise Vector Quantization (VQ) of an image of the summer palace (China), reducing the number of
colors required to show the image from 96,615 unique colors to 64, while preserving the overall appearance quality.
In this example, pixels are represented in a 3D-space and K-means is used to find 64 color clusters. In the image
processing literature, the codebook obtained from K-means (the cluster centers) is called the color palette. Using a
single byte, up to 256 colors can be addressed, whereas an RGB encoding requires 3 bytes per pixel. The GIF file
format, for example, uses such a palette.
For comparison, a quantized image using a random codebook (colors picked up randomly) is also shown.
•
Out:
Fitting model on a small sub-sample of the data
done in 0.369s.
Predicting color indices on the full image (k-means)
done in 1.659s.
Predicting color indices on the full image (random)
done in 1.100s.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image
from sklearn.utils import shuffle
(continues on next page)
n_colors = 64
plt.figure(3)
plt.clf()
plt.axis('off')
plt.title('Quantized image (64 colors, Random)')
plt.imshow(recreate_image(codebook_random, labels_random, w, h))
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Example builds a swiss roll dataset and runs hierarchical clustering on their position.
For more information, see Hierarchical clustering.
In a first step, the hierarchical clustering is performed without connectivity constraints on the structure and is solely
based on distance, whereas in a second step the clustering is restricted to the k-Nearest Neighbors graph: it’s a
hierarchical clustering with structure prior.
Some of the clusters learned without connectivity constraints do not respect the structure of the swiss roll and extend
across different folds of the manifolds. On the opposite, when opposing connectivity constraints, the clusters form a
nice parcellation of the swiss roll.
•
Out:
Compute unstructured hierarchical clustering...
Elapsed time: 0.06s
Number of points: 1500
Compute structured hierarchical clustering...
Elapsed time: 0.09s
Number of points: 1500
print(__doc__)
# #############################################################################
# Generate data (swiss roll dataset)
n_samples = 1500
noise = 0.05
X, _ = make_swiss_roll(n_samples, noise)
# Make it thinner
X[:, 1] *= .5
# #############################################################################
# Compute clustering
print("Compute unstructured hierarchical clustering...")
st = time.time()
ward = AgglomerativeClustering(n_clusters=6, linkage='ward').fit(X)
elapsed_time = time.time() - st
label = ward.labels_
print("Elapsed time: %.2fs" % elapsed_time)
print("Number of points: %i" % label.size)
# #############################################################################
# Plot result
fig = plt.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
for l in np.unique(label):
ax.scatter(X[label == l, 0], X[label == l, 1], X[label == l, 2],
color=plt.cm.jet(np.float(l) / np.max(label + 1)),
s=20, edgecolor='k')
plt.title('Without connectivity constraints (time %.2fs)' % elapsed_time)
# #############################################################################
# Define the structure A of the data. Here a 10 nearest neighbors
from sklearn.neighbors import kneighbors_graph
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# #############################################################################
# Compute clustering
print("Compute structured hierarchical clustering...")
st = time.time()
ward = AgglomerativeClustering(n_clusters=6, connectivity=connectivity,
linkage='ward').fit(X)
elapsed_time = time.time() - st
label = ward.labels_
print("Elapsed time: %.2fs" % elapsed_time)
print("Number of points: %i" % label.size)
# #############################################################################
# Plot result
fig = plt.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
for l in np.unique(label):
ax.scatter(X[label == l, 0], X[label == l, 1], X[label == l, 2],
color=plt.cm.jet(float(l) / np.max(label + 1)),
s=20, edgecolor='k')
plt.title('With connectivity constraints (time %.2fs)' % elapsed_time)
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
•
# Author: Gael Varoquaux
# License: BSD 3-Clause or CC-0
np.random.seed(0)
def sqr(x):
return np.sign(np.cos(x))
X = list()
y = list()
for i, (phi, a) in enumerate([(.5, .15), (.5, .6), (.3, .2)]):
for _ in range(30):
phase_noise = .01 * np.random.normal()
amplitude_noise = .04 * np.random.normal()
additional_noise = 1 - 2 * np.random.rand(n_features)
# Make the noise sparse
(continues on next page)
X = np.array(X)
y = np.array(y)
n_clusters = 3
plt.legend(loc='best')
plt.axis('tight')
plt.axis('off')
plt.suptitle("Ground truth", size=20)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Clustering can be expensive, especially when our dataset contains millions of datapoints. Many clustering algorithms
are not inductive and so cannot be directly applied to new data samples without recomputing the clustering, which
may be intractable. Instead, we can use clustering to then learn an inductive model with a classifier, which has several
benefits:
• it allows the clusters to scale and apply to new data
• unlike re-fitting the clusters to new samples, it makes sure the labelling procedure is consistent over time
• it allows us to use the inferential capabilities of the classifier to describe or explain the clusters
This example illustrates a generic implementation of a meta-estimator which extends clustering by inducing a classifier
from the cluster labels.
import numpy as np
import matplotlib.pyplot as plt
(continues on next page)
N_SAMPLES = 5000
RANDOM_STATE = 42
class InductiveClusterer(BaseEstimator):
def __init__(self, clusterer, classifier):
self.clusterer = clusterer
self.classifier = classifier
@if_delegate_has_method(delegate='classifier_')
def predict(self, X):
return self.classifier_.predict(X)
@if_delegate_has_method(delegate='classifier_')
def decision_function(self, X):
return self.classifier_.decision_function(X)
# Train a clustering algorithm on the training data and get the cluster labels
clusterer = AgglomerativeClustering(n_clusters=3)
cluster_labels = clusterer.fit_predict(X)
plt.figure(figsize=(12, 4))
plt.subplot(131)
plot_scatter(X, cluster_labels)
plt.title("Ward Linkage")
# Generate new samples and plot them along with the original dataset
X_new, y_new = make_blobs(n_samples=10,
centers=[(-7, -1), (-2, 4), (3, 6)],
random_state=RANDOM_STATE)
plt.subplot(132)
plot_scatter(X, cluster_labels)
plot_scatter(X_new, 'black', 1)
plt.title("Unknown instances")
probable_clusters = inductive_learner.predict(X_new)
plt.subplot(133)
plot_scatter(X, cluster_labels)
plot_scatter(X_new, probable_clusters)
Z = inductive_learner.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Finds core samples of high density and expands clusters from them. This example uses data that is generated so
that the clusters have different densities. The sklearn.cluster.OPTICS is first used with its Xi cluster detec-
tion method, and then setting specific thresholds on the reachability, which corresponds to sklearn.cluster.
DBSCAN . We can see that the different clusters of OPTICS’s Xi method can be recovered with different choices of
thresholds in DBSCAN.
np.random.seed(0)
n_points_per_cluster = 250
labels_050 = cluster_optics_dbscan(reachability=clust.reachability_,
core_distances=clust.core_distances_,
ordering=clust.ordering_, eps=0.5)
labels_200 = cluster_optics_dbscan(reachability=clust.reachability_,
core_distances=clust.core_distances_,
ordering=clust.ordering_, eps=2)
space = np.arange(len(X))
reachability = clust.reachability_[clust.ordering_]
labels = clust.labels_[clust.ordering_]
plt.figure(figsize=(10, 7))
G = gridspec.GridSpec(2, 3)
ax1 = plt.subplot(G[0, :])
ax2 = plt.subplot(G[1, 0])
ax3 = plt.subplot(G[1, 1])
ax4 = plt.subplot(G[1, 2])
# Reachability plot
colors = ['g.', 'r.', 'b.', 'y.', 'c.']
for klass, color in zip(range(0, 5), colors):
Xk = space[labels == klass]
Rk = reachability[labels == klass]
ax1.plot(Xk, Rk, color, alpha=0.3)
ax1.plot(space[labels == -1], reachability[labels == -1], 'k.', alpha=0.3)
ax1.plot(space, np.full_like(space, 2., dtype=float), 'k-', alpha=0.5)
ax1.plot(space, np.full_like(space, 0.5, dtype=float), 'k-.', alpha=0.5)
ax1.set_ylabel('Reachability (epsilon distance)')
ax1.set_title('Reachability Plot')
# OPTICS
colors = ['g.', 'r.', 'b.', 'y.', 'c.']
for klass, color in zip(range(0, 5), colors):
Xk = X[clust.labels_ == klass]
ax2.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax2.plot(X[clust.labels_ == -1, 0], X[clust.labels_ == -1, 1], 'k+', alpha=0.1)
ax2.set_title('Automatic Clustering\nOPTICS')
# DBSCAN at 0.5
colors = ['g', 'greenyellow', 'olive', 'r', 'b', 'c']
for klass, color in zip(range(0, 6), colors):
Xk = X[labels_050 == klass]
ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3, marker='.')
ax3.plot(X[labels_050 == -1, 0], X[labels_050 == -1, 1], 'k+', alpha=0.1)
ax3.set_title('Clustering at 0.5 epsilon cut\nDBSCAN')
# DBSCAN at 2.
colors = ['g.', 'm.', 'y.', 'c.']
for klass, color in zip(range(0, 4), colors):
Xk = X[labels_200 == klass]
ax4.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax4.plot(X[labels_200 == -1, 0], X[labels_200 == -1, 1], 'k+', alpha=0.1)
ax4.set_title('Clustering at 2.0 epsilon cut\nDBSCAN')
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example compares the timing of Birch (with and without the global clustering step) and MiniBatchKMeans on a
synthetic dataset having 100,000 samples and 2 features generated using make_blobs.
If n_clusters is set to None, the data is reduced from 100,000 samples to a set of 158 clusters. This can be viewed
as a preprocessing step before the final (global) clustering step that further reduces these 158 clusters to 100 clusters.
Out:
Birch without global clustering as the final step took 3.72 seconds
n_clusters : 158
Birch with global clustering as the final step took 3.62 seconds
n_clusters : 100
Time taken to run MiniBatchKMeans 3.43 seconds
print(__doc__)
# Compute clustering with Birch with and without the final clustering step
# and plot.
birch_models = [Birch(threshold=1.7, n_clusters=None),
Birch(threshold=1.7, n_clusters=100)]
final_step = ['without global clustering', 'with global clustering']
# Plot result
labels = birch_model.labels_
centroids = birch_model.subcluster_centers_
n_clusters = np.unique(labels).size
print("n_clusters : %d" % n_clusters)
ax = fig.add_subplot(1, 3, ind + 1)
for this_centroid, k, col in zip(centroids, range(n_clusters), colors_):
mask = labels == k
ax.scatter(X[mask, 0], X[mask, 1],
c='w', edgecolor=col, marker='.', alpha=0.5)
if birch_model.n_clusters is None:
ax.scatter(this_centroid[0], this_centroid[1], marker='+',
c='k', s=25)
ax.set_ylim([-25, 25])
ax.set_xlim([-25, 25])
ax.set_autoscaley_on(False)
ax.set_title('Birch %s' % info)
ax = fig.add_subplot(1, 3, 3)
for this_centroid, k, col in zip(mbk.cluster_centers_,
range(n_clusters), colors_):
mask = mbk.labels_ == k
ax.scatter(X[mask, 0], X[mask, 1], marker='.',
c='w', edgecolor=col, alpha=0.5)
ax.scatter(this_centroid[0], this_centroid[1], marker='+',
c='k', s=25)
ax.set_xlim([-25, 25])
ax.set_ylim([-25, 25])
ax.set_title("MiniBatchKMeans")
ax.set_autoscaley_on(False)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Evaluate the ability of k-means initializations strategies to make the algorithm convergence robust as measured by
the relative standard deviation of the inertia of the clustering (i.e. the sum of squared distances to the nearest cluster
center).
The first plot shows the best inertia reached for each combination of the model (KMeans or MiniBatchKMeans)
and the init method (init="random" or init="kmeans++") for increasing values of the n_init parameter
that controls the number of initializations.
The second plot demonstrate one single run of the MiniBatchKMeans estimator using a init="random" and
n_init=1. This run leads to a bad convergence (local optimum) with estimated centers stuck between ground truth
clusters.
The dataset used for evaluation is a 2D grid of isotropic Gaussian clusters widely spaced.
•
Out:
Evaluation of KMeans with k-means++ init
Evaluation of KMeans with random init
Evaluation of MiniBatchKMeans with k-means++ init
Evaluation of MiniBatchKMeans with random init
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
random_state = np.random.RandomState(0)
(continues on next page)
noise = random_state.normal(
scale=scale, size=(n_samples_per_center, centers.shape[1]))
plt.figure()
plots = []
legends = []
cases = [
(KMeans, 'k-means++', {}),
(KMeans, 'random', {}),
(MiniBatchKMeans, 'k-means++', {'max_no_improvement': 3}),
(MiniBatchKMeans, 'random', {'max_no_improvement': 3, 'init_size': 500}),
]
plt.xlabel('n_init')
plt.ylabel('inertia')
plt.legend(plots, legends)
plt.title("Mean inertia for various k-means init across %d runs" % n_runs)
plt.figure()
for k in range(n_clusters):
my_members = km.labels_ == k
color = cm.nipy_spectral(float(k) / n_clusters, 1)
plt.plot(X[my_members, 0], X[my_members, 1], 'o', marker='.', c=color)
cluster_center = km.cluster_centers_[k]
plt.plot(cluster_center[0], cluster_center[1], 'o',
markerfacecolor=color, markeredgecolor='k', markersize=6)
plt.title("Example cluster allocation with a single random init\n"
"with MiniBatchKMeans")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The following plots demonstrate the impact of the number of clusters and number of samples on various clustering
performance evaluation metrics.
Non-adjusted measures such as the V-Measure show a dependency between the number of clusters and the number of
samples: the mean V-Measure of random labeling increases significantly as the number of clusters is closer to the total
number of samples used to compute the measure.
Adjusted for chance measure such as ARI display some random variations centered around a mean score of 0.0 for
any number of samples and clusters.
Only adjusted measures can hence safely be used as a consensus index to evaluate the average stability of clustering
algorithms for a given value of k on various overlapping sub-samples of the dataset.
•
Out:
Computing adjusted_rand_score for 10 values of n_clusters and n_samples=100
done in 0.036s
Computing v_measure_score for 10 values of n_clusters and n_samples=100
done in 0.047s
Computing ami_score for 10 values of n_clusters and n_samples=100
done in 0.377s
Computing mutual_info_score for 10 values of n_clusters and n_samples=100
done in 0.040s
Computing adjusted_rand_score for 10 values of n_clusters and n_samples=1000
done in 0.048s
Computing v_measure_score for 10 values of n_clusters and n_samples=1000
done in 0.061s
Computing ami_score for 10 values of n_clusters and n_samples=1000
done in 0.207s
Computing mutual_info_score for 10 values of n_clusters and n_samples=1000
done in 0.048s
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from time import time
from sklearn import metrics
Both random labelings have the same number of clusters for each value
possible value in ``n_clusters_range``.
for i, k in enumerate(n_clusters_range):
for j in range(n_runs):
if fixed_n_classes is None:
labels_a = random_labels(low=0, high=k, size=n_samples)
labels_b = random_labels(low=0, high=k, size=n_samples)
scores[i, j] = score_func(labels_a, labels_b)
return scores
score_funcs = [
metrics.adjusted_rand_score,
metrics.v_measure_score,
ami_score,
metrics.mutual_info_score,
]
n_samples = 100
n_clusters_range = np.linspace(2, n_samples, 10).astype(np.int)
plt.figure(1)
plots = []
names = []
for score_func in score_funcs:
print("Computing %s for %d values of n_clusters and n_samples=%d"
% (score_func.__name__, len(n_clusters_range), n_samples))
t0 = time()
scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range)
(continues on next page)
n_samples = 1000
n_clusters_range = np.linspace(2, 100, 10).astype(np.int)
n_classes = 10
plt.figure(2)
plots = []
names = []
for score_func in score_funcs:
print("Computing %s for %d values of n_clusters and n_samples=%d"
% (score_func.__name__, len(n_clusters_range), n_samples))
t0 = time()
scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range,
fixed_n_classes=n_classes)
print("done in %0.3fs" % (time() - t0))
plots.append(plt.errorbar(
n_clusters_range, scores.mean(axis=1), scores.std(axis=1))[0])
names.append(score_func.__name__)
Note: Click here to download the full example code or to run this example in your browser via Binder
We want to compare the performance of the MiniBatchKMeans and KMeans: the MiniBatchKMeans is faster, but
gives slightly different results (see Mini Batch K-Means).
We will cluster a set of data, first with KMeans and then with MiniBatchKMeans, and plot the results. We will also
plot the points that are labelled differently between the two algorithms.
print(__doc__)
import time
import numpy as np
import matplotlib.pyplot as plt
# #############################################################################
# Generate sample data
np.random.seed(0)
batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7)
# #############################################################################
# Compute clustering with Means
# #############################################################################
# Compute clustering with MiniBatchKMeans
# #############################################################################
(continues on next page)
# We want to have the same colors for the same cluster from the
# MiniBatchKMeans and the KMeans algorithm. Let's pair the cluster centers per
# closest one.
k_means_cluster_centers = k_means.cluster_centers_
order = pairwise_distances_argmin(k_means.cluster_centers_,
mbk.cluster_centers_)
mbk_means_cluster_centers = mbk.cluster_centers_[order]
# KMeans
ax = fig.add_subplot(1, 3, 1)
for k, col in zip(range(n_clusters), colors):
my_members = k_means_labels == k
cluster_center = k_means_cluster_centers[k]
ax.plot(X[my_members, 0], X[my_members, 1], 'w',
markerfacecolor=col, marker='.')
ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' % (
t_batch, k_means.inertia_))
# MiniBatchKMeans
ax = fig.add_subplot(1, 3, 2)
for k, col in zip(range(n_clusters), colors):
my_members = mbk_means_labels == k
cluster_center = mbk_means_cluster_centers[k]
ax.plot(X[my_members, 0], X[my_members, 1], 'w',
markerfacecolor=col, marker='.')
ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
ax.set_title('MiniBatchKMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.text(-3.5, 1.8, 'train time: %.2fs\ninertia: %f' %
(t_mini_batch, mbk.inertia_))
for k in range(n_clusters):
different += ((k_means_labels == k) != (mbk_means_labels == k))
identic = np.logical_not(different)
ax.plot(X[identic, 0], X[identic, 1], 'w',
markerfacecolor='#bbbbbb', marker='.')
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Out:
________________________________________________________________________________
[Memory] Calling sklearn.cluster._agglomerative.ward_tree...
ward_tree(array([[-0.451933, ..., -0.675318],
...,
[ 0.275706, ..., -1.085711]]),
<1600x1600 sparse matrix of type '<class 'numpy.int64'>'
with 7840 stored elements in COOrdinate format>, n_clusters=None, return_
˓→distance=False)
print(__doc__)
import shutil
import tempfile
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg, ndimage
from joblib import Memory
# #############################################################################
# Generate data
n_samples = 200
size = 40 # image size
roi_size = 15
snr = 5.
np.random.seed(0)
mask = np.ones([size, size], dtype=np.bool)
X = np.random.randn(n_samples, size ** 2)
for x in X: # smooth data
x[:] = ndimage.gaussian_filter(x.reshape(size, size), sigma=1.0).ravel()
X -= X.mean(axis=0)
X /= X.std(axis=0)
y = np.dot(X, coef.ravel())
noise = np.random.randn(y.shape[0])
noise_coef = (linalg.norm(y, 2) / np.exp(snr / 20.)) / linalg.norm(noise, 2)
y += noise_coef * noise # add noise
# #############################################################################
# Compute the coefs of a Bayesian Ridge with GridSearch
cv = KFold(2) # cross-validation generator for model selection
ridge = BayesianRidge()
cachedir = tempfile.mkdtemp()
mem = Memory(location=cachedir, verbose=1)
# #############################################################################
# Inverse the transformation to plot the results on an image
plt.close('all')
plt.figure(figsize=(7.3, 2.7))
plt.subplot(1, 3, 1)
plt.imshow(coef, interpolation="nearest", cmap=plt.cm.RdBu_r)
plt.title("True weights")
plt.subplot(1, 3, 2)
plt.imshow(coef_selection_, interpolation="nearest", cmap=plt.cm.RdBu_r)
plt.title("Feature Selection")
plt.subplot(1, 3, 3)
plt.imshow(coef_agglomeration_, interpolation="nearest", cmap=plt.cm.RdBu_r)
plt.title("Feature Agglomeration")
plt.subplots_adjust(0.04, 0.0, 0.98, 0.94, 0.16, 0.26)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
In this example we compare the various initialization strategies for K-means in terms of runtime and quality of the
results.
As the ground truth is known here, we also apply different cluster quality metrics to judge the goodness of fit of the
cluster labels to the ground truth.
Cluster quality metrics evaluated (see Clustering performance evaluation for definitions and discussions of the met-
rics):
Out:
print(__doc__)
np.random.seed(42)
sample_size = 300
print(82 * '_')
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')
# in this case the seeding of the centers is deterministic, hence we run the
# kmeans algorithm only once with n_init=1
pca = PCA(n_components=n_digits).fit(data)
bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),
name="PCA-based",
data=data)
print(82 * '_')
# #############################################################################
# Visualize the results on PCA-reduced data
reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)
# Step size of the mesh. Decrease to increase the quality of the VQ.
(continues on next page)
# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows characteristics of different linkage methods for hierarchical clustering on datasets that are “inter-
esting” but still in 2D.
The main observations to make are:
• single linkage is fast, and can perform well on non-globular data, but it performs poorly in the presence of noise.
• average and complete linkage perform well on cleanly separated globular clusters, but have mixed results other-
wise.
• Ward is the most effective method for noisy data.
While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional
data.
print(__doc__)
import time
import warnings
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
Generate datasets. We choose the size big enough to see the scalability of the algorithms, but not too big to avoid too
long running times
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None
plot_num = 1
datasets = [
(noisy_circles, {'n_clusters': 2}),
(noisy_moons, {'n_clusters': 2}),
(varied, {'n_neighbors': 2}),
(aniso, {'n_neighbors': 2}),
(blobs, {}),
(no_structure, {})]
X, y = dataset
# ============
# Create cluster objects
# ============
ward = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='ward')
complete = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='complete')
average = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='average')
single = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='single')
clustering_algorithms = (
('Single Linkage', single),
('Average Linkage', average),
('Complete Linkage', complete),
('Ward Linkage', ward),
)
t1 = time.time()
if hasattr(algorithm, 'labels_'):
y_pred = algorithm.labels_.astype(np.int)
else:
y_pred = algorithm.predict(X)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
6.5.27 Selecting the number of clusters with silhouette analysis on KMeans clus-
tering
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot
displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a
way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].
Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neigh-
boring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two
neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.
In this example the silhouette analysis is used to choose an optimal value for n_clusters. The silhouette plot shows
that the n_clusters value of 3, 5 and 6 are a bad pick for the given data due to the presence of clusters with below
average silhouette scores and also due to wide fluctuations in the size of the silhouette plots. Silhouette analysis is
more ambivalent in deciding between 2 and 4.
Also from the thickness of the silhouette plot the cluster size can be visualized. The silhouette plot for cluster 0 when
n_clusters is equal to 2, is bigger in size owing to the grouping of the 3 sub clusters into one big cluster. However
when the n_clusters is equal to 4, all the plots are more or less of similar thickness and hence are of similar sizes
as can be also verified from the labelled scatter plot on the right.
•
Out:
print(__doc__)
range_n_clusters = [2, 3, 4, 5, 6]
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows characteristics of different clustering algorithms on datasets that are “interesting” but still in 2D.
With the exception of the last dataset, the parameters of each of these dataset-algorithm pairs has been tuned to produce
good clustering results. Some algorithms are more sensitive to parameter values than others.
The last dataset is an example of a ‘null’ situation for clustering: the data is homogeneous, and there is no good
clustering. For this example, the null dataset uses the same parameters as the dataset in the row above it, which
represents a mismatch in the parameter values and the data structure.
While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional
data.
print(__doc__)
import time
import warnings
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
# ============
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
# ============
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None
# ============
# Set up cluster parameters
# ============
plt.figure(figsize=(9 * 2 + 3, 12.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)
plot_num = 1
datasets = [
(noisy_circles, {'damping': .77, 'preference': -240,
'quantile': .2, 'n_clusters': 2,
'min_samples': 20, 'xi': 0.25}),
(noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
(varied, {'eps': .18, 'n_neighbors': 2,
'min_samples': 5, 'xi': 0.035, 'min_cluster_size': .2}),
(aniso, {'eps': .15, 'n_neighbors': 2,
'min_samples': 20, 'xi': 0.1, 'min_cluster_size': .2}),
(blobs, {}),
(no_structure, {})]
X, y = dataset
clustering_algorithms = (
('MiniBatchKMeans', two_means),
('AffinityPropagation', affinity_propagation),
('MeanShift', ms),
('SpectralClustering', spectral),
('Ward', ward),
('AgglomerativeClustering', average_linkage),
('DBSCAN', dbscan),
('OPTICS', optics),
('Birch', birch),
('GaussianMixture', gmm)
)
t1 = time.time()
if hasattr(algorithm, 'labels_'):
(continues on next page)
plt.xlim(-2.5, 2.5)
plt.ylim(-2.5, 2.5)
plt.xticks(())
plt.yticks(())
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
transform=plt.gca().transAxes, size=15,
horizontalalignment='right')
plot_num += 1
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The usual covariance maximum likelihood estimate can be regularized using shrinkage. Ledoit and Wolf proposed a
close formula to compute the asymptotically optimal shrinkage parameter (minimizing a MSE criterion), yielding the
Ledoit-Wolf covariance estimate.
Chen et al. proposed an improvement of the Ledoit-Wolf shrinkage parameter, the OAS coefficient, whose convergence
is significantly better under the assumption that the data are Gaussian.
This example, inspired from Chen’s publication [1], shows a comparison of the estimated MSE of the LW and OAS
methods, using Gaussian distributed data.
[1] “Shrinkage Algorithms for MMSE Covariance Estimation” Chen et al., IEEE Trans. on Sign. Proc., Volume 58,
Issue 10, October 2010.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import toeplitz, cholesky
np.random.seed(0)
n_features = 100
# simulation covariance matrix (AR(1) process)
r = 0.1
real_cov = toeplitz(r ** np.arange(n_features))
coloring_matrix = cholesky(real_cov)
lw = LedoitWolf(store_precision=False, assume_centered=True)
lw.fit(X)
lw_mse[i, j] = lw.error_norm(real_cov, scaling=False)
lw_shrinkage[i, j] = lw.shrinkage_
oa = OAS(store_precision=False, assume_centered=True)
oa.fit(X)
oa_mse[i, j] = oa.error_norm(real_cov, scaling=False)
oa_shrinkage[i, j] = oa.shrinkage_
# plot MSE
plt.subplot(2, 1, 1)
plt.errorbar(n_samples_range, lw_mse.mean(1), yerr=lw_mse.std(1),
label='Ledoit-Wolf', color='navy', lw=2)
plt.errorbar(n_samples_range, oa_mse.mean(1), yerr=oa_mse.std(1),
label='OAS', color='darkorange', lw=2)
plt.ylabel("Squared error")
plt.legend(loc="upper right")
plt.title("Comparison of covariance estimators")
plt.xlim(5, 31)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Using the GraphicalLasso estimator to learn a covariance and sparse precision from a small number of samples.
To estimate a probabilistic model (e.g. a Gaussian model), estimating the precision matrix, that is the inverse covari-
ance matrix, is as important as estimating the covariance matrix. Indeed a Gaussian model is parametrized by the
precision matrix.
To be in favorable recovery conditions, we sample the data from a model with a sparse inverse covariance matrix. In
addition, we ensure that the data is not too much correlated (limiting the largest coefficient of the precision matrix) and
that there a no small coefficients in the precision matrix that cannot be recovered. In addition, with a small number of
observations, it is easier to recover a correlation matrix rather than a covariance, thus we scale the time series.
Here, the number of samples is slightly larger than the number of dimensions, thus the empirical covariance is still
invertible. However, as the observations are strongly correlated, the empirical covariance matrix is ill-conditioned and
as a result its inverse –the empirical precision matrix– is very far from the ground truth.
If we use l2 shrinkage, as with the Ledoit-Wolf estimator, as the number of samples is small, we need to shrink a lot.
As a result, the Ledoit-Wolf precision is fairly close to the ground truth precision, that is not far from being diagonal,
but the off-diagonal structure is lost.
The l1-penalized estimator can recover part of this off-diagonal structure. It learns a sparse precision. It is not
able to recover the exact sparsity pattern: it detects too many non-zero coefficients. However, the highest non-zero
coefficients of the l1 estimated correspond to the non-zero coefficients in the ground truth. Finally, the coefficients of
the l1 precision estimate are biased toward zero: because of the penalty, they are all smaller than the corresponding
ground truth value, as can be seen on the figure.
Note that, the color range of the precision matrices is tweaked to improve readability of the figure. The full range of
values of the empirical precision is not displayed.
The alpha parameter of the GraphicalLasso setting the sparsity of the model is set by internal cross-validation in the
GraphicalLassoCV. As can be seen on figure 2, the grid to compute the cross-validation score is iteratively refined in
the neighborhood of the maximum.
print(__doc__)
# author: Gael Varoquaux <gael.varoquaux@inria.fr>
# License: BSD 3 clause
# Copyright: INRIA
import numpy as np
from scipy import linalg
from sklearn.datasets import make_sparse_spd_matrix
from sklearn.covariance import GraphicalLassoCV, ledoit_wolf
import matplotlib.pyplot as plt
# #############################################################################
# Generate the data
n_samples = 60
n_features = 20
prng = np.random.RandomState(1)
prec = make_sparse_spd_matrix(n_features, alpha=.98,
smallest_coef=.4,
largest_coef=.7,
random_state=prng)
cov = linalg.inv(prec)
d = np.sqrt(np.diag(cov))
cov /= d
cov /= d[:, np.newaxis]
prec *= d
prec *= d[:, np.newaxis]
X = prng.multivariate_normal(np.zeros(n_features), cov, size=n_samples)
X -= X.mean(axis=0)
X /= X.std(axis=0)
# #############################################################################
# Estimate the covariance
emp_cov = np.dot(X.T, X) / n_samples
model = GraphicalLassoCV()
model.fit(X)
(continues on next page)
lw_cov_, _ = ledoit_wolf(X)
lw_prec_ = linalg.inv(lw_cov_)
# #############################################################################
# Plot the results
plt.figure(figsize=(10, 6))
plt.subplots_adjust(left=0.02, right=0.98)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
When working with covariance estimation, the usual approach is to use a maximum likelihood estimator, such as
the sklearn.covariance.EmpiricalCovariance. It is unbiased, i.e. it converges to the true (population)
covariance when given many observations. However, it can also be beneficial to regularize it, in order to reduce
its variance; this, in turn, introduces some bias. This example illustrates the simple regularization used in Shrunk
Covariance estimators. In particular, it focuses on how to set the amount of regularization, i.e. how to choose the
bias-variance trade-off.
Here we compare 3 approaches:
• Setting the parameter by cross-validating the likelihood on three folds according to a grid of potential shrinkage
parameters.
• A close formula proposed by Ledoit and Wolf to compute the asymptotically optimal regularization parameter
(minimizing a MSE criterion), yielding the sklearn.covariance.LedoitWolf covariance estimate.
• An improvement of the Ledoit-Wolf shrinkage, the sklearn.covariance.OAS, proposed by Chen et al.
Its convergence is significantly better under the assumption that the data are Gaussian, in particular for small
samples.
To quantify estimation error, we plot the likelihood of unseen data for different values of the shrinkage parameter. We
also show the choices by cross-validation, or with the LedoitWolf and OAS estimates.
Note that the maximum likelihood estimate corresponds to no shrinkage, and thus performs poorly. The Ledoit-Wolf
estimate performs really well, as it is close to the optimal and is computational not costly. In this example, the OAS
estimate is a bit further away. Interestingly, both approaches outperform cross-validation, which is significantly most
computationally costly.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg
# #############################################################################
# Generate sample data
n_features, n_samples = 40, 20
np.random.seed(42)
base_X_train = np.random.normal(size=(n_samples, n_features))
base_X_test = np.random.normal(size=(n_samples, n_features))
# Color samples
coloring_matrix = np.random.normal(size=(n_features, n_features))
X_train = np.dot(base_X_train, coloring_matrix)
X_test = np.dot(base_X_test, coloring_matrix)
# #############################################################################
# Compute the likelihood on test data
(continues on next page)
# under the ground-truth model, which we would not have access to in real
# settings
real_cov = np.dot(coloring_matrix.T, coloring_matrix)
emp_cov = empirical_covariance(X_train)
loglik_real = -log_likelihood(emp_cov, linalg.inv(real_cov))
# #############################################################################
# Compare different approaches to setting the parameter
# #############################################################################
# Plot results
fig = plt.figure()
plt.title("Regularized covariance: likelihood and shrinkage coefficient")
plt.xlabel('Regularization parameter: shrinkage coefficient')
plt.ylabel('Error: negative log-likelihood on test data')
# range shrinkage curve
plt.loglog(shrinkages, negative_logliks, label="Negative log-likelihood")
# adjust view
lik_max = np.amax(negative_logliks)
lik_min = np.amin(negative_logliks)
ymin = lik_min - 6. * np.log((plt.ylim()[1] - plt.ylim()[0]))
ymax = lik_max + 10. * np.log(lik_max - lik_min)
xmin = shrinkages[0]
xmax = shrinkages[-1]
# LW likelihood
plt.vlines(lw.shrinkage_, ymin, -loglik_lw, color='magenta',
linewidth=3, label='Ledoit-Wolf estimate')
# OAS likelihood
plt.vlines(oa.shrinkage_, ymin, -loglik_oa, color='purple',
linewidth=3, label='OAS estimate')
# best CV estimator likelihood
plt.vlines(cv.best_estimator_.shrinkage, ymin,
-cv.best_estimator_.score(X_test), color='cyan',
linewidth=3, label='Cross-validation best estimate')
(continues on next page)
plt.ylim(ymin, ymax)
plt.xlim(xmin, xmax)
plt.legend()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
An example to show covariance estimation with the Mahalanobis distances on Gaussian distributed data.
For Gaussian distributed data, the distance of an observation 𝑥𝑖 to the mode of the distribution can be computed using
its Mahalanobis distance: 𝑑(𝜇,Σ) (𝑥𝑖 )2 = (𝑥𝑖 − 𝜇)′ Σ−1 (𝑥𝑖 − 𝜇) where 𝜇 and Σ are the location and the covariance of
the underlying Gaussian distribution.
In practice, 𝜇 and Σ are replaced by some estimates. The usual covariance maximum likelihood estimate is very
sensitive to the presence of outliers in the data set and therefor, the corresponding Mahalanobis distances are. One
would better have to use a robust estimator of covariance to guarantee that the estimation is resistant to “erroneous”
observations in the data set and that the associated Mahalanobis distances accurately reflect the true organisation of
the observations.
The Minimum Covariance Determinant estimator is a robust, high-breakdown point (i.e. it can be used to estimate the
𝑛 −𝑛 −1
covariance matrix of highly contaminated datasets, up to samples 2 features outliers) estimator of covariance. The idea is
𝑛samples +𝑛features +1
to find 2 observations whose empirical covariance has the smallest determinant, yielding a “pure” subset
of observations from which to compute standards estimates of location and covariance.
The Minimum Covariance Determinant estimator (MCD) has been introduced by P.J.Rousseuw in [1].
This example illustrates how the Mahalanobis distances are affected by outlying data: observations drawn from a
contaminating distribution are not distinguishable from the observations coming from the real, Gaussian distribution
that one may want to work with. Using MCD-based Mahalanobis distances, the two populations become distinguish-
able. Associated applications are outliers detection, observations ranking, clustering, . . . For visualization purpose,
the cubic root of the Mahalanobis distances are represented in the boxplot, as Wilson and Hilferty suggest [2]
[1] P. J. Rousseeuw. Least median of squares regression. J. Am Stat Ass, 79:871, 1984.
[2] Wilson, E. B., & Hilferty, M. M. (1931). The distribution of chi-square. Proceedings of the National
Academy of Sciences of the United States of America, 17, 684-688.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
n_samples = 125
n_outliers = 25
n_features = 2
# generate data
gen_cov = np.eye(n_features)
gen_cov[0, 0] = 2.
X = np.dot(np.random.randn(n_samples, n_features), gen_cov)
# add some outliers
outliers_cov = np.eye(n_features)
outliers_cov[np.arange(1, n_features), np.arange(1, n_features)] = 7.
X[-n_outliers:] = np.dot(np.random.randn(n_outliers, n_features), outliers_cov)
# compare estimators learnt from the full data set with true parameters
emp_cov = EmpiricalCovariance().fit(X)
(continues on next page)
# #############################################################################
# Display results
fig = plt.figure()
plt.subplots_adjust(hspace=-.1, wspace=.4, top=.95, bottom=.05)
mahal_emp_cov = emp_cov.mahalanobis(zz)
mahal_emp_cov = mahal_emp_cov.reshape(xx.shape)
emp_cov_contour = subfig1.contour(xx, yy, np.sqrt(mahal_emp_cov),
cmap=plt.cm.PuBu_r,
linestyles='dashed')
mahal_robust_cov = robust_cov.mahalanobis(zz)
mahal_robust_cov = mahal_robust_cov.reshape(xx.shape)
robust_contour = subfig1.contour(xx, yy, np.sqrt(mahal_robust_cov),
cmap=plt.cm.YlOrBr_r, linestyles='dotted')
subfig1.legend([emp_cov_contour.collections[1], robust_contour.collections[1],
inlier_plot, outlier_plot],
['MLE dist', 'robust dist', 'inliers', 'outliers'],
loc="upper right", borderaxespad=0)
plt.xticks(())
plt.yticks(())
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The usual covariance maximum likelihood estimate is very sensitive to the presence of outliers in the data set. In
such a case, it would be better to use a robust estimator of covariance to guarantee that the estimation is resistant to
“erroneous” observations in the data set.1 ,2
The Minimum Covariance Determinant estimator is a robust, high-breakdown point (i.e. it can be used to estimate the
𝑛 −𝑛 −1
covariance matrix of highly contaminated datasets, up to samples 2 features outliers) estimator of covariance. The idea is
𝑛samples +𝑛features +1
to find 2 observations whose empirical covariance has the smallest determinant, yielding a “pure” subset
of observations from which to compute standards estimates of location and covariance. After a correction step aiming
at compensating the fact that the estimates were learned from only a portion of the initial data, we end up with robust
estimates of the data set location and covariance.
The Minimum Covariance Determinant estimator (MCD) has been introduced by P.J.Rousseuw in3 .
Evaluation
In this example, we compare the estimation errors that are made when using various types of location and covariance
estimates on contaminated Gaussian distributed data sets:
• The mean and the empirical covariance of the full dataset, which break down as soon as there are outliers in the
data set
• The robust MCD, that has a low error provided 𝑛samples > 5𝑛features
• The mean and the empirical covariance of the observations that are known to be good ones. This can be consid-
ered as a “perfect” MCD estimation, so one can trust our implementation by comparing to this case.
1 Johanna Hardin, David M Rocke. The distribution of robust distances. Journal of Computational and Graphical Statistics. December 1, 2005,
14(4): 928-946.
2 Zoubir A., Koivunen V., Chakhchoukh Y. and Muma M. (2012). Robust estimation in signal processing: A tutorial-style treatment of funda-
References
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager
# example settings
n_samples = 80
n_features = 5
repeat = 10
range_n_outliers = np.concatenate(
(np.linspace(0, n_samples / 8, 5),
np.linspace(n_samples / 8, n_samples / 2, 5)[1:-1])).astype(np.int)
# computation
for i, n_outliers in enumerate(range_n_outliers):
for j in range(repeat):
rng = np.random.RandomState(i * j)
# generate data
X = rng.randn(n_samples, n_features)
# add some outliers
outliers_index = rng.permutation(n_samples)[:n_outliers]
outliers_offset = 10. * \
(np.random.randint(2, size=(n_outliers, n_features)) - 0.5)
X[outliers_index] += outliers_offset
inliers_mask = np.ones(n_samples).astype(bool)
inliers_mask[outliers_index] = False
# compare estimators learned from the full data set with true
# parameters
err_loc_emp_full[i, j] = np.sum(X.mean(0) ** 2)
err_cov_emp_full[i, j] = EmpiricalCovariance().fit(X).error_norm(
np.eye(n_features))
# Display results
font_prop = matplotlib.font_manager.FontProperties(size=11)
plt.subplot(2, 1, 1)
lw = 2
plt.errorbar(range_n_outliers, err_loc_mcd.mean(1),
yerr=err_loc_mcd.std(1) / np.sqrt(repeat),
label="Robust location", lw=lw, color='m')
plt.errorbar(range_n_outliers, err_loc_emp_full.mean(1),
yerr=err_loc_emp_full.std(1) / np.sqrt(repeat),
label="Full data set mean", lw=lw, color='green')
plt.errorbar(range_n_outliers, err_loc_emp_pure.mean(1),
yerr=err_loc_emp_pure.std(1) / np.sqrt(repeat),
label="Pure data set mean", lw=lw, color='black')
plt.title("Influence of outliers on the location estimation")
plt.ylabel(r"Error ($||\mu - \hat{\mu}||_2^2$)")
plt.legend(loc="upper left", prop=font_prop)
plt.subplot(2, 1, 2)
x_size = range_n_outliers.size
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Simple usage of various cross decomposition algorithms: - PLSCanonical - PLSRegression, with multivariate re-
sponse, a.k.a. PLS2 - PLSRegression, with univariate response, a.k.a. PLS1 - CCA
Given 2 multivariate covarying two-dimensional datasets, X, and Y, PLS extracts the ‘directions of covariance’, i.e.
the components of each datasets that explain the most shared variance between both datasets. This is apparent on the
scatterplot matrix display: components 1 in dataset X and dataset Y are maximally correlated (points lie around the
first diagonal). This is also true for components 2 in both dataset, however, the correlation across datasets for different
components is weak: the point cloud is very spherical.
Out:
Corr(X)
[[ 1. 0.51 0.07 -0.05]
[ 0.51 1. 0.11 -0.01]
[ 0.07 0.11 1. 0.49]
[-0.05 -0.01 0.49 1. ]]
Corr(Y)
[[1. 0.48 0.05 0.03]
[0.48 1. 0.04 0.12]
[0.05 0.04 1. 0.51]
[0.03 0.12 0.51 1. ]]
True B (such that: Y = XB + Err)
[[1 1 1]
[2 2 2]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]
[0 0 0]]
Estimated B
[[ 1. 1. 1. ]
[ 2. 2. 2. ]
[-0. -0. 0. ]
[ 0. 0. 0. ]
[ 0. 0. 0. ]
(continues on next page)
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import PLSCanonical, PLSRegression, CCA
# #############################################################################
# Dataset based latent variables model
n = 500
# 2 latents vars:
l1 = np.random.normal(size=n)
l2 = np.random.normal(size=n)
X_train = X[:n // 2]
Y_train = Y[:n // 2]
X_test = X[n // 2:]
Y_test = Y[n // 2:]
print("Corr(X)")
print(np.round(np.corrcoef(X.T), 2))
print("Corr(Y)")
print(np.round(np.corrcoef(Y.T), 2))
# #############################################################################
# Canonical (symmetric) PLS
# Transform data
# ~~~~~~~~~~~~~~
plsca = PLSCanonical(n_components=2)
(continues on next page)
plt.subplot(224)
plt.scatter(X_train_r[:, 1], Y_train_r[:, 1], label="train",
marker="o", c="b", s=25)
plt.scatter(X_test_r[:, 1], Y_test_r[:, 1], label="test",
marker="o", c="r", s=25)
plt.xlabel("x scores")
plt.ylabel("y scores")
plt.title('Comp. 2: X vs Y (test corr = %.2f)' %
np.corrcoef(X_test_r[:, 1], Y_test_r[:, 1])[0, 1])
plt.xticks(())
plt.yticks(())
plt.legend(loc="best")
plt.subplot(223)
plt.scatter(Y_train_r[:, 0], Y_train_r[:, 1], label="train",
marker="*", c="b", s=50)
plt.scatter(Y_test_r[:, 0], Y_test_r[:, 1], label="test",
marker="*", c="r", s=50)
plt.xlabel("Y comp. 1")
plt.ylabel("Y comp. 2")
plt.title('Y comp. 1 vs Y comp. 2 , (test corr = %.2f)'
% np.corrcoef(Y_test_r[:, 0], Y_test_r[:, 1])[0, 1])
(continues on next page)
# #############################################################################
# PLS regression, with multivariate response, a.k.a. PLS2
n = 1000
q = 3
p = 10
X = np.random.normal(size=n * p).reshape((n, p))
B = np.array([[1, 2] + [0] * (p - 2)] * q).T
# each Yj = 1*X1 + 2*X2 + noize
Y = np.dot(X, B) + np.random.normal(size=n * q).reshape((n, q)) + 5
pls2 = PLSRegression(n_components=3)
pls2.fit(X, Y)
print("True B (such that: Y = XB + Err)")
print(B)
# compare pls2.coef_ with B
print("Estimated B")
print(np.round(pls2.coef_, 1))
pls2.predict(X)
n = 1000
p = 10
X = np.random.normal(size=n * p).reshape((n, p))
y = X[:, 0] + 2 * X[:, 1] + np.random.normal(size=n * 1) + 5
pls1 = PLSRegression(n_components=3)
pls1.fit(X, y)
# note that the number of components exceeds 1 (the dimension of y)
print("Estimated betas")
print(np.round(pls1.coef_, 1))
# #############################################################################
# CCA (PLS mode B with symmetric deflation)
cca = CCA(n_components=2)
cca.fit(X_train, Y_train)
X_train_r, Y_train_r = cca.transform(X_train, Y_train)
X_test_r, Y_test_r = cca.transform(X_test, Y_test)
Note: Click here to download the full example code or to run this example in your browser via Binder
This dataset is made up of 1797 8x8 images. Each image, like the one shown below, is of a hand-written digit. In order
to utilize an 8x8 figure like this, we’d have to first transform it into a feature vector with length 64.
See here for more information about this dataset.
print(__doc__)
Note: Click here to download the full example code or to run this example in your browser via Binder
This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored
in a 150x4 numpy.ndarray
The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.
The below plot uses the first two features. See here for more information on this dataset.
•
print(__doc__)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot several randomly generated 2D classification datasets. This example illustrates the datasets.
make_classification datasets.make_blobs and datasets.make_gaussian_quantiles func-
tions.
For make_classification, three binary and two multi-class classification datasets are generated, with different
numbers of informative features and clusters per class.
print(__doc__)
plt.figure(figsize=(8, 8))
plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)
plt.subplot(321)
plt.title("One informative feature, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=1,
(continues on next page)
plt.subplot(322)
plt.title("Two informative features, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1,
s=25, edgecolor='k')
plt.subplot(323)
plt.title("Two informative features, two clusters per class",
fontsize='small')
X2, Y2 = make_classification(n_features=2, n_redundant=0, n_informative=2)
plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2,
s=25, edgecolor='k')
plt.subplot(324)
plt.title("Multi-class, two informative features, one cluster",
fontsize='small')
X1, Y1 = make_classification(n_features=2, n_redundant=0, n_informative=2,
n_clusters_per_class=1, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1,
s=25, edgecolor='k')
plt.subplot(325)
plt.title("Three blobs", fontsize='small')
X1, Y1 = make_blobs(n_features=2, centers=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1,
s=25, edgecolor='k')
plt.subplot(326)
plt.title("Gaussian divided into three quantiles", fontsize='small')
X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=3)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1,
s=25, edgecolor='k')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This illustrates the make_multilabel_classification dataset generator. Each sample consists of counts of
two features (up to 50 in total), which are differently distributed in each of two classes.
Points are labeled as follows, where Y means the class is present:
1 2 3 Color
Y N N Red
N Y N Blue
N N Y Yellow
Y Y N Purple
Y N Y Orange
Y Y N Green
Y Y Y Brown
A star marks the expected sample for each class; its size reflects the probability of selecting that class label.
The left and right examples highlight the n_labels parameter: more of the samples in the right plot have 2 or 3
labels.
Note that this two-dimensional example is very degenerate: generally the number of features would be much greater
than the “document length”, while here we have much larger documents than vocabulary. Similarly, with n_classes
> n_features, it is much less likely that a feature distinguishes a particular class.
Out:
import numpy as np
import matplotlib.pyplot as plt
print(__doc__)
COLORS = np.array(['!',
'#FF3333', # red
'#0198E1', # blue
'#BF5FFF', # purple
'#FCD116', # yellow
'#FF7216', # orange
'#4DBD33', # green
'#87421F' # brown
])
plot_2d(ax2, n_labels=3)
ax2.set_title('n_labels=3, length=50')
ax2.set_xlim(left=0, auto=True)
ax2.set_ylim(bottom=0, auto=True)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
# Predict
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
# Predict
X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)
(continues on next page)
Note: Click here to download the full example code or to run this example in your browser via Binder
6.9.3 Plot the decision surface of a decision tree on the iris dataset
Plot the decision surface of a decision tree trained on pairs of features of the iris dataset.
See decision tree for more information on the estimator.
For each pair of iris features, the decision tree learns decision boundaries made of combinations of simple thresholding
rules inferred from the training samples.
We also show the tree structure of a model built on all of the features.
•
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# Parameters
n_classes = 3
plot_colors = "ryb"
plot_step = 0.02
# Load data
iris = load_iris()
# Train
clf = DecisionTreeClassifier().fit(X, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)
plt.xlabel(iris.feature_names[pair[0]])
plt.ylabel(iris.feature_names[pair[1]])
plt.figure()
clf = DecisionTreeClassifier().fit(iris.data, iris.target)
plot_tree(clf, filled=True)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
Minimal cost complexity pruning recursively finds the node with the “weakest link”. The weakest link is char-
acterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an
idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.
cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities
at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity
of its leaves.
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In the following plot, the maximum effective alpha value is removed, because it is the trivial tree with only one node.
fig, ax = plt.subplots()
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
Out:
Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes
the whole tree, leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Out:
For the remainder of this example, we remove the last element in clfs and ccp_alphas, because it is the trivial
tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the
tree overfits, leading to a 100% training accuracy and 88% testing accuracy. As alpha increases, more of the tree is
pruned, thus creating a decision tree that generalizes better. In this example, setting ccp_alpha=0.015 maximizes
the testing accuracy.
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The decision tree structure can be analysed to gain further insight on the relation between the features and the target
to predict. In this example, we show how to retrieve:
• the binary tree structure;
• the depth of each node and whether or not it’s a leaf;
• the nodes that were reached by a sample using the decision_path method;
• the leaf that was reached by a sample using the apply method;
• the rules that were used to predict a sample;
• the decision path shared by a group of samples.
Out:
The binary tree structure has 5 nodes and has the following tree structure:
node=0 test node: go to node 1 if X[:, 3] <= 0.800000011920929 else to node 2.
node=1 leaf node.
node=2 test node: go to node 3 if X[:, 2] <= 4.950000047683716 else to node 4.
node=3 leaf node.
node=4 leaf node.
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# The decision estimator has an attribute called tree_ which stores the entire
# tree structure and allows access to low level attributes. The binary tree
# tree_ is represented as a number of parallel arrays. The i-th element of each
# array holds information about the node `i`. Node 0 is the tree's root. NOTE:
# Some of the arrays only apply to either leaves or split nodes, resp. In this
# case the values of nodes of the other type are arbitrary!
#
# Among those arrays, we have:
# - left_child, id of the left child of the node
# - right_child, id of the right child of the node
# - feature, feature used for splitting the node
# - threshold, threshold value at the node
#
n_nodes = estimator.tree_.node_count
children_left = estimator.tree_.children_left
children_right = estimator.tree_.children_right
feature = estimator.tree_.feature
threshold = estimator.tree_.threshold
# First let's retrieve the decision path of each sample. The decision_path
# method allows to retrieve the node indicator functions. A non zero element of
# indicator matrix at the position (i, j) indicates that the sample i goes
# through the node j.
node_indicator = estimator.decision_path(X_test)
# Similarly, we can also have the leaves ids reached by each sample.
leave_id = estimator.apply(X_test)
# Now, it's possible to get the tests that were used to predict a sample or
# a group of samples. First, let's make it for the sample.
sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
node_indicator.indptr[sample_id + 1]]
common_node_id = np.arange(n_nodes)[common_nodes]
6.10 Decomposition
Note: Click here to download the full example code or to run this example in your browser via Binder
A plot that compares the various Beta-divergence loss functions supported by the Multiplicative-Update (‘mu’) solver
in sklearn.decomposition.NMF.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition._nmf import _beta_divergence
print(__doc__)
x = np.linspace(0.001, 4, 1000)
y = np.zeros(x.shape)
colors = 'mbgyr'
for j, beta in enumerate((0., 0.5, 1., 1.5, 2.)):
for i, xi in enumerate(x):
y[i] = _beta_divergence(1, xi, 1, beta)
name = "beta = %1.1f" % beta
plt.plot(x, y, label=name, color=colors[j])
plt.xlabel("x")
plt.title("beta-divergence(1, x)")
plt.legend(loc=0)
plt.axis([0, 4, 0, 3])
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
np.random.seed(5)
plt.cla()
pca = decomposition.PCA(n_components=3)
(continues on next page)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Incremental principal component analysis (IPCA) is typically used as a replacement for principal component analysis
(PCA) when the dataset to be decomposed is too large to fit in memory. IPCA builds a low-rank approximation for the
input data using an amount of memory which is independent of the number of input data samples. It is still dependent
on the input data features, but changing the batch size allows for control of memory usage.
This example serves as a visual check that IPCA is able to find a similar projection of the data to PCA (to a sign flip),
while only processing a few samples at a time. This can be considered a “toy example”, as IPCA is intended for large
datasets which do not fit in main memory, requiring incremental approaches.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data
y = iris.target
n_components = 2
(continues on next page)
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
if "Incremental" in title:
err = np.abs(np.abs(X_pca) - np.abs(X_ipca)).mean()
plt.title(title + " of iris dataset\nMean absolute unsigned error "
"%.6f" % err)
else:
plt.title(title + " of iris dataset")
plt.legend(loc="best", shadow=False, scatterpoints=1)
plt.axis([-4, 4, -1.5, 1.5])
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour and Virginica) with 4 attributes: sepal length,
sepal width, petal length and petal width.
Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal compo-
nents, or directions in the feature space) that account for the most variance in the data. Here we plot the different
samples on the 2 first principal components.
Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance between classes. In
particular, LDA, in contrast to PCA, is a supervised method, using known class labels.
•
Out:
explained variance ratio (first two components): [0.92461872 0.05306648]
print(__doc__)
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
lda = LinearDiscriminantAnalysis(n_components=2)
(continues on next page)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of IRIS dataset')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
# #############################################################################
# Generate sample data
np.random.seed(0)
n_samples = 2000
time = np.linspace(0, 8, n_samples)
# Compute ICA
ica = FastICA(n_components=3)
S_ = ica.fit_transform(X) # Reconstruct signals
A_ = ica.mixing_ # Get estimated mixing matrix
# We can `prove` that the ICA model applies by reverting the unmixing.
assert np.allclose(X, np.dot(S_, A_.T) + ica.mean_)
# #############################################################################
# Plot results
plt.figure()
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
These figures aid in illustrating how a point cloud can be very flat in one direction–which is where PCA comes in to
choose a direction that is not flat.
•
print(__doc__)
# #############################################################################
# Create the data
(continues on next page)
e = np.exp(1)
np.random.seed(4)
def pdf(x):
return 0.5 * (stats.norm(scale=0.25 / e).pdf(x)
+ stats.norm(scale=4 / e).pdf(x))
y = np.random.normal(scale=0.5, size=(30000))
x = np.random.normal(scale=0.5, size=(30000))
z = np.random.normal(scale=0.1, size=len(x))
density *= pdf_z
a = x + y
b = 2 * y
c = a - b + z
# #############################################################################
# Plot the figures
def plot_figs(fig_num, elev, azim):
fig = plt.figure(fig_num, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=elev, azim=azim)
pca = PCA(n_components=3)
pca.fit(Y)
pca_score = pca.explained_variance_ratio_
V = pca.components_
elev = -40
azim = -80
plot_figs(1, elev, azim)
elev = 30
azim = 20
plot_figs(2, elev, azim)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates visually in the feature space a comparison by results using two different component analysis
techniques.
Independent component analysis (ICA) vs Principal component analysis (PCA).
Representing ICA in the feature space gives the view of ‘geometric ICA’: ICA is an algorithm that finds directions in
the feature space corresponding to projections with high non-Gaussianity. These directions need not be orthogonal in
the original feature space, but they are orthogonal in the whitened feature space, in which all directions correspond to
the same variance.
PCA, on the other hand, finds orthogonal directions in the raw feature space that correspond to directions accounting
for maximum variance.
Here we simulate independent sources using a highly non-Gaussian process, 2 student T with a low number of degrees
of freedom (top left figure). We mix them to create observations (top right figure). In this raw observation space,
directions identified by PCA are represented by orange vectors. We represent the signal in the PCA space, after
whitening by the variance corresponding to the PCA vectors (lower left). Running ICA corresponds to finding a
rotation in this space to identify the directions of largest non-Gaussianity (lower right).
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# #############################################################################
# Generate sample data
rng = np.random.RandomState(42)
S = rng.standard_t(1.5, size=(20000, 2))
S[:, 0] *= 2.
# Mix data
A = np.array([[1, 1], [0, 2]]) # Mixing matrix
pca = PCA()
S_pca_ = pca.fit(X).transform(X)
ica = FastICA(random_state=rng)
(continues on next page)
S_ica_ /= S_ica_.std(axis=0)
# #############################################################################
# Plot results
plt.hlines(0, -3, 3)
plt.vlines(0, -3, 3)
plt.xlim(-3, 3)
plt.ylim(-3, 3)
plt.xlabel('x')
plt.ylabel('y')
plt.figure()
plt.subplot(2, 2, 1)
plot_samples(S / S.std())
plt.title('True Independent Sources')
plt.title('Observations')
plt.subplot(2, 2, 3)
plot_samples(S_pca_ / np.std(S_pca_, axis=0))
plt.title('PCA recovered signals')
plt.subplot(2, 2, 4)
plot_samples(S_ica_ / np.std(S_ica_))
plt.title('ICA recovered signals')
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows that Kernel PCA is able to find a projection of the data that makes data linearly separable.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
# Plot results
plt.figure()
plt.subplot(2, 2, 1, aspect='equal')
plt.title("Original space")
reds = y == 0
blues = y == 1
plt.subplot(2, 2, 2, aspect='equal')
plt.scatter(X_pca[reds, 0], X_pca[reds, 1], c="red",
s=20, edgecolor='k')
plt.scatter(X_pca[blues, 0], X_pca[blues, 1], c="blue",
s=20, edgecolor='k')
plt.title("Projection by PCA")
plt.xlabel("1st principal component")
plt.ylabel("2nd component")
plt.subplot(2, 2, 3, aspect='equal')
plt.scatter(X_kpca[reds, 0], X_kpca[reds, 1], c="red",
s=20, edgecolor='k')
plt.scatter(X_kpca[blues, 0], X_kpca[blues, 1], c="blue",
s=20, edgecolor='k')
plt.title("Projection by KPCA")
plt.xlabel(r"1st principal component in space induced by $\phi$")
plt.ylabel("2nd component")
plt.subplot(2, 2, 4, aspect='equal')
plt.scatter(X_back[reds, 0], X_back[reds, 1], c="red",
s=20, edgecolor='k')
plt.scatter(X_back[blues, 0], X_back[blues, 1], c="blue",
s=20, edgecolor='k')
plt.title("Original space after inverse transform")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
6.10.9 Model selection with Probabilistic PCA and Factor Analysis (FA)
Probabilistic PCA and Factor Analysis are probabilistic models. The consequence is that the likelihood of new data
can be used for model selection and covariance estimation. Here we compare PCA and FA with cross-validation on
low rank data corrupted with homoscedastic noise (noise variance is the same for each feature) or heteroscedastic noise
(noise variance is the different for each feature). In a second step we compare the model likelihood to the likelihoods
obtained from shrinkage covariance estimators.
One can observe that with homoscedastic noise both FA and PCA succeed in recovering the size of the low rank
subspace. The likelihood with PCA is higher than FA in this case. However PCA fails and overestimates the rank
when heteroscedastic noise is present. Under appropriate circumstances the low rank models are more likely than
shrinkage models.
The automatic estimation from Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604 by Thomas P.
Minka is also compared.
•
Out:
best n_components by PCA CV = 10
best n_components by FactorAnalysis CV = 10
best n_components by PCA MLE = 10
best n_components by PCA CV = 35
best n_components by FactorAnalysis CV = 10
best n_components by PCA MLE = 38
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg
# #############################################################################
# Create the data
# #############################################################################
# Fit the models
def compute_scores(X):
pca = PCA(svd_solver='full')
fa = FactorAnalysis()
def shrunk_cov_score(X):
shrinkages = np.logspace(-2, 0, 30)
cv = GridSearchCV(ShrunkCovariance(), {'shrinkage': shrinkages})
return np.mean(cross_val_score(cv.fit(X).best_estimator_, X))
def lw_score(X):
return np.mean(cross_val_score(LedoitWolf(), X))
plt.figure()
plt.plot(n_components, pca_scores, 'b', label='PCA scores')
plt.plot(n_components, fa_scores, 'r', label='FA scores')
plt.axvline(rank, color='g', label='TRUTH: %d' % rank, linestyle='-')
plt.axvline(n_components_pca, color='b',
label='PCA CV: %d' % n_components_pca, linestyle='--')
plt.axvline(n_components_fa, color='r',
label='FactorAnalysis CV: %d' % n_components_fa,
linestyle='--')
plt.axvline(n_components_pca_mle, color='k',
label='PCA MLE: %d' % n_components_pca_mle, linestyle='--')
plt.xlabel('nb of components')
plt.ylabel('CV scores')
plt.legend(loc='lower right')
plt.title(title)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Transform a signal as a sparse combination of Ricker wavelets. This example visually compares different sparse coding
methods using the sklearn.decomposition.SparseCoder estimator. The Ricker (also known as Mexican
hat or the second derivative of a Gaussian) is not a particularly good kernel to represent piecewise constant signals
like this one. It can therefore be seen how much adding different widths of atoms matters and it therefore motivates
learning the dictionary to best fit your type of signals.
The richer dictionary on the right is not larger in size, heavier subsampling is performed in order to stay on the same
order of magnitude.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
resolution = 1024
subsampling = 3 # subsampling factor
width = 100
n_components = resolution // subsampling
# Generate a signal
y = np.linspace(0, resolution - 1, resolution)
first_quarter = y < resolution / 4
y[first_quarter] = 3.
y[np.logical_not(first_quarter)] = -1.
plt.figure(figsize=(13, 6))
for subplot, (D, title) in enumerate(zip((D_fixed, D_multi),
('fixed width', 'multiple widths'))):
plt.subplot(1, 2, subplot + 1)
plt.title('Sparse coding against %s dictionary' % title)
plt.plot(y, lw=lw, linestyle='--', label='Original signal')
# Do a wavelet approximation
for title, algo, alpha, n_nonzero, color in estimators:
coder = SparseCoder(dictionary=D, transform_n_nonzero_coefs=n_nonzero,
transform_alpha=alpha, transform_algorithm=algo)
x = coder.transform(y.reshape(1, -1))
density = len(np.flatnonzero(x))
x = np.ravel(np.dot(x, D))
squared_error = np.sum((y - x) ** 2)
plt.plot(x, color=color, lw=lw,
label='%s: %s nonzero coefs,\n%.2f error'
% (title, density, squared_error))
Note: Click here to download the full example code or to run this example in your browser via Binder
An example comparing the effect of reconstructing noisy fragments of a raccoon face image using firstly online
Dictionary Learning and various transform methods.
The dictionary is fitted on the distorted left half of the image, and subsequently used to reconstruct the right half. Note
that even better performance could be achieved by fitting to an undistorted (i.e. noiseless) image, but here we start
from the assumption that it is not available.
A common practice for evaluating the results of image denoising is by looking at the difference between the recon-
struction and the original image. If the reconstruction is perfect this will look like Gaussian noise.
It can be seen from the plots that the results of Orthogonal Matching Pursuit (OMP) with two non-zero coefficients is
a bit less biased than when keeping only one (the edges look less prominent). It is in addition closer from the ground
truth in Frobenius norm.
The result of Least Angle Regression is much more strongly biased: the difference is reminiscent of the local intensity
value of the original image.
Thresholding is clearly not useful for denoising, but it is here to show that it can produce a suggestive output with
very high speed, and thus be useful for other tasks such as object classification, where performance is not necessarily
related to visualisation.
•
Out:
Distorting image...
Extracting reference patches...
done in 0.03s.
Learning the dictionary...
done in 2.82s.
Extracting noisy patches...
done in 0.00s.
Orthogonal Matching Pursuit
1 atom...
done in 0.99s.
Orthogonal Matching Pursuit
2 atoms...
done in 2.17s.
Least-angle regression
5 atoms...
done in 17.40s.
Thresholding
alpha=0.1...
done in 0.12s.
print(__doc__)
# Extract all reference patches from the left half of the image
print('Extracting reference patches...')
t0 = time()
patch_size = (7, 7)
data = extract_patches_2d(distorted[:, :width // 2], patch_size)
data = data.reshape(data.shape[0], -1)
data -= np.mean(data, axis=0)
data /= np.std(data, axis=0)
print('done in %.2fs.' % (time() - t0))
# #############################################################################
# Learn the dictionary from reference patches
plt.figure(figsize=(4.2, 4))
for i, comp in enumerate(V[:100]):
plt.subplot(10, 10, i + 1)
plt.imshow(comp.reshape(patch_size), cmap=plt.cm.gray_r,
interpolation='nearest')
plt.xticks(())
plt.yticks(())
plt.suptitle('Dictionary learned from face patches\n' +
'Train time %.1fs on %d patches' % (dt, len(data)),
fontsize=16)
plt.subplots_adjust(0.08, 0.02, 0.92, 0.85, 0.08, 0.23)
# #############################################################################
# Display the distorted image
# #############################################################################
# Extract noisy patches and reconstruct them using the dictionary
transform_algorithms = [
('Orthogonal Matching Pursuit\n1 atom', 'omp',
{'transform_n_nonzero_coefs': 1}),
('Orthogonal Matching Pursuit\n2 atoms', 'omp',
{'transform_n_nonzero_coefs': 2}),
('Least-angle regression\n5 atoms', 'lars',
{'transform_n_nonzero_coefs': 5}),
('Thresholding\n alpha=0.1', 'threshold', {'transform_alpha': .1})]
reconstructions = {}
for title, transform_algorithm, kwargs in transform_algorithms:
print(title + '...')
reconstructions[title] = face.copy()
t0 = time()
dico.set_params(transform_algorithm=transform_algorithm, **kwargs)
code = dico.transform(data)
patches = np.dot(code, V)
patches += intercept
patches = patches.reshape(len(data), *patch_size)
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example applies to The Olivetti faces dataset different unsupervised matrix decomposition (dimension reduction)
methods from the module sklearn.decomposition (see the documentation chapter Decomposing signals in
components (matrix factorization problems)) .
•
Out:
Dataset consists of 400 faces
Extracting the top 6 Eigenfaces - PCA using randomized SVD...
done in 0.033s
Extracting the top 6 Non-negative components - NMF...
done in 0.132s
Extracting the top 6 Independent components - FastICA...
/home/circleci/project/sklearn/decomposition/_fastica.py:119: ConvergenceWarning:
˓→FastICA did not converge. Consider increasing tolerance or the maximum number of
˓→iterations.
ConvergenceWarning)
done in 0.130s
Extracting the top 6 Sparse comp. - MiniBatchSparsePCA...
done in 0.666s
Extracting the top 6 MiniBatchDictionaryLearning...
done in 0.428s
Extracting the top 6 Cluster centers - MiniBatchKMeans...
done in 0.133s
Extracting the top 6 Factor Analysis components - FA...
done in 0.133s
Extracting the top 6 Dictionary learning...
done in 0.415s
Extracting the top 6 Dictionary learning - positive dictionary...
done in 0.449s
Extracting the top 6 Dictionary learning - positive code...
done in 0.130s
Extracting the top 6 Dictionary learning - positive dictionary & code...
(continues on next page)
print(__doc__)
import logging
from time import time
# #############################################################################
# Load faces data
faces, _ = fetch_olivetti_faces(return_X_y=True, shuffle=True,
random_state=rng)
n_samples, n_features = faces.shape
# global centering
faces_centered = faces - faces.mean(axis=0)
# local centering
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)
# #############################################################################
# List of the different estimators, whether to center and transpose the
# problem, and whether the transformer uses the clustering API.
estimators = [
('Eigenfaces - PCA using randomized SVD',
decomposition.PCA(n_components=n_components, svd_solver='randomized',
whiten=True),
True),
('MiniBatchDictionaryLearning',
decomposition.MiniBatchDictionaryLearning(n_components=15, alpha=0.1,
n_iter=50, batch_size=3,
random_state=rng),
True),
# #############################################################################
# Plot a sample of the input data
# #############################################################################
# Do the estimation and plot it
plt.show()
# #############################################################################
# Various positivity constraints applied to dictionary learning.
estimators = [
('Dictionary learning',
decomposition.MiniBatchDictionaryLearning(n_components=15, alpha=0.1,
n_iter=50, batch_size=3,
random_state=rng),
True),
('Dictionary learning - positive dictionary',
decomposition.MiniBatchDictionaryLearning(n_components=15, alpha=0.1,
n_iter=50, batch_size=3,
random_state=rng,
positive_dict=True),
True),
('Dictionary learning - positive code',
decomposition.MiniBatchDictionaryLearning(n_components=15, alpha=0.1,
n_iter=50, batch_size=3,
fit_algorithm='cd',
random_state=rng,
positive_code=True),
True),
('Dictionary learning - positive dictionary & code',
decomposition.MiniBatchDictionaryLearning(n_components=15, alpha=0.1,
n_iter=50, batch_size=3,
fit_algorithm='cd',
random_state=rng,
positive_dict=True,
positive_code=True),
True),
]
# #############################################################################
# Plot a sample of the input data
# #############################################################################
# Do the estimation and plot it
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows the use of forests of trees to evaluate the importance of the pixels in an image classification task
(faces). The hotter the pixel, the more important.
The code below also illustrates how the construction and the computation of the predictions can be parallelized within
multiple jobs.
Out:
print(__doc__)
forest.fit(X, y)
print("done in %0.3fs" % (time() - t0))
importances = forest.feature_importances_
importances = importances.reshape(data.images[0].shape)
Note: Click here to download the full example code or to run this example in your browser via Binder
A decision tree is boosted using the AdaBoost.R21 algorithm on a 1D sinusoidal dataset with a small amount of
Gaussian noise. 299 boosts (300 decision trees) is compared with a single decision tree regressor. As the number of
boosts is increased the regressor can fit more detail.
1
print(__doc__)
regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
regr_1.fit(X, y)
regr_2.fit(X, y)
(continues on next page)
# Predict
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
# Training classifiers
reg1 = GradientBoostingRegressor(random_state=1, n_estimators=10)
reg2 = RandomForestRegressor(random_state=1, n_estimators=10)
reg3 = LinearRegression()
ereg = VotingRegressor([('gb', reg1), ('rf', reg2), ('lr', reg3)])
reg1.fit(X, y)
reg2.fit(X, y)
reg3.fit(X, y)
ereg.fit(X, y)
xt = X[:20]
Note: Click here to download the full example code or to run this example in your browser via Binder
This examples shows the use of forests of trees to evaluate the importance of features on an artificial classification
task. The red bars are the feature importances of the forest, along with their inter-trees variability.
As expected, the plot suggests that 3 features are informative, while the remaining are not.
Out:
Feature ranking:
1. feature 1 (0.295902)
2. feature 2 (0.208351)
3. feature 0 (0.177632)
4. feature 3 (0.047121)
5. feature 6 (0.046303)
6. feature 8 (0.046013)
7. feature 7 (0.045575)
8. feature 4 (0.044614)
9. feature 9 (0.044577)
10. feature 5 (0.043912)
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
Note: Click here to download the full example code or to run this example in your browser via Binder
Random partitioning produces noticeable shorter paths for anomalies. Hence, when a forest of random trees collec-
tively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
# plot the line, the samples, and the nearest vectors to the plane
xx, yy = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot the decision boundaries of a VotingClassifier for two features of the Iris dataset.
Plot the class probabilities of the first sample in a toy dataset predicted by three different classifiers and averaged by
the VotingClassifier.
First, three exemplary classifiers are initialized (DecisionTreeClassifier, KNeighborsClassifier, and
SVC) and used to initialize a soft-voting VotingClassifier with weights [2, 1, 2], which means that the
predicted probabilities of the DecisionTreeClassifier and SVC each count 2 times as much as the weights of
the KNeighborsClassifier classifier when the averaged probability is calculated.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=4)
clf2 = KNeighborsClassifier(n_neighbors=7)
clf3 = SVC(gamma=.1, kernel='rbf', probability=True)
eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2),
(continues on next page)
clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)
eclf.fit(X, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
An example to compare multi-output regression with random forest and the multioutput.MultiOutputRegressor meta-
estimator.
This example illustrates the use of the multioutput.MultiOutputRegressor meta-estimator to perform multi-output re-
gression. A random forest regressor is used, which supports multi-output regression natively, so the results can be
compared.
The random forest regressor will only ever predict values within the range of observations or closer to zero for each of
the targets. As a result the predictions are biased towards the centre of the circle.
Using a single underlying feature the model learns both the x and y coordinate as output.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
max_depth = 30
regr_multirf = MultiOutputRegressor(RandomForestRegressor(n_estimators=100,
max_depth=max_depth,
(continues on next page)
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows how quantile regression can be used to create prediction intervals.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
def f(x):
"""The function to predict."""
return x * np.sin(x)
#----------------------------------------------------------------------
# First the noiseless case
X = np.atleast_2d(np.random.uniform(0, 10.0, size=100)).T
X = X.astype(np.float32)
# Observations
y = f(X).ravel()
alpha = 0.95
clf.fit(X, y)
clf.set_params(alpha=1.0 - alpha)
clf.fit(X, y)
clf.set_params(loss='ls')
clf.fit(X, y)
# Plot the function, the prediction and the 90% confidence interval based on
# the MSE
fig = plt.figure()
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
plt.plot(xx, y_pred, 'r-', label=u'Prediction')
plt.plot(xx, y_upper, 'k-')
plt.plot(xx, y_lower, 'k-')
plt.fill(np.concatenate([xx, xx[::-1]]),
np.concatenate([y_upper, y_lower[::-1]]),
alpha=.5, fc='b', ec='None', label='90% prediction interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-10, 20)
plt.legend(loc='upper left')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Illustration of the effect of different regularization strategies for Gradient Boosting. The example is taken from Hastie
et al 20091 .
The loss function used is binomial deviance. Regularization via shrinkage (learning_rate < 1.0) improves
performance considerably. In combination with shrinkage, stochastic gradient boosting (subsample < 1.0) can
produce more accurate models by reducing the variance via bagging. Subsampling without shrinkage usually does
poorly. Another strategy to reduce the variance is by subsampling the features analogous to the random splits in
Random Forests (via the max_features parameter).
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
'min_samples_split': 5}
plt.figure()
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)
plt.legend(loc='upper left')
plt.xlabel('Boosting Iterations')
plt.ylabel('Test Set Deviance')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot the class probabilities of the first sample in a toy dataset predicted by three different classifiers and averaged by
the VotingClassifier.
First, three examplary classifiers are initialized (LogisticRegression, GaussianNB, and
RandomForestClassifier) and used to initialize a soft-voting VotingClassifier with weights [1, 1,
5], which means that the predicted probabilities of the RandomForestClassifier count 5 times as much as the
weights of the other classifiers when the averaged probability is calculated.
To visualize the probability weighting, we fit each classifier on the training set and plot the predicted class probabilities
for the first sample in this example dataset.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# plotting
N = 4 # number of groups
ind = np.arange(N) # group positions
width = 0.35 # bar width
fig, ax = plt.subplots()
# plot annotations
plt.axvline(2.8, color='k', linestyle='dashed')
ax.set_xticks(ind + width)
ax.set_xticklabels(['LogisticRegression\nweight 1',
'GaussianNB\nweight 1',
'RandomForestClassifier\nweight 5',
'VotingClassifier\n(average probabilities)'],
rotation=40,
ha='right')
plt.ylim([0, 1])
plt.title('Class probabilities for sample 1 by different classifiers')
plt.legend([p1[0], p2[0]], ['class 1', 'class 2'], loc='upper left')
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Out:
MSE: 6.4961
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# #############################################################################
# Load data
boston = datasets.load_boston()
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
(continues on next page)
# #############################################################################
# Fit regression model
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
print("MSE: %.4f" % mse)
# #############################################################################
# Plot training deviance
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')
# #############################################################################
# Plot feature importance
feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, boston.feature_names[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The RandomForestClassifier is trained using bootstrap aggregation, where each new tree is fit from a boot-
strap sample of the training observations 𝑧𝑖 = (𝑥𝑖 , 𝑦𝑖 ). The out-of-bag (OOB) error is the average error for each 𝑧𝑖
calculated using predictions from the trees that do not contain 𝑧𝑖 in their respective bootstrap sample. This allows the
RandomForestClassifier to be fit and validated whilst being trained1 .
The example below demonstrates how the OOB error can be measured at the addition of each new tree during train-
ing. The resulting plot allows a practitioner to approximate a suitable value of n_estimators at which the error
stabilizes.
print(__doc__)
(continues on next page)
1 T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning Ed. 2”, p592-593, Springer, 2009.
RANDOM_STATE = 123
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example fits an AdaBoosted decision stump on a non-linearly separable classification dataset composed of two
“Gaussian quantiles” clusters (see sklearn.datasets.make_gaussian_quantiles) and plots the decision
boundary and decision scores. The distributions of decision scores are shown separately for samples of class A and B.
The predicted class label for each sample is determined by the sign of the decision score. Samples with decision scores
greater than zero are classified as B, and are otherwise classified as A. The magnitude of a decision score determines
the degree of likeness with the predicted class label. Additionally, a new dataset could be constructed containing a
desired purity of class B, for example, by only selecting samples with a decision score above some value.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2.,
n_samples=200, n_features=2,
n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
n_samples=300, n_features=2,
(continues on next page)
bdt.fit(X, y)
plot_colors = "br"
plot_step = 0.02
class_names = "AB"
plt.figure(figsize=(10, 5))
Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")
plt.tight_layout()
plt.subplots_adjust(wspace=0.35)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
RandomTreesEmbedding provides a way to map data to a very high-dimensional, sparse representation, which might
be beneficial for classification. The mapping is completely unsupervised and very efficient.
This example visualizes the partitions given by several trees and shows how the transformation can also be used for
non-linear dimensionality reduction or non-linear classification.
Points that are neighboring often share the same leaf of a tree and therefore share large parts of their hashed repre-
sentation. This allows to separate two concentric circles simply based on the principal components of the transformed
data with truncated SVD.
In high-dimensional spaces, linear classifiers often achieve excellent accuracy. For sparse binary data, BernoulliNB is
particularly well-suited. The bottom row compares the decision boundary obtained by BernoulliNB in the transformed
space with an ExtraTreesClassifier forests learned on the original data.
import numpy as np
import matplotlib.pyplot as plt
ax = plt.subplot(221)
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k')
ax.set_title("Original Data (2d)")
ax.set_xticks(())
ax.set_yticks(())
ax = plt.subplot(222)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, s=50, edgecolor='k')
ax.set_title("Truncated SVD reduction (2d) of transformed data (%dd)" %
X_transformed.shape[1])
ax.set_xticks(())
ax.set_yticks(())
# Plot the decision in original space. For that, we will assign a color
# to each point in the mesh [x_min, x_max]x[y_min, y_max].
h = .01
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
ax = plt.subplot(223)
ax.set_title("Naive Bayes on Transformed data")
ax.pcolormesh(xx, yy, y_grid_pred.reshape(xx.shape))
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k')
ax.set_ylim(-1.4, 1.4)
ax.set_xlim(-1.4, 1.4)
ax.set_xticks(())
ax.set_yticks(())
ax = plt.subplot(224)
ax.set_title("ExtraTrees predictions")
ax.pcolormesh(xx, yy, y_grid_pred.reshape(xx.shape))
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='k')
ax.set_ylim(-1.4, 1.4)
ax.set_xlim(-1.4, 1.4)
ax.set_xticks(())
ax.set_yticks(())
(continues on next page)
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example reproduces Figure 1 of Zhu et al1 and shows how boosting can improve prediction accuracy on a multi-
class problem. The classification dataset is constructed by taking a ten-dimensional standard normal distribution and
defining three classes separated by nested concentric ten-dimensional spheres such that roughly equal numbers of
samples are in each class (quantiles of the 𝜒2 distribution).
The performance of the SAMME and SAMME.R1 algorithms are compared. SAMME.R uses the probability estimates
to update the additive model, while SAMME uses the classifications only. As the example illustrates, the SAMME.R
algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations. The
error of each algorithm on the test set after each boosting iteration is shown on the left, the classification error on the
test set of each tree is shown in the middle, and the boost weight of each tree is shown on the right. All trees have a
weight of one in the SAMME.R algorithm and therefore are not shown.
print(__doc__)
X, y = make_gaussian_quantiles(n_samples=13000, n_features=10,
n_classes=3, random_state=1)
n_split = 3000
bdt_real = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=600,
learning_rate=1)
bdt_discrete = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=600,
learning_rate=1.5,
algorithm="SAMME")
bdt_real.fit(X_train, y_train)
bdt_discrete.fit(X_train, y_train)
real_test_errors = []
discrete_test_errors = []
n_trees_discrete = len(bdt_discrete)
n_trees_real = len(bdt_real)
# Boosting might terminate early, but the following arrays are always
# n_estimators long. We crop them to the actual number of trees here:
discrete_estimator_errors = bdt_discrete.estimator_errors_[:n_trees_discrete]
real_estimator_errors = bdt_real.estimator_errors_[:n_trees_real]
discrete_estimator_weights = bdt_discrete.estimator_weights_[:n_trees_discrete]
plt.figure(figsize=(15, 5))
plt.subplot(131)
plt.plot(range(1, n_trees_discrete + 1),
discrete_test_errors, c='black', label='SAMME')
plt.plot(range(1, n_trees_real + 1),
real_test_errors, c='black',
linestyle='dashed', label='SAMME.R')
plt.legend()
plt.ylim(0.18, 0.62)
plt.ylabel('Test Error')
plt.xlabel('Number of Trees')
plt.subplot(133)
plt.plot(range(1, n_trees_discrete + 1), discrete_estimator_weights,
"b", label='SAMME')
plt.legend()
plt.ylabel('Weight')
plt.xlabel('Number of Trees')
plt.ylim((0, discrete_estimator_weights.max() * 1.2))
plt.xlim((-20, n_trees_discrete + 20))
Note: Click here to download the full example code or to run this example in your browser via Binder
This example is based on Figure 10.2 from Hastie et al 20091 and illustrates the difference in performance between
the discrete SAMME2 boosting algorithm and real SAMME.R boosting algorithm. Both algorithms are evaluated on
a binary classification task where the target Y is a non-linear function of 10 input features.
Discrete SAMME AdaBoost adapts based on errors in predicted class labels whereas real SAMME.R uses the pre-
dicted class probabilities.
1 T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning Ed. 2”, Springer, 2009.
2
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
n_estimators = 400
# A learning rate of 1. may not be optimal for both SAMME and SAMME.R
learning_rate = 1.
X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
dt = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)
dt.fit(X_train, y_train)
dt_err = 1.0 - dt.score(X_test, y_test)
ada_discrete = AdaBoostClassifier(
base_estimator=dt_stump,
learning_rate=learning_rate,
n_estimators=n_estimators,
algorithm="SAMME")
ada_discrete.fit(X_train, y_train)
ada_real = AdaBoostClassifier(
base_estimator=dt_stump,
learning_rate=learning_rate,
n_estimators=n_estimators,
algorithm="SAMME.R")
ada_real.fit(X_train, y_train)
fig = plt.figure()
ax = fig.add_subplot(111)
ada_discrete_err = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_discrete.staged_predict(X_test)):
ada_discrete_err[i] = zero_one_loss(y_pred, y_test)
ada_discrete_err_train = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_discrete.staged_predict(X_train)):
ada_discrete_err_train[i] = zero_one_loss(y_pred, y_train)
ada_real_err = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_real.staged_predict(X_test)):
ada_real_err[i] = zero_one_loss(y_pred, y_test)
ada_real_err_train = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_real.staged_predict(X_train)):
ada_real_err_train[i] = zero_one_loss(y_pred, y_train)
ax.plot(np.arange(n_estimators) + 1, ada_discrete_err,
label='Discrete AdaBoost Test Error',
color='red')
ax.plot(np.arange(n_estimators) + 1, ada_discrete_err_train,
label='Discrete AdaBoost Train Error',
color='blue')
ax.plot(np.arange(n_estimators) + 1, ada_real_err,
label='Real AdaBoost Test Error',
color='orange')
ax.plot(np.arange(n_estimators) + 1, ada_real_err_train,
label='Real AdaBoost Train Error',
(continues on next page)
ax.set_ylim((0.0, 0.5))
ax.set_xlabel('n_estimators')
ax.set_ylabel('error rate')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Stacking refers to a method to blend estimators. In this strategy, some estimators are individually fitted on some
training data while a final estimator is trained using the stacked predictions of these base estimators.
In this example, we illustrate the use case in which different regressors are stacked together and a final linear penalized
regressor is used to output the prediction. We compare the performance of each individual regressor with the stacking
strategy. Stacking slightly improves the overall performance.
print(__doc__)
The function plot_regression_results is used to plot the predicted and true targets.
import matplotlib.pyplot as plt
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
ax.spines['left'].set_position(('outward', 10))
ax.spines['bottom'].set_position(('outward', 10))
ax.set_xlim([y_true.min(), y_true.max()])
ax.set_ylim([y_true.min(), y_true.max()])
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
extra = plt.Rectangle((0, 0), 0, 0, fc="w", fill=False,
(continues on next page)
It is sometimes tedious to find the model which will best perform on a given dataset. Stacking provide an
alternative by combining the outputs of several learners, without the need to choose a model specifically.
The performance of stacking is usually close to the best model and sometimes it can outperform the
prediction performance of each individual model.
Here, we combine 3 learners (linear and non-linear) and use a ridge regressor to combine their outputs
together.
estimators = [
('Random Forest', RandomForestRegressor(random_state=42)),
('Lasso', LassoCV()),
('Gradient Boosting', HistGradientBoostingRegressor(random_state=0))
]
stacking_regressor = StackingRegressor(
estimators=estimators, final_estimator=RidgeCV()
)
We used the Boston data set (prediction of house prices). We check the performance of each individual predictor as
well as the stack of the regressors.
import time
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_validate, cross_val_predict
X, y = load_boston(return_X_y=True)
The stacked regressor will combine the strengths of the different regressors. However, we also see that training the
stacked regressor is much more computationally expensive.
Total running time of the script: ( 0 minutes 12.051 seconds)
Estimated memory usage: 9 MB
Note: Click here to download the full example code or to run this example in your browser via Binder
Gradient boosting is an ensembling technique where several weak learners (regression trees) are combined to yield a
powerful single model, in an iterative fashion.
Early stopping support in Gradient Boosting enables us to find the least number of iterations which is sufficient to
build a model that generalizes well to unseen data.
The concept of early stopping is simple. We specify a validation_fraction which denotes the fraction of the
whole dataset that will be kept aside from training to assess the validation loss of the model. The gradient boosting
model is trained using the training set and evaluated using the validation set. When each additional stage of regression
tree is added, the validation set is used to score the model. This is continued until the scores of the model in the last
n_iter_no_change stages do not improve by atleast tol. After that the model is considered to have converged
and further addition of stages is “stopped early”.
The number of stages of the final model is available at the attribute n_estimators_.
This example illustrates how the early stopping can used in the sklearn.ensemble.
GradientBoostingClassifier model to achieve almost the same accuracy as compared to a model
built without early stopping using many fewer estimators. This can significantly reduce training time, memory usage
and prediction latency.
import time
import numpy as np
import matplotlib.pyplot as plt
print(__doc__)
n_gb = []
score_gb = []
time_gb = []
n_gbes = []
score_gbes = []
time_gbes = []
n_estimators = 500
for X, y in data_list:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
# We specify that if the scores don't improve by atleast 0.01 for the last
# 10 stages, stop fitting additional stages
gbes = ensemble.GradientBoostingClassifier(n_estimators=n_estimators,
(continues on next page)
start = time.time()
gbes.fit(X_train, y_train)
time_gbes.append(time.time() - start)
score_gb.append(gb.score(X_test, y_test))
score_gbes.append(gbes.score(X_test, y_test))
n_gb.append(gb.n_estimators_)
n_gbes.append(gbes.n_estimators_)
bar_width = 0.2
n = len(data_list)
index = np.arange(0, n * bar_width, bar_width) * 2.5
index = index[0:n]
plt.figure(figsize=(9, 5))
autolabel(bar1, n_gb)
autolabel(bar2, n_gbes)
plt.ylim([0, 1.3])
plt.legend(loc='best')
plt.grid(True)
plt.xlabel('Datasets')
(continues on next page)
plt.show()
plt.figure(figsize=(9, 5))
autolabel(bar1, n_gb)
autolabel(bar2, n_gbes)
plt.xlabel('Datasets')
plt.ylabel('Fit Time')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Transform your features into a higher dimensional, sparse space. Then train a linear model on these features.
First fit an ensemble of trees (totally random trees, a random forest, or gradient boosted trees) on the training set. Then
each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf
indices are then encoded in a one-hot fashion.
Each sample goes through the decisions of each tree of the ensemble and ends up in one leaf per tree. The sample is
encoded by setting feature values for these leaves to 1 and the other feature values to 0.
The resulting transformer has then learned a supervised, sparse, high-dimensional categorical embedding of the data.
•
# Author: Tim Head <betatim@gmail.com>
#
# License: BSD 3 clause
import numpy as np
np.random.seed(10)
n_estimator = 10
X, y = make_classification(n_samples=80000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
rt_lm = LogisticRegression(max_iter=1000)
pipeline = make_pipeline(rt, rt_lm)
pipeline.fit(X_train, y_train)
y_pred_rt = pipeline.predict_proba(X_test)[:, 1]
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)
y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)
y_pred_grd_lm = grd_lm.predict_proba(
grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
plt.figure(2)
(continues on next page)
Note: Click here to download the full example code or to run this example in your browser via Binder
Out-of-bag (OOB) estimates can be a useful heuristic to estimate the “optimal” number of boosting iterations. OOB
estimates are almost identical to cross-validation estimates but they can be computed on-the-fly without the need for
repeated model fitting. OOB estimates are only available for Stochastic Gradient Boosting (i.e. subsample < 1.
0), the estimates are derived from the improvement in loss based on the examples not included in the bootstrap sample
(the so-called out-of-bag examples). The OOB estimator is a pessimistic estimator of the true test loss, but remains a
fairly good approximation for a small number of trees.
The figure shows the cumulative sum of the negative OOB improvements as a function of the boosting iteration. As
you can see, it tracks the test loss for the first hundred iterations but then diverges in a pessimistic way. The figure
also shows the performance of 3-fold cross validation which usually gives a better estimate of the test loss but is
computationally more demanding.
Out:
Accuracy: 0.6840
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
X = X.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
random_state=9)
clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print("Accuracy: {:.4f}".format(acc))
n_estimators = params['n_estimators']
x = np.arange(n_estimators) + 1
def cv_estimate(n_splits=None):
cv = KFold(n_splits=n_splits)
cv_clf = ensemble.GradientBoostingClassifier(**params)
val_scores = np.zeros((n_estimators,), dtype=np.float64)
for train, test in cv.split(X_train, y_train):
cv_clf.fit(X_train[train], y_train[train])
val_scores += heldout_score(cv_clf, X_train[test], y_train[test])
val_scores /= n_splits
return val_scores
plt.legend(loc='upper right')
plt.ylabel('normalized loss')
plt.xlabel('number of iterations')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates and compares the bias-variance decomposition of the expected mean squared error of a single
estimator against a bagging ensemble.
In regression, the expected mean squared error of an estimator can be decomposed in terms of bias, variance and
noise. On average over datasets of the regression problem, the bias term measures the average amount by which the
predictions of the estimator differ from the predictions of the best possible estimator for the problem (i.e., the Bayes
model). The variance term measures the variability of the predictions of the estimator when fit over different instances
LS of the problem. Finally, the noise measures the irreducible part of the error which is due the variability in the data.
The upper left figure illustrates the predictions (in dark red) of a single decision tree trained over a random dataset LS
(the blue dots) of a toy 1d regression problem. It also illustrates the predictions (in light red) of other single decision
trees trained over other (and different) randomly drawn instances LS of the problem. Intuitively, the variance term
here corresponds to the width of the beam of predictions (in light red) of the individual estimators. The larger the
variance, the more sensitive are the predictions for x to small changes in the training set. The bias term corresponds
to the difference between the average prediction of the estimator (in cyan) and the best possible model (in dark blue).
On this problem, we can thus observe that the bias is quite low (both the cyan and the blue curves are close to each
other) while the variance is large (the red beam is rather wide).
The lower left figure plots the pointwise decomposition of the expected mean squared error of a single decision tree.
It confirms that the bias term (in blue) is low while the variance is large (in green). It also illustrates the noise part of
the error which, as expected, appears to be constant and around 0.01.
The right figures correspond to the same plots but using instead a bagging ensemble of decision trees. In both figures,
we can observe that the bias term is larger than in the previous case. In the upper right figure, the difference between
the average prediction (in cyan) and the best possible model is larger (e.g., notice the offset around x=2). In the lower
right figure, the bias curve is also slightly higher than in the lower left figure. In terms of variance however, the beam
of predictions is narrower, which suggests that the variance is lower. Indeed, as the lower right figure confirms, the
variance term (in green) is lower than for single decision trees. Overall, the bias- variance decomposition is therefore
no longer the same. The tradeoff is better for bagging: averaging several decision trees fit on bootstrap copies of the
dataset slightly increases the bias term but allows for a larger reduction of the variance, which results in a lower overall
mean squared error (compare the red curves int the lower figures). The script output also confirms this intuition. The
total error of the bagging ensemble is lower than the total error of a single decision tree, and this difference indeed
mainly stems from a reduced variance.
For further details on bias-variance decomposition, see section 7.3 of1 .
References
Out:
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# Settings
n_repeat = 50 # Number of iterations for computing expectations
n_train = 50 # Size of the training set
n_test = 1000 # Size of the test set
noise = 0.1 # Standard deviation of the noise
np.random.seed(0)
n_estimators = len(estimators)
# Generate data
def f(x):
x = x.ravel()
if n_repeat == 1:
y = f(X) + np.random.normal(0.0, noise, n_samples)
else:
y = np.zeros((n_samples, n_repeat))
for i in range(n_repeat):
y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples)
X = X.reshape((n_samples, 1))
return X, y
X_train = []
y_train = []
for i in range(n_repeat):
X, y = generate(n_samples=n_train, noise=noise)
X_train.append(X)
y_train.append(y)
plt.figure(figsize=(10, 8))
for i in range(n_repeat):
estimator.fit(X_train[i], y_train[i])
y_predict[:, i] = estimator.predict(X_test)
for i in range(n_repeat):
for j in range(n_repeat):
y_error += (y_test[:, j] - y_predict[:, i]) ** 2
# Plot figures
plt.subplot(2, n_estimators, n + 1)
plt.plot(X_test, f(X_test), "b", label="$f(x)$")
plt.plot(X_train[0], y_train[0], ".b", label="LS ~ $y = f(x)+noise$")
for i in range(n_repeat):
if i == 0:
plt.plot(X_test, y_predict[:, i], "r", label=r"$\^y(x)$")
else:
plt.plot(X_test, y_predict[:, i], "r", alpha=0.05)
plt.xlim([-5, 5])
plt.title(name)
if n == n_estimators - 1:
plt.legend(loc=(1.1, .5))
plt.xlim([-5, 5])
plt.ylim([0, 0.1])
if n == n_estimators - 1:
plt.subplots_adjust(right=.75)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
6.11.22 Plot the decision surfaces of ensembles of trees on the iris dataset
Plot the decision surfaces of forests of randomized trees trained on pairs of features of the iris dataset.
This plot compares the decision surfaces learned by a decision tree classifier (first column), by a random forest classi-
fier (second column), by an extra- trees classifier (third column) and by an AdaBoost classifier (fourth column).
In the first row, the classifiers are built using the sepal width and the sepal length features only, on the second row
using the petal length and sepal length only, and on the third row using the petal width and the petal length only.
In descending order of quality, when trained (outside of this example) on all 4 features using 30 estimators and scored
using 10 fold cross validation, we see:
Increasing max_depth for AdaBoost lowers the standard deviation of the scores (but the average score does not
improve).
See the console’s output for further details about each model.
In this example you might try to:
1) vary the max_depth for the DecisionTreeClassifier and AdaBoostClassifier,
perhaps try max_depth=3 for the DecisionTreeClassifier or max_depth=None for
AdaBoostClassifier
2) vary n_estimators
It is worth noting that RandomForests and ExtraTrees can be fitted in parallel on many cores as each tree is built
independently of the others. AdaBoost’s samples are built sequentially and so do not use multiple cores.
Out:
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
(continues on next page)
# Parameters
n_classes = 3
n_estimators = 30
cmap = plt.cm.RdYlBu
plot_step = 0.02 # fine step width for decision surface contours
plot_step_coarser = 0.5 # step widths for coarse classifier guesses
RANDOM_SEED = 13 # fix the seed on each iteration
# Load data
iris = load_iris()
plot_idx = 1
models = [DecisionTreeClassifier(max_depth=None),
RandomForestClassifier(n_estimators=n_estimators),
ExtraTreesClassifier(n_estimators=n_estimators),
AdaBoostClassifier(DecisionTreeClassifier(max_depth=3),
n_estimators=n_estimators)]
# Shuffle
idx = np.arange(X.shape[0])
np.random.seed(RANDOM_SEED)
np.random.shuffle(idx)
X = X[idx]
y = y[idx]
# Standardize
mean = X.mean(axis=0)
std = X.std(axis=0)
X = (X - mean) / std
# Train
model.fit(X, y)
scores = model.score(X, y)
# Create a title for each column and the console by using str() and
# slicing away useless parts of the string
model_title = str(type(model)).split(
".")[-1][:-2][:-len("Classifier")]
model_details = model_title
if hasattr(model, "estimators_"):
model_details += " with {} estimators".format(
len(model.estimators_))
print(model_details + " with features", pair,
(continues on next page)
plt.subplot(3, 4, plot_idx)
if plot_idx <= len(models):
# Add a title at the top of each column
plt.title(model_title, fontsize=9)
# Plot the training points, these are clustered together and have a
# black outline
plt.scatter(X[:, 0], X[:, 1], c=y,
cmap=ListedColormap(['r', 'y', 'b']),
edgecolor='k', s=20)
plot_idx += 1 # move on to the next plot in sequence
Applications to real world problems with some medium sized datasets or interactive user interface.
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates the need for robust covariance estimation on a real data set. It is useful both for outlier
detection and for a better understanding of the data structure.
We selected two sets of two variables from the Boston housing data set as an illustration of what kind of analysis can
be done with several outlier detection tools. For the purpose of visualization, we are working with two-dimensional
examples, but one should be aware that things are not so trivial in high-dimension, as it will be pointed out.
In both examples below, the main result is that the empirical covariance estimate, as a non-robust one, is highly
influenced by the heterogeneous structure of the observations. Although the robust covariance estimate is able to
focus on the main mode of the data distribution, it sticks to the assumption that the data should be Gaussian distributed,
yielding some biased estimation of the data structure, but yet accurate to some extent. The One-Class SVM does not
assume any parametric form of the data distribution and can therefore model the complex shape of the data much
better.
First example
The first example illustrates how robust covariance estimation can help concentrating on a relevant cluster when an-
other one exists. Here, many observations are confounded into one and break down the empirical covariance estima-
tion. Of course, some screening tools would have pointed out the presence of two clusters (Support Vector Machines,
Gaussian Mixture Models, univariate outlier detection, . . . ). But had it been a high-dimensional example, none of
these could be applied that easily.
Second example
The second example shows the ability of the Minimum Covariance Determinant robust estimator of covariance to
concentrate on the main mode of the data distribution: the location seems to be well estimated, although the covariance
is hard to estimate due to the banana-shaped distribution. Anyway, we can get rid of some outlying observations. The
One-Class SVM is able to capture the real data structure, but the difficulty is to adjust its kernel bandwidth parameter
so as to obtain a good compromise between the shape of the data scatter matrix and the risk of over-fitting the data.
•
print(__doc__)
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn.datasets import load_boston
# Get data
X1 = load_boston()['data'][:, [8, 10]] # two clusters
X2 = load_boston()['data'][:, [5, 12]] # "banana"-shaped
legend1_values_list = list(legend1.values())
legend1_keys_list = list(legend1.keys())
legend2_values_list = list(legend2.values())
legend2_keys_list = list(legend2.keys())
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows the reconstruction of an image from a set of parallel projections, acquired along different angles.
Such a dataset is acquired in computed tomography (CT).
Without any prior information on the sample, the number of projections required to reconstruct the image is of the
order of the linear size l of the image (in pixels). For simplicity we consider here a sparse image, where only pixels
on the boundary of objects have a non-zero value. Such data could correspond for example to a cellular material.
Note however that most images are sparse in a different basis, such as the Haar wavelets. Only l/7 projections are
acquired, therefore it is necessary to use prior information available on the sample (its sparsity): this is an example of
compressive sensing.
The tomography projection operation is a linear transformation. In addition to the data-fidelity term corresponding
to a linear regression, we penalize the L1 norm of the image to account for its sparsity. The resulting optimization
problem is called the Lasso. We use the class sklearn.linear_model.Lasso, that uses the coordinate descent
algorithm. Importantly, this implementation is more computationally efficient on a sparse matrix, than the projection
operator used here.
The reconstruction with L1 penalization gives a result with zero error (all pixels are successfully labeled with 0 or 1),
even if noise was added to the projections. In comparison, an L2 penalization (sklearn.linear_model.Ridge)
produces a large number of labeling errors for the pixels. Important artifacts are observed on the reconstructed image,
contrary to the L1 penalization. Note in particular the circular artifact separating the pixels in the corners, that have
contributed to fewer projections than the central disk.
print(__doc__)
import numpy as np
from scipy import sparse
from scipy import ndimage
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
def _generate_center_coordinates(l_x):
X, Y = np.mgrid[:l_x, :l_x].astype(np.float64)
center = l_x / 2.
X += 0.5 - center
Y += 0.5 - center
return X, Y
Parameters
----------
l_x : int
linear size of image array
n_dir : int
number of angles at which projections are acquired.
Returns
-------
p : sparse matrix of shape (n_dir l_x, l_x**2)
"""
X, Y = _generate_center_coordinates(l_x)
angles = np.linspace(0, np.pi, n_dir, endpoint=False)
data_inds, weights, camera_inds = [], [], []
data_unravel_indices = np.arange(l_x ** 2)
data_unravel_indices = np.hstack((data_unravel_indices,
data_unravel_indices))
for i, angle in enumerate(angles):
Xrot = np.cos(angle) * X - np.sin(angle) * Y
inds, w = _weights(Xrot, dx=1, orig=X.min())
mask = np.logical_and(inds >= 0, inds < l_x)
weights += list(w[mask])
camera_inds += list(inds[mask] + i * l_x)
data_inds += list(data_unravel_indices[mask])
proj_operator = sparse.coo_matrix((weights, (camera_inds, data_inds)))
return proj_operator
def generate_synthetic_data():
""" Synthetic binary data """
rs = np.random.RandomState(0)
n_pts = 36
x, y = np.ogrid[0:l, 0:l]
mask_outer = (x - l / 2.) ** 2 + (y - l / 2.) ** 2 < (l / 2.) ** 2
mask = np.zeros((l, l))
points = l * rs.rand(2, n_pts)
mask[(points[0]).astype(np.int), (points[1]).astype(np.int)] = 1
mask = ndimage.gaussian_filter(mask, sigma=l / n_pts)
res = np.logical_and(mask > mask.mean(), mask_outer)
return np.logical_xor(res, ndimage.binary_erosion(res))
plt.figure(figsize=(8, 3.3))
plt.subplot(131)
plt.imshow(data, cmap=plt.cm.gray, interpolation='nearest')
plt.axis('off')
plt.title('original image')
plt.subplot(132)
plt.imshow(rec_l2, cmap=plt.cm.gray, interpolation='nearest')
plt.title('L2 penalization')
plt.axis('off')
plt.subplot(133)
plt.imshow(rec_l1, cmap=plt.cm.gray, interpolation='nearest')
plt.title('L1 penalization')
plt.axis('off')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
6.12.3 Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet
Allocation
Loading dataset...
done in 1.237s.
Extracting tf-idf features for NMF...
done in 0.263s.
Extracting tf features for LDA...
done in 0.260s.
Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_
˓→features=1000...
done in 0.293s.
Topic #1: windows use dos using window program os application drivers help software
˓→pc running ms screen files version work code mode
Topic #2: god jesus bible faith christian christ christians does sin heaven believe
˓→lord life mary church atheism love belief human religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card
˓→help like appreciated information list send video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil speed
˓→power good 000 brake year models used bought
Topic #5: edu soon send com university internet mit ftp mail cc pub article
˓→information hope email mac home program blood contact
Topic #6: file files problem format win sound ftp pub read save site image help
˓→available create copy running memory self version
Topic #7: game team games year win play season players nhl runs goal toronto hockey
˓→division flyers player defense leafs bad won
Topic #8: drive drives hard disk floppy software card mac computer power scsi
˓→controller apple 00 mb pc rom sale problem monitor
Topic #9: key chip clipper keys encryption government public use secure enforcement
˓→phone nsa law communications security clinton used standard legal data
Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features,
˓→n_samples=2000 and n_features=1000...
done in 0.732s.
Topic #1: windows thanks help hi using looking does info software video use dos pc
˓→advance anybody mail appreciated card need know
Topic #2: god does jesus true book christian bible christians religion faith church
˓→believe read life christ says people lord exist say
Topic #3: thanks know bike interested car mail new like price edu heard list hear
˓→want cars email contact just com mark
Topic #5: space government 00 nasa public security states earth phone 1993 research
˓→technology university subject information science data internet provide blood
Topic #6: edu file com program try problem files soon window remember sun win send
˓→library mike article just mit oh code
Topic #7: game team year games play world season won case division players win nhl
˓→flyers second toronto points cubs ll al
Topic #8: drive think hard drives disk mac apple need number software scsi computer
˓→don card floppy bus cable actually controller memory
Topic #9: just use good like key chip got way don doesn sure clipper better going
˓→keys ll want speed encryption thought
Topic #1: drive car disk hard drives game power speed card just like good controller
˓→new year bios rom better team got
Topic #2: edu com mail windows file send graphics use version ftp pc thanks available
˓→program help files using software time know
Topic #3: vs gm thanks win interested copies john email text st mail copy hi new book
˓→division edu buying advance know
Topic #4: performance wanted robert speed couldn math ok change address include
˓→organization mr science major university internet edu computer driver kept
Topic #5: space scsi earth moon surface probe lunar orbit mission nasa launch science
˓→mars energy bit printer spacecraft probes sci solar
Topic #6: israel 000 section turkish military armenian greek killed state armenians
˓→people population attacks women israeli men weapon division dangerous jews
Topic #9: just don people like think know time say god good way make does did want
˓→right really going said things
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.
print("Loading dataset...")
t0 = time()
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'),
return_X_y=True)
data_samples = data[:n_samples]
print("done in %0.3fs." % (time() - t0))
Note: Click here to download the full example code or to run this example in your browser via Binder
The dataset used in this example is a preprocessed excerpt of the “Labeled Faces in the Wild”, aka LFW:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
Expected results for the top 5 most represented people in the dataset:
•
Out:
[[ 6 3 0 4 0 0 0]
[ 1 52 0 7 0 0 0]
[ 1 2 17 7 0 0 0]
[ 0 3 0 143 0 0 0]
[ 0 1 0 3 20 0 1]
[ 0 4 0 3 1 7 0]
[ 0 1 2 5 0 0 28]]
print(__doc__)
# #############################################################################
# Download the data, if not already on disk and load it as numpy arrays
# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
(continues on next page)
# #############################################################################
# Split into a training set and a test set using a stratified k fold
# #############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150
# #############################################################################
# Train a SVM classification model
# #############################################################################
# Quantitative evaluation of the model quality on the test set
# #############################################################################
# Qualitative evaluation of the predictions using matplotlib
plot_gallery(X_test, prediction_titles, h, w)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Demonstrate how model complexity influences both prediction accuracy and computational performance.
The dataset is the Boston Housing dataset (resp. 20 Newsgroups) for regression (resp. classification).
For each class of models we make the model complexity vary through the choice of relevant model parameters and
measure the influence on both computational performance (latency) and predictive power (MSE or Hamming Loss).
•
Out:
Benchmarking GradientBoostingRegressor(n_estimators=50)
Complexity: 50 | MSE: 8.6545 | Pred. Time: 0.000187s
Benchmarking GradientBoostingRegressor()
Complexity: 100 | MSE: 7.7179 | Pred. Time: 0.000262s
Benchmarking GradientBoostingRegressor(n_estimators=200)
Complexity: 200 | MSE: 6.7507 | Pred. Time: 0.000426s
Benchmarking GradientBoostingRegressor(n_estimators=500)
Complexity: 500 | MSE: 7.1471 | Pred. Time: 0.000851s
print(__doc__)
import time
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.parasite_axes import host_subplot
from mpl_toolkits.axisartist.axislines import Axes
from scipy.sparse.csr import csr_matrix
# #############################################################################
# Routines
def benchmark_influence(conf):
"""
Benchmark influence of :changing_param: on both MSE and latency.
"""
prediction_times = []
prediction_powers = []
complexities = []
for param_value in conf['changing_param_values']:
conf['tuned_params'][conf['changing_param']] = param_value
estimator = conf['estimator'](**conf['tuned_params'])
print("Benchmarking %s" % estimator)
estimator.fit(conf['data']['X_train'], conf['data']['y_train'])
conf['postfit_hook'](estimator)
complexity = conf['complexity_computer'](estimator)
complexities.append(complexity)
start_time = time.time()
for _ in range(conf['n_samples']):
y_pred = estimator.predict(conf['data']['X_test'])
elapsed_time = (time.time() - start_time) / float(conf['n_samples'])
prediction_times.append(elapsed_time)
pred_score = conf['prediction_performance_computer'](
conf['data']['y_test'], y_pred)
prediction_powers.append(pred_score)
print("Complexity: %d | %s: %.4f | Pred. Time: %fs\n" % (
complexity, conf['prediction_performance_label'], pred_score,
elapsed_time))
return prediction_powers, prediction_times, complexities
def _count_nonzero_coefficients(estimator):
a = estimator.coef_.toarray()
return np.count_nonzero(a)
# #############################################################################
# Main code
regression_data = generate_data('regression')
classification_data = generate_data('classification', sparse=True)
configurations = [
{'estimator': SGDClassifier,
'tuned_params': {'penalty': 'elasticnet', 'alpha': 0.001, 'loss':
'modified_huber', 'fit_intercept': True, 'tol': 1e-3},
'changing_param': 'l1_ratio',
'changing_param_values': [0.25, 0.5, 0.75, 0.9],
'complexity_label': 'non_zero coefficients',
'complexity_computer': _count_nonzero_coefficients,
'prediction_performance_computer': hamming_loss,
'prediction_performance_label': 'Hamming Loss (Misclassification Ratio)',
'postfit_hook': lambda x: x.sparsify(),
'data': classification_data,
'n_samples': 30},
{'estimator': NuSVR,
'tuned_params': {'C': 1e3, 'gamma': 2 ** -15},
'changing_param': 'nu',
'changing_param_values': [0.1, 0.25, 0.5, 0.75, 0.9],
'complexity_label': 'n_support_vectors',
'complexity_computer': lambda x: len(x.support_vectors_),
'data': regression_data,
'postfit_hook': lambda x: x,
'prediction_performance_computer': mean_squared_error,
'prediction_performance_label': 'MSE',
'n_samples': 30},
{'estimator': GradientBoostingRegressor,
'tuned_params': {'loss': 'ls'},
'changing_param': 'n_estimators',
'changing_param_values': [10, 50, 100, 200, 500],
'complexity_label': 'n_trees',
'complexity_computer': lambda x: x.n_estimators,
'data': regression_data,
'postfit_hook': lambda x: x,
'prediction_performance_computer': mean_squared_error,
'prediction_performance_label': 'MSE',
'n_samples': 30},
]
for conf in configurations:
prediction_performances, prediction_times, complexities = \
(continues on next page)
Note: Click here to download the full example code or to run this example in your browser via Binder
This example employs several unsupervised learning techniques to extract the stock market structure from variations
in historical quotes.
The quantity that we use is the daily variation in quote price: quotes that are linked tend to cofluctuate during a day.
We use sparse inverse covariance estimation to find which quotes are correlated conditionally on the others. Specifi-
cally, sparse inverse covariance gives us a graph, that is a list of connection. For each symbol, the symbols that it is
connected too are those useful to explain its fluctuations.
Clustering
We use clustering to group together quotes that behave similarly. Here, amongst the various clustering techniques
available in the scikit-learn, we use Affinity Propagation as it does not enforce equal-size clusters, and it can choose
automatically the number of clusters from the data.
Note that this gives us a different indication than the graph, as the graph reflects conditional relations between variables,
while the clustering reflects marginal properties: variables clustered together can be considered as having a similar
impact at the level of the full stock market.
Embedding in 2D space
For visualization purposes, we need to lay out the different symbols on a 2D canvas. For this we use Manifold learning
techniques to retrieve 2D embedding.
Visualization
The output of the 3 models are combined in a 2D graph where nodes represents the stocks and edges the:
• cluster labels are used to define the color of the nodes
• the sparse covariance model is used to display the strength of the edges
• the 2D embedding is used to position the nodes in the plan
This example has a fair amount of visualization-related code, as visualization is crucial here to display the graph. One
of the challenge is to position the labels minimizing overlap. For this we use an heuristic based on the direction of the
nearest neighbor along each axis.
Out:
Cluster 7: McDonald's
Cluster 8: GlaxoSmithKline, Novartis, Pfizer, Sanofi-Aventis, Unilever
Cluster 9: Kellogg, Coca Cola, Pepsi
Cluster 10: Colgate-Palmolive, Kimberly-Clark, Procter Gamble
Cluster 11: Canon, Honda, Navistar, Sony, Toyota, Xerox
import sys
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
import pandas as pd
print(__doc__)
# #############################################################################
# Retrieve the data from Internet
# The data is from 2003 - 2008. This is reasonably calm: (not too long ago so
# that we get high-tech firms, and before the 2008 crash). This kind of
# historical data can be obtained for from APIs like the quandl.com and
# alphavantage.co ones.
symbol_dict = {
'TOT': 'Total',
'XOM': 'Exxon',
'CVX': 'Chevron',
'COP': 'ConocoPhillips',
'VLO': 'Valero Energy',
'MSFT': 'Microsoft',
'IBM': 'IBM',
'TWX': 'Time Warner',
'CMCSA': 'Comcast',
'CVC': 'Cablevision',
'YHOO': 'Yahoo',
'DELL': 'Dell',
'HPQ': 'HP',
'AMZN': 'Amazon',
'TM': 'Toyota',
'CAJ': 'Canon',
'SNE': 'Sony',
'F': 'Ford',
'HMC': 'Honda',
'NAV': 'Navistar',
'NOC': 'Northrop Grumman',
'BA': 'Boeing',
'KO': 'Coca Cola',
'MMM': '3M',
'MCD': 'McDonald\'s',
'PEP': 'Pepsi',
'K': 'Kellogg',
'UN': 'Unilever',
'MAR': 'Marriott',
'PG': 'Procter Gamble',
'CL': 'Colgate-Palmolive',
'GE': 'General Electrics',
'WFC': 'Wells Fargo',
'JPM': 'JPMorgan Chase',
(continues on next page)
quotes = []
# The daily variations of the quotes are what carry most information
variation = close_prices - open_prices
# #############################################################################
# Learn a graphical structure from the correlations
edge_model = covariance.GraphicalLassoCV()
# #############################################################################
# Cluster using affinity propagation
_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()
# #############################################################################
# Find a low-dimension embedding for visualization: find the best position of
# the nodes (the stocks) on a 2D plane
embedding = node_position_model.fit_transform(X.T).T
# #############################################################################
# Visualization
plt.figure(1, facecolor='w', figsize=(10, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])
plt.axis('off')
dx = x - embedding[0]
dx[index] = 1
dy = y - embedding[1]
dy[index] = 1
this_dx = dx[np.argmin(np.abs(dy))]
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
A classical way to assert the relative importance of vertices in a graph is to compute the principal eigenvector of the
adjacency matrix so as to assign to each vertex the values of the components of the first eigenvector as a centrality
score:
https://en.wikipedia.org/wiki/Eigenvector_centrality
On the graph of webpages and links those values are called the PageRank scores by Google.
The goal of this example is to analyze the graph of links inside wikipedia articles to rank articles by relative importance
according to this eigenvector centrality.
The traditional way to compute the principal eigenvector is to use the power iteration method:
https://en.wikipedia.org/wiki/Power_iteration
Here the computation is achieved thanks to Martinsson’s Randomized SVD algorithm implemented in scikit-learn.
The graph data is fetched from the DBpedia dumps. DBpedia is an extraction of the latent structured data of the
Wikipedia content.
import numpy as np
print(__doc__)
# #############################################################################
# Where to download the data, if not already on disk
redirects_url = "http://downloads.dbpedia.org/3.5.1/en/redirects_en.nt.bz2"
redirects_filename = redirects_url.rsplit("/", 1)[1]
page_links_url = "http://downloads.dbpedia.org/3.5.1/en/page_links_en.nt.bz2"
page_links_filename = page_links_url.rsplit("/", 1)[1]
resources = [
(redirects_url, redirects_filename),
(page_links_url, page_links_filename),
]
# #############################################################################
# Loading the redirect files
memory = Memory(cachedir=".")
DBPEDIA_RESOURCE_PREFIX_LEN = len("http://dbpedia.org/resource/")
SHORTNAME_SLICE = slice(DBPEDIA_RESOURCE_PREFIX_LEN + 1, -1)
def short_name(nt_uri):
"""Remove the < and > URI markers and the common URI prefix"""
return nt_uri[SHORTNAME_SLICE]
def get_redirects(redirects_filename):
"""Parse the redirections and build a transitively closed map out of it"""
redirects = {}
print("Parsing the NT redirect file")
for l, line in enumerate(BZ2File(redirects_filename)):
split = line.split()
if len(split) != 4:
print("ignoring malformed line: " + line)
continue
redirects[short_name(split[0])] = short_name(split[2])
if l % 1000000 == 0:
print("[%s] line: %08d" % (datetime.now().isoformat(), l))
return redirects
# disabling joblib as the pickling of large dicts seems much too slow
#@memory.cache
def get_adjacency_matrix(redirects_filename, page_links_filename, limit=None):
"""Extract the adjacency graph as a scipy sparse matrix
return scores
Note: Click here to download the full example code or to run this example in your browser via Binder
Modeling species’ geographic distributions is an important problem in conservation biology. In this example we model
the geographic distribution of two south american mammals given past observations and 14 environmental variables.
Since we have only positive examples (there are no unsuccessful observations), we cast this problem as a density
estimation problem and use the sklearn.svm.OneClassSVM as our modeling tool. The dataset is provided by
Phillips et. al. (2006). If available, the example uses basemap to plot the coast lines and national boundaries of South
America.
The two species are:
• “Bradypus variegatus” , the Brown-throated Sloth.
• “Microryzomys minutus” , also known as the Forest Small Rice Rat, a rodent that lives in Peru, Colombia,
Ecuador, Peru, and Venezuela.
References
Out:
________________________________________________________________________________
Modeling distribution of species 'bradypus variegatus'
- fit OneClassSVM ... done.
- plot coastlines from coverage
- predict species distribution
import numpy as np
import matplotlib.pyplot as plt
print(__doc__)
def construct_grids(batch):
"""Construct the map grid from the batch object
Parameters
----------
batch : Batch object
The object returned by :func:`fetch_species_distributions`
Returns
-------
(xgrid, ygrid) : 1-D arrays
The grid corresponding to the values in batch.coverages
"""
# x,y coordinates for corner cells
xmin = batch.x_left_lower_corner + batch.grid_size
xmax = xmin + (batch.Nx * batch.grid_size)
ymin = batch.y_left_lower_corner + batch.grid_size
ymax = ymin + (batch.Ny * batch.grid_size)
# determine coverage values for each of the training & testing points
ix = np.searchsorted(xgrid, pts['dd long'])
iy = np.searchsorted(ygrid, pts['dd lat'])
bunch['cov_%s' % label] = coverages[:, -iy, ix].T
return bunch
def plot_species_distribution(species=("bradypus_variegatus_0",
"microryzomys_minutus_0")):
"""
Plot the species distribution.
"""
if len(species) > 2:
print("Note: when more than two species are provided,"
" only the first two will be used")
t0 = time()
# We'll make use of the fact that coverages[6] has measurements at all
# land points. This will help us decide between land and water.
land_reference = data.coverages[6]
# Standardize features
mean = species.cov_train.mean(axis=0)
std = species.cov_train.std(axis=0)
train_cover_std = (species.cov_train - mean) / std
# Fit OneClassSVM
print(" - fit OneClassSVM ... ", end='')
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.5)
clf.fit(train_cover_std)
print("done.")
plot_species_distribution()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
A simple graphical frontend for Libsvm mainly intended for didactic purposes. You can create data points by point
and click and visualize the decision region induced by different kernels and parameter settings.
To create positive examples click the left mouse button; to create negative examples click the right button.
If all examples are from the same class, it uses a one-class SVM.
print(__doc__)
import matplotlib
matplotlib.use('TkAgg')
import sys
import numpy as np
import tkinter as Tk
(continues on next page)
class Model:
"""The Model which hold the data. It implements the
observable in the observer pattern and notifies the
registered observers on change event.
"""
def __init__(self):
self.observers = []
self.surface = None
self.data = []
self.cls = None
self.surface_type = 0
class Controller:
def __init__(self, model):
self.model = model
self.kernel = Tk.IntVar()
self.surface_type = Tk.IntVar()
# Whether or not a model has been fitted
self.fitted = False
def fit(self):
print("fit the model")
train = np.array(self.model.data)
X = train[:, 0:2]
y = train[:, 2]
C = float(self.complexity.get())
gamma = float(self.gamma.get())
coef0 = float(self.coef0.get())
(continues on next page)
def clear_data(self):
self.model.data = []
self.fitted = False
self.model.changed("clear")
def refit(self):
"""Refit the model if already fitted. """
if self.fitted:
self.fit()
class View:
"""Test docstring. """
def __init__(self, root, controller):
f = Figure()
ax = f.add_subplot(111)
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlim((x_min, x_max))
ax.set_ylim((y_min, y_max))
canvas = FigureCanvasTkAgg(f, master=root)
canvas.show()
(continues on next page)
def plot_kernels(self):
self.ax.text(-50, -60, "Linear: $u^T v$")
self.ax.text(-20, -60, r"RBF: $\exp (-\gamma \| u-v \|^2)$")
self.ax.text(10, -60, r"Poly: $(\gamma \, u^T v + r)^d$")
if event == "example_added":
self.update_example(model, -1)
if event == "clear":
self.ax.clear()
self.ax.set_xticks([])
self.ax.set_yticks([])
self.contours = []
self.c_labels = None
self.plot_kernels()
if event == "surface":
self.remove_surface()
self.plot_support_vectors(model.clf.support_vectors_)
self.plot_decision_surface(model.surface, model.surface_type)
self.canvas.draw()
class ControllBar:
def __init__(self, root, controller):
fm = Tk.Frame(root)
kernel_group = Tk.Frame(fm)
Tk.Radiobutton(kernel_group, text="Linear", variable=controller.kernel,
value=0, command=controller.refit).pack(anchor=Tk.W)
Tk.Radiobutton(kernel_group, text="RBF", variable=controller.kernel,
value=1, command=controller.refit).pack(anchor=Tk.W)
Tk.Radiobutton(kernel_group, text="Poly", variable=controller.kernel,
value=2, command=controller.refit).pack(anchor=Tk.W)
kernel_group.pack(side=Tk.LEFT)
valbox = Tk.Frame(fm)
controller.complexity = Tk.StringVar()
controller.complexity.set("1.0")
c = Tk.Frame(valbox)
Tk.Label(c, text="C:", anchor="e", width=7).pack(side=Tk.LEFT)
Tk.Entry(c, width=6, textvariable=controller.complexity).pack(
side=Tk.LEFT)
(continues on next page)
controller.gamma = Tk.StringVar()
controller.gamma.set("0.01")
g = Tk.Frame(valbox)
Tk.Label(g, text="gamma:", anchor="e", width=7).pack(side=Tk.LEFT)
Tk.Entry(g, width=6, textvariable=controller.gamma).pack(side=Tk.LEFT)
g.pack()
controller.degree = Tk.StringVar()
controller.degree.set("3")
d = Tk.Frame(valbox)
Tk.Label(d, text="degree:", anchor="e", width=7).pack(side=Tk.LEFT)
Tk.Entry(d, width=6, textvariable=controller.degree).pack(side=Tk.LEFT)
d.pack()
controller.coef0 = Tk.StringVar()
controller.coef0.set("0")
r = Tk.Frame(valbox)
Tk.Label(r, text="coef0:", anchor="e", width=7).pack(side=Tk.LEFT)
Tk.Entry(r, width=6, textvariable=controller.coef0).pack(side=Tk.LEFT)
r.pack()
valbox.pack(side=Tk.LEFT)
cmap_group = Tk.Frame(fm)
Tk.Radiobutton(cmap_group, text="Hyperplanes",
variable=controller.surface_type, value=0,
command=controller.refit).pack(anchor=Tk.W)
Tk.Radiobutton(cmap_group, text="Surface",
variable=controller.surface_type, value=1,
command=controller.refit).pack(anchor=Tk.W)
cmap_group.pack(side=Tk.LEFT)
def get_parser():
from optparse import OptionParser
op = OptionParser()
op.add_option("--output",
action="store", type="str", dest="output",
help="Path where to dump data.")
return op
def main(argv):
op = get_parser()
opts, args = op.parse_args(argv[1:])
root = Tk.Tk()
model = Model()
controller = Controller(model)
(continues on next page)
if opts.output:
model.dump_svmlight_file(opts.output)
if __name__ == "__main__":
main(sys.argv)
Note: Click here to download the full example code or to run this example in your browser via Binder
•
Out:
Benchmarking SGDRegressor(alpha=0.01, l1_ratio=0.25, penalty='elasticnet', tol=0.0001)
Benchmarking RandomForestRegressor()
Benchmarking SVR()
benchmarking with 100 features
benchmarking with 250 features
benchmarking with 500 features
example run in 10.35s
import time
import gc
import numpy as np
import matplotlib.pyplot as plt
def _not_in_sphinx():
# Hack to detect whether we are running by the sphinx builder
return '__file__' in globals()
Parameters
----------
estimator : already trained estimator supporting `predict()`
X_test : test input
n_bulk_repeats : how many times to repeat when evaluating bulk mode
Returns
-------
atomic_runtimes, bulk_runtimes : a pair of `np.array` which contain the
runtimes in seconds.
"""
atomic_runtimes = atomic_benchmark_estimator(estimator, X_test, verbose)
bulk_runtimes = bulk_benchmark_estimator(estimator, X_test, n_bulk_repeats,
verbose)
return atomic_runtimes, bulk_runtimes
(continues on next page)
random_seed = 13
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=n_train, test_size=n_test, random_state=random_seed)
X_train, y_train = shuffle(X_train, y_train, random_state=random_seed)
X_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
X_test = X_scaler.transform(X_test)
y_scaler = StandardScaler()
y_train = y_scaler.fit_transform(y_train[:, None])[:, 0]
y_test = y_scaler.transform(y_test[:, None])[:, 0]
gc.collect()
if verbose:
print("ok")
return X_train, y_train, X_test, y_test
Parameters
----------
runtimes : list of `np.array` of latencies in micro-seconds
cls_names : list of estimator class names that generated the runtimes
pred_type : 'bulk' or 'atomic'
"""
plt.show()
def benchmark(configuration):
"""Run the whole benchmark."""
X_train, y_train, X_test, y_test = generate_dataset(
configuration['n_train'], configuration['n_test'],
configuration['n_features'])
stats = {}
for estimator_conf in configuration['estimators']:
print("Benchmarking", estimator_conf['instance'])
estimator_conf['instance'].fit(X_train, y_train)
gc.collect()
a, b = benchmark_estimator(estimator_conf['instance'], X_test)
stats[estimator_conf['name']] = {'atomic': a, 'bulk': b}
Parameters
----------
Returns:
--------
percentiles : dict(estimator_name,
dict(n_features, percentile_perf_in_us))
"""
percentiles = defaultdict(defaultdict)
for n in n_features:
print("benchmarking with %d features" % n)
X_train, y_train, X_test, y_test = generate_dataset(n_train, n_test, n)
for cls_name, estimator in estimators.items():
(continues on next page)
# #############################################################################
# Main code
start_time = time.time()
# #############################################################################
# Benchmark bulk/atomic prediction speed for various regressors
configuration = {
'n_train': int(1e3),
'n_test': int(1e2),
'n_features': int(1e2),
'estimators': [
{'name': 'Linear Model',
'instance': SGDRegressor(penalty='elasticnet', alpha=0.01,
l1_ratio=0.25, tol=1e-4),
'complexity_label': 'non-zero coefficients',
'complexity_computer': lambda clf: np.count_nonzero(clf.coef_)},
{'name': 'RandomForest',
'instance': RandomForestRegressor(),
'complexity_label': 'estimators',
'complexity_computer': lambda clf: clf.n_estimators},
{'name': 'SVR',
'instance': SVR(kernel='rbf'),
'complexity_label': 'support vectors',
'complexity_computer': lambda clf: len(clf.support_vectors_)},
]
}
benchmark(configuration)
# benchmark throughput
throughputs = benchmark_throughputs(configuration)
plot_benchmark_throughput(throughputs, configuration)
stop_time = time.time()
print("example run in %.2fs" % (stop_time - start_time))
Note: Click here to download the full example code or to run this example in your browser via Binder
This is an example showing how scikit-learn can be used for classification using an out-of-core approach: learning
from data that doesn’t fit into main memory. We make use of an online classifier, i.e., one that supports the partial_fit
method, that will be fed with batches of examples. To guarantee that the features space remains the same over time
we leverage a HashingVectorizer that will project each example into the same feature space. This is especially useful
in the case of text classification where new features (words) may appear in each batch.
# Authors: Eustache Diemert <eustache@diemert.fr>
# @FedericoV <https://github.com/FedericoV/>
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
def _not_in_sphinx():
# Hack to detect whether we are running by the sphinx builder
return '__file__' in globals()
The dataset used in this example is Reuters-21578 as provided by the UCI ML repository. It will be automatically
downloaded and uncompressed on first run.
class ReutersParser(HTMLParser):
"""Utility class to parse a SGML file and yield documents one at a time."""
def _reset(self):
self.in_title = 0
self.in_body = 0
self.in_topics = 0
self.in_topic_d = 0
self.title = ""
self.body = ""
self.topics = []
self.topic_d = ""
def end_reuters(self):
self.body = re.sub(r'\s+', r' ', self.body)
self.docs.append({'title': self.title,
'body': self.body,
'topics': self.topics})
self._reset()
def end_title(self):
self.in_title = 0
def end_body(self):
self.in_body = 0
def end_topics(self):
self.in_topics = 0
(continues on next page)
def end_d(self):
self.in_topic_d = 0
self.topics.append(self.topic_d)
self.topic_d = ""
def stream_reuters_documents(data_path=None):
"""Iterate over documents of the Reuters dataset.
"""
DOWNLOAD_URL = ('http://archive.ics.uci.edu/ml/machine-learning-databases/'
'reuters21578-mld/reuters21578.tar.gz')
ARCHIVE_FILENAME = 'reuters21578.tar.gz'
if data_path is None:
data_path = os.path.join(get_data_home(), "reuters")
if not os.path.exists(data_path):
"""Download the dataset."""
print("downloading dataset (once and for all) into %s" %
data_path)
os.mkdir(data_path)
parser = ReutersParser()
for filename in glob(os.path.join(data_path, "*.sgm")):
for doc in parser.parse(open(filename, 'rb')):
yield doc
Main
Create the vectorizer and limit the number of features to a reasonable maximum
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,
alternate_sign=False)
# We learn a binary classification between the "acq" class and all the others.
# "acq" was chosen as it is more or less evenly distributed in the Reuters
# files. For other datasets, one should take care of creating a test set with
# a realistic portion of positive instances.
all_classes = np.array([0, 1])
positive_class = 'acq'
"""
data = [('{title}\n\n{body}'.format(**doc), pos_class in doc['topics'])
for doc in itertools.islice(doc_iter, size)
if doc['topics']]
if not len(data):
return np.asarray([], dtype=int), np.asarray([], dtype=int)
X_text, y = zip(*data)
return X_text, np.asarray(y, dtype=int)
cls_stats = {}
get_minibatch(data_stream, n_test_documents)
# Discard test set
# We will feed the classifier with mini-batches of 1000 documents; this means
# we have at most 1000 docs in memory at any time. The smaller the document
# batch, the bigger the relative overhead of the partial fit methods.
minibatch_size = 1000
# Create the data_stream that parses Reuters SGML files and iterates on
# documents as a stream.
minibatch_iterators = iter_minibatches(data_stream, minibatch_size)
total_vect_time = 0.0
tick = time.time()
X_train = vectorizer.transform(X_train_text)
total_vect_time += time.time() - tick
if i % 3 == 0:
print(progress(cls_name, cls_stats[cls_name]))
if i % 3 == 0:
print('\n')
Out:
Test set is 994 documents (121 positive)
SGD classifier : 878 train docs ( 108 positive) 994
˓→test docs ( 121 positive) accuracy: 0.920 in 0.81s ( 1090 docs/s)
Perceptron classifier : 878 train docs ( 108 positive) 994
˓→test docs ( 121 positive) accuracy: 0.905 in 0.81s ( 1082 docs/s)
NB Multinomial classifier : 878 train docs ( 108 positive) 994
˓→test docs ( 121 positive) accuracy: 0.878 in 0.84s ( 1048 docs/s)
Passive-Aggressive classifier : 878 train docs ( 108 positive) 994
˓→test docs ( 121 positive) accuracy: 0.934 in 0.84s ( 1045 docs/s)
Plot results
The plot represents the learning curve of the classifier: the evolution of classification accuracy over the course of the
mini-batches. Accuracy is measured on the first 1000 samples, held out as a validation set.
To limit the memory consumption, we queue examples up to a fixed amount before feeding them to the learner.
rcParams['legend.fontsize'] = 10
cls_names = list(sorted(cls_stats.keys()))
plt.figure()
for _, stats in sorted(cls_stats.items()):
# Plot accuracy evolution with runtime
accuracy, runtime = zip(*stats['runtime_history'])
plot_accuracy(runtime, accuracy, 'runtime (s)')
ax = plt.gca()
ax.set_ylim((0.8, 1))
plt.legend(cls_names, loc='best')
cls_runtime.append(total_vect_time)
cls_names.append('Vectorization')
bar_colors = ['b', 'g', 'r', 'c', 'm', 'y']
ax = plt.subplot(111)
rectangles = plt.bar(range(len(cls_names)), cls_runtime, width=0.5,
color=bar_colors)
def autolabel(rectangles):
"""attach some text vi autolabel on rectangles."""
for rect in rectangles:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width() / 2.,
1.05 * height, '%.4f' % height,
ha='center', va='bottom')
plt.setp(plt.xticks()[1], rotation=30)
autolabel(rectangles)
plt.tight_layout()
plt.show()
ax = plt.subplot(111)
rectangles = plt.bar(range(len(cls_names)), cls_runtime, width=0.5,
color=bar_colors)
•
Total running time of the script: ( 0 minutes 10.339 seconds)
Estimated memory usage: 8 MB
Note: Click here to download the full example code or to run this example in your browser via Binder
A recursive feature elimination example showing the relevance of pixels in a digit classification task.
print(__doc__)
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates the differences between univariate F-test statistics and mutual information.
We consider 3 features x_1, x_2, x_3 distributed uniformly over [0, 1], the target depends on them as follows:
y = x_1 + sin(6 * pi * x_2) + 0.1 * N(0, 1), that is the third features is completely irrelevant.
The code below plots the dependency of y against individual x_i and normalized values of univariate F-tests statistics
and mutual information.
As F-test captures only linear dependency, it rates x_1 as the most discriminative feature. On the other hand, mutual
information can capture any kind of dependency between variables and it rates x_2 as the most discriminative feature,
which probably agrees better with our intuitive perception for this example. Both methods correctly marks x_3 as
irrelevant.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import f_regression, mutual_info_regression
np.random.seed(0)
X = np.random.rand(1000, 3)
y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)
f_test, _ = f_regression(X, y)
f_test /= np.max(f_test)
mi = mutual_info_regression(X, y)
mi /= np.max(mi)
plt.figure(figsize=(15, 5))
for i in range(3):
plt.subplot(1, 3, i + 1)
plt.scatter(X[:, i], y, edgecolor='black', s=20)
plt.xlabel("$x_{}$".format(i + 1), fontsize=14)
(continues on next page)
Note: Click here to download the full example code or to run this example in your browser via Binder
Simple usage of Pipeline that runs successively a univariate feature selection with anova and then a SVM of the
selected features.
Using a sub-pipeline, the fitted coefficients can be mapped back into the original feature space.
Out:
accuracy 0.76 25
macro avg 0.77 0.76 0.75 25
weighted avg 0.79 0.76 0.76 25
[[-0.23912131 0. 0. 0. -0.3236911 0.
0. 0. 0. 0. 0. 0.
0.10836648 0. 0. 0. 0. 0.
0. 0. ]
[ 0.43878747 0. 0. 0. -0.51415652 0.
0. 0. 0. 0. 0. 0.
0.04845652 0. 0. 0. 0. 0.
0. 0. ]
[-0.65382998 0. 0. 0. 0.57962856 0.
0. 0. 0. 0. 0. 0.
-0.04736524 0. 0. 0. 0. 0.
0. 0. ]
[ 0.54403412 0. 0. 0. 0.58478491 0.
0. 0. 0. 0. 0. 0.
-0.11344659 0. 0. 0. 0. 0.
0. 0. ]]
print(__doc__)
# ANOVA SVM-C
# 1) anova filter, take 3 best ranked features
anova_filter = SelectKBest(f_regression, k=3)
# 2) svm
clf = svm.LinearSVC()
coef = anova_svm[:-1].inverse_transform(anova_svm['linearsvc'].coef_)
print(coef)
Note: Click here to download the full example code or to run this example in your browser via Binder
A recursive feature elimination example with automatic tuning of the number of features selected with cross-validation.
Out:
print(__doc__)
Note: Click here to download the full example code or to run this example in your browser via Binder
Use SelectFromModel meta-transformer along with Lasso to select the best couple of features from the Boston dataset.
print(__doc__)
# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
clf = LassoCV()
Note: Click here to download the full example code or to run this example in your browser via Binder
In order to test if a classification score is significative a technique in repeating the classification procedure after ran-
domizing, permuting, the labels. The p-value is then given by the percentage of runs for which the score obtained is
greater than the classification score obtained in the first place.
Out:
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# #############################################################################
# Loading a dataset
(continues on next page)
# Add noisy data to the informative features for make the task harder
X = np.c_[X, E]
svm = SVC(kernel='linear')
cv = StratifiedKFold(2)
# #############################################################################
# View histogram of permutation scores
plt.hist(permutation_scores, 20, label='Permutation scores',
edgecolor='black')
ylim = plt.ylim()
# BUG: vlines(..., linestyle='--') fails on older versions of matplotlib
# plt.vlines(score, ylim[0], ylim[1], linestyle='--',
# color='g', linewidth=3, label='Classification Score'
# ' (pvalue %s)' % pvalue)
# plt.vlines(1.0 / n_classes, ylim[0], ylim[1], linestyle='--',
# color='k', linewidth=3, label='Luck')
plt.plot(2 * [score], ylim, '--g', linewidth=3,
label='Classification Score'
' (pvalue %s)' % pvalue)
plt.plot(2 * [1. / n_classes], ylim, '--k', linewidth=3, label='Luck')
plt.ylim(ylim)
plt.legend()
plt.xlabel('Score')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
In the total set of features, only the 4 first ones are significant. We can see that they have the highest score with
univariate feature selection. The SVM assigns a large weight to one of these features, but also Selects many of the
non-informative features. Applying univariate feature selection before the SVM increases the SVM weight attributed
to the significant features, and will thus improve classification.
Out:
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# #############################################################################
# Import some data to play with
plt.figure(1)
plt.clf()
X_indices = np.arange(X.shape[-1])
# #############################################################################
# Univariate feature selection with F-test for feature scoring
# We use the default selection function to select the four
# most significant features
selector = SelectKBest(f_classif, k=4)
selector.fit(X_train, y_train)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
label=r'Univariate score ($-Log(p_{value})$)', color='darkorange',
edgecolor='black')
# #############################################################################
# Compare to the weights of an SVM
clf = make_pipeline(MinMaxScaler(), LinearSVC())
clf.fit(X_train, y_train)
print('Classification accuracy without selecting features: {:.3f}'
.format(clf.score(X_test, y_test)))
svm_weights = np.abs(clf[-1].coef_).sum(axis=0)
svm_weights /= svm_weights.sum()
clf_selected = make_pipeline(
SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC()
)
clf_selected.fit(X_train, y_train)
print('Classification accuracy after univariate feature selection: {:.3f}'
.format(clf_selected.score(X_test, y_test)))
svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)
(continues on next page)
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot the density estimation of a mixture of two Gaussians. Data is generated from two Gaussians with different centers
and covariance matrices.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from sklearn import mixture
n_samples = 300
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot the confidence ellipsoids of a mixture of two Gaussians obtained with Expectation Maximisation
(GaussianMixture class) and Variational Inference (BayesianGaussianMixture class models with a
Dirichlet process prior).
Both models have access to five components with which to fit the data. Note that the Expectation Maximisation
model will necessarily use all five components while the Variational Inference model will effectively only use as many
as are needed for a good fit. Here we can see that the Expectation Maximisation model splits some components
arbitrarily, because it is trying to fit too many components, while the Dirichlet Process model adapts it number of state
automatically.
This example doesn’t show it, as we’re in a low-dimensional space, but another advantage of the Dirichlet process
model is that it can fit full covariance matrices effectively even when there are less examples per cluster than there are
dimensions in the data, due to regularization properties of the inference algorithm.
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.xlim(-9., 5.)
plt.ylim(-3., 6.)
plt.xticks(())
plt.yticks(())
plt.title(title)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows that model selection can be performed with Gaussian Mixture Models using information-theoretic
criteria (BIC). Model selection concerns both the covariance type and the number of components in the model. In that
case, AIC also provides the right result (not shown to save time), but BIC is better suited if the problem is to identify
the right model. Unlike Bayesian procedures, such inferences are prior-free.
In that case, the model with 2 components and full covariance (which corresponds to the true generative model) is
selected.
import numpy as np
import itertools
print(__doc__)
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
(continues on next page)
bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
'darkorange'])
clf = best_gmm
bars = []
plt.xticks(())
plt.yticks(())
plt.title('Selected GMM: full model, 2 components')
plt.subplots_adjust(hspace=.35, bottom=.02)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
import numpy as np
print(__doc__)
iris = datasets.load_iris()
X_train = iris.data[train_index]
y_train = iris.target[train_index]
X_test = iris.data[test_index]
y_test = iris.target[test_index]
n_classes = len(np.unique(y_train))
n_estimators = len(estimators)
y_train_pred = estimator.predict(X_train)
train_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100
plt.text(0.05, 0.9, 'Train accuracy: %.1f' % train_accuracy,
transform=h.transAxes)
y_test_pred = estimator.predict(X_test)
test_accuracy = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100
plt.text(0.05, 0.8, 'Test accuracy: %.1f' % test_accuracy,
transform=h.transAxes)
plt.xticks(())
plt.yticks(())
plt.title(name)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example demonstrates the behavior of Gaussian mixture models fit on data that was not sampled from a mixture
of Gaussian random variables. The dataset is formed by 100 points loosely spaced following a noisy sine curve. There
is therefore no ground truth value for the number of Gaussian components.
The first model is a classical Gaussian Mixture Model with 10 components fit with the Expectation-Maximization
algorithm.
The second model is a Bayesian Gaussian Mixture Model with a Dirichlet process prior fit with variational inference.
The low value of the concentration prior makes the model favor a lower number of active components. This models
“decides” to focus its modeling power on the big picture of the structure of the dataset: groups of points with alternating
directions modeled by non-diagonal covariance matrices. Those alternating directions roughly capture the alternating
nature of the original sine signal.
The third model is also a Bayesian Gaussian mixture model with a Dirichlet process prior but this time the value of the
concentration prior is higher giving the model more liberty to model the fine-grained structure of the data. The result
is a mixture with a larger number of active components that is similar to the first model where we arbitrarily decided
to fix the number of components to 10.
Which model is the best is a matter of subjective judgement: do we want to favor models that only capture the big
picture to summarize and explain most of the structure of the data while ignoring the details or do we prefer models
that closely follow the high density regions of the signal?
The last two panels show how we can sample from the last two models. The resulting samples distributions do not
look exactly like the original data distribution. The difference primarily stems from the approximation error we made
by using a model that assumes that the data was generated by a finite number of Gaussian components instead of a
continuous noisy sine curve.
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
print(__doc__)
# Parameters
n_samples = 100
for i in range(X.shape[0]):
x = i * step - 6.
X[i, 0] = x + np.random.normal(0, 0.1)
X[i, 1] = 3. * (np.sin(x) + np.random.normal(0, .2))
plt.figure(figsize=(10, 10))
plt.subplots_adjust(bottom=.04, top=0.95, hspace=.2, wspace=.05,
left=.03, right=.97)
dpgmm = mixture.BayesianGaussianMixture(
n_components=10, covariance_type='full', weight_concentration_prior=1e-2,
weight_concentration_prior_type='dirichlet_process',
mean_precision_prior=1e-2, covariance_prior=1e0 * np.eye(2),
init_params="random", max_iter=100, random_state=2).fit(X)
plot_results(X, dpgmm.predict(X), dpgmm.means_, dpgmm.covariances_, 1,
"Bayesian Gaussian mixture models with a Dirichlet process prior "
r"for $\gamma_0=0.01$.")
dpgmm = mixture.BayesianGaussianMixture(
n_components=10, covariance_type='full', weight_concentration_prior=1e+2,
weight_concentration_prior_type='dirichlet_process',
mean_precision_prior=1e-2, covariance_prior=1e0 * np.eye(2),
init_params="kmeans", max_iter=100, random_state=2).fit(X)
plot_results(X, dpgmm.predict(X), dpgmm.means_, dpgmm.covariances_, 2,
"Bayesian Gaussian mixture models with a Dirichlet process prior "
r"for $\gamma_0=100$")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example plots the ellipsoids obtained from a toy dataset (mixture of three Gaussians) fit-
ted by the BayesianGaussianMixture class models with a Dirichlet distribution prior
(weight_concentration_prior_type='dirichlet_distribution') and a Dirichlet process
prior (weight_concentration_prior_type='dirichlet_process'). On each figure, we plot the
results for three different values of the weight concentration prior.
The BayesianGaussianMixture class can adapt its number of mixture components automatically. The param-
eter weight_concentration_prior has a direct link with the resulting number of components with non-zero
weights. Specifying a low value for the concentration prior will make the model put most of the weight on few com-
ponents set the remaining components weights very close to zero. High values of the concentration prior will allow a
larger number of components to be active in the mixture.
The Dirichlet process prior allows to define an infinite number of components and automatically selects the correct
number of components: it activates a component only if it is necessary.
On the contrary the classical finite mixture model with a Dirichlet distribution prior will favor more uniformly weighted
components and therefore tends to divide natural clusters into unnecessary sub-components.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
print(__doc__)
ax2.get_xaxis().set_tick_params(direction='out')
ax2.yaxis.grid(True, alpha=0.7)
for k, w in enumerate(estimator.weights_):
ax2.bar(k, w, width=0.9, color='#56B4E9', zorder=3,
align='center', edgecolor='black')
ax2.text(k, w + 0.007, "%.1f%%" % (w * 100.),
horizontalalignment='center')
ax2.set_xlim(-.6, 2 * n_components - .4)
ax2.set_ylim(0., 1.1)
ax2.tick_params(axis='y', which='both', left=False,
right=False, labelleft=False)
ax2.tick_params(axis='x', which='both', top=False)
if plot_title:
ax1.set_ylabel('Estimated Mixtures')
ax2.set_ylabel('Weight of each component')
# Generate data
rng = np.random.RandomState(random_state)
X = np.vstack([
rng.multivariate_normal(means[j], covars[j], samples[j])
for j in range(n_components)])
y = np.concatenate([np.full(samples[j], j, dtype=int)
for j in range(n_components)])
(continues on next page)
gs = gridspec.GridSpec(3, len(concentrations_prior))
for k, concentration in enumerate(concentrations_prior):
estimator.weight_concentration_prior = concentration
estimator.fit(X)
plot_results(plt.subplot(gs[0:2, k]), plt.subplot(gs[2, k]), estimator,
X, y, r"%s$%.1e$" % (title, concentration),
plot_title=k == 0)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates GPC on XOR data. Compared are a stationary, isotropic kernel (RBF) and a non-stationary
kernel (DotProduct). On this particular dataset, the DotProduct kernel obtains considerably better results because the
class-boundaries are linear and coincide with the coordinate axes. In general, stationary kernels often obtain better
results.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
plt.subplot(1, 2, i + 1)
image = plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
aspect='auto', origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0.5], linewidths=2,
colors=['k'])
plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired,
(continues on next page)
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates the predicted probability of GPC for an isotropic and anisotropic RBF kernel on a two-
dimensional version for the iris-dataset. The anisotropic RBF kernel obtains slightly higher log-marginal-likelihood
by assigning different length-scales to the two feature dimensions.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
(continues on next page)
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Both kernel ridge regression (KRR) and Gaussian process regression (GPR) learn a target function by employing
internally the “kernel trick”. KRR learns a linear function in the space induced by the respective kernel which corre-
sponds to a non-linear function in the original space. The linear function in the kernel space is chosen based on the
mean-squared error loss with ridge regularization. GPR uses the kernel to define the covariance of a prior distribution
over the target functions and uses the observed training data to define a likelihood function. Based on Bayes theorem,
a (Gaussian) posterior distribution over target functions is defined, whose mean is used for prediction.
A major difference is that GPR can choose the kernel’s hyperparameters based on gradient-ascent on the marginal
likelihood function while KRR needs to perform a grid search on a cross-validated loss function (mean-squared error
loss). A further difference is that GPR learns a generative, probabilistic model of the target function and can thus
provide meaningful confidence intervals and posterior samples along with the predictions while KRR only provides
predictions.
This example illustrates both methods on an artificial dataset, which consists of a sinusoidal target function and strong
noise. The figure compares the learned model of KRR and GPR based on a ExpSineSquared kernel, which is suited
for learning periodic functions. The kernel’s hyperparameters control the smoothness (l) and periodicity of the kernel
(p). Moreover, the noise level of the data is learned explicitly by GPR by an additional WhiteKernel component in the
kernel and by the regularization parameter alpha of KRR.
The figure shows that both methods learn reasonable models of the target function. GPR correctly identifies the peri-
odicity of the function to be roughly 2*pi (6.28), while KRR chooses the doubled periodicity 4*pi. Besides that, GPR
provides reasonable confidence bounds on the prediction which are not available for KRR. A major difference between
the two methods is the time required for fitting and predicting: while fitting KRR is fast in principle, the grid-search
for hyperparameter optimization scales exponentially with the number of hyperparameters (“curse of dimensional-
ity”). The gradient-based optimization of the parameters in GPR does not suffer from this exponential scaling and is
thus considerable faster on this example with 3-dimensional hyperparameter space. The time for predicting is similar;
however, generating the variance of the predictive distribution of GPR takes considerable longer than just predicting
the mean.
Out:
print(__doc__)
import time
import numpy as np
rng = np.random.RandomState(0)
stime = time.time()
y_gpr, y_std = gpr.predict(X_plot, return_std=True)
print("Time for GPR prediction with standard-deviation: %.3f"
% (time.time() - stime))
# Plot results
plt.figure(figsize=(10, 5))
lw = 2
plt.scatter(X, y, c='k', label='data')
plt.plot(X_plot, np.sin(X_plot), color='navy', lw=lw, label='True')
plt.plot(X_plot, y_kr, color='turquoise', lw=lw,
label='KRR (%s)' % kr.best_params_)
plt.plot(X_plot, y_gpr, color='darkorange', lw=lw,
label='GPR (%s)' % gpr.kernel_)
plt.fill_between(X_plot[:, 0], y_gpr - y_std, y_gpr + y_std, color='darkorange',
alpha=0.2)
plt.xlabel('data')
plt.ylabel('target')
plt.xlim(0, 20)
plt.ylim(-4, 4)
plt.title('GPR versus Kernel Ridge')
plt.legend(loc="best", scatterpoints=1, prop={'size': 8})
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
6.15.4 Illustration of prior and posterior Gaussian process for different kernels
This example illustrates the prior and posterior of a GPR with different kernels. Mean, standard deviation, and 10
samples are shown for both prior and posterior.
•
Out:
/home/circleci/project/sklearn/gaussian_process/_gpr.py:494: ConvergenceWarning:
˓→lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
_check_optimize_result("lbfgs", opt_res)
/home/circleci/project/sklearn/gaussian_process/_gpr.py:362: UserWarning: Predicted
˓→variances smaller than 0. Setting those variances to 0.
print(__doc__)
import numpy as np
# Plot prior
plt.figure(figsize=(8, 8))
plt.subplot(2, 1, 1)
X_ = np.linspace(0, 5, 100)
y_mean, y_std = gp.predict(X_[:, np.newaxis], return_std=True)
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - y_std, y_mean + y_std,
alpha=0.2, color='k')
y_samples = gp.sample_y(X_[:, np.newaxis], 10)
plt.plot(X_, y_samples, lw=1)
plt.xlim(0, 5)
plt.ylim(-3, 3)
plt.title("Prior (kernel: %s)" % kernel, fontsize=12)
# Plot posterior
plt.subplot(2, 1, 2)
X_ = np.linspace(0, 5, 100)
y_mean, y_std = gp.predict(X_[:, np.newaxis], return_std=True)
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - y_std, y_mean + y_std,
alpha=0.2, color='k')
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
A two-dimensional classification example showing iso-probability lines for the predicted probabilities.
Out:
Learned kernel: 0.0256**2 * DotProduct(sigma_0=5.72) ** 2
print(__doc__)
import numpy as np
# A few constants
lim = 8
def g(x):
"""The function to predict (classification will then consist in predicting
whether g(x) <= 0 or not)"""
return 5. - x[:, 1] - .5 * x[:, 0] ** 2.
# Design of experiments
X = np.array([[-4.61611719, -6.00099547],
[4.10469096, 5.32782448],
[0.00000000, -0.50000000],
[-6.17289014, -4.6984743],
[1.3109306, -6.93271427],
[-5.03823144, 3.10584743],
[-2.87600388, 6.74310541],
[5.21301203, 4.26386883]])
# Observations
y = np.array(g(X) > 0, dtype=int)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates the predicted probability of GPC for an RBF kernel with different choices of the hyperparam-
eters. The first figure shows the predicted probability of GPC with arbitrarily chosen hyperparameters and with the
hyperparameters corresponding to the maximum log-marginal-likelihood (LML).
While the hyperparameters chosen by optimizing LML have a considerable larger LML, they perform slightly worse
according to the log-loss on test data. The figure shows that this is because they exhibit a steep change of the class
probabilities at the class boundaries (which is good) but have predicted probabilities close to 0.5 far away from the
class boundaries (which is bad) This undesirable effect is caused by the Laplace approximation used internally by
GPC.
The second figure shows the log-marginal-likelihood for different choices of the kernel’s hyperparameters, highlighting
the two choices of the hyperparameters used in the first figure by black dots.
•
Out:
Log Marginal Likelihood (initial): -17.598
Log Marginal Likelihood (optimized): -3.875
Accuracy: 1.000 (initial) 1.000 (optimized)
Log-loss: 0.214 (initial) 0.319 (optimized)
print(__doc__)
import numpy as np
# Plot posteriors
plt.figure()
plt.scatter(X[:train_size, 0], y[:train_size], c='k', label="Train data",
edgecolors=(0, 0, 0))
plt.scatter(X[train_size:, 0], y[train_size:], c='g', label="Test data",
edgecolors=(0, 0, 0))
X_ = np.linspace(0, 5, 100)
plt.plot(X_, gp_fix.predict_proba(X_[:, np.newaxis])[:, 1], 'r',
label="Initial kernel: %s" % gp_fix.kernel_)
plt.plot(X_, gp_opt.predict_proba(X_[:, np.newaxis])[:, 1], 'b',
label="Optimized kernel: %s" % gp_opt.kernel_)
plt.xlabel("Feature")
plt.ylabel("Class 1 probability")
plt.xlim(0, 5)
plt.ylim(-0.25, 1.5)
plt.legend(loc="best")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates that GPR with a sum-kernel including a WhiteKernel can estimate the noise level of data.
An illustration of the log-marginal-likelihood (LML) landscape shows that there exist two local maxima of LML. The
first corresponds to a model with a high noise level and a large length scale, which explains all variations in the data
by noise. The second one has a smaller noise level and shorter length scale, which explains most of the variation by
the noise-free functional relationship. The second model has a higher likelihood; however, depending on the initial
value for the hyperparameters, the gradient-based optimization might also converge to the high-noise solution. It is
thus important to repeat the optimization several times for different initializations.
•
print(__doc__)
import numpy as np
rng = np.random.RandomState(0)
X = rng.uniform(0, 5, 20)[:, np.newaxis]
y = 0.5 * np.sin(3 * X[:, 0]) + rng.normal(0, 0.5, X.shape[0])
# First run
plt.figure()
kernel = 1.0 * RBF(length_scale=100.0, length_scale_bounds=(1e-2, 1e3)) \
+ WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))
gp = GaussianProcessRegressor(kernel=kernel,
alpha=0.0).fit(X, y)
X_ = np.linspace(0, 5, 100)
y_mean, y_cov = gp.predict(X_[:, np.newaxis], return_cov=True)
(continues on next page)
# Second run
plt.figure()
kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
+ WhiteKernel(noise_level=1e-5, noise_level_bounds=(1e-10, 1e+1))
gp = GaussianProcessRegressor(kernel=kernel,
alpha=0.0).fit(X, y)
X_ = np.linspace(0, 5, 100)
y_mean, y_cov = gp.predict(X_[:, np.newaxis], return_cov=True)
plt.plot(X_, y_mean, 'k', lw=3, zorder=9)
plt.fill_between(X_, y_mean - np.sqrt(np.diag(y_cov)),
y_mean + np.sqrt(np.diag(y_cov)),
alpha=0.5, color='k')
plt.plot(X_, 0.5*np.sin(3*X_), 'r', lw=3, zorder=9)
plt.scatter(X[:, 0], y, c='r', s=50, zorder=10, edgecolors=(0, 0, 0))
plt.title("Initial: %s\nOptimum: %s\nLog-Marginal-Likelihood: %s"
% (kernel, gp.kernel_,
gp.log_marginal_likelihood(gp.kernel_.theta)))
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
•
print(__doc__)
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(1)
def f(x):
"""The function to predict."""
return x * np.sin(x)
# ----------------------------------------------------------------------
# First the noiseless case
X = np.atleast_2d([1., 3., 5., 6., 7., 8.]).T
# Observations
y = f(X).ravel()
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x, return_std=True)
# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure()
plt.plot(x, f(x), 'r:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'r.', markersize=10, label='Observations')
plt.plot(x, y_pred, 'b-', label='Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
np.concatenate([y_pred - 1.9600 * sigma,
(y_pred + 1.9600 * sigma)[::-1]]),
alpha=.5, fc='b', ec='None', label='95% confidence interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-10, 20)
plt.legend(loc='upper left')
# ----------------------------------------------------------------------
# now the noisy case
X = np.linspace(0.1, 9.9, 20)
X = np.atleast_2d(X).T
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred, sigma = gp.predict(x, return_std=True)
# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure()
plt.plot(x, f(x), 'r:', label=r'$f(x) = x\,\sin(x)$')
plt.errorbar(X.ravel(), y, dy, fmt='r.', markersize=10, label='Observations')
plt.plot(x, y_pred, 'b-', label='Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example is based on Section 5.4.3 of “Gaussian Processes for Machine Learning” [RW2006]. It illustrates an
example of complex kernel engineering and hyperparameter optimization using gradient ascent on the log-marginal-
likelihood. The data consists of the monthly average atmospheric CO2 concentrations (in parts per million by volume
(ppmv)) collected at the Mauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to model the
CO2 concentration as a function of the time t.
The kernel is composed of several terms that are responsible for explaining different properties of the signal:
• a long term, smooth rising trend is to be explained by an RBF kernel. The RBF kernel with a large length-scale
enforces this component to be smooth; it is not enforced that the trend is rising which leaves this choice to the
GP. The specific length-scale and the amplitude are free hyperparameters.
• a seasonal component, which is to be explained by the periodic ExpSineSquared kernel with a fixed periodicity
of 1 year. The length-scale of this periodic component, controlling its smoothness, is a free parameter. In order
to allow decaying away from exact periodicity, the product with an RBF kernel is taken. The length-scale of this
RBF component controls the decay time and is a further free parameter.
• smaller, medium term irregularities are to be explained by a RationalQuadratic kernel component, whose length-
scale and alpha parameter, which determines the diffuseness of the length-scales, are to be determined. Ac-
cording to [RW2006], these irregularities can better be explained by a RationalQuadratic than an RBF kernel
component, probably because it can accommodate several length-scales.
• a “noise” term, consisting of an RBF kernel contribution, which shall explain the correlated noise components
such as local weather phenomena, and a WhiteKernel contribution for the white noise. The relative amplitudes
and the RBF’s length scale are further free parameters.
Maximizing the log-marginal-likelihood after subtracting the target’s mean yields the following kernel with an LML
of -83.214:
34.4**2 * RBF(length_scale=41.8)
+ 3.27**2 * RBF(length_scale=180) * ExpSineSquared(length_scale=1.44,
periodicity=1)
+ 0.446**2 * RationalQuadratic(alpha=17.7, length_scale=0.957)
+ 0.197**2 * RBF(length_scale=0.138) + WhiteKernel(noise_level=0.0336)
Thus, most of the target signal (34.4ppm) is explained by a long-term rising trend (length-scale 41.8 years). The
periodic component has an amplitude of 3.27ppm, a decay time of 180 years and a length-scale of 1.44. The long
decay time indicates that we have a locally very close to periodic seasonal component. The correlated noise has an
amplitude of 0.197ppm with a length scale of 0.138 years and a white-noise contribution of 0.197ppm. Thus, the
overall noise level is very small, indicating that the data can be very well explained by the model. The figure shows
also that the model makes very confident predictions until around 2015.
Out:
˓→+ WhiteKernel(noise_level=0.0361)
Log-marginal-likelihood: -117.023
˓→122) + WhiteKernel(noise_level=0.0367)
Log-marginal-likelihood: -115.050
import numpy as np
print(__doc__)
def load_mauna_loa_atmospheric_co2():
ml_data = fetch_openml(data_id=41187)
months = []
ppmv_sums = []
counts = []
y = ml_data.data[:, 0]
m = ml_data.data[:, 1]
month_float = y + (m - 1) / 12
ppmvs = ml_data.target
months = np.asarray(months).reshape(-1, 1)
avg_ppmvs = np.asarray(ppmv_sums) / counts
return months, avg_ppmvs
X, y = load_mauna_loa_atmospheric_co2()
gp = GaussianProcessRegressor(kernel=kernel_gpml, alpha=0,
optimizer=None, normalize_y=True)
gp.fit(X, y)
(continues on next page)
gp = GaussianProcessRegressor(kernel=kernel, alpha=0,
normalize_y=True)
gp.fit(X, y)
# Illustration
plt.scatter(X, y, c='k')
plt.plot(X_, y_pred)
plt.fill_between(X_[:, 0], y_pred - y_std, y_pred + y_std,
alpha=0.5, color='k')
plt.xlim(X_.min(), X_.max())
plt.xlabel("Year")
plt.ylabel(r"CO$_2$ in ppm")
plt.title(r"Atmospheric CO$_2$ concentration at Mauna Loa")
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates the use of Gaussian processes for regression and classification tasks on data that are not in
fixed-length feature vector form. This is achieved through the use of kernel functions that operates directly on discrete
structures such as variable-length sequences, trees, and graphs.
Specifically, here the input variables are some gene sequences stored as variable-length strings consisting of letters
‘A’, ‘T’, ‘C’, and ‘G’, while the output variables are floating point numbers and True/False labels in the regression and
classification tasks, respectively.
A kernel between the gene sequences is defined using R-convolution1 by integrating a binary letter-wise kernel over
all pairs of letters among a pair of strings.
This example will generate three figures.
In the first figure, we visualize the value of the kernel, i.e. the similarity of the sequences, using a colormap. Brighter
color here indicates higher similarity.
In the second figure, we show some regression result on a dataset of 6 sequences. Here we use the 1st, 2nd, 4th, and
5th sequences as the training set to make predictions on the 3rd and 6th sequences.
In the third figure, we demonstrate a classification model by training on 6 sequences and make predictions on another
5 sequences. The ground truth here is simply whether there is at least one ‘A’ in the sequence. Here the model makes
four correct classifications and fails on one.
•
1 Haussler, D. (1999). Convolution kernels on discrete structures (Vol. 646). Technical report, Department of Computer Science, University of
California at Santa Cruz.
•
Out:
/home/circleci/project/sklearn/gaussian_process/_gpr.py:494: ConvergenceWarning:
˓→lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process.kernels import Kernel, Hyperparameter
from sklearn.gaussian_process.kernels import GenericKernelMixin
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.base import clone
@property
def hyperparameter_baseline_similarity(self):
return Hyperparameter("baseline_similarity",
"numeric",
self.baseline_similarity_bounds)
def is_stationary(self):
return False
kernel = SequenceKernel()
'''
Sequence similarity matrix under the kernel
===========================================
'''
K = kernel(X)
D = kernel.diag(X)
plt.figure(figsize=(8, 5))
plt.imshow(np.diag(D**-0.5).dot(K).dot(np.diag(D**-0.5)))
plt.xticks(np.arange(len(X)), X)
plt.yticks(np.arange(len(X)), X)
plt.title('Sequence similarity under the kernel')
'''
Regression
==========
'''
training_idx = [0, 1, 3, 4]
gp = GaussianProcessRegressor(kernel=kernel)
gp.fit(X[training_idx], Y[training_idx])
plt.figure(figsize=(8, 5))
plt.bar(np.arange(len(X)), gp.predict(X), color='b', label='prediction')
plt.bar(training_idx, Y[training_idx], width=0.2, color='r',
alpha=1, label='training')
plt.xticks(np.arange(len(X)), X)
plt.title('Regression on sequences')
plt.legend()
'''
(continues on next page)
gp = GaussianProcessClassifier(kernel)
gp.fit(X_train, Y_train)
plt.figure(figsize=(8, 5))
plt.scatter(np.arange(len(X_train)), [1.0 if c else -1.0 for c in Y_train],
s=100, marker='o', edgecolor='none', facecolor=(1, 0.75, 0),
label='training')
plt.scatter(len(X_train) + np.arange(len(X_test)),
[1.0 if c else -1.0 for c in Y_test],
s=100, marker='o', edgecolor='none', facecolor='r', label='truth')
plt.scatter(len(X_train) + np.arange(len(X_test)),
[1.0 if c else -1.0 for c in gp.predict(X_test)],
s=100, marker='x', edgecolor=(0, 1.0, 0.3), linewidth=2,
label='prediction')
plt.xticks(np.arange(len(X_train) + len(X_test)),
np.concatenate((X_train, X_test)))
plt.yticks([-1, 1], [False, True])
plt.title('Classification on sequences')
plt.legend()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Computes Lasso Path along the regularization parameter using the LARS algorithm on the diabetes dataset. Each
color represents a different feature of the coefficient vector, and this is displayed as a function of the regularization
parameter.
Out:
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
X, y = datasets.load_diabetes(return_X_y=True)
xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]
plt.plot(xx, coefs.T)
ymin, ymax = plt.ylim()
plt.vlines(xx, ymin, ymax, linestyle='dashed')
plt.xlabel('|coef| / max|coef|')
plt.ylabel('Coefficients')
plt.title('LASSO Path')
plt.axis('tight')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot the maximum margin separating hyperplane within a two-class separable dataset using a linear Support Vector
Machines classifier trained using SGD.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs
clf.fit(X, Y)
# plot the line, the points, and the nearest vectors to the plane
xx = np.linspace(-1, 5, 10)
yy = np.linspace(-1, 5, 10)
plt.axis('tight')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
# #############################################################################
# Compute paths
n_alphas = 200
alphas = np.logspace(-10, -2, n_alphas)
coefs = []
for a in alphas:
ridge = linear_model.Ridge(alpha=a, fit_intercept=False)
ridge.fit(X, y)
coefs.append(ridge.coef_)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
A plot that compares the various convex loss functions supported by sklearn.linear_model.
SGDClassifier .
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
Note: Click here to download the full example code or to run this example in your browser via Binder
Due to the few points in each dimension and the straight line that linear regression uses to follow these points as well
as it can, noise on the observations will cause great variance as shown in the first plot. Every line’s slope can vary
quite a bit for each prediction due to the noise induced in the observations.
Ridge regression is basically minimizing a penalised version of the least-squared function. The penalising shrinks
the value of the regression coefficients. Despite the few data points in each dimension, the slope of the prediction is
much more stable and the variance in the line itself is greatly reduced, in comparison to that of the standard linear
regression
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
classifiers = dict(ols=linear_model.LinearRegression(),
ridge=linear_model.Ridge(alpha=.1))
for _ in range(6):
this_X = .1 * np.random.normal(size=(2, 1)) + X_train
clf.fit(this_X, y_train)
clf.fit(X_train, y_train)
ax.plot(X_test, clf.predict(X_test), linewidth=2, color='blue')
ax.scatter(X_train, y_train, s=30, c='red', marker='+', zorder=10)
ax.set_title(name)
(continues on next page)
fig.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Ridge Regression is the estimator used in this example. Each color in the left plot represents one different dimension
of the coefficient vector, and this is displayed as a function of the regularization parameter. The right plot shows
how exact the solution is. This example illustrates how a well defined solution is found by Ridge regression and
how regularization affects the coefficients and their values. The plot on the right shows how the difference of the
coefficients from the estimator changes as a function of regularization.
In this example the dependent variable Y is set as a function of the input features: y = X*w + c. The coefficient vector
w is randomly sampled from a normal distribution, whereas the bias term c is set to a constant.
As alpha tends toward zero the coefficients found by Ridge regression stabilize towards the randomly sampled vector
w. For big alpha (strong regularisation) the coefficients are smaller (eventually converging at 0) leading to a simpler
and biased solution. These dependencies can be observed on the left plot.
The right plot shows the mean squared error between the coefficients found by the model and the chosen vector w.
Less regularised models retrieve the exact coefficients (error is equal to 0), stronger regularised models increase the
error.
Please note that in this example the data is non-noisy, hence it is possible to extract the exact coefficients.
print(__doc__)
clf = Ridge()
coefs = []
errors = []
# Display results
plt.figure(figsize=(20, 6))
plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.subplot(122)
ax = plt.gca()
ax.plot(alphas, errors)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('error')
plt.title('Coefficient error as a function of the regularization')
plt.axis('tight')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Contours of where the penalty is equal to 1 for the three penalties L1, L2 and elastic-net.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
l1_color = "navy"
l2_color = "c"
elastic_net_color = "darkorange"
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example demonstrates how to approximate a function with a polynomial of degree n_degree by using ridge
regression. Concretely, from n_samples 1d points, it suffices to build the Vandermonde matrix, which is n_samples x
n_degree+1 and has the following form:
[[1, x_1, x_1 ** 2, x_1 ** 3, . . . ], [1, x_2, x_2 ** 2, x_2 ** 3, . . . ], . . . ]
Intuitively, this matrix can be interpreted as a matrix of pseudo features (the points raised to some power). The matrix
is akin to (but different from) the matrix induced by a polynomial kernel.
This example shows that you can do non-linear regression with a linear model, using a pipeline to add non-linear
features. Kernel methods extend this idea and can induce very high (even infinite) dimensional feature spaces.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
def f(x):
""" function to approximate by polynomial interpolation"""
return x * np.sin(x)
plt.legend(loc='lower left')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Shown in the plot is how the logistic regression would, in this synthetic dataset, classify values as either 0 or 1, i.e.
class one or two, using the logistic curve.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# General a toy dataset:s it's just a straight line with some Gaussian noise:
xmin, xmax = -5, 5
n_samples = 100
np.random.seed(0)
X = np.random.normal(size=n_samples)
y = (X > 0).astype(np.float)
X[X > 0] *= 4
X += .3 * np.random.normal(size=n_samples)
X = X[:, np.newaxis]
ols = linear_model.LinearRegression()
(continues on next page)
plt.ylabel('y')
plt.xlabel('X')
plt.xticks(range(-5, 10))
plt.yticks([0, 0.5, 1])
plt.ylim(-.25, 1.25)
plt.xlim(-4, 10)
plt.legend(('Logistic Regression Model', 'Linear Regression Model'),
loc="lower right", fontsize='small')
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Train l1-penalized logistic regression models on a binary classification problem derived from the Iris dataset.
The models are ordered from strongest regularized to least regularized. The 4 coefficients of the models are collected
and plotted as a “regularization path”: on the left-hand side of the figure (strong regularizers), all the coefficients are
exactly 0. When regularization gets progressively looser, coefficients can get non-zero values one after the other.
Here we choose the liblinear solver because it can efficiently optimize for the Logistic Regression loss with a non-
smooth, sparsity inducing l1 penalty.
Also note that we set a low value for the tolerance to make sure that the model has converged before collecting the
coefficients.
We also use warm_start=True which means that the coefficients of the models are reused to initialize the next model
fit to speed-up the computation of the full-path.
Out:
print(__doc__)
iris = datasets.load_iris()
X = iris.data
(continues on next page)
X = X[y != 2]
y = y[y != 2]
# #############################################################################
# Demo path functions
coefs_ = np.array(coefs_)
plt.plot(np.log10(cs), coefs_, marker='o')
ymin, ymax = plt.ylim()
plt.xlabel('log(C)')
plt.ylabel('Coefficients')
plt.title('Logistic Regression Path')
plt.axis('tight')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Show below is a logistic-regression classifiers decision boundaries on the first two dimensions (sepal length and width)
of the iris dataset. The datapoints are colored according to their labels.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
logreg = LogisticRegression(C=1e5)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot decision function of a weighted dataset, where the size of points is proportional to its weight.
print(__doc__)
# we create 20 points
np.random.seed(0)
X = np.r_[np.random.randn(10, 2) + [1, 1], np.random.randn(10, 2)]
y = [1] * 10 + [-1] * 10
sample_weight = 100 * np.abs(np.random.randn(20))
# and assign a bigger weight to the last 10 samples
sample_weight[:10] *= 10
plt.legend([no_weights.collections[0], samples_weights.collections[0]],
["no weights", "with weights"], loc="lower left")
plt.xticks(())
plt.yticks(())
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example uses the only the first feature of the diabetes dataset, in order to illustrate a two-dimensional plot of
this regression technique. The straight line can be seen in the plot, showing how linear regression attempts to draw a
straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the
responses predicted by the linear approximation.
The coefficients, the residual sum of squares and the coefficient of determination are also calculated.
Out:
Coefficients:
[938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47
print(__doc__)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
In this example we see how to robustly fit a linear model to faulty data using the RANSAC algorithm.
Out:
import numpy as np
from matplotlib import pyplot as plt
n_samples = 1000
n_outliers = 50
lw = 2
plt.scatter(X[inlier_mask], y[inlier_mask], color='yellowgreen', marker='.',
label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask], color='gold', marker='.',
label='Outliers')
plt.plot(line_X, line_y, color='navy', linewidth=lw, label='Linear regressor')
plt.plot(line_X, line_y_ransac, color='cornflowerblue', linewidth=lw,
label='RANSAC regressor')
plt.legend(loc='lower right')
plt.xlabel("Input")
plt.ylabel("Response")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Features 1 and 2 of the diabetes-dataset are fitted and plotted below. It illustrates that although feature 2 has a strong
coefficient on the full model, it does not give us much regarding y when compared to just feature 1
print(__doc__)
X, y = datasets.load_diabetes(return_X_y=True)
indices = (0, 1)
ols = linear_model.LinearRegression()
ols.fit(X_train, y_train)
# #############################################################################
# Plot the figure
def plot_figs(fig_num, elev, azim, X_train, clf):
fig = plt.figure(fig_num, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, elev=elev, azim=azim)
elev = -.5
azim = 0
plot_figs(2, elev, azim, X_train, ols)
elev = -.5
azim = 90
plot_figs(3, elev, azim, X_train, ols)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(X.min(), X.max(), 7)
epsilon_values = [1.35, 1.5, 1.75, 1.9]
for k, epsilon in enumerate(epsilon_values):
huber = HuberRegressor(alpha=0.0, epsilon=epsilon)
huber.fit(X, y)
coef_ = huber.coef_ * x + huber.intercept_
plt.plot(x, coef_, colors[k], label="huber loss, %s" % epsilon)
Note: Click here to download the full example code or to run this example in your browser via Binder
We show that linear_model.Lasso provides the same results for dense and sparse data and that in the case of sparse
data the speed is improved.
Out:
print(__doc__)
# #############################################################################
# The two Lasso implementations on Dense data
print("--- Dense matrices")
alpha = 1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)
t0 = time()
sparse_lasso.fit(X_sp, y)
print("Sparse Lasso done in %fs" % (time() - t0))
t0 = time()
dense_lasso.fit(X, y)
print("Dense Lasso done in %fs" % (time() - t0))
# #############################################################################
# The two Lasso implementations on Sparse data
print("--- Sparse matrices")
Xs = X.copy()
Xs[Xs < 2.5] = 0.0
Xs = sparse.coo_matrix(Xs)
Xs = Xs.tocsc()
alpha = 0.1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)
t0 = time()
sparse_lasso.fit(Xs, y)
print("Sparse Lasso done in %fs" % (time() - t0))
t0 = time()
dense_lasso.fit(Xs.toarray(), y)
print("Dense Lasso done in %fs" % (time() - t0))
Note: Click here to download the full example code or to run this example in your browser via Binder
An example showing how different online solvers perform on the hand-written digits dataset.
Out:
training SGD
training ASGD
training Perceptron
training Passive-Aggressive I
training Passive-Aggressive II
training SAG
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
classifiers = [
("SGD", SGDClassifier(max_iter=100)),
("ASGD", SGDClassifier(average=True)),
("Perceptron", Perceptron()),
("Passive-Aggressive I", PassiveAggressiveClassifier(loss='hinge',
C=1.0, tol=1e-4)),
("Passive-Aggressive II", PassiveAggressiveClassifier(loss='squared_hinge',
C=1.0, tol=1e-4)),
("SAG", LogisticRegression(solver='sag', tol=1e-1, C=1.e4 / X.shape[0]))
]
xx = 1. - np.array(heldout)
plt.legend(loc="upper right")
plt.xlabel("Proportion train")
plt.ylabel("Test Error Rate")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The multi-task lasso allows to fit multiple regression problems jointly enforcing the selected features to be the same
across tasks. This example simulates sequential measurements, each task is a time instant, and the relevant features
vary in amplitude over time while being the same. The multi-task lasso imposes that features that are selected at one
time point are select for all time point. This makes feature selection by the Lasso more stable.
•
print(__doc__)
rng = np.random.RandomState(42)
# Generate some 2D coefficients with sine waves with random frequency and phase
n_samples, n_features, n_tasks = 100, 30, 40
n_relevant_features = 5
coef = np.zeros((n_tasks, n_features))
times = np.linspace(0, 2 * np.pi, n_tasks)
for k in range(n_relevant_features):
coef[:, k] = np.sin((1. + rng.randn(1)) * times + 3 * rng.randn(1))
X = rng.randn(n_samples, n_features)
Y = np.dot(X, coef.T) + rng.randn(n_samples, n_tasks)
# #############################################################################
(continues on next page)
feature_to_plot = 0
plt.figure()
lw = 2
plt.plot(coef[:, feature_to_plot], color='seagreen', linewidth=lw,
label='Ground truth')
plt.plot(coef_lasso_[:, feature_to_plot], color='cornflowerblue', linewidth=lw,
label='Lasso')
plt.plot(coef_multi_task_lasso_[:, feature_to_plot], color='gold', linewidth=lw,
label='MultiTaskLasso')
plt.legend(loc='upper center')
plt.axis('tight')
plt.ylim([-1.1, 1.1])
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Here we fit a multinomial logistic regression with L1 penalty on a subset of the MNIST digits classification task. We
use the SAGA algorithm for this purpose: this a solver that is fast when the number of samples is significantly larger
than the number of features and is able to finely optimize non-smooth objective functions which is the case with the
l1-penalty. Test accuracy reaches > 0.8, while weight vectors remains sparse and therefore more easily interpretable.
Note that this accuracy of this l1-penalized linear model is significantly below what can be reached by an l2-penalized
linear model or a non-linear multi-layer perceptron model on this dataset.
Out:
Sparsity with L1 penalty: 79.95%
Test score with L1 penalty: 0.8322
Example run in 32.402 s
import time
import matplotlib.pyplot as plt
import numpy as np
print(__doc__)
random_state = check_random_state(0)
permutation = random_state.permutation(X.shape[0])
X = X[permutation]
(continues on next page)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
coef = clf.coef_.copy()
plt.figure(figsize=(10, 5))
scale = np.abs(coef).max()
for i in range(10):
l1_plot = plt.subplot(2, 5, i + 1)
l1_plot.imshow(coef[i].reshape(28, 28), interpolation='nearest',
cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
l1_plot.set_xticks(())
l1_plot.set_yticks(())
l1_plot.set_xlabel('Class %i' % i)
plt.suptitle('Classification vector for...')
run_time = time.time() - t0
print('Example run in %.3f s' % run_time)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot decision surface of multi-class SGD on iris dataset. The hyperplanes corresponding to the three one-versus-all
(OVA) classifiers are represented by the dashed lines.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import SGDClassifier
# shuffle
idx = np.arange(X.shape[0])
np.random.seed(13)
np.random.shuffle(idx)
X = X[idx]
y = y[idx]
# standardize
mean = X.mean(axis=0)
(continues on next page)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis('tight')
Note: Click here to download the full example code or to run this example in your browser via Binder
Using orthogonal matching pursuit for recovering a sparse signal from a noisy measurement encoded with a dictionary
print(__doc__)
# y = Xw
# |x|_0 = n_nonzero_coefs
y, X, w = make_sparse_coded_signal(n_samples=1,
n_components=n_components,
n_features=n_features,
n_nonzero_coefs=n_nonzero_coefs,
random_state=0)
idx, = w.nonzero()
Note: Click here to download the full example code or to run this example in your browser via Binder
Estimates Lasso and Elastic-Net regression models on a manually generated sparse signal corrupted with an additive
noise. Estimated coefficients are compared with the ground-truth.
Out:
Lasso(alpha=0.1)
r^2 on test data : 0.658064
ElasticNet(alpha=0.1, l1_ratio=0.7)
r^2 on test data : 0.642515
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# #############################################################################
# Generate some sparse data to play with
np.random.seed(42)
# Add noise
y += 0.01 * np.random.normal(size=n_samples)
# #############################################################################
# Lasso
from sklearn.linear_model import Lasso
alpha = 0.1
lasso = Lasso(alpha=alpha)
# #############################################################################
# ElasticNet
from sklearn.linear_model import ElasticNet
plt.legend(loc='best')
plt.title("Lasso $R^2$: %.3f, Elastic Net $R^2$: %.3f"
% (r2_score_lasso, r2_score_enet))
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# #############################################################################
# Generate sinusoidal data with noise
size = 25
rng = np.random.RandomState(1234)
x_train = rng.uniform(0., 1., size)
y_train = func(x_train) + rng.normal(scale=0.1, size=size)
x_test = np.linspace(0., 1., 100)
# #############################################################################
# Fit by cubic polynomial
n_order = 3
X_train = np.vander(x_train, n_order + 1, increasing=True)
X_test = np.vander(x_test, n_order + 1, increasing=True)
# #############################################################################
# Plot the true and predicted curves with log marginal likelihood (L)
reg = BayesianRidge(tol=1e-6, fit_intercept=False, compute_score=True)
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for i, ax in enumerate(axes):
# Bayesian ridge regression with different initial value pairs
if i == 0:
init = [1 / np.var(y_train), 1.] # Default values
elif i == 1:
init = [1., 1e-3]
reg.set_params(alpha_init=init[0], lambda_init=init[1])
reg.fit(X_train, y_train)
ymean, ystd = reg.predict(X_test, return_std=True)
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
•
# Author: Florian Wilhelm -- <florian.wilhelm@gmail.com>
# License: BSD 3 clause
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, TheilSenRegressor
from sklearn.linear_model import RANSACRegressor
print(__doc__)
# #############################################################################
# Outliers only in the y direction
np.random.seed(0)
n_samples = 200
# Linear model y = 3*x + N(2, 0.1**2)
x = np.random.randn(n_samples)
w = 3.
c = 2.
noise = 0.1 * np.random.randn(n_samples)
(continues on next page)
plt.axis('tight')
plt.legend(loc='upper left')
plt.title("Corrupt y")
# #############################################################################
# Outliers in the X direction
np.random.seed(0)
# Linear model y = 3*x + N(2, 0.1**2)
x = np.random.randn(n_samples)
noise = 0.1 * np.random.randn(n_samples)
y = 3 * x + 2 + noise
# 10% outliers
x[-20:] = 9.9
y[-20:] += 22
X = x[:, np.newaxis]
plt.figure()
plt.scatter(x, y, color='indigo', marker='x', s=40)
plt.axis('tight')
plt.legend(loc='upper left')
plt.title("Corrupt x")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot decision surface of multinomial and One-vs-Rest Logistic Regression. The hyperplanes corresponding to the
three One-vs-Rest (OVR) classifiers are represented by the dashed lines.
•
Out:
training score : 0.995 (multinomial)
training score : 0.976 (ovr)
print(__doc__)
# Authors: Tom Dupre la Tour <tom.dupre-la-tour@m4x.org>
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.title("Decision surface of LogisticRegression (%s)" % multi_class)
plt.axis('tight')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Here a sine function is fit with a polynomial of order 3, for values close to zero.
•
from matplotlib import pyplot as plt
import numpy as np
np.random.seed(42)
X = np.random.normal(size=400)
y = np.sin(X)
# Make sure that it X is 2D
X = X[:, np.newaxis]
X_test = np.random.normal(size=200)
y_test = np.sin(X_test)
X_test = X_test[:, np.newaxis]
y_errors = y.copy()
y_errors[::3] = 3
X_errors = X.copy()
X_errors[::3] = 3
y_errors_large = y.copy()
y_errors_large[::3] = 10
X_errors_large = X.copy()
X_errors_large[::3] = 10
Note: Click here to download the full example code or to run this example in your browser via Binder
Comparison of the sparsity (percentage of zero coefficients) of solutions when L1, L2 and Elastic-Net penalty are used
for different values of C. We can see that large values of C give more freedom to the model. Conversely, smaller values
of C constrain the model more. In the L1 penalty case, this leads to sparser solutions. As expected, the Elastic-Net
penalty sparsity is between that of L1 and L2.
We classify 8x8 images of digits into two classes: 0-4 against 5-9. The visualization shows coefficients of the models
for varying C.
Out:
C=1.00
Sparsity with L1 penalty: 6.25%
Sparsity with Elastic-Net penalty: 4.69%
Sparsity with L2 penalty: 4.69%
Score with L1 penalty: 0.90
Score with Elastic-Net penalty: 0.90
Score with L2 penalty: 0.90
C=0.10
Sparsity with L1 penalty: 29.69%
Sparsity with Elastic-Net penalty: 12.50%
Sparsity with L2 penalty: 4.69%
Score with L1 penalty: 0.90
Score with Elastic-Net penalty: 0.90
Score with L2 penalty: 0.90
C=0.01
Sparsity with L1 penalty: 84.38%
Sparsity with Elastic-Net penalty: 68.75%
Sparsity with L2 penalty: 4.69%
Score with L1 penalty: 0.86
Score with Elastic-Net penalty: 0.88
Score with L2 penalty: 0.89
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
X, y = datasets.load_digits(return_X_y=True)
X = StandardScaler().fit_transform(X)
coef_l1_LR = clf_l1_LR.coef_.ravel()
coef_l2_LR = clf_l2_LR.coef_.ravel()
coef_en_LR = clf_en_LR.coef_.ravel()
print("C=%.2f" % C)
print("{:<40} {:.2f}%".format("Sparsity with L1 penalty:", sparsity_l1_LR))
print("{:<40} {:.2f}%".format("Sparsity with Elastic-Net penalty:",
sparsity_en_LR))
print("{:<40} {:.2f}%".format("Sparsity with L2 penalty:", sparsity_l2_LR))
print("{:<40} {:.2f}".format("Score with L1 penalty:",
clf_l1_LR.score(X, y)))
print("{:<40} {:.2f}".format("Score with Elastic-Net penalty:",
clf_en_LR.score(X, y)))
print("{:<40} {:.2f}".format("Score with L2 penalty:",
clf_l2_LR.score(X, y)))
if i == 0:
axes_row[0].set_title("L1 penalty")
axes_row[1].set_title("Elastic-Net\nl1_ratio = %s" % l1_ratio)
axes_row[2].set_title("L2 penalty")
axes_row[0].set_ylabel('C = %s' % C)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Lasso and elastic net (L1 and L2 penalisation) implemented using a coordinate descent.
The coefficients can be forced to be positive.
•
Out:
Computing regularization path using the lasso...
Computing regularization path using the positive lasso...
Computing regularization path using the elastic net...
Computing regularization path using the positive elastic net...
print(__doc__)
X, y = datasets.load_diabetes(return_X_y=True)
# Compute paths
# Display results
plt.figure(1)
colors = cycle(['b', 'r', 'g', 'c', 'k'])
neg_log_alphas_lasso = -np.log10(alphas_lasso)
neg_log_alphas_enet = -np.log10(alphas_enet)
for coef_l, coef_e, c in zip(coefs_lasso, coefs_enet, colors):
l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)
l2 = plt.plot(neg_log_alphas_enet, coef_e, linestyle='--', c=c)
plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Lasso and Elastic-Net Paths')
plt.legend((l1[-1], l2[-1]), ('Lasso', 'Elastic-Net'), loc='lower left')
plt.axis('tight')
plt.figure(2)
neg_log_alphas_positive_lasso = -np.log10(alphas_positive_lasso)
for coef_l, coef_pl, c in zip(coefs_lasso, coefs_positive_lasso, colors):
l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)
l2 = plt.plot(neg_log_alphas_positive_lasso, coef_pl, linestyle='--', c=c)
plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Lasso and positive Lasso')
plt.legend((l1[-1], l2[-1]), ('Lasso', 'positive Lasso'), loc='lower left')
plt.axis('tight')
plt.figure(3)
neg_log_alphas_positive_enet = -np.log10(alphas_positive_enet)
for (coef_e, coef_pe, c) in zip(coefs_enet, coefs_positive_enet, colors):
l1 = plt.plot(neg_log_alphas_enet, coef_e, c=c)
l2 = plt.plot(neg_log_alphas_positive_enet, coef_pe, linestyle='--', c=c)
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# #############################################################################
# Generating simulated data with Gaussian weights
# #############################################################################
# Fit the ARD Regression
clf = ARDRegression(compute_score=True)
clf.fit(X, y)
ols = LinearRegression()
ols.fit(X, y)
# #############################################################################
# Plot the true weights, the estimated weights, the histogram of the
# weights, and predictions with standard deviations
plt.figure(figsize=(6, 5))
plt.title("Weights of the model")
plt.plot(clf.coef_, color='darkblue', linestyle='-', linewidth=2,
label="ARD estimate")
plt.plot(ols.coef_, color='yellowgreen', linestyle=':', linewidth=2,
label="OLS estimate")
plt.plot(w, color='orange', linestyle='-', linewidth=2, label="Ground truth")
plt.xlabel("Features")
plt.ylabel("Values of the weights")
plt.legend(loc=1)
plt.figure(figsize=(6, 5))
plt.title("Histogram of the weights")
plt.hist(clf.coef_, bins=n_features, color='navy', log=True)
plt.scatter(clf.coef_[relevant_features], np.full(len(relevant_features), 5.),
color='gold', marker='o', label="Relevant features")
plt.ylabel("Features")
plt.xlabel("Values of the weights")
plt.legend(loc=1)
plt.figure(figsize=(6, 5))
plt.title("Marginal log-likelihood")
plt.plot(clf.scores_, color='navy', linewidth=2)
plt.ylabel("Score")
plt.xlabel("Iterations")
degree = 10
X = np.linspace(0, 10, 100)
y = f(X, noise_amount=1)
clf_poly = ARDRegression(threshold_lambda=1e5)
clf_poly.fit(np.vander(X, degree), y)
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# #############################################################################
# Generating simulated data with Gaussian weights
np.random.seed(0)
n_samples, n_features = 100, 100
X = np.random.randn(n_samples, n_features) # Create Gaussian data
# Create weights with a precision lambda_ of 4.
lambda_ = 4.
w = np.zeros(n_features)
# Only keep 10 weights of interest
relevant_features = np.random.randint(0, n_features, 10)
for i in relevant_features:
w[i] = stats.norm.rvs(loc=0, scale=1. / np.sqrt(lambda_))
# Create noise with a precision alpha of 50.
alpha_ = 50.
noise = stats.norm.rvs(loc=0, scale=1. / np.sqrt(alpha_), size=n_samples)
# Create the target
(continues on next page)
# #############################################################################
# Fit the Bayesian Ridge Regression and an OLS for comparison
clf = BayesianRidge(compute_score=True)
clf.fit(X, y)
ols = LinearRegression()
ols.fit(X, y)
# #############################################################################
# Plot true weights, estimated weights, histogram of the weights, and
# predictions with standard deviations
lw = 2
plt.figure(figsize=(6, 5))
plt.title("Weights of the model")
plt.plot(clf.coef_, color='lightgreen', linewidth=lw,
label="Bayesian Ridge estimate")
plt.plot(w, color='gold', linewidth=lw, label="Ground truth")
plt.plot(ols.coef_, color='navy', linestyle='--', label="OLS estimate")
plt.xlabel("Features")
plt.ylabel("Values of the weights")
plt.legend(loc="best", prop=dict(size=12))
plt.figure(figsize=(6, 5))
plt.title("Histogram of the weights")
plt.hist(clf.coef_, bins=n_features, color='gold', log=True,
edgecolor='black')
plt.scatter(clf.coef_[relevant_features], np.full(len(relevant_features), 5.),
color='navy', label="Relevant features")
plt.ylabel("Features")
plt.xlabel("Values of the weights")
plt.legend(loc="upper left")
plt.figure(figsize=(6, 5))
plt.title("Marginal log-likelihood")
plt.plot(clf.scores_, color='navy', linewidth=lw)
plt.ylabel("Score")
plt.xlabel("Iterations")
degree = 10
X = np.linspace(0, 10, 100)
y = f(X, noise_amount=0.1)
clf_poly = BayesianRidge()
clf_poly.fit(np.vander(X, degree), y)
Note: Click here to download the full example code or to run this example in your browser via Binder
Comparison of multinomial logistic L1 vs one-versus-rest L1 logistic regression to classify documents from the new-
groups20 dataset. Multinomial logistic regression yields more accurate results and is faster to train on the larger scale
dataset.
Here we use the l1 sparsity that trims the weights of not informative features to zero. This is good if the goal is to
extract the strongly discriminative vocabulary of each class. If the goal is to get the best predictive accuracy, it is better
to use the non sparsity-inducing l2 penalty instead.
A more traditional (and possibly better) way to predict on a sparse subset of input features would be to use univariate
feature selection followed by a traditional (l2-penalised) logistic regression model.
Out:
import timeit
import warnings
print(__doc__)
# Author: Arthur Mensch
warnings.filterwarnings("ignore", category=ConvergenceWarning,
module="sklearn")
t0 = timeit.default_timer()
X, y = fetch_20newsgroups_vectorized('all', return_X_y=True)
X = X[:n_samples]
y = y[:n_samples]
model_params = models[model]
y_pred = lr.predict(X_test)
accuracy = np.sum(y_pred == y_test) / y_test.shape[0]
density = np.mean(lr.coef_ != 0, axis=1) * 100
accuracies.append(accuracy)
densities.append(density)
times.append(train_time)
models[model]['times'] = times
models[model]['densities'] = densities
models[model]['accuracies'] = accuracies
print('Test accuracy for model %s: %.4f' % (model, accuracies[-1]))
print('%% non-zero coefficients for model %s, '
'per class:\n %s' % (model, densities[-1]))
print('Run time (%i epochs) for model %s:'
'%.2f' % (model_params['iters'][-1], model, times[-1]))
fig = plt.figure()
ax = fig.add_subplot(111)
Note: Click here to download the full example code or to run this example in your browser via Binder
Use the Akaike information criterion (AIC), the Bayes Information criterion (BIC) and cross-validation to select an
optimal value of the regularization parameter alpha of the Lasso estimator.
•
Out:
Computing regularization path using the coordinate descent lasso...
Computing regularization path using the Lars lasso...
print(__doc__)
import time
import numpy as np
import matplotlib.pyplot as plt
X, y = datasets.load_diabetes(return_X_y=True)
(continues on next page)
rng = np.random.RandomState(42)
X = np.c_[X, rng.randn(X.shape[0], 14)] # add some bad features
# #############################################################################
# LassoLarsIC: least angle regression with BIC/AIC criterion
model_bic = LassoLarsIC(criterion='bic')
t1 = time.time()
model_bic.fit(X, y)
t_bic = time.time() - t1
alpha_bic_ = model_bic.alpha_
model_aic = LassoLarsIC(criterion='aic')
model_aic.fit(X, y)
alpha_aic_ = model_aic.alpha_
plt.figure()
plot_ic_criterion(model_aic, 'AIC', 'b')
plot_ic_criterion(model_bic, 'BIC', 'r')
plt.legend()
plt.title('Information-criterion for model selection (training time %.3fs)'
% t_bic)
# #############################################################################
# LassoCV: coordinate descent
# Compute paths
print("Computing regularization path using the coordinate descent lasso...")
t1 = time.time()
model = LassoCV(cv=20).fit(X, y)
t_lasso_cv = time.time() - t1
# Display results
m_log_alphas = -np.log10(model.alphas_ + EPSILON)
plt.figure()
ymin, ymax = 2300, 3800
plt.plot(m_log_alphas, model.mse_path_, ':')
plt.plot(m_log_alphas, model.mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
(continues on next page)
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: coordinate descent '
'(train time: %.2fs)' % t_lasso_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)
# #############################################################################
# LassoLarsCV: least angle regression
# Compute paths
print("Computing regularization path using the Lars lasso...")
t1 = time.time()
model = LassoLarsCV(cv=20).fit(X, y)
t_lasso_lars_cv = time.time() - t1
# Display results
m_log_alphas = -np.log10(model.cv_alphas_ + EPSILON)
plt.figure()
plt.plot(m_log_alphas, model.mse_path_, ':')
plt.plot(m_log_alphas, model.mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean square error')
plt.title('Mean square error on each fold: Lars (train time: %.2fs)'
% t_lasso_lars_cv)
plt.axis('tight')
plt.ylim(ymin, ymax)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Stochastic Gradient Descent is an optimization technique which minimizes a loss function in a stochastic fashion,
performing a gradient descent step sample by sample. In particular, it is a very efficient method to fit linear models.
As a stochastic method, the loss function is not necessarily decreasing at each iteration, and convergence is only
guaranteed in expectation. For this reason, monitoring the convergence on the loss function can be difficult.
Another approach is to monitor convergence on a validation score. In this case, the input data is split into a training set
and a validation set. The model is then fitted on the training set and the stopping criterion is based on the prediction
score computed on the validation set. This enables us to find the least number of iterations which is sufficient to build
a model that generalizes well to unseen data and reduces the chance of over-fitting the training data.
This early stopping strategy is activated if early_stopping=True; otherwise the stopping criterion only uses
the training loss on the entire input data. To better control the early stopping strategy, we can specify a parameter
validation_fraction which set the fraction of the input dataset that we keep aside to compute the validation
score. The optimization will continue until the validation score did not improve by at least tol during the last
n_iter_no_change iterations. The actual number of iterations is available at the attribute n_iter_.
This example illustrates how the early stopping can used in the sklearn.linear_model.SGDClassifier
model to achieve almost the same accuracy as compared to a model built without early stopping. This can significantly
reduce training time. Note that scores differ between the stopping criteria even from early iterations because some of
the training data is held out with the validation stopping criterion.
Out:
No stopping criterion: .................................................
Training loss: .................................................
Validation score: .................................................
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
print(__doc__)
@ignore_warnings(category=ConvergenceWarning)
def fit_and_score(estimator, max_iter, X_train, X_test, y_train, y_test):
"""Fit the estimator on the train set and score it on both sets"""
estimator.set_params(max_iter=max_iter)
estimator.set_params(random_state=0)
start = time.time()
estimator.fit(X_train, y_train)
results = []
for estimator_name, estimator in estimator_dict.items():
print(estimator_name + ': ', end='')
for max_iter in range(1, 50):
print('.', end='')
sys.stdout.flush()
nrows = 2
ncols = int(np.ceil(len(plot_list) / 2.))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(6 * ncols,
4 * nrows))
axes[0, 0].get_shared_y_axes().join(axes[0, 0], axes[0, 1])
fig.tight_layout()
plt.show()
6.17 Inspection
Note: Click here to download the full example code or to run this example in your browser via Binder
In this example, we compute the permutation importance on the Wisconsin breast cancer dataset using
permutation_importance. The RandomForestClassifier can easily get about 97% accuracy on a test
dataset. Because this dataset contains multicollinear features, the permutation importance will show that none of the
features are important. One approach to handling multicollinearity is by performing hierarchical clustering on the
features’ Spearman rank-order correlations, picking a threshold, and keeping a single feature from each cluster.
Note: See also Permutation Importance vs Random Forest Feature Importance (MDI)
print(__doc__)
from collections import defaultdict
First, we train a random forest on the breast cancer dataset and evaluate its accuracy on a test set:
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Out:
Next, we plot the tree based feature importance and the permutation importance. The permutation importance plot
shows that permuting a feature drops the accuracy by at most 0.012, which would suggest that none of the features
are important. This is in contradiction with the high test accuracy computed above: some feature must be important.
The permutation importance is calculated on the training set to show how much the model relies on each feature during
training.
tree_importance_sorted_idx = np.argsort(clf.feature_importances_)
tree_indices = np.arange(0, len(clf.feature_importances_)) + 0.5
When features are collinear, permutating one feature will have little effect on the models performance because it
can get the same information from a correlated feature. One way to handle multicollinear features is by performing
hierarchical clustering on the Spearman rank-order correlations, picking a threshold, and keeping a single feature from
each cluster. First, we plot a heatmap of the correlated features:
Next, we manually pick a threshold by visual inspection of the dendrogram to group our features into clusters and
choose a feature from each cluster to keep, select those features from our dataset, and train a new random forest. The
test accuracy of the new random forest did not change much compared to the random forest trained on the complete
dataset.
Out:
Note: Click here to download the full example code or to run this example in your browser via Binder
In this example, we will compare the impurity-based feature importance of RandomForestClassifier with the
permutation importance on the titanic dataset using permutation_importance. We will show that the impurity-
based feature importance can inflate the importance of numerical features.
Furthermore, the impurity-based feature importance of random forests suffers from being computed on statistics de-
rived from the training dataset: the importances can be high even for features that are not predictive of the target
variable, as long as the model has the capacity to use them to overfit.
This example shows how to use Permutation Importances as an alternative that can mitigate those limitations.
References:
print(__doc__)
import matplotlib.pyplot as plt
import numpy as np
Let’s use pandas to load a copy of the titanic dataset. The following shows how to apply separate preprocessing on
numerical and categorical features.
We further include two random variables that are not correlated in any way with the target variable (survived):
• random_num is a high cardinality numerical variable (as many unique values as records).
• random_cat is a low cardinality categorical variable (3 possible values).
X = X[categorical_columns + numerical_columns]
categorical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='mean'))
])
preprocessing = ColumnTransformer(
[('cat', categorical_pipe, categorical_columns),
('num', numerical_pipe, numerical_columns)])
rf = Pipeline([
('preprocess', preprocessing),
('classifier', RandomForestClassifier(random_state=42))
])
rf.fit(X_train, y_train)
Out:
Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('cat',
Pipeline(steps=[('imputer',
SimpleImputer(fill_
˓→value='missing',
˓→ strategy='constant')),
('onehot',
˓→ OneHotEncoder(handle_unknown='ignore'))]),
['pclass', 'sex', 'embarked',
'random_cat']),
('num',
Pipeline(steps=[('imputer',
SimpleImputer())]),
['age', 'sibsp', 'parch',
'fare', 'random_num'])])),
('classifier', RandomForestClassifier(random_state=42))])
Prior to inspecting the feature importances, it is important to check that the model predictive performance is high
enough. Indeed there would be little interest of inspecting the important features of a non-predictive model.
Here one can observe that the train accuracy is very high (the forest model has enough capacity to completely memorize
the training set) but it can still generalize well enough to the test set thanks to the built-in bagging of random forests.
It might be possible to trade some accuracy on the training set for a slightly better accuracy on the test set by limiting
the capacity of the trees (for instance by setting min_samples_leaf=5 or min_samples_leaf=10) so as to
limit overfitting while not introducing too much underfitting.
However let’s keep our high capacity random forest model for now so as to illustrate some pitfalls with feature impor-
tance on variables with many unique values.
Out:
The impurity-based feature importance ranks the numerical features to be the most important features. As a result, the
non-predictive random_num variable is ranked the most important!
This problem stems from two limitations of impurity-based feature importances:
• impurity-based importances are biased towards high cardinality features;
• impurity-based importances are computed on training set statistics and therefore do not reflect the ability of
feature to be useful to make predictions that generalize to the test set (when the model has enough capacity).
ohe = (rf.named_steps['preprocess']
.named_transformers_['cat']
.named_steps['onehot'])
feature_names = ohe.get_feature_names(input_features=categorical_columns)
feature_names = np.r_[feature_names, numerical_columns]
tree_feature_importances = (
rf.named_steps['classifier'].feature_importances_)
sorted_idx = tree_feature_importances.argsort()
As an alternative, the permutation importances of rf are computed on a held out test set. This shows that the low
cardinality categorical feature, sex is the most important feature.
Also note that both random features have very low importances (close to 0) as expected.
fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
vert=False, labels=X_test.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()
It is also possible to compute the permutation importances on the training set. This reveals that random_num gets a
significantly higher importance ranking than when computed on the test set. The difference between those two plots is
a confirmation that the RF model has enough capacity to use that random numerical feature to overfit. You can further
confirm this by re-running this example with constrained RF with min_samples_leaf=10.
fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
vert=False, labels=X_train.columns[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Partial dependence plots show the dependence between the target function2 and a set of ‘target’ features, marginalizing
over the values of all other features (the complement features). Due to the limits of human perception, the size of the
target feature set must be small (usually, one or two) thus the target features are usually chosen among the most
important features.
This example shows how to obtain partial dependence plots from a MLPRegressor and a
HistGradientBoostingRegressor trained on the California housing dataset. The example is taken
from1 .
The plots show four 1-way and two 1-way partial dependence plots (omitted for MLPRegressor due to computation
time). The target variables for the one-way PDP are: median income (MedInc), average occupants per household
(AvgOccup), median house age (HouseAge), and average rooms per household (AveRooms).
2 For classification you can think of it as the regression score before the link function.
1 T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning Ed. 2”, Springer, 2009.
print(__doc__)
Center target to avoid gradient boosting init bias: gradient boosting with the ‘recursion’ method does not account for
the initial estimator (here the average target, by default)
cal_housing = fetch_california_housing()
X = pd.DataFrame(cal_housing.data, columns=cal_housing.feature_names)
y = cal_housing.target
y -= y.mean()
print("Training MLPRegressor...")
tic = time()
est = make_pipeline(QuantileTransformer(),
MLPRegressor(hidden_layer_sizes=(50, 50),
learning_rate_init=0.01,
early_stopping=True))
est.fit(X_train, y_train)
print("done in {:.3f}s".format(time() - tic))
print("Test R2 score: {:.2f}".format(est.score(X_test, y_test)))
Out:
Training MLPRegressor...
done in 4.388s
Test R2 score: 0.81
We configured a pipeline to scale the numerical input features and tuned the neural network size and learning rate to
get a reasonable compromise between training time and predictive performance on a test set.
Importantly, this tabular dataset has very different dynamic ranges for its features. Neural networks tend to be very
sensitive to features with varying scales and forgetting to preprocess the numeric feature would lead to a very poor
model.
It would be possible to get even higher predictive performance with a larger neural network but the training would also
be significantly more expensive.
Note that it is important to check that the model is accurate enough on a test set before plotting the partial dependence
since there would be little use in explaining the impact of a given feature on the prediction function of a poor model.
Let’s now compute the partial dependence plots for this neural network using the model-agnostic (brute-force) method:
Out:
Let’s now fit a GradientBoostingRegressor and compute the partial dependence plots either or one or two variables at
a time.
print("Training GradientBoostingRegressor...")
tic = time()
est = HistGradientBoostingRegressor()
est.fit(X_train, y_train)
print("done in {:.3f}s".format(time() - tic))
print("Test R2 score: {:.2f}".format(est.score(X_test, y_test)))
Out:
Training GradientBoostingRegressor...
done in 0.409s
Test R2 score: 0.85
Here, we used the default hyperparameters for the gradient boosting model without any preprocessing as tree-based
models are naturally robust to monotonic transformations of numerical features.
Note that on this tabular dataset, Gradient Boosting Machines are both significantly faster to train and more accurate
than neural networks. It is also significantly cheaper to tune their hyperparameters (the default tend to work well while
this is not often the case for neural networks).
Finally, as we will see next, computing partial dependence plots tree-based models is also orders of magnitude faster
making it cheap to compute partial dependence plots for pairs of interacting features:
Out:
We can clearly see that the median house price shows a linear relationship with the median income (top left) and that
the house price drops when the average occupants per household increases (top middle). The top right plot shows that
the house age in a district does not have a strong influence on the (median) house price; so does the average rooms per
household. The tick marks on the x-axis represent the deciles of the feature values in the training data.
We also observe that MLPRegressor has much smoother predictions than
HistGradientBoostingRegressor. For the plots to be comparable, it is necessary to subtract the av-
erage value of the target y: The ‘recursion’ method, used by default for HistGradientBoostingRegressor,
does not account for the initial predictor (in our case the average target). Setting the target average to 0 avoids this
bias.
Partial dependence plots with two target features enable us to visualize interactions among them. The two-way partial
dependence plot shows the dependence of median house price on joint values of house age and average occupants per
household. We can clearly see an interaction between the two features: for an average occupancy greater than two, the
house price is nearly independent of the house age, whereas for values less than two there is a strong dependence on
age.
3D interaction plots
Let’s make the same partial dependence plot for the 2 features interaction, this time in 3 dimensions.
fig = plt.figure()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Out:
print(__doc__)
#----------------------------------------------------------------------
# Locally linear embedding of the swiss roll
#----------------------------------------------------------------------
# Plot result
fig = plt.figure()
ax = fig.add_subplot(211, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.Spectral)
ax.set_title("Original data")
ax = fig.add_subplot(212)
ax.scatter(X_r[:, 0], X_r[:, 1], c=color, cmap=plt.cm.Spectral)
plt.axis('tight')
plt.xticks([]), plt.yticks([])
plt.title('Projected data')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
An illustration of dimensionality reduction on the S-curve dataset with various manifold learning methods.
For a discussion and comparison of these algorithms, see the manifold module page
For a similar example, where the methods are applied to a sphere dataset, see Manifold Learning methods on a severed
sphere
Note that the purpose of the MDS is to find a low-dimensional representation of the data (here 2D) in which the
distances respect well the distances in the original high-dimensional space, unlike other manifold-learning algorithms,
it does not seeks an isotropic representation of the data in the low-dimensional space.
Out:
print(__doc__)
n_points = 1000
(continues on next page)
# Create figure
fig = plt.figure(figsize=(15, 8))
fig.suptitle("Manifold Learning with %i points, %i neighbors"
% (1000, n_neighbors), fontsize=14)
methods = OrderedDict()
methods['LLE'] = LLE(method='standard')
methods['LTSA'] = LLE(method='ltsa')
methods['Hessian LLE'] = LLE(method='hessian')
methods['Modified LLE'] = LLE(method='modified')
methods['Isomap'] = manifold.Isomap(n_neighbors, n_components)
methods['MDS'] = manifold.MDS(n_components, max_iter=100, n_init=1)
methods['SE'] = manifold.SpectralEmbedding(n_components=n_components,
n_neighbors=n_neighbors)
methods['t-SNE'] = manifold.TSNE(n_components=n_components, init='pca',
random_state=0)
# Plot results
for i, (label, method) in enumerate(methods.items()):
t0 = time()
Y = method.fit_transform(X)
t1 = time()
print("%s: %.2g sec" % (label, t1 - t0))
ax = fig.add_subplot(2, 5, 2 + i + (i > 3))
ax.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral)
ax.set_title("%s (%.2g sec)" % (label, t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.axis('tight')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The reconstructed points using the metric MDS and non metric MDS are slightly shifted to avoid overlapping.
print(__doc__)
import numpy as np
EPSILON = np.finfo(np.float32).eps
n_samples = 20
seed = np.random.RandomState(seed=3)
X_true = seed.randint(0, 20, 2 * n_samples).astype(np.float)
X_true = X_true.reshape((n_samples, 2))
# Center the data
X_true -= X_true.mean()
similarities = euclidean_distances(X_true)
pos = clf.fit_transform(pos)
npos = clf.fit_transform(npos)
fig = plt.figure(1)
ax = plt.axes([0., 0., 1., 1.])
s = 100
plt.scatter(X_true[:, 0], X_true[:, 1], color='navy', s=s, lw=0,
label='True Position')
plt.scatter(pos[:, 0], pos[:, 1], color='turquoise', s=s, lw=0, label='MDS')
plt.scatter(npos[:, 0], npos[:, 1], color='darkorange', s=s, lw=0, label='NMDS')
plt.legend(scatterpoints=1, loc='best', shadow=False)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
An illustration of t-SNE on the two concentric circles and the S-curve datasets for different perplexity values.
We observe a tendency towards clearer shapes as the perplexity value increases.
The size, the distance and the shape of clusters may vary upon initialization, perplexity values and does not always
convey a meaning.
As shown below, t-SNE for higher perplexities finds meaningful topology of two concentric circles, however the size
and the distance of the circles varies slightly from the original. Contrary to the two circles dataset, the shapes visually
diverge from S-curve topology on the S-curve dataset even for larger perplexity values.
For further details, “How to Use t-SNE Effectively” https://distill.pub/2016/misread-tsne/ provides a good discussion
of the effects of various parameters, as well as interactive plots to explore those effects.
Out:
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
n_samples = 300
n_components = 2
(fig, subplots) = plt.subplots(3, 5, figsize=(15, 8))
perplexities = [5, 30, 50, 100]
red = y == 0
green = y == 1
ax = subplots[0][0]
ax.scatter(X[red, 0], X[red, 1], c="r")
ax.scatter(X[green, 0], X[green, 1], c="g")
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')
t0 = time()
tsne = manifold.TSNE(n_components=n_components, init='random',
random_state=0, perplexity=perplexity)
Y = tsne.fit_transform(X)
t1 = time()
print("circles, perplexity=%d in %.2g sec" % (perplexity, t1 - t0))
ax.set_title("Perplexity=%d" % perplexity)
ax.scatter(Y[red, 0], Y[red, 1], c="r")
ax.scatter(Y[green, 0], Y[green, 1], c="g")
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.axis('tight')
ax = subplots[1][0]
ax.scatter(X[:, 0], X[:, 2], c=color)
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
t0 = time()
tsne = manifold.TSNE(n_components=n_components, init='random',
random_state=0, perplexity=perplexity)
Y = tsne.fit_transform(X)
t1 = time()
print("S-curve, perplexity=%d in %.2g sec" % (perplexity, t1 - t0))
ax.set_title("Perplexity=%d" % perplexity)
ax.scatter(Y[:, 0], Y[:, 1], c=color)
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.axis('tight')
t0 = time()
tsne = manifold.TSNE(n_components=n_components, init='random',
random_state=0, perplexity=perplexity)
Y = tsne.fit_transform(X)
t1 = time()
print("uniform grid, perplexity=%d in %.2g sec" % (perplexity, t1 - t0))
ax.set_title("Perplexity=%d" % perplexity)
ax.scatter(Y[:, 0], Y[:, 1], c=color)
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.axis('tight')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
An application of the different Manifold learning techniques on a spherical data-set. Here one can see the use of
dimensionality reduction in order to gain some intuition regarding the manifold learning methods. Regarding the
dataset, the poles are cut from the sphere, as well as a thin slice down its side. This enables the manifold learning
techniques to ‘spread it open’ whilst projecting it onto two dimensions.
For a similar example, where the methods are applied to the S-curve dataset, see Comparison of Manifold Learning
methods
Note that the purpose of the MDS is to find a low-dimensional representation of the data (here 2D) in which the
distances respect well the distances in the original high-dimensional space, unlike other manifold-learning algorithms,
it does not seeks an isotropic representation of the data in the low-dimensional space. Here the manifold problem
matches fairly that of representing a flat map of the Earth, as with map projection
Out:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import NullFormatter
ax = fig.add_subplot(251, projection='3d')
ax.scatter(x, y, z, c=p[indices], cmap=plt.cm.rainbow)
ax.view_init(40, -10)
ax = fig.add_subplot(252 + i)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("%s (%.2g sec)" % (labels[i], t1 - t0))
(continues on next page)
ax = fig.add_subplot(257)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("%s (%.2g sec)" % ('Isomap', t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')
ax = fig.add_subplot(258)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("MDS (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')
ax = fig.add_subplot(259)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("Spectral Embedding (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')
ax = fig.add_subplot(2, 5, 10)
plt.scatter(trans_data[0], trans_data[1], c=colors, cmap=plt.cm.rainbow)
plt.title("t-SNE (%.2g sec)" % (t1 - t0))
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
•
Out:
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
# ----------------------------------------------------------------------
# Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
plt.figure()
ax = plt.subplot(111)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(y[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
if hasattr(offsetbox, 'AnnotationBbox'):
# only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1., 1.]]) # just something big
for i in range(X.shape[0]):
dist = np.sum((X[i] - shown_images) ** 2, 1)
if np.min(dist) < 4e-3:
# don't show points that are too close
continue
shown_images = np.r_[shown_images, [X[i]]]
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
X[i])
ax.add_artist(imagebox)
plt.xticks([]), plt.yticks([])
if title is not None:
plt.title(title)
# ----------------------------------------------------------------------
# Plot images of the digits
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
(continues on next page)
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')
# ----------------------------------------------------------------------
# Random 2D projection using a random unitary matrix
print("Computing random projection")
rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)
X_projected = rp.fit_transform(X)
plot_embedding(X_projected, "Random Projection of the digits")
# ----------------------------------------------------------------------
# Projection on to the first 2 principal components
# ----------------------------------------------------------------------
# Projection on to the first 2 linear discriminant components
# ----------------------------------------------------------------------
# Isomap projection of the digits dataset
print("Computing Isomap projection")
t0 = time()
X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
print("Done.")
plot_embedding(X_iso,
"Isomap projection of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Locally linear embedding of the digits dataset
print("Computing LLE embedding")
(continues on next page)
# ----------------------------------------------------------------------
# Modified Locally linear embedding of the digits dataset
print("Computing modified LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method='modified')
t0 = time()
X_mlle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_mlle,
"Modified Locally Linear Embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# HLLE embedding of the digits dataset
print("Computing Hessian LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method='hessian')
t0 = time()
X_hlle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_hlle,
"Hessian Locally Linear Embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# LTSA embedding of the digits dataset
print("Computing LTSA embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method='ltsa')
t0 = time()
X_ltsa = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_ltsa,
"Local Tangent Space Alignment of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# MDS embedding of the digits dataset
print("Computing MDS embedding")
clf = manifold.MDS(n_components=2, n_init=1, max_iter=100)
t0 = time()
X_mds = clf.fit_transform(X)
print("Done. Stress: %f" % clf.stress_)
plot_embedding(X_mds,
"MDS embedding of the digits (time %.2fs)" %
(continues on next page)
# ----------------------------------------------------------------------
# Random Trees embedding of the digits dataset
print("Computing Totally Random Trees embedding")
hasher = ensemble.RandomTreesEmbedding(n_estimators=200, random_state=0,
max_depth=5)
t0 = time()
X_transformed = hasher.fit_transform(X)
pca = decomposition.TruncatedSVD(n_components=2)
X_reduced = pca.fit_transform(X_transformed)
plot_embedding(X_reduced,
"Random forest embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Spectral embedding of the digits dataset
print("Computing Spectral embedding")
embedder = manifold.SpectralEmbedding(n_components=2, random_state=0,
eigen_solver="arpack")
t0 = time()
X_se = embedder.fit_transform(X)
plot_embedding(X_se,
"Spectral embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# t-SNE embedding of the digits dataset
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
t0 = time()
X_tsne = tsne.fit_transform(X)
plot_embedding(X_tsne,
"t-SNE embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# NCA projection of the digits dataset
print("Computing NCA projection")
nca = neighbors.NeighborhoodComponentsAnalysis(init='random',
n_components=2, random_state=0)
t0 = time()
X_nca = nca.fit_transform(X, y)
plot_embedding(X_nca,
"NCA embedding of the digits (time %.2fs)" %
(time() - t0))
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The sklearn.impute.IterativeImputer class is very flexible - it can be used with a variety of estimators
to do round-robin regression, treating every variable as an output in turn.
In this example we compare some estimators for the purpose of missing feature imputation with sklearn.impute.
IterativeImputer:
• BayesianRidge: regularized linear regression
• DecisionTreeRegressor: non-linear regression
• ExtraTreesRegressor: similar to missForest in R
• KNeighborsRegressor: comparable to other KNN imputation approaches
Of particular interest is the ability of sklearn.impute.IterativeImputer to mimic the behavior of miss-
Forest, a popular imputation package for R. In this example, we have chosen to use sklearn.ensemble.
ExtraTreesRegressor instead of sklearn.ensemble.RandomForestRegressor (as in missForest)
due to its increased speed.
Note that sklearn.neighbors.KNeighborsRegressor is different from KNN imputation, which learns
from samples with missing values by using a distance metric that accounts for missing values, rather than imput-
ing them.
The goal is to compare different estimators to see which one is best for the sklearn.impute.
IterativeImputer when using a sklearn.linear_model.BayesianRidge estimator on the California
housing dataset with a single value randomly removed from each row.
For this particular pattern of missing values we see that sklearn.ensemble.ExtraTreesRegressor and
sklearn.linear_model.BayesianRidge give the best results.
Out:
/home/circleci/project/examples/impute/plot_iterative_imputer_variants_comparison.
˓→py:130: FutureWarning: The 'get_values' method is deprecated and will be removed in
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N_SPLITS = 5
rng = np.random.RandomState(0)
scores = pd.concat(
[score_full_data, score_simple_imputer, score_iterative_imputer],
keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
)
Note: Click here to download the full example code or to run this example in your browser via Binder
Missing values can be replaced by the mean, the median or the most frequent value using the basic sklearn.
impute.SimpleImputer. The median is a more robust estimator for data with high magnitude variables which
could dominate results (otherwise known as a ‘long tail’).
With KNNImputer, missing values can be imputed using the weighted or unweighted mean of the desired number of
nearest neighbors.
Another option is the sklearn.impute.IterativeImputer. This uses round-robin linear regression, treating
every variable as an output in turn. The version implemented assumes Gaussian (output) variables. If your features are
obviously non-Normal, consider transforming them to look more Normal so as to potentially improve performance.
In addition of using an imputing method, we can also keep an indication of the missing information using sklearn.
impute.MissingIndicator which might carry some information.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.RandomState(0)
N_SPLITS = 5
REGRESSOR = RandomForestRegressor(random_state=0)
def get_results(dataset):
X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0]
n_features = X_full.shape[1]
# Estimate the score after imputation (mean strategy) of the missing values
imputer = SimpleImputer(missing_values=0, strategy="mean")
mean_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
results_diabetes = np.array(get_results(load_diabetes()))
mses_diabetes = results_diabetes[:, 0] * -1
stds_diabetes = results_diabetes[:, 1]
results_boston = np.array(get_results(load_boston()))
mses_boston = results_boston[:, 0] * -1
stds_boston = results_boston[:, 1]
n_bars = len(mses_diabetes)
xval = np.arange(n_bars)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
lr = linear_model.LinearRegression()
X, y = datasets.load_boston(return_X_y=True)
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, X, y, cv=10)
Note: Click here to download the full example code or to run this example in your browser via Binder
Example of confusion matrix usage to evaluate the quality of the output of a classifier on the iris data set. The diagonal
elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal
elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the
better, indicating many correct predictions.
The figures show the confusion matrix with and without normalization by class support size (number of elements in
each class). This kind of normalization can be interesting in case of class imbalance to have a more visual interpretation
of which class is being misclassified.
Here the results are not as good as they could be as our choice for the regularization parameter C was not the best. In
real life applications this parameter is usually chosen using Tuning the hyper-parameters of an estimator.
•
Out:
Confusion matrix, without normalization
[[13 0 0]
[ 0 10 6]
[ 0 0 9]]
Normalized confusion matrix
[[1. 0. 0. ]
[0. 0.62 0.38]
[0. 0. 1. ]]
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
classifier = svm.SVC(kernel='linear', C=0.01).fit(X_train, y_train)
np.set_printoptions(precision=2)
print(title)
print(disp.confusion_matrix)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
In this plot you can see the training scores and validation scores of an SVM for different values of the kernel parameter
gamma. For very low values of gamma, you can see that both the training score and the validation score are low.
This is called underfitting. Medium values of gamma will result in high values for both scores, i.e. the classifier is
performing fairly well. If gamma is too high, the classifier will overfit, which means that the training score is good but
the validation score is poor.
print(__doc__)
X, y = load_digits(return_X_y=True)
Note: Click here to download the full example code or to run this example in your browser via Binder
This example demonstrates the problems of underfitting and overfitting and how we can use linear regression with
polynomial features to approximate nonlinear functions. The plot shows the function that we want to approximate,
which is a part of the cosine function. In addition, the samples from the real function and the approximations of
different models are displayed. The models have polynomial features of different degrees. We can see that a linear
function (polynomial with degree 1) is not sufficient to fit the training samples. This is called underfitting. A
polynomial of degree 4 approximates the true function almost perfectly. However, for higher degrees the model
will overfit the training data, i.e. it learns the noise of the training data. We evaluate quantitatively overfitting /
underfitting by using cross-validation. We calculate the mean squared error (MSE) on the validation set, the higher,
the less likely the model generalizes correctly from the training data.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
(continues on next page)
def true_fun(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i],
include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
Note: Click here to download the full example code or to run this example in your browser via Binder
This examples shows how a classifier is optimized by cross-validation, which is done using the sklearn.
model_selection.GridSearchCV object on a development set that comprises only half of the available labeled
data.
The performance of the selected hyper-parameters and trained model is then measured on a dedicated evaluation set
that was not used during the model selection step.
More details on tools available for model selection can be found in the sections on Cross-validation: evaluating
estimator performance and Tuning the hyper-parameters of an estimator.
Out:
print(__doc__)
clf = GridSearchCV(
SVC(), tuned_parameters, scoring='%s_macro' % score
)
clf.fit(X_train, y_train)
# Note the problem is too easy: the hyperparameter plateau is too flat and the
# output model is the same for precision and recall with ties in quality.
Note: Click here to download the full example code or to run this example in your browser via Binder
6.20.6 Comparing randomized search and grid search for hyperparameter estima-
tion
Compare randomized search and grid search for optimizing hyperparameters of a random forest. All parameters that
influence the learning are searched simultaneously (except for the number of estimators, which poses a time / quality
tradeoff).
The randomized search and the grid search explore exactly the same space of parameters. The result in parameter
settings is quite similar, while the run time for randomized search is drastically lower.
The performance is may slightly worse for the randomized search, and is likely due to a noise effect and would not
carry over to a held-out test set.
Note that in practice, one would not search over this many different parameters simultaneously using grid search, but
pick only the ones deemed most important.
Out:
print(__doc__)
import numpy as np
# build a classifier
clf = SGDClassifier(loss='hinge', penalty='elasticnet',
fit_intercept=True)
start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)
Note: Click here to download the full example code or to run this example in your browser via Binder
Illustration of how the performance of an estimator on unseen data (test data) is not the same as the performance on
training data. As the regularization increases the performance on train decreases while the performance on test is
optimal within a range of values of the regularization parameter. The example with an Elastic-Net regression model
and the performance is measured using the explained variance a.k.a. R^2.
Out:
print(__doc__)
import numpy as np
from sklearn import linear_model
# #############################################################################
# Generate sample data
n_samples_train, n_samples_test, n_features = 75, 150, 500
np.random.seed(0)
coef = np.random.randn(n_features)
coef[50:] = 0.0 # only the top 10 features are impacting the model
X = np.random.randn(n_samples_train + n_samples_test, n_features)
y = np.dot(X, coef)
# #############################################################################
# Compute train and test errors
alphas = np.logspace(-5, 1, 60)
enet = linear_model.ElasticNet(l1_ratio=0.7, max_iter=10000)
train_errors = list()
test_errors = list()
for alpha in alphas:
enet.set_params(alpha=alpha)
enet.fit(X_train, y_train)
train_errors.append(enet.score(X_train, y_train))
test_errors.append(enet.score(X_test, y_test))
i_alpha_optim = np.argmax(test_errors)
alpha_optim = alphas[i_alpha_optim]
print("Optimal regularization parameter : %s" % alpha_optim)
# #############################################################################
# Plot results functions
Note: Click here to download the full example code or to run this example in your browser via Binder
Example of Receiver Operating Characteristic (ROC) metric to evaluate classifier output quality using cross-validation.
ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the
top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not
very realistic, but it does mean that a larger area under the curve (AUC) is usually better.
The “steepness” of ROC curves is also important, since it is ideal to maximize the true positive rate while minimizing
the false positive rate.
This example shows the ROC response of different datasets, created from K-fold cross-validation. Taking all of these
curves, it is possible to calculate the mean area under curve, and see the variance of the curve when the training set
is split into different subsets. This roughly shows how the classifier output is affected by changes in the training data,
and how different the splits generated by K-fold cross-validation are from one another.
Note:
See also sklearn.metrics.roc_auc_score, sklearn.model_selection.cross_val_score,
Receiver Operating Characteristic (ROC),
print(__doc__)
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
# #############################################################################
# Data IO and generation
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
fig, ax = plt.subplots()
for i, (train, test) in enumerate(cv.split(X, y)):
classifier.fit(X[train], y[train])
viz = plot_roc_curve(classifier, X[test], y[test],
name='ROC fold {}'.format(i),
alpha=0.3, lw=1, ax=ax)
interp_tpr = interp(mean_fpr, viz.fpr, viz.tpr)
interp_tpr[0] = 0.0
tprs.append(interp_tpr)
aucs.append(viz.roc_auc)
Note: Click here to download the full example code or to run this example in your browser via Binder
This example compares non-nested and nested cross-validation strategies on a classifier of the iris data set. Nested
cross-validation (CV) is often used to train a model in which hyperparameters also need to be optimized. Nested CV
estimates the generalization error of the underlying model and its (hyper)parameter search. Choosing the parameters
that maximize non-nested CV biases the model to the dataset, yielding an overly-optimistic score.
Model selection without nested CV uses the same data to tune model parameters and evaluate model performance.
Information may thus “leak” into the model and overfit the data. The magnitude of this effect is primarily dependent
on the size of the dataset and the stability of the model. See Cawley and Talbot1 for an analysis of these issues.
To avoid this problem, nested CV effectively uses a series of train/validation/test set splits. In the inner loop
(here executed by GridSearchCV ), the score is approximately maximized by fitting a model to each training
set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop (here in
cross_val_score), generalization error is estimated by averaging test set scores over several dataset splits.
The example below uses a support vector classifier with a non-linear kernel to build a model with optimized hyperpa-
rameters by grid search. We compare the performance of non-nested and nested CV strategies by taking the difference
between their scores.
See Also:
References:
1 Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res
2010,11, 2079-2107.
Out:
print(__doc__)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Multiple metric parameter search can be done by setting the scoring parameter to a list of metric scorer names or a
dict mapping the scorer names to the scorer callables.
The scores of all the scorers are available in the cv_results_ dict at keys ending in '_<scorer_name>'
('mean_test_precision', 'rank_test_precision', etc. . . )
The best_estimator_, best_index_, best_score_ and best_params_ correspond to the scorer (key)
that is set to the refit attribute.
import numpy as np
from matplotlib import pyplot as plt
print(__doc__)
X, y = make_hastie_10_2(n_samples=8000, random_state=42)
# The scorers can be either be one of the predefined metric strings or a scorer
# callable, like the one returned by make_scorer
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
plt.figure(figsize=(13, 13))
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",
fontsize=16)
plt.xlabel("min_samples_split")
plt.ylabel("Score")
ax = plt.gca()
ax.set_xlim(0, 402)
ax.set_ylim(0.73, 1)
# Plot a dotted vertical line at the best score for that scorer marked by x
ax.plot([X_axis[best_index], ] * 2, [0, best_score],
linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)
plt.legend(loc="best")
plt.grid(False)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The dataset used in this example is the 20 newsgroups dataset which will be automatically downloaded and then cached
and reused for the document classification example.
You can adjust the number of categories by giving their names to the dataset loader or setting them to None to get the
20 of them.
print(__doc__)
# #############################################################################
# Load some categories from the training set
categories = [
'alt.atheism',
'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
(continues on next page)
# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
# 'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
# 'tfidf__use_idf': (True, False),
# 'tfidf__norm': ('l1', 'l2'),
'clf__max_iter': (20,),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
# 'clf__max_iter': (10, 50, 80),
}
if __name__ == "__main__":
# multiprocessing requires the fork to happen in a __main__ protected
# block
# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
Note: Click here to download the full example code or to run this example in your browser via Binder
This example balances model complexity and cross-validated score by finding a decent accuracy within 1 standard
deviation of the best accuracy score while minimising the number of PCA components [1].
The figure shows the trade-off between cross-validated score and the number of PCA components. The balanced
case is when n_components=10 and accuracy=0.88, which falls into the range within 1 standard deviation of the best
accuracy score.
[1] Hastie, T., Tibshirani, R.„ Friedman, J. (2001). Model Assessment and Selection. The Elements of Statistical
Learning (pp. 219-260). New York, NY, USA: Springer New York Inc..
Out:
The best_index_ is 2
The n_components selected is 10
The corresponding accuracy score is 0.88
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
def lower_bound(cv_results):
"""
Calculate the lower bound within 1 standard deviation
of the best `mean_test_scores`.
Parameters
----------
cv_results : dict of numpy(masked) ndarrays
See attribute cv_results_ of `GridSearchCV`
Returns
-------
float
Lower bound within 1 standard deviation of the
best `mean_test_score`.
"""
best_score_idx = np.argmax(cv_results['mean_test_score'])
return (cv_results['mean_test_score'][best_score_idx]
- cv_results['std_test_score'][best_score_idx])
def best_low_complexity(cv_results):
"""
Balance model complexity with cross-validated score.
Parameters
----------
cv_results : dict of numpy(masked) ndarrays
See attribute cv_results_ of `GridSearchCV`.
Return
------
int
Index of a model that has the fewest PCA components
while has its test score within 1 standard deviation of the best
`mean_test_score`.
"""
threshold = lower_bound(cv_results)
candidate_idx = np.flatnonzero(cv_results['mean_test_score'] >= threshold)
(continues on next page)
pipe = Pipeline([
('reduce_dim', PCA(random_state=42)),
('classify', LinearSVC(random_state=42, C=0.01)),
])
param_grid = {
'reduce_dim__n_components': [6, 8, 10, 12, 14]
}
n_components = grid.cv_results_['param_reduce_dim__n_components']
test_scores = grid.cv_results_['mean_test_score']
plt.figure()
plt.bar(n_components, test_scores, width=1.3, color='b')
lower = lower_bound(grid.cv_results_)
plt.axhline(np.max(test_scores), linestyle='--', color='y',
label='Best score')
plt.axhline(lower, linestyle='--', color='.5', label='Best score - 1 std')
best_index_ = grid.best_index_
Note: Click here to download the full example code or to run this example in your browser via Binder
Choosing the right cross-validation object is a crucial part of fitting a model properly. There are many ways to split
data into training and test sets in order to avoid model overfitting, to standardize the number of groups in test sets, etc.
This example visualizes the behavior of several common scikit-learn objects for comparison.
First, we must understand the structure of our data. It has 100 randomly generated input datapoints, 3 classes split
unevenly across datapoints, and 10 “groups” split evenly across datapoints.
As we’ll see, some cross-validation objects do specific things with labeled data, others behave differently with grouped
data, and others do not use this information.
To begin, we’ll visualize our data.
We’ll define a function that lets us visualize the behavior of each cross-validation object. We’ll perform 4 splits of the
data. On each split, we’ll visualize the indices chosen for the training set (in blue) and the test set (in red).
# Formatting
yticklabels = list(range(n_splits)) + ['class', 'group']
ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
xlabel='Sample index', ylabel="CV iteration",
ylim=[n_splits+2.2, -.2], xlim=[0, 100])
ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
return ax
fig, ax = plt.subplots()
cv = KFold(n_splits)
plot_cv_indices(cv, X, y, groups, ax, n_splits)
Out:
As you can see, by default the KFold cross-validation iterator does not take either datapoint class or group into
consideration. We can change this by using the StratifiedKFold like so.
fig, ax = plt.subplots()
cv = StratifiedKFold(n_splits)
plot_cv_indices(cv, X, y, groups, ax, n_splits)
Out:
In this case, the cross-validation retained the same ratio of classes across each CV split. Next we’ll visualize this
behavior for a number of CV iterators.
Let’s visually compare the cross validation behavior for many scikit-learn cross-validation objects. Below we will
loop through several common cross-validation objects, visualizing the behavior of each.
Note how some use the group/class information while others do not.
for cv in cvs:
this_cv = cv(n_splits=n_splits)
(continues on next page)
ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.02))],
['Testing set', 'Training set'], loc=(1.02, .8))
# Make the legend fit
plt.tight_layout()
fig.subplots_adjust(right=.7)
plt.show()
•
Total running time of the script: ( 0 minutes 2.743 seconds)
Estimated memory usage: 8 MB
Note: Click here to download the full example code or to run this example in your browser via Binder
Example of Receiver Operating Characteristic (ROC) metric to evaluate classifier output quality.
ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the
top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not
very realistic, but it does mean that a larger area under the curve (AUC) is usually better.
The “steepness” of ROC curves is also important, since it is ideal to maximize the true positive rate while minimizing
the false positive rate.
ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC
curve and ROC area to multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn
per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary
prediction (micro-averaging).
Another evaluation measure for multi-label classification is macro-averaging, which gives equal weight to the classi-
fication of each label.
Note:
See also sklearn.metrics.roc_auc_score, Receiver Operating Characteristic (ROC) with cross validation
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
(continues on next page)
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='navy', linestyle=':', linewidth=4)
The sklearn.metrics.roc_auc_score function can be used for multi-class classification. The multi-class
One-vs-One scheme compares every unique pairwise combination of classes. In this section, we calculate the AUC
using the OvR and OvO schemes. We report a macro average, and a prevalence-weighted average.
y_prob = classifier.predict_proba(X_test)
Out:
Note: Click here to download the full example code or to run this example in your browser via Binder
6.20.15 Precision-Recall
Precision (𝑃 ) is defined as the number of true positives (𝑇𝑝 ) over the number of true positives plus the number of false
positives (𝐹𝑝 ).
𝑇𝑝
𝑃 = 𝑇𝑝 +𝐹𝑝
Recall (𝑅) is defined as the number of true positives (𝑇𝑝 ) over the number of true positives plus the number of false
negatives (𝐹𝑛 ).
𝑇𝑝
𝑅= 𝑇𝑝 +𝐹𝑛
These quantities are also related to the (𝐹1 ) score, which is defined as the harmonic mean of precision and recall.
×𝑅
𝐹1 = 2𝑃
𝑃 +𝑅
𝑇
𝑝
Note that the precision may not decrease with recall. The definition of precision ( 𝑇𝑝 +𝐹 𝑝
) shows that lowering the
threshold of a classifier may increase the denominator, by increasing the number of results returned. If the threshold
was previously set too high, the new results may all be true positives, which will increase precision. If the previous
threshold was about right or too low, further lowering the threshold will introduce false positives, decreasing precision.
𝑇
𝑝
Recall is defined as 𝑇𝑝 +𝐹 𝑛
, where 𝑇𝑝 + 𝐹𝑛 does not depend on the classifier threshold. This means that lowering
the classifier threshold may increase recall, by increasing the number of true positive results. It is also possible that
lowering the threshold may leave recall unchanged, while the precision fluctuates.
The relationship between recall and precision can be observed in the stairstep area of the plot - at the edges of these
steps a small change in the threshold considerably reduces precision, with only a minor gain in recall.
Average precision (AP) summarizes such a plot as the weighted mean of precisions achieved at each threshold, with
the increase in recall from the previous threshold used as the weight:
∑︀
AP = 𝑛 (𝑅𝑛 − 𝑅𝑛−1 )𝑃𝑛
where 𝑃𝑛 and 𝑅𝑛 are the precision and recall at the nth threshold. A pair (𝑅𝑘 , 𝑃𝑘 ) is referred to as an operating point.
AP and the trapezoidal area under the operating points (sklearn.metrics.auc) are common ways to summarize
a precision-recall curve that lead to different results. Read more in the User Guide.
Precision-recall curves are typically used in binary classification to study the output of a classifier. In order to extend
the precision-recall curve and average precision to multi-class or multi-label classification, it is necessary to binarize
the output. One curve can be drawn per label, but one can also draw a precision-recall curve by considering each
element of the label indicator matrix as a binary prediction (micro-averaging).
Note:
See also sklearn.metrics.average_precision_score, sklearn.metrics.recall_score,
sklearn.metrics.precision_score, sklearn.metrics.f1_score
iris = datasets.load_iris()
(continues on next page)
# Limit to the two first classes, and split into training and test
X_train, X_test, y_train, y_test = train_test_split(X[y < 2], y[y < 2],
test_size=.5,
random_state=random_state)
Out:
Out:
Text(0.5, 1.0, '2-class Precision-Recall curve: AP=0.88')
In multi-label settings
# Run classifier
(continues on next page)
Out:
plt.figure()
plt.step(recall['micro'], precision['micro'], where='post')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title(
'Average precision score, micro-averaged over all classes: AP={0:0.2f}'
.format(average_precision["micro"]))
Out:
Text(0.5, 1.0, 'Average precision score, micro-averaged over all classes: AP=0.43')
plt.figure(figsize=(7, 8))
f_scores = np.linspace(0.2, 0.8, num=4)
lines = []
labels = []
for f_score in f_scores:
x = np.linspace(0.01, 1)
y = f_score * x / (2 * x - f_score)
l, = plt.plot(x[y >= 0], y[y >= 0], color='gray', alpha=0.2)
plt.annotate('f1={0:0.1f}'.format(f_score), xy=(0.9, y[45] + 0.02))
lines.append(l)
labels.append('iso-f1 curves')
l, = plt.plot(recall["micro"], precision["micro"], color='gold', lw=2)
lines.append(l)
(continues on next page)
fig = plt.gcf()
fig.subplots_adjust(bottom=0.25)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Extension of Precision-Recall curve to multi-class')
plt.legend(lines, labels, loc=(0, -.38), prop=dict(size=14))
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
In the first column, first row the learning curve of a naive Bayes classifier is shown for the digits dataset. Note that the
training score and the cross-validation score are both not very good at the end. However, the shape of the curve can
be found in more complex datasets very often: the training score is very high at the beginning and decreases and the
cross-validation score is very low at the beginning and increases. In the second column, first row we see the learning
curve of an SVM with RBF kernel. We can see clearly that the training score is still around the maximum and the
validation score could be increased with more training samples. The plots in the second row show the times required
by the models to train with various sizes of training dataset. The plots in the third row show how much time was
required to train the models for each training sizes.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
axes[0].set_title(title)
if ylim is not None:
axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")
return plt
X, y = load_digits(return_X_y=True)
estimator = GaussianNB()
plot_learning_curve(estimator, title, X, y, axes=axes[:, 0], ylim=(0.7, 1.01),
cv=cv, n_jobs=4)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Next we create 10 classifier chains. Each classifier chain contains a logistic regression model for each of the 14 labels.
The models in each chain are ordered randomly. In addition to the 103 features in the dataset, each model gets the
predictions of the preceding models in the chain as features (note that by default at training time each model gets the
true labels as features). These additional features allow each chain to exploit correlations among the classes. The
Jaccard similarity score for each chain tends to be greater than that of the set independent logistic models.
Because the models in each chain are arranged randomly there is significant variation in performance among the
chains. Presumably there is an optimal ordering of the classes in a chain that will yield the best performance. However
we do not know that ordering a priori. Instead we can construct an voting ensemble of classifier chains by averaging
the binary predictions of the chains and apply a threshold of 0.5. The Jaccard similarity score of the ensemble is
greater than that of the independent models and tends to exceed the score of each chain in the ensemble (although this
is not guaranteed with randomly ordered chains).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.multioutput import ClassifierChain
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import jaccard_score
from sklearn.linear_model import LogisticRegression
print(__doc__)
# Fit an independent logistic regression model for each class using the
# OneVsRestClassifier wrapper.
base_lr = LogisticRegression()
ovr = OneVsRestClassifier(base_lr)
ovr.fit(X_train, Y_train)
Y_pred_ovr = ovr.predict(X_test)
ovr_jaccard_score = jaccard_score(Y_test, Y_pred_ovr, average='samples')
Y_pred_ensemble = Y_pred_chains.mean(axis=0)
ensemble_jaccard_score = jaccard_score(Y_test,
Y_pred_ensemble >= .5,
average='samples')
model_names = ('Independent',
'Chain 1',
'Chain 2',
'Chain 3',
'Chain 4',
'Chain 5',
'Chain 6',
'Chain 7',
'Chain 8',
'Chain 9',
'Chain 10',
'Ensemble')
x_pos = np.arange(len(model_names))
# Plot the Jaccard similarity scores for the independent model, each of the
# chains, and the ensemble (note that the vertical axis on this plot does
# not begin at 0).
Note: Click here to download the full example code or to run this example in your browser via Binder
Demonstrate the resolution of a regression problem using a k-Nearest Neighbor and the interpolation of the target
using both barycenter and constant weights.
print(__doc__)
# #############################################################################
# Generate sample data
import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors
np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()
# #############################################################################
# Fit regression model
n_neighbors = 5
plt.subplot(2, 1, i + 1)
plt.scatter(X, y, color='darkorange', label='data')
plt.plot(T, y_, color='navy', label='prediction')
plt.axis('tight')
plt.legend()
plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors,
weights))
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which computes the local
density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a
substantially lower density than their neighbors. This example shows how to use LOF for outlier detection which is
the default use case of this estimator in scikit-learn. Note that when LOF is used for outlier detection it has no predict,
decision_function and score_samples methods. See User Guide: for details on the difference between outlier detection
and novelty detection and how to use LOF for novelty detection.
The number of neighbors considered (parameter n_neighbors) is typically set 1) greater than the minimum number of
samples a cluster has to contain, so that other samples can be local outliers relative to this cluster, and 2) smaller than
the maximum number of close by samples that can potentially be local outliers. In practice, such informations are
generally not available, and taking n_neighbors=20 appears to work well in general.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
print(__doc__)
np.random.seed(42)
n_outliers = len(X_outliers)
ground_truth = np.ones(len(X), dtype=int)
ground_truth[-n_outliers:] = -1
Note: Click here to download the full example code or to run this example in your browser via Binder
Sample usage of Nearest Neighbors classification. It will plot the decision boundaries for each class.
•
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
n_neighbors = 15
# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Sample usage of Nearest Centroid classification. It will plot the decision boundaries for each class.
•
Out:
None 0.8133333333333334
0.2 0.82
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import NearestCentroid
n_neighbors = 15
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows how kernel density estimation (KDE), a powerful non-parametric density estimation technique,
can be used to learn a generative model for a dataset. With this generative model in place, new samples can be drawn.
These new samples reflect the underlying model of the data.
Out:
import numpy as np
import matplotlib.pyplot as plt
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This examples demonstrates how to precompute the k nearest neighbors before using them in KNeighborsClassifier.
KNeighborsClassifier can compute the nearest neighbors internally, but precomputing them can have several benefits,
such as finer parameter control, caching for multiple use, or custom implementations.
Here we use the caching property of pipelines to cache the nearest neighbors graph between multiple fits of KNeigh-
borsClassifier. The first call is slow since it computes the neighbors graph, while subsequent call are faster as they
do not need to recompute the graph. Here the durations are small since the dataset is small, but the gain can be more
substantial when the dataset grows larger, or when the grid of parameter to search is large.
print(__doc__)
X, y = load_digits(return_X_y=True)
n_neighbors_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The transformer computes the nearest neighbors graph using the maximum number
# of neighbors necessary in the grid search. The classifier model filters the
# nearest neighbors graph as required by its own n_neighbors parameter.
graph_model = KNeighborsTransformer(n_neighbors=max(n_neighbors_list),
mode='distance')
classifier_model = KNeighborsClassifier(metric='precomputed')
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates a learned distance metric that maximizes the nearest neighbors classification accuracy. It
provides a visual representation of this metric compared to the original point space. Please refer to the User Guide for
more information.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from matplotlib import cm
from sklearn.utils.fixes import logsumexp
print(__doc__)
Original points
First we create a data set of 9 samples from 3 classes, and plot the points in the original space. For this example, we
focus on the classification of point no. 3. The thickness of a link between point no. 3 and another point is proportional
to their distance.
plt.figure(1)
ax = plt.gca()
for i in range(X.shape[0]):
ax.text(X[i, 0], X[i, 1], str(i), va='center', ha='center')
ax.scatter(X[i, 0], X[i, 1], s=300, c=cm.Set1(y[[i]]), alpha=0.4)
ax.set_title("Original points")
ax.axes.get_xaxis().set_visible(False)
(continues on next page)
i = 3
relate_point(X, i, ax)
plt.show()
Learning an embedding
We use NeighborhoodComponentsAnalysis to learn an embedding and plot the points after the transforma-
tion. We then take the embedding and find the nearest neighbors.
plt.figure(2)
ax2 = plt.gca()
X_embedded = nca.transform(X)
relate_point(X_embedded, i, ax2)
for i in range(len(X)):
ax2.text(X_embedded[i, 0], X_embedded[i, 1], str(i),
va='center', ha='center')
ax2.scatter(X_embedded[i, 0], X_embedded[i, 1], s=300, c=cm.Set1(y[[i]]),
alpha=0.4)
ax2.set_title("NCA embedding")
ax2.axes.get_xaxis().set_visible(False)
ax2.axes.get_yaxis().set_visible(False)
ax2.axis('equal')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which computes the local
density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a
substantially lower density than their neighbors. This example shows how to use LOF for novelty detection. Note
that when LOF is used for novelty detection you MUST not use predict, decision_function and score_samples on the
training set as this would lead to wrong results. You must only use these methods on new unseen data (which are not
in the training set). See User Guide: for details on the difference between outlier detection and novelty detection and
how to use LOF for outlier detection.
The number of neighbors considered, (parameter n_neighbors) is typically set 1) greater than the minimum number
of samples a cluster has to contain, so that other samples can be local outliers relative to this cluster, and 2) smaller
than the maximum number of close by samples that can potentially be local outliers. In practice, such informations
are generally not available, and taking n_neighbors=20 appears to work well in general.
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
print(__doc__)
np.random.seed(42)
# plot the learned frontier, the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
s = 40
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white', s=s, edgecolors='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='blueviolet', s=s,
edgecolors='k')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='gold', s=s,
edgecolors='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([a.collections[0], b1, b2, c],
["learned frontier", "training observations",
"new regular observations", "new abnormal observations"],
loc="upper left",
prop=matplotlib.font_manager.FontProperties(size=11))
plt.xlabel(
"errors novel regular: %d/40 ; errors novel abnormal: %d/40"
% (n_error_test, n_error_outliers))
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
An example comparing nearest neighbors classification with and without Neighborhood Components Analysis.
It will plot the class decision boundaries given by a Nearest Neighbors classifier when using the Euclidean distance
on the original features, versus using the Euclidean distance after the transformation learned by Neighborhood Com-
ponents Analysis. The latter aims to find a linear transformation that maximises the (stochastic) nearest neighbor
classification accuracy on the training set.
•
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import (KNeighborsClassifier,
NeighborhoodComponentsAnalysis)
from sklearn.pipeline import Pipeline
print(__doc__)
n_neighbors = 1
dataset = datasets.load_iris()
X, y = dataset.data, dataset.target
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example compares different (linear) dimensionality reduction methods applied on the Digits data set. The data
set contains images of digits from 0 to 9 with approximately 180 samples of each class. Each image is of dimension
8x8 = 64, and is reduced to a two-dimensional data point.
Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal compo-
nents, or directions in the feature space) that account for the most variance in the data. Here we plot the different
samples on the 2 first principal components.
Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance between classes. In
particular, LDA, in contrast to PCA, is a supervised method, using known class labels.
Neighborhood Components Analysis (NCA) tries to find a feature space such that a stochastic nearest neighbor algo-
rithm will give the best accuracy. Like LDA, it is a supervised method.
One can see that NCA enforces a clustering of the data that is visually meaningful despite the large reduction in
dimension.
•
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import (KNeighborsClassifier,
NeighborhoodComponentsAnalysis)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
print(__doc__)
n_neighbors = 3
random_state = 0
dim = len(X[0])
(continues on next page)
# plt.figure()
for i, (name, model) in enumerate(dim_reduction_methods):
plt.figure()
# plt.subplot(1, 3, i + 1, aspect=1)
Note: Click here to download the full example code or to run this example in your browser via Binder
This shows an example of a neighbors-based query (in particular a kernel density estimate) on geospatial data, using
a Ball Tree built upon the Haversine distance metric – i.e. distances over points in latitude/longitude. The dataset
is provided by Phillips et. al. (2006). If available, the example uses basemap to plot the coast lines and national
boundaries of South America.
This example does not perform any learning over the data (see Species distribution modeling for an example of classi-
fication based on the attributes in this dataset). It simply shows the kernel density estimate of observed data points in
geospatial coordinates.
The two species are:
• “Bradypus variegatus” , the Brown-throated Sloth.
• “Microryzomys minutus” , also known as the Forest Small Rice Rat, a rodent that lives in Peru, Colombia,
Ecuador, Peru, and Venezuela.
References
Out:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_species_distributions
from sklearn.neighbors import KernelDensity
def construct_grids(batch):
"""Construct the map grid from the batch object
Parameters
----------
batch : Batch object
The object returned by :func:`fetch_species_distributions`
Returns
-------
(xgrid, ygrid) : 1-D arrays
The grid corresponding to the values in batch.coverages
"""
# x,y coordinates for corner cells
xmin = batch.x_left_lower_corner + batch.grid_size
xmax = xmin + (batch.Nx * batch.grid_size)
ymin = batch.y_left_lower_corner + batch.grid_size
ymax = ymin + (batch.Ny * batch.grid_size)
xy = np.vstack([Y.ravel(), X.ravel()]).T
xy = xy[land_mask]
xy *= np.pi / 180.
for i in range(2):
plt.subplot(1, 2, i + 1)
if basemap:
print(" - plot coastlines using basemap")
m = Basemap(projection='cyl', llcrnrlat=Y.min(),
urcrnrlat=Y.max(), llcrnrlon=X.min(),
urcrnrlon=X.max(), resolution='c')
m.drawcoastlines()
m.drawcountries()
else:
print(" - plot coastlines from coverage")
plt.contour(X, Y, land_reference,
levels=[-9998], colors="k",
linestyles="solid")
plt.xticks([])
plt.yticks([])
plt.title(species_names[i])
Note: Click here to download the full example code or to run this example in your browser via Binder
This example uses the sklearn.neighbors.KernelDensity class to demonstrate the principles of Kernel
Density Estimation in one dimension.
The first plot shows one of the problems with using histograms to visualize the density of points in 1D. Intuitively, a
histogram can be thought of as a scheme in which a unit “block” is stacked above each point on a regular grid. As
the top two panels show, however, the choice of gridding for these blocks can lead to wildly divergent ideas about
the underlying shape of the density distribution. If we instead center each block on the point it represents, we get the
estimate shown in the bottom left panel. This is a kernel density estimation with a “top hat” kernel. This idea can be
generalized to other kernel shapes: the bottom-right panel of the first figure shows a Gaussian kernel density estimate
over the same distribution.
Scikit-learn implements efficient kernel density estimation using either a Ball Tree or KD Tree structure, through the
sklearn.neighbors.KernelDensity estimator. The available kernels are shown in the second figure of this
example.
The third figure compares kernel density estimates for a distribution of 100 samples in 1 dimension. Though this
example uses 1D distributions, kernel density estimation is easily and efficiently extensible to higher dimensions as
well.
•
# Author: Jake Vanderplas <jakevdp@cs.washington.edu>
#
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from distutils.version import LooseVersion
from scipy.stats import norm
from sklearn.neighbors import KernelDensity
# ----------------------------------------------------------------------
# Plot the progression of histograms to kernels
np.random.seed(1)
N = 20
X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),
np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]
X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]
bins = np.linspace(-5, 10, 10)
# histogram 2
ax[0, 1].hist(X[:, 0], bins=bins + 0.75, fc='#AAAAFF', **density_param)
ax[0, 1].text(-3.5, 0.31, "Histogram, bins shifted")
# tophat KDE
kde = KernelDensity(kernel='tophat', bandwidth=0.75).fit(X)
log_dens = kde.score_samples(X_plot)
ax[1, 0].fill(X_plot[:, 0], np.exp(log_dens), fc='#AAAAFF')
ax[1, 0].text(-3.5, 0.31, "Tophat Kernel Density")
# Gaussian KDE
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X)
log_dens = kde.score_samples(X_plot)
ax[1, 1].fill(X_plot[:, 0], np.exp(log_dens), fc='#AAAAFF')
ax[1, 1].text(-3.5, 0.31, "Gaussian Kernel Density")
# ----------------------------------------------------------------------
# Plot all available kernels
X_plot = np.linspace(-6, 6, 1000)[:, None]
X_src = np.zeros((1, 1))
axi.set_ylim(0, 1.05)
axi.set_xlim(-2.9, 2.9)
# ----------------------------------------------------------------------
# Plot a 1D density example
N = 100
np.random.seed(1)
X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),
np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]
fig, ax = plt.subplots()
ax.fill(X_plot[:, 0], true_dens, fc='black', alpha=0.2,
label='input distribution')
colors = ['navy', 'cornflowerblue', 'darkorange']
kernels = ['gaussian', 'tophat', 'epanechnikov']
lw = 2
ax.legend(loc='upper left')
ax.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), '+k')
ax.set_xlim(-4, 9)
ax.set_ylim(-0.02, 0.4)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example presents how to chain KNeighborsTransformer and TSNE in a pipeline. It also shows how to wrap the
packages annoy and nmslib to replace KNeighborsTransformer and perform approximate nearest neighbors. These
packages can be installed with pip install annoy nmslib.
Note: Currently TSNE(metric='precomputed') does not modify the precomputed distances, and thus assumes
that precomputed euclidean distances are squared. In future versions, a parameter in TSNE will control the optional
squaring of precomputed distances (see #12401).
Note: In KNeighborsTransformer we use the definition which includes each training point as its own neighbor
in the count of n_neighbors, and for compatibility reasons, one extra neighbor is computed when mode ==
'distance'. Please note that we do the same in the proposed wrappers.
Sample output:
Benchmarking on MNIST_2000:
---------------------------
AnnoyTransformer: 0.583 sec
NMSlibTransformer: 0.321 sec
KNeighborsTransformer: 1.225 sec
TSNE with AnnoyTransformer: 4.903 sec
TSNE with NMSlibTransformer: 5.009 sec
TSNE with KNeighborsTransformer: 6.210 sec
TSNE with internal NearestNeighbors: 6.365 sec
Benchmarking on MNIST_10000:
----------------------------
AnnoyTransformer: 4.457 sec
NMSlibTransformer: 2.080 sec
KNeighborsTransformer: 30.680 sec
TSNE with AnnoyTransformer: 30.225 sec
TSNE with NMSlibTransformer: 43.295 sec
TSNE with KNeighborsTransformer: 64.845 sec
TSNE with internal NearestNeighbors: 64.984 sec
try:
import annoy
except ImportError:
print("The package 'annoy' is required to run this example.")
sys.exit()
try:
import nmslib
except ImportError:
print("The package 'nmslib' is required to run this example.")
sys.exit()
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
from scipy.sparse import csr_matrix
print(__doc__)
if self.metric == 'sqeuclidean':
distances **= 2
return kneighbors_graph
if X is None:
for i in range(self.annoy_.get_n_items()):
ind, dist = self.annoy_.get_nns_by_item(
i, n_neighbors, self.search_k, include_distances=True)
if self.metric == 'sqeuclidean':
distances **= 2
return kneighbors_graph
def test_transformers():
"""Test that AnnoyTransformer and KNeighborsTransformer give same results
"""
X = np.random.RandomState(42).randn(10, 2)
knn = KNeighborsTransformer()
Xt0 = knn.fit_transform(X)
ann = AnnoyTransformer()
Xt1 = ann.fit_transform(X)
nms = NMSlibTransformer()
Xt2 = nms.fit_transform(X)
def load_mnist(n_samples):
"""Load MNIST, shuffle the data, and return only n_samples."""
mnist = fetch_openml(data_id=41063)
X, y = shuffle(mnist.data, mnist.target, random_state=42)
return X[:n_samples], y[:n_samples]
def run_benchmark():
datasets = [
('MNIST_2000', load_mnist(n_samples=2000)),
('MNIST_10000', load_mnist(n_samples=10000)),
]
n_iter = 500
perplexity = 30
# TSNE requires a certain number of neighbors which depends on the
# perplexity parameter.
# Add one since we include each sample as its own neighbor.
n_neighbors = int(3. * perplexity + 1) + 1
transformers = [
('AnnoyTransformer', AnnoyTransformer(n_neighbors=n_neighbors,
metric='sqeuclidean')),
('NMSlibTransformer', NMSlibTransformer(n_neighbors=n_neighbors,
metric='sqeuclidean')),
('KNeighborsTransformer', KNeighborsTransformer(
n_neighbors=n_neighbors, mode='distance', metric='sqeuclidean')),
('TSNE with AnnoyTransformer', make_pipeline(
AnnoyTransformer(n_neighbors=n_neighbors, metric='sqeuclidean'),
TSNE(metric='precomputed', perplexity=perplexity,
method="barnes_hut", random_state=42, n_iter=n_iter), )),
('TSNE with NMSlibTransformer', make_pipeline(
NMSlibTransformer(n_neighbors=n_neighbors, metric='sqeuclidean'),
(continues on next page)
fig.tight_layout()
plt.show()
if __name__ == '__main__':
test_transformers()
run_benchmark()
Note: Click here to download the full example code or to run this example in your browser via Binder
Sometimes looking at the learned coefficients of a neural network can provide insight into the learning behavior. For
example if weights look unstructured, maybe some were not used at all, or if very large coefficients exist, maybe
regularization was too low or the learning rate too high.
This example shows how to plot some of the first layer weights in a MLPClassifier trained on the MNIST dataset.
The input data consists of 28x28 pixel handwritten digits, leading to 784 features in the dataset. Therefore the first
layer weight matrix have the shape (784, hidden_layer_sizes[0]). We can therefore visualize a single column of the
weight matrix as a 28x28 pixel image.
To make the example run faster, we use very few hidden units, and train only for a very short time. Training longer
would result in weights with a much smoother spatial appearance.
Out:
print(__doc__)
mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
For greyscale image data where pixel values can be interpreted as degrees of blackness on a white background, like
handwritten digit recognition, the Bernoulli Restricted Boltzmann machine model (BernoulliRBM ) can perform
effective non-linear feature extraction.
In order to learn good latent representations from a small dataset, we artificially generate more labeled data by per-
turbing the training data with linear shifts of 1 pixel in each direction.
This example shows how to build a classification pipeline with a BernoulliRBM feature extractor and a
LogisticRegression classifier. The hyperparameters of the entire model (learning rate, hidden layer size, regu-
larization) were optimized by grid search, but the search is not reproduced here because of runtime constraints.
Logistic regression on raw pixel values is presented for comparison. The example shows that the features extracted by
the BernoulliRBM help improve the classification accuracy.
Out:
[BernoulliRBM] Iteration 1, pseudo-likelihood = -25.39, time = 0.15s
[BernoulliRBM] Iteration 2, pseudo-likelihood = -23.72, time = 0.22s
[BernoulliRBM] Iteration 3, pseudo-likelihood = -22.72, time = 0.20s
[BernoulliRBM] Iteration 4, pseudo-likelihood = -21.86, time = 0.20s
[BernoulliRBM] Iteration 5, pseudo-likelihood = -21.66, time = 0.20s
[BernoulliRBM] Iteration 6, pseudo-likelihood = -21.00, time = 0.20s
[BernoulliRBM] Iteration 7, pseudo-likelihood = -20.75, time = 0.20s
[BernoulliRBM] Iteration 8, pseudo-likelihood = -20.52, time = 0.19s
[BernoulliRBM] Iteration 9, pseudo-likelihood = -20.38, time = 0.20s
[BernoulliRBM] Iteration 10, pseudo-likelihood = -20.23, time = 0.19s
Logistic regression using RBM features:
precision recall f1-score support
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
# #############################################################################
# Setting up
(continues on next page)
[[0, 0, 0],
[1, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 1],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 1, 0]]]
X = np.concatenate([X] +
[np.apply_along_axis(shift, 1, X, vector)
for vector in direction_vectors])
Y = np.concatenate([Y for _ in range(5)], axis=0)
return X, Y
# Load Data
X, y = datasets.load_digits(return_X_y=True)
X = np.asarray(X, 'float32')
X, Y = nudge_dataset(X, y)
X = (X - np.min(X, 0)) / (np.max(X, 0) + 0.0001) # 0-1 scaling
rbm_features_classifier = Pipeline(
steps=[('rbm', rbm), ('logistic', logistic)])
# #############################################################################
# Training
# #############################################################################
# Evaluation
Y_pred = rbm_features_classifier.predict(X_test)
print("Logistic regression using RBM features:\n%s\n" % (
metrics.classification_report(Y_test, Y_pred)))
Y_pred = raw_pixel_classifier.predict(X_test)
print("Logistic regression using raw pixel features:\n%s\n" % (
metrics.classification_report(Y_test, Y_pred)))
# #############################################################################
# Plotting
plt.figure(figsize=(4.2, 4))
for i, comp in enumerate(rbm.components_):
plt.subplot(10, 10, i + 1)
plt.imshow(comp.reshape((8, 8)), cmap=plt.cm.gray_r,
interpolation='nearest')
plt.xticks(())
plt.yticks(())
plt.suptitle('100 components extracted by RBM', fontsize=16)
plt.subplots_adjust(0.08, 0.02, 0.92, 0.85, 0.08, 0.23)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
A comparison of different values for regularization parameter ‘alpha’ on synthetic datasets. The plot shows that
different alphas yield different decision functions.
Alpha is a parameter for regularization term, aka penalty term, that combats overfitting by constraining the size of the
weights. Increasing alpha may fix high variance (a sign of overfitting) by encouraging smaller weights, resulting in
a decision boundary plot that appears with lesser curvatures. Similarly, decreasing alpha may fix high bias (a sign of
underfitting) by encouraging larger weights, potentially resulting in a more complicated decision boundary.
print(__doc__)
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
alphas = np.logspace(-5, 3, 5)
names = ['alpha ' + str(i) for i in alphas]
classifiers = []
for i in alphas:
classifiers.append(MLPClassifier(solver='lbfgs', alpha=i, random_state=1,
hidden_layer_sizes=[100, 100]))
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
(continues on next page)
figure.subplots_adjust(left=.02, right=.98)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example visualizes some training loss curves for different stochastic learning strategies, including SGD and
Adam. Because of time-constraints, we use several small datasets, for which L-BFGS might be more suitable. The
general trend shown in these examples seems to carry over to larger datasets, however.
Note that those results can be highly dependent on the value of learning_rate_init.
Out:
print(__doc__)
import warnings
X = MinMaxScaler().fit_transform(X)
mlps = []
if name == "digits":
# digits is larger but converges fairly quickly
max_iter = 15
else:
max_iter = 400
mlps.append(mlp)
print("Training set score: %f" % mlp.score(X, y))
print("Training set loss: %f" % mlp.loss_)
for mlp, label, args in zip(mlps, labels, plot_args):
ax.plot(mlp.loss_curve_, label=label, **args)
Examples of how to compose transformers and pipelines from other estimators. See the User Guide.
Note: Click here to download the full example code or to run this example in your browser via Binder
In many real-world examples, there are many ways to extract features from a dataset. Often it is beneficial to combine
several methods to obtain good performance. This example shows how to use FeatureUnion to combine features
obtained by PCA and univariate selection.
Combining features using this transformer has the benefit that it allows cross validation and grid searches over the
whole process.
The combination used in this example is not particularly helpful on this dataset and is only used to illustrate the usage
of FeatureUnion.
Out:
iris = load_iris()
X, y = iris.data, iris.target
svm = SVC(kernel="linear")
Note: Click here to download the full example code or to run this example in your browser via Binder
The PCA does an unsupervised dimensionality reduction, while the logistic regression does the prediction.
We use a GridSearchCV to set the dimensionality of the PCA
Out:
Best parameter (CV score=0.920):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}
print(__doc__)
ax0.axvline(search.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen')
ax0.legend(prop=dict(size=12))
plt.xlim(-1, 70)
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of
features, using sklearn.compose.ColumnTransformer. This is particularly handy for the case of datasets
that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the cate-
gorical ones.
In this example, the numeric data is standard-scaled after mean-imputation, while the categorical data is one-hot
encoded after imputing missing values with a new category ('missing').
Finally, the preprocessing pipeline is integrated in a full prediction pipeline using sklearn.pipeline.
Pipeline, together with a simple classification model.
# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause
import numpy as np
np.random.seed(0)
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
(continues on next page)
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
Out:
Grid search can also be performed on the different preprocessing steps defined in the
ColumnTransformer object, together with the classifier’s hyperparameters as part of the Pipeline.
We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter
of the logistic regression using sklearn.model_selection.GridSearchCV .
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__C': [0.1, 1.0, 10, 100],
}
Out:
Note: Click here to download the full example code or to run this example in your browser via Binder
This example constructs a pipeline that does dimensionality reduction followed by prediction with a support vector
classifier. It demonstrates the use of GridSearchCV and Pipeline to optimize over different classes of estimators
in a single CV run – unsupervised PCA and NMF dimensionality reductions are compared to univariate feature selection
during the grid search.
Additionally, Pipeline can be instantiated with the memory argument to memoize the transformers within the
pipeline, avoiding to fit again the same transformers over and over.
Note that the use of memory to enable caching becomes interesting when the fitting of a transformer is costly.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2
print(__doc__)
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', 'passthrough'),
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']
mean_scores = np.array(grid.cv_results_['mean_test_score'])
(continues on next page)
plt.figure()
COLORS = 'bgrcmyk'
for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):
plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i])
plt.show()
It is sometimes worthwhile storing the state of a specific transformer since it could be used again. Using a
pipeline in GridSearchCV triggers such situations. Therefore, we use the argument memory to enable
caching.
Warning: Note that this example is, however, only an illustration since for this specific case fitting
PCA is not necessarily slower than loading the cache. Hence, use the memory constructor parameter
when the fitting of a transformer is costly.
# This time, a cached pipeline will be used within the grid search
The PCA fitting is only computed at the evaluation of the first configuration of the C parameter of the LinearSVC
classifier. The other configurations of C will trigger the loading of the cached PCA estimator data, leading to save pro-
cessing time. Therefore, the use of caching the pipeline using memory is highly beneficial when fitting a transformer
is costly.
Total running time of the script: ( 0 minutes 5.127 seconds)
Estimated memory usage: 8 MB
Note: Click here to download the full example code or to run this example in your browser via Binder
Datasets can often contain components of that require different feature extraction and processing pipelines. This
scenario might occur when:
1. Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
2. Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.
This example demonstrates how to use sklearn.compose.ColumnTransformer on a dataset containing dif-
ferent types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject
line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a
ColumnTransformer and finally train a classifier on the combined set of features.
The choice of features is not particularly helpful, but serves to illustrate the technique.
Out:
[Pipeline] ....... (step 1 of 3) Processing subjectbody, total= 0.0s
[Pipeline] ............. (step 2 of 3) Processing union, total= 0.3s
[Pipeline] ............... (step 3 of 3) Processing svc, total= 0.0s
precision recall f1-score support
import numpy as np
prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features[i, 0] = sub
return features
pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
Note: Click here to download the full example code or to run this example in your browser via Binder
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from distutils.version import LooseVersion
print(__doc__)
Synthetic example
A synthetic random regression problem is generated. The targets y are modified by: (i) translating all targets such
that all entries are non-negative and (ii) applying an exponential function to obtain non-linear targets which cannot be
fitted using a simple linear model.
Therefore, a logarithmic (np.log1p) and an exponential function (np.expm1) will be used to transform the targets
before training a linear regression model and using it for prediction.
The following illustrate the probability density functions of the target before and after applying the logarithmic func-
tions.
At first, a linear model will be applied on the original targets. Due to the non-linearity, the model trained will not
be precise during the prediction. Subsequently, a logarithmic function is used to linearize the targets, allowing better
prediction even with a similar linear model as reported by the median absolute error (MAE).
f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)
regr = RidgeCV()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
ax0.scatter(y_test, y_pred)
ax0.plot([0, 2000], [0, 2000], '--k')
ax0.set_ylabel('Target predicted')
ax0.set_xlabel('True Target')
ax0.set_title('Ridge regression \n without target transformation')
ax0.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (
r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))
ax0.set_xlim([0, 2000])
ax0.set_ylim([0, 2000])
regr_trans = TransformedTargetRegressor(regressor=RidgeCV(),
func=np.log1p,
inverse_func=np.expm1)
regr_trans.fit(X_train, y_train)
y_pred = regr_trans.predict(X_test)
In a similar manner, the boston housing data set is used to show the impact of transforming the targets before learning a
model. In this example, the targets to be predicted corresponds to the weighted distances to the five Boston employment
centers.
from sklearn.datasets import load_boston
from sklearn.preprocessing import QuantileTransformer, quantile_transform
dataset = load_boston()
(continues on next page)
The effect of the transformer is weaker than on the synthetic data. However, the transform induces a decrease of the
MAE.
f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)
regr = RidgeCV()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
ax0.scatter(y_test, y_pred)
ax0.plot([0, 10], [0, 10], '--k')
ax0.set_ylabel('Target predicted')
ax0.set_xlabel('True Target')
ax0.set_title('Ridge regression \n without target transformation')
ax0.text(1, 9, r'$R^2$=%.2f, MAE=%.2f' % (
r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))
ax0.set_xlim([0, 10])
ax0.set_ylim([0, 10])
regr_trans = TransformedTargetRegressor(
regressor=RidgeCV(),
transformer=QuantileTransformer(n_quantiles=300,
output_distribution='normal'))
regr_trans.fit(X_train, y_train)
y_pred = regr_trans.predict(X_test)
plt.show()
6.25 Preprocessing
Note: Click here to download the full example code or to run this example in your browser via Binder
Shows how to use a function transformer in a pipeline. If you know your dataset’s first principle component is irrelevant
for a classification task, you can use the FunctionTransformer to select all but the first column of the PCA transformed
data.
•
import matplotlib.pyplot as plt
import numpy as np
def generate_dataset():
"""
This dataset is two lines with a slope ~ 1, where one has
a y offset of ~100
"""
return np.vstack((
np.vstack((
_generate_vector(),
_generate_vector() + 100,
)).T,
np.vstack((
_generate_vector(),
_generate_vector(),
)).T,
(continues on next page)
def all_but_first_column(X):
return X[:, 1:]
if __name__ == '__main__':
X, y = generate_dataset()
lw = 0
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y, lw=lw)
plt.figure()
X_transformed, y_transformed = drop_first_component(*generate_dataset())
plt.scatter(
X_transformed[:, 0],
np.zeros(len(X_transformed)),
c=y_transformed,
lw=lw,
s=60
)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The example compares prediction result of linear regression (linear model) and decision tree (tree based model) with
and without discretization of real-valued features.
As is shown in the result before discretization, linear model is fast to build and relatively straightforward to interpret,
but can only model linear relationships, while decision tree can build a much more complex model of the data. One
way to make linear model more powerful on continuous data is to use discretization (also known as binning). In the
example, we discretize the feature and one-hot encode the transformed data. Note that if the bins are not reasonably
wide, there would appear to be a substantially increased risk of overfitting, so the discretizer parameters should usually
be tuned under cross validation.
After discretization, linear regression and decision tree make exactly the same prediction. As features are constant
within each bin, any model must predict the same value for all points within a bin. Compared with the result before
discretization, linear model become much more flexible while decision tree gets much less flexible. Note that bin-
ning features generally has no beneficial effect for tree-based models, as these models can learn to split up the data
anywhere.
import numpy as np
import matplotlib.pyplot as plt
print(__doc__)
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
import numpy as np
import matplotlib.pyplot as plt
print(__doc__)
n_samples = 200
centers_0 = np.array([[0, 0], [0, 5], [2, 4], [8, 8]])
centers_1 = np.array([[0, 0], [3, 1]])
ax = plt.subplot(len(X_list), len(strategies) + 1, i)
ax.scatter(X[:, 0], X[:, 1], edgecolors='k')
if ds_cnt == 0:
ax.set_title("Input data", size=14)
xx, yy = np.meshgrid(
np.linspace(X[:, 0].min(), X[:, 0].max(), 300),
np.linspace(X[:, 1].min(), X[:, 1].max(), 300))
grid = np.c_[xx.ravel(), yy.ravel()]
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1
# transform the dataset with KBinsDiscretizer
for strategy in strategies:
enc = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy=strategy)
enc.fit(X)
grid_encoded = enc.transform(grid)
ax = plt.subplot(len(X_list), len(strategies) + 1, i)
# horizontal stripes
horizontal = grid_encoded[:, 0].reshape(xx.shape)
ax.contourf(xx, yy, horizontal, alpha=.5)
# vertical stripes
vertical = grid_encoded[:, 1].reshape(xx.shape)
ax.contourf(xx, yy, vertical, alpha=.5)
i += 1
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Feature scaling through standardization (or Z-score normalization) can be an important preprocessing step for many
machine learning algorithms. Standardization involves rescaling the features such that they have the properties of a
standard normal distribution with a mean of zero and a standard deviation of one.
While many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require features to be normalized,
intuitively we can think of Principle Component Analysis (PCA) as being a prime example of when normalization is
important. In PCA we are interested in the components that maximize the variance. If one component (e.g. human
height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine
that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled.
As a change in height of one meter can be considered much more important than the change in weight of one kilogram,
this is clearly incorrect.
To illustrate this, PCA is performed comparing the use of data with StandardScaler applied, to unscaled data.
The results are visualized and a clear difference noted. The 1st principal component in the unscaled set can be seen. It
can be seen that feature #13 dominates the direction, being a whole two orders of magnitude above the other features.
This is contrasted when observing the principal component for the scaled version of the data. In the scaled version,
the orders of magnitude are roughly the same across all the features.
The dataset used is the Wine Dataset available at UCI. This dataset has continuous features that are heterogeneous in
scale due to differing properties that they measure (i.e alcohol content, and malic acid).
The transformed data is then used to train a naive Bayes classifier, and a clear difference in prediction accuracies is
observed wherein the dataset which is scaled before PCA vastly outperforms the unscaled version.
Out:
PC 1 without scaling:
[ 1.76e-03 -8.36e-04 1.55e-04 -5.31e-03 2.02e-02 1.02e-03 1.53e-03
-1.12e-04 6.31e-04 2.33e-03 1.54e-04 7.43e-04 1.00e+00]
PC 1 with scaling:
[ 0.13 -0.26 -0.01 -0.23 0.16 0.39 0.42 -0.28 0.33 -0.11 0.3 0.38
0.28]
RANDOM_STATE = 42
FIG_SIZE = (10, 7)
# Fit to data and predict using pipelined scaling, GNB and PCA.
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())
std_clf.fit(X_train, y_train)
pred_test_std = std_clf.predict(X_test)
# Use PCA without and with scale on X_train data for visualization.
X_train_transformed = pca.transform(X_train)
scaler = std_clf.named_steps['standardscaler']
X_train_std_transformed = pca_std.transform(scaler.transform(X_train))
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example demonstrates the use of the Box-Cox and Yeo-Johnson transforms through PowerTransformer to
map data from various distributions to a normal distribution.
The power transform is useful as a transformation in modeling problems where homoscedasticity and normality are de-
sired. Below are examples of Box-Cox and Yeo-Johnwon applied to six different probability distributions: Lognormal,
Chi-squared, Weibull, Gaussian, Uniform, and Bimodal.
Note that the transformations successfully map the data to a normal distribution when applied to certain datasets, but
are ineffective with others. This highlights the importance of visualizing the data before and after transformation.
Also note that even though Box-Cox seems to perform better than Yeo-Johnson for lognormal and chi-squared distri-
butions, keep in mind that Box-Cox does not support inputs with negative values.
For comparison, we also add the output from QuantileTransformer. It can force any arbitrary distribution into
a gaussian, provided that there are enough training samples (thousands). Because it is a non-parametric method, it is
harder to interpret than the parametric ones (Box-Cox and Yeo-Johnson).
On “small” datasets (less than a few hundred points), the quantile transformer is prone to overfitting. The use of the
power transform is then recommended.
print(__doc__)
N_SAMPLES = 1000
FONT_SIZE = 6
BINS = 30
rng = np.random.RandomState(304)
bc = PowerTransformer(method='box-cox')
yj = PowerTransformer(method='yeo-johnson')
# n_quantiles is set to the training set size rather than the default value
# to avoid a warning being raised by this example
qt = QuantileTransformer(n_quantiles=500, output_distribution='normal',
random_state=rng)
size = (N_SAMPLES, 1)
# lognormal distribution
X_lognormal = rng.lognormal(size=size)
# chi-squared distribution
df = 3
X_chisq = rng.chisquare(df=df, size=size)
# weibull distribution
a = 50
X_weibull = rng.weibull(a=a, size=size)
# gaussian distribution
loc = 100
X_gaussian = rng.normal(loc=loc, size=size)
# uniform distribution
X_uniform = rng.uniform(low=0, high=1, size=size)
# bimodal distribution
loc_a, loc_b = 100, 105
X_a, X_b = rng.normal(loc=loc_a, size=size), rng.normal(loc=loc_b, size=size)
X_bimodal = np.concatenate([X_a, X_b], axis=0)
# create plots
distributions = [
('Lognormal', X_lognormal),
('Chi-squared', X_chisq),
('Weibull', X_weibull),
('Gaussian', X_gaussian),
('Uniform', X_uniform),
('Bimodal', X_bimodal)
(continues on next page)
plt.tight_layout()
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
A demonstration of feature discretization on synthetic classification datasets. Feature discretization decomposes each
feature into a set of bins, here equally distributed in width. The discrete values are then one-hot encoded, and given to
a linear classifier. This preprocessing enables a non-linear behavior even though the classifier is linear.
On this example, the first two rows represent linearly non-separable datasets (moons and concentric circles) while the
third is approximately linearly separable. On the two linearly non-separable datasets, feature discretization largely
increases the performance of linear classifiers. On the linearly separable dataset, feature discretization decreases the
performance of linear classifiers. Two non-linear classifiers are also shown for comparison.
This example should be taken with a grain of salt, as the intuition conveyed does not necessarily carry over to real
datasets. Particularly in high-dimensional spaces, data can more easily be separated linearly. Moreover, using feature
discretization and one-hot encoding increases the number of features, which easily lead to overfitting when the number
of samples is small.
The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classifi-
cation accuracy on the test set.
Out:
dataset 0
---------
LogisticRegression: 0.86
LinearSVC: 0.86
KBinsDiscretizer + LogisticRegression: 0.86
KBinsDiscretizer + LinearSVC: 0.92
GradientBoostingClassifier: 0.90
SVC: 0.94
dataset 1
---------
LogisticRegression: 0.40
LinearSVC: 0.40
KBinsDiscretizer + LogisticRegression: 0.88
KBinsDiscretizer + LinearSVC: 0.86
GradientBoostingClassifier: 0.80
SVC: 0.84
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
print(__doc__)
def get_name(estimator):
name = estimator.__class__.__name__
if name == 'Pipeline':
name = [get_name(est[1]) for est in estimator.steps]
name = ' + '.join(name)
return name
n_samples = 100
datasets = [
make_moons(n_samples=n_samples, noise=0.2, random_state=0),
make_circles(n_samples=n_samples, noise=0.2, factor=0.5, random_state=1),
make_classification(n_samples=n_samples, n_features=2, n_redundant=0,
n_informative=2, random_state=2,
n_clusters_per_class=1)
]
cm = plt.cm.PiYG
cm_bright = ListedColormap(['#b30065', '#178000'])
# plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]*[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
if ds_cnt == 0:
ax.set_title(name.replace(' + ', '\n'))
ax.text(0.95, 0.06, ('%.2f' % score).lstrip('0'), size=15,
bbox=dict(boxstyle='round', alpha=0.8, facecolor='white'),
transform=ax.transAxes, horizontalalignment='right')
plt.tight_layout()
Note: Click here to download the full example code or to run this example in your browser via Binder
Feature 0 (median income in a block) and feature 5 (number of households) of the California housing dataset have
very different scales and contain some very large outliers. These two characteristics lead to difficulties to visualize
the data and, more importantly, they can degrade the predictive performance of many machine learning algorithms.
Unscaled data can also slow down or even prevent the convergence of many gradient-based estimators.
Indeed many estimators are designed with the assumption that each feature takes values close to zero or more im-
portantly that all features vary on comparable scales. In particular, metric-based and gradient-based estimators often
assume approximately standardized data (centered features with unit variances). A notable exception are decision
tree-based estimators that are robust to arbitrary scaling of the data.
This example uses different scalers, transformers, and normalizers to bring the data within a pre-defined range.
Scalers are linear (or more precisely affine) transformers and differ from each other in the way to estimate the param-
eters used to shift and scale each feature.
QuantileTransformer provides non-linear transformations in which distances between marginal outliers and
inliers are shrunk. PowerTransformer provides non-linear transformations in which data is mapped to a normal
distribution to stabilize variance and minimize skewness.
Unlike the previous transformations, normalization refers to a per sample transformation instead of a per feature
transformation.
The following code is a bit verbose, feel free to jump directly to the analysis of the results.
import numpy as np
print(__doc__)
dataset = fetch_california_housing()
X_full, y_full = dataset.data, dataset.target
distributions = [
('Unscaled data', X),
('Data after standard scaling',
StandardScaler().fit_transform(X)),
('Data after min-max scaling',
MinMaxScaler().fit_transform(X)),
('Data after max-abs scaling',
MaxAbsScaler().fit_transform(X)),
('Data after robust scaling',
RobustScaler(quantile_range=(25, 75)).fit_transform(X)),
('Data after power transformation (Yeo-Johnson)',
PowerTransformer(method='yeo-johnson').fit_transform(X)),
('Data after power transformation (Box-Cox)',
PowerTransformer(method='box-cox').fit_transform(X)),
('Data after quantile transformation (gaussian pdf)',
QuantileTransformer(output_distribution='normal')
.fit_transform(X)),
('Data after quantile transformation (uniform pdf)',
QuantileTransformer(output_distribution='uniform')
.fit_transform(X)),
('Data after sample-wise L2 normalizing',
Normalizer().fit_transform(X)),
]
ax_scatter = plt.axes(rect_scatter)
ax_histx = plt.axes(rect_histx)
ax_histy = plt.axes(rect_histy)
ax_scatter_zoom = plt.axes(rect_scatter)
ax_histx_zoom = plt.axes(rect_histx)
ax_histy_zoom = plt.axes(rect_histy)
ax.set_title(title)
ax.set_xlabel(x0_label)
ax.set_ylabel(x1_label)
Two plots will be shown for each scaler/normalizer/transformer. The left figure will show a scatter plot of the full data
set while the right figure will exclude the extreme values considering only 99 % of the data set, excluding marginal
outliers. In addition, the marginal distributions for each feature will be shown on the side of the scatter plot.
def make_plot(item_idx):
title, X = distributions[item_idx]
ax_zoom_out, ax_zoom_in, ax_colorbar = create_axes(title)
axarr = (ax_zoom_out, ax_zoom_in)
plot_distribution(axarr[0], X, y, hist_nbins=200,
x0_label="Median Income",
x1_label="Number of households",
title="Full data")
# zoom-in
zoom_in_percentile_range = (0, 99)
cutoffs_X0 = np.percentile(X[:, 0], zoom_in_percentile_range)
cutoffs_X1 = np.percentile(X[:, 1], zoom_in_percentile_range)
non_outliers_mask = (
np.all(X > [cutoffs_X0[0], cutoffs_X1[0]], axis=1) &
np.all(X < [cutoffs_X0[1], cutoffs_X1[1]], axis=1))
plot_distribution(axarr[1], X[non_outliers_mask], y[non_outliers_mask],
hist_nbins=50,
x0_label="Median Income",
x1_label="Number of households",
title="Zoom-in")
Original data
Each transformation is plotted showing two transformed features, with the left plot showing the entire dataset, and
the right zoomed-in to show the dataset without the marginal outliers. A large majority of the samples are compacted
to a specific range, [0, 10] for the median income and [0, 6] for the number of households. Note that there are
some marginal outliers (some blocks have more than 1200 households). Therefore, a specific pre-processing can
be very beneficial depending of the application. In the following, we present some insights and behaviors of those
pre-processing methods in the presence of marginal outliers.
make_plot(0)
StandardScaler
StandardScaler removes the mean and scales the data to unit variance. However, the outliers have an influence
when computing the empirical mean and standard deviation which shrink the range of the feature values as shown in
the left figure below. Note in particular that because the outliers on each feature have different magnitudes, the spread
of the transformed data on each feature is very different: most of the data lie in the [-2, 4] range for the transformed
median income feature while the same data is squeezed in the smaller [-0.2, 0.2] range for the transformed number of
households.
StandardScaler therefore cannot guarantee balanced feature scales in the presence of outliers.
make_plot(1)
MinMaxScaler
MinMaxScaler rescales the data set such that all feature values are in the range [0, 1] as shown in the right panel
below. However, this scaling compress all inliers in the narrow range [0, 0.005] for the transformed number of
households.
As StandardScaler, MinMaxScaler is very sensitive to the presence of outliers.
make_plot(2)
MaxAbsScaler
MaxAbsScaler differs from the previous scaler such that the absolute values are mapped in the range [0, 1]. On
positive only data, this scaler behaves similarly to MinMaxScaler and therefore also suffers from the presence of
large outliers.
make_plot(3)
RobustScaler
Unlike the previous scalers, the centering and scaling statistics of this scaler are based on percentiles and are therefore
not influenced by a few number of very large marginal outliers. Consequently, the resulting range of the transformed
feature values is larger than for the previous scalers and, more importantly, are approximately similar: for both features
most of the transformed values lie in a [-2, 3] range as seen in the zoomed-in figure. Note that the outliers themselves
are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is
required (see below).
make_plot(4)
PowerTransformer
PowerTransformer applies a power transformation to each feature to make the data more Gaussian-like. Cur-
rently, PowerTransformer implements the Yeo-Johnson and Box-Cox transforms. The power transform finds
the optimal scaling factor to stabilize variance and mimimize skewness through maximum likelihood estimation. By
default, PowerTransformer also applies zero-mean, unit variance normalization to the transformed output. Note
that Box-Cox can only be applied to strictly positive data. Income and number of households happen to be strictly
positive, but if negative values are present the Yeo-Johnson transformed is to be preferred.
make_plot(5)
make_plot(6)
make_plot(7)
QuantileTransformer applies a non-linear transformation such that the probability density function of each
feature will be mapped to a uniform distribution. In this case, all the data will be mapped in the range [0, 1], even the
outliers which cannot be distinguished anymore from the inliers.
As RobustScaler, QuantileTransformer is robust to outliers in the sense that adding or removing outliers in
the training set will yield approximately the same transformation on held out data. But contrary to RobustScaler,
QuantileTransformer will also automatically collapse any outlier by setting them to the a priori defined range
boundaries (0 and 1).
make_plot(8)
Normalizer
The Normalizer rescales the vector for each sample to have unit norm, independently of the distribution of the
samples. It can be seen on both figures below where all samples are mapped onto the unit circle. In our example the
two selected features have only positive values; therefore the transformed data only lie in the positive quadrant. This
would not be the case if some original features had a mix of positive and negative values.
make_plot(9)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
We are pleased to announce the release of scikit-learn 0.22, which comes with many bug fixes and new features! We
detail below a few of the major features of this release. For an exhaustive list of all the changes, please refer to the
release notes.
To install the latest version (with pip):
or with conda:
A new plotting API is available for creating visualizations. This new API allows for quickly adjusting the visuals of a
plot without involving any recomputation. It is also possible to add different plots to the same figure. The following
example illustrates plot_roc_curve, but other plots utilities are supported like plot_partial_dependence,
plot_precision_recall_curve, and plot_confusion_matrix. Read more about this new API in the
User Guide.
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
svc = SVC(random_state=42)
svc.fit(X_train, y_train)
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)
plt.show()
StackingClassifier and StackingRegressor allow you to have a stack of estimators with a final classifier
or a regressor. Stacked generalization consists in stacking the output of individual estimators and use a classifier to
compute the final prediction. Stacking allows to use the strength of each individual estimator by using their output
as input of a final estimator. Base estimators are fitted on the full X while the final estimator is trained using cross-
validated predictions of the base estimators using cross_val_predict.
Read more in the User Guide.
X, y = load_iris(return_X_y=True)
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('svr', make_pipeline(StandardScaler(),
LinearSVC(random_state=42)))
]
(continues on next page)
Out:
0.9473684210526315
The inspection.permutation_importance can be used to get an estimate of the importance of each feature,
for any fitted estimator:
fig, ax = plt.subplots()
sorted_idx = result.importances_mean.argsort()
ax.boxplot(result.importances[sorted_idx].T,
vert=False, labels=range(X.shape[1]))
ax.set_title("Permutation Importance of each feature")
ax.set_ylabel("Features")
fig.tight_layout()
plt.show()
X = np.array([0, 1, 2, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]
gbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)
print(gbdt.predict(X))
Out:
[0 0 1 1]
Most estimators based on nearest neighbors graphs now accept precomputed sparse graphs as input, to reuse the
same graph for multiple estimator fits. To use this feature in a pipeline, one can use the memory parame-
ter, along with one of the two new transformers, neighbors.KNeighborsTransformer and neighbors.
RadiusNeighborsTransformer. The precomputation can also be performed by custom estimators to use alter-
native implementations, such as approximate nearest neighbors methods. See more details in the User Guide.
X, y = make_classification(random_state=0)
# We can decrease the number of neighbors and the graph will not be
# recomputed.
estimator.set_params(isomap__n_neighbors=5)
estimator.fit(X)
We now support imputation for completing missing values using k-Nearest Neighbors.
Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the
training set. Two samples are close if the features that neither is missing are close. By default, a euclidean distance
metric that supports missing values, nan_euclidean_distances, is used to find the nearest neighbors.
Read more in the User Guide.
import numpy as np
from sklearn.impute import KNNImputer
Out:
[[1. 2. 4. ]
[3. 4. 3. ]
[5.5 6. 5. ]
[8. 8. 7. ]]
Tree pruning
It is now possible to prune most tree-based estimators once the trees are built. The pruning is based on minimal
cost-complexity. Read more in the User Guide for details.
X, y = make_classification(random_state=0)
rf = RandomForestClassifier(random_state=0, ccp_alpha=0).fit(X, y)
print("Average number of nodes without pruning {:.1f}".format(
np.mean([e.tree_.node_count for e in rf.estimators_])))
rf = RandomForestClassifier(random_state=0, ccp_alpha=0.05).fit(X, y)
print("Average number of nodes with pruning {:.1f}".format(
np.mean([e.tree_.node_count for e in rf.estimators_])))
Out:
datasets.fetch_openml can now return pandas dataframe and thus properly handle datasets with heteroge-
neous data:
Out:
pclass embarked
0 1.0 S
1 1.0 S
2 1.0 S
3 1.0 S
4 1.0 S
Developers can check the compatibility of their scikit-learn compatible estimators using check_estimator. For
instance, the check_estimator(LinearSVC) passes.
We now provide a pytest specific decorator which allows pytest to run all checks independently and report the
checks that are failing.
@parametrize_with_checks([LogisticRegression, DecisionTreeRegressor])
def test_sklearn_compatible_estimator(estimator, check):
check(estimator)
The roc_auc_score function can also be used in multi-class classification. Two averaging strategies are currently
supported: the one-vs-one algorithm computes the average of the pairwise ROC AUC scores, and the one-vs-rest
algorithm computes the average of the ROC AUC scores for each class against all other classes. In both cases, the
multiclass ROC AUC scores are computed from the probability estimates that a sample belongs to a particular class
according to the model. The OvO and OvR algorithms support weighting uniformly (average='macro') and
weighting by the prevalence (average='weighted').
Read more in the User Guide.
X, y = make_classification(n_classes=4, n_informative=16)
clf = SVC(decision_function_shape='ovo', probability=True).fit(X, y)
print(roc_auc_score(y, clf.predict_proba(X), multi_class='ovo'))
Out:
0.9957333333333332
Note: Click here to download the full example code or to run this example in your browser via Binder
6.27.1 Decision boundary of label propagation versus SVM on the Iris dataset
Comparison for decision boundary generated on iris dataset between Label Propagation and SVM.
This demonstrates Label Propagation learning a good boundary even with a small amount of labeled data.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
from sklearn.semi_supervised import LabelSpreading
rng = np.random.RandomState(0)
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
y_30 = np.copy(y)
y_30[rng.rand(len(y)) < 0.3] = -1
y_50 = np.copy(y)
y_50[rng.rand(len(y)) < 0.5] = -1
(continues on next page)
color_map = {-1: (1, 1, 1), 0: (0, 0, .9), 1: (1, 0, 0), 2: (.8, .6, 0)}
plt.title(titles[i])
Note: Click here to download the full example code or to run this example in your browser via Binder
Example of LabelPropagation learning a complex internal structure to demonstrate “manifold learning”. The outer
circle should be labeled “red” and the inner circle “blue”. Because both label groups lie inside their own distinct
shape, we can see that the labels propagate correctly around the circle.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.semi_supervised import LabelSpreading
from sklearn.datasets import make_circles
# #############################################################################
# Learn with LabelSpreading
label_spread = LabelSpreading(kernel='knn', alpha=0.8)
label_spread.fit(X, labels)
# #############################################################################
# Plot output labels
output_labels = label_spread.transduction_
plt.figure(figsize=(8.5, 4))
plt.subplot(1, 2, 1)
plt.scatter(X[labels == outer, 0], X[labels == outer, 1], color='navy',
marker='s', lw=0, label="outer labeled", s=10)
plt.scatter(X[labels == inner, 0], X[labels == inner, 1], color='c',
marker='s', lw=0, label='inner labeled', s=10)
plt.scatter(X[labels == -1, 0], X[labels == -1, 1], color='darkorange',
marker='.', label='unlabeled')
plt.legend(scatterpoints=1, shadow=False, loc='upper right')
plt.title("Raw data (2 classes=outer and inner)")
(continues on next page)
plt.subplot(1, 2, 2)
output_label_array = np.asarray(output_labels)
outer_numbers = np.where(output_label_array == outer)[0]
inner_numbers = np.where(output_label_array == inner)[0]
plt.scatter(X[outer_numbers, 0], X[outer_numbers, 1], color='navy',
marker='s', lw=0, s=10, label="outer learned")
plt.scatter(X[inner_numbers, 0], X[inner_numbers, 1], color='c',
marker='s', lw=0, s=10, label="inner learned")
plt.legend(scatterpoints=1, shadow=False, loc='upper right')
plt.title("Labels learned with Label Spreading (KNN)")
Note: Click here to download the full example code or to run this example in your browser via Binder
This example demonstrates the power of semisupervised learning by training a Label Spreading model to classify
handwritten digits with sets of very few labels.
The handwritten digit dataset has 1797 total points. The model will be trained using all points, but only 30 will be
labeled. Results in the form of a confusion matrix and a series of metrics over each class will be very good.
At the end, the top 10 most uncertain predictions will be shown.
Out:
Label Spreading model: 40 labeled & 300 unlabeled points (340 total)
precision recall f1-score support
Confusion matrix
[[27 0 0 0 0 0 0 0 0 0]
[ 0 37 0 0 0 0 0 0 0 0]
[ 0 1 24 0 0 0 2 1 0 0]
[ 0 0 0 28 0 5 0 1 0 1]
[ 0 0 0 0 24 0 0 0 0 0]
[ 0 0 0 0 0 32 0 0 0 2]
(continues on next page)
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
digits = datasets.load_digits()
rng = np.random.RandomState(2)
indices = np.arange(len(digits.data))
rng.shuffle(indices)
X = digits.data[indices[:340]]
y = digits.target[indices[:340]]
images = digits.images[indices[:340]]
n_total_samples = len(y)
n_labeled_points = 40
indices = np.arange(n_total_samples)
unlabeled_set = indices[n_labeled_points:]
# #############################################################################
# Shuffle everything around
y_train = np.copy(y)
y_train[unlabeled_set] = -1
# #############################################################################
# Learn with LabelSpreading
lp_model = LabelSpreading(gamma=.25, max_iter=20)
lp_model.fit(X, y_train)
predicted_labels = lp_model.transduction_[unlabeled_set]
true_labels = y[unlabeled_set]
print(classification_report(true_labels, predicted_labels))
print("Confusion matrix")
print(cm)
# #############################################################################
# Calculate uncertainty values for each transduced distribution
pred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)
# #############################################################################
# Pick the top 10 most uncertain labels
uncertainty_index = np.argsort(pred_entropies)[-10:]
# #############################################################################
# Plot
f = plt.figure(figsize=(7, 5))
for index, image_index in enumerate(uncertainty_index):
image = images[image_index]
Note: Click here to download the full example code or to run this example in your browser via Binder
Demonstrates an active learning technique to learn handwritten digits using label propagation.
We start by training a label propagation model with only 10 labeled points, then we select the top five most uncertain
points to label. Next, we train with 15 labeled points (original 10 + 5 new ones). We repeat this process four times
to have a model trained with 30 labeled examples. Note you can increase this to label more than 30 by changing
max_iterations. Labeling more than 30 can be useful to get a sense for the speed of convergence of this active
learning technique.
A plot will appear showing the top 5 most uncertain digits for each iteration of training. These may or may not contain
mistakes, but we will train the next model with their true labels.
Out:
Iteration 0 ______________________________________________________________________
Label Spreading model: 40 labeled & 290 unlabeled (330 total)
precision recall f1-score support
Confusion matrix
[[22 0 0 0 0 0 0 0 0 0]
[ 0 18 2 0 0 0 1 0 5 0]
[ 0 0 27 0 0 0 0 0 2 0]
[ 0 0 0 24 0 0 0 0 3 0]
(continues on next page)
Confusion matrix
[[22 0 0 0 0 0 0 0 0 0]
[ 0 22 0 0 0 0 0 0 0 0]
[ 0 0 27 0 0 0 0 0 2 0]
[ 0 0 0 26 0 0 0 0 0 0]
[ 0 1 0 0 22 0 0 0 0 0]
[ 0 0 0 0 0 23 0 0 0 10]
[ 0 1 0 0 0 0 34 0 0 0]
[ 0 0 0 0 0 0 0 30 3 0]
[ 0 4 0 0 0 0 0 0 24 0]
[ 0 0 0 0 2 1 0 2 2 27]]
Iteration 2 ______________________________________________________________________
Label Spreading model: 50 labeled & 280 unlabeled (330 total)
precision recall f1-score support
Confusion matrix
[[22 0 0 0 0 0 0 0 0 0]
(continues on next page)
Confusion matrix
[[22 0 0 0 0 0 0 0 0 0]
[ 0 22 0 0 0 0 0 0 0 0]
[ 0 0 27 0 0 0 0 0 0 0]
[ 0 0 0 26 0 0 0 0 0 0]
[ 0 0 0 0 20 0 0 0 0 0]
[ 0 0 0 0 0 27 0 0 0 4]
[ 0 1 0 0 0 0 34 0 0 0]
[ 0 0 0 0 0 0 0 31 0 0]
[ 0 3 0 0 1 0 0 0 24 0]
[ 0 0 0 0 2 1 0 0 2 28]]
Iteration 4 ______________________________________________________________________
Label Spreading model: 60 labeled & 270 unlabeled (330 total)
precision recall f1-score support
Confusion matrix
[[22 0 0 0 0 0 0 0 0 0]
[ 0 22 0 0 0 0 0 0 0 0]
[ 0 0 26 1 0 0 0 0 0 0]
[ 0 0 0 25 0 0 0 0 0 0]
[ 0 0 0 0 19 0 0 0 0 0]
[ 0 0 0 0 0 27 0 0 0 4]
[ 0 1 0 0 0 0 34 0 0 0]
[ 0 0 0 0 0 0 0 31 0 0]
[ 0 0 0 0 1 0 0 0 24 0]
[ 0 0 0 0 2 1 0 0 2 28]]
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
digits = datasets.load_digits()
rng = np.random.RandomState(0)
indices = np.arange(len(digits.data))
rng.shuffle(indices)
X = digits.data[indices[:330]]
y = digits.target[indices[:330]]
images = digits.images[indices[:330]]
n_total_samples = len(y)
n_labeled_points = 40
max_iterations = 5
unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]
f = plt.figure()
for i in range(max_iterations):
if len(unlabeled_indices) == 0:
print("No unlabeled items left to label.")
break
y_train = np.copy(y)
y_train[unlabeled_indices] = -1
predicted_labels = lp_model.transduction_[unlabeled_indices]
true_labels = y[unlabeled_indices]
cm = confusion_matrix(true_labels, predicted_labels,
labels=lp_model.classes_)
print(classification_report(true_labels, predicted_labels))
print("Confusion matrix")
print(cm)
# for more than 5 iterations, visualize the gain only on the first 5
if i < 5:
f.text(.05, (1 - (i + 1) * .183),
"model %d\n\nfit with\n%d labels" %
((i + 1), i * 5 + 10), size=10)
for index, image_index in enumerate(uncertainty_index):
image = images[image_index]
# for more than 5 iterations, visualize the gain only on the first 5
if i < 5:
sub = f.add_subplot(5, 5, index + 1 + (5 * i))
sub.imshow(image, cmap=plt.cm.gray_r, interpolation='none')
sub.set_title("predict: %i\ntrue: %i" % (
lp_model.transduction_[image_index], y[image_index]), size=10)
sub.axis('off')
Note: Click here to download the full example code or to run this example in your browser via Binder
Perform binary classification using non-linear SVC with RBF kernel. The target to predict is a XOR of the inputs.
The color map illustrates the decision function learned by the SVC.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
(continues on next page)
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',
origin='lower', cmap=plt.cm.PuOr_r)
contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,
linestyles='dashed')
plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired,
edgecolors='k')
plt.xticks(())
plt.yticks(())
plt.axis([-3, 3, -3, 3])
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot the maximum margin separating hyperplane within a two-class separable dataset using a Support Vector Machine
classifier with linear kernel.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs
Note: Click here to download the full example code or to run this example in your browser via Binder
Simple usage of Support Vector Machines to classify a sample. It will plot the decision surface and the support vectors.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
(2 0)
k(X, Y) = X ( ) Y.T
(0 1)
"""
M = np.array([[2, 0], [0, 1.0]])
return np.dot(np.dot(X, M), Y.T)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Note: Click here to download the full example code or to run this example in your browser via Binder
Unlike SVC (based on LIBSVM), LinearSVC (based on LIBLINEAR) does not provide the support vectors. This
example demonstrates how to obtain the support vectors in LinearSVC.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVC
plt.figure(figsize=(10, 5))
for i, C in enumerate([1, 100]):
# "hinge" is the standard SVM loss
clf = LinearSVC(C=C, loss="hinge", random_state=42).fit(X, y)
# obtain the support vectors through the decision function
decision_function = clf.decision_function(X)
# we can also calculate the decision function manually
# decision_function = np.dot(X, clf.coef_[0]) + clf.intercept_[0]
support_vector_indices = np.where((2 * y - 1) * decision_function <= 1)[0]
support_vectors = X[support_vector_indices]
plt.subplot(1, 2, i + 1)
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),
np.linspace(ylim[0], ylim[1], 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
plt.scatter(support_vectors[:, 0], support_vectors[:, 1], s=100,
linewidth=1, facecolors='none', edgecolors='k')
(continues on next page)
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_blobs
X, y = make_blobs(random_state=27)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Plot decision function of a weighted dataset, where the size of points is proportional to its weight.
The sample weighting rescales the C parameter, which means that the classifier puts more emphasis on getting these
points right. The effect might often be subtle. To emphasize the effect here, we particularly weight outliers, making
the deformation of the decision boundary very visible.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
Z = classifier.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# plot the line, the points, and the nearest vectors to the plane
axis.contourf(xx, yy, Z, alpha=0.75, cmap=plt.cm.bone)
axis.scatter(X[:, 0], X[:, 1], c=y, s=100 * sample_weight, alpha=0.9,
cmap=plt.cm.bone, edgecolors='black')
axis.axis('off')
axis.set_title(title)
# we create 20 points
np.random.seed(0)
X = np.r_[np.random.randn(10, 2) + [1, 1], np.random.randn(10, 2)]
y = [1] * 10 + [-1] * 10
sample_weight_last_ten = abs(np.random.randn(len(X)))
sample_weight_constant = np.ones(len(X))
# and bigger weights to some outliers
sample_weight_last_ten[15:] *= 5
(continues on next page)
clf_no_weights = svm.SVC(gamma=1)
clf_no_weights.fit(X, y)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Find the optimal separating hyperplane using an SVC for classes that are unbalanced.
We first find the separating plane with a plain SVC and then plot (dashed) the separating hyperplane with automatically
correction for unbalanced classes.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs
# fit the model and get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)
Note: Click here to download the full example code or to run this example in your browser via Binder
6.28.8 SVM-Kernels
Three different types of SVM-Kernels are displayed below. The polynomial and RBF are especially useful when the
data-points are not linearly separable.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
# figure number
fignum = 1
# plot the line, the points, and the nearest vectors to the plane
plt.figure(fignum, figsize=(4, 3))
plt.clf()
plt.axis('tight')
x_min = -3
x_max = 3
y_min = -3
y_max = 3
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
fignum = fignum + 1
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example shows how to perform univariate feature selection before running a SVC (support vector classifier) to
improve the classification scores. We use the iris dataset (4 features) and add 36 non-informative features. We can find
that our model achieves best performance when we select around 10% of features.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# #############################################################################
# Import some data to play with
X, y = load_iris(return_X_y=True)
# Add non-informative features
np.random.seed(0)
X = np.hstack((X, 2 * np.random.random((X.shape[0], 36))))
# #############################################################################
# Create a feature-selection transform, a scaler and an instance of SVM that we
# combine together to have an full-blown estimator
clf = Pipeline([('anova', SelectPercentile(chi2)),
('scaler', StandardScaler()),
('svc', SVC(gamma="auto"))])
(continues on next page)
# #############################################################################
# Plot the cross-validation score as a function of percentile of features
score_means = list()
score_stds = list()
percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)
Note: Click here to download the full example code or to run this example in your browser via Binder
6.28.10 Support Vector Regression (SVR) using linear and non-linear kernels
print(__doc__)
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt
# #############################################################################
# Generate sample data
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()
# #############################################################################
# Add noise to targets
y[::5] += 3 * (0.5 - np.random.rand(8))
# #############################################################################
# Fit regression model
svr_rbf = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1)
svr_lin = SVR(kernel='linear', C=100, gamma='auto')
svr_poly = SVR(kernel='poly', C=100, gamma='auto', degree=3, epsilon=.1,
coef0=1)
# #############################################################################
# Look at the results
lw = 2
Note: Click here to download the full example code or to run this example in your browser via Binder
The plots below illustrate the effect the parameter C has on the separation line. A large value of C basically tells our
model that we do not have that much faith in our data’s distribution, and will only consider points close to line of
separation.
A small value of C includes more/all the observations, allowing the margins to be calculated using all the data in the
area.
•
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
# figure number
(continues on next page)
# plot the parallels to the separating hyperplane that pass through the
# support vectors (margin away from hyperplane in direction
# perpendicular to hyperplane). This is sqrt(1+a^2) away vertically in
# 2-d.
margin = 1 / np.sqrt(np.sum(clf.coef_ ** 2))
yy_down = yy - np.sqrt(1 + a ** 2) * margin
yy_up = yy + np.sqrt(1 + a ** 2) * margin
# plot the line, the points, and the nearest vectors to the plane
plt.figure(fignum, figsize=(4, 3))
plt.clf()
plt.plot(xx, yy, 'k-')
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')
plt.axis('tight')
x_min = -4.8
x_max = 4.2
y_min = -6
y_max = 6
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
fignum = fignum + 1
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm
# plot the line, the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.title("Novelty Detection")
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='darkred')
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='palevioletred')
s = 40
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white', s=s, edgecolors='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='blueviolet', s=s,
edgecolors='k')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='gold', s=s,
edgecolors='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([a.collections[0], b1, b2, c],
["learned frontier", "training observations",
"new regular observations", "new abnormal observations"],
loc="upper left",
prop=matplotlib.font_manager.FontProperties(size=11))
plt.xlabel(
"error train: %d/200 ; errors novel regular: %d/40 ; "
"errors novel abnormal: %d/40"
% (n_error_train, n_error_test, n_error_outliers))
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Comparison of different linear SVM classifiers on a 2D projection of the iris dataset. We only consider the first 2
features of this dataset:
• Sepal length
• Sepal width
This example shows how to plot the decision surface for four SVM classifiers with different kernels.
The linear models LinearSVC() and SVC(kernel='linear') yield slightly different decision boundaries.
This can be a consequence of the following differences:
• LinearSVC minimizes the squared hinge loss while SVC minimizes the regular hinge loss.
• LinearSVC uses the One-vs-All (also known as One-vs-Rest) multiclass reduction while SVC uses the One-
vs-One multiclass reduction.
Both linear models have linear decision boundaries (intersecting hyperplanes) while the non-linear kernel models
(polynomial or Gaussian RBF) have more flexible non-linear decision boundaries with shapes that depend on the kind
of kernel and its parameters.
Note: while plotting the decision function of classifiers for toy 2D datasets can help get an intuitive understanding
of their respective expressive power, be aware that those intuitions don’t always generalize to more realistic high-
dimensional problems.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
xx, yy : ndarray
"""
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy
Parameters
----------
ax: matplotlib axes object
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
params: dictionary of params to pass to contourf, optional
"""
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
models = (svm.SVC(kernel='linear', C=C),
svm.LinearSVC(C=C, max_iter=10000),
svm.SVC(kernel='rbf', gamma=0.7, C=C),
svm.SVC(kernel='poly', degree=3, gamma='auto', C=C))
(continues on next page)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
The following example illustrates the effect of scaling the regularization parameter when using Support Vector Ma-
chines for classification. For SVC classification, we are interested in a risk minimization for the equation:
∑︁
𝐶 ℒ(𝑓 (𝑥𝑖 ), 𝑦𝑖 ) + Ω(𝑤)
𝑖=1,𝑛
where
• 𝐶 is used to set the amount of regularization
• ℒ is a loss function of our samples and our model parameters.
• Ω is a penalty function of our model parameters
If we consider the loss function to be the individual error per sample, then the data-fit term, or the sum of the error for
each sample, will increase as we add more samples. The penalization term, however, will not increase.
When using, for example, cross validation, to set the amount of regularization with C, there will be a different amount
of samples between the main problem and the smaller problems within the folds of the cross validation.
Since our loss function is dependent on the amount of samples, the latter will influence the selected value of
C. The question that arises is How do we optimally adjust C to account for the different
amount of training samples?
The figures below are used to illustrate the effect of scaling our C to compensate for the change in the number of
samples, in the case of using an l1 penalty, as well as the l2 penalty.
l1-penalty case
In the l1 case, theory says that prediction consistency (i.e. that under given hypothesis, the estimator learned predicts
as well as a model knowing the true distribution) is not possible because of the bias of the l1. It does say, however,
that model consistency, in terms of finding the right set of non-zero parameters as well as their signs, can be achieved
by scaling C1.
l2-penalty case
The theory says that in order to achieve prediction consistency, the penalty parameter should be kept constant as the
number of samples grow.
Simulations
The two figures below plot the values of C on the x-axis and the corresponding cross-validation scores on the
y-axis, for several different fractions of a generated data-set.
In the l1 penalty case, the cross-validation-error correlates best with the test-error, when scaling our C with the number
of samples, n, which can be seen in the first figure.
For the l2 penalty case, the best result comes from the case where C is not scaled.
Note:
Two separate datasets are used for the two different plots. The reason behind this is the l1 case works better on
sparse data, while l2 is better suited to the non-sparse case.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
rnd = check_random_state(1)
# set up dataset
n_samples = 100
n_features = 300
plt.legend(loc="best")
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
This example illustrates the effect of the parameters gamma and C of the Radial Basis Function (RBF) kernel SVM.
Intuitively, the gamma parameter defines how far the influence of a single training example reaches, with low values
meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of
influence of samples selected by the model as support vectors.
The C parameter trades off correct classification of training examples against maximization of the decision function’s
margin. For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all
training points correctly. A lower C will encourage a larger margin, therefore a simpler decision function, at the cost
of training accuracy. In other words‘‘C‘‘ behaves as a regularization parameter in the SVM.
The first plot is a visualization of the decision function for a variety of parameter values on a simplified classification
problem involving only 2 input features and 2 possible target classes (binary classification). Note that this kind of plot
is not possible to do for problems with more features or target classes.
The second plot is a heatmap of the classifier’s cross-validation accuracy as a function of C and gamma. For this
example we explore a relatively large grid for illustration purposes. In practice, a logarithmic grid from 10−3 to 103
is usually sufficient. If the best parameters lie on the boundaries of the grid, it can be extended in that direction in a
subsequent search.
Note that the heat map plot has a special colorbar with a midpoint value close to the score values of the best performing
models so as to make it easy to tell them apart in the blink of an eye.
The behavior of the model is very sensitive to the gamma parameter. If gamma is too large, the radius of the area of
influence of the support vectors only includes the support vector itself and no amount of regularization with C will be
able to prevent overfitting.
When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data.
The region of influence of any selected support vector would include the whole training set. The resulting model will
behave similarly to a linear model with a set of hyperplanes that separate the centers of high density of any pair of two
classes.
For intermediate values, we can see on the second plot that good models can be found on a diagonal of C and gamma.
Smooth models (lower gamma values) can be made more complex by increasing the importance of classifying each
point correctly (larger C values) hence the diagonal of good performing models.
Finally one can also observe that for some intermediate values of gamma we get equally performing models when C
becomes very large: it is not necessary to regularize by enforcing a larger margin. The radius of the RBF kernel alone
acts as a good structural regularizer. In practice though it might still be interesting to simplify the decision function
with a lower value of C so as to favor models that use less memory and that are faster to predict.
We should also note that small differences in scores results from the random splits of the cross-validation procedure.
Those spurious variations can be smoothed out by increasing the number of CV iterations n_splits at the expense
of compute time. Increasing the value number of C_range and gamma_range steps will increase the resolution of
the hyper-parameter heat map.
•
Out:
The best parameters are {'C': 1.0, 'gamma': 0.1} with a score of 0.97
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
class MidpointNormalize(Normalize):
# #############################################################################
# Load and prepare data set
#
# dataset for grid search
iris = load_iris()
X = iris.data
y = iris.target
# Dataset for decision function visualization: we only keep the first two
# features in X and sub-sample the dataset to keep only 2 classes and
# make it a binary classification problem.
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_2d = scaler.fit_transform(X_2d)
# #############################################################################
# Train classifiers
#
# For an initial search, a logarithmic grid with basis
# 10 is often helpful. Using a basis of 2, a finer
# tuning can be achieved but at a much higher cost.
# #############################################################################
# Visualization
#
# draw visualization of parameter effects
plt.figure(figsize=(8, 6))
xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))
for (k, (C, gamma, clf)) in enumerate(classifiers):
# evaluate decision function in a grid
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
scores = grid.cv_results_['mean_test_score'].reshape(len(C_range),
len(gamma_range))
plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,
norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
plt.yticks(np.arange(len(C_range)), C_range)
plt.title('Validation accuracy')
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
A tutorial exercise regarding the use of classification techniques on the Digits dataset.
This exercise is used in the Classification part of the Supervised learning: predicting an output variable from high-
dimensional observations section of the A tutorial on statistical-learning for scientific data processing.
Out:
print(__doc__)
n_samples = len(X_digits)
knn = neighbors.KNeighborsClassifier()
logistic = linear_model.LogisticRegression(max_iter=1000)
Note: Click here to download the full example code or to run this example in your browser via Binder
print(__doc__)
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
X, y = datasets.load_digits(return_X_y=True)
svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)
scores = list()
scores_std = list()
for C in C_s:
svc.C = C
this_scores = cross_val_score(svc, X, y, n_jobs=1)
scores.append(np.mean(this_scores))
scores_std.append(np.std(this_scores))
(continues on next page)
# Do the plotting
import matplotlib.pyplot as plt
plt.figure()
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
•
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0, :2]
y = y[y != 0]
n_sample = len(X)
np.random.seed(0)
order = np.random.permutation(n_sample)
X = X[order]
y = y[order].astype(np.float)
plt.figure()
plt.clf()
plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired,
edgecolor='k', s=20)
plt.axis('tight')
x_min = X[:, 0].min()
x_max = X[:, 0].max()
y_min = X[:, 1].min()
y_max = X[:, 1].max()
plt.title(kernel)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
Out:
Answer to the bonus question: how much can you trust the selection of alpha?
Answer: Not very much since we obtained different alphas for different
subsets of the data and moreover, the scores for these alphas differ
quite substantially.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
X, y = datasets.load_diabetes(return_X_y=True)
X = X[:150]
y = y[:150]
# #############################################################################
# Bonus: how much can you trust the selection of alpha?
# To answer this question we use the LassoCV object that sets its alpha
# parameter automatically from the data by internal cross-validation (i.e. it
# performs cross-validation on the training data it receives).
# We use external cross-validation to see how much the automatically obtained
# alphas differ across different cross-validation folds.
lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000)
k_fold = KFold(3)
plt.show()
Note: Click here to download the full example code or to run this example in your browser via Binder
DictVectorizer
done in 1.040226s at 6.003MB/s
Found 47928 unique terms
import numpy as np
def n_nonzero_columns(X):
"""Returns the number of non-zero columns in a CSR matrix X."""
return len(np.unique(X.nonzero()[1]))
def tokens(doc):
"""Extract tokens from doc.
This uses a simple regex to break strings into tokens. For a more
principled approach, see CountVectorizer or TfidfVectorizer.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))
def token_freqs(doc):
"""Extract a dict mapping tokens from doc to their frequencies."""
freq = defaultdict(int)
for tok in tokens(doc):
freq[tok] += 1
return freq
categories = [
'alt.atheism',
'comp.graphics',
'comp.sys.ibm.pc.hardware',
'misc.forsale',
'rec.autos',
'sci.space',
'talk.religion.misc',
]
# Uncomment the following line to use a larger set (11k+ documents)
# categories = None
print(__doc__)
print("Usage: %s [n_features_for_hashing]" % sys.argv[0])
print(" The default number of features is 2**18.")
print()
try:
n_features = int(sys.argv[1])
except IndexError:
n_features = 2 ** 18
except ValueError:
(continues on next page)
print("DictVectorizer")
t0 = time()
vectorizer = DictVectorizer()
vectorizer.fit_transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % len(vectorizer.get_feature_names()))
print()
Note: Click here to download the full example code or to run this example in your browser via Binder
This is an example showing how the scikit-learn can be used to cluster documents by topics using a bag-of-words
approach. This example uses a scipy.sparse matrix to store the features instead of standard numpy arrays.
Two feature extraction methods can be used in this example:
• TfidfVectorizer uses a in-memory vocabulary (a python dict) to map the most frequent words to features indices
and hence compute a word occurrence frequency (sparse) matrix. The word frequencies are then reweighted
using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.
• HashingVectorizer hashes word occurrences to a fixed dimensional space, possibly with collisions. The word
count vectors are then normalized to each have l2-norm equal to one (projected to the euclidean unit-ball) which
seems to be important for k-means to work in high dimensional space.
HashingVectorizer does not provide IDF weighting as this is a stateless model (the fit method does nothing).
When IDF weighting is needed it can be added by pipelining its output to a TfidfTransformer instance.
Two algorithms are demoed: ordinary k-means and its more scalable cousin minibatch k-means.
Additionally, latent semantic analysis can also be used to reduce dimensionality and discover latent patterns in the
data.
It can be noted that k-means (and minibatch k-means) are very sensitive to feature scaling and that in this case the IDF
weighting helps improve the quality of the clustering by quite a lot as measured against the “ground truth” provided
by the class label assignments of the 20 newsgroups dataset.
This improvement is not visible in the Silhouette Coefficient which is small for both as this measure seem to suffer
from the phenomenon called “Concentration of Measure” or “Curse of Dimensionality” for high dimensional datasets
such as text data. Other measures such as V-measure and Adjusted Rand Index are information theoretic based eval-
uation scores: as they are only based on cluster assignments rather than distances, hence not affected by the curse of
dimensionality.
Note: as k-means is optimizing a non-convex objective function, it will likely end up in a local optimum. Several runs
with independent random init might be necessary to get a good convergence.
Out:
Usage: plot_document_clustering.py [options]
Options:
-h, --help show this help message and exit
--lsa=N_COMPONENTS Preprocess documents with latent semantic analysis.
--no-minibatch Use ordinary k-means algorithm (in batch mode).
--no-idf Disable Inverse Document Frequency feature weighting.
--use-hashing Use a hashing feature vectorizer
--n-features=N_FEATURES
Maximum number of features (dimensions) to extract
from text.
--verbose Print progress reports inside k-means algorithm.
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
3387 documents
4 categories
verbose=False)
done in 0.056s
Homogeneity: 0.412
Completeness: 0.491
V-measure: 0.448
Adjusted Rand-Index: 0.289
Silhouette Coefficient: 0.006
import logging
from optparse import OptionParser
import sys
from time import time
import numpy as np
def is_interactive():
return not hasattr(sys.modules['__main__'], '__file__')
# #############################################################################
# Load some categories from the training set
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
# Uncomment the following to do the analysis on all the categories
# categories = None
labels = dataset.target
true_k = np.unique(labels).shape[0]
if opts.n_components:
print("Performing dimensionality reduction using LSA")
t0 = time()
# Vectorizer results are normalized, which makes KMeans behave as
# spherical k-means for better results. Since LSA/SVD results are
# not normalized, we have to redo the normalization.
svd = TruncatedSVD(opts.n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
X = lsa.fit_transform(X)
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(
int(explained_variance * 100)))
print()
# #############################################################################
# Do the actual clustering
if opts.minibatch:
km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
init_size=1000, batch_size=1000, verbose=opts.verbose)
else:
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
verbose=opts.verbose)
print()
if not opts.use_hashing:
print("Top terms per cluster:")
(continues on next page)
if opts.n_components:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
Note: Click here to download the full example code or to run this example in your browser via Binder
This is an example showing how scikit-learn can be used to classify documents by topics using a bag-of-words ap-
proach. This example uses a scipy.sparse matrix to store the features and demonstrates various classifiers that can
efficiently handle sparse matrices.
The dataset used in this example is the 20 newsgroups dataset. It will be automatically downloaded, then cached.
op = OptionParser()
op.add_option("--report",
action="store_true", dest="print_report",
help="Print a detailed classification report.")
op.add_option("--chi2_select",
action="store", type="int", dest="select_chi2",
help="Select some number of features using a chi-squared test")
op.add_option("--confusion_matrix",
action="store_true", dest="print_cm",
help="Print the confusion matrix.")
op.add_option("--top10",
action="store_true", dest="print_top10",
help="Print ten most discriminative terms per class"
" for every classifier.")
op.add_option("--all_categories",
action="store_true", dest="all_categories",
help="Whether to use all categories or not.")
op.add_option("--use_hashing",
action="store_true",
help="Use a hashing vectorizer.")
op.add_option("--n_features",
action="store", type=int, default=2 ** 16,
help="n_features when using the hashing vectorizer.")
op.add_option("--filtered",
action="store_true",
help="Remove newsgroup information that is easily overfit: "
"headers, signatures, and quoting.")
def is_interactive():
return not hasattr(sys.modules['__main__'], '__file__')
print(__doc__)
op.print_help()
print()
Out:
Usage: plot_document_classification_20newsgroups.py [options]
Let’s load data from the newsgroups dataset which comprises around 18000 newsgroups posts on 20 topics split in
two subsets: one for training (or development) and the other one for testing (or for performance evaluation).
if opts.all_categories:
categories = None
else:
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
if opts.filtered:
remove = ('headers', 'footers', 'quotes')
else:
remove = ()
def size_mb(docs):
return sum(len(s.encode('utf-8')) for s in docs) / 1e6
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)
print("Extracting features from the test data using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()
if opts.select_chi2:
print("Extracting %d best features by a chi-squared test" %
opts.select_chi2)
t0 = time()
ch2 = SelectKBest(chi2, k=opts.select_chi2)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
if feature_names:
# keep selected feature names
feature_names = [feature_names[i] for i
in ch2.get_support(indices=True)]
print("done in %fs" % (time() - t0))
print()
def trim(s):
"""Trim string to fit on terminal (assuming 80-column display)"""
return s if len(s) <= 80 else s[:77] + "..."
Out:
Extracting features from the test data using the same vectorizer
done in 0.264800s at 10.829MB/s
n_samples: 1353, n_features: 33809
Benchmark classifiers
We train and test the datasets with 15 different classification models and get performance results for each model.
def benchmark(clf):
print('_' * 80)
print("Training: ")
print(clf)
t0 = time()
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)
if hasattr(clf, 'coef_'):
print("dimensionality: %d" % clf.coef_.shape[1])
print("density: %f" % density(clf.coef_))
if opts.print_report:
print("classification report:")
print(metrics.classification_report(y_test, pred,
target_names=target_names))
if opts.print_cm:
print("confusion matrix:")
print(metrics.confusion_matrix(y_test, pred))
print()
clf_descr = str(clf).split('(')[0]
return clf_descr, score, train_time, test_time
results = []
for clf, name in (
(RidgeClassifier(tol=1e-2, solver="sag"), "Ridge Classifier"),
(Perceptron(max_iter=50), "Perceptron"),
(PassiveAggressiveClassifier(max_iter=50),
"Passive-Aggressive"),
(KNeighborsClassifier(n_neighbors=10), "kNN"),
(RandomForestClassifier(), "Random forest")):
print('=' * 80)
print(name)
results.append(benchmark(clf))
Out:
================================================================================
Ridge Classifier
________________________________________________________________________________
Training:
RidgeClassifier(solver='sag', tol=0.01)
/home/circleci/project/sklearn/linear_model/_ridge.py:558: UserWarning: "sag" solver
˓→requires many iterations to fit an intercept with sparse inputs. Either set the
================================================================================
Perceptron
________________________________________________________________________________
Training:
Perceptron(max_iter=50)
train time: 0.016s
test time: 0.002s
accuracy: 0.888
dimensionality: 33809
density: 0.255302
================================================================================
Passive-Aggressive
________________________________________________________________________________
Training:
PassiveAggressiveClassifier(max_iter=50)
train time: 0.028s
test time: 0.003s
accuracy: 0.902
dimensionality: 33809
density: 0.692841
================================================================================
kNN
________________________________________________________________________________
Training:
KNeighborsClassifier(n_neighbors=10)
(continues on next page)
================================================================================
Random forest
________________________________________________________________________________
Training:
RandomForestClassifier()
train time: 1.316s
test time: 0.074s
accuracy: 0.837
================================================================================
L2 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, tol=0.001)
train time: 0.075s
test time: 0.001s
accuracy: 0.900
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50)
train time: 0.022s
test time: 0.002s
accuracy: 0.899
dimensionality: 33809
density: 0.569944
================================================================================
L1 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, penalty='l1', tol=0.001)
train time: 0.214s
test time: 0.001s
accuracy: 0.873
dimensionality: 33809
density: 0.005553
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='l1')
train time: 0.092s
test time: 0.002s
accuracy: 0.888
dimensionality: 33809
density: 0.022982
================================================================================
NearestCentroid (aka Rocchio classifier)
________________________________________________________________________________
Training:
NearestCentroid()
train time: 0.007s
test time: 0.002s
accuracy: 0.855
================================================================================
Naive Bayes
________________________________________________________________________________
Training:
MultinomialNB(alpha=0.01)
train time: 0.003s
test time: 0.001s
accuracy: 0.899
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
BernoulliNB(alpha=0.01)
train time: 0.004s
test time: 0.003s
accuracy: 0.884
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
ComplementNB(alpha=0.1)
train time: 0.003s
test time: 0.001s
accuracy: 0.911
dimensionality: 33809
density: 1.000000
================================================================================
LinearSVC with L1-based feature selection
________________________________________________________________________________
Training:
(continues on next page)
Add plots
The bar plot indicates the accuracy, training time (normalized) and test time (normalized) of each classifier.
indices = np.arange(len(results))
plt.figure(figsize=(12, 8))
plt.title("Score")
plt.barh(indices, score, .2, label="score", color='navy')
plt.barh(indices + .3, training_time, .2, label="training time",
color='c')
plt.barh(indices + .6, test_time, .2, label="test time", color='darkorange')
plt.yticks(())
plt.legend(loc='best')
plt.subplots_adjust(left=.25)
plt.subplots_adjust(top=.95)
plt.subplots_adjust(bottom=.05)
plt.show()
SEVEN
API REFERENCE
This is the class and function reference of scikit-learn. Please refer to the full user guide for further details, as the class
and function raw specifications may not be enough to give full guidelines on their uses. For reference on concepts
repeated across the API, see Glossary of Common Terms and API Elements.
sklearn.base.BaseEstimator
class sklearn.base.BaseEstimator
Base class for all estimators in scikit-learn
Notes
All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit
keyword arguments (no *args or **kwargs).
Methods
1557
scikit-learn user guide, Release 0.23.dev0
sklearn.base.BiclusterMixin
class sklearn.base.BiclusterMixin
Mixin class for all bicluster estimators in scikit-learn
Attributes
biclusters_ Convenient way to get row and column indicators together.
Methods
Notes
Works with sparse matrices. Only works if rows_ and columns_ attributes exist.
sklearn.base.ClassifierMixin
class sklearn.base.ClassifierMixin
Mixin class for all classifiers in scikit-learn.
Methods
score(self, X, y[, sample_weight]) Return the mean accuracy on the given test data and
labels.
sklearn.base.ClusterMixin
class sklearn.base.ClusterMixin
Mixin class for all cluster estimators in scikit-learn.
Methods
sklearn.base.DensityMixin
class sklearn.base.DensityMixin
Mixin class for all density estimators in scikit-learn.
Methods
score(self, X[, y]) Return the score of the model on the data X
sklearn.base.RegressorMixin
class sklearn.base.RegressorMixin
Mixin class for all regression estimators in scikit-learn.
Methods
Notes
sklearn.base.TransformerMixin
class sklearn.base.TransformerMixin
Mixin class for all transformers in scikit-learn.
Methods
7.1.2 Functions
sklearn.base.clone
sklearn.base.clone(estimator, safe=True)
Constructs a new estimator with the same parameters.
Clone does a deep copy of the model in an estimator without actually copying attached data. It yields a new
estimator with the same parameters that has not been fit on any data.
Parameters
estimator [estimator object, or list, tuple or set of objects] The estimator or group of estimators
to be cloned
safe [boolean, optional] If safe is false, clone will fall back to a deep copy on objects that are
not estimators.
sklearn.base.is_classifier
sklearn.base.is_classifier(estimator)
Return True if the given estimator is (probably) a classifier.
Parameters
estimator [object] Estimator object to test.
Returns
out [bool] True if estimator is a classifier and False otherwise.
sklearn.base.is_regressor
sklearn.base.is_regressor(estimator)
Return True if the given estimator is (probably) a regressor.
Parameters
estimator [object] Estimator object to test.
Returns
out [bool] True if estimator is a regressor and False otherwise.
sklearn.config_context
sklearn.config_context(**new_config)
Context manager for global scikit-learn configuration
Parameters
assume_finite [bool, optional] If True, validation for finiteness will be skipped, saving time, but
leading to potential crashes. If False, validation for finiteness will be performed, avoiding
error. Global default: False.
working_memory [int, optional] If set, scikit-learn will attempt to limit the size of temporary
arrays to this number of MiB (per job when parallelised), often saving both computation
time and memory on expensive operations that can be performed in chunks. Global default:
1024.
print_changed_only [bool, optional] If True, only the parameters that were set to non-default
values will be printed when printing an estimator. For example, print(SVC()) while
True will only print ‘SVC()’ while the default behaviour would be to print ‘SVC(C=1.0,
cache_size=200, . . . )’ with all the non-changed parameters.
See also:
Notes
All settings, not just those presently modified, will be returned to their previous values when the context manager
is exited. This is not thread-safe.
Examples
sklearn.get_config
sklearn.get_config()
Retrieve current values for configuration set by set_config
Returns
config [dict] Keys are parameter names that can be passed to set_config.
See also:
sklearn.set_config
sklearn.show_versions
sklearn.show_versions()
Print useful debugging information
7.2.1 sklearn.calibration.CalibratedClassifierCV
class sklearn.calibration.CalibratedClassifierCV(base_estimator=None,
method=’sigmoid’, cv=None)
Probability calibration with isotonic regression or sigmoid.
See glossary entry for cross-validation estimator.
With this class, the base_estimator is fit on the train set of the cross-validation generator and the test set is used
for calibration. The probabilities for each of the folds are then averaged for prediction. In case that cv=”prefit”
is passed to __init__, it is assumed that base_estimator has been fitted already and all data is used for calibration.
Note that data for fitting the classifier and for calibrating it must be disjoint.
Read more in the User Guide.
Parameters
base_estimator [instance BaseEstimator] The classifier whose output decision function needs
to be calibrated to offer more accurate predict_proba outputs. If cv=prefit, the classifier must
have been fit already on data.
method [‘sigmoid’ or ‘isotonic’] The method to use for calibration. Can be ‘sigmoid’ which
corresponds to Platt’s method or ‘isotonic’ which is a non-parametric approach. It is not
advised to use isotonic calibration with too few calibration samples (<<1000) since it
tends to overfit. Use sigmoids (Platt’s calibration) in this case.
cv [integer, cross-validation generator, iterable or “prefit”, optional] Determines the cross-
validation splitting strategy. Possible inputs for cv are:
• None, to use the default 5-fold cross-validation,
• integer, to specify the number of folds.
• CV splitter,
• An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if y is binary or multiclass, sklearn.model_selection.
StratifiedKFold is used. If y is neither binary nor multiclass, sklearn.
model_selection.KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is
used for calibration.
Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
Attributes
classes_ [array, shape (n_classes)] The class labels.
calibrated_classifiers_ [list (len() equal to cv or 1 if cv == “prefit”)] The list of calibrated
classifiers, one for each crossvalidation fold, which has been fitted on all but the validation
fold and calibrated on the validation fold.
References
Methods
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict the target of new samples. Can be different from the prediction of the uncalibrated classifier.
Parameters
X [array-like, shape (n_samples, n_features)] The samples.
Returns
C [array, shape (n_samples,)] The predicted class.
predict_proba(self, X)
Posterior probabilities of classification
This function returns posterior probabilities of classification according to each class on an array of test
vectors X.
Parameters
X [array-like, shape (n_samples, n_features)] The samples.
Returns
C [array, shape (n_samples, n_classes)] The predicted probas.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.2.2 sklearn.calibration.calibration_curve
References
Alexandru Niculescu-Mizil and Rich Caruana (2005) Predicting Good Probabilities With Supervised Learning,
in Proceedings of the 22nd International Conference on Machine Learning (ICML). See section 4 (Qualitative
Analysis of Predictions).
7.3.1 Classes
sklearn.cluster.AffinityPropagation
Notes
References
Brendan J. Frey and Delbert Dueck, “Clustering by Passing Messages Between Data Points”, Science Feb. 2007
Examples
Methods
fit(self, X[, y]) Fit the clustering from features, or affinity matrix.
fit_predict(self, X[, y]) Fit the clustering from features or affinity matrix, and
return cluster labels.
get_params(self[, deep]) Get parameters for this estimator.
predict(self, X) Predict the closest cluster each sample in X belongs
to.
set_params(self, \*\*params) Set the parameters of this estimator.
sklearn.cluster.AgglomerativeClustering
connectivity [array-like or callable, default=None] Connectivity matrix. Defines for each sam-
ple the neighboring samples following a given structure of the data. This can be a connec-
tivity matrix itself or a callable that transforms the data into a connectivity matrix, such as
derived from kneighbors_graph. Default is None, i.e, the hierarchical clustering algorithm
is unstructured.
compute_full_tree [‘auto’ or bool, default=’auto’] Stop early the construction of the tree at
n_clusters. This is useful to decrease computation time if the number of clusters is not small
compared to the number of samples. This option is useful only when specifying a connectiv-
ity matrix. Note also that when varying the number of clusters and using caching, it may be
advantageous to compute the full tree. It must be True if distance_threshold is not
None. By default compute_full_tree is “auto”, which is equivalent to True when
distance_threshold is not None or that n_clusters is inferior to the maximum
between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.
linkage [{“ward”, “complete”, “average”, “single”}, default=”ward”] Which linkage criterion
to use. The linkage criterion determines which distance to use between sets of observation.
The algorithm will merge the pairs of cluster that minimize this criterion.
• ward minimizes the variance of the clusters being merged.
• average uses the average of the distances of each observation of the two sets.
• complete or maximum linkage uses the maximum distances between all observations of
the two sets.
• single uses the minimum of the distances between all observations of the two sets.
distance_threshold [float, default=None] The linkage distance threshold above which,
clusters will not be merged. If not None, n_clusters must be None and
compute_full_tree must be True.
New in version 0.21.
Attributes
n_clusters_ [int] The number of clusters found by the algorithm. If
distance_threshold=None, it will be equal to the given n_clusters.
labels_ [ndarray of shape (n_samples)] cluster labels for each point
n_leaves_ [int] Number of leaves in the hierarchical tree.
n_connected_components_ [int] The estimated number of connected components in the graph.
children_ [array-like of shape (n_samples-1, 2)] The children of each non-leaf node. Val-
ues less than n_samples correspond to leaves of the tree which are the original sam-
ples. A node i greater than or equal to n_samples is a non-leaf node and has children
children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and
children[i][1] are merged to form node n_samples + i
Examples
Methods
fit(self, X[, y]) Fit the hierarchical clustering from features, or dis-
tance matrix.
fit_predict(self, X[, y]) Fit the hierarchical clustering from features or dis-
tance matrix, and return cluster labels.
get_params(self[, deep]) Get parameters for this estimator.
set_params(self, \*\*params) Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.cluster.Birch
MiniBatchKMeans Alternative implementation that does incremental updates of the centers’ positions using
mini-batches.
Notes
The tree data structure consists of nodes with each node consisting of a number of subclusters. The maximum
number of subclusters in a node is determined by the branching factor. Each subcluster maintains a linear sum,
squared sum and the number of samples in that subcluster. In addition, each subcluster can also have a node as
its child, if the subcluster is not a member of a leaf node.
For a new point entering the root, it is merged with the subcluster closest to it and the linear sum, squared sum
and the number of samples of that subcluster are updated. This is done recursively till the properties of the leaf
node are updated.
References
• Tian Zhang, Raghu Ramakrishnan, Maron Livny BIRCH: An efficient data clustering method for large
databases. https://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
• Roberto Perdisci JBirch - Java implementation of BIRCH clustering algorithm https://code.google.com/
archive/p/jbirch
Examples
Methods
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
partial_fit(self, X=None, y=None)
Online learning. Prevents rebuilding of CFTree from scratch.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features), None] Input data. If X is not
provided, only the global clustering step is done.
y [Ignored] Not used, present here for API consistency by convention.
Returns
self Fitted estimator.
predict(self, X)
Predict data using the centroids_ of subclusters.
Avoid computation of the row norms of X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] Input data.
Returns
labels [ndarray, shape(n_samples)] Labelled data.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform X into subcluster centroids dimension.
Each dimension represents the distance from the sample point to each cluster centroid.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] Input data.
Returns
X_trans [{array-like, sparse matrix}, shape (n_samples, n_clusters)] Transformed data.
sklearn.cluster.DBSCAN
labels_ [array, shape = [n_samples]] Cluster labels for each point in the dataset given to fit().
Noisy samples are given the label -1.
See also:
OPTICS A similar clustering at multiple values of eps. Our implementation is optimized for memory usage.
Notes
References
Ester, M., H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise”. In: Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how
you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.
Examples
Methods
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.cluster.FeatureAgglomeration
Examples
Methods
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
inverse_transform(self, Xred)
Inverse the transformation. Return a vector of size nb_features with the values of Xred assigned to each
group of features
Parameters
Xred [array-like of shape (n_samples, n_clusters) or (n_clusters,)] The values to be assigned
to each cluster of samples
Returns
X [array, shape=[n_samples, n_features] or [n_features]] A vector of size n_samples with
the values of Xred assigned to each of the cluster of samples.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform a new matrix using the built clustering
Parameters
X [array-like of shape (n_samples, n_features) or (n_samples,)] A M by N array of M ob-
servations in N dimensions or a length M array of M one-dimensional observations.
Returns
Y [array, shape = [n_samples, n_clusters] or [n_clusters]] The pooled values for each feature
cluster.
• Feature agglomeration
• Feature agglomeration vs. univariate selection
sklearn.cluster.KMeans
MiniBatchKMeans Alternative online implementation that does incremental updates of the centers positions
using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much
faster than the default batch implementation.
Notes
Examples
Methods
Returns
X_new [array, shape [n_samples, k]] X transformed in the new space.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X, sample_weight=None)
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value re-
turned by predict is the index of the closest code in the code book.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] New data to predict.
sample_weight [array-like, shape (n_samples,), optional] The weights for each observation
in X. If None, all observations are assigned equal weight (default: None).
Returns
labels [array, shape [n_samples,]] Index of the cluster each sample belongs to.
score(self, X, y=None, sample_weight=None)
Opposite of the value of X on the K-means objective.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] New data.
y [Ignored] Not used, present here for API consistency by convention.
sample_weight [array-like, shape (n_samples,), optional] The weights for each observation
in X. If None, all observations are assigned equal weight (default: None).
Returns
score [float] Opposite of the value of X on the K-means objective.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the
array returned by transform will typically be dense.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] New data to transform.
Returns
X_new [array, shape [n_samples, k]] X transformed in the new space.
sklearn.cluster.MiniBatchKMeans
KMeans The classic implementation of the clustering method based on the Lloyd’s algorithm. It consumes the
whole set of input data at each iteration.
Notes
See https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Examples
Methods
It must be noted that the data will be converted to C ordering, which will cause a memory
copy if the given data is not C-contiguous.
y [Ignored] Not used, present here for API consistency by convention.
sample_weight [array-like, shape (n_samples,), optional] The weights for each observation
in X. If None, all observations are assigned equal weight (default: None).
Returns
self
fit_predict(self, X, y=None, sample_weight=None)
Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] New data to transform.
y [Ignored] Not used, present here for API consistency by convention.
sample_weight [array-like, shape (n_samples,), optional] The weights for each observation
in X. If None, all observations are assigned equal weight (default: None).
Returns
labels [array, shape [n_samples,]] Index of the cluster each sample belongs to.
fit_transform(self, X, y=None, sample_weight=None)
Compute clustering and transform X to cluster-distance space.
Equivalent to fit(X).transform(X), but more efficiently implemented.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] New data to transform.
y [Ignored] Not used, present here for API consistency by convention.
sample_weight [array-like, shape (n_samples,), optional] The weights for each observation
in X. If None, all observations are assigned equal weight (default: None).
Returns
X_new [array, shape [n_samples, k]] X transformed in the new space.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
partial_fit(self, X, y=None, sample_weight=None)
Update k means estimate on a single mini-batch X.
Parameters
X [array-like of shape (n_samples, n_features)] Coordinates of the data points to cluster. It
must be noted that X will be copied if it is not C-contiguous.
sklearn.cluster.MeanShift
n_jobs [int or None, optional (default=None)] The number of jobs to use for the computation.
This works by computing each of the n_init runs in parallel.
None means 1 unless in a joblib.parallel_backend context. -1 means using all
processors. See Glossary for more details.
max_iter [int, default=300] Maximum number of iterations, per seed point before the clustering
operation terminates (for that seed point), if has not converged yet.
New in version 0.22.
Attributes
cluster_centers_ [array, [n_clusters, n_features]] Coordinates of cluster centers.
labels_ : Labels of each point.
n_iter_ [int] Maximum number of iterations performed on each seed.
New in version 0.22.
Notes
Scalability:
Because this implementation uses a flat kernel and a Ball Tree to look up members of each kernel, the complexity
will tend towards O(T*n*log(n)) in lower dimensions, with n the number of samples and T the number of points.
In higher dimensions the complexity will tend towards O(T*n^2).
Scalability can be boosted by using fewer seeds, for example by using a higher value of min_bin_freq in the
get_bin_seeds function.
Note that the estimate_bandwidth function is much less scalable than the mean shift algorithm and will be the
bottleneck if it is used.
References
Dorin Comaniciu and Peter Meer, “Mean Shift: A robust approach toward feature space analysis”. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence. 2002. pp. 603-619.
Examples
Methods
sklearn.cluster.OPTICS
reachability_ [array, shape (n_samples,)] Reachability distances per sample, indexed by object
order. Use clust.reachability_[clust.ordering_] to access in cluster order.
ordering_ [array, shape (n_samples,)] The cluster ordered list of sample indices.
core_distances_ [array, shape (n_samples,)] Distance at which each sample becomes a core
point, indexed by object order. Points which will never be core have a distance of inf. Use
clust.core_distances_[clust.ordering_] to access in cluster order.
predecessor_ [array, shape (n_samples,)] Point that a sample was reached from, indexed by
object order. Seed points have a predecessor of -1.
cluster_hierarchy_ [array, shape (n_clusters, 2)] The list of clusters in the form of [start,
end] in each row, with all indices inclusive. The clusters are ordered according
to (end, -start) (ascending) so that larger clusters encompassing smaller clusters
come after those smaller ones. Since labels_ does not reflect the hierarchy, usually
len(cluster_hierarchy_) > np.unique(optics.labels_). Please also
note that these indices are of the ordering_, i.e. X[ordering_][start:end +
1] form a cluster. Only available when cluster_method='xi'.
See also:
DBSCAN A similar clustering for a specified neighborhood radius (eps). Our implementation is optimized for
runtime.
References
[R2c55e37003fe-1], [R2c55e37003fe-2]
Examples
Methods
Extracts an ordered list of points and reachability distances, and performs initial clustering using max_eps
distance specified at OPTICS object instantiation.
Parameters
X [array, shape (n_samples, n_features), or (n_samples, n_samples) if met-
ric=’precomputed’] A feature array, or array of distances between samples if met-
ric=’precomputed’.
y [ignored] Ignored.
Returns
self [instance of OPTICS] The instance.
fit_predict(self, X, y=None)
Perform clustering on X and returns cluster labels.
Parameters
X [ndarray, shape (n_samples, n_features)] Input data.
y [Ignored] Not used, present for API consistency by convention.
Returns
labels [ndarray, shape (n_samples,)] Cluster labels.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.cluster.SpectralClustering
np.exp(-gamma * d(X,X) ** 2)
Notes
If you have an affinity matrix, such as a distance matrix, for which 0 means identical elements, and high values
means very dissimilar elements, it can be transformed in a similarity matrix that is well suited for the algorithm
by applying the Gaussian (RBF, heat) kernel:
Where delta is a free parameter representing the width of the Gaussian kernel.
Another alternative is to take a symmetric version of the k nearest neighbors connectivity matrix of the points.
If the pyamg package is installed, it is used: this greatly speeds up computation.
References
• Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik http://citeseer.ist.psu.edu/
viewdoc/summary?doi=10.1.1.160.2324
• A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.165.9323
Examples
Methods
sklearn.cluster.SpectralBiclustering
References
• Kluger, Yuval, et. al., 2003. Spectral biclustering of microarray data: coclustering genes and conditions.
Examples
Methods
row_ind [np.array, dtype=np.intp] Indices of rows in the dataset that belong to the bicluster.
col_ind [np.array, dtype=np.intp] Indices of columns in the dataset that belong to the biclus-
ter.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_shape(self, i)
Shape of the i’th bicluster.
Parameters
i [int] The index of the cluster.
Returns
shape [(int, int)] Number of rows and columns (resp.) in the bicluster.
get_submatrix(self, i, data)
Return the submatrix corresponding to bicluster i.
Parameters
i [int] The index of the cluster.
data [array] The data.
Returns
submatrix [array] The submatrix corresponding to bicluster i.
Notes
Works with sparse matrices. Only works if rows_ and columns_ attributes exist.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.cluster.SpectralCoclustering
References
• Dhillon, Inderjit S, 2001. Co-clustering documents and words using bipartite spectral graph partitioning.
Examples
Methods
Returns
row_ind [np.array, dtype=np.intp] Indices of rows in the dataset that belong to the bicluster.
col_ind [np.array, dtype=np.intp] Indices of columns in the dataset that belong to the biclus-
ter.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_shape(self, i)
Shape of the i’th bicluster.
Parameters
i [int] The index of the cluster.
Returns
shape [(int, int)] Number of rows and columns (resp.) in the bicluster.
get_submatrix(self, i, data)
Return the submatrix corresponding to bicluster i.
Parameters
i [int] The index of the cluster.
data [array] The data.
Returns
submatrix [array] The submatrix corresponding to bicluster i.
Notes
Works with sparse matrices. Only works if rows_ and columns_ attributes exist.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.3.2 Functions
sklearn.cluster.affinity_propagation
return_n_iter [bool, default False] Whether or not to return the number of iterations.
Returns
cluster_centers_indices [array, shape (n_clusters,)] index of clusters centers
labels [array, shape (n_samples,)] cluster labels for each point
n_iter [int] number of iterations run. Returned only if return_n_iter is set to True.
Notes
References
Brendan J. Frey and Delbert Dueck, “Clustering by Passing Messages Between Data Points”, Science Feb. 2007
sklearn.cluster.cluster_optics_dbscan
sklearn.cluster.cluster_optics_xi
sklearn.cluster.compute_optics_graph
min_samples [int > 1 or float between 0 and 1] The number of samples in a neighborhood for a
point to be considered as a core point. Expressed as an absolute number or a fraction of the
number of samples (rounded to be at least 2).
max_eps [float, optional (default=np.inf)] The maximum distance between two samples for one
to be considered as in the neighborhood of the other. Default value of np.inf will identify
clusters across all scales; reducing max_eps will result in shorter run times.
metric [string or callable, optional (default=’minkowski’)] Metric to use for distance computa-
tion. Any metric from scikit-learn or scipy.spatial.distance can be used.
If metric is a callable function, it is called on each pair of instances (rows) and the resulting
value recorded. The callable should take two arrays as input and return one value indicating
the distance between them. This works for Scipy’s metrics, but is less efficient than passing
the metric name as a string. If metric is “precomputed”, X is assumed to be a distance matrix
and must be square.
Valid values for metric are:
• from scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
• from scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’,
‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘rus-
sellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
See the documentation for scipy.spatial.distance for details on these metrics.
p [integer, optional (default=2)] Parameter for the Minkowski metric from sklearn.
metrics.pairwise_distances. When p = 1, this is equivalent to using manhat-
tan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance
(l_p) is used.
metric_params [dict, optional (default=None)] Additional keyword arguments for the metric
function.
algorithm [{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional] Algorithm used to compute the
nearest neighbors:
• ‘ball_tree’ will use BallTree
• ‘kd_tree’ will use KDTree
• ‘brute’ will use a brute-force search.
• ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed
to fit method. (default)
Note: fitting on sparse input will override the setting of this parameter, using brute force.
leaf_size [int, optional (default=30)] Leaf size passed to BallTree or KDTree. This can
affect the speed of the construction and query, as well as the memory required to store the
tree. The optimal value depends on the nature of the problem.
n_jobs [int or None, optional (default=None)] The number of parallel jobs to run for neighbors
search. None means 1 unless in a joblib.parallel_backend context. -1 means
using all processors. See Glossary for more details.
Returns
ordering_ [array, shape (n_samples,)] The cluster ordered list of sample indices.
core_distances_ [array, shape (n_samples,)] Distance at which each sample becomes a core
point, indexed by object order. Points which will never be core have a distance of inf. Use
clust.core_distances_[clust.ordering_] to access in cluster order.
reachability_ [array, shape (n_samples,)] Reachability distances per sample, indexed by object
order. Use clust.reachability_[clust.ordering_] to access in cluster order.
predecessor_ [array, shape (n_samples,)] Point that a sample was reached from, indexed by
object order. Seed points have a predecessor of -1.
References
[1]
sklearn.cluster.dbscan
n_jobs [int or None, optional (default=None)] The number of parallel jobs to run for neighbors
search. None means 1 unless in a joblib.parallel_backend context. -1 means
using all processors. See Glossary for more details.
Returns
core_samples [array [n_core_samples]] Indices of core samples.
labels [array [n_samples]] Cluster labels for each point. Noisy samples are given the label -1.
See also:
Notes
References
Ester, M., H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise”. In: Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how
you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.
sklearn.cluster.estimate_bandwidth
n_samples [int, optional] The number of samples to use. If not given, all samples are used.
random_state [int, RandomState instance or None (default)] The generator used to randomly
select the samples from input points for bandwidth estimation. Use an int to make the
randomness deterministic. See Glossary.
n_jobs [int or None, optional (default=None)] The number of parallel jobs to run for neighbors
search. None means 1 unless in a joblib.parallel_backend context. -1 means
using all processors. See Glossary for more details.
Returns
bandwidth [float] The bandwidth parameter.
sklearn.cluster.k_means
sklearn.cluster.mean_shift
Notes
sklearn.cluster.spectral_clustering
Parameters
affinity [array-like or sparse matrix, shape: (n_samples, n_samples)] The affinity matrix de-
scribing the relationship of the samples to embed. Must be symmetric.
Possible examples:
• adjacency matrix of a graph,
• heat kernel of the pairwise distance matrix of the samples,
• symmetric k-nearest neighbours connectivity matrix of the samples.
n_clusters [integer, optional] Number of clusters to extract.
n_components [integer, optional, default is n_clusters] Number of eigen vectors to use for the
spectral embedding
eigen_solver [{None, ‘arpack’, ‘lobpcg’, or ‘amg’}] The eigenvalue decomposition strategy to
use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems,
but may also lead to instabilities
random_state [int, RandomState instance or None (default)] A pseudo random number gener-
ator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver
== ‘amg’ and by the K-Means initialization. Use an int to make the randomness determin-
istic. See Glossary.
n_init [int, optional, default: 10] Number of time the k-means algorithm will be run with dif-
ferent centroid seeds. The final results will be the best output of n_init consecutive runs in
terms of inertia.
eigen_tol [float, optional, default: 0.0] Stopping criterion for eigendecomposition of the Lapla-
cian matrix when using arpack eigen_solver.
assign_labels [{‘kmeans’, ‘discretize’}, default: ‘kmeans’] The strategy to use to assign labels
in the embedding space. There are two ways to assign labels after the laplacian embedding.
k-means can be applied and is a popular choice. But it can also be sensitive to initialization.
Discretization is another approach which is less sensitive to random initialization. See the
‘Multiclass spectral clustering’ paper referenced below for more details on the discretization
approach.
Returns
labels [array of integers, shape: n_samples] The labels of the clusters.
Notes
The graph should contain only one connect component, elsewhere the results make little sense.
This algorithm solves the normalized cut for k=2: it is a normalized spectral clustering.
References
• Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik http://citeseer.ist.psu.edu/
viewdoc/summary?doi=10.1.1.160.2324
• A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.165.9323
• Multiclass spectral clustering, 2003 Stella X. Yu, Jianbo Shi https://www1.icsi.berkeley.edu/~stellayu/
publication/doc/2003kwayICCV.pdf
sklearn.cluster.ward_tree
7.4.1 sklearn.compose.ColumnTransformer
return any of the above. To select multiple columns by name or dtype, you can use
make_column_transformer.
remainder [{‘drop’, ‘passthrough’} or estimator, default ‘drop’] By default, only the
specified columns in transformers are transformed and combined in the output,
and the non-specified columns are dropped. (default of 'drop'). By specify-
ing remainder='passthrough', all remaining columns that were not specified in
transformers will be automatically passed through. This subset of columns is con-
catenated with the output of the transformers. By setting remainder to be an estimator,
the remaining non-specified columns will use the remainder estimator. The estimator
must support fit and transform. Note that using this feature requires that the DataFrame
columns input at fit and transform have identical order.
sparse_threshold [float, default = 0.3] If the output of the different transformers contains sparse
matrices, these will be stacked as a sparse matrix if the overall density is lower than this
value. Use sparse_threshold=0 to always return dense. When the transformed output
consists of all dense data, the stacked result will be dense, and this keyword will be ignored.
n_jobs [int or None, optional (default=None)] Number of jobs to run in parallel. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
transformer_weights [dict, optional] Multiplicative weights for features per transformer. The
output of the transformer is multiplied by these weights. Keys are transformer names, values
the weights.
verbose [boolean, optional(default=False)] If True, the time elapsed while fitting each trans-
former will be printed as it is completed.
Attributes
transformers_ [list] The collection of fitted transformers as tuples of (name, fitted_transformer,
column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. In
case there were no columns selected, this will be the unfitted transformer. If there
are remaining columns, the final element is a tuple of the form: (‘remainder’, trans-
former, remaining_columns) corresponding to the remainder parameter. If there are
remaining columns, then len(transformers_)==len(transformers)+1, other-
wise len(transformers_)==len(transformers).
named_transformers_ [Bunch object, a dictionary with attribute access] Access the fitted
transformer by name.
sparse_output_ [boolean] Boolean flag indicating wether the output of transform is a sparse
matrix or a dense numpy array, which depends on the output of the individual transformers
and the sparse_threshold keyword.
See also:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified
in the transformers list. Columns of the original feature matrix that are not specified are dropped from the
resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified
with passthrough are added at the right to the output of the transformers.
Examples
Methods
7.4.2 sklearn.compose.TransformedTargetRegressor
regressor.fit(X, func(y))
or:
regressor.fit(X, transformer.transform(y))
inverse_func(regressor.predict(X))
or:
transformer.inverse_transform(regressor.predict(X))
Notes
Internally, the target y is always converted into a 2-dimensional array to be used by scikit-learn transformers.
At the time of prediction, the output will be reshaped to a have the same number of dimensions as y.
See examples/compose/plot_transformed_target.py.
Examples
Methods
fit(self, X, y, \*\*fit_params) Fit the model according to the given training data.
get_params(self[, deep]) Get parameters for this estimator.
predict(self, X) Predict using the base regressor, applying inverse.
score(self, X, y[, sample_weight]) Return the coefficient of determination R^2 of the
prediction.
set_params(self, \*\*params) Set the parameters of this estimator.
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict using the base regressor, applying inverse.
The regressor is used to predict and the inverse_func or inverse_transform is applied before
returning the prediction.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Samples.
Returns
y_hat [array, shape = (n_samples,)] Predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
7.4.3 sklearn.compose.make_column_transformer
sklearn.compose.make_column_transformer(*transformers, **kwargs)
Construct a ColumnTransformer from the given transformers.
This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming
the transformers. Instead, they will be given names automatically based on their types. It also does not allow
weighting with transformer_weights.
Read more in the User Guide.
Parameters
*transformers [tuples] Tuples of the form (transformer, column(s)) specifying the transformer
objects to be applied to subsets of the data.
transformer [estimator or {‘passthrough’, ‘drop’}] Estimator must support fit and trans-
form. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to
drop the columns or to pass them through untransformed, respectively.
column(s) [string or int, array-like of string or int, slice, boolean mask array or callable]
Indexes the data on its second axis. Integers are interpreted as positional columns, while
strings can reference DataFrame columns by name. A scalar string or int should be used
where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will
be passed to the transformer. A callable is passed the input data X and can return any of
the above.
remainder [{‘drop’, ‘passthrough’} or estimator, default ‘drop’] By default, only the
specified columns in transformers are transformed and combined in the output,
and the non-specified columns are dropped. (default of 'drop'). By specify-
ing remainder='passthrough', all remaining columns that were not specified in
transformers will be automatically passed through. This subset of columns is con-
catenated with the output of the transformers. By setting remainder to be an estimator,
the remaining non-specified columns will use the remainder estimator. The estimator
must support fit and transform.
sparse_threshold [float, default = 0.3] If the transformed output consists of a mix of sparse and
dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use
sparse_threshold=0 to always return dense. When the transformed output consists
of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and
this keyword will be ignored.
n_jobs [int or None, optional (default=None)] Number of jobs to run in parallel. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
verbose [boolean, optional(default=False)] If True, the time elapsed while fitting each trans-
former will be printed as it is completed.
Returns
ct [ColumnTransformer]
See also:
Examples
7.4.4 sklearn.compose.make_column_selector
sklearn.compose.make_column_selector(pattern=None, dtype_include=None,
dtype_exclude=None)
Create a callable to select columns to be used with ColumnTransformer.
make_column_selector can select columns based on datatype or the columns name with a regex. When
using multiple selection criteria, all criteria must match for a column to be selected.
Parameters
pattern [str, default=None] Name of columns containing this regex pattern will be included. If
None, column selection will not be selected based on pattern.
dtype_include [column dtype or list of column dtypes, default=None] A selection of dtypes to
include. For more details, see pandas.DataFrame.select_dtypes.
dtype_exclude [column dtype or list of column dtypes, default=None] A selection of dtypes to
exclude. For more details, see pandas.DataFrame.select_dtypes.
Returns
selector [callable] Callable for column selection to be used by a ColumnTransformer.
See also:
Examples
The sklearn.covariance module includes methods and algorithms to robustly estimate the covariance of fea-
tures given a set of points. The precision matrix defined as the inverse of the covariance is also estimated. Covariance
estimation is closely related to the theory of Gaussian Graphical Models.
User guide: See the Covariance estimation section for further details.
7.5.1 sklearn.covariance.EmpiricalCovariance
Examples
Methods
error_norm(self, comp_cov[, norm, scaling, . . . ]) Computes the Mean Squared Error between two co-
variance estimators.
fit(self, X[, y]) Fits the Maximum Likelihood Estimator covariance
model according to the given training data and pa-
rameters.
get_params(self[, deep]) Get parameters for this estimator.
get_precision(self) Getter for the precision matrix.
mahalanobis(self, X) Computes the squared Mahalanobis distances of
given observations.
score(self, X_test[, y]) Computes the log-likelihood of a Gaussian data set
with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, \*\*params) Set the parameters of this estimator.
scaling [bool] If True (default), the squared error norm is divided by n_features. If False,
the squared error norm is not rescaled.
squared [bool] Whether to compute the squared error norm or the error norm. If True
(default), the squared error norm is returned. If False, the error norm is returned.
Returns
The Mean Squared Error (in the sense of the Frobenius norm) between
self and comp_cov covariance estimators.
fit(self, X, y=None)
Fits the Maximum Likelihood Estimator covariance model according to the given training data and param-
eters.
Parameters
X [array-like of shape (n_samples, n_features)] Training data, where n_samples is the num-
ber of samples and n_features is the number of features.
y not used, present for API consistence purpose.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_precision(self )
Getter for the precision matrix.
Returns
precision_ [array-like] The precision matrix associated to the current covariance object.
mahalanobis(self, X)
Computes the squared Mahalanobis distances of given observations.
Parameters
X [array-like of shape (n_samples, n_features)] The observations, the Mahalanobis distances
of the which we compute. Observations are assumed to be drawn from the same distribu-
tion than the data used in fit.
Returns
dist [array, shape = [n_samples,]] Squared Mahalanobis distances of the observations.
score(self, X_test, y=None)
Computes the log-likelihood of a Gaussian data set with self.covariance_ as an estimator of its
covariance matrix.
Parameters
X_test [array-like of shape (n_samples, n_features)] Test data of which we compute the
likelihood, where n_samples is the number of samples and n_features is the number of
features. X_test is assumed to be drawn from the same distribution than the data used in
fit (including centering).
y not used, present for API consistence purpose.
Returns
res [float] The likelihood of the data set with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.5.2 sklearn.covariance.EllipticEnvelope
random_state [int, RandomState instance or None, optional (default=None)] The seed of the
pseudo random number generator to use when shuffling the data. If int, random_state is
the seed used by the random number generator; If RandomState instance, random_state is
the random number generator; If None, the random number generator is the RandomState
instance used by np.random.
Attributes
location_ [array-like, shape (n_features,)] Estimated robust location
covariance_ [array-like, shape (n_features, n_features)] Estimated robust covariance matrix
precision_ [array-like, shape (n_features, n_features)] Estimated pseudo inverse matrix. (stored
only if store_precision is True)
support_ [array-like, shape (n_samples,)] A mask of the observations that have been used to
compute the robust estimates of location and shape.
offset_ [float] Offset used to define the decision function from the raw scores. We have the rela-
tion: decision_function = score_samples - offset_. The offset depends
on the contamination parameter and is defined in such a way we obtain the expected number
of outliers (samples with decision function < 0) in training.
See also:
EmpiricalCovariance, MinCovDet
Notes
Outlier detection from covariance estimation may break or not perform well in high-dimensional settings. In
particular, one will always take care to work with n_samples > n_features ** 2.
References
[R68ae096da0e4-1]
Examples
Methods
References
[RVD]
decision_function(self, X)
Compute the decision function of the given observations.
Parameters
X [array-like, shape (n_samples, n_features)]
Returns
Returns
precision_ [array-like] The precision matrix associated to the current covariance object.
mahalanobis(self, X)
Computes the squared Mahalanobis distances of given observations.
Parameters
X [array-like of shape (n_samples, n_features)] The observations, the Mahalanobis distances
of the which we compute. Observations are assumed to be drawn from the same distribu-
tion than the data used in fit.
Returns
dist [array, shape = [n_samples,]] Squared Mahalanobis distances of the observations.
predict(self, X)
Predict the labels (1 inlier, -1 outlier) of X according to the fitted model.
Parameters
X [array-like, shape (n_samples, n_features)]
Returns
is_inlier [array, shape (n_samples,)] Returns -1 for anomalies/outliers and +1 for inliers.
reweight_covariance(self, data)
Re-weight raw Minimum Covariance Determinant estimates.
Re-weight observations using Rousseeuw’s method (equivalent to deleting outlying observations from the
data set before computing location and covariance estimates) described in [RVDriessen].
Parameters
data [array-like, shape (n_samples, n_features)] The data matrix, with p features and n sam-
ples. The data set must be the one which was used to compute the raw estimates.
Returns
location_reweighted [array-like, shape (n_features, )] Re-weighted robust location esti-
mate.
covariance_reweighted [array-like, shape (n_features, n_features)] Re-weighted robust co-
variance estimate.
support_reweighted [array-like, type boolean, shape (n_samples,)] A mask of the obser-
vations that have been used to compute the re-weighted robust location and covariance
estimates.
References
[RVDriessen]
score(self, X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like, shape (n_samples, n_features)] Test samples.
7.5.3 sklearn.covariance.GraphicalLasso
enet_tol [positive float, optional] The tolerance for the elastic net solver used to calculate the
descent direction. This parameter controls the accuracy of the search direction for a given
column update, not of the overall parameter estimate. Only used for mode=’cd’.
max_iter [integer, default 100] The maximum number of iterations.
verbose [boolean, default False] If verbose is True, the objective function and dual gap are
plotted at each iteration.
assume_centered [boolean, default False] If True, data are not centered before computation.
Useful when working with data whose mean is almost, but not exactly zero. If False, data
are centered before computation.
Attributes
location_ [array-like, shape (n_features,)] Estimated location, i.e. the estimated mean.
covariance_ [array-like, shape (n_features, n_features)] Estimated covariance matrix
precision_ [array-like, shape (n_features, n_features)] Estimated pseudo inverse matrix.
n_iter_ [int] Number of iterations run.
See also:
graphical_lasso, GraphicalLassoCV
Examples
Methods
error_norm(self, comp_cov[, norm, scaling, . . . ]) Computes the Mean Squared Error between two co-
variance estimators.
fit(self, X[, y]) Fits the GraphicalLasso model to X.
get_params(self[, deep]) Get parameters for this estimator.
get_precision(self) Getter for the precision matrix.
Continued on next page
mahalanobis(self, X)
Computes the squared Mahalanobis distances of given observations.
Parameters
X [array-like of shape (n_samples, n_features)] The observations, the Mahalanobis distances
of the which we compute. Observations are assumed to be drawn from the same distribu-
tion than the data used in fit.
Returns
dist [array, shape = [n_samples,]] Squared Mahalanobis distances of the observations.
score(self, X_test, y=None)
Computes the log-likelihood of a Gaussian data set with self.covariance_ as an estimator of its
covariance matrix.
Parameters
X_test [array-like of shape (n_samples, n_features)] Test data of which we compute the
likelihood, where n_samples is the number of samples and n_features is the number of
features. X_test is assumed to be drawn from the same distribution than the data used in
fit (including centering).
y not used, present for API consistence purpose.
Returns
res [float] The likelihood of the data set with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.5.4 sklearn.covariance.GraphicalLassoCV
n_refinements [strictly positive integer] The number of times the grid is refined. Not used if
explicit values of alphas are passed.
cv [int, cross-validation generator or an iterable, optional] Determines the cross-validation split-
ting strategy. Possible inputs for cv are:
• None, to use the default 5-fold cross-validation,
• integer, to specify the number of folds.
• CV splitter,
• An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
Changed in version 0.20: cv default value if None changed from 3-fold to 5-fold.
tol [positive float, optional] The tolerance to declare convergence: if the dual gap goes below
this value, iterations are stopped.
enet_tol [positive float, optional] The tolerance for the elastic net solver used to calculate the
descent direction. This parameter controls the accuracy of the search direction for a given
column update, not of the overall parameter estimate. Only used for mode=’cd’.
max_iter [integer, optional] Maximum number of iterations.
mode [{‘cd’, ‘lars’}] The Lasso solver to use: coordinate descent or LARS. Use LARS for
very sparse underlying graphs, where number of features is greater than number of samples.
Elsewhere prefer cd which is more numerically stable.
n_jobs [int or None, optional (default=None)] number of jobs to run in parallel. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
verbose [boolean, optional] If verbose is True, the objective function and duality gap are printed
at each iteration.
assume_centered [boolean] If True, data are not centered before computation. Useful when
working with data whose mean is almost, but not exactly zero. If False, data are centered
before computation.
Attributes
location_ [array-like, shape (n_features,)] Estimated location, i.e. the estimated mean.
covariance_ [numpy.ndarray, shape (n_features, n_features)] Estimated covariance matrix.
precision_ [numpy.ndarray, shape (n_features, n_features)] Estimated precision matrix (inverse
covariance).
alpha_ [float] Penalization parameter selected.
cv_alphas_ [list of float] All penalization parameters explored.
grid_scores_ [2D numpy.ndarray (n_alphas, n_folds)] Log-likelihood score on left-out data
across folds.
n_iter_ [int] Number of iterations run for the optimal alpha.
See also:
graphical_lasso, GraphicalLasso
Notes
The search for the optimal penalization parameter (alpha) is done on an iteratively refined grid: first the cross-
validated scores on a grid are computed, then a new refined grid is centered around the maximum, and so on.
One of the challenges which is faced here is that the solvers can fail to converge to a well-conditioned estimate.
The corresponding values of alpha then come out as missing values, but the optimum may be close to these
missing values.
Examples
Methods
error_norm(self, comp_cov[, norm, scaling, . . . ]) Computes the Mean Squared Error between two co-
variance estimators.
fit(self, X[, y]) Fits the GraphicalLasso covariance model to X.
get_params(self[, deep]) Get parameters for this estimator.
get_precision(self) Getter for the precision matrix.
mahalanobis(self, X) Computes the squared Mahalanobis distances of
given observations.
score(self, X_test[, y]) Computes the log-likelihood of a Gaussian data set
with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, \*\*params) Set the parameters of this estimator.
X_test [array-like of shape (n_samples, n_features)] Test data of which we compute the
likelihood, where n_samples is the number of samples and n_features is the number of
features. X_test is assumed to be drawn from the same distribution than the data used in
fit (including centering).
y not used, present for API consistence purpose.
Returns
res [float] The likelihood of the data set with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.5.5 sklearn.covariance.LedoitWolf
precision_ [array-like, shape (n_features, n_features)] Estimated pseudo inverse matrix. (stored
only if store_precision is True)
shrinkage_ [float, 0 <= shrinkage <= 1] Coefficient in the convex combination used for the
computation of the shrunk estimate.
Notes
References
“A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices”, Ledoit and Wolf, Journal of Mul-
tivariate Analysis, Volume 88, Issue 2, February 2004, pages 365-411.
Examples
Methods
error_norm(self, comp_cov[, norm, scaling, . . . ]) Computes the Mean Squared Error between two co-
variance estimators.
fit(self, X[, y]) Fits the Ledoit-Wolf shrunk covariance model ac-
cording to the given training data and parameters.
get_params(self[, deep]) Get parameters for this estimator.
get_precision(self) Getter for the precision matrix.
mahalanobis(self, X) Computes the squared Mahalanobis distances of
given observations.
score(self, X_test[, y]) Computes the log-likelihood of a Gaussian data set
with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, \*\*params) Set the parameters of this estimator.
7.5.6 sklearn.covariance.MinCovDet
Parameters
store_precision [bool] Specify if the estimated precision is stored.
assume_centered [bool] If True, the support of the robust location and the covariance estimates
is computed, and a covariance estimate is recomputed from it, without centering the data.
Useful to work with data whose mean is significantly equal to zero but is not exactly zero. If
False, the robust location and covariance are directly computed with the FastMCD algorithm
without additional treatment.
support_fraction [float, 0 < support_fraction < 1] The proportion of points to be included in
the support of the raw MCD estimate. Default is None, which implies that the minimum
value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2
random_state [int, RandomState instance or None, optional (default=None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
Attributes
raw_location_ [array-like, shape (n_features,)] The raw robust estimated location before cor-
rection and re-weighting.
raw_covariance_ [array-like, shape (n_features, n_features)] The raw robust estimated covari-
ance before correction and re-weighting.
raw_support_ [array-like, shape (n_samples,)] A mask of the observations that have been
used to compute the raw robust estimates of location and shape, before correction and re-
weighting.
location_ [array-like, shape (n_features,)] Estimated robust location
covariance_ [array-like, shape (n_features, n_features)] Estimated robust covariance matrix
precision_ [array-like, shape (n_features, n_features)] Estimated pseudo inverse matrix. (stored
only if store_precision is True)
support_ [array-like, shape (n_samples,)] A mask of the observations that have been used to
compute the robust estimates of location and shape.
dist_ [array-like, shape (n_samples,)] Mahalanobis distances of the training set (on which fit
is called) observations.
References
Examples
Methods
References
[RVD]
error_norm(self, comp_cov, norm=’frobenius’, scaling=True, squared=True)
Computes the Mean Squared Error between two covariance estimators. (In the sense of the Frobenius
norm).
Parameters
Re-weight observations using Rousseeuw’s method (equivalent to deleting outlying observations from the
data set before computing location and covariance estimates) described in [RVDriessen].
Parameters
data [array-like, shape (n_samples, n_features)] The data matrix, with p features and n sam-
ples. The data set must be the one which was used to compute the raw estimates.
Returns
location_reweighted [array-like, shape (n_features, )] Re-weighted robust location esti-
mate.
covariance_reweighted [array-like, shape (n_features, n_features)] Re-weighted robust co-
variance estimate.
support_reweighted [array-like, type boolean, shape (n_samples,)] A mask of the obser-
vations that have been used to compute the re-weighted robust location and covariance
estimates.
References
[RVDriessen]
score(self, X_test, y=None)
Computes the log-likelihood of a Gaussian data set with self.covariance_ as an estimator of its
covariance matrix.
Parameters
X_test [array-like of shape (n_samples, n_features)] Test data of which we compute the
likelihood, where n_samples is the number of samples and n_features is the number of
features. X_test is assumed to be drawn from the same distribution than the data used in
fit (including centering).
y not used, present for API consistence purpose.
Returns
res [float] The likelihood of the data set with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.5.7 sklearn.covariance.OAS
Notes
References
“Shrinkage Algorithms for MMSE Covariance Estimation” Chen et al., IEEE Trans. on Sign. Proc., Volume 58,
Issue 10, October 2010.
Methods
error_norm(self, comp_cov[, norm, scaling, . . . ]) Computes the Mean Squared Error between two co-
variance estimators.
fit(self, X[, y]) Fits the Oracle Approximating Shrinkage covariance
model according to the given training data and pa-
rameters.
get_params(self[, deep]) Get parameters for this estimator.
get_precision(self) Getter for the precision matrix.
mahalanobis(self, X) Computes the squared Mahalanobis distances of
given observations.
Continued on next page
precision_ [array-like] The precision matrix associated to the current covariance object.
mahalanobis(self, X)
Computes the squared Mahalanobis distances of given observations.
Parameters
X [array-like of shape (n_samples, n_features)] The observations, the Mahalanobis distances
of the which we compute. Observations are assumed to be drawn from the same distribu-
tion than the data used in fit.
Returns
dist [array, shape = [n_samples,]] Squared Mahalanobis distances of the observations.
score(self, X_test, y=None)
Computes the log-likelihood of a Gaussian data set with self.covariance_ as an estimator of its
covariance matrix.
Parameters
X_test [array-like of shape (n_samples, n_features)] Test data of which we compute the
likelihood, where n_samples is the number of samples and n_features is the number of
features. X_test is assumed to be drawn from the same distribution than the data used in
fit (including centering).
y not used, present for API consistence purpose.
Returns
res [float] The likelihood of the data set with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.5.8 sklearn.covariance.ShrunkCovariance
Notes
Examples
Methods
error_norm(self, comp_cov[, norm, scaling, . . . ]) Computes the Mean Squared Error between two co-
variance estimators.
fit(self, X[, y]) Fits the shrunk covariance model according to the
given training data and parameters.
get_params(self[, deep]) Get parameters for this estimator.
get_precision(self) Getter for the precision matrix.
mahalanobis(self, X) Computes the squared Mahalanobis distances of
given observations.
Continued on next page
mahalanobis(self, X)
Computes the squared Mahalanobis distances of given observations.
Parameters
X [array-like of shape (n_samples, n_features)] The observations, the Mahalanobis distances
of the which we compute. Observations are assumed to be drawn from the same distribu-
tion than the data used in fit.
Returns
dist [array, shape = [n_samples,]] Squared Mahalanobis distances of the observations.
score(self, X_test, y=None)
Computes the log-likelihood of a Gaussian data set with self.covariance_ as an estimator of its
covariance matrix.
Parameters
X_test [array-like of shape (n_samples, n_features)] Test data of which we compute the
likelihood, where n_samples is the number of samples and n_features is the number of
features. X_test is assumed to be drawn from the same distribution than the data used in
fit (including centering).
y not used, present for API consistence purpose.
Returns
res [float] The likelihood of the data set with self.covariance_ as an estimator of its
covariance matrix.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.5.9 sklearn.covariance.empirical_covariance
sklearn.covariance.empirical_covariance(X, assume_centered=False)
Computes the Maximum likelihood covariance estimator
Parameters
X [ndarray, shape (n_samples, n_features)] Data from which to compute the covariance estimate
assume_centered [boolean] If True, data will not be centered before computation. Useful when
working with data whose mean is almost, but not exactly zero. If False, data will be centered
before computation.
Returns
covariance [2D ndarray, shape (n_features, n_features)] Empirical covariance (Maximum Like-
lihood Estimator).
7.5.10 sklearn.covariance.graphical_lasso
verbose [boolean, optional] If verbose is True, the objective function and dual gap are printed
at each iteration.
return_costs [boolean, optional] If return_costs is True, the objective function and dual gap at
each iteration are returned.
eps [float, optional] The machine-precision regularization in the computation of the Cholesky
diagonal factors. Increase this for very ill-conditioned systems.
return_n_iter [bool, optional] Whether or not to return the number of iterations.
Returns
covariance [2D ndarray, shape (n_features, n_features)] The estimated covariance matrix.
precision [2D ndarray, shape (n_features, n_features)] The estimated (sparse) precision matrix.
costs [list of (objective, dual_gap) pairs] The list of values of the objective function and the dual
gap at each iteration. Returned only if return_costs is True.
n_iter [int] Number of iterations. Returned only if return_n_iter is set to True.
See also:
GraphicalLasso, GraphicalLassoCV
Notes
The algorithm employed to solve this problem is the GLasso algorithm, from the Friedman 2008 Biostatistics
paper. It is the same algorithm as in the R glasso package.
One possible difference with the glasso R package is that the diagonal coefficients are not penalized.
7.5.11 sklearn.covariance.ledoit_wolf
Notes
7.5.12 sklearn.covariance.oas
sklearn.covariance.oas(X, assume_centered=False)
Estimate covariance with the Oracle Approximating Shrinkage algorithm.
Parameters
X [array-like, shape (n_samples, n_features)] Data from which to compute the covariance esti-
mate.
assume_centered [boolean] If True, data will not be centered before computation. Useful to
work with data whose mean is significantly equal to zero but is not exactly zero. If False,
data will be centered before computation.
Returns
shrunk_cov [array-like, shape (n_features, n_features)] Shrunk covariance.
shrinkage [float] Coefficient in the convex combination used for the computation of the shrunk
estimate.
Notes
7.5.13 sklearn.covariance.shrunk_covariance
sklearn.covariance.shrunk_covariance(emp_cov, shrinkage=0.1)
Calculates a covariance matrix shrunk on the diagonal
Read more in the User Guide.
Parameters
emp_cov [array-like, shape (n_features, n_features)] Covariance matrix to be shrunk
shrinkage [float, 0 <= shrinkage <= 1] Coefficient in the convex combination used for the
computation of the shrunk estimate.
Returns
Notes
User guide: See the Cross decomposition section for further details.
7.6.1 sklearn.cross_decomposition.CCA
PLSCanonical
PLSSVD
Notes
For each component k, find the weights u, v that maximizes max corr(Xk u, Yk v), such that |u| = |v| =
1
Note that it maximizes only the correlations between the scores.
The residual matrix of X (Xk+1) block is obtained by the deflation on the current X score: x_score.
The residual matrix of Y (Yk+1) block is obtained by deflation on the current Y score.
References
Jacob A. Wegelin. A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case.
Technical Report 371, Department of Statistics, University of Washington, Seattle, 2000.
In french but still a reference: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions
Technic.
Examples
Methods
Notes
Notes
This call requires the estimation of a p x q matrix, which may be an issue in high dimensional space.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
• Multilabel classification
• Compare cross decomposition methods
7.6.2 sklearn.cross_decomposition.PLSCanonical
CCA
PLSSVD
Notes
Matrices:
T: x_scores_
U: y_scores_
W: x_weights_
C: y_weights_
P: x_loadings_
Q: y_loadings__
max corr(Xk u, Yk v) * std(Xk u) std(Yk u), such that ``|u| = |v| = 1``
Note that it maximizes both the correlations between the scores and the intra-block variances.
The residual matrix of X (Xk+1) block is obtained by the deflation on the current X score: x_score.
The residual matrix of Y (Yk+1) block is obtained by deflation on the current Y score. This performs a canonical
symmetric version of the PLS regression. But slightly different than the CCA. This is mostly used for modeling.
This implementation provides the same results that the “plspm” package provided in the R language (R-
project), using the function plsca(X, Y). Results are equal or collinear with the function pls(..., mode
= "canonical") of the “mixOmics” package. The difference relies in the fact that mixOmics implementa-
tion does not exactly implement the Wold algorithm since it does not normalize y_weights to one.
References
Jacob A. Wegelin. A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case.
Technical Report 371, Department of Statistics, University of Washington, Seattle, 2000.
Examples
Methods
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
inverse_transform(self, X)
Transform data back to its original space.
Parameters
X [array-like of shape (n_samples, n_components)] New data, where n_samples is the num-
ber of samples and n_components is the number of pls components.
Returns
x_reconstructed [array-like of shape (n_samples, n_features)]
Notes
Notes
This call requires the estimation of a p x q matrix, which may be an issue in high dimensional space.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
Notes
7.6.3 sklearn.cross_decomposition.PLSRegression
Notes
Matrices:
T: x_scores_
U: y_scores_
W: x_weights_
C: y_weights_
P: x_loadings_
Q: y_loadings_
This implementation provides the same results that 3 PLS packages provided in the R language (R-project):
• “mixOmics” with function pls(X, Y, mode = “regression”)
• “plspm ” with function plsreg2(X, Y)
• “pls” with function oscorespls.fit(X, Y)
References
Jacob A. Wegelin. A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case.
Technical Report 371, Department of Statistics, University of Washington, Seattle, 2000.
In french but still a reference: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions
Technic.
Examples
Methods
fit_transform(self, X, y=None)
Learn and apply the dimension reduction on the train data.
Parameters
X [array-like of shape (n_samples, n_features)] Training vectors, where n_samples is the
number of samples and n_features is the number of predictors.
y [array-like of shape (n_samples, n_targets)] Target vectors, where n_samples is the number
of samples and n_targets is the number of response variables.
Returns
x_scores if Y is not given, (x_scores, y_scores) otherwise.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
inverse_transform(self, X)
Transform data back to its original space.
Parameters
X [array-like of shape (n_samples, n_components)] New data, where n_samples is the num-
ber of samples and n_components is the number of pls components.
Returns
x_reconstructed [array-like of shape (n_samples, n_features)]
Notes
Notes
This call requires the estimation of a p x q matrix, which may be an issue in high dimensional space.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
7.6.4 sklearn.cross_decomposition.PLSSVD
PLSCanonical
CCA
Examples
Methods
transform(self, X, Y=None)
Apply the dimension reduction learned on the train data.
Parameters
X [array-like of shape (n_samples, n_features)] Training vectors, where n_samples is the
number of samples and n_features is the number of predictors.
Y [array-like of shape (n_samples, n_targets)] Target vectors, where n_samples is the num-
ber of samples and n_targets is the number of response variables.
The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular
reference datasets. It also features some artificial data generators.
User guide: See the Dataset loading utilities section for further details.
7.7.1 Loaders
sklearn.datasets.clear_data_home
sklearn.datasets.clear_data_home(data_home=None)
Delete all the content of the data home cache.
Parameters
data_home [str | None] The path to scikit-learn data dir.
sklearn.datasets.dump_svmlight_file
• Libsvm GUI
sklearn.datasets.fetch_20newsgroups
Classes 20
Samples total 18846
Dimensionality 1
Features text
• target_names: a list of categories of the returned data, length [n_classes]. This depends
on the categories parameter.
(data, target) [tuple if return_X_y=True] New in version 0.22.
sklearn.datasets.fetch_20newsgroups_vectorized
sklearn.datasets.fetch_20newsgroups_vectorized(subset=’train’, remove=(),
data_home=None, down-
load_if_missing=True, re-
turn_X_y=False, normalize=True)
Load the 20 newsgroups dataset and vectorize it into token counts (classification).
Download it if necessary.
This is a convenience function; the transformation is done using the default settings for sklearn.
feature_extraction.text.CountVectorizer. For more advanced usage (stopword
filtering, n-gram extraction, etc.), combine fetch_20newsgroups with a custom sklearn.
feature_extraction.text.CountVectorizer, sklearn.feature_extraction.text.
HashingVectorizer, sklearn.feature_extraction.text.TfidfTransformer or
sklearn.feature_extraction.text.TfidfVectorizer.
The resulting counts are normalized using sklearn.preprocessing.normalize unless normalize is
set to False.
Classes 20
Samples total 18846
Dimensionality 130107
Features real
normalize [bool, default=True] If True, normalizes each document’s feature vector to unit norm
using sklearn.preprocessing.normalize.
New in version 0.22.
Returns
bunch [Bunch object with the following attribute:]
• bunch.data: sparse matrix, shape [n_samples, n_features]
• bunch.target: array, shape [n_samples]
• bunch.target_names: a list of categories of the returned data, length [n_classes].
• bunch.DESCR: a description of the dataset.
(data, target) [tuple if return_X_y is True] New in version 0.20.
sklearn.datasets.fetch_california_housing
sklearn.datasets.fetch_california_housing(data_home=None, download_if_missing=True,
return_X_y=False, as_frame=False)
Load the California housing dataset (regression).
Notes
sklearn.datasets.fetch_covtype
Classes 7
Samples total 581012
Dimensionality 54
Features int
sklearn.datasets.fetch_kddcup99
Classes 23
Samples total 4898431
Dimensionality 41
Features discrete (int) or continuous (float)
Returns
data [Bunch]
Dictionary-like object, the interesting attributes are:
• ‘data’, the data to learn.
• ‘target’, the regression target for each sample.
• ‘DESCR’, a description of the dataset.
(data, target) [tuple if return_X_y is True] New in version 0.20.
sklearn.datasets.fetch_lfw_pairs
Classes 5749
Samples total 13233
Dimensionality 5828
Features real, between 0 and 255
In the official README.txt this task is described as the “Restricted” task. As I am not sure as to implement the
“Unrestricted” variant correctly, I left it as unsupported for now.
The original images are 250 x 250 pixels, but the default slice and resize arguments reduce them to 62 x 47.
Read more in the User Guide.
Parameters
subset [optional, default: ‘train’] Select the dataset to load: ‘train’ for the development training
set, ‘test’ for the development test set, and ‘10_folds’ for the official evaluation set that is
meant to be used with a 10-folds cross validation.
data_home [optional, default: None] Specify another download and cache folder for the
datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
funneled [boolean, optional, default: True] Download and use the funneled variant of the
dataset.
resize [float, optional, default 0.5] Ratio used to resize the each face picture.
color [boolean, optional, default False] Keep the 3 RGB channels instead of averaging them to
a single gray level channel. If color is True the shape of the data has one more dimension
than the shape with color = False.
slice_ [optional] Provide a custom 2D slice (height, width) to extract the ‘interesting’ part of
the jpeg files and avoid use statistical correlation from the background
download_if_missing [optional, True by default] If False, raise a IOError if the data is not
locally available instead of trying to download the data from the source site.
Returns
sklearn.datasets.fetch_lfw_people
Classes 5749
Samples total 13233
Dimensionality 5828
Features real, between 0 and 255
sklearn.datasets.fetch_olivetti_faces
Classes 40
Samples total 400
Dimensionality 4096
Features real, between 0 and 1
Returns
bunch [Bunch object with the following attributes:]
• data: ndarray, shape (400, 4096). Each row corresponds to a ravelled face image of
original size 64 x 64 pixels.
• images : ndarray, shape (400, 64, 64). Each row is a face image corresponding to one of
the 40 subjects of the dataset.
• target : ndarray, shape (400,). Labels associated to each face image. Those labels are
ranging from 0-39 and correspond to the Subject IDs.
• DESCR : string. Description of the modified Olivetti Faces Dataset.
(data, target) [tuple if return_X_y=True] New in version 0.22.
sklearn.datasets.fetch_openml
Note: EXPERIMENTAL
The API is experimental (particularly the return value structure), and might have small backward-incompatible
changes in future releases.
Parameters
name [str or None] String identifier of the dataset. Note that OpenML can have multiple
datasets with the same name.
version [integer or ‘active’, default=’active’] Version of the dataset. Can only be provided if
also name is given. If ‘active’ the oldest version that’s still active is used. Since there
may be more than one active version of a dataset, and those versions may fundamentally be
different from one another, setting an exact version is highly recommended.
data_id [int or None] OpenML ID of the dataset. The most specific way of retrieving a dataset.
If data_id is not given, name (and potential version) are used to obtain a dataset.
data_home [string or None, default None] Specify another download and cache folder for the
data sets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
target_column [string, list or None, default ‘default-target’] Specify the column name in the
data to use as target. If ‘default-target’, the standard target column a stored on the server is
used. If None, all columns are returned as data and the target is None. If list (of strings), all
columns with these names are returned as multi-target (Note: not all scikit-learn classifiers
can handle all types of multi-output combinations)
cache [boolean, default=True] Whether to cache downloaded datasets using joblib.
return_X_y [boolean, default=False.] If True, returns (data, target) instead of a Bunch
object. See below for more information about the data and target objects.
as_frame [boolean, default=False] If True, the data is a pandas DataFrame including columns
with appropriate dtypes (numeric, string or categorical). The target is a pandas DataFrame
or Series depending on the number of target_columns. The Bunch will contain a frame
attribute with the target and the data. If return_X_y is True, then (data, target)
will be pandas DataFrames or Series as describe above.
Returns
data [Bunch] Dictionary-like object, with attributes:
data [np.array, scipy.sparse.csr_matrix of floats, or pandas DataFrame] The feature matrix.
Categorical features are encoded as ordinals.
target [np.array, pandas Series or DataFrame] The regression target or classification labels,
if applicable. Dtype is float if numeric, and object if categorical. If as_frame is True,
target is a pandas object.
DESCR [str] The full description of the dataset
feature_names [list] The names of the dataset columns
target_names: list The names of the target columns
New in version 0.22.
categories [dict or None] Maps each categorical feature name to a list of values, such that
the value encoded as i is ith in the list. If as_frame is True, this is None.
details [dict] More metadata from OpenML
frame [pandas DataFrame] Only present when as_frame=True. DataFrame with data
and target.
(data, target) [tuple if return_X_y is True]
Note: EXPERIMENTAL
This interface is experimental and subsequent releases may change attributes without notice
(although there should only be minor changes to data and target).
Missing values in the ‘data’ are represented as NaN’s. Missing values in ‘target’ are repre-
sented as NaN’s (numerical target) or None (categorical target)
sklearn.datasets.fetch_rcv1
Classes 103
Samples total 804414
Dimensionality 47236
Features real, between 0 and 1
sklearn.datasets.fetch_species_distributions
sklearn.datasets.fetch_species_distributions(data_home=None, down-
load_if_missing=True)
Loader for species distribution dataset from Phillips et. al. (2006)
Read more in the User Guide.
Parameters
data_home [optional, default: None] Specify another download and cache folder for the
datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
download_if_missing [optional, True by default] If False, raise a IOError if the data is not
locally available instead of trying to download the data from the source site.
Returns
The data is returned as a Bunch object with the following attributes:
coverages [array, shape = [14, 1592, 1212]] These represent the 14 features measured at each
point of the map grid. The latitude/longitude values for the grid are discussed below. Miss-
ing data is represented by the value -9999.
train [record array, shape = (1624,)] The training points for the data. Each point has three fields:
• train[‘species’] is the species name
• train[‘dd long’] is the longitude, in degrees
• train[‘dd lat’] is the latitude, in degrees
test [record array, shape = (620,)] The test points for the data. Same format as the training data.
Nx, Ny [integers] The number of longitudes (x) and latitudes (y) in the grid
x_left_lower_corner, y_left_lower_corner [floats] The (x,y) position of the lower-left corner,
in degrees
grid_size [float] The spacing between points of the grid, in degrees
Notes
This dataset represents the geographic distribution of species. The dataset is provided by Phillips et. al. (2006).
The two species are:
• “Bradypus variegatus” , the Brown-throated Sloth.
• “Microryzomys minutus” , also known as the Forest Small Rice Rat, a rodent that lives in Peru, Colombia,
Ecuador, Peru, and Venezuela.
• For an example of using this dataset with scikit-learn, see exam-
ples/applications/plot_species_distribution_modeling.py.
References
sklearn.datasets.get_data_home
sklearn.datasets.get_data_home(data_home=None)
Return the path of the scikit-learn data dir.
This folder is used by some large dataset loaders to avoid downloading the data several times.
By default the data dir is set to a folder named ‘scikit_learn_data’ in the user home folder.
Alternatively, it can be set by the ‘SCIKIT_LEARN_DATA’ environment variable or programmatically by giving
an explicit folder path. The ‘~’ symbol is expanded to the user home folder.
If the folder does not already exist, it is automatically created.
Parameters
data_home [str | None] The path to scikit-learn data dir.
sklearn.datasets.load_boston
sklearn.datasets.load_boston(return_X_y=False)
Load and return the boston house-prices dataset (regression).
Notes
Examples
sklearn.datasets.load_breast_cancer
sklearn.datasets.load_breast_cancer(return_X_y=False)
Load and return the breast cancer wisconsin dataset (classification).
The breast cancer dataset is a classic and very easy binary classification dataset.
Classes 2
Samples per class 212(M),357(B)
Samples total 569
Dimensionality 30
Features real, positive
data [Bunch] Dictionary-like object, the interesting attributes are: ‘data’, the data to learn,
‘target’, the classification labels, ‘target_names’, the meaning of the labels, ‘feature_names’,
the meaning of the features, and ‘DESCR’, the full description of the dataset, ‘filename’, the
physical location of breast cancer csv dataset (added in version 0.20).
(data, target) [tuple if return_X_y is True] New in version 0.18.
The copy of UCI ML Breast Cancer Wisconsin (Diagnostic) dataset is
downloaded from:
https://goo.gl/U2Uwz2
Examples
Let’s say you are interested in the samples 10, 50, and 85, and want to know their class name.
>>> from sklearn.datasets import load_breast_cancer
>>> data = load_breast_cancer()
>>> data.target[[10, 50, 85]]
array([0, 1, 0])
>>> list(data.target_names)
['malignant', 'benign']
sklearn.datasets.load_diabetes
sklearn.datasets.load_diabetes(return_X_y=False)
Load and return the diabetes dataset (regression).
sklearn.datasets.load_digits
sklearn.datasets.load_digits(n_class=10, return_X_y=False)
Load and return the digits dataset (classification).
Each datapoint is a 8x8 image of a digit.
Classes 10
Samples per class ~180
Samples total 1797
Dimensionality 64
Features integers 0-16
Examples
sklearn.datasets.load_files
data [Bunch] Dictionary-like object, the interesting attributes are: either data, the raw text data
to learn, or ‘filenames’, the files holding it, ‘target’, the classification labels (integer index),
‘target_names’, the meaning of the labels, and ‘DESCR’, the full description of the dataset.
sklearn.datasets.load_iris
sklearn.datasets.load_iris(return_X_y=False)
Load and return the iris dataset (classification).
The iris dataset is a classic and very easy multi-class classification dataset.
Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features real, positive
Notes
Changed in version 0.20: Fixed two wrong data points according to Fisher’s paper. The new version is the same
as in R, but not as in the UCI Machine Learning Repository.
Examples
Let’s say you are interested in the samples 10, 25, and 50, and want to know their class name.
• SVM Exercise
sklearn.datasets.load_linnerud
sklearn.datasets.load_linnerud(return_X_y=False)
Load and return the linnerud dataset (multivariate regression).
Samples total 20
Dimensionality 3 (for both data and target)
Features integer
Targets integer
sklearn.datasets.load_sample_image
sklearn.datasets.load_sample_image(image_name)
Load the numpy array of a single sample image
Read more in the User Guide.
Parameters
image_name [{china.jpg, flower.jpg}] The name of the sample image loaded
Returns
img [3D array] The image as a numpy array: height x width x color
Examples
sklearn.datasets.load_sample_images
sklearn.datasets.load_sample_images()
Load sample images for image manipulation.
Loads both, china and flower.
Read more in the User Guide.
Returns
data [Bunch] Dictionary-like object with the following attributes : ‘images’, the two sample
images, ‘filenames’, the file names for the images, and ‘DESCR’ the full description of the
dataset.
Examples
sklearn.datasets.load_svmlight_file
Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recom-
mended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the
first call and benefit from the near instantaneous loading of memmapped structures for the subsequent calls.
In case the file contains a pairwise preference constraint (known as “qid” in the svmlight format) these are
ignored unless the query_id parameter is set to True. These pairwise preference constraints can be used to
constraint the combination of samples when using pairwise loss functions (as is the case in some learning to
rank problems) so that only pairs with the same query_id value are considered.
This implementation is written in Cython and is reasonably fast. However, a faster API-compatible loader is
also available at:
https://github.com/mblondel/svmlight-loader
Parameters
f [{str, file-like, int}] (Path to) a file to load. If a path ends in “.gz” or “.bz2”, it will be uncom-
pressed on the fly. If an integer is passed, it is assumed to be a file descriptor. A file-like
or file descriptor will not be closed by this function. A file-like object must be opened in
binary mode.
n_features [int or None] The number of features to use. If None, it will be inferred. This
argument is useful to load several files that are subsets of a bigger sliced dataset: each
subset might not have examples of every feature, hence the inferred shape might vary from
one slice to another. n_features is only required if offset or length are passed a non-
default value.
dtype [numpy data type, default np.float64] Data type of dataset to be loaded. This will be the
data type of the output numpy arrays X and y.
multilabel [boolean, optional, default False] Samples may have several labels each (see https:
//www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html)
zero_based [boolean or “auto”, optional, default “auto”] Whether column indices in f are zero-
based (True) or one-based (False). If column indices are one-based, they are transformed
to zero-based to match Python/NumPy conventions. If set to “auto”, a heuristic check is
applied to determine this from the file contents. Both kinds of files occur “in the wild”, but
they are unfortunately not self-identifying. Using “auto” or True should always be safe when
no offset or length is passed. If offset or length are passed, the “auto” mode falls
back to zero_based=True to avoid having the heuristic check yield inconsistent results
on different segments of the file.
query_id [boolean, default False] If True, will return the query_id array for each file.
offset [integer, optional, default 0] Ignore the offset first bytes by seeking forward, then dis-
carding the following bytes up until the next new line character.
length [integer, optional, default -1] If strictly positive, stop reading any new line of data once
the position in the file has reached the (offset + length) bytes threshold.
Returns
X [scipy.sparse matrix of shape (n_samples, n_features)]
y [ndarray of shape (n_samples,), or, in the multilabel a list of] tuples of length n_samples.
query_id [array of shape (n_samples,)] query_id for each sample. Only returned when query_id
is set to True.
See also:
load_svmlight_files similar function for loading multiple files in this format, enforcing the same num-
ber of features/columns on all of them.
Examples
@mem.cache
def get_data():
data = load_svmlight_file("mysvmlightfile")
return data[0], data[1]
X, y = get_data()
sklearn.datasets.load_svmlight_files
is passed. If offset or length are passed, the “auto” mode falls back to zero_based=True to
avoid having the heuristic check yield inconsistent results on different segments of the file.
query_id [boolean, defaults to False] If True, will return the query_id array for each file.
offset [integer, optional, default 0] Ignore the offset first bytes by seeking forward, then dis-
carding the following bytes up until the next new line character.
length [integer, optional, default -1] If strictly positive, stop reading any new line of data once
the position in the file has reached the (offset + length) bytes threshold.
Returns
[X1, y1, . . . , Xn, yn]
where each (Xi, yi) pair is the result from load_svmlight_file(files[i]).
If query_id is set to True, this will return instead [X1, y1, q1,
. . . , Xn, yn, qn] where (Xi, yi, qi) is the result from
load_svmlight_file(files[i])
See also:
load_svmlight_file
Notes
When fitting a model to a matrix X_train and evaluating it against a matrix X_test, it is essential that X_train
and X_test have the same number of features (X_train.shape[1] == X_test.shape[1]). This may not be the case
if you load the files individually with load_svmlight_file.
sklearn.datasets.load_wine
sklearn.datasets.load_wine(return_X_y=False)
Load and return the wine dataset (classification).
New in version 0.18.
The wine dataset is a classic and very easy multi-class classification dataset.
Classes 3
Samples per class [59,71,48]
Samples total 178
Dimensionality 13
Features real, positive
Examples
Let’s say you are interested in the samples 10, 80, and 140, and want to know their class name.
sklearn.datasets.make_biclusters
make_checkerboard
References
[1]
sklearn.datasets.make_blobs
Examples
sklearn.datasets.make_checkerboard
make_biclusters
References
[1]
sklearn.datasets.make_circles
factor [0 < double < 1 (default=.8)] Scale factor between inner and outer circle.
Returns
X [array of shape [n_samples, 2]] The generated samples.
y [array of shape [n_samples]] The integer labels (0 or 1) for class membership of each sample.
• Classifier comparison
• Comparing different hierarchical linkage methods on toy datasets
• Comparing different clustering algorithms on toy datasets
• Kernel PCA
• Hashing feature transformation using Totally Random Trees
• t-SNE: The effect of various perplexity values on the shape
• Varying regularization in Multi-layer Perceptron
• Compare Stochastic learning strategies for MLPClassifier
• Feature discretization
• Label Propagation learning a complex structure
sklearn.datasets.make_classification
n_informative [int, optional (default=2)] The number of informative features. Each class is
composed of a number of gaussian clusters each located around the vertices of a hypercube
in a subspace of dimension n_informative. For each cluster, informative features are
drawn independently from N(0, 1) and then randomly linearly combined within each cluster
in order to add covariance. The clusters are then placed on the vertices of the hypercube.
n_redundant [int, optional (default=2)] The number of redundant features. These features are
generated as random linear combinations of the informative features.
n_repeated [int, optional (default=0)] The number of duplicated features, drawn randomly
from the informative and the redundant features.
n_classes [int, optional (default=2)] The number of classes (or labels) of the classification prob-
lem.
n_clusters_per_class [int, optional (default=2)] The number of clusters per class.
weights [array-like of shape (n_classes,) or (n_classes - 1,), (default=None)] The proportions
of samples assigned to each class. If None, then classes are balanced. Note that if
len(weights) == n_classes - 1, then the last class weight is automatically in-
ferred. More than n_samples samples may be returned if the sum of weights exceeds
1.
flip_y [float, optional (default=0.01)] The fraction of samples whose class is assigned randomly.
Larger values introduce noise in the labels and make the classification task harder.
class_sep [float, optional (default=1.0)] The factor multiplying the hypercube size. Larger val-
ues spread out the clusters/classes and make the classification task easier.
hypercube [boolean, optional (default=True)] If True, the clusters are put on the vertices of a
hypercube. If False, the clusters are put on the vertices of a random polytope.
shift [float, array of shape [n_features] or None, optional (default=0.0)] Shift features by the
specified value. If None, then features are shifted by a random value drawn in [-class_sep,
class_sep].
scale [float, array of shape [n_features] or None, optional (default=1.0)] Multiply features by
the specified value. If None, then features are scaled by a random value drawn in [1, 100].
Note that scaling happens after shifting.
shuffle [boolean, optional (default=True)] Shuffle the samples and the features.
random_state [int, RandomState instance or None (default)] Determines random number gen-
eration for dataset creation. Pass an int for reproducible output across multiple function
calls. See Glossary.
Returns
X [array of shape [n_samples, n_features]] The generated samples.
y [array of shape [n_samples]] The integer labels for class membership of each sample.
See also:
Notes
The algorithm is adapted from Guyon [1] and was designed to generate the “Madelon” dataset.
References
[1]
sklearn.datasets.make_friedman1
Out of the n_features features, only 5 are actually used to compute y. The remaining features are indepen-
dent of y.
The number of features has to be >= 5.
Read more in the User Guide.
Parameters
n_samples [int, optional (default=100)] The number of samples.
n_features [int, optional (default=10)] The number of features. Should be at least 5.
noise [float, optional (default=0.0)] The standard deviation of the gaussian noise applied to the
output.
random_state [int, RandomState instance or None (default)] Determines random number gen-
eration for dataset noise. Pass an int for reproducible output across multiple function calls.
See Glossary.
Returns
X [array of shape [n_samples, n_features]] The input samples.
y [array of shape [n_samples]] The output values.
References
[1], [2]
sklearn.datasets.make_friedman2
References
[1], [2]
sklearn.datasets.make_friedman3
References
[1], [2]
sklearn.datasets.make_gaussian_quantiles
cov [float, optional (default=1.)] The covariance matrix will be this value times the unit matrix.
This dataset only produces symmetric normal distributions.
n_samples [int, optional (default=100)] The total number of points equally divided among
classes.
n_features [int, optional (default=2)] The number of features for each sample.
n_classes [int, optional (default=3)] The number of classes
shuffle [boolean, optional (default=True)] Shuffle the samples.
random_state [int, RandomState instance or None (default)] Determines random number gen-
eration for dataset creation. Pass an int for reproducible output across multiple function
calls. See Glossary.
Returns
X [array of shape [n_samples, n_features]] The generated samples.
y [array of shape [n_samples]] The integer labels for quantile membership of each sample.
Notes
References
[1]
sklearn.datasets.make_hastie_10_2
sklearn.datasets.make_hastie_10_2(n_samples=12000, random_state=None)
Generates data for binary classification used in Hastie et al. 2009, Example 10.2.
The ten features are standard independent Gaussian and the target y is defined by:
References
[1]
sklearn.datasets.make_low_rank_matrix
The low rank part of the profile can be considered the structured signal part of the data while the tail can be
considered the noisy part of the data that cannot be summarized by a low number of linear components (singular
vectors).
This kind of singular profiles is often seen in practice, for instance:
• gray level pictures of faces
• TF-IDF vectors of text documents crawled from the web
Read more in the User Guide.
Parameters
n_samples [int, optional (default=100)] The number of samples.
n_features [int, optional (default=100)] The number of features.
effective_rank [int, optional (default=10)] The approximate number of singular vectors re-
quired to explain most of the data by linear combinations.
tail_strength [float between 0.0 and 1.0, optional (default=0.5)] The relative importance of the
fat noisy tail of the singular values profile.
random_state [int, RandomState instance or None (default)] Determines random number gen-
eration for dataset creation. Pass an int for reproducible output across multiple function
calls. See Glossary.
Returns
X [array of shape [n_samples, n_features]] The matrix.
sklearn.datasets.make_moons
sklearn.datasets.make_multilabel_classification
sklearn.datasets.make_multilabel_classification(n_samples=100, n_features=20,
n_classes=5, n_labels=2, length=50,
allow_unlabeled=True, sparse=False,
return_indicator=’dense’, re-
turn_distributions=False, ran-
dom_state=None)
Generate a random multilabel classification problem.
For each sample, the generative process is:
• Multilabel classification
• Plot randomly generated multilabel dataset
sklearn.datasets.make_regression
• Prediction Latency
• Plot Ridge coefficients as a function of the L2 regularization
• Robust linear model estimation using RANSAC
• HuberRegressor vs Ridge on dataset with strong outliers
• Lasso on dense and sparse data
• Effect of transforming the targets in regression model
sklearn.datasets.make_s_curve
sklearn.datasets.make_sparse_coded_signal
sklearn.datasets.make_sparse_spd_matrix
make_spd_matrix
Notes
The sparsity is actually imposed on the cholesky factor of the matrix. Thus alpha does not translate directly into
the filling fraction of the matrix itself.
sklearn.datasets.make_sparse_uncorrelated
X ~ N(0, 1)
y(X) = X[:, 0] + 2 * X[:, 1] - 2 * X[:, 2] - 1.5 * X[:, 3]
Only the first 4 features are informative. The remaining features are useless.
Read more in the User Guide.
Parameters
n_samples [int, optional (default=100)] The number of samples.
n_features [int, optional (default=10)] The number of features.
random_state [int, RandomState instance or None (default)] Determines random number gen-
eration for dataset creation. Pass an int for reproducible output across multiple function
calls. See Glossary.
Returns
X [array of shape [n_samples, n_features]] The input samples.
y [array of shape [n_samples]] The output values.
References
[1]
sklearn.datasets.make_spd_matrix
sklearn.datasets.make_spd_matrix(n_dim, random_state=None)
Generate a random symmetric, positive-definite matrix.
Read more in the User Guide.
Parameters
n_dim [int] The matrix dimension.
random_state [int, RandomState instance or None (default)] Determines random number gen-
eration for dataset creation. Pass an int for reproducible output across multiple function
calls. See Glossary.
Returns
X [array of shape [n_dim, n_dim]] The random symmetric, positive-definite matrix.
See also:
make_sparse_spd_matrix
sklearn.datasets.make_swiss_roll
Notes
References
[1]
The sklearn.decomposition module includes matrix decomposition algorithms, including among others PCA,
NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques.
User guide: See the Decomposing signals in components (matrix factorization problems) section for further details.
7.8.1 sklearn.decomposition.DictionaryLearning
fit_algorithm [{‘lars’, ‘cd’}, default=’lars’] lars: uses the least angle regression method to
solve the lasso problem (linear_model.lars_path) cd: uses the coordinate descent method
to compute the Lasso solution (linear_model.Lasso). Lars will be faster if the estimated
components are sparse.
New in version 0.17: cd coordinate descent method to improve speed.
transform_algorithm [{‘lasso_lars’, ‘lasso_cd’, ‘lars’, ‘omp’, ‘threshold’}, default=’omp’]
Algorithm used to transform the data lars: uses the least angle regression method (lin-
ear_model.lars_path) lasso_lars: uses Lars to compute the Lasso solution lasso_cd: uses the
coordinate descent method to compute the Lasso solution (linear_model.Lasso). lasso_lars
will be faster if the estimated components are sparse. omp: uses orthogonal matching pur-
suit to estimate the sparse solution threshold: squashes to zero all coefficients less than alpha
from the projection dictionary * X'
New in version 0.17: lasso_cd coordinate descent method to improve speed.
transform_n_nonzero_coefs [int, default=0.1*n_features] Number of nonzero coefficients to
target in each column of the solution. This is only used by algorithm='lars' and
algorithm='omp' and is overridden by alpha in the omp case.
transform_alpha [float, default=1.0] If algorithm='lasso_lars' or
algorithm='lasso_cd', alpha is the penalty applied to the L1 norm. If
algorithm='threshold', alpha is the absolute value of the threshold below
which coefficients will be squashed to zero. If algorithm='omp', alpha is the
tolerance parameter: the value of the reconstruction error targeted. In this case, it overrides
n_nonzero_coefs.
n_jobs [int or None, default=None] Number of parallel jobs to run. None means 1 unless in
a joblib.parallel_backend context. -1 means using all processors. See Glossary
for more details.
code_init [array of shape (n_samples, n_components), default=None] initial value for the code,
for warm restart
dict_init [array of shape (n_components, n_features), default=None] initial values for the dic-
tionary, for warm restart
verbose [bool, default=False] To control the verbosity of the procedure.
split_sign [bool, default=False] Whether to split the sparse feature vector into the concatenation
of its negative part and its positive part. This can improve the performance of downstream
classifiers.
random_state [int, RandomState instance or None, default=None] If int, random_state is the
seed used by the random number generator; If RandomState instance, random_state is the
random number generator; If None, the random number generator is the RandomState in-
stance used by np.random.
positive_code [bool, default=False] Whether to enforce positivity when finding the code.
New in version 0.20.
positive_dict [bool, default=False] Whether to enforce positivity when finding the dictionary
New in version 0.20.
transform_max_iter [int, default=1000] Maximum number of iterations to perform if
algorithm='lasso_cd' or lasso_lars.
New in version 0.22.
Attributes
components_ [array, [n_components, n_features]] dictionary atoms extracted from the data
error_ [array] vector of errors at each iteration
n_iter_ [int] Number of iterations run.
See also:
SparseCoder
MiniBatchDictionaryLearning
SparsePCA
MiniBatchSparsePCA
Notes
References:
J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary learning for sparse coding (https://www.di.ens.
fr/sierra/pdfs/icml09.pdf)
Methods
7.8.2 sklearn.decomposition.FactorAnalysis
FactorAnalysis performs a maximum likelihood estimate of the so-called loading matrix, the transformation
of the latent variables to the observed ones, using SVD based approach.
Read more in the User Guide.
New in version 0.13.
Parameters
n_components [int | None] Dimensionality of latent space, the number of components of X that
are obtained after transform. If None, n_components is set to the number of features.
tol [float] Stopping tolerance for log-likelihood increase.
copy [bool] Whether to make a copy of X. If False, the input X gets overwritten during fitting.
max_iter [int] Maximum number of iterations.
noise_variance_init [None | array, shape=(n_features,)] The initial guess of the noise variance
for each feature. If None, it defaults to np.ones(n_features)
svd_method [{‘lapack’, ‘randomized’}] Which SVD method to use. If ‘lapack’ use standard
SVD from scipy.linalg, if ‘randomized’ use fast randomized_svd function. Defaults to
‘randomized’. For most applications ‘randomized’ will be sufficiently precise while pro-
viding significant speed gains. Accuracy can also be improved by setting higher values for
iterated_power. If this is not sufficient, for maximum precision you should choose
‘lapack’.
iterated_power [int, optional] Number of iterations for the power method. 3 by default. Only
used if svd_method equals ‘randomized’
random_state [int, RandomState instance or None, optional (default=0)] If int, random_state
is the seed used by the random number generator; If RandomState instance, random_state is
the random number generator; If None, the random number generator is the RandomState
instance used by np.random. Only used when svd_method equals ‘randomized’.
Attributes
components_ [array, [n_components, n_features]] Components with maximum variance.
loglike_ [list, [n_iterations]] The log likelihood at each iteration.
noise_variance_ [array, shape=(n_features,)] The estimated noise variance for each feature.
n_iter_ [int] Number of iterations run.
mean_ [array, shape (n_features,)] Per-feature empirical mean, estimated from the training set.
See also:
PCA Principal component analysis is also a latent linear variable model which however assumes equal noise
variance for each feature. This extra assumption makes probabilistic PCA faster as it can be computed in
closed form.
FastICA Independent component analysis, a latent variable model with non-Gaussian latent variables.
References
Examples
Methods
fit(self, X[, y]) Fit the FactorAnalysis model to X using SVD based
approach
fit_transform(self, X[, y]) Fit to data, then transform it.
get_covariance(self) Compute data covariance with the FactorAnalysis
model.
get_params(self[, deep]) Get parameters for this estimator.
get_precision(self) Compute data precision matrix with the FactorAnal-
ysis model.
score(self, X[, y]) Compute the average log-likelihood of the samples
score_samples(self, X) Compute the log-likelihood of each sample
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Apply dimensionality reduction to X using the
model.
get_covariance(self )
Compute data covariance with the FactorAnalysis model.
cov = components_.T * components_ + diag(noise_variance)
Returns
cov [array, shape (n_features, n_features)] Estimated covariance of data.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_precision(self )
Compute data precision matrix with the FactorAnalysis model.
Returns
precision [array, shape (n_features, n_features)] Estimated precision of data.
score(self, X, y=None)
Compute the average log-likelihood of the samples
Parameters
X [array, shape (n_samples, n_features)] The data
y [Ignored]
Returns
ll [float] Average log-likelihood of the samples under the current model
score_samples(self, X)
Compute the log-likelihood of each sample
Parameters
X [array, shape (n_samples, n_features)] The data
Returns
ll [array, shape (n_samples,)] Log-likelihood of each sample under the current model
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Apply dimensionality reduction to X using the model.
Compute the expected mean of the latent variables. See Barber, 21.2.33 (or Bishop, 12.66).
Parameters
X [array-like, shape (n_samples, n_features)] Training data.
Returns
X_new [array-like, shape (n_samples, n_components)] The latent variables of X.
7.8.3 sklearn.decomposition.FastICA
components_ [2D array, shape (n_components, n_features)] The linear operator to apply to
the data to get the independent sources. This is equal to the unmixing matrix when
whiten is False, and equal to np.dot(unmixing_matrix, self.whitening_)
when whiten is True.
mixing_ [array, shape (n_features, n_components)] The pseudo-inverse of components_. It
is the linear operator that maps independent sources to the data.
mean_ [array, shape(n_features)] The mean over features. Only set if self.whiten is True.
n_iter_ [int] If the algorithm is “deflation”, n_iter is the maximum number of iterations run
across all components. Else they are just the number of iterations taken to converge.
whitening_ [array, shape (n_components, n_features)] Only set if whiten is ‘True’. This is the
pre-whitening matrix that projects data onto the first n_components principal compo-
nents.
Notes
Implementation based on A. Hyvarinen and E. Oja, Independent Component Analysis: Algorithms and Appli-
cations, Neural Networks, 13(4-5), 2000, pp. 411-430
Examples
Methods
7.8.4 sklearn.decomposition.IncrementalPCA
batch_size [int or None, (default=None)] The number of samples to use for each batch. Only
used when calling fit. If batch_size is None, then batch_size is inferred from the
data and set to 5 * n_features, to provide a balance between approximation accuracy
and memory consumption.
Attributes
components_ [array, shape (n_components, n_features)] Components with maximum variance.
explained_variance_ [array, shape (n_components,)] Variance explained by each of the se-
lected components.
explained_variance_ratio_ [array, shape (n_components,)] Percentage of variance explained
by each of the selected components. If all components are stored, the sum of explained
variances is equal to 1.0.
singular_values_ [array, shape (n_components,)] The singular values corresponding to each
of the selected components. The singular values are equal to the 2-norms of the
n_components variables in the lower-dimensional space.
mean_ [array, shape (n_features,)] Per-feature empirical mean, aggregate over calls to
partial_fit.
var_ [array, shape (n_features,)] Per-feature empirical variance, aggregate over calls to
partial_fit.
noise_variance_ [float] The estimated noise covariance following the Probabilistic PCA model
from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C.
Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf.
n_components_ [int] The estimated number of components. Relevant when
n_components=None.
n_samples_seen_ [int] The number of samples processed by the estimator. Will be reset on
new calls to fit, but increments across partial_fit calls.
See also:
PCA
KernelPCA
SparsePCA
TruncatedSVD
Notes
Implements the incremental PCA model from: D. Ross, J. Lim, R. Lin, M. Yang, Incremental Learning for
Robust Visual Tracking, International Journal of Computer Vision, Volume 77, Issue 1-3, pp. 125-141, May
2008. See https://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf
This model is an extension of the Sequential Karhunen-Loeve Transform from: A. Levy and M. Lindenbaum, Se-
quential Karhunen-Loeve Basis Extraction and its Application to Images, IEEE Transactions on Image Process-
ing, Volume 9, Number 8, pp. 1371-1374, August 2000. See https://www.cs.technion.ac.il/~mic/doc/skl-ip.pdf
We have specifically abstained from an optimization used by authors of both papers, a QR decomposition used
in specific situations to reduce the algorithmic complexity of the SVD. The source for this technique is Matrix
Computations, Third Edition, G. Holub and C. Van Loan, Chapter 5, section 5.4.4, pp 252-253.. This technique
has been omitted because it is advantageous only when decomposing a matrix with n_samples (rows) >=
5/3 * n_features (columns), and hurts the readability of the implemented algorithm. This would be a good
opportunity for future optimization, if it is deemed necessary.
References
D. Ross, J. Lim, R. Lin, M. Yang. Incremental Learning for Robust Visual Tracking, International Journal of
Computer Vision, Volume 77, Issue 1-3, pp. 125-141, May 2008.
G. Golub and C. Van Loan. Matrix Computations, Third Edition, Chapter 5, Section 5.4.4, pp. 252-253.
Examples
Methods
fit(self, X[, y]) Fit the model with X, using minibatches of size
batch_size.
fit_transform(self, X[, y]) Fit to data, then transform it.
get_covariance(self) Compute data covariance with the generative model.
get_params(self[, deep]) Get parameters for this estimator.
get_precision(self) Compute data precision matrix with the generative
model.
inverse_transform(self, X) Transform data back to its original space.
partial_fit(self, X[, y, check_input]) Incremental fit with X.
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Apply dimensionality reduction to X.
Notes
If whitening is enabled, inverse_transform will compute the exact inverse operation, which includes re-
versing whitening.
partial_fit(self, X, y=None, check_input=True)
Incremental fit with X. All of X is processed as a single batch.
Parameters
X [array-like, shape (n_samples, n_features)] Training data, where n_samples is the number
of samples and n_features is the number of features.
check_input [bool] Run check_array on X.
y [Ignored]
Returns
self [object] Returns the instance itself.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set, using minibatches
of size batch_size if X is sparse.
Parameters
X [array-like, shape (n_samples, n_features)] New data, where n_samples is the number of
samples and n_features is the number of features.
Returns
X_new [array-like, shape (n_samples, n_components)]
Examples
• Incremental PCA
7.8.5 sklearn.decomposition.KernelPCA
References
Kernel PCA was introduced in: Bernhard Schoelkopf, Alexander J. Smola, and Klaus-Robert Mueller. 1999.
Kernel principal component analysis. In Advances in kernel methods, MIT Press, Cambridge, MA, USA
327-352.
Examples
Methods
References
• Kernel PCA
7.8.6 sklearn.decomposition.LatentDirichletAllocation
class sklearn.decomposition.LatentDirichletAllocation(n_components=10,
doc_topic_prior=None,
topic_word_prior=None,
learning_method=’batch’,
learning_decay=0.7,
learning_offset=10.0,
max_iter=10,
batch_size=128,
evaluate_every=-1, to-
tal_samples=1000000.0,
perp_tol=0.1,
mean_change_tol=0.001,
max_doc_update_iter=100,
n_jobs=None, verbose=0,
random_state=None)
Latent Dirichlet Allocation with online variational Bayes algorithm
New in version 0.17.
Read more in the User Guide.
Parameters
n_components [int, optional (default=10)] Number of topics.
doc_topic_prior [float, optional (default=None)] Prior of document topic distribution theta.
If the value is None, defaults to 1 / n_components. In [Re25e5648fc37-1], this is
called alpha.
topic_word_prior [float, optional (default=None)] Prior of topic word distribution beta. If
the value is None, defaults to 1 / n_components. In [Re25e5648fc37-1], this is called
eta.
learning_method [‘batch’ | ‘online’, default=’batch’] Method used to update _component.
Only used in fit method. In general, if the data size is large, the online update will be
much faster than the batch update.
Valid options:
References
[2] “Stochastic Variational Inference”, Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley,
2013
[3] Matthew D. Hoffman’s onlineldavb code. Link: https://github.com/blei-lab/onlineldavb
[Re25e5648fc37-1]
Examples
Methods
fit(self, X[, y]) Learn model for the data X with variational Bayes
method.
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
partial_fit(self, X[, y]) Online VB with Mini-Batch update.
perplexity(self, X[, sub_sampling]) Calculate approximate perplexity for data X.
score(self, X[, y]) Calculate approximate log-likelihood as score.
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Transform data X according to the fitted model.
Changed in version 0.19: doc_topic_distr argument has been deprecated and is ignored because user no
longer has access to unnormalized distribution
Parameters
X [array-like or sparse matrix, [n_samples, n_features]] Document word matrix.
sub_sampling [bool] Do sub-sampling or not.
Returns
score [float] Perplexity score.
score(self, X, y=None)
Calculate approximate log-likelihood as score.
Parameters
X [array-like or sparse matrix, shape=(n_samples, n_features)] Document word matrix.
y [Ignored]
Returns
score [float] Use approximate bound as score.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform data X according to the fitted model.
Changed in version 0.18: doc_topic_distr is now normalized
Parameters
X [array-like or sparse matrix, shape=(n_samples, n_features)] Document word matrix.
Returns
doc_topic_distr [shape=(n_samples, n_components)] Document topic distribution for X.
• Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
7.8.7 sklearn.decomposition.MiniBatchDictionaryLearning
class sklearn.decomposition.MiniBatchDictionaryLearning(n_components=None,
alpha=1, n_iter=1000,
fit_algorithm=’lars’,
n_jobs=None,
batch_size=3,
shuffle=True,
dict_init=None, trans-
form_algorithm=’omp’,
trans-
form_n_nonzero_coefs=None,
transform_alpha=None,
verbose=False,
split_sign=False, ran-
dom_state=None, posi-
tive_code=False, posi-
tive_dict=False, trans-
form_max_iter=1000)
Mini-batch dictionary learning
Finds a dictionary (a set of atoms) that can best be used to represent data using a sparse code.
Solves the optimization problem:
the sparse solution threshold: squashes to zero all coefficients less than alpha from the
projection dictionary * X’
transform_n_nonzero_coefs [int, 0.1 * n_features by default] Number of nonzero
coefficients to target in each column of the solution. This is only used by
algorithm='lars' and algorithm='omp' and is overridden by alpha in the omp
case.
transform_alpha [float, 1. by default] If algorithm='lasso_lars' or
algorithm='lasso_cd', alpha is the penalty applied to the L1 norm. If
algorithm='threshold', alpha is the absolute value of the threshold below
which coefficients will be squashed to zero. If algorithm='omp', alpha is the
tolerance parameter: the value of the reconstruction error targeted. In this case, it overrides
n_nonzero_coefs.
verbose [bool, optional (default: False)] To control the verbosity of the procedure.
split_sign [bool, False by default] Whether to split the sparse feature vector into the concate-
nation of its negative part and its positive part. This can improve the performance of down-
stream classifiers.
random_state [int, RandomState instance or None, optional (default=None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
positive_code [bool] Whether to enforce positivity when finding the code.
New in version 0.20.
positive_dict [bool] Whether to enforce positivity when finding the dictionary.
New in version 0.20.
transform_max_iter [int, optional (default=1000)] Maximum number of iterations to perform
if algorithm='lasso_cd' or lasso_lars.
New in version 0.22.
Attributes
components_ [array, [n_components, n_features]] components extracted from the data
inner_stats_ [tuple of (A, B) ndarrays] Internal sufficient statistics that are kept by the algo-
rithm. Keeping them is useful in online settings, to avoid losing the history of the evolution,
but they shouldn’t have any use for the end user. A (n_components, n_components) is the
dictionary covariance matrix. B (n_features, n_components) is the data approximation ma-
trix
n_iter_ [int] Number of iterations run.
iter_offset_ [int] The number of iteration on data batches that has been performed before.
random_state_ [RandomState] RandomState instance that is generated either from a seed, the
random number generattor or by np.random.
See also:
SparseCoder
DictionaryLearning
SparsePCA
MiniBatchSparsePCA
Notes
References:
J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009: Online dictionary learning for sparse coding (https://www.di.ens.
fr/sierra/pdfs/icml09.pdf)
Methods
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
partial_fit(self, X, y=None, iter_offset=None)
Updates the model using the data in X as a mini-batch.
Parameters
X [array-like, shape (n_samples, n_features)] Training vector, where n_samples in the num-
ber of samples and n_features is the number of features.
y [Ignored]
iter_offset [integer, optional] The number of iteration on data batches that has been per-
formed before this call to partial_fit. This is optional: if no number is passed, the memory
of the object is used.
Returns
self [object] Returns the instance itself.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Encode the data as a sparse combination of the dictionary atoms.
Coding method is determined by the object parameter transform_algorithm.
Parameters
X [array of shape (n_samples, n_features)] Test data to be transformed, must have the same
number of features as the data used to train the model.
Returns
X_new [array, shape (n_samples, n_components)] Transformed data
7.8.8 sklearn.decomposition.MiniBatchSparsePCA
See also:
PCA
SparsePCA
DictionaryLearning
Examples
Methods
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Least Squares projection of the data onto the sparse components.
To avoid instability issues in case the system is under-determined, regularization can be applied (Ridge
regression) via the ridge_alpha parameter.
Note that Sparse PCA components orthogonality is not enforced as in PCA hence one cannot use a simple
linear projection.
Parameters
X [array of shape (n_samples, n_features)] Test data to be transformed, must have the same
number of features as the data used to train the model.
Returns
X_new array, shape (n_samples, n_components) Transformed data.
7.8.9 sklearn.decomposition.NMF
Where:
For multiplicative-update (‘mu’) solver, the Frobenius norm (0.5 * ||X - WH||_Fro^2) can be changed into
another beta-divergence loss, by changing the beta_loss parameter.
The objective function is minimized with an alternating minimization of W and H.
Read more in the User Guide.
Parameters
n_components [int or None] Number of components, if n_components is not set all features
are kept.
init [None | ‘random’ | ‘nndsvd’ | ‘nndsvda’ | ‘nndsvdar’ | ‘custom’] Method used to initialize
the procedure. Default: None. Valid options:
• None: ‘nndsvd’ if n_components <= min(n_samples, n_features), otherwise random.
• ‘random’: non-negative random matrices, scaled with: sqrt(X.mean() /
n_components)
• ‘nndsvd’: Nonnegative Double Singular Value Decomposition (NNDSVD)
initialization (better for sparseness)
• ‘nndsvda’: NNDSVD with zeros filled with the average of X (better when sparsity is
not desired)
• ‘nndsvdar’: NNDSVD with zeros filled with small random values (generally faster,
less accurate alternative to NNDSVDa for when sparsity is not desired)
• ‘custom’: use custom matrices W and H
solver [‘cd’ | ‘mu’] Numerical solver to use: ‘cd’ is a Coordinate Descent solver. ‘mu’ is a
Multiplicative Update solver.
New in version 0.17: Coordinate Descent solver.
New in version 0.19: Multiplicative Update solver.
References
Cichocki, Andrzej, and P. H. A. N. Anh-Huy. “Fast local algorithms for large scale nonnegative matrix and ten-
sor factorizations.” IEICE transactions on fundamentals of electronics, communications and computer sciences
92.3: 708-721, 2009.
Fevotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the beta-divergence. Neural
Computation, 23(9).
Examples
Methods
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
inverse_transform(self, W)
Transform data back to its original space.
Parameters
W [{array-like, sparse matrix}, shape (n_samples, n_components)] Transformed data matrix
Returns
X [{array-like, sparse matrix}, shape (n_samples, n_features)] Data matrix of original shape
New in version 0.18: ..
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform the data X according to the fitted NMF model
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] Data matrix to be trans-
formed by the model
Returns
W [array, shape (n_samples, n_components)] Transformed data
7.8.10 sklearn.decomposition.PCA
copy [bool, default=True] If False, data passed to fit are overwritten and running
fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whiten [bool, optional (default False)] When True (False by default) the components_ vec-
tors are multiplied by the square root of n_samples and then divided by the singular values
to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative variance
scales of the components) but can sometime improve the predictive accuracy of the down-
stream estimators by making their data respect some hard-wired assumptions.
svd_solver [str {‘auto’, ‘full’, ‘arpack’, ‘randomized’}]
If auto : The solver is selected by a default policy based on X.shape and
n_components: if the input data is larger than 500x500 and the number of compo-
nents to extract is lower than 80% of the smallest dimension of the data, then the more
efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and
optionally truncated afterwards.
If full : run exact full SVD calling the standard LAPACK solver via scipy.linalg.
svd and select the components by postprocessing
If arpack : run SVD truncated to n_components calling ARPACK solver via scipy.
sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
If randomized : run randomized SVD by the method of Halko et al.
New in version 0.18.0.
tol [float >= 0, optional (default .0)] Tolerance for singular values computed by svd_solver ==
‘arpack’.
New in version 0.18.0.
iterated_power [int >= 0, or ‘auto’, (default ‘auto’)] Number of iterations for the power method
computed by svd_solver == ‘randomized’.
New in version 0.18.0.
random_state [int, RandomState instance or None, optional (default None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or
‘randomized’.
New in version 0.18.0.
Attributes
components_ [array, shape (n_components, n_features)] Principal axes in feature space, rep-
resenting the directions of maximum variance in the data. The components are sorted by
explained_variance_.
explained_variance_ [array, shape (n_components,)] The amount of variance explained by
each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
New in version 0.18.
explained_variance_ratio_ [array, shape (n_components,)] Percentage of variance explained
by each of the selected components.
If n_components is not set then all components are stored and the sum of the ratios is
equal to 1.0.
singular_values_ [array, shape (n_components,)] The singular values corresponding to each
of the selected components. The singular values are equal to the 2-norms of the
n_components variables in the lower-dimensional space.
New in version 0.19.
mean_ [array, shape (n_features,)] Per-feature empirical mean, estimated from the training set.
Equal to X.mean(axis=0).
n_components_ [int] The estimated number of components. When n_components is set to
‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated
from input data. Otherwise it equals the parameter n_components, or the lesser value of
n_features and n_samples if n_components is None.
n_features_ [int] Number of features in the training data.
n_samples_ [int] Number of samples in the training data.
noise_variance_ [float] The estimated noise covariance following the Probabilistic PCA model
from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C.
References
For n_components == ‘mle’, this class uses the method of Minka, T. P. “Automatic choice of dimensionality for
PCA”. In NIPS, pp. 598-604
Implements the probabilistic PCA model from: Tipping, M. E., and Bishop, C. M. (1999). “Probabilistic
principal component analysis”. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
61(3), 611-622. via the score and score_samples methods. See http://www.miketipping.com/papers/met-mppca.
pdf
For svd_solver == ‘arpack’, refer to scipy.sparse.linalg.svds.
For svd_solver == ‘randomized’, see: Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). “Finding structure
with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”. SIAM review,
53(2), 217-288. and also Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011). “A randomized algorithm for
the decomposition of matrices”. Applied and Computational Harmonic Analysis, 30(1), 47-68.
Examples
Methods
Notes
This method returns a Fortran-ordered array. To convert it to a C-ordered array, use ‘np.ascontiguousarray’.
get_covariance(self )
Compute data covariance with the generative model.
cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)
where S**2 contains the explained variances, and sigma2 contains the noise variances.
Returns
cov [array, shape=(n_features, n_features)] Estimated covariance of data.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_precision(self )
Compute data precision matrix with the generative model.
Equals the inverse of the covariance but computed with the matrix inversion lemma for efficiency.
Returns
precision [array, shape=(n_features, n_features)] Estimated precision of data.
inverse_transform(self, X)
Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
Parameters
X [array-like, shape (n_samples, n_components)] New data, where n_samples is the number
of samples and n_components is the number of components.
Returns
X_original array-like, shape (n_samples, n_features)
Notes
If whitening is enabled, inverse_transform will compute the exact inverse operation, which includes re-
versing whitening.
score(self, X, y=None)
Return the average log-likelihood of all samples.
See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.
com/papers/met-mppca.pdf
Parameters
X [array, shape(n_samples, n_features)] The data.
y [None] Ignored variable.
Returns
ll [float] Average log-likelihood of the samples under the current model.
score_samples(self, X)
Return the log-likelihood of each sample.
See. “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.
com/papers/met-mppca.pdf
Parameters
X [array, shape(n_samples, n_features)] The data.
Returns
ll [array, shape (n_samples,)] Log-likelihood of each sample under the current model.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
Parameters
X [array-like, shape (n_samples, n_features)] New data, where n_samples is the number of
samples and n_features is the number of features.
Returns
X_new [array-like, shape (n_samples, n_components)]
Examples
• Multilabel classification
• Explicit feature map approximation for RBF kernels
• A demo of K-Means clustering on the handwritten digits data
• The Iris Dataset
7.8.11 sklearn.decomposition.SparsePCA
method [{‘lars’, ‘cd’}] lars: uses the least angle regression method to solve the lasso prob-
lem (linear_model.lars_path) cd: uses the coordinate descent method to compute the Lasso
solution (linear_model.Lasso). Lars will be faster if the estimated components are sparse.
n_jobs [int or None, optional (default=None)] Number of parallel jobs to run. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
U_init [array of shape (n_samples, n_components),] Initial values for the loadings for warm
restart scenarios.
V_init [array of shape (n_components, n_features),] Initial values for the components for warm
restart scenarios.
verbose [int] Controls the verbosity; the higher, the more messages. Defaults to 0.
random_state [int, RandomState instance or None, optional (default=None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
normalize_components [‘deprecated’] This parameter does not have any effect. The compo-
nents are always normalized.
New in version 0.20.
Deprecated since version 0.22: normalize_components is deprecated in 0.22 and will
be removed in 0.24.
Attributes
components_ [array, [n_components, n_features]] Sparse components extracted from the data.
error_ [array] Vector of errors at each iteration.
n_iter_ [int] Number of iterations run.
mean_ [array, shape (n_features,)] Per-feature empirical mean, estimated from the training set.
Equal to X.mean(axis=0).
See also:
PCA
MiniBatchSparsePCA
DictionaryLearning
Examples
Methods
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Least Squares projection of the data onto the sparse components.
To avoid instability issues in case the system is under-determined, regularization can be applied (Ridge
regression) via the ridge_alpha parameter.
Note that Sparse PCA components orthogonality is not enforced as in PCA hence one cannot use a simple
linear projection.
Parameters
X [array of shape (n_samples, n_features)] Test data to be transformed, must have the same
number of features as the data used to train the model.
Returns
X_new array, shape (n_samples, n_components) Transformed data.
7.8.12 sklearn.decomposition.SparseCoder
X ~= code * dictionary
will be faster if the estimated components are sparse. omp: uses orthogonal matching pur-
suit to estimate the sparse solution threshold: squashes to zero all coefficients less than alpha
from the projection dictionary * X'
transform_n_nonzero_coefs [int, default=0.1*n_features] Number of nonzero coefficients to
target in each column of the solution. This is only used by algorithm='lars' and
algorithm='omp' and is overridden by alpha in the omp case.
transform_alpha [float, default=1.] If algorithm='lasso_lars' or
algorithm='lasso_cd', alpha is the penalty applied to the L1 norm. If
algorithm='threshold', alpha is the absolute value of the threshold below
which coefficients will be squashed to zero. If algorithm='omp', alpha is the
tolerance parameter: the value of the reconstruction error targeted. In this case, it overrides
n_nonzero_coefs.
split_sign [bool, default=False] Whether to split the sparse feature vector into the concatenation
of its negative part and its positive part. This can improve the performance of downstream
classifiers.
n_jobs [int or None, default=None] Number of parallel jobs to run. None means 1 unless in
a joblib.parallel_backend context. -1 means using all processors. See Glossary
for more details.
positive_code [bool, default=False] Whether to enforce positivity when finding the code.
New in version 0.20.
transform_max_iter [int, default=1000] Maximum number of iterations to perform if
algorithm='lasso_cd' or lasso_lars.
New in version 0.22.
Attributes
components_ [array, [n_components, n_features]] The unchanged dictionary atoms
See also:
DictionaryLearning
MiniBatchDictionaryLearning
SparsePCA
MiniBatchSparsePCA
sparse_encode
Methods
Parameters
X [array of shape (n_samples, n_features)] Test data to be transformed, must have the same
number of features as the data used to train the model.
Returns
X_new [array, shape (n_samples, n_components)] Transformed data
7.8.13 sklearn.decomposition.TruncatedSVD
PCA
Notes
SVD suffers from a problem called “sign indeterminacy”, which means the sign of the components_ and the
output from transform depend on the algorithm and random state. To work around this, fit instances of this class
to data once, then keep the instance around to do transformations.
References
Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions
Halko, et al., 2009 (arXiv:909) https://arxiv.org/pdf/0909.4061.pdf
Examples
Methods
fit(self, X, y=None)
Fit LSI model on training data X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] Training data.
y [Ignored]
Returns
self [object] Returns the transformer object.
fit_transform(self, X, y=None)
Fit LSI model to X and perform dimensionality reduction on X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] Training data.
y [Ignored]
Returns
X_new [array, shape (n_samples, n_components)] Reduced version of X. This will always
be a dense array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
inverse_transform(self, X)
Transform X back to its original space.
Returns an array X_original whose transform would be X.
Parameters
X [array-like, shape (n_samples, n_components)] New data.
Returns
X_original [array, shape (n_samples, n_features)] Note that this is always a dense array.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Perform dimensionality reduction on X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] New data.
Returns
X_new [array, shape (n_samples, n_components)] Reduced version of X. This will always
be a dense array.
7.8.14 sklearn.decomposition.dict_learning
dict_learning_online
DictionaryLearning
MiniBatchDictionaryLearning
SparsePCA
MiniBatchSparsePCA
7.8.15 sklearn.decomposition.dict_learning_online
where V is the dictionary and U is the sparse code. This is accomplished by repeatedly iterating over mini-
batches by slicing the input data.
Read more in the User Guide.
Parameters
X [array of shape (n_samples, n_features)] Data matrix.
n_components [int,] Number of dictionary atoms to extract.
alpha [float,] Sparsity controlling parameter.
n_iter [int,] Number of mini-batch iterations to perform.
return_code [boolean,] Whether to also return the code U or just the dictionary V.
dict_init [array of shape (n_components, n_features),] Initial value for the dictionary for warm
restart scenarios.
callback [callable or None, optional (default: None)] callable that gets invoked every five iter-
ations
batch_size [int,] The number of samples to take in each batch.
verbose [bool, optional (default: False)] To control the verbosity of the procedure.
shuffle [boolean,] Whether to shuffle the data before splitting it in batches.
n_jobs [int or None, optional (default=None)] Number of parallel jobs to run. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
method [{‘lars’, ‘cd’}] lars: uses the least angle regression method to solve the lasso prob-
lem (linear_model.lars_path) cd: uses the coordinate descent method to compute the Lasso
solution (linear_model.Lasso). Lars will be faster if the estimated components are sparse.
iter_offset [int, default 0] Number of previous iterations completed on the dictionary used for
initialization.
random_state [int, RandomState instance or None, optional (default=None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
dict_learning
DictionaryLearning
MiniBatchDictionaryLearning
SparsePCA
MiniBatchSparsePCA
7.8.16 sklearn.decomposition.fastica
Notes
The data matrix X is considered to be a linear combination of non-Gaussian (independent) components i.e. X
= AS where columns of S contain the independent components and A is a linear mixing matrix. In short ICA
7.8.17 sklearn.decomposition.non_negative_factorization
Where:
For multiplicative-update (‘mu’) solver, the Frobenius norm (0.5 * ||X - WH||_Fro^2) can be changed into
another beta-divergence loss, by changing the beta_loss parameter.
The objective function is minimized with an alternating minimization of W and H. If H is given and up-
date_H=False, it solves for W only.
Parameters
X [array-like, shape (n_samples, n_features)] Constant matrix.
W [array-like, shape (n_samples, n_components)] If init=’custom’, it is used as initial guess for
the solution.
H [array-like, shape (n_components, n_features)] If init=’custom’, it is used as initial guess for
the solution. If update_H=False, it is used as a constant, to solve for W only.
n_components [integer] Number of components, if n_components is not set all features are
kept.
init [None | ‘random’ | ‘nndsvd’ | ‘nndsvda’ | ‘nndsvdar’ | ‘custom’] Method used to initialize
the procedure. Default: None.
Valid options:
• None: ‘nndsvd’ if n_components < n_features, otherwise ‘random’.
• ‘random’: non-negative random matrices, scaled with: sqrt(X.mean() /
n_components)
• ‘nndsvd’: Nonnegative Double Singular Value Decomposition (NNDSVD)
initialization (better for sparseness)
• ‘nndsvda’: NNDSVD with zeros filled with the average of X (better when sparsity is
not desired)
• ‘nndsvdar’: NNDSVD with zeros filled with small random values (generally faster,
less accurate alternative to NNDSVDa for when sparsity is not desired)
• ‘custom’: use custom matrices W and H
Changed in version 0.23: The default value of init changed from ‘random’ to None in
0.23.
update_H [boolean, default: True] Set to True, both W and H will be estimated from initial
guesses. Set to False, only W will be estimated.
solver [‘cd’ | ‘mu’] Numerical solver to use:
• ‘cd’ is a Coordinate Descent solver that uses Fast Hierarchical Alternating Least
Squares (Fast HALS).
• ‘mu’ is a Multiplicative Update solver.
New in version 0.17: Coordinate Descent solver.
New in version 0.19: Multiplicative Update solver.
beta_loss [float or string, default ‘frobenius’] String must be in {‘frobenius’, ‘kullback-leibler’,
‘itakura-saito’}. Beta divergence to be minimized, measuring the distance between X and
the dot product WH. Note that values different from ‘frobenius’ (or 2) and ‘kullback-leibler’
(or 1) lead to significantly slower fits. Note that for beta_loss <= 0 (or ‘itakura-saito’), the
input matrix X cannot contain zeros. Used only in ‘mu’ solver.
New in version 0.19.
tol [float, default: 1e-4] Tolerance of the stopping condition.
max_iter [integer, default: 200] Maximum number of iterations before timing out.
alpha [double, default: 0.] Constant that multiplies the regularization terms.
l1_ratio [double, default: 0.] The regularization mixing parameter, with 0 <= l1_ratio <= 1. For
l1_ratio = 0 the penalty is an elementwise L2 penalty (aka Frobenius Norm). For l1_ratio =
1 it is an elementwise L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1
and L2.
regularization [‘both’ | ‘components’ | ‘transformation’ | None] Select whether the regulariza-
tion affects the components (H), the transformation (W), both or none of them.
random_state [int, RandomState instance or None, optional, default: None] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
verbose [integer, default: 0] The verbosity level.
shuffle [boolean, default: False] If true, randomize the order of coordinates in the CD solver.
Returns
W [array-like, shape (n_samples, n_components)] Solution to the non-negative least squares
problem.
H [array-like, shape (n_components, n_features)] Solution to the non-negative least squares
problem.
n_iter [int] Actual number of iterations.
References
Cichocki, Andrzej, and P. H. A. N. Anh-Huy. “Fast local algorithms for large scale nonnegative matrix and ten-
sor factorizations.” IEICE transactions on fundamentals of electronics, communications and computer sciences
92.3: 708-721, 2009.
Fevotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the beta-divergence. Neural
Computation, 23(9).
Examples
7.8.18 sklearn.decomposition.sparse_encode
X ~= code * dictionary
algorithm [{‘lasso_lars’, ‘lasso_cd’, ‘lars’, ‘omp’, ‘threshold’}] lars: uses the least angle re-
gression method (linear_model.lars_path) lasso_lars: uses Lars to compute the Lasso so-
lution lasso_cd: uses the coordinate descent method to compute the Lasso solution (lin-
ear_model.Lasso). lasso_lars will be faster if the estimated components are sparse. omp:
uses orthogonal matching pursuit to estimate the sparse solution threshold: squashes to zero
all coefficients less than alpha from the projection dictionary * X’
n_nonzero_coefs [int, 0.1 * n_features by default] Number of nonzero coefficients to tar-
get in each column of the solution. This is only used by algorithm='lars' and
algorithm='omp' and is overridden by alpha in the omp case.
alpha [float, 1. by default] If algorithm='lasso_lars' or
algorithm='lasso_cd', alpha is the penalty applied to the L1 norm. If
algorithm='threshold', alpha is the absolute value of the threshold below
which coefficients will be squashed to zero. If algorithm='omp', alpha is the
tolerance parameter: the value of the reconstruction error targeted. In this case, it overrides
n_nonzero_coefs.
copy_cov [boolean, optional] Whether to copy the precomputed covariance matrix; if False, it
may be overwritten.
init [array of shape (n_samples, n_components)] Initialization value of the sparse codes. Only
used if algorithm='lasso_cd'.
max_iter [int, 1000 by default] Maximum number of iterations to perform if
algorithm='lasso_cd' or lasso_lars.
n_jobs [int or None, optional (default=None)] Number of parallel jobs to run. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
check_input [boolean, optional] If False, the input arrays X and dictionary will not be checked.
verbose [int, optional] Controls the verbosity; the higher, the more messages. Defaults to 0.
positive [boolean, optional] Whether to enforce positivity when finding the encoding.
New in version 0.20.
Returns
code [array of shape (n_samples, n_components)] The sparse codes
See also:
sklearn.linear_model.lars_path
sklearn.linear_model.orthogonal_mp
sklearn.linear_model.Lasso
SparseCoder
7.9.1 sklearn.discriminant_analysis.LinearDiscriminantAnalysis
class sklearn.discriminant_analysis.LinearDiscriminantAnalysis(solver=’svd’,
shrink-
age=None,
priors=None,
n_components=None,
store_covariance=False,
tol=0.0001)
Linear Discriminant Analysis
A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using
Bayes’ rule.
The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix.
The fitted model can also be used to reduce the dimensionality of the input by projecting it to the most discrim-
inative directions.
New in version 0.17: LinearDiscriminantAnalysis.
Read more in the User Guide.
Parameters
solver [string, optional]
Solver to use, possible values:
• ‘svd’: Singular value decomposition (default). Does not compute the covariance ma-
trix, therefore this solver is recommended for data with a large number of features.
• ‘lsqr’: Least squares solution, can be combined with shrinkage.
• ‘eigen’: Eigenvalue decomposition, can be combined with shrinkage.
shrinkage [string or float, optional]
Shrinkage parameter, possible values:
• None: no shrinkage (default).
• ‘auto’: automatic shrinkage using the Ledoit-Wolf lemma.
• float between 0 and 1: fixed shrinkage parameter.
Note that shrinkage works only with ‘lsqr’ and ‘eigen’ solvers.
priors [array, optional, shape (n_classes,)] Class priors.
n_components [int, optional (default=None)] Number of components (<= min(n_classes -
1, n_features)) for dimensionality reduction. If None, will be set to min(n_classes - 1,
n_features).
store_covariance [bool, optional] Additionally compute class covariance matrix (default
False), used only in ‘svd’ solver.
New in version 0.17.
tol [float, optional, (default 1.0e-4)] Threshold used for rank estimation in SVD solver.
New in version 0.17.
Attributes
coef_ [array, shape (n_features,) or (n_classes, n_features)] Weight vector(s).
intercept_ [array, shape (n_classes,)] Intercept term.
covariance_ [array-like, shape (n_features, n_features)] Covariance matrix (shared by all
classes).
explained_variance_ratio_ [array, shape (n_components,)] Percentage of variance explained
by each of the selected components. If n_components is not set then all components are
stored and the sum of explained variances is equal to 1.0. Only available when eigen or svd
solver is used.
means_ [array-like, shape (n_classes, n_features)] Class means.
priors_ [array-like, shape (n_classes,)] Class priors (sum to 1).
scalings_ [array-like, shape (rank, n_classes - 1)] Scaling of the features in the space spanned
by the class centroids.
xbar_ [array-like, shape (n_features,)] Overall mean.
classes_ [array-like, shape (n_classes,)] Unique class labels.
See also:
Notes
The default solver is ‘svd’. It can perform both classification and transform, and it does not rely on the calcu-
lation of the covariance matrix. This can be an advantage in situations where the number of features is large.
However, the ‘svd’ solver cannot be used with shrinkage.
The ‘lsqr’ solver is an efficient algorithm that only works for classification. It supports shrinkage.
The ‘eigen’ solver is based on the optimization of the between class scatter to within class scatter ratio. It can
be used for both classification and transform, and it supports shrinkage. However, the ‘eigen’ solver needs to
compute the covariance matrix, so it might not be suitable for situations with a high number of features.
Examples
Methods
Parameters
X [array-like, shape (n_samples, n_features)] Training data.
y [array, shape (n_samples,)] Target values.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Project data to maximize class separation.
Parameters
X [array-like, shape (n_samples, n_features)] Input data.
Returns
X_new [array, shape (n_samples, n_components)] Transformed data.
7.9.2 sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis
class sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis(priors=None,
reg_param=0.0,
store_covariance=False,
tol=0.0001)
Quadratic Discriminant Analysis
A classifier with a quadratic decision boundary, generated by fitting class conditional densities to the data and
using Bayes’ rule.
The model fits a Gaussian density to each class.
New in version 0.17: QuadraticDiscriminantAnalysis
Read more in the User Guide.
Parameters
priors [array, optional, shape = [n_classes]] Priors on classes
reg_param [float, optional] Regularizes the covariance estimate as
(1-reg_param)*Sigma + reg_param*np.eye(n_features)
store_covariance [boolean] If True the covariance matrices are computed and stored in the
self.covariance_ attribute.
New in version 0.17.
tol [float, optional, default 1.0e-4] Threshold used for rank estimation.
New in version 0.17.
Attributes
covariance_ [list of array-like of shape (n_features, n_features)] Covariance matrices of each
class.
means_ [array-like of shape (n_classes, n_features)] Class means.
priors_ [array-like of shape (n_classes)] Class priors (sum to 1).
rotations_ [list of arrays] For each class k an array of shape [n_features, n_k], with n_k =
min(n_features, number of elements in class k) It is the rotation of the
Gaussian distribution, i.e. its principal axis.
scalings_ [list of arrays] For each class k an array of shape [n_k]. It contains the scaling of the
Gaussian distributions along its principal axes, i.e. the variance in the rotated coordinate
system.
classes_ [array-like, shape (n_classes,)] Unique class labels.
See also:
Examples
Methods
decision_function(self, X)
Apply decision function to an array of samples.
Parameters
X [array-like of shape (n_samples, n_features)] Array of samples (test vectors).
Returns
C [ndarray of shape (n_samples,) or (n_samples, n_classes)] Decision function values re-
lated to each class, per sample. In the two-class case, the shape is [n_samples,], giving the
log likelihood ratio of the positive class.
fit(self, X, y)
Fit the model according to the given training data and parameters.
Changed in version 0.19: store_covariances has been moved to main constructor as
store_covariance
Changed in version 0.19: tol has been moved to main constructor.
Parameters
X [array-like of shape (n_samples, n_features)] Training vector, where n_samples is the
number of samples and n_features is the number of features.
y [array, shape = [n_samples]] Target values (integers)
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Perform classification on an array of test vectors X.
The predicted class C for each sample in X is returned.
Parameters
X [array-like of shape (n_samples, n_features)]
Returns
C [ndarray of shape (n_samples,)]
predict_log_proba(self, X)
Return posterior probabilities of classification.
Parameters
X [array-like of shape (n_samples, n_features)] Array of samples/test vectors.
Returns
C [ndarray of shape (n_samples, n_classes)] Posterior log-probabilities of classification per
class.
predict_proba(self, X)
Return posterior probabilities of classification.
Parameters
X [array-like of shape (n_samples, n_features)] Array of samples/test vectors.
Returns
C [ndarray of shape (n_samples, n_classes)] Posterior probabilities of classification per
class.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
• Classifier comparison
• Linear and Quadratic Discriminant Analysis with covariance ellipsoid
User guide: See the Metrics and scoring: quantifying the quality of predictions section for further details.
7.10.1 sklearn.dummy.DummyClassifier
Examples
Methods
X [{array-like, object with finite length or shape}] Training data, requires length =
n_samples
Returns
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] Predicted target values for X.
predict_log_proba(self, X)
Return log probability estimates for the test vectors X.
Parameters
X [{array-like, object with finite length or shape}] Training data, requires length =
n_samples
Returns
P [array-like or list of array-like of shape (n_samples, n_classes)] Returns the log probability
of the sample for each class in the model, where classes are ordered arithmetically for each
output.
predict_proba(self, X)
Return probability estimates for the test vectors X.
Parameters
X [{array-like, object with finite length or shape}] Training data, requires length =
n_samples
Returns
P [array-like or list of array-lke of shape (n_samples, n_classes)] Returns the probability of
the sample for each class in the model, where classes are ordered arithmetically, for each
output.
score(self, X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [{array-like, None}] Test samples with shape = (n_samples, n_features) or None. Passing
None as test samples gives the same result as passing real test samples, since DummyClas-
sifier operates independently of the sampled observations.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.10.2 sklearn.dummy.DummyRegressor
Examples
Methods
Parameters
X [{array-like, None}] Test samples with shape = (n_samples, n_features) or None. For
some estimators this may be a precomputed kernel matrix instead, shape = (n_samples,
n_samples_fitted], where n_samples_fitted is the number of samples used in the fitting
for the estimator. Passing None as test samples gives the same result as passing real test
samples, since DummyRegressor operates independently of the sampled observations.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
The sklearn.ensemble module includes ensemble-based methods for classification, regression and anomaly de-
tection.
User guide: See the Ensemble methods section for further details.
7.11.1 sklearn.ensemble.AdaBoostClassifier
AdaBoostRegressor An AdaBoost regressor that begins by fitting a regressor on the original dataset and
then fits additional copies of the regressor on the same dataset but where the weights of instances are
adjusted according to the error of the current prediction.
GradientBoostingClassifier GB builds an additive model in a forward stage-wise fashion. Regres-
sion trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary
classification is a special case where only a single regression tree is induced.
sklearn.tree.DecisionTreeClassifier A non-parametric supervised learning method used for
classification. Creates a model that predicts the value of a target variable by learning simple decision
rules inferred from the data features.
References
[R33e4ec8c4ad5-1], [R33e4ec8c4ad5-2]
Examples
Methods
Returns
feature_importances_ [ndarray of shape (n_features,)] The feature importances.
fit(self, X, y, sample_weight=None)
Build a boosted classifier from the training set (X, y).
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
y [array-like of shape (n_samples,)] The target values (class labels).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
the sample weights are initialized to 1 / n_samples.
Returns
self [object] Fitted estimator.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict classes for X.
The predicted class of an input sample is computed as the weighted mean prediction of the classifiers in
the ensemble.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
Returns
y [ndarray of shape (n_samples,)] The predicted classes.
predict_log_proba(self, X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the weighted mean predicted class
log-probabilities of the classifiers in the ensemble.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
Returns
p [ndarray of shape (n_samples, n_classes)] The class probabilities of the input samples.
The order of outputs is the same of that of the classes_ attribute.
predict_proba(self, X)
Predict class probabilities for X.
The predicted class probabilities of an input sample is computed as the weighted mean predicted class
probabilities of the classifiers in the ensemble.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
Returns
p [ndarray of shape (n_samples, n_classes)] The class probabilities of the input samples.
The order of outputs is the same of that of the classes_ attribute.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
This generator method yields the ensemble predicted class probabilities after each iteration of boosting
and therefore allows monitoring, such as to determine the predicted class probabilities on a test set after
each boost.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
Yields
p [generator of ndarray of shape (n_samples,)] The class probabilities of the input samples.
The order of outputs is the same of that of the classes_ attribute.
staged_score(self, X, y, sample_weight=None)
Return staged scores for X, y.
This generator method yields the ensemble score after each iteration of boosting and therefore allows
monitoring, such as to determine the score on a test set after each boost.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
y [array-like of shape (n_samples,)] Labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Yields
z [float]
• Classifier comparison
• Two-class AdaBoost
• Multi-class AdaBoosted Decision Trees
• Discrete versus Real AdaBoost
• Plot the decision surfaces of ensembles of trees on the iris dataset
7.11.2 sklearn.ensemble.AdaBoostRegressor
Parameters
base_estimator [object, default=None] The base estimator from which
the boosted ensemble is built. If None, then the base estimator is
DecisionTreeRegressor(max_depth=3).
n_estimators [int, default=50] The maximum number of estimators at which boosting is termi-
nated. In case of perfect fit, the learning procedure is stopped early.
learning_rate [float, default=1.] Learning rate shrinks the contribution of each regres-
sor by learning_rate. There is a trade-off between learning_rate and
n_estimators.
loss [{‘linear’, ‘square’, ‘exponential’}, default=’linear’] The loss function to use when updat-
ing the weights after each boosting iteration.
random_state [int, RandomState instance, default=None] If int, random_state is the seed used
by the random number generator; If RandomState instance, random_state is the random
number generator; If None, the random number generator is the RandomState instance used
by np.random.
Attributes
base_estimator_ [estimator] The base estimator from which the ensemble is grown.
estimators_ [list of classifiers] The collection of fitted sub-estimators.
estimator_weights_ [array of floats] Weights for each estimator in the boosted ensemble.
estimator_errors_ [array of floats] Regression error for each estimator in the boosted ensem-
ble.
feature_importances_ [ndarray of shape (n_features,)] Return the feature importances
(the higher, the more important the feature).
See also:
AdaBoostClassifier, GradientBoostingRegressor
sklearn.tree.DecisionTreeRegressor
References
[R0c261b7dee9d-1], [R0c261b7dee9d-2]
Examples
Methods
fit(self, X, y[, sample_weight]) Build a boosted regressor from the training set (X,
y).
get_params(self[, deep]) Get parameters for this estimator.
predict(self, X) Predict regression value for X.
score(self, X, y[, sample_weight]) Return the coefficient of determination R^2 of the
prediction.
set_params(self, \*\*params) Set the parameters of this estimator.
staged_predict(self, X) Return staged predictions for X.
staged_score(self, X, y[, sample_weight]) Return staged scores for X, y.
Returns
feature_importances_ [ndarray of shape (n_features,)] The feature importances.
fit(self, X, y, sample_weight=None)
Build a boosted regressor from the training set (X, y).
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
y [array-like of shape (n_samples,)] The target values (real numbers).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
the sample weights are initialized to 1 / n_samples.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict regression value for X.
The predicted regression value of an input sample is computed as the weighted median prediction of the
classifiers in the ensemble.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
Returns
y [ndarray of shape (n_samples,)] The predicted regression values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
The predicted regression value of an input sample is computed as the weighted median prediction of the
classifiers in the ensemble.
This generator method yields the ensemble prediction after each iteration of boosting and therefore allows
monitoring, such as to determine the prediction on a test set after each boost.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Yields
y [generator of ndarray of shape (n_samples,)] The predicted regression values.
staged_score(self, X, y, sample_weight=None)
Return staged scores for X, y.
This generator method yields the ensemble score after each iteration of boosting and therefore allows
monitoring, such as to determine the score on a test set after each boost.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted
to CSR.
y [array-like of shape (n_samples,)] Labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Yields
z [float]
7.11.3 sklearn.ensemble.BaggingClassifier
References
Examples
Methods
The columns correspond to the classes in sorted order, as they appear in the attribute
classes_. Regression and binary classification are special cases with k == 1, other-
wise k==n_classes.
property estimators_samples_
The subset of drawn samples for each base estimator.
Returns a dynamically generated list of indices identifying the samples used for fitting each member of the
ensemble, i.e., the in-bag samples.
Note: the list is re-created at each call to the property in order to reduce the object memory footprint by
not storing the sampling data. Thus fetching the property may be slower than expected.
fit(self, X, y, sample_weight=None)
Build a Bagging ensemble of estimators from the training set (X, y).
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrices are accepted only if they are supported by the base estimator.
y [array-like of shape (n_samples,)] The target values (class labels in classification, real
numbers in regression).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
then samples are equally weighted. Note that this is supported only if the base estimator
supports sample weighting.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict class for X.
The predicted class of an input sample is computed as the class with the highest mean predicted probability.
If base estimators do not implement a predict_proba method, then it resorts to voting.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrices are accepted only if they are supported by the base estimator.
Returns
y [ndarray of shape (n_samples,)] The predicted classes.
predict_log_proba(self, X)
Predict class log-probabilities for X.
The predicted class log-probabilities of an input sample is computed as the log of the mean predicted class
probabilities of the base estimators in the ensemble.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrices are accepted only if they are supported by the base estimator.
Returns
p [ndarray of shape (n_samples, n_classes)] The class log-probabilities of the input samples.
The order of the classes corresponds to that in the attribute classes_.
predict_proba(self, X)
Predict class probabilities for X.
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities
of the base estimators in the ensemble. If base estimators do not implement a predict_proba method,
then it resorts to voting and the predicted class probabilities of an input sample represents the proportion
of estimators predicting each class.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrices are accepted only if they are supported by the base estimator.
Returns
p [ndarray of shape (n_samples, n_classes)] The class probabilities of the input samples.
The order of the classes corresponds to that in the attribute classes_.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.11.4 sklearn.ensemble.BaggingRegressor
random_state [int, RandomState instance, default=None] If int, random_state is the seed used
by the random number generator; If RandomState instance, random_state is the random
number generator; If None, the random number generator is the RandomState instance used
by np.random.
verbose [int, default=0] Controls the verbosity when fitting and predicting.
Attributes
base_estimator_ [estimator] The base estimator from which the ensemble is grown.
n_features_ [int] The number of features when fit is performed.
estimators_ [list of estimators] The collection of fitted sub-estimators.
estimators_samples_ [list of arrays] The subset of drawn samples for each base estima-
tor.
estimators_features_ [list of arrays] The subset of drawn features for each base estimator.
oob_score_ [float] Score of the training dataset obtained using an out-of-bag estimate. This
attribute exists only when oob_score is True.
oob_prediction_ [ndarray of shape (n_samples,)] Prediction computed with out-of-bag esti-
mate on the training set. If n_estimators is small it might be possible that a data point was
never left out during the bootstrap. In this case, oob_prediction_ might contain NaN.
This attribute exists only when oob_score is True.
References
Examples
Methods
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrices are accepted only if they are supported by the base estimator.
y [array-like of shape (n_samples,)] The target values (class labels in classification, real
numbers in regression).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
then samples are equally weighted. Note that this is supported only if the base estimator
supports sample weighting.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the estimators in the ensemble.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
Sparse matrices are accepted only if they are supported by the base estimator.
Returns
y [ndarray of shape (n_samples,)] The predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
7.11.5 sklearn.ensemble.IsolationForest
Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a
sample is equivalent to the path length from the root node to the terminating node.
This path length, averaged over a forest of such random trees, is a measure of normality and our decision
function.
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees
collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
Read more in the User Guide.
New in version 0.18.
Parameters
n_estimators [int, default=100] The number of base estimators in the ensemble.
max_samples [“auto”, int or float, default=”auto”]
The number of samples to draw from X to train each base estimator.
• If int, then draw max_samples samples.
• If float, then draw max_samples * X.shape[0] samples.
• If “auto”, then max_samples=min(256, n_samples).
If max_samples is larger than the number of samples provided, all samples will be used for
all trees (no sampling).
contamination [‘auto’ or float, default=’auto’] The amount of contamination of the data set,
i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on
the scores of the samples.
• If ‘auto’, the threshold is determined as in the original paper.
• If float, the contamination should be in the range [0, 0.5].
Changed in version 0.22: The default value of contamination changed from 0.1 to
'auto'.
max_features [int or float, default=1.0] The number of features to draw from X to train each
base estimator.
• If int, then draw max_features features.
• If float, then draw max_features * X.shape[1] features.
bootstrap [bool, default=False] If True, individual trees are fit on random subsets of the training
data sampled with replacement. If False, sampling without replacement is performed.
n_jobs [int, default=None] The number of jobs to run in parallel for both fit and predict.
None means 1 unless in a joblib.parallel_backend context. -1 means using all
processors. See Glossary for more details.
behaviour [str, default=’deprecated’] This parameter has not effect, is deprecated, and will be
removed.
New in version 0.20: behaviour is added in 0.20 for back-compatibility purpose.
Deprecated since version 0.20: behaviour='old' is deprecated in 0.20 and will not be
possible in 0.22.
Deprecated since version 0.22: behaviour parameter is deprecated in 0.22 and removed
in 0.24.
random_state [int, RandomState instance, default=None] If int, random_state is the seed used
by the random number generator; If RandomState instance, random_state is the random
number generator; If None, the random number generator is the RandomState instance used
by np.random.
verbose [int, default=0] Controls the verbosity of the tree building process.
warm_start [bool, default=False] When set to True, reuse the solution of the previous call to
fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the
Glossary.
New in version 0.21.
Attributes
estimators_ [list of DecisionTreeClassifier] The collection of fitted sub-estimators.
estimators_samples_ [list of arrays] The subset of drawn samples for each base estima-
tor.
max_samples_ [int] The actual number of samples.
offset_ [float] Offset used to define the decision function from the raw scores. We have the re-
lation: decision_function = score_samples - offset_. offset_ is de-
fined as follows. When the contamination parameter is set to “auto”, the offset is equal to
-0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When
a contamination parameter different than “auto” is provided, the offset is defined in such
a way we obtain the expected number of outliers (samples with decision function < 0) in
training.
See also:
Notes
The implementation is based on an ensemble of ExtraTreeRegressor. The maximum depth of each tree is set to
ceil(log_2(n)) where 𝑛 is the number of samples used to build the tree (see (Liu et al., 2008) for more
details).
References
[Rd7ae0a2ae688-1], [Rd7ae0a2ae688-2]
Examples
Methods
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.11.6 sklearn.ensemble.RandomTreesEmbedding
where N is the total number of samples, N_t is the number of samples at the current node,
N_t_L is the number of samples in the left child, and N_t_R is the number of samples in
the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
New in version 0.19.
min_impurity_split [float, default=0] Threshold for early stopping in tree growth. A node will
split if its impurity is above the threshold, otherwise it is a leaf.
Deprecated since version 0.19: min_impurity_split has been deprecated in favor of
min_impurity_decrease in 0.19. The default value of min_impurity_split
has changed from 1e-7 to 0 in 0.23 and it will be removed in 0.25. Use
min_impurity_decrease instead.
sparse_output [bool, default=True] Whether or not to return a sparse CSR matrix, as default
behavior, or to return a dense array compatible with dense pipeline operators.
n_jobs [int, default=None] The number of jobs to run in parallel. fit, transform,
decision_path and apply are all parallelized over the trees. None means 1 unless
in a joblib.parallel_backend context. -1 means using all processors. See Glos-
sary for more details.
random_state [int, RandomState instance, default=None] Controls the generation of the ran-
dom y used to fit the trees and the draw of the splits for each feature at the trees’ nodes. See
Glossary for details.
verbose [int, default=0] Controls the verbosity when fitting and predicting.
warm_start [bool, default=False] When set to True, reuse the solution of the previous call to
fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the
Glossary.
Attributes
estimators_ [list of DecisionTreeClassifier] The collection of fitted sub-estimators.
References
[R6e47e53bacbd-1], [R6e47e53bacbd-2]
Methods
property feature_importances_
Return the feature importances (the higher, the more important the feature).
Returns
feature_importances_ [ndarray of shape (n_features,)] The values of this array sum to 1,
unless all trees are single node trees consisting of only the root node, in which case it will
be an array of zeros.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform dataset.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Input data to be trans-
formed. Use dtype=np.float32 for maximum efficiency. Sparse matrices are also
supported, use sparse csr_matrix for maximum efficiency.
Returns
X_transformed [sparse matrix of shape (n_samples, n_out)] Transformed dataset.
7.11.7 sklearn.ensemble.StackingClassifier
Note: A larger number of split will provide no benefits if the number of training samples
is large enough. Indeed, the training time will increase. cv is not used for model evaluation
but for prediction.
Notes
When predict_proba is used by each estimator (i.e. most of the time for stack_method='auto' or
specifically for stack_method='predict_proba'), The first column predicted by each estimator will be
dropped in the case of a binary classification problem. Indeed, both feature will be perfectly collinear.
References
[Rb91ed47a817e-1]
Examples
Methods
Returns
decisions [ndarray of shape (n_samples,), (n_samples, n_classes), or (n_samples, n_classes
* (n_classes-1) / 2)] The decision function computed the final estimator.
fit(self, X, y, sample_weight=None)
Fit the estimators.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Training vectors, where
n_samples is the number of samples and n_features is the number of features.
y [array-like of shape (n_samples,)] Target values.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights. If None,
then samples are equally weighted. Note that this is supported only if all underlying esti-
mators support sample weights.
Returns
self [object]
fit_transform(self, X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get the parameters of an estimator from the ensemble.
Parameters
deep [bool, default=True] Setting it to True gets the various classifiers and the parameters
of the classifiers as well.
predict(self, X, **predict_params)
Predict target for X.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Training vectors, where
n_samples is the number of samples and n_features is the number of features.
**predict_params [dict of str -> obj] Parameters to the predict called by the
final_estimator. Note that this may be used to return uncertainties from some es-
timators with return_std or return_cov. Be aware that it will only accounts for
uncertainty in the final estimator.
Returns
y_pred [ndarray of shape (n_samples,) or (n_samples, n_output)] Predicted targets.
predict_proba(self, X)
Predict class probabilities for X using final_estimator_.predict_proba.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Training vectors, where
n_samples is the number of samples and n_features is the number of features.
Returns
probabilities [ndarray of shape (n_samples, n_classes) or list of ndarray of shape
(n_output,)] The class probabilities of the input samples.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of an estimator from the ensemble.
Valid parameter keys can be listed with get_params().
Parameters
**params [keyword arguments] Specific parameters using e.g.
set_params(parameter_name=new_value). In addition, to setting the
parameters of the stacking estimator, the individual estimator of the stacking estimators
can also be set, or can be removed by setting them to ‘drop’.
transform(self, X)
Return class labels or probabilities for X for each estimator.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Training vectors, where
n_samples is the number of samples and n_features is the number of features.
Returns
y_preds [ndarray of shape (n_samples, n_estimators) or (n_samples, n_classes *
n_estimators)] Prediction outputs for each estimator.
7.11.8 sklearn.ensemble.StackingRegressor
Stacked generalization consists in stacking the output of individual estimator and use a regressor to compute the
final prediction. Stacking allows to use the strength of each individual estimator by using their output as input
of a final estimator.
Note that estimators_ are fitted on the full X while final_estimator_ is trained using cross-validated
predictions of the base estimators using cross_val_predict.
New in version 0.22.
Read more in the User Guide.
Parameters
estimators [list of (str, estimator)] Base estimators which will be stacked together. Each ele-
ment of the list is defined as a tuple of string (i.e. name) and an estimator instance. An
estimator can be set to ‘drop’ using set_params.
final_estimator [estimator, default=None] A regressor which will be used to combine the base
estimators. The default regressor is a RidgeCV.
cv [int, cross-validation generator or an iterable, default=None] Determines the cross-validation
splitting strategy used in cross_val_predict to train final_estimator. Possible
inputs for cv are:
• None, to use the default 5-fold cross validation,
• integer, to specify the number of folds in a (Stratified) KFold,
• An object to be used as a cross-validation generator,
• An iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass,
StratifiedKFold is used. In all other cases, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
Note: A larger number of split will provide no benefits if the number of training samples
is large enough. Indeed, the training time will increase. cv is not used for model evaluation
but for prediction.
n_jobs [int, default=None] The number of jobs to run in parallel for fit of all estimators.
None means 1 unless in a joblib.parallel_backend context. -1 means using all
processors. See Glossary for more details.
passthrough [bool, default=False] When False, only the predictions of estimators will be used
as training data for final_estimator. When True, the final_estimator is trained
on the predictions as well as the original training data.
Attributes
estimators_ [list of estimator] The elements of the estimators parameter, having been fit-
ted on the training data. If an estimator has been set to 'drop', it will not appear in
estimators_.
named_estimators_ [Bunch] Attribute to access any fitted sub-estimators by name.
final_estimator_ [estimator] The regressor to stacked the base estimators fitted.
References
[R606df7ffad02-1]
Examples
Methods
Notes
7.11.9 sklearn.ensemble.VotingClassifier
voting [{‘hard’, ‘soft’}, default=’hard’] If ‘hard’, uses predicted class labels for majority rule
voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the pre-
dicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
weights [array-like of shape (n_classifiers,), default=‘None‘] Sequence of weights (float or
int) to weight the occurrences of predicted class labels (hard voting) or class probabilities
before averaging (soft voting). Uses uniform weights if None.
n_jobs [int, default=None] The number of jobs to run in parallel for fit. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
flatten_transform [bool, default=True] Affects shape of transform output only when vot-
ing=’soft’ If voting=’soft’ and flatten_transform=True, transform method returns matrix
with shape (n_samples, n_classifiers * n_classes). If flatten_transform=False, it returns
(n_classifiers, n_samples, n_classes).
Attributes
estimators_ [list of classifiers] The collection of fitted sub-estimators as defined in
estimators that are not ‘drop’.
named_estimators_ [Bunch object, a dictionary with attribute access] Attribute to access any
fitted sub-estimators by name.
New in version 0.20.
classes_ [array-like of shape (n_predictions,)] The classes labels.
See also:
Examples
Methods
7.11.10 sklearn.ensemble.VotingRegressor
Examples
Methods
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get the parameters of an estimator from the ensemble.
Parameters
deep [bool, default=True] Setting it to True gets the various classifiers and the parameters
of the classifiers as well.
predict(self, X)
Predict regression target for X.
The predicted regression target of an input sample is computed as the mean predicted regression targets of
the estimators in the ensemble.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The input samples.
Returns
y [ndarray of shape (n_samples,)] The predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
set_params(self, **params)
Set the parameters of an estimator from the ensemble.
Valid parameter keys can be listed with get_params().
Parameters
**params [keyword arguments] Specific parameters using e.g.
set_params(parameter_name=new_value). In addition, to setting the
parameters of the stacking estimator, the individual estimator of the stacking estimators
can also be set, or can be removed by setting them to ‘drop’.
transform(self, X)
Return predictions for X for each estimator.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input samples.
Returns
predictions: ndarray of shape (n_samples, n_classifiers) Values predicted by each re-
gressor.
7.11.11 sklearn.ensemble.HistGradientBoostingRegressor
Note: This estimator is still experimental for now: the predictions and the API might change without any
deprecation cycle. To use it, you need to explicitly import enable_hist_gradient_boosting:
tol [float or None, optional (default=1e-7)] The absolute tolerance to use when comparing
scores during early stopping. The higher the tolerance, the more likely we are to early
stop: higher tolerance means that it will be harder for subsequent iterations to be considered
an improvement upon the reference score.
verbose: int, optional (default=0) The verbosity level. If not zero, print some information
about the fitting process.
random_state [int, np.random.RandomStateInstance or None, optional (default=None)]
Pseudo-random number generator to control the subsampling in the binning process, and
the train/validation data split if early stopping is enabled. See random_state.
Attributes
n_iter_ [int] The number of iterations as selected by early stopping (if n_iter_no_change is not
None). Otherwise it corresponds to max_iter.
n_trees_per_iteration_ [int] The number of tree that are built at each iteration. For regressors,
this is always 1.
train_score_ [ndarray, shape (n_iter_+1,)] The scores at each iteration on the training data.
The first entry is the score of the ensemble before the first iteration. Scores are computed
according to the scoring parameter. If scoring is not ‘loss’, scores are computed on a
subset of at most 10 000 samples. Empty if no early stopping.
validation_score_ [ndarray, shape (n_iter_+1,)] The scores at each iteration on the held-out
validation data. The first entry is the score of the ensemble before the first iteration. Scores
are computed according to the scoring parameter. Empty if no early stopping or if
validation_fraction is None.
Examples
>>> # To use this experimental feature, we need to explicitly ask for it:
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> from sklearn.datasets import load_boston
>>> X, y = load_boston(return_X_y=True)
>>> est = HistGradientBoostingRegressor().fit(X, y)
>>> est.score(X, y)
0.98...
Methods
fit(self, X, y)
Fit the gradient boosting model.
Parameters
X [array-like of shape (n_samples, n_features)] The input samples.
y [array-like of shape (n_samples,)] Target values.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict values for X.
Parameters
X [array-like, shape (n_samples, n_features)] The input samples.
Returns
y [ndarray, shape (n_samples,)] The predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.11.12 sklearn.ensemble.HistGradientBoostingClassifier
Note: This estimator is still experimental for now: the predictions and the API might change without any
deprecation cycle. To use it, you need to explicitly import enable_hist_gradient_boosting:
means that it will be harder for subsequent iterations to be considered an improvement upon
the reference score.
verbose: int, optional (default=0) The verbosity level. If not zero, print some information
about the fitting process.
random_state [int, np.random.RandomStateInstance or None, optional (default=None)]
Pseudo-random number generator to control the subsampling in the binning process, and
the train/validation data split if early stopping is enabled. See random_state.
Attributes
n_iter_ [int] The number of estimators as selected by early stopping (if n_iter_no_change is not
None). Otherwise it corresponds to max_iter.
n_trees_per_iteration_ [int] The number of tree that are built at each iteration. This is equal to
1 for binary classification, and to n_classes for multiclass classification.
train_score_ [ndarray, shape (n_iter_+1,)] The scores at each iteration on the training data.
The first entry is the score of the ensemble before the first iteration. Scores are computed
according to the scoring parameter. If scoring is not ‘loss’, scores are computed on a
subset of at most 10 000 samples. Empty if no early stopping.
validation_score_ [ndarray, shape (n_iter_+1,)] The scores at each iteration on the held-out
validation data. The first entry is the score of the ensemble before the first iteration. Scores
are computed according to the scoring parameter. Empty if no early stopping or if
validation_fraction is None.
Examples
>>> # To use this experimental feature, we need to explicitly ask for it:
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> clf = HistGradientBoostingClassifier().fit(X, y)
>>> clf.score(X, y)
1.0
Methods
decision_function(self, X)
Compute the decision function of X.
Parameters
X [array-like, shape (n_samples, n_features)] The input samples.
Returns
decision [ndarray, shape (n_samples,) or (n_samples, n_trees_per_iteration)] The raw pre-
dicted values (i.e. the sum of the trees leaves) for each sample. n_trees_per_iteration is
equal to the number of classes in multiclass classification.
fit(self, X, y)
Fit the gradient boosting model.
Parameters
X [array-like of shape (n_samples, n_features)] The input samples.
y [array-like of shape (n_samples,)] Target values.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict classes for X.
Parameters
X [array-like, shape (n_samples, n_features)] The input samples.
Returns
y [ndarray, shape (n_samples,)] The predicted classes.
predict_proba(self, X)
Predict class probabilities for X.
Parameters
X [array-like, shape (n_samples, n_features)] The input samples.
Returns
p [ndarray, shape (n_samples, n_classes)] The class probabilities of the input samples.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
The sklearn.exceptions module includes all custom warnings and error classes used across scikit-learn.
7.12.1 sklearn.exceptions.ChangedBehaviorWarning
class sklearn.exceptions.ChangedBehaviorWarning
Warning class used to notify the user of any change in the behavior.
Changed in version 0.18: Moved from sklearn.base.
Attributes
args
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
7.12.2 sklearn.exceptions.ConvergenceWarning
class sklearn.exceptions.ConvergenceWarning
Custom warning to capture convergence problems
Attributes
args
Examples
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
7.12.3 sklearn.exceptions.DataConversionWarning
class sklearn.exceptions.DataConversionWarning
Warning used to notify implicit data conversions happening in the code.
This warning occurs when some input data needs to be converted or interpreted in a way that may not match the
user’s expectations.
For example, this warning may occur when the user
• passes an integer array to a function which expects float input and will convert the input
• requests a non-copying operation, but a copy is required to meet the implementation’s data-type ex-
pectations;
• passes an input whose shape can be interpreted ambiguously.
Changed in version 0.18: Moved from sklearn.utils.validation.
Attributes
args
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
7.12.4 sklearn.exceptions.DataDimensionalityWarning
class sklearn.exceptions.DataDimensionalityWarning
Custom warning to notify potential issues with data dimensionality.
For example, in random projection, this warning is raised when the number of components, which quantifies
the dimensionality of the target projection space, is higher than the number of features, which quantifies the
dimensionality of the original source space, to imply that the dimensionality of the problem will not be reduced.
Changed in version 0.18: Moved from sklearn.utils.
Attributes
args
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
7.12.5 sklearn.exceptions.EfficiencyWarning
class sklearn.exceptions.EfficiencyWarning
Warning used to notify the user of inefficient computation.
This warning notifies the user that the efficiency may not be optimal due to some reason which may be included
as a part of the warning message. This may be subclassed into a more specific Warning class.
New in version 0.18.
Attributes
args
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
7.12.6 sklearn.exceptions.FitFailedWarning
class sklearn.exceptions.FitFailedWarning
Warning class used if there is an error while fitting the estimator.
This Warning is used in meta estimators GridSearchCV and RandomizedSearchCV and the cross-validation
helper function cross_val_score to warn when there is an error while fitting the estimator.
Attributes
args
Examples
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
7.12.7 sklearn.exceptions.NotFittedError
class sklearn.exceptions.NotFittedError
Exception class to raise if estimator is used before fitting.
This class inherits from both ValueError and AttributeError to help with exception handling and backward
compatibility.
Attributes
args
Examples
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
7.12.8 sklearn.exceptions.NonBLASDotWarning
class sklearn.exceptions.NonBLASDotWarning
Warning used when the dot operation does not use BLAS.
This warning is used to notify the user that BLAS was not used for dot operation and hence the efficiency may
be affected.
Changed in version 0.18: Moved from sklearn.utils.validation, extends EfficiencyWarning.
Attributes
args
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
7.12.9 sklearn.exceptions.UndefinedMetricWarning
class sklearn.exceptions.UndefinedMetricWarning
Warning used when the metric is invalid
Changed in version 0.18: Moved from sklearn.base.
Attributes
args
Methods
with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
The sklearn.experimental module provides importable modules that enable the use of experimental features
or estimators.
The features and estimators that are experimental aren’t subject to deprecation cycles. Use them at your own risks!
7.13.1 sklearn.experimental.enable_hist_gradient_boosting
The # noqa comment comment can be removed: it just tells linters like flake8 to ignore the import, which appears
as unused.
7.13.2 sklearn.experimental.enable_iterative_imputer
Enables IterativeImputer
The API and results of this estimator might change without any deprecation cycle.
Importing this file dynamically sets sklearn.impute.IterativeImputer as an attribute of the impute mod-
ule:
The sklearn.feature_extraction module deals with feature extraction from raw data. It currently includes
methods to extract features from text and images.
User guide: See the Feature extraction section for further details.
7.14.1 sklearn.feature_extraction.DictVectorizer
Examples
True
(continues on next page)
Methods
fit(self, X[, y]) Learn a list of feature name -> indices mappings.
fit_transform(self, X[, y]) Learn a list of feature name -> indices mappings and
transform X.
get_feature_names(self) Returns a list of feature names, ordered by their in-
dices.
get_params(self[, deep]) Get parameters for this estimator.
inverse_transform(self, X[, dict_type]) Transform array or sparse matrix X back to feature
mappings.
restrict(self, support[, indices]) Restrict the features to those in support using feature
selection.
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Transform feature->value dicts to array or sparse ma-
trix.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
inverse_transform(self, X, dict_type=<class ’dict’>)
Transform array or sparse matrix X back to feature mappings.
X must have been produced by this DictVectorizer’s transform or fit_transform method; it may only have
passed through transformers that preserve the number of features and their order.
In the case of one-hot/one-of-K coding, the constructed feature names and values are returned rather than
the original ones.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Sample matrix.
dict_type [type, default=dict] Constructor for feature mappings. Must conform to the col-
lections.Mapping API.
Returns
D [list of dict_type objects of shape (n_samples,)] Feature mappings for the samples in X.
restrict(self, support, indices=False)
Restrict the features to those in support using feature selection.
This function modifies the estimator in-place.
Parameters
support [array-like] Boolean mask or list of indices (as returned by the get_support member
of feature selectors).
indices [bool, default=False] Whether support is a list of indices.
Returns
self
Examples
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform feature->value dicts to array or sparse matrix.
Named features not encountered during fit or fit_transform will be silently ignored.
Parameters
X [Mapping or iterable over Mappings of shape (n_samples,)] Dict(s) or Mapping(s) from
feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).
Returns
Xa [{array, sparse matrix}] Feature vectors; always 2-d.
7.14.2 sklearn.feature_extraction.FeatureHasher
input_type [{“dict”, “pair”}, default=”dict”] Either “dict” (the default) to accept dictionaries
over (feature_name, value); “pair” to accept pairs of (feature_name, value); or “string” to
accept single strings. feature_name should be a string, while value should be a number.
In the case of “string”, a value of 1 is implied. The feature_name is hashed to find the
appropriate column for the feature. The value’s sign might be flipped in the output (but see
non_negative, below).
dtype [numpy dtype, default=np.float64] The type of feature values. Passed to scipy.sparse ma-
trix constructors as the dtype argument. Do not set this to bool, np.boolean or any unsigned
integer type.
alternate_sign [bool, default=True] When True, an alternating sign is added to the features as
to approximately conserve the inner product in the hashed space even for small n_features.
This approach is similar to sparse random projection.
See also:
Examples
Methods
sklearn.feature_extraction.image.extract_patches_2d
sklearn.feature_extraction.image.extract_patches_2d(image, patch_size,
max_patches=None, ran-
dom_state=None)
Reshape a 2D image into a collection of patches
The resulting patches are allocated in a dedicated array.
Read more in the User Guide.
Parameters
image [ndarray of shape (image_height, image_width) or (image_height, image_width,
n_channels)] The original image data. For color images, the last dimension specifies the
channel: a RGB image would have n_channels=3.
patch_size [tuple of int (patch_height, patch_width)] The dimensions of one patch.
max_patches [int or float, default=None] The maximum number of patches to extract. If
max_patches is a float between 0 and 1, it is taken to be a proportion of the total number
of patches.
random_state [int, RandomState instance, default=None] Determines the random number gen-
erator used for random sampling when max_patches is not None. Use an int to make the
randomness deterministic. See Glossary.
Returns
patches [array of shape (n_patches, patch_height, patch_width) or (n_patches, patch_height,
patch_width, n_channels)] The collection of patches extracted from the image, where
n_patches is either max_patches or the total number of patches that can be extracted.
Examples
sklearn.feature_extraction.image.grid_to_graph
Notes
For scikit-learn versions 0.14.1 and prior, return_as=np.ndarray was handled by returning a dense np.matrix
instance. Going forward, np.ndarray returns an np.ndarray, as expected.
For compatibility, user code relying on this method should wrap its calls in np.asarray to avoid type issues.
sklearn.feature_extraction.image.img_to_graph
Notes
For scikit-learn versions 0.14.1 and prior, return_as=np.ndarray was handled by returning a dense np.matrix
instance. Going forward, np.ndarray returns an np.ndarray, as expected.
For compatibility, user code relying on this method should wrap its calls in np.asarray to avoid type issues.
sklearn.feature_extraction.image.reconstruct_from_patches_2d
sklearn.feature_extraction.image.reconstruct_from_patches_2d(patches, im-
age_size)
Reconstruct the image from all of its patches.
Patches are assumed to overlap and the image is constructed by filling in the patches from left to right, top to
bottom, averaging the overlapping regions.
Read more in the User Guide.
Parameters
patches [ndarray of shape (n_patches, patch_height, patch_width) or (n_patches, patch_height,
patch_width, n_channels)] The complete set of patches. If the patches contain colour
information, channels are indexed along the last dimension: RGB patches would have
n_channels=3.
image_size [tuple of int (image_height, image_width) or (image_height, image_width,
n_channels)] The size of the image that will be reconstructed.
Returns
image [ndarray of shape image_size] The reconstructed image.
sklearn.feature_extraction.image.PatchExtractor
class sklearn.feature_extraction.image.PatchExtractor(patch_size=None,
max_patches=None, ran-
dom_state=None)
Extracts patches from a collection of images
Read more in the User Guide.
New in version 0.9.
Parameters
patch_size [tuple of int (patch_height, patch_width)] The dimensions of one patch.
max_patches [int or float, default=None] The maximum number of patches per image to ex-
tract. If max_patches is a float in (0, 1), it is taken to mean a proportion of the total number
of patches.
random_state [int, RandomState instance, default=None] Determines the random number gen-
erator used for random sampling when max_patches is not None. Use an int to make the
randomness deterministic. See Glossary.
Examples
Methods
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transforms the image samples in X into a matrix of patch data.
Parameters
X [ndarray of shape (n_samples, image_height, image_width) or (n_samples, image_height,
image_width, n_channels)] Array of images from which to extract patches. For
color images, the last dimension specifies the channel: a RGB image would have
n_channels=3.
Returns
patches [array of shape (n_patches, patch_height, patch_width) or (n_patches,
patch_height, patch_width, n_channels)] The collection of patches extracted from
the images, where n_patches is either n_samples * max_patches or the total
number of patches that can be extracted.
The sklearn.feature_extraction.text submodule gathers utilities to build feature vectors from text doc-
uments.
sklearn.feature_extraction.text.CountVectorizer
class sklearn.feature_extraction.text.CountVectorizer(input=’content’,
encoding=’utf-8’, de-
code_error=’strict’,
strip_accents=None, low-
ercase=True, preproces-
sor=None, tokenizer=None,
stop_words=None, to-
ken_pattern=’(?u)\b\w\w+\b’,
ngram_range=(1,
1), analyzer=’word’,
max_df=1.0, min_df=1,
max_features=None,
vocabulary=None, bi-
nary=False, dtype=<class
’numpy.int64’>)
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature
selection then the number of features will be equal to the vocabulary size found by analyzing the data.
Read more in the User Guide.
Parameters
input [string {‘filename’, ‘file’, ‘content’}, default=’content’] If ‘filename’, the sequence
passed as an argument to fit is expected to be a list of filenames that need reading to fetch
the raw content to analyze.
If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch
the bytes in memory.
Otherwise the input is expected to be a sequence of items that can be of type string or byte.
encoding [string, default=’utf-8’] If bytes or files are given to analyze, this encoding is used to
decode.
decode_error [{‘strict’, ‘ignore’, ‘replace’}, default=’strict’] Instruction on what to do if a byte
sequence is given to analyze that contains characters not of the given encoding. By
default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are
‘ignore’ and ‘replace’.
strip_accents [{‘ascii’, ‘unicode’}, default=None] Remove accents and perform other charac-
ter normalization during the preprocessing step. ‘ascii’ is a fast method that only works on
characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that
works on any characters. None (default) does nothing.
Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.
lowercase [bool, default=True] Convert all characters to lowercase before tokenizing.
preprocessor [callable, default=None] Override the preprocessing (string transformation) stage
while preserving the tokenizing and n-grams generation steps. Only applies if analyzer
is not callable.
tokenizer [callable, default=None] Override the string tokenization step while preserving the
preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
stop_words [string {‘english’}, list, default=None] If ‘english’, a built-in stop word list for
English is used. There are several known issues with ‘english’ and you should consider an
alternative (see Using stop words).
If a list, that list is assumed to contain stop words, all of which will be removed from the
resulting tokens. Only applies if analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0)
to automatically detect and filter stop words based on intra corpus document frequency of
terms.
token_pattern [string] Regular expression denoting what constitutes a “token”, only used if
analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric
characters (punctuation is completely ignored and always treated as a token separator).
ngram_range [tuple (min_n, max_n), default=(1, 1)] The lower and upper boundary of the
range of n-values for different word n-grams or char n-grams to be extracted. All values
of n such such that min_n <= n <= max_n will be used. For example an ngram_range
of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2)
means only bigrams. Only applies if analyzer is not callable.
analyzer [string, {‘word’, ‘char’, ‘char_wb’} or callable, default=’word’] Whether the feature
should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character
n-grams only from text inside word boundaries; n-grams at the edges of words are padded
with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed
input.
Changed in version 0.21.
Since v0.21, if input is filename or file, the data is first read from the file and then
passed to the given callable analyzer.
max_df [float in range [0.0, 1.0] or int, default=1.0] When building the vocabulary ignore terms
that have a document frequency strictly higher than the given threshold (corpus-specific stop
words). If float, the parameter represents a proportion of documents, integer absolute counts.
This parameter is ignored if vocabulary is not None.
min_df [float in range [0.0, 1.0] or int, default=1] When building the vocabulary ignore terms
that have a document frequency strictly lower than the given threshold. This value is also
called cut-off in the literature. If float, the parameter represents a proportion of documents,
integer absolute counts. This parameter is ignored if vocabulary is not None.
max_features [int, default=None] If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
vocabulary [Mapping or iterable, default=None] Either a Mapping (e.g., a dict) where keys are
terms and values are indices in the feature matrix, or an iterable over terms. If not given, a
vocabulary is determined from the input documents. Indices in the mapping should not be
repeated and should not have any gap between 0 and the largest index.
binary [bool, default=False] If True, all non zero counts are set to 1. This is useful for discrete
probabilistic models that model binary events rather than integer counts.
dtype [type, default=np.int64] Type of the matrix returned by fit_transform() or transform().
Attributes
vocabulary_ [dict] A mapping of terms to feature indices.
HashingVectorizer, TfidfVectorizer
Notes
The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided
only for introspection and can be safely removed using delattr or set to None before pickling.
Examples
Methods
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, raw_documents)
Transform documents to document-term matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to
the constructor.
Parameters
raw_documents [iterable] An iterable which yields either str, unicode or file objects.
Returns
X [sparse matrix of shape (n_samples, n_features)] Document-term matrix.
sklearn.feature_extraction.text.HashingVectorizer
class sklearn.feature_extraction.text.HashingVectorizer(input=’content’,
encoding=’utf-8’, de-
code_error=’strict’,
strip_accents=None,
lowercase=True, pre-
processor=None,
tokenizer=None,
stop_words=None, to-
ken_pattern=’(?u)\b\w\w+\b’,
ngram_range=(1,
1), analyzer=’word’,
n_features=1048576, bi-
nary=False, norm=’l2’,
alternate_sign=True,
dtype=<class
’numpy.float64’>)
Convert a collection of text documents to a matrix of token occurrences
It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary
occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean
unit sphere if norm=’l2’.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index
mapping.
This strategy has several advantages:
• it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in
memory
• it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
• it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
• there is no way to compute the inverse transform (from feature indices to string feature names) which can
be a problem when trying to introspect which features are most important to a model.
• there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this
is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
• no IDF weighting as this would render the transformer stateful.
The hash function employed is the signed 32-bit version of Murmurhash3.
Read more in the User Guide.
Parameters
input [string {‘filename’, ‘file’, ‘content’}, default=’content’] If ‘filename’, the sequence
passed as an argument to fit is expected to be a list of filenames that need reading to fetch
the raw content to analyze.
If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch
the bytes in memory.
Otherwise the input is expected to be a sequence of items that can be of type string or byte.
encoding [string, default=’utf-8’] If bytes or files are given to analyze, this encoding is used to
decode.
decode_error [{‘strict’, ‘ignore’, ‘replace’}, default=’strict’] Instruction on what to do if a byte
sequence is given to analyze that contains characters not of the given encoding. By
default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are
‘ignore’ and ‘replace’.
strip_accents [{‘ascii’, ‘unicode’}, default=None] Remove accents and perform other charac-
ter normalization during the preprocessing step. ‘ascii’ is a fast method that only works on
characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that
works on any characters. None (default) does nothing.
Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.
lowercase [bool, default=True] Convert all characters to lowercase before tokenizing.
preprocessor [callable, default=None] Override the preprocessing (string transformation) stage
while preserving the tokenizing and n-grams generation steps. Only applies if analyzer
is not callable.
tokenizer [callable, default=None] Override the string tokenization step while preserving the
preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
stop_words [string {‘english’}, list, default=None] If ‘english’, a built-in stop word list for
English is used. There are several known issues with ‘english’ and you should consider an
alternative (see Using stop words).
If a list, that list is assumed to contain stop words, all of which will be removed from the
resulting tokens. Only applies if analyzer == 'word'.
token_pattern [string] Regular expression denoting what constitutes a “token”, only used if
analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric
characters (punctuation is completely ignored and always treated as a token separator).
ngram_range [tuple (min_n, max_n), default=(1, 1)] The lower and upper boundary of the
range of n-values for different n-grams to be extracted. All values of n such that min_n <= n
<= max_n will be used. For example an ngram_range of (1, 1) means only unigrams,
(1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if
analyzer is not callable.
analyzer [string, {‘word’, ‘char’, ‘char_wb’} or callable, default=’word’] Whether the feature
should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams
only from text inside word boundaries; n-grams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed
input.
Changed in version 0.21.
Since v0.21, if input is filename or file, the data is first read from the file and then
passed to the given callable analyzer.
n_features [int, default=(2 ** 20)] The number of features (columns) in the output matrices.
Small numbers of features are likely to cause hash collisions, but large numbers will cause
larger coefficient dimensions in linear learners.
binary [bool, default=False.] If True, all non zero counts are set to 1. This is useful for discrete
probabilistic models that model binary events rather than integer counts.
norm [{‘l1’, ‘l2’}, default=’l2’] Norm used to normalize term vectors. None for no normaliza-
tion.
alternate_sign [bool, default=True] When True, an alternating sign is added to the features as
to approximately conserve the inner product in the hashed space even for small n_features.
This approach is similar to sparse random projection.
New in version 0.19.
dtype [type, default=np.float64] Type of the matrix returned by fit_transform() or transform().
See also:
CountVectorizer, TfidfVectorizer
Examples
Methods
fit(self, X, y=None)
Does nothing: this transformer is stateless.
Parameters
X [ndarray of shape [n_samples, n_features]] Training data.
fit_transform(self, X, y=None)
Transform a sequence of documents to a document-term matrix.
Parameters
X [iterable over raw text documents, length = n_samples] Samples. Each sample must be a
text document (either bytes or unicode strings, file name or file object depending on the
constructor argument) which will be tokenized and hashed.
y [any] Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
Returns
X [sparse matrix of shape (n_samples, n_features)] Document-term matrix.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_stop_words(self )
Build or fetch the effective stop words list.
Returns
stop_words: list or None A list of stop words.
partial_fit(self, X, y=None)
Does nothing: this transformer is stateless.
This method is just there to mark the fact that this transformer can work in a streaming setup.
Parameters
X [ndarray of shape [n_samples, n_features]] Training data.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform a sequence of documents to a document-term matrix.
Parameters
X [iterable over raw text documents, length = n_samples] Samples. Each sample must be a
text document (either bytes or unicode strings, file name or file object depending on the
constructor argument) which will be tokenized and hashed.
Returns
X [sparse matrix of shape (n_samples, n_features)] Document-term matrix.
sklearn.feature_extraction.text.TfidfTransformer
smooth_idf [bool, default=True] Smooth idf weights by adding one to document frequencies, as
if an extra document was seen containing every term in the collection exactly once. Prevents
zero divisions.
sublinear_tf [bool, default=False] Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
Attributes
idf_ [array of shape (n_features)] The inverse document frequency (IDF) vector; only defined
if use_idf is True.
References
[R1b90ac3ca370-Yates2011], [R1b90ac3ca370-MRS2008]
Examples
Methods
fit(self, X[, y]) Learn the idf vector (global term weights).
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X[, copy]) Transform a count matrix to a tf or tf-idf representa-
tion
Parameters
X [sparse matrix of shape n_samples, n_features)] A matrix of term/token counts.
fit_transform(self, X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X, copy=True)
Transform a count matrix to a tf or tf-idf representation
Parameters
X [sparse matrix of (n_samples, n_features)] a matrix of term/token counts
copy [bool, default=True] Whether to copy X and operate on the copy or perform in-place
operations.
Returns
vectors [sparse matrix of shape (n_samples, n_features)]
sklearn.feature_extraction.text.TfidfVectorizer
class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’,
encoding=’utf-8’, de-
code_error=’strict’,
strip_accents=None, low-
ercase=True, preproces-
sor=None, tokenizer=None,
analyzer=’word’,
stop_words=None, to-
ken_pattern=’(?u)\b\w\w+\b’,
ngram_range=(1, 1),
max_df=1.0, min_df=1,
max_features=None,
vocabulary=None, bi-
nary=False, dtype=<class
’numpy.float64’>,
norm=’l2’, use_idf=True,
smooth_idf=True, sublin-
ear_tf=False)
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.
Read more in the User Guide.
Parameters
input [{‘filename’, ‘file’, ‘content’}, default=’content’] If ‘filename’, the sequence passed as
an argument to fit is expected to be a list of filenames that need reading to fetch the raw
content to analyze.
If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch
the bytes in memory.
Otherwise the input is expected to be a sequence of items that can be of type string or byte.
encoding [str, default=’utf-8’] If bytes or files are given to analyze, this encoding is used to
decode.
decode_error [{‘strict’, ‘ignore’, ‘replace’}, default=’strict’] Instruction on what to do if a byte
sequence is given to analyze that contains characters not of the given encoding. By
default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are
‘ignore’ and ‘replace’.
strip_accents [{‘ascii’, ‘unicode’}, default=None] Remove accents and perform other charac-
ter normalization during the preprocessing step. ‘ascii’ is a fast method that only works on
characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that
works on any characters. None (default) does nothing.
Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.
lowercase [bool, default=True] Convert all characters to lowercase before tokenizing.
preprocessor [callable, default=None] Override the preprocessing (string transformation) stage
while preserving the tokenizing and n-grams generation steps. Only applies if analyzer
is not callable.
tokenizer [callable, default=None] Override the string tokenization step while preserving the
preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
analyzer [{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’] Whether the feature should
be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only
from text inside word boundaries; n-grams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed
input.
Changed in version 0.21.
Since v0.21, if input is filename or file, the data is first read from the file and then
passed to the given callable analyzer.
stop_words [{‘english’}, list, default=None] If a string, it is passed to _check_stop_list and the
appropriate stop list is returned. ‘english’ is currently the only supported string value. There
are several known issues with ‘english’ and you should consider an alternative (see Using
stop words).
If a list, that list is assumed to contain stop words, all of which will be removed from the
resulting tokens. Only applies if analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0)
to automatically detect and filter stop words based on intra corpus document frequency of
terms.
token_pattern [str] Regular expression denoting what constitutes a “token”, only used if
analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric
characters (punctuation is completely ignored and always treated as a token separator).
ngram_range [tuple (min_n, max_n), default=(1, 1)] The lower and upper boundary of the
range of n-values for different n-grams to be extracted. All values of n such that min_n <= n
<= max_n will be used. For example an ngram_range of (1, 1) means only unigrams,
(1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if
analyzer is not callable.
max_df [float or int, default=1.0] When building the vocabulary ignore terms that have a docu-
ment frequency strictly higher than the given threshold (corpus-specific stop words). If float
in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute
counts. This parameter is ignored if vocabulary is not None.
min_df [float or int, default=1] When building the vocabulary ignore terms that have a docu-
ment frequency strictly lower than the given threshold. This value is also called cut-off in the
literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents,
integer absolute counts. This parameter is ignored if vocabulary is not None.
max_features [int, default=None] If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
vocabulary [Mapping or iterable, default=None] Either a Mapping (e.g., a dict) where keys are
terms and values are indices in the feature matrix, or an iterable over terms. If not given, a
vocabulary is determined from the input documents.
binary [bool, default=False] If True, all non-zero term counts are set to 1. This does not mean
outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and
normalization to False to get 0/1 outputs).
dtype [dtype, default=float64] Type of the matrix returned by fit_transform() or transform().
norm [{‘l1’, ‘l2’}, default=’l2’] Each output row will have unit norm, either: * ‘l2’: Sum
of squares of vector elements is 1. The cosine similarity between two vectors is their dot
product when l2 norm has been applied. * ‘l1’: Sum of absolute values of vector elements
is 1. See preprocessing.normalize.
use_idf [bool, default=True] Enable inverse-document-frequency reweighting.
smooth_idf [bool, default=True] Smooth idf weights by adding one to document frequencies, as
if an extra document was seen containing every term in the collection exactly once. Prevents
zero divisions.
sublinear_tf [bool, default=False] Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
Attributes
vocabulary_ [dict] A mapping of terms to feature indices.
fixed_vocabulary_: bool True if a fixed vocabulary of term to indices mapping is provided by
the user
idf_ [array of shape (n_features,)] The inverse document frequency (IDF) vector; only defined
if use_idf is True.
stop_words_ [set] Terms that were ignored because they either:
• occurred in too many documents (max_df)
• occurred in too few documents (min_df)
• were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
See also:
Notes
The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided
only for introspection and can be safely removed using delattr or set to None before pickling.
Examples
Methods
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, raw_documents, copy=’deprecated’)
Transform documents to document-term matrix.
Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).
Parameters
raw_documents [iterable] An iterable which yields either str, unicode or file objects.
copy [bool, default=True] Whether to copy X and operate on the copy or perform in-place
operations.
Deprecated since version 0.22: The copy parameter is unused and was deprecated in
version 0.22 and will be removed in 0.24. This parameter will be ignored.
Returns
X [sparse matrix of (n_samples, n_features)] Tf-idf-weighted document-term matrix.
The sklearn.feature_selection module implements feature selection algorithms. It currently includes uni-
variate filter selection methods and the recursive feature elimination algorithm.
User guide: See the Feature selection section for further details.
7.15.1 sklearn.feature_selection.GenericUnivariateSelect
class sklearn.feature_selection.GenericUnivariateSelect(score_func=<function
f_classif>,
mode=’percentile’,
param=1e-05)
Univariate feature selector with configurable strategy.
Read more in the User Guide.
Parameters
score_func [callable] Function taking two arrays X and y, and returning a pair of arrays (scores,
pvalues). For modes ‘percentile’ or ‘kbest’ it can return a single array scores.
mode [{‘percentile’, ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’}] Feature selection mode.
param [float or int depending on the feature selection mode] Parameter of the corresponding
mode.
Attributes
scores_ [array-like of shape (n_features,)] Scores of features.
pvalues_ [array-like of shape (n_features,)] p-values of feature scores, None if score_func
returned scores only.
See also:
Examples
Methods
X_r [array of shape [n_samples, n_selected_features]] The input samples with only the se-
lected features.
7.15.2 sklearn.feature_selection.SelectPercentile
Notes
Ties between features with equal scores will be broken in an unspecified way.
Examples
Methods
get_support(self, indices=False)
Get a mask, or integer index, of the features selected
Parameters
indices [boolean (default False)] If True, the return value will be an array of integers, rather
than a boolean mask.
Returns
support [array] An index that selects the retained features from a feature vector. If
indices is False, this is a boolean array of shape [# input features], in which an ele-
ment is True iff its corresponding feature is selected for retention. If indices is True,
this is an integer array of shape [# output features] whose values are indices into the input
feature vector.
inverse_transform(self, X)
Reverse the transformation operation
Parameters
X [array of shape [n_samples, n_selected_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_original_features]] X with columns of zeros inserted
where features would have been removed by transform.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Reduce X to the selected features.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_selected_features]] The input samples with only the se-
lected features.
7.15.3 sklearn.feature_selection.SelectKBest
Notes
Ties between features with equal scores will be broken in an unspecified way.
Examples
Methods
Returns
support [array] An index that selects the retained features from a feature vector. If
indices is False, this is a boolean array of shape [# input features], in which an ele-
ment is True iff its corresponding feature is selected for retention. If indices is True,
this is an integer array of shape [# output features] whose values are indices into the input
feature vector.
inverse_transform(self, X)
Reverse the transformation operation
Parameters
X [array of shape [n_samples, n_selected_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_original_features]] X with columns of zeros inserted
where features would have been removed by transform.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Reduce X to the selected features.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_selected_features]] The input samples with only the se-
lected features.
7.15.4 sklearn.feature_selection.SelectFpr
Examples
Methods
support [array] An index that selects the retained features from a feature vector. If
indices is False, this is a boolean array of shape [# input features], in which an ele-
ment is True iff its corresponding feature is selected for retention. If indices is True,
this is an integer array of shape [# output features] whose values are indices into the input
feature vector.
inverse_transform(self, X)
Reverse the transformation operation
Parameters
X [array of shape [n_samples, n_selected_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_original_features]] X with columns of zeros inserted
where features would have been removed by transform.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Reduce X to the selected features.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_selected_features]] The input samples with only the se-
lected features.
7.15.5 sklearn.feature_selection.SelectFdr
References
https://en.wikipedia.org/wiki/False_discovery_rate
Examples
Methods
Parameters
X [array-like of shape (n_samples, n_features)] The training input samples.
y [array-like of shape (n_samples,)] The target values (class labels in classification, real
numbers in regression).
Returns
self [object]
fit_transform(self, X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_support(self, indices=False)
Get a mask, or integer index, of the features selected
Parameters
indices [boolean (default False)] If True, the return value will be an array of integers, rather
than a boolean mask.
Returns
support [array] An index that selects the retained features from a feature vector. If
indices is False, this is a boolean array of shape [# input features], in which an ele-
ment is True iff its corresponding feature is selected for retention. If indices is True,
this is an integer array of shape [# output features] whose values are indices into the input
feature vector.
inverse_transform(self, X)
Reverse the transformation operation
Parameters
X [array of shape [n_samples, n_selected_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_original_features]] X with columns of zeros inserted
where features would have been removed by transform.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Reduce X to the selected features.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_selected_features]] The input samples with only the se-
lected features.
7.15.6 sklearn.feature_selection.SelectFromModel
Notes
Examples
Methods
Parameters
X [array-like of shape (n_samples, n_features)] The training input samples.
y [array-like, shape (n_samples,)] The target values (integers that correspond to classes in
classification, real numbers in regression).
**fit_params [Other estimator specific parameters]
Returns
self [object]
fit_transform(self, X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_support(self, indices=False)
Get a mask, or integer index, of the features selected
Parameters
indices [boolean (default False)] If True, the return value will be an array of integers, rather
than a boolean mask.
Returns
support [array] An index that selects the retained features from a feature vector. If
indices is False, this is a boolean array of shape [# input features], in which an ele-
ment is True iff its corresponding feature is selected for retention. If indices is True,
this is an integer array of shape [# output features] whose values are indices into the input
feature vector.
inverse_transform(self, X)
Reverse the transformation operation
Parameters
X [array of shape [n_samples, n_selected_features]] The input samples.
Returns
7.15.7 sklearn.feature_selection.SelectFwe
score_func [callable] Function taking two arrays X and y, and returning a pair of arrays (scores,
pvalues). Default is f_classif (see below “See also”). The default function only works with
classification tasks.
alpha [float, optional] The highest uncorrected p-value for features to keep.
Attributes
scores_ [array-like of shape (n_features,)] Scores of features.
pvalues_ [array-like of shape (n_features,)] p-values of feature scores.
See also:
Examples
Methods
Parameters
X [array-like of shape (n_samples, n_features)] The training input samples.
y [array-like of shape (n_samples,)] The target values (class labels in classification, real
numbers in regression).
Returns
self [object]
fit_transform(self, X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_support(self, indices=False)
Get a mask, or integer index, of the features selected
Parameters
indices [boolean (default False)] If True, the return value will be an array of integers, rather
than a boolean mask.
Returns
support [array] An index that selects the retained features from a feature vector. If
indices is False, this is a boolean array of shape [# input features], in which an ele-
ment is True iff its corresponding feature is selected for retention. If indices is True,
this is an integer array of shape [# output features] whose values are indices into the input
feature vector.
inverse_transform(self, X)
Reverse the transformation operation
Parameters
X [array of shape [n_samples, n_selected_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_original_features]] X with columns of zeros inserted
where features would have been removed by transform.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Reduce X to the selected features.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_selected_features]] The input samples with only the se-
lected features.
7.15.8 sklearn.feature_selection.RFE
ranking_ [array of shape [n_features]] The feature ranking, such that ranking_[i] corre-
sponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are
assigned rank 1.
estimator_ [object] The external estimator fit on the reduced dataset.
See also:
RFECV Recursive feature elimination with built-in cross-validated selection of the best number of features
Notes
References
[Re310f679c81e-1]
Examples
The following example shows how to retrieve the 5 most informative features in the Friedman #1 dataset.
Methods
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] The training input samples.
y [array-like of shape (n_samples,)] The target values.
indices [boolean (default False)] If True, the return value will be an array of integers, rather
than a boolean mask.
Returns
support [array] An index that selects the retained features from a feature vector. If
indices is False, this is a boolean array of shape [# input features], in which an ele-
ment is True iff its corresponding feature is selected for retention. If indices is True,
this is an integer array of shape [# output features] whose values are indices into the input
feature vector.
inverse_transform(self, X)
Reverse the transformation operation
Parameters
X [array of shape [n_samples, n_selected_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_original_features]] X with columns of zeros inserted
where features would have been removed by transform.
predict(self, X)
Reduce X to the selected features and then predict using the underlying estimator.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
y [array of shape [n_samples]] The predicted target values.
predict_log_proba(self, X)
Predict class log-probabilities for X.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
p [array of shape (n_samples, n_classes)] The class log-probabilities of the input samples.
The order of the classes corresponds to that in the attribute classes_.
predict_proba(self, X)
Predict class probabilities for X.
Parameters
X [{array-like or sparse matrix} of shape (n_samples, n_features)] The input samples. In-
ternally, it will be converted to dtype=np.float32 and if a sparse matrix is provided
to a sparse csr_matrix.
Returns
p [array of shape (n_samples, n_classes)] The class probabilities of the input samples. The
order of the classes corresponds to that in the attribute classes_.
score(self, X, y)
Reduce X to the selected features and then return the score of the underlying estimator.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
y [array of shape [n_samples]] The target values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Reduce X to the selected features.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_selected_features]] The input samples with only the se-
lected features.
7.15.9 sklearn.feature_selection.RFECV
Notes
References
[R6f4d61ceb411-1]
Examples
The following example shows how to retrieve the a-priori not known 5 informative features in the Friedman #1
dataset.
Methods
Fit the RFE model and automatically tune the number of selected features.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Training vector, where
n_samples is the number of samples and n_features is the total number of features.
y [array-like of shape (n_samples,)] Target values (integers for classification, real numbers
for regression).
groups [array-like of shape (n_samples,) or None] Group labels for the samples used while
splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance
(e.g., GroupKFold).
Returns
X_r [array of shape [n_samples, n_original_features]] X with columns of zeros inserted
where features would have been removed by transform.
predict(self, X)
Reduce X to the selected features and then predict using the underlying estimator.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
y [array of shape [n_samples]] The predicted target values.
predict_log_proba(self, X)
Predict class log-probabilities for X.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
p [array of shape (n_samples, n_classes)] The class log-probabilities of the input samples.
The order of the classes corresponds to that in the attribute classes_.
predict_proba(self, X)
Predict class probabilities for X.
Parameters
X [{array-like or sparse matrix} of shape (n_samples, n_features)] The input samples. In-
ternally, it will be converted to dtype=np.float32 and if a sparse matrix is provided
to a sparse csr_matrix.
Returns
p [array of shape (n_samples, n_classes)] The class probabilities of the input samples. The
order of the classes corresponds to that in the attribute classes_.
score(self, X, y)
Reduce X to the selected features and then return the score of the underlying estimator.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
y [array of shape [n_samples]] The target values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
7.15.10 sklearn.feature_selection.VarianceThreshold
class sklearn.feature_selection.VarianceThreshold(threshold=0.0)
Feature selector that removes all low-variance features.
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used
for unsupervised learning.
Read more in the User Guide.
Parameters
threshold [float, optional] Features with a training-set variance lower than this threshold will be
removed. The default is to keep all features with non-zero variance, i.e. remove the features
that have the same value in all samples.
Attributes
variances_ [array, shape (n_features,)] Variances of individual features.
Notes
Examples
The following dataset has integer features, two of which are the same in every sample. These are removed with
the default setting for threshold:
Methods
__init__(self, threshold=0.0)
Initialize self. See help(type(self)) for accurate signature.
fit(self, X, y=None)
Learn empirical variances from X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] Sample vectors from which
to compute variances.
y [any] Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
Returns
self
fit_transform(self, X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
get_support(self, indices=False)
Get a mask, or integer index, of the features selected
Parameters
indices [boolean (default False)] If True, the return value will be an array of integers, rather
than a boolean mask.
Returns
support [array] An index that selects the retained features from a feature vector. If
indices is False, this is a boolean array of shape [# input features], in which an ele-
ment is True iff its corresponding feature is selected for retention. If indices is True,
this is an integer array of shape [# output features] whose values are indices into the input
feature vector.
inverse_transform(self, X)
Reverse the transformation operation
Parameters
X [array of shape [n_samples, n_selected_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_original_features]] X with columns of zeros inserted
where features would have been removed by transform.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Reduce X to the selected features.
Parameters
X [array of shape [n_samples, n_features]] The input samples.
Returns
X_r [array of shape [n_samples, n_selected_features]] The input samples with only the se-
lected features.
7.15.11 sklearn.feature_selection.chi2
sklearn.feature_selection.chi2(X, y)
Compute chi-squared stats between each non-negative feature and class.
This score can be used to select the n_features features with the highest values for the test chi-squared statistic
from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in
document classification), relative to the classes.
Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds
out” the features that are the most likely to be independent of class and therefore irrelevant for classification.
Read more in the User Guide.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Sample vectors.
y [array-like of shape (n_samples,)] Target vector (class labels).
Returns
chi2 [array, shape = (n_features,)] chi2 statistics of each feature.
pval [array, shape = (n_features,)] p-values of each feature.
See also:
Notes
7.15.12 sklearn.feature_selection.f_classif
sklearn.feature_selection.f_classif(X, y)
Compute the ANOVA F-value for the provided sample.
Read more in the User Guide.
Parameters
X [{array-like, sparse matrix} shape = [n_samples, n_features]] The set of regressors that will
be tested sequentially.
y [array of shape(n_samples)] The data matrix.
Returns
F [array, shape = [n_features,]] The set of F values.
pval [array, shape = [n_features,]] The set of p-values.
See also:
7.15.13 sklearn.feature_selection.f_regression
sklearn.feature_selection.f_regression(X, y, center=True)
Univariate linear regression tests.
Linear model for testing the individual effect of each of many regressors. This is a scoring function to be used
in a feature selection procedure, not a free standing feature selection procedure.
This is done in 2 steps:
1. The correlation between each regressor and the target is computed, that is, ((X[:, i] - mean(X[:, i])) * (y -
mean_y)) / (std(X[:, i]) * std(y)).
2. It is converted to an F score then to a p-value.
For more on usage see the User Guide.
Parameters
X [{array-like, sparse matrix} shape = (n_samples, n_features)] The set of regressors that will
be tested sequentially.
y [array of shape(n_samples).] The data matrix
center [True, bool,] If true, X and y will be centered.
Returns
F [array, shape=(n_features,)] F values of features.
pval [array, shape=(n_features,)] p-values of F-scores.
See also:
7.15.14 sklearn.feature_selection.mutual_info_classif
sklearn.feature_selection.mutual_info_classif(X, y, discrete_features=’auto’,
n_neighbors=3, copy=True, ran-
dom_state=None)
Estimate mutual information for a discrete target variable.
Mutual information (MI) [1] between two random variables is a non-negative value, which measures the depen-
dency between the variables. It is equal to zero if and only if two random variables are independent, and higher
values mean higher dependency.
The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances
as described in [2] and [3]. Both methods are based on the idea originally proposed in [4].
It can be used for univariate features selection, read more in the User Guide.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Feature matrix.
y [array_like, shape (n_samples,)] Target vector.
discrete_features [{‘auto’, bool, array_like}, default ‘auto’] If bool, then determines whether
to consider all features discrete or continuous. If array, then it should be either a boolean
mask with shape (n_features,) or array with indices of discrete features. If ‘auto’, it is
assigned to False for dense X and to True for sparse X.
n_neighbors [int, default 3] Number of neighbors to use for MI estimation for continuous vari-
ables, see [2] and [3]. Higher values reduce variance of the estimation, but could introduce
a bias.
copy [bool, default True] Whether to make a copy of the given data. If set to False, the initial
data will be overwritten.
random_state [int, RandomState instance or None, optional, default None] The seed of the
pseudo random number generator for adding small noise to continuous variables in order
to remove repeated values. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number generator; If None,
the random number generator is the RandomState instance used by np.random.
Returns
mi [ndarray, shape (n_features,)] Estimated mutual information between each feature and the
target.
Notes
1. The term “discrete features” is used instead of naming them “categorical”, because it describes the essence
more accurately. For example, pixel intensities of an image are discrete features (but hardly categorical)
and you will get better results if mark them as such. Also note, that treating a continuous variable as
discrete and vice versa will usually give incorrect results, so be attentive about that.
2. True mutual information can’t be negative. If its estimate turns out to be negative, it is replaced by zero.
References
7.15.15 sklearn.feature_selection.mutual_info_regression
sklearn.feature_selection.mutual_info_regression(X, y, discrete_features=’auto’,
n_neighbors=3, copy=True, ran-
dom_state=None)
Estimate mutual information for a continuous target variable.
Mutual information (MI) [1] between two random variables is a non-negative value, which measures the depen-
dency between the variables. It is equal to zero if and only if two random variables are independent, and higher
values mean higher dependency.
The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances
as described in [2] and [3]. Both methods are based on the idea originally proposed in [4].
It can be used for univariate features selection, read more in the User Guide.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Feature matrix.
y [array_like, shape (n_samples,)] Target vector.
discrete_features [{‘auto’, bool, array_like}, default ‘auto’] If bool, then determines whether
to consider all features discrete or continuous. If array, then it should be either a boolean
mask with shape (n_features,) or array with indices of discrete features. If ‘auto’, it is
assigned to False for dense X and to True for sparse X.
n_neighbors [int, default 3] Number of neighbors to use for MI estimation for continuous vari-
ables, see [2] and [3]. Higher values reduce variance of the estimation, but could introduce
a bias.
copy [bool, default True] Whether to make a copy of the given data. If set to False, the initial
data will be overwritten.
random_state [int, RandomState instance or None, optional, default None] The seed of the
pseudo random number generator for adding small noise to continuous variables in order
to remove repeated values. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number generator; If None,
the random number generator is the RandomState instance used by np.random.
Returns
mi [ndarray, shape (n_features,)] Estimated mutual information between each feature and the
target.
Notes
1. The term “discrete features” is used instead of naming them “categorical”, because it describes the essence
more accurately. For example, pixel intensities of an image are discrete features (but hardly categorical)
and you will get better results if mark them as such. Also note, that treating a continuous variable as
discrete and vice versa will usually give incorrect results, so be attentive about that.
2. True mutual information can’t be negative. If its estimate turns out to be negative, it is replaced by zero.
References
The sklearn.gaussian_process module implements Gaussian Process based regression and classification.
User guide: See the Gaussian Processes section for further details.
7.16.1 sklearn.gaussian_process.GaussianProcessClassifier
'fmin_l_bfgs_b'
n_restarts_optimizer [int, optional (default: 0)] The number of restarts of the optimizer for
finding the kernel’s parameters which maximize the log-marginal likelihood. The first run
of the optimizer is performed from the kernel’s initial parameters, the remaining ones (if
any) from thetas sampled log-uniform randomly from the space of allowed theta-values. If
greater than 0, all bounds must be finite. Note that n_restarts_optimizer=0 implies that one
run is performed.
max_iter_predict [int, optional (default: 100)] The maximum number of iterations in New-
ton’s method for approximating the posterior during predict. Smaller values will reduce
computation time at the cost of worse results.
warm_start [bool, optional (default: False)] If warm-starts are enabled, the solution of the last
Newton iteration on the Laplace approximation of the posterior mode is used as initializa-
tion for the next call of _posterior_mode(). This can speed up convergence when _poste-
rior_mode is called several times on similar problems as in hyperparameter optimization.
See the Glossary.
copy_X_train [bool, optional (default: True)] If True, a persistent copy of the training data is
stored in the object. Otherwise, just a reference to the training data is stored, which might
cause predictions to change if the data is modified externally.
random_state [int, RandomState instance or None, optional (default: None)] The generator
used to initialize the centers. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number generator; If None,
the random number generator is the RandomState instance used by np.random.
multi_class [string, default] Specifies how multi-class classification problems are handled.
Supported are “one_vs_rest” and “one_vs_one”. In “one_vs_rest”, one binary Gaussian
process classifier is fitted for each class, which is trained to separate this class from the rest.
In “one_vs_one”, one binary Gaussian process classifier is fitted for each pair of classes,
which is trained to separate these two classes. The predictions of these binary predictors are
combined into multi-class predictions. Note that “one_vs_one” does not support predicting
probability estimates.
n_jobs [int or None, optional (default=None)] The number of jobs to use for the computation.
None means 1 unless in a joblib.parallel_backend context. -1 means using all
processors. See Glossary for more details.
Attributes
kernel_ [kernel object] The kernel used for prediction. In case of binary classification, the
structure of the kernel is the same as the one passed as parameter but with optimized hy-
perparameters. In case of multi-class classification, a CompoundKernel is returned which
consists of the different kernels used in the one-versus-rest classifiers.
Examples
Methods
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
log_marginal_likelihood(self, theta=None, eval_gradient=False, clone_kernel=True)
Returns log-marginal likelihood of theta for training data.
In the case of multi-class classification, the mean log-marginal likelihood of the one-versus-rest classifiers
are returned.
Parameters
theta [array-like of shape (n_kernel_params,) or None] Kernel hyperparameters for which
the log-marginal likelihood is evaluated. In the case of multi-class classification, theta may
be the hyperparameters of the compound kernel or of an individual kernel. In the latter
case, all individual kernel get assigned the same theta values. If None, the precomputed
log_marginal_likelihood of self.kernel_.theta is returned.
eval_gradient [bool, default: False] If True, the gradient of the log-marginal likelihood with
respect to the kernel hyperparameters at position theta is returned additionally. Note that
gradient computation is not supported for non-binary classification. If True, theta must not
be None.
clone_kernel [bool, default=True] If True, the kernel attribute is copied. If False, the kernel
attribute is modified, but may result in a performance improvement.
Returns
log_likelihood [float] Log-marginal likelihood of theta for training data.
log_likelihood_gradient [array, shape = (n_kernel_params,), optional] Gradient of the log-
marginal likelihood with respect to the kernel hyperparameters at position theta. Only
returned when eval_gradient is True.
predict(self, X)
Perform classification on an array of test vectors X.
Parameters
X [sequence of length n_samples] Query points where the GP is evaluated for classification.
Could either be array-like with shape = (n_samples, n_features) or a list of objects.
Returns
C [ndarray of shape (n_samples,)] Predicted target values for X, values are from classes_
predict_proba(self, X)
Return probability estimates for the test vector X.
Parameters
X [sequence of length n_samples] Query points where the GP is evaluated for classification.
Could either be array-like with shape = (n_samples, n_features) or a list of objects.
Returns
C [array-like of shape (n_samples, n_classes)] Returns the probability of the samples for
each class in the model. The columns correspond to the classes in sorted order, as they
appear in the attribute classes_.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.16.2 sklearn.gaussian_process.GaussianProcessRegressor
class sklearn.gaussian_process.GaussianProcessRegressor(kernel=None,
alpha=1e-10, opti-
mizer=’fmin_l_bfgs_b’,
n_restarts_optimizer=0,
normalize_y=False,
copy_X_train=True,
random_state=None)
Gaussian process regression (GPR).
The implementation is based on Algorithm 2.1 of Gaussian Processes for Machine Learning (GPML) by Ras-
mussen and Williams.
'fmin_l_bfgs_b'
n_restarts_optimizer [int, optional (default: 0)] The number of restarts of the optimizer for
finding the kernel’s parameters which maximize the log-marginal likelihood. The first run
of the optimizer is performed from the kernel’s initial parameters, the remaining ones (if
any) from thetas sampled log-uniform randomly from the space of allowed theta-values. If
greater than 0, all bounds must be finite. Note that n_restarts_optimizer == 0 implies that
one run is performed.
normalize_y [boolean, optional (default: False)] Whether the target values y are normalized,
i.e., the mean of the observed target values become zero. This parameter should be set to
True if the target values’ mean is expected to differ considerable from zero. When enabled,
the normalization effectively modifies the GP’s prior based on the data, which contradicts
the likelihood principle; normalization is thus disabled per default.
copy_X_train [bool, optional (default: True)] If True, a persistent copy of the training data is
stored in the object. Otherwise, just a reference to the training data is stored, which might
cause predictions to change if the data is modified externally.
random_state [int, RandomState instance or None, optional (default: None)] The generator
used to initialize the centers. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number generator; If None,
the random number generator is the RandomState instance used by np.random.
Attributes
X_train_ [sequence of length n_samples] Feature vectors or other representations of training
data (also required for prediction). Could either be array-like with shape = (n_samples,
n_features) or a list of objects.
y_train_ [array-like of shape (n_samples,) or (n_samples, n_targets)] Target values in training
data (also required for prediction)
kernel_ [kernel object] The kernel used for prediction. The structure of the kernel is the same
as the one passed as parameter but with optimized hyperparameters
L_ [array-like of shape (n_samples, n_samples)] Lower-triangular Cholesky decomposition of
the kernel in X_train_
alpha_ [array-like of shape (n_samples,)] Dual coefficients of training data points in kernel
space
log_marginal_likelihood_value_ [float] The log-marginal-likelihood of self.kernel_.
theta
Examples
Methods
We can also predict based on an unfitted model by using the GP prior. In addition to the mean of the pre-
dictive distribution, also its standard deviation (return_std=True) or covariance (return_cov=True). Note
that at most one of the two can be requested.
Parameters
X [sequence of length n_samples] Query points where the GP is evaluated. Could either be
array-like with shape = (n_samples, n_features) or a list of objects.
return_std [bool, default: False] If True, the standard-deviation of the predictive distribu-
tion at the query points is returned along with the mean.
return_cov [bool, default: False] If True, the covariance of the joint predictive distribution
at the query points is returned along with the mean
Returns
y_mean [array, shape = (n_samples, [n_output_dims])] Mean of predictive distribution a
query points
y_std [array, shape = (n_samples,), optional] Standard deviation of predictive distribution at
query points. Only returned when return_std is True.
y_cov [array, shape = (n_samples, n_samples), optional] Covariance of joint predictive dis-
tribution a query points. Only returned when return_cov is True.
sample_y(self, X, n_samples=1, random_state=0)
Draw samples from Gaussian process and evaluate at X.
Parameters
X [sequence of length n_samples] Query points where the GP is evaluated. Could either be
array-like with shape = (n_samples, n_features) or a list of objects.
n_samples [int, default: 1] The number of samples drawn from the Gaussian process
random_state [int, RandomState instance or None, optional (default=0)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
Returns
y_samples [array, shape = (n_samples_X, [n_output_dims], n_samples)] Values of
n_samples samples drawn from Gaussian process and evaluated at query points.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
7.16.3 sklearn.gaussian_process.kernels.CompoundKernel
class sklearn.gaussian_process.kernels.CompoundKernel(kernels)
Kernel which is composed of a set of other kernels.
New in version 0.18.
Parameters
kernels [list of Kernel objects] The other kernels
Attributes
bounds Returns the log-transformed bounds on the theta.
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
requires_vector_input Returns whether the kernel is defined on discrete structures.
theta Returns the (flattened, log-transformed) non-fixed hyperparameters.
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
__init__(self, kernels)
Initialize self. See help(type(self)) for accurate signature.
__call__(self, X, Y=None, eval_gradient=False)
Return the kernel k(X, Y) and optionally its gradient.
Note that this compound kernel returns the results of all simple kernel stacked along an additional axis.
Parameters
X [sequence of length n_samples_X] Left argument of the returned kernel k(X, Y) Could
either be array-like with shape = (n_samples_X, n_features) or a list of objects.
Y [sequence of length n_samples_Y] Right argument of the returned kernel k(X, Y).
property requires_vector_input
Returns whether the kernel is defined on discrete structures.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.4 sklearn.gaussian_process.kernels.ConstantKernel
class sklearn.gaussian_process.kernels.ConstantKernel(constant_value=1.0,
constant_value_bounds=(1e-
05, 100000.0))
Constant kernel.
Can be used as part of a product-kernel where it scales the magnitude of the other factor (kernel) or as part of a
sum-kernel, where it modifies the mean of the Gaussian process.
k(x_1, x_2) = constant_value for all x_1, x_2
New in version 0.18.
Parameters
constant_value [float, default: 1.0] The constant value which defines the covariance: k(x_1,
x_2) = constant_value
constant_value_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound
on constant_value
Attributes
bounds Returns the log-transformed bounds on the theta.
hyperparameter_constant_value
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
requires_vector_input Whether the kernel works only on fixed-length feature vectors.
theta Returns the (flattened, log-transformed) non-fixed hyperparameters.
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
get_params(self, deep=True)
Get parameters of this kernel.
Parameters
deep [boolean, optional] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter specifications.
is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Whether the kernel works only on fixed-length feature vectors.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.5 sklearn.gaussian_process.kernels.DotProduct
class sklearn.gaussian_process.kernels.DotProduct(sigma_0=1.0,
sigma_0_bounds=(1e-05,
100000.0))
Dot-Product kernel.
The DotProduct kernel is non-stationary and can be obtained from linear regression by putting N(0, 1) priors
on the coefficients of x_d (d = 1, . . . , D) and a prior of N(0, sigma_0^2) on the bias. The DotProduct
kernel is invariant to a rotation of the coordinates about the origin, but not translations. It is parameterized by
a parameter sigma_0^2. For sigma_0^2 =0, the kernel is called the homogeneous linear kernel, otherwise it is
inhomogeneous. The kernel is given by
k(x_i, x_j) = sigma_0 ^ 2 + x_i cdot x_j
The DotProduct kernel is commonly combined with exponentiation.
New in version 0.18.
Parameters
sigma_0 [float >= 0, default: 1.0] Parameter controlling the inhomogenity of the kernel. If
sigma_0=0, the kernel is homogenous.
sigma_0_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound on l
Attributes
bounds Returns the log-transformed bounds on the theta.
hyperparameter_sigma_0
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
requires_vector_input Returns whether the kernel is defined on fixed-length feature
vectors or generic objects.
theta Returns the (flattened, log-transformed) non-fixed hyperparameters.
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
K_gradient [array (opt.), shape (n_samples_X, n_samples_X, n_dims)] The gradient of the
kernel k(X, X) with respect to the hyperparameter of the kernel. Only returned when
eval_gradient is True.
property bounds
Returns the log-transformed bounds on the theta.
Returns
bounds [array, shape (n_dims, 2)] The log-transformed bounds on the kernel’s hyperparam-
eters theta
clone_with_theta(self, theta)
Returns a clone of self with given hyperparameters theta.
Parameters
theta [array, shape (n_dims,)] The hyperparameters
diag(self, X)
Returns the diagonal of the kernel k(X, X).
The result of this method is identical to np.diag(self(X)); however, it can be evaluated more efficiently
since only the diagonal is evaluated.
Parameters
X [array, shape (n_samples_X, n_features)] Left argument of the returned kernel k(X, Y)
Returns
K_diag [array, shape (n_samples_X,)] Diagonal of kernel k(X, X)
get_params(self, deep=True)
Get parameters of this kernel.
Parameters
deep [boolean, optional] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter specifications.
is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Returns whether the kernel is defined on fixed-length feature vectors or generic objects. Defaults to True
for backward compatibility.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.6 sklearn.gaussian_process.kernels.ExpSineSquared
class sklearn.gaussian_process.kernels.ExpSineSquared(length_scale=1.0,
periodicity=1.0,
length_scale_bounds=(1e-
05, 100000.0),
periodicity_bounds=(1e-
05, 100000.0))
Exp-Sine-Squared kernel.
The ExpSineSquared kernel allows modeling periodic functions. It is parameterized by a length-scale parameter
length_scale>0 and a periodicity parameter periodicity>0. Only the isotropic variant where l is a scalar is
supported at the moment. The kernel given by:
k(x_i, x_j) = exp(-2 (sin(pi / periodicity * d(x_i, x_j)) / length_scale) ^ 2)
New in version 0.18.
Parameters
length_scale [float > 0, default: 1.0] The length scale of the kernel.
periodicity [float > 0, default: 1.0] The periodicity of the kernel.
length_scale_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound on
length_scale
periodicity_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound on
periodicity
Attributes
bounds Returns the log-transformed bounds on the theta.
hyperparameter_length_scale
hyperparameter_periodicity
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
Parameters
X [sequence of length n_samples] Left argument of the returned kernel k(X, Y)
Returns
K_diag [array, shape (n_samples_X,)] Diagonal of kernel k(X, X)
get_params(self, deep=True)
Get parameters of this kernel.
Parameters
deep [boolean, optional] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter specifications.
is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Returns whether the kernel is defined on fixed-length feature vectors or generic objects. Defaults to True
for backward compatibility.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.7 sklearn.gaussian_process.kernels.Exponentiation
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
eval_gradient is True.
property bounds
Returns the log-transformed bounds on the theta.
Returns
bounds [array, shape (n_dims, 2)] The log-transformed bounds on the kernel’s hyperparam-
eters theta
clone_with_theta(self, theta)
Returns a clone of self with given hyperparameters theta.
Parameters
theta [array, shape (n_dims,)] The hyperparameters
diag(self, X)
Returns the diagonal of the kernel k(X, X).
The result of this method is identical to np.diag(self(X)); however, it can be evaluated more efficiently
since only the diagonal is evaluated.
Parameters
X [sequence of length n_samples_X] Argument to the kernel. Could either be array-like
with shape = (n_samples_X, n_features) or a list of objects.
Returns
K_diag [array, shape (n_samples_X,)] Diagonal of kernel k(X, X)
get_params(self, deep=True)
Get parameters of this kernel.
Parameters
deep [boolean, optional] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter.
is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Returns whether the kernel is defined on discrete structures.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.8 sklearn.gaussian_process.kernels.Hyperparameter
class sklearn.gaussian_process.kernels.Hyperparameter
A kernel hyperparameter’s specification in form of a namedtuple.
New in version 0.18.
Attributes
name [string] Alias for field number 0
value_type [string] Alias for field number 1
bounds [pair of floats >= 0 or “fixed”] Alias for field number 2
n_elements [int, default=1] Alias for field number 3
fixed [bool, default: None] Alias for field number 4
Methods
property value_type
Alias for field number 1
7.16.9 sklearn.gaussian_process.kernels.Kernel
class sklearn.gaussian_process.kernels.Kernel
Base class for all kernels.
New in version 0.18.
Attributes
bounds Returns the log-transformed bounds on the theta.
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
requires_vector_input Returns whether the kernel is defined on fixed-length feature
vectors or generic objects.
theta Returns the (flattened, log-transformed) non-fixed hyperparameters.
Methods
abstract diag(self, X)
Returns the diagonal of the kernel k(X, X).
The result of this method is identical to np.diag(self(X)); however, it can be evaluated more efficiently
since only the diagonal is evaluated.
Parameters
X [sequence of length n_samples] Left argument of the returned kernel k(X, Y)
Returns
K_diag [array, shape (n_samples_X,)] Diagonal of kernel k(X, X)
get_params(self, deep=True)
Get parameters of this kernel.
Parameters
deep [boolean, optional] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter specifications.
abstract is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Returns whether the kernel is defined on fixed-length feature vectors or generic objects. Defaults to True
for backward compatibility.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.10 sklearn.gaussian_process.kernels.Matern
class sklearn.gaussian_process.kernels.Matern(length_scale=1.0,
length_scale_bounds=(1e-05, 100000.0),
nu=1.5)
Matern kernel.
The class of Matern kernels is a generalization of the RBF and the absolute exponential kernel parameterized by
an additional parameter nu. The smaller nu, the less smooth the approximated function is. For nu=inf, the kernel
becomes equivalent to the RBF kernel and for nu=0.5 to the absolute exponential kernel. Important intermediate
values are nu=1.5 (once differentiable functions) and nu=2.5 (twice differentiable functions).
See Rasmussen and Williams 2006, pp84 for details regarding the different variants of the Matern kernel.
New in version 0.18.
Parameters
length_scale [float or array with shape (n_features,), default: 1.0] The length scale of the kernel.
If a float, an isotropic kernel is used. If an array, an anisotropic kernel is used where each
dimension of l defines the length-scale of the respective feature dimension.
length_scale_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound on
length_scale
nu [float, default: 1.5] The parameter nu controlling the smoothness of the learned function.
The smaller nu, the less smooth the approximated function is. For nu=inf, the kernel be-
comes equivalent to the RBF kernel and for nu=0.5 to the absolute exponential kernel. Im-
portant intermediate values are nu=1.5 (once differentiable functions) and nu=2.5 (twice
differentiable functions). Note that values of nu not in [0.5, 1.5, 2.5, inf] incur a consid-
erably higher computational cost (appr. 10 times higher) since they require to evaluate the
modified Bessel function. Furthermore, in contrast to l, nu is kept fixed to its initial value
and not optimized.
Attributes
anisotropic
bounds Returns the log-transformed bounds on the theta.
hyperparameter_length_scale
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
requires_vector_input Returns whether the kernel is defined on fixed-length feature
vectors or generic objects.
theta Returns the (flattened, log-transformed) non-fixed hyperparameters.
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
Continued on next page
7.16.11 sklearn.gaussian_process.kernels.PairwiseKernel
class sklearn.gaussian_process.kernels.PairwiseKernel(gamma=1.0,
gamma_bounds=(1e-
05, 100000.0), met-
ric=’linear’, pair-
wise_kernels_kwargs=None)
Wrapper for kernels in sklearn.metrics.pairwise.
A thin wrapper around the functionality of the kernels in sklearn.metrics.pairwise.
Note: Evaluation of eval_gradient is not analytic but numeric and all kernels support only isotropic dis-
tances. The parameter gamma is considered to be a hyperparameter and may be optimized. The other
kernel parameters are set directly at initialization and are kept fixed.
New in version 0.18.
Parameters
gamma [float >= 0, default: 1.0] Parameter gamma of the pairwise kernel specified by metric
gamma_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound on
gamma
metric [string, or callable, default: “linear”] The metric to use when calculating kernel between
instances in a feature array. If metric is a string, it must be one of the metrics in pair-
wise.PAIRWISE_KERNEL_FUNCTIONS. If metric is “precomputed”, X is assumed to be
a kernel matrix. Alternatively, if metric is a callable function, it is called on each pair of
instances (rows) and the resulting value recorded. The callable should take two arrays from
X as input and return a value indicating the distance between them.
pairwise_kernels_kwargs [dict, default: None] All entries of this dict (if any) are passed as
keyword arguments to the pairwise kernel function.
Attributes
bounds Returns the log-transformed bounds on the theta.
hyperparameter_gamma
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
requires_vector_input Returns whether the kernel is defined on fixed-length feature
vectors or generic objects.
theta Returns the (flattened, log-transformed) non-fixed hyperparameters.
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
eval_gradient is True.
property bounds
Returns the log-transformed bounds on the theta.
Returns
bounds [array, shape (n_dims, 2)] The log-transformed bounds on the kernel’s hyperparam-
eters theta
clone_with_theta(self, theta)
Returns a clone of self with given hyperparameters theta.
Parameters
theta [array, shape (n_dims,)] The hyperparameters
diag(self, X)
Returns the diagonal of the kernel k(X, X).
The result of this method is identical to np.diag(self(X)); however, it can be evaluated more efficiently
since only the diagonal is evaluated.
Parameters
X [array, shape (n_samples_X, n_features)] Left argument of the returned kernel k(X, Y)
Returns
K_diag [array, shape (n_samples_X,)] Diagonal of kernel k(X, X)
get_params(self, deep=True)
Get parameters of this kernel.
Parameters
deep [boolean, optional] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter specifications.
is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Returns whether the kernel is defined on fixed-length feature vectors or generic objects. Defaults to True
for backward compatibility.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.12 sklearn.gaussian_process.kernels.Product
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
property requires_vector_input
Returns whether the kernel is stationary.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.13 sklearn.gaussian_process.kernels.RBF
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
7.16.14 sklearn.gaussian_process.kernels.RationalQuadratic
class sklearn.gaussian_process.kernels.RationalQuadratic(length_scale=1.0,
alpha=1.0,
length_scale_bounds=(1e-
05, 100000.0),
alpha_bounds=(1e-05,
100000.0))
Rational Quadratic kernel.
The RationalQuadratic kernel can be seen as a scale mixture (an infinite sum) of RBF kernels with different
characteristic length-scales. It is parameterized by a length-scale parameter length_scale>0 and a scale mixture
parameter alpha>0. Only the isotropic variant where length_scale is a scalar is supported at the moment. The
kernel given by:
k(x_i, x_j) = (1 + d(x_i, x_j)^2 / (2*alpha * length_scale^2))^-alpha
New in version 0.18.
Parameters
length_scale [float > 0, default: 1.0] The length scale of the kernel.
alpha [float > 0, default: 1.0] Scale mixture parameter
length_scale_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound on
length_scale
alpha_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound on alpha
Attributes
bounds Returns the log-transformed bounds on the theta.
hyperparameter_alpha
hyperparameter_length_scale
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
requires_vector_input Returns whether the kernel is defined on fixed-length feature
vectors or generic objects.
theta Returns the (flattened, log-transformed) non-fixed hyperparameters.
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
Continued on next page
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter specifications.
is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Returns whether the kernel is defined on fixed-length feature vectors or generic objects. Defaults to True
for backward compatibility.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.15 sklearn.gaussian_process.kernels.Sum
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
The result of this method is identical to np.diag(self(X)); however, it can be evaluated more efficiently
since only the diagonal is evaluated.
Parameters
X [sequence of length n_samples_X] Argument to the kernel. Could either be array-like
with shape = (n_samples_X, n_features) or a list of objects.
Returns
K_diag [array, shape (n_samples_X,)] Diagonal of kernel k(X, X)
get_params(self, deep=True)
Get parameters of this kernel.
Parameters
deep [boolean, optional] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter.
is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Returns whether the kernel is stationary.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.16.16 sklearn.gaussian_process.kernels.WhiteKernel
class sklearn.gaussian_process.kernels.WhiteKernel(noise_level=1.0,
noise_level_bounds=(1e-05,
100000.0))
White kernel.
The main use-case of this kernel is as part of a sum-kernel where it explains the noise of the signal as indepen-
dently and identically normally-distributed. The parameter noise_level equals the variance of this noise.
k(x_1, x_2) = noise_level if x_1 == x_2 else 0
New in version 0.18.
Parameters
noise_level [float, default: 1.0] Parameter controlling the noise level (variance)
noise_level_bounds [pair of floats >= 0, default: (1e-5, 1e5)] The lower and upper bound on
noise_level
Attributes
bounds Returns the log-transformed bounds on the theta.
hyperparameter_noise_level
hyperparameters Returns a list of all hyperparameter specifications.
n_dims Returns the number of non-fixed hyperparameters of the kernel.
requires_vector_input Whether the kernel works only on fixed-length feature vectors.
theta Returns the (flattened, log-transformed) non-fixed hyperparameters.
Methods
__call__(self, X[, Y, eval_gradient]) Return the kernel k(X, Y) and optionally its gradient.
clone_with_theta(self, theta) Returns a clone of self with given hyperparameters
theta.
diag(self, X) Returns the diagonal of the kernel k(X, X).
get_params(self[, deep]) Get parameters of this kernel.
is_stationary(self) Returns whether the kernel is stationary.
set_params(self, \*\*params) Set the parameters of this kernel.
eval_gradient is True.
property bounds
Returns the log-transformed bounds on the theta.
Returns
bounds [array, shape (n_dims, 2)] The log-transformed bounds on the kernel’s hyperparam-
eters theta
clone_with_theta(self, theta)
Returns a clone of self with given hyperparameters theta.
Parameters
theta [array, shape (n_dims,)] The hyperparameters
diag(self, X)
Returns the diagonal of the kernel k(X, X).
The result of this method is identical to np.diag(self(X)); however, it can be evaluated more efficiently
since only the diagonal is evaluated.
Parameters
X [sequence of length n_samples_X] Argument to the kernel. Could either be array-like
with shape = (n_samples_X, n_features) or a list of objects.
Returns
K_diag [array, shape (n_samples_X,)] Diagonal of kernel k(X, X)
get_params(self, deep=True)
Get parameters of this kernel.
Parameters
deep [boolean, optional] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
property hyperparameters
Returns a list of all hyperparameter specifications.
is_stationary(self )
Returns whether the kernel is stationary.
property n_dims
Returns the number of non-fixed hyperparameters of the kernel.
property requires_vector_input
Whether the kernel works only on fixed-length feature vectors.
set_params(self, **params)
Set the parameters of this kernel.
The method works on simple kernels as well as on nested kernels. The latter have parameters of the form
<component>__<parameter> so that it’s possible to update each component of a nested object.
Returns
self
property theta
Returns the (flattened, log-transformed) non-fixed hyperparameters.
Note that theta are typically the log-transformed values of the kernel’s hyperparameters as this representa-
tion of the search space is more amenable for hyperparameter search, as hyperparameters like length-scales
naturally live on a log-scale.
Returns
theta [array, shape (n_dims,)] The non-fixed, log-transformed hyperparameters of the kernel
7.17.1 sklearn.impute.SimpleImputer
• If “constant”, then replace missing values with fill_value. Can be used with strings or
numeric data.
New in version 0.20: strategy=”constant” for fixed value imputation.
fill_value [string or numerical value, default=None] When strategy == “constant”, fill_value is
used to replace all occurrences of missing_values. If left to the default, fill_value will be 0
when imputing numerical data and “missing_value” for strings or object data types.
verbose [integer, default=0] Controls the verbosity of the imputer.
copy [boolean, default=True] If True, a copy of X will be created. If False, imputation will be
done in-place whenever possible. Note that, in the following cases, a new copy will always
be made, even if copy=False:
• If X is not an array of floating values;
• If X is encoded as a CSR matrix;
• If add_indicator=True.
add_indicator [boolean, default=False] If True, a MissingIndicator transform will stack
onto output of the imputer’s transform. This allows a predictive estimator to account for
missingness despite imputation. If a feature has no missing values at fit/train time, the fea-
ture won’t appear on the missing indicator even if there are missing values at transform/test
time.
Attributes
statistics_ [array of shape (n_features,)] The imputation fill value for each feature. Computing
statistics can result in np.nan values. During transform, features corresponding to
np.nan statistics will be discarded.
indicator_ [sklearn.impute.MissingIndicator] Indicator used to add binary indi-
cators for missing values. None if add_indicator is False.
See also:
Notes
Columns which only contained missing values at fit are discarded upon transform if strategy is not “con-
stant”.
Examples
Methods
Returns
self [object] Estimator instance.
transform(self, X)
Impute all missing values in X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input data to complete.
7.17.2 sklearn.impute.IterativeImputer
Note: This estimator is still experimental for now: the predictions and the API might change without any
deprecation cycle. To use it, you need to explicitly import enable_iterative_imputer:
Parameters
estimator [estimator object, default=BayesianRidge()] The estimator to use at each step of
the round-robin imputation. If sample_posterior is True, the estimator must support
return_std in its predict method.
missing_values [int, np.nan, default=np.nan] The placeholder for the missing values. All oc-
currences of missing_values will be imputed.
sample_posterior [boolean, default=False] Whether to sample from the (Gaussian) predictive
posterior of the fitted estimator for each imputation. Estimator must support return_std
in its predict method if set to True. Set to True if using IterativeImputer for
multiple imputations.
See also:
Notes
To support imputation in inductive mode we store each feature’s estimator during the fit phase, and predict
without refitting (in order) during the transform phase.
Features which contain all missing values at fit are discarded upon transform.
References
[Rcd31b817a31e-1], [Rcd31b817a31e-2]
Examples
Methods
7.17.3 sklearn.impute.MissingIndicator
Examples
Methods
be boolean.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Generate missing values indicator for X.
Parameters
X [{array-like, sparse matrix}, shape (n_samples, n_features)] The input data to complete.
Returns
Xt [{ndarray or sparse matrix}, shape (n_samples, n_features) or (n_samples,
n_features_with_missing)] The missing indicator for input data. The data type of Xt will
be boolean.
7.17.4 sklearn.impute.KNNImputer
References
• Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David
Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMAT-
ICS Vol. 17 no. 6, 2001 Pages 520-525.
Examples
Methods
Returns
self [object] Estimator instance.
transform(self, X)
Impute all missing values in X.
Parameters
X [array-like of shape (n_samples, n_features)] The input data to complete.
Returns
X [array-like of shape (n_samples, n_output_features)] The imputed dataset.
n_output_features is the number of features that is not always missing dur-
ing fit.
7.18.1 sklearn.inspection.partial_dependence
Warning: The ‘recursion’ method only works for gradient boosting estimators, and unlike the ‘brute’
method, it does not account for the init predictor of the boosting process. In practice this will produce
the same values as ‘brute’ up to a constant offset in the target response, provided that init is a consant
estimator (which is the default). However, as soon as init is not a constant estimator, the partial dependence
values are incorrect for ‘recursion’. This is not relevant for HistGradientBoostingClassifier and
HistGradientBoostingRegressor, which do not have an init parameter.
See also:
Examples
7.18.2 sklearn.inspection.permutation_importance
References
[BRE]
7.18.3 Plotting
sklearn.inspection.PartialDependenceDisplay
Parameters
pd_results [list of (ndarray, ndarray)] Results of partial_dependence for features.
Each tuple corresponds to a (averaged_predictions, grid).
features [list of (int,) or list of (int, int)] Indices of features for a given plot. A tuple of one
integer will plot a partial dependence curve of one feature. A tuple of two integers will plot
a two-way partial dependence curve as a contour plot.
feature_names [list of str] Feature names corresponding to the indices in features.
target_idx [int]
• In a multiclass setting, specifies the class for which the PDPs should be computed. Note
that for binary classification, the positive class (index 1) is always used.
• In a multioutput setting, specifies the task for which the PDPs should be computed.
Ignored in binary classification or classical regression settings.
pdp_lim [dict] Global min and max average predictions, such that all plots will have the same
scale and y limits. pdp_lim[1] is the global min and max for single partial dependence
curves. pdp_lim[2] is the global min and max for two-way partial dependence curves.
deciles [dict] Deciles for feature indices in features.
Attributes
bounding_ax_ [matplotlib Axes or None] If ax is an axes or None, the bounding_ax_ is
the axes where the grid of partial dependence plots are drawn. If ax is a list of axes or a
numpy array of axes, bounding_ax_ is None.
axes_ [ndarray of matplotlib Axes] If ax is an axes or None, axes_[i, j] is the axes on the
i-th row and j-th column. If ax is a list of axes, axes_[i] is the i-th item in ax. Elements
that are None corresponds to a nonexisting axes in that position.
lines_ [ndarray of matplotlib Artists] If ax is an axes or None, line_[i, j] is the partial
dependence curve on the i-th row and j-th column. If ax is a list of axes, lines_[i] is
the partial dependence curve corresponding to the i-th item in ax. Elements that are None
corresponds to a nonexisting axes or an axes that does not include a line plot.
contours_ [ndarray of matplotlib Artists] If ax is an axes or None, contours_[i, j]
is the partial dependence plot on the i-th row and j-th column. If ax is a list of axes,
contours_[i] is the partial dependence plot corresponding to the i-th item in ax. El-
ements that are None corresponds to a nonexisting axes or an axes that does not include a
contour plot.
figure_ [matplotlib Figure] Figure containing partial dependence plots.
Methods
sklearn.inspection.plot_partial_dependence
Note: plot_partial_dependence does not support using the same axes with multiple calls. To plot the
the partial dependence for multiple estimators, please pass the axes created by the first call to the second call:
• In a multiclass setting, specifies the class for which the PDPs should be computed. Note
that for binary classification, the positive class (index 1) is always used.
• In a multioutput setting, specifies the task for which the PDPs should be computed.
Ignored in binary classification or classical regression settings.
response_method [‘auto’, ‘predict_proba’ or ‘decision_function’, optional (default=’auto’)]
Specifies whether to use predict_proba or decision_function as the target response. For
regressors this parameter is ignored and the response is always the output of predict. By
default, predict_proba is tried first and we revert to decision_function if it doesn’t exist. If
method is ‘recursion’, the response is always the output of decision_function.
n_cols [int, optional (default=3)] The maximum number of columns in the grid plot. Only
active when ax is a single axis or None.
grid_resolution [int, optional (default=100)] The number of equally spaced points on the axes
of the plots, for each target feature.
percentiles [tuple of float, optional (default=(0.05, 0.95))] The lower and upper percentile used
to create the extreme values for the PDP axes. Must be in [0, 1].
method [str, optional (default=’auto’)] The method to use to calculate the partial dependence
predictions:
• ‘recursion’ is only supported for gradient boosting es-
timator (namely GradientBoostingClassifier,
GradientBoostingRegressor, HistGradientBoostingClassifier,
HistGradientBoostingRegressor) but is more efficient in terms of speed. With
this method, X is optional and is only used to build the grid and the partial dependences
are computed using the training data. This method does not account for the init
predictor of the boosting process, which may lead to incorrect values (see warning below.
With this method, the target response of a classifier is always the decision function, not
the predicted probabilities.
• ‘brute’ is supported for any estimator, but is more computationally intensive.
• ‘auto’: - ‘recursion’ is used for estimators that supports it. - ‘brute’ is used for all other
estimators.
n_jobs [int, optional (default=None)] The number of CPUs to use to compute the partial depen-
dences. None means 1 unless in a joblib.parallel_backend context. -1 means
using all processors. See Glossary for more details.
verbose [int, optional (default=0)] Verbose output during PD computations.
fig [Matplotlib figure object, optional (default=None)] A figure object onto which the plots will
be drawn, after the figure has been cleared. By default, a new one is created.
Deprecated since version 0.22: fig will be removed in 0.24.
line_kw [dict, optional] Dict with keywords passed to the matplotlib.pyplot.plot call.
For one-way partial dependence plots.
contour_kw [dict, optional] Dict with keywords passed to the matplotlib.pyplot.
contourf call. For two-way partial dependence plots.
ax [Matplotlib axes or array-like of Matplotlib axes, default=None]
• If a single axis is passed in, it is treated as a bounding axes and a grid of partial de-
pendence plots will be drawn within these bounds. The n_cols parameter controls
the number of columns in the grid.
• If an array-like of axes are passed in, the partial dependence plots will be drawn di-
rectly into these axes.
• If None, a figure and a bounding axes is created and treated as the single axes case.
New in version 0.22.
Returns
display: PartialDependenceDisplay
Warning: The ‘recursion’ method only works for gradient boosting estimators, and unlike the ‘brute’
method, it does not account for the init predictor of the boosting process. In practice this will produce
the same values as ‘brute’ up to a constant offset in the target response, provided that init is a consant
estimator (which is the default). However, as soon as init is not a constant estimator, the partial dependence
values are incorrect for ‘recursion’. This is not relevant for HistGradientBoostingClassifier and
HistGradientBoostingRegressor, which do not have an init parameter.
See also:
Examples
User guide: See the Isotonic regression section for further details.
7.19.1 sklearn.isotonic.IsotonicRegression
where:
• y[i] are inputs (real numbers)
• y_[i] are fitted
• X specifies the order. If X is non-decreasing then y_ is non-decreasing.
• w[i] are optional strictly positive weights (default to 1.0)
Notes
Ties are broken using the secondary method from Leeuw, 1977.
References
Isotonic Median Regression: A Linear Programming Approach Nilotpal Chakravarti Mathematics of Operations
Research Vol. 14, No. 2 (May, 1989), pp. 303-308
Isotone Optimization in R : Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods Leeuw, Hornik,
Mair Journal of Statistical Software 2009
Correctness of Kruskal’s algorithms for monotone regression with ties Leeuw, Psychometrica, 1977
Examples
Methods
Notes
X is stored for future use, as transform needs X to interpolate new input data.
fit_transform(self, X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
Notes
Returns
self [object] Estimator instance.
transform(self, T)
Transform new data by linear interpolation
Parameters
T [array-like of shape (n_samples,)] Data to transform.
Returns
T_ [array, shape=(n_samples,)] The transformed data
• Isotonic Regression
7.19.2 sklearn.isotonic.check_increasing
sklearn.isotonic.check_increasing(x, y)
Determine whether y is monotonically correlated with x.
y is found increasing or decreasing with respect to x based on a Spearman correlation test.
Parameters
x [array-like of shape (n_samples,)] Training data.
y [array-like of shape (n_samples,)] Training target.
Returns
increasing_bool [boolean] Whether the relationship is increasing or decreasing.
Notes
The Spearman correlation coefficient is estimated from the data, and the sign of the resulting estimate is used as
the result.
In the event that the 95% confidence interval based on Fisher transform spans zero, a warning is raised.
References
7.19.3 sklearn.isotonic.isotonic_regression
where:
• y[i] are inputs (real numbers)
• y_[i] are fitted
• w[i] are optional strictly positive weights (default to 1.0)
References
“Active set algorithms for isotonic regression; A unifying framework” by Michael J. Best and Nilotpal
Chakravarti, section 3.
The sklearn.kernel_approximation module implements several approximate kernel feature maps base on
Fourier transforms.
User guide: See the Kernel Approximation section for further details.
7.20.1 sklearn.kernel_approximation.AdditiveChi2Sampler
Notes
This estimator approximates a slightly different version of the additive chi squared kernel then metric.
additive_chi2 computes.
References
See “Efficient additive kernels via explicit feature maps” A. Vedaldi and A. Zisserman, Pattern Analysis and
Machine Intelligence, 2011
Examples
Methods
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Apply approximate feature map to X.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)]
Returns
X_new [{array, sparse matrix}, shape = (n_samples, n_features * (2*sample_steps + 1))]
Whether the return value is an array of sparse matrix depends on the type of the input X.
7.20.2 sklearn.kernel_approximation.Nystroem
components_ [array, shape (n_components, n_features)] Subset of training points used to con-
struct the feature map.
component_indices_ [array, shape (n_components)] Indices of components_ in the training
set.
normalization_ [array, shape (n_components, n_components)] Normalization matrix needed
for embedding. Square root of the kernel matrix on components_.
See also:
References
• Williams, C.K.I. and Seeger, M. “Using the Nystroem method to speed up kernel machines”, Advances in
neural information processing systems 2001
• T. Yang, Y. Li, M. Mahdavi, R. Jin and Z. Zhou “Nystroem Method vs Random Fourier Features: A
Theoretical and Empirical Comparison”, Advances in Neural Information Processing Systems 2012
Examples
Methods
Samples a subset of training points, computes kernel on these and computes normalization matrix.
Parameters
X [array-like of shape (n_samples, n_features)] Training data.
fit_transform(self, X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X [numpy array of shape [n_samples, n_features]] Training set.
y [numpy array of shape [n_samples]] Target values.
**fit_params [dict] Additional fit parameters.
Returns
X_new [numpy array of shape [n_samples, n_features_new]] Transformed array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Apply feature map to X.
Computes an approximate feature map using the kernel between some training points and X.
Parameters
X [array-like of shape (n_samples, n_features)] Data to transform.
Returns
X_transformed [array, shape=(n_samples, n_components)] Transformed data.
7.20.3 sklearn.kernel_approximation.RBFSampler
Notes
See “Random Features for Large-Scale Kernel Machines” by A. Rahimi and Benjamin Recht.
[1] “Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning” by A.
Rahimi and Benjamin Recht. (https://people.eecs.berkeley.edu/~brecht/papers/08.rah.rec.nips.pdf)
Examples
Methods
7.20.4 sklearn.kernel_approximation.SkewedChi2Sampler
class sklearn.kernel_approximation.SkewedChi2Sampler(skewedness=1.0,
n_components=100, ran-
dom_state=None)
Approximates feature map of the “skewed chi-squared” kernel by Monte Carlo approximation of its Fourier
transform.
Read more in the User Guide.
Parameters
skewedness [float] “skewedness” parameter of the kernel. Needs to be cross-validated.
n_components [int] number of Monte Carlo samples per original feature. Equals the dimen-
sionality of the computed feature space.
random_state [int, RandomState instance or None, optional (default=None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
See also:
AdditiveChi2Sampler A different approach for approximating an additive variant of the chi squared ker-
nel.
sklearn.metrics.pairwise.chi2_kernel The exact chi squared kernel.
References
See “Random Fourier Approximations for Skewed Multiplicative Histogram Kernels” by Fuxin Li, Catalin
Ionescu and Cristian Sminchisescu.
Examples
Methods
Returns
self [object] Estimator instance.
transform(self, X)
Apply the approximate feature map to X.
Parameters
X [array-like, shape (n_samples, n_features)] New data, where n_samples in the number of
samples and n_features is the number of features. All values of X must be strictly greater
than “-skewedness”.
Returns
X_new [array-like, shape (n_samples, n_components)]
7.21.1 sklearn.kernel_ridge.KernelRidge
and should return a floating point number. Set to “precomputed” in order to pass a precom-
puted kernel matrix to the estimator methods instead of samples.
gamma [float, default=None] Gamma parameter for the RBF, laplacian, polynomial, exponen-
tial chi2 and sigmoid kernels. Interpretation of the default value is left to the kernel; see the
documentation for sklearn.metrics.pairwise. Ignored by other kernels.
degree [float, default=3] Degree of the polynomial kernel. Ignored by other kernels.
coef0 [float, default=1] Zero coefficient for polynomial and sigmoid kernels. Ignored by other
kernels.
kernel_params [mapping of string to any, optional] Additional parameters (keyword argu-
ments) for kernel function passed as callable object.
Attributes
dual_coef_ [array, shape = [n_samples] or [n_samples, n_targets]] Representation of weight
vector(s) in kernel space
X_fit_ [{array-like, sparse matrix} of shape (n_samples, n_features)] Training data, which is
also required for prediction. If kernel == “precomputed” this is instead the precomputed
training matrix, shape = [n_samples, n_samples].
See also:
References
• Kevin P. Murphy “Machine Learning: A Probabilistic Perspective”, The MIT Press chapter 14.4.3, pp.
492-493
Examples
Methods
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.linear_model.LogisticRegression
class_weight [dict or ‘balanced’, default=None] Weights associated with classes in the form
{class_label: weight}. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely pro-
portional to class frequencies in the input data as n_samples / (n_classes * np.
bincount(y)).
Note that these weights will be multiplied with sample_weight (passed through the fit
method) if sample_weight is specified.
New in version 0.17: class_weight=’balanced’
random_state [int, RandomState instance, default=None] The seed of the pseudo random num-
ber generator to use when shuffling the data. If int, random_state is the seed used by the
random number generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState instance used by np.
random. Used when solver == ‘sag’ or ‘liblinear’.
solver [{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’] Algorithm to use in
the optimization problem.
• For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for
large ones.
• For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial
loss; ‘liblinear’ is limited to one-versus-rest schemes.
• ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
• ‘liblinear’ and ‘saga’ also handle L1 penalty
• ‘saga’ also supports ‘elasticnet’ penalty
• ‘liblinear’ does not support setting penalty='none'
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approxi-
mately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.
New in version 0.17: Stochastic Average Gradient descent solver.
New in version 0.19: SAGA solver.
Changed in version 0.22: The default solver changed from ‘liblinear’ to ‘lbfgs’ in 0.22.
max_iter [int, default=100] Maximum number of iterations taken for the solvers to converge.
multi_class [{‘auto’, ‘ovr’, ‘multinomial’}, default=’auto’] If the option chosen is ‘ovr’, then a
binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial
loss fit across the entire probability distribution, even when the data is binary. ‘multino-
mial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if
solver=’liblinear’, and otherwise selects ‘multinomial’.
New in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’ case.
Changed in version 0.22: Default changed from ‘ovr’ to ‘auto’ in 0.22.
verbose [int, default=0] For the liblinear and lbfgs solvers set verbose to any positive number
for verbosity.
warm_start [bool, default=False] When set to True, reuse the solution of the previous call to
fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver.
See the Glossary.
New in version 0.17: warm_start to support lbfgs, newton-cg, sag, saga solvers.
n_jobs [int, default=None] Number of CPU cores used when parallelizing over classes if
multi_class=’ovr’”. This parameter is ignored when the solver is set to ‘liblinear’ re-
gardless of whether ‘multi_class’ is specified or not. None means 1 unless in a joblib.
parallel_backend context. -1 means using all processors. See Glossary for more
details.
l1_ratio [float, default=None] The Elastic-Net mixing parameter, with 0 <= l1_ratio
<= 1. Only used if penalty='elasticnet'`. Setting ``l1_ratio=0 is
equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using
penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.
Attributes
classes_ [ndarray of shape (n_classes, )] A list of class labels known to the classifier.
coef_ [ndarray of shape (1, n_features) or (n_classes, n_features)] Coefficient of the features in
the decision function.
coef_ is of shape (1, n_features) when the given problem is binary. In particular,
when multi_class='multinomial', coef_ corresponds to outcome 1 (True) and
-coef_ corresponds to outcome 0 (False).
intercept_ [ndarray of shape (1,) or (n_classes,)] Intercept (a.k.a. bias) added to the decision
function.
If fit_intercept is set to False, the intercept is set to zero. intercept_
is of shape (1,) when the given problem is binary. In particular, when
multi_class='multinomial', intercept_ corresponds to outcome 1 (True) and
-intercept_ corresponds to outcome 0 (False).
n_iter_ [ndarray of shape (n_classes,) or (1, )] Actual number of iterations for all classes. If
binary or multinomial, it returns only 1 element. For liblinear solver, only the maximum
number of iteration across all classes is given.
Changed in version 0.20: In SciPy <= 1.0.0 the number of lbfgs iterations may exceed
max_iter. n_iter_ will now report at most max_iter.
See also:
SGDClassifier Incrementally trained logistic regression (when given the parameter loss="log").
LogisticRegressionCV Logistic regression with built-in cross validation.
Notes
The underlying C implementation uses a random number generator to select features when fitting the model.
It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a
smaller tol parameter.
Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the
narrative documentation.
References
L-BFGS-B – Software for Large-scale Bound-constrained Optimization Ciyou Zhu, Richard Byrd, Jorge
Nocedal and Jose Luis Morales. http://users.iems.northwestern.edu/~nocedal/lbfgsb.html
LIBLINEAR – A Library for Large Linear Classification https://www.csie.ntu.edu.tw/~cjlin/liblinear/
SAG – Mark Schmidt, Nicolas Le Roux, and Francis Bach Minimizing Finite Sums with the Stochastic Av-
erage Gradient https://hal.inria.fr/hal-00860051/document
SAGA – Defazio, A., Bach F. & Lacoste-Julien S. (2014). SAGA: A Fast Incremental Gradient Method With
Support for Non-Strongly Convex Composite Objectives https://arxiv.org/abs/1407.0202
Hsiang-Fu Yu, Fang-Lan Huang, Chih-Jen Lin (2011). Dual coordinate descent methods for logistic re-
gression and maximum entropy models. Machine Learning 85(1-2):41-75. https://www.csie.ntu.edu.tw/
~cjlin/papers/maxent_dual.pdf
Examples
Methods
scores per (sample, class) combination. In the binary case, confidence score for
self.classes_[1] where >0 means this class would be predicted.
densify(self )
Convert coefficient matrix to dense array format.
Converts the coef_ member (back) to a numpy.ndarray. This is the default format of coef_ and is
required for fitting, so calling this method is only required on models that have previously been sparsified;
otherwise, it is a no-op.
Returns
self Fitted estimator.
fit(self, X, y, sample_weight=None)
Fit the model according to the given training data.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Training vector, where
n_samples is the number of samples and n_features is the number of features.
y [array-like of shape (n_samples,)] Target vector relative to X.
sample_weight [array-like of shape (n_samples,) default=None] Array of weights that are
assigned to individual samples. If not provided, then each sample is given unit weight.
New in version 0.17: sample_weight support to LogisticRegression.
Returns
self Fitted estimator.
Notes
The SAGA solver supports both float64 and float32 bit arrays.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict class labels for samples in X.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape [n_samples]] Predicted class label per sample.
predict_log_proba(self, X)
Predict logarithm of probability estimates.
The returned estimates for all classes are ordered by the label of classes.
Parameters
Notes
For non-sparse models, i.e. when there are not many zeros in coef_, this may actually increase memory
usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be
computed with (coef_ == 0).sum(), must be more than 50% for this to provide significant benefits.
After calling this method, further fitting with the partial_fit method (if any) will not work until you call
densify.
sklearn.linear_model.PassiveAggressiveClassifier
parallel_backend context. -1 means using all processors. See Glossary for more
details.
random_state [int, RandomState instance or None, optional, default=None] The seed of the
pseudo random number generator to use when shuffling the data. If int, random_state is
the seed used by the random number generator; If RandomState instance, random_state is
the random number generator; If None, the random number generator is the RandomState
instance used by np.random.
warm_start [bool, optional] When set to True, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution. See the Glossary.
Repeatedly calling fit or partial_fit when warm_start is True can result in a different solution
than when calling fit a single time because of the way the data is shuffled.
class_weight [dict, {class_label: weight} or “balanced” or None, optional] Preset for the
class_weight fit parameter.
Weights associated with classes. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely pro-
portional to class frequencies in the input data as n_samples / (n_classes * np.
bincount(y))
New in version 0.17: parameter class_weight to automatically weight samples.
average [bool or int, optional] When set to True, computes the averaged SGD weights and
stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin
once the total number of samples seen reaches average. So average=10 will begin averaging
after seeing 10 samples.
New in version 0.19: parameter average to use weights averaging in SGD
Attributes
coef_ [array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]] Weights
assigned to the features.
intercept_ [array, shape = [1] if n_classes == 2 else [n_classes]] Constants in decision function.
n_iter_ [int] The actual number of iterations to reach the stopping criterion. For multiclass fits,
it is the maximum over every binary fit.
classes_ [array of shape (n_classes,)] The unique classes labels.
t_ [int] Number of weight updates performed during training. Same as (n_iter_ *
n_samples).
See also:
SGDClassifier
Perceptron
References
Examples
Methods
otherwise, it is a no-op.
Returns
self Fitted estimator.
fit(self, X, y, coef_init=None, intercept_init=None)
Fit linear model with Passive Aggressive algorithm.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Training data
y [numpy array of shape [n_samples]] Target values
coef_init [array, shape = [n_classes,n_features]] The initial coefficients to warm-start the
optimization.
intercept_init [array, shape = [n_classes]] The initial intercept to warm-start the optimiza-
tion.
Returns
self [returns an instance of self.]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
partial_fit(self, X, y, classes=None)
Fit linear model with Passive Aggressive algorithm.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Subset of the training data
y [numpy array of shape [n_samples]] Subset of the target values
classes [array, shape = [n_classes]] Classes across all calls to partial_fit. Can be obtained
by via np.unique(y_all), where y_all is the target vector of the entire dataset. This
argument is required for the first call to partial_fit and can be omitted in the subsequent
calls. Note that y doesn’t need to contain all labels in classes.
Returns
self [returns an instance of self.]
predict(self, X)
Predict class labels for samples in X.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape [n_samples]] Predicted class label per sample.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **kwargs)
Set and validate the parameters of estimator.
Parameters
**kwargs [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sparsify(self )
Convert coefficient matrix to sparse format.
Converts the coef_ member to a scipy.sparse matrix, which for L1-regularized models can be much more
memory- and storage-efficient than the usual numpy.ndarray representation.
The intercept_ member is not converted.
Returns
self Fitted estimator.
Notes
For non-sparse models, i.e. when there are not many zeros in coef_, this may actually increase memory
usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be
computed with (coef_ == 0).sum(), must be more than 50% for this to provide significant benefits.
After calling this method, further fitting with the partial_fit method (if any) will not work until you call
densify.
sklearn.linear_model.Perceptron
SGDClassifier
Notes
Perceptron is a classification algorithm which shares the same underlying implementation with
SGDClassifier. In fact, Perceptron() is equivalent to SGDClassifier(loss="perceptron",
eta0=1, learning_rate="constant", penalty=None).
References
Examples
Methods
Notes
For non-sparse models, i.e. when there are not many zeros in coef_, this may actually increase memory
usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be
computed with (coef_ == 0).sum(), must be more than 50% for this to provide significant benefits.
After calling this method, further fitting with the partial_fit method (if any) will not work until you call
densify.
sklearn.linear_model.RidgeClassifier
alpha [float, default=1.0] Regularization strength; must be a positive float. Regularization im-
proves the conditioning of the problem and reduces the variance of the estimates. Larger
values specify stronger regularization. Alpha corresponds to C^-1 in other linear models
such as LogisticRegression or LinearSVC.
fit_intercept [bool, default=True] Whether to calculate the intercept for this model. If set to
false, no intercept will be used in calculations (e.g. data is expected to be already centered).
normalize [bool, default=False] This parameter is ignored when fit_intercept is set
to False. If True, the regressors X will be normalized before regression by subtract-
ing the mean and dividing by the l2-norm. If you wish to standardize, please use
sklearn.preprocessing.StandardScaler before calling fit on an estimator
with normalize=False.
copy_X [bool, default=True] If True, X will be copied; else, it may be overwritten.
max_iter [int, default=None] Maximum number of iterations for conjugate gradient solver. The
default value is determined by scipy.sparse.linalg.
tol [float, default=1e-3] Precision of the solution.
class_weight [dict or ‘balanced’, default=None] Weights associated with classes in the form
{class_label: weight}. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely pro-
portional to class frequencies in the input data as n_samples / (n_classes * np.
bincount(y)).
solver [{‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’}, default=’auto’] Solver to
use in the computational routines:
• ‘auto’ chooses the solver automatically based on the type of data.
• ‘svd’ uses a Singular Value Decomposition of X to compute the Ridge coefficients. More
stable for singular matrices than ‘cholesky’.
• ‘cholesky’ uses the standard scipy.linalg.solve function to obtain a closed-form solution.
• ‘sparse_cg’ uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an
iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data
(possibility to set tol and max_iter).
• ‘lsqr’ uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the
fastest and uses an iterative procedure.
• ‘sag’ uses a Stochastic Average Gradient descent, and ‘saga’ uses its unbiased and more
flexible version named SAGA. Both methods use an iterative procedure, and are often
faster than other solvers when both n_samples and n_features are large. Note that ‘sag’
and ‘saga’ fast convergence is only guaranteed on features with approximately the same
scale. You can preprocess the data with a scaler from sklearn.preprocessing.
New in version 0.17: Stochastic Average Gradient descent solver.
New in version 0.19: SAGA solver.
random_state [int, RandomState instance, default=None] The seed of the pseudo random num-
ber generator to use when shuffling the data. If int, random_state is the seed used by the
random number generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState instance used by np.
random. Used when solver == ‘sag’.
Attributes
coef_ [ndarray of shape (1, n_features) or (n_classes, n_features)] Coefficient of the features in
the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
intercept_ [float or ndarray of shape (n_targets,)] Independent term in decision function. Set to
0.0 if fit_intercept = False.
n_iter_ [None or ndarray of shape (n_targets,)] Actual number of iterations for each target.
Available only for sag and lsqr solvers. Other solvers will return None.
classes_ [ndarray of shape (n_classes,)] The classes labels.
See also:
Notes
For multi-class classification, n_class classifiers are trained in a one-versus-all approach. Concretely, this is
implemented by taking advantage of the multi-variate response support in Ridge.
Examples
Methods
Returns
array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes) Confidence
scores per (sample, class) combination. In the binary case, confidence score for
self.classes_[1] where >0 means this class would be predicted.
fit(self, X, y, sample_weight=None)
Fit Ridge classifier model.
Parameters
X [{ndarray, sparse matrix} of shape (n_samples, n_features)] Training data.
y [ndarray of shape (n_samples,)] Target values.
sample_weight [float or ndarray of shape (n_samples,), default=None] Individual weights
for each sample. If given a float, every sample will have the same weight.
New in version 0.17: sample_weight support to Classifier.
Returns
self [object] Instance of the estimator.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict class labels for samples in X.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape [n_samples]] Predicted class label per sample.
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.linear_model.SGDClassifier
used by the perceptron algorithm. The other losses are designed for regression but can be
useful in classification as well; see SGDRegressor for a description.
penalty [{‘l2’, ‘l1’, ‘elasticnet’}, default=’l2’] The penalty (aka regularization term) to be used.
Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’
might bring sparsity to the model (feature selection) not achievable with ‘l2’.
alpha [float, default=0.0001] Constant that multiplies the regularization term. Defaults to
0.0001. Also used to compute learning_rate when set to ‘optimal’.
l1_ratio [float, default=0.15] The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1.
l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.
fit_intercept [bool, default=True] Whether the intercept should be estimated or not. If False,
the data is assumed to be already centered. Defaults to True.
max_iter [int, default=1000] The maximum number of passes over the training data (aka
epochs). It only impacts the behavior in the fit method, and not the partial_fit
method.
New in version 0.19.
tol [float, default=1e-3] The stopping criterion. If it is not None, the iterations will stop when
(loss > best_loss - tol) for n_iter_no_change consecutive epochs.
New in version 0.19.
shuffle [bool, default=True] Whether or not the training data should be shuffled after each
epoch.
verbose [int, default=0] The verbosity level.
epsilon [float, default=0.1] Epsilon in the epsilon-insensitive loss functions; only if loss is
‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’. For ‘huber’, determines
the threshold at which it becomes less important to get the prediction exactly right. For
epsilon-insensitive, any differences between the current prediction and the correct label are
ignored if they are less than this threshold.
n_jobs [int, default=None] The number of CPUs to use to do the OVA (One Versus
All, for multi-class problems) computation. None means 1 unless in a joblib.
parallel_backend context. -1 means using all processors. See Glossary for more
details.
random_state [int, RandomState instance, default=None] The seed of the pseudo random num-
ber generator to use when shuffling the data. If int, random_state is the seed used by the
random number generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState instance used by np.
random.
learning_rate [str, default=’optimal’] The learning rate schedule:
‘constant’: eta = eta0
‘optimal’: [default] eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed
by Leon Bottou.
‘invscaling’: eta = eta0 / pow(t, power_t)
‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time
n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to
increase validation score by tol if early_stopping is True, the current learning rate is di-
vided by 5.
eta0 [double, default=0.0] The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’
schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.
power_t [double, default=0.5] The exponent for inverse scaling learning rate [default 0.5].
early_stopping [bool, default=False] Whether to use early stopping to terminate training when
validation score is not improving. If set to True, it will automatically set aside a stratified
fraction of training data as validation and terminate training when validation score is not
improving by at least tol for n_iter_no_change consecutive epochs.
New in version 0.20.
validation_fraction [float, default=0.1] The proportion of training data to set aside as validation
set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.
New in version 0.20.
n_iter_no_change [int, default=5] Number of iterations with no improvement to wait before
early stopping.
New in version 0.20.
class_weight [dict, {class_label: weight} or “balanced”, default=None] Preset for the
class_weight fit parameter.
Weights associated with classes. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely pro-
portional to class frequencies in the input data as n_samples / (n_classes * np.
bincount(y)).
warm_start [bool, default=False] When set to True, reuse the solution of the previous call to
fit as initialization, otherwise, just erase the previous solution. See the Glossary.
Repeatedly calling fit or partial_fit when warm_start is True can result in a different solution
than when calling fit a single time because of the way the data is shuffled. If a dynamic
learning rate is used, the learning rate is adapted depending on the number of samples al-
ready seen. Calling fit resets this counter, while partial_fit will result in increasing
the existing counter.
average [bool or int, default=False] When set to True, computes the averaged SGD weights
and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will
begin once the total number of samples seen reaches average. So average=10 will begin
averaging after seeing 10 samples.
Attributes
coef_ [ndarray of shape (1, n_features) if n_classes == 2 else (n_classes, n_features)] Weights
assigned to the features.
intercept_ [ndarray of shape (1,) if n_classes == 2 else (n_classes,)] Constants in decision
function.
n_iter_ [int] The actual number of iterations to reach the stopping criterion. For multiclass fits,
it is the maximum over every binary fit.
loss_function_ [concrete LossFunction]
classes_ [array of shape (n_classes,)]
t_ [int] Number of weight updates performed during training. Same as (n_iter_ *
n_samples).
See also:
Examples
Methods
References
Zadrozny and Elkan, “Transforming classifier scores into multiclass probability estimates”, SIGKDD’02,
http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf
The justification for the formula in the loss=”modified_huber” case is in the appendix B in: http://jmlr.
csail.mit.edu/papers/volume2/zhang02c/zhang02c.pdf
score(self, X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each
sample that each label set be correctly predicted.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True labels for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] Mean accuracy of self.predict(X) wrt. y.
set_params(self, **kwargs)
Set and validate the parameters of estimator.
Parameters
**kwargs [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sparsify(self )
Convert coefficient matrix to sparse format.
Converts the coef_ member to a scipy.sparse matrix, which for L1-regularized models can be much more
memory- and storage-efficient than the usual numpy.ndarray representation.
The intercept_ member is not converted.
Returns
self Fitted estimator.
Notes
For non-sparse models, i.e. when there are not many zeros in coef_, this may actually increase memory
usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be
computed with (coef_ == 0).sum(), must be more than 50% for this to provide significant benefits.
After calling this method, further fitting with the partial_fit method (if any) will not work until you call
densify.
sklearn.linear_model.LinearRegression
Notes
From the implementation point of view, this is just plain Ordinary Least Squares (scipy.linalg.lstsq) wrapped as
a predictor object.
Examples
Methods
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict using the linear model.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape (n_samples,)] Returns predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
• Isotonic Regression
• Face completion with a multi-output estimators
• Plot individual and voting regression predictions
• Ordinary Least Squares and Ridge Regression Variance
• Logistic function
• Linear Regression Example
• Robust linear model estimation using RANSAC
• Sparsity Example: Fitting only features 1 and 2
• Theil-Sen Regression
• Robust linear estimator fitting
• Automatic Relevance Determination Regression (ARD)
• Bayesian Ridge Regression
• Plotting Cross-Validated Predictions
• Underfitting vs. Overfitting
• Using KBinsDiscretizer to discretize continuous features
sklearn.linear_model.Ridge
This model solves a regression model where the loss function is the linear least squares function and regulariza-
tion is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has
built-in support for multi-variate regression (i.e., when y is a 2d-array of shape (n_samples, n_targets)).
Read more in the User Guide.
Parameters
alpha [{float, ndarray of shape (n_targets,)}, default=1.0] Regularization strength; must be a
positive float. Regularization improves the conditioning of the problem and reduces the
variance of the estimates. Larger values specify stronger regularization. Alpha corresponds
to C^-1 in other linear models such as LogisticRegression or LinearSVC. If an array is
passed, penalties are assumed to be specific to the targets. Hence they must correspond in
number.
fit_intercept [bool, default=True] Whether to calculate the intercept for this model. If set to
false, no intercept will be used in calculations (i.e. data is expected to be centered).
See also:
Examples
Methods
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict using the linear model.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape (n_samples,)] Returns predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.linear_model.SGDRegressor
max_iter [int, default=1000] The maximum number of passes over the training data (aka
epochs). It only impacts the behavior in the fit method, and not the partial_fit
method.
New in version 0.19.
tol [float, default=1e-3] The stopping criterion. If it is not None, the iterations will stop when
(loss > best_loss - tol) for n_iter_no_change consecutive epochs.
New in version 0.19.
shuffle [bool, default=True] Whether or not the training data should be shuffled after each
epoch.
verbose [int, default=0] The verbosity level.
epsilon [float, default=0.1] Epsilon in the epsilon-insensitive loss functions; only if loss is
‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’. For ‘huber’, determines
the threshold at which it becomes less important to get the prediction exactly right. For
epsilon-insensitive, any differences between the current prediction and the correct label are
ignored if they are less than this threshold.
random_state [int, RandomState instance, default=None] The seed of the pseudo random num-
ber generator to use when shuffling the data. If int, random_state is the seed used by the
random number generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState instance used by np.
random.
learning_rate [string, default=’invscaling’] The learning rate schedule:
‘constant’: eta = eta0
‘optimal’: eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon
Bottou.
‘invscaling’: [default] eta = eta0 / pow(t, power_t)
‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time
n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to
increase validation score by tol if early_stopping is True, the current learning rate is di-
vided by 5.
eta0 [double, default=0.01] The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’
schedules. The default value is 0.01.
power_t [double, default=0.25] The exponent for inverse scaling learning rate.
early_stopping [bool, default=False] Whether to use early stopping to terminate training when
validation score is not improving. If set to True, it will automatically set aside a fraction of
training data as validation and terminate training when validation score is not improving by
at least tol for n_iter_no_change consecutive epochs.
New in version 0.20.
validation_fraction [float, default=0.1] The proportion of training data to set aside as validation
set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.
New in version 0.20.
n_iter_no_change [int, default=5] Number of iterations with no improvement to wait before
early stopping.
New in version 0.20.
warm_start [bool, default=False] When set to True, reuse the solution of the previous call to
fit as initialization, otherwise, just erase the previous solution. See the Glossary.
Repeatedly calling fit or partial_fit when warm_start is True can result in a different solution
than when calling fit a single time because of the way the data is shuffled. If a dynamic
learning rate is used, the learning rate is adapted depending on the number of samples al-
ready seen. Calling fit resets this counter, while partial_fit will result in increasing
the existing counter.
average [bool or int, default=False] When set to True, computes the averaged SGD weights
and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will
begin once the total number of samples seen reaches average. So average=10 will begin
averaging after seeing 10 samples.
Attributes
coef_ [ndarray of shape (n_features,)] Weights assigned to the features.
intercept_ [ndarray of shape (1,)] The intercept term.
average_coef_ [ndarray of shape (n_features,)] Averaged weights assigned to the features.
average_intercept_ [ndarray of shape (1,)] The averaged intercept term.
n_iter_ [int] The actual number of iterations to reach the stopping criterion.
t_ [int] Number of weight updates performed during training. Same as (n_iter_ *
n_samples).
See also:
Examples
Methods
Notes
Notes
For non-sparse models, i.e. when there are not many zeros in coef_, this may actually increase memory
usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be
computed with (coef_ == 0).sum(), must be more than 50% for this to provide significant benefits.
After calling this method, further fitting with the partial_fit method (if any) will not work until you call
densify.
• Prediction Latency
The following estimators have built-in variable selection fitting procedures, but any estimator using a L1 or elastic-
net penalty also performs variable selection: typically SGDRegressor or SGDClassifier with an appropriate
penalty.
sklearn.linear_model.ElasticNet
If you are interested in controlling the L1 and L2 penalty separately, keep in mind that this is equivalent to:
a * L1 + b * L2
where:
The parameter l1_ratio corresponds to alpha in the glmnet R package while alpha corresponds to the lambda
parameter in glmnet. Specifically, l1_ratio = 1 is the lasso penalty. Currently, l1_ratio <= 0.01 is not reliable,
unless you supply your own sequence of alpha.
Read more in the User Guide.
Parameters
alpha [float, optional] Constant that multiplies the penalty terms. Defaults to 1.0. See the notes
for the exact mathematical meaning of this parameter. alpha = 0 is equivalent to an
ordinary least square, solved by the LinearRegression object. For numerical reasons,
using alpha = 0 with the Lasso object is not advised. Given this, you should use the
LinearRegression object.
l1_ratio [float] The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For
l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty.
For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
fit_intercept [bool] Whether the intercept should be estimated or not. If False, the data is
assumed to be already centered.
normalize [boolean, optional, default False] This parameter is ignored when
fit_intercept is set to False. If True, the regressors X will be normalized be-
fore regression by subtracting the mean and dividing by the l2-norm. If you wish to
standardize, please use sklearn.preprocessing.StandardScaler before
calling fit on an estimator with normalize=False.
precompute [True | False | array-like] Whether to use a precomputed Gram matrix to speed up
calculations. The Gram matrix can also be passed as argument. For sparse input this option
is always True to preserve sparsity.
max_iter [int, optional] The maximum number of iterations
copy_X [boolean, optional, default True] If True, X will be copied; else, it may be overwritten.
tol [float, optional] The tolerance for the optimization: if the updates are smaller than tol, the
optimization code checks the dual gap for optimality and continues until it is smaller than
tol.
warm_start [bool, optional] When set to True, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution. See the Glossary.
positive [bool, optional] When set to True, forces the coefficients to be positive.
random_state [int, RandomState instance or None, optional, default None] The seed of the
pseudo random number generator that selects a random feature to update. If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random. Used when selection == ‘random’.
selection [str, default ‘cyclic’] If set to ‘random’, a random coefficient is updated every iteration
rather than looping over features sequentially by default. This (setting to ‘random’) often
leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes
coef_ [array, shape (n_features,) | (n_targets, n_features)] parameter vector (w in the cost func-
tion formula)
sparse_coef_ [scipy.sparse matrix, shape (n_features, 1) | (n_targets, n_features)] sparse
representation of the fitted coef_
intercept_ [float | array, shape (n_targets,)] independent term in decision function.
n_iter_ [array-like, shape (n_targets,)] number of iterations run by the coordinate descent solver
to reach the specified tolerance.
See also:
Notes
To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a
Fortran-contiguous numpy array.
Examples
Methods
Notes
Coordinate descent is an algorithm that considers each column of data at a time hence it will automatically
convert the X input as a Fortran-contiguous numpy array if necessary.
To avoid memory re-allocation it is advised to allocate the initial data in memory directly using that format.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
static path(X, y, l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, precompute=’auto’,
Xy=None, copy_X=True, coef_init=None, verbose=False, return_n_iter=False, posi-
tive=False, check_input=True, **params)
Compute elastic net path with coordinate descent.
The elastic net optimization function varies for mono and multi-outputs.
For mono-output tasks it is:
Where:
n_iters [array-like, shape (n_alphas,)] The number of iterations taken by the coordinate
descent optimizer to reach the specified tolerance for each alpha. (Is returned when
return_n_iter is set to True).
See also:
MultiTaskElasticNet
MultiTaskElasticNetCV
ElasticNet
ElasticNetCV
Notes
Notes
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
property sparse_coef_
sparse representation of the fitted coef_
sklearn.linear_model.Lars
Attributes
alphas_ [array-like of shape (n_alphas + 1,) | list of n_targets such arrays] Maximum of covari-
ances (in absolute value) at each iteration. n_alphas is either n_nonzero_coefs or
n_features, whichever is smaller.
active_ [list, length = n_alphas | list of n_targets such lists] Indices of active variables at the end
of the path.
coef_path_ [array-like of shape (n_features, n_alphas + 1) | list of n_targets such arrays] The
varying values of the coefficients along the path. It is not present if the fit_path param-
eter is False.
coef_ [array-like of shape (n_features,) or (n_targets, n_features)] Parameter vector (w in the
formulation formula).
intercept_ [float or array-like of shape (n_targets,)] Independent term in decision function.
n_iter_ [array-like or int] The number of iterations taken by lars_path to find the grid of alphas
for each target.
See also:
lars_path, LarsCV
sklearn.decomposition.sparse_encode
Examples
Methods
Notes
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.linear_model.Lasso
Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0
(no L2 penalty).
Read more in the User Guide.
Parameters
alpha [float, optional] Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is
equivalent to an ordinary least square, solved by the LinearRegression object. For
numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this,
you should use the LinearRegression object.
fit_intercept [boolean, optional, default True] Whether to calculate the intercept for this model.
If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
normalize [boolean, optional, default False] This parameter is ignored when
fit_intercept is set to False. If True, the regressors X will be normalized be-
fore regression by subtracting the mean and dividing by the l2-norm. If you wish to
standardize, please use sklearn.preprocessing.StandardScaler before
calling fit on an estimator with normalize=False.
precompute [True | False | array-like, default=False] Whether to use a precomputed Gram ma-
trix to speed up calculations. If set to 'auto' let us decide. The Gram matrix can also be
passed as argument. For sparse input this option is always True to preserve sparsity.
copy_X [boolean, optional, default True] If True, X will be copied; else, it may be overwritten.
max_iter [int, optional] The maximum number of iterations
tol [float, optional] The tolerance for the optimization: if the updates are smaller than tol, the
optimization code checks the dual gap for optimality and continues until it is smaller than
tol.
warm_start [bool, optional] When set to True, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution. See the Glossary.
positive [bool, optional] When set to True, forces the coefficients to be positive.
random_state [int, RandomState instance or None, optional, default None] The seed of the
pseudo random number generator that selects a random feature to update. If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random. Used when selection == ‘random’.
selection [str, default ‘cyclic’] If set to ‘random’, a random coefficient is updated every iteration
rather than looping over features sequentially by default. This (setting to ‘random’) often
leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes
coef_ [array, shape (n_features,) | (n_targets, n_features)] parameter vector (w in the cost func-
tion formula)
sparse_coef_ [scipy.sparse matrix, shape (n_features, 1) | (n_targets, n_features)] sparse
representation of the fitted coef_
intercept_ [float | array, shape (n_targets,)] independent term in decision function.
n_iter_ [int | array-like, shape (n_targets,)] number of iterations run by the coordinate descent
solver to reach the specified tolerance.
See also:
lars_path
lasso_path
LassoLars
LassoCV
LassoLarsCV
sklearn.decomposition.sparse_encode
Notes
Examples
Methods
Notes
Coordinate descent is an algorithm that considers each column of data at a time hence it will automatically
convert the X input as a Fortran-contiguous numpy array if necessary.
To avoid memory re-allocation it is advised to allocate the initial data in memory directly using that format.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
static path(X, y, l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, precompute=’auto’,
Xy=None, copy_X=True, coef_init=None, verbose=False, return_n_iter=False, posi-
tive=False, check_input=True, **params)
Compute elastic net path with coordinate descent.
The elastic net optimization function varies for mono and multi-outputs.
For mono-output tasks it is:
Where:
n_iters [array-like, shape (n_alphas,)] The number of iterations taken by the coordinate
descent optimizer to reach the specified tolerance for each alpha. (Is returned when
return_n_iter is set to True).
See also:
MultiTaskElasticNet
MultiTaskElasticNetCV
ElasticNet
ElasticNetCV
Notes
Notes
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
property sparse_coef_
sparse representation of the fitted coef_
sklearn.linear_model.LassoLars
lars_path
lasso_path
Lasso
LassoCV
LassoLarsCV
LassoLarsIC
sklearn.decomposition.sparse_encode
Examples
Methods
Notes
sklearn.linear_model.OrthogonalMatchingPursuit
class sklearn.linear_model.OrthogonalMatchingPursuit(n_nonzero_coefs=None,
tol=None, fit_intercept=True,
normalize=True, precom-
pute=’auto’)
Orthogonal Matching Pursuit model (OMP)
Read more in the User Guide.
Parameters
n_nonzero_coefs [int, optional] Desired number of non-zero entries in the solution. If None
(by default) this value is set to 10% of n_features.
tol [float, optional] Maximum norm of the residual. If not None, overrides n_nonzero_coefs.
fit_intercept [boolean, optional] whether to calculate the intercept for this model. If set to false,
no intercept will be used in calculations (i.e. data is expected to be centered).
normalize [boolean, optional, default True] This parameter is ignored when fit_intercept
is set to False. If True, the regressors X will be normalized before regression by sub-
tracting the mean and dividing by the l2-norm. If you wish to standardize, please use
sklearn.preprocessing.StandardScaler before calling fit on an estimator
with normalize=False.
precompute [{True, False, ‘auto’}, default ‘auto’] Whether to use a precomputed Gram and
Xy matrix to speed up calculations. Improves performance when n_targets or n_samples is
very large. Note that if you already have such matrices, you can pass them directly to the fit
method.
Attributes
coef_ [array, shape (n_features,) or (n_targets, n_features)] parameter vector (w in the formula)
intercept_ [float or array, shape (n_targets,)] independent term in decision function.
n_iter_ [int or array-like] Number of active features across every target.
See also:
orthogonal_mp
orthogonal_mp_gram
lars_path
Lars
LassoLars
decomposition.sparse_encode
OrthogonalMatchingPursuitCV
Notes
Orthogonal matching pursuit was introduced in G. Mallat, Z. Zhang, Matching pursuits with time-frequency
dictionaries, IEEE Transactions on Signal Processing, Vol. 41, No. 12. (December 1993), pp. 3397-3415.
(http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf)
This implementation is based on Rubinstein, R., Zibulevsky, M. and Elad, M., Efficient Implementation of
the K-SVD Algorithm using Batch Orthogonal Matching Pursuit Technical Report - CS Technion, April 2008.
https://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
Examples
Methods
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.linear_model.ARDRegression
Notes
References
D. J. C. MacKay, Bayesian nonlinear modeling for the prediction competition, ASHRAE Transactions, 1994.
R. Salakhutdinov, Lecture notes on Statistical Machine Learning, http://www.utstat.toronto.edu/~rsalakhu/
sta4273/notes/Lecture2.pdf#page=15 Their beta is our self.alpha_ Their alpha is our self.lambda_
ARD is a little different than the slide: only dimensions/features for which self.lambda_ < self.
threshold_lambda are kept and the rest are discarded.
Examples
Methods
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X, return_std=False)
Predict using the linear model.
In addition to the mean of the predictive distribution, also its standard deviation can be returned.
Parameters
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Samples.
return_std [bool, default=False] Whether to return the standard deviation of posterior pre-
diction.
Returns
y_mean [array-like of shape (n_samples,)] Mean of predictive distribution of query points.
y_std [array-like of shape (n_samples,)] Standard deviation of predictive distribution of
query points.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.linear_model.BayesianRidge
Notes
There exist several strategies to perform Bayesian ridge regression. This implementation is based on the algo-
rithm described in Appendix A of (Tipping, 2001) where updates of the regularization parameters are done as
suggested in (MacKay, 1992). Note that according to A New View of Automatic Relevance Determination (Wipf
and Nagarajan, 2008) these update rules do not guarantee that the marginal likelihood is increasing between two
consecutive iterations of the optimization.
References
D. J. C. MacKay, Bayesian Interpolation, Computation and Neural Systems, Vol. 4, No. 3, 1992.
M. E. Tipping, Sparse Bayesian Learning and the Relevance Vector Machine, Journal of Machine Learning
Research, Vol. 1, 2001.
Examples
Methods
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
These estimators fit multiple regression problems (or tasks) jointly, while inducing sparse coefficients. While the
inferred coefficients may differ between the tasks, they are constrained to agree on the features that are selected (non-
zero coefficients).
sklearn.linear_model.MultiTaskElasticNet
Where:
warm_start [bool, optional] When set to True, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution. See the Glossary.
random_state [int, RandomState instance or None, optional, default None] The seed of the
pseudo random number generator that selects a random feature to update. If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random. Used when selection == ‘random’.
selection [str, default ‘cyclic’] If set to ‘random’, a random coefficient is updated every iteration
rather than looping over features sequentially by default. This (setting to ‘random’) often
leads to significantly faster convergence especially when tol is higher than 1e-4.
Attributes
intercept_ [array, shape (n_tasks,)] Independent term in decision function.
coef_ [array, shape (n_tasks, n_features)] Parameter vector (W in the cost function formula). If
a 1D y is passed in at fit (non multi-task usage), coef_ is then a 1D array. Note that coef_
stores the transpose of W, W.T.
n_iter_ [int] number of iterations run by the coordinate descent solver to reach the specified
tolerance.
See also:
Notes
Examples
Methods
Notes
Coordinate descent is an algorithm that considers each column of data at a time hence it will automatically
convert the X input as a Fortran-contiguous numpy array if necessary.
To avoid memory re-allocation it is advised to allocate the initial data in memory directly using that format.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
static path(X, y, l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, precompute=’auto’,
Xy=None, copy_X=True, coef_init=None, verbose=False, return_n_iter=False, posi-
tive=False, check_input=True, **params)
Compute elastic net path with coordinate descent.
The elastic net optimization function varies for mono and multi-outputs.
For mono-output tasks it is:
Where:
MultiTaskElasticNet
MultiTaskElasticNetCV
ElasticNet
ElasticNetCV
Notes
Notes
sklearn.linear_model.MultiTaskLasso
Where:
coef_ [array, shape (n_tasks, n_features)] Parameter vector (W in the cost function formula).
Note that coef_ stores the transpose of W, W.T.
intercept_ [array, shape (n_tasks,)] independent term in decision function.
n_iter_ [int] number of iterations run by the coordinate descent solver to reach the specified
tolerance.
See also:
Notes
Examples
Methods
Notes
Coordinate descent is an algorithm that considers each column of data at a time hence it will automatically
convert the X input as a Fortran-contiguous numpy array if necessary.
To avoid memory re-allocation it is advised to allocate the initial data in memory directly using that format.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
static path(X, y, l1_ratio=0.5, eps=0.001, n_alphas=100, alphas=None, precompute=’auto’,
Xy=None, copy_X=True, coef_init=None, verbose=False, return_n_iter=False, posi-
tive=False, check_input=True, **params)
Compute elastic net path with coordinate descent.
The elastic net optimization function varies for mono and multi-outputs.
For mono-output tasks it is:
Where:
MultiTaskElasticNet
MultiTaskElasticNetCV
ElasticNet
ElasticNetCV
Notes
Notes
Any estimator using the Huber loss would also be robust to outliers, e.g. SGDRegressor with loss='huber'.
sklearn.linear_model.HuberRegressor
outliers_ [array, shape (n_samples,)] A boolean mask which is set to True where the samples
are identified as outliers.
References
[Re4616ef910fb-1], [Re4616ef910fb-2]
Examples
Methods
fit(self, X, y[, sample_weight]) Fit the model according to the given training data.
get_params(self[, deep]) Get parameters for this estimator.
predict(self, X) Predict using the linear model.
score(self, X, y[, sample_weight]) Return the coefficient of determination R^2 of the
prediction.
set_params(self, \*\*params) Set the parameters of this estimator.
Returns
self [object]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict using the linear model.
Parameters
X [array_like or sparse matrix, shape (n_samples, n_features)] Samples.
Returns
C [array, shape (n_samples,)] Returns predicted values.
score(self, X, y, sample_weight=None)
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) **
2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Parameters
X [array-like of shape (n_samples, n_features)] Test samples. For some estimators this may
be a precomputed kernel matrix or a list of generic objects instead, shape = (n_samples,
n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for
the estimator.
y [array-like of shape (n_samples,) or (n_samples, n_outputs)] True values for X.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float] R^2 of self.predict(X) wrt. y.
Notes
sklearn.linear_model.RANSACRegressor
is_data_valid [callable, optional] This function is called with the randomly selected data before
the model is fitted to it: is_data_valid(X, y). If its return value is False the current
randomly chosen sub-sample is skipped.
is_model_valid [callable, optional] This function is called with the estimated model and the
randomly selected data: is_model_valid(model, X, y). If its return value is False
the current randomly chosen sub-sample is skipped. Rejecting samples with this function is
computationally costlier than with is_data_valid. is_model_valid should there-
fore only be used if the estimated model is needed for making the rejection decision.
max_trials [int, optional] Maximum number of iterations for random sample selection.
max_skips [int, optional] Maximum number of iterations that can be skipped due to finding
zero inliers or invalid data defined by is_data_valid or invalid models defined by
is_model_valid.
New in version 0.19.
stop_n_inliers [int, optional] Stop iteration if at least this number of inliers are found.
stop_score [float, optional] Stop iteration if score is greater equal than this threshold.
stop_probability [float in range [0, 1], optional] RANSAC iteration stops if at least one outlier-
free set of the training data is sampled in RANSAC. This requires to generate at least N
samples (iterations):
where the probability (confidence) is typically set to high value such as 0.99 (the default)
and e is the current fraction of inliers w.r.t. the total number of samples.
loss [string, callable, optional, default “absolute_loss”] String inputs, “absolute_loss” and
“squared_loss” are supported which find the absolute loss and squared loss per sample re-
spectively.
If loss is a callable, then it should be a function that takes two arrays as inputs, the true
and predicted value and returns a 1-D array with the i-th value of the array corresponding to
the loss on X[i].
If the loss on a sample is greater than the residual_threshold, then this sample is
classified as an outlier.
random_state [int, RandomState instance or None, optional, default None] The generator used
to initialize the centers. If int, random_state is the seed used by the random number gener-
ator; If RandomState instance, random_state is the random number generator; If None, the
random number generator is the RandomState instance used by np.random.
Attributes
estimator_ [object] Best fitted model (copy of the base_estimator object).
n_trials_ [int] Number of random selection trials until one of the stop criteria is met. It is
always <= max_trials.
inlier_mask_ [bool array of shape [n_samples]] Boolean mask of inliers classified as True.
n_skips_no_inliers_ [int] Number of iterations skipped due to finding zero inliers.
New in version 0.19.
n_skips_invalid_data_ [int] Number of iterations skipped due to invalid data defined by
is_data_valid.
New in version 0.19.
References
Examples
Methods
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict using the estimated model.
This is a wrapper for estimator_.predict(X).
Parameters
X [numpy array of shape [n_samples, n_features]]
Returns
y [array, shape = [n_samples] or [n_samples, n_targets]] Returns predicted values.
score(self, X, y)
Returns the score of the prediction.
This is a wrapper for estimator_.score(X, y).
Parameters
X [numpy array or sparse matrix of shape [n_samples, n_features]] Training data.
y [array, shape = [n_samples] or [n_samples, n_targets]] Target values.
Returns
z [float] Score of the prediction.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
sklearn.linear_model.TheilSenRegressor
n_subpopulation_ [int] Number of combinations taken into account from ‘n choose k’, where
n is the number of samples and k is the number of subsamples.
References
• Theil-Sen Estimators in a Multiple Linear Regression Model, 2009 Xin Dang, Hanxiang Peng, Xueqin
Wang and Heping Zhang http://home.olemiss.edu/~xdang/papers/MTSE.pdf
Examples
Methods
Notes
• Theil-Sen Regression
• Robust linear estimator fitting
7.22.7 Miscellaneous
sklearn.linear_model.PassiveAggressiveRegressor
sklearn.linear_model.PassiveAggressiveRegressor(C=1.0, fit_intercept=True,
max_iter=1000, tol=0.001,
early_stopping=False,
validation_fraction=0.1,
n_iter_no_change=5, shuffle=True,
verbose=0, loss=’epsilon_insensitive’,
epsilon=0.1, random_state=None,
warm_start=False, average=False)
Passive Aggressive Regressor
Read more in the User Guide.
Parameters
C [float] Maximum step size (regularization). Defaults to 1.0.
fit_intercept [bool] Whether the intercept should be estimated or not. If False, the data is
assumed to be already centered. Defaults to True.
max_iter [int, optional (default=1000)] The maximum number of passes over the training data
(aka epochs). It only impacts the behavior in the fit method, and not the partial_fit
method.
New in version 0.19.
tol [float or None, optional (default=1e-3)] The stopping criterion. If it is not None, the itera-
tions will stop when (loss > previous_loss - tol).
New in version 0.19.
early_stopping [bool, default=False] Whether to use early stopping to terminate training when
validation. score is not improving. If set to True, it will automatically set aside a fraction of
training data as validation and terminate training when validation score is not improving by
at least tol for n_iter_no_change consecutive epochs.
New in version 0.20.
validation_fraction [float, default=0.1] The proportion of training data to set aside as validation
set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.
New in version 0.20.
SGDRegressor
References
Examples
sklearn.linear_model.enet_path
Where:
||W||_21 = \sum_i \sqrt{\sum_j w_{ij}^2}
precompute [True | False | ‘auto’ | array-like] Whether to use a precomputed Gram matrix to
speed up calculations. If set to 'auto' let us decide. The Gram matrix can also be passed
as argument.
Xy [array-like, optional] Xy = np.dot(X.T, y) that can be precomputed. It is useful only when
the Gram matrix is precomputed.
copy_X [bool, optional, default True] If True, X will be copied; else, it may be overwritten.
coef_init [array, shape (n_features, ) | None] The initial values of the coefficients.
verbose [bool or int] Amount of verbosity.
return_n_iter [bool] Whether to return the number of iterations or not.
positive [bool, default False] If set to True, forces coefficients to be positive. (Only allowed
when y.ndim == 1).
check_input [bool, default True] Skip input validation checks, including the Gram matrix when
provided assuming there are handled by the caller when check_input=False.
**params [kwargs] Keyword arguments passed to the coordinate descent solver.
Returns
alphas [array, shape (n_alphas,)] The alphas along the path where models are computed.
coefs [array, shape (n_features, n_alphas) or (n_outputs, n_features, n_alphas)] Coefficients
along the path.
dual_gaps [array, shape (n_alphas,)] The dual gaps at the end of the optimization for each
alpha.
n_iters [array-like, shape (n_alphas,)] The number of iterations taken by the coordinate
descent optimizer to reach the specified tolerance for each alpha. (Is returned when
return_n_iter is set to True).
See also:
MultiTaskElasticNet
MultiTaskElasticNetCV
ElasticNet
ElasticNetCV
Notes
sklearn.linear_model.lars_path
in the case of method=’lars’, the objective function is only known in the form of an implicit equation (see
discussion in [1])
Read more in the User Guide.
Parameters
X [None or array-like of shape (n_samples, n_features)] Input data. Note that if X is None then
the Gram matrix must be specified, i.e., cannot be None or False.
Deprecated since version 0.21: The use of X is None in combination with Gram is not
None will be removed in v0.23. Use lars_path_gram instead.
y [None or array-like of shape (n_samples,)] Input targets.
Xy [array-like of shape (n_samples,) or (n_samples, n_targets), default=None] Xy =
np.dot(X.T, y) that can be precomputed. It is useful only when the Gram matrix is pre-
computed.
Gram [None, ‘auto’, array-like of shape (n_features, n_features), default=None] Precomputed
Gram matrix (X’ * X), if 'auto', the Gram matrix is precomputed from the given X, if
there are more samples than features.
Deprecated since version 0.21: The use of X is None in combination with Gram is not None
will be removed in v0.23. Use lars_path_gram instead.
max_iter [int, default=500] Maximum number of iterations to perform, set to infinity for no
limit.
alpha_min [float, default=0] Minimum correlation along the path. It corresponds to the regu-
larization parameter alpha parameter in the Lasso.
method [{‘lar’, ‘lasso’}, default=’lar’] Specifies the returned model. Select 'lar' for Least
Angle Regression, 'lasso' for the Lasso.
copy_X [bool, default=True] If False, X is overwritten.
eps [float, optional] The machine-precision regularization in the computation of the Cholesky
diagonal factors. Increase this for very ill-conditioned systems. By default, np.
finfo(np.float).eps is used.
copy_Gram [bool, default=True] If False, Gram is overwritten.
verbose [int, default=0] Controls output verbosity.
return_path [bool, default=True] If return_path==True returns the entire path, else re-
turns only the last point of the path.
return_n_iter [bool, default=False] Whether to return the number of iterations.
positive [bool, default=False] Restrict coefficients to be >= 0. This option is only allowed
with method ‘lasso’. Note that the model coefficients will not converge to the ordinary-
least-squares solution for small values of alpha. Only coefficients up to the smallest alpha
value (alphas_[alphas_ > 0.].min() when fit_path=True) reached by the step-
wise Lars-Lasso algorithm are typically in congruence with the solution of the coordinate
descent lasso_path function.
Returns
alphas [array-like of shape (n_alphas + 1,)] Maximum of covariances (in absolute value) at
each iteration. n_alphas is either max_iter, n_features or the number of nodes in
the path with alpha >= alpha_min, whichever is smaller.
active [array-like of shape (n_alphas,)] Indices of active variables at the end of the path.
coefs [array-like of shape (n_features, n_alphas + 1)] Coefficients along the path
n_iter [int] Number of iterations run. Returned only if return_n_iter is set to True.
See also:
lars_path_gram
lasso_path
lasso_path_gram
LassoLars
Lars
LassoLarsCV
LarsCV
sklearn.decomposition.sparse_encode
References
sklearn.linear_model.lars_path_gram
in the case of method=’lars’, the objective function is only known in the form of an implicit equation (see
discussion in [1])
Read more in the User Guide.
Parameters
Xy [array-like of shape (n_samples,) or (n_samples, n_targets)] Xy = np.dot(X.T, y).
Gram [array-like of shape (n_features, n_features)] Gram = np.dot(X.T * X).
n_samples [int or float] Equivalent size of sample.
max_iter [int, default=500] Maximum number of iterations to perform, set to infinity for no
limit.
alpha_min [float, default=0] Minimum correlation along the path. It corresponds to the regu-
larization parameter alpha parameter in the Lasso.
method [{‘lar’, ‘lasso’}, default=’lar’] Specifies the returned model. Select 'lar' for Least
Angle Regression, 'lasso' for the Lasso.
copy_X [bool, default=True] If False, X is overwritten.
eps [float, optional] The machine-precision regularization in the computation of the Cholesky
diagonal factors. Increase this for very ill-conditioned systems. By default, np.
finfo(np.float).eps is used.
copy_Gram [bool, default=True] If False, Gram is overwritten.
verbose [int, default=0] Controls output verbosity.
return_path [bool, default=True] If return_path==True returns the entire path, else re-
turns only the last point of the path.
return_n_iter [bool, default=False] Whether to return the number of iterations.
positive [bool, default=False] Restrict coefficients to be >= 0. This option is only allowed
with method ‘lasso’. Note that the model coefficients will not converge to the ordinary-
least-squares solution for small values of alpha. Only coefficients up to the smallest alpha
value (alphas_[alphas_ > 0.].min() when fit_path=True) reached by the step-
wise Lars-Lasso algorithm are typically in congruence with the solution of the coordinate
descent lasso_path function.
Returns
alphas [array-like of shape (n_alphas + 1,)] Maximum of covariances (in absolute value) at
each iteration. n_alphas is either max_iter, n_features or the number of nodes in
the path with alpha >= alpha_min, whichever is smaller.
active [array-like of shape (n_alphas,)] Indices of active variables at the end of the path.
coefs [array-like of shape (n_features, n_alphas + 1)] Coefficients along the path
n_iter [int] Number of iterations run. Returned only if return_n_iter is set to True.
See also:
lars_path
lasso_path
lasso_path_gram
LassoLars
Lars
LassoLarsCV
LarsCV
sklearn.decomposition.sparse_encode
References
sklearn.linear_model.lasso_path
Where:
coef_init [array, shape (n_features, ) | None] The initial values of the coefficients.
verbose [bool or integer] Amount of verbosity.
return_n_iter [bool] whether to return the number of iterations or not.
positive [bool, default False] If set to True, forces coefficients to be positive. (Only allowed
when y.ndim == 1).
**params [kwargs] keyword arguments passed to the coordinate descent solver.
Returns
alphas [array, shape (n_alphas,)] The alphas along the path where models are computed.
coefs [array, shape (n_features, n_alphas) or (n_outputs, n_features, n_alphas)] Coefficients
along the path.
dual_gaps [array, shape (n_alphas,)] The dual gaps at the end of the optimization for each
alpha.
n_iters [array-like, shape (n_alphas,)] The number of iterations taken by the coordinate descent
optimizer to reach the specified tolerance for each alpha.
See also:
lars_path
Lasso
LassoLars
LassoCV
LassoLarsCV
sklearn.decomposition.sparse_encode
Notes
Examples
sklearn.linear_model.orthogonal_mp
n_iters [array-like or int] Number of active features across every target. Returned only if
return_n_iter is set to True.
See also:
OrthogonalMatchingPursuit
orthogonal_mp_gram
lars_path
decomposition.sparse_encode
Notes
Orthogonal matching pursuit was introduced in S. Mallat, Z. Zhang, Matching pursuits with time-frequency
dictionaries, IEEE Transactions on Signal Processing, Vol. 41, No. 12. (December 1993), pp. 3397-3415.
(http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf)
This implementation is based on Rubinstein, R., Zibulevsky, M. and Elad, M., Efficient Implementation of
the K-SVD Algorithm using Batch Orthogonal Matching Pursuit Technical Report - CS Technion, April 2008.
https://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
sklearn.linear_model.orthogonal_mp_gram
coef [array, shape (n_features,) or (n_features, n_targets)] Coefficients of the OMP solution. If
return_path=True, this contains the whole coefficient path. In this case its shape is
(n_features, n_features) or (n_features, n_targets, n_features) and iterating over the last axis
yields coefficients in increasing order of active features.
n_iters [array-like or int] Number of active features across every target. Returned only if
return_n_iter is set to True.
See also:
OrthogonalMatchingPursuit
orthogonal_mp
lars_path
decomposition.sparse_encode
Notes
Orthogonal matching pursuit was introduced in G. Mallat, Z. Zhang, Matching pursuits with time-frequency
dictionaries, IEEE Transactions on Signal Processing, Vol. 41, No. 12. (December 1993), pp. 3397-3415.
(http://blanche.polytechnique.fr/~mallat/papiers/MallatPursuit93.pdf)
This implementation is based on Rubinstein, R., Zibulevsky, M. and Elad, M., Efficient Implementation of
the K-SVD Algorithm using Batch Orthogonal Matching Pursuit Technical Report - CS Technion, April 2008.
https://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf
sklearn.linear_model.ridge_regression
• ‘svd’ uses a Singular Value Decomposition of X to compute the Ridge coefficients. More
stable for singular matrices than ‘cholesky’.
• ‘cholesky’ uses the standard scipy.linalg.solve function to obtain a closed-form solution
via a Cholesky decomposition of dot(X.T, X)
• ‘sparse_cg’ uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an
iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data
(possibility to set tol and max_iter).
• ‘lsqr’ uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the
fastest and uses an iterative procedure.
• ‘sag’ uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased
version named SAGA. Both methods also use an iterative procedure, and are often faster
than other solvers when both n_samples and n_features are large. Note that ‘sag’ and
‘saga’ fast convergence is only guaranteed on features with approximately the same scale.
You can preprocess the data with a scaler from sklearn.preprocessing.
All last five solvers support both dense and sparse data. However, only ‘sag’ and ‘sparse_cg’
supports sparse input when‘fit_intercept‘ is True.
New in version 0.17: Stochastic Average Gradient descent solver.
New in version 0.19: SAGA solver.
max_iter [int, default=None] Maximum number of iterations for conjugate gradient solver. For
the ‘sparse_cg’ and ‘lsqr’ solvers, the default value is determined by scipy.sparse.linalg. For
‘sag’ and saga solver, the default value is 1000.
tol [float, default=1e-3] Precision of the solution.
verbose [int, default=0] Verbosity level. Setting verbose > 0 will display additional information
depending on the solver used.
random_state [int, RandomState instance, default=None] The seed of the pseudo random num-
ber generator to use when shuffling the data. If int, random_state is the seed used by the
random number generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState instance used by np.
random. Used when solver == ‘sag’.
return_n_iter [bool, default=False] If True, the method also returns n_iter, the actual num-
ber of iteration performed by the solver.
New in version 0.17.
return_intercept [bool, default=False] If True and if X is sparse, the method also re-
turns the intercept, and the solver is automatically changed to ‘sag’. This is only
a temporary fix for fitting the intercept with sparse data. For dense data, use
sklearn.linear_model._preprocess_data before your regression.
New in version 0.17.
check_input [bool, default=True] If False, the input arrays X and y will not be checked.
New in version 0.21.
Returns
coef [ndarray of shape (n_features,) or (n_targets, n_features)] Weight vector(s).
n_iter [int, optional] The actual number of iteration performed by the solver. Only returned if
return_n_iter is True.
intercept [float or ndarray of shape (n_targets,)] The intercept of the model. Only returned if
return_intercept is True and if X is a scipy sparse array.
Notes
7.23.1 sklearn.manifold.Isomap
References
[R7f4d308f5054-1]
Examples
Methods
Notes
7.23.2 sklearn.manifold.LocallyLinearEmbedding
n_jobs [int or None, optional (default=None)] The number of parallel jobs to run. None means
1 unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
Attributes
embedding_ [array-like, shape [n_samples, n_components]] Stores the embedding vectors
reconstruction_error_ [float] Reconstruction error associated with embedding_
nbrs_ [NearestNeighbors object] Stores nearest neighbors instance, including BallTree or
KDtree if applicable.
References
Examples
Methods
fit_transform(self, X, y=None)
Compute the embedding vectors for data X and transform X.
Parameters
X [array-like of shape [n_samples, n_features]] training set.
y [Ignored]
Returns
X_new [array-like, shape (n_samples, n_components)]
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Transform new points into embedding space.
Parameters
X [array-like of shape (n_samples, n_features)]
Returns
X_new [array, shape = [n_samples, n_components]]
Notes
Because of scaling performed by this method, it is discouraged to use it together with methods that are not
scale-invariant (like SVMs)
7.23.3 sklearn.manifold.MDS
References
“Modern Multidimensional Scaling - Theory and Applications” Borg, I.; Groenen P. Springer Series in Statistics
(1997)
“Nonmetric multidimensional scaling: a numerical method” Kruskal, J. Psychometrika, 29 (1964)
“Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis” Kruskal, J. Psychometrika,
29, (1964)
Examples
Methods
fit(self, X[, y, init]) Computes the position of the points in the embedding
space
fit_transform(self, X[, y, init]) Fit the data from X, and returns the embedded coor-
dinates
get_params(self[, deep]) Get parameters for this estimator.
set_params(self, \*\*params) Set the parameters of this estimator.
y [Ignored]
init [ndarray, shape (n_samples,), optional, default: None] Starting configuration of the em-
bedding to initialize the SMACOF algorithm. By default, the algorithm is initialized with
a randomly chosen array.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.23.4 sklearn.manifold.SpectralEmbedding
References
Examples
Methods
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.23.5 sklearn.manifold.TSNE
learning_rate [float, optional (default: 200.0)] The learning rate for t-SNE is usually in the
range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any
point approximately equidistant from its nearest neighbours. If the learning rate is too low,
most points may look compressed in a dense cloud with few outliers. If the cost function
gets stuck in a bad local minimum increasing the learning rate may help.
n_iter [int, optional (default: 1000)] Maximum number of iterations for the optimization.
Should be at least 250.
n_iter_without_progress [int, optional (default: 300)] Maximum number of iterations without
progress before we abort the optimization, used after 250 initial iterations with early exag-
geration. Note that progress is only checked every 50 iterations so this value is rounded to
the next multiple of 50.
New in version 0.17: parameter n_iter_without_progress to control stopping criteria.
min_grad_norm [float, optional (default: 1e-7)] If the gradient norm is below this threshold,
the optimization will be stopped.
metric [string or callable, optional] The metric to use when calculating distance between
instances in a feature array. If metric is a string, it must be one of the options al-
lowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pair-
wise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to
be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair
of instances (rows) and the resulting value recorded. The callable should take two arrays
from X as input and return a value indicating the distance between them. The default is
“euclidean” which is interpreted as squared euclidean distance.
init [string or numpy array, optional (default: “random”)] Initialization of embedding. Possible
options are ‘random’, ‘pca’, and a numpy array of shape (n_samples, n_components). PCA
initialization cannot be used with precomputed distances and is usually more globally stable
than random initialization.
verbose [int, optional (default: 0)] Verbosity level.
random_state [int, RandomState instance or None, optional (default: None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random. Note that different initializations might
result in different local minima of the cost function.
method [string (default: ‘barnes_hut’)] By default the gradient calculation algorithm uses
Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the
slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when
nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale
to millions of examples.
New in version 0.17: Approximate optimization method via the Barnes-Hut.
angle [float (default: 0.5)] Only used if method=’barnes_hut’ This is the trade-off between
speed and accuracy for Barnes-Hut T-SNE. ‘angle’ is the angular size (referred to as theta
in [3]) of a distant node as measured from a point. If this size is below ‘angle’ then it is used
as a summary node of all points contained within it. This method is not very sensitive to
changes in this parameter in the range of 0.2 - 0.8. Angle less than 0.2 has quickly increasing
computation time and angle greater 0.8 has quickly increasing error.
n_jobs [int or None, optional (default=None)] The number of parallel jobs to run for
neighbors search. This parameter has no impact when metric="precomputed"
or (metric="euclidean" and method="exact"). None means 1 unless in a
References
[1] van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Ma-
chine Learning Research 9:2579-2605, 2008.
[2] van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding https://lvdmaaten.github.io/tsne/
[3] L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learn-
ing Research 15(Oct):3221-3245, 2014. https://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf
Examples
Methods
y [Ignored]
fit_transform(self, X, y=None)
Fit X into an embedded space and return that transformed output.
Parameters
X [array, shape (n_samples, n_features) or (n_samples, n_samples)] If the metric is ‘pre-
computed’ X must be a square distance matrix. Otherwise it contains a sample per row. If
the method is ‘exact’, X may be a sparse matrix of type ‘csr’, ‘csc’ or ‘coo’. If the method
is ‘barnes_hut’ and the metric is ‘precomputed’, X may be a precomputed sparse graph.
y [Ignored]
Returns
X_new [array, shape (n_samples, n_components)] Embedding of the training data in low-
dimensional space.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.23.6 sklearn.manifold.locally_linear_embedding
References
7.23.7 sklearn.manifold.smacof
n_jobs [int or None, optional (default=None)] The number of jobs to use for the computation.
If multiple initializations are used (n_init), each run of the algorithm is computed in
parallel.
None means 1 unless in a joblib.parallel_backend context. -1 means using all
processors. See Glossary for more details.
max_iter [int, optional, default: 300] Maximum number of iterations of the SMACOF algo-
rithm for a single run.
verbose [int, optional, default: 0] Level of verbosity.
eps [float, optional, default: 1e-3] Relative tolerance with respect to stress at which to declare
convergence.
random_state [int, RandomState instance or None, optional, default: None] The generator
used to initialize the centers. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number generator; If None,
the random number generator is the RandomState instance used by np.random.
return_n_iter [bool, optional, default: False] Whether or not to return the number of iterations.
Returns
X [ndarray, shape (n_samples, n_components)] Coordinates of the points in a
n_components-space.
stress [float] The final value of the stress (sum of squared distance of the disparities and the
distances for all constrained points).
n_iter [int] The number of iterations corresponding to the best stress. Returned only if
return_n_iter is set to True.
Notes
“Modern Multidimensional Scaling - Theory and Applications” Borg, I.; Groenen P. Springer Series in Statistics
(1997)
“Nonmetric multidimensional scaling: a numerical method” Kruskal, J. Psychometrika, 29 (1964)
“Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis” Kruskal, J. Psychometrika,
29, (1964)
7.23.8 sklearn.manifold.spectral_embedding
Notes
Spectral Embedding (Laplacian Eigenmaps) is most useful when the graph has one connected component. If
there graph has many components, the first few eigenvectors will simply uncover the connected components of
the graph.
References
• https://en.wikipedia.org/wiki/LOBPCG
• Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gra-
dient Method Andrew V. Knyazev https://doi.org/10.1137%2FS1064827500366124
7.23.9 sklearn.manifold.trustworthiness
where for each sample i, 𝒩𝑖𝑘 are its k nearest neighbors in the output space, and every sample j is its 𝑟(𝑖, 𝑗)-th
nearest neighbor in the input space. In other words, any unexpected nearest neighbors in the output space are
penalised in proportion to their rank in the input space.
• “Neighborhood Preservation in Nonlinear Projection Methods: An Experimental Study” J. Venna, S. Kaski
• “Learning a Parametric Embedding by Preserving Local Structure” L.J.P. van der Maaten
Parameters
X [array, shape (n_samples, n_features) or (n_samples, n_samples)] If the metric is ‘precom-
puted’ X must be a square distance matrix. Otherwise it contains a sample per row.
X_embedded [array, shape (n_samples, n_components)] Embedding of the training data in
low-dimensional space.
n_neighbors [int, optional (default: 5)] Number of neighbors k that will be considered.
metric [string, or callable, optional, default ‘euclidean’] Which metric to use for computing
pairwise distances between samples from the original input space. If metric is ‘precom-
puted’, X must be a matrix of pairwise distances or squared distances. Otherwise, see the
documentation of argument metric in sklearn.pairwise.pairwise_distances for a list of avail-
able metrics.
Returns
trustworthiness [float] Trustworthiness of the low-dimensional embedding.
See the Metrics and scoring: quantifying the quality of predictions section and the Pairwise metrics, Affinities and
Kernels section of the user guide for further details.
The sklearn.metrics module includes score functions, performance metrics and pairwise metrics and distance
computations.
See the The scoring parameter: defining model evaluation rules section of the user guide for further details.
sklearn.metrics.check_scoring
scoring [string, callable or None, optional, default: None] A string (see model evaluation doc-
umentation) or a scorer callable object / function with signature scorer(estimator,
X, y).
allow_none [boolean, optional, default: False] If no scoring is specified and the estimator has
no score function, we can either return None or raise an exception.
Returns
scoring [callable] A scorer callable object / function with signature scorer(estimator,
X, y).
sklearn.metrics.get_scorer
sklearn.metrics.get_scorer(scoring)
Get a scorer from string.
Read more in the User Guide.
Parameters
scoring [str | callable] scoring method as string. If callable it is returned as is.
Returns
scorer [callable] The scorer.
sklearn.metrics.make_scorer
For example average_precision or the area under the roc curve can not be computed
using discrete predictions alone.
**kwargs [additional arguments] Additional parameters to be passed to score_func.
Returns
scorer [callable] Callable object that returns a scalar score; greater is better.
Notes
Examples
See the Classification metrics section of the user guide for further details.
sklearn.metrics.accuracy_score
See also:
Notes
In binary and multiclass classification, this function is equal to the jaccard_score function.
Examples
sklearn.metrics.auc
sklearn.metrics.auc(x, y)
Compute Area Under the Curve (AUC) using the trapezoidal rule
This is a general function, given points on a curve. For computing the area under the ROC-
curve, see roc_auc_score. For an alternative way to summarize a precision-recall curve, see
average_precision_score.
Parameters
x [array, shape = [n]] x coordinates. These must be either monotonic increasing or monotonic
decreasing.
y [array, shape = [n]] y coordinates.
Returns
auc [float]
See also:
Examples
sklearn.metrics.average_precision_score
where 𝑃𝑛 and 𝑅𝑛 are the precision and recall at the nth threshold [1]. This implementation is not interpolated
and is different from computing the area under the precision-recall curve with the trapezoidal rule, which uses
linear interpolation and can be too optimistic.
Note: this implementation is restricted to the binary classification task or multilabel classification task.
Read more in the User Guide.
Parameters
y_true [array, shape = [n_samples] or [n_samples, n_classes]] True binary labels or binary label
indicators.
y_score [array, shape = [n_samples] or [n_samples, n_classes]] Target scores, can either be
probability estimates of the positive class, confidence values, or non-thresholded measure
of decisions (as returned by “decision_function” on some classifiers).
average [string, [None, ‘micro’, ‘macro’ (default), ‘samples’, ‘weighted’]] If None, the scores
for each class are returned. Otherwise, this determines the type of averaging performed on
the data:
'micro': Calculate metrics globally by considering each element of the label indicator
matrix as a label.
'macro': Calculate metrics for each label, and find their unweighted mean. This does not
take label imbalance into account.
'weighted': Calculate metrics for each label, and find their average, weighted by sup-
port (the number of true instances for each label).
'samples': Calculate metrics for each instance, and find their average.
Will be ignored when y_true is binary.
pos_label [int or str (default=1)] The label of the positive class. Only applied to binary
y_true. For multilabel-indicator y_true, pos_label is fixed to 1.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
average_precision [float]
See also:
Notes
Changed in version 0.19: Instead of linearly interpolating between operating points, precisions are weighted by
the change in recall since the last operating point.
References
[1]
Examples
• Precision-Recall
sklearn.metrics.balanced_accuracy_score
recall_score, roc_auc_score
Notes
Some literature promotes alternative definitions of balanced accuracy. Our definition is equivalent to
accuracy_score with class-balanced sample weights, and shares desirable properties with the binary case.
See the User Guide.
References
[1], [2]
Examples
sklearn.metrics.brier_score_loss
the Brier score is for a set of predictions, the better the predictions are calibrated. Note that the Brier score
always takes on a value between zero and one, since this is the largest possible difference between a predicted
probability (which must be between zero and one) and the actual outcome (which can take on values of only 0
and 1). The Brier loss is composed of refinement loss and calibration loss. The Brier score is appropriate for
binary and categorical outcomes that can be structured as true or false, but is inappropriate for ordinal variables
which can take on three or more values (this is because the Brier score assumes that all possible outcomes are
equivalently “distant” from one another). Which label is considered to be the positive label is controlled via the
parameter pos_label, which defaults to 1. Read more in the User Guide.
Parameters
y_true [array, shape (n_samples,)] True targets.
y_prob [array, shape (n_samples,)] Probabilities of the positive class.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
pos_label [int or str, default=None] Label of the positive class. Defaults to the greater label
unless y_true is all 0 or all -1 in which case pos_label defaults to 1.
Returns
score [float] Brier score
References
[1]
Examples
sklearn.metrics.classification_report
The reported averages include macro average (averaging the unweighted mean per la-
bel), weighted average (averaging the support-weighted mean per label), and sample
average (only for multilabel classification). Micro average (averaging the total true
positives, false negatives and false positives) is only shown for multi-label or multi-
class with a subset of classes, because it corresponds to accuracy otherwise. See also
precision_recall_fscore_support for more details on averages.
Note that in binary classification, recall of the positive class is also known as “sensitivity”;
recall of the negative class is “specificity”.
See also:
precision_recall_fscore_support, confusion_matrix
multilabel_confusion_matrix
Examples
sklearn.metrics.cohen_kappa_score
𝜅 = (𝑝𝑜 − 𝑝𝑒 )/(1 − 𝑝𝑒 )
where 𝑝𝑜 is the empirical probability of agreement on the label assigned to any sample (the observed agreement
ratio), and 𝑝𝑒 is the expected agreement when both annotators assign labels randomly. 𝑝𝑒 is estimated using a
per-annotator empirical prior over the class labels [2].
References
sklearn.metrics.confusion_matrix
References
[1]
Examples
sklearn.metrics.dcg_score
k [int, optional (default=None)] Only consider the highest k scores in the ranking. If None, use
all outputs.
log_base [float, optional (default=2)] Base of the logarithm used for the discount. A low value
means a sharper discount (top results are more important).
sample_weight [ndarray, shape (n_samples,), optional (default=None)] Sample weights. If
None, all samples are given the same weight.
ignore_ties [bool, optional (default=False)] Assume that there are no ties in y_score (which is
likely to be the case if y_score is continuous) for efficiency gains.
Returns
discounted_cumulative_gain [float] The averaged sample DCG scores.
See also:
ndcg_score The Discounted Cumulative Gain divided by the Ideal Discounted Cumulative Gain (the DCG
obtained for a perfect ranking), in order to have a score between 0 and 1.
References
Examples
sklearn.metrics.f1_score
In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending
on the average parameter.
Read more in the User Guide.
Parameters
y_true [1d array-like, or label indicator array / sparse matrix] Ground truth (correct) target
values.
y_pred [1d array-like, or label indicator array / sparse matrix] Estimated targets as returned by
a classifier.
labels [list, optional] The set of labels to include when average != 'binary', and their
order if average is None. Labels present in the data can be excluded, for example to
calculate a multiclass average ignoring a majority negative class, while labels not present in
the data will result in 0 components in a macro average. For multilabel targets, labels are
column indices. By default, all labels in y_true and y_pred are used in sorted order.
Changed in version 0.17: parameter labels improved for multiclass problem.
pos_label [str or int, 1 by default] The class to report if average='binary' and the
data is binary. If the data are multiclass or multilabel, this will be ignored; setting
labels=[pos_label] and average != 'binary' will report scores for that la-
bel only.
average [string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]] This pa-
rameter is required for multiclass/multilabel targets. If None, the scores for each class are
returned. Otherwise, this determines the type of averaging performed on the data:
'binary': Only report results for the class specified by pos_label. This is applicable
only if targets (y_{true,pred}) are binary.
'micro': Calculate metrics globally by counting the total true positives, false negatives
and false positives.
'macro': Calculate metrics for each label, and find their unweighted mean. This does not
take label imbalance into account.
'weighted': Calculate metrics for each label, and find their average weighted by support
(the number of true instances for each label). This alters ‘macro’ to account for label
imbalance; it can result in an F-score that is not between precision and recall.
'samples': Calculate metrics for each instance, and find their average (only meaningful
for multilabel classification where this differs from accuracy_score).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
zero_division [“warn”, 0 or 1, default=”warn”] Sets the value to return when there is a zero
division, i.e. when all predictions and labels are negative. If set to “warn”, this acts as 0, but
warnings are also raised.
Returns
f1_score [float or array of float, shape = [n_unique_labels]] F1 score of the positive class in
binary classification or weighted average of the F1 scores of each class for the multiclass
task.
See also:
Notes
When true positive + false positive == 0, precision is undefined; When true positive
+ false negative == 0, recall is undefined. In such cases, by default the metric will be set to 0,
as will f-score, and UndefinedMetricWarning will be raised. This behavior can be modified with
zero_division.
References
[1]
Examples
sklearn.metrics.fbeta_score
Returns
fbeta_score [float (if average is not None) or array of float, shape = [n_unique_labels]] F-beta
score of the positive class in binary classification or weighted average of the F-beta score of
each class for the multiclass task.
See also:
precision_recall_fscore_support, multilabel_confusion_matrix
Notes
References
[1], [2]
Examples
sklearn.metrics.hamming_loss
See also:
Notes
In multiclass classification, the Hamming loss corresponds to the Hamming distance between y_true and
y_pred which is equivalent to the subset zero_one_loss function, when normalize parameter is set to
True.
In multilabel classification, the Hamming loss is different from the subset zero-one loss. The zero-one loss
considers the entire set of labels for a given sample incorrect if it does not entirely match the true set of labels.
Hamming loss is more forgiving in that it penalizes only the individual labels.
The Hamming loss is upperbounded by the subset zero-one loss, when normalize parameter is set to True. It
is always between 0 and 1, lower being better.
References
[1], [2]
Examples
sklearn.metrics.hinge_loss
References
Examples
sklearn.metrics.jaccard_score
The Jaccard index [1], or Jaccard similarity coefficient, defined as the size of the intersection divided by the size
of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of
labels in y_true.
Read more in the User Guide.
Parameters
y_true [1d array-like, or label indicator array / sparse matrix] Ground truth (correct) labels.
y_pred [1d array-like, or label indicator array / sparse matrix] Predicted labels, as returned by
a classifier.
labels [list, optional] The set of labels to include when average != 'binary', and their
order if average is None. Labels present in the data can be excluded, for example to
calculate a multiclass average ignoring a majority negative class, while labels not present in
the data will result in 0 components in a macro average. For multilabel targets, labels are
column indices. By default, all labels in y_true and y_pred are used in sorted order.
pos_label [str or int, 1 by default] The class to report if average='binary' and the
data is binary. If the data are multiclass or multilabel, this will be ignored; setting
labels=[pos_label] and average != 'binary' will report scores for that la-
bel only.
average [string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]] If None,
the scores for each class are returned. Otherwise, this determines the type of averaging
performed on the data:
'binary': Only report results for the class specified by pos_label. This is applicable
only if targets (y_{true,pred}) are binary.
'micro': Calculate metrics globally by counting the total true positives, false negatives
and false positives.
'macro': Calculate metrics for each label, and find their unweighted mean. This does not
take label imbalance into account.
'weighted': Calculate metrics for each label, and find their average, weighted by sup-
port (the number of true instances for each label). This alters ‘macro’ to account for label
imbalance.
'samples': Calculate metrics for each instance, and find their average (only meaningful
for multilabel classification).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float (if average is not None) or array of floats, shape = [n_unique_labels]]
See also:
Notes
jaccard_score may be a poor metric if there are no positives for some samples or classes. Jaccard is
undefined if there are no true or predicted labels, and our implementation will return a score of 0 with a warning.
References
[1]
Examples
• Classifier Chain
sklearn.metrics.log_loss
y_true [array-like or label indicator matrix] Ground truth (correct) labels for n_samples sam-
ples.
y_pred [array-like of float, shape = (n_samples, n_classes) or (n_samples,)] Predicted prob-
abilities, as returned by a classifier’s predict_proba method. If y_pred.shape =
(n_samples,) the probabilities provided are assumed to be that of the positive class. The
labels in y_pred are assumed to be ordered alphabetically, as done by preprocessing.
LabelBinarizer.
eps [float] Log loss is undefined for p=0 or p=1, so probabilities are clipped to max(eps, min(1
- eps, p)).
normalize [bool, optional (default=True)] If true, return the mean loss per sample. Otherwise,
return the sum of the per-sample losses.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
labels [array-like, optional (default=None)] If not provided, labels will be inferred from y_true.
If labels is None and y_pred has shape (n_samples,) the labels are assumed to be
binary and are inferred from y_true. .. versionadded:: 0.18
Returns
loss [float]
Notes
References
C.M. Bishop (2006). Pattern Recognition and Machine Learning. Springer, p. 209.
Examples
sklearn.metrics.matthews_corrcoef
correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average
random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient. [source:
Wikipedia]
Binary and multiclass labels are supported. Only in the binary case does this relate to information about true
and false positives and negatives. See references below.
Read more in the User Guide.
Parameters
y_true [array, shape = [n_samples]] Ground truth (correct) target values.
y_pred [array, shape = [n_samples]] Estimated targets as returned by a classifier.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
mcc [float] The Matthews correlation coefficient (+1 represents a perfect prediction, 0 an aver-
age random prediction and -1 and inverse prediction).
References
Examples
sklearn.metrics.multilabel_confusion_matrix
confusion_matrix
Notes
Examples
Multilabel-indicator case:
>>> import numpy as np
>>> from sklearn.metrics import multilabel_confusion_matrix
>>> y_true = np.array([[1, 0, 1],
... [0, 1, 0]])
>>> y_pred = np.array([[1, 0, 0],
... [0, 1, 1]])
>>> multilabel_confusion_matrix(y_true, y_pred)
array([[[1, 0],
[0, 1]],
<BLANKLINE>
[[1, 0],
[0, 1]],
<BLANKLINE>
[[0, 1],
[1, 0]]])
Multiclass case:
>>> y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
>>> y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
>>> multilabel_confusion_matrix(y_true, y_pred,
... labels=["ant", "bird", "cat"])
array([[[3, 1],
[0, 2]],
<BLANKLINE>
[[5, 0],
[1, 0]],
(continues on next page)
sklearn.metrics.ndcg_score
References
Examples
sklearn.metrics.precision_recall_curve
pos_label [int or str, default=None] The label of the positive class. When pos_label=None,
if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
precision [array, shape = [n_thresholds + 1]] Precision values such that element i is the preci-
sion of predictions with score >= thresholds[i] and the last element is 1.
recall [array, shape = [n_thresholds + 1]] Decreasing recall values such that element i is the
recall of predictions with score >= thresholds[i] and the last element is 0.
thresholds [array, shape = [n_thresholds <= len(np.unique(probas_pred))]] Increasing thresh-
olds on the decision function used to compute precision and recall.
See also:
Examples
• Precision-Recall
sklearn.metrics.precision_recall_fscore_support
The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta
score reaches its best value at 1 and worst score at 0.
The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and
precision are equally important.
The support is the number of occurrences of each class in y_true.
If pos_label is None and in binary classification, this function returns the average precision, recall and
F-measure if average is one of 'micro', 'macro', 'weighted' or 'samples'.
Read more in the User Guide.
Parameters
y_true [1d array-like, or label indicator array / sparse matrix] Ground truth (correct) target
values.
y_pred [1d array-like, or label indicator array / sparse matrix] Estimated targets as returned by
a classifier.
beta [float, 1.0 by default] The strength of recall versus precision in the F-score.
labels [list, optional] The set of labels to include when average != 'binary', and their
order if average is None. Labels present in the data can be excluded, for example to
calculate a multiclass average ignoring a majority negative class, while labels not present in
the data will result in 0 components in a macro average. For multilabel targets, labels are
column indices. By default, all labels in y_true and y_pred are used in sorted order.
pos_label [str or int, 1 by default] The class to report if average='binary' and the
data is binary. If the data are multiclass or multilabel, this will be ignored; setting
labels=[pos_label] and average != 'binary' will report scores for that la-
bel only.
average [string, [None (default), ‘binary’, ‘micro’, ‘macro’, ‘samples’, ‘weighted’]] If None,
the scores for each class are returned. Otherwise, this determines the type of averaging
performed on the data:
'binary': Only report results for the class specified by pos_label. This is applicable
only if targets (y_{true,pred}) are binary.
'micro': Calculate metrics globally by counting the total true positives, false negatives
and false positives.
'macro': Calculate metrics for each label, and find their unweighted mean. This does not
take label imbalance into account.
'weighted': Calculate metrics for each label, and find their average weighted by support
(the number of true instances for each label). This alters ‘macro’ to account for label
imbalance; it can result in an F-score that is not between precision and recall.
'samples': Calculate metrics for each instance, and find their average (only meaningful
for multilabel classification where this differs from accuracy_score).
warn_for [tuple or set, for internal use] This determines which warnings will be made in the
case that this function is being used to return only one of its metrics.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
zero_division [“warn”, 0 or 1, default=”warn”]
Sets the value to return when there is a zero division:
• recall: when there are no positive labels
Notes
When true positive + false positive == 0, precision is undefined; When true positive
+ false negative == 0, recall is undefined. In such cases, by default the metric will be set to 0,
as will f-score, and UndefinedMetricWarning will be raised. This behavior can be modified with
zero_division.
References
Examples
It is possible to compute per-label precisions, recalls, F1-scores and supports instead of averaging:
sklearn.metrics.precision_score
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of
false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is
negative.
The best value is 1 and the worst value is 0.
Read more in the User Guide.
Parameters
y_true [1d array-like, or label indicator array / sparse matrix] Ground truth (correct) target
values.
y_pred [1d array-like, or label indicator array / sparse matrix] Estimated targets as returned by
a classifier.
labels [list, optional] The set of labels to include when average != 'binary', and their
order if average is None. Labels present in the data can be excluded, for example to
calculate a multiclass average ignoring a majority negative class, while labels not present in
the data will result in 0 components in a macro average. For multilabel targets, labels are
column indices. By default, all labels in y_true and y_pred are used in sorted order.
Changed in version 0.17: parameter labels improved for multiclass problem.
pos_label [str or int, 1 by default] The class to report if average='binary' and the
data is binary. If the data are multiclass or multilabel, this will be ignored; setting
labels=[pos_label] and average != 'binary' will report scores for that la-
bel only.
average [string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]] This pa-
rameter is required for multiclass/multilabel targets. If None, the scores for each class are
returned. Otherwise, this determines the type of averaging performed on the data:
'binary': Only report results for the class specified by pos_label. This is applicable
only if targets (y_{true,pred}) are binary.
'micro': Calculate metrics globally by counting the total true positives, false negatives
and false positives.
'macro': Calculate metrics for each label, and find their unweighted mean. This does not
take label imbalance into account.
'weighted': Calculate metrics for each label, and find their average weighted by support
(the number of true instances for each label). This alters ‘macro’ to account for label
imbalance; it can result in an F-score that is not between precision and recall.
'samples': Calculate metrics for each instance, and find their average (only meaningful
for multilabel classification where this differs from accuracy_score).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
zero_division [“warn”, 0 or 1, default=”warn”] Sets the value to return when there is a zero
division. If set to “warn”, this acts as 0, but warnings are also raised.
Returns
precision [float (if average is not None) or array of float, shape = [n_unique_labels]] Precision
of the positive class in binary classification or weighted average of the precision of each
class for the multiclass task.
See also:
precision_recall_fscore_support, multilabel_confusion_matrix
Notes
Examples
sklearn.metrics.recall_score
precision_recall_fscore_support, balanced_accuracy_score
multilabel_confusion_matrix
Notes
Examples
sklearn.metrics.roc_auc_score
References
Examples
sklearn.metrics.roc_curve
Notes
Since the thresholds are sorted from low to high values, they are reversed upon returning them to ensure they
correspond to both fpr and tpr, which are sorted in reversed order during their calculation.
References
[1], [2]
Examples
sklearn.metrics.zero_one_loss
Notes
In multilabel classification, the zero_one_loss function corresponds to the subset zero-one loss: for each sample,
the entire set of labels must be correctly predicted, otherwise the loss for that sample is equal to one.
Examples
See the Regression metrics section of the user guide for further details.
sklearn.metrics.explained_variance_score
Notes
Examples
sklearn.metrics.max_error
sklearn.metrics.max_error(y_true, y_pred)
max_error metric calculates the maximum residual error.
Read more in the User Guide.
Parameters
y_true [array-like of shape (n_samples,)] Ground truth (correct) target values.
y_pred [array-like of shape (n_samples,)] Estimated target values.
Returns
max_error [float] A positive floating point value (the best value is 0.0).
Examples
sklearn.metrics.mean_absolute_error
Examples
sklearn.metrics.mean_squared_error
Examples
sklearn.metrics.mean_squared_log_error
Examples
sklearn.metrics.median_absolute_error
Parameters
y_true [array-like of shape = (n_samples) or (n_samples, n_outputs)] Ground truth (correct)
target values.
y_pred [array-like of shape = (n_samples) or (n_samples, n_outputs)] Estimated target values.
multioutput [{‘raw_values’, ‘uniform_average’} or array-like of shape] (n_outputs,) Defines
aggregating of multiple output values. Array-like value defines weights used to average
errors.
‘raw_values’ : Returns a full set of errors in case of multioutput input.
‘uniform_average’ : Errors of all outputs are averaged with uniform weight.
Returns
loss [float or ndarray of floats] If multioutput is ‘raw_values’, then mean absolute error is re-
turned for each output separately. If multioutput is ‘uniform_average’ or an ndarray of
weights, then the weighted average of all output errors is returned.
Examples
sklearn.metrics.r2_score
Notes
References
[1]
Examples
sklearn.metrics.mean_poisson_deviance
Examples
sklearn.metrics.mean_gamma_deviance
Examples
sklearn.metrics.mean_tweedie_deviance
Examples
See the Multilabel ranking metrics section of the user guide for further details.
sklearn.metrics.coverage_error
References
[1]
sklearn.metrics.label_ranking_average_precision_score
y_score [array, shape = [n_samples, n_labels]] Target scores, can either be probability esti-
mates of the positive class, confidence values, or non-thresholded measure of decisions (as
returned by “decision_function” on some classifiers).
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
Returns
score [float]
Examples
sklearn.metrics.label_ranking_loss
References
[1]
See the Clustering performance evaluation section of the user guide for further details.
The sklearn.metrics.cluster submodule contains evaluation metrics for cluster analysis results. There are
two forms of evaluation:
• supervised, which uses a ground truth class values for each sample.
• unsupervised, which does not and measures the ‘quality’ of the model itself.
sklearn.metrics.adjusted_mutual_info_score
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
won’t change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score
value. This can be useful to measure the agreement of two independent label assignments strategies on the same
References
[1], [2]
Examples
Perfect labelings are both homogeneous and complete, hence have score 1.0:
If classes members are completely split across different clusters, the assignment is totally in-complete, hence
the AMI is null:
sklearn.metrics.adjusted_rand_score
sklearn.metrics.adjusted_rand_score(labels_true, labels_pred)
Rand index adjusted for chance.
The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and
counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.
The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:
The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the
number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).
ARI is a symmetric measure:
adjusted_rand_score(a, b) == adjusted_rand_score(b, a)
References
[Hubert1985], [wk]
Examples
Labelings that assign all classes members to the same clusters are complete be not always pure, hence penalized:
ARI is symmetric, so labelings that have pure clusters with members coming from the same classes but unnec-
essary splits are penalized:
If classes members are completely split across different clusters, the assignment is totally incomplete, hence the
ARI is very low:
sklearn.metrics.calinski_harabasz_score
sklearn.metrics.calinski_harabasz_score(X, labels)
Compute the Calinski and Harabasz score.
It is also known as the Variance Ratio Criterion.
The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.
Read more in the User Guide.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data
points. Each row corresponds to a single data point.
labels [array-like, shape (n_samples,)] Predicted labels for each sample.
Returns
score [float] The resulting Calinski-Harabasz score.
References
[1]
sklearn.metrics.davies_bouldin_score
sklearn.metrics.davies_bouldin_score(X, labels)
Computes the Davies-Bouldin score.
The score is defined as the average similarity measure of each cluster with its most similar cluster, where
similarity is the ratio of within-cluster distances to between-cluster distances. Thus, clusters which are farther
apart and less dispersed will result in a better score.
The minimum score is zero, with lower values indicating better clustering.
Read more in the User Guide.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data
points. Each row corresponds to a single data point.
labels [array-like, shape (n_samples,)] Predicted labels for each sample.
Returns
score: float The resulting Davies-Bouldin score.
References
[1]
sklearn.metrics.completeness_score
sklearn.metrics.completeness_score(labels_true, labels_pred)
Completeness metric of a cluster labeling given a ground truth.
A clustering result satisfies completeness if all the data points that are members of a given class are elements of
the same cluster.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
won’t change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the
homogeneity_score which will be different in general.
Read more in the User Guide.
Parameters
labels_true [int array, shape = [n_samples]] ground truth class labels to be used as a reference
labels_pred [array-like of shape (n_samples,)] cluster labels to evaluate
Returns
completeness [float] score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
See also:
homogeneity_score
v_measure_score
References
[1]
Examples
Non-perfect labelings that assign all classes members to the same clusters are still complete:
>>> print(completeness_score([0, 0, 1, 1], [0, 0, 0, 0]))
1.0
>>> print(completeness_score([0, 1, 2, 3], [0, 0, 1, 1]))
0.999...
If classes members are split across different clusters, the assignment cannot be complete:
>>> print(completeness_score([0, 0, 1, 1], [0, 1, 0, 1]))
0.0
>>> print(completeness_score([0, 0, 0, 0], [0, 1, 2, 3]))
0.0
sklearn.metrics.cluster.contingency_matrix
Returns
contingency [{array-like, sparse}, shape=[n_classes_true, n_classes_pred]] Matrix 𝐶 such that
𝐶𝑖,𝑗 is the number of samples in true class 𝑖 and in predicted class 𝑗. If eps is None,
the dtype of this array will be integer. If eps is given, the dtype will be float. Will be a
scipy.sparse.csr_matrix if sparse=True.
sklearn.metrics.fowlkes_mallows_score
Where TP is the number of True Positive (i.e. the number of pair of points that belongs in the same clusters
in both labels_true and labels_pred), FP is the number of False Positive (i.e. the number of pair of
points that belongs in the same clusters in labels_true and not in labels_pred) and FN is the number
of False Negative (i.e the number of pair of points that belongs in the same clusters in labels_pred and not
in labels_True).
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
Read more in the User Guide.
Parameters
labels_true [int array, shape = (n_samples,)] A clustering of the data into disjoint subsets.
labels_pred [array, shape = (n_samples, )] A clustering of the data into disjoint subsets.
sparse [bool] Compute contingency matrix internally with sparse matrix.
Returns
score [float] The resulting Fowlkes-Mallows score.
References
[1], [2]
Examples
Perfect labelings are both homogeneous and complete, hence have score 1.0:
If classes members are completely split across different clusters, the assignment is totally random, hence the
FMI is null:
sklearn.metrics.homogeneity_completeness_v_measure
sklearn.metrics.homogeneity_completeness_v_measure(labels_true, labels_pred,
beta=1.0)
Compute the homogeneity and completeness and V-Measure scores at once.
Those metrics are based on normalized conditional entropy measures of the clustering labeling to evaluate given
the knowledge of a Ground Truth class labels of the same samples.
A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a
single class.
A clustering result satisfies completeness if all the data points that are members of a given class are elements of
the same cluster.
Both scores have positive values between 0.0 and 1.0, larger values being desirable.
Those 3 metrics are independent of the absolute values of the labels: a permutation of the class or cluster label
values won’t change the score values in any way.
V-Measure is furthermore symmetric: swapping labels_true and label_pred will give the
same score. This does not hold for homogeneity and completeness. V-Measure is identical to
normalized_mutual_info_score with the arithmetic averaging method.
Read more in the User Guide.
Parameters
labels_true [int array, shape = [n_samples]] ground truth class labels to be used as a reference
labels_pred [array-like of shape (n_samples,)] cluster labels to evaluate
beta [float] Ratio of weight attributed to homogeneity vs completeness. If beta is
greater than 1, completeness is weighted more strongly in the calculation. If beta is
less than 1, homogeneity is weighted more strongly.
Returns
homogeneity [float] score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling
completeness [float] score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
v_measure [float] harmonic mean of the first two
See also:
homogeneity_score
completeness_score
v_measure_score
sklearn.metrics.homogeneity_score
sklearn.metrics.homogeneity_score(labels_true, labels_pred)
Homogeneity metric of a cluster labeling given a ground truth.
A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a
single class.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
won’t change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the
completeness_score which will be different in general.
Read more in the User Guide.
Parameters
labels_true [int array, shape = [n_samples]] ground truth class labels to be used as a reference
labels_pred [array-like of shape (n_samples,)] cluster labels to evaluate
Returns
homogeneity [float] score between 0.0 and 1.0. 1.0 stands for perfectly homogeneous labeling
See also:
completeness_score
v_measure_score
References
[1]
Examples
Non-perfect labelings that further split classes into more clusters can be perfectly homogeneous:
Clusters that include samples from different classes do not make for an homogeneous labeling:
sklearn.metrics.mutual_info_score
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
won’t change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score
value. This can be useful to measure the agreement of two independent label assignments strategies on the same
dataset when the real ground truth is not known.
Read more in the User Guide.
Parameters
labels_true [int array, shape = [n_samples]] A clustering of the data into disjoint subsets.
labels_pred [int array-like of shape (n_samples,)] A clustering of the data into disjoint subsets.
contingency [{None, array, sparse matrix}, shape = [n_classes_true, n_classes_pred]] A con-
tingency matrix given by the contingency_matrix function. If value is None, it will
be computed, otherwise the given value is used, with labels_true and labels_pred
ignored.
Returns
mi [float] Mutual information, a non-negative value
See also:
Notes
sklearn.metrics.normalized_mutual_info_score
Examples
Perfect labelings are both homogeneous and complete, hence have score 1.0:
If classes members are completely split across different clusters, the assignment is totally in-complete, hence
the NMI is null:
sklearn.metrics.silhouette_score
References
[1], [2]
sklearn.metrics.silhouette_samples
References
[1], [2]
sklearn.metrics.v_measure_score
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values
won’t change the score value in any way.
This metric is furthermore symmetric: switching label_true with label_pred will return the same score
value. This can be useful to measure the agreement of two independent label assignments strategies on the same
dataset when the real ground truth is not known.
Read more in the User Guide.
Parameters
labels_true [int array, shape = [n_samples]] ground truth class labels to be used as a reference
labels_pred [array-like of shape (n_samples,)] cluster labels to evaluate
beta [float] Ratio of weight attributed to homogeneity vs completeness. If beta is
greater than 1, completeness is weighted more strongly in the calculation. If beta is
less than 1, homogeneity is weighted more strongly.
Returns
v_measure [float] score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
See also:
homogeneity_score
completeness_score
normalized_mutual_info_score
References
[1]
Examples
Perfect labelings are both homogeneous and complete, hence have score 1.0:
Labelings that assign all classes members to the same clusters are complete be not homogeneous, hence penal-
ized:
Labelings that have pure clusters with members coming from the same classes are homogeneous but un-
necessary splits harms completeness and thus penalize V-measure as well:
If classes members are completely split across different clusters, the assignment is totally incomplete, hence the
V-Measure is null:
Clusters that include samples from totally different classes totally destroy the homogeneity of the labeling,
hence:
See the Biclustering evaluation section of the user guide for further details.
sklearn.metrics.consensus_score
sklearn.metrics.consensus_score(a, b, similarity=’jaccard’)
The similarity of two sets of biclusters.
Similarity between individual biclusters is computed. Then the best matching between sets is found using the
Hungarian algorithm. The final score is the sum of similarities divided by the size of the larger set.
Read more in the User Guide.
Parameters
a [(rows, columns)] Tuple of row and column indicators for a set of biclusters.
b [(rows, columns)] Another set of biclusters like a.
similarity [string or function, optional, default: “jaccard”] May be the string “jaccard” to use
the Jaccard coefficient, or any function that takes four arguments, each of which is a 1d
indicator vector: (a_rows, a_columns, b_rows, b_columns).
References
• Hochreiter, Bodenhofer, et. al., 2010. FABIA: factor analysis for bicluster acquisition.
See the Pairwise metrics, Affinities and Kernels section of the user guide for further details.
sklearn.metrics.pairwise.additive_chi2_kernel
sklearn.metrics.pairwise.additive_chi2_kernel(X, Y=None)
Computes the additive chi-squared kernel between observations in X and Y
The chi-squared kernel is computed between each pair of rows in X and Y. X and Y have to be non-negative.
This kernel is most commonly applied to histograms.
The chi-squared kernel is given by:
Notes
References
• Zhang, J. and Marszalek, M. and Lazebnik, S. and Schmid, C. Local features and kernels for classification
of texture and object categories: A comprehensive study International Journal of Computer Vision 2007
https://research.microsoft.com/en-us/um/people/manik/projects/trade-off/papers/ZhangIJCV06.pdf
sklearn.metrics.pairwise.chi2_kernel
References
• Zhang, J. and Marszalek, M. and Lazebnik, S. and Schmid, C. Local features and kernels for classification
of texture and object categories: A comprehensive study International Journal of Computer Vision 2007
https://research.microsoft.com/en-us/um/people/manik/projects/trade-off/papers/ZhangIJCV06.pdf
sklearn.metrics.pairwise.cosine_similarity
Parameters
X [ndarray or sparse array, shape: (n_samples_X, n_features)] Input data.
Y [ndarray or sparse array, shape: (n_samples_Y, n_features)] Input data. If None, the output
will be the pairwise similarities between all samples in X.
dense_output [boolean (optional), default True] Whether to return dense output even when the
input is sparse. If False, the output is sparse if both input arrays are sparse.
New in version 0.17: parameter dense_output for dense output.
Returns
kernel matrix [array] An array with shape (n_samples_X, n_samples_Y).
sklearn.metrics.pairwise.cosine_distances
sklearn.metrics.pairwise.cosine_distances(X, Y=None)
Compute cosine distance between samples in X and Y.
Cosine distance is defined as 1.0 minus the cosine similarity.
Read more in the User Guide.
Parameters
X [array_like, sparse matrix] with shape (n_samples_X, n_features).
Y [array_like, sparse matrix (optional)] with shape (n_samples_Y, n_features).
Returns
distance matrix [array] An array with shape (n_samples_X, n_samples_Y).
See also:
sklearn.metrics.pairwise.cosine_similarity
scipy.spatial.distance.cosine dense matrices only
sklearn.metrics.pairwise.distance_metrics
sklearn.metrics.pairwise.distance_metrics()
Valid metrics for pairwise_distances.
This function simply returns the valid pairwise distance metrics. It exists to allow for a description of the
mapping for each of the valid strings.
The valid distance metrics, and the function they map to, are:
metric Function
‘cityblock’ metrics.pairwise.manhattan_distances
‘cosine’ metrics.pairwise.cosine_distances
‘euclidean’ metrics.pairwise.euclidean_distances
‘haversine’ metrics.pairwise.haversine_distances
‘l1’ metrics.pairwise.manhattan_distances
‘l2’ metrics.pairwise.euclidean_distances
‘manhattan’ metrics.pairwise.manhattan_distances
‘nan_euclidean’ metrics.pairwise.nan_euclidean_distances
sklearn.metrics.pairwise.euclidean_distances
This formulation has two advantages over other ways of computing distances. First, it is computationally ef-
ficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then
dot(x, x) and/or dot(y, y) can be pre-computed.
However, this is not the most precise way of doing this computation, and the distance matrix returned by this
function may not be exactly symmetric as required by, e.g., scipy.spatial.distance functions.
Read more in the User Guide.
Parameters
X [{array-like, sparse matrix}, shape (n_samples_1, n_features)]
Y [{array-like, sparse matrix}, shape (n_samples_2, n_features)]
Y_norm_squared [array-like, shape (n_samples_2, ), optional] Pre-computed dot-products of
vectors in Y (e.g., (Y**2).sum(axis=1)) May be ignored in some cases, see the note
below.
squared [boolean, optional] Return squared Euclidean distances.
X_norm_squared [array-like of shape (n_samples,), optional] Pre-computed dot-products of
vectors in X (e.g., (X**2).sum(axis=1)) May be ignored in some cases, see the note
below.
Returns
distances [array, shape (n_samples_1, n_samples_2)]
See also:
Notes
To achieve better accuracy, X_norm_squared and Y_norm_squared may be unused if they are passed as
float32.
Examples
sklearn.metrics.pairwise.haversine_distances
sklearn.metrics.pairwise.haversine_distances(X, Y=None)
Compute the Haversine distance between samples in X and Y
The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere.
The first distance of each point is assumed to be the latitude, the second is the longitude, given in radians. The
dimension of the data must be 2.
√︁
𝐷(𝑥, 𝑦) = 2 arcsin[ sin2 ((𝑥1 − 𝑦1)/2) + cos(𝑥1) cos(𝑦1) sin2 ((𝑥2 − 𝑦2)/2)]
Parameters
X [array_like, shape (n_samples_1, 2)]
Y [array_like, shape (n_samples_2, 2), optional]
Returns
distance [{array}, shape (n_samples_1, n_samples_2)]
Notes
As the Earth is nearly spherical, the haversine formula provides a good approximation of the distance between
two points of the Earth surface, with a less than 1% error on average.
Examples
We want to calculate the distance between the Ezeiza Airport (Buenos Aires, Argentina) and the Charles de
Gaulle Airport (Paris, France)
>>> from sklearn.metrics.pairwise import haversine_distances
>>> from math import radians
>>> bsas = [-34.83333, -58.5166646]
>>> paris = [49.0083899664, 2.53844117956]
>>> bsas_in_radians = [radians(_) for _ in bsas]
>>> paris_in_radians = [radians(_) for _ in paris]
>>> result = haversine_distances([bsas_in_radians, paris_in_radians])
>>> result * 6371000/1000 # multiply by Earth radius to get kilometers
array([[ 0. , 11099.54035582],
[11099.54035582, 0. ]])
sklearn.metrics.pairwise.kernel_metrics
sklearn.metrics.pairwise.kernel_metrics()
Valid metrics for pairwise_kernels
This function simply returns the valid pairwise distance metrics. It exists, however, to allow for a verbose
description of the mapping for each of the valid strings.
The valid distance metrics, and the function they map to, are:
metric Function
‘additive_chi2’ sklearn.pairwise.additive_chi2_kernel
‘chi2’ sklearn.pairwise.chi2_kernel
‘linear’ sklearn.pairwise.linear_kernel
‘poly’ sklearn.pairwise.polynomial_kernel
‘polynomial’ sklearn.pairwise.polynomial_kernel
‘rbf’ sklearn.pairwise.rbf_kernel
‘laplacian’ sklearn.pairwise.laplacian_kernel
‘sigmoid’ sklearn.pairwise.sigmoid_kernel
‘cosine’ sklearn.pairwise.cosine_similarity
sklearn.metrics.pairwise.laplacian_kernel
for each pair of rows x in X and y in Y. Read more in the User Guide.
New in version 0.17.
Parameters
X [array of shape (n_samples_X, n_features)]
Y [array of shape (n_samples_Y, n_features)]
gamma [float, default None] If None, defaults to 1.0 / n_features
Returns
kernel_matrix [array of shape (n_samples_X, n_samples_Y)]
sklearn.metrics.pairwise.linear_kernel
sklearn.metrics.pairwise.manhattan_distances
Notes
When X and/or Y are CSR sparse matrices and they are not already in canonical format, this function modifies
them in-place to make them canonical.
Examples
sklearn.metrics.pairwise.nan_euclidean_distances
Compute the euclidean distance between each pair of samples in X and Y, where Y=X is assumed if Y=None.
When calculating the distance between a pair of samples, this formulation ignores feature coordinates with a
missing value in either sample and scales up the weight of the remaining coordinates:
dist(x,y) = sqrt(weight * sq. distance from present coordinates) where, weight = Total # of coordi-
nates / # of present coordinates
For example, the distance between [3, na, na, 6] and [1, na, 4, 5] is:
√︂
4
((3 − 1)2 + (6 − 5)2 )
2
If all the coordinates are missing or if there are no common present coordinates then NaN is returned for that
pair.
Read more in the User Guide.
New in version 0.22.
Parameters
X [array-like, shape=(n_samples_1, n_features)]
Y [array-like, shape=(n_samples_2, n_features)]
squared [bool, default=False] Return squared Euclidean distances.
missing_values [np.nan or int, default=np.nan] Representation of missing value
copy [boolean, default=True] Make and use a deep copy of X and Y (if Y exists)
Returns
distances [array, shape (n_samples_1, n_samples_2)]
See also:
References
• John K. Dixon, “Pattern Recognition with Partly Missing Data”, IEEE Transactions on Systems, Man, and
Cybernetics, Volume: 9, Issue: 10, pp. 617 - 621, Oct. 1979. http://ieeexplore.ieee.org/abstract/document/
4310090/
Examples
sklearn.metrics.pairwise.pairwise_kernels
Notes
sklearn.metrics.pairwise.polynomial_kernel
sklearn.metrics.pairwise.rbf_kernel
sklearn.metrics.pairwise.sigmoid_kernel
sklearn.metrics.pairwise.paired_euclidean_distances
sklearn.metrics.pairwise.paired_euclidean_distances(X, Y)
Computes the paired euclidean distances between X and Y
Read more in the User Guide.
Parameters
X [array-like, shape (n_samples, n_features)]
Y [array-like, shape (n_samples, n_features)]
Returns
distances [ndarray (n_samples, )]
sklearn.metrics.pairwise.paired_manhattan_distances
sklearn.metrics.pairwise.paired_manhattan_distances(X, Y)
Compute the L1 distances between the vectors in X and Y.
Read more in the User Guide.
Parameters
X [array-like, shape (n_samples, n_features)]
Y [array-like, shape (n_samples, n_features)]
Returns
distances [ndarray (n_samples, )]
sklearn.metrics.pairwise.paired_cosine_distances
sklearn.metrics.pairwise.paired_cosine_distances(X, Y)
Computes the paired cosine distances between X and Y
Read more in the User Guide.
Parameters
X [array-like, shape (n_samples, n_features)]
Y [array-like, shape (n_samples, n_features)]
Returns
distances [ndarray, shape (n_samples, )]
Notes
The cosine distance is equivalent to the half the squared euclidean distance if each sample is normalized to unit
norm
sklearn.metrics.pairwise.paired_distances
Examples
sklearn.metrics.pairwise_distances
pairwise_distances_chunked performs the same calculation as this function, but returns a generator
of chunks of the distance matrix, in order to limit memory usage.
paired_distances Computes the distances between corresponding elements of two arrays
sklearn.metrics.pairwise_distances_argmin
sklearn.metrics.pairwise_distances
sklearn.metrics.pairwise_distances_argmin_min
sklearn.metrics.pairwise_distances_argmin_min
See also:
sklearn.metrics.pairwise_distances
sklearn.metrics.pairwise_distances_argmin
sklearn.metrics.pairwise_distances_chunked
D_chunk [array or sparse matrix] A contiguous slice of distance matrix, optionally processed
by reduce_func.
Examples
Without reduce_func:
>>> r = .2
>>> def reduce_func(D_chunk, start):
... neigh = [np.flatnonzero(d < r) for d in D_chunk]
... avg_dist = (D_chunk * (D_chunk < r)).mean(axis=1)
... return neigh, avg_dist
>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func)
>>> neigh, avg_dist = next(gen)
>>> neigh
[array([0, 3]), array([1]), array([2]), array([0, 3]), array([4])]
>>> avg_dist
array([0.039..., 0. , 0. , 0.039..., 0. ])
7.24.8 Plotting
See the Visualizations section of the user guide for further details.
sklearn.metrics.plot_confusion_matrix
• Confusion matrix
sklearn.metrics.plot_precision_recall_curve
sklearn.metrics.plot_precision_recall_curve(estimator, X, y, sample_weight=None,
response_method=’auto’, name=None,
ax=None, **kwargs)
Plot Precision Recall Curve for binary classifiers.
Extra keyword arguments will be passed to matplotlib’s plot.
Read more in the User Guide.
Parameters
estimator [estimator instance] Trained classifier.
X [{array-like, sparse matrix} of shape (n_samples, n_features)] Input values.
y [array-like of shape (n_samples,)] Binary target values.
sample_weight [array-like of shape (n_samples,), default=None] Sample weights.
response_method [{‘predict_proba’, ‘decision_function’, ‘auto’}, default=’auto’] Specifies
whether to use predict_proba or decision_function as the target response. If set to ‘auto’,
predict_proba is tried first and if it does not exist decision_function is tried next.
name [str, default=None] Name for labeling curve. If None, the name of the estimator is used.
ax [matplotlib axes, default=None] Axes object to plot on. If None, a new figure and axes is
created.
**kwargs [dict] Keyword arguments to be passed to matplotlib’s plot.
Returns
display [PrecisionRecallDisplay] Object that stores computed values.
• Precision-Recall
sklearn.metrics.plot_roc_curve
Examples
sklearn.metrics.ConfusionMatrixDisplay
Methods
sklearn.metrics.PrecisionRecallDisplay
Methods
sklearn.metrics.RocCurveDisplay
Examples
Methods
7.25.1 sklearn.mixture.BayesianGaussianMixture
tol [float, defaults to 1e-3.] The convergence threshold. EM iterations will stop when the lower
bound average gain on the likelihood (of the training data with respect to the model) is below
this threshold.
reg_covar [float, defaults to 1e-6.] Non-negative regularization added to the diagonal of co-
variance. Allows to assure that the covariance matrices are all positive.
max_iter [int, defaults to 100.] The number of EM iterations to perform.
n_init [int, defaults to 1.] The number of initializations to perform. The result with the highest
lower bound value on the likelihood is kept.
init_params [{‘kmeans’, ‘random’}, defaults to ‘kmeans’.] The method used to initialize the
weights, the means and the covariances. Must be one of:
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
precisions_ [array-like] The precision matrices for each component in the mixture. A preci-
sion matrix is the inverse of a covariance matrix. A covariance matrix is symmetric posi-
tive definite so the mixture of Gaussian can be equivalently parameterized by the precision
matrices. Storing the precision matrices instead of the covariance matrices makes it more
efficient to compute the log-likelihood of new samples at test time. The shape depends on
covariance_type:
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
converged_ [bool] True when convergence was reached in fit(), False otherwise.
n_iter_ [int] Number of step used by the best fit of inference to reach the convergence.
lower_bound_ [float] Lower bound value on the likelihood (of the training data with respect to
the model) of the best fit of inference.
weight_concentration_prior_ [tuple or float] The dirichlet concentration of each
component on the weight distribution (Dirichlet). The type depends on
weight_concentration_prior_type:
The higher concentration puts more mass in the center and will lead to more components
being active, while a lower concentration parameter will lead to more mass at the edge of
the simplex.
weight_concentration_ [array-like, shape (n_components,)] The dirichlet concentration of
each component on the weight distribution (Dirichlet).
mean_precision_prior_ [float] The precision prior on the mean distribution (Gaussian).
Controls the extent of where means can be placed. Larger values concentrate
the cluster means around mean_prior. If mean_precision_prior is set to None,
mean_precision_prior_ is set to 1.
mean_precision_ [array-like, shape (n_components,)] The precision of each components on
the mean distribution (Gaussian).
mean_prior_ [array-like, shape (n_features,)] The prior on the mean distribution (Gaussian).
degrees_of_freedom_prior_ [float] The prior of the number of degrees of freedom on the co-
variance distributions (Wishart).
degrees_of_freedom_ [array-like, shape (n_components,)] The number of degrees of freedom
of each components in the model.
covariance_prior_ [float or array-like] The prior on the covariance distribution (Wishart). The
shape depends on covariance_type:
See also:
References
Methods
initialization is performed upon the first call. Upon consecutive calls, training starts where it left off.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
self
fit_predict(self, X, y=None)
Estimate model parameters using X and predict the labels for X.
The method fits the model n_init times and sets the parameters with which the model has the largest
likelihood or lower bound. Within each trial, the method iterates between E-step and M-step for
max_iter times until the change of likelihood or lower bound is less than tol, otherwise, a
ConvergenceWarning is raised. After fitting, it predicts the most probable label for the input data
points.
New in version 0.20.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
labels [array, shape (n_samples,)] Component labels.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict the labels for the data samples in X using trained model.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
labels [array, shape (n_samples,)] Component labels.
predict_proba(self, X)
Predict posterior probability of each component given the data.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
resp [array, shape (n_samples, n_components)] Returns the probability each Gaussian
(state) in the model given each sample.
sample(self, n_samples=1)
Generate random samples from the fitted Gaussian distribution.
Parameters
n_samples [int, optional] Number of samples to generate. Defaults to 1.
Returns
X [array, shape (n_samples, n_features)] Randomly generated sample
y [array, shape (nsamples,)] Component labels
score(self, X, y=None)
Compute the per-sample average log-likelihood of the given data X.
Parameters
X [array-like, shape (n_samples, n_dimensions)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
log_likelihood [float] Log likelihood of the Gaussian mixture given X.
score_samples(self, X)
Compute the weighted log probabilities for each sample.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
log_prob [array, shape (n_samples,)] Log probabilities of each data point in X.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
7.25.2 sklearn.mixture.GaussianMixture
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
precisions_ [array-like] The precision matrices for each component in the mixture. A preci-
sion matrix is the inverse of a covariance matrix. A covariance matrix is symmetric posi-
tive definite so the mixture of Gaussian can be equivalently parameterized by the precision
matrices. Storing the precision matrices instead of the covariance matrices makes it more
efficient to compute the log-likelihood of new samples at test time. The shape depends on
covariance_type:
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
converged_ [bool] True when convergence was reached in fit(), False otherwise.
n_iter_ [int] Number of step used by the best fit of EM to reach the convergence.
lower_bound_ [float] Lower bound value on the log-likelihood (of the training data with re-
spect to the model) of the best fit of EM.
See also:
Methods
The method fits the model n_init times and sets the parameters with which the model has the
largest likelihood or lower bound. Within each trial, the method iterates between E-step and M-step
for max_iter times until the change of likelihood or lower bound is less than tol, otherwise, a
ConvergenceWarning is raised. If warm_start is True, then n_init is ignored and a single
initialization is performed upon the first call. Upon consecutive calls, training starts where it left off.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
self
fit_predict(self, X, y=None)
Estimate model parameters using X and predict the labels for X.
The method fits the model n_init times and sets the parameters with which the model has the largest
likelihood or lower bound. Within each trial, the method iterates between E-step and M-step for
max_iter times until the change of likelihood or lower bound is less than tol, otherwise, a
ConvergenceWarning is raised. After fitting, it predicts the most probable label for the input data
points.
New in version 0.20.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
labels [array, shape (n_samples,)] Component labels.
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
predict(self, X)
Predict the labels for the data samples in X using trained model.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
labels [array, shape (n_samples,)] Component labels.
predict_proba(self, X)
Predict posterior probability of each component given the data.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
resp [array, shape (n_samples, n_components)] Returns the probability each Gaussian
(state) in the model given each sample.
sample(self, n_samples=1)
Generate random samples from the fitted Gaussian distribution.
Parameters
n_samples [int, optional] Number of samples to generate. Defaults to 1.
Returns
X [array, shape (n_samples, n_features)] Randomly generated sample
y [array, shape (nsamples,)] Component labels
score(self, X, y=None)
Compute the per-sample average log-likelihood of the given data X.
Parameters
X [array-like, shape (n_samples, n_dimensions)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
log_likelihood [float] Log likelihood of the Gaussian mixture given X.
score_samples(self, X)
Compute the weighted log probabilities for each sample.
Parameters
X [array-like, shape (n_samples, n_features)] List of n_features-dimensional data points.
Each row corresponds to a single data point.
Returns
log_prob [array, shape (n_samples,)] Log probabilities of each data point in X.
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
• GMM covariances
• Gaussian Mixture Model Sine Curve
User guide: See the Cross-validation: evaluating estimator performance, Tuning the hyper-parameters of an estima-
tor and Learning curve sections for further details.
sklearn.model_selection.GroupKFold
class sklearn.model_selection.GroupKFold(n_splits=5)
K-fold iterator variant with non-overlapping groups.
The same group will not appear in two different folds (the number of distinct groups has to be at least equal to
the number of folds).
The folds are approximately balanced in the sense that the number of distinct groups is approximately the same
in each fold.
Parameters
n_splits [int, default=5] Number of folds. Must be at least 2.
Changed in version 0.22: n_splits default value changed from 3 to 5.
See also:
LeaveOneGroupOut For splitting the data according to explicit domain-specific stratification of the dataset.
Examples
Methods
__init__(self, n_splits=5)
Initialize self. See help(type(self)) for accurate signature.
get_n_splits(self, X=None, y=None, groups=None)
Returns the number of splitting iterations in the cross-validator
Parameters
X [object] Always ignored, exists for compatibility.
y [object] Always ignored, exists for compatibility.
groups [object] Always ignored, exists for compatibility.
Returns
n_splits [int] Returns the number of splitting iterations in the cross-validator.
split(self, X, y=None, groups=None)
Generate indices to split data into training and test set.
Parameters
X [array-like, shape (n_samples, n_features)] Training data, where n_samples is the number
of samples and n_features is the number of features.
y [array-like, shape (n_samples,), optional] The target variable for supervised learning prob-
lems.
groups [array-like, with shape (n_samples,)] Group labels for the samples used while split-
ting the dataset into train/test set.
Yields
train [ndarray] The training set indices for that split.
test [ndarray] The testing set indices for that split.
sklearn.model_selection.GroupShuffleSplit
Examples
Methods
groups [array-like, with shape (n_samples,)] Group labels for the samples used while split-
ting the dataset into train/test set.
Yields
train [ndarray] The training set indices for that split.
test [ndarray] The testing set indices for that split.
Notes
Randomized CV splitters may return different results for each call of split. You can make the results
identical by setting random_state to an integer.
sklearn.model_selection.KFold
StratifiedKFold Takes group information into account to avoid building folds with imbalanced class
distributions (for binary or multiclass classification tasks).
GroupKFold K-fold iterator variant with non-overlapping groups.
RepeatedKFold Repeats K-Fold n times.
Notes
The first n_samples % n_splits folds have size n_samples // n_splits + 1, other folds have
size n_samples // n_splits, where n_samples is the number of samples.
Randomized CV splitters may return different results for each call of split. You can make the results identical
by setting random_state to an integer.
Examples
Methods
X [array-like, shape (n_samples, n_features)] Training data, where n_samples is the number
of samples and n_features is the number of features.
y [array-like, shape (n_samples,)] The target variable for supervised learning problems.
groups [array-like, with shape (n_samples,), optional] Group labels for the samples used
while splitting the dataset into train/test set.
Yields
train [ndarray] The training set indices for that split.
test [ndarray] The testing set indices for that split.
sklearn.model_selection.LeaveOneGroupOut
class sklearn.model_selection.LeaveOneGroupOut
Leave One Group Out cross-validator
Provides train/test indices to split data according to a third-party provided group. This group information can be
used to encode arbitrary domain specific stratifications of the samples as integers.
For instance the groups could be the year of collection of the samples and thus allow for cross-validation against
time-based splits.
Read more in the User Guide.
Examples
Methods
sklearn.model_selection.LeavePGroupsOut
class sklearn.model_selection.LeavePGroupsOut(n_groups)
Leave P Group(s) Out cross-validator
Provides train/test indices to split data according to a third-party provided group. This group information can be
used to encode arbitrary domain specific stratifications of the samples as integers.
For instance the groups could be the year of collection of the samples and thus allow for cross-validation against
time-based splits.
The difference between LeavePGroupsOut and LeaveOneGroupOut is that the former builds the test sets with
all the samples assigned to p different values of the groups while the latter uses samples all assigned the same
groups.
Read more in the User Guide.
Parameters
n_groups [int] Number of groups (p) to leave out in the test split.
See also:
Examples
Methods
__init__(self, n_groups)
Initialize self. See help(type(self)) for accurate signature.
get_n_splits(self, X=None, y=None, groups=None)
Returns the number of splitting iterations in the cross-validator
Parameters
X [object] Always ignored, exists for compatibility.
y [object] Always ignored, exists for compatibility.
groups [array-like, with shape (n_samples,)] Group labels for the samples used while split-
ting the dataset into train/test set. This ‘groups’ parameter must always be specified to
calculate the number of splits, though the other parameters can be omitted.
Returns
n_splits [int] Returns the number of splitting iterations in the cross-validator.
split(self, X, y=None, groups=None)
Generate indices to split data into training and test set.
Parameters
X [array-like, shape (n_samples, n_features)] Training data, where n_samples is the number
of samples and n_features is the number of features.
y [array-like, of length n_samples, optional] The target variable for supervised learning
problems.
groups [array-like, with shape (n_samples,)] Group labels for the samples used while split-
ting the dataset into train/test set.
Yields
train [ndarray] The training set indices for that split.
test [ndarray] The testing set indices for that split.
sklearn.model_selection.LeaveOneOut
class sklearn.model_selection.LeaveOneOut
Leave-One-Out cross-validator
Provides train/test indices to split data in train/test sets. Each sample is used once as a test set (singleton) while
the remaining samples form the training set.
Note: LeaveOneOut() is equivalent to KFold(n_splits=n) and LeavePOut(p=1) where n is the
number of samples.
Due to the high number of test sets (which is the same as the number of samples) this cross-validation method
can be very costly. For large datasets one should favor KFold, ShuffleSplit or StratifiedKFold.
Read more in the User Guide.
See also:
LeaveOneGroupOut For splitting the data according to explicit, domain-specific stratification of the dataset.
GroupKFold K-fold iterator variant with non-overlapping groups.
Examples
Methods
y [array-like, of length n_samples] The target variable for supervised learning problems.
groups [array-like, with shape (n_samples,), optional] Group labels for the samples used
while splitting the dataset into train/test set.
Yields
train [ndarray] The training set indices for that split.
test [ndarray] The testing set indices for that split.
sklearn.model_selection.LeavePOut
class sklearn.model_selection.LeavePOut(p)
Leave-P-Out cross-validator
Provides train/test indices to split data in train/test sets. This results in testing on all distinct samples of size p,
while the remaining n - p samples form the training set in each iteration.
Note: LeavePOut(p) is NOT equivalent to KFold(n_splits=n_samples // p) which creates non-
overlapping test sets.
Due to the high number of iterations which grows combinatorically with the number of samples this cross-
validation method can be very costly. For large datasets one should favor KFold, StratifiedKFold or
ShuffleSplit.
Read more in the User Guide.
Parameters
p [int] Size of the test sets. Must be strictly less than the number of samples.
Examples
Methods
__init__(self, p)
Initialize self. See help(type(self)) for accurate signature.
get_n_splits(self, X, y=None, groups=None)
Returns the number of splitting iterations in the cross-validator
Parameters
X [array-like, shape (n_samples, n_features)] Training data, where n_samples is the number
of samples and n_features is the number of features.
y [object] Always ignored, exists for compatibility.
groups [object] Always ignored, exists for compatibility.
split(self, X, y=None, groups=None)
Generate indices to split data into training and test set.
Parameters
X [array-like, shape (n_samples, n_features)] Training data, where n_samples is the number
of samples and n_features is the number of features.
y [array-like, of length n_samples] The target variable for supervised learning problems.
groups [array-like, with shape (n_samples,), optional] Group labels for the samples used
while splitting the dataset into train/test set.
Yields
train [ndarray] The training set indices for that split.
test [ndarray] The testing set indices for that split.
sklearn.model_selection.PredefinedSplit
class sklearn.model_selection.PredefinedSplit(test_fold)
Predefined split cross-validator
Provides train/test indices to split data into train/test sets using a predefined scheme specified by the user with
the test_fold parameter.
Read more in the User Guide.
New in version 0.16.
Parameters
test_fold [array-like, shape (n_samples,)] The entry test_fold[i] represents the index of
the test set that sample i belongs to. It is possible to exclude sample i from any test set (i.e.
include sample i in every training set) by setting test_fold[i] equal to -1.
Examples
Methods
__init__(self, test_fold)
Initialize self. See help(type(self)) for accurate signature.
get_n_splits(self, X=None, y=None, groups=None)
Returns the number of splitting iterations in the cross-validator
Parameters
X [object] Always ignored, exists for compatibility.
y [object] Always ignored, exists for compatibility.
groups [object] Always ignored, exists for compatibility.
Returns
n_splits [int] Returns the number of splitting iterations in the cross-validator.
split(self, X=None, y=None, groups=None)
Generate indices to split data into training and test set.
Parameters
X [object] Always ignored, exists for compatibility.
y [object] Always ignored, exists for compatibility.
groups [object] Always ignored, exists for compatibility.
Yields
train [ndarray] The training set indices for that split.
test [ndarray] The testing set indices for that split.
sklearn.model_selection.RepeatedKFold
Notes
Randomized CV splitters may return different results for each call of split. You can make the results identical
by setting random_state to an integer.
Examples
Methods
sklearn.model_selection.RepeatedStratifiedKFold
Notes
Randomized CV splitters may return different results for each call of split. You can make the results identical
by setting random_state to an integer.
Examples
Methods
X [array-like, shape (n_samples, n_features)] Training data, where n_samples is the number
of samples and n_features is the number of features.
y [array-like, of length n_samples] The target variable for supervised learning problems.
groups [array-like, with shape (n_samples,), optional] Group labels for the samples used
while splitting the dataset into train/test set.
Yields
train [ndarray] The training set indices for that split.
test [ndarray] The testing set indices for that split.
sklearn.model_selection.ShuffleSplit
Examples
Methods
Notes
Randomized CV splitters may return different results for each call of split. You can make the results
identical by setting random_state to an integer.
sklearn.model_selection.StratifiedKFold
Notes
Examples
Methods
Notes
Randomized CV splitters may return different results for each call of split. You can make the results
identical by setting random_state to an integer.
sklearn.model_selection.StratifiedShuffleSplit
Examples
Methods
Notes
Randomized CV splitters may return different results for each call of split. You can make the results
identical by setting random_state to an integer.
sklearn.model_selection.TimeSeriesSplit
Notes
Examples
Methods
sklearn.model_selection.check_cv
sklearn.model_selection.train_test_split
sklearn.model_selection.train_test_split(*arrays, **options)
Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to
input data into a single call for splitting (and optionally subsampling) data in a oneliner.
Read more in the User Guide.
Parameters
*arrays [sequence of indexables with same length / shape[0]] Allowed inputs are lists, numpy
arrays, scipy-sparse matrices or pandas dataframes.
test_size [float, int or None, optional (default=None)] If float, should be between 0.0 and 1.0
and represent the proportion of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the complement of the train
size. If train_size is also None, it will be set to 0.25.
train_size [float, int, or None, (default=None)] If float, should be between 0.0 and 1.0 and
represent the proportion of the dataset to include in the train split. If int, represents the
absolute number of train samples. If None, the value is automatically set to the complement
of the test size.
random_state [int, RandomState instance or None, optional (default=None)] If int, ran-
dom_state is the seed used by the random number generator; If RandomState instance,
random_state is the random number generator; If None, the random number generator is
the RandomState instance used by np.random.
shuffle [boolean, optional (default=True)] Whether or not to shuffle the data before splitting. If
shuffle=False then stratify must be None.
stratify [array-like or None (default=None)] If not None, data is split in a stratified fashion,
using this as the class labels.
Returns
splitting [list, length=2 * len(arrays)] List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be a scipy.sparse.
csr_matrix. Else, output type is the same as the input type.
Examples
sklearn.model_selection.GridSearchCV
NOTE that when using custom scorers, each scorer should return a single value. Metric
functions returning a list/array of values can be wrapped into multiple scorers that return
one value each.
See Specifying multiple metrics for evaluation for an example.
If None, the estimator’s score method is used.
n_jobs [int or None, optional (default=None)] Number of jobs to run in parallel. None means 1
unless in a joblib.parallel_backend context. -1 means using all processors. See
Glossary for more details.
pre_dispatch [int, or string, optional] Controls the number of jobs that get dispatched during
parallel execution. Reducing this number can be useful to avoid an explosion of memory
consumption when more jobs get dispatched than CPUs can process. This parameter can
be:
• None, in which case all the jobs are immediately created and spawned. Use this for
lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
• An int, giving the exact number of total jobs that are spawned
• A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
iid [boolean, default=False] If True, return the average score across folds, weighted by the num-
ber of samples in each test set. In this case, the data is assumed to be identically distributed
across the folds, and the loss minimized is the total loss per sample, and not the mean loss
across the folds.
Deprecated since version 0.22: Parameter iid is deprecated in 0.22 and will be removed in
0.24
cv [int, cross-validation generator or an iterable, optional] Determines the cross-validation split-
ting strategy. Possible inputs for cv are:
• None, to use the default 5-fold cross validation,
• integer, to specify the number of folds in a (Stratified)KFold,
• CV splitter,
• An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass,
StratifiedKFold is used. In all other cases, KFold is used.
Refer User Guide for the various cross-validation strategies that can be used here.
Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.
refit [boolean, string, or callable, default=True] Refit an estimator using the best found param-
eters on the whole dataset.
For multiple metric evaluation, this needs to be a string denoting the scorer that would be
used to find the best parameters for refitting the estimator at the end.
Where there are considerations other than maximum score in choosing a best estima-
tor, refit can be set to a function which returns the selected best_index_ given
cv_results_. In that case, the best_estimator_ and best_params_ will be set
according to the returned best_index_ while the best_score_ attribute will not be
available.
The refitted estimator is made available at the best_estimator_ attribute and permits
using predict directly on this GridSearchCV instance.
Also for multiple metric evaluation, the attributes best_index_, best_score_ and
best_params_ will only be available if refit is set and all of them will be determined
w.r.t this specific scorer.
See scoring parameter to know more about multiple metric evaluation.
Changed in version 0.20: Support for callable added.
verbose [integer] Controls the verbosity: the higher, the more messages.
error_score [‘raise’ or numeric] Value to assign to the score if an error occurs in estimator
fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning
is raised. This parameter does not affect the refit step, which will always raise the error.
Default is np.nan.
return_train_score [boolean, default=False] If False, the cv_results_ attribute will not
include training scores. Computing training scores is used to get insights on how differ-
ent parameter settings impact the overfitting/underfitting trade-off. However computing the
scores on the training set can be computationally expensive and is not strictly required to
select the parameters that yield the best generalization performance.
Attributes
cv_results_ [dict of numpy (masked) ndarrays] A dict with keys as column headers and values
as columns, that can be imported into a pandas DataFrame.
For instance the below given table
{
'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
mask = [False False False False]...)
'param_gamma': masked_array(data = [-- -- 0.1 0.2],
mask = [ True True False False]...),
'param_degree': masked_array(data = [2.0 3.0 -- --],
mask = [False False True True]...),
'split0_test_score' : [0.80, 0.70, 0.80, 0.93],
'split1_test_score' : [0.82, 0.50, 0.70, 0.78],
'mean_test_score' : [0.81, 0.60, 0.75, 0.85],
'std_test_score' : [0.01, 0.10, 0.05, 0.08],
'rank_test_score' : [2, 4, 3, 1],
'split0_train_score' : [0.80, 0.92, 0.70, 0.93],
'split1_train_score' : [0.82, 0.55, 0.70, 0.87],
'mean_train_score' : [0.81, 0.74, 0.70, 0.90],
'std_train_score' : [0.01, 0.19, 0.00, 0.03],
'mean_fit_time' : [0.73, 0.63, 0.43, 0.49],
'std_fit_time' : [0.01, 0.02, 0.01, 0.01],
'mean_score_time' : [0.01, 0.06, 0.04, 0.04],
'std_score_time' : [0.00, 0.00, 0.00, 0.01],
'params' : [{'kernel': 'poly', 'degree': 2}, ...],
}
NOTE
The key 'params' is used to store a list of parameter settings dicts for all the parameter
candidates.
The mean_fit_time, std_fit_time, mean_score_time and
std_score_time are all in seconds.
For multi-metric evaluation, the scores for all the scorers are available in the
cv_results_ dict at the keys ending with that scorer’s name ('_<scorer_name>')
instead of '_score' shown above. (‘split0_test_precision’, ‘mean_train_precision’ etc.)
best_estimator_ [estimator] Estimator that was chosen by the search, i.e. estimator which
gave highest score (or smallest loss if specified) on the left out data. Not available if
refit=False.
See refit parameter for more information on allowed values.
best_score_ [float] Mean cross-validated score of the best_estimator
For multi-metric evaluation, this is present only if refit is specified.
This attribute is not available if refit is a function.
best_params_ [dict] Parameter setting that gave the best results on the hold out data.
For multi-metric evaluation, this is present only if refit is specified.
best_index_ [int] The index (of the cv_results_ arrays) which corresponds to the best can-
didate parameter setting.
The dict at search.cv_results_['params'][search.best_index_] gives
the parameter setting for the best model, that gives the highest mean score (search.
best_score_).
For multi-metric evaluation, this is present only if refit is specified.
scorer_ [function or a dict] Scorer function used on the held out data to choose the best param-
eters for the model.
For multi-metric evaluation, this attribute holds the validated scoring dict which maps
the scorer key to the scorer callable.
n_splits_ [int] The number of cross-validation splits (folds/iterations).
refit_time_ [float] Seconds used for refitting the best model on the whole dataset.
This is present only if refit is not False.
See also:
Notes
The parameters selected are those that maximize the score of the left out data, unless an explicit score is passed
in which case it is used instead.
If n_jobs was set to a value higher than one, the data is copied for each point in the grid (and not n_jobs
times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the
dataset is large and not enough memory is available. A workaround in this case is to set pre_dispatch.
Then, the memory is copied only pre_dispatch many times. A reasonable value for pre_dispatch is 2
* n_jobs.
Examples
Methods
Parameters
X [indexable, length n_samples] Must fulfill the input assumptions of the underlying esti-
mator.
fit(self, X, y=None, groups=None, **fit_params)
Run fit with all sets of parameters.
Parameters
X [array-like of shape (n_samples, n_features)] Training vector, where n_samples is the
number of samples and n_features is the number of features.
y [array-like of shape (n_samples, n_output) or (n_samples,), optional] Target relative to X
for classification or regression; None for unsupervised learning.
groups [array-like, with shape (n_samples,), optional] Group labels for the samples used
while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv
instance (e.g., GroupKFold).
**fit_params [dict of string -> object] Parameters passed to the fit method of the estima-
tor
get_params(self, deep=True)
Get parameters for this estimator.
Parameters
deep [bool, default=True] If True, will return the parameters for this estimator and contained
subobjects that are estimators.
Returns
params [mapping of string to any] Parameter names mapped to their values.
inverse_transform(self, Xt)
Call inverse_transform on the estimator with the best found params.
Only available if the underlying estimator implements inverse_transform and refit=True.
Parameters
Xt [indexable, length n_samples] Must fulfill the input assumptions of the underlying esti-
mator.
predict(self, X)
Call predict on the estimator with the best found parameters.
Only available if refit=True and the underlying estimator supports predict.
Parameters
X [indexable, length n_samples] Must fulfill the input assumptions of the underlying esti-
mator.
predict_log_proba(self, X)
Call predict_log_proba on the estimator with the best found parameters.
Only available if refit=True and the underlying estimator supports predict_log_proba.
Parameters
X [indexable, length n_samples] Must fulfill the input assumptions of the underlying esti-
mator.
predict_proba(self, X)
Call predict_proba on the estimator with the best found parameters.
Only available if refit=True and the underlying estimator supports predict_proba.
Parameters
X [indexable, length n_samples] Must fulfill the input assumptions of the underlying esti-
mator.
score(self, X, y=None)
Returns the score on the given data, if the estimator has been refit.
This uses the score defined by scoring where provided, and the best_estimator_.score method
otherwise.
Parameters
X [array-like of shape (n_samples, n_features)] Input data, where n_samples is the number
of samples and n_features is the number of features.
y [array-like of shape (n_samples, n_output) or (n_samples,), optional] Target relative to X
for classification or regression; None for unsupervised learning.
Returns
score [float]
set_params(self, **params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form <component>__<parameter> so that it’s possible to update each component
of a nested object.
Parameters
**params [dict] Estimator parameters.
Returns
self [object] Estimator instance.
transform(self, X)
Call transform on the estimator with the best found parameters.
Only available if the underlying estimator supports transform and refit=True.
Parameters
X [indexable, length n_samples] Must fulfill the input assumptions of the underlying esti-
mator.
sklearn.model_selection.ParameterGrid
class sklearn.model_selection.ParameterGrid(param_grid)
Grid of parameters with a discrete number of values for each.
Can be used to iterate over parameter value combinations with the Python built-in function iter.
Examples