0% found this document useful (0 votes)
133 views2 pages

1 An Introduction To Machine Learning With Scikit Learn

This document provides an introduction and overview of machine learning with Scikit-Learn: - Scikit-Learn is a popular Python machine learning library that provides simple and efficient implementations of established machine learning algorithms. - It includes tools for classification, regression, clustering, model selection, and preprocessing. Core algorithms are implemented in low-level languages while exposing a unified Pythonic API. - The document outlines Scikit-Learn's estimators, predictors, and transformers interfaces that provide a consistent way to build, fit, and use machine learning models. It also describes the library's role within the broader Python data science stack.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views2 pages

1 An Introduction To Machine Learning With Scikit Learn

This document provides an introduction and overview of machine learning with Scikit-Learn: - Scikit-Learn is a popular Python machine learning library that provides simple and efficient implementations of established machine learning algorithms. - It includes tools for classification, regression, clustering, model selection, and preprocessing. Core algorithms are implemented in low-level languages while exposing a unified Pythonic API. - The document outlines Scikit-Learn's estimators, predictors, and transformers interfaces that provide a consistent way to build, fit, and use machine learning models. It also describes the library's role within the broader Python data science stack.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

An introduction to Machine Learning with Scikit-Learn

Gilles Louppe (@glouppe)

University of Liège

Prerequisites
In [1]:
# This is an Jupyter notebook, with executable Python code inside
42 / 2

21.0
Out[1]:

Materials available on GitHub

Require a Python distribution with scientific packages (NumPy, SciPy, Scikit-Learn, Pandas)

See installation instructions in README

In [2]:
# Global imports and settings

# Matplotlib
%matplotlib inline
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["figure.max_open_warning"] = -1

# Print options
import numpy as np
np.set_printoptions(precision=3)

# Slideshow
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {'width': 1440, 'height': 768, 'scroll': True, 'theme': 'simple'})

# Silence warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=RuntimeWarning)

In [3]:
%%javascript
Reveal.addEventListener("slidechanged", function(event){ window.location.hash = "header"; });

Outline
Scikit-Learn and the scientific ecosystem in Python
Classification
Model evaluation and selection
Transformers, pipelines and feature unions
Beyond building classifiers
Summary

Scikit-Learn
Overview
Machine learning library written in Python
Simple and efficient, for both experts and non-experts
Classical, well-established machine learning algorithms
Shipped with documentation and examples
BSD 3 license

Community driven development


20~ core developers (mostly researchers)
500-1000 occasional contributors
All working publicly together on GitHub
Emphasis on keeping the project maintainable
Style consistency
Unit-test coverage
Documentation and examples
Code review
Mature and stable
Join us!

Python stack for data analysis


The open source Python ecosystem provides a standalone, versatile and powerful scientific working
environment, including: NumPy, SciPy, IPython, Matplotlib, Pandas, and many others...

Scikit-Learn builds upon NumPy and SciPy and complements this scientific environment with machine learning
algorithms;
By design, Scikit-Learn is non-intrusive, easy to use and easy to combine with other libraries;
Core algorithms are implemented in low-level languages.

Algorithms
Supervised learning:

Linear models (Ridge, Lasso, Elastic Net, ...)


Support Vector Machines
Tree-based methods (Random Forests, Bagging, GBRT, ...)
Nearest neighbors
Neural networks (basics)
Gaussian Processes
Feature selection

Unsupervised learning:

Clustering (KMeans, Ward, ...)


Matrix decomposition (PCA, ICA, ...)
Density estimation
Outlier detection

Model selection and evaluation:

Cross-validation
Grid-search
Lots of metrics

... and many more! (See our Reference)

Classification
Framework
Data comes as a finite learning set L = (X, y) where

Input samples are given as an array X of shape n_samples × n_features , taking their values in X ;
Output values are given as an array y, taking symbolic values in Y .

The goal of supervised classification is to build an estimator φ : X ↦ Y minimizing

Err(φ) = EX,Y {ℓ(Y , φ(X))}

where ℓ is a loss function, e.g., the zero-one loss for classification ℓ01 (Y , Y^) = 1(Y ≠ Y^).

Applications
Classifying signal from background events;
Diagnosing disease from symptoms;
Recognising cats in pictures;
Identifying body parts with Kinect cameras;
...

Data
Input data = Numpy arrays or Scipy sparse matrices ;
Algorithms are expressed using high-level operations defined on matrices or vectors (similar to MATLAB) ;
Leverage efficient low-leverage implementations ;
Keep code short and readable.

In [4]:
# Generate data
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=1000, centers=20, random_state=123)
labels = ["b", "r"]
y = np.take(labels, (y < 10))
print(X)
print(y[:5])

[[-6.453 -8.764]
[ 0.29 0.147]
[-5.184 -1.253]
...
[-0.231 -1.608]
[-0.603 6.873]
[ 2.284 4.874]]
['r' 'r' 'b' 'r' 'b']

In [5]:
# X is a 2 dimensional array, with 1000 rows and 2 columns
print(X.shape)

# y is a vector of 1000 elements


print(y.shape)

(1000, 2)
(1000,)

In [6]:
# Rows and columns can be accessed with lists, slices or masks
print(X[[1, 2, 3]]) # rows 1, 2 and 3
print(X[:5]) # 5 first rows
print(X[500:510, 0]) # values from row 500 to row 510 at column 0
print(X[y == "b"][:5]) # 5 first rows for which y is "b"

[[ 0.29 0.147]
[-5.184 -1.253]
[-4.714 3.674]]
[[-6.453 -8.764]
[ 0.29 0.147]
[-5.184 -1.253]
[-4.714 3.674]
[ 4.516 -2.881]]
[-4.438 -2.46 4.331 -7.921 1.57 0.565 4.996 4.758 -1.604 1.101]
[[-5.184 -1.253]
[ 4.516 -2.881]
[ 1.708 2.624]
[-0.526 8.96 ]
[-1.076 9.787]]

In [7]:
# Plot
plt.figure()
for label in labels:
mask = (y == label)
plt.scatter(X[mask, 0], X[mask, 1], c=label)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()

Loading external data


Numpy provides some simple tools for loading data from files (CSV, binary, etc);

For structured data, Pandas provides more advanced tools (CSV, JSON, Excel, HDF5, SQL, etc);

A simple and unified API


All learning algorithms in scikit-learn share a uniform and limited API consisting of complementary interfaces:

an estimator interface for building and fitting models;


a predictor interface for making predictions;
a transformer interface for converting data.

Goal: enforce a simple and consistent API to make it trivial to swap or plug algorithms.

Estimators
In [8]:
class Estimator(object):
def fit(self, X, y=None):
"""Fits estimator to data."""
# set state of ``self``
return self

In [9]:
# Import the nearest neighbor class
from sklearn.neighbors import KNeighborsClassifier # Change this to try
# something else

# Set hyper-parameters, for controlling algorithm


clf = KNeighborsClassifier(n_neighbors=5)

# Learn a model from training data


clf.fit(X, y)

KNeighborsClassifier()
Out[9]:

In [10]:
# Estimator state is stored in instance attributes
clf._tree

<sklearn.neighbors._kd_tree.KDTree at 0x560760a44520>
Out[10]:

Predictors
In [11]:
# Make predictions
print(clf.predict(X[:5]))

['r' 'r' 'r' 'b' 'b']

In [12]:
# Compute (approximate) class probabilities
print(clf.predict_proba(X[:5]))

[[0. 1. ]
[0. 1. ]
[0.2 0.8]
[0.6 0.4]
[0.8 0.2]]

In [13]:
#Mount your google drive as follows:
from google.colab import drive
drive.mount('/content/mydir')

Drive already mounted at /content/mydir; to attempt to forcibly remount, call drive.mount("/content/m


ydir", force_remount=True).

In [14]:
# Query name of current folder
import os
folder_name=os.getcwd()
print(folder_name)

/content

In [15]:
# Goto the Colab folder
os.chdir('/content/mydir/MyDrive/Victor/MachineLearningCourseExercises')
folder_name=os.getcwd()
print(folder_name)

/content/mydir/MyDrive/Victor/MachineLearningCourseExercises

In [16]:
ls

1_An_introduction_to_Machine_Learning_with_Scikit_Learn.ipynb Colabtutorial/
Assignment1_HammettNeuralNetwork/ __pycache__/
Assignment2_RegressiveLinearModelsForChemistryPrediction/ robustness.py
Assignment3_UnsupervisedLearning/ tutorial.py

In [17]:
from tutorial import plot_surface
plot_surface(clf, X, y)

In [18]:
from tutorial import plot_histogram
plot_histogram(clf, X, y)

Classifier zoo
Decision trees
Idea: greedily build a partition of the input space using cuts orthogonal to feature axes.

In [19]:
from tutorial import plot_clf
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X, y)
plot_clf(clf, X, y)

Random Forests
Idea: Build several decision trees with controlled randomness and average their decisions.

In [20]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=500)
# from sklearn.ensemble import ExtraTreesClassifier
# clf = ExtraTreesClassifier(n_estimators=500)
clf.fit(X, y)
plot_clf(clf, X, y)

Logistic regression
Idea: model the decision boundary as an hyperplane.

In [21]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X, y)
plot_clf(clf, X, y)

Support vector machines


Idea: Find the hyperplane which has the largest distance to the nearest training points of any class.

In [22]:
from sklearn.svm import SVC
clf = SVC(kernel="linear") # try kernel="rbf" instead
clf.fit(X, y)
plot_clf(clf, X, y)

Multi-layer perceptron
Idea: a multi-layer perceptron is a circuit of non-linear combinations of the data.

In [23]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100, 100, 100), activation="relu", learning_rate="invscaling"
clf.fit(X, y)
plot_clf(clf, X, y)

Gaussian Processes
Idea: a gaussian process is a distribution over functions f , such that f(x), for any set x of points, is gaussian distributed.

In [24]:
from sklearn.gaussian_process import GaussianProcessClassifier
clf = GaussianProcessClassifier()
clf.fit(X, y)
plot_clf(clf, X, y)

Model evaluation and selection


Evaluation
Recall that we want to learn an estimator φ minimizing the generalization error Err(φ) = EX,Y {ℓ(Y , φ(X))}.

Problem: Since PX,Y is unknown, the generalization error Err(φ) cannot be evaluated.

Solution: Use a proxy to approximate Err(φ).

Training error
In [25]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
print("Training error =", zero_one_loss(y, clf.predict(X)))

Training error = 0.0

Test error
Issue: the training error is a biased estimate of the generalization error.

Solution: Divide L into two disjoint parts called training and test sets (usually using 70% for training and 30% for test).

Use the training set for fitting the model;


Use the test set for evaluation only, thereby yielding an unbiased estimate.

In [26]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print("Training error =", zero_one_loss(y_train, clf.predict(X_train)))
print("Test error =", zero_one_loss(y_test, clf.predict(X_test)))

Training error = 0.10399999999999998


Test error = 0.17600000000000005
Summary: Beware of bias when you estimate model performance:

Training score is often an optimistic estimate of the true performance;


The same data should not be used both for training and evaluation.

Cross-validation
Issue:

When L is small, training on 70% of the data may lead to a model that is significantly different from a model that
would have been learned on the entire set L.
Yet, increasing the size of the training set (resp. decreasing the size of the test set), might lead to an inaccurate
estimate of the generalization error.

Solution: K-Fold cross-validation.

Split L into K small disjoint folds.


Train on K-1 folds, evaluate the test error one the held-out fold.
Repeat for all combinations and average the K estimates of the generalization error.

![](img/cross-validation.png)

In [27]:
from sklearn.model_selection import KFold

scores = []

for train, test in KFold(n_splits=5, random_state=None).split(X):


X_train, y_train = X[train], y[train]
X_test, y_test = X[test], y[test]
clf = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
scores.append(zero_one_loss(y_test, clf.predict(X_test)))

print("CV error = %f +-%f" % (np.mean(scores), np.std(scores)))

CV error = 0.163000 +-0.010770

In [28]:
# Shortcut
from sklearn.model_selection import cross_val_score
scores = cross_val_score(KNeighborsClassifier(n_neighbors=5), X, y,
cv=KFold(n_splits=5, random_state=None),
scoring="accuracy")
print("CV error = %f +-%f" % (1. - np.mean(scores), np.std(scores)))

CV error = 0.163000 +-0.010770

Metrics
Default score
Estimators come with a built-in default evaluation score

Accuracy for classification


R2 score for regression

In [29]:
y_train = (y_train == "r")
y_test = (y_test == "r")
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print("Default score =", clf.score(X_test, y_test))

Default score = 0.84

Accuracy
Definition: The accuracy is the proportion of correct predictions.

In [30]:
from sklearn.metrics import accuracy_score
print("Accuracy =", accuracy_score(y_test, clf.predict(X_test)))

Accuracy = 0.84

Precision, recall and F-measure


TP
P recision =
TP + FP
TP
Recall =
TP + FN
2 ∗ P recision ∗ Recall
F=
P recision + Recall

In [31]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import fbeta_score
print("Precision =", precision_score(y_test, clf.predict(X_test)))
print("Recall =", recall_score(y_test, clf.predict(X_test)))
print("F =", fbeta_score(y_test, clf.predict(X_test), beta=1))

Precision = 0.8118811881188119
Recall = 0.8631578947368421
F = 0.8367346938775511

ROC AUC
Definition: Area under the curve of the false positive rate (FPR) against the true positive rate (TPR) as the decision
threshold of the classifier is varied.

In [32]:
from sklearn.metrics import get_scorer
roc_auc_scorer = get_scorer("roc_auc")
print("ROC AUC =", roc_auc_scorer(clf, X_test, y_test))
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(X_test)[:, 1])
plt.plot(fpr, tpr)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.show()

ROC AUC = 0.9297744360902256

Confusion matrix
Definition: number of samples of class i predicted as class j.

In [33]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, clf.predict(X_test))

array([[86, 19],
Out[33]:
[13, 82]])

Model selection
Finding good hyper-parameters is crucial to control under- and over-fitting, hence achieving better performance.
The estimated generalization error can be used to select the best model.

Under- and over-fitting


Under-fitting: the model is too simple and does not capture the true relation between X and Y.
Over-fitting: the model is too specific to the training set and does not generalize.

In [34]:
from sklearn.model_selection import validation_curve

# Evaluate parameter range in CV


param_range = range(2, 200)
param_name = "max_leaf_nodes"

train_scores, test_scores = validation_curve(


DecisionTreeClassifier(), X, y,
param_name=param_name,
param_range=param_range, cv=5, n_jobs=-1)

train_scores_mean = np.mean(train_scores, axis=1)


train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

# Plot parameter VS estimated error


plt.xlabel(param_name)
plt.ylabel("score")
plt.xlim(min(param_range), max(param_range))
plt.plot(param_range, 1. - train_scores_mean, color="red", label="Training error")
plt.fill_between(param_range,
1. - train_scores_mean + train_scores_std,
1. - train_scores_mean - train_scores_std,
alpha=0.2, color="red")
plt.plot(param_range, 1. - test_scores_mean, color="blue", label="CV error")
plt.fill_between(param_range,
1. - test_scores_mean + test_scores_std,
1. - test_scores_mean - test_scores_std,
alpha=0.2, color="blue")
plt.legend(loc="best")

<matplotlib.legend.Legend at 0x7f866bd5e210>
Out[34]:

In [35]:
# Best trade-off
print("%s = %d, CV error = %f" % (param_name,
param_range[np.argmax(test_scores_mean)],
1. - np.max(test_scores_mean)))

max_leaf_nodes = 63, CV error = 0.176000


Question: Where is the model under-fitting and over-fitting?

Question: What does it mean if the training error is different from the test error?

Hyper-parameter search
In [36]:
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(KNeighborsClassifier(),
param_grid={"n_neighbors": list(range(1, 100))},
scoring="accuracy",
cv=5, n_jobs=-1)
grid.fit(X, y) # Note that GridSearchCV is itself an estimator

print("Best score = %f, Best parameters = %s" % (1. - grid.best_score_,


grid.best_params_))

Best score = 0.131000, Best parameters = {'n_neighbors': 34}


Question: Should you report the best score as an estimate of the generalization error of the model?

Selection and evaluation, simultaneously


grid.best_score_ is not independent from the best model, since its construction was guided by the optimization
of this quantity.

As a result, the optimized grid.best_score_ estimate may in fact be a biased, optimistic, estimate of the true
performance of the model.

Solution: Use nested cross-validation for correctly selecting the model and correctly evaluating its performance.

In [37]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

scores = cross_val_score(
GridSearchCV(KNeighborsClassifier(),
param_grid={"n_neighbors": list(range(1, 100))},
scoring="accuracy",
cv=5, n_jobs=-1),
X, y, cv=5, scoring="accuracy")

# Unbiased estimate of the accuracy


print("%f +-%f" % (1. - np.mean(scores), np.std(scores)))

0.144000 +-0.023958

Transformers, pipelines and feature unions


Transformers
Classification (or regression) is often only one or the last step of a long and complicated process;
In most cases, input data needs to be cleaned, massaged or extended before being fed to a learning algorithm;
For this purpose, Scikit-Learn provides the transformer API.

In [38]:
class Transformer(object):
def fit(self, X, y=None):
"""Fits estimator to data."""
# set state of ``self``
return self

def transform(self, X):


"""Transform X into Xt."""
# transform X in some way to produce Xt
return Xt

# Shortcut
def fit_transform(self, X, y=None):
self.fit(X, y)
Xt = self.transform(X)
return Xt

Transformer zoo
In [39]:
# Load digits data
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Plot
sample_id = 42
plt.imshow(X[sample_id].reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.title("y = %d" % y[sample_id])
plt.show()

Scalers and other normalizers


In [40]:
from sklearn.preprocessing import StandardScaler
tf = StandardScaler()
tf.fit(X_train, y_train)
Xt_train = tf.transform(X_train)
print("Mean (before scaling) =", np.mean(X_train))
print("Mean (after scaling) =", np.mean(Xt_train))

# Shortcut: Xt = tf.fit_transform(X)
# See also Binarizer, MinMaxScaler, Normalizer, ...

Mean (before scaling) = 4.8921213808463255


Mean (after scaling) = -2.307813265739004e-18

In [41]:
# Scaling is critical for some algorithms
from sklearn.svm import SVC
clf = SVC()
print("Without scaling =", clf.fit(X_train, y_train).score(X_test, y_test))
print("With scaling =", clf.fit(tf.transform(X_train), y_train).score(tf.transform(X_test), y_test))

Without scaling = 0.9911111111111112


With scaling = 0.9844444444444445

Feature selection
In [42]:
# Select the 10 top features, as ranked using ANOVA F-score
from sklearn.feature_selection import SelectKBest, f_classif
tf = SelectKBest(score_func=f_classif, k=10)
Xt = tf.fit_transform(X_train, y_train)
print("Shape =", Xt.shape)

# Plot support
plt.imshow(tf.get_support().reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.show()

Shape = (1347, 10)

Feature selection (cont.)


In [43]:
# Feature selection using backward elimination
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
tf = RFE(RandomForestClassifier(), n_features_to_select=10, verbose=1)
Xt = tf.fit_transform(X_train, y_train)
print("Shape =", Xt.shape)

# Plot support
plt.imshow(tf.get_support().reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.show()

Fitting estimator with 64 features.


Fitting estimator with 63 features.
Fitting estimator with 62 features.
Fitting estimator with 61 features.
Fitting estimator with 60 features.
Fitting estimator with 59 features.
Fitting estimator with 58 features.
Fitting estimator with 57 features.
Fitting estimator with 56 features.
Fitting estimator with 55 features.
Fitting estimator with 54 features.
Fitting estimator with 53 features.
Fitting estimator with 52 features.
Fitting estimator with 51 features.
Fitting estimator with 50 features.
Fitting estimator with 49 features.
Fitting estimator with 48 features.
Fitting estimator with 47 features.
Fitting estimator with 46 features.
Fitting estimator with 45 features.
Fitting estimator with 44 features.
Fitting estimator with 43 features.
Fitting estimator with 42 features.
Fitting estimator with 41 features.
Fitting estimator with 40 features.
Fitting estimator with 39 features.
Fitting estimator with 38 features.
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 29 features.
Fitting estimator with 28 features.
Fitting estimator with 27 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Shape = (1347, 10)

Decomposition, factorization or embeddings


In [44]:
# Compute decomposition
from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
tf = PCA(n_components=2)
Xt_train = tf.fit_transform(X_train)

# Plot
plt.scatter(Xt_train[:, 0], Xt_train[:, 1], c=y_train)
plt.show()

# See also: KernelPCA, NMF, FastICA, Kernel approximations,


# manifold learning, etc

Function transformer
In [45]:
from sklearn.preprocessing import FunctionTransformer

def increment(X):
return X + 1

tf = FunctionTransformer(func=increment)
Xt = tf.fit_transform(X)
print(X[0])
print(Xt[0])

[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.


15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0.
0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12.
0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]
[ 1. 1. 6. 14. 10. 2. 1. 1. 1. 1. 14. 16. 11. 16. 6. 1. 1. 4.
16. 3. 1. 12. 9. 1. 1. 5. 13. 1. 1. 9. 9. 1. 1. 6. 9. 1.
1. 10. 9. 1. 1. 5. 12. 1. 2. 13. 8. 1. 1. 3. 15. 6. 11. 13.
1. 1. 1. 1. 7. 14. 11. 1. 1. 1.]

Pipelines
Transformers can be chained in sequence to form a pipeline.

In [46]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Chain transformers to build a new transformer


tf = make_pipeline(StandardScaler(),
SelectKBest(score_func=f_classif, k=10))
tf.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
Out[46]:
('selectkbest', SelectKBest())])

In [47]:
Xt_train = tf.transform(X_train)
print("Mean =", np.mean(Xt_train))
print("Shape =", Xt_train.shape)

Mean = -1.3715004550677509e-17
Shape = (1347, 10)

In [48]:
# Chain transformers + a classifier to build a new classifier
clf = make_pipeline(StandardScaler(),
SelectKBest(score_func=f_classif, k=10),
RandomForestClassifier())
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test)[:5])

[[0. 0. 0.84 0. 0. 0.04 0. 0.1 0.02 0. ]


[0. 0. 0.01 0. 0. 0. 0.02 0.02 0.94 0.01]
[0. 0.07 0.8 0.04 0. 0. 0. 0.05 0.04 0. ]
[0. 0. 0. 0. 0.03 0. 0.96 0. 0.01 0. ]
[0. 0. 0. 0. 0. 0.01 0.99 0. 0. 0. ]]

In [49]:
# Hyper-parameters can be accessed using step names
print("K =", clf.get_params()["selectkbest__k"])

K = 10

In [50]:
clf.named_steps

{'randomforestclassifier': RandomForestClassifier(),
Out[50]:
'selectkbest': SelectKBest(),
'standardscaler': StandardScaler()}

In [51]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(clf,
param_grid={"selectkbest__k": [1, 10, 20, 30, 40, 50],
"randomforestclassifier__max_features": [0.1, 0.25, 0.5]})
grid.fit(X_train, y_train)

print("Best params =", grid.best_params_)

Best params = {'randomforestclassifier__max_features': 0.1, 'selectkbest__k': 50}

Feature unions
Similarly, transformers can be applied in parallel to transform data in union.

Nested composition
Since pipelines and unions are themselves estimators, they can be composed into nested structures.

In [52]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_union
from sklearn.decomposition import PCA

clf = make_pipeline(
# Build features
make_union(
FunctionTransformer(func=lambda X: X), # Identity
PCA(),
),
# Select the best features
RFE(RandomForestClassifier(), n_features_to_select=10),
# Train
MLPClassifier()
)

clf.fit(X_train, y_train)

Pipeline(steps=[('featureunion',
Out[52]:
FeatureUnion(transformer_list=[('functiontransformer',
FunctionTransformer(func=<function <lambda> at 0x7f8
67180ba70>)),
('pca', PCA())])),
('rfe',
RFE(estimator=RandomForestClassifier(),
n_features_to_select=10)),
('mlpclassifier', MLPClassifier())])

Beyond building classifiers


(Quantile) Regression
Clustering
Density estimation
Feature learning
Outlier detection
...

Example: Kernel Density estimation


In [53]:
# Load the data
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data

In [54]:
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

# Project the 64-dimensional data to a lower dimension


pca = PCA(n_components=15, whiten=False)
X = pca.fit_transform(X)

# Use grid search cross-validation to optimize the bandwidth


params = {'bandwidth': np.logspace(-1, 1, 100)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(X)

print("Best bandwidth: %.2f" % grid.best_estimator_.bandwidth)

Best bandwidth: 3.59

In [55]:
# Use the best estimator to compute the kernel density estimate
kde = grid.best_estimator_

# Sample 44 new points from the data


new_data = kde.sample(44, random_state=0)
new_data = pca.inverse_transform(new_data)

In [56]:
# Turn data into a 4x11 grid
new_data = new_data.reshape((4, 11, -1))
real_data = digits.data[:44].reshape((4, 11, -1))

# Plot real digits and resampled digits


fig, ax = plt.subplots(9, 11, subplot_kw=dict(xticks=[], yticks=[]))
for j in range(11):
ax[4, j].set_visible(False)
for i in range(4):
im = ax[i, j].imshow(real_data[i, j].reshape((8, 8)),
cmap=plt.cm.binary, interpolation='nearest')
im.set_clim(0, 16)
im = ax[i + 5, j].imshow(new_data[i, j].reshape((8, 8)),
cmap=plt.cm.binary, interpolation='nearest')
im.set_clim(0, 16)

ax[0, 5].set_title('Selection from the input data')


ax[5, 5].set_title('"New" digits drawn from the kernel density model')
plt.show()

Summary
Scikit-Learn provides essential tools for machine learning.
It is more than training classifiers!
It integrates within a larger Python scientific ecosystem.
Try it for yourself!

In [57]:
questions?

Object `questions` not found.

In [57]:

You might also like