1 An Introduction To Machine Learning With Scikit Learn
1 An Introduction To Machine Learning With Scikit Learn
University of Liège
Prerequisites
In [1]:
# This is an Jupyter notebook, with executable Python code inside
42 / 2
21.0
Out[1]:
Require a Python distribution with scientific packages (NumPy, SciPy, Scikit-Learn, Pandas)
In [2]:
# Global imports and settings
# Matplotlib
%matplotlib inline
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["figure.max_open_warning"] = -1
# Print options
import numpy as np
np.set_printoptions(precision=3)
# Slideshow
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {'width': 1440, 'height': 768, 'scroll': True, 'theme': 'simple'})
# Silence warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=RuntimeWarning)
In [3]:
%%javascript
Reveal.addEventListener("slidechanged", function(event){ window.location.hash = "header"; });
Outline
Scikit-Learn and the scientific ecosystem in Python
Classification
Model evaluation and selection
Transformers, pipelines and feature unions
Beyond building classifiers
Summary
Scikit-Learn
Overview
Machine learning library written in Python
Simple and efficient, for both experts and non-experts
Classical, well-established machine learning algorithms
Shipped with documentation and examples
BSD 3 license
Scikit-Learn builds upon NumPy and SciPy and complements this scientific environment with machine learning
algorithms;
By design, Scikit-Learn is non-intrusive, easy to use and easy to combine with other libraries;
Core algorithms are implemented in low-level languages.
Algorithms
Supervised learning:
Unsupervised learning:
Cross-validation
Grid-search
Lots of metrics
Classification
Framework
Data comes as a finite learning set L = (X, y) where
Input samples are given as an array X of shape n_samples × n_features , taking their values in X ;
Output values are given as an array y, taking symbolic values in Y .
where ℓ is a loss function, e.g., the zero-one loss for classification ℓ01 (Y , Y^) = 1(Y ≠ Y^).
Applications
Classifying signal from background events;
Diagnosing disease from symptoms;
Recognising cats in pictures;
Identifying body parts with Kinect cameras;
...
Data
Input data = Numpy arrays or Scipy sparse matrices ;
Algorithms are expressed using high-level operations defined on matrices or vectors (similar to MATLAB) ;
Leverage efficient low-leverage implementations ;
Keep code short and readable.
In [4]:
# Generate data
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=1000, centers=20, random_state=123)
labels = ["b", "r"]
y = np.take(labels, (y < 10))
print(X)
print(y[:5])
[[-6.453 -8.764]
[ 0.29 0.147]
[-5.184 -1.253]
...
[-0.231 -1.608]
[-0.603 6.873]
[ 2.284 4.874]]
['r' 'r' 'b' 'r' 'b']
In [5]:
# X is a 2 dimensional array, with 1000 rows and 2 columns
print(X.shape)
(1000, 2)
(1000,)
In [6]:
# Rows and columns can be accessed with lists, slices or masks
print(X[[1, 2, 3]]) # rows 1, 2 and 3
print(X[:5]) # 5 first rows
print(X[500:510, 0]) # values from row 500 to row 510 at column 0
print(X[y == "b"][:5]) # 5 first rows for which y is "b"
[[ 0.29 0.147]
[-5.184 -1.253]
[-4.714 3.674]]
[[-6.453 -8.764]
[ 0.29 0.147]
[-5.184 -1.253]
[-4.714 3.674]
[ 4.516 -2.881]]
[-4.438 -2.46 4.331 -7.921 1.57 0.565 4.996 4.758 -1.604 1.101]
[[-5.184 -1.253]
[ 4.516 -2.881]
[ 1.708 2.624]
[-0.526 8.96 ]
[-1.076 9.787]]
In [7]:
# Plot
plt.figure()
for label in labels:
mask = (y == label)
plt.scatter(X[mask, 0], X[mask, 1], c=label)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
For structured data, Pandas provides more advanced tools (CSV, JSON, Excel, HDF5, SQL, etc);
Goal: enforce a simple and consistent API to make it trivial to swap or plug algorithms.
Estimators
In [8]:
class Estimator(object):
def fit(self, X, y=None):
"""Fits estimator to data."""
# set state of ``self``
return self
In [9]:
# Import the nearest neighbor class
from sklearn.neighbors import KNeighborsClassifier # Change this to try
# something else
KNeighborsClassifier()
Out[9]:
In [10]:
# Estimator state is stored in instance attributes
clf._tree
<sklearn.neighbors._kd_tree.KDTree at 0x560760a44520>
Out[10]:
Predictors
In [11]:
# Make predictions
print(clf.predict(X[:5]))
In [12]:
# Compute (approximate) class probabilities
print(clf.predict_proba(X[:5]))
[[0. 1. ]
[0. 1. ]
[0.2 0.8]
[0.6 0.4]
[0.8 0.2]]
In [13]:
#Mount your google drive as follows:
from google.colab import drive
drive.mount('/content/mydir')
In [14]:
# Query name of current folder
import os
folder_name=os.getcwd()
print(folder_name)
/content
In [15]:
# Goto the Colab folder
os.chdir('/content/mydir/MyDrive/Victor/MachineLearningCourseExercises')
folder_name=os.getcwd()
print(folder_name)
/content/mydir/MyDrive/Victor/MachineLearningCourseExercises
In [16]:
ls
1_An_introduction_to_Machine_Learning_with_Scikit_Learn.ipynb Colabtutorial/
Assignment1_HammettNeuralNetwork/ __pycache__/
Assignment2_RegressiveLinearModelsForChemistryPrediction/ robustness.py
Assignment3_UnsupervisedLearning/ tutorial.py
In [17]:
from tutorial import plot_surface
plot_surface(clf, X, y)
In [18]:
from tutorial import plot_histogram
plot_histogram(clf, X, y)
Classifier zoo
Decision trees
Idea: greedily build a partition of the input space using cuts orthogonal to feature axes.
In [19]:
from tutorial import plot_clf
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X, y)
plot_clf(clf, X, y)
Random Forests
Idea: Build several decision trees with controlled randomness and average their decisions.
In [20]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=500)
# from sklearn.ensemble import ExtraTreesClassifier
# clf = ExtraTreesClassifier(n_estimators=500)
clf.fit(X, y)
plot_clf(clf, X, y)
Logistic regression
Idea: model the decision boundary as an hyperplane.
In [21]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X, y)
plot_clf(clf, X, y)
In [22]:
from sklearn.svm import SVC
clf = SVC(kernel="linear") # try kernel="rbf" instead
clf.fit(X, y)
plot_clf(clf, X, y)
Multi-layer perceptron
Idea: a multi-layer perceptron is a circuit of non-linear combinations of the data.
In [23]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100, 100, 100), activation="relu", learning_rate="invscaling"
clf.fit(X, y)
plot_clf(clf, X, y)
Gaussian Processes
Idea: a gaussian process is a distribution over functions f , such that f(x), for any set x of points, is gaussian distributed.
In [24]:
from sklearn.gaussian_process import GaussianProcessClassifier
clf = GaussianProcessClassifier()
clf.fit(X, y)
plot_clf(clf, X, y)
Problem: Since PX,Y is unknown, the generalization error Err(φ) cannot be evaluated.
Training error
In [25]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
print("Training error =", zero_one_loss(y, clf.predict(X)))
Test error
Issue: the training error is a biased estimate of the generalization error.
Solution: Divide L into two disjoint parts called training and test sets (usually using 70% for training and 30% for test).
In [26]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print("Training error =", zero_one_loss(y_train, clf.predict(X_train)))
print("Test error =", zero_one_loss(y_test, clf.predict(X_test)))
Cross-validation
Issue:
When L is small, training on 70% of the data may lead to a model that is significantly different from a model that
would have been learned on the entire set L.
Yet, increasing the size of the training set (resp. decreasing the size of the test set), might lead to an inaccurate
estimate of the generalization error.

In [27]:
from sklearn.model_selection import KFold
scores = []
In [28]:
# Shortcut
from sklearn.model_selection import cross_val_score
scores = cross_val_score(KNeighborsClassifier(n_neighbors=5), X, y,
cv=KFold(n_splits=5, random_state=None),
scoring="accuracy")
print("CV error = %f +-%f" % (1. - np.mean(scores), np.std(scores)))
Metrics
Default score
Estimators come with a built-in default evaluation score
In [29]:
y_train = (y_train == "r")
y_test = (y_test == "r")
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print("Default score =", clf.score(X_test, y_test))
Accuracy
Definition: The accuracy is the proportion of correct predictions.
In [30]:
from sklearn.metrics import accuracy_score
print("Accuracy =", accuracy_score(y_test, clf.predict(X_test)))
Accuracy = 0.84
In [31]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import fbeta_score
print("Precision =", precision_score(y_test, clf.predict(X_test)))
print("Recall =", recall_score(y_test, clf.predict(X_test)))
print("F =", fbeta_score(y_test, clf.predict(X_test), beta=1))
Precision = 0.8118811881188119
Recall = 0.8631578947368421
F = 0.8367346938775511
ROC AUC
Definition: Area under the curve of the false positive rate (FPR) against the true positive rate (TPR) as the decision
threshold of the classifier is varied.
In [32]:
from sklearn.metrics import get_scorer
roc_auc_scorer = get_scorer("roc_auc")
print("ROC AUC =", roc_auc_scorer(clf, X_test, y_test))
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(X_test)[:, 1])
plt.plot(fpr, tpr)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.show()
Confusion matrix
Definition: number of samples of class i predicted as class j.
In [33]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, clf.predict(X_test))
array([[86, 19],
Out[33]:
[13, 82]])
Model selection
Finding good hyper-parameters is crucial to control under- and over-fitting, hence achieving better performance.
The estimated generalization error can be used to select the best model.
In [34]:
from sklearn.model_selection import validation_curve
<matplotlib.legend.Legend at 0x7f866bd5e210>
Out[34]:
In [35]:
# Best trade-off
print("%s = %d, CV error = %f" % (param_name,
param_range[np.argmax(test_scores_mean)],
1. - np.max(test_scores_mean)))
Question: What does it mean if the training error is different from the test error?
Hyper-parameter search
In [36]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(),
param_grid={"n_neighbors": list(range(1, 100))},
scoring="accuracy",
cv=5, n_jobs=-1)
grid.fit(X, y) # Note that GridSearchCV is itself an estimator
As a result, the optimized grid.best_score_ estimate may in fact be a biased, optimistic, estimate of the true
performance of the model.
Solution: Use nested cross-validation for correctly selecting the model and correctly evaluating its performance.
In [37]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
scores = cross_val_score(
GridSearchCV(KNeighborsClassifier(),
param_grid={"n_neighbors": list(range(1, 100))},
scoring="accuracy",
cv=5, n_jobs=-1),
X, y, cv=5, scoring="accuracy")
0.144000 +-0.023958
In [38]:
class Transformer(object):
def fit(self, X, y=None):
"""Fits estimator to data."""
# set state of ``self``
return self
# Shortcut
def fit_transform(self, X, y=None):
self.fit(X, y)
Xt = self.transform(X)
return Xt
Transformer zoo
In [39]:
# Load digits data
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Plot
sample_id = 42
plt.imshow(X[sample_id].reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.title("y = %d" % y[sample_id])
plt.show()
# Shortcut: Xt = tf.fit_transform(X)
# See also Binarizer, MinMaxScaler, Normalizer, ...
In [41]:
# Scaling is critical for some algorithms
from sklearn.svm import SVC
clf = SVC()
print("Without scaling =", clf.fit(X_train, y_train).score(X_test, y_test))
print("With scaling =", clf.fit(tf.transform(X_train), y_train).score(tf.transform(X_test), y_test))
Feature selection
In [42]:
# Select the 10 top features, as ranked using ANOVA F-score
from sklearn.feature_selection import SelectKBest, f_classif
tf = SelectKBest(score_func=f_classif, k=10)
Xt = tf.fit_transform(X_train, y_train)
print("Shape =", Xt.shape)
# Plot support
plt.imshow(tf.get_support().reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.show()
# Plot support
plt.imshow(tf.get_support().reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.show()
# Plot
plt.scatter(Xt_train[:, 0], Xt_train[:, 1], c=y_train)
plt.show()
Function transformer
In [45]:
from sklearn.preprocessing import FunctionTransformer
def increment(X):
return X + 1
tf = FunctionTransformer(func=increment)
Xt = tf.fit_transform(X)
print(X[0])
print(Xt[0])
Pipelines
Transformers can be chained in sequence to form a pipeline.
In [46]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
Pipeline(steps=[('standardscaler', StandardScaler()),
Out[46]:
('selectkbest', SelectKBest())])
In [47]:
Xt_train = tf.transform(X_train)
print("Mean =", np.mean(Xt_train))
print("Shape =", Xt_train.shape)
Mean = -1.3715004550677509e-17
Shape = (1347, 10)
In [48]:
# Chain transformers + a classifier to build a new classifier
clf = make_pipeline(StandardScaler(),
SelectKBest(score_func=f_classif, k=10),
RandomForestClassifier())
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test)[:5])
In [49]:
# Hyper-parameters can be accessed using step names
print("K =", clf.get_params()["selectkbest__k"])
K = 10
In [50]:
clf.named_steps
{'randomforestclassifier': RandomForestClassifier(),
Out[50]:
'selectkbest': SelectKBest(),
'standardscaler': StandardScaler()}
In [51]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(clf,
param_grid={"selectkbest__k": [1, 10, 20, 30, 40, 50],
"randomforestclassifier__max_features": [0.1, 0.25, 0.5]})
grid.fit(X_train, y_train)
Feature unions
Similarly, transformers can be applied in parallel to transform data in union.
Nested composition
Since pipelines and unions are themselves estimators, they can be composed into nested structures.
In [52]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_union
from sklearn.decomposition import PCA
clf = make_pipeline(
# Build features
make_union(
FunctionTransformer(func=lambda X: X), # Identity
PCA(),
),
# Select the best features
RFE(RandomForestClassifier(), n_features_to_select=10),
# Train
MLPClassifier()
)
clf.fit(X_train, y_train)
Pipeline(steps=[('featureunion',
Out[52]:
FeatureUnion(transformer_list=[('functiontransformer',
FunctionTransformer(func=<function <lambda> at 0x7f8
67180ba70>)),
('pca', PCA())])),
('rfe',
RFE(estimator=RandomForestClassifier(),
n_features_to_select=10)),
('mlpclassifier', MLPClassifier())])
In [54]:
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
In [55]:
# Use the best estimator to compute the kernel density estimate
kde = grid.best_estimator_
In [56]:
# Turn data into a 4x11 grid
new_data = new_data.reshape((4, 11, -1))
real_data = digits.data[:44].reshape((4, 11, -1))
Summary
Scikit-Learn provides essential tools for machine learning.
It is more than training classifiers!
It integrates within a larger Python scientific ecosystem.
Try it for yourself!
In [57]:
questions?
In [57]: