0% found this document useful (0 votes)

139 views2 pages

1 An Introduction To Machine Learning With Scikit Learn

This document provides an introduction and overview of machine learning with Scikit-Learn: - Scikit-Learn is a popular Python machine learning library that provides simple and efficient implementations of established machine learning algorithms. - It includes tools for classification, regression, clustering, model selection, and preprocessing. Core algorithms are implemented in low-level languages while exposing a unified Pythonic API. - The document outlines Scikit-Learn's estimators, predictors, and transformers interfaces that provide a consistent way to build, fit, and use machine learning models. It also describes the library's role within the broader Python data science stack.

Uploaded by

kehinde.ogunleye007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views2 pages

1 An Introduction To Machine Learning With Scikit Learn

Uploaded by

kehinde.ogunleye007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

An introduction to Machine Learning with Scikit-Learn

Gilles Louppe (@glouppe)

University of Liège

Prerequisites
In [1]:
# This is an Jupyter notebook, with executable Python code inside
42 / 2

21.0
Out[1]:

Materials available on GitHub

Require a Python distribution with scientific packages (NumPy, SciPy, Scikit-Learn, Pandas)

See installation instructions in README

In [2]:
# Global imports and settings

# Matplotlib
%matplotlib inline
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["figure.max_open_warning"] = -1

# Print options
import numpy as np
np.set_printoptions(precision=3)

# Slideshow
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {'width': 1440, 'height': 768, 'scroll': True, 'theme': 'simple'})

# Silence warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=UserWarning)
warnings.simplefilter(action="ignore", category=RuntimeWarning)

In [3]:
%%javascript
Reveal.addEventListener("slidechanged", function(event){ window.location.hash = "header"; });

Outline
Scikit-Learn and the scientific ecosystem in Python
Classification
Model evaluation and selection
Transformers, pipelines and feature unions
Beyond building classifiers
Summary

Scikit-Learn
Overview
Machine learning library written in Python
Simple and efficient, for both experts and non-experts
Classical, well-established machine learning algorithms
Shipped with documentation and examples
BSD 3 license

Community driven development

20~ core developers (mostly researchers)
500-1000 occasional contributors
All working publicly together on GitHub
Emphasis on keeping the project maintainable
Style consistency
Unit-test coverage
Documentation and examples
Code review
Mature and stable
Join us!

Python stack for data analysis

The open source Python ecosystem provides a standalone, versatile and powerful scientific working
environment, including: NumPy, SciPy, IPython, Matplotlib, Pandas, and many others...

Scikit-Learn builds upon NumPy and SciPy and complements this scientific environment with machine learning
algorithms;
By design, Scikit-Learn is non-intrusive, easy to use and easy to combine with other libraries;
Core algorithms are implemented in low-level languages.

Algorithms
Supervised learning:

Linear models (Ridge, Lasso, Elastic Net, ...)

Support Vector Machines
Tree-based methods (Random Forests, Bagging, GBRT, ...)
Nearest neighbors
Neural networks (basics)
Gaussian Processes
Feature selection

Unsupervised learning:

Clustering (KMeans, Ward, ...)

Matrix decomposition (PCA, ICA, ...)
Density estimation
Outlier detection

Model selection and evaluation:

Cross-validation
Grid-search
Lots of metrics

... and many more! (See our Reference)

Classification
Framework
Data comes as a finite learning set L = (X, y) where

Input samples are given as an array X of shape n_samples × n_features , taking their values in X ;
Output values are given as an array y, taking symbolic values in Y .

The goal of supervised classification is to build an estimator φ : X ↦ Y minimizing

Err(φ) = EX,Y {ℓ(Y , φ(X))}

where ℓ is a loss function, e.g., the zero-one loss for classification ℓ01 (Y , Y^) = 1(Y ≠ Y^).

Applications
Classifying signal from background events;
Diagnosing disease from symptoms;
Recognising cats in pictures;
Identifying body parts with Kinect cameras;
...

Data
Input data = Numpy arrays or Scipy sparse matrices ;
Algorithms are expressed using high-level operations defined on matrices or vectors (similar to MATLAB) ;
Leverage efficient low-leverage implementations ;
Keep code short and readable.

In [4]:
# Generate data
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=1000, centers=20, random_state=123)
labels = ["b", "r"]
y = np.take(labels, (y < 10))
print(X)
print(y[:5])

[[-6.453 -8.764]
[ 0.29 0.147]
[-5.184 -1.253]
...
[-0.231 -1.608]
[-0.603 6.873]
[ 2.284 4.874]]
['r' 'r' 'b' 'r' 'b']

In [5]:
# X is a 2 dimensional array, with 1000 rows and 2 columns
print(X.shape)

# y is a vector of 1000 elements

print(y.shape)

(1000, 2)
(1000,)

In [6]:
# Rows and columns can be accessed with lists, slices or masks
print(X[[1, 2, 3]]) # rows 1, 2 and 3
print(X[:5]) # 5 first rows
print(X[500:510, 0]) # values from row 500 to row 510 at column 0
print(X[y == "b"][:5]) # 5 first rows for which y is "b"

[[ 0.29 0.147]
[-5.184 -1.253]
[-4.714 3.674]]
[[-6.453 -8.764]
[ 0.29 0.147]
[-5.184 -1.253]
[-4.714 3.674]
[ 4.516 -2.881]]
[-4.438 -2.46 4.331 -7.921 1.57 0.565 4.996 4.758 -1.604 1.101]
[[-5.184 -1.253]
[ 4.516 -2.881]
[ 1.708 2.624]
[-0.526 8.96 ]
[-1.076 9.787]]

In [7]:
# Plot
plt.figure()
for label in labels:
mask = (y == label)
plt.scatter(X[mask, 0], X[mask, 1], c=label)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()

Loading external data

Numpy provides some simple tools for loading data from files (CSV, binary, etc);

For structured data, Pandas provides more advanced tools (CSV, JSON, Excel, HDF5, SQL, etc);

A simple and unified API

All learning algorithms in scikit-learn share a uniform and limited API consisting of complementary interfaces:

an estimator interface for building and fitting models;

a predictor interface for making predictions;
a transformer interface for converting data.

Goal: enforce a simple and consistent API to make it trivial to swap or plug algorithms.

Estimators
In [8]:
class Estimator(object):
def fit(self, X, y=None):
"""Fits estimator to data."""
# set state of ``self``
return self

In [9]:
# Import the nearest neighbor class
from sklearn.neighbors import KNeighborsClassifier # Change this to try
# something else

# Set hyper-parameters, for controlling algorithm

clf = KNeighborsClassifier(n_neighbors=5)

# Learn a model from training data

clf.fit(X, y)

KNeighborsClassifier()
Out[9]:

In [10]:
# Estimator state is stored in instance attributes
clf._tree

<sklearn.neighbors._kd_tree.KDTree at 0x560760a44520>
Out[10]:

Predictors
In [11]:
# Make predictions
print(clf.predict(X[:5]))

['r' 'r' 'r' 'b' 'b']

In [12]:
# Compute (approximate) class probabilities
print(clf.predict_proba(X[:5]))

[[0. 1. ]
[0. 1. ]
[0.2 0.8]
[0.6 0.4]
[0.8 0.2]]

In [13]:
#Mount your google drive as follows:
from google.colab import drive
drive.mount('/content/mydir')

Drive already mounted at /content/mydir; to attempt to forcibly remount, call drive.mount("/content/m

ydir", force_remount=True).

In [14]:
# Query name of current folder
import os
folder_name=os.getcwd()
print(folder_name)

/content

In [15]:
# Goto the Colab folder
os.chdir('/content/mydir/MyDrive/Victor/MachineLearningCourseExercises')
folder_name=os.getcwd()
print(folder_name)

/content/mydir/MyDrive/Victor/MachineLearningCourseExercises

In [16]:
ls

1_An_introduction_to_Machine_Learning_with_Scikit_Learn.ipynb Colabtutorial/
Assignment1_HammettNeuralNetwork/ __pycache__/
Assignment2_RegressiveLinearModelsForChemistryPrediction/ robustness.py
Assignment3_UnsupervisedLearning/ tutorial.py

In [17]:
from tutorial import plot_surface
plot_surface(clf, X, y)

In [18]:
from tutorial import plot_histogram
plot_histogram(clf, X, y)

Classifier zoo
Decision trees
Idea: greedily build a partition of the input space using cuts orthogonal to feature axes.

In [19]:
from tutorial import plot_clf
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X, y)
plot_clf(clf, X, y)

Random Forests
Idea: Build several decision trees with controlled randomness and average their decisions.

In [20]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=500)
# from sklearn.ensemble import ExtraTreesClassifier
# clf = ExtraTreesClassifier(n_estimators=500)
clf.fit(X, y)
plot_clf(clf, X, y)

Logistic regression
Idea: model the decision boundary as an hyperplane.

In [21]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X, y)
plot_clf(clf, X, y)

Support vector machines

Idea: Find the hyperplane which has the largest distance to the nearest training points of any class.

In [22]:
from sklearn.svm import SVC
clf = SVC(kernel="linear") # try kernel="rbf" instead
clf.fit(X, y)
plot_clf(clf, X, y)

Multi-layer perceptron
Idea: a multi-layer perceptron is a circuit of non-linear combinations of the data.

In [23]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(100, 100, 100), activation="relu", learning_rate="invscaling"
clf.fit(X, y)
plot_clf(clf, X, y)

Gaussian Processes
Idea: a gaussian process is a distribution over functions f , such that f(x), for any set x of points, is gaussian distributed.

In [24]:
from sklearn.gaussian_process import GaussianProcessClassifier
clf = GaussianProcessClassifier()
clf.fit(X, y)
plot_clf(clf, X, y)

Model evaluation and selection

Evaluation
Recall that we want to learn an estimator φ minimizing the generalization error Err(φ) = EX,Y {ℓ(Y , φ(X))}.

Problem: Since PX,Y is unknown, the generalization error Err(φ) cannot be evaluated.

Solution: Use a proxy to approximate Err(φ).

Training error
In [25]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
print("Training error =", zero_one_loss(y, clf.predict(X)))

Training error = 0.0

Test error
Issue: the training error is a biased estimate of the generalization error.

Solution: Divide L into two disjoint parts called training and test sets (usually using 70% for training and 30% for test).

Use the training set for fitting the model;

Use the test set for evaluation only, thereby yielding an unbiased estimate.

In [26]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print("Training error =", zero_one_loss(y_train, clf.predict(X_train)))
print("Test error =", zero_one_loss(y_test, clf.predict(X_test)))

Training error = 0.10399999999999998

Test error = 0.17600000000000005
Summary: Beware of bias when you estimate model performance:

Training score is often an optimistic estimate of the true performance;

The same data should not be used both for training and evaluation.

Cross-validation
Issue:

When L is small, training on 70% of the data may lead to a model that is significantly different from a model that
would have been learned on the entire set L.
Yet, increasing the size of the training set (resp. decreasing the size of the test set), might lead to an inaccurate
estimate of the generalization error.

Solution: K-Fold cross-validation.

Split L into K small disjoint folds.

Train on K-1 folds, evaluate the test error one the held-out fold.
Repeat for all combinations and average the K estimates of the generalization error.

![](img/cross-validation.png)

In [27]:
from sklearn.model_selection import KFold

scores = []

for train, test in KFold(n_splits=5, random_state=None).split(X):

X_train, y_train = X[train], y[train]
X_test, y_test = X[test], y[test]
clf = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
scores.append(zero_one_loss(y_test, clf.predict(X_test)))

print("CV error = %f +-%f" % (np.mean(scores), np.std(scores)))

CV error = 0.163000 +-0.010770

In [28]:
# Shortcut
from sklearn.model_selection import cross_val_score
scores = cross_val_score(KNeighborsClassifier(n_neighbors=5), X, y,
cv=KFold(n_splits=5, random_state=None),
scoring="accuracy")
print("CV error = %f +-%f" % (1. - np.mean(scores), np.std(scores)))

CV error = 0.163000 +-0.010770

Metrics
Default score
Estimators come with a built-in default evaluation score

Accuracy for classification

R2 score for regression

In [29]:
y_train = (y_train == "r")
y_test = (y_test == "r")
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print("Default score =", clf.score(X_test, y_test))

Default score = 0.84

Accuracy
Definition: The accuracy is the proportion of correct predictions.

In [30]:
from sklearn.metrics import accuracy_score
print("Accuracy =", accuracy_score(y_test, clf.predict(X_test)))

Accuracy = 0.84

Precision, recall and F-measure

TP
P recision =
TP + FP
TP
Recall =
TP + FN
2 ∗ P recision ∗ Recall
F=
P recision + Recall

In [31]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import fbeta_score
print("Precision =", precision_score(y_test, clf.predict(X_test)))
print("Recall =", recall_score(y_test, clf.predict(X_test)))
print("F =", fbeta_score(y_test, clf.predict(X_test), beta=1))

Precision = 0.8118811881188119
Recall = 0.8631578947368421
F = 0.8367346938775511

ROC AUC
Definition: Area under the curve of the false positive rate (FPR) against the true positive rate (TPR) as the decision
threshold of the classifier is varied.

In [32]:
from sklearn.metrics import get_scorer
roc_auc_scorer = get_scorer("roc_auc")
print("ROC AUC =", roc_auc_scorer(clf, X_test, y_test))
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(X_test)[:, 1])
plt.plot(fpr, tpr)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.show()

ROC AUC = 0.9297744360902256

Confusion matrix
Definition: number of samples of class i predicted as class j.

In [33]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, clf.predict(X_test))

array([[86, 19],
Out[33]:
[13, 82]])

Model selection
Finding good hyper-parameters is crucial to control under- and over-fitting, hence achieving better performance.
The estimated generalization error can be used to select the best model.

Under- and over-fitting

Under-fitting: the model is too simple and does not capture the true relation between X and Y.
Over-fitting: the model is too specific to the training set and does not generalize.

In [34]:
from sklearn.model_selection import validation_curve

# Evaluate parameter range in CV

param_range = range(2, 200)
param_name = "max_leaf_nodes"

train_scores, test_scores = validation_curve(

DecisionTreeClassifier(), X, y,
param_name=param_name,
param_range=param_range, cv=5, n_jobs=-1)

train_scores_mean = np.mean(train_scores, axis=1)

train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

# Plot parameter VS estimated error

plt.xlabel(param_name)
plt.ylabel("score")
plt.xlim(min(param_range), max(param_range))
plt.plot(param_range, 1. - train_scores_mean, color="red", label="Training error")
plt.fill_between(param_range,
1. - train_scores_mean + train_scores_std,
1. - train_scores_mean - train_scores_std,
alpha=0.2, color="red")
plt.plot(param_range, 1. - test_scores_mean, color="blue", label="CV error")
plt.fill_between(param_range,
1. - test_scores_mean + test_scores_std,
1. - test_scores_mean - test_scores_std,
alpha=0.2, color="blue")
plt.legend(loc="best")

<matplotlib.legend.Legend at 0x7f866bd5e210>
Out[34]:

In [35]:
# Best trade-off
print("%s = %d, CV error = %f" % (param_name,
param_range[np.argmax(test_scores_mean)],
1. - np.max(test_scores_mean)))

max_leaf_nodes = 63, CV error = 0.176000

Question: Where is the model under-fitting and over-fitting?

Question: What does it mean if the training error is different from the test error?

Hyper-parameter search
In [36]:
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(KNeighborsClassifier(),
param_grid={"n_neighbors": list(range(1, 100))},
scoring="accuracy",
cv=5, n_jobs=-1)
grid.fit(X, y) # Note that GridSearchCV is itself an estimator

print("Best score = %f, Best parameters = %s" % (1. - grid.best_score_,

grid.best_params_))

Best score = 0.131000, Best parameters = {'n_neighbors': 34}

Question: Should you report the best score as an estimate of the generalization error of the model?

Selection and evaluation, simultaneously

grid.best_score_ is not independent from the best model, since its construction was guided by the optimization
of this quantity.

As a result, the optimized grid.best_score_ estimate may in fact be a biased, optimistic, estimate of the true
performance of the model.

Solution: Use nested cross-validation for correctly selecting the model and correctly evaluating its performance.

In [37]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

scores = cross_val_score(
GridSearchCV(KNeighborsClassifier(),
param_grid={"n_neighbors": list(range(1, 100))},
scoring="accuracy",
cv=5, n_jobs=-1),
X, y, cv=5, scoring="accuracy")

# Unbiased estimate of the accuracy

print("%f +-%f" % (1. - np.mean(scores), np.std(scores)))

0.144000 +-0.023958

Transformers, pipelines and feature unions

Transformers
Classification (or regression) is often only one or the last step of a long and complicated process;
In most cases, input data needs to be cleaned, massaged or extended before being fed to a learning algorithm;
For this purpose, Scikit-Learn provides the transformer API.

In [38]:
class Transformer(object):
def fit(self, X, y=None):
"""Fits estimator to data."""
# set state of ``self``
return self

def transform(self, X):

"""Transform X into Xt."""
# transform X in some way to produce Xt
return Xt

# Shortcut
def fit_transform(self, X, y=None):
self.fit(X, y)
Xt = self.transform(X)
return Xt

Transformer zoo
In [39]:
# Load digits data
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Plot
sample_id = 42
plt.imshow(X[sample_id].reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.title("y = %d" % y[sample_id])
plt.show()

Scalers and other normalizers

In [40]:
from sklearn.preprocessing import StandardScaler
tf = StandardScaler()
tf.fit(X_train, y_train)
Xt_train = tf.transform(X_train)
print("Mean (before scaling) =", np.mean(X_train))
print("Mean (after scaling) =", np.mean(Xt_train))

# Shortcut: Xt = tf.fit_transform(X)
# See also Binarizer, MinMaxScaler, Normalizer, ...

Mean (before scaling) = 4.8921213808463255

Mean (after scaling) = -2.307813265739004e-18

In [41]:
# Scaling is critical for some algorithms
from sklearn.svm import SVC
clf = SVC()
print("Without scaling =", clf.fit(X_train, y_train).score(X_test, y_test))
print("With scaling =", clf.fit(tf.transform(X_train), y_train).score(tf.transform(X_test), y_test))

Without scaling = 0.9911111111111112

With scaling = 0.9844444444444445

Feature selection
In [42]:
# Select the 10 top features, as ranked using ANOVA F-score
from sklearn.feature_selection import SelectKBest, f_classif
tf = SelectKBest(score_func=f_classif, k=10)
Xt = tf.fit_transform(X_train, y_train)
print("Shape =", Xt.shape)

# Plot support
plt.imshow(tf.get_support().reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.show()

Shape = (1347, 10)

Feature selection (cont.)

In [43]:
# Feature selection using backward elimination
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
tf = RFE(RandomForestClassifier(), n_features_to_select=10, verbose=1)
Xt = tf.fit_transform(X_train, y_train)
print("Shape =", Xt.shape)

# Plot support
plt.imshow(tf.get_support().reshape((8, 8)), interpolation="nearest", cmap=plt.cm.Blues)
plt.show()

Fitting estimator with 64 features.

Fitting estimator with 63 features.
Fitting estimator with 62 features.
Fitting estimator with 61 features.
Fitting estimator with 60 features.
Fitting estimator with 59 features.
Fitting estimator with 58 features.
Fitting estimator with 57 features.
Fitting estimator with 56 features.
Fitting estimator with 55 features.
Fitting estimator with 54 features.
Fitting estimator with 53 features.
Fitting estimator with 52 features.
Fitting estimator with 51 features.
Fitting estimator with 50 features.
Fitting estimator with 49 features.
Fitting estimator with 48 features.
Fitting estimator with 47 features.
Fitting estimator with 46 features.
Fitting estimator with 45 features.
Fitting estimator with 44 features.
Fitting estimator with 43 features.
Fitting estimator with 42 features.
Fitting estimator with 41 features.
Fitting estimator with 40 features.
Fitting estimator with 39 features.
Fitting estimator with 38 features.
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 29 features.
Fitting estimator with 28 features.
Fitting estimator with 27 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Shape = (1347, 10)

Decomposition, factorization or embeddings

In [44]:
# Compute decomposition
from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
tf = PCA(n_components=2)
Xt_train = tf.fit_transform(X_train)

# Plot
plt.scatter(Xt_train[:, 0], Xt_train[:, 1], c=y_train)
plt.show()

# See also: KernelPCA, NMF, FastICA, Kernel approximations,

# manifold learning, etc

Function transformer
In [45]:
from sklearn.preprocessing import FunctionTransformer

def increment(X):
return X + 1

tf = FunctionTransformer(func=increment)
Xt = tf.fit_transform(X)
print(X[0])
print(Xt[0])

[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.

15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0.
0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12.
0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]
[ 1. 1. 6. 14. 10. 2. 1. 1. 1. 1. 14. 16. 11. 16. 6. 1. 1. 4.
16. 3. 1. 12. 9. 1. 1. 5. 13. 1. 1. 9. 9. 1. 1. 6. 9. 1.
1. 10. 9. 1. 1. 5. 12. 1. 2. 13. 8. 1. 1. 3. 15. 6. 11. 13.
1. 1. 1. 1. 7. 14. 11. 1. 1. 1.]

Pipelines
Transformers can be chained in sequence to form a pipeline.

In [46]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Chain transformers to build a new transformer

tf = make_pipeline(StandardScaler(),
SelectKBest(score_func=f_classif, k=10))
tf.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
Out[46]:
('selectkbest', SelectKBest())])

In [47]:
Xt_train = tf.transform(X_train)
print("Mean =", np.mean(Xt_train))
print("Shape =", Xt_train.shape)

Mean = -1.3715004550677509e-17
Shape = (1347, 10)

In [48]:
# Chain transformers + a classifier to build a new classifier
clf = make_pipeline(StandardScaler(),
SelectKBest(score_func=f_classif, k=10),
RandomForestClassifier())
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test)[:5])

[[0. 0. 0.84 0. 0. 0.04 0. 0.1 0.02 0. ]

[0. 0. 0.01 0. 0. 0. 0.02 0.02 0.94 0.01]
[0. 0.07 0.8 0.04 0. 0. 0. 0.05 0.04 0. ]
[0. 0. 0. 0. 0.03 0. 0.96 0. 0.01 0. ]
[0. 0. 0. 0. 0. 0.01 0.99 0. 0. 0. ]]

In [49]:
# Hyper-parameters can be accessed using step names
print("K =", clf.get_params()["selectkbest__k"])

K = 10

In [50]:
clf.named_steps

{'randomforestclassifier': RandomForestClassifier(),
Out[50]:
'selectkbest': SelectKBest(),
'standardscaler': StandardScaler()}

In [51]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(clf,
param_grid={"selectkbest__k": [1, 10, 20, 30, 40, 50],
"randomforestclassifier__max_features": [0.1, 0.25, 0.5]})
grid.fit(X_train, y_train)

print("Best params =", grid.best_params_)

Best params = {'randomforestclassifier__max_features': 0.1, 'selectkbest__k': 50}

Feature unions
Similarly, transformers can be applied in parallel to transform data in union.

Nested composition
Since pipelines and unions are themselves estimators, they can be composed into nested structures.

In [52]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_union
from sklearn.decomposition import PCA

clf = make_pipeline(
# Build features
make_union(
FunctionTransformer(func=lambda X: X), # Identity
PCA(),
),
# Select the best features
RFE(RandomForestClassifier(), n_features_to_select=10),
# Train
MLPClassifier()
)

clf.fit(X_train, y_train)

Pipeline(steps=[('featureunion',
Out[52]:
FeatureUnion(transformer_list=[('functiontransformer',
FunctionTransformer(func=<function <lambda> at 0x7f8
67180ba70>)),
('pca', PCA())])),
('rfe',
RFE(estimator=RandomForestClassifier(),
n_features_to_select=10)),
('mlpclassifier', MLPClassifier())])

Beyond building classifiers

(Quantile) Regression
Clustering
Density estimation
Feature learning
Outlier detection
...

Example: Kernel Density estimation

In [53]:
# Load the data
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data

In [54]:
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

# Project the 64-dimensional data to a lower dimension

pca = PCA(n_components=15, whiten=False)
X = pca.fit_transform(X)

# Use grid search cross-validation to optimize the bandwidth

params = {'bandwidth': np.logspace(-1, 1, 100)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(X)

print("Best bandwidth: %.2f" % grid.best_estimator_.bandwidth)

Best bandwidth: 3.59

In [55]:
# Use the best estimator to compute the kernel density estimate
kde = grid.best_estimator_

# Sample 44 new points from the data

new_data = kde.sample(44, random_state=0)
new_data = pca.inverse_transform(new_data)

In [56]:
# Turn data into a 4x11 grid
new_data = new_data.reshape((4, 11, -1))
real_data = digits.data[:44].reshape((4, 11, -1))

# Plot real digits and resampled digits

fig, ax = plt.subplots(9, 11, subplot_kw=dict(xticks=[], yticks=[]))
for j in range(11):
ax[4, j].set_visible(False)
for i in range(4):
im = ax[i, j].imshow(real_data[i, j].reshape((8, 8)),
cmap=plt.cm.binary, interpolation='nearest')
im.set_clim(0, 16)
im = ax[i + 5, j].imshow(new_data[i, j].reshape((8, 8)),
cmap=plt.cm.binary, interpolation='nearest')
im.set_clim(0, 16)

ax[0, 5].set_title('Selection from the input data')

ax[5, 5].set_title('"New" digits drawn from the kernel density model')
plt.show()

Summary
Scikit-Learn provides essential tools for machine learning.
It is more than training classifiers!
It integrates within a larger Python scientific ecosystem.
Try it for yourself!

In [57]:
questions?

Object `questions` not found.

In [57]:

Scikit Learn Docs PDF
No ratings yet
Scikit Learn Docs PDF
2,387 pages
The Tempic Field: Verbatim Extracts From The Smith Archive
No ratings yet
The Tempic Field: Verbatim Extracts From The Smith Archive
3 pages
Intro To Scikit Learning
No ratings yet
Intro To Scikit Learning
18 pages
Unit 1-1
No ratings yet
Unit 1-1
10 pages
Unit 2 MLMM
No ratings yet
Unit 2 MLMM
41 pages
1 - An Introduction To Machine Learning With Scikit-Learn
No ratings yet
1 - An Introduction To Machine Learning With Scikit-Learn
9 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
04 MLModelingBasics
No ratings yet
04 MLModelingBasics
61 pages
ML Exp
No ratings yet
ML Exp
9 pages
Classification Algorithms I
No ratings yet
Classification Algorithms I
14 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
28 pages
Clustering in Python-Dr. Afsaneh Javadi
No ratings yet
Clustering in Python-Dr. Afsaneh Javadi
8 pages
Unit-2 Feature Selection
No ratings yet
Unit-2 Feature Selection
92 pages
Python For Machine Learning Basics
No ratings yet
Python For Machine Learning Basics
36 pages
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
Scikit Learn Docs
No ratings yet
Scikit Learn Docs
2,503 pages
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
No ratings yet
An Introduction To Supervised Learning With Scikit-Learn: Machine Learning: The Problem Setting
4 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
cs229 Python Friday
No ratings yet
cs229 Python Friday
40 pages
ML Manual
No ratings yet
ML Manual
21 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Unit 4
No ratings yet
Unit 4
105 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
ML Record
No ratings yet
ML Record
19 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
Machine Learning - Manual
No ratings yet
Machine Learning - Manual
32 pages
ML Pgms - 24mar2025
No ratings yet
ML Pgms - 24mar2025
23 pages
Data Science II: Charles C.N. Wang
No ratings yet
Data Science II: Charles C.N. Wang
38 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
Ds You Should Know
No ratings yet
Ds You Should Know
6 pages
ML
No ratings yet
ML
8 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
ML Lab Manual
No ratings yet
ML Lab Manual
59 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
AIML Short Term Internship Session 9 Summary-1719044709410
No ratings yet
AIML Short Term Internship Session 9 Summary-1719044709410
14 pages
Scikit Learn Docs PDF
100% (3)
Scikit Learn Docs PDF
2,204 pages
Data Preprocessing: Modern Data Analytics (G0Z39A) Prof. Dr. Ir. Jan de Spiegeleer
No ratings yet
Data Preprocessing: Modern Data Analytics (G0Z39A) Prof. Dr. Ir. Jan de Spiegeleer
82 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
CS3361 Data Science Lab Manual
No ratings yet
CS3361 Data Science Lab Manual
43 pages
Unit 2 ML
No ratings yet
Unit 2 ML
93 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
CS229 Section: Python Tutorial: Maya Srikanth
No ratings yet
CS229 Section: Python Tutorial: Maya Srikanth
39 pages
Practical-1: Aim: Study About Numpy Library of Python
No ratings yet
Practical-1: Aim: Study About Numpy Library of Python
28 pages
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
No ratings yet
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
9 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Part3 ML
No ratings yet
Part3 ML
201 pages
Tanu Raman ML Lab File
No ratings yet
Tanu Raman ML Lab File
21 pages
ML Lab
No ratings yet
ML Lab
30 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
CO-367 Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
CO-367 Machine Learning Lab File: Submitted To: Submitted by
12 pages
Cs3361-Data Science Lab Manual
No ratings yet
Cs3361-Data Science Lab Manual
44 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
Lab1 ML Eac22050
No ratings yet
Lab1 ML Eac22050
17 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
5 Data Analytics Projects For Beginners - CourseraG
No ratings yet
5 Data Analytics Projects For Beginners - CourseraG
6 pages
4 Steps of Using Latent Dirichlet Allocation (LDA) For Topic Modeling in NLP
No ratings yet
4 Steps of Using Latent Dirichlet Allocation (LDA) For Topic Modeling in NLP
21 pages
Converting List To String
No ratings yet
Converting List To String
8 pages
Datetime - Basic Date and Time Types - Python 3.11.3 Documentation
No ratings yet
Datetime - Basic Date and Time Types - Python 3.11.3 Documentation
38 pages
Partial Correlation Intro 1
No ratings yet
Partial Correlation Intro 1
5 pages
Python 123455
No ratings yet
Python 123455
11 pages
Maths Paper Solving
No ratings yet
Maths Paper Solving
5 pages
B Tech Electrical and Electronics Engineering
No ratings yet
B Tech Electrical and Electronics Engineering
221 pages
80-20 Curve (Pareto) PDF
No ratings yet
80-20 Curve (Pareto) PDF
6 pages
Third Term Exam-Wps Office-5
No ratings yet
Third Term Exam-Wps Office-5
5 pages
Business Statistics: Bba 2 Sem
No ratings yet
Business Statistics: Bba 2 Sem
30 pages
Hashsorting
No ratings yet
Hashsorting
33 pages
Lesson 4 Polynomial Curves Tangents and Normal To Plane Curves
No ratings yet
Lesson 4 Polynomial Curves Tangents and Normal To Plane Curves
57 pages
2023 PLS
No ratings yet
2023 PLS
21 pages
Marketing Measurement and Forecasting
86% (14)
Marketing Measurement and Forecasting
16 pages
MCQ
100% (1)
MCQ
5 pages
Week3 Lecture Notes
No ratings yet
Week3 Lecture Notes
11 pages
CN U2
No ratings yet
CN U2
162 pages
Latest DLL Math 4 WK 8
No ratings yet
Latest DLL Math 4 WK 8
2 pages
Chapter 11 - Similarity
100% (1)
Chapter 11 - Similarity
37 pages
Origin of The South African Measurement System
No ratings yet
Origin of The South African Measurement System
3 pages
978 3 662 03750 8
No ratings yet
978 3 662 03750 8
541 pages
Caed MCQ With Answers
No ratings yet
Caed MCQ With Answers
46 pages
Class 10 Science Chapter 8 Presentation
No ratings yet
Class 10 Science Chapter 8 Presentation
68 pages
ECON022 BAP With Major
No ratings yet
ECON022 BAP With Major
3 pages
Hypothesis Testing Keshav N
No ratings yet
Hypothesis Testing Keshav N
8 pages
3251 Real Analysis IIComplex Analysis Feb Mar 2024
No ratings yet
3251 Real Analysis IIComplex Analysis Feb Mar 2024
2 pages
Computer Architecture ECE 361 Lecture 5: The Design Process & ALU Design
No ratings yet
Computer Architecture ECE 361 Lecture 5: The Design Process & ALU Design
55 pages
Quantitative Aptitude Shortcuts & Tricks
No ratings yet
Quantitative Aptitude Shortcuts & Tricks
8 pages
Syllabus For RET Examination 2018: University of Gour Banga Subject: Physics
No ratings yet
Syllabus For RET Examination 2018: University of Gour Banga Subject: Physics
22 pages
Enotes
No ratings yet
Enotes
30 pages
MCE - 5 Published BS and BS ISO Stds
No ratings yet
MCE - 5 Published BS and BS ISO Stds
5 pages
5-10-1 Truolivescultivars Accessions Autochtones
No ratings yet
5-10-1 Truolivescultivars Accessions Autochtones
72 pages

1 An Introduction To Machine Learning With Scikit Learn

Uploaded by

1 An Introduction To Machine Learning With Scikit Learn

Uploaded by

An introduction to Machine Learning with Scikit-Learn

Gilles Louppe (@glouppe)

Materials available on GitHub

See installation instructions in README

Community driven development

Python stack for data analysis

Linear models (Ridge, Lasso, Elastic Net, ...)

Clustering (KMeans, Ward, ...)

Model selection and evaluation:

... and many more! (See our Reference)

The goal of supervised classification is to build an estimator φ : X ↦ Y minimizing

Err(φ) = EX,Y {ℓ(Y , φ(X))}

# y is a vector of 1000 elements

Loading external data

A simple and unified API

an estimator interface for building and fitting models;

# Set hyper-parameters, for controlling algorithm

# Learn a model from training data

['r' 'r' 'r' 'b' 'b']

Drive already mounted at /content/mydir; to attempt to forcibly remount, call drive.mount("/content/m

Support vector machines

Model evaluation and selection

Solution: Use a proxy to approximate Err(φ).

Training error = 0.0

Use the training set for fitting the model;

Training error = 0.10399999999999998

Training score is often an optimistic estimate of the true performance;

Solution: K-Fold cross-validation.

Split L into K small disjoint folds.

for train, test in KFold(n_splits=5, random_state=None).split(X):

print("CV error = %f +-%f" % (np.mean(scores), np.std(scores)))

CV error = 0.163000 +-0.010770

CV error = 0.163000 +-0.010770

Accuracy for classification

Default score = 0.84

Precision, recall and F-measure

ROC AUC = 0.9297744360902256

Under- and over-fitting

# Evaluate parameter range in CV

train_scores, test_scores = validation_curve(

train_scores_mean = np.mean(train_scores, axis=1)

# Plot parameter VS estimated error

max_leaf_nodes = 63, CV error = 0.176000

print("Best score = %f, Best parameters = %s" % (1. - grid.best_score_,

Best score = 0.131000, Best parameters = {'n_neighbors': 34}

Selection and evaluation, simultaneously

# Unbiased estimate of the accuracy

Transformers, pipelines and feature unions

def transform(self, X):

Scalers and other normalizers

Mean (before scaling) = 4.8921213808463255

Without scaling = 0.9911111111111112

Shape = (1347, 10)

Feature selection (cont.)

Fitting estimator with 64 features.

Decomposition, factorization or embeddings

# See also: KernelPCA, NMF, FastICA, Kernel approximations,

[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.

# Chain transformers to build a new transformer

[[0. 0. 0.84 0. 0. 0.04 0. 0.1 0.02 0. ]

print("Best params =", grid.best_params_)

Best params = {'randomforestclassifier__max_features': 0.1, 'selectkbest__k': 50}

Beyond building classifiers

Example: Kernel Density estimation

# Project the 64-dimensional data to a lower dimension

# Use grid search cross-validation to optimize the bandwidth

print("Best bandwidth: %.2f" % grid.best_estimator_.bandwidth)

Best bandwidth: 3.59

# Sample 44 new points from the data

# Plot real digits and resampled digits

ax[0, 5].set_title('Selection from the input data')

Object `questions` not found.

You might also like