Scikit Learn Tutorial PDF
Scikit Learn Tutorial PDF
i
Scikit-Learn
Audience
This tutorial will be useful for graduates, postgraduates, and research students who either
have an interest in this Machine Learning subject or have this subject as a part of their
curriculum. The reader can be a beginner or an advanced learner.
Prerequisites
The reader must have basic knowledge about Machine Learning. He/she should also be
aware about Python, NumPy, Scipy, Matplotlib. If you are new to any of these concepts,
we recommend you take up tutorials concerning these topics, before you dig further into
this tutorial.
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at [email protected]
ii
Scikit-Learn
Table of Contents
About the Tutorial ........................................................................................................................................... ii
Audience .......................................................................................................................................................... ii
Prerequisites .................................................................................................................................................... ii
Prerequisites .................................................................................................................................................... 2
Installation ....................................................................................................................................................... 2
Features ........................................................................................................................................................... 3
Binarisation ...................................................................................................................................................... 9
Mean Removal................................................................................................................................................. 9
Scaling ............................................................................................................................................................ 10
Normalisation ................................................................................................................................................ 11
iii
Scikit-Learn
Elastic-Net...................................................................................................................................................... 47
MultiTaskElasticNet ....................................................................................................................................... 51
Introduction ................................................................................................................................................... 64
iv
Scikit-Learn
SVC ................................................................................................................................................................. 65
NuSVC ............................................................................................................................................................ 69
LinearSVC ....................................................................................................................................................... 70
SVR................................................................................................................................................................. 71
NuSVR ............................................................................................................................................................ 72
LinearSVR ....................................................................................................................................................... 73
Methods ........................................................................................................................................................ 75
One-Class SVM............................................................................................................................................... 82
KNeighborsClassifier ...................................................................................................................................... 91
RadiusNeighborsClassifier ............................................................................................................................. 97
KNeighborsRegressor .................................................................................................................................... 99
vii
1. Scikit-Learn — Introduction Scikit-Learn
Origin of Scikit-Learn
It was originally called scikits.learn and was initially developed by David Cournapeau as
a Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael
Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for
Research in Computer Science and Automation), took this project at another level and
made the first public release (v0.1 beta) on 1st Feb. 2010.
1
Scikit-Learn
Various organisations like Booking.com, JP Morgan, Evernote, Inria, AWeber, Spotify and
many more are using Sklearn.
Prerequisites
Before we start using scikit-learn latest release, we require the following:
Python (>=3.5)
NumPy (>= 1.11.0)
Scipy (>= 0.17.0)
Joblib (>= 0.11)
Matplotlib (>= 1.5.1) is required for Sklearn plotting capabilities.
Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data
structure and analysis.
Installation
If you already installed NumPy and Scipy, following are the two easiest ways to install
scikit-learn:
Using pip
Using conda
On the other hand, if NumPy and Scipy is not yet installed on your Python workstation
then, you can install them by using either pip or conda.
2
Scikit-Learn
Another option to use scikit-learn is to use Python distributions like Canopy and
Anaconda because they both ship the latest version of scikit-learn.
Features
Rather than focusing on loading, manipulating and summarising data, Scikit-learn library
is focused on modeling the data. Some of the most popular groups of models provided by
Sklearn are as follows:
Unsupervised Learning algorithms: On the other hand, it also has all the popular
unsupervised learning algorithms from clustering, factor analysis, PCA (Principal
Component Analysis) to unsupervised neural networks.
Cross Validation: It is used to check the accuracy of supervised models on unseen data.
Dimensionality Reduction: It is used for reducing the number of attributes in data which
can be further used for summarisation, visualisation and feature selection.
Ensemble methods: As name suggest, it is used for combining the predictions of multiple
supervised models.
Feature extraction: It is used to extract the features from data to define the attributes
in image and text data.
3
2. Scikit-Learn ― Modelling Process Scikit-Learn
This chapter deals with the modelling process involved in Sklearn. Let us understand about
the same in detail and begin with dataset loading.
Dataset Loading
A collection of data is called dataset. It is having the following two components:
Features: The variables of data are called its features. They are also known as predictors,
inputs or attributes.
Feature matrix: It is the collection of features, in case there are more than one.
Feature Names: It is the list of all the names of the features.
Response: It is the output variable that basically depends upon the feature variables.
They are also known as target, label or output.
Scikit-learn have few example datasets like iris and digits for classification and the
Boston house prices for regression.
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
4
Scikit-Learn
Output
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)']
First 10 rows of X:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
5
Scikit-Learn
The following example will split the data into 70:30 ratio, i.e. 70% data will be used as
training data and 30% will be used as testing data. The dataset is iris dataset as in above
example.
X = iris.data
y = iris.target
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Output
(105, 4)
(45, 4)
(105,)
(45,)
6
Scikit-Learn
X, y: Here, X is the feature matrix and y is the response vector, which need to
be split.
test_size: This represents the ratio of test data to the total given data. As in the
above example, we are setting test_data = 0.3 for 150 rows of X. It will produce
test data of 150*0.3 = 45 rows.
random_size: It is used to guarantee that the split will always be the same. This
is useful in the situations where you want reproducible results.
In the example below, we are going to use KNN (K nearest neighbors) classifier. Don’t go
into the details of KNN algorithms, as there will be a separate chapter for that. This
example is used to make you understand the implementation part only.
iris = load_iris()
X = iris.data
y = iris.target
classifier_knn = KNeighborsClassifier(n_neighbors=3)
7
Scikit-Learn
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)
# Providing sample data and the model will make prediction out of that data
Output
Accuracy: 0.9833333333333333
Model Persistence
Once you train the model, it is desirable that the model should be persist for future use so
that we do not need to retrain it again and again. It can be done with the help of dump
and load features of joblib package.
Consider the example below in which we will be saving the above trained model
(classifier_knn) for future use:
joblib.dump(classifier_knn, 'iris_classifier_knn.joblib')
The above code will save the model into file named iris_classifier_knn.joblib. Now, the
object can be reloaded from the file with the help of following code:
joblib.load('iris_classifier_knn.joblib')
8
Scikit-Learn
Binarisation
This preprocessing technique is used when we need to convert our numerical values into
Boolean values.
Example
import numpy as np
from sklearn import preprocessing
Input_data = np.array([2.1, -1.9, 5.5],
[-1.5, 2.4, 3.5],
[0.5, -7.9, 5.6],
[5.9, 2.3, -5.8]])
data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("\nBinarized data:\n", data_binarized)
In the above example, we used threshold value = 0.5 and that is why, all the values
above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0.
Output
Binarized data:
[[ 1. 0. 1.]
[ 0. 1. 1.]
[ 0. 0. 1.]
[ 1. 1. 0.]]
Mean Removal
This technique is used to eliminate the mean from feature vector so that every feature
centered on zero.
Example
import numpy as np
from sklearn import preprocessing
Input_data = np.array([2.1, -1.9, 5.5],
9
Scikit-Learn
#displaying the mean and the standard deviation of the input data
print("Mean =", input_data.mean(axis=0))
print("Stddeviation = ", input_data.std(axis=0))
#Removing the mean and the standard deviation of the input data
data_scaled = preprocessing.scale(input_data)
print("Mean_removed =", data_scaled.mean(axis=0))
print("Stddeviation_removed =", data_scaled.std(axis=0))
Output
Scaling
We use this preprocessing technique for scaling the feature vectors. Scaling of feature
vectors is important, because the features should not be synthetically large or small.
Example
import numpy as np
from sklearn import preprocessing
Output
Normalisation
We use this preprocessing technique for modifying the feature vectors. Normalisation of
feature vectors is necessary so that the feature vectors can be measured at common scale.
There are two types of normalisation as follows:
L1 Normalisation
It is also called Least Absolute Deviations. It modifies the value in such a manner that the
sum of the absolute values remains always up to 1 in each row. Following example shows
the implementation of L1 normalisation on input data.
Example
import numpy as np
from sklearn import preprocessing
Input_data = np.array([2.1, -1.9, 5.5],
[-1.5, 2.4, 3.5],
[0.5, -7.9, 5.6],
[5.9, 2.3, -5.8]])
data_normalized_l1 = preprocessing.normalize(input_data, norm='l1')
print("\nL1 normalized data:\n", data_normalized_l1)
Output
L1 normalized data:
11
Scikit-Learn
L2 Normalisation
Also called Least Squares. It modifies the value in such a manner that the sum of the
squares remains always up to 1 in each row. Following example shows the implementation
of L2 normalisation on input data.
Example
import numpy as np
from sklearn import preprocessing
Input_data = np.array([2.1, -1.9, 5.5],
[-1.5, 2.4, 3.5],
[0.5, -7.9, 5.6],
[5.9, 2.3, -5.8]])
data_normalized_l2 = preprocessing.normalize(input_data, norm='l2')
print("\nL1 normalized data:\n", data_normalized_l2)
Output
L2 normalized data:
[[ 0.33946114 -0.30713151 0.88906489]
[-0.33325106 0.53320169 0.7775858 ]
[ 0.05156558 -0.81473612 0.57753446]
[ 0.68706914 0.26784051 -0.6754239 ]]
12
3. Scikit-Learn — Data Representation Scikit-Learn
As we know that machine learning is about to create model from data. For this purpose,
computer must understand the data first. Next, we are going to discuss various ways to
represent the data in order to be understood by computer:
Data as table
The best way to represent data in Scikit-learn is in the form of tables. A table represents
a 2-D grid of data where rows represent the individual elements of the dataset and the
columns represents the quantities related to those individual elements.
Example
With the example given below, we can download iris dataset in the form of a Pandas
DataFrame with the help of python seaborn library.
Output
From above output, we can see that each row of the data represents a single observed
flower and the number of rows represents the total number of flowers in the dataset.
Generally, we refer the rows of the matrix as samples.
On the other hand, each column of the data represents a quantitative information
describing each sample. Generally, we refer the columns of the matrix as features.
Example
In the example below, from iris dataset we predict the species of flower based on the other
measurements. In this case, the Species column would be considered as the feature.
Output
14
Scikit-Learn
Output
(150,4)
(150,)
15
4. Scikit-Learn ― Estimator API Scikit-Learn
In this chapter, we will learn about Estimator API (application programming interface).
Let us begin by understanding what is an Estimator API.
For fitting the data, all estimator objects expose a fit method that takes a dataset shown
as follows:
estimator.fit(data)
Next, all the parameters of an estimator can be set, as follows, when it is instantiated by
the corresponding attribute.
Once data is fitted with an estimator, parameters are estimated from the data at hand.
Now, all the estimated parameters will be the attributes of the estimator object ending by
an underscore as follows:
estimator.estimated_param_
16
Scikit-Learn
fit
fit_predict if transductive
predict if inductive
Guiding Principles
While designing the Scikit-Learn API, following guiding principles kept in mind:
Consistency
This principle states that all the objects should share a common interface drawn from a
limited set of methods. The documentation should also be consistent.
Composition
Sensible defaults
According to this principle, the Scikit-learn library defines an appropriate default value
whenever ML models require user-specified parameters.
Inspection
17
Scikit-Learn
As per this guiding principle, every specified parameter value is exposed as pubic
attributes.
Output
18
Scikit-Learn
(150, 4)
y_iris = iris['species']
y_iris.shape
Output
(150,)
Now, for this regression example, we are going to use the following sample data:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.RandomState(35)
x = 10*rng.rand(40)
y = 2*x-1+rng.randn(40)
plt.scatter(x,y);
Output
So, we have the above data for our linear regression example.
19
Scikit-Learn
model = LinearRegression(fit_intercept=True)
model
Output
X = x[:, np.newaxis]
X.shape
Output
(40, 1)
Model fitting
Once, we arrange the data, it is time to fit the model i.e. to apply our model to data. This
can be done with the help of fit() method as follows:
model.fit(X, y)
Output
20
Scikit-Learn
normalize=False)
For this example, the below parameter shows the slope of the simple linear fit of the data:
model.coef_
Output
array([1.99839352])
The below parameter represents the intercept of the simple linear fit to the data:
model.intercept_
Output
-0.9895459457775022
Output
21
Scikit-Learn
iris = sns.load_dataset('iris')
X_iris = iris.drop('species', axis = 1)
X_iris.shape
y_iris = iris['species']
y_iris.shape
rng = np.random.RandomState(35)
x = 10*rng.rand(40)
y = 2*x-1+rng.randn(40)
plt.scatter(x,y);
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model
X = x[:, np.newaxis]
X.shape
model.fit(X, y)
22
Scikit-Learn
model.coef_
model.intercept_
xfit = np.linspace(-1, 11)
Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
plt.scatter(x, y)
plt.plot(xfit, yfit);
Like the above given example, we can load and plot the random data from iris dataset.
After that we can follow the steps as below:
Output
Model fitting
model.fit(X_iris)
Output
23
Scikit-Learn
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);
Output
import numpy as np
import seaborn as sns
iris = sns.load_dataset('iris')
24
Scikit-Learn
model
model.fit(X_iris)
X_2D = model.transform(X_iris)
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);
25
5. Scikit-Learn — Conventions Scikit-Learn
Scikit-learn’s objects share a uniform basic API that consists of the following three
complementary interfaces:
The APIs adopt simple conventions and the design choices have been guided in a manner
to avoid the proliferation of framework code.
Purpose of Conventions
The purpose of conventions is to make sure that the API stick to the following broad
principles:
Consistency: All the objects whether they are basic, or composite must share a consistent
interface which further composed of a limited set of methods.
Various Conventions
The conventions available in Sklearn are explained below:
Type casting
It states that the input should be cast to float64. In the following example, in which
sklearn.random_projection module used to reduce the dimensionality of the data, will
explain it:
import numpy as np
26
Scikit-Learn
rannge = np.random.RandomState(0)
X = range.rand(10,2000)
X.dtype
Transformer_data = random_projection.GaussianRandomProjection()
X_new = transformer.fit_transform(X)
X_new.dtype
Output
dtype('float32')
dtype('float64')
In the above example, we can see that X is float32 which is cast to float64 by
fit_transform(X).
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
X, y = load_iris(return_X_y=True)
clf = SVC()
clf.set_params(kernel='linear').fit(X, y)
clf.predict(X[:5])
Output
27
Scikit-Learn
array([0, 0, 0, 0, 0])
Once the estimator has been constructed, above code will change the default kernel rbf
to linear via SVC.set_params().
Now, the following code will change back the kernel to rbf to refit the estimator and to
make a second prediction.
clf.set_params(kernel='rbf', gamma='scale').fit(X, y)
clf.predict(X[:5])
Output
array([0, 0, 0, 0, 0])
Complete code
The following is the complete executable program:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
X, y = load_iris(return_X_y=True)
clf = SVC()
clf.set_params(kernel='linear').fit(X, y)
clf.predict(X[:5])
clf.set_params(kernel='rbf', gamma='scale').fit(X, y)
clf.predict(X[:5])
X = [[1, 2], [3, 4], [4, 5], [5, 2], [1, 1]]
28
Scikit-Learn
y = [0, 0, 1, 1, 2]
classif = OneVsRestClassifier(estimator=SVC(gamma='scale',random_state=0))
classif.fit(X, y).predict(X)
Output
array([0, 0, 1, 1, 2])
In the above example, classifier is fit on one dimensional array of multiclass labels and the
predict() method hence provides corresponding multiclass prediction. But on the other
hand, it is also possible to fit upon a two-dimensional array of binary label indicators as
follows:
X = [[1, 2], [3, 4], [4, 5], [5, 2], [1, 1]]
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
Output
array([[0, 0, 0],
[0, 0, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 0]])
Output
array([[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 1, 0],
29
Scikit-Learn
[1, 0, 1, 1, 0],
[1, 0, 1, 0, 0]])
30
6. Scikit-Learn ― Linear Modeling Scikit-Learn
This chapter will help you in learning about the linear modeling in Scikit-Learn. Let us
begin by understanding what is linear regression in Sklearn.
The following table lists out various linear models provided by Scikit-Learn:
Model Description
Linear Regression
It is one of the best statistical models that studies the relationship between a dependent
variable (Y) with a given set of independent variables (X). The relationship can be
established with the help of fitting a best line.
Parameters
Following table consists the parameters used by Linear Regression module:
Parameter Description
fit_intercept: Boolean, optional, default Used to calculate the intercept for the model. No
True intercept will be used in the calculation if this set
to false.
normalize: Boolean, optional, default False If this parameter is set to True, the regressor X will
be normalized before regression. The
normalization will be done by subtracting the mean
and dividing it by L2 norm. If fit_intercept = False,
this parameter will be ignored.
copy_X: Boolean, optional, default True By default, it is true which means X will be copied.
But if it is set to false, X may be overwritten.
n_jobs: int or None, It represents the number of jobs to use for the
optional(default=None) computation.
Attributes
Following table consists the attributes used by Linear Regression module:
Attributes Description
32
Scikit-Learn
coef_: array, shape(n_features,) or It is used to estimate the coefficients for the linear
(n_targets, n_features) regression problem. It would be a 2D array of
shape (n_targets, n_features) if multiple targets
are passed during fit. Ex. (y 2D). On the other
hand, it would be a 1D array of length (n_features)
if only one target is passed during fit.
Implementation Example
First, import the required packages:
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1,1],[1,2],[2,2],[2,3]])
y = np.dot(X, np.array([1,2])) + 3
regr.predict(np.array([[3,5]]))
Output
array([16.])
To get the coefficient of determination of the prediction we can use Score() method as
follows:
regr.score(X,y)
Output
1.0
regr.coef_
Output
33
Scikit-Learn
array([1., 2.])
We can calculate the intercept i.e. the expected mean value of Y when all X = 0 by using
attribute named ‘intercept’ as follows:
In [24]: regr.intercept_
Output
3.0000000000000018
regr.coef_
regr.intercept_
Logistic Regression
Logistic regression, despite its name, is a classification algorithm rather than regression
algorithm. Based on a given set of independent variables, it is used to estimate discrete
value (0 or 1, yes/no, true/false). It is also called logit or MaxEnt Classifier.
Basically, it measures the relationship between the categorical dependent variable and one
or more independent variables by estimating the probability of occurrence of an event
using its logistics function.
Parameters
Following table lists the parameters used by Logistic Regression module:
Parameter Description
penalty: str, ‘L1’, ‘L2’, ‘elasticnet’ or none, This parameter is used to specify the norm (L1 or
optional, default = ‘L2’ L2) used in penalization (regularization).
34
Scikit-Learn
dual: Boolean, optional, default = False It is used for dual or primal formulation whereas
dual formulation is only implemented for L2
penalty.
tol: float, optional, default=1e-4 It represents the tolerance for stopping criteria.
fit_intercept: Boolean, optional, default = This parameter specifies that a constant (bias or
True intercept) should be added to the decision
function.
class_weight: dict or ‘balanced’ optional, It represents the weights associated with classes.
default = none If we use the default option, it means all the
classes are supposed to have weight one. On the
other hand, if you choose class_weight: balanced,
it will use the values of y to automatically adjust
weights.
random_state: int, RandomState instance This parameter represents the seed of the pseudo
or None, optional, default = none random number generated which is used while
shuffling the data. Followings are the options:
solver: str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, This parameter represents which algorithm to use
‘saag’, ‘saga’}, optional, default = ‘liblinear’ in the optimization problem. Followings are the
properties of options under this parameter:
35
Scikit-Learn
max_iter: int, optional, default = 100 As name suggest, it represents the maximum
number of iterations taken for solvers to converge.
multi_class: str, {‘ovr’, ‘multinomial’, ovr: For this option, a binary problem is fit
‘auto’}, optional, default = ‘ovr’
for each label.
multimonial: For this option, the loss
minimized is the multinomial loss fit across
the entire probability distribution. We can’t
use this option if solver = ‘liblinear’.
auto: This option will select ‘ovr’ if solver =
‘liblinear’ or data is binary, else it will
choose ‘multinomial’.
verbose: int, optional, default = 0 By default, the value of this parameter is 0 but for
liblinear and lbfgs solver we should set verbose to
any positive number.
warm_start: bool, optional, default = false With this parameter set to True, we can reuse the
solution of the previous call to fit as initialization.
If we choose default i.e. false, it will erase the
previous solution.
n_jobs: int or None, optional, default = If multi_class = ‘ovr’, this parameter represents
None the number of CPU cores used when parallelizing
over classes. It is ignored when solver = ‘liblinear’.
l1_ratio: float or None, optional, default = It is used in case when penalty = ‘elasticnet’. It is
None basically the Elastic-Net mixing parameter with 0
< = l1_ratio < = 1.
Attributes
Followings table consist the attributes used by Logistic Regression module:
Attributes Description
36
Scikit-Learn
Intercept_: array, shape(1) or (n_classes) It represents the constant, also known as bias,
added to the decision function.
classes_: array, shape(n_classes) It will provide a list of class labels known to the
classifier.
n_iter_: array, shape (n_classes) or (1) It returns the actual number of iterations for all the
classes.
Implementation Example
Following Python script provides a simple example of implementing logistic regression on
iris dataset of scikit-learn:
LRG = linear_model.LogisticRegression(random_state=0,solver='liblinear',multi
class='auto').fit(X, y)
LRG.score(X, y)
Output
0.96
The output shows that the above Logistic Regression model gave the accuracy of 96
percent.
Ridge Regression
Ridge regression or Tikhonov regularization is the regularization technique that performs
L2 regularization. It modifies the loss function by adding the penalty (shrinkage quantity)
equivalent to the square of the magnitude of coefficients.
𝑚 𝑛 2 𝑛 𝑛
Parameters
37
Scikit-Learn
Parameter Description
alpha: {float, array-like}, Alpha is the tuning parameter that decides how
shape(n_targets) much we want to penalize the model.
normalize: Boolean, optional, default = If this parameter is set to True, the regressor X will
False be normalized before regression. The
normalization will be done by subtracting the mean
and dividing it by L2 norm. If fit_intercept =
False, this parameter will be ignored.
copy_X: Boolean, optional, default = True By default, it is true which means X will be copied.
But if it is set to false, X may be overwritten.
solver: str, {‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, This parameter represents which solver to use in
‘sparse_cg’, ‘sag’, ‘saga’}’ the computational routines. Following are the
properties of options under this parameter:
38
Scikit-Learn
random_state: int, RandomState instance This parameter represents the seed of the pseudo
or None, optional, default = none random number generated which is used while
shuffling the data. Following are the options:
Attributes
Followings table consist the attributes used by Ridge module:
Attributes Description
n_iter_: array or None, shape (n_targets) Available for only ‘sag’ and ‘lsqr’ solver, returns the
actual number of iterations for each target.
Implementation Example
Following Python script provides a simple example of implementing Ridge Regression. We
are using 15 samples and 10 features. The value of alpha is 0.5 in our case. There are two
methods namely fit() and score() used to fit this model and calculate the score
respectively.
39
Scikit-Learn
X = rng.randn(n_samples, n_features)
rdg = Ridge(alpha=0.5)
rdg.fit(X, y)
rdg.score(X,y)
Output
0.76294987
The output shows that the above Ridge Regression model gave the score of around 76
percent. For more accuracy, we can increase the number of samples and features.
For the above example, we can get the weight vector with the help of following python
script:
rdg.coef_
Output
Similarly, we can get the value of intercept with the help of following python script:
rdg.intercept_
Output
0.527486
𝑝(𝑦|𝑋, 𝑤, 𝛼) = 𝑁(𝑦|𝑋𝑤 , 𝛼)
One of the most useful type of Bayesian regression is Bayesian Ridge regression which
estimates a probabilistic model of the regression problem. Here the prior for the coefficient
𝑤 is given by spherical Gaussian as follows:
40
Scikit-Learn
Parameters
Followings table consist the parameters used by BayesianRidge module:
Parameter Description
fit_intercept: Boolean, optional, default It decides whether to calculate the intercept for
True this model or not. No intercept will be used in
calculation, if it will set to false.
tol: float, optional, default=1.e-3 It represents the precision of the solution and will
stop the algorithm if w has converged.
copy_X: Boolean, optional, default = True By default, it is true which means X will be copied.
But if it is set to false, X may be overwritten.
verbose: Boolean, optional, default=False By default, it is false but if set true, verbose mode
will be enabled while fitting the model.
Attributes
Followings table consist the attributes used by BayesianRidge module:
Attributes Description
coef_: array, shape = n_features This attribute provides the weight vectors.
41
Scikit-Learn
scores_: array, shape = (n_iter_+1) It provides the value of the log marginal likelihood
at each iteration of the optimisation. In the
resulting score, the array starts with the value of
the log marginal likelihood obtained for the initial
values of 𝛼 𝑎𝑛𝑑 𝜆, and ends with the value obtained
for estimated 𝛼 𝑎𝑛𝑑 𝜆.
Implementation Example
Following Python script provides a simple example of fitting Bayesian Ridge Regression
model using sklearn BayesianRidge module.
Output
From the above output, we can check model’s parameters used in the calculation.
Now, once fitted, the model can predict new values as follows:
BayReg.predict([[1,1]])
Output
array([1.00000007])
BayReg.coef_
Output
42
Scikit-Learn
array([0.49999993, 0.49999993])
Parameters
Followings table consist the parameters used by Lasso module:
Parameter Description
alpha: float, optional, default = 1.0 Alpha, the constant that multiplies the L1 term, is
the tuning parameter that decides how much we
want to penalize the model. The default value is
1.0.
tol: float, optional This parameter represents the tolerance for the
optimization. The tol value and updates would be
compared and if found updates smaller than tol,
the optimization checks the dual gap for optimality
and continues until it is smaller than tol.
normalize: Boolean, optional, default = If this parameter is set to True, the regressor X will
False be normalized before regression. The
normalization will be done by subtracting the mean
and dividing it by L2 norm. If fit_intercept =
False, this parameter will be ignored.
copy_X: Boolean, optional, default = True By default, it is true which means X will be copied.
But if it is set to false, X may be overwritten.
43
Scikit-Learn
warm_start: bool, optional, default = false With this parameter set to True, we can reuse the
solution of the previous call to fit as initialization.
If we choose default i.e. false, it will erase the
previous solution.
random_state: int, RandomState instance This parameter represents the seed of the pseudo
or None, optional, default = none random number generated which is used while
shuffling the data. Followings are the options:
Attributes
Followings table consist the attributes used by Lasso module:
Attributes Description
n_iter_: int or array-like, shape It gives the number of iterations run by the
(n_targets) coordinate descent solver to reach the specified
tolerance.
Implementation Example
Following Python script uses Lasso model which further uses coordinate descent as the
algorithm to fit the coefficients:
44
Scikit-Learn
Output
Now, once fitted, the model can predict new values as follows:
Lreg.predict([[0,1]])
Output
array([0.75])
For the above example, we can get the weight vector with the help of following python
script:
Lreg.coef_
Output
array([0.25, 0. ])
Similarly, we can get the value of intercept with the help of following python script:
Lreg.intercept_
Output
0.75
We can get the total number of iterations to get the specified tolerance with the help of
following python script:
Lreg.n_iter_
Output
We can change the values of parameters to get the desired output from the model.
Multi-task LASSO
It allows to fit multiple regression problems jointly enforcing the selected features to be
same for all the regression problems, also called tasks. Sklearn provides a linear model
45
Scikit-Learn
named MultiTaskLasso, trained with a mixed L1, L2-norm for regularisation, which
estimates sparse coefficients for multiple regression problems jointly. In this the response
y is a 2D array of shape (n_samples, n_tasks).
The parameters and the attributes for MultiTaskLasso are like that of Lasso. The only
difference is in the alpha parameter. In Lasso the alpha parameter is a constant that
multiplies L1 norm, whereas in Multi-task Lasso it is a constant that multiplies the L1/L2
terms.
Implementation Example
Following Python script uses MultiTaskLasso linear model which further uses coordinate
descent as the algorithm to fit the coefficients:
Output
Now, once fitted, the model can predict new values as follows:
MTLReg.predict([[0,1]])
Output
array([[0.53033009, 0.53033009]])
For the above example, we can get the weight vector with the help of following python
script:
MTLReg.coef_
Output
array([[0.46966991, 0. ],
[0.46966991, 0. ]])
Similarly, we can get the value of intercept with the help of following python script:
MTLReg.intercept_
Output
array([0.53033009, 0.53033009])
46
Scikit-Learn
We can get the total number of iterations to get the specified tolerance with the help of
following python script:
MTLReg.n_iter_
Output
We can change the values of parameters to get the desired output from the model.
Elastic-Net
The Elastic-Net is a regularised regression method that linearly combines both penalties
i.e. L1 and L2 of the Lasso and Ridge regression methods. It is useful when there are
multiple correlated features. The difference between Lass and Elastic-Net lies in the fact
that Lasso is likely to pick one of these features at random while elastic-net is likely to
pick both at once.
Sklearn provides a linear model named ElasticNet which is trained with both L1, L2-norm
for regularisation of the coefficients. The advantage of such combination is that it allows
for learning a sparse model where few of the weights are non-zero like Lasso regularisation
method, while still maintaining the regularization properties of Ridge regularisation
method.
1 𝛼(1 − 𝜌)
min ||𝑋𝑤 − 𝑦||22 + 𝛼𝜌||𝑤||1 + ||𝑤||22
𝑤 2𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 2
Parameters
Following table consist the parameters used by ElasticNet module:
Parameter Description
alpha: float, optional, default = 1.0 Alpha, the constant that multiplies the L1/L2 term,
is the tuning parameter that decides how much we
want to penalize the model. The default value is
1.0.
47
Scikit-Learn
tol: float, optional This parameter represents the tolerance for the
optimization. The tol value and updates would be
compared and if found updates smaller than tol,
the optimization checks the dual gap for optimality
and continues until it is smaller than tol.
normalise: Boolean, optional, default = If this parameter is set to True, the regressor X will
False be normalised before regression. The
normalisation will be done by subtracting the mean
and dividing it by L2 norm. If fit_intercept =
False, this parameter will be ignored.
copy_X: Boolean, optional, default = True By default, it is true which means X will be copied.
But if it is set to false, X may be overwritten.
warm_start: bool, optional, default = false With this parameter set to True, we can reuse the
solution of the previous call to fit as initialisation.
If we choose default i.e. false, it will erase the
previous solution.
48
Scikit-Learn
random_state: int, RandomState instance This parameter represents the seed of the pseudo
or None, optional, default = none random number generated which is used while
shuffling the data. Following are the options:
Attributes
Followings table consist the attributes used by ElasticNet module:
Attributes Description
coef_: array, shape (n_tasks, n_features) This attribute provides the weight vectors.
Implementation Example
Following Python script uses ElasticNet linear model which further uses coordinate
descent as the algorithm to fit the coefficients:
49
Scikit-Learn
Output
Now, once fitted, the model can predict new values as follows:
ENregReg.predict([[0,1]])
Output
array([0.73686077])
For the above example, we can get the weight vector with the help of following python
script:
ENreg.coef_
Output
array([0.26318357, 0.26313923])
Similarly, we can get the value of intercept with the help of following python script:
ENreg.intercept_
Output
0.47367720941913904
We can get the total number of iterations to get the specified tolerance with the help of
following python script:
ENreg.n_iter_
Output
15
We can change the values of alpha (towards 1) to get better results from the model.
50
Scikit-Learn
Output
ElasticNet(alpha=1, copy_X=True, fit_intercept=True, l1_ratio=0.5,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=0, selection='cyclic', tol=0.0001, warm_start=False)
Output
array([0.90909216])
#weight vectors
ENreg.coef_
Output
array([0.09091128, 0.09090784])
#Calculating intercept
ENreg.intercept_
Output
0.818180878658411
ENreg.n_iter_
Output
10
From the above examples, we can see the difference in the outputs.
MultiTaskElasticNet
It is an Elastic-Net model that allows to fit multiple regression problems jointly enforcing
the selected features to be same for all the regression problems, also called tasks. Sklearn
provides a linear model named MultiTaskElasticNet, trained with a mixed L1, L2-norm
and L2 for regularisation, which estimates sparse coefficients for multiple regression
problems jointly. In this, the response y is a 2D array of shape (n_samples, n_tasks).
51
Scikit-Learn
1 𝛼(1 − 𝜌)
min ||𝑋𝑤 − 𝑦||2𝐹𝑟𝑜 + 𝛼𝜌||𝑤||21 + ||𝑤||2𝐹𝑟𝑜
𝑤 2𝑛𝑠𝑎𝑚𝑝𝑙𝑒𝑠 2
2
||𝐴||𝐹𝑟𝑜 = √∑ 𝑎𝑖𝑗
𝑖𝑗
2
||𝐴||21 = ∑ √∑ 𝑎𝑖𝑗
𝑖 𝑗
The parameters and the attributes for MultiTaskElasticNet are like that of ElasticNet.
The only difference is in li_ratio i.e. ElasticNet mixing parameter. In MultiTaskElasticNet
its range is 0 < l1_ratio < = 1. If l1_ratio = 1, the penalty would be L1/L2 penalty. If
l1_ratio = 0, the penalty would be an L2 penalty. If the value of l1 ratio is between 0 and
1, the penalty would be the combination of L1/L2 and L2.
Implementation Example
To show the difference, we are implementing the same example as we did in Multi-task
Lasso:
Output
MultiTaskElasticNet(alpha=0.5, copy_X=True, fit_intercept=True, l1_ratio=0.5,
max_iter=1000, normalize=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
Output
array([[0.69056563, 0.69056563]])
#weight vectors
52
Scikit-Learn
MTENReg.coef_
Output
array([[0.30943437, 0.30938224],
[0.30943437, 0.30938224]])
#Calculating intercept
MTENReg.intercept_
Output
array([0.38118338, 0.38118338])
MTENReg.n_iter_
Output
15
53
7. Scikit-Learn — Extended Linear Modeling Scikit-Learn
This chapter focusses on the polynomial features and pipelining tools in Sklearn.
One such example is that a simple linear regression can be extended by constructing
polynomial features from the coefficients.
Mathematically, suppose we have standard linear regression model then for 2-D data it
would look like this:
𝑌 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2
Now, we can combine the features in second-order polynomials and our model will look
like as follows:
𝑌 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥1 𝑥2 + 𝑤4 𝑥12 + 𝑤5 𝑥22
The above is still a linear model. Here, we saw that the resulting polynomial regression is
in the same class of linear models and can be solved similarly.
Parameters
Followings table consist the parameters used by PolynomialFeatures module:
Parameter Description
interaction_only: By default, it is false but if set as true, the features that are products of
Boolean, default = most degree distinct input features, are produced. Such features are
false called interaction features.
include_bias: It includes a bias column i.e. the feature in which all polynomials powers
Boolean, default = are zero.
true
54
Scikit-Learn
order: str in {‘C’, This parameter represents the order of output array in the dense case.
‘F’}, default = ‘C’ ‘F’ order means faster to compute but on the other hand, it may slow
down subsequent estimators.
Attributes
Followings table consist the attributes used by PolynomialFeatures module:
Attributes Description
powers_: array, shape It shows powers_ [i,j] is the exponent of the jth
(n_output_features, n_input_features) input in the ith output.
Implementation Example
Following Python script uses PolynomialFeatures transformer to transform array of 8
into shape (4,2):
Output
Example
The below python scripts using Scikit-learn’s Pipeline tools to streamline the preprocessing
(will fit to an order-3 polynomial data).
55
Scikit-Learn
#Provide the size of array and order of polynomial data to fit the model.
x = np.arange(5)
y = 3 - 2 * x + x ** 2 - x ** 3
Stream_model = model.fit(x[:, np.newaxis], y)
Stream_model.named_steps['linear'].coef_
Output
The above output shows that the linear model trained on polynomial features is able to
recover the exact input polynomial coefficients.
56
8. Scikit-Learn ― Stochastic Gradient Descent Scikit-Learn
Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to
find the values of parameters/coefficients of functions that minimize a cost function. In
other words, it is used for discriminative learning of linear classifiers under convex loss
functions such as SVM and Logistic regression. It has been successfully applied to large-
scale datasets because the update to the coefficients is performed for each training
instance, rather than at the end of instances.
SGD Classifier
Stochastic Gradient Descent (SGD) classifier basically implements a plain SGD learning
routine supporting various loss functions and penalties for classification. Scikit-learn
provides SGDClassifier module to implement SGD classification.
Parameters
Followings table consist the parameters used by SGDClassifier module:
Parameter Description
loss: str, default = It represents the loss function to be used while implementing. The default
‘hinge’ value is ‘hinge’ which will give us a linear SVM. The other options which
can be used are:
penalty: str, ‘none’, It is the regularization term used in the model. By default, it is L2. We can
‘l2’, ‘l1’, ‘elasticnet’ use L1 or ‘elasticnet; as well but both might bring sparsity to the model,
hence not achievable with L2.
alpha: float, default Alpha, the constant that multiplies the regularization term, is the tuning
= 0.0001 parameter that decides how much we want to penalize the model. The
default value is 0.0001.
57
Scikit-Learn
l1_ratio: float, This is called the ElasticNet mixing parameter. Its range is 0 < = l1_ratio
default = 0.15 < = 1. If l1_ratio = 1, the penalty would be L1 penalty. If l1_ratio = 0,
the penalty would be an L2 penalty.
tol: float or none, This parameter represents the stopping criterion for iterations. Its default
optional, default = value is False but if set to None, the iterations will stop when 𝒍𝒐𝒔𝒔 >
1.e-3 𝒃𝒆𝒔𝒕_𝒍𝒐𝒔𝒔 − 𝒕𝒐𝒍 for 𝒏_𝒊𝒕𝒆𝒓_𝒏𝒐_𝒄𝒉𝒂𝒏𝒈𝒆 successive epochs.
shuffle: Boolean, This parameter represents that whether we want our training data to be
optional, default = shuffled after each epoch or not.
True
epsilon: float, This parameter specifies the width of the insensitive region. If loss =
default = 0.1 ‘epsilon-insensitive’, any difference, between current prediction and the
correct label, less than the threshold would be ignored.
max_iter: int, As name suggest, it represents the maximum number of passes over the
optional, default = epochs i.e. training data.
1000
warm_start: bool, With this parameter set to True, we can reuse the solution of the previous
optional, default = call to fit as initialization. If we choose default i.e. false, it will erase the
false previous solution.
random_state: int, This parameter represents the seed of the pseudo random number
RandomState generated which is used while shuffling the data. Followings are the
instance or None, options:
optional, default =
int: In this case, random_state is the seed used by random
none
number generator.
RandomState instance: In this case, random_state is the random
number generator.
None: In this case, the random number generator is the
RandonState instance used by np.random.
n_jobs: int or none, It represents the number of CPUs to be used in OVA (One Versus All)
optional, Default = computation, for multi-class problems. The default value is none which
None means 1.
58
Scikit-Learn
eta0: double, It represents the initial learning rate for above mentioned learning rate
default = 0.0 options i.e. ‘constant’, ‘invscalling’, or ‘adaptive’.
early_stopping: This parameter represents the use of early stopping to terminate training
bool, default = False when validation score is not improving. Its default value is false but when
set to true, it automatically set aside a stratified fraction of training data
as validation and stop training when validation score is not improving.
classs_weight: This parameter represents the weights associated with classes. If not
dict, {class_label: provided, the classes are supposed to have weight 1.
weight} or
“balanced”, or
None, optional
warm_start: bool, With this parameter set to True, we can reuse the solution of the previous
optional, default = call to fit as initialization. If we choose default i.e. false, it will erase the
false previous solution.
average: Boolean Its default value is False but when set to True, it calculates the averaged
or int, optional, Stochastic Gradient Descent weights and stores the result in the coef_
default = false attribute. On the other hand, if its value set to an integer greater than 1,
the averaging will begin once the total number of samples seen reaches.
Attributes
Following table consist the attributes used by SGDClassifier module:
59
Scikit-Learn
Attributes Description
coef_: array, shape (1, n_features) if This attribute provides the weight assigned to the
n_classes==2, else (n_classes, n_features) features.
Implementation Example
Like other classifiers, Stochastic Gradient Descent (SGD) has to be fitted with following
two arrays:
import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
SGDClf = linear_model.SGDClassifier(max_iter=1000, tol=1e-
3,penalty="elasticnet")
SGDClf.fit(X, Y)
Output
Now, once fitted, the model can predict new values as follows:
SGDClf.predict([[2.,2.]])
Output
60
Scikit-Learn
array([2])
For the above example, we can get the weight vector with the help of following python
script:
SGDClf.coef_
Output
array([[19.54811198, 9.77200712]])
Similarly, we can get the value of intercept with the help of following python script:
SGDClf.intercept_
Output
array([10.])
SGDClf.decision_function([[2., 2.]])
Output
array([68.6402382])
SGD Regressor
Stochastic Gradient Descent (SGD) regressor basically implements a plain SGD learning
routine supporting various loss functions and penalties to fit linear regression models.
Scikit-learn provides SGDRegressor module to implement SGD regression.
Parameters
Parameters used by SGDRegressor are almost same as that were used in SGDClassifier
module. The difference lies in ‘loss’ parameter. For SGDRegressor modules’ loss
parameter the positives values are as follows:
61
Scikit-Learn
Another difference is that the parameter named ‘power_t’ has the default value of 0.25
rather than 0.5 as in SGDClassifier. Furthermore, it doesn’t have ‘class_weight’ and
‘n_jobs’ parameters.
Attributes
Attributes of SGDRegressor are also same as that were of SGDClassifier module.
Rather it has three extra attributes as follows:
t_: int
It provides the number of weight updates performed during the training phase.
Note: the attributes average_coef_ and average_intercept_ will work after enabling
parameter ‘average’ to True.
Implementation Example
Following Python script uses SGDRegressor linear model:
import numpy as np
from sklearn import linear_model
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
SGDReg
=linear_model.SGDRegressor(max_iter=1000,penalty="elasticnet",loss='huber',tol=
1e-3, average=True)
SGDReg.fit(X, y)
Output
62
Scikit-Learn
verbose=0, warm_start=False)
Now, once fitted, we can get the weight vector with the help of following python script:
SGDReg.coef_
Output
Similarly, we can get the value of intercept with the help of following python script:
SGReg.intercept_
Output
array([0.00678258])
We can get the number of weight updates during training phase with the help of the
following python script:
SGDReg.t_
Output
61.0
63
9. Scikit-Learn — Support Vector Machines Scikit-Learn
(SVMs)
This chapter deals with a machine learning method termed as Support Vector Machines
(SVMs).
Introduction
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
methods used for classification, regression, and, outliers’ detection. SVMs are very efficient
in high dimensional spaces and generally are used in classification problems. SVMs are
popular and memory efficient because they use a subset of training points in the decision
function.
The main goal of SVMs is to divide the datasets into number of classes in order to find a
maximum marginal hyperplane (MMH) which can be done in the following two steps:
Support Vector Machines will first generate hyperplanes iteratively that separates
the classes in the best way.
After that it will choose the hyperplane that segregate the classes correctly.
Support Vectors: They may be defined as the datapoints which are closest to the
hyperplane. Support vectors help in deciding the separating line.
Hyperplane: The decision plane or space that divides set of objects having
different classes.
Margin: The gap between two lines on the closet data points of different classes is
called margin.
Following diagrams will give you an insight about these SVM concepts:
64
Scikit-Learn
Hyperplane
Class A
Y-
axis
Margin
Class B
Support Vectors
X-axis
SVM in Scikit-learn supports both sparse and dense sample vectors as input.
Classification of SVM
Scikit-learn provides three classes namely SVC, NuSVC and LinearSVC which can
perform multiclass-class classification.
SVC
It is C-support vector classification whose implementation is based on libsvm. The module
used by scikit-learn is sklearn.svm.SVC. This class handles the multiclass support
according to one-vs-one scheme.
Parameters
Followings table consist the parameters used by sklearn.svm.SVC class:
Parameter Description
kernel: string, This parameter specifies the type of kernel to be used in the algorithm.
optional, default = we can choose any one among, ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’,
‘rbf’ ‘precomputed’. The default value of kernel would be ‘rbf’.
degree: int, It represents the degree of the ‘poly’ kernel function and will be ignored
optional, default = 3 by all other kernels.
gamma: {‘scale’, It is the kernel coefficient for kernels ‘rbf’, ‘poly’ and ‘sigmoid’.
‘auto’} or float,
65
Scikit-Learn
optinal default = If you choose default i.e. gamma = ‘scale’ then the value of gamma to be
‘scale’ used by SVC is 1/(𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 ∗ 𝑋. 𝑣𝑎𝑟()).
coef0: float, An independent term in kernel function which is only significant in ‘poly’
optional, and ‘sigmoid’.
Default=0.0
tol: float, optional, This parameter represents the stopping criterion for iterations.
default = 1.e-3
shrinking: This parameter represents that whether we want to use shrinking heuristic
Boolean, optional, or not.
default = True
verbose: Boolean, It enables or disable verbose output. Its default value is false.
default: false
max_iter: int, As name suggest, it represents the maximum number of iterations within
optional, default = - the solver. Value -1 means there is no limit on the number of iterations.
1
cache_size: float, This parameter will specify the size of the kernel cache. The value will be
optional in MB(MegaBytes).
random_state: int, This parameter represents the seed of the pseudo random number
RandomState generated which is used while shuffling the data. Followings are the
instance or None, options:
optional, default =
int: In this case, random_state is the seed used by random number
none
generator.
RandomState instance: In this case, random_state is the
random number generator.
None: In this case, the random number generator is the
RandonState instance used by np.random.
class_weight: This parameter will set the parameter C of class j to 𝑐𝑙𝑎𝑠𝑠_𝑤𝑒𝑖𝑔ℎ𝑡[𝑗] ∗ 𝐶 for
{dict, ‘balanced’}, SVC. If we use the default option, it means all the classes are supposed
optional to have weight one. On the other hand, if you choose class_weight:
balanced, it will use the values of y to automatically adjust weights.
66
Scikit-Learn
decision_function This parameter will decide whether the algorithm will return ‘ovr’ (one-
_shape: ‘ovo’, vs-rest) decision function of shape as all other classifiers, or the original
‘ovr’, default = ‘ovr’ ovo(one-vs-one) decision function of libsvm.
break_ties: True: The predict will break ties according to the confidence values of
boolean, optional, decision_function
default = false
False: The predict will return the first class among the tied classes.
Attributes
Followings table consist the attributes used by sklearn.svm.SVC class:
Attributes Description
dual_coef_: array, shape = [n_class- These are the coefficient of the support vectors in
1,n_SV] the decision function.
coef_: array, shape = [n_class * (n_class- This attribute, only available in case of linear
1)/2, n_features] kernel, provides the weight assigned to the
features.
Implementation Example
Like other classifiers, SVC also has to be fitted with following two arrays:
67
Scikit-Learn
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
SVCClf = SVC(kernel='linear',gamma='scale', shrinking=False,)
SVCClf.fit(X, y)
Output
Now, once fitted, we can get the weight vector with the help of following python script
SVCClf.coef_
Output
array([[0.5, 0.5]])
SVCClf.predict([[-0.5,-0.8]])
Output
array([1])
SVCClf.n_support_
Output
array([1, 1])
SVCClf.support_vectors_
Output
array([[-1., -1.],
[ 1., 1.]])
68
Scikit-Learn
SVCClf.support_
Output
array([0, 2])
SVCClf.intercept_
Output
array([-0.])
SVCClf.fit_status_
Output
NuSVC
NuSVC is Nu Support Vector Classification. It is another class provided by scikit-learn
which can perform multi-class classification. It is like SVC but NuSVC accepts slightly
different sets of parameters. The parameter which is different from SVC is as follows:
It represents an upper bound on the fraction of training errors and a lower bound of the
fraction of support vectors. Its value should be in the interval of (o,1].
Implementation Example
We can implement the same example using sklearn.svm.NuSVC class also.
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import NuSVC
NuSVCClf = NuSVC(kernel='linear',gamma='scale', shrinking=False,)
NuSVCClf.fit(X, y)
Output
We can get the outputs of rest of the attributes as did in the case of SVC.
LinearSVC
It is Linear Support Vector Classification. It is similar to SVC having kernel = ‘linear’. The
difference between them is that LinearSVC implemented in terms of liblinear while SVC
is implemented in libsvm. That’s the reason LinearSVC has more flexibility in the choice
of penalties and loss functions. It also scales better to large number of samples.
If we talk about its parameters and attributes then it does not support ‘kernel’ because
it is assumed to be linear and it also lacks some of the attributes like support_,
support_vectors_, n_support_, fit_status_ and, dual_coef_.
Implementation Example
Following Python script uses sklearn.svm.LinearSVC class:
Output
Now, once fitted, the model can predict new values as follows:
LSVCClf.predict([[0,0,0,0]])
Output
70
Scikit-Learn
[1]
For the above example, we can get the weight vector with the help of following python
script:
LSVCClf.coef_
Output
Similarly, we can get the value of intercept with the help of following python script:
LSVCClf.intercept_
Output
[0.26860518]
Whereas, the model produced by SVR (Support Vector Regression) also only depends on
a subset of the training data. Why? Because the cost function for building the model
ignores any training data points close to the model prediction.
Scikit-learn provides three classes namely SVR, NuSVR and LinearSVR as three
different implementations of SVR.
SVR
It is Epsilon-support vector regression whose implementation is based on libsvm. As
opposite to SVC There are two free parameters in the model namely ‘C’ and ‘epsilon’.
It represents the epsilon in the epsilon-SVR model, and specifies the epsilon-tube within
which no penalty is associated in the training loss function with points predicted within a
distance epsilon from the actual value.
Implementation Example
71
Scikit-Learn
Output
Now, once fitted, we can get the weight vector with the help of following python script:
SVRReg.coef_
Output
array([[0.4, 0.4]])
SVRReg.predict([[1,1]])
Output
array([1.1])
NuSVR
NuSVR is Nu Support Vector Regression. It is like NuSVC, but NuSVR uses a parameter nu
to control the number of support vectors. And moreover, unlike NuSVC where nu replaced
C parameter, here it replaces epsilon.
Implementation Example
Following Python script uses sklearn.svm.SVR class:
72
Scikit-Learn
X = np.random.randn(n_samples, n_features)
NuSVRReg = NuSVR(kernel='linear', gamma='auto',C=1.0, nu=0.1)^M
NuSVRReg.fit(X, y)
Output
Now, once fitted, we can get the weight vector with the help of following python script:
NuSVRReg.coef_
Output
LinearSVR
It is Linear Support Vector Regression. It is similar to SVR having kernel = ‘linear’. The
difference between them is that LinearSVR implemented in terms of liblinear, while SVC
implemented in libsvm. That’s the reason LinearSVR has more flexibility in the choice of
penalties and loss functions. It also scales better to large number of samples.
If we talk about its parameters and attributes then it does not support ‘kernel’ because
it is assumed to be linear and it also lacks some of the attributes like support_,
support_vectors_, n_support_, fit_status_ and, dual_coef_.
It represents the loss function where epsilon_insensitive loss is the L1 loss and the squared
epsilon-insensitive loss is the L2 loss.
Implementation Example
Following Python script uses sklearn.svm.LinearSVR class:
LSVRReg.fit(X, y)
Output
Now, once fitted, the model can predict new values as follows:
LSRReg.predict([[0,0,0,0]])
Output
array([-0.01041416])
For the above example, we can get the weight vector with the help of following python
script:
LSRReg.coef_
Output
Similarly, we can get the value of intercept with the help of following python script:
LSRReg.intercept_
Output
array([-0.01041416])
74
10. Scikit-Learn ― Anomaly Detection Scikit-Learn
Here, we will learn about what is anomaly detection in Sklearn and how it is used in
identification of the data points.
Anomaly detection is a technique used to identify data points in dataset that does not fit
well with the rest of the data. It has many applications in business such as fraud detection,
intrusion detection, system health monitoring, surveillance, and predictive maintenance.
Anomalies, which are also called outlier, can be divided into following three categories:
Methods
Two methods namely outlier detection and novelty detection can be used for anomaly
detection. It’s necessary to see the distinction between them.
Outlier detection
The training data contains outliers that are far from the rest of the data. Such outliers are
defined as observations. That’s the reason, outlier detection estimators always try to fit
the region having most concentrated training data while ignoring the deviant observations.
It is also known as unsupervised anomaly detection.
Novelty detection
It is concerned with detecting an unobserved pattern in new observations which is not
included in training data. Here, the training data is not polluted by the outliers. It is also
known as semi-supervised anomaly detection.
There are set of ML tools, provided by scikit-learn, which can be used for both outlier
detection as well novelty detection. These tools first implementing object learning from
the data in an unsupervised by using fit () method as follows:
estimator.fit(X_train)
Now, the new observations would be sorted as inliers (labeled 1) or outliers (labeled
-1) by using predict() method as follows:
estimator.fit(X_test)
The estimator will first compute the raw scoring function and then predict method will
make use of threshold on that raw scoring function. We can access this raw scoring
75
Scikit-Learn
function with the help of score_sample method and can control the threshold by
contamination parameter.
We can also define decision_function method that defines outliers as negative value and
inliers as non-negative value.
estimator.decision_function(X_test)
This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the
central data points. It ignores the points outside the central mode.
Parameters
Following table consist the parameters used by sklearn. covariance.EllipticEnvelop
method:
Parameter Description
support_fraction: This parameter tells the method that how much proportion of
float in (0., 1.), points to be included in the support of the raw MCD estimates.
optional, default =
None
76
Scikit-Learn
random_state: int, This parameter represents the seed of the pseudo random number
RandomState generated which is used while shuffling the data. Followings are
instance or None, the options:
optional, default =
int: In this case, random_state is the seed used by
none
random number generator.
RandomState instance: In this case, random_state is
the random number generator.
None: In this case, the random number generator is the
RandonState instance used by np.random.
Attributes
Following table consist the attributes used by sklearn. covariance.EllipticEnvelop
method:
Attributes Description
Implementation Example
import numpy as np^M
from sklearn.covariance import EllipticEnvelope^M
true_cov = np.array([[.5, .6],[.6, .4]])
X = np.random.RandomState(0).multivariate_normal(mean=[0, 0],
cov=true_cov,size=500)
cov = EllipticEnvelope(random_state=0).fit(X)^M
77
Scikit-Learn
# Now we can use predict method. It will return 1 for an inlier and -1 for an
outlier.
cov.predict([[0, 0],[2, 2]])
Output
array([ 1, -1])
Isolation Forest
In case of high-dimensional dataset, one efficient way for outlier detection is to use random
forests. The scikit-learn provides ensemble.IsolationForest method that isolates the
observations by randomly selecting a feature. Afterwards, it randomly selects a value
between the maximum and minimum values of the selected features.
Here, the number of splitting needed to isolate a sample is equivalent to path length from
the root node to the terminating node.
Parameters
Followings table consist the parameters used by sklearn. ensemble.IsolationForest
method:
Parameter Description
support_fraction: This parameter tells the method that how much proportion of
float in (0., 1.), points to be included in the support of the raw MCD estimates.
optional, default =
None
contamination: It provides the proportion of the outliers in the data set. If we set
auto or float, it default i.e. auto, it will determine the threshold as in the original
optional, default = paper. If set to float, the range of contamination will be in the
auto range of [0,0.5].
78
Scikit-Learn
random_state: int, This parameter represents the seed of the pseudo random number
RandomState generated which is used while shuffling the data. Followings are
instance or None, the options:
optional, default =
int: In this case, random_state is the seed used by random
none
number generator.
RandomState instance: In this case, random_state is
the random number generator.
None: In this case, the random number generator is the
RandonState instance used by np.random.
bootstrap: Its default option is False which means the sampling would be
Boolean, optional performed without replacement. And on the other hand, if set to
(default = False) True, means individual trees are fit on a random subset of the
training data sampled with replacement.
n_jobs: int or It represents the number of jobs to be run in parallel for fit() and
None, optional predict() methods both.
(default = None)
verbose: int, This parameter controls the verbosity of the tree building process.
optional (default =
0)
warm_start: Bool, If warm_start = true, we can reuse previous call’s solution to fit
optional and can add more estimators to the ensemble. But if is set to
(default=False) false, we need to fit a whole new forest.
Attributes
Following table consist the attributes used by sklearn. ensemble.IsolationForest
method:
Attributes Description
79
Scikit-Learn
Implementation Example
The Python script below will use sklearn. ensemble.IsolationForest method to fit 10
trees on given data:
Output
Parameters
Followings table consist the parameters used by sklearn. neighbors.LocalOutlierFactor
method:
Parameter Description
80
Scikit-Learn
contamination: It provides the proportion of the outliers in the data set. If we set
auto or float, it default i.e. auto, it will determine the threshold as in the original
optional, default = paper. If set to float, the range of contamination will be in the
auto range of [0,0.5].
P: int, optional It is the parameter for the Minkowski metric. P=1 is equivalent to
(default = 2) using manhattan_distance i.e. L1, whereas P=2 is equivalent to
using euclidean_distance i.e. L2.
novelty: Boolean, By default, LOF algorithm is used for outlier detection but it can
(default = False) be used for novelty detection if we set novelty = true.
n_jobs: int or It represents the number of jobs to be run in parallel for fit() and
None, optional predict() methods both.
(default = None)
Attributes
Following table consist the attributes used by sklearn.neighbors.LocalOutlierFactor
method:
Attributes Description
81
Scikit-Learn
Implementation Example
The Python script given below will use sklearn.neighbors.LocalOutlierFactor method
to construct NeighborsClassifier class from any array corresponding our data set:
Output
Now, we can ask from this constructed classifier who’s is the closet point to [0.5, 1., 1.5]
by using the following python script:
Output
One-Class SVM
The One-Class SVM, introduced by Schölkopf et al., is the unsupervised Outlier Detection.
It is also very efficient in high-dimensional data and estimates the support of a high-
dimensional distribution. It is implemented in the Support Vector Machines module in
the Sklearn.svm.OneClassSVM object. For defining a frontier, it requires a kernel
(mostly used is RBF) and a scalar parameter.
For better understanding let’s fit our data with svm.OneClassSVM object:
82
Scikit-Learn
OSVMclf.score_samples(X)
Output
83
11. Scikit-Learn — K-Nearest Neighbors (KNN) Scikit-Learn
This chapter will help you in understanding the nearest neighbor methods in Sklearn.
Neighbor based learning method are of both types namely supervised and
unsupervised. Supervised neighbors-based learning can be used for both classification
as well as regression predictive problems but, it is mainly used for classification predictive
problems in industry.
Neighbors based learning methods do not have a specialised training phase and uses all
the data for training while classification. It also does not assume anything about the
underlying data. That’s the reason they are lazy and non-parametric in nature.
To find a predefined number of training samples closet in distance to the new data
point
Predict the label from these number of training samples.
Here, the number of samples can be a user-defined constant like in K-nearest neighbor
learning or vary based on the local density of point like in radius-based neighbor learning.
sklearn.neighbors Module
Scikit-learn have sklearn.neighbors module that provides functionality for both
unsupervised and supervised neighbors-based learning methods. As input, the classes in
this module can handle either NumPy arrays or scipy.sparse matrices.
Types of algorithms
Different types of algorithms which can be used in neighbor-based methods’
implementation are as follows:
Brute Force
The brute-force computation of distances between all pairs of points in the dataset
provides the most naïve neighbor search implementation. Mathematically, for N samples
in D dimensions, brute-force approach scales as 𝑶[𝑫𝑵𝟐 ].
For small data samples, this algorithm can be very useful, but it becomes infeasible as and
when number of samples grows. Brute force neighbor search can be enabled by writing
the keyword algorithm=’brute’.
K-D Tree
One of the tree-based data structures that have been invented to address the
computational inefficiencies of the brute-force approach, is KD tree data structure.
Basically, the KD tree is a binary tree structure which is called K-dimensional tree. It
recursively partitions the parameters space along the data axes by dividing it into nested
orthographic regions into which the data points are filled.
84
Scikit-Learn
Advantages
Following are some advantages of K-D tree algorithm:
Construction is fast: As the partitioning is performed only along the data axes, K-D tree’s
construction is very fast.
Less distance computations: This algorithm takes very less distance computations to
determine the nearest neighbor of a query point. It only takes 𝑶[𝐥𝐨𝐠(𝑵)] distance
computations.
Disadvantages
Fast for only low-dimensional neighbor searches: It is very fast for low-dimensional
(D < 20) neighbor searches but as and when D grow it becomes inefficient. As the
partitioning is performed only along the data axes,
K-D tree neighbor searches can be enabled by writing the keyword algorithm=’kd_tree’.
Ball Tree
As we know that KD Tree is inefficient in higher dimensions, hence, to address this
inefficiency of KD Tree, Ball tree data structure was developed. Mathematically, it
recursively divides the data, into nodes defined by a centroid C and radius r, in such a way
that each point in the node lies within the hyper-sphere defined by centroid C and radius
r. It uses triangle inequality, given below, which reduces the number of candidate points
for a neighbor search:
|𝑿 + 𝒀| ≤ |𝑿| + |𝒀|
Advantages
Following are some advantages of Ball Tree algorithm:
Efficient on highly structured data: As ball tree partition the data in a series of nesting
hyper-spheres, it is efficient on highly structured data.
Out-performs KD-tree: Ball tree out-performs KD tree in high dimensions because it has
spherical geometry of the ball tree nodes.
Disadvantages
Costly: Partition the data in a series of nesting hyper-spheres makes its construction very
costly
Data Structure
Another factor that affect the performance of these algorithms is intrinsic dimensionality
of the data or sparsity of the data. It is because the query times of Ball tree and KD tree
algorithms can be greatly influenced by it. Whereas, the query time of Brute Force
algorithm is unchanged by data structure. Generally, Ball tree and KD tree algorithms
produces faster query time when implanted on sparser data with smaller intrinsic
dimensionality.
86
12. Scikit-Learn ― KNN Learning Scikit-Learn
k-NN (k-Nearest Neighbor), one of the simplest machine learning algorithms, is non-
parametric and lazy in nature. Non-parametric means that there is no assumption for the
underlying data distribution i.e. the model structure is determined from the dataset. Lazy
or instance-based learning means that for the purpose of model generation, it does not
require any training data points and whole training data is used in the testing phase.
Step 1
In this step, it computes and stores the k nearest neighbors for each sample in the training
set.
Step 2
In this step, for an unlabeled sample, it retrieves the k nearest neighbors from dataset.
Then among these k-nearest neighbors, it predicts the class through voting (class with
majority votes wins).
On the other hand, the supervised neighbors-based learning is used for classification as
well as regression.
Scikit-learn module
sklearn.neighbors.NearestNeighbors is the module used to implement unsupervised
nearest neighbor learning. It uses specific nearest neighbor algorithms named BallTree,
KDTree or Brute Force. In other words, it acts as a uniform interface to these three
algorithms.
87
Scikit-Learn
Parameters
Followings table consist the parameters used by NearestNeighbors module:
Parameter Description
n_neighbors: int, optional The number of neighbors to get. The default value is
5.
algorithm: {‘auto’, ‘ball_tree’, This parameter will take the algorithm (BallTree,
‘kd_tree’, ‘brute’}, optional KDTree or Brute-force) you want to use to compute
the nearest neighbors. If you will provide ‘auto’, it will
attempt to decide the most appropriate algorithm
based on the values passed to fit method.
leaf_size: int, optional It can affect the speed of the construction & query as
well as the memory required to store the tree. It is
passed to BallTree or KDTree. Although the optimal
value depends on the nature of the problem, its default
value is 30.
Scipy.spatial.distance:
[‘braycurtis’,‘canberra’,‘chebyshev’,‘dice’,‘hamming’,‘j
accard’,
‘correlation’,‘kulsinski’,‘mahalanobis’,‘minkowski’,‘rog
erstanimoto’,‘russellrao’,
‘sokalmicheme’,’sokalsneath’, ‘seuclidean’,
‘sqeuclidean’, ‘yule’].
metric_params: dict, optional This is the additional keyword arguments for the
metric function. The default value is None.
88
Scikit-Learn
N_jobs: int or None, optional It reprsetst the numer of parallel jobs to run for
neighbor search. The default value is None.
Implementation Example
The example below will find the nearest neighbors between two sets of data by using the
sklearn.neighbors.NearestNeighbors module.
Now, after importing the packages, define the sets of data in between we want to find the
nearest neighbors:
Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4,
5]])
nrst_neigh.fit(Input_data)
Now, find the K-neighbors of data set. It will return the indices and distances of the
neighbors of each point.
Output
array([[0, 1, 3],
[1, 2, 0],
[2, 1, 0],
[3, 4, 0],
[4, 5, 3],
[5, 6, 4],
[6, 5, 4]], dtype=int64)
distances
Output
The above output shows that the nearest neighbor of each point is the point itself i.e. at
zero. It is because the query set matches the training set.
We can also show a connection between neighboring points by producing a sparse graph
as follows:
nrst_neigh.kneighbors_graph(Input_data).toarray()
Output
Once we fit the unsupervised NearestNeighbors model, the data will be stored in a data
structure based on the value set for the argument ‘algorithm’. After that we can use this
unsupervised learner’s kneighbors in a model which requires neighbor searches.
90
Scikit-Learn
It is computed from a simple majority vote of the nearest neighbors of each point.
It simply stores instances of the training data, that’s why it is a type of non-
generalizing learning.
Scikit-learn modules
Followings are the two different types of nearest neighbor classifiers used by scikit-learn:
KNeighborsClassifier
The K in the name of this classifier represents the k nearest neighbors, where k is an
integer value specified by the user. Hence as the name suggests, this classifier implements
learning based on the k nearest neighbors. The choice of the value of k is dependent on
data. Let’s understand it more with the help if an implementation example:
Implementation Example
In this example, we will be implementing KNN on data set named Iris Flower data set by
using scikit-learn KneighborsClassifer.
This data set has 50 samples for each different species (setosa, versicolor,
virginica) of iris flower i.e. total of 150 samples.
For each sample, we have 4 features named sepal length, sepal width, petal length,
petal width)
First, import the dataset and print the features names as follows:
Output
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width
(cm)']
Now we can print target i.e the integers representing the different species. Here 0 =
setos, 1 = versicolor and 2 = virginica.
91
Scikit-Learn
print(iris.target)
Output
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
print(iris.target_names)
Output
We can check the number of observations and features with the help of following line of
code (iris data set has 150 observations and 4 features)
print(iris.data.shape)
Output
(150, 4)
Now, we need to split the data into training and testing data. We will be using Sklearn
train_test_split function to split the data into the ratio of 70 (training data) and 30
(testing data):
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
Next, we will be doing data scaling with the help of Sklearn preprocessing module as
follows:
Following line of codes will give you the shape of train and test objects:
92
Scikit-Learn
print(X_train.shape)
print(X_test.shape)
Output
(105, 4)
(45, 4)
Following line of codes will give you the shape of new y object:
print(y_train.shape)
print(y_test.shape)
Output
(105,)
(45,)
Now, we will be plotting the relationship between the values of K and the corresponding
testing accuracy. It will be done using matplotlib library.
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(k_range,scores_list)
plt.xlabel("Value of K")
plt.ylabel("Accuracy")
Output
Confusion Matrix:
[[15 0 0]
[ 0 15 0]
[ 0 1 14]]
Classification Report:
precision recall f1-score support
94
Scikit-Learn
For the above model, we can choose the optimal value of K (any value between 6 to 14,
as the accuracy is highest for this range) as 8 and retrain the model as follows:
classifier = KNeighborsClassifier(n_neighbors=8)
classifier.fit(X_train, y_train)
Output
classes = {0:'setosa',1:'versicolor',2:'virginicia'}
x_new = [[1,1,1,1],[4,3,1.3,0.2]]
y_predict = rnc.predict(x_new)
print(classes[y_predict[0]])
print(classes[y_predict[1]])
Output
virginicia
virginicia
95
Scikit-Learn
print(X_train.shape)
print(X_test.shape)
Range_k = range(1,15)
scores = {}
scores_list = []
for k in range_k:
classifier = KNeighborsClassifier(n_neighbors=k)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
scores[k] = metrics.accuracy_score(y_test,y_pred)
scores_list.append(metrics.accuracy_score(y_test,y_pred))
96
Scikit-Learn
classifier = KNeighborsClassifier(n_neighbors=8)
classifier.fit(X_train, y_train)
classes = {0:'setosa',1:'versicolor',2:'virginicia'}
x_new = [[1,1,1,1],[4,3,1.3,0.2]]
y_predict = rnc.predict(x_new)
print(classes[y_predict[0]])
print(classes[y_predict[1]])
RadiusNeighborsClassifier
The Radius in the name of this classifier represents the nearest neighbors within a specified
radius r, where r is a floating-point value specified by the user. Hence as the name
suggests, this classifier implements learning based on the number neighbors within a fixed
radius r of each training point. Let’s understand it more with the help if an implementation
example:
Implementation Example
In this example, we will be implementing KNN on data set named Iris Flower data set by
using scikit-learn RadiusNeighborsClassifer:
Now, we need to split the data into training and testing data. We will be using Sklearn
train_test_split function to split the data into the ratio of 70 (training data) and 20
(testing data):
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
97
Scikit-Learn
Next, we will be doing data scaling with the help of Sklearn preprocessing module as
follows:
Next, import the RadiusneighborsClassifier class from Sklearn and provide the value
of radius as follows:
classes = {0:'setosa',1:'versicolor',2:'virginicia'}
x_new = [[1,1,1,1]]
y_predict = rnc.predict(x_new)
print(classes[y_predict[0]])
Output
versicolor
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
classes = {0:'setosa',1:'versicolor',2:'virginicia'}
x_new = [[1,1,1,1]]
y_predict = rnc.predict(x_new)
print(classes[y_predict[0]])
Followings are the two different types of nearest neighbor regressors used by scikit-learn:
KNeighborsRegressor
The K in the name of this regressor represents the k nearest neighbors, where k is an
integer value specified by the user. Hence, as the name suggests, this regressor
implements learning based on the k nearest neighbors. The choice of the value of k is
dependent on data. Let’s understand it more with the help of an implementation example:
Implementation Example
In this example, we will be implementing KNN on data set named Iris Flower data set by
using scikit-learn KNeighborsRegressor.
Now, we need to split the data into training and testing data. We will be using Sklearn
train_test_split function to split the data into the ratio of 70 (training data) and 20
(testing data):
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
99
Scikit-Learn
Next, we will be doing data scaling with the help of Sklearn preprocessing module as
follows:
Next, import the KNeighborsRegressor class from Sklearn and provide the value of
neighbors as follows:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=8)
knnr.fit(X_train, y_train)
Output
Output
Output
100
Scikit-Learn
[0.66666667]
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=8)
knnr.fit(X_train, y_train)
RadiusNeighborsRegressor
The Radius in the name of this regressor represents the nearest neighbors within a
specified radius r, where r is a floating-point value specified by the user. Hence as the
name suggests, this regressor implements learning based on the number neighbors within
a fixed radius r of each training point. Let’s understand it more with the help if an
implementation example:
101
Scikit-Learn
Implementation Example
In this example, we will be implementing KNN on data set named Iris Flower data set by
using scikit-learn RadiusNeighborsRegressor:
Now, we need to split the data into training and testing data. We will be using Sklearn
train_test_split function to split the data into the ratio of 70 (training data) and 20
(testing data):
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Next, we will be doing data scaling with the help of Sklearn preprocessing module as
follows:
Next, import the RadiusneighborsRegressor class from Sklearn and provide the value
of radius as follows:
import numpy as np
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X_train, y_train)
Output
Output
[1.]
iris = load_iris()
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
import numpy as np
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X_train, y_train)
print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X, y)
print(knnr_r.predict([[2.5]]))
103
13. Scikit-Learn ― Classification with Naïve Bayes Scikit-Learn
Naïve Bayes methods are a set of supervised learning algorithms based on applying Bayes’
theorem with a strong assumption that all the predictors are independent to each other
i.e. the presence of a feature in a class is independent to the presence of any other feature
in the same class. This is naïve assumption that is why these methods are called Naïve
Bayes methods.
Bayes theorem states the following relationship in order to find the posterior probability
of class i.e. the probability of a label and some observed features, 𝑷(𝒀 | 𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔).
𝑃(𝑌)𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 | 𝑌)
𝑃(𝑌 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) =
𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠)
The Scikit-learn provides different naïve Bayes classifiers models namely Gaussian,
Multinomial, Complement and Bernoulli. All of them differ mainly by the assumption they
make regarding the distribution of 𝑷(𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔 | 𝒀) i.e. the probability of predictor given
class.
Model Description
104
Scikit-Learn
Parameter Description
priors: arrray-like, It represents the prior probabilities of the classes. If we specify this
shape(n_classes) parameter while fitting the data, then the prior probabilities will not be
justified according to the data.
Var_smoothing: This parameter gives the portion of the largest variance of the features
float, optional, that is added to variance in order to stabilize calculation.
default = 1e-9
Attributes
Following table consist the attributes used by sklearn.naive_bayes.GaussianNB
method:
Attributes Description
theta_: array, shape (n_classes, It gives the mean of each feature per class.
n_features)
sigma_: array, shape (n_classes, It gives the variance of each feature per class.
n_features)
Methods
Following table consist the methods used by sklearn.naive_bayes.GaussianNB
method:
Method Description
105
Scikit-Learn
Implementation Example
The Python script below will use sklearn.naive_bayes.GaussianNB method to construct
Gaussian Naïve Bayes Classifier from our data set:
import numpy as np
X = np.array([[-1, -1], [-2, -4], [-4, -6], [1, 2]])
Y = np.array([1, 1, 2, 2])
from sklearn.naive_bayes import GaussianNB
GNBclf = GaussianNB()
GNBclf.fit(X, Y)
Output
GaussianNB(priors=None, var_smoothing=1e-09)
Now, once fitted we can predict the new value by using predict() method as follows:
print((GNBclf.predict([[-0.5, 2]]))
Output
[2]
106
Scikit-Learn
Parameters
Following table consist the parameters used by sklearn.naive_bayes.MultinomialNB
method:
Parameter Description
alpha: float, It represents the additive smoothing parameter. If you choose 0 as its
optional, default = value, then there will be no smoothing.
1.0
fit_prior: Boolean, It tells the model that whether to learn class prior probabilities or not. The
optional, default = default value is True but if set to False, the algorithms will use a uniform
true prior.
class_prior: array- This parameter represents the prior probabilities of each class.
like,
size(n_classes,),
optional, Default =
None
Attributes
Following table consist the attributes used by sklearn.naive_bayes.MultinomialNB
method:
Attributes Description
intercept_: array, shape (n_classes,) These are the Mirrors class_log_prior_ for
interpreting MultinomilaNB model as a linear
model.
107
Scikit-Learn
coef_: array, shape (n_classes, These are the Mirrors feature_log_prior_ for
n_features) interpreting MultinomilaNB model as a linear
model.
feature_count_: array, shape (n_classes, It provides the actual number of training samples
n_features) encountered for each (class,feature).
Implementation Example
The Python script below will use sklearn.naive_bayes.GaussianNB method to construct
Gaussian Naïve Bayes Classifier from our data set:
import numpy as np
X = np.random.randint(8, size=(8, 100))
y = np.array([1, 2, 3, 4, 5, 6, 7, 8])
Output
Now, once fitted we can predict the new value aby using predict() method as follows:
print((MNBclf.predict(X[4:5]))
Output
[5]
Parameters
Following table consist the parameters used by sklearn.naive_bayes.BernoulliNB
method:
108
Scikit-Learn
Parameter Description
alpha: float, It represents the additive smoothing parameter. If you choose 0 as its
optional, default = value, then there will be no smoothing.
1.0
binarize: float or With this parameter we can set the threshold for binarizing of sample
None, optional, features. Binarization here means mapping to the Booleans. If you choose
default = 0.0 its value to be None it means input consists of binary vectors.
fit_prior: Boolean, It tells the model that whether to learn class prior probabilities or not. The
optional, default = default value is True but if set to False, the algorithms will use a uniform
true prior.
class_prior: array- This parameter represents the prior probabilities of each class.
like,
size(n_classes,),
optional, Default =
None
Attributes
Following table consist the attributes used by sklearn.naive_bayes.BernoulliNB
method:
Attributes Description
feature_count_: array, shape (n_classes, It provides the actual number of training samples
n_features) encountered for each (class,feature).
109
Scikit-Learn
Implementation Example
The Python script below will use sklearn.naive_bayes.BernoulliNB method to construct
Bernoulli Naïve Bayes Classifier from our data set:
import numpy as np
X = np.random.randint(10, size=(10, 1000))
y = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
from sklearn.naive_bayes import BernoulliNB
BNBclf = BernoulliNB()
BNBclf.fit(X, y)
Output
Now, once fitted we can predict the new value by using predict() method as follows:
print((BNBclf.predict(X[0:5]))
Output
[1 2 3 4 5]
Parameters
Followings table consist the parameters used by sklearn.naive_bayes.ComplementNB
method:
Parameter Description
alpha: float, It represents the additive smoothing parameter. If you choose 0 as its
optional, default = value, then there will be no smoothing.
1.0
fit_prior: Boolean, It tells the model that whether to learn class prior probabilities or not. The
optional, default = default value is True but if set to False, the algorithms will use a uniform
true prior. This parameter is only used in edge case with a single class in the
training data set.
class_prior: array- This parameter represents the prior probabilities of each class.
like,
110
Scikit-Learn
size(n_classes,),
optional, Default =
None
norm: Boolean, It tells the model that whether to perform second normalization of the
optional, default = weights or not.
False
Attributes
Following table consist the attributes used by sklearn.naive_bayes.ComplementNB
method:
Attributes Description
feature_count_: array, shape (n_classes, It provides the actual number of training samples
n_features) encountered for each (class,feature).
Implementation Example
The Python script below will use sklearn.naive_bayes.BernoulliNB method to construct
Bernoulli Naïve Bayes Classifier from our data set:
import numpy as np
X = np.random.randint(15, size=(15, 1000))
y = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
from sklearn.naive_bayes import ComplementNB
CNBclf = ComplementNB()
CNBclf.fit(X, y)
Output
111
Scikit-Learn
Now, once fitted we can predict the new value aby using predict() method as follows:
print((CNBclf.predict(X[10:15]))
Output
[11 12 13 14 15]
Import Sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
print(label_names)
print(labels[0])
print(feature_names[0])
print(features[0])
train, test, train_labels, test_labels =
train_test_split(features,labels,test_size = 0.40, random_state = 42)
from sklearn.naive_bayes import GaussianNB
GNBclf = GaussianNB()
model = GNBclf.fit(train, train_labels)
preds = GNBclf.predict(test)
print(preds)
Output
[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1
0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1
1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 1 1
0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1
1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0
1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1]
112
Scikit-Learn
The above output consists of a series of 0s and 1s which are basically the predicted values
from tumor classes namely malignant and benign.
113
14. Scikit-Learn ― Decision Trees Scikit-Learn
In this chapter, we will learn about learning method in Sklearn which is termed as decision
trees.
Decisions tress (DTs) are the most powerful non-parametric supervised learning method.
They can be used for the classification and regression tasks. The main goal of DTs is to
create a model predicting target variable value by learning simple decision rules deduced
from the data features. Decision trees have two main entities; one is root node, where the
data splits, and other is decision nodes or leaves, where we got final output.
ID3
It was developed by Ross Quinlan in 1986. It is also called Iterative Dichotomiser 3. The
main goal of this algorithm is to find those categorical features, for every node, that will
yield the largest information gain for categorical targets.
It lets the tree to be grown to their maximum size and then to improve the tree’s ability
on unseen data, applies a pruning step. The output of this algorithm would be a multiway
tree.
C4.5
It is the successor to ID3 and dynamically defines a discrete attribute that partition the
continuous attribute value into a discrete set of intervals. That’s the reason it removed the
restriction of categorical features. It converts the ID3 trained tree into sets of ‘IF-THEN’
rules.
In order to determine the sequence in which these rules should applied, the accuracy of
each rule will be evaluated first.
C5.0
It works similar as C4.5 but it uses less memory and build smaller rulesets. It is more
accurate than C4.5.
CART
It is called Classification and Regression Trees alsgorithm. It basically generates binary
splits by using the features and threshold yielding the largest information gain at each
node (called the Gini index).
Homogeneity depends upon Gini index, higher the value of Gini index, higher would be the
homogeneity. It is like C4.5 algorithm, but, the difference is that it does not compute rule
sets and does not support numerical target variables (regression) as well.
114
Scikit-Learn
Parameters
Following table consist the parameters used by sklearn.tree.DecisionTreeClassifier
module:
Parameter Description
criterion: string, It represents the function to measure the quality of a split. Supported
optional default= “gini” criteria are “gini” and “entropy”. The default is gini which is for Gini
impurity while entropy is for the information gain.
splitter: string, optional It tells the model, which strategy from “best” or “random” to choose
default= “best” the split at each node.
max_depth : int or This parameter decides the maximum depth of the tree. The default
None, optional value is None which means the nodes will expand until all leaves are
default=None pure or until all leaves contain less than min_smaples_split samples.
min_samples_split: int This parameter provides the minimum number of samples required
, float, optional default=2 to split an internal node.
min_samples_leaf: int, This parameter provides the minimum number of samples required
float, optional default=1 to be at a leaf node.
min_weight_fraction_l With this parameter, the model will get the minimum weighted
eaf: float, optional fraction of the sum of weights required to be at a leaf node.
default=0.
max_features: int, It gives the model the number of features to be considered when
float, string or None, looking for the best split.
optional default=None
random_state: int, This parameter represents the seed of the pseudo random number
RandomState instance or generated which is used while shuffling the data. Followings are the
None, optional, default = options:
none
int: In this case, random_state is the seed used by random
number generator.
RandomState instance: In this case, random_state is the
random number generator.
None: In this case, the random number generator is the
RandonState instance used by np.random.
115
Scikit-Learn
max_leaf_nodes: int or This parameter will let grow a tree with max_leaf_nodes in best-first
None, optional fashion. The default is none which means there would be unlimited
default=None number of leaf nodes.
min_impurity_decreas This value works as a criterion for a node to split because the model
e: float, optional will split a node if this split induces a decrease of the impurity greater
default=0. than or equal to min_impurity_decrease value.
min_impurity_split: flo It represents the threshold for early stopping in tree growth.
at, default=1e-7
class_weight: dict, list It represents the weights associated with classes. The form is
of dicts, “balanced” or {class_label: weight}. If we use the default option, it means all the
None, default=None classes are supposed to have weight one. On the other hand, if you
choose class_weight: balanced, it will use the values of y to
automatically adjust weights.
presort: bool, optional It tells the model whether to presort the data to speed up the finding
default=False of best splits in fitting. The default is false but of set to true, it may
slow down the training process.
Attributes
Following table consist the attributes used by sklearn.tree.DecisionTreeClassifier
module:
Attributes Description
feature_importances_: array of shape This attribute will return the feature importance.
=[n_features]
classes_: array of shape = [n_classes] or It represents the classes labels i.e. the single
a list of such arrays output problem, or a list of arrays of class labels
i.e. multi-output problem.
n_classes_: int or list It represents the number of classes i.e. the single
output problem, or a list of number of classes for
every output i.e. multi-output problem.
Methods
Following table consist the methods used by sklearn.tree.DecisionTreeClassifier
module:
116
Scikit-Learn
Method Description
apply(self, X[, check_input]) This method will return the index of the leaf.
decision_path(self, X[, check_inpu As name suggests, this method will return the
t]) decision path in the tree
Implementation Example
The Python script below will use sklearn.tree.DecisionTreeClassifier module to
construct a classifier for predicting male or female from our data set having 25 samples
and two features namely ‘height’ and ‘length of hair’:
117
Scikit-Learn
Y=['Man','Woman','Woman','Man','Woman','Man','Woman','Man','Woman','Man','Woman
','Man','Woman','Woman','Woman','Man','Woman','Woman','Man', 'Woman', 'Woman',
'Man', 'Man', 'Woman', 'Woman']
data_feature_names = ['height','length of hair']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3,
random_state=1)
DTclf = tree.DecisionTreeClassifier()
DTclf = clf.fit(X,Y)
prediction = DTclf.predict([[135,29]])
print(prediction)
Output
['Woman']
We can also predict the probability of each class by using following python predict_proba()
method as follows:
prediction = DTclf.predict_proba([[135,29]])
print(prediction)
Output
[[0. 1.]]
Parameters
Parameters used by DecisionTreeRegressor are almost same as that were used in
DecisionTreeClassifier module. The difference lies in ‘criterion’ parameter. For
DecisionTreeRegressor modules ‘criterion: string, optional default= “mse”’ parameter
have the following values:
mse: It stands for the mean squared error. It is equal to variance reduction as
feature selectin criterion. It minimises the L2 loss using the mean of each terminal
node.
freidman_mse: It also uses mean squared error but with Friedman’s improvement
score.
mae: It stands for the mean absolute error. It minimizes the L1 loss using the
median of each terminal node.
118
Scikit-Learn
Attributes
Attributes of DecisionTreeRegressor are also same as that were of
DecisionTreeClassifier module. The difference is that it does not have ‘classes_’ and
‘n_classes_’ attributes.
Methods
Methods of DecisionTreeRegressor are also same as that were of
DecisionTreeClassifier module. The difference is that it does not have
‘predict_log_proba()’ and ‘predict_proba()’ methods.
Implementation Example
The fit() method in Decision tree regression model will take floating point values of y. let’s
see a simple implementation example by using Sklearn.tree.DecisionTreeRegressor:
Once fitted, we can use this regression model to make prediction as follows:
DTreg.predict([[4, 5]])
Output
array([1.5])
119
15. Scikit-Learn ― Randomized Decision Trees Scikit-Learn
This chapter will help you in understanding randomized decision trees in Sklearn.
Here, ‘max_features’ is the size of the random subsets of features to consider when
splitting a node. If we choose this parameter’s value to none then it will consider all the
features rather than a random subset. On the other hand, n_estimators are the number
of trees in the forest. The higher the number of trees, the better the result will be. But it
will take longer to compute also.
Implementation example
In the following example, we are building a random forest classifier by using
sklearn.ensemble.RandomForestClassifier and also checking its accuracy also by
using cross_val_score module.
120
Scikit-Learn
Output
0.9997
We can also use the sklearn dataset to build Random Forest classifier. As in the following
example we are using iris dataset. We will also find its accuracy score and confusion
matrix.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix,
accuracy_score
path = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width',
'Class']
dataset = pd.read_csv(path, names=headernames)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
RFclf = RandomForestClassifier(n_estimators=50)
RFclf.fit(X_train, y_train)
y_pred = RFclf.predict(X_test)
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[[14 0 0]
[ 0 18 1]
[ 0 0 12]]
121
Scikit-Learn
Classification Report:
precision recall f1-score support
Accuracy: 0.9777777777777777
Implementation example
In the following example, we are building a random forest regressor by using
sklearn.ensemble.RandomForestregressor and also predicting for new values by
using predict() method.
Output
122
Scikit-Learn
print(RFregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))
Output
[98.47729198]
Extra-Tree Methods
For each feature under consideration, it selects a random value for the split. The benefit
of using extra tree methods is that it allows to reduce the variance of the model a bit
more. The disadvantage of using these methods is that it slightly increases the bias.
Implementation example
In the following example, we are building a random forest classifier by using
sklearn.ensemble.ExtraTreeClassifier and also checking its accuracy by using
cross_val_score module.
Output
1.0
We can also use the sklearn dataset to build classifier using Extra-Tree method. As in the
following example we are using Pima-Indian dataset.
123
Scikit-Learn
Output
0.7551435406698566
Implementation example
In the following example, we are applying sklearn.ensemble.ExtraTreesregressor and
on the same data as we used while creating random forest regressor. Let’s see the
difference in the Output
Output
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
oob_score=False, random_state=0, verbose=0, warm_start=False)
print(ETregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))
Output
[85.50955817]
125
16. Scikit-Learn ― Boosting Methods Scikit-Learn
In this chapter, we will learn about the boosting methods in Sklearn, which enables
building an ensemble model.
Boosting methods build ensemble model in an increment way. The main principle is to
build the model incrementally by training each base model estimator sequentially. In order
to build powerful ensemble, these methods basically combine several week learners which
are sequentially trained over multiple iterations of training data. The sklearn.ensemble
module is having following two boosting methods.
AdaBoost
It is one of the most successful boosting ensemble method whose main key is in the way
they give weights to the instances in dataset. That’s why the algorithm needs to pay less
attention to the instances while constructing subsequent models.
Implementation example
In the following example, we are building a AdaBoost classifier by using
sklearn.ensemble.AdaBoostClassifier and also predicting and checking its score.
Output
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=1.0, n_estimators=100, random_state=0)
126
Scikit-Learn
print(ADBclf.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))
Output
[1]
ADBclf.score(X, y)
Output
0.995
We can also use the sklearn dataset to build classifier using Extra-Tree method. For
example, in an example given below, we are using Pima-Indian dataset.
Output
0.7851435406698566
Implementation example
In the following example, we are building a AdaBoost regressor by using
sklearn.ensemble.AdaBoostregressor and also predicting for new values by using
predict() method.
Output
print(ADBregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))
Output
[85.50955817]
On the other hand, if we choose this parameter’s value to exponential then it recovers the
AdaBoost algorithm. The parameter n_estimators will control the number of week
learners. A hyper-parameter named learning_rate (in the range of (0.0, 1.0]) will control
overfitting via shrinkage.
Implementation example
128
Scikit-Learn
GDBclf = GradientBoostingClassifier(n_estimators=50,
learning_rate=1.0,max_depth=1, random_state=0).fit(X_train, y_train)
GDBclf.score(X_test, y_test)
Output
0.8724285714285714
We can also use the sklearn dataset to build classifier using Gradient Boosting Classifier.
As in the following example we are using Pima-Indian dataset.
129
Scikit-Learn
Output
0.7946582356674234
Implementation example
In the following example, we are building a Gradient Boosting regressor by using
sklearn.ensemble.GradientBoostingregressor and also finding the mean squared
error by using mean_squared_error() method.
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor
X, y = make_friedman1(n_samples=2000, random_state=0, noise=1.0)
X_train, X_test = X[:1000], X[1000:]
y_train, y_test = y[:1000], y[1000:]
GDBreg = GradientBoostingRegressor(n_estimators=80, learning_rate=0.1,
max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
mean_squared_error(y_test, GDBreg.predict(X_test))
Output
5.391246106657164
130
17. Scikit-Learn ― Clustering Methods Scikit-Learn
Here, we will study about the clustering methods in Sklearn which will help in identification
of any similarity in the data samples.
Clustering methods, one of the most useful unsupervised ML methods, used to find
similarity & relationship patterns among data samples. After that, they cluster those
samples into groups having similarity based on features. Clustering determines the
intrinsic grouping among the present unlabeled data, that’s why it is important.
KMeans
This algorithm computes the centroids and iterates until it finds optimal centroid. It
requires the number of clusters to be specified that’s why it assumes that they are already
known. The main logic of this algorithm is to cluster the data separating samples in n
number of groups of equal variances by minimizing the criteria known as the inertia. The
number of clusters identified by algorithm is represented by ‘K.
Affinity Propagation
This algorithm is based on the concept of ‘message passing’ between different pairs of
samples until convergence. It does not require the number of clusters to be specified
before running the algorithm. The algorithm has a time complexity of the order 𝑂(𝑁 2 𝑇),
which is the biggest disadvantage of it.
Mean Shift
This algorithm mainly discovers blobs in a smooth density of samples. It assigns the
datapoints to the clusters iteratively by shifting points towards the highest density of
datapoints. Instead of relying on a parameter named bandwidth dictating the size of the
region to search through, it automatically sets the number of clusters.
Spectral Clustering
Before clustering, this algorithm basically uses the eigenvalues i.e. spectrum of the
similarity matrix of the data to perform dimensionality reduction in fewer dimensions. The
use of this algorithm is not advisable when there are large number of clusters.
131
Scikit-Learn
Hierarchical Clustering
This algorithm builds nested clusters by merging or splitting the clusters successively. This
cluster hierarchy is represented as dendrogram i.e. tree. It falls into following two
categories:
Divisive hierarchical algorithms: In this hierarchical algorithm, all data points are
treated as one big cluster. In this the process of clustering involves dividing, by using top-
down approach, the one big cluster into various small clusters.
DBSCAN
It stands for “Density-based spatial clustering of applications with noise”. This
algorithm is based on the intuitive notion of “clusters” & “noise” that clusters are dense
regions of the lower density in the data space, separated by lower density regions of data
points.
Higher value of parameter min_samples or lower value of the parameter eps will give
an indication about the higher density of data points which is necessary to form a cluster.
OPTICS
It stands for “Ordering points to identify the clustering structure”. This algorithm
also finds density-based clusters in spatial data. It’s basic working logic is like DBSCAN.
BIRCH
It stands for Balanced iterative reducing and clustering using hierarchies. It is used to
perform hierarchical clustering over large data sets. It builds a tree named CFT i.e.
Characteristics Feature Tree, for the given data.
The advantage of CFT is that the data nodes called CF (Characteristics Feature) nodes
holds the necessary information for clustering which further prevents the need to hold the
entire input data in memory.
132
Scikit-Learn
Small level of
scalability with
n_clusters.
%matplotlib inline
import matplotlib.pyplot as plt
133
Scikit-Learn
Output
1797, 64)
This output shows that digit dataset is having 1797 samples with 64 features.
Output
(10, 64)
This output shows that K-means clustering created 10 clusters with 64 features.
Output
The below output has images showing clusters centers learned by K-Means Clustering.
Next, the Python script below will match the learned cluster labels (by K-Means) with the
true labels found in them:
134
Scikit-Learn
We can also check the accuracy with the help of the below mentioned command.
Output
0.7935447968836951
accuracy_score(digits.target, labels)
136
18. Scikit-Learn ― Clustering Performance Scikit-Learn
Evaluation
There are various functions with the help of which we can evaluate the performance of
clustering algorithms.
Following are some important and mostly used functions given by the Scikit-learn for
evaluating clustering performance:
It has two parameters namely labels_true, which is ground truth class labels, and
labels_pred, which are clusters label to evaluate.
Example
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
adjusted_rand_score(labels_true, labels_pred)
Output
0.4444444444444445
Perfect labeling would be scored 1 and bad labelling or independent labelling is scored 0
or negative.
Example
137
Scikit-Learn
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
Output
0.7611702597222881
Example
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
Output
0.4444444444444448
Fowlkes-Mallows Score
The Fowlkes-Mallows function measures the similarity of two clustering of a set of points.
It may be defined as the geometric mean of the pairwise precision and recall.
Mathematically,
𝑇𝑃
𝐹𝑀𝑆 =
√(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)
Here, TP = True Positive; number of pair of points belonging to the same clusters in true
as well as predicted labels both.
FP = False Positive; number of pair of points belonging to the same clusters in true
labels but not in the predicted labels.
FN = False Negative; number of pair of points belonging to the same clusters in the
predicted labels but not in the true labels.
138
Scikit-Learn
Example
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
Output
0.6546536707079771
Silhouette Coefficient
The Silhouette function will compute the mean Silhouette Coefficient of all samples using
the mean intra-cluster distance and the mean nearest-cluster distance for each sample.
Mathematically,
𝑆 = (𝑏 − 𝑎)/𝑚𝑎𝑥(𝑎, 𝑏)
Example
Output
0.5528190123564091
139
Scikit-Learn
Contingency Matrix
This matrix will report the intersection cardinality for every trusted pair of (true,
predicted). Confusion matrix for classification problems is a square contingency matrix.
Example
Output
array([[0, 2, 1],
[1, 1, 1]])
The first row of above output shows that among three samples whose true cluster is “a”,
none of them is in 0, two of the are in 1 and 1 is in 2. On the other hand, second row
shows that among three samples whose true cluster is “b”, 1 is in 0, 1 is in 1 and 1 is in
2.
140
19. Scikit-Learn ― Dimensionality Reduction using Scikit-Learn
PCA
Exact PCA
Principal Component Analysis (PCA) is used for linear dimensionality reduction using
Singular Value Decomposition (SVD) of the data to project it to a lower dimensional
space. While decomposition using PCA, input data is centered but not scaled for each
feature before applying the SVD.
Example
The below example will use sklearn.decomposition.PCA module to find best 5 Principal
components from Pima Indians Diabetes dataset.
Output
Incremental PCA
Incremental Principal Component Analysis (IPCA) is used to address the biggest
limitation of Principal Component Analysis (PCA) and that is PCA only supports batch
processing, means all the input data to be processed should fit in the memory.
Same as PCA, while decomposition using IPCA, input data is centered but not scaled for
each feature before applying the SVD.
Example
The below example will use sklearn.decomposition.IPCA module on Sklearn digit
dataset.
Output
(1797, 10)
Here, we can partially fit on smaller batches of data (as we did on 100 per batch) or you
can let the fit() function to divide the data into batches.
142
Scikit-Learn
Kernel PCA
Kernel Principal Component Analysis, an extension of PCA, achieves non-linear
dimensionality reduction using kernels. It supports both transform and
inverse_transform.
Example
The below example will use sklearn.decomposition.KernelPCA module on Sklearn digit
dataset. We are using sigmoid kernel .
Output
(1797, 10)
Example
The below example will use sklearn.decomposition.PCA module with the optional
parameter svd_solver=’randomized’ to find best 7 Principal components from Pima Indians
Diabetes dataset.
X = array[:,0:8]
143
Scikit-Learn
Y = array[:,8]
pca = PCA(n_components=7,svd_solver= 'randomized')
fit = pca.fit(X)
print(("Explained Variance: %s") % (fit.explained_variance_ratio_))
print(fit.components_)
Output
144