Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF
Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF
10
Learning – Online
Learning
In this chapter, we will see the following recipes:
Introduction
In this chapter, we will concentrate on large-scale machine learning and the algorithms suited
to tackle such large-scale problems. Till now, when we trained all our models, we assumed
that our training set can fit into our computer's memory. In this chapter, we will see how to go
about building models when this assumption is no longer satisfied. Our training records are of
a huge size and so we cannot fit them completely into our memory. We may have to load them
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
piecewise and still produce a model with a good accuracy. The argument of a training set not
fitting into our computer memory can be extrapolated to streaming data. With streaming data,
we don't see all the data at once. We should be able to make decisions based on whatever
data we are exposed to and also have a mechanism for continuously improving our model as
new data arrives.
387
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
We will introduce the framework of the stochastic gradient descent-based algorithms. This is
a versatile framework to handle very large-scale datasets that will not fit completely into our
memory. Several types of linear algorithms, including logistic regression, linear regression,
and linear SVM, can be accommodated using this framework. The kernel trick, which we
introduced in our previous chapter, can be included in this framework in order to deal with
datasets with nonlinear relationships.
We will begin our list of recipes with the perceptron algorithm, the oldest machine learning
algorithm. Perceptron is easy to understand and implement. However, Perceptron is limited to
solving only linear problems. A kernel-based perceptron can be used to solve nonlinear datasets.
In our second recipe, we will formally introduce the framework of gradient descent-based
methods and how it can be used to perform regression-based tasks. We will look at different
loss functions to see how different types of linear models can be built using these functions. We
will also see how perceptron belongs to the family of stochastic gradient descent.
In our final recipe, we will see how classification algorithms can be built using the stochastic
gradient descent framework.
Even though we don't have a direct example of streaming data, with our existing datasets, we
will see how the streaming data use cases can be addressed. Online learning algorithms are
not limited to streaming data, they can be applied to batch data also, except that they process
only one instance at a time.
Let's revisit our definition of a classification problem. Each record or instance can be written
as a set (X,y), where X is a set of attributes and y is a corresponding class label.
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
Learning a target function, F, that maps each record's attribute set to one of the predefined
class label, y, is the job of a classification algorithm.
The difference in our case is that we have a large-scale learning problem. All our data will not
fit into our main memory. So, we need to keep our data on a disk and use only a portion of it
at a time in order to build our perceptron model.
388
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
Let's consider the following scenario. We have the complete dataset on our disk. In a single
epoch, that is, in step 3, all the steps mentioned are performed on all the data on our disk.
In an online learning scenario, a bunch of instances based on a windowing function will be
available to us at any point in time. We can update the weights as many times as the number
of instances in our window in a single epoch.
X i = { x1 , x2 , x3 ……. xm } , where i = 1 to n
Our Y is as follows:
Y = {+1, −1}
W = {w1 , w2 , w3 ,…. wm }
yˆi = sign ( wi ∗ xi )
The sign function returns +1 if the product of the weight and attributes is positive, or -1 if the
product is negative.
389
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
Perceptron proceeds to compare the predicted y with the actual y. If the predicted y is correct,
it moves on to the next record. If the prediction is incorrect, there are two scenarios. If the
predicted y is +1 and the actual y is -1, it decrements the weight with an x value, and vice
versa. If the actual y is +1 and the predicted y is -1, it increments the weights. Let's see this
as an equation for more clarity:
wt +1 = wt + yi xi
Typically, a learning rate alpha is provided so that the weights are updated in a controlled
manner. With the presence of noise in the data, a full increment of decrements will lead to
the weights not converging:
wt +1 = wt + α ( yi xi )
Getting ready
Let's generate data using make_classification in batches with a generator function to
simulate large-scale data and data streaming, and proceed to write the perceptron algorithm.
How to do it…
Let's load the necessary libraries. We will then write a function, get_data, which is a
generator:
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.preprocessing import scale
import numpy as np
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
def get_data(batch_size):
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
b_size = 0
no_features = 30
redundant_features = int(0.1*no_features)
390
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
informative_features = int(0.8*no_features)
repeated_features = int(0.1*no_features)
yield x,y
b_size+=1
We will proceed to write two functions, one to build our perceptron model and the other one to
test the worthiness of our model:
def build_model(x,y,weights,epochs,alpha=0.5):
"""
Simple Perceptron
"""
for i in range(epochs):
train[index,:])
return weights
def model_worth(x,y,weights):
prediction = np.sign(np.sum(x * weights,axis=1))
print classification_report(y,prediction)
391
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
Finally, we will write our main function to invoke all the preceding functions, to demonstrate
the perceptron algorithm:
if __name__ == "__main__":
data = get_data(10)
x,y = data.next()
weights = np.zeros(x.shape[1])
for i in range(10):
epochs = 100
weights = build_model(x,y,weights,epochs)
print
print "Model worth after receiving dataset batch %d"%(i+1)
model_worth(x,y,weights)
print
if i < 9:
x,y = data.next()
How it works…
Let's start with our main function. We will ask our generator to send us 10 sets of data:
data = get_data(10)
Here, we want to simulate both large-scale data and data streaming. While building our
model, we don't have access to all the data, just part of it:
x,y = data.next()
We will use the next() function in the generator in order to get the next set of data. In the
get_data function, we will use the make_classification function from scikit-learn:
x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features, random_state=51)
Let's look at the parameters passed to the make_classification method. The first
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
parameter is the number of instances required, in this case, we need 1,000 instances. The
second parameter is about how many attributes per instance are required. We will assume that
we need 30. The third parameter, flip_y, randomly interchanges 3 percent of the instances.
This is done to introduce some noise in our data. The next parameter is about these 30 features
and how many of them should be informative enough to be used in our classification. We
specified that 60 percent of our features, that is, 18 out of 30, should be informative. The next
parameter is about the redundant features. These are generated as a linear combination of
the informative features in order to introduce correlation among the features. Finally, repeated
features are duplicate features that are drawn randomly from both the informative features and
the redundant features.
392
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
When we call next(), we will get 1,000 instances of this data. This function returns a y label
as {0,1}; we want {-1,+1} and hence we will change all the zeros in y to -1:
y_indx = y < 1
y[y_indx] = -1
Finally, we will center our data using the scale function from scikit-learn.
Let's proceed to build our model with the first batch of data. We will initialize our weights
matrix with zeros:
weights = np.zeros(x.shape[1])
As we need 10 batches of data to simulate large-scale learning and data streaming, we will do
the model building 10 times in the for loop:
for i in range(10):
epochs = 100
weights = build_model(x,y,weights,epochs)
Let's see in our build_model. We will start with shuffling the data:
# Shuffle the dataset
shuff_index = np.random.shuffle(range(len(y)))
x_train = x[shuff_index,:].reshape(x.shape)
y_train = np.ravel(y[shuff_index,:])
We will go through each record in our dataset and start updating our weights:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
393
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
We will multiply our training data with weights, and add them together. Finally, we will use the
np.sign function to get our prediction. Now, based on the prediction, we will update our weights:
weights = weights + alpha * (y_train[index] * x_
train[index,:])
In our main function, we will invoke the model_worth function to print the goodness of the
model. Here, we will use the classification_report convienience function to print the
accuracy score of the model:
print
print "Model worth after receiving dataset batch %d"%(i+1)
model_worth(x,y,weights)
We will then proceed to update our model for the next batch of incoming data. Note that we have
not altered the weights parameter. It gets updated with every batch of new data coming in.
394
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
There's more…
Scikit-learn provides us with an implementation of perceptron. Refer to the following URL for
more details:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.Perceptron.html.
Another improvement that can be made in the perceptron algorithm is to use more features.
yˆ = sign ( wi ∗ φ ( xi ) )
We replaced the x values with a function. Here, we can send a feature generator. For example,
a polynomial feature generator can be added to our get_data function, as follows:
def get_data(batch_size):
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
b_size = 0
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.8*no_features)
repeated_features = int(0.1*no_features)
poly = PolynomialFeatures(degree=2)
y_indx = y < 1
y[y_indx] = -1
x = poly.fit_transform(x)
yield x,y
b_size+=1
Finally, kernel-based perceptron algorithms are available to handle nonlinear datasets. Refer
to the Wikipedia article for more information about kernel-based perceptron:
https://en.wikipedia.org/wiki/Kernel_perceptron.
395
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
See also
ff Learning and using Kernels recipe in Chapter 5, Data Mining - Finding a needle in a
haystack
X = { x1 , x2 ,… , xn }
The response variable, Y, is a vector of real-valued entries. The job of regression is to find a
function such that when x is provided as an input to this function, it should return y:
F(X ) =Y
The preceding function is parameterized by a weight vector, that is, a combination of the
weight vector and input vector is used to predict Y, so rewriting the function with the weight
vector will get the following:
F ( X ,W ) = Y
So, the question now is how do we know that we have the right weight vectors? We will use a
loss function, L, to get the right weight vectors. The loss function measures the cost of making
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
a wrong prediction. It empirically measures the cost of predicting y when the actual value is y.
The regression problem now becomes the problem of finding the right weight vector that will
minimize the loss function. For our whole dataset of n elements, the overall loss function is
as follows:
1 n
∑ L ( f ( xi , w) , yi )
n i =1
396
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
Our weight vectors should be those that minimize the preceding value.
Gradient descent is an optimization technique used to minimize the preceding equation. For
this equation, we will find the gradient, that is, the first-order derivative with respect to W.
Unlike other optimization techniques such as the batch gradient descent, stochastic gradient
descent operates on one instance at a time. The steps involved in stochastic gradient descent
are as follows:
Let's say:
∇w
This signifies the derivative, w.r.t w. The weights are updated as follows:
wi +1 = wi − ∇ w L ( f ( xi , w ) , yi )
As you can see, the weights are moved in the opposite direction to the gradient, thus forcing
a descent that will eventually give the weight vector values, which can reduce the objective
cost function.
A squared loss is a typical loss function used with regression. The squared loss of an instance
is defined in the following way:
( ŷ − y )
2
The derivative of the preceding equation is substituted into the weight update equation.
With this background knowledge, let's proceed to our recipe for stochastic gradient
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
descent regression.
As explained in perceptron, a learning rate, eta, is added to the weight update equation in
order to avoid the effect of noise:
(
wi +1 = wi − η ∇ w L ( f ( xi , w ) , yi ) )
397
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
Getting ready
We will be leveraging the scikit-learn's implementation of SGD regression. As in some of the
previous recipes, we will use the make_regression function from scikit-learn to generate
data for our recipe in order to demonstrate stochastic gradient descent regression.
How to do it…
Let's start with a very simple example demonstrating how to build a stochastic gradient
descent regressor.
We will first load the required libraries. We will then write a function to generate predictors and
response variables to demonstrate regression:
from sklearn.datasets import make_regression
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.cross_validation import train_test_split
def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
x,y = make_regression(n_samples=1000,n_features=no_features,\
random_state=51)
return x,y
We will proceed to write the functions that will help us build, validate, and inspect our model:
def build_model(x,y):
estimator = SGDRegressor(n_iter = 10, shuffle=True,loss = "squared_
loss", \
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='none')
estimator.fit(x,y)
return estimator
def model_worth(model,x,y):
predicted_y = model.predict(x)
398
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
print "\nMean absolute error = %0.2f"%mean_absolute_
error(y,predicted_y)
print "Mean squared error = %0.2f"%mean_squared_
error(y,predicted_y)
def inspect_model(model):
print "\nModel Itercept {0}".format(model.intercept_)
print
for i,coef in enumerate(model.coef_):
print "Coefficient {0} = {1:.3f}".format(i+1,coef)
Finally, we will write our main function to invoke all the preceding functions:
if __name__ == "__main__":
x,y = get_data()
model = build_model(x_train,y_train)
inspect_model(model)
How it works…
Let's start with our main function. We will invoke the get_data function to generate our
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
In the get_data function, we will leverage the convenient make_regression function from
scikit-learn to generate a dataset for the regression problems:
no_features = 30
x,y = make_regression(n_samples=1000,n_features=no_features,\
random_state=51)
399
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
As you can see, we will generate a dataset with 1,000 instances specified by an n_samples
parameter, and 30 features defined by an n_features parameter.
Let's split the data into training and testing sets using train_test_split. We will reserve
30 percent of our data to test:
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)
Once again, we will leverage train_test_split to split our test data into dev and test sets:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)
With the data divided to build, evaluate, and test the model, we will proceed to build
our models.
In build_model, we will leverage scikit-learn's SGD regressor class to build our stochastic
gradient descent method:
estimator = SGDRegressor(n_iter = 10, shuffle=True,loss = "squared_
loss", \
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='none')
estimator.fit(x,y)
The SGD regressor is a vast method and can be used to fit a number of linear models with a
lot of parameters. We will first explain the basic method of stochastic gradient descent and
then proceed to explain the other details.
Let's look at the parameters that we used. The first parameter is the number of times that we
want to go through our dataset in order to update the weights. Here, we will say that we want
10 iterations. As in perceptron, after going through all the records once, we need to shuffle our
input records when we start the next iteration. A parameter shuffle is used for the same. The
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
default value of shuffle is true, we have included it here for explanation purposes. Our loss
function is the squared loss and we want to do a linear regression; hence, we will specify this
using the loss parameter.
Our learning rate, eta, is a constant that we will specify with the learning_rate parameter.
We will provide a value for our learning rate using the eta0 parameter. We will then say that
we need to fit the intercept as we have not centered our data by its mean. Finally, the penalty
parameter controls the type of shrinkage required. In our case, we don't need any shrinkage
using the none string.
400
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
We will proceed to build our model by invoking the fit function with our predictor and response
variable. Finally we will return the model that we built to our calling function.
Let's now inspect our model and see the value of the intercept and coefficients:
inspect_model(model)
In the inspect model, we will print the values of the model intercepts and coefficients:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
Let's now look at how our model has performed in our training data:
print "Model worth on train data"
model_worth(model,x_train,y_train)
We will invoke the model_worth function to look at our model's performance. The model_worth
function prints the mean absolute error and mean squared error values.
401
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
1 n
∑ ( yˆi − yi )
2
n i =1
1 n
∑ yˆi − yi
n i =1
The mean squared error is sensitive to outliers. Hence, the mean absolute error is a more
robust measure. Let's look at the model's performance using the training data:
Let's now look at the model's performance using our dev data:
There's more…
We can include regularization in the stochastic gradient descent framework. Recall the
following cost function of ridge regression from the previous chapter:
2
n m m
∑ y
i
i =1
− w0 − ∑ x w
ij ij + α ∑ w2j
j =1 j =1
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
402
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
We included an expanded version of the square loss function here and added the
regularization term—the sum of the square of the weights. We can include it in our gradient
descent procedure. Let's say that we denote our regularization term as R(W). Our weight
update is now as follows:
(
wi +1 = wi − η ∇ w L ( f ( xi , w ) , yi ) + α ( ∇ w R ( w ) ) )
As you can see, now we have the derivative of the loss function with respect to the weight
vector, w, and the derivative of the regularization term with respect to the weights is added
to our weight update rule.
return estimator
Let's see the new parameters that we passed compared with our previous build model method:
estimator = SGDRegressor(n_iter = 10,shuffle=True,loss = "squared_
loss", \
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='l2',alpha=0.01)
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
403
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
Earlier, we mentioned our penalty as none. Now, you can see that we mentioned that we
need to add an L2 penalty to our model. Again, we will give an alpha value of 0.01 using the
alpha parameter. Let's look at our coefficients:
You can see the effect of the L2 regularization: a lot of the coefficients have attained a zero
value. Similarly, the L1 regularization and elastic net, which combines both the L1 and L2
regularization, can be included using the penalty parameter.
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
404
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
SVM regression models can be built using the epsilon-insensitive loss function. This loss
function is defined as follows:
0 if yi − f ( xi , w ) < ε
L ( f ( xi , w ) , yi ) =
yi − f ( xi , w ) − ε otherwise
Refer to the following URL for the various parameters that can be passed to the SGD regressor
in scikit-learn:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.SGDRegressor.html.
See also
ff Predicting real valued numbers using regression recipe in Chapter 7,
Machine Learning II
ff Shrinkage using Ridge Regression recipe in Chapter 7, Machine Learning II
We will use the derivative of this loss function in our weight update rule to arrive at our
weight vectors.
The SGD classifier class from scikit-learn provides us with a variety of loss functions. However,
in this recipe, we will see log loss, which will give us logistic regression.
WT X
405
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
We have given a generalized notation. The intercept is assumed to be the first dimension
of our weight vector. For a binary classification problem, a logit function is applied to get a
prediction. as follows:
1
F ( w, xi ) = T
1 + e− w xi
The preceding function is also called the sigmoid function. For very large positive values of x_i,
this function will return a value close to one, and vice versa for large negative values close to
zero. With this, we can define our log loss function as follows:
L ( w, xi ) = − yi log ( F ( w, xi ) ) − (1 − yi ) log (1 − F ( w, xi ) )
With the preceding loss function fitted into the weight update rule of the gradient descent, we
can arrive at the appropriate weight vectors.
For the log loss function defined in scikit-learn, refer to the following URL:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_
loss.html.
With this knowledge, let's jump into our recipe for stochastic gradient descent-based
classification.
Getting ready
We will leverage scikit-learn's implementation of the stochastic gradient descent classifier.
As we did in some of the previous recipes, we will use the make_classification function
from scikit-learn to generate data for our recipe in order to demonstrate the stochastic
gradient descent classification.
How to do it…
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
Let's start with a very simple example demonstrating how to build a stochastic gradient
descent regressor.
406
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
We will first load the required libraries. We will then write a function to generate the predictors
and response variables:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import SGDClassifier
import numpy as np
def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features,random_state=7)
return x,y
We will proceed to write functions that will help us build and validate our model:
def build_model(x,y,x_dev,y_dev):
estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", \
learning_rate = "constant",eta0=0.0001,fit_
intercept=True, penalty="none")
estimator.fit(x,y)
train_predcited = estimator.predict(x)
train_score = accuracy_score(y,train_predcited)
dev_predicted = estimator.predict(x_dev)
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
dev_score = accuracy_score(y_dev,dev_predicted)
print
print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_
score,dev_score)
Finally, we will write our main function to invoke all the preceding functions:
if __name__ == "__main__":
x,y = get_data()
407
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)
build_model(x_train,y_train,x_dev,y_dev)
How it works…
Let's start with our main function. We will invoke get_data to get our x predictor attributes
and y response attributes. In get_data, we will leverage the make_classification
dataset in order to generate our training data for the random forest method:
def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=500,n_features=no_features,flip_
y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features,random_state=7)
return x,y
Let's look at the parameters passed to the make_classification method. The first
parameter is the number of instances required. In this case, we need 500 instances. The
second parameter is about how many attributes per instance are required. We say that we
need 30. The third parameter, flip_y, randomly interchanges 3 percent of the instances.
This is done to introduce noise in our data. The next parameter is about how many out
of those 30 features should be informative enough to be used in our classification. We
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
specified that 60 percent of our features, that is, 18 out of 30, should be informative. The
next parameter is about redundant features. These are generated as a linear combination
of the informative features in order to introduce correlation among the features. Finally, the
repeated features are duplicate features that are drawn randomly from both the informative
and redundant features.
408
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
Let's split the data into training and testing sets using train_test_split. We will reserve
30 percent of our data to test:
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)
Once again, we will leverage train_test_split to split our test data into dev and test sets:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)
With the data divided to build, evaluate, and test the model, we will proceed to build
our models:
build_model(x_train,y_train,x_dev,y_dev)
Let's look at the parameters that we used. The first parameter is the number of times we want
to go through our dataset to update the weights. Here, we say that we want 50 iterations. As
in perceptron, after going through all the records once, we need to shuffle our input records
when we start the next iteration. The shuffle parameter is used for the same. The default
value of shuffle is true, we have included it here for explanation purposes. Our loss function is
log loss: we want to do a logistic regression and we will specify this using the loss parameter.
Our learning rate, eta, is a constant that we will specify with the learning_rate parameter.
We will provide the value for our learning rate using the eta0 parameter. We will then proceed
to say that we need to fit the intercept, as we have not centered our data by its mean. Finally,
the penalty parameter controls the type of shrinkage required. In our case, we will say that we
don't need any shrinkage using the none string.
We will proceed to build our model by invoking the fit function with our predictor and response
variable, and evaluate our model with our training and dev dataset:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
estimator.fit(x,y)
train_predcited = estimator.predict(x)
train_score = accuracy_score(y,train_predcited)
dev_predicted = estimator.predict(x_dev)
dev_score = accuracy_score(y_dev,dev_predicted)
print
print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_
score,dev_score)
409
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
There's more…
Regularization, L1, L2, or elastic net can be applied for SGD classification. The procedure is
the same as that of regression, and hence, we will not repeat it here. Refer to the previous
recipe for this.
The learning rate, eta, was constant in our example. This need not be the case. With every
iteration, the eta value can be reduced. The learning rate parameter, learning_rate, can
be set to an optimal string or invscaling. Refer to the following scikit documentation:
http://scikit-learn.org/stable/modules/sgd.html.
We used the fit method to build our model. As mentioned previously, in large-scale machine
learning, we know that all the data will not be available to us at once. When we receive the
data in batches, we need to use the partial_fit method, instead of fit. Using the fit
method will reinitialize the weights and we will lose all the training information from the
previous batch of data. Refer to the following link for more information on partial_fit:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit.
See also
ff Shrinkage using Ridge Regression recipe in Chapter 7, Machine Learning II
ff Using stochastic gradient descent for regression recipe in Chapter 9, Machine
Copyright © 2015. Packt Publishing, Limited. All rights reserved.
Learning III
410
Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.