0% found this document useful (0 votes)
61 views24 pages

Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views24 pages

Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Large-Scale Machine

10
Learning – Online
Learning
In this chapter, we will see the following recipes:

ff Using perceptron as an online linear algorithm


ff Using stochastic gradient descent for regression
ff Using stochastic gradient descent for classification

Introduction
In this chapter, we will concentrate on large-scale machine learning and the algorithms suited
to tackle such large-scale problems. Till now, when we trained all our models, we assumed
that our training set can fit into our computer's memory. In this chapter, we will see how to go
about building models when this assumption is no longer satisfied. Our training records are of
a huge size and so we cannot fit them completely into our memory. We may have to load them
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

piecewise and still produce a model with a good accuracy. The argument of a training set not
fitting into our computer memory can be extrapolated to streaming data. With streaming data,
we don't see all the data at once. We should be able to make decisions based on whatever
data we are exposed to and also have a mechanism for continuously improving our model as
new data arrives.

387

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

We will introduce the framework of the stochastic gradient descent-based algorithms. This is
a versatile framework to handle very large-scale datasets that will not fit completely into our
memory. Several types of linear algorithms, including logistic regression, linear regression,
and linear SVM, can be accommodated using this framework. The kernel trick, which we
introduced in our previous chapter, can be included in this framework in order to deal with
datasets with nonlinear relationships.

We will begin our list of recipes with the perceptron algorithm, the oldest machine learning
algorithm. Perceptron is easy to understand and implement. However, Perceptron is limited to
solving only linear problems. A kernel-based perceptron can be used to solve nonlinear datasets.

In our second recipe, we will formally introduce the framework of gradient descent-based
methods and how it can be used to perform regression-based tasks. We will look at different
loss functions to see how different types of linear models can be built using these functions. We
will also see how perceptron belongs to the family of stochastic gradient descent.

In our final recipe, we will see how classification algorithms can be built using the stochastic
gradient descent framework.

Even though we don't have a direct example of streaming data, with our existing datasets, we
will see how the streaming data use cases can be addressed. Online learning algorithms are
not limited to streaming data, they can be applied to batch data also, except that they process
only one instance at a time.

Using perceptron as an online learning


algorithm
As mentioned earlier, perceptron is one of the oldest machine learning algorithms. It was first
mentioned in a 1943 paper:
A LOGICAL CALCULUS OF THE IDEAS IMMANENT IN NERVOUS ACTIVITY. WARREN S.
MCCULLOCH AND WALTER PITTS University of Illinois, College of Medicine, Department of
Psychiatry at the Illinois Neuropsychiatric Institute, University of Chicago, Chicago, U.S.A.

Let's revisit our definition of a classification problem. Each record or instance can be written
as a set (X,y), where X is a set of attributes and y is a corresponding class label.
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Learning a target function, F, that maps each record's attribute set to one of the predefined
class label, y, is the job of a classification algorithm.

The difference in our case is that we have a large-scale learning problem. All our data will not
fit into our main memory. So, we need to keep our data on a disk and use only a portion of it
at a time in order to build our perceptron model.

388

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

Let's proceed to outline the perceptron algorithm:

1. Initialize the weights of the model to a small random number.


2. Center the input data, x, with its mean.
3. At each time step t (also called epoch):
‰‰ Shuffle the dataset
‰‰ Pick a single instance of the record and make a prediction
‰‰ Observe the deviation of the prediction from the true label output
‰‰ Update the weights if the prediction is different from the true label

Let's consider the following scenario. We have the complete dataset on our disk. In a single
epoch, that is, in step 3, all the steps mentioned are performed on all the data on our disk.
In an online learning scenario, a bunch of instances based on a windowing function will be
available to us at any point in time. We can update the weights as many times as the number
of instances in our window in a single epoch.

Let's see how to go about updating our weights.

Let's say our input X is as follows:

X i = { x1 , x2 , x3 ……. xm } , where i = 1 to n

Our Y is as follows:

Y = {+1, −1}

We will define our weights as the following equation:

W = {w1 , w2 , w3 ,…. wm }

Our prediction after we see each record is defined as follows:


Copyright © 2015. Packt Publishing, Limited. All rights reserved.

yˆi = sign ( wi ∗ xi )

The sign function returns +1 if the product of the weight and attributes is positive, or -1 if the
product is negative.

389

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

Perceptron proceeds to compare the predicted y with the actual y. If the predicted y is correct,
it moves on to the next record. If the prediction is incorrect, there are two scenarios. If the
predicted y is +1 and the actual y is -1, it decrements the weight with an x value, and vice
versa. If the actual y is +1 and the predicted y is -1, it increments the weights. Let's see this
as an equation for more clarity:

wt +1 = wt + yi xi

Typically, a learning rate alpha is provided so that the weights are updated in a controlled
manner. With the presence of noise in the data, a full increment of decrements will lead to
the weights not converging:

wt +1 = wt + α ( yi xi )

Alpha is a very small value ranging, between 0.1 and 0.4.

Let's jump into our recipe now.

Getting ready
Let's generate data using make_classification in batches with a generator function to
simulate large-scale data and data streaming, and proceed to write the perceptron algorithm.

How to do it…
Let's load the necessary libraries. We will then write a function, get_data, which is a
generator:
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.preprocessing import scale
import numpy as np
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

def get_data(batch_size):
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
b_size = 0
no_features = 30
redundant_features = int(0.1*no_features)

390

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
informative_features = int(0.8*no_features)
repeated_features = int(0.1*no_features)

while b_size < batch_size:


x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features, random_state=51)
y_indx = y < 1
y[y_indx] = -1
x = scale(x,with_mean=True,with_std=True)

yield x,y
b_size+=1

We will proceed to write two functions, one to build our perceptron model and the other one to
test the worthiness of our model:
def build_model(x,y,weights,epochs,alpha=0.5):
"""
Simple Perceptron
"""

for i in range(epochs):

# Shuffle the dataset


shuff_index = np.random.shuffle(range(len(y)))
x_train = x[shuff_index,:].reshape(x.shape)
y_train = np.ravel(y[shuff_index,:])

# Build weights one instance at a time


for index in range(len(y)):
prediction = np.sign( np.sum(x_train[index,:] * weights) )
if prediction != y_train[index]:
weights = weights + alpha * (y_train[index] * x_
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

train[index,:])

return weights

def model_worth(x,y,weights):
prediction = np.sign(np.sum(x * weights,axis=1))
print classification_report(y,prediction)

391

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

Finally, we will write our main function to invoke all the preceding functions, to demonstrate
the perceptron algorithm:
if __name__ == "__main__":
data = get_data(10)
x,y = data.next()
weights = np.zeros(x.shape[1])
for i in range(10):
epochs = 100
weights = build_model(x,y,weights,epochs)
print
print "Model worth after receiving dataset batch %d"%(i+1)
model_worth(x,y,weights)
print
if i < 9:
x,y = data.next()

How it works…
Let's start with our main function. We will ask our generator to send us 10 sets of data:
data = get_data(10)

Here, we want to simulate both large-scale data and data streaming. While building our
model, we don't have access to all the data, just part of it:
x,y = data.next()

We will use the next() function in the generator in order to get the next set of data. In the
get_data function, we will use the make_classification function from scikit-learn:

x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features, random_state=51)

Let's look at the parameters passed to the make_classification method. The first
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

parameter is the number of instances required, in this case, we need 1,000 instances. The
second parameter is about how many attributes per instance are required. We will assume that
we need 30. The third parameter, flip_y, randomly interchanges 3 percent of the instances.
This is done to introduce some noise in our data. The next parameter is about these 30 features
and how many of them should be informative enough to be used in our classification. We
specified that 60 percent of our features, that is, 18 out of 30, should be informative. The next
parameter is about the redundant features. These are generated as a linear combination of
the informative features in order to introduce correlation among the features. Finally, repeated
features are duplicate features that are drawn randomly from both the informative features and
the redundant features.

392

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

When we call next(), we will get 1,000 instances of this data. This function returns a y label
as {0,1}; we want {-1,+1} and hence we will change all the zeros in y to -1:
y_indx = y < 1
y[y_indx] = -1

Finally, we will center our data using the scale function from scikit-learn.

Let's proceed to build our model with the first batch of data. We will initialize our weights
matrix with zeros:
weights = np.zeros(x.shape[1])

As we need 10 batches of data to simulate large-scale learning and data streaming, we will do
the model building 10 times in the for loop:
for i in range(10):
epochs = 100
weights = build_model(x,y,weights,epochs)

Our perceptron algorithm is built in build_model. A predictor x, response variable y, the


weights matrix, and number of time steps or epochs are passed as parameters. In our case,
we have set the number of epochs to 100. This function has one additional parameter, alpha
value:
def build_model(x,y,weights,epochs,alpha=0.5)

By default, we have set our alpha value to 0.5.

Let's see in our build_model. We will start with shuffling the data:
# Shuffle the dataset
shuff_index = np.random.shuffle(range(len(y)))
x_train = x[shuff_index,:].reshape(x.shape)
y_train = np.ravel(y[shuff_index,:])

We will go through each record in our dataset and start updating our weights:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

# Build weights one instance at a time


for index in range(len(y)):
prediction = np.sign( np.sum(x_train[index,:] * weights) )
if prediction != y_train[index]:
weights = weights + alpha * (y_train[index] * x_
train[index,:])

In the for loop, you can see that we do the prediction:


prediction = np.sign( np.sum(x_train[index,:] * weights) )

393

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

We will multiply our training data with weights, and add them together. Finally, we will use the
np.sign function to get our prediction. Now, based on the prediction, we will update our weights:
weights = weights + alpha * (y_train[index] * x_
train[index,:])

That is all. We will return the weights to the calling function.

In our main function, we will invoke the model_worth function to print the goodness of the
model. Here, we will use the classification_report convienience function to print the
accuracy score of the model:
print
print "Model worth after receiving dataset batch %d"%(i+1)
model_worth(x,y,weights)

We will then proceed to update our model for the next batch of incoming data. Note that we have
not altered the weights parameter. It gets updated with every batch of new data coming in.

Let's see what model_worth has printed:


Copyright © 2015. Packt Publishing, Limited. All rights reserved.

394

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

There's more…
Scikit-learn provides us with an implementation of perceptron. Refer to the following URL for
more details:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.Perceptron.html.

Another improvement that can be made in the perceptron algorithm is to use more features.

Remember the prediction equation, we can rewrite it as follows:

yˆ = sign ( wi ∗ φ ( xi ) )

We replaced the x values with a function. Here, we can send a feature generator. For example,
a polynomial feature generator can be added to our get_data function, as follows:
def get_data(batch_size):
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
b_size = 0
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.8*no_features)
repeated_features = int(0.1*no_features)
poly = PolynomialFeatures(degree=2)

while b_size < batch_size:


x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features, random_state=51)
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

y_indx = y < 1
y[y_indx] = -1
x = poly.fit_transform(x)
yield x,y
b_size+=1

Finally, kernel-based perceptron algorithms are available to handle nonlinear datasets. Refer
to the Wikipedia article for more information about kernel-based perceptron:

https://en.wikipedia.org/wiki/Kernel_perceptron.

395

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

See also
ff Learning and using Kernels recipe in Chapter 5, Data Mining - Finding a needle in a
haystack

Using stochastic gradient descent for


regression
In a typical regression setup, we have a set of predictors (instances), as follows:

X = { x1 , x2 ,… , xn }

Each instance has m attributes, as follows:

xi = { xi1 , xi 2 ,… , xim } where i = 1 to n

The response variable, Y, is a vector of real-valued entries. The job of regression is to find a
function such that when x is provided as an input to this function, it should return y:

F(X ) =Y

The preceding function is parameterized by a weight vector, that is, a combination of the
weight vector and input vector is used to predict Y, so rewriting the function with the weight
vector will get the following:

F ( X ,W ) = Y

So, the question now is how do we know that we have the right weight vectors? We will use a
loss function, L, to get the right weight vectors. The loss function measures the cost of making
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

a wrong prediction. It empirically measures the cost of predicting y when the actual value is y.
The regression problem now becomes the problem of finding the right weight vector that will
minimize the loss function. For our whole dataset of n elements, the overall loss function is
as follows:

1 n
∑ L ( f ( xi , w) , yi )
n i =1

396

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

Our weight vectors should be those that minimize the preceding value.

Gradient descent is an optimization technique used to minimize the preceding equation. For
this equation, we will find the gradient, that is, the first-order derivative with respect to W.

Unlike other optimization techniques such as the batch gradient descent, stochastic gradient
descent operates on one instance at a time. The steps involved in stochastic gradient descent
are as follows:

1. For each epoch, shuffle the dataset.


2. Pick an instance and its response variable, y.
3. Calculate the loss function and its derivative, w.r.t weights.
4. Update the weights.

Let's say:

∇w

This signifies the derivative, w.r.t w. The weights are updated as follows:

wi +1 = wi − ∇ w L ( f ( xi , w ) , yi )

As you can see, the weights are moved in the opposite direction to the gradient, thus forcing
a descent that will eventually give the weight vector values, which can reduce the objective
cost function.

A squared loss is a typical loss function used with regression. The squared loss of an instance
is defined in the following way:

( ŷ − y )
2

The derivative of the preceding equation is substituted into the weight update equation.
With this background knowledge, let's proceed to our recipe for stochastic gradient
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

descent regression.

As explained in perceptron, a learning rate, eta, is added to the weight update equation in
order to avoid the effect of noise:

(
wi +1 = wi − η ∇ w L ( f ( xi , w ) , yi ) )

397

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

Getting ready
We will be leveraging the scikit-learn's implementation of SGD regression. As in some of the
previous recipes, we will use the make_regression function from scikit-learn to generate
data for our recipe in order to demonstrate stochastic gradient descent regression.

How to do it…
Let's start with a very simple example demonstrating how to build a stochastic gradient
descent regressor.

We will first load the required libraries. We will then write a function to generate predictors and
response variables to demonstrate regression:
from sklearn.datasets import make_regression
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.cross_validation import train_test_split

def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30

x,y = make_regression(n_samples=1000,n_features=no_features,\
random_state=51)
return x,y

We will proceed to write the functions that will help us build, validate, and inspect our model:
def build_model(x,y):
estimator = SGDRegressor(n_iter = 10, shuffle=True,loss = "squared_
loss", \
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='none')
estimator.fit(x,y)

return estimator

def model_worth(model,x,y):
predicted_y = model.predict(x)

398

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
print "\nMean absolute error = %0.2f"%mean_absolute_
error(y,predicted_y)
print "Mean squared error = %0.2f"%mean_squared_
error(y,predicted_y)

def inspect_model(model):
print "\nModel Itercept {0}".format(model.intercept_)
print
for i,coef in enumerate(model.coef_):
print "Coefficient {0} = {1:.3f}".format(i+1,coef)

Finally, we will write our main function to invoke all the preceding functions:
if __name__ == "__main__":
x,y = get_data()

# Divide the data into Train, dev and test


x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)

model = build_model(x_train,y_train)

inspect_model(model)

print "Model worth on train data"


model_worth(model,x_train,y_train)
print "Model worth on dev data"
model_worth(model,x_dev,y_dev)

# Building model with l2 regularization


model = build_model_regularized(x_train,y_train)
inspect_model(model)

How it works…
Let's start with our main function. We will invoke the get_data function to generate our
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

predictor x and response variable y:


x,y = get_data()

In the get_data function, we will leverage the convenient make_regression function from
scikit-learn to generate a dataset for the regression problems:
no_features = 30
x,y = make_regression(n_samples=1000,n_features=no_features,\
random_state=51)

399

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

As you can see, we will generate a dataset with 1,000 instances specified by an n_samples
parameter, and 30 features defined by an n_features parameter.

Let's split the data into training and testing sets using train_test_split. We will reserve
30 percent of our data to test:
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)

Once again, we will leverage train_test_split to split our test data into dev and test sets:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)

With the data divided to build, evaluate, and test the model, we will proceed to build
our models.

We will invoke the build_model function with our training dataset:


model = build_model(x_train,y_train)

In build_model, we will leverage scikit-learn's SGD regressor class to build our stochastic
gradient descent method:
estimator = SGDRegressor(n_iter = 10, shuffle=True,loss = "squared_
loss", \
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='none')
estimator.fit(x,y)

The SGD regressor is a vast method and can be used to fit a number of linear models with a
lot of parameters. We will first explain the basic method of stochastic gradient descent and
then proceed to explain the other details.

Let's look at the parameters that we used. The first parameter is the number of times that we
want to go through our dataset in order to update the weights. Here, we will say that we want
10 iterations. As in perceptron, after going through all the records once, we need to shuffle our
input records when we start the next iteration. A parameter shuffle is used for the same. The
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

default value of shuffle is true, we have included it here for explanation purposes. Our loss
function is the squared loss and we want to do a linear regression; hence, we will specify this
using the loss parameter.

Our learning rate, eta, is a constant that we will specify with the learning_rate parameter.
We will provide a value for our learning rate using the eta0 parameter. We will then say that
we need to fit the intercept as we have not centered our data by its mean. Finally, the penalty
parameter controls the type of shrinkage required. In our case, we don't need any shrinkage
using the none string.

400

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

We will proceed to build our model by invoking the fit function with our predictor and response
variable. Finally we will return the model that we built to our calling function.

Let's now inspect our model and see the value of the intercept and coefficients:
inspect_model(model)

In the inspect model, we will print the values of the model intercepts and coefficients:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Let's now look at how our model has performed in our training data:
print "Model worth on train data"
model_worth(model,x_train,y_train)

We will invoke the model_worth function to look at our model's performance. The model_worth
function prints the mean absolute error and mean squared error values.

401

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

The mean squared error is defined as follows:

1 n
∑ ( yˆi − yi )
2

n i =1

The mean absolute error is defined in the following way:

1 n
∑ yˆi − yi
n i =1

The mean squared error is sensitive to outliers. Hence, the mean absolute error is a more
robust measure. Let's look at the model's performance using the training data:

Let's now look at the model's performance using our dev data:

There's more…
We can include regularization in the stochastic gradient descent framework. Recall the
following cost function of ridge regression from the previous chapter:

2
n  m  m

∑ y
 i
i =1 
− w0 − ∑ x w
ij ij  + α ∑ w2j
j =1  j =1
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

402

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

We included an expanded version of the square loss function here and added the
regularization term—the sum of the square of the weights. We can include it in our gradient
descent procedure. Let's say that we denote our regularization term as R(W). Our weight
update is now as follows:

(
wi +1 = wi − η ∇ w L ( f ( xi , w ) , yi ) + α ( ∇ w R ( w ) ) )
As you can see, now we have the derivative of the loss function with respect to the weight
vector, w, and the derivative of the regularization term with respect to the weights is added
to our weight update rule.

Let's write a new function to build our model to include regularization:


def build_model_regularized(x,y):
estimator = SGDRegressor(n_iter = 10,shuffle=True,loss = "squared_
loss", \
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='l2',alpha=0.01)
estimator.fit(x,y)

return estimator

We can invoke this function from our main function as follows:


model = build_model_regularized(x_train,y_train)
inspect_model(model)

Let's see the new parameters that we passed compared with our previous build model method:
estimator = SGDRegressor(n_iter = 10,shuffle=True,loss = "squared_
loss", \
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='l2',alpha=0.01)
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

403

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

Earlier, we mentioned our penalty as none. Now, you can see that we mentioned that we
need to add an L2 penalty to our model. Again, we will give an alpha value of 0.01 using the
alpha parameter. Let's look at our coefficients:

You can see the effect of the L2 regularization: a lot of the coefficients have attained a zero
value. Similarly, the L1 regularization and elastic net, which combines both the L1 and L2
regularization, can be included using the penalty parameter.
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Remember in our introduction, we mentioned that stochastic gradient descent is more of a


framework than a single method. Other linear models can be generated using this framework
by changing the loss function.

404

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

SVM regression models can be built using the epsilon-insensitive loss function. This loss
function is defined as follows:

 0 if yi − f ( xi , w ) < ε
L ( f ( xi , w ) , yi ) = 
 yi − f ( xi , w ) − ε otherwise

Refer to the following URL for the various parameters that can be passed to the SGD regressor
in scikit-learn:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.SGDRegressor.html.

See also
ff Predicting real valued numbers using regression recipe in Chapter 7,
Machine Learning II
ff Shrinkage using Ridge Regression recipe in Chapter 7, Machine Learning II

Using stochastic gradient descent for


classification
A classification problem setup is very similar to a regression setup except for the response
variable. In a classification setup, the response is a categorical variable. Due to its nature, we
have a different loss function to measure the cost of the wrong predictions. Let's assume a binary
classifier for our discussion and recipe, and our target variable, Y, can take the values {0,1}.

We will use the derivative of this loss function in our weight update rule to arrive at our
weight vectors.

The SGD classifier class from scikit-learn provides us with a variety of loss functions. However,
in this recipe, we will see log loss, which will give us logistic regression.

Logistic regression fits a linear model to a data of the following form:


Copyright © 2015. Packt Publishing, Limited. All rights reserved.

WT X

405

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

We have given a generalized notation. The intercept is assumed to be the first dimension
of our weight vector. For a binary classification problem, a logit function is applied to get a
prediction. as follows:

1
F ( w, xi ) = T
1 + e− w xi

The preceding function is also called the sigmoid function. For very large positive values of x_i,
this function will return a value close to one, and vice versa for large negative values close to
zero. With this, we can define our log loss function as follows:

L ( w, xi ) = − yi log ( F ( w, xi ) ) − (1 − yi ) log (1 − F ( w, xi ) )

With the preceding loss function fitted into the weight update rule of the gradient descent, we
can arrive at the appropriate weight vectors.

For the log loss function defined in scikit-learn, refer to the following URL:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_
loss.html.

With this knowledge, let's jump into our recipe for stochastic gradient descent-based
classification.

Getting ready
We will leverage scikit-learn's implementation of the stochastic gradient descent classifier.
As we did in some of the previous recipes, we will use the make_classification function
from scikit-learn to generate data for our recipe in order to demonstrate the stochastic
gradient descent classification.

How to do it…
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Let's start with a very simple example demonstrating how to build a stochastic gradient
descent regressor.

406

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

We will first load the required libraries. We will then write a function to generate the predictors
and response variables:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import SGDClassifier

import numpy as np

def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features,random_state=7)
return x,y

We will proceed to write functions that will help us build and validate our model:
def build_model(x,y,x_dev,y_dev):
estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", \
learning_rate = "constant",eta0=0.0001,fit_
intercept=True, penalty="none")
estimator.fit(x,y)
train_predcited = estimator.predict(x)
train_score = accuracy_score(y,train_predcited)
dev_predicted = estimator.predict(x_dev)
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

dev_score = accuracy_score(y_dev,dev_predicted)

print
print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_
score,dev_score)

Finally, we will write our main function to invoke all the preceding functions:
if __name__ == "__main__":
x,y = get_data()

407

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)

build_model(x_train,y_train,x_dev,y_dev)

How it works…
Let's start with our main function. We will invoke get_data to get our x predictor attributes
and y response attributes. In get_data, we will leverage the make_classification
dataset in order to generate our training data for the random forest method:
def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=500,n_features=no_features,flip_
y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features,random_state=7)
return x,y

Let's look at the parameters passed to the make_classification method. The first
parameter is the number of instances required. In this case, we need 500 instances. The
second parameter is about how many attributes per instance are required. We say that we
need 30. The third parameter, flip_y, randomly interchanges 3 percent of the instances.
This is done to introduce noise in our data. The next parameter is about how many out
of those 30 features should be informative enough to be used in our classification. We
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

specified that 60 percent of our features, that is, 18 out of 30, should be informative. The
next parameter is about redundant features. These are generated as a linear combination
of the informative features in order to introduce correlation among the features. Finally, the
repeated features are duplicate features that are drawn randomly from both the informative
and redundant features.

408

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10

Let's split the data into training and testing sets using train_test_split. We will reserve
30 percent of our data to test:
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)

Once again, we will leverage train_test_split to split our test data into dev and test sets:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)

With the data divided to build, evaluate, and test the model, we will proceed to build
our models:
build_model(x_train,y_train,x_dev,y_dev)

In build_model, we will leverage scikit-learn's SGDClassifier class to build our stochastic


gradient descent method:
estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", \
learning_rate = "constant",eta0=0.0001,fit_
intercept=True, penalty="none")

Let's look at the parameters that we used. The first parameter is the number of times we want
to go through our dataset to update the weights. Here, we say that we want 50 iterations. As
in perceptron, after going through all the records once, we need to shuffle our input records
when we start the next iteration. The shuffle parameter is used for the same. The default
value of shuffle is true, we have included it here for explanation purposes. Our loss function is
log loss: we want to do a logistic regression and we will specify this using the loss parameter.
Our learning rate, eta, is a constant that we will specify with the learning_rate parameter.
We will provide the value for our learning rate using the eta0 parameter. We will then proceed
to say that we need to fit the intercept, as we have not centered our data by its mean. Finally,
the penalty parameter controls the type of shrinkage required. In our case, we will say that we
don't need any shrinkage using the none string.

We will proceed to build our model by invoking the fit function with our predictor and response
variable, and evaluate our model with our training and dev dataset:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

estimator.fit(x,y)
train_predcited = estimator.predict(x)
train_score = accuracy_score(y,train_predcited)
dev_predicted = estimator.predict(x_dev)
dev_score = accuracy_score(y_dev,dev_predicted)

print
print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_
score,dev_score)

409

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning

Let's look at our accuracy scores:

There's more…
Regularization, L1, L2, or elastic net can be applied for SGD classification. The procedure is
the same as that of regression, and hence, we will not repeat it here. Refer to the previous
recipe for this.

The learning rate, eta, was constant in our example. This need not be the case. With every
iteration, the eta value can be reduced. The learning rate parameter, learning_rate, can
be set to an optimal string or invscaling. Refer to the following scikit documentation:

http://scikit-learn.org/stable/modules/sgd.html.

The parameter is specified as follows:


estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", \
learning_rate = "invscaling",eta0=0.001,fit_intercept=True,
penalty="none")

We used the fit method to build our model. As mentioned previously, in large-scale machine
learning, we know that all the data will not be available to us at once. When we receive the
data in batches, we need to use the partial_fit method, instead of fit. Using the fit
method will reinitialize the weights and we will lose all the training information from the
previous batch of data. Refer to the following link for more information on partial_fit:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit.

See also
ff Shrinkage using Ridge Regression recipe in Chapter 7, Machine Learning II
ff Using stochastic gradient descent for regression recipe in Chapter 9, Machine
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Learning III

410

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.

You might also like