0% found this document useful (0 votes)

61 views24 pages

Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF

Uploaded by

crystal chukuemeka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views24 pages

Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF

Uploaded by

crystal chukuemeka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Large-Scale Machine

10
Learning – Online
Learning
In this chapter, we will see the following recipes:

ff Using perceptron as an online linear algorithm

ff Using stochastic gradient descent for regression
ff Using stochastic gradient descent for classification

Introduction
In this chapter, we will concentrate on large-scale machine learning and the algorithms suited
to tackle such large-scale problems. Till now, when we trained all our models, we assumed
that our training set can fit into our computer's memory. In this chapter, we will see how to go
about building models when this assumption is no longer satisfied. Our training records are of
a huge size and so we cannot fit them completely into our memory. We may have to load them
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

piecewise and still produce a model with a good accuracy. The argument of a training set not
fitting into our computer memory can be extrapolated to streaming data. With streaming data,
we don't see all the data at once. We should be able to make decisions based on whatever
data we are exposed to and also have a mechanism for continuously improving our model as
new data arrives.

387

We will introduce the framework of the stochastic gradient descent-based algorithms. This is
a versatile framework to handle very large-scale datasets that will not fit completely into our
memory. Several types of linear algorithms, including logistic regression, linear regression,
and linear SVM, can be accommodated using this framework. The kernel trick, which we
introduced in our previous chapter, can be included in this framework in order to deal with
datasets with nonlinear relationships.

We will begin our list of recipes with the perceptron algorithm, the oldest machine learning
algorithm. Perceptron is easy to understand and implement. However, Perceptron is limited to
solving only linear problems. A kernel-based perceptron can be used to solve nonlinear datasets.

In our second recipe, we will formally introduce the framework of gradient descent-based
methods and how it can be used to perform regression-based tasks. We will look at different
loss functions to see how different types of linear models can be built using these functions. We
will also see how perceptron belongs to the family of stochastic gradient descent.

In our final recipe, we will see how classification algorithms can be built using the stochastic
gradient descent framework.

Even though we don't have a direct example of streaming data, with our existing datasets, we
will see how the streaming data use cases can be addressed. Online learning algorithms are
not limited to streaming data, they can be applied to batch data also, except that they process
only one instance at a time.

Using perceptron as an online learning

algorithm
As mentioned earlier, perceptron is one of the oldest machine learning algorithms. It was first
mentioned in a 1943 paper:
A LOGICAL CALCULUS OF THE IDEAS IMMANENT IN NERVOUS ACTIVITY. WARREN S.
MCCULLOCH AND WALTER PITTS University of Illinois, College of Medicine, Department of
Psychiatry at the Illinois Neuropsychiatric Institute, University of Chicago, Chicago, U.S.A.

Let's revisit our definition of a classification problem. Each record or instance can be written
as a set (X,y), where X is a set of attributes and y is a corresponding class label.
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Learning a target function, F, that maps each record's attribute set to one of the predefined
class label, y, is the job of a classification algorithm.

The difference in our case is that we have a large-scale learning problem. All our data will not
fit into our main memory. So, we need to keep our data on a disk and use only a portion of it
at a time in order to build our perceptron model.

388

Let's proceed to outline the perceptron algorithm:

1. Initialize the weights of the model to a small random number.

2. Center the input data, x, with its mean.
3. At each time step t (also called epoch):
Shuffle the dataset
Pick a single instance of the record and make a prediction
Observe the deviation of the prediction from the true label output
Update the weights if the prediction is different from the true label

Let's consider the following scenario. We have the complete dataset on our disk. In a single
epoch, that is, in step 3, all the steps mentioned are performed on all the data on our disk.
In an online learning scenario, a bunch of instances based on a windowing function will be
available to us at any point in time. We can update the weights as many times as the number
of instances in our window in a single epoch.

Let's see how to go about updating our weights.

Let's say our input X is as follows:

X i = { x1 , x2 , x3 ……. xm } , where i = 1 to n

Our Y is as follows:

Y = {+1, −1}

We will define our weights as the following equation:

W = {w1 , w2 , w3 ,…. wm }

Our prediction after we see each record is defined as follows:

yˆi = sign ( wi ∗ xi )

The sign function returns +1 if the product of the weight and attributes is positive, or -1 if the
product is negative.

389

Perceptron proceeds to compare the predicted y with the actual y. If the predicted y is correct,
it moves on to the next record. If the prediction is incorrect, there are two scenarios. If the
predicted y is +1 and the actual y is -1, it decrements the weight with an x value, and vice
versa. If the actual y is +1 and the predicted y is -1, it increments the weights. Let's see this
as an equation for more clarity:

wt +1 = wt + yi xi

Typically, a learning rate alpha is provided so that the weights are updated in a controlled
manner. With the presence of noise in the data, a full increment of decrements will lead to
the weights not converging:

wt +1 = wt + α ( yi xi )

Alpha is a very small value ranging, between 0.1 and 0.4.

Let's jump into our recipe now.

Getting ready
Let's generate data using make_classification in batches with a generator function to
simulate large-scale data and data streaming, and proceed to write the perceptron algorithm.

How to do it…
Let's load the necessary libraries. We will then write a function, get_data, which is a
generator:
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.preprocessing import scale
import numpy as np
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

def get_data(batch_size):
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
b_size = 0
no_features = 30
redundant_features = int(0.1*no_features)

390

while b_size < batch_size:

x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features, random_state=51)
y_indx = y < 1
y[y_indx] = -1
x = scale(x,with_mean=True,with_std=True)

yield x,y
b_size+=1

We will proceed to write two functions, one to build our perceptron model and the other one to
test the worthiness of our model:
def build_model(x,y,weights,epochs,alpha=0.5):
"""
Simple Perceptron
"""

for i in range(epochs):

# Shuffle the dataset

shuff_index = np.random.shuffle(range(len(y)))
x_train = x[shuff_index,:].reshape(x.shape)
y_train = np.ravel(y[shuff_index,:])

# Build weights one instance at a time

for index in range(len(y)):
prediction = np.sign( np.sum(x_train[index,:] * weights) )
if prediction != y_train[index]:
weights = weights + alpha * (y_train[index] * x_
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

train[index,:])

return weights

def model_worth(x,y,weights):
prediction = np.sign(np.sum(x * weights,axis=1))
print classification_report(y,prediction)

391

Finally, we will write our main function to invoke all the preceding functions, to demonstrate
the perceptron algorithm:
if __name__ == "__main__":
data = get_data(10)
x,y = data.next()
weights = np.zeros(x.shape[1])
for i in range(10):
epochs = 100
weights = build_model(x,y,weights,epochs)
print
print "Model worth after receiving dataset batch %d"%(i+1)
model_worth(x,y,weights)
print
if i < 9:
x,y = data.next()

How it works…
Let's start with our main function. We will ask our generator to send us 10 sets of data:
data = get_data(10)

Here, we want to simulate both large-scale data and data streaming. While building our
model, we don't have access to all the data, just part of it:
x,y = data.next()

We will use the next() function in the generator in order to get the next set of data. In the
get_data function, we will use the make_classification function from scikit-learn:

Let's look at the parameters passed to the make_classification method. The first
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

parameter is the number of instances required, in this case, we need 1,000 instances. The
second parameter is about how many attributes per instance are required. We will assume that
we need 30. The third parameter, flip_y, randomly interchanges 3 percent of the instances.
This is done to introduce some noise in our data. The next parameter is about these 30 features
and how many of them should be informative enough to be used in our classification. We
specified that 60 percent of our features, that is, 18 out of 30, should be informative. The next
parameter is about the redundant features. These are generated as a linear combination of
the informative features in order to introduce correlation among the features. Finally, repeated
features are duplicate features that are drawn randomly from both the informative features and
the redundant features.

392

When we call next(), we will get 1,000 instances of this data. This function returns a y label
as {0,1}; we want {-1,+1} and hence we will change all the zeros in y to -1:
y_indx = y < 1
y[y_indx] = -1

Finally, we will center our data using the scale function from scikit-learn.

Let's proceed to build our model with the first batch of data. We will initialize our weights
matrix with zeros:
weights = np.zeros(x.shape[1])

As we need 10 batches of data to simulate large-scale learning and data streaming, we will do
the model building 10 times in the for loop:
for i in range(10):
epochs = 100
weights = build_model(x,y,weights,epochs)

Our perceptron algorithm is built in build_model. A predictor x, response variable y, the

weights matrix, and number of time steps or epochs are passed as parameters. In our case,
we have set the number of epochs to 100. This function has one additional parameter, alpha
value:
def build_model(x,y,weights,epochs,alpha=0.5)

By default, we have set our alpha value to 0.5.

Let's see in our build_model. We will start with shuffling the data:
# Shuffle the dataset
shuff_index = np.random.shuffle(range(len(y)))
x_train = x[shuff_index,:].reshape(x.shape)
y_train = np.ravel(y[shuff_index,:])

We will go through each record in our dataset and start updating our weights:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

# Build weights one instance at a time

for index in range(len(y)):
prediction = np.sign( np.sum(x_train[index,:] * weights) )
if prediction != y_train[index]:
weights = weights + alpha * (y_train[index] * x_
train[index,:])

In the for loop, you can see that we do the prediction:

prediction = np.sign( np.sum(x_train[index,:] * weights) )

393

We will multiply our training data with weights, and add them together. Finally, we will use the
np.sign function to get our prediction. Now, based on the prediction, we will update our weights:
weights = weights + alpha * (y_train[index] * x_
train[index,:])

That is all. We will return the weights to the calling function.

In our main function, we will invoke the model_worth function to print the goodness of the
model. Here, we will use the classification_report convienience function to print the
accuracy score of the model:
print
print "Model worth after receiving dataset batch %d"%(i+1)
model_worth(x,y,weights)

We will then proceed to update our model for the next batch of incoming data. Note that we have
not altered the weights parameter. It gets updated with every batch of new data coming in.

Let's see what model_worth has printed:

394

There's more…
Scikit-learn provides us with an implementation of perceptron. Refer to the following URL for
more details:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.Perceptron.html.

Another improvement that can be made in the perceptron algorithm is to use more features.

Remember the prediction equation, we can rewrite it as follows:

yˆ = sign ( wi ∗ φ ( xi ) )

We replaced the x values with a function. Here, we can send a feature generator. For example,
a polynomial feature generator can be added to our get_data function, as follows:
def get_data(batch_size):
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
b_size = 0
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.8*no_features)
repeated_features = int(0.1*no_features)
poly = PolynomialFeatures(degree=2)

while b_size < batch_size:

x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features, random_state=51)
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

y_indx = y < 1
y[y_indx] = -1
x = poly.fit_transform(x)
yield x,y
b_size+=1

Finally, kernel-based perceptron algorithms are available to handle nonlinear datasets. Refer
to the Wikipedia article for more information about kernel-based perceptron:

https://en.wikipedia.org/wiki/Kernel_perceptron.

395

See also
ff Learning and using Kernels recipe in Chapter 5, Data Mining - Finding a needle in a
haystack

Using stochastic gradient descent for

regression
In a typical regression setup, we have a set of predictors (instances), as follows:

X = { x1 , x2 ,… , xn }

Each instance has m attributes, as follows:

xi = { xi1 , xi 2 ,… , xim } where i = 1 to n

The response variable, Y, is a vector of real-valued entries. The job of regression is to find a
function such that when x is provided as an input to this function, it should return y:

F(X ) =Y

The preceding function is parameterized by a weight vector, that is, a combination of the
weight vector and input vector is used to predict Y, so rewriting the function with the weight
vector will get the following:

F ( X ,W ) = Y

So, the question now is how do we know that we have the right weight vectors? We will use a
loss function, L, to get the right weight vectors. The loss function measures the cost of making
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

a wrong prediction. It empirically measures the cost of predicting y when the actual value is y.
The regression problem now becomes the problem of finding the right weight vector that will
minimize the loss function. For our whole dataset of n elements, the overall loss function is
as follows:

1 n
∑ L ( f ( xi , w) , yi )
n i =1

396

Our weight vectors should be those that minimize the preceding value.

Gradient descent is an optimization technique used to minimize the preceding equation. For
this equation, we will find the gradient, that is, the first-order derivative with respect to W.

Unlike other optimization techniques such as the batch gradient descent, stochastic gradient
descent operates on one instance at a time. The steps involved in stochastic gradient descent
are as follows:

1. For each epoch, shuffle the dataset.

2. Pick an instance and its response variable, y.
3. Calculate the loss function and its derivative, w.r.t weights.
4. Update the weights.

Let's say:

∇w

This signifies the derivative, w.r.t w. The weights are updated as follows:

wi +1 = wi − ∇ w L ( f ( xi , w ) , yi )

As you can see, the weights are moved in the opposite direction to the gradient, thus forcing
a descent that will eventually give the weight vector values, which can reduce the objective
cost function.

A squared loss is a typical loss function used with regression. The squared loss of an instance
is defined in the following way:

( ŷ − y )
2

The derivative of the preceding equation is substituted into the weight update equation.
With this background knowledge, let's proceed to our recipe for stochastic gradient
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

descent regression.

As explained in perceptron, a learning rate, eta, is added to the weight update equation in
order to avoid the effect of noise:

(
wi +1 = wi − η ∇ w L ( f ( xi , w ) , yi ) )

397

Getting ready
We will be leveraging the scikit-learn's implementation of SGD regression. As in some of the
previous recipes, we will use the make_regression function from scikit-learn to generate
data for our recipe in order to demonstrate stochastic gradient descent regression.

How to do it…
Let's start with a very simple example demonstrating how to build a stochastic gradient
descent regressor.

We will first load the required libraries. We will then write a function to generate predictors and
response variables to demonstrate regression:
from sklearn.datasets import make_regression
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.cross_validation import train_test_split

def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30

x,y = make_regression(n_samples=1000,n_features=no_features,\
random_state=51)
return x,y

We will proceed to write the functions that will help us build, validate, and inspect our model:
def build_model(x,y):
estimator = SGDRegressor(n_iter = 10, shuffle=True,loss = "squared_
loss", \
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='none')
estimator.fit(x,y)

return estimator

def model_worth(model,x,y):
predicted_y = model.predict(x)

398

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Chapter 10
print "\nMean absolute error = %0.2f"%mean_absolute_
error(y,predicted_y)
print "Mean squared error = %0.2f"%mean_squared_
error(y,predicted_y)

def inspect_model(model):
print "\nModel Itercept {0}".format(model.intercept_)
print
for i,coef in enumerate(model.coef_):
print "Coefficient {0} = {1:.3f}".format(i+1,coef)

Finally, we will write our main function to invoke all the preceding functions:
if __name__ == "__main__":
x,y = get_data()

# Divide the data into Train, dev and test

x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)

model = build_model(x_train,y_train)

inspect_model(model)

print "Model worth on train data"

model_worth(model,x_train,y_train)
print "Model worth on dev data"
model_worth(model,x_dev,y_dev)

# Building model with l2 regularization

model = build_model_regularized(x_train,y_train)
inspect_model(model)

How it works…
Let's start with our main function. We will invoke the get_data function to generate our
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

predictor x and response variable y:

x,y = get_data()

In the get_data function, we will leverage the convenient make_regression function from
scikit-learn to generate a dataset for the regression problems:
no_features = 30
x,y = make_regression(n_samples=1000,n_features=no_features,\
random_state=51)

399

As you can see, we will generate a dataset with 1,000 instances specified by an n_samples
parameter, and 30 features defined by an n_features parameter.

Let's split the data into training and testing sets using train_test_split. We will reserve
30 percent of our data to test:
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)

Once again, we will leverage train_test_split to split our test data into dev and test sets:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)

With the data divided to build, evaluate, and test the model, we will proceed to build
our models.

We will invoke the build_model function with our training dataset:

model = build_model(x_train,y_train)

In build_model, we will leverage scikit-learn's SGD regressor class to build our stochastic
gradient descent method:
estimator = SGDRegressor(n_iter = 10, shuffle=True,loss = "squared_
loss", \
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='none')
estimator.fit(x,y)

The SGD regressor is a vast method and can be used to fit a number of linear models with a
lot of parameters. We will first explain the basic method of stochastic gradient descent and
then proceed to explain the other details.

Let's look at the parameters that we used. The first parameter is the number of times that we
want to go through our dataset in order to update the weights. Here, we will say that we want
10 iterations. As in perceptron, after going through all the records once, we need to shuffle our
input records when we start the next iteration. A parameter shuffle is used for the same. The
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

default value of shuffle is true, we have included it here for explanation purposes. Our loss
function is the squared loss and we want to do a linear regression; hence, we will specify this
using the loss parameter.

Our learning rate, eta, is a constant that we will specify with the learning_rate parameter.
We will provide a value for our learning rate using the eta0 parameter. We will then say that
we need to fit the intercept as we have not centered our data by its mean. Finally, the penalty
parameter controls the type of shrinkage required. In our case, we don't need any shrinkage
using the none string.

400

We will proceed to build our model by invoking the fit function with our predictor and response
variable. Finally we will return the model that we built to our calling function.

Let's now inspect our model and see the value of the intercept and coefficients:
inspect_model(model)

In the inspect model, we will print the values of the model intercepts and coefficients:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Let's now look at how our model has performed in our training data:
print "Model worth on train data"
model_worth(model,x_train,y_train)

We will invoke the model_worth function to look at our model's performance. The model_worth
function prints the mean absolute error and mean squared error values.

401

The mean squared error is defined as follows:

1 n
∑ ( yˆi − yi )
2

n i =1

The mean absolute error is defined in the following way:

1 n
∑ yˆi − yi
n i =1

The mean squared error is sensitive to outliers. Hence, the mean absolute error is a more
robust measure. Let's look at the model's performance using the training data:

Let's now look at the model's performance using our dev data:

There's more…
We can include regularization in the stochastic gradient descent framework. Recall the
following cost function of ridge regression from the previous chapter:

2
n  m  m

402

We included an expanded version of the square loss function here and added the
regularization term—the sum of the square of the weights. We can include it in our gradient
descent procedure. Let's say that we denote our regularization term as R(W). Our weight
update is now as follows:

(
wi +1 = wi − η ∇ w L ( f ( xi , w ) , yi ) + α ( ∇ w R ( w ) ) )
As you can see, now we have the derivative of the loss function with respect to the weight
vector, w, and the derivative of the regularization term with respect to the weights is added
to our weight update rule.

Let's write a new function to build our model to include regularization:

def build_model_regularized(x,y):
estimator = SGDRegressor(n_iter = 10,shuffle=True,loss = "squared_
loss", \
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='l2',alpha=0.01)
estimator.fit(x,y)

return estimator

We can invoke this function from our main function as follows:

model = build_model_regularized(x_train,y_train)
inspect_model(model)

Let's see the new parameters that we passed compared with our previous build model method:
estimator = SGDRegressor(n_iter = 10,shuffle=True,loss = "squared_
loss", \
learning_rate='constant',eta0=0.01,fit_intercept=True, \
penalty='l2',alpha=0.01)
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

403

Earlier, we mentioned our penalty as none. Now, you can see that we mentioned that we
need to add an L2 penalty to our model. Again, we will give an alpha value of 0.01 using the
alpha parameter. Let's look at our coefficients:

You can see the effect of the L2 regularization: a lot of the coefficients have attained a zero
value. Similarly, the L1 regularization and elastic net, which combines both the L1 and L2
regularization, can be included using the penalty parameter.
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Remember in our introduction, we mentioned that stochastic gradient descent is more of a

framework than a single method. Other linear models can be generated using this framework
by changing the loss function.

404

SVM regression models can be built using the epsilon-insensitive loss function. This loss
function is defined as follows:

 0 if yi − f ( xi , w ) < ε
L ( f ( xi , w ) , yi ) = 
 yi − f ( xi , w ) − ε otherwise

Refer to the following URL for the various parameters that can be passed to the SGD regressor
in scikit-learn:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.SGDRegressor.html.

See also
ff Predicting real valued numbers using regression recipe in Chapter 7,
Machine Learning II
ff Shrinkage using Ridge Regression recipe in Chapter 7, Machine Learning II

Using stochastic gradient descent for

classification
A classification problem setup is very similar to a regression setup except for the response
variable. In a classification setup, the response is a categorical variable. Due to its nature, we
have a different loss function to measure the cost of the wrong predictions. Let's assume a binary
classifier for our discussion and recipe, and our target variable, Y, can take the values {0,1}.

We will use the derivative of this loss function in our weight update rule to arrive at our
weight vectors.

The SGD classifier class from scikit-learn provides us with a variety of loss functions. However,
in this recipe, we will see log loss, which will give us logistic regression.

Logistic regression fits a linear model to a data of the following form:

WT X

405

We have given a generalized notation. The intercept is assumed to be the first dimension
of our weight vector. For a binary classification problem, a logit function is applied to get a
prediction. as follows:

1
F ( w, xi ) = T
1 + e− w xi

The preceding function is also called the sigmoid function. For very large positive values of x_i,
this function will return a value close to one, and vice versa for large negative values close to
zero. With this, we can define our log loss function as follows:

L ( w, xi ) = − yi log ( F ( w, xi ) ) − (1 − yi ) log (1 − F ( w, xi ) )

With the preceding loss function fitted into the weight update rule of the gradient descent, we
can arrive at the appropriate weight vectors.

For the log loss function defined in scikit-learn, refer to the following URL:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_
loss.html.

With this knowledge, let's jump into our recipe for stochastic gradient descent-based
classification.

Getting ready
We will leverage scikit-learn's implementation of the stochastic gradient descent classifier.
As we did in some of the previous recipes, we will use the make_classification function
from scikit-learn to generate data for our recipe in order to demonstrate the stochastic
gradient descent classification.

Let's start with a very simple example demonstrating how to build a stochastic gradient
descent regressor.

406

We will first load the required libraries. We will then write a function to generate the predictors
and response variables:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import SGDClassifier

import numpy as np

def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=1000,n_features=no_
features,flip_y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features,random_state=7)
return x,y

We will proceed to write functions that will help us build and validate our model:
def build_model(x,y,x_dev,y_dev):
estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", \
learning_rate = "constant",eta0=0.0001,fit_
intercept=True, penalty="none")
estimator.fit(x,y)
train_predcited = estimator.predict(x)
train_score = accuracy_score(y,train_predcited)
dev_predicted = estimator.predict(x_dev)
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

dev_score = accuracy_score(y_dev,dev_predicted)

print
print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_
score,dev_score)

Finally, we will write our main function to invoke all the preceding functions:
if __name__ == "__main__":
x,y = get_data()

407

Gopi, Subramanian. Python Data Science Cookbook, Packt Publishing, Limited, 2015. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/manchester/detail.action?docID=4191189.
Created from manchester on 2020-07-07 02:46:13.
Large-Scale Machine Learning – Online Learning
# Divide the data into Train, dev and test
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_
size = 0.3,random_state=9)
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)

build_model(x_train,y_train,x_dev,y_dev)

How it works…
Let's start with our main function. We will invoke get_data to get our x predictor attributes
and y response attributes. In get_data, we will leverage the make_classification
dataset in order to generate our training data for the random forest method:
def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=500,n_features=no_features,flip_
y=0.03,\
n_informative = informative_features, n_redundant =
redundant_features \
,n_repeated = repeated_features,random_state=7)
return x,y

Let's look at the parameters passed to the make_classification method. The first
parameter is the number of instances required. In this case, we need 500 instances. The
second parameter is about how many attributes per instance are required. We say that we
need 30. The third parameter, flip_y, randomly interchanges 3 percent of the instances.
This is done to introduce noise in our data. The next parameter is about how many out
of those 30 features should be informative enough to be used in our classification. We
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

specified that 60 percent of our features, that is, 18 out of 30, should be informative. The
next parameter is about redundant features. These are generated as a linear combination
of the informative features in order to introduce correlation among the features. Finally, the
repeated features are duplicate features that are drawn randomly from both the informative
and redundant features.

408

Once again, we will leverage train_test_split to split our test data into dev and test sets:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_
all,test_size=0.3,random_state=9)

With the data divided to build, evaluate, and test the model, we will proceed to build
our models:
build_model(x_train,y_train,x_dev,y_dev)

In build_model, we will leverage scikit-learn's SGDClassifier class to build our stochastic

gradient descent method:
estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", \
learning_rate = "constant",eta0=0.0001,fit_
intercept=True, penalty="none")

Let's look at the parameters that we used. The first parameter is the number of times we want
to go through our dataset to update the weights. Here, we say that we want 50 iterations. As
in perceptron, after going through all the records once, we need to shuffle our input records
when we start the next iteration. The shuffle parameter is used for the same. The default
value of shuffle is true, we have included it here for explanation purposes. Our loss function is
log loss: we want to do a logistic regression and we will specify this using the loss parameter.
Our learning rate, eta, is a constant that we will specify with the learning_rate parameter.
We will provide the value for our learning rate using the eta0 parameter. We will then proceed
to say that we need to fit the intercept, as we have not centered our data by its mean. Finally,
the penalty parameter controls the type of shrinkage required. In our case, we will say that we
don't need any shrinkage using the none string.

We will proceed to build our model by invoking the fit function with our predictor and response
variable, and evaluate our model with our training and dev dataset:
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

estimator.fit(x,y)
train_predcited = estimator.predict(x)
train_score = accuracy_score(y,train_predcited)
dev_predicted = estimator.predict(x_dev)
dev_score = accuracy_score(y_dev,dev_predicted)

print
print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_
score,dev_score)

409

Let's look at our accuracy scores:

There's more…
Regularization, L1, L2, or elastic net can be applied for SGD classification. The procedure is
the same as that of regression, and hence, we will not repeat it here. Refer to the previous
recipe for this.

The learning rate, eta, was constant in our example. This need not be the case. With every
iteration, the eta value can be reduced. The learning rate parameter, learning_rate, can
be set to an optimal string or invscaling. Refer to the following scikit documentation:

http://scikit-learn.org/stable/modules/sgd.html.

The parameter is specified as follows:

estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", \
learning_rate = "invscaling",eta0=0.001,fit_intercept=True,
penalty="none")

We used the fit method to build our model. As mentioned previously, in large-scale machine
learning, we know that all the data will not be available to us at once. When we receive the
data in batches, we need to use the partial_fit method, instead of fit. Using the fit
method will reinitialize the weights and we will lose all the training information from the
previous batch of data. Refer to the following link for more information on partial_fit:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_
model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit.

See also
ff Shrinkage using Ridge Regression recipe in Chapter 7, Machine Learning II
ff Using stochastic gradient descent for regression recipe in Chapter 9, Machine
Copyright © 2015. Packt Publishing, Limited. All rights reserved.

Learning III

410

3 IllustrEx
No ratings yet
3 IllustrEx
16 pages
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
No ratings yet
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
145 pages
Ocs Unit 5
No ratings yet
Ocs Unit 5
19 pages
Theory and Examples: Problem Statement
No ratings yet
Theory and Examples: Problem Statement
44 pages
Machine Learning Simplified
100% (1)
Machine Learning Simplified
109 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
112 pages
Machine Learning: Practice 2
No ratings yet
Machine Learning: Practice 2
74 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
883 pages
BigML WhizzML Tutorials
No ratings yet
BigML WhizzML Tutorials
45 pages
Adaptive Linear Neuron Using Linear (Identity) Activation Function With Batch Gradient Method
No ratings yet
Adaptive Linear Neuron Using Linear (Identity) Activation Function With Batch Gradient Method
19 pages
Chapter 2
No ratings yet
Chapter 2
33 pages
Sentiment Analysis With WEKA-ITSOct19 PDF
No ratings yet
Sentiment Analysis With WEKA-ITSOct19 PDF
69 pages
Take It Easy: Created Status Last Read
No ratings yet
Take It Easy: Created Status Last Read
55 pages
Unit 1 and Unit 2
No ratings yet
Unit 1 and Unit 2
30 pages
Pattern Unit 3
No ratings yet
Pattern Unit 3
14 pages
ML Lab Experiments (1) - Pages-2
No ratings yet
ML Lab Experiments (1) - Pages-2
10 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation
No ratings yet
CS 188 Introduction To Artificial Intelligence Fall 2017 Note 10 Neural Networks: Motivation
9 pages
d2l en
No ratings yet
d2l en
981 pages
lec22-ML III
No ratings yet
lec22-ML III
51 pages
P5 Neural Nets
No ratings yet
P5 Neural Nets
114 pages
cs188 sp24 Note22
No ratings yet
cs188 sp24 Note22
8 pages
Deep Down About d2l
No ratings yet
Deep Down About d2l
922 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Week3 Perceptron Mlprwerwerwer
No ratings yet
Week3 Perceptron Mlprwerwerwer
8 pages
Applsci 12 12478
No ratings yet
Applsci 12 12478
14 pages
Unit 2 - Class
No ratings yet
Unit 2 - Class
16 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
60 pages
d2l en PDF
No ratings yet
d2l en PDF
996 pages
Chapter 11 Neural Nets (Python)
No ratings yet
Chapter 11 Neural Nets (Python)
43 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
894 pages
Dive Into Deep Learning: Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
No ratings yet
Dive Into Deep Learning: Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
987 pages
Deep Dive Pytorch
No ratings yet
Deep Dive Pytorch
986 pages
d2l en PDF
No ratings yet
d2l en PDF
995 pages
Index: Name - JINESH PRAJAPAT Class - B. Tech, III Year Branch - AI & DS Sem - V
No ratings yet
Index: Name - JINESH PRAJAPAT Class - B. Tech, III Year Branch - AI & DS Sem - V
35 pages
NN Theory
No ratings yet
NN Theory
138 pages
ML Unit I
No ratings yet
ML Unit I
14 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Machine Learning
No ratings yet
Machine Learning
95 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
ML Word To PDF
No ratings yet
ML Word To PDF
229 pages
Unit 2 - Class - Preceptron
No ratings yet
Unit 2 - Class - Preceptron
13 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
Machine Learning The Basics
No ratings yet
Machine Learning The Basics
158 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
Perceptron Expt Student
No ratings yet
Perceptron Expt Student
6 pages
6.036 Notes
No ratings yet
6.036 Notes
99 pages
Machine Learning
No ratings yet
Machine Learning
20 pages
Lecture Notes 2016
No ratings yet
Lecture Notes 2016
132 pages
Perceptron Classifier Explanation
No ratings yet
Perceptron Classifier Explanation
4 pages
Tutorial 1 PYTHON AI
No ratings yet
Tutorial 1 PYTHON AI
7 pages
Python Data Science Cookbook - (Index)
No ratings yet
Python Data Science Cookbook - (Index)
9 pages
Unit-III Advanced Machine Learning
No ratings yet
Unit-III Advanced Machine Learning
8 pages
Super Cheatsheet Artificial Intelligence
No ratings yet
Super Cheatsheet Artificial Intelligence
18 pages
Unit 3
No ratings yet
Unit 3
15 pages
6036 Lecture Notes
No ratings yet
6036 Lecture Notes
56 pages
Final Report Project
No ratings yet
Final Report Project
71 pages
Super VIP Cheat Sheet: Arti Cial Intelligence
No ratings yet
Super VIP Cheat Sheet: Arti Cial Intelligence
18 pages
ARTIFICIAL INTELLIGENCE Class 8
No ratings yet
ARTIFICIAL INTELLIGENCE Class 8
16 pages
NE
No ratings yet
NE
22 pages
Machine Learning For Asset Managers: Quantitative Finance
No ratings yet
Machine Learning For Asset Managers: Quantitative Finance
3 pages
M.SC (Data Science) 28.02.2018
No ratings yet
M.SC (Data Science) 28.02.2018
16 pages
PRCV Lab Manual-Final
No ratings yet
PRCV Lab Manual-Final
60 pages
English Paper
No ratings yet
English Paper
22 pages
Gemma 2: Improving Open Language Models at A Practical Size: Beltagy Et Al. 2020a Ainslie Et Al. 2023 Hinton Et Al. 2015
No ratings yet
Gemma 2: Improving Open Language Models at A Practical Size: Beltagy Et Al. 2020a Ainslie Et Al. 2023 Hinton Et Al. 2015
21 pages
Project Report Chetan Sharma
No ratings yet
Project Report Chetan Sharma
114 pages
Sensors 22 02445
No ratings yet
Sensors 22 02445
24 pages
Machine Learning-AI For A Business Problem
No ratings yet
Machine Learning-AI For A Business Problem
16 pages
Davidlo CV Dec19
No ratings yet
Davidlo CV Dec19
34 pages
DAL Syllabus
No ratings yet
DAL Syllabus
4 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
NLP in Medical
No ratings yet
NLP in Medical
11 pages
Slidesgo Smart Mobility Leveraging Machine Learning For Real Time Traffic Prediction and Management 20240804092523nyfm
No ratings yet
Slidesgo Smart Mobility Leveraging Machine Learning For Real Time Traffic Prediction and Management 20240804092523nyfm
11 pages
Autoencoder Assignment PDF
No ratings yet
Autoencoder Assignment PDF
5 pages
Personality Prediction System Based On Graphology Using Machine Learning
No ratings yet
Personality Prediction System Based On Graphology Using Machine Learning
34 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
47 pages
ML Process in Azure Cloud
No ratings yet
ML Process in Azure Cloud
17 pages
Farah Et Al 2023
No ratings yet
Farah Et Al 2023
19 pages
Revolutionizing Education With Industry 5.0 Challenges and Future Research Agendas
No ratings yet
Revolutionizing Education With Industry 5.0 Challenges and Future Research Agendas
5 pages
Entropy 23 00018 v2 41
No ratings yet
Entropy 23 00018 v2 41
1 page
Lec03 2 Linear Regression Slides
No ratings yet
Lec03 2 Linear Regression Slides
14 pages
KMBN 401 Unit 2
No ratings yet
KMBN 401 Unit 2
14 pages
Transfer Learning For Bayesian Optimization Revised
No ratings yet
Transfer Learning For Bayesian Optimization Revised
6 pages
Articles (Zaki)
No ratings yet
Articles (Zaki)
2 pages
Harika Sai Bolisetty-CV
No ratings yet
Harika Sai Bolisetty-CV
1 page
Python Machine Learning: Learn how to build powerful Python machine learning algorithms to generate useful data insights with this data analysis tutorial
From Everand
Python Machine Learning: Learn how to build powerful Python machine learning algorithms to generate useful data insights with this data analysis tutorial
Sebastian Raschka
4/5 (20)
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
From Everand
Google JAX Essentials: A quick practical learning of blazing-fast library for machine learning and deep learning projects
Mei Wong
No ratings yet
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF

Uploaded by

Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF

Uploaded by

Large-Scale Machine

ff Using perceptron as an online linear algorithm

Using perceptron as an online learning

Let's proceed to outline the perceptron algorithm:

1. Initialize the weights of the model to a small random number.

Let's see how to go about updating our weights.

Let's say our input X is as follows:

We will define our weights as the following equation:

Our prediction after we see each record is defined as follows:

Alpha is a very small value ranging, between 0.1 and 0.4.

Let's jump into our recipe now.

while b_size < batch_size:

# Shuffle the dataset

# Build weights one instance at a time

Our perceptron algorithm is built in build_model. A predictor x, response variable y, the

By default, we have set our alpha value to 0.5.

# Build weights one instance at a time

In the for loop, you can see that we do the prediction:

That is all. We will return the weights to the calling function.

Let's see what model_worth has printed:

Remember the prediction equation, we can rewrite it as follows:

while b_size < batch_size:

Using stochastic gradient descent for

Each instance has m attributes, as follows:

xi = { xi1 , xi 2 ,… , xim } where i = 1 to n

1. For each epoch, shuffle the dataset.

# Divide the data into Train, dev and test

print "Model worth on train data"

# Building model with l2 regularization

predictor x and response variable y:

We will invoke the build_model function with our training dataset:

The mean squared error is defined as follows:

The mean absolute error is defined in the following way:

Let's write a new function to build our model to include regularization:

We can invoke this function from our main function as follows:

Remember in our introduction, we mentioned that stochastic gradient descent is more of a

Using stochastic gradient descent for

Logistic regression fits a linear model to a data of the following form:

In build_model, we will leverage scikit-learn's SGDClassifier class to build our stochastic

Let's look at our accuracy scores:

The parameter is specified as follows:

You might also like