0% found this document useful (0 votes)
134 views33 pages

Vectorized Logistic Regression

The document describes the vectorized implementation of logistic regression. It discusses how the non-vectorized implementation computes the gradient descent update rule for each feature individually in a loop. The vectorized version reformulates this to compute the gradient vector for all features simultaneously using vector operations. This avoids explicit loops and speeds up computation significantly. Code examples compare the performance of the non-vectorized vs vectorized versions on a sample classification dataset. The vectorized implementation converges much faster due to avoiding redundant computations.

Uploaded by

Alberto Simões
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views33 pages

Vectorized Logistic Regression

The document describes the vectorized implementation of logistic regression. It discusses how the non-vectorized implementation computes the gradient descent update rule for each feature individually in a loop. The vectorized version reformulates this to compute the gradient vector for all features simultaneously using vector operations. This avoids explicit loops and speeds up computation significantly. Code examples compare the performance of the non-vectorized vs vectorized versions on a sample classification dataset. The vectorized implementation converges much faster due to avoiding redundant computations.

Uploaded by

Alberto Simões
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Vectorized Implementation of

Logistic Regression

1
• Recall, for the non-vectorized implementation of Logistic Regression, the batch
update rule is:
For each feature ( ){

for each

೅ ೔ ೔

=1
2
• Notice that over the sum of training examples, must be computed for
each training example :

• Thus, the summation operation over training examples will produce


terms for , which can be written as an component vector :

3
• Moreover, a vector is created:

ି௭ ೔

• The dimension:

4
• Furthermore, is subtracted from the corresponding
component of :

• The dimension of the vector


5
• Finally, multiplies each difference component :

ሺభሻ

ሺమሻ

ሺ೘ሻ

• This can be viewed as a component by component product of vectors


and

• The vector has components.

• The vector has components.


6
• Therefore, : Training
example of
feature .

Vector product.
7
• Therefore to compute the ,

1. Compute vector .
2. Subtract given vector from .
3. Take the sum of the products obtained from the
component wise multiplication of the vectors
and .
8
• Consider the dimension of the vectors in :

• has dimension , i.e.,

• has dimension , i.e.,

• To obtain conformant arguments for the sum of the products


, we can transpose , so that , and
then rearrange terms:

9
• Expanding:

10
• This is the for feature ,

• The list of slopes, formed by finding the slope for each feature, forms a
gradient (i.e., vector of partial derivatives):

Complete vector , i.e.,


all features of all
training examples.
11
• Consider the dimension of the vectors in the :

• has dimension , i.e.,

• has dimension , i.e.,

• To obtain conformant arguments in the product ,


we can transpose , so that :

12
• Expanding the vector product:

13
• For vectorized implementation, the batch update rule
is:
For each feature (j=0; j<=n; j++) {

}
• In vector format:

14
• Example 2 uses the data set of “Exercise 4: Logistic
Regression and Newton’s Method” from:

http://openclassroom.stanford.edu/MainFolder/Cour
sePage.php?course=MachineLearning

• Example 2 compares the code and performance of


the non-vectorized and the vectorized
implementations of basic logistic regression.
15
• This example uses a data set that represents scores on an exam in the first two columns (x1
and x2) the pass/fail mark in the 3rd column (y).

• 0 represents fail, and 1 represents pass.

# x1 x2 y
1 5.55E+01 6.95E+01 1

2 4.10E+01 8.15E+01 1

… … 1

78 1.85E+01 7.45E+01 0

79 1.60E+01 7.25E+01 0

80 3.35E+01 6.80E+01 0

16
Opening and Plotting Data Files
clear all; close all; clc
x = load('ex4x.dat'); y = load('ex4y.dat');
figure;
hold on
set(0, 'defaultaxesfontname', 'Arial');
set(0, 'defaultaxesfontsize', 16);

for i=1:length(y)
if (y(i)==1)
plot3(x(i,1),x(i,2),y(i),'+', 'color', 'g', 'markersize', 8);
else
plot3(x(i,1),x(i,2),y(i),'o', 'color', 'r', 'markersize', 8);
endif
endfor
ylabel('Exam 1 Score', 'fontsize', 18, 'fontname', 'Arial');
xlabel('Exam 2 Score', 'fontsize', 18, 'fontname', 'Arial');
zlabel('Pass/Fail', 'fontsize', 18, 'fontname', 'Arial');
title('Exam Scores', 'fontsize', 20, 'fontname', 'Arial');
17
m = numTrainSam; Same as in Example 1
prevTheta=theta;
for t=1:maxIterations
totError = 0;
for j=1:numFeatures
totSlope = 0;
for i=1:m
z=0;
for jj=1:numFeatures
z=z+prevTheta(jj)*x(i,jj);
end
h=1.0/(1.0+exp(-z));
totSlope = (totSlope + (h-y(i))*x(i,j));
totError = (totError + -y(i)*log(h)-(1-y(i))*log(1-h));
end
totError=totError/numTrainSam;
theta(j)=theta(j)-learningRate*(totSlope/numTrainSam);
end
prevTheta=theta;
errorPerIteration(t)=totError/numFeatures;
end
18
for t = 1:MAX_ITR
% Update theta ೅ ೔ ೔

z = x * theta;
h = g(z);
grad = (1/m).*x' * (h-y);
theta = theta - alpha .* grad;
% Calculate J (for testing convergence)
J(t) =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h));
end
19
x theta z
=

… …

• After 1000 iterations, theta is shown above. This is a suboptimal value of


theta. A much better theta is achieved after 10,000,000 iterations. 20
z

… …

21
… … …

22
m Columns

m Rows

23
Comparison:
Non-vectorized vs Vectorized
Non-vectorized Vectorized
Iterations 100,000
CPU Time (s) 4805.7 24.430
-3.998494 -3.998494
0.065799 0.065799
0.024059 0.024059
Iterations 10,000,000
CPU Time (s) ≈500,000 2532.5
-16.37865
0.14834
0.15891
24
Visualizing Classification Result
−16.380
0.1483 + Positive training
0.1589 examples.

Δ Predictions.

o Negative training
examples.

• Predictions map to [0 1], and they represent the


probability that the training example is positive. 25
Prediction

Prediction
Examples labeled
as Positive, but
predicted as
Negative.

Positive Training Example Negative Training Example

Examples labeled
as Negative, but
predicted as
• Unable to learn 15 training examples. Positive.
• Training accuracy = 15/80 = 81.25% 26
27
28
29
30
Very difficult to
differentiate or
classify these
examples.

• Unable to differentiate some examples probably because they


are very similar. 31
References

32
References
[1] W. S. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," Bulletin of Mathematical
Biophysics, vol. 5, pp. 115-133, 1943.
[2] F. Rosenblatt, "The Perceptron--a perceiving and recognizing automaton," Cornell Aeronautical Laboratory, New York, NY,
1957.
[3] M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry, Cambridge MA: The MIT Press,
1969.
[4] P. J. Werbos, "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences," PhD thesis, Harvard
University, Harvard, 1974.
[5] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational properties," in Proceedings
of the National Academy of Sciences of the USA, 1982.

33

You might also like