Stochastic Gradient Descent
Stochastic Gradient Descent
SGD works by iteratively updating the model parameters in the direction of the negative
gradient of the cost function. However, instead of using the entire training dataset to
calculate the gradient, SGD uses a single training example or a small batch of training
examples. This makes SGD much faster than batch gradient descent, but it can also
make it less stable.
The main advantage of SGD is that it is very computationally efficient. This makes it a
good choice for large datasets. Additionally, SGD can be used to train models with
non-convex cost functions, which can be difficult to train with other optimization
algorithms.
However, SGD can also be less stable than other optimization algorithms. This is
because the gradient of the cost function can be very noisy, especially for small batches.
This noise can make it difficult for the algorithm to converge to a minimum.
To address this issue, we can use a technique called momentum. Momentum helps to
smooth out the noise in the gradient, which can help the algorithm converge more
quickly.
In general, SGD is a good choice for large datasets. However, if the dataset is small or noisy,
then other optimization algorithms may be a better choice.
Benefits:
Computationally efficient
Can be used to train models with non-convex cost functions
Can be used to train models on large datasets
Drawbacks:
Let's explore the steps involved in stochastic gradient descent in more detail:
1. Initialize Parameters: Start by initializing the model's parameters, such as weights and
biases, with random values. These parameters will be iteratively updated during the
training process.
2. Define the Cost Function: Choose an appropriate cost function that measures the
discrepancy between the predicted values of the model and the actual values in the
training dataset. The choice of cost function depends on the specific problem at hand.
3. Choose a Learning Rate: Select a learning rate, which determines the step size taken in
the direction of the gradients during parameter updates. The learning rate is a
hyperparameter that needs to be carefully chosen. A large learning rate may cause
overshooting, while a small learning rate can result in slow convergence.
a. Randomly Shuffle the Training Dataset: Shuffle the training dataset randomly. This step is
crucial to introduce randomness in the training process and avoid potential biases due to the
order of the instances.
iii. **Compute Loss**: Calculate the cost function using the predic
ted output and the actual target for the selected training instance.
This step quantifies the model's performance on that specific instanc
e.
c. Repeat Steps 4b for Each Training Instance: Iterate over all the training instances in the
shuffled dataset, applying forward propagation, computing loss, backpropagation, and
parameter updates for each instance.
5. Repeat Steps 4a-4c: Repeat the process of random shuffling the dataset and iterating
over the instances until a stopping criterion is met. The stopping criterion can be a
maximum number of iterations or a threshold on the improvement of the cost function.
6. Evaluate Model: Once the training process is complete, evaluate the trained model's
performance on a separate validation or test dataset. This step gives an estimate of how
well the model is likely to generalize to unseen data.
Stochastic gradient descent has several advantages, including computational efficiency due to
updates based on subsets of instances, the ability to escape local minima, and the potential
for faster convergence. However, it may exhibit more fluctuations in the optimization process
due to the noisy gradients computed on individual instances. To strike a balance between
computational efficiency and convergence stability other variations of gradient descent such
In [1]: # code
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
In [3]: print(X.shape)
print(y.shape)
(442, 10)
(442,)
Out[5]: ▾ LinearRegression
LinearRegression()
Out[7]: 0.4399338661568968
In [8]: X_train.shape
Explanation
The code in the focal cell defines a class called SGDRegressor, which represents a stochastic
gradient descent (SGD) regressor. This class allows you to perform linear regression using
stochastic gradient descent as the optimization algorithm.
The code starts by defining the SGDRegressor class. It has two parameters: learning_rate
(the learning rate for gradient descent, with a default value of 0.01) and epochs (the
number of epochs for training, with a default value of 100).
1. The init method is the constructor of the class. It initializes the instance variables coef_
and intercept_ as None, and assigns the provided learning_rate and epochs values to the
corresponding instance variables.
2. The fit method is used to train the SGDRegressor on the provided training data. It takes
X_train (the input training data) and y_train (the target training data) as parameters.
3. Inside the fit method, the coefficients (coef_) and intercept (intercept_) are initialized. The
intercept is set to 0, and the coefficients are set to an array of ones with the same shape
as the number of features in the input data.
4. The training process starts with two nested loops. The outer loop iterates epochs number
of times, and the inner loop iterates over each sample in the training data.
5. In each iteration of the inner loop, a random index (idx) is generated to select a random
training sample.
6. The predicted output (y_hat) for the selected sample is calculated using the dot product of
the input data (X_train[idx]) and the coefficients (self.coef_), plus the intercept
(self.intercept_).
9. After the training process is complete, the final intercept and coefficients are printed using
print(self.intercept_, self.coef_).
10. The predict method takes the test data (X_test) as input and returns the predicted output
for the test data. It calculates the predicted output using the dot product of the test data
and the coefficients, plus the intercept.
In [13]: r2_score(y_test,y_pred)
Out[13]: 0.4202489926600458
Sklearn
In [16]: reg.fit(X_train,y_train)
C:\Users\user\anaconda3\lib\site-packages\sklearn\linear_model\_stochastic_gr
adient.py:1548: ConvergenceWarning: Maximum number of iteration reached befor
e convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
Out[16]:
▾ SGDRegressor
SGDRegressor(learning_rate='constant', max_iter=100)
Out[18]: 0.43300205052916463
Animation
In [19]: from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
In [22]: plt.scatter(X,y)
Figure 1
[27.82809103]
-2.29474455867698
b = b - (lr * slope_b)
m = m - (lr * slope_m)
all_b.append(b)
all_m.append(m)
all_cost.append(cost)
all_lr.append(lr)
print("Total time taken:", time.time() - start)
In [25]: len(all_cost)
Out[25]: 100
# Display the animation
plt.xlabel('x')
plt.ylabel('y')
plt.title('Stochastic Gradient Descent')
plt.colorbar(contour)
plt.tight_layout()
plt.show()
1. Fixed Learning Rate: In this simple approach, the learning rate remains constant
throughout the training process. It is set to a fixed value, typically determined through
hyperparameter tuning. However, a fixed learning rate may not be ideal for all scenarios,
as it may lead to slow convergence or overshooting.
2. Time-based Decay: This learning schedule reduces the learning rate over time by a fixed
factor at predetermined intervals or epochs. The formula for time-based decay is:
3. Step Decay: Step decay reduces the learning rate by a fixed factor at specific milestones
or steps during training. The formula for step decay is:
4. Exponential Decay: Exponential decay reduces the learning rate exponentially over time.
The formula for exponential decay is:
5. Piecewise Constant Decay: Piecewise constant decay allows you to define specific
learning rates for different epochs or ranges of epochs. It is often used to decrease the
learning rate more aggressively at the beginning of training and then reduce it more slowly
later. The learning rate is manually adjusted based on the desired schedule.
These are just a few examples of learning schedules used in SGD. The choice of learning
schedule depends on the problem at hand, the characteristics of the dataset, and empirical
experimentation to find the best settings. It is important to note that learning rate schedules
should be carefully tuned to achieve the desired balance between convergence speed and
avoiding overshooting or getting stuck in local minima.
In [ ]: