0% found this document useful (0 votes)
26 views27 pages

Unit4 Notes

Calculus is essential in machine learning for gradient computation, backpropagation, and optimization algorithms, enabling effective model training and analysis. The document illustrates the application of calculus in analyzing a plant's growth using derivatives to determine growth rates and concavity. It also discusses convex sets and functions, providing examples and their relevance in optimization problems.

Uploaded by

taniabhat2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views27 pages

Unit4 Notes

Calculus is essential in machine learning for gradient computation, backpropagation, and optimization algorithms, enabling effective model training and analysis. The document illustrates the application of calculus in analyzing a plant's growth using derivatives to determine growth rates and concavity. It also discusses convex sets and functions, providing examples and their relevance in optimization problems.

Uploaded by

taniabhat2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

1. What is the importance of calculus in machine learning?

Imagine you are


analyzing the growth of a plant over time. The height of the plant (in centimeters)
at any time 𝑥 (in weeks) is modeled by the function:
𝑓(𝑥) = 𝑥 + 3𝑥 + 5
Here:
 𝑓(𝑥) represents the height of the plant at time 𝑥.
 The function is quadratic, indicating that the plant's growth accelerates over
time.
We will now analyze the growth rate of the plant by computing the derivative of
𝑓(𝑥), finding the slope of the tangent at a specific time (𝑥 = 2 weeks), and
determining whether the plant's height is increasing or decreasing at that time.

Answer: Calculus is a cornerstone of machine learning, providing the mathematical foundation


for many key concepts and algorithms. Its importance can be broken down into several areas:
1. Gradient Computation:

o Calculus is used to compute gradients (partial derivatives) of the loss function


with respect to model parameters.
o Gradients indicate the direction of the steepest ascent or descent, which is
essential for optimizing model parameters.
o For example, in linear regression, the gradient of the Mean Squared Error (MSE)
loss function is used to update the weights.
2. Backpropagation:

o In neural networks, the chain rule of calculus is used to compute gradients layer
by layer during backpropagation.
o This allows the network to learn by adjusting weights to minimize the loss
function.
3. Convexity Analysis:

o Calculus helps determine whether a function is convex, which is crucial for


ensuring the existence of a global minimum in optimization problems.
o Convex functions are easier to optimize because they have no local minima.
4. Rate of Change:

o Derivatives measure how a function changes with respect to its inputs, enabling
the analysis of model behavior and sensitivity.
o For example, in regression models, the derivative of the loss function with respect
to the parameters determines how the parameters should be updated.
5. Optimization Algorithms:

o Gradient-based optimization algorithms like gradient descent, stochastic gradient


descent (SGD), and Adam rely on calculus to update model parameters
iteratively.
6. Regularization:

o Calculus is used to compute gradients for regularization terms (e.g., L1 or L2


regularization), which help prevent overfitting.
Numerical Example:

Step 1: Compute the Derivative of 𝑓(𝑥)


The derivative 𝑓 (𝑥) represents the rate of change of the plant's height with respect to time. For
the given function:
𝑓(𝑥) = 𝑥 + 3𝑥 + 5
Using the power rule for differentiation:
𝑑 𝑑 𝑑
𝑓 (𝑥) = (𝑥 ) + (3𝑥) + (5)
𝑑𝑥 𝑑𝑥 𝑑𝑥
Applying the power rule:
𝑓 (𝑥) = 2𝑥 + 3 + 0
So, the derivative is:
𝑓 (𝑥) = 2𝑥 + 3
Interpretation:
 The derivative 𝑓 (𝑥) = 2𝑥 + 3 tells us how fast the plant's height is changing at any time
𝑥.
 For example, at 𝑥 = 1 week, the growth rate is 𝑓 (1) = 2(1) + 3 = 5 cm/week.

Step 2: Find the Slope of the Tangent at 𝑥 = 2


The slope of the tangent to the curve at a specific point 𝑥 = 𝑎 is given by the value of the
derivative at that point, 𝑓 (𝑎).
For 𝑥 = 2 weeks:
𝑓 (2) = 2(2) + 3 = 4 + 3 = 7
So, the slope of the tangent at 𝑥 = 2 is 7 cm/week.
Interpretation:
 At 𝑥 = 2 weeks, the plant's height is increasing at a rate of 7 cm per week.
 This means that, at this specific time, the plant is growing rapidly.

Step 3: Determine Whether the Function is Increasing or Decreasing at 𝑥 = 2


The sign of the derivative 𝑓 (𝑥) at a point determines whether the function is increasing or
decreasing at that point:
 If 𝑓 (𝑥) > 0, the function is increasing.
 If 𝑓 (𝑥) < 0, the function is decreasing.
From Step 2, we know:
𝑓 (2) = 7 > 0
Since 𝑓 (2) > 0, the function 𝑓(𝑥) is increasing at 𝑥 = 2.
Interpretation:
 At 𝑥 = 2 weeks, the plant's height is increasing.
 This aligns with our observation that the slope of the tangent is positive, indicating
growth.

Summary of Results:
7. Derivative of 𝑓(𝑥):
𝑓 (𝑥) = 2𝑥 + 3
o This represents the growth rate of the plant at any time 𝑥.
8. Slope of the Tangent at 𝑥 = 2:
𝑓 (2) = 7 cm/week
o At 𝑥 = 2 weeks, the plant is growing at a rate of 7 cm per week.
9. Behavior of the Function at 𝑥 = 2:
o The function is increasing at 𝑥 = 2, meaning the plant's height is growing at this
time.

Case Study Conclusion:


By analyzing the derivative of the function 𝑓(𝑥) = 𝑥 + 3𝑥 + 5, we determined:
 The plant's growth rate at any time 𝑥 is given by 𝑓 (𝑥) = 2𝑥 + 3.
 At 𝑥 = 2 weeks, the plant is growing at a rate of 7 cm/week.
 The plant's height is increasing at 𝑥 = 2 weeks, indicating healthy growth.
This type of analysis is useful in real-world applications, such as predicting growth trends,
optimizing resources, and making informed decisions based on rates of change.

Q2. Consider the function 𝒇(𝒙) = 𝒙𝟑 − 𝟔𝒙𝟐 + 𝟗𝒙 + 𝟐.


1. Compute the first derivative 𝑓 (𝑥).
2. Compute the second derivative 𝑓 (𝑥).
3. Evaluate the second derivative at 𝑥 = 2 and interpret its meaning.
4. Determine whether the function is concave upward, concave downward, or has an
inflection point at 𝑥 = 2. Justify your answer.

Detailed Answer:
1. Compute the First Derivative 𝑓 (𝑥)
The first derivative of 𝑓(𝑥) is calculated using the power rule:
𝑓(𝑥) = 𝑥 − 6𝑥 + 9𝑥 + 2
𝑑 𝑑 𝑑 𝑑
𝑓 (𝑥) = (𝑥 ) − (6𝑥 ) + (9𝑥) + (2)
𝑑𝑥 𝑑𝑥 𝑑𝑥 𝑑𝑥
𝑓 (𝑥) = 3𝑥 − 12𝑥 + 9 + 0
𝑓 (𝑥) = 3𝑥 − 12𝑥 + 9
Answer:
𝑓 (𝑥) = 3𝑥 − 12𝑥 + 9

2. Compute the Second Derivative 𝑓 (𝑥)


The second derivative is the derivative of 𝑓 (𝑥):
𝑓 (𝑥) = 3𝑥 − 12𝑥 + 9
𝑑 𝑑 𝑑
𝑓 (𝑥) = (3𝑥 ) − (12𝑥) + (9)
𝑑𝑥 𝑑𝑥 𝑑𝑥
𝑓 (𝑥) = 6𝑥 − 12 + 0
𝑓 (𝑥) = 6𝑥 − 12
Answer:
𝑓 (𝑥) = 6𝑥 − 12

3. Evaluate the Second Derivative at 𝑥 = 2 and Interpret Its Meaning :


Substitute 𝑥 = 2 into 𝑓 (𝑥):
𝑓 (2) = 6(2) − 12 = 12 − 12 = 0
Interpretation:
 The second derivative 𝑓 (𝑥) measures the concavity of the function.
 At 𝑥 = 2, 𝑓 (2) = 0, which indicates that the function may have an inflection point at
𝑥 = 2.
Answer:
𝑓 (2) = 0
This suggests that 𝑥 = 2 is a possible inflection point.

4. Determine Concavity and Justify


To determine whether the function is concave upward, concave downward, or has an inflection
point at 𝑥 = 2, analyze the sign of 𝑓 (𝑥) around 𝑥 = 2:
 For 𝑥 < 2 (e.g., 𝑥 = 1):

𝑓 (1) = 6(1) − 12 = −6 < 0


The function is concave downward for 𝑥 < 2.

 For 𝑥 > 2 (e.g., 𝑥 = 3):

𝑓 (3) = 6(3) − 12 = 6 > 0


The function is concave upward for 𝑥 > 2.

Since the concavity changes from downward to upward at 𝑥 = 2, this confirms that 𝑥 = 2 is an
inflection point.
Answer:
 For 𝑥 < 2, the function is concave downward.
 For 𝑥 > 2, the function is concave upward.
 At 𝑥 = 2, the function has an inflection point.
3. Define convex sets and convex functions. Provide examples.
Answer:
Convex Sets
A set 𝑆 is convex if, for any two points 𝑥, 𝑦 ∈ 𝑆, the line segment connecting them lies entirely
within 𝑆. Mathematically:
𝜆𝑥 + (1 − 𝜆)𝑦 ∈ 𝑆 ∀𝜆 ∈ [0,1]
Examples:
 A circle is a convex set because any line segment connecting two points within the circle
lies entirely inside it.
 A crescent moon is not a convex set because there exist points where the line segment
connecting them goes outside the set.
Convex Functions
A function 𝑓(𝑥) is convex if, for any two points 𝑥, 𝑦 in its domain and 𝜆 ∈ [0,1]:
𝑓(𝜆𝑥 + (1 − 𝜆)𝑦) ≤ 𝜆𝑓(𝑥) + (1 − 𝜆)𝑓(𝑦)
Examples:
 𝑓(𝑥) = 𝑥 is convex because its second derivative 𝑓 (𝑥) = 2 is always positive.
 𝑓(𝑥) = sin(𝑥) is not convex because its second derivative 𝑓 (𝑥) = −sin(𝑥) changes
sign.
Numerical Example: Consider the function 𝑓(𝑥) = 𝑥 . For 𝑥 = 1 and 𝑦 = 2, and 𝜆 = 0.5:
𝑓(0.5 ⋅ 1 + 0.5 ⋅ 2) = 𝑓(1.5) = 2.25
0.5 ⋅ 𝑓(1) + 0.5 ⋅ 𝑓(2) = 0.5 ⋅ 1 + 0.5 ⋅ 4 = 2.5
Since 2.25 ≤ 2.5, the function is convex.
Practical Case Study: In support vector machines (SVM), the objective function is convex,
ensuring a unique global minimum. This convexity property makes SVM optimization efficient
and reliable.

Q4. A company is designing a new product and needs to optimize its production process.
The production cost 𝑪(𝒙) depends on the quantity 𝒙 of raw materials used. The company
has identified that the cost function 𝑪(𝒙) and the feasible set of raw material quantities 𝑺
play a crucial role in determining the optimal production strategy. The cost function is
given by:
𝑪(𝒙) = 𝒙𝟐 + 𝟏𝟎𝒙 + 𝟏𝟎𝟎,
and the feasible set of raw material quantities is 𝑺 = [𝟓, 𝟏𝟓].
Based on this information, answer the following questions:

1. Convex Set Verification

Show that the feasible set 𝑆 = [5,15] is a convex set by verifying the convexity
condition for 𝑥 = 5 and 𝑥 = 15 with 𝜆 = 0.3.

2. Convex Function Verification

Verify that the cost function 𝐶(𝑥) = 𝑥 + 10𝑥 + 100 is convex by checking the
convexity condition for 𝑥 = 5 and 𝑥 = 10 with 𝜆 = 0.5.

Ans:- 1. Convex Set Verification


We are given the set 𝑆 = [5,15]. To show that 𝑆 is convex, we verify the convexity condition for
𝑥 = 5, 𝑥 = 15, and 𝜆 = 0.3.
Step 1: Compute 𝜆𝑥 + (1 − 𝜆)𝑥 :
𝜆𝑥 + (1 − 𝜆)𝑥 = 0.3(5) + (1 − 0.3)(15) = 1.5 + 0.7(15) = 1.5 + 10.5 = 12.
Step 2: Check if 12 ∈ 𝑆: Since 12 lies within the interval [5,15], the set 𝑆 satisfies the
convexity condition for 𝜆 = 0.3.
Conclusion: The set 𝑆 = [5,15] is convex.
2. Convex Function Verification
We are given the cost function 𝐶(𝑥) = 𝑥 + 10𝑥 + 100. To verify that 𝐶(𝑥) is convex, we
check the convexity condition for 𝑥 = 5, 𝑥 = 10, and 𝜆 = 0.5.
Step 1: Compute 𝐶(𝜆𝑥 + (1 − 𝜆)𝑥 ):
𝜆𝑥 + (1 − 𝜆)𝑥 = 0.5(5) + (1 − 0.5)(10) = 2.5 + 0.5(10) = 2.5 + 5 = 7.5.
𝐶(7.5) = (7.5) + 10(7.5) + 100 = 56.25 + 75 + 100 = 231.25.
Step 2: Compute 𝜆𝐶(𝑥 ) + (1 − 𝜆)𝐶(𝑥 ):
𝐶(𝑥 ) = 𝐶(5) = (5) + 10(5) + 100 = 25 + 50 + 100 = 175,
𝐶(𝑥 ) = 𝐶(10) = (10) + 10(10) + 100 = 100 + 100 + 100 = 300,
𝜆𝐶(𝑥 ) + (1 − 𝜆)𝐶(𝑥 ) = 0.5(175) + 0.5(300) = 87.5 + 150 = 237.5.
Step 3: Compare the two results:
𝐶(𝜆𝑥 + (1 − 𝜆)𝑥 ) = 231.25 ≤ 237.5 = 𝜆𝐶(𝑥 ) + (1 − 𝜆)𝐶(𝑥 ).
Conclusion: The cost function 𝐶(𝑥) = 𝑥 + 10𝑥 + 100 satisfies the convexity condition for
𝑥 = 5, 𝑥 = 10, and 𝜆 = 0.5.

5. Explain the concept of optimization in machine learning with numerical


case study.
Answer: Optimization in machine learning refers to the process of finding the best set of model
parameters that minimize (or maximize) a given objective function, typically a loss function.
Key aspects include:
1. Objective Function:

o The objective function quantifies the performance of a model. For example:


 In regression, the Mean Squared Error (MSE) is used.
 In classification, the cross-entropy loss is used.
2. Constraints:

o Constraints are conditions that the solution must satisfy. For example:
 Regularization terms like L1 or L2 penalty are added to the loss function
to prevent overfitting.
3. Gradient-Based Methods:

o Optimization algorithms like gradient descent use gradients (partial derivatives) to


iteratively update parameters and minimize the loss function.
4. Global vs Local Optima:

o In non-convex problems, optimization algorithms may converge to local minima,


but convex problems guarantee a global minimum.
5. Applications:

o Optimization is used in training models like linear regression, logistic regression,


neural networks, and support vector machines (SVM).
(a) Numerical Example: Minimize 𝑓(𝑥) = 𝑥 using gradient descent. Start at 𝑥 = 3, learning
rate 𝜂 = 0.1:
𝑥 = 𝑥 − 𝜂𝑓 (𝑥) = 3 − 0.1 ⋅ 6 = 2.4
Practical Case Study: In logistic regression, the log-loss function is minimized using gradient
descent to classify data points. The optimization process adjusts the model parameters to
maximize the likelihood of the observed data.
(b) Numerical Example:
Let’s consider a simple linear regression problem where we want to fit a line 𝑦 = 𝑚𝑥 + 𝑏 to a
set of data points. The goal is to find the best values of 𝑚 (slope) and 𝑏 (intercept) that minimize
the Mean Squared Error (MSE) loss function.

Step 1: Define the Loss Function


The MSE loss function is given by:

1
𝐿(𝑚, 𝑏) = 𝑦 − (𝑚𝑥 + 𝑏) ,
𝑁

where:
 𝑁 is the number of data points,
 (𝑥 , 𝑦 ) are the data points,
 𝑚𝑥 + 𝑏 is the predicted value for input 𝑥 .

Step 2: Initialize Parameters


Let’s initialize the parameters 𝑚 and 𝑏 with some random values:
𝑚 = 1, 𝑏 = 0.

Step 3: Compute the Loss


Suppose we have the following data points:
(𝑥 , 𝑦 ) = (1,3), (𝑥 , 𝑦 ) = (2,5), (𝑥 , 𝑦 ) = (3,7).
Compute the predicted values and the loss:
Predicted values: 𝑦 = 1(1) + 0 = 1, 𝑦 = 1(2) + 0 = 2, 𝑦 = 1(3) + 0 = 3.
1 1 29
Loss: 𝐿(𝑚, 𝑏) = [(3 − 1) + (5 − 2) + (7 − 3) ] = (4 + 9 + 16) = ≈ 9.67.
3 3 3

Step 4: Update Parameters Using Gradient Descent


Gradient Descent is an optimization algorithm that updates the parameters in the direction of the
negative gradient of the loss function. The update rules are:
∂𝐿
𝑚new = 𝑚old − 𝛼 ,
∂𝑚
∂𝐿
𝑏new = 𝑏old − 𝛼 ,
∂𝑏
where 𝛼 is the learning rate (e.g., 𝛼 = 0.1).
Compute the Gradients:

∂𝐿 2
= 𝑥 𝑦 − (𝑚𝑥 + 𝑏) ,
∂𝑚 𝑁

∂𝐿 2
= 𝑦 − (𝑚𝑥 + 𝑏) .
∂𝑏 𝑁

For our data:


∂𝐿 2 2 40
= [1(3 − 1) + 2(5 − 2) + 3(7 − 3)] = (2 + 6 + 12) = ≈ 13.33,
∂𝑚 3 3 3
∂𝐿 2 2 18
= [(3 − 1) + (5 − 2) + (7 − 3)] = (2 + 3 + 4) = = 6.
∂𝑏 3 3 3
Update the Parameters:
𝑚new = 1 − 0.1(13.33) = 1 − 1.333 = −0.333,
𝑏new = 0 − 0.1(6) = 0 − 0.6 = −0.6.

Step 5: Repeat Until Convergence


Repeat the process of computing the loss, gradients, and updating the parameters until the loss
function is minimized (or reaches a satisfactory value).

Summary of Optimization in Machine Learning:


6. Objective: Minimize the loss function 𝐿(𝑚, 𝑏).
7. Parameters: 𝑚 and 𝑏.
8. Optimization Algorithm: Gradient Descent.
9. Process:
o Compute the loss.
o Compute the gradients.
o Update the parameters.
o Repeat until convergence.
6. What is gradient descent? Explain its variants.
Answer:
Gradient Descent (GD)
Gradient descent is an iterative optimization algorithm used to minimize a function by moving in
the direction of the negative gradient. The update rule is:
𝑤 = 𝑤 − 𝜂∇𝐿(𝑤 )
where:
 𝑤 : Current parameter value.
 𝜂: Learning rate (step size).
 ∇𝐿(𝑤 ): Gradient of the loss function with respect to 𝑤 .
Problem Statement
 Function to minimize:

𝑓(𝑥) = 𝑥 + 4𝑥 + 4
o This is a quadratic function with a parabolic shape.
o The function has a global minimum at 𝑥 = −2.
 Initial guess: 𝑥 = 5

o We start the optimization process from 𝑥 = 5.


 Learning rate: 𝛼 = 0.1

o The learning rate controls the step size during each iteration.
o A smaller learning rate leads to slower but more stable convergence.
 Number of iterations: 5

o We will perform 5 updates to the value of 𝑥.


 Goal:

o Find the value of 𝑥 that minimizes 𝑓(𝑥).

Step 1 - Compute the Gradient


 The gradient (derivative) of 𝑓(𝑥) is:
𝑑
𝑓 (𝑥) = (𝑥 + 4𝑥 + 4) = 2𝑥 + 4
𝑑𝑥
o The gradient measures the slope of the function at a given point.
o It indicates the direction of the steepest ascent.
 Why is the gradient important?

o To minimize 𝑓(𝑥), we move in the opposite direction of the gradient.


o The gradient tells us how to adjust 𝑥 to reduce the value of 𝑓(𝑥).

Step 2 - Gradient Descent Formula


 The update rule for gradient descent is:

𝑥new = 𝑥old − 𝛼 ⋅ 𝑓 (𝑥old )


o Explanation of terms:
 𝑥old : Current value of 𝑥.
 𝛼: Learning rate (step size).
 𝑓 (𝑥old ): Gradient at 𝑥old .
 How it works:

o Compute the gradient at the current point.


o Multiply the gradient by the learning rate.
o Subtract this value from the current 𝑥 to get the updated 𝑥.

Step 3 - Iterations
 Perform 5 iterations of gradient descent starting from 𝑥 = 5.
 Iteration Table:
Iteration 𝑥old 𝑓 (𝑥old ) 𝑥new = 𝑥old − 0.1 ⋅ 𝑓 (𝑥old )
0 5.0 14.0 5.0 − 0.1 ⋅ 14.0 = 3.6
1 3.6 11.2 3.6 − 0.1 ⋅ 11.2 = 2.48
2 2.48 8.96 2.48 − 0.1 ⋅ 8.96 = 1.584
3 1.584 7.168 1.584 − 0.1 ⋅ 7.168 = 0.8672
4 0.8672 5.7344 0.8672 − 0.1 ⋅ 5.7344 = 0.29376
5 0.29376 4.58752 0.29376 − 0.1 ⋅ 4.58752 = −0.164992
 Explanation of iterations:
o At each step, the gradient is computed using 𝑓 (𝑥) = 2𝑥 + 4.
o The new value of 𝑥 is updated using the gradient descent formula.
o Over time, 𝑥 moves closer to the minimum at 𝑥 = −2.
Final Result

 After 5 iterations, the value of 𝑥 is approximately:


𝑥 ≈ −0.165
 Explanation of the result:
o The function is moving toward the minimum at 𝑥 = −2.
o If more iterations are performed, 𝑥 will converge to 𝑥 = −2.
o The learning rate 𝛼 = 0.1 ensures stable but gradual progress.

Mathematical Insight
 The function 𝑓(𝑥) = 𝑥 + 4𝑥 + 4 can be rewritten as:
𝑓(𝑥) = (𝑥 + 2)
 Explanation:
o This form shows that the function is a perfect square.
o The minimum value of 𝑓(𝑥) is 0, achieved at 𝑥 = −2.
o Gradient descent is converging toward this minimum.

Key Takeaways
 Gradient descent is an iterative optimization algorithm:

o It updates the parameters step-by-step to minimize the loss function.


 The learning rate 𝛼 controls the step size:

o A smaller learning rate leads to slower but more stable convergence.


o A larger learning rate can cause overshooting or divergence.
 The gradient 𝑓 (𝑥) determines the direction of the update:

o The gradient points in the direction of the steepest ascent.


o Moving in the opposite direction reduces the value of 𝑓(𝑥).
 For quadratic functions, gradient descent converges to the global minimum:

o Quadratic functions have a single global minimum, making them ideal for
gradient descent.
 More iterations or a smaller learning rate can improve accuracy:

o Increasing the number of iterations or reducing the learning rate can lead to better
results.

Here’s a tabular comparison of Stochastic Gradient Descent (SGD), Momentum-based


Gradient Descent, and Adam (Adaptive Moment Estimation) in terms of their mathematical
computation and key characteristics:
Stochastic
Gradient Momentum-based Adam (Adaptive Moment
Aspect Descent (SGD) Gradient Descent Estimation)
Key Idea Updates Adds momentum to Combines momentum and
parameters using SGD by incorporating adaptive learning rates for
one data point (or past gradients to each parameter.
a small batch) at a accelerate convergence.
time.
Gradient ∂𝐿 ∂𝐿 ∂𝐿
Computation = 2𝑥 𝑦 − (𝑚𝑥 = 2𝑥 𝑦 − (𝑚𝑥
∂𝑚 ∂𝑚 ∂𝑚
= 2𝑥 𝑦 + 𝑏) + 𝑏)
− (𝑚𝑥 + 𝑏)
∂𝐿 ∂𝐿 ∂𝐿
= 2 𝑦 − (𝑚𝑥 = 2 𝑦 − (𝑚𝑥 + 𝑏)
∂𝑏 ∂𝑏 ∂𝑏
=2 𝑦 + 𝑏)
− (𝑚𝑥 + 𝑏)
Parameter ∂𝐿 𝑣 = 𝛽𝑣 + (1 𝑚 =𝛽 𝑚 + (1
Update 𝑚 =𝑚−𝛼 ∂𝐿 ∂𝐿
∂𝑚
− 𝛽) −𝛽 )
∂𝑚 ∂𝑚
∂𝐿 𝑚 = 𝑚 − 𝛼𝑣 𝑣
𝑏 =𝑏−𝛼
∂𝑏 =𝛽 𝑣
∂𝐿
+ (1 − 𝛽 )
∂𝑚
𝑚 = ,𝑣 =
𝑚
𝑚 =𝑚−𝛼
𝑣 +𝜖
Hyperparameters Learning rate (𝛼). Learning rate (𝛼), Learning rate (𝛼), 𝛽 , 𝛽 ,
momentum (𝛽). 𝜖.
Advantages Simple and fast Faster convergence than Combines momentum and
for large datasets. SGD, reduces adaptive learning rates,
oscillations. works well for sparse data
and noisy gradients.
Disadvantages Noisy updates, Requires tuning of More complex, requires
may get stuck in momentum tuning of multiple
local minima. hyperparameter. hyperparameters.
Use Cases Large datasets, Faster convergence than Deep learning, large-scale
simple models. SGD, especially for optimization problems.
high-curvature or noisy
loss landscapes.
Key Takeaways:
 SGD: Simple and fast but noisy updates.
 Momentum: Adds momentum to SGD for faster convergence and reduced oscillations.
 Adam: Combines momentum and adaptive learning rates, making it highly effective for
deep learning and large-scale problems.

Below is the detailed solution for each variant of gradient descent

Stochastic Gradient Descent (SGD)


Problem Setup
 Function to minimize:
𝑓(𝑥) = 𝑥 + 4𝑥 + 4
 Gradient:
𝑓 (𝑥) = 2𝑥 + 4
 Initial guess: 𝑥 = 5
 Learning rate: 𝛼 = 0.1
Update Rule
𝑥new = 𝑥old − 𝛼 ⋅ 𝑓 (𝑥old )
Iterations
Iteration 𝑥old 𝑓 (𝑥old ) 𝑥new
0 5.0 2(5) + 4 = 14.0 5.0 − 0.1(14.0) = 3.6
1 3.6 2(3.6) + 4 = 11.2 3.6 − 0.1(11.2) = 2.48
2 2.48 2(2.48) + 4 = 8.96 2.48 − 0.1(8.96) = 1.584
3 1.584 2(1.584) + 4 = 7.168 1.584 − 0.1(7.168) = 0.8672
4 0.8672 2(0.8672) + 4 = 5.7344 0.8672 − 0.1(5.7344) = 0.29376
5 0.29376 2(0.29376) + 4 = 4.58752 0.29376 − 0.1(4.58752) = −0.164992
Explanation
 SGD updates the parameter 𝑥 using the gradient at each step.
 The updates are frequent but noisy, leading to fluctuations.

Momentum-based Gradient Descent


Problem Setup
 Function to minimize:
𝑓(𝑥) = 𝑥 + 4𝑥 + 4
 Gradient:
𝑓 (𝑥) = 2𝑥 + 4
 Initial guess: 𝑥 = 5
 Learning rate: 𝛼 = 0.1
 Momentum factor: 𝛽 = 0.9
Update Rule
𝑣 = 𝛽𝑣 + (1 − 𝛽)𝑓 (𝑥old )
𝑥new = 𝑥old − 𝛼 ⋅ 𝑣
Iterations
𝑣
Iteratio = 𝛽𝑣
n 𝑥old 𝑓 (𝑥old ) = 2𝑥 + 4 + (1 − 𝛽)𝑓 (𝑥old ) 𝑥new = 𝑥old − 𝛼𝑣
0 5.0 2(5) + 4 = 14.0 0.9(0) + 0.1(14.0) 5.0 − 0.1(1.4) = 4.86
= 1.4
1 4.86 2(4.86) + 4 0.9(1.4) + 0.1(13.72) 4.86 − 0.1(2.632)
= 13.72 = 2.632 = 4.5968
2 4.5968 2(4.5968) + 4 0.9(2.632) 4.5968
= 13.1936 + 0.1(13.1936) − 0.1(3.68816)
= 3.68816 = 4.227984
3 4.227984 2(4.227984) + 4 0.9(3.68816) 4.227984
= 12.455968 + 0.1(12.455968) − 0.1(4.5649408)
= 4.5649408 = 3.77148992
4 3.77148992 2(3.77148992) 0.9(4.5649408) 3.77148992
+4 + 0.1(11.54297984) − 0.1(5.262744704)
= 11.54297984 = 5.262744704 = 3.2452154496
5 3.245215449 2(3.2452154496) 0.9(5.262744704) 3.2452154496
6 +4 + 0.1(10.4904308992) − 0.1(5.78551332352)
= 10.4904308992 = 5.78551332352 = 2.666664117248

Explanation
 Momentum introduces a velocity term 𝑣 that accumulates past gradients, reducing
oscillations.
 Momentum is particularly useful for functions with steep regions or noisy gradients.

Adam (Adaptive Moment Estimation)


Problem Setup
 Function to minimize:
𝑓(𝑥) = 𝑥 + 4𝑥 + 4
 Gradient:
𝑓 (𝑥) = 2𝑥 + 4
 Initial guess: 𝑥 = 5
 Learning rate: 𝛼 = 0.1
 Momentum factors: 𝛽 = 0.9, 𝛽 = 0.999
 Small constant: 𝜖 = 10
Update Rule
𝑚 =𝛽 𝑚 + (1 − 𝛽 )𝑓 (𝑥old )

𝑣 =𝛽 𝑣 + (1 − 𝛽 ) 𝑓 (𝑥old )
𝑚 𝑣
𝑚corrected = , 𝑣 corrected =
1−𝛽 1−𝛽

𝑚corrected
𝑥new = 𝑥old − 𝛼 ⋅
𝑣 corrected + 𝜖

Iterations
Iteration 𝑥old 𝑓 (𝑥old ) 𝑚 𝑣 𝑚corrected 𝑣 corrected 𝑥new
0 5.0 14.0 0.9 × 0 0.999 × 0 1.4 0.196 5.0
+ 0.1 + 0.001 1 − 0.9 1 − 0.999 − 0.1
× 14 × 196 = 14.0 = 196 14.0
×
= 1.4 = 0.196 √196
= 4.9
1 4.9 13.8 0.9 × 1.4 0.999 2.64 0.386 4.9
+ 0.1 × 0.196 1 − 0.9 1 − 0.999 − 0.1
× 13.8 + 0.001 = 13.8947 = 193.2 13.8947
×
= 2.64 × 190.44 √193.2
= 0.386 ≈ 4.8
2 4.8 13.6 0.9 0.999 3.736 0.57056 4.8
× 2.64 × 0.386 1 − 0.9 1 − 0.999 − 0.1
+ 0.1 + 0.001 = 13.78 ≈ 190.4 13.78
×
× 13.6 × 184.96 √190.4
= 3.736 = 0.57056 ≈ 4.7
Explanation
 Adam maintains two moving averages: 𝑚 (first moment) and 𝑣 (second moment).
 The corrected moments 𝑚corrected and 𝑣 corrected account for bias in the initial steps.
 Adam is widely used in deep learning due to its adaptability and robustness.

Practical Case Study: In neural networks, stochastic gradient descent (SGD) is used to train the
model on large datasets. The noisy updates help the model escape local minima and converge to
a better solution.

7. Explain the role of the learning rate in gradient descent.


Answer: The learning rate (𝜂) controls the step size of parameter updates in gradient descent.
Its role includes:
1. Convergence Speed: A larger learning rate leads to faster convergence but may cause
overshooting.
2. Stability: A smaller learning rate ensures stable convergence but may be slow.
3. Trade-Off: Choosing an appropriate learning rate is crucial for balancing speed and
stability.
4. Adaptive Learning Rates: Variants like Adam dynamically adjust the learning rate for
better performance.
Case Study: Minimizing a Quadratic Function with Multiple Variables
Objective Function: Consider the following quadratic function:
𝑓(𝑥, 𝑦) = 𝑥 + 2𝑦
This function has a global minimum at (𝑥, 𝑦) = (0,0).
Gradient: The gradient of the function is:
∂𝑓
⎡ ⎤
∂𝑥 2𝑥
∇𝑓(𝑥, 𝑦) = ⎢∂𝑓 ⎥ =
⎢ ⎥ 4𝑦
⎣∂𝑦⎦
Gradient Descent Update Rule: For each parameter 𝑥 and 𝑦, the update rule is:
∂𝑓
𝑥new = 𝑥old − 𝛼 ⋅
∂𝑥
∂𝑓
𝑦new = 𝑦old − 𝛼 ⋅
∂𝑦
Substituting the gradient:
𝑥new = 𝑥old − 𝛼 ⋅ 2𝑥old
𝑦new = 𝑦old − 𝛼 ⋅ 4𝑦old

Initialization
Let’s start with an initial guess:
(𝑥 , 𝑦 ) = (2,3)

Case 1: Learning Rate 𝛼 = 0.1


Iteration 1:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 2 − 0.1 ⋅ 2 ⋅ 2 = 2 − 0.4 = 1.6
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = 3 − 0.1 ⋅ 4 ⋅ 3 = 3 − 1.2 = 1.8
Iteration 2:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 1.6 − 0.1 ⋅ 2 ⋅ 1.6 = 1.6 − 0.32 = 1.28
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = 1.8 − 0.1 ⋅ 4 ⋅ 1.8 = 1.8 − 0.72 = 1.08
Iteration 3:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 1.28 − 0.1 ⋅ 2 ⋅ 1.28 = 1.28 − 0.256 = 1.024
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = 1.08 − 0.1 ⋅ 4 ⋅ 1.08 = 1.08 − 0.432 = 0.648
Observation: The values of 𝑥 and 𝑦 are gradually decreasing towards the minimum at (0,0).
The learning rate 𝛼 = 0.1 is appropriate for this case.

Case 2: Learning Rate 𝛼 = 0.5


Iteration 1:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 2 − 0.5 ⋅ 2 ⋅ 2 = 2 − 2 = 0
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = 3 − 0.5 ⋅ 4 ⋅ 3 = 3 − 6 = −3
Iteration 2:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 0 − 0.5 ⋅ 2 ⋅ 0 = 0
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = −3 − 0.5 ⋅ 4 ⋅ (−3) = −3 + 6 = 3
Iteration 3:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 0 − 0.5 ⋅ 2 ⋅ 0 = 0
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = 3 − 0.5 ⋅ 4 ⋅ 3 = 3 − 6 = −3
Observation: The value of 𝑥 converges to the minimum (𝑥 = 0) in one step, but 𝑦 oscillates
between 3 and −3. This is because the learning rate is too high for the 𝑦-dimension, causing
overshooting.

Case 3: Learning Rate 𝛼 = 0.01


Iteration 1:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 2 − 0.01 ⋅ 2 ⋅ 2 = 2 − 0.04 = 1.96
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = 3 − 0.01 ⋅ 4 ⋅ 3 = 3 − 0.12 = 2.88
Iteration 2:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 1.96 − 0.01 ⋅ 2 ⋅ 1.96 = 1.96 − 0.0392 = 1.9208
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = 2.88 − 0.01 ⋅ 4 ⋅ 2.88 = 2.88 − 0.1152 = 2.7648
Iteration 3:
𝑥 = 𝑥 − 𝛼 ⋅ 2𝑥 = 1.9208 − 0.01 ⋅ 2 ⋅ 1.9208 = 1.9208 − 0.0384 = 1.8824
𝑦 = 𝑦 − 𝛼 ⋅ 4𝑦 = 2.7648 − 0.01 ⋅ 4 ⋅ 2.7648 = 2.7648 − 0.1106 = 2.6542
Observation: The values of 𝑥 and 𝑦 are decreasing very slowly. The learning rate 𝛼 = 0.01 is
too small, resulting in slow convergence.

Key Takeaways:
1. Learning Rate Matters: The choice of learning rate significantly impacts the
convergence of gradient descent.
2. Too High: Causes oscillations or divergence.
3. Too Low: Results in slow convergence.
4. Optimal Learning Rate: Depends on the problem and the function being optimized. In
practice, techniques like learning rate scheduling or adaptive learning rates (e.g.,
Adam, RMSprop) are used to dynamically adjust the learning rate.

Practical Case Study: In deep learning, adaptive learning rate methods like Adam are used to
improve convergence.
8. What is the difference between batch gradient descent and stochastic
gradient descent?
Answer:
 Batch GD: Uses the entire dataset to compute the gradient.
 SGD: Uses a single data point (or mini-batch) to compute the gradient.
Let’s consider another numerical case study to compare Batch Gradient Descent (BGD) and
Stochastic Gradient Descent (SGD). This time, we’ll use a non-linear function to better
illustrate the differences in behavior between the two algorithms.

Case Study: Minimizing a Non-Linear Function


Objective Function: Consider the following non-linear function:
𝑓(𝑥) = 𝑥 − 3𝑥 + 2
This function has a global minimum at 𝑥 ≈ 2.25 and a local minimum at 𝑥 ≈ 0.
Initial Guess:
𝑥 = 1.5
Learning Rate:

𝛼 = 0.01

Gradient: The gradient (derivative) of the function is:


𝑓 (𝑥) = 4𝑥 − 9𝑥
Dataset: Let’s assume we have the following dataset of points sampled from the function:

𝑥 , 𝑓(𝑥 ) = (0,2), 𝑥 , 𝑓(𝑥 ) = (1,0), 𝑥 , 𝑓(𝑥 ) = (2, −6), 𝑥 , 𝑓(𝑥 ) = (3,2)


Initial Guess:
𝑥 = 1.5
Learning Rate:
𝛼 = 0.01

Batch Gradient Descent (BGD)


Compute the Gradient (using the entire dataset):
∂𝐿(𝑥) 1
= 𝑓 (𝑥 )
∂𝑥 4

where 𝑓 (𝑥 ) = 4𝑥 − 9𝑥 .

Update Rule:

∂𝐿(𝑥)
𝑥new = 𝑥old − 𝛼 ⋅
∂𝑥
Iteration 1:

o Compute the gradient at 𝑥 = 1.5:


𝑓 (𝑥 ) = 4(0) − 9(0) = 0
𝑓 (𝑥 ) = 4(1) − 9(1) = 4 − 9 = −5
𝑓 (𝑥 ) = 4(2) − 9(2) = 32 − 36 = −4
𝑓 (𝑥 ) = 4(3) − 9(3) = 108 − 81 = 27
∂𝐿(𝑥) 1 18
= (0 − 5 − 4 + 27) = = 4.5
∂𝑥 4 4
o Update 𝑥:
𝑥 = 𝑥 − 𝛼 ⋅ 4.5 = 1.5 − 0.01 ⋅ 4.5 = 1.455
Iteration 2:

o Compute the gradient at 𝑥 = 1.455:


𝑓 (𝑥 ) = 0, 𝑓 (𝑥 ) = −5, 𝑓 (𝑥 ) = −4, 𝑓 (𝑥 ) = 27
∂𝐿(𝑥) 1
= (0 − 5 − 4 + 27) = 4.5
∂𝑥 4
o Update 𝑥:
𝑥 = 𝑥 − 𝛼 ⋅ 4.5 = 1.455 − 0.01 ⋅ 4.5 = 1.41
Observation:

o The parameter 𝑥 is updated once per epoch using the entire dataset.
o The updates are smooth and deterministic.

Stochastic Gradient Descent (SGD)


Update Rule (using one data point at a time):

𝑥new = 𝑥old − 𝛼 ⋅ 𝑓 (𝑥 )

Iteration 1 (using 𝑥 , 𝑓(𝑥 ) = (0,2)):

o Compute the gradient:


𝑓 (𝑥 ) = 0
o Update 𝑥:
𝑥 = 𝑥 − 𝛼 ⋅ 0 = 1.5
Iteration 2 (using 𝑥 , 𝑓(𝑥 ) = (1,0)):

o Compute the gradient:


𝑓 (𝑥 ) = −5
o Update 𝑥:
𝑥 = 𝑥 − 𝛼 ⋅ (−5) = 1.5 + 0.01 ⋅ 5 = 1.55
Iteration 3 (using 𝑥 , 𝑓(𝑥 ) = (2, −6)):

o Compute the gradient:


𝑓 (𝑥 ) = −4
o Update 𝑥:
𝑥 = 𝑥 − 𝛼 ⋅ (−4) = 1.55 + 0.01 ⋅ 4 = 1.59
Iteration 4 (using 𝑥 , 𝑓(𝑥 ) = (3,2)):

o Compute the gradient:


𝑓 (𝑥 ) = 27
o Update 𝑥:
𝑥 = 𝑥 − 𝛼 ⋅ 27 = 1.59 − 0.01 ⋅ 27 = 1.32
Observation:

o The parameter 𝑥 is updated for each data point.


o The updates are noisy and less deterministic compared to BGD.

Key Differences:
Convergence Behavior:

o BGD converges smoothly but slowly.


o SGD converges faster but with oscillations.
Computational Cost:

o BGD is computationally expensive for large datasets.


o SGD is computationally efficient.
Noise in Updates:

o BGD updates are deterministic.


o SGD updates are noisy due to the use of individual data points.

When to Use:
 BGD: Suitable for small datasets or when stable convergence is required.
 SGD: Suitable for large datasets or when computational efficiency is critical. Mini-batch
SGD is often used as a compromise between BGD and SGD.
Practical Case Study: In training deep neural networks, SGD is preferred for large datasets due
to its scalability.

9. How is optimization implemented in Python? Provide an example.


Answer: Optimization in Python is implemented using libraries like scipy.optimize.
Numerical Example:
from scipy.optimize import minimize

def objective(x):
return x[0]**2 + x[1]**2

x0 = [1, 1]
result = minimize(objective, x0, method='BFGS')
print(result.x) # Optimal solution

Practical Case Study: In hyperparameter tuning, optimization algorithms are used to find the
best model parameters.

1. Gradient Descent for Minimizing a Function


Let’s minimize the function 𝑓(𝑥) = 𝑥 + 5𝑥 + 6 using gradient descent.
import numpy as np

# Define the function and its gradient


def f(x):
return x**2 + 5*x + 6

def grad_f(x):
return 2*x + 5

# Gradient Descent
def gradient_descent(starting_point, learning_rate, num_iterations):
x = starting_point
for i in range(num_iterations):
gradient = grad_f(x)
x = x - learning_rate * gradient
if i % 10 == 0:
print(f"Iteration {i}: x = {x}, f(x) = {f(x)}")
return x

# Parameters
starting_point = 0.0
learning_rate = 0.1
num_iterations = 50

# Run Gradient Descent


optimal_x = gradient_descent(starting_point, learning_rate, num_iterations)
print(f"Optimal x: {optimal_x}")

2. Case Study : Linear Regression with Gradient Descent


We’ll implement linear regression using gradient descent to fit a line to data.
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data


np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add bias term (intercept)


X_b = np.c_[np.ones((100, 1)), X]

# Define the loss function (MSE) and its gradient


def compute_loss(X, y, theta):
m = len(y)
predictions = X.dot(theta)
loss = (1/(2*m)) * np.sum((predictions - y)**2)
return loss

def compute_gradient(X, y, theta):


m = len(y)
predictions = X.dot(theta)
gradient = (1/m) * X.T.dot(predictions - y)
return gradient

# Gradient Descent
def gradient_descent(X, y, theta, learning_rate, num_iterations):
loss_history = []
for i in range(num_iterations):
gradient = compute_gradient(X, y, theta)
theta = theta - learning_rate * gradient
loss = compute_loss(X, y, theta)
loss_history.append(loss)
return theta, loss_history

# Initialize parameters
theta = np.random.randn(2, 1)
learning_rate = 0.1
num_iterations = 1000

# Run Gradient Descent


theta_optimal, loss_history = gradient_descent(X_b, y, theta, learning_rate,
num_iterations)

# Plot the results


plt.plot(loss_history)
plt.xlabel("Iterations")
plt.ylabel("Loss")
plt.title("Loss vs. Iterations")
plt.show()

print(f"Optimal parameters (theta): {theta_optimal}")

3. Case Study : Neural Network Training with Adam Optimizer


We’ll use TensorFlow/Keras to train a simple neural network on the MNIST dataset.
import tensorflow as tf
from tensorflow.keras import layers, models

# Load MNIST dataset


(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.reshape(-1, 28*28).astype("float32") / 255.0
X_test = X_test.reshape(-1, 28*28).astype("float32") / 255.0

# Build a simple neural network


model = models.Sequential([
layers.Dense(128, activation="relu", input_shape=(28*28,)),
layers.Dense(10, activation="softmax")
])

# Compile the model with Adam optimizer


model.compile(optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

# Train the model


history = model.fit(X_train, y_train, epochs=5, batch_size=32,
validation_split=0.2)

# Evaluate the model


test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_accuracy}")
Key Takeaways:
1. Basic Optimization: Gradient descent can be implemented from scratch for simple
functions.
2. Linear Regression: Gradient descent is used to fit a model to data by minimizing the
mean squared error (MSE).
3. Neural Networks: Frameworks like TensorFlow/Keras provide built-in optimizers (e.g.,
Adam) for training complex models.

You might also like