Unit4 Notes
Unit4 Notes
o In neural networks, the chain rule of calculus is used to compute gradients layer
by layer during backpropagation.
o This allows the network to learn by adjusting weights to minimize the loss
function.
3. Convexity Analysis:
o Derivatives measure how a function changes with respect to its inputs, enabling
the analysis of model behavior and sensitivity.
o For example, in regression models, the derivative of the loss function with respect
to the parameters determines how the parameters should be updated.
5. Optimization Algorithms:
Summary of Results:
7. Derivative of 𝑓(𝑥):
𝑓 (𝑥) = 2𝑥 + 3
o This represents the growth rate of the plant at any time 𝑥.
8. Slope of the Tangent at 𝑥 = 2:
𝑓 (2) = 7 cm/week
o At 𝑥 = 2 weeks, the plant is growing at a rate of 7 cm per week.
9. Behavior of the Function at 𝑥 = 2:
o The function is increasing at 𝑥 = 2, meaning the plant's height is growing at this
time.
Detailed Answer:
1. Compute the First Derivative 𝑓 (𝑥)
The first derivative of 𝑓(𝑥) is calculated using the power rule:
𝑓(𝑥) = 𝑥 − 6𝑥 + 9𝑥 + 2
𝑑 𝑑 𝑑 𝑑
𝑓 (𝑥) = (𝑥 ) − (6𝑥 ) + (9𝑥) + (2)
𝑑𝑥 𝑑𝑥 𝑑𝑥 𝑑𝑥
𝑓 (𝑥) = 3𝑥 − 12𝑥 + 9 + 0
𝑓 (𝑥) = 3𝑥 − 12𝑥 + 9
Answer:
𝑓 (𝑥) = 3𝑥 − 12𝑥 + 9
Since the concavity changes from downward to upward at 𝑥 = 2, this confirms that 𝑥 = 2 is an
inflection point.
Answer:
For 𝑥 < 2, the function is concave downward.
For 𝑥 > 2, the function is concave upward.
At 𝑥 = 2, the function has an inflection point.
3. Define convex sets and convex functions. Provide examples.
Answer:
Convex Sets
A set 𝑆 is convex if, for any two points 𝑥, 𝑦 ∈ 𝑆, the line segment connecting them lies entirely
within 𝑆. Mathematically:
𝜆𝑥 + (1 − 𝜆)𝑦 ∈ 𝑆 ∀𝜆 ∈ [0,1]
Examples:
A circle is a convex set because any line segment connecting two points within the circle
lies entirely inside it.
A crescent moon is not a convex set because there exist points where the line segment
connecting them goes outside the set.
Convex Functions
A function 𝑓(𝑥) is convex if, for any two points 𝑥, 𝑦 in its domain and 𝜆 ∈ [0,1]:
𝑓(𝜆𝑥 + (1 − 𝜆)𝑦) ≤ 𝜆𝑓(𝑥) + (1 − 𝜆)𝑓(𝑦)
Examples:
𝑓(𝑥) = 𝑥 is convex because its second derivative 𝑓 (𝑥) = 2 is always positive.
𝑓(𝑥) = sin(𝑥) is not convex because its second derivative 𝑓 (𝑥) = −sin(𝑥) changes
sign.
Numerical Example: Consider the function 𝑓(𝑥) = 𝑥 . For 𝑥 = 1 and 𝑦 = 2, and 𝜆 = 0.5:
𝑓(0.5 ⋅ 1 + 0.5 ⋅ 2) = 𝑓(1.5) = 2.25
0.5 ⋅ 𝑓(1) + 0.5 ⋅ 𝑓(2) = 0.5 ⋅ 1 + 0.5 ⋅ 4 = 2.5
Since 2.25 ≤ 2.5, the function is convex.
Practical Case Study: In support vector machines (SVM), the objective function is convex,
ensuring a unique global minimum. This convexity property makes SVM optimization efficient
and reliable.
Q4. A company is designing a new product and needs to optimize its production process.
The production cost 𝑪(𝒙) depends on the quantity 𝒙 of raw materials used. The company
has identified that the cost function 𝑪(𝒙) and the feasible set of raw material quantities 𝑺
play a crucial role in determining the optimal production strategy. The cost function is
given by:
𝑪(𝒙) = 𝒙𝟐 + 𝟏𝟎𝒙 + 𝟏𝟎𝟎,
and the feasible set of raw material quantities is 𝑺 = [𝟓, 𝟏𝟓].
Based on this information, answer the following questions:
Show that the feasible set 𝑆 = [5,15] is a convex set by verifying the convexity
condition for 𝑥 = 5 and 𝑥 = 15 with 𝜆 = 0.3.
Verify that the cost function 𝐶(𝑥) = 𝑥 + 10𝑥 + 100 is convex by checking the
convexity condition for 𝑥 = 5 and 𝑥 = 10 with 𝜆 = 0.5.
o Constraints are conditions that the solution must satisfy. For example:
Regularization terms like L1 or L2 penalty are added to the loss function
to prevent overfitting.
3. Gradient-Based Methods:
1
𝐿(𝑚, 𝑏) = 𝑦 − (𝑚𝑥 + 𝑏) ,
𝑁
where:
𝑁 is the number of data points,
(𝑥 , 𝑦 ) are the data points,
𝑚𝑥 + 𝑏 is the predicted value for input 𝑥 .
∂𝐿 2
= 𝑥 𝑦 − (𝑚𝑥 + 𝑏) ,
∂𝑚 𝑁
∂𝐿 2
= 𝑦 − (𝑚𝑥 + 𝑏) .
∂𝑏 𝑁
𝑓(𝑥) = 𝑥 + 4𝑥 + 4
o This is a quadratic function with a parabolic shape.
o The function has a global minimum at 𝑥 = −2.
Initial guess: 𝑥 = 5
o The learning rate controls the step size during each iteration.
o A smaller learning rate leads to slower but more stable convergence.
Number of iterations: 5
Step 3 - Iterations
Perform 5 iterations of gradient descent starting from 𝑥 = 5.
Iteration Table:
Iteration 𝑥old 𝑓 (𝑥old ) 𝑥new = 𝑥old − 0.1 ⋅ 𝑓 (𝑥old )
0 5.0 14.0 5.0 − 0.1 ⋅ 14.0 = 3.6
1 3.6 11.2 3.6 − 0.1 ⋅ 11.2 = 2.48
2 2.48 8.96 2.48 − 0.1 ⋅ 8.96 = 1.584
3 1.584 7.168 1.584 − 0.1 ⋅ 7.168 = 0.8672
4 0.8672 5.7344 0.8672 − 0.1 ⋅ 5.7344 = 0.29376
5 0.29376 4.58752 0.29376 − 0.1 ⋅ 4.58752 = −0.164992
Explanation of iterations:
o At each step, the gradient is computed using 𝑓 (𝑥) = 2𝑥 + 4.
o The new value of 𝑥 is updated using the gradient descent formula.
o Over time, 𝑥 moves closer to the minimum at 𝑥 = −2.
Final Result
Mathematical Insight
The function 𝑓(𝑥) = 𝑥 + 4𝑥 + 4 can be rewritten as:
𝑓(𝑥) = (𝑥 + 2)
Explanation:
o This form shows that the function is a perfect square.
o The minimum value of 𝑓(𝑥) is 0, achieved at 𝑥 = −2.
o Gradient descent is converging toward this minimum.
Key Takeaways
Gradient descent is an iterative optimization algorithm:
o Quadratic functions have a single global minimum, making them ideal for
gradient descent.
More iterations or a smaller learning rate can improve accuracy:
o Increasing the number of iterations or reducing the learning rate can lead to better
results.
Explanation
Momentum introduces a velocity term 𝑣 that accumulates past gradients, reducing
oscillations.
Momentum is particularly useful for functions with steep regions or noisy gradients.
𝑣 =𝛽 𝑣 + (1 − 𝛽 ) 𝑓 (𝑥old )
𝑚 𝑣
𝑚corrected = , 𝑣 corrected =
1−𝛽 1−𝛽
𝑚corrected
𝑥new = 𝑥old − 𝛼 ⋅
𝑣 corrected + 𝜖
Iterations
Iteration 𝑥old 𝑓 (𝑥old ) 𝑚 𝑣 𝑚corrected 𝑣 corrected 𝑥new
0 5.0 14.0 0.9 × 0 0.999 × 0 1.4 0.196 5.0
+ 0.1 + 0.001 1 − 0.9 1 − 0.999 − 0.1
× 14 × 196 = 14.0 = 196 14.0
×
= 1.4 = 0.196 √196
= 4.9
1 4.9 13.8 0.9 × 1.4 0.999 2.64 0.386 4.9
+ 0.1 × 0.196 1 − 0.9 1 − 0.999 − 0.1
× 13.8 + 0.001 = 13.8947 = 193.2 13.8947
×
= 2.64 × 190.44 √193.2
= 0.386 ≈ 4.8
2 4.8 13.6 0.9 0.999 3.736 0.57056 4.8
× 2.64 × 0.386 1 − 0.9 1 − 0.999 − 0.1
+ 0.1 + 0.001 = 13.78 ≈ 190.4 13.78
×
× 13.6 × 184.96 √190.4
= 3.736 = 0.57056 ≈ 4.7
Explanation
Adam maintains two moving averages: 𝑚 (first moment) and 𝑣 (second moment).
The corrected moments 𝑚corrected and 𝑣 corrected account for bias in the initial steps.
Adam is widely used in deep learning due to its adaptability and robustness.
Practical Case Study: In neural networks, stochastic gradient descent (SGD) is used to train the
model on large datasets. The noisy updates help the model escape local minima and converge to
a better solution.
Initialization
Let’s start with an initial guess:
(𝑥 , 𝑦 ) = (2,3)
Key Takeaways:
1. Learning Rate Matters: The choice of learning rate significantly impacts the
convergence of gradient descent.
2. Too High: Causes oscillations or divergence.
3. Too Low: Results in slow convergence.
4. Optimal Learning Rate: Depends on the problem and the function being optimized. In
practice, techniques like learning rate scheduling or adaptive learning rates (e.g.,
Adam, RMSprop) are used to dynamically adjust the learning rate.
Practical Case Study: In deep learning, adaptive learning rate methods like Adam are used to
improve convergence.
8. What is the difference between batch gradient descent and stochastic
gradient descent?
Answer:
Batch GD: Uses the entire dataset to compute the gradient.
SGD: Uses a single data point (or mini-batch) to compute the gradient.
Let’s consider another numerical case study to compare Batch Gradient Descent (BGD) and
Stochastic Gradient Descent (SGD). This time, we’ll use a non-linear function to better
illustrate the differences in behavior between the two algorithms.
𝛼 = 0.01
where 𝑓 (𝑥 ) = 4𝑥 − 9𝑥 .
Update Rule:
∂𝐿(𝑥)
𝑥new = 𝑥old − 𝛼 ⋅
∂𝑥
Iteration 1:
o The parameter 𝑥 is updated once per epoch using the entire dataset.
o The updates are smooth and deterministic.
𝑥new = 𝑥old − 𝛼 ⋅ 𝑓 (𝑥 )
Key Differences:
Convergence Behavior:
When to Use:
BGD: Suitable for small datasets or when stable convergence is required.
SGD: Suitable for large datasets or when computational efficiency is critical. Mini-batch
SGD is often used as a compromise between BGD and SGD.
Practical Case Study: In training deep neural networks, SGD is preferred for large datasets due
to its scalability.
def objective(x):
return x[0]**2 + x[1]**2
x0 = [1, 1]
result = minimize(objective, x0, method='BFGS')
print(result.x) # Optimal solution
Practical Case Study: In hyperparameter tuning, optimization algorithms are used to find the
best model parameters.
def grad_f(x):
return 2*x + 5
# Gradient Descent
def gradient_descent(starting_point, learning_rate, num_iterations):
x = starting_point
for i in range(num_iterations):
gradient = grad_f(x)
x = x - learning_rate * gradient
if i % 10 == 0:
print(f"Iteration {i}: x = {x}, f(x) = {f(x)}")
return x
# Parameters
starting_point = 0.0
learning_rate = 0.1
num_iterations = 50
# Gradient Descent
def gradient_descent(X, y, theta, learning_rate, num_iterations):
loss_history = []
for i in range(num_iterations):
gradient = compute_gradient(X, y, theta)
theta = theta - learning_rate * gradient
loss = compute_loss(X, y, theta)
loss_history.append(loss)
return theta, loss_history
# Initialize parameters
theta = np.random.randn(2, 1)
learning_rate = 0.1
num_iterations = 1000