Chat GPT
Chat GPT
Update Rule:
θ = θ − η ⋅ ∇J(θ)
Where:
θ: Parameters
η : Learning rate
∇J(θ): Gradient of the cost function.
Variants:
Batch GD: Uses the entire dataset for each update. Slow for large datasets.
Stochastic GD (SGD): Updates parameters using a single data point. Faster but
noisy.
Pros:
Conceptually simple.
Cons:
1/4
2. Stochastic Gradient Descent (SGD)
Concept: A variant of GD where updates are made after evaluating a single data point.
Update Rule:
θ = θ − η ⋅ ∇J(θ; x(i) )
Pros:
Faster updates.
Cons:
Update Rule:
η
θt+1 = θt − ⋅ ∇J(θt )
gt + ϵ
Where:
2/4
Automatically adjusts learning rate.
Cons:
Accumulated gradients can lead to very small updates over time (learning rate
decay).
Update Rule:
η
θt+1 = θt − ⋅ ∇J(θt )
gt + ϵ
Where:
Cons:
Exam Tips
Comparison: Be ready to compare techniques (e.g., AdaGrad adjusts rates based on past
gradients; RMSProp adds a moving average to counteract decay).
Applications:
Use GD for theoretical understanding and when you can afford full-batch updates.
3/4
Use SGD for large datasets.
Use RMSProp for RNNs or when the learning rate needs to adapt dynamically.
Mathematical Details: Memorize update rules and key parameters (e.g., η, β, ϵ) for
quick recall.
Practical Scenarios: Know which optimizer to use based on data size and feature
sparsity.
AdaGrad Adjusts learning rates based Good for sparse data. Learning rate decays too
on gradient history. fast.
4/4