ML Concepts
ML Concepts
(Assuming for below description that we are built linear regression model..)
When trying to fit data to model (say linear regression line)… bias refers to model’s
inability to capture true relationship.. meaning data is seen as curved line but
model is fitting straight line.. then model will have high bias, but if model is
changed to curvey/zigzag line then bias will be reduced or become zero.. since we
build model by using training data and model does not see testing data until after
fitting is completed, trade off here is that - fitting a curvy line with low bias for
training data will likely cause overfitting as it’s seen perfectly predicting for training
data but not that great for testing data as model is not generalised.. Vice versa -
High bias that’s fitting straight line to curvey/zigzag data points may underfit
training data and so may not properly capture true relationship of data
Difference in fits between data sets (training vs testing) is called variance.. Using
Sum of squared errors method, curvey or zigzag line will have zero errors for
training data but high errors for testing data (since it’s fitting training data points
perfectly).. so it has very High variance or variability.. Vice versa - Straight line was
fitting training data with some amount of error but it does the same for testing
data.. in both cases, sum of squared errors is not too much different.. so basically
this model is generalising data representation quite well.. essentially having low
variance or variability..
Ideally model should have low bias ans low variability.. but since it’s not possible in
real world.. idea is to find sweet spot between simple model (say line with mx+B
equation) versus complex model (say quadratic model).. balancing bias and
variance ultimately achieving sweet spot between overfitting and underfitting too..
__Entropy / KL Divergence__
Cross-Entropy: Average number of total bits to represent an event from Q instead
of P.
Relative Entropy (KL Divergence): Average number of extra bits to represent an
event from Q instead of P.
https://machinelearningmastery.com/cross-entropy-for-machine-learning/
Cross Entropy (Total bits ) - Cross entropy of two distributions (real and predicted)
that have the same probability distribution for a class label, will also always be 0.0.
Recall that when evaluating a model using cross-entropy on a training dataset that
we average the cross-entropy across all examples in the dataset. Therefore, a
cross-entropy of 0.0 when training a model indicates that the predicted class
probabilities are identical to the probabilities in the training dataset, e.g. zero loss.
In practice, a cross-entropy loss of 0.0 often indicates that the model has overfit
the training dataset, but that is another story.
__Word Embeddings__
https://ai.stackexchange.com/questions/18634/what-are-the-main-differences-
between-skip-gram-and-continuous-bag-of-words
__Types of Optimizers__
- Gradient Descent: Gradient Descent is a fundamental optimization algorithm used
in machine learning. It updates the model parameters in the opposite direction of
the gradient of the loss function with respect to the parameters. It iteratively
adjusts the parameters to find the minimum of the loss function, aiming to
optimize the model.
- Stochastic Gradient Descent (SGD): SGD is a variant of Gradient Descent that
randomly selects a subset (mini-batch) of training examples to compute the
gradient and update the parameters. It introduces randomness to the optimization
process and is more computationally efficient for large datasets.
- AdaGrad (Adaptive Gradient Algorithm): AdaGrad adapts the learning rate for
each parameter based on the historical gradients. It increases the learning rate for
infrequent features and decreases it for frequent features. AdaGrad is effective in
handling sparse data and has been widely used in natural language processing
tasks.
- RMSprop (Root Mean Square Propagation): RMSprop is an optimizer that
addresses the limitations of AdaGrad by maintaining an exponentially decaying
average of past squared gradients. It divides the learning rate by the root mean
square (RMS) of the past gradients, allowing for more stable and adaptive updates.
- Adam (Adaptive Moment Estimation): Adam is an adaptive optimization algorithm
that combines the advantages of both AdaGrad and RMSprop. It adapts the
learning rate for each parameter by considering the first and second moments of
the gradients. Adam is widely used in deep learning due to its efficiency and
effectiveness in optimizing complex models.
- Adadelta: Adadelta is an extension of AdaGrad that further improves its
limitations by addressing the rapidly decreasing learning rate. It replaces the
accumulation of past gradients with an exponentially decaying average, which
allows for adaptive learning rates without explicitly setting a global learning rate.
- Momentum: Momentum is an optimizer that adds a fraction of the previous
update to the current update, effectively creating momentum in the optimization
process. It helps accelerate convergence by dampening oscillations and speeding
up convergence along shallow directions in the optimization landscape.
- Nesterov Accelerated Gradient (NAG): NAG is an extension of the momentum
optimizer that calculates the gradient using the momentum term's future position.
It reduces the oscillations and allows for faster convergence, especially in
scenarios with sparse gradients.
Adam, RMSProp and Adagrad and Stochastic gradient descent (SGD) are all
optimizers, however SGD is the slowest among them.