0% found this document useful (0 votes)
2 views3 pages

ML Concepts

The document discusses key concepts in machine learning, including bias and variance in model fitting, the importance of finding a balance between overfitting and underfitting, and various optimization techniques. It also covers cross-entropy and KL divergence in relation to model evaluation, as well as different types of recommendation systems and weight initialization techniques. Additionally, it outlines various optimizers used in deep learning, highlighting their unique characteristics and applications.

Uploaded by

kjmfqkwj4v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views3 pages

ML Concepts

The document discusses key concepts in machine learning, including bias and variance in model fitting, the importance of finding a balance between overfitting and underfitting, and various optimization techniques. It also covers cross-entropy and KL divergence in relation to model evaluation, as well as different types of recommendation systems and weight initialization techniques. Additionally, it outlines various optimizers used in deep learning, highlighting their unique characteristics and applications.

Uploaded by

kjmfqkwj4v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

__Bias & Variance__

(Assuming for below description that we are built linear regression model..)

When trying to fit data to model (say linear regression line)… bias refers to model’s
inability to capture true relationship.. meaning data is seen as curved line but
model is fitting straight line.. then model will have high bias, but if model is
changed to curvey/zigzag line then bias will be reduced or become zero.. since we
build model by using training data and model does not see testing data until after
fitting is completed, trade off here is that - fitting a curvy line with low bias for
training data will likely cause overfitting as it’s seen perfectly predicting for training
data but not that great for testing data as model is not generalised.. Vice versa -
High bias that’s fitting straight line to curvey/zigzag data points may underfit
training data and so may not properly capture true relationship of data

Difference in fits between data sets (training vs testing) is called variance.. Using
Sum of squared errors method, curvey or zigzag line will have zero errors for
training data but high errors for testing data (since it’s fitting training data points
perfectly).. so it has very High variance or variability.. Vice versa - Straight line was
fitting training data with some amount of error but it does the same for testing
data.. in both cases, sum of squared errors is not too much different.. so basically
this model is generalising data representation quite well.. essentially having low
variance or variability..

Ideally model should have low bias ans low variability.. but since it’s not possible in
real world.. idea is to find sweet spot between simple model (say line with mx+B
equation) versus complex model (say quadratic model).. balancing bias and
variance ultimately achieving sweet spot between overfitting and underfitting too..

__Entropy / KL Divergence__
Cross-Entropy: Average number of total bits to represent an event from Q instead
of P.
Relative Entropy (KL Divergence): Average number of extra bits to represent an
event from Q instead of P.

https://machinelearningmastery.com/cross-entropy-for-machine-learning/

Cross Entropy (Total bits ) - Cross entropy of two distributions (real and predicted)
that have the same probability distribution for a class label, will also always be 0.0.
Recall that when evaluating a model using cross-entropy on a training dataset that
we average the cross-entropy across all examples in the dataset. Therefore, a
cross-entropy of 0.0 when training a model indicates that the predicted class
probabilities are identical to the probabilities in the training dataset, e.g. zero loss.
In practice, a cross-entropy loss of 0.0 often indicates that the model has overfit
the training dataset, but that is another story.

__Word Embeddings__
https://ai.stackexchange.com/questions/18634/what-are-the-main-differences-
between-skip-gram-and-continuous-bag-of-words

__Hyperparameters : Deep Learning__


Dropout, Early stopping, Epochs, Learning rate, batch size, layers, activatin function,
optimizer, weight decay

__Types of Recommendation Systems__


Collaborative filtering : based on gathering and analyzing data on user’s behavior.
This includes the user’s online activities and predicting what they will like based on
the "similarity with other users - user/user collaborative".
Content-Based Filtering : based on the description of a product and a profile of the
"user’s preferred choices". In this recommendation system, products are described
using keywords, and a user profile is built to express the kind of item this user likes'

__Types of Optimizers__
- Gradient Descent: Gradient Descent is a fundamental optimization algorithm used
in machine learning. It updates the model parameters in the opposite direction of
the gradient of the loss function with respect to the parameters. It iteratively
adjusts the parameters to find the minimum of the loss function, aiming to
optimize the model.
- Stochastic Gradient Descent (SGD): SGD is a variant of Gradient Descent that
randomly selects a subset (mini-batch) of training examples to compute the
gradient and update the parameters. It introduces randomness to the optimization
process and is more computationally efficient for large datasets.
- AdaGrad (Adaptive Gradient Algorithm): AdaGrad adapts the learning rate for
each parameter based on the historical gradients. It increases the learning rate for
infrequent features and decreases it for frequent features. AdaGrad is effective in
handling sparse data and has been widely used in natural language processing
tasks.
- RMSprop (Root Mean Square Propagation): RMSprop is an optimizer that
addresses the limitations of AdaGrad by maintaining an exponentially decaying
average of past squared gradients. It divides the learning rate by the root mean
square (RMS) of the past gradients, allowing for more stable and adaptive updates.
- Adam (Adaptive Moment Estimation): Adam is an adaptive optimization algorithm
that combines the advantages of both AdaGrad and RMSprop. It adapts the
learning rate for each parameter by considering the first and second moments of
the gradients. Adam is widely used in deep learning due to its efficiency and
effectiveness in optimizing complex models.
- Adadelta: Adadelta is an extension of AdaGrad that further improves its
limitations by addressing the rapidly decreasing learning rate. It replaces the
accumulation of past gradients with an exponentially decaying average, which
allows for adaptive learning rates without explicitly setting a global learning rate.
- Momentum: Momentum is an optimizer that adds a fraction of the previous
update to the current update, effectively creating momentum in the optimization
process. It helps accelerate convergence by dampening oscillations and speeding
up convergence along shallow directions in the optimization landscape.
- Nesterov Accelerated Gradient (NAG): NAG is an extension of the momentum
optimizer that calculates the gradient using the momentum term's future position.
It reduces the oscillations and allows for faster convergence, especially in
scenarios with sparse gradients.

Adam, RMSProp and Adagrad and Stochastic gradient descent (SGD) are all
optimizers, however SGD is the slowest among them.

Adaboost is an ensemble method algorithm.

Weight Initialization Techniques


- Zero initialization: This technique initializes all the weights to zero. However,
using this technique can lead to symmetry among the neurons, causing all the
neurons in a layer to update in the same way during training. Consequently, it is not
typically recommended for most scenarios.
- Random initialization: Random initialization involves assigning random values to
the weights from a specified distribution. The most common approach is to
sample the weights from a Gaussian distribution with zero mean and a small
standard deviation. This technique helps break the symmetry among the neurons
and promotes diverse updates during training.
- Xavier/Glorot initialization: Xavier initialization is designed "to address the
vanishing/exploding gradient problem" in deep neural networks. It initializes the
weights using a distribution with zero mean and a variance calculated based on the
number of inputs and outputs of a layer. It provides a balanced initialization for the
weights and is commonly used with activation functions like "tanh or sigmoid".
- He initialization: He initialization is similar to Xavier initialization but is specifically
designed for activation functions that benefit from a larger range of values, such as
ReLU (Rectified Linear Unit) and its variants. It initializes the weights using a
distribution with zero mean and a variance calculated based on the number of
inputs to the layer.
- LeCun initialization: LeCun initialization, also known as LeCun's uniform
initialization, is specifically designed for networks that use the "tanh activation
function". It initializes the weights using a uniform distribution within a specific
range that takes into account the number of inputs to the layer.
- Orthogonal initialization: Orthogonal initialization initializes the weights as
orthogonal matrices. It ensures that the weight vectors are orthogonal to each
other, which can help prevent the collapsing of gradients and improve training
stability, especially in recurrent neural networks (RNNs).

You might also like