MLT Notes
MLT Notes
represented as
• Hence, one of the techniques used for optimization is (batch) gradient descent, where weight
vector is updated in multiple short iterations. This can be mathematically represented as
In the above equation, [t+1] is the iteration that follow [t], α is the learning rate, and is the partial
derivative of the loss with respect to the weight vector.
• To speed up the (batch) gradient descent process, we use mini-batch gradient descent, where
we work on a small set of examples, instead of the entire dataset. Batch size is chosen to be a power of
2, to optimize the disk read/write during computations. In the case of stochastic gradient descent
(SGD), we use a batch-size of 1.
• Learning curves are used to measure the efficacy of the learning process. It plots iterations on
X-axis and loss on the Y-axis, and expects loss to steeply climb down. If the loss isn’t reducing (but,
increasing), reduce the learning rate α and retry. If the loss is reducing, but isn’t reducing enough in
each iteration, increase the learning rate α and retry. Look at learning curves plotted for different
values of α, in a linear regression problem.
Note that in the first case (α=0.01), the loss reduces but not as quickly as in the second case (α=0.1).
When α is too high at say 0.5, the loss increases instead.
• When training and validation loss are low, it’s the right fit. When training loss is low, and
validation loss is high, it’s an overfit. When the training and validation loss are high, it’s an underfit.
• Overfitting can be avoided by learning from more data, or by using regularization (penalty)
• Underfitting can be avoided by increasing the model capacity by including the polynomial
features, or by reducing regularization rate.
• Efficacy of the model can be measured by precision, recall, F1 score, AUC-ROC, AUC-PR.
Accuracy isn’t the best metric for measuring efficacy because it won’t work well when there’s class
imbalance.
• Confusion matrix is used to calculate the above-mentioned metrics
• Precision = TP / (TP + FP); Recall = TP / (TP + FN); Accuracy = (TP + TN) / (TP + TN + FN + FP)
• F1-Score = 2 * Precision * Recall/(Precision + Recall)
• PR curve can be produced by plotting the precision-recall values at different thresholds of
probability.
• A typical PR curve looks like this
• PR curve for a no-skills/random classifier with 0 class-imbalance looks like this. PR curve for a
classifier with 70-30 class imbalance, has the line at 0.7 instead.
• AUC-PR is calculated by computing the total area under the PR curve. This is the preferred
metric, when there’s moderate to large class imbalance. In the ideal case, this area should be 1.
• ROC is plotted with False Positive Rate (FPR) on X-axis and True Positive Rate (TPR) on Y-axis, at
different thresholds. Here’s how it looks.
;
NOTE: TPR is equal to Recall.
• ROC curve for an ideal classifier looks like this
• ROC curve for a no-skills/random classifier looks like this, and has an area of 0.5. Anything
below this area is worse than the random classifier. AUC-ROC is preferred as a metric when there’s no
class imbalance.
Week2
• In the case of linear regression, input is (n x m) feature matrix comprising of n examples each
with m features. Label matrix is n x 1, where each label is a scalar.
• Linear regression model is represented mathematically as
, where the w0, w1…wm are the weights, and x1, x2…xm are the
features. Since no feature is associated with w0, we add a dummy feature x0 with value 1. This is
• Partial derivative of the loss is . In the case of normal equation (fit), we’ll
equate this to zero, to give w = X-1y.
• In the case of gradient descent, we’ll start with an arbitrary weight vector, and keep updating
the weights over multiple iterations (epochs). Weight update rule is given as
Week3
• When you fit polynomial data, say, generated by Sin(2πx) using linear model, it underfits. In this
case, we’ll need to make a polynomial transformation of the single input feature x to the kth degree, and
generate new features x2, x3, …xk. Note that k is an arbitrary number. Once we’ve k features, we’ll use
these features to fit into a linear model. This will closely approximate the polynomial fit of the single
original feature.
• Denoting these transformations to the input feature using the label φ, we can mathematically
• In order to find the number of features after applying a 4th degree polynomial transformation to
n data samples with 5 features?
• RMSE vs degree plots are typically used to identify overfit/underfit tendency. Thus, in following
graph, the model tends to overfit more than degree 6.
• In the case of higher degree polynomials, the weights tend to be arbitrarily large for the
features. This is a symptom of overfit. This is also called high variance problem.
• In order to avoid overfit, use larger training sets, or use regularization to control the model
complexity.
• Regularization process adds a penalty to the loss, leading to a change in its derivative, and hence
the weight update.
• Regularization is controlled by two terms – the penalty function (function of weight vector) and
the regularization rate.
• 3 types of regularization are used in the industry – L1 (Lasso), L2 (Ridge), and a combination of
L1 and L2.
• Ridge regularization uses the second norm of the weight vector, and takes the following
vectorized form
the weights over multiple iterations (epochs). Weight update rule is given as
• Lasso regression assigns sparse weights. It assigns non zero weights to only important features
and zero weights to unimportant features.
• Since the above form is not differentiable, we use special methods (from sklearn library) to carry
out this process.
Week4
• Class label is a discrete quantity, unlike a real number in the regression setup.
• Under single-label classification (aka binary/multi-class classification), each example gets
assigned with a single label, whereas under multi-label classification, each example gets assigned with
multiple labels. Note that multi-class, multi-label problems are typically referred to simply as multi-label
problems.
• Classification methods include
a. Discriminant functions that learn direct mapping between feature matrix x and labels.
b. Generative models that learn conditional probability distribution P(y|x) using Bayes
theorem and prior probabilities of classes, and assign labels based on that.
c. Discriminative models that use parameters to build models based on conditional probability,
and assign labels on that.
d. Instance-based models that compare training and test examples, and assign class labels
based on certain measures of similarity.
• Multi-class (single-label) and multi-label classifications use one-hot encoding to represent
classes assigned to an example. While in the former case, exactly one entry in each row is 1, in the
latter, multiple entries on each row could be 1. In the case of binary classification, the classes are
represented as -1 and +1 and doesn’t use any encoding.
• In the case of binary-class classification, feature matrix X is represented by (n x m) matrix,
weights by (m x 1) vector and labels by (n x 1) vector. In the case of multi-class (and multi-label)
classification (with k classes), feature matrix X is represented by (n x m) matrix, weights by (m x k) matrix
and labels by (n x k) matrix.
• Discriminant function has the same mathematical representation as the linear regression y = w0
+ wTx, where w0 is the bias. It has the geometry of a hyper-plane with (m-1) dimensions, where m is the
number of features.
• The decision boundary of the classes class-0 and class-1 is y = 0. When y > 0, it belongs to the
class-1, else it belongs to the class-0.
• The weight vector is orthogonal to every vector lying within the decision surface; hence it
determines the orientation of the decision surface. The location of the decision boundary is w0/||w||
and y is the signed measure of the perpendicular distance of the point x from the decision surface. This
is represented in the figure below.
• When there are multiple classes (>2), we could use One-Vs-rest (where we build k-1
discriminant functions) or one-Vs-one (where we build kC2 discriminant functions), both of which have
their faults.
(One Vs Rest) (One Vs One)
• Hence, we develop a single k-class discriminant function that carries the form
(Note that subscript 0 represents the bias). In the case of k-class discriminant function,
the correct classification is done using majority vote. Thus, label yk is assigned to an example, if yk > yj
for all j ≠ k
• To learn parameters of model, we may use Least Squares Classification or Perceptron.
• Calculations of loss, optimization and weight update rule in the case of Least Squares
Classification remains the same as in the case of Linear Regression, which computes J(w) =
. Losses for all examples are summed up to get the cumulative loss; Loss function
is not differentiable and can be geometrically represented as
• In the misclassified region, loss is a linear function of the weight vector, and can be reduced by
reducing the weight vector. In the correctly classified region, leave the weight vector unchanged.
• While linearly separable examples lead to zero loss ultimately and convergence of the algorithm,
non-separable examples lead to oscillating loss and hence don’t converge.
• Similar to the loss, the weight update rule for each example can be represented as
, and must be summed up to get the cumulative weight
update. This is similar to weight update rule in all models where weights are updated with the product
of (α, error, feature).
• Models learnt in week1-4 can be summarized using the following table:
Week5
• Logistic Regression is an important building block in the construction of Neural Networks.
• It is a discriminative classifier, and can be applied for binary/multi-class/multi-label
classifications.
• It obtains probability of sample belonging to a specific class by computing sigmoid (aka logistic
function) of linear combination of features.
• The training model can be mathematically represented as
(binary classification)
where X is the (n x m) matrix containing training samples, y is a (n, ) size label vector, and y(i) is a scalar
Or
(multi-class/multi-label classification)
where X is the (n x m) matrix containing training samples, Y is the (n x k) label matrix and y(i) is a (k, ) size
label vector
• In the case of binary classification, for each feature vector x, the probability of the label y
belonging to class-1, is given by the following equation, where g represents the sigmoid function.
• In the case of samples that’re not linearly separable, we’ll use the following formula instead.
Here φ represents the transformation, say polynomial based. This leads to many more parameters, and
thus more #weights.
• Negative log likelihood (cross-entropy loss) is often used to represent the loss, and written as
J(w) =
• With L2 regularization, the binary cross-entropy loss is represented as
Note that the weight vector combines the individual wjs (for each feature)
• The above can be vectorized using the following equation:
NOTE: This is similar to weight update rule in all models where weights are updated with the product of
(α, error, feature).
• Inference can be represented mathematically as
Week6
• Naïve Bayes classifier is a generative counter-part of logistic regression.
• It uses Bayes theorem to calculate the probability of a sample belonging to a class, while
assuming that the features are conditionally independent, given a label. This is mathematically
represented as follows.
(binary classification)
where X is the (n x m) matrix containing training samples, y is a (n, ) size label vector, and y(i) is a scalar
Or
(multi-class/multi-label classification)
where X is the (n x m) matrix containing training samples, Y is the (n x k) label matrix and y(i) is a (k, ) size
label vector
NOTE: This is the same as in the case with logistic regression
• Probability that the sample belongs to class yc is given by the following formula.
.
NOTE: xj represents each feature in the input sample x; yr represents each of the multiple classes. Also
note that there are k prior probabilities represented as p(yr), m conditional densities per class (and
there are k such classes) represented as p(xj|yr). Terms in the numerator is one of these k probabilities.
• Sum of all prior probabilities is equal to 1.
• The parameters for the model depends on the class distribution used to model the situation.
a. If each sample has only 2 (binary) features, we’ll use Bernoulli distribution, in which we’ll
use the mean as parameter.
b. If each sample has e features, we’ll use Categorical distribution, in which we’ll use e and
mean for each category as parameters.
c. If each sample has continuous features, we’ll use Normal distribution, in which we’ll use
parameters.
• In order to calculate the loss, first compute the likelihood of the observed data, given weight w.
.
Taking log on both sides, we’ve the log likelihood represented as
• Thus,
• Prior is the ratio of the number of labels that match the given class from the training dataset,
and represented mathematically as
a. Mean
b. Variance
Bernoulli NB
Posterior = (Prior * Likelihood) / Evidence
Explanation:
Likelihood is given as
So, posterior probability,
Thus, .
First part of the equation is the fraction of examples with label c among all training examples.
In code, the first part is self.w_priors calculated by method fit; the second part of the equation is
calculated by method _calc_pdf
Gaussian NB
Posterior = (Prior * Likelihood) / Evidence
Explanation:
Likelihood is given as
Thus, .
The first part of the equation is the fraction of examples with label c among all training examples.
In code, the first part is self._priors calculated by method fit; the second part of the equation is
calculated by method _calc_pdf
Week7
• Regression and Classification models are special cases of broader family of models called
Generalized Linear Models (GLMs) and can be represented mathematically as
In the above formula, η is the natural or canonical parameter of distribution, T(y) is the sufficient
statistic (in most cases, it’s equal to y) and a is the log partition function. The quantity
essentially plays the role of a normalization constant, that makes sure that the distribution
sums/integrates over y to 1.
• Many including Bernoulli, Gaussian, Multinomial, Poisson, Gamma, Exponential, Beta, Dirichlet
distributions can be written in the above form, with appropriate values for the parameters.
• To derive GLM for the above distribution, we make the following assumptions or design choices:
a.
b. where E represents the expected value of the sufficient statistic
c. Linear relationship between η and x;
For example, in the case of ordinal least squares modelled as a Gaussian distribution
, where μ = η
In the above picture, for P1, 2 out of 3 neighbors are red, therefore, it is predicted to be in class Red.
Similarly, for P2, 2 out of 3 neighbors are green, therefore, it is predicted to be in class Green.
• For classification task, the neighbors take part in voting. The class that receives highest number
of votes is the predicted class
• For regression task, the output/prediction is calculated as average of the outputs/labels of k
neighbors.
• If #neighbors k is too small, the decision boundary will be jagged and will result in overfit. As k
increases, the decision boundary gets smoother.
• If #neighbors k is too large, the model will be biased and will result in underfit.
• As #neighbors k comes close to total number of points in the dataset, the model will predict
label of majority class for all samples.
• Advantages of KNN model:
a. Quite easy to understand and implement the algorithm.
b. The output of a prediction can be explained based on its neighbors. This adds to
interpretability of the K-NN model.
• Disadvantages of KNN model:
a. For large training set, K-NN can be time consuming, since all computations are performed at
runtime.
b. K-NN is sensitive to redundant or irrelevant features since all features are used to compute
distance between two points.
c. On significantly difficult tasks, it can be out performed by other techniques such as SVM,
Neural Networks.
Week8
• SVM is a supervised ML algorithm that can be used for both classification and regression tasks
• It’s a discriminative classifier (like perceptron and logistic regression) and works for both binary
and multi-class/multi-label classification.
• Model is represented as ; here the label y is assumed to be +1 or -1
• The separating hyperplane (in red) is the classifier. It’s at equal distance from both classes and
separates them such that there’s maximum margin between the two classes.
Let’s look initially at hard-margin SVM, wherein the classes are assumed to be linearly separable.
• Bounding planes are parallel to separating hyperplanes on the either sides and pass through
support vectors.
• Support vectors are subset of training points closer to the separating hyperplane and influence
its position and orientation. All support vectors are points on the bounding planes.
• Bounding planes can be represented mathematically as
• In the case of soft-margin SVM, we introduce a slack variable > 0 for each training point,
such that . The constraints are now a less-strict because each training
point need only be at a distance of (1 - ) from the separating hyperplane instead of a hard distance of
1. The slack allows the input to be closer to the hyperplane or even be on the wrong side.
• In order to prevent slack variable becoming too large, we penalize it in the objective function
thus giving us the primal problem that can be stated as , which when
optimized using a dual problem will yield . The
second term in this formula is called hinge loss. Hinge loss is represented graphically as follows
NOTE: X-axis represents and Y-axis represents the misclassification cost (loss)
• Thus, the weight update rule is given as
and
• In the case of non-linearly separable data, Kernel SVMs simplify computations since we don’t
have to perform explicit transformations, but instead performs dot product between input features in a
higher dimensional space using special functions called Kernel functions.
• Following kernel functions are typically used:
a. Linear Kernel -
b. Polynomial Kernel -
Week-9
• Decision trees are non-parametric supervised learning methods that can be used for modelling
classification and regression problems.
• Decision trees partition feature space into a set of rectangles (or cuboids) and then fit a model
on each.
• The training model can be mathematically represented as
is an element from a discrete set in the case of a classification problem, and is a real number in the
case of regression problem.
• ID3 (Iterative Dichotomizer 3) is the most common algorithm to solve such problems. It uses a
top-down, greedy search (without backtracking) through a space of possible decision trees.
• This can be pictorially represented as follows.
• In the case of classification, prediction is the label in the leaf node and in the case of regression,
prediction is the sample mean of all labels that belong to the subset of the training data, in the leaf
node.
• Proportion of data points in node i that belong to the class k is represented mathematically as
and
• Overfitting in decision trees happens typically when the trees grow bigger. This implies that tree
size is one of the hyperparameters in decision trees.
• How to avoid overfitting?
a. Pre-pruning
a. Stop if the number of samples is less than some user specified threshold.
b. Stop if expanding the current node does not improve impurity.
b. Post-pruning
a. Grow the tree until it reaches a minimum #nodes or #datapoints covered
b. Prune back the tree to reach a balance between the residual error and #nodes. The
pruning criteria is mathematically represented as follows.
, where and
NOTE: The regularization parameter α determines the trade-off between the overall residual sum-of-
squares error and the complexity of the model as measured by the number |T| of leaf nodes, and its
value is chosen by cross-validation.
• Large value of α result in smaller trees , and conversely for smaller values of α
• Decision trees suffer from the problem of high variance, which can be mitigated by ensembles
dealt in the Week-10.
Week-10
Ensemble methods
• Combining predictions of multiple models. Better performance, than the individual model
• Methods include
• Majority Voting
• Bagging – Random Forest is a popular bagging algorithm.
• Boosting - Gradient Boosting, Adaboost, XGBoost
• Two categories of Ensemble learning
a. Voting (among high variance models to prevent outlier predictions and overfitting)
NOTE: In the case of a regression model, replace the classifiers by regressors, and voting among
predictions by average of all predictions.
Here’s the algorithm. Let’s say there’re n datapoints x1…x10, we’ll form q classifiers (bootstraps), with
each bootstrap formed by dividing them among training and test sets. Training set is constructed by
making combinations with replacement on the original dataset (keeping n datapoints in each), and rest
are set aside for the test set. Now, for a new datapoint x, the prediction is given as
• Random forest is a bagging algorithm that uses decision trees. In each of the q classifiers, u/m
features are randomly selected, without replacement. Note that only the selected u features are used
for split.
• In Random forest you can generate hundreds of trees (say T1, T2 . . . ..Tn) and then aggregate
the results of these trees. Individual tree is built on a subset of features and observations (datapoints).
• Random forest has two hyperparameters - #classifiers q and #features u. The algorithm is
insensitive to its hyperparameters. Set q as high as possible, and u to either
• Boosting is an iterative process, where we build a model on the training dataset, then another to
rectify errors present in the previous one, until and unless the errors are minimized. In particular, we
start with a weak model, and each new model is fit on a modified version of the original dataset.
• Adaptive and Gradient boosting techniques differ mainly regarding how the weights are
updated on the datapoints, and how the classifiers are combined.
• Here’s the detailed algorithm for Gradient Boosting.
1. Make a first guess for y_train and y_test, using the average of y_train
Week-11
• Clustering is the process of grouping similar samples in the training set into the same cluster.
It’s an example of unsupervised ML algorithm
• Some of the applications of clustering can be found in customer profiling, anomaly detection,
image segmentation, image compression, geo-statistics, astronomy
• Training data can be mathematically represented as
• Clustering could follow either of
a. Hard-clustering, where each point in the training set is assigned to one of the k clusters
b. Soft clustering, where each point has a probability of membership to k clusters, such that
the sum of such probabilities is 1.
• k-means clustering is an example of hard-clustering. In this model, each cluster cr (1 <= r <= k) is
represented by its centroid, which is calculated as the average of the vector of points in that cluster;
mathematically represented as
• Clusters get assigned to the datapoints based on a distance measure (typically Euclidean
distance), represented as
• Loss is represented mathematically as
• Here’s the detailed algorithm used in k-mean clustering.
1. Start off with k initial (randomly assigned) cluster centers.
2. Assign each point to the closest cluster center.
3. For each cluster, recompute its center as the average of all its assigned points.
4. Repeat above two steps until centroids don't move or certain number of iterations have
been performed.
• Disadvantages of k-means clustering include
1. Poor performance due to incorrect initialization
2. Inability to converge since datapoints don’t form a spherical shape.
3. With more number of data points, the algorithm could be very slow to converge.
4. k is unknown in the beginning; computing this (using elbow method) could be expensive.
• In the elbow method of computing k, loss is calculated at different values of k. Beyond the
elbow point, the loss doesn’t reduce further significantly. In this specific case, the ideal #clusters k is 3.
, where a is the mean of the distance between the sample i and other
samples in the cluster, and b is the minimum of mean distance between sample i and each sample in
another cluster. Note that this number is always between -1 and 1 (including both). When the
silhouette coefficient is at its peak, the corresponding X-value is the ideal #clusters k
• If the features have a correlation coefficient of 1, the cluster centroids will be in a straight line.
• k-means is extremely sensitive to cluster center initialization.
• Bad initialization can lead to poor convergence speed and bad overall clustering.
Week-12
• Linear combination of inputs and non-linear activation
makes artificial neuron.
• Following diagram represents an forward pass ANN.
• The network has a sequence of layers, including the input layer, output layer and any number of
hidden layers. The input to the network passes through these successive layers. Each layer transforms
the input from the previous layer with the help of weights and activation functions. Output from the
last layer is
• Number of layers L and the choice of activation function are hyperparameters.
• At layer l (0 < l <= L), the total number of weights is equal to Sl-1 * Sl, which can be represented
as a matrix.
• In the following ANN with two layers , l – 1 and l ,the pre-activations at layer l is represented as
simply written as