ML 19.03 Sidenotes
ML 19.03 Sidenotes
Outlier- That does not belong to the range of your data points.
Which loss function to use? “If the outliers represent anomalies that are
important for business and should be detected, then we should use MSE. On
the other hand, if we believe that the outliers just represent the corrupted
data, then we should choose MAE as loss”.
Quantile loss: For modeling, the uncertainty in our models gives the range of
predictions as opposed to ‘point estimates’ and can significantly improve
decision-making processes for many business problems.
Causes for outliers in data:
1. Data entry or measurement errors,
2. Sampling problems and unusual conditions.
3. Natural variation in the data distribution.
Bias-variance tradeoff
Expected answer:
“Bias can be seen as the number of assumptions made in the model and therefore
is to a degree the inverse to the model complexity. It can be estimated by the
training error (irreducible error, linear models have many strong assumptions) and
draw a 5 cm 45 deg line on a sheet of paper without the help of a ruler.
Variance is the difference of predictions between the test and training error. A large
variance indicates that the model performs much differently on unseen data
(possibility of reducing to a certain extent).
Therefore, a low bias (high complexity model) and high variance (large
difference in training and test error) usually indicate overfitting”.
Chapter 3
● Instances are not only represented only by the conjunction of constraints
but can be modeled in other ways like squares or rectangles.--CE
● Hypothesis space (H) and version space wrt H, D is consistent if there are no
training errors.
● Take ONE training example then check whether it accepts or rejects with
the VS and based on this define the error.
● Hypothesis space is defined by the user or the concept learner.
● The find-S algorithm only considers positive examples and only works on
the conjunction of constraints and DOES NOT work on anything else.
● Candidate elimination (sensitive to noise) can hold ANY OTHER KIND
(ex. squares, rectangles) other than the conjunction of constraints.
● If training data contains NO ERRORS then it will converge consistently.
Missing values -> converges to empty version space.
● Requires noise-free training data, if this condition is not met then the
convergence will take longer.
● Boolean decision tree for concept learning -> every boolean function is
evaluated from L->R
● Overfitting in DTs caused by more nodes not proportional to
class/target, and with more splits, data complexity of underlying
distribution, model complexity.
● Application of concept learning in DTs: Check which h works best (as an
evaluation) than all the others which classify most test instances
correctly. (look figure below)
LOOCV k-fold CV
A special case of k-fold CV when k = N Dataset grouped into chunks of k sizes (k<N)
N-1 for training; 1 out of N for test k-1 fold for training; 1-fold for test
Convergence time is higher (use when the Better convergence in comparison to LOOCV
learning curve is steep)
Chapter 4
Min entropy = 0; Max entropy = need to ask many questions for
arriving at the right answer to correctly distinguish among
classes in case of binary classification.
● A class of students and a teacher ask questions to the topper student
then next time everyone expects he would only answer then entropy
is less.
● Most classes are 6 and there are a lot of 1’s present then the entropy
is low.
● “Non-fair dice” (rated) which mostly falls on 6. Entropy => less.
● DT is a greedy algorithm that will not search the solution space.
● Info Gain from fair dice to unfair dice is high.
● For multi-classes, a single question cannot answer (entropy>1).
● Entropy is only specific for a subset, it does not take into
account the complete picture.
● ID3 prefers a more general hypothesis. NO backtracking.
● Why do we have this weighted formula in entropy? To remove the
bias.
● Why do we need IG when we already have entropy? We need to
count the other possible values. Then we shall get the average
entropy across all values of that attribute.
● Why giving maximum weightage to a certain attribute with more value
is a problem? Results in overfitting
○ The bigger the weightage the more memory it needs to
build the tree.
○ A column with all unique values (majority class).
● When do you stop the ID3 algorithm?
○ Reach pure subset
○ Reach the leaf node (homogenous node)
○ Design a threshold -> do not split any further
○ When attribute values are the same.
○ Depth of the tree (level 3-4).
The formal definition of overfitting in general:
Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the
training data if there exists some alternative hypothesis h`∈ H , such that h
has a smaller error than h' over the training examples, but h' has a smaller
error than h over the entire distribution of instances.
Prevention:
● Don't try to fit all examples, stop before the training set is exhausted.
● Fit all examples then prune the resultant tree.
How does one tell if a given tree is one that has overfit the data?
● Extract a validation set not used for training from the training set and
use this to check for overfitting. Usually, the validation set consists of
one-third of the training set, chosen randomly.
● Then use statistical tests, eg. the chi-squared metric, to determine if
changing the tree improves its performance over the validation set.
● A variation of the above is to use MDL to check if modifying the tree
increases its MDL with respect to the validation set.
If we use the validation set to guide pruning, how do we tell that the
tree is not overfitting the validation set?
In this case, we need to extract yet another set called the test set from the
training set and use this for the final check.
The information gain measure has a bias that favors attributes with
many values over those with only a few.
Obviously, no other attribute can do better. This will result in a very broad
tree of depth 1.
To guard against this, use GainRatio(S, A) instead of Gain(S, A).
“An overfit DT contains the exact value as the perfect split
according to the training examples but, cannot be used for
predictions as it satisfies only the training examples and thus will
capture less information.”
● Overfitting in DTs caused by, more nodes not proportional to
class/target, and with more splits, data complexity of underlying
distribution, model complexity, perfect split.
● Building trees that adapt too much to the training examples may lead
to overfitting.
● Overfitting is more likely with non-parametric and non-linear models
(DT, kNN) that have more flexibility when learning a target function.
Chapter 6
● Perceptron does not overfit because it is not a network of (dense)
layers.
● Learning in perceptrons and nnets is regarding weights including
bias.
● Only learn linear decisions hence, cannot be used for non-linear
situations.
● The weights are stopped learning or modifying until we get error = 0.
● Not able to handle XOR function (parity problem) due to presence of
nonlinearity.
Neural Networks:
● Primarily used for classification, but can also be used for regression
by changing the activation to produce a quantity.
● Basic assumption of error: error at output layer is due to the error at
hidden layers -> to minimize this we use the bp algorithm.
● Bp is not essentially part of nnets but a way to compute the
gradients.
● A fully connected nnet is always not necessary (skip connections).
Leaky relu:
● Helps solve the vanishing gradient problem, by allowing a small
gradient.
● Still non-linear activation, but also derivable below 0.
● Sparsity of relu is lost in this version.
● Still suffers from exploding gradient problems.
Softmax activation:
● + Able to output more than just binary classification (like sign or
sigmoid) i.e. works with multi-class problems.
● + fully derivable, but still has non-linear properties (therefore is a
non-linear multinomial classifier).
● + normalized output (between 0 & 1).
● - does not work for multi-label problems (each instance can belong to
more than one class).
● - again non-zero centered but, since it is being used mostly as an
output layer, it does not matter.
Why is softmax not preferred at the hidden layers but only at the
output layer?
Short answer: softmax at the hidden layer decreases the accuracy and
speed of learning.
Detailed answer: “Variable independence” if softmax is used at the
hidden layer then you will keep all your nodes linearly dependent which
leads to poor generalization (low test accuracy). And thus a lot of
regularization effort is required to keep variables independent, uncorrelated
and quite sparse.
“Training issues” the speed of training becomes very slow. If you want to
make the network perform better by employing softmax at the hidden layer
then all the activations from the hidden layer are expressed lower. Meaning
the average activation on a higher level which might increase the error and
harm your training phase.
● It reduces the expressive power of your models.
tanh activation:
● + very similar to sigmoid function in properties.
● + zero-centered, therefore neutral states can be shown.
● + Faster takes on high values beyond the zero values therefore it has
stronger gradients.
● + Has values (range) between -1 and +1. Therefore, mean output is
at around zero, whereas, in sigmoid (more like logit) is on average
some positive value.
● - computationally more expensive than sigmoid.
● - similar problems as with sigmoid e.g. error gradient is very small
near origin i.e., initializations with very small weights not a good
idea, also suffers from vanishing gradient problems.
Why is sigmoid activations preferred mostly?
● They use exponential functions, and so gradients become linear at
some point and help quicker in iteration.
● Sigmoid outputs are non zero-centered.
Backpropagation
Hypothesis space of bp: it is the space of all functions that can be
represented by assigning weights to the given, fixed network of
interconnected units.
Neural nets are flexible when it comes to usage (for explainability: the
output of nnets can be used with DT.).
Why might the learning process get stuck at a low learning rate?
1. When using sigmoid units: gradients might get very small and when
you multiply this small gradient with a low learning rate then the
update cannot happen any further as the training cannot escape from
that. Thereby the updates done so far vanish.
2. General problem: The error function might not have just one local
minima but multiple and hence, the training might get stuck.
To countermeasure low and large lr’s we can use learning rate schedule
or alternatively use momentum.
Adam optimizer (first and second momentum for the loss function):
● The best optimizer works on the first and the second derivative of the
loss function but the second derivative is relatively very expensive to
calculate. So you make some estimations by using the moment.
● Combines properties of RMSprop + momentum.
● Take the moving average of the previous gradient.
● Do not oscillate or get stuck.
● Little memory requirements.
● Key idea: provision of different lr for different weights.
● Appropriate for problems with noisy gradients. SGD maintains a
single lr for all weight updates and the lr does not change during
training.
RNN:
● Used for handling non-sequential data.
● Long term dependencies are not captured
● Windowing might solve having the previous context available
specially for long-term dependency.
● Bi-grams and trigrams will NOT help in long-term dependency.
● It works well for smaller sequences but problematic for longer
sequences.
● Training RNN happens via unfolding of the network overtime and then
by using regular bp.
● This creates a very deep network therefore usually suffers from
vanishing gradient problems thus making the history unimportant in
training.
LSTM:
● Lstm tries to solve vanishing gradients by regulating the flow of
information from past iteration through ‘gating’.
● Uses the concept of ‘cell state’ and hence is able to process longer
input sequences.
● Mostly uses tanh and sigmoid activations.
FFNN<RNN<LSTM
Chapter 8
Naive Bayes assumption: The features in the dataset are conditionally
independent given the class. (short answer: conditional independence).
● Independent and identically distributed (i.i.d).
● NB checks for each individual instance the most probable class
(most similar hypothesis). P(D|h).P(h) => collectively called as
‘potential’.
● MAP calculates for the whole dataset. i.e. how likely is this hypothesis
true if we look at the whole dataset and then we take the maximum
out of this. Hypothesis that maximizes the product of prior and
likelihood.
● MLE is a special case of MAP where our prior is uniform (all values
are equally likely).
○ Limitation of MLE:
■ It assumes that the dataset is complete or fully
observable. This does not mean that the model has
access to all data; instead it assumes that all variables
relevant to the problem are present.
● Handling missing values
○ Take the maximum of the class labels and assign the majority in
case of categorical (class imbalance).
○ Take the mean of the values in case of numeric (affected by
outliers).
○ Median values in case of numeric (better option).
● Bayes optimal classifiers can be expensive if there are many
hypotheses (solution: use Gibbs classifier).
Why is size a parameter for probability estimation is spam
classification?
Spam emails are lengthy. It might be interesting to see the average size of
the spam. Only words which come very frequently or seldomly do not
contribute too much to the prediction/classification.
Weighting scheme:
● Distance weighting - classification to the function with the maximum
value is assigned.
● Attribute weighting - to solve problem of ‘curse of dimensionality’
● Class-based weighting - takes into account the instance of each
class. It penalizes the miss rate by minority class by setting a higher
class weight and at the same time reducing weight for the majority
class. During training, we give more weightage to the minority class in
the cost function so that it would provide a higher penalty to the
majority class and the algorithm could focus on reducing the errors
for the minority class.
Limitations of kd-tree:
● It runs in O(log n) average time per search in a reasonable model.
● Storage for the kd-tree is O(n).
● Assuming ‘d’ is small the preprocessing time is O(n log n).
Issues with CBR (ex: issuing VISA to a resident from the aliens
office.):
● To develop methods for indexing cases.
● Syntactic similarity measures (in word sequences).
● Examples in recommendation systems.
Cover tree: a data structure used for partitioning of the metric spaces
to speed up operations like nearest neighbor or range searches.
Active learning:
● Oracle can be anyone not necessarily humans, machines can also be
possible.
● Case base in AL: Contains/holds the most informative instance to be
labelled by the experts (border-line cases or anything else).
● Instances near to the decision boundary may not be the most
informative instances because they do not represent the distribution
purely.
● Can be very expensive also but stores the complete set of data
(training + testing).
IB1:
● Has all the cases in the case base.
● Identical to the kNN algorithm.
● Also performs normalization of attribute ranges similar to kNN.
● Processes instances incrementally, unlike traditional NN algorithms.
● Practical issue: the case base grows quickly.
IB2:
● Stores only misclassified instances from the training set.
● Order dependent on how instances arrive.
● Low noise tolerance, low memory demands.
● IB2’s classification accuracy decreases more quickly than IB1’s as
the noise level increases.
● Reason: noisy attributes are more likely to be misclassified and IB2
saves only those, which it uses to generate classification decisions.
● Assumption: vast majority of misclassified instances are
near-boundary instances that are located in a small neighborhood of
the boundary (border-line cases).
● Classification performance of IB2 is only marginally worse than IB1.
IB3:
● “Wait and see”
● Includes only misclassified cases in current CB and removes “bad
cases”.
● “Evidence gathering method” to determine which saved instances
are expected to perform well during classification.
● Maintains a classification record with each saved instance.
● A classification record saves an instance's classification performance
on subsequently presented training instances and suggests how it will
perform in future.
● Uses significance testing to determine which instances are good
classifiers and which ones are believed to be noisy. The former are
used to classify subsequently presented instances. The latter are
discarded from the concept description.
Chapter 10 (clustering)
k-means :
● Stopping criteria: “until mapping of the instances to the cluster
centers do not change”.
● Flat clustering (Hard assignment of points to groups).
● Requirements: the value ‘k’ for the number of clusters and a distance
metric (euclidean).
● Non-deterministic approach: each time with different iterations and
centroids you get different clusters.
● It prefers “cloud-like” clusters. Suffers from local optima (elongated,
strangled clusters).
● A quick and easy way to find groups in the data. Assumes the cluster
are independent.
● Favours clusters of same density, same size, shape.
● Affected by outliers.
HAC (deterministic):
● No need for k but need to have a distance measure for points and
clusters. Look for non-spherical shaped clusters.
● Stopping criteria: “stops when only one point is left for merging”.
● Visualizations by using dendrogram (perform a cut to obtain clusters
at each level. Note: cut need not be horizontal).
● Each point is initialized as one cluster.
○ MIN / SLINK
■ Produces chain-like structures: a pair forms, then an
object rejoins the pair and so on. The resulting
dendrogram does not clearly show the separated groups,
but can be used to identify gradients in the data.
■ Similarity is most similar to members and is
monotonically decreasing from iteration to iteration.
■ It is optimal clustering.
■ Merging criteria: local
■ used to identify outliers or affected outliers.
■ It can be used with similarity or dissimilarity measures.
■ It is continuous from weak clustering to the strong
clustering. Hence, this method is invariant to monotone
transformations of input data.
○ MAX / CLINK
■ Derives compact clusters.
■ Tends to produce small groups separately.
■ Good for looking for discontinuities in data.
■ Merging criteria: non-local, thus the entire structure of
the clustering can influence merge decisions.
■ Since it is sensitive to outliers it causes less-than
optimal merging hence, clustering is not optimal.
○ Centroid linkage
■ It is non-monotonic. Similarity can increase during
clustering (inversion property).
■ Not affected by noise, but have a bias towards finding
“global patterns”.
● EM algorithm:
○ Set of instances + missing attribute and looks for the most
probable model for the data.
○ Determine based on the model the missing attribute.
● Complexity of HAC algorithms:
MIN < MAX < AVG < CEN < WARDS
k-means HAC
Handles big data. Requires to work with concise data.
FCM:
● Soft clustering technique.
● Stopping criteria: “stops when membership values remain the
same or until the SSE reduced to a certain threshold”.
● A point can belong to more than one cluster.
● Uses membership values. And its summation of it cannot be >1
and each individual membership value cannot be less than 0
(non-negative).
● Uses a hyperparameter ‘m’ i.e. the fuzzifier - it controls the amount
of overlap.
● K-means is a very special case of FCM and can be converted back to
k-means by putting 1(hard assignment) where it approximates to 1
and 0 (hard assignment) where it approximates to 0.
LVQ:
● Fixed lr might cause oscillations.
● Parameterized clustering.
● Unsupervised + semi-supervised.
● LVQ requires a partial dataset, whereas k-means full dataset.
● Uses codebook vectors.
● Weak clustering. More problems due to iterative nature.
● Optimization results for LVQ and k-means are the same.