ML Revision
ML Revision
- PANDAS
- a Python library for data wrangling and analysis. It is build
around a data structure called the DataFrame (a table, similar to an Excel
spreadsheet). pandas provides a great range of methods to modify and
operate this table, (e.g. SQL-like queries and joins of tables).
- SCIKIT-LEARN
-Machine learning library for the Python programming language.
II. What is ML?
- Short answer
• Machine : Easy part
• Learning : Hard part
- To solve problems on a computer, algs (sequences of intructions to transform
input->output) are necessary
- Tasks withoyt predefined algs (such as distinguishing spam form legitimate
emails) require ML
- ML involves automatically extracting algs from example data.
- In simple it’s teaching a computer to do a task using ‘data’
- Methods that extracts knowledge from the data
- Closely related to statistics and optimization
- Machine learning is focused on prediction
II.2 An ML example
- We want a computer to distinguish between a Greyhound and a Labrador.
Thus, we create a dataset byu collecting examples of those breeds (collecting
photos of them) & we describe their breed-specific feats.
A) Association learning
- Association learning focuses on uncovering associations, correlations, or
patterns within a dataset.
- The primary goal is to identify relationships between variables or items that
frequently occur together.
- Association learning methods typically involve mining large datasets to find
rules or patterns that indicate associations.
- Association learning finds applications in various domains, including retail for
market basket analysis, recommendation systems, web usage mining,
bioinformatics, and more.
- In retail, it helps understand customer purchasing behaviour and optimize
product placement or cross-selling strategies.
B) Classification
- Identifying which category an object belongs to
- Example+applications : Credit scoring involves predicting the risk associated
with a loan application. Attributes like income, savings, profession, age, and
past financial history are considered. The task is to classify applicants as low-
risk or high-risk based on these attributes. Applications of Classification:
optical character recognition, face recognition, medical diagnosis, speech
recognition, spam detection, outlier detection, etc.
- Algs- SVM, nearest neighbours, random forests, etc.
- Discrete output (eg. color, gender, yes/no, etc.)
- Eg. “Will you pass this course?’
- Supervised learning
- The model trained from the data defines a decision boundary that separates
the data
C) Regression
- In regression, the goal is to predict a continuous output (value of an attribute),
based on input attributes.
- Example: Predicting the price of a used car based on attributes like brand,
year, mileage, etc.
- Application: drug response, stock prices, navigation of a mobile robot,
optimization tasks, response surface design, etc.
- Algs: SVR, nearest neighbour, random forest, etc.
- Continuous output (e.g. temp, age, distance, salary, etc.)
- E.g. “How many points will you
get in the exam?
- The model fits the data to describe the relation between 2 features or
between a feature (e.g., height) and the label (e.g., yes/no)
D) Clustering
- A method for unsupervised learning aimed at finding clusters or groupings
within input data without any predefined labels
- Helps identify outliers, customers who deviate significantly from others in their
group.
- Application- customer segmentation, grouping, experiment outcomes, etc.
- Algs- k-means, feat selection, nonnegative matrix factorization, etc.
B) Unsupervised Learning
- Algs work with only input data, without explicit instructions on what to do with
said data+ no known output data(independent women, slayyyy)
- This means that there’s no labels/categories for the alg to learn from
- Useful for discovering hidden relationships, patterns, grouping similar data,
reducing the dimensionality of the input space, etc.
- Examples: identifying topics in blog posts, discovering trending ‘X’ topics,
grouping data into clusters, outlier detection, segmenting customers into
groups based on preferences (clustering again), detecting abnormal access
patterns on a website, etc.
- Clustering
C) Reinforcement learning
B) Text
- Words/ Letters need to be converted in a format computers can understand
VI. Preprocessing
- Feature exctraction and normalization
- Preparing the raw data and making it suitable for an ML model
B) Robust scaler
- Same as Standard Scaler, but with median instead of mean and
IQR instead of SD.
- Better for skewed data
- Deals better with outliers
C) MinMax scaler
- Shifts data to an interval set by xmin and xmax
D) Normalizer
- Doesn’t work by feature (column) but by row
- Each row of data is rescaled so that its norm becomes 1
- Compute the norm of the vector (square root of the squared elements)
-Divide each element by the norm
-Used only when the direction of data matters
-Helpful for histograms
IV.2 Ways to scale data
A) Univariate transoformations
- Examples of univariate transformations: logarithmic, geometric, power, etc.
- Most ML models perform best with Gaussian distributed data (bell curve)
- Methods to transform data to Gaussian include Box-Cox transform and
Yeo-Johnson transform
-Parameters can be automatically estimated so that skewness is minimized
and variance is stabilized.
B) Binning
- Separate feature values into n categories (e.g., equally spaced over the
range of values)
- Replace all values within a category by a single value, e.g., mean.
- Effective for models with few parameters, such as regression, but not
effective for models with many parameters, such as decision trees.
VII. Guiding principles in ML
VII.1 Measuring classification success
- How “predictive” are the models we have learnt?
- New data is probably not exactly the same as the training data
-What happens if we overfit our data?
A) Cross-validation
- To evaluate (test) your model’s ability to predict new data
- Detect overfitting or selection bias
- Techniques:
- K-fold cross validation
- Leave one out (K-fold cross validation to the extreme)
RFE
Advantages Limitations
-Highly interpretable. Each word is an -All structure is lost!. Crucial info may be
independent feature. lost
-Simple method. • Misspellings: “machine” and “machnie”
- Fairly effective approach for some will be counted as different word
applications • Some expressions consists of different
words- e.g. product review with the word
“worth” -> What if the review said “not
worth” vs “definitely worth”
I. Classification
- The decision boundary can be a straight line (“stiff”) or a wiggly line (“flexible”)
- Model trained from the data defines a decision boundary that separates the
data
Classification Regression
- consider
- Test points are assigned the majority label of the k nearest neighbours
- Special cases:
-k = N: since all datapoints are considered, the predicted label for a test point
will always be the the majority label of all datapoints. Equivalent to a majority
classifier.
- (In regression, the model is the line that fits the data) • Smaller k leads to
more complex decision boundaries
- k too low -> danger of overfitting, high complexity • k too high -> danger of
underfitting, low complexity
II.6 How to determine model complexity?
- Depends on complexity of the separation between the classes
- Start with the simplest model (large k in k-NN), and increase complexity
(smaller k)
III.1 Nearest
shrunken
centroid
- After calculating
the NC th ealg
applies a shrinkage step. Shrinking essentially is puling the centroids
towards the overall mean of all points in the dataset (this way centroids are
regularized prevented from being overly influenced by outliers/noisy data
points)
- ‘Shrinks’ each of the class centroids towards th eoverall centroid for all
classes by an amount we call a threshold
- "shrinks" each of the class centroids toward the overall centroid for all
classes by an amount we call the threshold . This shrinkage consists of
moving the centroid towards zero by threshold, setting it equal to zero if it
hits zero. For example if threshold was 2.0, a centroid of 3.2 would be
shrunk to 1.2, a centroid of -3.4 would be shrunk to -1.4, and a centroid of
1.2 would be shrunk to zero.
- After shrinking the centroids, the new sample is classified by the usual
nearest centroid rule, but using the shrunken class centroids.
-
III.2 KNN vs NC
I. Regression
- In machine learning, supervised learning to predict continuous outputs
(we’ll call these y) is called regression.
- What do you need to predict outputs ?
-Continuous or categorical input features (we'll call these x);
-Training examples: many x for which y is known (e.g. many people of
whom we know the height, predicting the housing prices);
-A model, a function that represents the relationship between x and y;
-A cost function, which tells us how well our model approximates the
training examples;
-Optimization, a way of finding parameters for the model while minimizing
the loss function.
- Analytical method- shows one unique solution (not alw feasible),
optimization doesn’t give a solution
I.1 Linear regression (aka ordinary least squares)
⁃ Simplest and classic linear method for regression.
⁃ Finds parameters w and b minimizing mean squared error
between predictions and true regression targets y on the training
set.
⁃ Mean squared error is the sum of squared differences between
predictions and true values.
⁃ No parameters to control model complexity.
equation:
⁃ where m is a parameter and e represents measurement or other noise
- We don’t wanna model the error (sth we don’t wanna account for and is
random)
- Goal: estimate m from training data for x and y
- Most common approach: minimize the least squares error (LSE)
I.3 Bias/Intercept
- If the line does not pass through the origin ...
I.11 Overfitting
- Machine learning is so effective in finding the best fit that it is likely to
construct a complex model that would never generalize to unseen data.
- However, a complex model that reduces prediction error and yields a
better fit also models noise.
Overfitting Underfitting
The induced model is not The induced model is too
complex (flexible) enough to complex to model the data
model the data (tries to fit noise)
Performs badly both on Performs better on training
training and validation set set than on validation set
Why?
square error )
- Objective function: any func that you optimize during training (e.g. maximum
likelihood, divergence between classes).
- A loss func’s a part of a cost func which is a type of objective function.
- Decision Boundary (for classification):
- • Single line/contour which separate data points into regions • What is the
output label at the boundary?
I.1 How do we find the parameters w?
A) Gradient Descent
- To find optimal values for w:
• iterative optimization algorithm that operates over a losslandscape (cost
function)
objective
- Learning rate: The size of steps took in any direction
class probability)
II.8 Entropy
- Entropy – level of uncertainty
- Exercise for computing entropy
- Cross entropy
III. Regularization
- “Regularization is any modification to a learning algorithm that is intended to
reduce its generalization error but not its training error”
- Similar to other data estimation problems, we may not have enough samples
to learn good models for logistic regression classification
- One way to overcome this is to ‘regularize’ the model, impose additional
constraints on the parameters we are fitting
- By adding a prior w
III.1 L1 vs L2 regularization
- L1 (Lasso)- encourages sparcity)
- Squared L2(Ridge)-encourages small weights
Blue-input layer
Orange-hidden
layer
Red-output layer
I.9
Inspiration for NNs
Occipital part-
computer vision
Object
recognition-
posterior, inferior
temporal cortex
PFC- cognitive decisions
Motor cortex-actions
4
- Each neuron communicates (is connected) to other ~10 neurons (learning
capacity comes from those connections)
I.10 ANNS
- Neural networks define functions of the inputs (hidden features), computed by
neurons
To optimize the
weights- gradient
descent
⁃ MLPs consist of an input layer, one or more hidden layers, and an output
layer.
⁃ Each layer contains perceptrons (or units), with connections (weights)
between them.
⁃ While theoretically, MLPs can have multiple hidden layers, in practice,
networks with one hidden layer are more common due to simplicity.
⁃ However, in certain cases, adding more hidden layers might improve the
network's ability to capture intricate patterns in the data.
⁃ In summary, multilayer perceptrons overcome the limitations of single-layer
perceptrons by introducing non-linear activation functions and multiple hidden layers,
enabling them to approximate non-linear functions and solve more complex machine
learning problems.
⁃ We optimize our hyperparameters through trial and error on the validation set.
- The capacity of the network increases with more hidden units and more
hidden layers
Once
optimization
ends- no
error. A lotta
epoch can
lead
to overfitting
II.1
Activation function
II.2 Backpropagation
- Propagating the error back and updating (adjusting) our weights and biases to
minimize loss.
- Compute the derivative of the loss function with respect to weights and biases
- Forward Pass:
o The algorithm begins with a forward pass, where input data is fed into
the network, and activations are computed layer by layer until the
output layer is reached.
o The output of each neuron is computed as a weighted sum of its inputs
passed through an activation function (e.g., sigmoid function for hidden
layers).
- Error Computation:
o After the forward pass, the error or loss is computed between the
predicted output and the actual target values.
o For example, in the case of nonlinear regression, the error is typically
computed using a loss function such as mean squared error (MSE).
- Backward Pass (Backpropagation):
o The error gradients with respect to the parameters (weights) of the
network are computed using the chain rule of calculus.
o The error is propagated backward through the network, starting from
the output layer towards the input layer.
o At each layer, the error gradients are computed for the weights
connecting that layer to the previous layer.
- Weight Update:
o The weights of the network are updated using an optimization
algorithm, typically gradient descent or one of its variants.
o The weights are adjusted in the opposite direction of the gradient to
minimize the error.
o The magnitude of the weight update is determined by the learning rate
(η) parameter, which controls the step size during optimization.
- Batch Learning:
o In batch learning, weight updates are accumulated over all patterns in
the training set, and the weights are updated once after a complete
pass over the training set (epoch).
o This process iterates over multiple epochs until convergence, where
the error is minimized to an acceptable level.
- Online Learning:
o Alternatively, online learning updates the weights after each individual
pattern, implementing stochastic gradient descent.
o Online learning can converge faster but requires careful tuning of the
learning rate parameter and randomization of the training data order.
MLP CNN
⁃ Structure: MLP consists of ⁃ Structure: CNN consists of
multiple layers of neurons, including an alternating layers of convolutional layers
input layer, one or more hidden layers, and pooling layers, followed by one or
and an output layer. Each neuron in one more fully connected layers.
layer is connected to every neuron in ⁃ Convolutional Layers:
the subsequent layer. Convolutional layers apply convolution
⁃ Fully Connected Layers: In operations to input data, allowing them
MLP, each neuron in a layer is to capture local patterns or features.
connected to every neuron in the next Each filter in a convolutional layer
layer, creating a dense network. learns to detect specific features.
⁃ Parameter Sharing: There is no ⁃ Parameter Sharing: CNNs use
parameter sharing between different parameter sharing, where a small set of
parts of the input data. Each weight learnable filters (kernels) are applied
parameter is unique to its connection. across the entire input data to extract
⁃ Global Information Processing: features. This significantly reduces the
MLP processes the entire input data number of parameters and improves
globally, without considering local model efficiency.
patterns or structures. It treats each ⁃ Local Information Processing:
input feature equally. CNNs process input data in a local and
⁃ Applicability: MLPs are suitable hierarchical manner, capturing local
for tasks where input data exhibits features and gradually combining them
simple relationships and does not have to learn global representations. They
spatial or temporal dependencies, such exploit the spatial locality and
as tabular data or basic pattern hierarchical structure present in the
recognition tasks. data.
⁃ Spatial Hierarchies: CNNs are
particularly effective for tasks involving
spatial data such as images, where
local patterns and spatial relationships
are crucial for understanding the
content. They automatically learn
hierarchical representations of the input
data, starting from low-level features
(edges, textures) to high-level features
(objects, scenes).
- the convolutional layer applies a set of filters to the input data, extracting
meaningful features and hierarchically learning representations of the
input, making it well-suited for tasks involving spatial data like images.
- the pooling layer downsamples the input feature maps, reducing their spatial
dimensions while preserving essential information, thereby improving
computational efficiency and promoting translation invariance in CNNs.
- the convolutional layer applies filters to the input data to extract features
hierarchically, making it well-suited for tasks involving spatial data like images.
It helps CNNs learn relevant representations of the input data while efficiently
managing the model's parameters.
III.6 Max polling layer
- Finds the maximum locally which indicates generally the maximum response
of the filtering from the previous layer.
LECTURE 7- Support Vector Machines
I. What’s SVM
- Support Vector Machine (SVM) and later generalized kernel machines offer a
different approach for linear classification and regression.
- SVMs adhere to Vapnik’s principle, prioritizing simplicity and efficiency by
focusing on learning the discriminant rather than estimating complex
probabilities directly.
- After training, the parameter of the linear model, the weight vector, can be
expressed in terms of a subset of the training set known as support vectors.
- Support vectors are instances close to the decision boundary, aiding in
knowledge extraction and providing an estimate of generalization error.
- The output of the model is expressed as a sum of influences of support
vectors, determined by kernel functions that measure similarity between data
instances.
- Kernel functions allow representation beyond traditional vector-based
methods, accommodating various data types such as graphs.
- Kernel-based algorithms are formulated as convex optimization problems,
allowing for analytical solutions without the need for heuristics like learning
rates or convergence checks.
- Hyperparameters are still required for model selection, but the optimization
process is more straightforward.
- The discussion typically starts with classification and then extends to
regression, outlier detection, and dimensionality reduction, with a common
quadratic program template to maximize separability or margin of instances
while maintaining smoothness of solution.
- Supervised ML algorithm for regression and classification
- can generate linear decision boundaries for linearly separable data
- variants exist for non-linear decision boundaries
For the positive class (y=1), we want the decision boundary to be such that 𝜃() is
points from each class.
greater than or equal to 1, ensuring that positive instances are correctly classified
For the negative class (y=0), we want 𝜃() to be less than or equal to -1,
and lie on the correct side of the decision boundary.
⁃
ensuring that negative instances are correctly classified and lie on the correct side of
while satisfying these constraints. This optimization problem aims to find 𝜃 values
⁃
a higher value of 𝜃() (greater than or equal to 1) and negative instances should have
class as -1 aligns with this formulation. It signifies that positive instances should have
a lower value of 𝜃() (less than or equal to -1), ensuring correct classification and a
large margin between the classes.
-Simpler (from what I understood:
⁃ In SVM, we want to draw a line (or a plane) that separates different classes of
data.
⁃ We want this line to have as much space as possible between the classes,
which we call the margin.
⁃ For one class, we want the line to be at least 1 unit away. For the other class,
we want it to be at least 1 unit away but in the opposite direction.
by 𝜃) to make this margin as big as possible while still correctly separating the
⁃ Mathematically, we find the best line by adjusting some values (represented
classes.
⁃ We also want to minimize the changes to these values, so we sum up their
squares to keep them small.
⁃ When we label one class as 1 and the other as -1, it's like saying "be at least
1 unit away on this side for class 1 and on the other side for class -1." This helps in
finding the best line that separates the classes well.
II.3 Linear SVMs: binary classification problem
- Linear Support Vector Machines (SVMs) are used for binary
classification problems, where the goal is to separate a dataset into
two classes using a straight line (or hyperplane in higher
dimensions). The SVM algorithm finds the optimal hyperplane that
maximizes the margin, which is the distance between the
hyperplane and the closest data points from each class. By
maximizing this margin, linear SVMs effectively classify new data
points by determining which side of the hyperplane they belong to.
- The SVM’s objective is to find a hyperplane, which is a decision
boundary that separates the two classes in the feature space with
the largest margin. The margin refers to the distance between the
hyperplane and the closest data points from each class, also called
support vectors. In the image, the solid line represents the decision
boundary, and the dotted lines on either side of the solid line
represent the margin.
- Ideally, a larger margin translates to better generalization on
unseen data. This is because the SVM is less likely to be swayed by
slight variations or noise in the data points.
- Here are some key points to remember about SVMs for binary
classification:
- SVMs find a hyperplane that maximizes the margin between the two
classes.
- The data points closest to the hyperplane are the support vectors
and are critical for defining the decision boundary.
- A larger margin generally leads to better performance on unseen
data
- Imagine you have a classification problem, like separating obese (green) from
non-obese (red) mice based on weight. A perfect separator would be a line
that keeps all red on one side and all green on the other.
- But what if the data is messy? Maybe some obese mice have a weight closer
to non-obese mice. A strict line (hard margin) won't work here.
- A soft margin SVM allows for some mistakes. It introduces an imaginary
"buffer zone" around the separation line. Data points can fall within this zone
or even on the wrong side of the line, but they are penalized for being there.
- The more a data point deviates from the correct side, the bigger the
penalty. This helps the SVM find a balance between a good separation line
and allowing for some errors in messy data.
- Think of it as a more forgiving version of the regular SVM, allowing for some
outliers without completely failing.
II.5 Max margin classification- Hard margin SVM
- a Support Vector Machine (SVM) classification model might not perform well
using a hard margin. Here, hard margin refers to a strict separation between
two classes of data points.
- The data points show the mass of mice, with red dots representing non-obese
mice and light green dots representing obese mice. The decision boundary, a
solid line separating the two classes, is established based on the mass. A
new observation,represented by a question mark, is located above the
decision boundary but closer to the cluster of non-obese mice.
- In this instance, the threshold – the mass value on the decision line – isn’t a
strong predictor for classifying this new observation. The model might classify
the new observation as obese even though it appears closer to the non-obese
mice in mass.
- Here’s why the SVM might struggle here:
- Limited Features: The model is likely only considering mass as a single
feature to distinguish between obese and non-obese mice.
- Data Overlap: There might be inherent overlap in mass between obese and
non-obese mice. Even though the model separates the data points based on
mass, there will always be some ambiguity, especially for observations on the
borderline.
- Typically, for SVM classification to be more effective, more features that are
relevant to class distinction are used. For instance, alongside mass, features
like genetic markers or diet information could improve the model’s ability to
distinguish between the two classes.
- Basically hard margin SVM aims to find a hyperplane that perfectly separates
classes without allowing any misclassifications, which works well only when
the data is linearly separable; however, it's sensitive to outliers and doesn't
generalize well. On the other hand, soft margin SVM introduces a margin of
tolerance, allowing for some misclassifications to find a more robust
hyperplane, especially when the data is not perfectly separable. It
incorporates slack variables to handle misclassifications and balances
between maximizing the margin and minimizing errors using a regularization
parameter C
The maximal
margin in
SVM refers
to the largest
possible
distance
between the
decision
boundary
(hyperplane)
and the
nearest data
point of any class. The goal of SVM is to find the hyperplane that maximizes
this margin, as it leads to better generalization and robustness of the
classifier, especially in cases of new or unseen data.
- After training an SVM, a support vector is any instance located on the margin
- SVMs compute the predictions only involves the support vectors, not the
whole training set.
- Imagine the decision function as a seesaw with a pivot point at the bias term
(θ₀). Each feature in the data point (x) contributes its weight (θₙ) to one side of
the seesaw. The decision function calculates the total weight on each side. If
the positive side outweighs the negative side by a certain threshold (θ'(x) ≥
1), the data point is classified as positive.Conversely, if the negative side
outweighs by a threshold (θ'(x) ≤ -1), it's classified as negative. A smaller
weight vector makes the seesaw more sensitive, requiring data points to be
further away from the center (decision boundary) to tip the balance decisively.
Advatages Disatvantages
• Accuracy • Not suited to larger datasets as
• Works well on smaller cleaner the training time with SVMs can be
datasets high • Less effective on noisier
• Can be more efficient because it datasets with overlapping classes
uses a subset of training points • Linear SVMs have a linear
decision boundary
• Originally designed as a two-
class classifier
a hard-margin SVM will fail to find a solution when the data is not linearly
separable.
- The image depicts one of the shortcomings of hard margin SVMs. Hard
margin SVMs aim to identify a hyperplane that perfectly separates the data
points belonging to different classes, with the largest margin possible. This
margin is the distance between the hyperplane and the closest data points
from each class, also known as the support vectors [2].
- The left graph in the image shows a scenario where a hard margin SVM can
successfully classify the data. The data points are linearly separable, meaning
a clear dividing line (hyperplane) can be drawn between the two classes. The
hard margin SVM finds this hyperplane and maximizes the margin by placing
it equidistant from the closest data points of each class (support vectors).
- However, the right graph illustrates an issue with hard margin SVMs – they
are sensitive to outliers. In this case, a single outlier data point disrupts the
possibility of finding a perfect separation. The presence of this outlier forces
the hyperplane to be positioned closer to one class, reducing the overall
margin. In severe cases, outliers can entirely prevent a hard margin SVM from
finding a separating hyperplane [2].
- Soft margin SVMs address this issue by allowing for some misclassification
during training. This enables the model to handle outliers and data that is not
perfectly linearly separable [1].
- In short: Hard margin SVMs aim for perfect separation of data, which is great
when it works. But, they struggle with outliers that mess up the clean
separation and reduce the margin between classes. Soft margin SVMs are
more flexible and can handle these outliers by allowing some misclassification
during training.
- the image you sent shows two scatter plots which show the relationship
between petal length and petal width of Iris flowers from two Iris species: Iris-
Versicolor and Iris-Virginica. The text labels C=100 and C=1 correspond to
the cost parameter used in training a Support Vector Machine (SVM)
classifier.
- The goal of a linear SVM classifier is to find a hyperplane that separates the
data points belonging to different classes with the maximum margin. The
margin is the distance between the hyperplane and the closest data points
from each class, also known as the support vectors.
- The left graph (C=100) shows a scenario where a linear SVM can achieve a
good separation between the two Iris species.The data points are almost
linearly separable, meaning a clear dividing line (hyperplane) can be drawn
between the two classes. The SVM finds this hyperplane and maximizes the
margin by placing it equidistant from the closest data points of each class
(support vectors).
- The right graph (C=1) illustrates the impact of the cost parameter (C) on the
SVM classifier. A lower cost parameter allows for more misclassifications
during training. In this case, the cost parameter is set very low (C=1), which
makes the SVM classifier more flexible and allows it to tolerate some overlap
between the two Iris species data points. This increases the model's ability to
handle outliers but can also reduce the margin between the classes.
- Thus, we introduce what is know as the ‘kernel trick’. The kernel trick is a
technique used in machine learning, particularly with support vector machines
(SVMs), to handle non-linearly separable data. Instead of explicitly mapping
the data to a higher-dimensional space using basis functions, the kernel trick
allows us to compute the dot product of the mapped data points in the original
space. This is achieved by defining a kernel function that calculates the
similarity between pairs of data points directly in the original input space.
- Kernel SVMs
introduce a concept called the "kernel trick." This trick involves transforming
the data points from their original feature space into a higher-dimensional
space.
- In this higher-dimensional space, the data points might become linearly
separable, allowing the SVM to find a separating hyperplane.
- The kernel function essentially acts as a bridge between the original feature
space and the higher-dimensional space,without explicitly performing the
transformation itself. This keeps the computational cost lower.
In simpler terms, the
kernel trick enables us
to use a linear model in
a high-dimensional
space without explicitly
calculating the
transformed features.
This is advantageous
because it avoids the
computational
overhead of mapping
the data to higher dimensions. The kernel function computes the similarity between
data points, and by replacing the dot product with this kernel function, we can
effectively work in a high-dimensional space without explicitly transforming the data.
The RBF kernel has only one parameter, gamma, which is the
inverse of the width of the Gaussian kernel. gamma and C both
control the complexity of the model,
Advantages Disatvantages
• Allow for complex decision Do no scale very well with
boundaries, even if the data the number of samples.
has only a few features. Running an SVM on data with
• Work well on low- up to 10,000 samples might
dimensional and high- work well, but working with
dimensional data (i.e., few datasets of size 100,000 or
and many features) more can become
challenging in terms of
runtime and memory usage.
Require careful
preprocessing of the data
and tuning of the
parameters..
The pid shows simplified illustration of a soft margin SVM with two features. It
highlights the concept of the threshold (decision boundary) separating the classes
and the margins that provide a buffer for potential misclassifications.
- The goal is to find a function f(x) that has at most ε deviation from
i
the actually obtained targets y for all the training data, and is at the
same time is as flat as possible.
- In other words,we do not care about errors as long as they are less
than ε, but will not accept any deviation larger than this.
V. Multiclass classification
V.1 Reduction to Binary classification
⁃ Standard Approach: If we have 4 classes, we can treat each class
separately and train a classifier to distinguish it from all the others. This results in 4
binary classifiers.
⁃ One-vs-Rest : Here, we train one classifier for each class, considering it as
the positive class and all other classes as the negative class.
⁃ One-vs-One: We create binary classifiers for each pair of classes. So, for 4
classes, we'd have 6 binary classifiers: 1 vs 2, 1 vs 3, 1 vs 4, 2 vs 3, 2 vs 4, and 3 vs
4.
V.2 Prediction with one vs rest
• Class with the highest score
I.3 Priors
- Degree of belief in an event in the absence of any other information
• E.g.
• P(rain tomorrow) = 0.8
• P(no rain tomorrow) = 0.2
⁃ Probability Model:
Assumes that features are binary-valued (0 or 1).
Estimates the probability of each feature being 1 (present) or 0
(absent) given each class.
Computes class conditional probabilities using the Bernoulli
distribution.
⁃ Assumptions:
Assumes features are conditionally independent given the class, just
like the standard Naïve Bayes algorithm.
This means that the presence or absence of one feature does not
provide any information about the presence or absence of any other
feature, given the class label.
⁃ Parameter Estimation:
Parameters (probabilities) are estimated from the training data using
maximum likelihood estimation (MLE) or smoothed estimates like
Laplace smoothing to handle unseen features.
⁃ Applications:
Commonly used in text classification tasks, especially when dealing
with binary feature representations, like bag-of-words or binary term
frequency-inverse document frequency (TF-IDF) vectors.
III. Decision Trees
⁃ Internal nodes correspond to attributes (features)
Strengths Weakneses
⁃ Decision trees can be easily ⁃ Even with pre-pruning, decision
visualized and understood, particularly trees tend to overfit the training data,
for smaller trees, making them leading to poor generalization
interpretable by non-experts. performance on unseen data.
⁃ They are invariant to the scaling ⁃ They may not perform as well as
of data, as each feature is processed ensemble methods, such as random
separately, and splits don't depend on forests or gradient boosting, in many
scaling. This means no preprocessing applications where improved
like normalization or standardization of generalization performance is crucial.
features is needed.
⁃ Decision trees can handle
features that are on completely different
scales or a mix of binary and continuous
features effectively.
III.4 Model complexity
⁃ The complexity of the model induced by a decision tree is determined by the
depth of the tree
⁃ Increasing the depth of the tree increases the number of decision boundaries
and may lead to overfitting
⁃ Pre-pruning and post-pruning
⁃ Limit tree size (pick one):
• max_depth
• max_leaf_nodes
• min_samples_split
Pre-pruning, a common strategy to prevent overfitting in decision
trees, involves stopping the growth of the tree before it perfectly fits
the training data. This is crucial because allowing the tree to grow
until all leaves are pure can lead to highly complex models that
overfit the training data.
There are two primary pre-pruning strategies:
⁃ Limiting tree depth: By setting a maximum depth for the
tree (max_depth), we restrict the number of consecutive questions
that can be asked during tree construction. This prevents the tree
from becoming arbitrarily deep and complex. For example, setting
max_depth=4 allows only four consecutive questions to be asked
during tree construction.
⁃ Limiting the number of leaves: Another approach is to limit
the maximum number of leaves in the tree (max_leaf_nodes). This
strategy prevents the tree from growing too many branches,
thereby controlling its complexity.
Implementing pre-pruning in scikit-learn's DecisionTreeClassifier
involves specifying these parameters during model instantiation. For
instance, setting max_depth=4 in the DecisionTreeClassifier
constructor limits the depth of the tree to four levels.
⁃ Let's look at the effect of pre-pruning on the Breast Cancer
dataset. Initially, without pre-pruning, the decision tree achieves
perfect accuracy (100%) on the training set but may not generalize
well to unseen data, as indicated by the slightly lower accuracy on
the test set (93.7%). However, by applying pre-pruning with
max_depth=4, we observe a decrease in training set accuracy
(98.8%), but an improvement in test set accuracy (95.1%). This
demonstrates how pre-pruning can help prevent overfitting and
improve the generalization performance of the decision tree model.
IV. Decision tree regression
Simple models used as building blocks for designing more complex models by
combining several of them.
⁃ Voting
⁃ Bagging: Train many models on bootstrapped data, then take average (e.g
Random Forests) - > less variance
II.1 Voting
⁃ Build different models
• Classifiers that are most “sure” will vote with more conviction
• Classifiers will be most “sure” about a particular part of the space
• Average the result
⁃ Scikit-learn: VotingClassifier
The passage delves into the concept of voting as a method to combine the outputs
of multiple classifiers in ensemble learning. Here are the key points discussed:
⁃ Combination via Voting:
⁃ Voting involves taking a linear combination of the outputs of different
classifiers
⁃ Combination Rules:
- There are various combination rules besides simple weighted averaging, such
as median, minimum, maximum, and product rules. Each rule has its characteristics,
like robustness to outliers or pessimistic/optimistic behavior.
⁃ Normalization:
- Combination rules often require the outputs of classifiers to be normalized to
the same scale before aggregation.
⁃ Voting in Classification:
- In classification, voting involves choosing the class with the maximum number
of votes (plurality voting) or more than half of the votes (majority voting).
- Weighted voting schemes can be used if classifiers provide additional
information such as posterior probabilities.
⁃ Voting in Regression:
- In regression, voting typically involves averaging or taking the median of the
outputs of base regressors. Median is more robust to noise than the average.
⁃ Determining Weights:
• can be determined based on the accuracies of classifiers on a
separate validation set or learned from the data.
- Bayesian Interpretation:
Voting schemes can be seen as approximations under a Bayesian framework,
with weights representing prior model probabilities and model decisions
approximating model-conditional likelihoods.
- Voting for Error Reduction:
Voting reduces variance and error by averaging over multiple noisy models,
assuming that the noise functions added by individual models are uncorrelated with
zero mean.
This smoothing effect in the functional space acts as a regularizer, decreasing
variance while potentially offsetting bias introduced by individual models.
In essence, voting provides a simple yet effective way to combine the outputs of
multiple classifiers to improve predictive performance in both classification and
regression tasks.
⁃ Classification problem:
• simple majority vote (hard voting)
• highest average probability (soft voting)
III.2 Boosting
⁃ Fits sequentially multiple weak learners adaptively
Stacking
ensem
b
- Fitting a stacking ensemble
- Steps:
o Split the training data in two folds
o Choose L weak learners and fit them to data of the first fold
o For each of the L weak learners, make predictions for observations in
the second fold
o Fit the meta-model on the second fold, using predictions made by the
weak learners as inputs
- Limitation: Only half of the data to train the base models and half of the data
to train the meta-model.
- Solution: “k-fold cross-training” approach (similar to what is done in k-fold
cross-validation) such that all the observations can be used to train the meta-
model.
IV.Random Forests
⁃ • Trees can be shallow (few depths) or deep (lot of depths, if not fully grown).
⁃ • Shallow trees - less variance but higher bias, better choice for sequential
methods
⁃ • Deep trees, - low bias but high variance, better choices for bagging method
that is mainly focused at reducing variance.
⁃ • Random forest approach is a bagging method where deep trees, fitted on
bootstrap samples, are combined to produce an output with lower variance.
I. Generalization performance
- A model should always be evaluated on independent test data.
• A model’s performance on unseen data will give us the generalization
- performance of the model.
Refers to how well an agent can extend its learned knowledge to new, unseen
situations. Here's a closer look at the factors influencing generalization performance:
- Quality of Function Approximation: The choice of function approximation
method (such as neural networks, decision trees, or radial basis functions)
greatly influences generalization performance. A good approximation method
should capture the underlying structure of the problem space, allowing for
accurate estimation of Q-values or state values across different states and
actions.
- Representation of States and Actions: The representation of states and
actions plays a crucial role in generalization. Effective feature representation
can highlight similarities between different states and actions, enabling the
agent to generalize its knowledge more effectively. Proper feature engineering
or representation learning techniques can enhance generalization
performance.
- Training Data: The quality and diversity of the training data also impact
generalization. A diverse training set that covers a wide range of states and
actions allows the agent to learn robust policies that generalize well to unseen
situations. Insufficient or biased training data may lead to poor generalization
performance.
- Regularization: Regularization techniques, such as weight decay or dropout
in neural networks, can help prevent overfitting and improve generalization
performance. By discouraging overly complex models, regularization
techniques promote simpler models that generalize better to unseen data.
- Exploration Strategy: The exploration strategy employed by the agent during
training can influence generalization. Balancing between exploration and
exploitation allows the agent to gather diverse experiences, which can help in
learning more robust policies that generalize well.
- Task Complexity: The complexity of the reinforcement learning task also
affects generalization performance. More complex tasks may require more
sophisticated function approximation methods and richer feature
representations to achieve good generalization.
- Hyperparameter Tuning: Proper tuning of hyperparameters, such as
learning rate, batch size, and network architecture, is essential for achieving
optimal generalization performance. Hyperparameters significantly impact the
training dynamics and the ability of the agent to generalize its learned
knowledge.
Bias: how far, on average, are the model’s predictions from the
correct value ?
High bias & High variance: we predict incorrectly and our prediction is noisy ← worst
of everything
Reducible error
• Bias2: How much does the average of the estimate deviate
from the true mean?
II.1 Cross-Validation
- Splitting the data in a training and test set multiple times
- Each partition serves as the test set once, while all other partitions serve as
training set
Benefits Disatvantages
- Leaves less to luck: If we get a - Increased computational cost
very good or bad training set by
chance, this will show in the - Simple cross-validation can
results → performance will be an result in class imbalance
outlier between training and test sets
- Can be useful to find out which items are regular and irregular from the point-
of-view of the dataset.
In summary, shuffle-split cross-validation offers flexibility and control over the cross-
validation process, allowing for experimentation with different data proportions and
subsampling strategies. It can be particularly useful for large datasets and when fine-
tuning model parameters.
- Simple grid search: try all possible combinations of chosen parameter values
This is a crucial insight into the importance of properly splitting the data into training,
validation, and test sets, especially when performing model selection and parameter
tuning. Here's a breakdown of the key points:
⁃ Overly optimistic evaluation: Selecting the best model based on
performance on the test set can lead to overly optimistic estimates of the model's
generalization performance. This is because the test set was used to adjust the
parameters, making it no longer independent for evaluation.
⁃ Threefold split(as shown in the figure): To address this issue, it's
recommended to split the data into three sets: the training set to build the model, the
validation set to select the parameters, and the test set to evaluate the performance
of the selected parameters.
⁃ Implementation: After splitting the data into training/validation and test sets,
a loop is used to iterate over different combinations of parameters (e.g., gamma and
C for an SVM model). For each combination, an SVM model is trained on the training
data and evaluated on the validation set. The parameters leading to the best
performance on the validation set are selected.
⁃ Final evaluation: Once the best parameters are determined using the
validation set, a final model is trained on the combined training and validation sets.
This model is then evaluated on the test set to obtain an unbiased estimate of its
performance on unseen data.
⁃ Importance of separate test set: Keeping a separate test set ensures that
the final evaluation is unbiased and provides a realistic estimate of how well the
model generalizes to new data. Using the test set for any exploratory analysis or
model selection can lead to information leakage and overly optimistic results.
By following this approach, practitioners can make more informed decisions about
model selection and parameter tuning while ensuring reliable estimates of model
performance on unseen data.
This figure illustrates the process of parameter selection and model evaluation using
grid search with cross-validation.
⁃ Grid Search with Cross-Validation Steps: The figure outlines the steps
involved in grid search with cross-validation. It begins with the definition of the
parameter grid, which specifies the hyperparameters and their respective values to
be evaluated.
⁃ Cross-Validation Evaluation: Each combination of hyperparameters is
evaluated using cross-validation. The dataset is divided into multiple folds, and the
model is trained and validated on different subsets of the data. The mean
performance across all folds is computed for each parameter combination.
⁃ Selection of Best Parameters: The parameter combination that yields the
highest mean performance score during cross-validation is selected as the optimal
set of hyperparameters.
⁃ Retraining with Optimal Parameters: Finally, a new model is trained using
the entire training dataset and the optimal hyperparameters. This model is then
evaluated on the test set to assess its generalization performance.
⁃ Visualization: The visualization likely includes representations of the
parameter grid, cross-validation process, and the selection of the best parameters
based on mean performance scores.
The image shows calculations for three metrics used in binary classification: True
Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
These terms are defined in a confusion matrix, which is a table that shows the
number of correct and incorrect predictions made by a classification model.
⁃ True Positive (TP) is the number of items correctly classified as positive.
⁃ True Negative (TN) is the number of items correctly classified as negative.
⁃ False Positive (FP) is the number of items incorrectly classified as positive.
⁃ False Negative (FN) is the number of items incorrectly classified as negative.
The text in the image shows how TPR, TNR, Precision, Recall, and F1 Score are
calculated using these values.
⁃ Accuracy is the most common metric used in binary classification, though it
can be misleading in some cases. It is simply the ratio of correct predictions to total
predictions. In the image, it is calculated as:
Accuracy = TP + TN / TP + TN + FP + FN
- Precision measures the ratio of actual positives to the total number of
predicted positives. In other words, it tells you how many of the items you predicted
as positive actually were positive. The image shows it being calculated as:
Precision = TP / TP + FP
⁃ Recall measures the ratio of actual positives you captured to the total number
of actual positives. In other words, it tells you what proportion of the actual positives
were identified by your model. The image shows it being calculated as:
Recall = TP / TP + FN
• F1 Score is a harmonic mean between precision and recall. It is a way of
combining these two metrics into a single score. A value of 1 means perfect
precision and recall, while a value of 0 means neither precision nor recall is high.The
image shows it being calculated as:
F1 Score = 2 * Precision * Recall / Precision + Recall
These metrics can be used to evaluate the performance of a binary classification
model. A good model will have high values for all of these metrics. However, in some
cases, it may be more important to optimize for one metric over another.For
example, if you are trying to classify emails as spam or not spam, it may be more
important to have a high recall (so that you don't miss any spam emails) than a high
precision (since some non-spam emails may be accidentally classified as spam).
This image depicts problems with accuracy in binary classification tasks. It shows a
confusion matrix which compares the number of predicted labels to the number of
true labels.
In the ideal scenario, all the values would be on the diagonal, meaning the model
perfectly predicts all positive and negative labels. In the image, we can see that this
is not the case. There are values off the diagonal, which means the model is making
mistakes.
Specifically, the model is making both False Positive (FP) and False Negative (FN)
errors.
False Positives (FP) are instances where the model predicted a positive
label, but the true label was negative. In the image, this is shown in the
second row, first column (P- 80.00 10.00). Here, the model predicted 10
positives that were actually negative.
False Negatives (FN) are instances where the model predicted a negative
label, but the true label was positive. In the image, this is shown in the first
row, third column (y_pred_1 y_pred_2 y_pred_3 10.00 0.00). Here, the model
predicted 10 negatives that were actually positive.
These errors will cause the model to have a lower accuracy than it potentially could.
There are ways to improve the accuracy of a binary classification model. Here are a
few:
Collect more data: Training a model on more data can help it to better learn
the patterns in the data and improve its ability to classify new data points.
Use a different model architecture: Some model architectures are better
suited for binary classification tasks than others. Experimenting with different
architectures can help you to find one that works well for your specific task.
Tune the hyperparameters of your model: The hyperparameters of a model
are the settings that control how it learns. Tuning these hyperparameters can
help you to improve the performance of your model.
Model y_pred_1 has the worst performance among the three models. It has
10 False Negatives (FN) and 0 True Positives (TP). This means the model did
not identify any of the actual positive labels, and incorrectly classified 10
positive labels as negative.
Model y_pred_2 also has some FN errors (10), but it also correctly classified
some of the positive labels (90). It also made some False Positive (FP) errors
(10).
Model y_pred_3 has the best performance among the three models. It has
the highest number of True Positives (TP) (98) and the lowest number of
errors (FN=2, FP=0).
Overall, the confusion matrix shows that all three models are making classification
errors. However, model y_pred_3 performs the best out of the three with the highest
accuracy.
Here are some ways to improve the accuracy of these binary classification models:
Collect more data: Training a model on more data can help it to better learn
the patterns in the data and improve its ability to classify new data points.
Use a different model architecture: Some model architectures are better
suited for binary classification tasks than others. Experimenting with different
architectures can help you to find one that works well for your specific task.
Tune the hyperparameters of your model: The hyperparameters of a model
are the settings that control how it learns. Tuning these hyperparameters can
help you to improve the performance of your model.
- Used when developing a new model and unclear entirely clear what operating
point/threshold to use.
o Allows analysis of the trade-off between precision and
recall for different classification thresholds.
o Each point on the curve corresponds to a specific
threshold of the decision function.
o Starts at the top-left corner (low threshold, high recall,
low precision) and moves towards the bottom-right
corner (high threshold, low recall, high precision).
o Closer proximity to the upper-right corner indicates
better classifier performance.
- Operating Point:
o Refers to setting a specific threshold to meet a desired
precision or recall value.
o Helps in making performance guarantees and aligning
with business objectives.
- Average Precision:
o Computed as the area under the precision-recall curve.
o Provides a summary measure of classifier performance
across all possible thresholds.
o Ranges from 0 (worst) to 1 (best), with higher values
indicating better performance.
- Comparison:
o Precision-recall curves allow detailed comparison of
classifiers' performance across various thresholds.
o Average precision score provides a quantitative measure
of classifier performance, aiding in automatic model
comparison.
- Example:
o Comparison between SVM and random forest classifiers
using precision-recall curves revealed nuanced
differences in performance at different operating points.
o Average precision scores showed similar performance
between the classifiers, contrary to the results obtained
from the F1-score.
- The closer the curve to the upper-right corner, the better th eclassifier
- For the ROC curve, the ideal curve is close to the top left: you want a
classifier that produces a high recall while keeping a low false positive rate.
⁃ Classification Report:
⁃ Computes precision, recall, and F1-score for each class.
⁃ Precision: Proportion of true positive predictions out of all positive predictions.
⁃ Recall: Proportion of true positive predictions out of all actual positives.
⁃ F1-score: Harmonic mean of precision and recall, provides a balance between
the two.
⁃ Commonly used to evaluate multiclass classification models.
III.2 Micro and macro F1
- Macro-average F1: Average F1 scores over classes (“all classes are equally
important”)
- Micro-average F1: Make one binary confusion matrix over all classes, then
compute recall, precision once (“all samples are equally important”)
o Derives binary F-scores per class, treating each class as the positive
class and others as negative.
o Can be averaged using "macro," "weighted," or "micro" averaging
strategies:
"Macro" averaging: Unweighted mean of per-class F-scores.
"Weighted" averaging: Mean of per-class F-scores, weighted by
class support.
"Micro" averaging: Computes precision, recall, and F-score
using total counts over all classes.
- Helps to assess model performance across all classes.
- Comparison:
o Accuracy, confusion matrix, and classification report offer insights into
overall and per-class performance.
o Multiclass F-score provides a comprehensive measure of model
performance across all classes, considering both precision and recall.
o Choice of evaluation metric depends on the specific characteristics of
the dataset and the goals of the classification task, particularly
considering class distribution and the importance of individual classes.
o R2 (Coefficient of Determination)-easy to
understand scale:
Measures the proportion of the variance in the
dependent variable (target) that is predictable from
the independent variables (features).
⁃ Choosing a Metric:
⁃ R2 is commonly used as it provides a clear indication of how
well the model explains the variance in the target variable.
⁃ MSE and MAE are useful for understanding the magnitude of
errors but may not provide as direct a measure of model fit as R2.
⁃ Business decisions may sometimes be based on MSE or MAE,
particularly if specific cost functions are involved.
⁃ The choice of metric ultimately depends on the specific
requirements of the problem and the preferences of stakeholders.
⁃ When using “scoring” use “neg_mean_squared_error” etc
IV.Imbalanced data
- Sources
• Asymmetric cost • Asymmetric data
- Approaches
- • Add samples
• Remove samples
- Imbalanced datasets occur when one class is significantly more frequent than
others (e.g., click-through prediction where most impressions don't result in
clicks).
- Accuracy can be misleading for imbalanced data. A simple model that always
predicts the majority class can achieve high accuracy without being
informative.
- Example: Classifying digit "9" vs all others in the digits dataset creates a 9:1
imbalance.
- A DummyClassifier predicting only the majority class ("not nine") achieves
nearly 90% accuracy, highlighting the limitations of accuracy.
- Other classifiers like DecisionTreeClassifier might not show significant
improvement over the dummy model based on accuracy alone.
- Alternative metrics are needed to evaluate models on imbalanced datasets
effectively. These metrics should penalize models that simply predict the
majority class.
IV.1 Random undersampling
- Technique: Random undersampling involves removing data points from the
majority class randomly.
- Goal: This technique aims to balance the class distribution in a dataset with a
heavily represented majority class.
- Process: It typically involves removing data points from the majority class
until the desired balance is achieved (often aiming for a class ratio close to
1:1).
- Advantages:
o Speed: Random undersampling is a very fast technique as it simply
removes data points, reducing training time.
o Efficiency: In some cases, it can lead to efficient training, especially
when dealing with large datasets where the majority class dominates.
- Disadvantages:
o Data Loss: A major drawback is the loss of potentially valuable data
from the majority class. This discarded data might contain useful
information for the model.
o Information Loss: Removing data points can lead to a loss of
information about the majority class distribution, potentially affecting
the model's generalizability.
- In summary, random undersampling is a quick and easy approach to address
imbalanced datasets, but it comes at the cost of potentially discarding
valuable data.
Benefits Limitations
- Reduced complexity: Lower- - PCA is sensitive to
dimensional data requires less outliers, which can influence the
computation and storage. calculation of eigenvectors.
- Improved - It assumes linear relationships
interpretability: Visualizing data in between features. Non-linear
a lower-dimensional space can relationships might not be
reveal patterns and relationships captured effectively.
more easily.
- Potential for better
classification: By removing
redundant features, PCA can
lead to more accurate
classification models (especially
when dealing with the curse of
dimensionality).
Overall, PCA is a powerful tool for dimensionality reduction in machine learning. By
understanding its concepts and applications, you can leverage it to improve the
performance and interpretability of your models.
- Benefits:
- Dimensionality Reduction: Working with a lower-dimensional latent space
allows for easier analysis,visualization, and manipulation of the data.
- Efficiency: Machine learning algorithms often perform better when dealing
with less data. Latent spaces can significantly reduce the computational cost
of processing complex datasets.
- Uncovering Hidden Structure: The process of finding the latent space can
reveal underlying patterns or relationships within the data that might not be
obvious in the original high-dimensional space.
- Applications:
- Recommendation Systems: Latent spaces can be used to model user
preferences and item characteristics, enabling recommendation systems to
suggest relevant items to users based on their past behavior.
- Image and Text Analysis: In image processing, latent spaces can be used
for tasks like image compression,denoising, and object
recognition. Similarly, in text analysis, latent spaces can help identify topics in
documents or group similar documents together.
- Anomaly Detection: Deviations from the expected patterns in the latent
space can indicate anomalies or outliers in the data.
- Things to Consider:
- Choosing the Right Technique: Different techniques like NMF and PCA
have their strengths and weaknesses depending on the data type and desired
outcome.
- Interpretability: While latent spaces capture important
information, interpreting the specific meaning of each dimension in the latent
space can be challenging.
- Understanding latent space is crucial for grasping how techniques like latent
factorization work and how they help us unlock the hidden potential within
complex datasets.
PCA NMF
- Works on any data type - Designed for non-negative data
(numerical). (e.g., image pixels, word
- Components (principal frequencies).
components) can have negative - Components (basis vectors) are
and positive values, making non-negative, offering easier
interpretation less interpretability in terms of the
straightforward. original data
- Focuses on capturing the most - Aims to identify parts or basis
variance in the data, regardless elements that can be additively
of the underlying additive combined to reconstruct the data.
structure. - Particularly useful for tasks like
- Broadly applicable for image compression, topic
dimensionality reduction, feature modeling in documents, music
extraction, and anomaly source separation,and analyzing
detection. recommender systems.
- General dimensionality - Non-negative data with
reduction: PCA is a good starting interpretable parts: NMF is a
point. strong choice.
- This can lead to more informative and visually separable clusters in the lower-
dimensional space.
- Allow for much more complex mappings and often provide better
visualizations.
- Learn underlying “manifold” structure, use for dimensionality reduction
IV.2 Pros and cons
- For visualization only
- Axes don’t correspond to anything in the input space.
- Often can’t transform new data.
- Pretty pictures!
I. Clustering
- An alternative to parametric methods for density estimation.
- Parametric methods assume a known distribution for the data, like Gaussian.
- Clustering relaxes the assumption of a single distribution and allows for a
mixture of distributions.
- This is useful when the data has multiple groups, like different writing styles
for the digit 7.
- Semiparametric approach assumes a parametric model for each group in the
data.
- Nonparametric approach, makes no assumptions about the data structure.
⁃ Parametric approach:
⁃ Assumes sample comes from a known distribution.
⁃ Typically assumes a parametric family, like Gaussian.
⁃ Problem reduces to estimating a small number of parameters.
⁃ Limitations of parametric approach:
⁃ Rigid parametric models can introduce bias.
⁃ Not suitable for applications where data doesn't fit assumptions.
⁃ Introduction to clustering:
⁃ Offers a semiparametric approach for situations where strict parametric
assumptions don't hold.
⁃ Allows for learning mixture parameters from data.
⁃ Discusses probabilistic modeling, vector quantization, and hierarchical
clustering.
⁃ However, Clustering can be a hard problem for a number of reasons, even
though the concept itself seems straightforward:
- • Data partitioning
• divide data by group before further processing.
existing clusters.
I.6 Restriction of Cluster Shapes
- Clusters are Voronoi-diagrams of centers.
I.8 MiniBatchKMeans
- Mini-batches are subsets of the input data, randomly sampled in
each training iteration.
- Algorithm
- Examples:
• partitioning low-dimensional space (similar to using basis
functions) •extracting features from high-dimensional spaces
- Find the two closest (most similar) clusters, and join them
- First figure
- Data points are numbered 0 to 11 at the bottom.
- The dendrogram shows how these points are merged into clusters.
- For example, points 1 and 4 are merged first, followed by points 6 and 9, and
so on.
- The longest branches (marked by a dashed line) indicate merging very
dissimilar clusters into 3 groups.
- Each row shows a dendrogram, which is a tree-like structure that depicts how
data points are hierarchically clustered. In the dendrogram, vertical lines
represent data points, and horizontal lines show merging steps that combine
clusters. The shorter the horizontal line, the more similar the merged clusters
are.
- The rightmost column of the dendrogram shows how many data points are in
each cluster.
- Second figure
- You can "cut" the dendrogram at any level to define a specific number of
clusters.
- The dendrogram itself doesn’t tell the optimal number of clusters. It shows all
possible mergings of data points into clusters at different distances. You can
choose a cutoff point on the dendrogram based on your chosen criteria to
identify the desired number of clusters
I.5 Pros and Cons
Pros Cons
⁃ Topology Flexibility: ⁃ Imbalanced Cluster Sizes:
⁃ Hierarchical clustering can ⁃ Depending on the chosen linkage
accommodate various types of input criteria, hierarchical clustering
topologies, including those defined by algorithms may lead to imbalanced
graphs such as neighborhood graphs. cluster sizes. Some linkage methods
This flexibility allows for the clustering of tend to favor the formation of clusters
data with complex relationships and with uneven numbers of data points,
structures. which can impact the interpretability and
⁃ Efficiency with Sparse usability of the clustering results.
Connectivity: ⁃ Computational Complexity:
⁃ Hierarchical clustering algorithms ⁃ Hierarchical clustering algorithms
can be efficient when dealing with can be computationally intensive,
datasets characterized by sparse especially for large datasets or when
connectivity, where only a subset of using certain linkage criteria that involve
data points are connected. This makes pairwise distance computations. This
it suitable for data with irregular or non- complexity can result in longer
uniform distributions. processing times and may limit
⁃ Holistic View: scalability for very large datasets.
⁃ Hierarchical clustering provides a ⁃ Subjectivity in Interpretation:
holistic view of the data by capturing the ⁃ The hierarchical nature of
hierarchical relationships between clustering can sometimes make it
clusters at different levels of granularity. challenging to interpret the results
This comprehensive perspective can aid objectively. Deciding on the appropriate
in understanding the underlying level of granularity and determining the
structure of the data and can assist in optimal number of clusters can be
making informed decisions about the subjective and may require expert
number of clusters. judgment, potentially leading to biased
interpretations.
Overall, while hierarchical clustering offers advantages such as flexibility in handling
various data topologies and providing a holistic view of the data, it also has
limitations related to computational complexity, potential for imbalanced cluster
sizes, and subjectivity in interpretation. These factors should be carefully considered
when applying hierarchical clustering methods in practice.
- In the image, there are four plots, each showing the results of DBSCAN with
different combinations of eps and min_samples values. The points are colored
according to their cluster assignment:
- Solid colors: Points that belong to clusters
- White: Noise points (points that are not assigned to any cluster)
- Large markers: Core samples
- Small markers: Boundary points (points that are within the eps radius of a
core point but are not core points themselves)
- Here's a breakdown of each plot:
- Top left (eps=1.0, min_samples=2): Most of the points are classified as
noise because the eps value is too small and there are not enough points
within that radius to be considered core points.
- Top right (eps=1.5, min_samples=2): Two clusters are formed, but there are
still some noise points.
- Bottom left (eps=2.0, min_samples=2): All points are now clustered, but
some clusters have merged together. This is because the eps value is too
large, causing more points to be considered neighbors.
- Bottom right (eps=3.0, min_samples=2): All points are again clustered, but
this time into a single cluster. This is because the eps value is very
large, causing all points to be considered neighbors.
- As you can see, the choice of eps and min_samples can significantly affect
the clustering results. It is important to find the right balance between these
parameters to capture the desired number and shapes of clusters in your
data.
- Density =
number of
sample points
within a specified
radius r (epsilon)
- Core point:
sample with
more than a
specified number
of points
(min_samples)
within epsilon
(includes
samples inside
the cluster)
- Noise point : any point that is not a core point or a border point.
- Finds core samples of high density and expands clusters from them.
- Steps:
- Limitations
o Varying densities
o High-dimensional data
Pros Cons
⁃ Automatic Determination of ⁃ Sensitive to Parameters: The
Clusters: DBSCAN does not require performance of DBSCAN can be
the user to specify the number of sensitive to the choice of its two main
clusters beforehand. Instead, it parameters: eps (epsilon) and
identifies clusters based on the density min_samples. Selecting appropriate
of data points in the feature space. This values for these parameters can require
makes it particularly useful for datasets some experimentation and domain
where the number of clusters is not knowledge.
known a priori. ⁃ Difficulty with Varying Density:
⁃ Ability to Capture Complex DBSCAN may struggle with datasets
Shapes: Unlike some other clustering where the density of data points varies
algorithms, DBSCAN can identify significantly across the feature space. In
clusters of arbitrary shapes and sizes. such cases, choosing a suitable value
This flexibility allows it to effectively for the epsilon parameter can be
handle datasets with non-linear challenging, as it needs to capture the
boundaries and irregularly shaped density variations appropriately.
clusters. ⁃ Memory and Computational
⁃ Robust to Noise: DBSCAN can Requirements: While DBSCAN is
distinguish between points that belong generally efficient, it can be memory-
to clusters and outliers, or noise points. intensive for large datasets, especially
Points that do not belong to any cluster when dealing with high-dimensional
are labeled as noise, providing a clear data. Additionally, the computational
indication of data points that do not fit complexity of DBSCAN may increase
well into any cluster. significantly as the dataset size grows.
⁃ Scalability: Despite being ⁃ Handling High-Dimensional
slightly slower than some other Data: Like many clustering algorithms,
clustering algorithms like k-means, DBSCAN's performance can degrade in
DBSCAN is still scalable and can high-dimensional spaces due to the
handle relatively large datasets curse of dimensionality. Preprocessing
efficiently. techniques such as dimensionality
reduction may be needed to mitigate
this issue.
Despite these limitations, DBSCAN remains a popular choice for clustering tasks,
especially when dealing with datasets where the number of clusters is not known in
advance or when clusters exhibit complex shapes and densities.
- Types of points:
o Solid colors: Represent points that belong to clusters.
o White: These are noise points, which means they are not assigned to
any cluster.
o Large markers: These indicate core points.
o Small markers: These represent boundary points, which are points
that are within the eps radius of a core point but are not core points
themselves.
- The image shows four different plots, each representing the result of running
DBSCAN with a different combination of eps and min_samples values. By
looking at the way the points are clustered and colored in each plot, we can
see the impact of these different parameter settings.
- Here’s a more detailed explanation of each plot:
- Top left (eps=1.0, min_samples=2): In this plot, most of the data points are
classified as noise (white) because the eps value (1.0) is too small. This
means that the neighborhood radius around each point is very small, and
there aren’t enough points within that radius to satisfy the minimum samples
requirement (2) to be considered a core point.
- Top right (eps=1.5, min_samples=2): Here, we can see the formation of two
distinct clusters (solid colors), but there are still some noise points (white)
present. This is because while increasing the eps value (to 1.5) allows more
points to be considered neighbors, it’s not enough to capture all the dense
regions in the data.
- Bottom left (eps=2.0, min_samples=2): In this plot, all the data points are
clustered (solid colors), but some clusters appear to have merged
together. This is because the eps value (increased to 2.0) is now too
large, causing too many points to be considered neighbors, even those that
belong to separate dense regions.
- Bottom right (eps=3.0, min_samples=2): Here, all the data points are again
clustered, but this time into a single large cluster (solid color). This is because
the eps value (3.0) is very large, causing almost all points to be considered
neighbors, effectively ignoring the presence of multiple dense regions in the
data.
- By observing these plots, we can see how the choice of eps and min_samples
can significantly affect the clustering results in DBSCAN. It’s important to find
the right balance between these parameters to identify the desired number
and shapes of clusters within your data.
- Key points about the image:
o Points that are part of clusters are shown in solid colors.
o Noise points, which are points not assigned to any cluster, are shown
in white.
o Large markers represent core points, which are densely surrounded by
other points.
o Small markers represent boundary points, which are located close to
core points but aren’t considered core points themselves.
o As the value of eps increases (left to right), more points are included in
clusters, potentially causing multiple clusters to merge into one.
o As the value of min_samples increases (top to bottom), fewer points
are classified as core points, and more points are labeled as noise.
- Overall, the figure depicts how the selection of eps and min_samples can
influence the results of DBSCAN clustering
- Non-convex optimization.
GMM K-Means
⁃ Assumption of Distribution: ⁃ Assumption of Similarity: k-
GMM assumes that the data is means assumes that the data can
generated from a mixture of several be partitioned into k clusters, each
Gaussian distributions, each represented by its centroid. It
representing a cluster. It allows for minimizes the sum of squared
clusters of different shapes and distances between data points and
sizes by modeling the data as a their respective cluster centroids.
combination of these Gaussian ⁃ Hard Clustering: k-means
distributions. performs hard clustering, meaning
⁃ Soft Clustering: GMM that each data point is assigned to
performs soft clustering, which exactly one cluster, with no notion of
means that it assigns a probability to uncertainty or probability.
each data point belonging to each ⁃ Cluster Shape: k-means
cluster rather than assigning it to a assumes that clusters are spherical
single cluster. This makes GMM and isotropic, which means that it
more flexible in capturing the may struggle with clusters of non-
uncertainty and overlap between spherical shapes or varying sizes.
clusters. ⁃ Simplicity: k-means is
⁃ Cluster Shape: GMM can computationally simpler and more
model clusters with different shapes, efficient compared to GMM, making
including elongated or elliptical it suitable for large datasets and
shapes, by adjusting the covariance high-dimensional data.
matrix of the Gaussian distributions. ⁃ Number of Clusters: k-
⁃ Complexity: GMM is more means requires specifying the
computationally complex compared number of clusters (k) beforehand,
to k-means, especially when dealing which can be a drawback if the
with high-dimensional data or a large optimal number of clusters is not
number of clusters. known in advance. Techniques like
⁃ Number of Clusters: GMM the elbow method or silhouette score
does not require specifying the can help in choosing an appropriate
number of clusters a priori, but it value for k.
needs to be initialized with an initial
guess of the number of components.
Techniques like the Bayesian
Information Criterion (BIC) or Akaike
Information Criterion (AIC) can be
used to estimate the optimal number
of components.
⁃ Use GMM when the underlying distribution of the data is not well
approximated by spherical clusters, and when there is uncertainty or overlap
between clusters.
⁃ Use k-means when the data is well-separated and can be partitioned into
spherical clusters, and when computational efficiency is a concern.
- Silhouette Coefficient
A) Elbow plot
- The elbow point essentially represents the point where the additional benefit
of creating new clusters starts to diminish.It's a trade-off between:
- Capturing more variance: More clusters might capture more specific
patterns in the data.
- Model complexity: Having too many clusters can lead to overfitting and a
less generalizable model.
- Computes the sum of squared distance (SSE) between data points and their
assigned clusters’ centroids.
- Pick the desired # of clusters at the spot where SSE starts to flatten out and
forming an elbow.
B) Silhouette Coefficient
- Interpretation:
- A higher average silhouette coefficient across all data points indicates a better
clustering solution.
- It helps identify potential outliers or misclassified points that might have a low
silhouette coefficient.
- A function S that measures the separation between two clusters, c1 and c2.
- How can we measure the goodness of a clustering C = c1, ... cl, using the
separation function S?
- The coefficient can take values in the interval [-1, 1]. • If 0 : the sample is very
close to the neighboring clusters.
• If 1 : the sample is far away from the neighboring clusters.
• If -1: the sample is assigned to the wrong clusters.