0% found this document useful (0 votes)
59 views

Introduction-To-Ml-Part-3 Edited

This document provides an overview of dimensionality reduction and unsupervised learning techniques. It discusses principal component analysis (PCA), kernel PCA, locally linear embedding, and clustering algorithms like K-means clustering. For K-means clustering, it explains the algorithm, limitations, methods for choosing the number of clusters like the elbow method and silhouette analysis. It also discusses variants like mini-batch K-means for large datasets.

Uploaded by

Hana Alhomrani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Introduction-To-Ml-Part-3 Edited

This document provides an overview of dimensionality reduction and unsupervised learning techniques. It discusses principal component analysis (PCA), kernel PCA, locally linear embedding, and clustering algorithms like K-means clustering. For K-means clustering, it explains the algorithm, limitations, methods for choosing the number of clusters like the elbow method and silhouette analysis. It also discusses variants like mini-batch K-means for large datasets.

Uploaded by

Hana Alhomrani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Introduction to Machine

Learning, part III


August 2022
References
• Slides closely follow Hands-on Machine Learning with Scikit-
Learn, Keras and Tensorflow by Aurelien Geron.
• Another great reference is Machine Learning with PyTorch and
Scikit-Learn by Sebastian Raschka.
• Official documentation and the user guide for Scikit-Learn are
also fantastic.

Prof. David R. Pugh


Dimensionality
Reduction

Prof. David R. Pugh


Dimensionality Reduction
Curse of Dimensionality 3D Intuition fails when applied to higher dimensions

• High-dimensional datasets are often


very sparse: most training instances
are likely to be far away from each
other.
• New instance will likely be far away
from any training instance => harder
to generalize well.
• High-dimensional datasets are prone
to overfitting.
• More data? Data required to
achieve given density of coverage
grow exponentially with number
of dimensions.
Prof. David R. Pugh
Projection
3D dataset lying "close" to 2D subspace Project down from 3D to 2D

Prof. David R. Pugh


Manifold Learning
Classic Swiss Roll Dataset Projection (left) vs Unrolling (right)

Prof. David R. Pugh


Manifold Learning
• With manifold learning you try to learn the lower dimensional
space that "best" represents the higher dimensional data.
• Relies on manifold hypothesis: most real-world high-
dimensional datasets lie close to a much lower-dimensional
manifold. Assumption is often observed empirically.
• Another implicit assumption: classification or regression task
will be "easier" if expressed in the lower-dimensional space of
the manifold. Does not always hold in practice.

Prof. David R. Pugh


Principal Components Analysis
(PCA)
What is PCA? Selecting the subspace to project on

• Find the lower dimensional


hyperplane "closest" to the
higher dimensional data.
• Project the data onto the lower
dimensional hyperplane.
• "Closest" means preserves the
most variation in the training
data.
Prof. David R. Pugh
Principal Components Analysis
(PCA)
Principle Components Choosing "right" number of dimensions

• PCA identifies the axis that accounts for


the largest amount of variance in the
training set.
• PCA also finds a second axis, orthogonal
to the first one, that accounts for the
largest amount of remaining variance.
• For an n-dimensional training set, PCA
will find n principal components.
• Choosing d < n principal components
allows you to project n-dimensional
training set into a d-dimensional training
set.

Prof. David R. Pugh


Kernel PCA (kPCA)
What is kPCA? Reducing Swiss Roll to 2D with various kernels

• Can combine the "kernel trick"


with PCA to perform complex
non-linear projections.
• Good at preserving clusters
of instances after projection, or
sometimes even unrolling
datasets that lie close to a
twisted manifold.
• Compute and
memory intensive; doesn't scale
to large datasets. Prof. David R. Pugh
Kernel PCA (kPCA)
Selecting kernels and tuning hyperparameters kPCA and reconstruction per-image error

• If using kPCA as preprocessing


step in classification/regression
pipeline, then choose kernel
and tune kPCA hyperparameters
to max performance of the
whole pipeline.
• Alternatively, choose kernel and
tune kPCA hyperparameters to
minimize reconstruction error.
Prof. David R. Pugh
Locally-Linear Embedding (LLE)
What is LLE? Unrolling the Swiss Roll with LLE

• Non-linear dimensionality
reduction technique.
• Unlike previous algorithms, LLE
doesn't rely on projections.
• Measures how each training
instance linearly relates to its
nearest neighbors, then looks for a
low-dimensional representation of
the training set that preserves
those local relationships.
• Scales poorly to large datasets. Prof. David R. Pugh
Unsupervised
Learning

Prof. David R. Pugh


Clustering
• The task of identifying similar instances and assigning them
to clusters, or groups of similar instances.
• Just like in classification, each instance gets assigned to a
group. However, unlike classification, clustering is an
unsupervised learning task.

Prof. David R. Pugh


Clustering

Prof. David R. Pugh


Clustering
• Wide variety of use cases for clustering algorithms.
• Customer segmentation
• Basic data analysis
• Dimensionality reduction
• Feature engineering
• Anomaly detection
• Semi-supervised learning
• Image segmentation

Prof. David R. Pugh


Hard vs Soft Clustering
• Hard clustering: assigning each instance to a single cluster.
• Soft clustering: give each instance a score for each cluster.
• Score can be the distance between the instance and the
centroid; conversely, it can be a similarity (affinity) score.

Prof. David R. Pugh


K-Means
Origins of K-Means algorithm Unlabeled dataset with 5 "clusters"
• The K-Means algorithm is a simple
algorithm capable of clustering
this kind of dataset very
efficiently.
• It was proposed by Stuart Lloyd at
Bell Labs in 1957.
• In 1965, Edward W. Forgy
published virtually the same
algorithm, so K-Means is
sometimes referred to as Lloyd–
Forgy.
Prof. David R. Pugh
K-Means
Deriving the K-Means algorithm A few iterations of the K-Means algorithm

• Suppose you were given the


centroids. How would you label
the instances?
• Suppose you were given all the
instance labels. How would you
compute the centroids?
• But you are given neither the
labels nor the centroids. How
can you proceed?
Prof. David R. Pugh
K-Means
Limitations of K-Means algorithm K-Means decision boundaries

• K-Means algorithm does not


behave well when blobs have
very different diameters. Why?
• K-Means only cares about the
distance to the centroid when
assigning an instance to a
cluster.

Prof. David R. Pugh


K-Means
• Algorithm is guaranteed to converge, but it may not converge to
the "best" solution. Compare solutions by comparing inertia.
• Inertia of a solution is the sum of squared distances between
the training instances and their closest centroid.
• The solution with lower inertia is better; the "best" solution will
minimize inertia.

Prof. David R. Pugh


K-Means
Centroid initialization methods Example: "unlucky" centroid initialization

• Randomly initialize centriods,


run algorithm n_init number of
times and keep the "best"
solution.
• K-Means++ algorithm initializes
centroids that are distant from
one another. Smarter
initialization reduces
computation.
Prof. David R. Pugh
K-Means
For large data use mini-batch K-Means Mini-batch: higher inertia but much faster!

• Instead of using the full dataset


at each iteration, use mini-
batches.
• Speeds up the algorithm by a
factor of three to four.
• Possible to cluster huge
datasets that do not fit in
memory.

Prof. David R. Pugh


K-Means
Finding optimal number of clusters Poor choice of k: too low (L) vs too high (R)

• Typically, you will not know k


beforehand.
• Poor choice of k will lead to low
quality solutions.
• Inertia is not a good
performance metric when
trying to choose k. Why?

Prof. David R. Pugh


K-Means
Choose k to "minimize" inertia? Can use an "elbow" plot to choose k
• Increasing k will always
decrease inertia. Why?
• Can plot inertia as function of
k and look for "elbow".
• This method is crude but
doesn't require too much
computation.

Prof. David R. Pugh


K-Means
What is the silhouette coefficient? Choose k to maximize silhouette score
• Silhouette coefficient can vary
between –1 and +1.
• Coefficient close to +1 means that the
instance is well inside its own cluster
and far from other clusters.
• Coefficient close to 0 means that
instance is close to a cluster
boundary.
• Coefficient close to –1 means that the
instance may have been assigned to
the wrong cluster.
• Average silhouette coefficient across
all instances is the silhouette score.
Prof. David R. Pugh
K-Means
Analyze silhouette diagram to choose k Helps identify "balanced" clusters
• Silhouette diagram: plot
of silhouette coefficients sorted
by the cluster they are assigned
to and by the value of the
coefficient.
• Vertical dashed lines represent
the mean silhouette score for
each number of clusters

Prof. David R. Pugh


K-Means
Limitations of K-Means K-Means fails with "non-circular" clusters
• Necessary to run the algorithm
several times to avoid suboptimal
solutions.
• Need to specify the number of
clusters.
• K-Means does not behave well
when the clusters have varying
sizes, different densities, or non-
spherical shapes.
• Important to scale the input
features before you run K-Means!
Prof. David R. Pugh
Clustering for Image Segmentation
• Image segmentation is the task of partitioning an image into
multiple segments.
• color segmentation: pixels with a similar color get assigned to
the same segment.
• semantic segmentation: all pixels that are part of the same
object type get assigned to the same segment.
• instance segmentation: all pixels that are part of the same
individual object are assigned to the same segment.

Prof. David R. Pugh


Clustering for Image Segmentation
Pixels get the mean color of their clusters K-means with various color "clusters"
• K-Means algorithm with 8
cluster outputs the image
shown in the upper right.
• With fewer than eight clusters,
the ladybug’s flashy red color
fails to get a cluster of its own.
• K-Means prefers clusters of
similar sizes. The ladybug is
small so even though its color is
flashy, K-Means fails to give it a
separate cluster. Prof. David R. Pugh
Clustering for Semi-Supervised
Learning
Clustering can help label unlabeled data 50 representative digits (one per cluster)
• Suppose that we have plenty of
unlabeled instances but very few
labeled instances.
• Use clustering algorithm to
identify representative instances
and label these manually.
• Propagate manual labels to all
instances in the same cluster.
• Re-train your classification
algorithms on this larger data set.
Prof. David R. Pugh
Active Learning
• When a human expert interacts with the learning algorithm,
providing labels for specific instances when the algorithm
requests them.
• Many different strategies for active learning, but one of the
most common ones is called uncertainty sampling.

Prof. David R. Pugh


Uncertainty Sampling
1. Model is trained on the labeled instances gathered so far;
model is used to make predictions on all the unlabeled
instances.
2. The instances for which the model is most uncertain are given
to the expert for labeling.
3. Iterate until the performance improvement stops being worth
the labeling effort.

Prof. David R. Pugh


DBSCAN
• Defines clusters as continuous regions of high density. Here is how it
works:
• For each instance, count how many instances are located within a
small distance ε from it. This region is called the instance’s ε-
neighborhood.
• If instance has at least min_samples instances in its ε-neighborhood,
then it is considered a core instance.
• All instances in the ε-neighborhood of a core instance belong to the
same cluster.
• Any instance that is not a core instance and does not have one in its
neighborhood is considered an anomaly.

Prof. David R. Pugh


DBSCAN

Prof. David R. Pugh


DBSCAN
Pros Cons
• If the density varies
• Capable of identifying any significantly across
number of clusters of any the clusters,
shape. • or if there’s no sufficiently
• Robust to outliers. low-density region around
some clusters, DBSCAN can
• Only two hyperparameters. struggle to capture all the
clusters.
• Algorithm does not scale
well to large datasets.
Prof. David R. Pugh
Other clustering algorithms
• Agglomerative clustering
• BIRCH
• Mean-Shift
• Affinity propagation
• Spectral clustering

Prof. David R. Pugh


Gaussian Mixture Model (GMM)
• Probabilistic model that assumes that instances were
generated from a mixture of several Gaussian distributions
whose parameters are unknown.
• Many variants; simplest variant discussed here requires that
you know the number of clusters, k.

Prof. David R. Pugh


Gaussian Mixture Model (GMM)
• Dataset is assumed to be generated as follows.
• For each instance, cluster is picked randomly from
among k clusters. Probability of choosing the cluster j is cluster
weight ϕ(j). The index of the cluster chosen for the instance
i is z(i).
• If instance i was assigned to the cluster j (i.e., z(i) = j), then
location x(i) of this instance is sampled randomly from Gaussian
distribution with mean μ(j) and covariance matrix Σ(j).

Prof. David R. Pugh


Expectation Maximization (EM)
• Expectation-Maximization (EM) algorithm has many similarities
with the K-Means algorithm:
1. Initializes the cluster parameters randomly.
2. Expectation step: assign instances to clusters.
3. Maximization step: update the clusters.
4. Repeat steps 2 and 3 until convergence.

Prof. David R. Pugh


Expectation Maximization (EM)
• Expectation-Maximization (EM) can be thought of as a
generalization of the K-Means algorithm:
• Like K-Means algorithm, EM algorithm finds the cluster
centers.
• EM algorithm also finds cluster size, shape, and orientation (as
well as their relative weights).
• EM algorithm uses soft cluster assignments, not hard
assignments (like K-Means algorithm).

Prof. David R. Pugh


Gaussian Mixture Model (GMM)
GMM is a soft clustering model Clusters, decision boundaries, contours
• Expectation step: algorithm
estimates probability that each
instance belongs to each cluster.
• Maximization step: each cluster is
updated using all instances; each
instance is weighted by the
estimated probability that it
belongs to that cluster.
• Each cluster’s update will mostly
be impacted by the instances that
have high probability of being in
the cluster.
Prof. David R. Pugh
Gaussian Mixture Model (GMM)
How to restrict share and orientation of clusters? Clusters using different covariance types

• When there are many dimensions,


or many clusters, or few instances,
EM can struggle to converge to the
optimal solution.
• Can reduce problem difficulty by
limiting the range of shapes and
orientations that the clusters can
have.
• Impose constraints on the
covariance matrices. Options
are full, spherical, diagonal, tied.
Prof. David R. Pugh
Anomaly Detection Using Gaussian
Mixtures
GMMs make anomaly detection simple! Anomalies are represented as stars
• Any instance located in a low-
density region can be
considered an anomaly.
• Must define what density
threshold to use.
• Too many false positives,
decrease the threshold; too
many false negatives, increase
the threshold.
Prof. David R. Pugh
Selecting the Number of Clusters
How to choose number of clusters? Use an "elbow" plot to select k
• Metrics like inertia or silhouette
score are not reliable when
clusters are not spherical or
have different sizes.
• Choose k to minimize some
information criterion, such as
Bayesian information criterion
(BIC) or Akaike information
criterion (AIC).
Prof. David R. Pugh
Bayesian GMM
• Rather than manually searching for the optimal number of
clusters, you can use Bayesian GMM.
• Capable of giving weights equal (or close) to zero to
unnecessary clusters.
• Set the number of clusters to a value that you have good
reason to believe is greater than the optimal number of
clusters.
• Algorithm will eliminate the unnecessary clusters automatically.

Prof. David R. Pugh


GMM Limitations

Prof. David R. Pugh


Algorithms for anomaly detection
• Fast-MCD
• Isolation Forest
• Local Outlier Factor (LOF)
• One-class SVM
• Invertible dimensionality reduction algorithms such as PCA

Prof. David R. Pugh


Introduction to
Neural Networks

Prof. David R. Pugh


Introduction to Neural Networks
• Artificial neural networks (ANNs).
• ML model inspired by the networks of biological neurons found
in our brains.
• ANNs have gradually become quite different from their
biological cousins.
• ANNs are at the very core of Deep Learning.

Prof. David R. Pugh


Introduction to Neural Networks
• Ideal for tackling large and highly complex tasks.
• Classifying billions of images (Google Images).
• Speech recognition services (Apple’s Siri).
• Recommending the best videos to (YouTube).
• Super-human gaming (DeepMind’s AlphaGo).

Prof. David R. Pugh


A Brief History of ANNs
• Interest in ANNs has come in waves.
• First wave of interest kicked off by McCulloch and Pitts (1943).
• Computational model of how biological neurons could
perform complex computations using propositional calculus.
• Early success led to lots of hype! But by 1960s the hype had
died down and ANNs fell into disuse as ANNs research entered
its first "winter" period.

Prof. David R. Pugh


A Brief History of ANNs
• A second wave of interest in ANNs kicked off in the early 1980s.
• In 1980s new ANN architectures invented, better training
techniques developed. Lots of hype!
• In 1990s other powerful ML techniques were invented, such as
SVMs, which seemed to offer better results than ANNs.
• Hype died down and ANN research entered its second "winter"
period.

Prof. David R. Pugh


A Brief History of ANNs
• Now experiencing a third wave of interest in ANNs. Will we see a
repeat of the past? Or is this time different?
• Huge quantities of data now available to train ANNs.
• Better training algorithms have been developed.
• Theoretical issues of ANNs have turned out to be benign in
practice.
• ANNs often outperform other ML techniques on large and
complex problems.
• Increase in computing power (GPUs, TPUs, etc.) makes it
possible to train large ANNs efficiently.

Prof. David R. Pugh


Biological Neurons
A single biological neuron Layers of biological neurons in a brain

Prof. David R. Pugh


Logical Computations with ANNs
• McCulloch and Pitts (1943) developed a simple model of a
biological neuron.
• Artificial neuron has one or more binary (on/off) inputs and one
binary output.
• Artificial neuron activates when more than certain number of
inputs are active.
• Proved that one build ANNs to compute any logical proposition.

Prof. David R. Pugh


Logical Computations with Artificial
Neurons
ANNs performing logical computations ANNs can act as logical operators

• The first network is the identity


function.
• The second network performs a
logical AND.
• The third network performs a
logical OR.
• the fourth network computes a
slightly more complex logical
proposition.
Prof. David R. Pugh
Threshold Logical Unit (TLU)
One of the simplest ANN architectures Linear transformation + step function

• Artificial neuron
called a threshold logic
unit (TLU).
• Inputs and output are numbers
(not binary on/off values).
• Each input connection is
associated with a weight.

Prof. David R. Pugh


TLUs as simple binary classifiers
• TLUs can perform linear binary classification (like logistic
regression, linear SVMs, etc.).
• TLU first computes a linear function of its inputs and then
applies a step function to the result.
• Most common step function used in TLU is the Heaviside step
function.
• If result exceeds a threshold, then output positive class (else
output the negative class).

Prof. David R. Pugh


The Perceptron
Perceptron is a single layer of TLUs Perceptron with 2 inputs and 3 outputs

• Composed of one or more TLUs


organized in a single layer: every
TLU is connected to every input.
• Such a layer is called a fully
connected or dense layer.
• Inputs form an "input layer";
since the layer of TLUs
also produces the final output,
it is also the output layer.
Prof. David R. Pugh
Training a Perceptron
• Perceptron training rule reinforces weights that help reduce the
prediction error.
• “Cells that fire together, wire together”; connection weight
between two neurons tends to increase when they fire
simultaneously.
• Perceptron is trained using a variant of this rule that considers
the error made by the network when making a prediction.
• If classes are linearly separable, then the training rule
converges to a solution.

Prof. David R. Pugh


The Perceptron
Adding layers helps address limitations Two-layer perception can perform XOR

• Single perceptron can't solve


the simple XOR classification
problem.
• However, a two-layer
perceptron can solve the XOR
classification problem.
• Other limitations of perceptron
can be addressed by adding
multiple layers.
Prof. David R. Pugh
The Multi-layer Perceptron (MLP)
Key neural network terms MLP with one hidden layer
• An input layer, one or more layers
of TLUs, called hidden layers, one
final layer of TLUs called
the output layer.
• When an ANN contains a deep
stack of hidden layers, it is called
a deep neural network (DNN).
• A feedforward neural
network (FNN) is when signal
flows from inputs to the outputs.
Prof. David R. Pugh
Training an MLP
• In 1970, master's student Seppo Linnainmaa, developed an
efficient algorithm called reverse-mode automatic differentiation.
• Computes gradients of the neural network’s error for every
single model parameter with just two passes through the
network!
• Combination of reverse-mode automatic differentiation and
stochastic gradient descent is called backpropagation (or
backprop).

Prof. David R. Pugh


Backpropagation
• Backpropagation algorithm in brief...

• Randomly initialize all the model parameters.


• Forward pass: make predictions for a mini-batch of training
samples.
• Measures the error between current predictions and targets using
some loss function.
• Backward pass: then goes through each layer in reverse to measure
the error contribution from each parameter.
• Gradient Descent: update model parameters to reduce the error.
• Repeat for some, typically large, number of mini-batches.

Prof. David R. Pugh


Activation Functions
Backprop requires non-linear activations Activation functions (L) vs derivatives (R)

• Without non-linear activation


functions DNNs would just be
linear functions! Why?
• Popular activation functions are
sigmoid, RELU, and hyperbolic
tangent.
• A large enough DNN with non-
linear activations
can approximate any
continuous function.
Prof. David R. Pugh
MLPs for Regression
• If you want to predict a single value, then you just need a single
output neuron: its output is the predicted value.
• To predict multiple values at once, you need one output
neuron per output dimension.
• Often does not use any activation function for the output layer,
so it’s free to output any value it wants.
• Typically use root mean squared error (RMSE) loss but other
loss functions are possible as well.

Prof. David R. Pugh


MLPs for Classification
• For a binary classification you need a single output neuron and
the sigmoid activation function.
• For multi-label binary classification tasks, you need one output
neuron per output dimension.
• For multi-class classification tasks, you need one output neuron
per class and the SoftMax activation function.

Prof. David R. Pugh


MLPs for Classification
Multi-class classification using MLPs SoftMax activation + Cross-entropy loss
• One output neuron per class;
use the SoftMax activation
function.
• SoftMax function ensures that
estimated probabilities are
between 0 and 1 (and that they
add up to 1 since the classes are
exclusive).
• Typically combine SoftMax
activation with cross-entropy
loss function. Prof. David R. Pugh
Building Intuition for MLPs

Prof. David R. Pugh


Checkout TensorFlow Playground!

Challenges:
• Solve each of the classification tasks.
• Can you find a single neural network that will perform well on
all the classification tasks?
• Can you solve all the classification tasks without using any
hidden layers?

Prof. David R. Pugh


Where to go from here?

Prof. David R. Pugh


Additional Training
• After completing Introduction to Machine Learning you should
check out the following training courses.
• Introduction to Deep Learning
• Accelerated Machine Learning

Prof. David R. Pugh

You might also like