0% found this document useful (0 votes)
2 views

Notes on Unsupervised Learning

Unsupervised Learning is a machine learning approach that analyzes data without explicit labels to identify patterns and structures. Key techniques include clustering (e.g., K-Means, Kernel K-Means), dimensionality reduction, and anomaly detection, with applications in customer segmentation and fraud detection. Challenges include interpretability, evaluation, and scalability, while advantages include the ability to utilize large volumes of unlabeled data.

Uploaded by

aatankarmy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Notes on Unsupervised Learning

Unsupervised Learning is a machine learning approach that analyzes data without explicit labels to identify patterns and structures. Key techniques include clustering (e.g., K-Means, Kernel K-Means), dimensionality reduction, and anomaly detection, with applications in customer segmentation and fraud detection. Challenges include interpretability, evaluation, and scalability, while advantages include the ability to utilize large volumes of unlabeled data.

Uploaded by

aatankarmy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Notes on Unsupervised Learning

Definition:

Unsupervised Learning is a type of machine learning where algorithms analyze and interpret data
without explicit labels or supervision. The goal is to identify patterns, structures, or relationships in
the data. Unlike supervised learning, there are no predefined outcomes or target variables.

Key Concepts:

1. Input Data:

o Consists of only features (no labels).

o Algorithms process raw data to discover hidden insights or groupings.

2. Objective:

o To find inherent structures or patterns in the dataset.

o Used for dimensionality reduction, clustering, and anomaly detection.

3. Techniques:

o Clustering:
Groups similar data points into clusters based on certain metrics (e.g., distance).

 Common algorithms: K-Means, DBSCAN, Hierarchical Clustering.

o Dimensionality Reduction:
Reduces the number of variables while retaining essential information.

 Common methods: Principal Component Analysis (PCA), t-SNE.

o Anomaly Detection:
Identifies data points that deviate significantly from the norm.

 Example: Isolation Forest, Autoencoders.

Advantages:

 Unlabeled Data Usage:


Utilizes large volumes of unlabeled data, which is often easier and cheaper to collect.

 Insight Discovery:
Uncovers hidden structures that may not be obvious through manual analysis.

Challenges:

1. Interpretability:
Results can be abstract and harder to explain compared to supervised methods.
2

2. Evaluation:
No predefined labels make it difficult to measure the accuracy or quality of results.

3. Scalability:
Processing large datasets can be computationally expensive.

Applications:

1. Customer Segmentation:
Grouping customers based on purchasing behavior for targeted marketing.

2. Fraud Detection:
Identifying unusual patterns in transactions to detect fraudulent activities.

3. Data Preprocessing:
Reducing dimensions to simplify complex datasets before further analysis.

Clustering: K-Means and Kernel K-Means

K-Means Clustering

Definition:
K-Means is a popular clustering algorithm that groups data into kk clusters based on feature
similarity. It minimizes the variance within clusters and is widely used due to its simplicity and
efficiency.

Steps in K-Means:

1. Initialization:

o Choose kk, the number of clusters.

o Randomly initialize kk cluster centroids (means).

2. Assignment Step:

o Assign each data point to the nearest centroid based on a distance metric (usually
Euclidean distance).

3. Update Step:

o Compute the new centroids by averaging the data points assigned to each cluster.

4. Iterative Refinement:

o Repeat the assignment and update steps until centroids converge or a maximum
number of iterations is reached.

Advantages of K-Means:
3

 Simple and easy to implement.

 Scales well to large datasets.

 Works well for spherical and well-separated clusters.

Limitations:

 Requires kk to be specified in advance.

 Sensitive to outliers and initial centroid placement.

 Assumes clusters are convex and isotropic (similar in size and shape).

Applications:

 Market segmentation, image compression, document clustering.

Kernel K-Means Clustering

Definition:
Kernel K-Means extends K-Means to handle non-linear data by using kernel functions to map the
data into a higher-dimensional space where linear separation is possible.

Key Differences from K-Means:

1. Distance Metric:

o Uses a kernel function (e.g., RBF, polynomial) to compute similarity between data
points.

o Allows for non-linear boundaries between clusters.

2. Feature Transformation:

o Implicitly transforms data into a higher-dimensional space using the kernel trick.

o Overcomes the limitation of K-Means for non-spherical clusters.

Steps in Kernel K-Means:

1. Compute the kernel matrix KK, which contains pairwise similarities between all data points.

2. Assign data points to clusters based on similarity in the transformed feature space.

3. Update the cluster assignments and repeat until convergence.

Advantages of Kernel K-Means:

 Handles non-linear cluster structures.

 More flexible than traditional K-Means.


4

Limitations:

 More computationally expensive due to kernel computation.

 Requires choosing the right kernel function and hyperparameters.

Applications:

 Image segmentation, gene expression analysis, and any scenario with complex, non-linear
data patterns.

Comparison of K-Means and Kernel K-Means:

Aspect K-Means Kernel K-Means

Cluster Shape Linear, spherical Non-linear, flexible

Distance Metric Euclidean Kernel-based similarity

Efficiency Fast, less expensive Slower, more expensive

Applications Simple data patterns Complex, non-linear data

Clustering: K-Means and Kernel K-Means

K-Means Clustering

Definition:
K-Means is a popular clustering algorithm that groups data into kk clusters based on feature
similarity. It minimizes the variance within clusters and is widely used due to its simplicity and
efficiency.

Steps in K-Means:

1. Initialization:

o Choose kk, the number of clusters.

o Randomly initialize kk cluster centroids (means).

2. Assignment Step:

o Assign each data point to the nearest centroid based on a distance metric (usually
Euclidean distance).

3. Update Step:

o Compute the new centroids by averaging the data points assigned to each cluster.

4. Iterative Refinement:

o Repeat the assignment and update steps until centroids converge or a maximum
number of iterations is reached.
5

Advantages of K-Means:

 Simple and easy to implement.

 Scales well to large datasets.

 Works well for spherical and well-separated clusters.

Limitations:

 Requires kk to be specified in advance.

 Sensitive to outliers and initial centroid placement.

 Assumes clusters are convex and isotropic (similar in size and shape).

Applications:

 Market segmentation, image compression, document clustering.

Kernel K-Means Clustering

Definition:
Kernel K-Means extends K-Means to handle non-linear data by using kernel functions to map the
data into a higher-dimensional space where linear separation is possible.

Key Differences from K-Means:

1. Distance Metric:

o Uses a kernel function (e.g., RBF, polynomial) to compute similarity between data
points.

o Allows for non-linear boundaries between clusters.

2. Feature Transformation:

o Implicitly transforms data into a higher-dimensional space using the kernel trick.

o Overcomes the limitation of K-Means for non-spherical clusters.

Steps in Kernel K-Means:

1. Compute the kernel matrix KK, which contains pairwise similarities between all data points.

2. Assign data points to clusters based on similarity in the transformed feature space.

3. Update the cluster assignments and repeat until convergence.

Advantages of Kernel K-Means:

 Handles non-linear cluster structures.


6

 More flexible than traditional K-Means.

Limitations:

 More computationally expensive due to kernel computation.

 Requires choosing the right kernel function and hyperparameters.

Applications:

 Image segmentation, gene expression analysis, and any scenario with complex, non-linear
data patterns.

Comparison of K-Means and Kernel K-Means:

Aspect K-Means Kernel K-Means

Cluster Shape Linear, spherical Non-linear, flexible

Distance Metric Euclidean Kernel-based similarity

Efficiency Fast, less expensive Slower, more expensive

Applications Simple data patterns Complex, non-linear data

Matrix Factorization and Matrix Completion

Matrix factorization and matrix completion are techniques commonly used in machine learning,
especially in the context of recommendation systems, collaborative filtering, and data recovery from
incomplete datasets.

Matrix Factorization

Definition:
Matrix factorization is a mathematical technique where a matrix is decomposed into multiple lower-
rank matrices such that their product approximates the original matrix. This is often used to discover
latent features or hidden relationships within the data.

Mathematical Formulation:
Given a matrix RR (e.g., a user-item interaction matrix in a recommendation system), matrix
factorization seeks to approximate RR as the product of two lower-dimensional matrices PP and QQ:

R≈P×QTR \approx P \times Q^T

Where:

 PP represents a matrix of user features (e.g., latent factors representing user preferences).

 QQ represents a matrix of item features (e.g., latent factors representing item


characteristics).
7

Types of Matrix Factorization:

1. Singular Value Decomposition (SVD):

o Decomposes a matrix into three matrices: R=UΣVTR = U \Sigma V^T, where UU and
VV are orthogonal matrices, and Σ\Sigma is a diagonal matrix of singular values. SVD
is often used in dimensionality reduction and is foundational in techniques like
Principal Component Analysis (PCA).

2. Non-Negative Matrix Factorization (NMF):

o Similar to SVD but restricts the factorization to non-negative values in the matrices
PP and QQ, which can be useful in applications like topic modeling or image
processing.

3. Alternating Least Squares (ALS):

o A popular algorithm used in recommendation systems where the matrix is factorized


iteratively by alternately fixing one matrix (e.g., PP) and solving for the other (e.g.,
QQ).

Applications of Matrix Factorization:

 Recommendation Systems:

o Used to predict user preferences (e.g., movie ratings, product recommendations) by


learning latent features from historical data.

 Topic Modeling:

o In Natural Language Processing (NLP), matrix factorization techniques like Latent


Semantic Analysis (LSA) can be used to discover hidden topics in text data.

 Data Compression and Image Processing:

o Decomposing large datasets into smaller, more manageable parts while retaining
important information.

Advantages of Matrix Factorization:

 Dimensionality Reduction: Reduces the complexity of large datasets while preserving


relevant patterns.

 Interpretability: Can reveal underlying structures or latent features in the data.

 Scalability: Efficient algorithms (like ALS) can be scaled to large datasets, such as in
recommendation systems.

Limitations:

 Overfitting: If not properly regularized, matrix factorization can overfit to noisy or sparse
data.
8

 Missing Data: While it works well on complete datasets, matrix factorization techniques may
not perform well with missing or incomplete data unless adapted for matrix completion.

Matrix Completion

Definition:
Matrix completion is the process of filling in the missing entries of a partially observed matrix. It is
often used when the matrix is large and contains many missing values, such as in recommendation
systems (e.g., user-item ratings matrix with missing ratings).

Mathematical Formulation:
The goal of matrix completion is to estimate the missing entries of a matrix RR, given that the matrix
can be approximated by a low-rank matrix. Let RobsR_{\text{obs}} be the observed entries of the
matrix, and RmissingR_{\text{missing}} be the missing entries. Matrix completion seeks to solve:

R^=arg⁡min⁡R^∥Robs−R^∥F2\hat{R} = \arg\min_{\hat{R}} \| R_{\text{obs}} - \hat{R} \|_F^2

Where ∥⋅∥F\| \cdot \|_F denotes the Frobenius norm, which measures the difference between the
observed entries and the predicted entries.

Techniques for Matrix Completion:

1. Low-Rank Matrix Factorization:

o Matrix completion can be framed as a matrix factorization problem where we learn


the low-rank factorization of the observed data and use it to predict the missing
values.

2. Singular Value Thresholding (SVT):

o A method for matrix completion that uses the singular value decomposition (SVD)
and thresholds the singular values to recover the low-rank matrix.

3. Alternating Least Squares (ALS):

o Used for matrix factorization and can be adapted for matrix completion by iteratively
filling in the missing values during the factorization process.

4. Trace Minimization:

o This approach minimizes the nuclear norm (sum of singular values) of the matrix as a
proxy for rank minimization to find the best low-rank approximation.

Applications of Matrix Completion:

 Collaborative Filtering:

o Predicting missing ratings or preferences for users in recommendation systems (e.g.,


Netflix, Amazon).
9

 Image Inpainting:

o Completing missing or corrupted regions of an image by leveraging the structure of


the data.

 Sensor Networks:

o Filling in missing data from sensors or measurements, especially when sensors fail or
are missing.

Advantages of Matrix Completion:

 Handling Missing Data: Can effectively predict and recover missing information in large
matrices.

 Improved Predictions: Used in recommendation systems to make accurate predictions by


leveraging existing data patterns.

Limitations:

 Assumption of Low-Rank: Matrix completion assumes the data can be approximated by a


low-rank matrix, which may not always be true.

 Computational Complexity: Matrix completion algorithms can be computationally


expensive, especially for large matrices with many missing entries.

Comparison: Matrix Factorization vs. Matrix Completion

Aspect Matrix Factorization Matrix Completion

Decompose a matrix into factors to Fill in missing values in a partially observed


Purpose
discover latent features matrix

Data Often applied to fully observed data Applied to data with missing entries

A completed matrix with predicted missing


Output Two low-rank matrices (e.g., PP and QQ)
values

Recommendation systems, topic Collaborative filtering, image inpainting,


Applications
modeling sensor networks

Key Data is well-structured (e.g., linear Matrix has a low-rank structure and
Assumption relationships) missing values are recoverable

In summary, matrix factorization is used to uncover latent features by decomposing a matrix into
lower-dimensional matrices, while matrix completion focuses on filling in missing values by
assuming the matrix has a low-rank structure. Both techniques are fundamental in applications like
recommendation systems and data recovery.

Generative Models: Mixture Models and Latent Factor Models


10

Generative models are a class of machine learning models that aim to model how data is generated
in order to learn the underlying distribution. These models can generate new data samples by
learning the underlying structure of the observed data. Two important types of generative models
are Mixture Models and Latent Factor Models.

Mixture Models

Definition:
A mixture model is a probabilistic model that assumes the data is generated from a mixture of
several different distributions. These distributions represent sub-populations within the overall
population, but the membership of each data point to these sub-populations is unknown.

Mathematical Formulation:
A mixture model expresses the probability density of a data point xx as a weighted sum of
component distributions:

p(x)=∑k=1Kπkpk(x)p(x) = \sum_{k=1}^{K} \pi_k p_k(x)

Where:

 KK is the number of components (clusters or sub-populations).

 πk\pi_k is the weight (or mixing coefficient) of the kk-th component, such that ∑k=1Kπk=1\
sum_{k=1}^{K} \pi_k = 1.

 pk(x)p_k(x) is the probability density function of the kk-th component.

Each component pk(x)p_k(x) can be a simple distribution, such as a Gaussian (Normal) distribution,
making Gaussian Mixture Models (GMMs) a common example.

Types of Mixture Models:

1. Gaussian Mixture Model (GMM):

o Assumes that each component follows a Gaussian (normal) distribution. GMM is


used for clustering and density estimation.

2. Hidden Markov Models (HMM):

o A type of mixture model used for sequential data. In HMMs, the state transitions
between distributions are governed by a hidden, latent variable.

Inference in Mixture Models:

 Expectation-Maximization (EM) Algorithm:

o A commonly used method to estimate the parameters of a mixture model, especially


when the component membership is unknown. The EM algorithm alternates
between two steps:
11

 E-step: Estimate the probabilities of the data points belonging to each


component.

 M-step: Update the model parameters based on these probabilities.

Applications of Mixture Models:

 Clustering:

o Grouping data into clusters based on their distributional properties (e.g., GMM for
soft clustering).

 Density Estimation:

o Estimating the underlying probability distribution of a dataset, useful in anomaly


detection.

 Speech and Image Recognition:

o Used in HMMs for modeling speech patterns or sequential image frames.

Advantages of Mixture Models:

 Flexible: Can model complex, multi-modal distributions.

 Soft Clustering: Unlike hard clustering methods, mixture models provide a probability of
belonging to each cluster.

Limitations:

 Choice of the Number of Components: Selecting the number of components KK can be


challenging and requires model selection methods like cross-validation or the Akaike
Information Criterion (AIC).

 Sensitive to Initialization: Similar to K-Means, the EM algorithm can be sensitive to the initial
choice of parameters.

Latent Factor Models

Definition:
Latent factor models are generative models that assume that observed data is generated by some
underlying, unobserved (latent) factors. These models are often used to decompose complex data
into simpler, interpretable components.

Mathematical Formulation:
In latent factor models, the observed data XX is assumed to be a function of the latent variables ZZ,
which are typically lower-dimensional and unobserved:

X≈f(Z)X \approx f(Z)


12

Where ff represents some generative process, such as a linear combination or a probabilistic model.

Latent factor models are widely used in collaborative filtering, matrix factorization, and
recommendation systems. A popular example of a latent factor model is Matrix Factorization.

Types of Latent Factor Models:

1. Principal Component Analysis (PCA):

o A linear latent factor model used for dimensionality reduction by projecting the data
onto a smaller subspace defined by the principal components.

2. Matrix Factorization (e.g., Singular Value Decomposition (SVD)):

o Used in recommendation systems to factorize the user-item interaction matrix into


lower-rank matrices, revealing latent factors representing user preferences and item
characteristics.

3. Probabilistic Latent Variable Models:

o Latent Dirichlet Allocation (LDA): A generative probabilistic model for topic


modeling, where each document is a mixture of latent topics, and each topic is a
distribution over words.

o Factor Analysis: Assumes the observed data is linearly related to a set of latent
variables and aims to recover these latent variables.

Inference in Latent Factor Models:

 Expectation-Maximization (EM):

o Similar to mixture models, the EM algorithm can be used in latent factor models to
estimate the latent factors and their associated parameters.

 Stochastic Gradient Descent (SGD):

o Used for models like matrix factorization in recommendation systems, where the
objective is to minimize the reconstruction error between the original and the
approximated data.

Applications of Latent Factor Models:

 Recommendation Systems:

o Used to predict user preferences in collaborative filtering by learning latent factors


that explain the patterns in user-item interactions.

 Topic Modeling:

o Identifying topics within a collection of text documents by learning latent variables


representing topics.
13

 Dimensionality Reduction:

o Reducing the number of features in high-dimensional data (e.g., images, text) while
preserving important information.

Advantages of Latent Factor Models:

 Interpretability: Latent factors are often easier to interpret than the raw data, revealing
underlying structures like preferences or topics.

 Flexibility: Can handle complex, high-dimensional data by modeling the underlying latent
factors.

Limitations:

 Linear Assumptions: Many latent factor models (e.g., PCA, matrix factorization) assume
linear relationships, which may not capture complex, non-linear patterns.

 Scalability: Latent factor models can be computationally expensive, especially with large
datasets, and require efficient optimization algorithms.

Comparison of Mixture Models and Latent Factor Models

Aspect Mixture Models Latent Factor Models

Model data as a mixture of Model observed data as a function of latent


Purpose
distributions variables

Key Data points are generated from Data points are generated by latent
Assumption multiple unknown distributions (unobserved) factors

Example Gaussian Mixture Models (GMM), PCA, Matrix Factorization, Latent Dirichlet
Models Hidden Markov Models (HMM) Allocation (LDA)

Type of Hard or soft clustering based on Can be used for soft clustering in
Clustering probability collaborative filtering

Clustering, density estimation, anomaly Recommendation systems, topic modeling,


Applications
detection dimensionality reduction

Model Moderate, depends on number of Moderate to high, depends on the number


Complexity components of latent factors

Can model multi-modal data Models underlying structures in high-


Flexibility
distributions dimensional data

In summary, mixture models are generative models that represent data as a combination of multiple
distributions, often used for clustering and density estimation. Latent factor models, on the other
hand, focus on uncovering hidden factors that generate observed data, with applications in
14

recommendation systems, topic modeling, and dimensionality reduction. Both types of models are
powerful tools for uncovering structure in complex datasets.

You might also like