Notes on Unsupervised Learning
Notes on Unsupervised Learning
Definition:
Unsupervised Learning is a type of machine learning where algorithms analyze and interpret data
without explicit labels or supervision. The goal is to identify patterns, structures, or relationships in
the data. Unlike supervised learning, there are no predefined outcomes or target variables.
Key Concepts:
1. Input Data:
2. Objective:
3. Techniques:
o Clustering:
Groups similar data points into clusters based on certain metrics (e.g., distance).
o Dimensionality Reduction:
Reduces the number of variables while retaining essential information.
o Anomaly Detection:
Identifies data points that deviate significantly from the norm.
Advantages:
Insight Discovery:
Uncovers hidden structures that may not be obvious through manual analysis.
Challenges:
1. Interpretability:
Results can be abstract and harder to explain compared to supervised methods.
2
2. Evaluation:
No predefined labels make it difficult to measure the accuracy or quality of results.
3. Scalability:
Processing large datasets can be computationally expensive.
Applications:
1. Customer Segmentation:
Grouping customers based on purchasing behavior for targeted marketing.
2. Fraud Detection:
Identifying unusual patterns in transactions to detect fraudulent activities.
3. Data Preprocessing:
Reducing dimensions to simplify complex datasets before further analysis.
K-Means Clustering
Definition:
K-Means is a popular clustering algorithm that groups data into kk clusters based on feature
similarity. It minimizes the variance within clusters and is widely used due to its simplicity and
efficiency.
Steps in K-Means:
1. Initialization:
2. Assignment Step:
o Assign each data point to the nearest centroid based on a distance metric (usually
Euclidean distance).
3. Update Step:
o Compute the new centroids by averaging the data points assigned to each cluster.
4. Iterative Refinement:
o Repeat the assignment and update steps until centroids converge or a maximum
number of iterations is reached.
Advantages of K-Means:
3
Limitations:
Assumes clusters are convex and isotropic (similar in size and shape).
Applications:
Definition:
Kernel K-Means extends K-Means to handle non-linear data by using kernel functions to map the
data into a higher-dimensional space where linear separation is possible.
1. Distance Metric:
o Uses a kernel function (e.g., RBF, polynomial) to compute similarity between data
points.
2. Feature Transformation:
o Implicitly transforms data into a higher-dimensional space using the kernel trick.
1. Compute the kernel matrix KK, which contains pairwise similarities between all data points.
2. Assign data points to clusters based on similarity in the transformed feature space.
Limitations:
Applications:
Image segmentation, gene expression analysis, and any scenario with complex, non-linear
data patterns.
K-Means Clustering
Definition:
K-Means is a popular clustering algorithm that groups data into kk clusters based on feature
similarity. It minimizes the variance within clusters and is widely used due to its simplicity and
efficiency.
Steps in K-Means:
1. Initialization:
2. Assignment Step:
o Assign each data point to the nearest centroid based on a distance metric (usually
Euclidean distance).
3. Update Step:
o Compute the new centroids by averaging the data points assigned to each cluster.
4. Iterative Refinement:
o Repeat the assignment and update steps until centroids converge or a maximum
number of iterations is reached.
5
Advantages of K-Means:
Limitations:
Assumes clusters are convex and isotropic (similar in size and shape).
Applications:
Definition:
Kernel K-Means extends K-Means to handle non-linear data by using kernel functions to map the
data into a higher-dimensional space where linear separation is possible.
1. Distance Metric:
o Uses a kernel function (e.g., RBF, polynomial) to compute similarity between data
points.
2. Feature Transformation:
o Implicitly transforms data into a higher-dimensional space using the kernel trick.
1. Compute the kernel matrix KK, which contains pairwise similarities between all data points.
2. Assign data points to clusters based on similarity in the transformed feature space.
Limitations:
Applications:
Image segmentation, gene expression analysis, and any scenario with complex, non-linear
data patterns.
Matrix factorization and matrix completion are techniques commonly used in machine learning,
especially in the context of recommendation systems, collaborative filtering, and data recovery from
incomplete datasets.
Matrix Factorization
Definition:
Matrix factorization is a mathematical technique where a matrix is decomposed into multiple lower-
rank matrices such that their product approximates the original matrix. This is often used to discover
latent features or hidden relationships within the data.
Mathematical Formulation:
Given a matrix RR (e.g., a user-item interaction matrix in a recommendation system), matrix
factorization seeks to approximate RR as the product of two lower-dimensional matrices PP and QQ:
Where:
PP represents a matrix of user features (e.g., latent factors representing user preferences).
o Decomposes a matrix into three matrices: R=UΣVTR = U \Sigma V^T, where UU and
VV are orthogonal matrices, and Σ\Sigma is a diagonal matrix of singular values. SVD
is often used in dimensionality reduction and is foundational in techniques like
Principal Component Analysis (PCA).
o Similar to SVD but restricts the factorization to non-negative values in the matrices
PP and QQ, which can be useful in applications like topic modeling or image
processing.
Recommendation Systems:
Topic Modeling:
o Decomposing large datasets into smaller, more manageable parts while retaining
important information.
Scalability: Efficient algorithms (like ALS) can be scaled to large datasets, such as in
recommendation systems.
Limitations:
Overfitting: If not properly regularized, matrix factorization can overfit to noisy or sparse
data.
8
Missing Data: While it works well on complete datasets, matrix factorization techniques may
not perform well with missing or incomplete data unless adapted for matrix completion.
Matrix Completion
Definition:
Matrix completion is the process of filling in the missing entries of a partially observed matrix. It is
often used when the matrix is large and contains many missing values, such as in recommendation
systems (e.g., user-item ratings matrix with missing ratings).
Mathematical Formulation:
The goal of matrix completion is to estimate the missing entries of a matrix RR, given that the matrix
can be approximated by a low-rank matrix. Let RobsR_{\text{obs}} be the observed entries of the
matrix, and RmissingR_{\text{missing}} be the missing entries. Matrix completion seeks to solve:
Where ∥⋅∥F\| \cdot \|_F denotes the Frobenius norm, which measures the difference between the
observed entries and the predicted entries.
o A method for matrix completion that uses the singular value decomposition (SVD)
and thresholds the singular values to recover the low-rank matrix.
o Used for matrix factorization and can be adapted for matrix completion by iteratively
filling in the missing values during the factorization process.
4. Trace Minimization:
o This approach minimizes the nuclear norm (sum of singular values) of the matrix as a
proxy for rank minimization to find the best low-rank approximation.
Collaborative Filtering:
Image Inpainting:
Sensor Networks:
o Filling in missing data from sensors or measurements, especially when sensors fail or
are missing.
Handling Missing Data: Can effectively predict and recover missing information in large
matrices.
Limitations:
Data Often applied to fully observed data Applied to data with missing entries
Key Data is well-structured (e.g., linear Matrix has a low-rank structure and
Assumption relationships) missing values are recoverable
In summary, matrix factorization is used to uncover latent features by decomposing a matrix into
lower-dimensional matrices, while matrix completion focuses on filling in missing values by
assuming the matrix has a low-rank structure. Both techniques are fundamental in applications like
recommendation systems and data recovery.
Generative models are a class of machine learning models that aim to model how data is generated
in order to learn the underlying distribution. These models can generate new data samples by
learning the underlying structure of the observed data. Two important types of generative models
are Mixture Models and Latent Factor Models.
Mixture Models
Definition:
A mixture model is a probabilistic model that assumes the data is generated from a mixture of
several different distributions. These distributions represent sub-populations within the overall
population, but the membership of each data point to these sub-populations is unknown.
Mathematical Formulation:
A mixture model expresses the probability density of a data point xx as a weighted sum of
component distributions:
Where:
πk\pi_k is the weight (or mixing coefficient) of the kk-th component, such that ∑k=1Kπk=1\
sum_{k=1}^{K} \pi_k = 1.
Each component pk(x)p_k(x) can be a simple distribution, such as a Gaussian (Normal) distribution,
making Gaussian Mixture Models (GMMs) a common example.
o A type of mixture model used for sequential data. In HMMs, the state transitions
between distributions are governed by a hidden, latent variable.
Clustering:
o Grouping data into clusters based on their distributional properties (e.g., GMM for
soft clustering).
Density Estimation:
Soft Clustering: Unlike hard clustering methods, mixture models provide a probability of
belonging to each cluster.
Limitations:
Sensitive to Initialization: Similar to K-Means, the EM algorithm can be sensitive to the initial
choice of parameters.
Definition:
Latent factor models are generative models that assume that observed data is generated by some
underlying, unobserved (latent) factors. These models are often used to decompose complex data
into simpler, interpretable components.
Mathematical Formulation:
In latent factor models, the observed data XX is assumed to be a function of the latent variables ZZ,
which are typically lower-dimensional and unobserved:
Where ff represents some generative process, such as a linear combination or a probabilistic model.
Latent factor models are widely used in collaborative filtering, matrix factorization, and
recommendation systems. A popular example of a latent factor model is Matrix Factorization.
o A linear latent factor model used for dimensionality reduction by projecting the data
onto a smaller subspace defined by the principal components.
o Factor Analysis: Assumes the observed data is linearly related to a set of latent
variables and aims to recover these latent variables.
Expectation-Maximization (EM):
o Similar to mixture models, the EM algorithm can be used in latent factor models to
estimate the latent factors and their associated parameters.
o Used for models like matrix factorization in recommendation systems, where the
objective is to minimize the reconstruction error between the original and the
approximated data.
Recommendation Systems:
Topic Modeling:
Dimensionality Reduction:
o Reducing the number of features in high-dimensional data (e.g., images, text) while
preserving important information.
Interpretability: Latent factors are often easier to interpret than the raw data, revealing
underlying structures like preferences or topics.
Flexibility: Can handle complex, high-dimensional data by modeling the underlying latent
factors.
Limitations:
Linear Assumptions: Many latent factor models (e.g., PCA, matrix factorization) assume
linear relationships, which may not capture complex, non-linear patterns.
Scalability: Latent factor models can be computationally expensive, especially with large
datasets, and require efficient optimization algorithms.
Key Data points are generated from Data points are generated by latent
Assumption multiple unknown distributions (unobserved) factors
Example Gaussian Mixture Models (GMM), PCA, Matrix Factorization, Latent Dirichlet
Models Hidden Markov Models (HMM) Allocation (LDA)
Type of Hard or soft clustering based on Can be used for soft clustering in
Clustering probability collaborative filtering
In summary, mixture models are generative models that represent data as a combination of multiple
distributions, often used for clustering and density estimation. Latent factor models, on the other
hand, focus on uncovering hidden factors that generate observed data, with applications in
14
recommendation systems, topic modeling, and dimensionality reduction. Both types of models are
powerful tools for uncovering structure in complex datasets.