0% found this document useful (0 votes)
7 views3 pages

Cheat Sheet-Building Unsupervised Learning Models

The document provides a cheat sheet for building unsupervised learning models, detailing various algorithms such as UMAP, t-SNE, PCA, DBSCAN, HDBSCAN, and K-Means, along with their pros, cons, applications, and key hyperparameters. It also includes associated functions for generating data and visualizations. The authors of the document are Jeff Grossman and Abhishek Gagneja.

Uploaded by

tibocef309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views3 pages

Cheat Sheet-Building Unsupervised Learning Models

The document provides a cheat sheet for building unsupervised learning models, detailing various algorithms such as UMAP, t-SNE, PCA, DBSCAN, HDBSCAN, and K-Means, along with their pros, cons, applications, and key hyperparameters. It also includes associated functions for generating data and visualizations. The authors of the document are Jeff Grossman and Abhishek Gagneja.

Uploaded by

tibocef309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

5/23/25, 7:49 AM about:blank

Cheat Sheet: Building Unsupervised Learning Models


Unsupervised learning models

Model Name Brief Description Code Syntax

UMAP (Uniform Manifold Approximation and Projection) is used


for dimensionality reduction.
Pros: High performance, preserves global structure. from umap.umap_ import UMAP
Cons: Sensitive to parameters. umap = UMAP(n_neighbors=15, min_dist=0.1, n_components=2)
Applications: Data visualization, feature extraction.
Key hyperparameters:
UMAP
n_neighbors: Controls the local neighborhood size (default
= 15).
min_dist: Controls the minimum distance between points
in the embedded space (default = 0.1).
n_components: The dimensionality of the embedding
(default = 2).

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a


nonlinear dimensionality reduction technique.
Pros: Good for visualizing high-dimensional data. from sklearn.manifold import TSNE
Cons: Computationally expensive, prone to overfitting. tsne = TSNE(n_components=2, perplexity=30, learning_rate=200)
Applications: Data visualization, anomaly detection.
Key hyperparameters:
t-SNE
n_components: The number of dimensions for the output
(default = 2).
perplexity: Balances attention between local and global
aspects of the data (default = 30).
learning_rate: Controls the step size during optimization
(default = 200).

PCA (principal component analysis) is used for linear


dimensionality reduction. from sklearn.decomposition import PCA
Pros: Easy to interpret, reduces noise. pca = PCA(n_components=2)
Cons: Linear, may lose information in nonlinear data.
Applications: Feature extraction, compression.
Key hyperparameters:
PCA
n_components: Number of principal components to retain
(default = 2).
whiten: Whether to scale the components (default = False).
svd_solver: The algorithm to compute the components
(default = 'auto').

DBSCAN (Density-Based Spatial Clustering of Applications with


Noise) is a density-based clustering algorithm. from sklearn.cluster import DBSCAN
Pros: Identifies outliers, does not require the number of clusters. dbscan = DBSCAN(eps=0.5, min_samples=5)
Cons: Difficult with varying density clusters.
Applications: Anomaly detection, spatial data clustering.
DBSCAN Key hyperparameters:

eps: The maximum distance between two points to be


considered neighbors (default = 0.5).
min_samples: Minimum number of samples in a
neighborhood to form a cluster (default = 5).

HDBSCAN (Hierarchical DBSCAN) improves on DBSCAN by


handling varying density clusters. import hdbscan
Pros: Better handling of varying densities. clusterer = hdbscan.HDBSCAN(min_cluster_size=5)
Cons: Can be slower than DBSCAN.
Applications: Large datasets, complex clustering problems.
HDBSCAN Key hyperparameters:

min_cluster_size: The minimum size of clusters (default =


5).
min_samples: Minimum number of samples to form a
cluster (default = 10).

K-Means K-Means is a centroid-based clustering algorithm that groups data from sklearn.cluster import KMeans
clustering into k clusters. kmeans = KMeans(n_clusters=3)
Pros: Efficient, simple to implement.
Cons: Sensitive to initial cluster centroids.

about:blank 1/3
5/23/25, 7:49 AM about:blank

Model Name Brief Description Code Syntax


Applications: Customer segmentation, pattern recognition.
Key hyperparameters:

n_clusters: Number of clusters (default = 8).


init: Method for initializing the centroids ('k-means++' or
'random', default = 'k-means++').
n_init: Number of times the algorithm will run with
different centroid seeds (default = 10).

Associated fuctions used

Method Brief Description Code Syntax

from sklearn.datasets import make_blobs


X, y = make_blobs(n_samples=100, centers=2, random_state=42)

Generates isotropic Gaussian blobs


make_blobs
for clustering.

from numpy.random import multivariate_normal


samples = multivariate_normal(mean=[0, 0], cov=[[1, 0], [0, 1]], size=100)

Generates samples from a


multivariate_normal
multivariate normal distribution.

import plotly.express as px
fig = px.scatter_3d(df, x='x', y='y', z='z')
fig.show()

Creates a 3D scatter plot using


plotly.express.scatter_3d
Plotly Express.

import geopandas as gpd


gdf = gpd.GeoDataFrame(df, geometry='geometry')

Creates a GeoDataFrame from a


geopandas.GeoDataFrame
Pandas DataFrame.

gdf = gdf.to_crs(epsg=3857)

Transforms the coordinate


geopandas.to_crs reference system of a
GeoDataFrame.

contextily.add_basemap Adds a basemap to a import contextily as ctx


GeoDataFrame plot for context. ax = gdf.plot(figsize=(10, 10))
ctx.add_basemap(ax)

about:blank 2/3
5/23/25, 7:49 AM about:blank

Method Brief Description Code Syntax

from sklearn.decomposition import PCA


pca = PCA(n_components=2)
pca.fit(X)
variance_ratio = pca.explained_variance_ratio_

Returns the proportion of variance


pca.explained_variance_ratio_ explained by each principal
component.

Author
Jeff Grossman
Abhishek Gagneja

about:blank 3/3

You might also like