ML Team 13 B Div
ML Team 13 B Div
,
Mrs.Pooja Chandargi become sparse, making it difficult to visualize and detect
Assistant Professor, meaningful patterns or relationships.
School of Electrical and Electronics Engineering Datasets like these require more storage spaces and tend to
KLE Technological University
have noise or redundant features that lead to overfitting in
Hubli, Karnataka, India
[email protected] the models making it necessary to address the problems by
reducing these dimensions. Dimensionality reduction
Mrs.Renuka M Ganiger directly tackles these issues by transforming data into a
Assistant Professor, lower-dimensional space without significant information
School of Electrical and Electronics Engineering loss.
KLE Technological University
Hubli, Karnataka, India
[email protected] These techniques can broadly be categorized into linear and
non-linear methods, each with unique strengths and
Abstract— Dimensionality reduction plays a pivotal role in limitations. Classical methods like Principal Component
handling high-dimensional datasets by mitigating Analysis (PCA) excel in handling linearly separable data,
computational inefficiency, redundancy, and overfitting while while modern algorithms such as t-distributed Stochastic
preserving critical data structure. This paper evaluates three
Neighbour Embedding (t-SNE) and Uniform Manifold
prominent dimensionality reduction techniques—Principal
Component Analysis (PCA), t-Distributed Stochastic Neighbor Approximation and Projection (UMAP) are adept at
Embedding (t-SNE), and Autoencoders—on a breast cancer uncovering complex, non-linear patterns in data.
dataset sourced from the SEER program. The study compares This research paper aims to provide a comprehensive
their performance across key metrics, including computational
efficiency, reconstruction error, and structural preservation. analysis of various dimensionality reduction techniques and
comparing their performance based on some parameters. It
Keywords-(dimensionality reduction, PCA, t-SNE, investigates the trade-offs between computational efficiency,
Autoencoders, Computation cost, Efficiency, Comparison )
scalability, interpretability, and accuracy, with the objective
I. INTRODUCTION of identifying optimal methods for specific use cases.
Any given dataset has inherent characteristics that define its
structure, behaviour and relationships among its elements. II. LITERATURE REVIEW
Dimensionality is one such aspect that greatly affects the
Dimensionality reduction addresses the challenges posed by
data representation.
high-dimensional datasets, such as computational
Dimensionality indicates the number of samples, features or
inefficiency and the curse of dimensionality. Techniques
variables available in the dataset. Each feature conforms to a
like Principal Component Analysis (PCA) and t-Distributed
measurable entity or characteristic of the data.
Stochastic Neighbor Embedding (t-SNE) are commonly
Real world datasets are massive in size and spread across
applied to improve model performance and data
various domains such as finance, social media, healthcare,
interpretability.[1][2]
academic literature, image processing etc. The huge number
PCA is a widely used linear dimensionality reduction
of features within these datasets, likely in thousands or
method that transforms high-dimensional data into a lower-
millions, make them high dimensional.
dimensional space by identifying directions of maximum
High dimensional datasets bring along with them concerns
variance. Its effectiveness in medical datasets, such as liver
and constraints as they are complicated to work with due to
disease detection, has been demonstrated. PCA provides
their sheer size, speed of data generation, data processing as
computational efficiency and noise reduction, making it
well as variety of information. High-dimensional data
suitable for large-scale datasets. However, its limitations
increases computational costs for machine learning
algorithms. In high-dimensional spaces, data points tend to
A Methodological Comparison of Linear and Non-Linear
Dimensionality Reduction Techniques
include the inability to capture non-linear relationships in performance for high-dimensional datasets, while LDA was
data [1][3]. more effective for datasets with fewer features [5].
PCA’s advantages include its unsupervised nature and
ability to reduce redundant features and noise, but it can
III. DATASET DESCRIPTION
suffer from high information loss in certain scenarios [4][5].
Extensions like Distributed Parallel PCA (DP-PCA) address The dataset chosen was sourced from Kaggle website [7].
these limitations by improving computational efficiency and This dataset of breast cancer patients was obtained from the
scalability, making it suitable for big data applications [4]. 2017 November update of the SEER Program of the NCI,
Unlike PCA, which is unsupervised, LDA is a supervised which provides information on population-based cancer
method that maximizes class separability. It projects data statistics. The dataset involved female patients with
onto a subspace that optimally discriminates between infiltrating duct and lobular carcinoma breast cancer (SEER
classes. While effective for classification tasks, LDA primary cites recode NOS histology codes 8522/3)
struggles with datasets containing overlapping classes or diagnosed in 2006-2010. Patients with unknown tumour
limited samples [6][4]. size, examined regional LNs, positive regional LNs, and
t-SNE and Multidimensional Scaling (MDS): Both methods patients whose survival months were less than 1 month were
have been explored for their ability to preserve local and excluded; thus, 4024 patients were ultimately included.
global structures in high-dimensional data. t-SNE is The dataset provided a detailed overview of clinical,
effective for visualization but is computationally expensive, pathological, and demographic factors associated with
while MDS focuses on preserving pairwise distances and is cancer patients and their outcomes. Its primary focus
more suited for low-dimensional tasks. These methods have appeared to be identifying relationships between various
been applied to datasets such as Parkinson's and diabetes to factors—such as tumor characteristics, receptor statuses,
enhance classification accuracy with algorithms like K- lymph node involvement, and staging—and patient survival.
Nearest Neighbors (KNN) and Support Vector Machines
(SVM)[2][3]. IV. METHODOLOGY
t-Distributed Stochastic Neighbor Embedding (t-SNE) and For the analysis, the Breast Cancer dataset has been used to
Uniform Manifold Approximation and Projection (UMAP) evaluate and compare the performance of the following
are powerful for visualizing high-dimensional data. t-SNE techniques:
excels in maintaining local structure, making it suitable for 1. Principal Component Analysis (PCA): A linear DR
clustering and classification visualization, although it can be technique that projects data onto a lower-
computationally intensive. UMAP, on the other hand, offers dimensional subspace, maximizing variance.
similar benefits with faster computation[4]. 2. t-Distributed Stochastic Neighbor Embedding (t-
Autoencoders: Unlike PCA and t-SNE, autoencoders SNE): A non-linear DR method optimized for
leverage deep learning to learn latent features and reduce visualizing high-dimensional data in two or three
dimensions through encoding-decoding mechanisms. dimensions.
Autoencoders outperform traditional techniques in terms of 3. Autoencoders: A deep learning-based non-linear
reconstruction error and robustness against noise. They have DR approach that encodes data into a compressed
been effectively applied to large, feature-rich datasets, representation and decodes it back to reconstruct
demonstrating superior performance in both linear and non- the original input.
linear data structures [2]. 1. Principal Component Analysis
RBMs (Restricted Boltzmann Machines) are another neural Principal Component Analysis (PCA) is one of the most
network-based technique used for unsupervised DR. They widely used techniques for dimensionality reduction, aiming
transform high-dimensional input into low-dimensional to transform a dataset into a lower-dimensional space while
latent variables, preserving essential features for tasks like retaining as much variance as possible. It does this by
image recognition [6]. identifying the principal components, which are the
directions of maximum variance in the data.
Several comparative studies demonstrate the strengths and
weaknesses of these techniques in specific contexts. For In a study [12], that involved data compression, PCA
instance, in micro-expression recognition, KPCA, t-SNE, minimized storage and computational demands by
and RBMs were evaluated using CASMEII datasets. The eliminating redundancy and focusing on the most
results highlighted that t-SNE produced superior low- informative components, making it ideal for handling big
dimensional representations for visualization, while KPCA data in industrial contexts.
excelled in classification tasks [6]. Another study using PCA
and LDA on medical datasets revealed PCA’s superior
A Methodological Comparison of Linear and Non-Linear
Dimensionality Reduction Techniques
It also serves as a valuable tool in feature engineering, t-Distributed Stochastic Neighbor Embedding (t-SNE),
where it enhances machine learning model performance by introduced by van der Maaten and Hinton in 2008, is a
removing correlated and less significant features, ensuring widely-used dimensionality reduction technique designed to
more robust predictions. According to another study [13],in visualize high-dimensional data in two or three dimensions.
exploratory data analysis, PCA facilitated the visualization It builds upon Stochastic Neighbor Embedding (SNE) by
of high-dimensional datasets, helping researchers uncover addressing optimization challenges and resolving the
patterns and relationships that might not have been "crowding problem" through the use of a Student-t
immediately apparent. distribution for low-dimensional similarities [8].
PCA algorithm identifies the directions or principal Its utility lies in revealing intricate relationships in datasets
components, that maximize the variance in the dataset. It with manifold structures, as commonly observed in high-
does so by solving the eigenvalue decomposition of the dimensional data like image features or single-cell RNA
covariance matrix or applying singular value decomposition sequencing data [8][9].
(SVD). In paper [8], it excelled in exploring relationships in datasets
with non-linear structures, revealing insights that traditional
1. Standardize the dataset. linear methods might missand by addressing issues like the
i. μ_j = (1/n) Σ x_ij crowding problem, t-SNE provides meaningful 2D or 3D
maps even for data with manifold properties. The study [9]
ii. σ_j = √((1/n) Σ (x_ij - μ_j)^2) however, stated that although t-SNE is highly effective, it
iii. z_ij = (x_ij - μ_j) / σ_j can be slow and inefficient when working with very large
datasets because the standard version does not scale well.
2. Compute the covariance matrix of the Improved methods, like FFT-accelerated t-SNE, address this
data. issue by making the process faster and more practical for
widespread use.
i. Σ = (1/(n-1)) Z^T Z
t-SNE algorithm models the high-dimensional pairwise
similarities using a Gaussian distribution and maps them to a
3. Perform eigenvalue decomposition or low-dimensional space using a student t-distribution,
SVD. minimizing the Kullback-Leibler divergence between the
two distributions.
4. Select the top-k principal components and
project the data onto this subspace.
Figure 6
VII.REFERENCES