0% found this document useful (0 votes)
18 views6 pages

ML Team 13 B Div

This paper compares linear and non-linear dimensionality reduction techniques, specifically PCA, t-SNE, and Autoencoders, using a breast cancer dataset. It evaluates their performance based on computational efficiency, reconstruction error, and structural preservation. The findings highlight the strengths and weaknesses of each method, aiming to identify optimal techniques for various applications.

Uploaded by

Soniya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

ML Team 13 B Div

This paper compares linear and non-linear dimensionality reduction techniques, specifically PCA, t-SNE, and Autoencoders, using a breast cancer dataset. It evaluates their performance based on computational efficiency, reconstruction error, and structural preservation. The findings highlight the strengths and weaknesses of each method, aiming to identify optimal techniques for various applications.

Uploaded by

Soniya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

A Methodological Comparison of Linear and Non-Linear

Dimensionality Reduction Techniques


Soniya Patil Parinitha Oruganti Sonal Ullegaddi
Student 3rd year, Student 3rd year, Student 3rd year,
School of Electrical and Electronics School of Electrical and Electronics School of Electrical and Electronics
Engineering Engineering Engineering
KLE Technological University KLE Technological University KLE Technological University
Hubli, Karnataka, India Hubli, Karnataka, India Hubli, Karnataka, India
[email protected] [email protected] [email protected]

,
Mrs.Pooja Chandargi become sparse, making it difficult to visualize and detect
Assistant Professor, meaningful patterns or relationships.
School of Electrical and Electronics Engineering Datasets like these require more storage spaces and tend to
KLE Technological University
have noise or redundant features that lead to overfitting in
Hubli, Karnataka, India
[email protected] the models making it necessary to address the problems by
reducing these dimensions. Dimensionality reduction
Mrs.Renuka M Ganiger directly tackles these issues by transforming data into a
Assistant Professor, lower-dimensional space without significant information
School of Electrical and Electronics Engineering loss.
KLE Technological University
Hubli, Karnataka, India
[email protected] These techniques can broadly be categorized into linear and
non-linear methods, each with unique strengths and
Abstract— Dimensionality reduction plays a pivotal role in limitations. Classical methods like Principal Component
handling high-dimensional datasets by mitigating Analysis (PCA) excel in handling linearly separable data,
computational inefficiency, redundancy, and overfitting while while modern algorithms such as t-distributed Stochastic
preserving critical data structure. This paper evaluates three
Neighbour Embedding (t-SNE) and Uniform Manifold
prominent dimensionality reduction techniques—Principal
Component Analysis (PCA), t-Distributed Stochastic Neighbor Approximation and Projection (UMAP) are adept at
Embedding (t-SNE), and Autoencoders—on a breast cancer uncovering complex, non-linear patterns in data.
dataset sourced from the SEER program. The study compares This research paper aims to provide a comprehensive
their performance across key metrics, including computational
efficiency, reconstruction error, and structural preservation. analysis of various dimensionality reduction techniques and
comparing their performance based on some parameters. It
Keywords-(dimensionality reduction, PCA, t-SNE, investigates the trade-offs between computational efficiency,
Autoencoders, Computation cost, Efficiency, Comparison )
scalability, interpretability, and accuracy, with the objective
I. INTRODUCTION of identifying optimal methods for specific use cases.
Any given dataset has inherent characteristics that define its
structure, behaviour and relationships among its elements. II. LITERATURE REVIEW
Dimensionality is one such aspect that greatly affects the
Dimensionality reduction addresses the challenges posed by
data representation.
high-dimensional datasets, such as computational
Dimensionality indicates the number of samples, features or
inefficiency and the curse of dimensionality. Techniques
variables available in the dataset. Each feature conforms to a
like Principal Component Analysis (PCA) and t-Distributed
measurable entity or characteristic of the data.
Stochastic Neighbor Embedding (t-SNE) are commonly
Real world datasets are massive in size and spread across
applied to improve model performance and data
various domains such as finance, social media, healthcare,
interpretability.[1][2]
academic literature, image processing etc. The huge number
PCA is a widely used linear dimensionality reduction
of features within these datasets, likely in thousands or
method that transforms high-dimensional data into a lower-
millions, make them high dimensional.
dimensional space by identifying directions of maximum
High dimensional datasets bring along with them concerns
variance. Its effectiveness in medical datasets, such as liver
and constraints as they are complicated to work with due to
disease detection, has been demonstrated. PCA provides
their sheer size, speed of data generation, data processing as
computational efficiency and noise reduction, making it
well as variety of information. High-dimensional data
suitable for large-scale datasets. However, its limitations
increases computational costs for machine learning
algorithms. In high-dimensional spaces, data points tend to
A Methodological Comparison of Linear and Non-Linear
Dimensionality Reduction Techniques
include the inability to capture non-linear relationships in performance for high-dimensional datasets, while LDA was
data [1][3]. more effective for datasets with fewer features [5].
PCA’s advantages include its unsupervised nature and
ability to reduce redundant features and noise, but it can
III. DATASET DESCRIPTION
suffer from high information loss in certain scenarios [4][5].
Extensions like Distributed Parallel PCA (DP-PCA) address The dataset chosen was sourced from Kaggle website [7].
these limitations by improving computational efficiency and This dataset of breast cancer patients was obtained from the
scalability, making it suitable for big data applications [4]. 2017 November update of the SEER Program of the NCI,
Unlike PCA, which is unsupervised, LDA is a supervised which provides information on population-based cancer
method that maximizes class separability. It projects data statistics. The dataset involved female patients with
onto a subspace that optimally discriminates between infiltrating duct and lobular carcinoma breast cancer (SEER
classes. While effective for classification tasks, LDA primary cites recode NOS histology codes 8522/3)
struggles with datasets containing overlapping classes or diagnosed in 2006-2010. Patients with unknown tumour
limited samples [6][4]. size, examined regional LNs, positive regional LNs, and
t-SNE and Multidimensional Scaling (MDS): Both methods patients whose survival months were less than 1 month were
have been explored for their ability to preserve local and excluded; thus, 4024 patients were ultimately included.
global structures in high-dimensional data. t-SNE is The dataset provided a detailed overview of clinical,
effective for visualization but is computationally expensive, pathological, and demographic factors associated with
while MDS focuses on preserving pairwise distances and is cancer patients and their outcomes. Its primary focus
more suited for low-dimensional tasks. These methods have appeared to be identifying relationships between various
been applied to datasets such as Parkinson's and diabetes to factors—such as tumor characteristics, receptor statuses,
enhance classification accuracy with algorithms like K- lymph node involvement, and staging—and patient survival.
Nearest Neighbors (KNN) and Support Vector Machines
(SVM)[2][3]. IV. METHODOLOGY
t-Distributed Stochastic Neighbor Embedding (t-SNE) and For the analysis, the Breast Cancer dataset has been used to
Uniform Manifold Approximation and Projection (UMAP) evaluate and compare the performance of the following
are powerful for visualizing high-dimensional data. t-SNE techniques:
excels in maintaining local structure, making it suitable for 1. Principal Component Analysis (PCA): A linear DR
clustering and classification visualization, although it can be technique that projects data onto a lower-
computationally intensive. UMAP, on the other hand, offers dimensional subspace, maximizing variance.
similar benefits with faster computation[4]. 2. t-Distributed Stochastic Neighbor Embedding (t-
Autoencoders: Unlike PCA and t-SNE, autoencoders SNE): A non-linear DR method optimized for
leverage deep learning to learn latent features and reduce visualizing high-dimensional data in two or three
dimensions through encoding-decoding mechanisms. dimensions.
Autoencoders outperform traditional techniques in terms of 3. Autoencoders: A deep learning-based non-linear
reconstruction error and robustness against noise. They have DR approach that encodes data into a compressed
been effectively applied to large, feature-rich datasets, representation and decodes it back to reconstruct
demonstrating superior performance in both linear and non- the original input.
linear data structures [2]. 1. Principal Component Analysis
RBMs (Restricted Boltzmann Machines) are another neural Principal Component Analysis (PCA) is one of the most
network-based technique used for unsupervised DR. They widely used techniques for dimensionality reduction, aiming
transform high-dimensional input into low-dimensional to transform a dataset into a lower-dimensional space while
latent variables, preserving essential features for tasks like retaining as much variance as possible. It does this by
image recognition [6]. identifying the principal components, which are the
directions of maximum variance in the data.
Several comparative studies demonstrate the strengths and
weaknesses of these techniques in specific contexts. For In a study [12], that involved data compression, PCA
instance, in micro-expression recognition, KPCA, t-SNE, minimized storage and computational demands by
and RBMs were evaluated using CASMEII datasets. The eliminating redundancy and focusing on the most
results highlighted that t-SNE produced superior low- informative components, making it ideal for handling big
dimensional representations for visualization, while KPCA data in industrial contexts.
excelled in classification tasks [6]. Another study using PCA
and LDA on medical datasets revealed PCA’s superior
A Methodological Comparison of Linear and Non-Linear
Dimensionality Reduction Techniques
It also serves as a valuable tool in feature engineering, t-Distributed Stochastic Neighbor Embedding (t-SNE),
where it enhances machine learning model performance by introduced by van der Maaten and Hinton in 2008, is a
removing correlated and less significant features, ensuring widely-used dimensionality reduction technique designed to
more robust predictions. According to another study [13],in visualize high-dimensional data in two or three dimensions.
exploratory data analysis, PCA facilitated the visualization It builds upon Stochastic Neighbor Embedding (SNE) by
of high-dimensional datasets, helping researchers uncover addressing optimization challenges and resolving the
patterns and relationships that might not have been "crowding problem" through the use of a Student-t
immediately apparent. distribution for low-dimensional similarities [8].

PCA algorithm identifies the directions or principal Its utility lies in revealing intricate relationships in datasets
components, that maximize the variance in the dataset. It with manifold structures, as commonly observed in high-
does so by solving the eigenvalue decomposition of the dimensional data like image features or single-cell RNA
covariance matrix or applying singular value decomposition sequencing data [8][9].
(SVD). In paper [8], it excelled in exploring relationships in datasets
with non-linear structures, revealing insights that traditional
1. Standardize the dataset. linear methods might missand by addressing issues like the
i. μ_j = (1/n) Σ x_ij crowding problem, t-SNE provides meaningful 2D or 3D
maps even for data with manifold properties. The study [9]
ii. σ_j = √((1/n) Σ (x_ij - μ_j)^2) however, stated that although t-SNE is highly effective, it
iii. z_ij = (x_ij - μ_j) / σ_j can be slow and inefficient when working with very large
datasets because the standard version does not scale well.
2. Compute the covariance matrix of the Improved methods, like FFT-accelerated t-SNE, address this
data. issue by making the process faster and more practical for
widespread use.
i. Σ = (1/(n-1)) Z^T Z
t-SNE algorithm models the high-dimensional pairwise
similarities using a Gaussian distribution and maps them to a
3. Perform eigenvalue decomposition or low-dimensional space using a student t-distribution,
SVD. minimizing the Kullback-Leibler divergence between the
two distributions.
4. Select the top-k principal components and
project the data onto this subspace.

Figure 2.[15] t-SNE Algorithm

1. Compute pairwise similarities in the high-


dimensional space.
2. Map these similarities to a low-
dimensional space using t-distributions.
3. Iteratively optimize the embeddings using
gradient descent to preserve neighborhood
relationships.
3. Autoencoders
Figure 1.[14] Flowchart of PCA Algorithm The use of autoencoders in handling high-dimensional and
complex datasets has demonstrated transformative potential
2. t-SNE across various domains. In the financial sector, one study
introduced a novel framework for clustering financial time
A Methodological Comparison of Linear and Non-Linear
Dimensionality Reduction Techniques
series. Financial data often presents challenges due to its 2. Design a decoder network to reconstruct
high dimensionality and non-linear relationships. By the original data from the latent space.
utilizing autoencoders as a dimensionality reduction
technique, this framework compresses multi-dimensional 3. Train the network using backpropagation
financial metrics into lower-dimensional representations to minimize reconstruction error.
while retaining critical informational features. Procedure for Analysis:
This compression enabled traditional clustering algorithms,
such as Agglomerative, BIRCH, and KNN, to achieve 1. Dataset Preparation
enhanced granularity and category isolation which in turn  Load the Breast Cancer dataset and split it into
impacted predictive modeling, optimizing investment training and testing subsets.
strategies, and facilitating risk management in the financial  Normalize or standardize the dataset to ensure
sector [10]. compatibility with DR techniques.
In medical imaging, a dual-featured autoencoder (DFAE) 2. Application of DR Techniques
has been proposed for the classification of histological 1. PCA:
images of human lung tissues. This approach combined a. Perform PCA and retain a predefined
spectral and spatial feature extraction using the Persistent number of principal components.
Homology Algorithm Toolbox (PHAT) and an image-super- b. Record the explained variance ratio for
resolution autoencoder. each component.
When these compressed features are processed using a 2. t-SNE:
Triple Layered Convolutional Architecture (TLCA), the a. Apply t-SNE with optimized
system achieved a classification accuracy of 96%. The hyperparameters.
method addressed key challenges in analyzing high- b. Project data into a 2D or 3D space for
resolution histological images, offering advancements in visualization.
diagnostic accuracy and efficiency [11]. 3. Autoencoders:
a. Design and train an autoencoder with
Unlike traditional methods like PCA, autoencoders excel in appropriate architecture.
capturing complex non-linear patterns, as demonstrated in b. Use the encoder to obtain the latent
their use for clustering financial time series, where they representation.
improved granularity and predictive accuracy [10].
Similarly, in histological image classification, autoencoders 3. Visualization
retained essential structural and textural information, Visualize the reduced-dimensional data for each method
achieving a remarkable 96% classification accuracy while using scatter plots in Python.Compare the spatial
mitigating overfitting risks [11]. Autoencoders flexibility separability of different classes in the dataset.
across diverse domains and their capability to extract 4. Computation of Performance Metrics
meaningful, compressed representations of high- Evaluate the techniques based on the following metrics:
dimensional data make them an efficient, and adaptable 1. Computational Efficiency: Measuring the
solution for improving the accuracy and efficiency of execution time for training and transforming data.
complex and high dimensional tasks. The algorithm follows
the steps below: 2. Reconstruction Error: For PCA and autoencoders,
compute the reconstruction error as the mean
squared error between the original and
reconstructed data.
3. Preservation of Data Structure: For t-SNE, evaluate
how well local neighbourhoods are preserved in the
low-dimensional space.
5. Analysis
The study involved comparative analysis and evaluated
PCA, t-SNE, and Autoencoders across performance metrics,
including computational efficiency, reconstruction error, and
cluster separability. A tabular summary presents key
metrics, while bar plots or line graphs visually compare
Figure 3.[15] Autoencoders Algorithm them. The trade-offs reveal PCA's superior computational
efficiency, the stronger reconstruction accuracy of PCA and
1. Design an encoder network to compress
Autoencoders, and t-SNE's excellence in class separability
data into a latent space.
A Methodological Comparison of Linear and Non-Linear
Dimensionality Reduction Techniques
for 2D visualizations. 0.122, indicating a reasonable ability to reconstruct the
original data from its latent representation. The
dimensionality reduction efficiency turned out to be 0.4,
V .RESULTS AND ANALYSIS
reducing the data to 40% of its original space.
1. The linear dimensional reduction technique PCA
performed exceptionally well on the dataset, reducing
dimensionality by 60% while retaining 95% of the variance,
demonstrating its efficiency. The reconstruction error was
negligible (~5.498 x 10^-31), indicating minimal
information loss. Computational efficiency was high, with a
transformation time of only 0.0096 seconds.

Figure 6

Figure 4.Visualization of data on a 2-dimensional Scatter plot.

2. The non-linear technique t-SNE showed high


reconstruction error, poor global distance preservation, and
Figure 7
required significant computational time (43.57 seconds),
making it less efficient for large datasets or tasks needing The correlation plot shows some discrepancies in structure
global structure retention. preservation, which could be optimized further. The training
and validation loss curves show a smooth and consistent
decline, indicating that the model was trained effectively
without overfitting. The computational efficiency, with a
training time of 17.55 seconds, reflected good performance
for this dataset.

Table 1. Comparison of Dimensionality Reduction Techniques


Metric PCA t-SNE Autoencoders
Reconstruction 5.498 × 10⁻³¹ 55.467 (high) 0.122 (low-
Error (near-zero) moderate)
Dimensionality 0.4 243.99356 (very 0.4
Reduction high)
Efficiency
Computational 0.0096 43.57 seconds 17.55 seconds
Efficiency (Time) seconds (slow) (moderate)
Figure 5 (fast)
Strengths Efficient, Excellent for Learns
retains visualization, complex, non-
3. Autoencoders demonstrated a balanced performance in global preserves local linear patterns
dimensionality reduction. The reconstruction error was structure structure
A Methodological Comparison of Linear and Non-Linear
Dimensionality Reduction Techniques
[1] M. Rai, H. Parmar,S. Singh,” Comparative Analysis of
Dimensionality Reduction Techniques in Machine Learning Models
for Liver Disease Detection”, Proceedings of the International
Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)
(I-SMAC 2024) IEEE Xplore Part Number: CFP24OSV-ART; ISBN:
979-8-3503-7642-5.
[2] S. Sakib, Md.A.B Siddique,Md. A.Rahman,” Performance Evaluation
of t-SNE and MDS Dimensionality Reduction Techniques with KNN,
ENN and SVM Classifiers”, 2020 IEEE Region 10 Symposium
(TENSYMP), 5-7 June 2020, Dhaka, Bangladesh.
[3] Y.Liang,X. Li , X. Huang ,Y.Yao, Z. Zhang.”An Automated Data
Mining Framework Using Autoencoders for Feature Extraction and
Dimensionality Reduction”.
[4] F.R.Mulla, Dr. A. K.Gupta .(2022).” A Review Paper on
Dimensionality Reduction Techniques”
[5] G.T.Reddy,R.Kaluri,P.K.Reddy,K.Lakshmanna,D.Rajput,G.Srivastav
a,T.Baker.(2020).” Analysis of Dimensionality Reduction Techniques
on Big Data”
[6] Shtino, Viola & Muca, Markela & Kapçiu, Rinela. (2024).
“Dimensionality Reduction: A Comparative Review using
RBM,KPCA, and t-SNE for Micro-Expressions Recognition.” Volume
Figure 8 15 Issue 1. 370-379. 10.14569/IJACSA.2024.0150135.
[7] https://www.kaggle.com/datasets/reihanenamdari/breast-cancer .
[8] L.van der Maaten,G.Hinton .(2008). “Visualizing Data using t-SNE”.
[9] G.C.Linderman, M.Rachh, J.G.Hoskins, S.Steinerberger, Y. Kluger.
VI.CONCLUSION (2017). “Efficient Algorithms for t-Distributed Stochastic
Neighborhood Embedding”.
The evaluation of dimensionality reduction techniques [10] D. G.CORTÉS, E.ONIEVA, I.P.LÓPEZ , L.TRINCHER , J.WU.
highlights that the choice of method is highly dependent on (2024). “Autoencoder-Enhanced Clustering: A Dimensionality Reduction
the dataset's characteristics. For linear datasets, Principal Approach to Financial Time Series”.
Component Analysis (PCA) emerges as the most effective [11]A.Ashraf, N.M.Nawi, T.Shahzad, M.Aamir, M.A.Khan, K.Ouahada.
technique due to its computational efficiency and ability to (2024). “Dimension Reduction Using Dual-Featured Auto-Encoder for the
Histological Classification of Human Lungs Tissues”.
retain significant variance. It demonstrated minimal
[12] Migenda, Nico & Möller, Ralf & Schenck, Wolfram. (2019). Adaptive
reconstruction error and high dimensionality reduction Dimensionality Adjustment for Online “Principal Component Analysis”.
efficiency, making it suitable for large-scale, structured 10.1007/978-3-030-33607-3_9.
datasets. [13] B.G.Sarmina,Guo-Hua Sun,Shi-Hai Dong,(2023), "PCA and t-SNE
For non-linear datasets, autoencoders and t-Distributed analysis in the study of QAOA entangled and non-entangled mixing
operators".
Stochastic Neighbor Embedding (t-SNE) showed distinct
[14] Kumar, Vinod & Biswas, Sougatamoy & Rajput, Dharmendra &
strengths. Autoencoders excel in capturing complex, non- Patel, Harshita & Tiwari, Basant. (2022). PCA-Based Incremental Extreme
linear patterns, offering a balanced trade-off between Learning Machine (PCA-IELM) for COVID-19 Patient Diagnosis Using
reconstruction accuracy and computational cost. They are Chest X-Ray Images. Computational Intelligence and Neuroscience. 2022.
particularly effective for datasets requiring intricate feature 1-17. 10.1155/2022/9107430.
learning. On the other hand, t-SNE is highly valuable for [15] Xue, Jingliang & Chen, Yingchun & Li, Ou & Li, Fei. (2020).
Classification and identification of unknown network protocols based on
visualizing non-linear relationships, particularly in CNN and T-SNE. Journal of Physics: Conference Series. 1617. 012071.
clustering tasks, despite its computational intensity. 10.1088/1742-6596/1617/1/012071.
[16] Yu, Ning & Pan, Yi. (2017). A deep learning method for lincRNA
detection using auto-encoder algorithm. BMC Bioinformatics. 18.
10.1186/s12859-017-1922-3.

VII.REFERENCES

You might also like