Principal Component Analysis For Noise Reduction and Fraudulent Activity Detection in Scientific Data
Principal Component Analysis For Noise Reduction and Fraudulent Activity Detection in Scientific Data
PROJECT 4
Abstract
In modern data analysis, the presence of noise poses significant challenges, often compromising the
accuracy of insights and concealing critical underlying signals. Principal Component Analysis (PCA)
has emerged as a potent technique for extracting valuable information from contaminated data, with
widespread applications in various domains, including the identification of fraudulent activities. This
research article delves into the utilization of PCA for noise removal in a simulated data matrix, which is
intentionally crafted using well-structured matrix functions to incorporate noise. We employ a threshold-
based denoising strategy using Singular Value Decomposition and rigorously assess its effectiveness under
varying noise intensities. Our findings underscore the prowess of PCA in mitigating noise and augmenting
the accuracy of data analysis. Moreover, we contextualize our results within the realm of real-world
applications and highlight promising avenues for future research in this dynamic field.
Contents
1 Introduction 2
4 Discussion 5
4.1 Simulating Noisy Data Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Identifying Relevant PCA Components . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Threshold-Based Denoising Using Singular Value Decomposition . . . . . . . . . 6
4.4 Evaluating the Impact of Noise Intensity on PCA Performance . . . . . . . . . . . 6
4.5 Limitations of the Study and Future Directions . . . . . . . . . . . . . . . . . . . . 6
4.6 Real-World Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Conclusion 7
Acknowledgement 7
References 7
2 Disant Upadhyay et al.
1. Introduction
In the field of data analysis, noise is an ever-present challenge that can obscure the underlying signal
and hinder the accuracy of the analysis. One powerful technique for extracting critical information
from contaminated data or identifying fraudulent activities is Principal Component Analysis (PCA)
Elhaik 2022. This method can effectively reduce noise and improve the accuracy of data analysis by
identifying and retaining only the most significant variations in the data Kurita 2020.
In this research article, we investigate the application of PCA in removing noise from data and
improving the accuracy of data analysis, addressing the thesis question: How can PCA be applied
to remove noise from data and improve the accuracy of data analysis? To this end, we simulate a
noisy data matrix V using carefully designed matrix functions Φ, Σ, and Ψ, and apply PCA-based
thresholding criteria to extract the clean signal from the contaminated data Do principal component
analysis regression eliminate noise in the data set? Through our analysis, we demonstrate the effectiveness
of PCA in reducing noise and improving the accuracy of data analysis Pca based image denoising.
The remainder of this article is organized as follows. In the subsequent section, we describe our
methods for simulating the noisy data matrix V and present our results in detail. We explain the
process of creating a noisy data matrix V such that U = ΦΣΨT and V = U + ϵη, where η is a random
matrix with standard normal distribution and ϵ is a scalar parameter. Additionally, we outline how
we sample these functions for a set of values of x to simulate noisy data V, and how we focus on
removing the noise using the PCA method. The following sections delve further into the application
of PCA, highlighting its power in extracting critical information from contaminated data Zhang
et al. 2017.
where x ranges from –3 to 3. To simulate the noisy data matrix V, we sample these functions for a
set of n equally spaced values of x between –3 and 3.
Using the provided Python code in Appendix A, we generate the matrices U and V, and
subsequently plot them side-by-side to visualize the differences and the presence of noise in the data
(Figure 1). This visual comparison allows us to observe the impact of noise on the data and sets the
stage for applying PCA to remove the noise and improve the accuracy of data analysis.
U V
0 1.51 × 102 1.55 × 102
1 1.25 × 102 1.30 × 102
2 5.74 × 10–14 4.85 × 101
3 5.01 × 10–14 4.80 × 101
4 4.76 × 10–14 4.78 × 101
5 4.26 × 10–14 4.74 × 101
6 4.13 × 10–14 4.73 × 101
7 3.62 × 10–14 4.71 × 101
8 3.24 × 10–14 4.68 × 101
9 2.99 × 10–14 4.65 × 101
This suggests that the first two PCA components capture the most significant variations in the data
and are most likely relevant to the clean data.
Having identified the relevant PCA components, we will proceed to the next section, where
we will explore the application of PCA-based thresholding criteria to extract the clean signal from
the contaminated data, further demonstrating the effectiveness of PCA in noise reduction and data
analysis accuracy improvement.
Building upon the identification of relevant PCA components in the previous section, we now
focus on applying a thresholding criterion to extract the clean data from the noisy data matrix V.
This thresholding criterion is based on setting all singular values smaller than a tolerance τ to zero.
We then compute Ũ = ΦΣ̃ΨT as the clean data, where Σ̃ is the thresholded singular value matrix.
√In √the data science literature, the threshold τ is known to have an optimal value given by
(4/ 3) nϵ. Using this optimal value, we apply the thresholding criterion to our data, as shown
in the Python code provided in Appendix C. The resulting thresholded matrices, Ũ and Ṽ, are
visualized in Figure 2.
As shown in Figure 2, the thresholded matrices demonstrate the effectiveness of the PCA-based
thresholding criterion in reducing the noise present in the data. Comparing Ũ and Ṽ, we can observe
Disant Upadhyay 5
similarities, suggesting that the clean data has been successfully extracted from the noisy data.
In the next section, we will further validate the effectiveness of our PCA-based thresholding
criterion by analyzing the correlation between a specific row of the clean data matrix U and the
thresholded matrix Ṽ.
Figure 3. Plot of row 300 of U and Ṽ , showing the effects of changing ϵ on the denoising process and accuracy.
The Python code for this analysis can be found in Appendix D. As shown in Figure 3, we plot
row 300 of U and Ṽ, and calculate the correlation between these rows using the np.corrcoef
function. The resulting correlation value indicates the level of accuracy in the denoising process, and
any changes in ϵ can be visually and numerically assessed through the plot and correlation value,
respectively.
By analyzing the effects of ϵ on the denoising process, we can gain a deeper understanding of
the relationship between noise level and the performance of PCA-based denoising techniques.
4. Discussion
In this research article, we have conducted an in-depth exploration of Principal Component Analysis
(PCA) as a powerful tool for reducing noise in data analysis. We examined various aspects of PCA and
assessed its performance in extracting critical information from contaminated data using a carefully
designed simulation study. The following subsections discuss our findings in detail, providing insight
into the strengths, limitations, and future directions for PCA as a noise reduction technique.
controlled by the scalar parameter ϵ. This simulation allowed us to systematically study the effects of
noise on PCA’s performance in a controlled environment. Our side-by-side visualization of matrices
U and V revealed the significant differences introduced by the noise term ϵη.
5. Conclusion
In this research article, we have thoroughly investigated the effectiveness of Principal Component
Analysis (PCA) as a noise reduction technique in data analysis. Our simulation-based approach
aimed to demonstrate PCA’s ability to extract critical information from contaminated data and its
potential to identify fraudulent activities. The thesis of our study posited that PCA would be effective
in removing noise from a data matrix and could potentially be employed in real-world applications.
Our findings support the thesis, showcasing that PCA is robust and capable of recovering clean
data from noisy data matrices. We observed the importance of focusing on the most dominant
components when using PCA for noise reduction, as these components have the most significant
impact on the clean data.
Our threshold-based denoising strategy effectively demonstrated that PCA could mitigate the
effects of noise, improving the accuracy of data analysis even when the noise level varies. This
observation aligns with our initial thesis, further substantiating the utility of PCA in real-world
applications, such as scientific measurements and fraud detection.
In conclusion, the results of our investigation confirm the correctness of our thesis and showcase
the potential of PCA as a valuable tool for extracting critical information from contaminated data.
Our study offers valuable insights into the effectiveness and robustness of PCA for noise reduction in
data analysis, making it applicable in a wide range of scenarios. While there are limitations to our
approach, the overall outcomes emphasize the importance of PCA in dealing with noisy data and its
applicability in various real-world situations.
Acknowledgement
This article is the Fourth and final cap-stone project for the course Technical writing taught by
Jabrul Alam at the Memorial University of Newfoundland.
References
Do principal component analysis regression eliminate noise in the data set? https://stats.stackexchange.com/questions/304449/do-
principal-component-analysis-regression-eliminate-noise-in-the-data-set.
Elhaik, Eran. 2022. Principal component analyses (pca)-based findings in population genetic studies are highly biased and
must be reevaluated. Scientific Reports.
Kurita, Takio. 2020. Principal component analysis (pca). In Computer vision. SpringerLink.
Pca based image denoising. https://aircconline.com/sipij/V3N2/3212sipij18.pdf .
Zhang, Xiaoming, Xuefeng Zhang, Xiaohong Zhang, and Xiaodong Zhang. 2017. Random noise suppression algorithm for
seismic signals based on modified principal component analysis. Wireless Personal Communications.
8 Disant Upadhyay et al.
1 import numpy as np
2 import matplotlib . pyplot as plt
3
23 return U , V
24
25 def plot_matrices (U , V ) :
26 """ Plot U and V side - by - side . """
27 fig , ax = plt . subplots (1 , 2)
28 ax [0]. imshow ( U )
29 ax [0]. set_title ( ’U ’)
30 ax [1]. imshow ( V )
31 ax [1]. set_title ( ’V ’)
32 plt . show ()
33
34 def main () :
35 # Define matrix size and noise level
36 n = 600
37 epsilon = 1
38 # Set random seed for reproducibility
39 np . random . seed (42)
40
47 main ()
Disant Upadhyay 9
27 plt . show ()
32 plt . show ()