0% found this document useful (0 votes)
431 views10 pages

Principal Component Analysis For Noise Reduction and Fraudulent Activity Detection in Scientific Data

The research article examines Principal Component Analysis (PCA) for noise removal in data analysis and its application in fraud detection. Using a simulated data matrix, the study evaluates a threshold-based denoising strategy and confirms PCA's effectiveness in enhancing data accuracy, with potential real-world applications and future research opportunities discussed.

Uploaded by

Disant Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
431 views10 pages

Principal Component Analysis For Noise Reduction and Fraudulent Activity Detection in Scientific Data

The research article examines Principal Component Analysis (PCA) for noise removal in data analysis and its application in fraud detection. Using a simulated data matrix, the study evaluates a threshold-based denoising strategy and confirms PCA's effectiveness in enhancing data accuracy, with potential real-world applications and future research opportunities discussed.

Uploaded by

Disant Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Disant Upadhyay (2023), 1, 1–10

PROJECT 4

Principal Component Analysis for Noise Reduction and


Fraudulent Activity Detection in Scientific Data
Disant Upadhyay*
Memorial University of Newfoundland
*Corresponding author. Email: [email protected]

Abstract
In modern data analysis, the presence of noise poses significant challenges, often compromising the
accuracy of insights and concealing critical underlying signals. Principal Component Analysis (PCA)
has emerged as a potent technique for extracting valuable information from contaminated data, with
widespread applications in various domains, including the identification of fraudulent activities. This
research article delves into the utilization of PCA for noise removal in a simulated data matrix, which is
intentionally crafted using well-structured matrix functions to incorporate noise. We employ a threshold-
based denoising strategy using Singular Value Decomposition and rigorously assess its effectiveness under
varying noise intensities. Our findings underscore the prowess of PCA in mitigating noise and augmenting
the accuracy of data analysis. Moreover, we contextualize our results within the realm of real-world
applications and highlight promising avenues for future research in this dynamic field.

Contents

1 Introduction 2

2 Methodology and Data Simulation 2


2.1 Data Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 PCA for Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Singular Values Analysis and Identifying Relevant PCA Components . . . . . . . . 3
2.4 Thresholding Criteria and Extracting Clean Data . . . . . . . . . . . . . . . . . . 4

3 Effects of ϵ on Threshold-Based Denoising and Accuracy 5

4 Discussion 5
4.1 Simulating Noisy Data Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Identifying Relevant PCA Components . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Threshold-Based Denoising Using Singular Value Decomposition . . . . . . . . . 6
4.4 Evaluating the Impact of Noise Intensity on PCA Performance . . . . . . . . . . . 6
4.5 Limitations of the Study and Future Directions . . . . . . . . . . . . . . . . . . . . 6
4.6 Real-World Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Conclusion 7

Acknowledgement 7

References 7
2 Disant Upadhyay et al.

A Visualizing Noisy and Original Transformations of Function-Based Matrices 8

B Singular Value Analysis for Identifying Relevant PCA Components 9

C Threshold-Based Denoising of Matrices Using Singular Value Decomposition 9

D Denoising and Correlation Analysis of Matrix Rows Using Thresholded SVD 10

1. Introduction
In the field of data analysis, noise is an ever-present challenge that can obscure the underlying signal
and hinder the accuracy of the analysis. One powerful technique for extracting critical information
from contaminated data or identifying fraudulent activities is Principal Component Analysis (PCA)
Elhaik 2022. This method can effectively reduce noise and improve the accuracy of data analysis by
identifying and retaining only the most significant variations in the data Kurita 2020.
In this research article, we investigate the application of PCA in removing noise from data and
improving the accuracy of data analysis, addressing the thesis question: How can PCA be applied
to remove noise from data and improve the accuracy of data analysis? To this end, we simulate a
noisy data matrix V using carefully designed matrix functions Φ, Σ, and Ψ, and apply PCA-based
thresholding criteria to extract the clean signal from the contaminated data Do principal component
analysis regression eliminate noise in the data set? Through our analysis, we demonstrate the effectiveness
of PCA in reducing noise and improving the accuracy of data analysis Pca based image denoising.
The remainder of this article is organized as follows. In the subsequent section, we describe our
methods for simulating the noisy data matrix V and present our results in detail. We explain the
process of creating a noisy data matrix V such that U = ΦΣΨT and V = U + ϵη, where η is a random
matrix with standard normal distribution and ϵ is a scalar parameter. Additionally, we outline how
we sample these functions for a set of values of x to simulate noisy data V, and how we focus on
removing the noise using the PCA method. The following sections delve further into the application
of PCA, highlighting its power in extracting critical information from contaminated data Zhang
et al. 2017.

2. Methodology and Data Simulation


In this section, we outline the methodology employed to investigate the effectiveness of PCA in noise
reduction and the accuracy improvement of data analysis. We will describe the process of simulating
the noisy data matrix V and the clean signal matrix U using carefully designed matrix functions Φ,
Σ, and Ψ.

2.1 Data Simulation


Our approach begins with the creation of a noisy data matrix V ∈ Rn×n , defined as V = U + ϵη,
where U = ΦΣΨT , and η ∈ Rn×n is a random matrix with a standard normal distribution. The scalar
parameter ϵ controls the level of noise introduced into the data. In this study, we set n = 600 and
ϵ = 1.
Disant Upadhyay 3

The matrix functions Φ(x), Σ, and Ψ(x) are defined as follows:


2
cos(17x)e–x

Φ(x) = ,
sin(11x)
 
2 0
Σ= ,
0 12
2
sin(5x)e–x

Ψ(x) = ,
cos(13x)

where x ranges from –3 to 3. To simulate the noisy data matrix V, we sample these functions for a
set of n equally spaced values of x between –3 and 3.

Figure 1. U and V visualized

Using the provided Python code in Appendix A, we generate the matrices U and V, and
subsequently plot them side-by-side to visualize the differences and the presence of noise in the data
(Figure 1). This visual comparison allows us to observe the impact of noise on the data and sets the
stage for applying PCA to remove the noise and improve the accuracy of data analysis.

2.2 PCA for Noise Reduction


In the following sections, we will discuss the application of PCA for noise reduction in the simulated
data. We will describe the PCA-based thresholding criteria used to extract the clean signal from
the contaminated data, and we will present our results in detail. Through our analysis, we aim to
demonstrate the effectiveness of PCA in reducing noise and improving the accuracy of data analysis.

2.3 Singular Values Analysis and Identifying Relevant PCA Components


In this section, we analyze the singular values of the clean signal matrix U and the noisy data matrix
V in relation to our thesis statement, which focuses on the application of PCA for noise reduction and
improved data analysis accuracy. By comparing their strongest singular values, we aim to identify
which PCA components are most likely relevant to the clean data. This analysis provides insights
into the effectiveness of PCA in isolating the significant variations in the data and facilitating noise
reduction.
We compute the singular values of U and V using Python code provided in Appendix B.
Subsequently, we compare the top 10 singular values of U and V in a table (Table 1).
Upon inspection of the table, we observe a clear distinction between the first two singular values
and the rest. The first two singular values of both U and V are considerably larger than the others.
4 Disant Upadhyay et al.

Table 1. Comparison of the top 10 singular values of U and V

U V
0 1.51 × 102 1.55 × 102
1 1.25 × 102 1.30 × 102
2 5.74 × 10–14 4.85 × 101
3 5.01 × 10–14 4.80 × 101
4 4.76 × 10–14 4.78 × 101
5 4.26 × 10–14 4.74 × 101
6 4.13 × 10–14 4.73 × 101
7 3.62 × 10–14 4.71 × 101
8 3.24 × 10–14 4.68 × 101
9 2.99 × 10–14 4.65 × 101

This suggests that the first two PCA components capture the most significant variations in the data
and are most likely relevant to the clean data.
Having identified the relevant PCA components, we will proceed to the next section, where
we will explore the application of PCA-based thresholding criteria to extract the clean signal from
the contaminated data, further demonstrating the effectiveness of PCA in noise reduction and data
analysis accuracy improvement.

2.4 Thresholding Criteria and Extracting Clean Data

Figure 2. Visualization of the thresholded matrices Ũ and Ṽ

Building upon the identification of relevant PCA components in the previous section, we now
focus on applying a thresholding criterion to extract the clean data from the noisy data matrix V.
This thresholding criterion is based on setting all singular values smaller than a tolerance τ to zero.
We then compute Ũ = ΦΣ̃ΨT as the clean data, where Σ̃ is the thresholded singular value matrix.
√In √the data science literature, the threshold τ is known to have an optimal value given by
(4/ 3) nϵ. Using this optimal value, we apply the thresholding criterion to our data, as shown
in the Python code provided in Appendix C. The resulting thresholded matrices, Ũ and Ṽ, are
visualized in Figure 2.
As shown in Figure 2, the thresholded matrices demonstrate the effectiveness of the PCA-based
thresholding criterion in reducing the noise present in the data. Comparing Ũ and Ṽ, we can observe
Disant Upadhyay 5

similarities, suggesting that the clean data has been successfully extracted from the noisy data.
In the next section, we will further validate the effectiveness of our PCA-based thresholding
criterion by analyzing the correlation between a specific row of the clean data matrix U and the
thresholded matrix Ṽ.

3. Effects of ϵ on Threshold-Based Denoising and Accuracy


In this section, we investigate the effects of changing the noise parameter ϵ on the threshold-based
denoising process and the resulting accuracy. We repeat the denoising process using ϵ = 0.5, and
compare a row of U with that of Ũ visually using a plot and numerically using a correlation matrix.

Figure 3. Plot of row 300 of U and Ṽ , showing the effects of changing ϵ on the denoising process and accuracy.

The Python code for this analysis can be found in Appendix D. As shown in Figure 3, we plot
row 300 of U and Ṽ, and calculate the correlation between these rows using the np.corrcoef
function. The resulting correlation value indicates the level of accuracy in the denoising process, and
any changes in ϵ can be visually and numerically assessed through the plot and correlation value,
respectively.
By analyzing the effects of ϵ on the denoising process, we can gain a deeper understanding of
the relationship between noise level and the performance of PCA-based denoising techniques.

4. Discussion
In this research article, we have conducted an in-depth exploration of Principal Component Analysis
(PCA) as a powerful tool for reducing noise in data analysis. We examined various aspects of PCA and
assessed its performance in extracting critical information from contaminated data using a carefully
designed simulation study. The following subsections discuss our findings in detail, providing insight
into the strengths, limitations, and future directions for PCA as a noise reduction technique.

4.1 Simulating Noisy Data Matrices


We began our investigation by simulating a noisy data matrix V using matrix functions Φ, Σ, and
Ψ. We created the matrices U and V such that U = ΦΣΨT and V = U + ϵη. The noise level was
6 Disant Upadhyay et al.

controlled by the scalar parameter ϵ. This simulation allowed us to systematically study the effects of
noise on PCA’s performance in a controlled environment. Our side-by-side visualization of matrices
U and V revealed the significant differences introduced by the noise term ϵη.

4.2 Identifying Relevant PCA Components


After simulating the noisy data matrix, we proceeded to calculate the singular values of U and V.
We compared the strongest ten singular values to identify which PCA components were most likely
relevant to the clean data. Our analysis suggested that the first two PCA components had the most
significant impact on the clean data. This finding highlights the importance of focusing on the most
dominant components when using PCA for noise reduction in data analysis.

4.3 Threshold-Based Denoising Using Singular Value Decomposition


To further examine PCA’s noise reduction capabilities, we implemented a threshold-based denoising
strategy using Singular Value Decomposition (SVD). We computed Ũ and Ṽ as the clean data
matrices
√ by setting all singular values smaller than a tolerance τ to zero. The threshold τ was selected

as (4/ 3) nϵ, an optimal value known in the data science literature. Our visual comparison of the
thresholded matrices Ũ and Ṽ demonstrated the effectiveness of PCA in recovering the clean data
from the noisy data matrix.

4.4 Evaluating the Impact of Noise Intensity on PCA Performance


We further investigated the impact of noise intensity on PCA’s performance by repeating the
thresholding process with a lower noise level (ϵ = 0.5). Our comparison of a row of U and Ṽ, both
visually and numerically through correlation analysis, showed a high degree of similarity between
the original and denoised data. This result confirms that PCA is highly effective in mitigating the
effects of noise and improving the accuracy of data analysis, even when the noise level varies.

4.5 Limitations of the Study and Future Directions


While our simulation study has demonstrated the robustness of PCA as a noise reduction technique
in data analysis, there are some limitations to our approach. First, our study is based on a controlled
simulation environment, which may not fully capture the complexity and variability of real-world
data. Second, the choice of matrix functions Φ, Σ, and Ψ may affect the generalizability of our
findings. Moreover, we have used a specific thresholding criterion based on the literature; other
thresholding strategies might lead to different results.
To address these limitations, future research could investigate the performance of PCA in handling
more complex data structures or explore alternative denoising techniques. Additionally, the impact
of different thresholding criteria on the effectiveness of PCA for noise reduction could be examined.

4.6 Real-World Applications


Our simulation study has significant implications for various real-world applications, such as scientific
measurements and the identification of fraudulent activities. In many scientific fields, data often
contain noise due to measurement errors or other factors. Our results show that PCA can be a
valuable tool for extracting critical information from contaminated data, enabling researchers to
obtain more accurate and reliable insights.
In the context of fraud detection, PCA can be employed to identify anomalous patterns in large
datasets where fraudulent activities may be hidden. By reducing noise and focusing on the most
relevant components, PCA can help reveal underlying structures that may indicate potential fraud.
Disant Upadhyay 7

5. Conclusion
In this research article, we have thoroughly investigated the effectiveness of Principal Component
Analysis (PCA) as a noise reduction technique in data analysis. Our simulation-based approach
aimed to demonstrate PCA’s ability to extract critical information from contaminated data and its
potential to identify fraudulent activities. The thesis of our study posited that PCA would be effective
in removing noise from a data matrix and could potentially be employed in real-world applications.
Our findings support the thesis, showcasing that PCA is robust and capable of recovering clean
data from noisy data matrices. We observed the importance of focusing on the most dominant
components when using PCA for noise reduction, as these components have the most significant
impact on the clean data.
Our threshold-based denoising strategy effectively demonstrated that PCA could mitigate the
effects of noise, improving the accuracy of data analysis even when the noise level varies. This
observation aligns with our initial thesis, further substantiating the utility of PCA in real-world
applications, such as scientific measurements and fraud detection.
In conclusion, the results of our investigation confirm the correctness of our thesis and showcase
the potential of PCA as a valuable tool for extracting critical information from contaminated data.
Our study offers valuable insights into the effectiveness and robustness of PCA for noise reduction in
data analysis, making it applicable in a wide range of scenarios. While there are limitations to our
approach, the overall outcomes emphasize the importance of PCA in dealing with noisy data and its
applicability in various real-world situations.

Acknowledgement
This article is the Fourth and final cap-stone project for the course Technical writing taught by
Jabrul Alam at the Memorial University of Newfoundland.

References
Do principal component analysis regression eliminate noise in the data set? https://stats.stackexchange.com/questions/304449/do-
principal-component-analysis-regression-eliminate-noise-in-the-data-set.
Elhaik, Eran. 2022. Principal component analyses (pca)-based findings in population genetic studies are highly biased and
must be reevaluated. Scientific Reports.
Kurita, Takio. 2020. Principal component analysis (pca). In Computer vision. SpringerLink.
Pca based image denoising. https://aircconline.com/sipij/V3N2/3212sipij18.pdf .
Zhang, Xiaoming, Xuefeng Zhang, Xiaohong Zhang, and Xiaodong Zhang. 2017. Random noise suppression algorithm for
seismic signals based on modified principal component analysis. Wireless Personal Communications.
8 Disant Upadhyay et al.

A. Visualizing Noisy and Original Transformations of Function-Based Matrices

1 import numpy as np
2 import matplotlib . pyplot as plt
3

4 def gen erate_mat rices (n , epsilon ) :


5 """ Generate U and V matrices . """
6

7 # Define x with n linearly spaced values


8 # between -3 and 3
9 x = np . linspace ( -3 , 3 , n )
10

11 # Define phi , sigma , and psi matrices


12 phi = np . array ([ np . cos (17* x ) * np . exp ( - x **2) , np . sin (11* x ) ])
13 sigma = np . array ([[2 , 0] , [0 , 1/2]])
14 psi = np . array ([ np . sin (5* x ) * np . exp ( - x **2) , np . cos (13* x ) ])
15

16 # Generate random noise matrix eta with dimensions n x n


17 eta = np . random . randn (n , n )
18

19 # Compute U and V matrices


20 U = phi . T @ sigma @ psi
21 V = U + epsilon * eta
22

23 return U , V
24

25 def plot_matrices (U , V ) :
26 """ Plot U and V side - by - side . """
27 fig , ax = plt . subplots (1 , 2)
28 ax [0]. imshow ( U )
29 ax [0]. set_title ( ’U ’)
30 ax [1]. imshow ( V )
31 ax [1]. set_title ( ’V ’)
32 plt . show ()
33

34 def main () :
35 # Define matrix size and noise level
36 n = 600
37 epsilon = 1
38 # Set random seed for reproducibility
39 np . random . seed (42)
40

41 # Generate U and V matrices


42 U , V = generate _matrice s (n , epsilon )
43

44 # Plot U and V matrices


45 plot_matrices (U , V )
46

47 main ()
Disant Upadhyay 9

B. Singular Value Analysis for Identifying Relevant PCA Components


1 import pandas as pd
2

3 # Compute singular values of U and V matrices without computing


the actual singular vectors
4 U_s = np . linalg . svd (U , compute_uv = False )
5 V_s = np . linalg . svd (V , compute_uv = False )
6

7 # Create a DataFrame containing the first 10 singular values of


U and V
8 df = pd . DataFrame ({ ’U ’: U_s [:10] , ’V ’: V_s [:10]})
9 print ( df )
10

11 # Set the number of PCA components to be considered


12 n_components = 2
13

14 # Inform the user about the number of relevant PCA components


15 print ( f ’ The first { n_components } PCA components are most likely
relevant to the clean data . ’)

C. Threshold-Based Denoising of Matrices Using Singular Value Decomposition


1 # Calculate the threshold value ( tau ) based on matrix size ( n )
and noise level ( epsilon )
2 tau = (4/ np . sqrt (3) ) * np . sqrt ( n ) * epsilon
3

4 # Compute singular value decompositions of U and V matrices


5 U_d , U_s , U_v = np . linalg . svd ( U )
6 V_d , V_s , V_v = np . linalg . svd ( V )
7

8 # Apply the threshold value ( tau ) to the singular values of U


and V matrices
9 U_s_tilde = np . where ( U_s > tau , U_s , 0)
10 V_s_tilde = np . where ( V_s > tau , V_s , 0)
11

12 # Reconstruct the denoised matrices U_tilde and V_tilde using


the thresholded singular values
13 U_tilde = U_d @ np . diag ( U_s_tilde ) @ U_v
14 V_tilde = V_d @ np . diag ( V_s_tilde ) @ V_v
15

16 # Create subplots for U_tilde and V_tilde


17 fig , ax = plt . subplots (1 , 2)
18

19 # Plot the denoised U_tilde matrix


20 ax [0]. imshow ( U_tilde )
21 ax [0]. set_title ( ’ U_tilde ’)
22

23 # Plot the denoised V_tilde matrix


24 ax [1]. imshow ( V_tilde )
10 Disant Upadhyay et al.

25 ax [1]. set_title ( ’ V_tilde ’)


26

27 plt . show ()

D. Denoising and Correlation Analysis of Matrix Rows Using Thresholded SVD


1 # Set a new noise level ( epsilon )
2 epsilon = 0.5
3

4 # Generate the noisy matrix V with the new noise level


5 V = U + epsilon * eta
6

7 # Calculate the threshold value ( tau ) based on matrix size ( n )


and the new noise level ( epsilon )
8 tau = (4/ np . sqrt (3) ) * np . sqrt ( n ) * epsilon
9

10 # Compute singular value decomposition of the noisy matrix V


11 V_d , V_s , V_v = np . linalg . svd ( V )
12

13 # Apply the threshold value ( tau ) to the singular values of V


matrix
14 V_s_tilde = np . where ( V_s > tau , V_s , 0)
15

16 # Reconstruct the denoised matrix V_tilde using the thresholded


singular values
17 V_tilde = V_d @ np . diag ( V_s_tilde ) @ V_v
18

19 # Plot row 300 of U and V_tilde matrices


20 plt . plot ( U [300 ,:] , label = ’U ’)
21 plt . plot ( V_tilde [300 ,:] , label = ’ V_tilde ’)
22

23 # Add legend to the plot


24 plt . legend ()
25

26 # Calculate the correlation coefficient between row 300 of U


and V_tilde matrices
27 corr = np . corrcoef ( U [300 ,:] , V_tilde [300 ,:])
28

29 # Print the correlation coefficient


30 print ( f ’ Correlation between row 300 of U and V_tilde : { corr
[0 ,1]} ’)
31

32 plt . show ()

You might also like