Exploring Anomaly Detection in Data Science: Applications, Methods, and Significance
Exploring Anomaly Detection in Data Science: Applications, Methods, and Significance
Miyuki Takahashi
C-2022-0063
BSIT - 2
February 2024
ii
TABLE OF CONTENTS
Page
TITLE PAGE i
TABLE OF CONTENTS Ii
CHAPTER
1 INTRODUCTION
Background 1
Research Aim 1
Research Objectives 1
2. LITERATURE REVIEW
Definition and Importance of Data Science 4
LIST OF FIGURES
Page
1 K-means clustering process. 6
2 Outline of CNN. 6
3 OCSVM model algorithm flow chart. 8
LIST OF TABLES
1 Pros and Cons of Anomaly Detection Algorithms 9
1
CHAPTER 1
INTRODUCTION
1.1 Background
Because of the expanding opportunities it presents across several areas, the confluence
of data science and anomaly detection has become a focus of interest and research in
recent years. The goal of the diverse field of data science is to extract useful information
from data by utilizing a variety of techniques and algorithms. On the other hand, anomaly
detection plays a crucial role in identifying anomalous trends or events that deviate from
expected norms in datasets. Its importance is felt in a variety of industries, including
manufacturing, banking, healthcare, and cybersecurity, where early anomaly detection can
prevent attacks or spur strategic decision-making. Strong anomaly detection algorithms are
becoming more and more necessary as data-driven systems proliferate and data volume
and complexity grow exponentially combined. Therefore, the goal of this research project is
to investigate the combination of anomaly detection techniques under the broad umbrella
of data science, with a particular emphasis on clarifying its various uses, methodological
foundations, and overall importance.
Students. This research benefits students by enhancing their analytical skills and critical
thinking abilities through an understanding of anomaly detection principles, thus preparing
them for future academic and professional pursuits in data science and related fields.
Educators. The study enriches curricula by integrating anomaly detection concepts into
classroom instruction, thereby improving the quality of education and better equipping
students for challenges in the digital age.
Future Researchers. By expanding knowledge in anomaly detection, this study provides
a foundation for future research to explore advanced methodologies and interdisciplinary
applications, fostering innovation and advancements in the field.
Practitioners and Industry Professionals. The findings of this research inform
practitioners and industry professionals about optimizing system design and improving
detection accuracy in real-world applications, thereby driving positive outcomes in
cybersecurity, finance, healthcare, manufacturing, and other sectors.
Society. This study promotes awareness of anomaly detection's role in addressing
complex challenges, contributing to a safer, more secure, and resilient society. By aligning
with ethical considerations for a sustainable future, the research supports the development
of responsible data-driven practices.
Limitations:
The comprehensiveness of the literature review may be limited by the availability of
relevant research articles and resources, potentially leading to gaps in the coverage of
certain topics or methodologies.
The generalizability of findings to specific contexts may be constrained by the scope of the
study and the diversity of application domains, necessitating caution in extrapolating
conclusions beyond the scope of the research.
3
The research may be restricted by the availability of relevant data and resources for
conducting empirical studies or case analyses, potentially limiting the depth of analysis or
the breadth of applications explored.
4
CHAPTER 2
LITERATURE REVIEW
This chapter presents the relevant works and context of application of data science
particularly the use of Anomaly Detection to provide basis and purpose of this study.
Finance:
Moreover, in the financial sector, anomaly detection is crucial for fraud detection in
credit card transactions, insurance claims, and trading activities. Isolation Forests and
Autoencoders are prevalent algorithms in this domain. Isolation Forests isolate anomalies
by randomly partitioning data into subsets, making them effective for detecting outliers with
minimum computations. Autoencoders, being neural network architectures, are capable of
reconstructing input data, with anomalies exhibiting higher reconstruction errors, thus
enabling their detection (Phua, Lee, Smith, & Gayler, 2010).
Healthcare:
In the healthcare sector, anomaly detection methods play a significant role in
medical image analysis for disease diagnosis. Gaussian Mixture Models (GMM) and
Convolutional Neural Networks (CNN) are commonly used algorithms. GMM models the
probability distribution of normal data, enabling the detection of deviations beyond a
certain threshold. CNNs, with their ability to extract hierarchical features from medical
images, facilitate the identification of anomalous patterns indicative of diseases such as
tumors or fractures (Pimentel, Clifton, Clifton, & Tarassenko, 2014).
A Convolutional Neural Network (CNN) architecture comprises two main parts (Figure 2):
1. Feature Extraction: This process involves the utilization of a convolutional tool to
separate and identify various features of the input image. Convolutional layers perform
feature extraction by applying filters to the input data, detecting patterns such as edges,
textures, or shapes. These layers are typically followed by pooling layers, which further
reduce the spatial dimensions of the features while retaining their essential information.
2. Classification: After feature extraction, the network consists of fully connected layers
responsible for predicting the class of the image based on the extracted features. These
layers take the output from the convolutional layers and perform classification tasks,
such as identifying objects or patterns within the image.
Manufacturing:
Furthermore, in the manufacturing industry, anomaly detection techniques are
instrumental in fault detection and predictive maintenance of critical machinery and
equipment. Principal Component Analysis (PCA) and Recurrent Neural Networks (RNN) are
among the commonly employed algorithms. PCA reduces the dimensionality of sensor data
while preserving critical information, enabling the detection of anomalies in multi-
dimensional datasets. RNNs, with their ability to model temporal dependencies in data, are
effective for predicting equipment failures and scheduling maintenance activities, thus
minimizing downtime and optimizing operational efficiency (Ding, Zhao, & Fu, 2019).
Telecommunications:
Anomaly detection is vital in telecommunications for identifying network intrusions,
unusual traffic patterns, and service disruptions. One commonly used algorithm is the
Random Cut Forest (RCF), which leverages randomization to isolate anomalies in streaming
data efficiently (Laptev et al., 2015). RCF leverages the principle of isolation to identify
anomalies in streaming data efficiently. By constructing a forest of random decision trees
and measuring the average path lengths for each data point, RCF assigns anomaly scores,
with shorter paths indicating anomalies.
Environmental Monitoring:
Environmental monitoring relies on anomaly detection to identify abnormal changes
in environmental parameters, such as pollution levels, weather patterns, and ecosystem
dynamics. Local Outlier Factor (LOF) is a widely adopted algorithm in this domain,
particularly for detecting spatial anomalies in sensor data (Breunig et al., 2000). LOF
measures the local density of data points relative to their neighbors, identifying regions with
significantly lower densities as anomalies. By considering the local context of data points,
LOF can effectively detect spatial anomalies in datasets with varying densities.
Social Media and Online Platforms:
Anomaly detection is critical for identifying fraudulent activities, fake accounts, and
abnormal user behavior on social media and online platforms. One-Class Support Vector
8
Machines (OCSVM) are commonly employed for this purpose, as they can effectively
distinguish between normal and abnormal instances in high-dimensional data (Schölkopf et
al., 2001). OCSVM learns a representation of normal data in high-dimensional space and
classifies instances that deviate from this representation as anomalies. By defining a
hypersphere around normal data points, OCSVM can detect outliers beyond the boundaries
of the hypersphere.
The flow chart of the algorithm is shown in Figure 3.
This table provides a comparative analysis of the pros and cons of different anomaly
detection algorithms.
9
CHAPTER 3
RESULTS AND DISCUSSION
This chapter presents the results and discussion from the exploration of the different…..
CHAPTER 4
SUMMARY AND CONCLUSION
Summary
In summary, this study has investigated how anomaly detection methods can be
integrated into data science across a range of disciplines. The study started with a summary
of the history and importance of anomaly detection in data science, then it descended into
the theories, practices, and uses of anomaly detection techniques. The effectiveness and
limitations of several anomaly detection algorithms were examined in real-world scenarios
encompassing cybersecurity, banking, healthcare, manufacturing, and other sectors through
a thorough analysis of the literature.
The goal of the research was to give scholars, practitioners, and other stakeholders
useful insights by clarifying the real-world applications of anomaly detection for data-driven
decision-making processes. Notwithstanding several drawbacks, such as the lack of
pertinent research papers and data, the study provided a thorough analysis of anomaly
detection's function in resolving important issues and producing favorable results in a range
of industries.
Conclusion
In conclusion, the incorporation of anomaly detection methods into data science is a
noteworthy development with extensive consequences. Anomaly detection is essential for
protecting systems, boosting operational effectiveness, and spurring innovation in a variety
of fields, including finance, healthcare, industrial process optimization, and disease
diagnosis.
Anomaly detection allows stakeholders to make well-informed decisions and
successfully manage risks by identifying anomalous patterns or occurrences that differ from
expected behavior within datasets. It does this by utilizing a wide range of techniques and
methodologies. As anomaly detection research and innovation continue, it is possible that
data-driven practices may improve further, making society safer, more secure, and resilient
while encouraging ethical concerns for sustainable data usage.
12
REFERENCES
Aleskerov, E., Freisleben, B., & Rao, B. (1997). Cardwatch: A neural network based database mining
system for credit card fraud detection. In Proceedings of IEEE Computational Intelligence for
Financial Engineering. 220–226.
Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local
outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp.
93-104).
Chandola, V., Banerjee, A., & Kumar, V. (2007). Anomaly Detection: A Survey.
(PDF) Anomaly Detection: A Survey (researchgate.net)
Ding, S., Zhao, X., & Fu, X. (2019). A survey on fault diagnosis and fault tolerance methods in
manufacturing systems. Journal of Manufacturing Systems, 53, 261-271.
https://doi.org/10.1016/j.jmsy.2019.02.006
Fujimaki, R., Yairi, T., & Machida, K. (2005). An approach to spacecraft anomaly detection problem
using kernel feature space. In Proceeding of the eleventh ACM SIGKDD international conference on
Knowledge discovery in data mining. ACM Press, New York, NY, USA, 401–410.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-
1780.
Huang, G., Chen, J., & Liu, L. (2023). One-Class SVM Model-Based Tunnel Personnel Safety Detection
Technology.
https://www.mdpi.com/2076-3417/13/3/1734
Kumar, V. (2005). Parallel and distributed computing for cybersecurity. Distributed Systems Online,
IEEE 6, 10
Laptev, N., Gao, Y., Li, W., & Fujimaki, R. (2015). Time-series anomaly detection service at Microsoft.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (pp. 2259-2268). ACM.
Moustafa, N., & Slay, J. (2015). UNSW-NB15: A comprehensive data set for network intrusion
detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military
Communications and Information Systems Conference (MilCIS) (pp. 1-6). IEEE.
https://doi.org/10.1109/MilCIS.2015.7348949
Phua, C., Lee, V., Smith, K., & Gayler, R. (2010). A comprehensive survey of data mining-based fraud
detection research. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and
Reviews), 41(6), 834-847.
https://doi.org/10.1109/TSMCC.2010.2041211
13
Pimentel, M. A. F., Clifton, D. A., Clifton, L., & Tarassenko, L. (2014). A review of novelty detection.
Signal Processing, 99, 215-249.
https://doi.org/10.1016/j.sigpro.2013.12.024
Provost, F., & Fawcett, T. (2013). Data science for business: What you need to know about data
mining and data-analytic thinking. O'Reilly Media, Inc.
https://www.researchgate.net/publication/256438799_Data_Science_for_Business
Riad, A., Elhenawy, I., Hassan, A., & Awadallah, N. (2013). Visualize Network Anomaly Detection by
Using K-Means Clustering Algorithm.
https://airccse.org/journal/cnc/5513cnc14.pdf
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the
support of a high-dimensional distribution. Neural computation, 13(7), 1443-1471.