Sprint Hack-O-Hire Team 1920587 1b2c50fteam Blitz

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Barclays Hack-O-Hire Hackathon

Team name - Team Blitz

Title
Ensuring Accurate Settlements: A Machine Learning Approach to Data Anomaly Detection

Introduction

The financial landscape relies heavily on efficient and accurate transaction settlements. These
settlements are based on trade data meticulously collected from various sources. This data, often
encompassing millions of records daily, forms the backbone for calculating and transferring funds.

However, the current scenario faces significant challenges due to inherent issues within the data itself.
These issues include:

Inflated Prices: Errors can occur where prices are recorded significantly higher than the actual value,
leading to overpayments and potential financial losses.
Missing Values: Crucial information may be absent from the data, hindering the ability to calculate
settlements accurately This can lead to delays and disruptions in the entire process.
Bad Data: Inaccurate or nonsensical data entries, often referred to as "bad data" or "fat fingers," can
further distort calculations and introduce inconsistencies.

These errors have a substantial impact on various aspects of the financial system:

Financial Losses: Inaccurate settlements due to inflated prices or missing data can result in significant
financial losses for both parties involved in the transaction.
Operational Delays:Identifying and rectifying errors can lead to delays in processing settlements,
hindering operational efficiency.
Reputational Risk: Frequent occurrences of erroneous settlements can damage the reputation of financial
institutions and erode trust within the system.

Addressing these data quality issues is crucial for ensuring the integrity and smooth functioning of
transaction settlements.

Proposed Solution: Anomaly Detection Framework for Trade Data

Stage 1: Data Acquisition and Preprocessing (ETL)

Extraction: Data will be gathered from various sources using methods like
Web Scraping: Publicly available trade data can be extracted from relevant websites.
API Integration: Established APIs provided by exchanges or data providers can be utilized for data
retrieval.
Transformation: The collected data will undergo cleaning processes to address:
Missing Values: Techniques like imputation or data deletion can be employed based on the specific
scenario.

Outliers: Potential outliers like inflated prices can be flagged for further investigation.
Inconsistencies: Data formatting issues will be rectified to ensure uniformity.

Load: The cleaned data will be loaded into a MongoDB database, chosen for its scalability and flexibility
in handling large datasets.

Stage 2: Data Storage and Management

A dedicated platform will be built to facilitate the addition of new data sources. This platform can
involve:
Web Interface: A user-friendly interface allowing for the configuration of new data feeds.
API Integration Library: A library simplifying the process of integrating with various data providers
through APIs.

Stage 3: Data Security

Open-source encryption algorithms will be implemented:


Data at Rest: Encrypting data within the database using algorithms like AES-256 ensures protection even
in case of a security breach.
Data in Transit: Encryption during data transfer between sources and the database safeguards sensitive
information.

Stage 4: Anomaly Detection with Isolation Forest

Utilizing Isolation Forest:


Training the Model: The algorithm is trained using the cleaned data obtained from Stage 1 (ETL). This
training establishes a baseline understanding of "normal" data patterns.
Real-time Anomaly Detection: As new trade data arrives: Each data point is fed into the trained model.
The model analyzes the data point based on various features (e.g., price, volume).
Anomaly Scoring:
A score is assigned based on the isolation process. Data points that deviate significantly from the norm
require fewer "splits" (decisions) to isolate within the model's internal structure.
Lower scores indicate a higher likelihood of being an anomaly.

Effectiveness for Trade Data Anomalies:


Unsupervised Advantage: Isolation Forest works without pre-labeled data (known anomalies). This is
crucial as labeling every anomaly in financial data might be impractical.
Efficiency: The algorithm efficiently handles large datasets due to its linear time complexity.

Interpretability: Anomaly scores provide a relative indication of how much a data point deviates
from the norm, aiding in prioritizing investigations for potentially suspicious activity.

Integration with the Framework:

Identified anomalies exceeding a predefined threshold (based on the score distribution) are flagged for
further scrutiny.
This allows for timely investigation and potential rectification of errors or fraudulent activity.
Isolation Forest's benefit lies in its ability to efficiently uncover hidden patterns within the data. By
focusing on data points that deviate significantly from the established baseline, the model effectively
highlights potential anomalies that warrant closer examination.

Fig: Anomaly Detection Framework Class Diagram

Stage 5: Data Visualization and Alerting

Tableau: This data visualization tool will be used to:


Create dashboards displaying key metrics related to data quality and identified anomalies.
Allow for interactive exploration of the data to gain further insights.
Alerting System: Triggers will be established to notify relevant personnel when anomalies are detected by
the Isolation Forest model.
This enables prompt investigation and rectification of potential errors.
Additional Considerations:
Model Evaluation: The performance of the Isolation Forest model will be monitored and evaluated using
metrics like precision and recall. This allows for fine-tuning the model and ensuring its accuracy in
identifying anomalies.

Benefits:

Improved Data Quality: By addressing data inconsistencies and anomalies, the framework ensures the
accuracy of trade data used for transaction settlements.
Reduced Financial Losses: Early detection and rectification of errors prevent financial losses due to
inflated prices or missing information.
Enhanced Operational Efficiency: Timely identification of anomalies minimizes delays in processing
settlements .Increased Trust and Transparency: A robust framework fosters trust within the financial
system by ensuring the integrity of transaction data.

This proposed solution provides a comprehensive approach to tackle the challenges of inaccurate trade
data. By utilizing a combination of data acquisition, preprocessing, machine learning, and data
visualization techniques, the framework aims to safeguard the financial system and ensure accurate
transaction settlements.l
Fig: Anomaly Detection Framework System Architecture

Refrences

1. Breunig, Markus M., et al. "LOF: Identifying density-based local outliers." Proceedings of the
2000 ACM SIGMOD international conference on Management of data. 2000.

2. Varun Chandola, Arindam Banerjee, and Vipin Kumar. "Anomaly detection: A survey." ACM
Computing Surveys (CSUR) 41.3 (2009): 1-58.

3. Zhang, Y., Zhu, X., & Liu, Q. (2019). Anomaly Detection of Network Security Based on
Improved Isolation Forest Algorithm. IEEE Access, 7, 23250-23261.
4. Ding, S., Zhang, H., Hao, X., & Li, H. (2019). Anomaly Detection for IIoT Data in Industry
4.0. IEEE Transactions on Industrial Informatics, 15(4), 2406-2415.

5. Ma, H., & Perkins, S. (2003). Time-series data mining for anomaly detection in network
intrusion detection. ACM Transactions on Internet Technology (TOIT), 3(3), 254-267.

6. Gharib, Mohamed Ahmed, and Hossam M. Zawbaa. "Big data anomaly detection: A review."
Journal of Big Data 7.1 (2020): 1-51.

7. Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., & Woźniak, M. (2017). Ensemble
learning for data stream analysis: A survey. Information Fusion, 37, 132-156.

8. Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009). A detailed analysis of the KDD
CUP 99 data set. In Proceedings of the Second IEEE International Conference on Computational
Intelligence for Security and Defense Applications 2009 (pp. 53-58). IEEE.

9. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

10. Zhang, Z., Luo, G., & Hu, Y. (2020). A survey of deep learning for big data processing.
ACM Computing Surveys (CSUR), 52(5), 1-35.

Dataset References

● https://data.world/adamhelsinger/cfpb-credit-card-history
● https://www.nseindia.com/market-data/real-time-data-subscription
● https://github.com/NSEDownload/NSEDownload

You might also like