Technical Solution Document: Version Number: 0.0 Version Date: May 9, 2016
Technical Solution Document: Version Number: 0.0 Version Date: May 9, 2016
For
Course# DS-610
Big Data Analytics
Error! Reference source not found.Error! Reference source not found.Error! Reference source not
found.
Document History
Document Details:
Service Request #:
Project Title
Professor
Author
Date Prepared:
Completion Date:
Revision History
Revision
Number
(#)
Peer Review
This document requires following approvals.
Name
Sumit Sameriya
Title
Student
Distribution
This document has been distributed to
Name
Prof. Gerardo
Menegaz
Title
Professor
Changes
marked
(N)
Contents
1. Introduction ............................................................................................................... 4
1.1
1.2
2.1.1
Diagram ......................................................................................................................................7
2.1.2
2.1.3
Key Components......................................................................................................................10
3.2
Technology 1 ..................................................................................................................................13
4.1.1
4.1.2
Hardware ..................................................................................................................................15
5.1.2
Software ...................................................................................................................................16
5.1.3
Network ....................................................................................................................................16
5.1.4
Monitoring ................................................................................................................................17
5.1.5
5.1.6
6.2
6.3
Risks ...............................................................................................................................................19
6.4
Assumptions ...................................................................................................................................19
6.5
Dependencies .................................................................................................................................20
1. Introduction
1.1 Credit Card Fraud Detection Using Data Analytics Overview
With the advent of communication techniques, e-commerce as well as online payment transactions
are increasing day by day. Along with this financial frauds associated with these transactions are also
intensifying which result in loss of billions of dollars every year globally. Among the various financial
frauds, credit card fraud is the most old, common and dangerous one due to its widespread usage
because of the convenience it offers to the customers. Also the various types of benefits like cash
back, reward points, interest-free credit, discount offers on purchases made at selected stores, and so
forth tempt the customers to use credit card instead of cash for their purchases. In year 2013 40% of
the total financial fraud was related to credit card and the loss of amount due to credit card fraud
worldwide was $5.55 billion. Fraudster gets access to credit card information in many ways. According
to a latest report by CBC News (http:\\www.huffingtonpost.ca/2013/04/24/smartphones-steal-creditcard-%20data_n_3148170.html), smart phones are used to skim credit card data easily with a free
Google application. With compromised credit cards and data breaches dominating the headlines in
the past couple of years, data breaches totaled 1,540 worldwide in 2014 -- up 46 percent from the
year before -- and led to the compromise of more than one billion data records. Twelve percent of
breaches occurred in the financial services sector; 11 percent happened in the retail sector. Malicious
outsiders were the culprits in 55 percent of data breaches, while malicious insiders accounted for 15
percent.
U.S. credit card fraud is on the rise. About 31.8 million U.S. consumers had their credit cards
breached in 2014, more than three times the number affected in 2013. That fraud isn't cheap. Nearly
90 percent of card breach victims in 2014 received replacement credit cards, costing issuers as much
as $12.75 per card (http://www.creditcards.com/credit-card-news/credit-card-security-id-theft-fraudstatistics-1276.php#ixzz467E1FXcl ).
To accurately predict a credit card transaction as a fraud or a legitimate transaction, I propose a fraud
miner. Using frequent item set mining technique, which we will discuss in detail in section 2.1.2 during
training phase legitimate transaction pattern and fraud transaction pattern of each customer are
created from their previous transactions in the database. And during testing phase, the matching
algorithm detects to which pattern the incoming transaction matches more. If the incoming transaction
matches with legitimate pattern then the algorithm will return 0 (legitimate transaction) and if it
matches with fraud pattern then the algorithm will return 1 (fraud transaction). Also the transaction
record will be maintained in the transaction database so that algorithm can treat future transactions
properly. Model is proposed in Figure2 below.
to analyze an organizations business data to gain insight into how well internal controls are operating and to
identify transactions that indicate the heightened risk of fraud. Data analysis can be applied anywhere in the
organization where electronic transactions are recorded and stored. And in future algorithm can also be
improved from captured transactions as it is based on supervised machine learning technique. So as variety
of data will increase robustness in algorithm will increase.
2. Architecture Overview
2.1 IT System Level
The application will reside in the Amazon Web Services (AWS). The implemented architecture consists of five
subsystems: Front End Interface, Network Interface, Application/Web Server, Database Interface, Credit
Card fraud detection engine, Business Intelligence tool.
Front End Interface: Supporting device (POS Point Of Sale), input to the application will be
transferred using this interface in binary format when user will swipe credit card.
Network Interface: Supporting devices (Routers, Switches), it will be responsible for routing
transaction information to Issuing Bank, Acquirer Bank, Application server, Business Intelligence tool,
Database Interface.
Database Interface: The database interface subsystem is the entry point through which the
transactions are read into the system.
BI Tool: Set of techniques and tools for the transformation of raw data into meaningful and useful
information for business intelligence purposes by the management.
2.1.1 Diagram
2.1.1.1 Logical Design
Front End Interface: This component will be at the client end; it could be in the form of POS (Point Of Sale)
and will allow user to swipe the card. It reads the client information by reading chip/magnetic stripe and will
route the transaction information in the binary format to issuing and acquirer bank for further processing.
Network Interface: After information is processed from POS. Merchant securely transfers order information
to proper Payment Gateway. Payment Gateway receives order information and appropriately routes
transaction to processor. Processor immediately submits the request to credit card interchange. The
transaction is then routed to the issuing bank (purchaser's bank) to request transaction authorization.
Application/Web Server: The fraud detection application will be developed using Java/Python. And this
application will be scheduled on this web server, so application server will handle the communication with
database or other components to fetch data and perform analysis.
Transaction Database, DB Server: MySQL Workbench will be used to maintain customer transaction
database, legitimate pattern database, and fraud pattern database. And MySQL Workbench will reside on
Data Base Server. Fraud Application and BI tools will communicate with database using this Data Base
Server.
Credit Card Fraud Detection Engine: In the credit card fraud detection subsystem, each transaction
entering into the system will be passed to the host server where the corresponding transaction profile is
further checked using transactions business rules.
Business intelligence (BI): BI is the set of techniques and tools for the transformation of raw data into
meaningful and useful information for business intelligence purposes. BI technologies are capable of handling
large amounts of unstructured data to help identify, develop and otherwise create new strategic business
opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying
new opportunities and implementing an effective strategy based on insights can provide businesses with a
competitive market advantage and long-term stability. BI technologies provide historical, current and
predictive views of business operations. Common functions of business intelligence technologies are
reporting.
Statistical Methods
Statistical model: see Section 4.1.1
Statistical tools to be employed: see Section 4.1
TABLEAU Desktop
PYTHON
JAVA
MYSQL
HADOOP
PIG
HIVE
3. Architectural Decisions
Ref
AD1
Decision
Deploy the application on cloud server to improve response time and maximize scalability.
Issue or
Problem
Statement
Assumptions
Motivation
Alternatives
Decision
Justification
Implications
Derived
Requirements
Related
Decisions
Ref
AD2
Topic
Customer transaction processing
Upload complete fraud detection application on
cloud server to improve transaction processing and
response time. Storage of all customer transaction
Id.
AD1
data files on remove servers will help in maintaining
data security and quick accessibility for visualization
purpose.
Credit card transaction is a giant industry each payment network receives billions of
transaction requests per second for processing. So processing these requests as well
as applying analytics to distinguish between legitimate and fraud transaction is a
challenging task, which requires trusted application environment.
Access is required to application and transaction database 24 hours a day, 7 days a
week with minimal disruption caused by any downtime of "legacy" systems.
Response time need to be reasonable (that is, less than 3 seconds) for all users
wherever they are placed.
No time lag in analysing incoming transactions as fraud or legitimate.
Option 1 - Deploy application and transaction database to cloud server distributed
throughout the network to help improve response time and maximize scalability.
Option 1
This is viable option given the highly distributed and trusted nature of this application.
Requires Cloud Server technology to be identified and procured.
None
None
Decision
Use RAID level 10 to handle disaster recovery.
Disaster recovery
Using RAID level 10 disaster recovery mechanism.
Topic
Id.
AD2
Justification
Implications
Derived
Requirements
Related
Decisions
With RAID 10 replica of data is created in additional disk, which takes charge if primary
disk goes down. This option is viable with designed application.
Requires additional disk for replicating the data from primary disk.
None
None
Web/Application Server
Database Interface
Random Forest: Random forest is an ensemble of decision trees. The basic principle behind ensemble
methods is that a group of weak learners can come together to form a strong learner. Random forests
grow many decision trees. Here each individual decision tree is a weak learner, while all the decision trees
taken together are a strong learner. When a new object is to be classified, it is run down in each of the trees
in the forest. Each tree gives a classification output or vote for a class. The forest classifies the new object
into the class having maximum votes. Random forests are fast and they can efficiently handle unbalanced
and large databases with thousands of features.
In my case this algorithm will come into picture if incoming transaction is not from existing customer. In that
case incoming transaction attributes are analyzed with existing customer transactions attributes.
ii)
5. Solution Components
5.1.1 Hardware
My application will be cloud based. Every credit card transaction (geospatial data) performed through any
POS Terminal will be pushed to the cloud and data will pass through each tier (database, application,
presentation). Below is the cloud description that will clear out the processing of geospatial data.
Cloud Description:
1. Apache Kafka (Network data source) is chosen to feed credit card swipe messages into the
architecture. Real-time data is published by Payment Processing systems over Kafka queues. Each
of the transactions has 100s of attributes that can be analyzed in real time to detect patterns of
usage. We leverage Kafka integration with Apache Storm to read one value at a time and perform
some kind of storage like persist the data into a HBASE cluster. Storm is a stream-processing
framework that also does micro-batching.
2. Once the machine learning models are defined, incoming data received from the Storm/Spark tier will
be ingested into the models to predict outlier transactions or potential fraud.
3. Data that has business relevance and needs to be kept offline can be handled using storage platform
based on Hadoop Distributed File System (HDFS). Historical data can be fed into the Machine
Learning models to understand the fraud pattern.
4. Output data elements can be written out to HDFS, and managed by HBASE. From here, reports and
visualizations can easily be constructed.
5. Some data needs to be pulled in near real-time, accessed in a low latency pattern as well as have
calculations performed on this data. In memory technology based on Spark is very suitable for this
use case as it not only supports a very high write rate but also gives users the ability to store, access,
modify and transfer extremely large amounts of distributed data.
6. The second data access pattern that needs to be supported is storage for data that is older. This is
typically large-scale historical data. This layer contains the immutable, constantly growing master
dataset stored on a distributed file system like HDFS. Besides being a storage mechanism, the data
stored in this layer can be formatted in a manner suitable for consumption from any tool within the
Apache Hadoop ecosystem like Hive or Pig or MySQL.
5.1.2 Software
5.1.2.1 Solution Specific Software
Manufacturer
Hive
Title
Procurement
Open Source
Ownership
N/A
Installation
DBA
Support
DBA
MySQL
Open Source
N/A
DBA
DBA
Tableau
License
Self
SA
TABLEAU
Python
Open Source
N/A
SA
SA
Apache Storm
Open Source
N/A
SA
HORTONWORKS,
MAPR
Spark
Open Source
N/A
SA
CLOUDERA,
HORTONWORKS,
MAPR
5.1.3 Network
Description
Subnet Mask
Masking Bits
VLAN
Subnet Mask
Masking Bits
VLAN
Subnet Mask
Masking Bits
VLAN
Description
IP Address
IP Address
Description
IP Address
5.1.4 Monitoring
For monitoring of my application I will use Amazon Cloud Watch, which monitors AWS cloud resources and
the application that we run on AWS. With Cloud Watch, we can collect and tract metrics, collect and monitor
log files and set alarms. Amazon Cloud Watch can monitor AWS resources such as Amazon EC2 instances,
Amazon Dynamo DB tables, and Amazon RDS DB instances, as well as custom metrics generated by
applications and services in EC2, and any log files the applications generate.
6. Viability Assessment
6.1 Functional Requirements
Risk
ID
Finding / Risk
Description
Contingency /
Probability Effort Impact
Mitigation
(H/M/L)
/ Cost (H/M/L)
Recommendation
H
Person
Responsible
SA
SA
SA
Review
Date
Finding /
Risk
Description
The system
shall be
NFR01
accessible
24/7
The system
NFR02 shall be highly
available
Probability
(H/M/L)
Contingency /
Mitigation
Recommendation
Person
Responsible
Support
Team
Support
Team
Effort Impact
/ Cost (H/M/L)
Review
Date
The system
NFR03 shall be
auditable
Project
Manager
6.3 Risks
Risk Finding / Risk Probability
ID
Description
(H/M/L)
Contingency /
Mitigation
Recommendation
Person
Responsible
Configuration
and Networking
Team.
Model should be
designed after properly
analyzing attributes of
transaction data.
Project Team
Project Team
Effort Impact
/ Cost (H/M/L)
R01
AWS
Unavailability
R02
Fraud
detection
model
accuracy
R03
Regulatory
Environment
Ensure that no
identifiable personal data
is stored if the law
prohibits in any state.
R04
Message loss
from KAFKA
Review
Date
Configuration
Team
6.4 Assumptions
Assumption
ID
Finding / Assumption
Description
A01
Confidence
Level (H/M/L)
Impact
(H/M/L)
Assumption
Identified by
Self
A02
KNN is good
classification approach
for tracking fraud
transactions
Self
A03
Self
Review
Date
Closed
Date
6.5 Dependencies
Finding /
Dependency
Dependency
ID
Description
Effect on Plan
Required
by Date
Owner
Associated
Risk ID
D01
AWS
availability
Application cannot be
useful if unavailable
Starting of
Project
Amazon
R01
D02
Proper
transaction
streaming
Starting of
Project
Self
R04
Regulatory
environment
D03
R01
Closed
Date