0% found this document useful (0 votes)
27 views26 pages

Fraud Transaction Prediction

1. The document cleans fraud transaction data and explores outliers and multicollinearity. It removes outliers using IQR and encodes categorical variables. High correlation is found between oldbalanceDest and newbalanceDest but they are important variables.

Uploaded by

Devanshu Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views26 pages

Fraud Transaction Prediction

1. The document cleans fraud transaction data and explores outliers and multicollinearity. It removes outliers using IQR and encodes categorical variables. High correlation is found between oldbalanceDest and newbalanceDest but they are important variables.

Uploaded by

Devanshu Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

data = pd.read_csv('Fraud.csv')

data.head()

step type amount nameOrig oldbalanceOrg


newbalanceOrig \
0 1 PAYMENT 9839.64 C1231006815 170136.0
160296.36
1 1 PAYMENT 1864.28 C1666544295 21249.0
19384.72
2 1 TRANSFER 181.00 C1305486145 181.0
0.00
3 1 CASH_OUT 181.00 C840083671 181.0
0.00
4 1 PAYMENT 11668.14 C2048537720 41554.0
29885.86

nameDest oldbalanceDest newbalanceDest isFraud


isFlaggedFraud
0 M1979787155 0.0 0.0 0
0
1 M2044282225 0.0 0.0 0
0
2 C553264065 0.0 0.0 1
0
3 C38997010 21182.0 0.0 1
0
4 M1230701703 0.0 0.0 0
0

1. Data cleaning including missing values,


outliers and multi-collinearity.
data.isna().sum()

step 0
type 0
amount 0
nameOrig 0
oldbalanceOrg 0
newbalanceOrig 0
nameDest 0
oldbalanceDest 0
newbalanceDest 0
isFraud 0
isFlaggedFraud 0
dtype: int64

data = data.drop(['nameOrig','nameDest'],axis=1)

data.head()

step type amount oldbalanceOrg newbalanceOrig


oldbalanceDest \
0 1 PAYMENT 9839.64 170136.0 160296.36
0.0
1 1 PAYMENT 1864.28 21249.0 19384.72
0.0
2 1 TRANSFER 181.00 181.0 0.00
0.0
3 1 CASH_OUT 181.00 181.0 0.00
21182.0
4 1 PAYMENT 11668.14 41554.0 29885.86
0.0

newbalanceDest isFraud isFlaggedFraud


0 0.0 0 0
1 0.0 0 0
2 0.0 1 0
3 0.0 1 0
4 0.0 0 0

from sklearn.preprocessing import LabelEncoder


ord = LabelEncoder()
data['type'] = ord.fit_transform(data['type'])

data.head()

step type amount oldbalanceOrg newbalanceOrig oldbalanceDest


\
0 1 3 9839.64 170136.0 160296.36 0.0

1 1 3 1864.28 21249.0 19384.72 0.0

2 1 4 181.00 181.0 0.00 0.0

3 1 1 181.00 181.0 0.00 21182.0

4 1 3 11668.14 41554.0 29885.86 0.0


newbalanceDest isFraud isFlaggedFraud
0 0.0 0 0
1 0.0 0 0
2 0.0 1 0
3 0.0 1 0
4 0.0 0 0

Checking the Outliers


for i in data.columns:
plt.figure(figsize=(10,10))
sns.boxplot(data[i],orient='h')
plt.xlabel(i)

C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
Removing the Outliers Interquartile Range Method
columns_to_analyze =
['amount','oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanc
eDest']
# plt.figure(figsize=(30,30))

# Create boxplots for each selected column


# data[columns_to_analyze].boxplot(figsize = (100,100))

# Identify and remove outliers by Using Interquartile range Concept


q1 = data[columns_to_analyze].quantile(0.25)
q3 = data[columns_to_analyze].quantile(0.75)
iqr = q3 - q1

# Filtering the outliers


outliers = (data[columns_to_analyze] < (q1 - 1.5 * iqr)) |
(data[columns_to_analyze] > (q3 + 1.5 * iqr))
df_no_outliers = data[~outliers.any(axis=1)]

df_no_outliers.shape

(4393187, 9)

data.shape

(6362620, 9)

total_outlier = data.shape[0] - df_no_outliers.shape[0]

total_outlier

1969433

cleaned_data = df_no_outliers

cleaned_data.head()

step type amount oldbalanceOrg newbalanceOrig oldbalanceDest


\
0 1 3 9839.64 170136.0 160296.36 0.0

1 1 3 1864.28 21249.0 19384.72 0.0

2 1 4 181.00 181.0 0.00 0.0

3 1 1 181.00 181.0 0.00 21182.0

4 1 3 11668.14 41554.0 29885.86 0.0

newbalanceDest isFraud isFlaggedFraud


0 0.0 0 0
1 0.0 0 0
2 0.0 1 0
3 0.0 1 0
4 0.0 0 0

Checking the Multicollinearity


from statsmodels.stats.outliers_influence import
variance_inflation_factor
def calc_VIF(x):
vif= pd.DataFrame()
vif['variables']=x.columns
vif["VIF"]=[variance_inflation_factor(x.values,i) for i in
range(x.shape[1])]

return(vif)
x=cleaned_data.drop('isFraud',axis=1)
calc_VIF(x)

C:\Users\devan\.conda\envs\MachineLearning\lib\site-packages\
statsmodels\regression\linear_model.py:1783: RuntimeWarning: invalid
value encountered in double_scalars
return 1 - self.ssr/self.uncentered_tss

variables VIF
0 step 2.964152
1 type 2.435823
2 amount 4.805709
3 oldbalanceOrg 2.420535
4 newbalanceOrig 2.996540
5 oldbalanceDest 53.449834
6 newbalanceDest 68.756255
7 isFlaggedFraud NaN

As We can see oldbalanceOrig and newbalanceDest are higly


correlated
plt.figure(figsize=(10,10))
corr = cleaned_data.corr()
sns.heatmap(corr,annot=True)

<AxesSubplot:>
It is also evident from the heatmap that
columns oldbalanceDest and newbalanceDest
are Highly correlated But as They are Important
Becoz Initial and Final Money on Destination
Side Is Important to Know
cleaned_data.head()

step type amount oldbalanceOrg newbalanceOrig oldbalanceDest


\
0 1 3 9839.64 170136.0 160296.36 0.0

1 1 3 1864.28 21249.0 19384.72 0.0

2 1 4 181.00 181.0 0.00 0.0

3 1 1 181.00 181.0 0.00 21182.0

4 1 3 11668.14 41554.0 29885.86 0.0

newbalanceDest isFraud isFlaggedFraud


0 0.0 0 0
1 0.0 0 0
2 0.0 1 0
3 0.0 1 0
4 0.0 0 0

sns.kdeplot(cleaned_data['oldbalanceOrg'])

<AxesSubplot:xlabel='oldbalanceOrg', ylabel='Density'>
sns.kdeplot(cleaned_data['oldbalanceDest'])

<AxesSubplot:xlabel='oldbalanceDest', ylabel='Density'>

sns.kdeplot(cleaned_data['newbalanceDest'])
<AxesSubplot:xlabel='newbalanceDest', ylabel='Density'>

Splitting the data into training and testing


from sklearn.model_selection import train_test_split
X = cleaned_data.drop('isFraud',axis=1)
y= cleaned_data['isFraud']
X_train,X_test,Y_train,Y_test =
train_test_split(X,y,test_size=0.9,random_state=42,stratify=y)

X_test.head()

step type amount oldbalanceOrg newbalanceOrig


oldbalanceDest \
1609976 156 3 9545.78 26355.00 16809.22
0.00
2220960 186 3 4790.25 0.00 0.00
0.00
4596673 328 3 51.52 108710.26 108658.74
0.00
19284 8 1 32966.31 59607.00 26640.69
1450296.94
5990985 419 3 106993.87 157767.00 50773.13
0.00

newbalanceDest isFlaggedFraud
1609976 0.00 0
2220960 0.00 0
4596673 0.00 0
19284 1236584.82 0
5990985 0.00 0

Describe your fraud detection model in


elaboration.
Model Training
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import
accuracy_score,confusion_matrix,ConfusionMatrixDisplay,precision_score

lg = LogisticRegression()
lg.fit(X_train,Y_train)
print("accuracy",accuracy_score(lg.predict(X_test),Y_test))

accuracy 0.9994499059022947

xgb = XGBClassifier()
xgb.fit(X_train,Y_train)
pred1 = xgb.predict(X_test)
print("accuracy",accuracy_score(pred1,Y_test))

accuracy 0.9997152156533259

How did you select variables to be included in


the model?
• The Variables are selected on the basis of eda and feature correlation defined by
heatmap with a threshold of 0.5
Demonstrate the performance of the model by
using best set of tools.
conf = confusion_matrix(pred1,Y_test)
disp = ConfusionMatrixDisplay(confusion_matrix=conf,display_labels =
[False, True])

disp.plot()

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at
0x244498d7700>

From The Above Even 0.1 of the dataset Used For Training
Is Performing well on 0.9 of the testing Data Hence Our
Model is Working Great
What are the key factors that predict fraudulent customer?
Predicting fraudulent customers is crucial for businesses to protect themselves from financial
losses and maintain trust with legitimate customers. Several key factors can help in identifying
potentially fraudulent customers:
1. Abnormal transaction patterns: Look for unusual or atypical behavior, such as a
sudden increase in transaction volume, larger-than-usual purchases, or multiple
transactions from different locations within a short timeframe.

2. Unusual login activity: Frequent failed login attempts, multiple login locations, or
suspicious IP addresses could indicate potential unauthorized access.

3. Geographical incongruities: Analyze the geographic location of the customer's


transactions compared to their usual location. Rapid changes in locations can be a
red flag.

4. Payment discrepancies: Monitor for inconsistencies between the billing address,


shipping address, and the customer's location. Mismatched or incomplete
information may raise suspicions.

5. Velocity checks: Identify customers with an unusually high number of transactions


in a short period. This could suggest automated or bot-driven activities.

6. Device fingerprinting: Track and analyze the characteristics of the customer's


device used for transactions. Sudden changes in device information might indicate
suspicious behavior.

7. Account age and history: New accounts with a large number of transactions or
customers with little history may pose a higher risk.

8. Unusual purchase timing: Transactions made during non-business hours or


holidays may warrant additional scrutiny.

9. Unusual product combinations: Customers purchasing an unusual mix of


products or a large quantity of high-value items may raise suspicion.

10. Customer behavior changes: Look for changes in a customer's behavior, such as a
sudden shift in spending habits or a switch to higher-risk products.

11. Social network analysis: Investigate relationships between customers and identify
connections between potentially fraudulent accounts.

12. Watchlists and databases: Check against internal and external fraud databases or
watchlists for known fraudulent customers.

13. Payment methods: Some payment methods, such as virtual credit cards or prepaid
cards, are associated with higher fraud risk.

14. Machine learning models: Implement advanced machine learning algorithms that
can analyze large amounts of data and identify patterns indicative of fraud.
It's important to remember that no single factor can reliably predict fraudulent customers. A
combination of these factors, along with continuous monitoring and analysis, will provide a
more accurate assessment of potential fraud. Moreover, it's essential to maintain a balance
between fraud detection and customer experience to avoid false positives that could harm
genuine customers.

What kind of prevention should be adopted while company


update its infrastructure?
When a company updates its infrastructure for handling fraud transactions, it should implement
a multi-layered approach to prevention. This will help to minimize the risk of fraud and protect
both the company and its customers. Here are some key prevention strategies to adopt:

1. Advanced authentication methods: Implement multi-factor authentication (MFA)


to add an extra layer of security. This can include something the user knows
(password), something they have (OTP or token), and something they are
(fingerprint or facial recognition).

2. Encryption and secure communication: Ensure that all data transmissions are
encrypted using industry-standard protocols like TLS (Transport Layer Security) to
protect sensitive information during transit.

3. Real-time transaction monitoring: Utilize sophisticated fraud detection systems


that can analyze transactions in real-time and identify suspicious patterns or
anomalies.

4. Behavioral analysis: Implement machine learning algorithms to analyze customer


behavior and create profiles for normal usage. Any deviations from these profiles
can trigger alerts for potential fraud.

5. IP geolocation and device profiling: Use geolocation data and device


fingerprinting to detect suspicious login attempts or transactions from unfamiliar
locations or devices.

6. Velocity checks: Set thresholds for transaction volume to identify and block
multiple transactions occurring in quick succession.

7. Blacklists and whitelists: Maintain lists of known fraudulent customers or high-


risk regions to block or flag suspicious activities.

8. Regular security audits and vulnerability assessments: Conduct periodic


security audits to identify and address potential weaknesses in the system.

9. Employee training and awareness: Educate employees about fraud prevention,


cybersecurity best practices, and how to recognize potential threats.

10. Secure APIs and third-party integrations: If the company integrates with third-
party services or APIs, ensure that they have robust security measures in place to
prevent data breaches.

11. Fraud analysis and reporting: Establish a process for reporting and investigating
suspected fraudulent activities promptly.
12. Customer communication: Keep customers informed about security measures,
potential risks, and the steps they can take to protect themselves.

13. Compliance with industry standards and regulations: Ensure that the company
complies with relevant data protection laws and industry security standards.

14. Regular system updates and patches: Keep all software and systems up to date
with the latest security patches to minimize vulnerabilities.

15. Continuous improvement: Regularly review and update fraud prevention


strategies to stay ahead of evolving fraud tactics.
By adopting these prevention measures, the company can create a robust and secure
infrastructure that safeguards against fraud transactions and builds trust with its customers.

Assuming these actions have been implemented, how


would you determine if they work?
To determine the effectiveness of the implemented actions in fraud detection, you can follow a
comprehensive evaluation process. Here are some steps to assess the success of the fraud
prevention measures:

1. Data analysis and metrics: Collect and analyze data related to fraud detection and
prevention. This includes the number of detected fraud incidents, false positives,
true positives, and the overall accuracy of the system.

2. Benchmarking: Establish benchmarks based on historical data before


implementing the new fraud prevention measures. This will allow you to compare
the current performance with past performance.

3. False positive rate: Measure the rate of false positives, i.e., legitimate transactions
flagged as fraudulent. A high false positive rate can inconvenience customers and
impact the company's revenue.

4. True positive rate: Measure the rate of true positives, i.e., actual fraudulent
transactions correctly identified by the system. A high true positive rate indicates
effective fraud detection.

5. Reduction in fraud losses: Calculate the reduction in financial losses due to fraud
after implementing the prevention measures.

6. Customer feedback and complaints: Gather feedback from customers to gauge


their experience with the new security measures. Address any complaints or
concerns promptly.

7. Comparison with industry standards: Compare the company's fraud detection


performance with industry benchmarks and best practices.
8. Time-to-detect and response time: Measure the time taken to detect potential
fraud and respond to suspicious activities. A faster response can minimize damage.

9. Adaptability to new fraud patterns: Assess how well the system adapts to
evolving fraud patterns and whether it can detect new types of fraud.

10. Cost-effectiveness: Evaluate the cost-effectiveness of the fraud prevention


measures. The benefits of preventing fraud should outweigh the expenses
associated with implementing and maintaining the system.

11. External validation: Consider seeking third-party validation or conducting


penetration tests to assess the system's resilience against potential attacks.

12. Continuous improvement: Establish a feedback loop for ongoing improvement.


Regularly review and fine-tune the fraud detection system based on the analysis of
new data and emerging fraud trends.

13. Comparing against control groups: Use control groups to compare the
performance of the fraud detection system with areas where the new measures
haven't been implemented. This can help isolate the impact of the prevention
measures.
By conducting a thorough evaluation using these metrics, the company can determine the
effectiveness of its fraud detection measures. This information can be used to make informed
decisions on refining existing strategies or implementing new ones to further strengthen the
fraud prevention system. It's important to note that fraudsters continually evolve their tactics, so
the evaluation process should be ongoing and adaptive.

Predictive System
a =
['8','1','32966.31','59607.00','26640.69','1450296.94','1236584.82','0
']
a = pd.DataFrame([a],columns=X_test.columns,dtype='float')

step type amount oldbalanceOrg newbalanceOrig oldbalanceDest


\
0 8.0 1.0 32966.31 59607.0 26640.69 1450296.94

newbalanceDest isFlaggedFraud
0 1236584.82 0.0

prediction = xgb.predict(a)[0]

if prediction==0:
print("Not A Fraud Transaction")
else:
print("Is a Fraud Transaction")

Not A Fraud Transaction

You might also like