Fraud Prediction Random Forest
Fraud Prediction Random Forest
This code provides a comprehensive end-to-end implementation for fraud detection with the
following key steps:
1. Data Preprocessing
2. Outlier Handling:
o Capping outliers to the 1st and 99th percentiles to mitigate their impact.
2. Feature Engineering
1. Restores isFraud:
o Adds back the isFraud column for correlation analysis and modeling.
3. Derived Features:
4. Multicollinearity Detection:
o Uses Variance Inflation Factor (VIF) to detect and remove highly collinear
features.
5. Correlation Analysis:
6. Log Transform:
o Trains the model on resampled data with verbose output for tracking.
2. Model Evaluation:
3. Feature Importance:
Key Outputs
1. Classification Report:
2. Confusion Matrix:
o Visualizes true positives, true negatives, false positives, and false negatives.
3. Feature Importance:
1. Scalable:
2. Informative:
3. Balanced Approach:
o No missing values were found in the dataset (isnull().sum() showed all columns
have complete data).
o They were capped to the 1st and 99th percentiles to reduce their impact.
3. Feature Engineering:
o SMOTE was applied to balance the dataset, resulting in equal numbers of fraud
(1) and non-fraud (0) cases.
6. Confusion Matrix:
7. Feature Importance:
a. Missing Values
• Process:
• Conclusion:
b. Outliers
• Detection:
o Visualized outliers with boxplots for each numerical column to better understand
their distribution.
• Handling:
▪ Capped the values at the 1st percentile (lower bound) and 99th
percentile (upper bound) to reduce the impact of extreme outliers.
o This ensured the dataset was robust without significantly altering the underlying
data distribution.
• Outcome:
step 66,620
amount 44,945
oldbalanceOrg 155,140
newbalanceOrig 155,931
oldbalanceDest 79,846
newbalanceDest 75,166
isFraud 8,213
isFlaggedFraud 16
c. Multi-Collinearity
• Detection:
o Computed VIF values for all numeric features, identifying features with VIF > 5 as
having high multicollinearity.
• Handling:
• Outcome:
Summary
These steps ensured a clean and well-prepared dataset for feature engineering and modeling.
2. Fraud Detection Model
a. Model Overview
b. Features Used
• Cleaned and Engineered Features:
• Feature Importance:
c. Model Training
2. Training Process:
▪ max_depth: None (trees are grown until pure or reach a minimum split
size).
d. Model Evaluation
1. Metrics:
o Precision (Class 1): 72% (Percentage of correctly identified fraud cases out of all
predicted fraud cases).
o Recall (Class 1): 95% (Percentage of actual fraud cases correctly identified).
2. Confusion Matrix:
o The model achieves 95% recall, ensuring most fraud cases are identified.
o Low false negatives are critical in fraud detection to minimize missed fraudulent
transactions.
o SMOTE ensures the model performs well on the minority fraud class despite the
initial imbalance.
3. Explainability:
o Precision (72%) for fraud detection could be improved. This indicates some non-
fraud cases are misclassified as fraud, which could lead to customer
dissatisfaction.
2. Processing Time:
Summary
The Random Forest model is effective for fraud detection, achieving a high recall and ROC-AUC
score. Its feature importance helps identify the primary factors driving fraud, making it a
practical choice for deployment in real-world systems. However, further optimizations, such as
hyperparameter tuning or using alternative algorithms like XGBoost, can enhance precision and
efficiency.
3. Variable Selection for the Model
The selection of variables for the fraud detection model was a systematic process that involved
exploratory data analysis, feature engineering, and statistical techniques to ensure that only
relevant and non-redundant features were included. Below are the steps taken:
o Selected all numerical features from the dataset for initial analysis (step, amount,
oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest, isFraud, and
isFlaggedFraud).
2. Removed Constant Columns:
o Example: isFlaggedFraud was dropped as it had constant values and did not
contribute to the model.
b. Feature Engineering
1. Derived Features:
2. Skewness Reduction:
▪ log_amount
▪ log_orig_balance_ratio
c. Multi-Collinearity Analysis
o Features with VIF > 5 were dropped, as they introduced redundancy and could
negatively impact the model:
2. Outcome:
o After removing multicollinear features, the remaining features were independent
and relevant.
d. Correlation Analysis
o Derived features were analyzed for their correlation with the target variable:
2. Outcome:
o Only features with meaningful correlation (absolute correlation > 0.05) were
retained.
After the above steps, the final features included in the model were:
1. Domain Knowledge:
2. Statistical Relevance:
o Only features with low multicollinearity and meaningful correlation with the
target were retained.
3. Model Efficiency:
o Removing redundant and irrelevant features ensured faster training and better
generalization.
Summary
The variable selection process was rigorous, combining domain expertise, statistical techniques
(VIF and correlation), and feature engineering. This ensured the model used relevant,
independent, and predictive features for fraud detection.
4. Demonstration of Model Performance
The performance of the Random Forest fraud detection model was evaluated using standard
machine learning metrics and tools. Below is a detailed analysis of the evaluation process:
a. Evaluation Metrics
1. Classification Report:
o Key metrics:
Classification Report:
2. Confusion Matrix:
o Visualized using a heatmap for better understanding of true positives, true
negatives, false positives, and false negatives.
o Output:
Confusion Matrix:
True Negatives (TN): 1,905,425
o Insights:
▪ The model has very high recall (low FN) for fraud detection.
3. ROC-AUC Score:
o Score: 0.996
o A high ROC-AUC score reflects the model's ability to effectively rank fraud
probabilities.
b. Visualization Tools
o Recall of 95% ensures that most fraud cases are correctly identified, minimizing
losses due to undetected fraud.
o Class imbalance was addressed using SMOTE, ensuring fair performance on both
classes.
d. Areas for Improvement
o Precision of 72% for fraud cases indicates some legitimate transactions are
misclassified as fraud (false positives). This could be improved with:
▪ Hyperparameter tuning.
2. Scalability:
o Although the model performs well, processing large datasets can be time-
consuming. Optimizations (e.g., using XGBoost or LightGBM) may improve
efficiency.
e. Tools Used
1. Scikit-learn:
3. SMOTE (Imbalanced-learn):
Conclusion
The model demonstrated strong performance metrics, particularly in identifying fraud cases
(high recall and ROC-AUC). While the model is effective, further fine-tuning and scalability
improvements can enhance its practical application in real-time fraud detection systems.
5. Key Factors That Predict Fraudulent Customers
From the feature importance analysis of the Random Forest model, the key factors predicting
fraud are:
1. log_orig_balance_ratio:
2. orig_balance_diff:
o This feature captures the net change in the origin account's balance before and
after the transaction.
3. log_amount:
4. step:
o While less significant than the other features, the timing of transactions (e.g.,
late at night or during off-peak hours) may provide clues to fraudulent activity.
6. Do These Factors Make Sense?
1. log_orig_balance_ratio:
o Rationale:
o Example:
2. orig_balance_diff:
o Rationale:
o Example:
3. log_amount:
o Rationale:
▪ Fraudulent transactions often involve extremes: either unusually high
amounts (to maximize gain in a single transaction) or unusually low
amounts (to avoid detection).
o Example:
4. step:
o Rationale:
o Example:
• Fraudsters' Behavior:
• Statistical Evidence:
o Feature importance analysis confirms that these factors are the strongest
predictors of fraud in the dataset.
o While the step feature captures transaction timing, additional temporal features
like transaction frequency or clustering similar transactions within time windows
could provide deeper insights.
Conclusion
The identified factors make intuitive sense and are backed by both statistical and domain-
specific reasoning. They effectively differentiate fraudulent from legitimate transactions, but
further refinement (e.g., additional features or contextual data) can enhance precision and
reduce false positives.
7. Prevention Strategies for Updating Company Infrastructure
To enhance fraud prevention and detection, companies should adopt the following measures
while updating their infrastructure:
1. Encryption:
o Implement MFA for all transactions, requiring customers to verify via multiple
channels (e.g., OTPs, biometrics).
1. Real-Time Monitoring:
2. Behavioral Analysis:
o Track user behavior patterns over time and flag deviations (e.g., unusual
transaction amounts, locations, or frequencies).
3. Anomaly Detection:
c. Infrastructure Improvements
1. Scalability:
2. Automation:
o Automate routine checks for fraudulent patterns, such as frequent failed login
attempts or multiple high-value transactions.
3. Backup Systems:
o Implement robust data backup solutions to ensure no data is lost in case of
cyberattacks or system failures.
1. Education:
o Inform customers about common fraud tactics and encourage safe practices, like
not sharing credentials.
2. Fraud Reporting:
e. Collaboration
1. Industry Partnerships:
2. Threat Intelligence:
o Stay updated on new fraud techniques through threat intelligence tools and
partnerships.
8. Determining if These Actions Work
To evaluate the effectiveness of these preventive measures, the company should adopt the
following assessment strategies:
3. Customer Complaints:
1. System Audits:
o Periodically assess the fraud detection system for accuracy, scalability, and
adaptability to new patterns.
2. Compliance Checks:
o Ensure the infrastructure complies with relevant data security regulations (e.g.,
GDPR, PCI DSS).
1. Model Metrics:
2. Drift Detection:
o Use tools to detect data drift or model drift, indicating changes in transaction
patterns that may reduce model efficacy.
d. Customer Feedback
1. Surveys:
o Conduct customer surveys to measure satisfaction with fraud detection and
resolution processes.
2. Support Metrics:
o Analyze metrics like average resolution time for fraud disputes and the number
of resolved cases.
e. Simulated Testing
1. Penetration Testing:
o Test the fraud detection system using synthetic data to ensure robustness.
f. Continuous Improvement
1. Update Models:
o Retrain fraud detection models periodically with new data to ensure they remain
effective.
2. Iterative Enhancements:
Conclusion
By adopting these preventive measures and continuously monitoring their effectiveness through
defined KPIs, audits, and customer feedback, the company can create a robust infrastructure
that minimizes fraud risks and ensures seamless customer experience.