Final Report PDF
Final Report PDF
1
A NOVEL HYBRID DATA BALANCING AND
FRAUD DETECTION APPROACH FOR
AUTOMOBILE INSURANCE CLAIMS
BACHELOR OF TECHNOLOGY IN
COMPUTER SCIENCE & ENGINEERING
Submitted by:
Atul Kumar Agrawal
Registration no. : 1602040031
Associate Professor
2
VEER SURENDRA SAI UNIVERSITY OF TECHNOLOGY, BURLA, ODISHA
Declaration
I declare that this written submission represents my ideas in my own words and wherever
others’ ideas or words have been included, I have adequately cited and referenced the
original sources. I also declare that i have adhered to all principles of academic honesty and
integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in
my submission. I understand that any violation of the above will be cause for disciplinary
action by the university and can also evoke penal action from the sources which have thus
not been properly cited or from whom proper permission has not been taken when needed.
Certificate
This is to certify that the dissertation entitled " A novel hybrid data balancing and fraud
detection approach for automobile insurance claims" submitted by Atul Kumar Agrawal
is approved for the degree of bachelor of technology in Computer Science and Engineering
is a record of an original research work carried out by him under my supervision and
guidance.
I would like to express my sincere gratitude to my supervisor, Dr. Suvasini Panigrahi , for
her invaluable help during the course work towards this dissertation. She was a source of
constant ideas and encouragement and provided a friendly atmosphere to work in. I am really
very thankful to her for everything.
I am also thankful to Dr. Manas Ranjan Kabat, Head of the Department and to all the
faculties of Department of Computer Science and Engineering for having supported me to
carry out this dissertation and for their constant advice. I would like to thank all my friends
for their encouragement and understanding. I would like to express my heart felt gratitude to
them.
Regd.no.-1602040031
5
Approval Sheet
This dissertation entitled "A novel hybrid data balancing and fraud detection approach for
automobile insurance claims" by Atul Kumar Agrawal is approved for the degree of
bachelor of technology in "Computer S cience and E ngineering", department of
C omputer S cience and E n g i n e e r i n g .
Date: Supervisor
Place: Burla
6
ABSTRACT
Automobile insurance fraud has been a major issue to the insurance companies and has caused
several crores of losses due to the fraudulent and false claims. It is a serious crime in most parts
of the world and the scammers may be sentenced to at least 1 year of jail and up to 20 years.
The fraudsters involve fake patients, fake doctors, fake lawyers together. Various machine
learning and deep learning techniques have been developed to detect these kind of fraud and
research is been done to get the best suitable method that can detect new patterns of fraudsters
over time. As the number of frauds is low in comparison to the legitimate transactions,we use
one class classification for the minority class to get better results so as to minimize the
classification of non-fraudulent data as fraud i.e. minimizing the false positive alarm rates.
Various machine learning techniques like Support vector machine , K – nearest neighbors ,
Decision tree , Logistic regression and Naive Bayes have been deployed to detect fraud. Most of
these classifiers have good accuracies . These methods have been trained using undersampling
and oversampling of data to reduce the class imbalance of fraud and non fraud database. It is
Keywords : Automobile insurance fraud, false positive alarm rates, undersampling, Support
vector machine, K – nearest neighbors, Decision tree, Logistic regression, Naive Bayes
7
Table of Contents
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Definitions
1.2 Fraudsters
Problem statement
8|Page
4.4 Fraud detection measures
5. 1 Proposal diagram
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
List of Figures
9|Page
List of Tables
10 | P a g e
LIST OF ABBREVIATIONS
Abbreviation Description
KNN K- Nearest Neighbors
SVM Support Vector Machine
TP True Positive
TN True Negative
FP False Positive
FN False Negative
CM Confusion Matrix
DT Decision Tree
LR Logistic Regression
NB Naive Bayes
11 | P a g e
Chapter 1
INTRODUCTION
1.1 Definitions :
Fraud : It is a serious crime that includes use of one’s occupation for personal enrichment
Fraud Detection : monitoring the behavior of population of users using data sets to estimate,
Automobile Insurance fraud : Automobile insurance fraud has been a major issue to the
insurance company and has caused several losses due to the false claims.
1.2 Fraudsters :
3. False claims that accident/damage happened after policy or coverage was purchased.
4. Claimants hide the information that excluded driver was driving at the time of accident.
2. Telecommunication fraud
3. Bankruptcy fraud
5. Application fraud
6. Behavioral fraud
7. Insurance fraud
8. Statement fraud
9. Security fraud
12 | P a g e
1.4 Various methods of fake fraud claims are:
1. Staged Collisions : In this type of frauds, fraudsters use a motor vehicle to stage fake
2. Exaggerated claims : Fake claims involving injuries and damages that may have already
3. False stolen reports : Claimant might have sold the vehicle or gifted it to a relative and
4. Hidden information : Claimants may hide information regarding the driver at the time of
5. Multiple claims: It includes people who claim multiple times for the same loss.
1. Soft auto-insurance fraud: Examples of soft auto-insurance fraud include filing more
than one claim for a single injury, filing claims for injuries not related to an automobile
accident, misreporting wage losses due to injuries, and reporting higher costs for car repairs
filing claims when the claimant was not actually involved in the accident, submitting claims
for medical treatments that were not received, or inventing injuries and false stolen reports.
1.6 Problems associated : a. Class imbalance problem [minority class = fraud, majority
class = legitimate]
b. Outliers problem : Outliers are the records which exhibit dissimilarity with the defined set
of clusters and they cannot be part of representatives while undersampling the data and
13 | P a g e
1.7 Organization of the thesis:-
Chapter 3 – I t includes some basic concept of m a c h i n e learning that has been used
in the work.
Chapter 4 – It presents an existing methodology that has been implemented and analyzed.
Chapter 5 – It presents the materials and methodology and its algorithms as well as
flowcharts.
Chapter 6 – It is about results and discussion upon various sample data set.
14 | P a g e
CHAPTER 2
MOTIVATION
The Insurance Fraud Bureau in the UK estimated there were more than 20,000 fake collisions
and false insurance claims across the UK from 1999 to 2006. One tactic fraudsters use is to
drive to a busy junction or roundabout and brake sharply causing a motorist to drive into the
back of them. They claim the other motorist was at fault because they were driving too fast or
too close behind them, and make a false and inflated claim to the motorist's insurer for injury
and damage, which can pay the fraudsters up to 30 Lakhs. In the Insurance Fraud Bureau's
first year or operation, the usage of data mining initiatives exposed insurance fraud networks
and led to 74 arrests and a five-to-one return on investment. The Insurance Research Council
suspected fraud. There is a wide variety of schemes used to defraud automobile insurance
providers.
According to data released by Beijing bureau of China, 10% insurance claims of the total
claims were fraud. The Coalition Against Insurance Fraud estimates that in 2006 a total of
about $80 billion was lost in the United States due to insurance fraud. According to estimates
by the Insurance Information Institute, insurance fraud accounts for about 10 percent of the
property/casualty insurance industry's incurred losses and loss adjustment expenses. India
forensic Center of Studies estimates that Insurance frauds in India costs about $6.25 billion
annually.
Problem Statement
15 | P a g e
CHAPTER 3
Literature Review
performance
measures (Accuracy,
Sensitivity and
AUPRC).
2. Predicting Fraudulent 2018 Random Random forest
Bayes algorithms
3. One-class support vector 2015 OCSVM OCSVM based
fraud detection
4. The Identification 2015 Outlier Data mining had
16 | P a g e
Mining
5. Random Rough Subspace 2011 random Random subspace
neural
network
ensemble
method
Various deep learning, Machine learning and data mining techniques have been implemented
2. Machine learning
i. Supervised learning
A. Classification
a. Logistic regression
b. Binary regression
A. Clustering
a. K-means clustering
b. Hierarchical clustering
17 | P a g e
3. Multi layered perceptron based (MLP)
4. Data mining
5. Random forest
10. kRNN and K-Means hybrid for outlier elimination and undersampling [3]
16. hybrid of back-propagation neural networks (ANN) and self-organizing maps (SOM) [9]
22. Stacking – bagging method - MLP together with Naïve Bayesian (NB) and C4.5
algorithm [16]
24. Non negative matrix factorization approach for health care fraud detection. [17]
27.SVM- Recursive Feature Elimination for feature selection and employed active learning
5. Imbalanced classification approaches is required because the fraud : legitimate ratio is very
low.
19 | P a g e
Chapter 4
BACKGROUND STUDY
It uses only minority classes. After train , it detects if the transaction belongs to the minority
class or not.
-1, otherwise
20 | P a g e
Ci>=0 ∀ i=1,2 . . . l) Σ
ω = weight factor
Oversampling methods :
artificial minority class instances. It replicates instances that are difficult to learn.
minority oversampling. It is used for generating artificial minority samples. The module
works by generating new instances from existing minority cases that you supply as input.
Clustering : Making clusters and using associative rule mining to identify correlated data
Validation techniques : K – fold validation technique. It chooses the test and train data
21 | P a g e
4.3.1. Decision Tree
> It divides the values into subsequent sub trees for decision making.
(5)
(6)
Disadvantages :
23 | P a g e
4.3.3. Support Vector Machine (SVM) :
hyperplane that optimally separates the data into two categories. SVM models are closely
related to neural networks. Using a kernel function, SVMs are an alternative training method
for polynomial, Radial Basis Function (RBF) networks and MLP classifiers, in which the
weights of the network are found by solving a quadratic programming problem with linear
It is a classification tool used to divide the data points into 2 classes using hyperplane.
It performs row dimensional reduction (by way of picking up Support Vectors) while
i= {1,2 ... l) Σ}
n= no of exploratory variables
y ∈ Rl
24 | P a g e
minimize (x1/2 ||w|| + (xΣl) Σi=1 Ci) ) (7)
w= vector of weights
n = no. of features
y = dependent variable
x= independent variable
y= {-1,1}
It is for a hyperplane w that separates the point x i from the origin with margin ρ and ξii
-> Euclidean distance , d(xp,q) = √ (x Σni=1 (xpi-ρ) qi)2 ) ,for 2 observations pi and qi .
25 | P a g e
4.3.5. Logistic Regression
Vector α = (x α0 , α1 . . . αn ) = coefficients
x= (x x0 , x1 . . . xn ) = Exploratory variables
ϵ = model’s error
g= logistic link function over [0,1] in R. ( for getting variable values between 0 and 1).
1. False positive
2. False negative
3. True positive
4. True negative
26 | P a g e
Chapter 5
PROPOSED SYSTEM
Majority of the above mentioned work did have limitations due to any data imbalance
problem. The proposed model in this paper is a one class classification to deal with the
As mentioned in the proposed methodology, we extracted two subsets of data in the ratios
80% and 20 % to ensure that each subset has the same proportion of positive and negative
samples.
Flow Chart:
27 | P a g e
5.2 Steps involved in the system implemented :
c. Removing redundancy
4. Testing
5. Validation
date and report date, Claim occurrence time , Claim open date, claim report date , claim loss
data , claim event location name , claim amount , policy premium, part market cost , claim on
vehicle , count of customer communication , are claim document submitted , policy effective
28 | P a g e
5.4 Dataset Description :
Year : 1994 – 96 in US
No. of attributes = 24
The experiments were implemented under the following hardware and software specifications
The study of the data set and the project has been implemented on a laptop with an Intel core
i3-5005U at 2 GHz with 8 GB of RAM and 2 GB of Graphics. The Operating System used is
Ubuntu 18.04.3.
The project has been implemented on Spyder IDE and language used is Python 3.6.
confusion matrix
29 | P a g e
1. Accuracy
Accuracy = No. of True Positives + No.of True Negatives / (No. of True Positives + No.of
2. Specificity
Specificity= No.of True Negatives/ (No. of True Negatives+ No.of False Positives)
3. Sensitivity
Sensitivity also known as the True Positive rate or Recall is calculated as,
Sensitivity = No. of True Positives / (No. of True Positives + No. of False Negatives)
4.Precision
Precision = No. of True Positives / (No. of True Positives+ No.of False Positives)
30 | P a g e
Chapter 6
Thus, the complete survey was done to get the various techniques used in automobile
insurance fraud detection. This shows that there is still scope as minority class classifiers can
be over sampled to further increase the accuracy and will be helpful in preventing losses of
31 | P a g e
Observation : It is evident from the table that by using undersampling of the majority class
and oversampling of the minority class for data balancing, most algorithms showed a increase
in accuracy scores.
32 | P a g e
Chapter 7
The system is implemented using supervised learning methods and the proposed system shall
be implemented on unsupervised algorithms and/or hybrid of the models which will be used
to compare and test the accuracies to get the best possible method for higher accuracy and
33 | P a g e
REFERENCES
3. M. Vasu, and V. Ravi, “ A hybrid undersampling approach for mining unbalanced datasets:
Application to Banking and insurance”, International Journal of Data Mining Modeling and
Management, Vol. 3(1), pp. 75-105, 2011.
4. M.A.H. Farquad, V. Ravi and S. Bapi Raju,“Analytical CRM in banking and finance using
SVM: a modified active learning-based rule extraction approach”, International Journal of
Electronic
Customer Relationship Management, vol. 6(1), pp 48-73, 2011.
7. D. C. Li, C. W. Liu, S. C. Hu, “A learning method for the class imbalance problem with
medical datasets”, Computers in Biology and Medicine, Vol. 40(5), pp. 509-518, 2010.
8. L. Peng, H. Zhang, B. Yang, andY. Chen, “A new approach for imbalanced data
classification based on data gravitation”, Information Sciences, Vol. 288, pp. 347-373, 2014.
9. C. F. Tsai, andY. H. Lu, “ Customer churn prediction by hybrid neural networks”, Expert
Systems with Applications, Vol. 36 (10), pp. 12547-12553, 2009.
10. Makki, Sara, Zainab Assaghir, Yehia Taher, Rafiqul Haque, Mohand-Saïd Hacid, and
Hassan Zeineddine. "An Experimental Study With Imbalanced Classification Approaches for
Credit Card Fraud Detection." IEEE Access 7 (2019): 93010-93022.
34 | P a g e
11. Kowshalya, G., and M. Nandhini. "Predicting Fraudulent Claims in Automobile
Insurance." In 2018 Second International) Σ Conference on Inventive Communication and
Computational) Σ Technol) Σogies (xICICCT), pp. 1338-1343. IEEE, 2018.
12. Yan, Chun, and Yaqi Li. "The Identification Algorithm and Model Construction of
Automobile Insurance Fraud Based on Data Mining." In 2015 Fifth International) Σ Conference
on Instrumentation and Measurement, Computer, Communication and Control) Σ (xIMCCC), pp.
1922-1928. IEEE, 2015.
13. Xu, Wei, Shengnan Wang, Dailing Zhang, and Bo Yang. "Random rough subspace based
neural network ensemble for insurance fraud detection." In 2011 Fourth International) Σ Joint
Conference on Computational) Σ Sciences and Optimization, pp. 1276-1280. IEEE, 2011.
14. K. Nian, H. Zhang, A. Tayal, T. Coleman, and Y. Li, “Auto insurance fraud detection
using unsupervised spectral ranking for anomaly,” The Journal of Finance and Data Science,
vol. 2, no. 1, pp. 58–75, 2016.
15. C. Phua, D. Alahakoon, and V. Lee, “Minority report in fraud detection: classification of
skewed data,” Acm sigkdd explorations newsletter, vol. 6, no. 1, pp. 50–59, 2004.
16. Phua, C., Damminda, A., Lee, V., 2004. Minority report in fraud detection: classification
of skewed data (Special Issue on Imbalanced Data Sets). SIGKDD Explor. 6 (1), 50–59.
17. Zhu. S., Wang, Y., Wu, Y., 2011. Health care fraud detection using non-negative matrix
factorization. In: Proceedings of the IEEE International Conference on Computer Science and
Education, pp. 499–503.
18. Sublej, L., Furlan, S., Bajec, M., 2011. An expert system for detecting automobile
insurance fraud using social network analysis. Expert Syst. Appl. 38 (1), 1039–1052.
35 | P a g e
A novel hybrid data balancing and fraud detection approach for
automobile insurance claims
ORIGINALITY REPORT
9%
SIMILARITY INDEX
PRIMARY SOURCES
1 en.wikipedia.org
Internet 16 words — 3%
2 Sundarkumar, G. Ganesh, and Vadlamani Ravi. "A
novel hybrid undersampling method for mining
13 words — 2%
unbalanced datasets in banking and insurance", Engineering
Applications of Artificial Intelligence, 2015.
Crossref
4 www.lisamoffatt.com
Internet 10 words — 1%
5 www.slideshare.net
Internet 10 words — 1%
6 Siddhartha Haldar, Ruptirtha Mukherjee, Pushpak
Chakraborty, Shayan Banerjee, Shreyaasha
9 words — 1%
Chaudhury, Sankhadeep Chatterjee. "Improved Epilepsy Detection
method by addressing Class Imbalance Problem", 2018 IEEE 9th
Annual Information Technology, Electronics and Mobile
Communication Conference (IEMCON), 2018
Crossref
7 tel.archives-ouvertes.fr
Internet 9 words — < 1%
8 www.peersupportvic.org
<1
Internet
9 words —
< 1%
Internet
9 words —
9 www.itkeyword.com
Internet 8 words — < 1%