0% found this document useful (0 votes)

114 views37 pages

Final Report PDF

This document presents a novel hybrid approach for detecting automobile insurance fraud using data balancing and machine learning techniques. It aims to minimize false positive classifications of legitimate claims as fraudulent. The approach uses undersampling and oversampling methods to address class imbalance in fraud and non-fraud data. It then trains supervised machine learning models like decision trees, naive Bayes, support vector machines, KNN and logistic regression. The models are evaluated using performance metrics on a test dataset to identify the most accurate model for fraud detection. This hybrid approach integrating data balancing and multiple classifiers seeks to develop an effective solution for detecting insurance fraud.

Uploaded by

Atul Kumar Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views37 pages

Final Report PDF

Uploaded by

Atul Kumar Agrawal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

A NOVEL HYBRID DATA BALANCING AND

FRAUD DETECTION APPROACH FOR

AUTOMOBILE INSURANCE CLAIMS

Atul Kumar Agrawal

Computer Science & Engineering

Veer Surendra Sai University Of Technology, Burla
2019

1
A NOVEL HYBRID DATA BALANCING AND
FRAUD DETECTION APPROACH FOR
AUTOMOBILE INSURANCE CLAIMS

A minor project submitted in partial fulfillment of the requirements for the

degree of:

BACHELOR OF TECHNOLOGY IN
COMPUTER SCIENCE & ENGINEERING

Submitted by:
Atul Kumar Agrawal
Registration no. : 1602040031

Under The supervision of:

Dr. Suvasini Panigrahi

Associate Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

VEER SURENDRA SAI UNIVERSITY OF TECHNOLOGY, BURLA
2019

2
VEER SURENDRA SAI UNIVERSITY OF TECHNOLOGY, BURLA, ODISHA

Declaration

I declare that this written submission represents my ideas in my own words and wherever
others’ ideas or words have been included, I have adequately cited and referenced the
original sources. I also declare that i have adhered to all principles of academic honesty and
integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in
my submission. I understand that any violation of the above will be cause for disciplinary
action by the university and can also evoke penal action from the sources which have thus
not been properly cited or from whom proper permission has not been taken when needed.

DATE: Atul Kumar Agrawal

Regd.no.-1602040031
3
Department of Computer Science & Engineering
Veer Surendra Sai University Of Technology, Burla, Odisha

Certificate

This is to certify that the dissertation entitled " A novel hybrid data balancing and fraud
detection approach for automobile insurance claims" submitted by Atul Kumar Agrawal
is approved for the degree of bachelor of technology in Computer Science and Engineering
is a record of an original research work carried out by him under my supervision and
guidance.

Dr. Manas Ranjan Kabat Dr. Suvasini Panigrahi

Head of Department Supervisor
4
Acknowledgment

I would like to express my sincere gratitude to my supervisor, Dr. Suvasini Panigrahi , for
her invaluable help during the course work towards this dissertation. She was a source of
constant ideas and encouragement and provided a friendly atmosphere to work in. I am really
very thankful to her for everything.

I am also thankful to Dr. Manas Ranjan Kabat, Head of the Department and to all the
faculties of Department of Computer Science and Engineering for having supported me to
carry out this dissertation and for their constant advice. I would like to thank all my friends
for their encouragement and understanding. I would like to express my heart felt gratitude to
them.

Atul Kumar Agrawal

Regd.no.-1602040031

5
Approval Sheet

This dissertation entitled "A novel hybrid data balancing and fraud detection approach for
automobile insurance claims" by Atul Kumar Agrawal is approved for the degree of
bachelor of technology in "Computer S cience and E ngineering", department of
C omputer S cience and E n g i n e e r i n g .

Date: Supervisor
Place: Burla

6
ABSTRACT

Automobile insurance fraud has been a major issue to the insurance companies and has caused

several crores of losses due to the fraudulent and false claims. It is a serious crime in most parts

of the world and the scammers may be sentenced to at least 1 year of jail and up to 20 years.

The fraudsters involve fake patients, fake doctors, fake lawyers together. Various machine

learning and deep learning techniques have been developed to detect these kind of fraud and

research is been done to get the best suitable method that can detect new patterns of fraudsters

over time. As the number of frauds is low in comparison to the legitimate transactions,we use

one class classification for the minority class to get better results so as to minimize the

classification of non-fraudulent data as fraud i.e. minimizing the false positive alarm rates.

Various machine learning techniques like Support vector machine , K – nearest neighbors ,

Decision tree , Logistic regression and Naive Bayes have been deployed to detect fraud. Most of

these classifiers have good accuracies . These methods have been trained using undersampling

and oversampling of data to reduce the class imbalance of fraud and non fraud database. It is

then validated using 10- fold validation technique.

Keywords : Automobile insurance fraud, false positive alarm rates, undersampling, Support

vector machine, K – nearest neighbors, Decision tree, Logistic regression, Naive Bayes

7
Table of Contents

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.1 Definitions

1.2 Fraudsters

1.3 Types of fraud

1.4 Methods of fake fraud claims

1.5 Types of automobile fraud

1.6 Problems associated

1.7 Organization of the thesis

CHAPTER 2 MOTIVATION. .................................. 14

Problem statement

CHAPTER 3 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . 16

CHAPTER 4 BACKGROUND STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Methods for class balancing

4.2 Undersampling and Oversampling methods

4.3 Supervised model training algorithms

4.3.1. Decision Tree

4.3.2. Naive Bayes

4.3.3. Support Vector Machine (SVM)

4.3.4. KNN (K – Nearest Neighbors)

4.3.5. Logistic Regression

8|Page
4.4 Fraud detection measures

CHAPTER 5 PROPOSED SYSTEM ......................... 27

5. 1 Proposal diagram

5.2 Steps involved in the system implemented

5.3 Data set used

5.4 Data set description

5.5 Implementation Environment

5.5.1 Hardware specifications

5.5.2 Software specifications

5.6 Performance metrics

CHAPTER 6 RESULTS AND DISCUSSION ..................... 31

CHAPTER 7 CONCLUSION AND FUTURE WORK ............ 33

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

List of Figures

Figure 4.1 : Decision Tree: rules A leading to the colored clusters 22

Figure 4.2 : Decision tree : insurance classification 22

Figure 4.3 : SVM trained with samples from two classes 24

Figure 4.4 : KNN classification model 25

Figure 4.5 : Generalized linear regression model. 26

Figure 5.1 : Proposed model flow chart 27

9|Page
List of Tables

Table 4.1 : Related work with algorithms used on insurance fraud 16

Table 6.1 : Accuracies after implication of 5 models to the data set. 31

Table 6.2 : Result after undersampling. 31

10 | P a g e
LIST OF ABBREVIATIONS
Abbreviation Description
KNN K- Nearest Neighbors
SVM Support Vector Machine
TP True Positive
TN True Negative
FP False Positive
FN False Negative
CM Confusion Matrix
DT Decision Tree
LR Logistic Regression
NB Naive Bayes

11 | P a g e
Chapter 1

INTRODUCTION

1.1 Definitions :

Fraud : It is a serious crime that includes use of one’s occupation for personal enrichment

through deliberate misuse or application of the employing organization’s resources or assets.

Illegal misuse of insurance policies for self benefit.

Fraud Detection : monitoring the behavior of population of users using data sets to estimate,

detect or avoid undesirable behaviors.

Automobile Insurance fraud : Automobile insurance fraud has been a major issue to the

insurance company and has caused several losses due to the false claims.

1.2 Fraudsters :

Fraud done by:

1. False accident claims and injury

2. False stolen reports

3. False claims that accident/damage happened after policy or coverage was purchased.

4. Claimants hide the information that excluded driver was driving at the time of accident.

1.3 Types of fraud:

1. Credit card fraud

2. Telecommunication fraud

3. Bankruptcy fraud

4. Theft / counterfeit fraud

5. Application fraud

6. Behavioral fraud

7. Insurance fraud

8. Statement fraud

9. Security fraud

12 | P a g e
1.4 Various methods of fake fraud claims are:

1. Staged Collisions : In this type of frauds, fraudsters use a motor vehicle to stage fake

accidents with an innocent party.

2. Exaggerated claims : Fake claims involving injuries and damages that may have already

been present before the actual accident had taken place.

3. False stolen reports : Claimant might have sold the vehicle or gifted it to a relative and

then claims for insurance based on stolen case.

4. Hidden information : Claimants may hide information regarding the driver at the time of

accident might be an excluded driver from the terms of the insurance .

5. Multiple claims: It includes people who claim multiple times for the same loss.

1.5 Types of automobile fraud:

1. Soft auto-insurance fraud: Examples of soft auto-insurance fraud include filing more

than one claim for a single injury, filing claims for injuries not related to an automobile

accident, misreporting wage losses due to injuries, and reporting higher costs for car repairs

than those that were actually paid.

2. Hard auto-insurance fraud: It includes activities such as staging,automobile collision,

filing claims when the claimant was not actually involved in the accident, submitting claims

for medical treatments that were not received, or inventing injuries and false stolen reports.

1.6 Problems associated : a. Class imbalance problem [minority class = fraud, majority

class = legitimate]

b. Outliers problem : Outliers are the records which exhibit dissimilarity with the defined set

of clusters and they cannot be part of representatives while undersampling the data and

further noise needs to be eliminated to enhance the data quality.

13 | P a g e
1.7 Organization of the thesis:-

The whole thesis has been ordered in the following way:

Chapter 2 – I t includes motivation for the research project.

Chapter 3 – I t includes some basic concept of m a c h i n e learning that has been used

in the work.

Chapter 4 – It presents an existing methodology that has been implemented and analyzed.

Chapter 5 – It presents the materials and methodology and its algorithms as well as

flowcharts.

Chapter 6 – It is about results and discussion upon various sample data set.

Chapter 7 – It concludes the thesis with some prospects of future work.

14 | P a g e
CHAPTER 2

MOTIVATION

Automobile Fraud Statistics

The Insurance Fraud Bureau in the UK estimated there were more than 20,000 fake collisions

and false insurance claims across the UK from 1999 to 2006. One tactic fraudsters use is to

drive to a busy junction or roundabout and brake sharply causing a motorist to drive into the

back of them. They claim the other motorist was at fault because they were driving too fast or

too close behind them, and make a false and inflated claim to the motorist's insurer for injury

and damage, which can pay the fraudsters up to 30 Lakhs. In the Insurance Fraud Bureau's

first year or operation, the usage of data mining initiatives exposed insurance fraud networks

and led to 74 arrests and a five-to-one return on investment. The Insurance Research Council

estimated that in 1996, 21 to 36 percent of auto-insurance claims contained elements of

suspected fraud. There is a wide variety of schemes used to defraud automobile insurance

providers.

According to data released by Beijing bureau of China, 10% insurance claims of the total

claims were fraud. The Coalition Against Insurance Fraud estimates that in 2006 a total of

about $80 billion was lost in the United States due to insurance fraud. According to estimates

by the Insurance Information Institute, insurance fraud accounts for about 10 percent of the

property/casualty insurance industry's incurred losses and loss adjustment expenses. India

forensic Center of Studies estimates that Insurance frauds in India costs about $6.25 billion

annually.

Problem Statement

1. Given an automobile insurance data set comprising of various features of various

claimants. The data set is labeled and has both training and test set.
2. A hybrid technique has been implemented in which the data has been sampled ,
trained and tested using the given data set.

15 | P a g e
CHAPTER 3

Literature Review

Table 3.1 : Related work with algorithms used on insurance fraud

S. No Name of the Research Year of Technique(s) Result

Paper Publication used

1. An Experimental Study 2019 - LR,C5.0, decision

With Imbalanced tree algorithm, SVM

Classification Approaches and ANN are the best

for Credit Card Fraud methods according to

Detection the 3 considered

performance

measures (Accuracy,

Sensitivity and

AUPRC).
2. Predicting Fraudulent 2018 Random Random forest

claims in automobile forest + outperforms the

insurance J48+Naive remaining two

Bayes algorithms
3. One-class support vector 2015 OCSVM OCSVM based

machine based undersampling

undersampling: improves the

Application to churn performance of

prediction and insurance classifiers.

fraud detection
4. The Identification 2015 Outlier Data mining had

Algorithm and Model detection the advantages of low

Construction method time complexity, high

of Automobile Insurance based on knn recognition rate,

Fraud Based on Data high accuracy

16 | P a g e
Mining
5. Random Rough Subspace 2011 random Random subspace

based Neural Network rough method can be used

Ensemble for Insurance subspace for online fraud

Fraud Detection based detection system

neural

network

ensemble

method

Various deep learning, Machine learning and data mining techniques have been implemented

in the case of automobile insurance fraud detection. These are:

1. Decision tree based

2. Machine learning

i. Supervised learning

A. Classification

a. Support vector machine (SVM) [4]

b. Recursive neural network(RNN)

c. Radial Basis Function neural network [5]

B. Regression and statistics

a. Logistic regression

b. Binary regression

ii. Unsupervised learning

A. Clustering

a. K-means clustering

b. Hierarchical clustering

B. Spectral Ranking Anomaly (SRA) [14]

iii. Semi- supervised learning : Combination of supervised and unsupervised learning

iv. Reinforcement learning

17 | P a g e
3. Multi layered perceptron based (MLP)

4. Data mining

5. Random forest

6. Naive Bayes Tree [4]

7. Probabilistic neural network

8. Group method of data handling (GMDH)

9. Synthetic Minority Oversampling Technique [2]

10. kRNN and K-Means hybrid for outlier elimination and undersampling [3]

11. Geometric mean based [6]

12. fuzzy Gaussian membership based oversampling [7]

13. data gravitation [8]

14. Fuzzy logic control (FLC)

15. Genetic algorithms

16. hybrid of back-propagation neural networks (ANN) and self-organizing maps (SOM) [9]

17. 10-fold cross validation method

18. OVERSAMPLING TECHNIQUES (for minority class):

a. ADASYN (adaptive synthesis)

b. SMOTE ( Synthetic Minority Oversampling Technique)

19. Back Propagation

20. C4.5 algorithm [15]

21. Meta learning approaches [15]

22. Stacking – bagging method - MLP together with Naïve Bayesian (NB) and C4.5

algorithm [16]

23. Bayesian belief network

24. Non negative matrix factorization approach for health care fraud detection. [17]

25. Random Rough subspace based Neural Network Ensemble [13]

26. Iterative Assessment Algorithm based on Graph components [18]

27.SVM- Recursive Feature Elimination for feature selection and employed active learning

methods for synthetic data generation. [4]

18 | P a g e
Limitations:

1. Data sets are not available / public. [privacy concerns]

2. Results are often not disclosed to public

3. Confidential data set and results.

4. False alarms may be generated.

5. Imbalanced classification approaches is required because the fraud : legitimate ratio is very

low.

19 | P a g e
Chapter 4

BACKGROUND STUDY

4.1 Methods for class balancing :

1. Oversampling Minority by creating artificial data sets

2. Under sampling Majority class data

3. Cost sensitive models -> higher cost = minority class

4. One class classification = train using only minority class

4.2 Undersampling and Oversampling methods used :

1. Random oversampling : Balance classes using replicating observations. Model is

modified to concentrate both on minority and majority classes .

2. One class classification : It is a cost sensitive approach .

It uses only minority classes. After train , it detects if the transaction belongs to the minority

class or not.

3. One class SVM

-> finds a small region capturing most of the data points.

-> f= { 1, if point is in the region

-1, otherwise

Φ : mapping function from variable space to higher dimensional space (F)

Hyper plane equation : ωTΦ(xxi) = ρ (1)

Objectives of one class SVM :

Maximize the margin = ρ / | | ω | | (2)

i.e. minimize 1/2 * | | ω | | + 1/(xv*l) Σ) Σl) Σi=1 (xρi-ρ)ρ) (3)

where, { ωT * Φ(xxi) >= (x ρ – ρi )

20 | P a g e
Ci>=0 ∀ i=1,2 . . . l) Σ

ω = weight factor

ρ = offset parameter for hyperplane F.

Oversampling methods :

1. ADASYN (Adaptive oversampling) : It is applied on minority class. It creates more

artificial minority class instances. It replicates instances that are difficult to learn.

Density distribution factor = ri . (degree of learning difficulty of each minority class ).

2. SMOTE ( Synthetic Minority Oversampling Technique): It stands for synthetic

minority oversampling. It is used for generating artificial minority samples. The module

works by generating new instances from existing minority cases that you supply as input.

Undersampling methods : Random Undersampling, Clustering

Clustering : Making clusters and using associative rule mining to identify correlated data

and generate associate rules.

Validation techniques : K – fold validation technique. It chooses the test and train data

randomly to train the model effectively and get the accuracy.

4.3 Supervised model training algorithms used :

4.3.1. Decision Tree

4.3.2. Naive Bayes

4.3.3. Support Vector Machine (SVM)

4.3.4. KNN (K – Nearest Neighbors)

4.3.5. Logistic Regression

21 | P a g e
4.3.1. Decision Tree

Figure 4.1 : Decision Tree: Rules leading to the colored clusters

Figure 4.2 : Decision tree : insurance classification

22 | P a g e
Decision tree uses C 5.0 algorithm.

> It uses cross entropy (information statistics and information gain)

> It is a classification based algorithm.

> It divides the values into subsequent sub trees for decision making.

4.3.2. Naive Bayes

Uses Bayes conditional probability rule for classification.

Objective : To find a class of new observation that maximizes its probability.

Y ∃ P(Y/X P(xY/X1X2 . . . Xn) is maximum.

P(xY/X1,X2 ,. . . Xn) = P(xX1,X2, . . . Xn/Y) * P(xY) (4)

Max(x P(xY/X1X2 . . . Xn) ) = max (xP(xX1X2 . . . Xn/Y) ) = max (xP(x(xX1/Y)P(xX2/Y) . . . P(xXn/Y)))

(5)

(6)

Disadvantages :

1. Assumption that variables are independent may not be true.

2. Democratization of continuous variables.

3. Information may be lost.

4. Data may not be distributed normally.

23 | P a g e
4.3.3. Support Vector Machine (SVM) :

Figure 4.3 : SVM trained with samples from two classes

A Support Vector Machine (SVM) performs classification by constructing an N-dimensional

hyperplane that optimally separates the data into two categories. SVM models are closely

related to neural networks. Using a kernel function, SVMs are an alternative training method

for polynomial, Radial Basis Function (RBF) networks and MLP classifiers, in which the

weights of the network are found by solving a quadratic programming problem with linear

constraints, rather than by solving a non-convex, unconstrained minimization problem, as in

standard neural network training.

It is a classification tool used to divide the data points into 2 classes using hyperplane.

It performs row dimensional reduction (by way of picking up Support Vectors) while

classifying the data sets.

For a given training vector xi in Rn ,

i= {1,2 ... l) Σ}

n= no of exploratory variables

l= no. of observations in the train set.

y ∈ Rl

Binary classification done using the following optimization problem :

24 | P a g e
minimize (x1/2 ||w|| + (xΣl) Σi=1 Ci) ) (7)

w: It maximizes the distance between 2 margins.

{ yi * (xwt Φ (xxi) + b ) >= 1-ρ)Ci , Ci >= 0 , i= 1,2 . . . l) Σ } (8)

Hyper plane equation : wt Φ (xxi) + b (9)

w= vector of weights

Ci = Slack variables (for error calculation / Miscalculations.)

C = cost parameter > 0.

n = no. of features

y = dependent variable

x= independent variable

y= {-1,1}

It is for a hyperplane w that separates the point x i from the origin with margin ρ and ξii

accounts for possible errors.

4.3.4. KNN (K – Nearest Neighbors)

Figure 4.4 : KNN classification model

-> K nearest points ( distance between the points)

-> It is a classification algorithm

-> Euclidean distance , d(xp,q) = √ (x Σni=1 (xpi-ρ) qi)2 ) ,for 2 observations pi and qi .

25 | P a g e
4.3.5. Logistic Regression

Figure 4.5 : Logistic Regression model.

Vector α = (x α0 , α1 . . . αn ) = coefficients

x= (x x0 , x1 . . . xn ) = Exploratory variables

ϵ = model’s error

Y= α0 + α1.x1 + α2.x2+ . . . αn xn + ϵ = (x x.α+ϵ ) (10)

g= logistic link function over [0,1] in R. ( for getting variable values between 0 and 1).

g(xp) = x.α (11)

p = probability of fraud risk

g(xp) = l) Σn(xp/(x1-ρ)p)) (x12)

p=exα / (x1+ exα ) (13)

4.4 Fraud detection measures :

1. False positive

2. False negative

3. True positive

4. True negative

26 | P a g e
Chapter 5

PROPOSED SYSTEM
Majority of the above mentioned work did have limitations due to any data imbalance

problem. The proposed model in this paper is a one class classification to deal with the

minority class imbalance problem.

As mentioned in the proposed methodology, we extracted two subsets of data in the ratios

80% and 20 % to ensure that each subset has the same proportion of positive and negative

samples.

5. 1 Proposed system diagram

Flow Chart:

Figure 5.1 : Proposed model flow chart

27 | P a g e
5.2 Steps involved in the system implemented :

1. Data pre processing : Oversampling and undersampling

a. Handling missing data

b. Joining the claim payment data

c. Removing redundancy

d. Data cleaning (using only essential attributes)

2. Clustering and classification

3. Training the model

4. Testing

5. Validation

5.3 Data set used : mycarclaims.csv [15]

Data set attributes used for automobile insurance fraud detection :

week_past , is_holiday , age_price_wsum , make , accidentarea , sex , maritalstatus , fault ,

vehicleCategory , RepNumber , Deductible , DriverRating , Days: policy_accident , Days:

policy_claim , PastNumberOfClaims , AgeOfPolicyHolder , PolicyReportFiled ,

WitnessPresent , AgentType , NumberOfSuppliments , AddressChange_claim ,

NumberOfCars , BasePolicy , FraudFound , Claim number, Policy number, Claim occurrence

date and report date, Claim occurrence time , Claim open date, claim report date , claim loss

data , claim event location name , claim amount , policy premium, part market cost , claim on

vehicle , count of customer communication , are claim document submitted , policy effective

date, claim occurrence date , claim on same vehicle check.

28 | P a g e
5.4 Dataset Description :

Total = 15,420 insurance claims

Year : 1994 – 96 in US

Genuine = 14,497 (94%)

Fraud = 923 (6%)

Imbalance ratio = 0.06 : 0.94

No. of attributes = 24

5.5 Implementation Environment

The experiments were implemented under the following hardware and software specifications

5.5.1 Hardware Specification

The study of the data set and the project has been implemented on a laptop with an Intel core
i3-5005U at 2 GHz with 8 GB of RAM and 2 GB of Graphics. The Operating System used is
Ubuntu 18.04.3.

5.5.2 Software Specification

The project has been implemented on Spyder IDE and language used is Python 3.6.

5.6 Performance metrics :

confusion matrix

29 | P a g e
1. Accuracy

Accuracy = No. of True Positives + No.of True Negatives / (No. of True Positives + No.of

True Negatives +No.of False Positives + No. of False Negatives )

Accuracy = (x TP + TN ) / (xTP + FP + FN + TN) (14)

2. Specificity

Specificity, also known as True Negative Rate is calculated as,

Specificity= No.of True Negatives/ (No. of True Negatives+ No.of False Positives)

Specificity= TN / (xTN + FP) (15)

3. Sensitivity

Sensitivity also known as the True Positive rate or Recall is calculated as,

Sensitivity = No. of True Positives / (No. of True Positives + No. of False Negatives)

Sensitivity = TP / (xTP + FN) (16)

4.Precision

Precision also known as Positive Predictive Value is calculated as,

Precision = No. of True Positives / (No. of True Positives+ No.of False Positives)

Precision= TP / (xTP + FP) (17)

30 | P a g e
Chapter 6

RESULTS AND DISCUSSION

Thus, the complete survey was done to get the various techniques used in automobile

insurance fraud detection. This shows that there is still scope as minority class classifiers can

be over sampled to further increase the accuracy and will be helpful in preventing losses of

the insurance companies to a great extent.

The following are the results produced :

Model Accuracy ( in% ) Sensitivity( in %) Specificity ( in %)

1 . Decision Tree 89.06 91 79.08
2. SVM 94.04 83.5 71.88
3. KNN 93.36 84.5 82.66
4. Logistic regression 93.60 83 86.43
5. Naive Bayes 73.97 80.5 77.16

Table 6.1 : Accuracies after implication of 5 models to the data set.

After Random undersampling data [no. of majority samples = 5000] :

Model Accuracy ( in% ) Sensitivity( in %) Specificity ( in %)

1. Decision Tree 90.84 91.2 69.65
2. SVM 94.55 87.7 74.28
3. KNN 94.19 25.5 84.37
4. Logistic regression 94.03 85.7 97.32
5. Naive Bayes 79.86 83.1 86.44

Table 6.2 : Result after undersampling and oversampling

31 | P a g e
Observation : It is evident from the table that by using undersampling of the majority class

and oversampling of the minority class for data balancing, most algorithms showed a increase

in accuracy scores.

32 | P a g e
Chapter 7

Conclusion and Future Work

The system is implemented using supervised learning methods and the proposed system shall

be implemented on unsupervised algorithms and/or hybrid of the models which will be used

to compare and test the accuracies to get the best possible method for higher accuracy and

low false positive alarm rates.

33 | P a g e
REFERENCES

1. Sundarkumar, G. Ganesh, Vadlamani Ravi, and V. Siddeshwar. "One-class support vector

machine based undersampling: Application to churn prediction and insurance fraud
detection." In 2015 IEEE International) Σ Conference on Computational) Σ Intel) Σl) Σigence and
Computing Research (xICCIC), pp. 1-7. IEEE, 2015.

2. N.V. Chawla,K.W. Bowyer., L.O. Hall, and W.P. Kegelmeyer,“SMOTE: Synthetic

Minority oversampling Technique”, Journal of Artificial Intelligence Research, vol. 16(1), pp.
321-357, 2002.

3. M. Vasu, and V. Ravi, “ A hybrid undersampling approach for mining unbalanced datasets:
Application to Banking and insurance”, International Journal of Data Mining Modeling and
Management, Vol. 3(1), pp. 75-105, 2011.

4. M.A.H. Farquad, V. Ravi and S. Bapi Raju,“Analytical CRM in banking and finance using
SVM: a modified active learning-based rule extraction approach”, International Journal of
Electronic
Customer Relationship Management, vol. 6(1), pp 48-73, 2011.

5. M. D. Pérez-Godoy, A. J. Rivera, C. J. Carmona, M. J. D. Jesus, “Training algorithm for

radial basis funcion network to tackle learning process with imbalanced datasets”, Applied
Soft Computing, Vol. 25, pp. 26-39, 2014.

6. M. J. Kim, D. K. Kang, and H. B. Kim,“Geometric mean based boosting algorithm with

oversampling to resolve data imbalance problem for bankruptcy prediction”, Expert Systems
with
Applications. Vol. 41(3), pp. 1074-1082, 2015.

7. D. C. Li, C. W. Liu, S. C. Hu, “A learning method for the class imbalance problem with
medical datasets”, Computers in Biology and Medicine, Vol. 40(5), pp. 509-518, 2010.

8. L. Peng, H. Zhang, B. Yang, andY. Chen, “A new approach for imbalanced data
classification based on data gravitation”, Information Sciences, Vol. 288, pp. 347-373, 2014.

9. C. F. Tsai, andY. H. Lu, “ Customer churn prediction by hybrid neural networks”, Expert
Systems with Applications, Vol. 36 (10), pp. 12547-12553, 2009.

10. Makki, Sara, Zainab Assaghir, Yehia Taher, Rafiqul Haque, Mohand-Saïd Hacid, and
Hassan Zeineddine. "An Experimental Study With Imbalanced Classification Approaches for
Credit Card Fraud Detection." IEEE Access 7 (2019): 93010-93022.

34 | P a g e
11. Kowshalya, G., and M. Nandhini. "Predicting Fraudulent Claims in Automobile
Insurance." In 2018 Second International) Σ Conference on Inventive Communication and
Computational) Σ Technol) Σogies (xICICCT), pp. 1338-1343. IEEE, 2018.

12. Yan, Chun, and Yaqi Li. "The Identification Algorithm and Model Construction of
Automobile Insurance Fraud Based on Data Mining." In 2015 Fifth International) Σ Conference
on Instrumentation and Measurement, Computer, Communication and Control) Σ (xIMCCC), pp.
1922-1928. IEEE, 2015.

13. Xu, Wei, Shengnan Wang, Dailing Zhang, and Bo Yang. "Random rough subspace based
neural network ensemble for insurance fraud detection." In 2011 Fourth International) Σ Joint
Conference on Computational) Σ Sciences and Optimization, pp. 1276-1280. IEEE, 2011.

14. K. Nian, H. Zhang, A. Tayal, T. Coleman, and Y. Li, “Auto insurance fraud detection
using unsupervised spectral ranking for anomaly,” The Journal of Finance and Data Science,
vol. 2, no. 1, pp. 58–75, 2016.

15. C. Phua, D. Alahakoon, and V. Lee, “Minority report in fraud detection: classification of
skewed data,” Acm sigkdd explorations newsletter, vol. 6, no. 1, pp. 50–59, 2004.

16. Phua, C., Damminda, A., Lee, V., 2004. Minority report in fraud detection: classification
of skewed data (Special Issue on Imbalanced Data Sets). SIGKDD Explor. 6 (1), 50–59.

17. Zhu. S., Wang, Y., Wu, Y., 2011. Health care fraud detection using non-negative matrix
factorization. In: Proceedings of the IEEE International Conference on Computer Science and
Education, pp. 499–503.

18. Sublej, L., Furlan, S., Bajec, M., 2011. An expert system for detecting automobile
insurance fraud using social network analysis. Expert Syst. Appl. 38 (1), 1039–1052.

35 | P a g e
A novel hybrid data balancing and fraud detection approach for
automobile insurance claims
ORIGINALITY REPORT

9%
SIMILARITY INDEX

PRIMARY SOURCES

1 en.wikipedia.org
Internet 16 words — 3%
2 Sundarkumar, G. Ganesh, and Vadlamani Ravi. "A
novel hybrid undersampling method for mining
13 words — 2%
unbalanced datasets in banking and insurance", Engineering
Applications of Artificial Intelligence, 2015.
Crossref

3 Sara Makki, Zainab Assaghir, Yehia Taher, Rafiqul

Haque, Mohand-Said Hacid, Hassan Zeineddine. "An
11 words — 1%
Experimental Study With Imbalanced Classification Approaches for
Credit Card Fraud Detection", IEEE Access, 2019
Crossref

4 www.lisamoffatt.com
Internet 10 words — 1%
5 www.slideshare.net
Internet 10 words — 1%
6 Siddhartha Haldar, Ruptirtha Mukherjee, Pushpak
Chakraborty, Shayan Banerjee, Shreyaasha
9 words — 1%
Chaudhury, Sankhadeep Chatterjee. "Improved Epilepsy Detection
method by addressing Class Imbalance Problem", 2018 IEEE 9th
Annual Information Technology, Electronics and Mobile
Communication Conference (IEMCON), 2018
Crossref

7 tel.archives-ouvertes.fr
Internet 9 words — < 1%
8 www.peersupportvic.org
<1
Internet
9 words —
< 1%
Internet
9 words —

9 www.itkeyword.com
Internet 8 words — < 1%

EXCLUDE QUOTES ON EXCLUDE MATCHES OFF

EXCLUDE ON
BIBLIOGRAPHY

Complete Data Science, Machine Learning, DL, NLP Bootcamp - Udemy Business
No ratings yet
Complete Data Science, Machine Learning, DL, NLP Bootcamp - Udemy Business
25 pages
Credit Card Fraud Detection-Ppt-1
100% (1)
Credit Card Fraud Detection-Ppt-1
22 pages
DS Notes BCA
No ratings yet
DS Notes BCA
16 pages
Minor Project Report - 7TH SEMESTER - Odt
No ratings yet
Minor Project Report - 7TH SEMESTER - Odt
16 pages
Fraud Detection How Machine Learning Systems Help Reveal Scams in Fintech Healthcare and ECommerce
100% (2)
Fraud Detection How Machine Learning Systems Help Reveal Scams in Fintech Healthcare and ECommerce
24 pages
Nitin Singh Project
No ratings yet
Nitin Singh Project
59 pages
Auto Insurance
No ratings yet
Auto Insurance
19 pages
Group 12
No ratings yet
Group 12
54 pages
Iceberg Queries and Other Data Mining Concepts
No ratings yet
Iceberg Queries and Other Data Mining Concepts
53 pages
Fraud Detection Project Report
No ratings yet
Fraud Detection Project Report
6 pages
Financial Fraud Detection Using Machine Learning - Final Report With Acceptance Index and Plag Report
No ratings yet
Financial Fraud Detection Using Machine Learning - Final Report With Acceptance Index and Plag Report
95 pages
Anas
No ratings yet
Anas
64 pages
DMDT53 002
No ratings yet
DMDT53 002
159 pages
Credit Fraud
0% (1)
Credit Fraud
67 pages
Graph Construction and Applicaiton
No ratings yet
Graph Construction and Applicaiton
7 pages
Ashraf Intro
No ratings yet
Ashraf Intro
64 pages
5 - Fraud Detection in Insurance Claim Using Machine Learning
No ratings yet
5 - Fraud Detection in Insurance Claim Using Machine Learning
69 pages
Rahman Baig Intro
No ratings yet
Rahman Baig Intro
45 pages
Team Intro
No ratings yet
Team Intro
44 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
Application of Data Mining Predict Employee Competency at PT. ABC
No ratings yet
Application of Data Mining Predict Employee Competency at PT. ABC
13 pages
This Study Resource Was
No ratings yet
This Study Resource Was
4 pages
Sandip Final Project
No ratings yet
Sandip Final Project
41 pages
XGBOOST
No ratings yet
XGBOOST
36 pages
Up-2 Final Project Report
No ratings yet
Up-2 Final Project Report
47 pages
Lec 16,17
No ratings yet
Lec 16,17
90 pages
MAJOR No Footer
No ratings yet
MAJOR No Footer
26 pages
New Report
No ratings yet
New Report
61 pages
ML.4-Classification Techniques (Week 5,6,7)
No ratings yet
ML.4-Classification Techniques (Week 5,6,7)
56 pages
6 Sem CS, Pes Polytechnic, Bengaluru Page 1
No ratings yet
6 Sem CS, Pes Polytechnic, Bengaluru Page 1
51 pages
Decision Tree Class 1
No ratings yet
Decision Tree Class 1
34 pages
Minority Report in Fraud Detection: Classification of Skewed Data
No ratings yet
Minority Report in Fraud Detection: Classification of Skewed Data
10 pages
Automobile Insurance Fraud Detection An Overview
No ratings yet
Automobile Insurance Fraud Detection An Overview
6 pages
All Major
No ratings yet
All Major
40 pages
Credit Card Fraud Detection Based On Machine Learning and Deep Learning (1) (1) (AutoRecovered)
No ratings yet
Credit Card Fraud Detection Based On Machine Learning and Deep Learning (1) (1) (AutoRecovered)
54 pages
Prayag Report
No ratings yet
Prayag Report
39 pages
Bank Fraud Detection
No ratings yet
Bank Fraud Detection
60 pages
Sahir Final Year Project
No ratings yet
Sahir Final Year Project
45 pages
Data Warehousing and Data Mining Lab Manual
0% (1)
Data Warehousing and Data Mining Lab Manual
30 pages
Fraudulent Insurance Claims Detection Using Machine Learning
No ratings yet
Fraudulent Insurance Claims Detection Using Machine Learning
54 pages
Chapter 4: Machine Learning
No ratings yet
Chapter 4: Machine Learning
30 pages
Insurace (1) REPORT
No ratings yet
Insurace (1) REPORT
35 pages
Iimb S 24 00083
No ratings yet
Iimb S 24 00083
22 pages
Stable Variable Selection For Right Censored Data: Comparison of Methods
No ratings yet
Stable Variable Selection For Right Censored Data: Comparison of Methods
29 pages
Auto Insurance Fraud Detection
No ratings yet
Auto Insurance Fraud Detection
27 pages
My Imp Research
No ratings yet
My Imp Research
28 pages
Creditcard Fault Detection: Arnav Madan
No ratings yet
Creditcard Fault Detection: Arnav Madan
31 pages
Fraudulent Insurance Claims Detection Using Machine Learning
No ratings yet
Fraudulent Insurance Claims Detection Using Machine Learning
54 pages
A Novel Hybrid Data Balancing and Fraud Detection Approach For Automobile Insurance Claims
No ratings yet
A Novel Hybrid Data Balancing and Fraud Detection Approach For Automobile Insurance Claims
30 pages
Data Mining With Weka
No ratings yet
Data Mining With Weka
49 pages
Project Oral Exam PPT
No ratings yet
Project Oral Exam PPT
26 pages
Artigo Fraud-Creditcard
No ratings yet
Artigo Fraud-Creditcard
14 pages
Intern
No ratings yet
Intern
17 pages
Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8
No ratings yet
Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8
53 pages
Car Insurance Fraud Proposal
No ratings yet
Car Insurance Fraud Proposal
1 page
Simulation and Modelling Lab: Veer Surendra Sai University of Technology
No ratings yet
Simulation and Modelling Lab: Veer Surendra Sai University of Technology
16 pages
Advanced Techniques in Insurance Claim Fraud Detection
No ratings yet
Advanced Techniques in Insurance Claim Fraud Detection
41 pages
Song Et Al. 2013
No ratings yet
Song Et Al. 2013
18 pages
ADS InsuranceFraud Detection
No ratings yet
ADS InsuranceFraud Detection
12 pages
Anas Majorprojectindex
No ratings yet
Anas Majorprojectindex
5 pages
NIH Public Access: Development of Prognostic Indicators Using Classification and Regression Trees (CART) For Survival
No ratings yet
NIH Public Access: Development of Prognostic Indicators Using Classification and Regression Trees (CART) For Survival
17 pages
Credit Card Fraud 1.4% Positive Class
No ratings yet
Credit Card Fraud 1.4% Positive Class
17 pages
Car Insurance Claim Prediction - First Seminar
No ratings yet
Car Insurance Claim Prediction - First Seminar
26 pages
Decision Tree: Dr. Alekh Gour
No ratings yet
Decision Tree: Dr. Alekh Gour
12 pages
A Comparative Study Between Feature Selection Algorithms - Ok
No ratings yet
A Comparative Study Between Feature Selection Algorithms - Ok
10 pages
Prediction of Automobile Insurance Fraud Claims Us
No ratings yet
Prediction of Automobile Insurance Fraud Claims Us
7 pages
Business Analytics
No ratings yet
Business Analytics
9 pages
Analysis of Vehicle Insurance Data To de
No ratings yet
Analysis of Vehicle Insurance Data To de
8 pages
Performance Evaluation of Class Balancing
No ratings yet
Performance Evaluation of Class Balancing
6 pages
11 V May 2023
No ratings yet
11 V May 2023
9 pages
Icpcsi.2017.8392219
No ratings yet
Icpcsi.2017.8392219
6 pages
Main Project
No ratings yet
Main Project
21 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
One Class Classification Approaches For Detecting Automobile Insurance Fraud
No ratings yet
One Class Classification Approaches For Detecting Automobile Insurance Fraud
7 pages
Prediction of Insurance Fraud Detection Using Machine Learning Algorithms
No ratings yet
Prediction of Insurance Fraud Detection Using Machine Learning Algorithms
8 pages
Midway Report Group 7
No ratings yet
Midway Report Group 7
8 pages
JETIR2307882
No ratings yet
JETIR2307882
4 pages
Credit Card Fraud Detection Using Naive Bayesian and C4.5 Decision
No ratings yet
Credit Card Fraud Detection Using Naive Bayesian and C4.5 Decision
5 pages
Front Pages
No ratings yet
Front Pages
5 pages
A Secure AI-Driven Architecture For Automated Insurance Systems Fraud Detection and Risk Measurement
No ratings yet
A Secure AI-Driven Architecture For Automated Insurance Systems Fraud Detection and Risk Measurement
13 pages
Day 1: 19 December 2018: Introduction To HPC
No ratings yet
Day 1: 19 December 2018: Introduction To HPC
5 pages
参考文献
No ratings yet
参考文献
2 pages
JPNR 2022 S03 053
No ratings yet
JPNR 2022 S03 053
5 pages
Sse 11 24 549 2
No ratings yet
Sse 11 24 549 2
1 page
6.006 Introduction To Algorithms: Mit Opencourseware
No ratings yet
6.006 Introduction To Algorithms: Mit Opencourseware
5 pages
(Exp 4) Classification Via Decision Trees in WEKA
No ratings yet
(Exp 4) Classification Via Decision Trees in WEKA
10 pages
Decision Tree and Sensitivity Analysis
No ratings yet
Decision Tree and Sensitivity Analysis
18 pages
Revolutionizing Insurance Fraud Detection: A Data-Driven Approach For Enhanced Accuracy and Efficiency
No ratings yet
Revolutionizing Insurance Fraud Detection: A Data-Driven Approach For Enhanced Accuracy and Efficiency
9 pages
Fraud Detection and Analysis For Insurance Claim Using Machine Learning
No ratings yet
Fraud Detection and Analysis For Insurance Claim Using Machine Learning
9 pages
Hybrid Classifier Using Evolutionary and Non-Evolutionary Algorithm For Performance Enhancement in Data Mining
No ratings yet
Hybrid Classifier Using Evolutionary and Non-Evolutionary Algorithm For Performance Enhancement in Data Mining
6 pages
Sentiment Analysis of IMDb Movie Reviews
No ratings yet
Sentiment Analysis of IMDb Movie Reviews
6 pages
A Review On Machine Learning Techniques
No ratings yet
A Review On Machine Learning Techniques
5 pages