Insurace (1) REPORT
Insurace (1) REPORT
MACHINE LEARNING
ABSTRACT
Insurance Company working as commercial enterprise from last few years has
been experiencing fraud cases for all type of claims. Amount claimed by fraudulent
is significantly huge that may causes serious problems, hence along with
government, different organization also working to detect and reduce such
activities. Such frauds occurred in all areas of insurance claim with high severity
such as insurance claimed towards auto sector is fraud that widely claimed and
prominent type, which can be done by fake accident claim. So, our aim to develop
a project that work on insurance claim data set to detect fraud and fake claims
amount. The project implements machine learning algorithms to build model to
label and classify claim. Also, to study comparative of all machine learning
algorithms used for classification using confusion matrix in term soft accuracy,
precision, recall etc. For fraudulent transaction validation, machine learning model
is built using Python Library.
CHAPTER 1
INTRODUCTION
The insurance industry has been facing numerous challenges due to fraud claims
from the very beginning. Losses incurred due to frauds, impacts all the parties
involved. Even one undetected fraud could lead to a huge loss ; resulting in
increased premium-costs, process inefficiency and loss of trust. Though all
insurance companies have their fraud- detection systems in place, still most of
those processes are very inefficient and time consuming. Traditional mechanisms
rely heavily on human intervention and hence are not adaptable to any changes or
situation, if required. A long on-going investigation results in delay in pay-outs and
has a negative impact on the customer. Uncaught fraudulent claims not only hinder
the profitability of the firm but also encourage other policy holders to show similar
behavior. Insurance fraud occurs when individuals attempt to profit by failing to
fulfill the terms of the insurance agreement. Frauds can be categorized under soft-
fraud or hard-fraud. If a policy holder intentionally plans an accident or invents a
loss just to gain benefits from the insurance company then it is said to be a hard
fraud. However, when an actual injury or theft occurs, and the insured exaggerates
the claim to obtain more money from the company, then that is termed as a soft
fraud. The evolution of big-data and the growth of unstructured data has given rise
to a lot of fraudsters exploiting the system. If the data is not analyzed thoroughly,
there will be huge chances of occurrence of a fraud . Data mining and analytics has
changed the fraud detection scenario. Data can be gathered from various sources
and can be stored in a combined repository for further use. Implementing analytical
solutions costs an initial investment to the insurance companies; thus they always
resist implementing it. However, it has been observed that using machine learning
and analytical capabilities have strengthened the insurance lifecycle in many
forms. It has been able to provide a lot of cost benefits to the companies by saving
up a lot of money, by reducing the overall cost of fraud detection and improving
the overall ROI of fraud detection. So, the insurers need to start leveraging their
machine learning capability in order to build more robust and risk-free systems.
Hence, there is a crucial need to develop a system that can help the insurance
industry to identify potential frauds with a high degree of accuracy, so that other
claims can be cleared rapidly while the identified cases can be examined in detail.
The dataset used in this study is found to have a class imbalance problem; means
that the number of instances of one class(positive) far exceeds the number of
instances of other class(negative).The class having far less number of instances
than the other becomes the minority class; other class being called as majority
class. Due to which, the minority class tends to be ignored during the classification
process. To avoid minority data-instances to be treated as a noise and the classifier
to be biased with the majority one, this data- imbalance needs to be fixed. A simple
way to fix this imbalance is by balancing the data set, either by oversampling
instances of the minority class or under- sampling instances of the majority class.
This research paper aims to develop a model to help the insurers take pro-active
decisions and make them better equipped to combat fraud. In this paper, we
propose a procedure for auto-fraud identification using Random Forests
Classification Technique, before which we remove the class imbalance-ness of our
original dataset. This is done using synthetic minority oversampling technique
This Project aims to suggest the most accurate and simplest way that can be used to
fight fraudulent claims. The main problem with detecting fraudulent activities is
the massive number of claims that run through the companies systems. This
problem can also be used as an advantage if the officials were to take into account
that they hold a big enough database if they combined the database of the claims.
Which can be used in order to develop better models to flag the suspicious claims
This paper will look into the different methods that have been used in solving
similar problems to test out the best methods that have been used previously.
Searching if examining these methods and trying to enhance and build a predictive
model that could flag out the suspicious claims based on the researching and
testing out the different models and comparing these models to come up with a
simple enough time-efficient and accurate model that can flag out the suspicious
claims without stressing the system it runs on.
CHAPTER 2
LITERATURE REVIEW
Title: Fraud Detection and Analysis for Insurance Claim using Machine Learning
Authors: Abhijeet Urunkar,Amruta Khot,Rashmi Bhat,Nandinee Mudegol,
Publication: 2022 IEEE International Conference on Signal Processing Informatics
Communication and Energy Systems (SPICES)
Insurance Company working as commercial enterprise from last few years
have been experiencing fraud cases for all type of claims. Amount claimed by
fraudulent is significantly huge that may causes serious problems, hence along
with government, different organization also working to detect and reduce such
activities. Such frauds occurred in all areas of insurance claim with high severity
such as insurance claimed towards auto sector is fraud that widely claimed and
prominent type, which can be done by fake accident claim. So, we aim to develop
a project that work on insurance claim data set to detect fraud and fake claims
amount. The project implement machine learning algorithms to build model to
label and classify claim. Also, to study comparative study of all machine learning
algorithms used for classification using confusion matrix in term soft accuracy,
precision, recall etc. For fraudulent transaction validation, machine learning model
is built using PySpark Python Library."
Title: A time-efficient model for detecting fraudulent health insurance claims using
Artificial neural networks
Authors: Shamitha S.K.,V. Ilango,
Publication: 2020 International Conference on System Computation Automation
and Networking (ICSCAN)
Health insurance has come in rescue for people, in reducing their medical
expenditure, which otherwise would have taken a high toll on their income. There
are both private and government-funded agencies serving in the health insurance
sector. With soaring high demand among the public, healthcare is not safe from the
fraudsters. The usage of computerized techniques has proved this area even more
vulnerable. It has become highly essential to detect this fraud at the earliest, such
that the impact of loss could be minimized. This paper throws light on a framework
in detecting fraud with faster learning and identifying the maximum number of
fraud instances. The usual problems, like data heterogeneity and imbalanced
classification of classes, have also been discussed in this paper. As a part of
developing an efficient framework for fraud detection, we applied several learners
and optimization techniques. The framework has evaluated with claims dataset
obtained from the CMS Medicare facility. We finally reached to a conclusion that
the application of Multi-Layer Perceptron, a feed-forward Neural Network with
genetic algorithm optimization had helped in enhancing the results and gain higher
accuracy. PCA was also applied to pick the most significant variables. The use of
PCA and other appropriate pre-processing techniques has also helped us in
reducing the training time, thereby achieving efficiency in terms of accuracy and
speed."
Title: Fraud detection and frequent pattern matching in insurance claims using data
mining techniques
Authors: Aayushi Verma,Anu Taneja,Anuja Arora,
Publication: 2017 Tenth International Conference on Contemporary Computing
(IC3)
Fraudulent insurance claims increase the burden on society. Frauds in health
care systems have not only led to additional expenses but also degrade the quality
and care which should be provided to patients. Insurance fraud detection is quite
subjective in nature and is fettered with societal need. This empirical study aims to
identify and gauge the frauds in health insurance data. The contribution of this
insurance claim fraud detection experimental study untangle the fraud
identification frequent patterns underlying in the insurance claim data using rule
based pattern mining. This experiment is an effort to assess the fraudulent patterns
in the data on the basis of two criteria-period based claim anomalies and disease
based anomalies. Rule based mining results according to both criteria are analysed.
Statistical Decision rules and k-means clustering are applied on Period based claim
anomalies outliers detection and association rule based mining with Gaussian
distribution is applied on disease based anomalies outlier detection. These outliers
depict fraud insurance claims in the data. The proposed approach has been
evaluated on real-world dataset of a health insurance organization and results show
that our proposed approach is efficient in detecting fraud insurance claim using
rule based mining."
Title: Health Care Insurance Fraud Detection Using Blockchain
Authors: Gokay Saldamli,Vamshi Reddy,Krishna S. Bojja,Manjunatha K.
Gururaja,Yashaswi Doddaveerappa,Loai Tawalbeh,
Publication: 2020 Seventh International Conference on Software Defined Systems
(SDS)
The health care industry is one of the important service providers that
improves people lives. As the cost of the healthcare service increases, health
insurance becomes the only way to get quality service in case of an accident or a
major illness. As health insurance will reduces the costs and provides financial and
economic stability for an individual. One of the main tasks of healthcare insurance
providers is to monitor and manage the data and to provide support to customers.
Due to regulations and business secrecy, insurance companies do not share the
patient's data but since the data are not integrated and not in sync between
insurance providers, there has been an increase in the number of fraud's occurring
in healthcare. Often times ambiguous or false information is provided to health
insurance companies in order to make them pay for some false claims to the policy
holders. The individual policyholder may also claim benefits from multiple
insurance providers. There is a financial loss of billions of dollars each year as
estimated by the National Health Care Anti-Fraud Association (NHCAA). In order
to prevent health insurance fraud, it is necessary to build a system to securely
manage and monitor insurance activities by integrating data from all the insurance
companies. As blockchain provides an immutable data maintaining and sharing,
we propose a blockchain based solution for health insurance fraud detection."
SYSTEM DESIGN
EXSISTING SYSTEM:
PROPOSED SYSTEM:
BLOCK DIAGRAM:
CHAPTER 4
METHODOLOGY
RANDOM FOREST:
We can understand the working of Random Forest algorithm with the help of
following steps −
Step 1 − First, start with the selection of random samples from a given
dataset.
Step 2 − Next, this algorithm will construct a decision tree for every
sample. Then it will get the prediction result from every decision tree.
Step 3 − In this step, voting will be performed for every predicted
result.
Step 4 − At last, select the most voted prediction result as the final
prediction result.
• Collection of dataset:
The data set is collected from several data science domains like Kaggle etc and the
data is pre-processed.
• performing Pre-processing:
The data set is imported in AI algorithm and before performing the pre-processing
techniques are implemented to enhance the dataset some of the techniques such as
one hot encoding.
• Training:
After pre-processing the dataset is trained using the best neighbour using KNN
algorithm to achieve maximum accuracy. After training procedure is completed all
the features of in the different data's are extracted and stored in a .sav file
• Classification:
In this final part the input parameters are given and analysed with model file
and claim status is classified.
CHAPTER 5
SOFTWARE DESCRIPTION
PYTHON:
PYTHON 3.7:
There are tools which use doc strings to automatically produce online or printed
documentation or to let the user interactively browse through code; it’s good
practice to include doc strings in code that you write, so make a habit of it. The
execution of a function introduces a new symbol table used for the local variables
of the function. More precisely, all variable assignments in a functions to read the
value in the local symbol table; whereas variable references first look in the local
symbol table, then in the local symbol tables of enclosing functions, then in the
global symbol table, and finally in the table of built-in names. Thus, global
variables cannot be directly assigned a value within a function (unless named in a
global statement), although they may be referenced. The actual parameters
(arguments) to a function call are introduced in the local symbol table of the called
function when it is called; thus, arguments are passed using call by value (where
the value is always an object reference, not the value of the object).1 When a
function calls another function, a new local symbol table is created for that call. A
function definition introduces the function name in the current symbol table. The
value of the function name has a type that is recognized by the interpreter as a
user-defined function. This value can be assigned to another name which can then
also be used as a function.
The comparison operators in and not in check whether a value occurs (does not
occur) in a sequence. The operator is and does not compare whether two objects
are really the same object; this only matters for mutable objects like lists. All
comparison operators have the same priority, which is lower than that of all
numerical operators. Comparisons can be chained. For
example,a<b==ctestswhetheraislessthanbandmoreoverbequalsc. Comparisons may
be combined using the Boolean operators and the outcome of a comparison (or of
any other Boolean expression) may be negated with not. These have lower
priorities than comparison operators; between them, not has the highest priority
and or the lowest, so that A and not B or C is equivalent to (A and (not B)) or C.
As always, parentheses can be used to express the desired composition. The
Boolean operators and are so-called short-circuit operators: their arguments are
evaluated from left to right, and evaluation stops as soon as the outcome is
determined. For example, if A and C are true but Bis false, A and B and C does not
evaluate the expression C. When used as a general value and not as a Boolean, the
return value of a short-circuit operator is the last evaluated argument.
Classes provide a means of bundling data and functionality together. Creating a
new class creates a new type of object, allowing new instances of that type to be
made. Each class instance can have attributes attached to it for maintaining its
state. Class instances can also have methods (defined by its class) for modifying its
state. Compared with other programming languages, Python’s class mechanism
adds classes with a minimum of new syntax and semantics. It is a mixture of the
class mechanisms found in C++ and Modula-3. Python classes provide all the
standard features of Object Oriented Programming: the class inheritance
mechanism allows multiple base classes, a derived class can override any methods
of its base class or classes, and a method can call the method of a base class with
the same name. Objects can contain arbitrary amounts and kinds of data. As is true
for modules, classes partake of the dynamic nature of Python: they are created at
runtime, and can be modified further after creation. In C++ terminology, normally
class members (including the data members) are public (except see below Private
Variables), and all member functions are virtual. A sin Modula-3, there are no
short hands for referencing the object’s members from its methods: the method
function is declared with an explicit first argument representing the object, which
is provided implicitly by the call. A sin Small talk, classes themselves are objects.
This providesSemantics for importing and renaming. Unlike C++ and Modula-3,
built-in types can be used as base classes for extension by the user. Also, like in C+
+, most built-in operators with special syntax (arithmetic operators, sub scripting
etc.) can be redefined for class instances.(Lacking universally accepted
terminology to talk about classes, I will make occasional use of Smalltalk and C++
terms. I would use Modula-3 terms, since its object- oriented semantics are closer
to those of Python than C++, but I expect that few readers have heard of it.)
Objects have individuality, and multiple names (in multiple scopes) can be bound
to the same object. This is known as aliasing in other languages. This is usually not
appreciated on a first glance at Python, and can be safely ignored when dealing
with immutable basic types (numbers, strings, tuples).However, aliasing has a
possibly surprising effect on these mantic of Python code involving mutable
objects such as lists, dictionaries, and most other types. This is usually used to the
benefit of the program, since aliases behave like pointers in some respects. For
example, passing an object is cheap since only a pointer is passed by the
implementation; and if a function modifies an object passed as an argument, the
caller will see the change — this eliminates the need for two different argument
passing mechanisms as in Pascal.
A namespace is a mapping from names to objects. Most name spaces are currently
implemented as Python dictionaries, but that’s normally not noticeable in any way
(except for performance), and it may change in the future. Examples of name
spaces are: these to f built-in names (containing functions such as abs(), and built-
in exception names); the global names in a module; and the local names in a
function invocation. In a sense the set of attributes of an object also form a
namespace. The important thing to know about namespaces is that there is
absolutely no relation between names in different namespaces; for instance, two
different modules may both define a function maximize without confusion — users
of the modules must prefix it with the module name. By the way, I use the word
attribute for any name following a dot — for example, in the expression z. real,
real is an attribute of the object z. Strictly speaking, references to names in
modules are attribute references: in the expression modname.funcname, modname
is a module object and funcname is an attribute of it. In this case there happens to
be a straight forward mapping between the module’s attributes and the global
names defined in the module: they share the same namespace!1 Attributes may be
read-only or writable. In the latter case, assignment to attributes is possible.
Module attributes are writable: you can write modname.the_answer = 42.
Writable attributes may also be deleted with the del statement. For example, del
mod name .the_ answer will remove the attribute the_answer from the object
named by mod name. Namespaces are created at different moments and have
different lifetimes. The namespace containing the built-in names is created when
the Python interpreter starts up, and is never deleted. The global namespace for a
module is created when the module definition is read in; normally, module
namespaces also last until the interpreter quits.The statements executed by the top-
level invocation of the interpreter, either read from a script file or interactively, are
considered part of a module called main, so they have their own global
namespace.(The built-in names actually also live in a module; this is called built
ins.) The local namespace for a function is created when the function is called, and
deleted when the function returns or raises an exception that is not handled within
the function. (Actually, forgetting would be a better way to describe what actually
happens.) Of course, recursive invocations each have their own local namespace.
To speed uploading modules, Python caches the compiled version of each module
in the pycache directory under the name module.version.pyc, where the version
encodes the format of the compiled file; it generally contains the Python version
number. For example, in CPython release 3.3 the compiled version of spam.py
would be cached as pycache/spam.cpython-33.pyc. This naming convention allows
compiled modules from different releases and different versions of Python to
coexist. Python checks the modification date of the source against the compiled
version to see if it’s out of date and needs to be recompiled. This is a completely
automatic process. Also, the compiled modules are platform-independent, so the
same library can be shared among systems with different architectures. Python
does not check the cache in two circumstances. First, it always recompiles and
does not store the result for the module that’s loaded directly from the command
line. Second, it does not check the cache if there is no source module. To support
anon-source (compiled only) distribution, the compiled module must be in the
source directory, and there must not be a source module. Some tips for experts:
You can use the -O or -OO switches on the Python command to reduce the size of
a compiled module. The -O switch removes assert statements, the -OO switch
removes both assert statements and doc strings. Since some programs may rely on
having these available, you should only use this option if you know what you’re
doing. “Optimized” modules have an opt- tag and are usually smaller. Future
releases may change the effects of optimization.
A program doesn’t run any faster when it is read from a .pyc file than when it is
read from a .py file; the only thing that’s faster about .pyc files is the speed with
which they are loaded. The module compile all can create .pyc files for all modules
in a directory.There is more detail on this process, including a flow chart of the
decisions.
THONNY IDE:
% ./configure
% make
% make install
The configure script supports several common options, for a detailed list, type
% ./configure --help
There are also some compile time options which can be found in src/Thonny .h.
Please see Appendix C for more information. In the case that your system lacks
dynamic linking loader support, you probably want to pass the option --disable-vte
to the configure script. This prevents
Thonny detects an already running instance of itself and opens files from the
command-line in the already running instance. So, Thonny can be used to view
and edit files by opening them from other programs such as a file manager. If you
do not like this for some reason, you can disable using the first instance by using
the appropriate command line option.
As long as a project is open, the Make and Run commands will use the project’s
settings, instead of the defaults. These will be used whichever document is
currently displayed. The current project’s settings are saved when it is closed, or
when Thonny is shut down. When restarting Thonny , the previously opened
project file that was in use at the end of the last session will be reopened.
Execute will run the corresponding executable file, shell script or interpreted script
in a terminal window. Note that the Terminal tool path must be correctly set in the
Tools tab of the Preferences dialog - you can use any terminal program that runs a
Bourne compatible shell and accept the "-e" command line argument to start a
command. After your program or script has finished executing, you will be
prompted to press the return key. This allows you to review any text output from
the program before the terminal window is closed.
By default the Compile and Build commands invoke the compiler and linker with
only the basic arguments needed by all programs. Using Set Includes an
arguments you can add any include paths and compile flags for the compiler, any
library names and paths for the linker, and any arguments you want to use when
running Execute.
Thonny has basic printing support. This means you can print a file by passing the
filename of the current file to a command which actually prints the file. However,
the printed document contains no syntax highlighting.
CHAPTER 6
RESULTS
CHAPTER 7
CONCLUSION
As the different countries around the world evolve into a more economical-based
one, stimulating their economy is the goal. To fight these fraudsters and money
launderers was quite a complex task before the era of machine learning but thanks
to machine learning and AI we are able to fight these kinds of attacks. The
proposed solution can be used in insurance companies to find out if a certain
insurance claim made is a fraud or not. The model was designed after testing
multiple algorithms to come up with the best model that will detect if a claim is
fraudulent or not. This is aimed at the insurance companies as a pitch to come up
with a more tailored model for their liking to their own systems. The model should
be simple enough to calculate big datasets, yet complex enough to have a decent
successful percentile.
REFERENCES: