0% found this document useful (0 votes)
44 views9 pages

Auto Insurance Business Analytics Approach For Cus

This paper presents an auto insurance business analytics approach for customer segmentation using three mixed-type data clustering algorithms: k-prototypes, improved k-prototypes, and similarity-based agglomerative clustering. The proposed method aims to enhance decision-making for auto insurance companies by providing complementary and reinforced customer segmentation results, which can lead to the development of tailored insurance products. The practical value of the approach is demonstrated through the extraction of actionable rules for customer-related decisions.

Uploaded by

shreyas patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views9 pages

Auto Insurance Business Analytics Approach For Cus

This paper presents an auto insurance business analytics approach for customer segmentation using three mixed-type data clustering algorithms: k-prototypes, improved k-prototypes, and similarity-based agglomerative clustering. The proposed method aims to enhance decision-making for auto insurance companies by providing complementary and reinforced customer segmentation results, which can lead to the development of tailored insurance products. The practical value of the approach is demonstrated through the extraction of actionable rules for customer-related decisions.

Uploaded by

shreyas patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ISSN 1330-3651 (Print), ISSN 1848-6339 (Online) https://doi.org/10.

17559/TV-20180720122815
Original scientific paper

Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple
Mixed-Type Data Clustering Algorithms

Kai ZHUANG, Sen WU, Xiaonan GAO

Abstract: Customer segmentation is critical for auto insurance companies to gain competitive advantage by mining useful customer related information. While some efforts
have been made for customer segmentation to support auto insurance decision making, their customer segmentation results tend to be affected by the characteristics of the
algorithm used and lack multiple validation from multiple algorithms. To this end, we propose an auto insurance business analytics approach that segments customers by
using three mixed-type data clustering algorithms including k-prototypes, improved k-prototypes and similarity-based agglomerative clustering. The customer segmentation
results of these algorithms can complement and reinforce each other and demonstrate as much information as possible to support decision-making. To confirm its practical
value, the proposed approach extracts seven rules for an auto insurance company that may support the company to make customer related decisions and develop insurance
products.

Keywords: auto insurance; business analytics approach; clustering; customer segmentation; mixed-type data

1 INTRODUCTION can complement and reinforce each other and demonstrate


as much information as possible to assist in the decision
Insurance companies are indispensable to people’s life, making of auto insurance companies.
provide appropriate services and improve people’s welfare. Obtaining labels is costly and time consuming in the
Since customer is considered as a critical factor of auto insurance industry, and much unlabelled customer
insurance companies in producing revenue and improving related data is ready for analysis, therefore, we analyse the
profitability, how to obtain and keep customers is the major customer related data of auto insurance companies by using
problem in insurance companies [1]. The profitability of unsupervised data mining technologies. Clustering
insurance companies mainly depends on the services they algorithms are typical unsupervised learning technologies.
offer and on meeting the customer demand on a regular Additionally, in real-world tasks, customer related data
basis, so a good customer related strategy must be found to contains both numerical (e.g. new car purchase price and
analyse customer’s features and biases. Customer vehicle age) and categorical (e.g. policy nature and policy
segmentation is a powerful tool to divide customers into state) attributes simultaneously. Hence, we select several
different clusters and analyse their characteristics. Auto different mixed-type data clustering algorithms to segment
insurance is one of the important components in insurance customers and analyse characteristics.
industry, which is profitable and lucrative. In this paper, we The purpose of this paper is to segment customers of
focus on the research of customer segmentation in auto auto insurance companies using multiple mixed-type data
insurance companies. clustering algorithms and analyse characteristics of
Currently, most of the existing researches about auto different customers. To this end, more convincing and
insurance companies focus on fraud detection [2-4], accurate analysis results could be obtained which are not
premium calculation [5], feature selection [6] and customer affected by clustering algorithms to support decision
segmentation [7]. The first three research objectives only making.
support auto insurance companies to accomplish daily The main contributions of this paper can be
works but have not utilized the mass customer related data summarized as follows:
to segment customer and discover helpful information for (1) An auto insurance business analytics approach is
companies. A few existing researches on ordinary proposed, in which multiple mixed-type data clustering
insurance customer segmentation only utilize one algorithms are utilized to segment customers. Since the
algorithm to analyse customer related data. Fuzzy Analytic mixed-type data clustering is a challenging task, there is no
Network Process (FANP) based weighted RFM (Recency, perfect method to deal with this problem. Compared with
Frequency, Monetary value) model [8] combines the k- processes that only use one algorithm, the results of
means paradigm to learn Gidden knowledge and different algorithms can complement and reinforce each
information by segmenting customers of auto insurance. other, and we could extract more convincing and accurate
Auto Insurance Customers Segmentation Intelligent Tool information to assist in the auto insurance decision making.
[9] that is a two-phase model segments customers of auto (2) The practical value of the approach is validated.
insurance companies based on risk. These methods used We exploit a real case to demonstrate our approach to
for customer segmentation merely exploit one algorithm to extract appropriate rules for the auto insurance company.
do analyses which incline to produce defective analysis The auto insurance company can develop more appropriate
results impacted by the characteristics of the exploited one insurance products for different customers based on the
algorithm. We desire to mine information reflecting real extracted rules to keep and attract customers.
customers rather than being impacted by the used
algorithm. Therefore, we use multiple algorithms to 2 MIXED-TYPE DATA CLUSTERING ALGORITHMS
segment customers in auto insurance companies and
analyse the characteristics of different categories of We review a few typical techniques for clustering
customers to provide some customer related strategies for mixed-type data. Currently, the clustering algorithms for
decision makers. In this way, the multiple analysis results processing mixed-type data can be classified into three

Tehnički vjesnik 25, 6(2018), 1783-1791 1783


Kai ZHUANG et al.: Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple Mixed-Type Data Clustering Algorithms

types [10, 11]: converting numerical attributes into The steps of k-prototypes are the same as those of k-
categorical attributes, converting categorical attributes into means. Firstly, pre-set the number of clusters k and
numerical attributes, and clustering mixed-type data initialize k-prototypes. Secondly, allocate each data object
directly. The first two clustering algorithms may lead to the into the cluster in which the dissimilarity between this data
information loss and impact the accuracy of results [12]. object and cluster prototype is minimum. Thirdly, update
Therefore, we focus on the third type of mixed-type data prototypes and reallocate data objects until no prototype
clustering algorithm. can be updated.
K-prototypes [13] is a classical mixed-type data
clustering algorithm by combining k-means for numerical 2.2 Improved K-prototypes
attribute data and k-modes for categorical attribute data,
which is very cost-effective. Similarity-Based
Improved k-prototypes refines the k-prototypes from
Agglomerative Clustering (SBAC) [14] based on Goodall
two aspects. It not only can cluster incomplete data with no
similarity metric [15] was proposed which circumvents
parameter determination. Improved k-prototypes [16] can need to impute the missing values, but also can avoid
cluster incomplete mixed-type data directly and eliminate sensitivity of initial prototypes.
the sensitivity of initial prototypes. Evidence-Based Improved k-prototypes defines a new dissimilarity
Spectral Clustering algorithm [17] integrates the spectral computing method called ‘Incomplete Set Mixed
clustering frame and evidence-based similarity Dissimilarity (ISMD)’, that computes dissimilarity
computation method to cluster mixed-type data. Moreover, between two incomplete mixed-type data objects with no
some similarity or dissimilarity of mixed-type data was need for imputing missing values in advance to avoid an
proposed. A dissimilarity for mixed-type data derived from estimation that may cause error. The categorical and
the probabilistic model was proposed in [18]. And a unified numerical attribute dissimilarity between Xi and Xj is
similarity coefficient [19] was presented based on the defined respectively as follows:
importance of the categorical attribute values. A similarity
measure [20] between any two mixed-type data objects was 1, xik ≠ x kj ∧ xik ≠ "*"∧ x kj ≠ "*"
proposed based on the uncertainty of the attribute values. δk ( Xi , X j ) = (3)
Among the aforementioned algorithms, k-prototypes is k k k k
0, xi = x j ∨ xi = "*"∨ x j = "*"
very efficient, but the parameter determination is
complicated, and its clustering result is sensitive to the  xil − xlj
initial prototypes. Improved k-prototypes could reduce the  , xl ≠ "*"∧ xlj ≠ "*"
sensitivity of initial prototypes by determining prototypes (
dl X i , X j ) =  Maxl − Minl i (4)
based on neighbours. Additionally, for SBAC, there is no 
 0, xil = "*"∨ xlj = "*"
need to pre-set parameters and it clusters mixed-type data
very effectively, but the time cost of computation is high.
Since these three algorithms could complement each other, where δk(Xi, Xj) and dl(Xi, Xj) respectively represent
we select them to construct the business analytics approach dissimilarity between Xi and Xj in categorical attribute k
for customer segmentation. and numerical attribute l. "*" represents the missing values.
Next, we briefly review these three mixed-type data In addition, improved k-prototypes initializes k
clustering algorithms. Let xi be a data object described by prototypes according to neighbours, the k (number of
m attributes, mn and mc are the number of numeric and clusters) data objects with the most neighbours are selected
n as initial prototypes. This method reduces the randomness
categorical attributes respectively, mn + mc = m, xil
c of the clustering result.
represents the lth numeric attribute value of Xi and xil The steps of improved k-prototypes are similar to those
represents the l categorical attribute value of Xi.
th
of k-prototypes and can be defined in four steps. Firstly,
initialize the prototypes based on neighbours. Secondly,
2.1 K-prototypes allocate each data object into the cluster in which the
dissimilarity between this data object and cluster prototype
K-prototypes eliminate the numeric data limitation of is minimum. Thirdly, update prototypes and reallocate data
k-means, but preserve the efficiency by inheriting its objects until there is no prototype that can be updated.
paradigm. A dissimilarity measure for mixed-type data is Finally, fill the missing values based on clustering result.
proposed by combining k-means and k-modes. It can be Improved k-prototypes not only inherits the
computed as follows: effectiveness of k-prototypes, but also eliminates the
sensitivity of initial prototypes and computes the
mn 2 mc dissimilarity of incomplete mixed-type data more
(
d Xi , X j = ) ∑(
n n
xil − x jl + γ δ ) ∑ (x , x )
c
il
c
jl (1) accurately.
=l 1 =l 1

0, p = q 2.3 Similarity-Based Agglomerative Clustering


δ ( p, q ) =  (2)
1, p ≠ q Similarity-Based Agglomerative Clustering (SBAC)
utilizes a similarity measure proposed by Goodall [15] that
where γ is a weight for categorical attributes that is usually has no need for pre-setting parameters in advance and
set to the ratio of the number of categorical attributes to the constructs a dendrogram to extract the final clustering
number of all attributes. result heuristically.

1784 Technical Gazette 25, 6(2018), 1783-1791


Kai ZHUANG et al.: Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple Mixed-Type Data Clustering Algorithms

l
The similarity CSij between two data objects Xi and Xj former node for continuous traversal until there is no node
for traversal. Finally, the clustering result can be obtained.
in lth categorical attribute is computed as follows:
3 BUSINESS ANALYTICS APPROACH FOR CUSTOMER
CSijl = 1 − ∑ ( pk )l2 (5) SEGMENTATION
k∈MSFVS xc , xc
il jl ( ) We propose a business analytics approach for
customer segmentation which exploits three clustering
where (pk)i is the probability of occurrence of value xklc in algorithms to segment customers and recognize the
characteristics of different customers, by analysing the
the data set, MSFVS xilc , x cjl ( ) is the set of all pairs of categories of customers who have purchased auto
values for l categorical attribute that are equally or more
th insurance products.
similar to the pair xilc , x cjl . ( ) The procedure of business analytics approach shown
in Fig.1 is divided into three phases: (a) data collection and
l
Analogously, the similarity NSij between two data preparation, (b) customer segmentation and (c) customer
characterization and integration. Next, we summarize the
objects Xi and Xj in lth numerical attribute is computed as phases of the approach.
follows:
Data Collection and Preparation
NSijl = 1 − ∑ ( pk )l ( pm )l (6)
( )
Input data Data preparation
k , m∈MSFSS x n , xijn
il Imputation
Transformation
Sales data Relevant data
where (pk)l and (pm)l are the probabilities of occurrence of
xkln and xml
n
(
, MSFSS xiln , x njl is the set of all pairs of ) Customer Segmentataion
values for l numerical attribute that are equally or more
th
Customer Customer Customer Evaluation

similar to the pair xiln , x njl . ( ) segmentation segmentation segmentati


using k-
prototypes
using
improved k-
on using
SBAC


Optimal clustering
result of k-prototypes
Optimal clustering
Next, the similarity Sij between Xi and Xj in all prototypes
result of improved k-
prototypes
attributes can be computed as follows: • Optimal clustering
result of SBAC

gij2
 CDl ln CDl − CDl ' ln CDl '  m
mc

2∑  1 −
=
ij ij ij ij  n (
− 2∑ ln NDijl (7)
) ( ) Customer Characterization and Integration

( )
'

l 1= CDijl − CDijl  l 1 Representative data Integrate clustering Extracted rules
  object
(each cluster)
results
l
gij2
1 2 Figure 1 Business analytics approach for customer segmentation
mc + mn −1  gij 
− 2 
1− e
Sij = 2 × ∑ l!
(8)
3.1 Data Collection and Preparation
l =0

l l For auto insurance companies, sales data is the core


where CDij and NDij are respectively equal to 1 minus data set for data segmentation, which contains products,
CSijl and NSijl . Using this similarity measure, the turnover, purchase time, customer information and so on.
It could reflect the essential information of an auto
dissimilarity matrix among data set can be calculated. insurance transaction. Moreover, if more relevant data of
Based on the dissimilarity matrix, we can implement other types could be obtained, that is a good way to expand
SBAC by three steps. Firstly, regard each data object as a studying data besides sales data. In this approach, sales
cluster and merge the two clusters with the minimum data is indispensable, and other relevant data is encouraged
dissimilarity in the matrix into one cluster. Secondly, delete to perfect studying data.
the rows and columns corresponding to the two merged In reality, the sales data and other relevant data of auto
clusters and update the dissimilarity matrix by computing insurance companies are usually incomplete and mixed-
the dissimilarity between newly merged cluster and each type. Thus, data preparation is required, which includes
other existing cluster. Finally, repeat the first two steps data imputation for imputing missing data and data
until all data objects are merged into the same cluster. transformation for digitizing categorical attribute values.
After conducting the above steps, a dendrogram is
constructed from bottom to top. Next, we need to extract 3.2 Customer Segmentation
appropriate parts from the dendrogram to construct final
clustering result. A threshold t (it is multiple of This phase is essential for the approach that performs
dissimilarity of root node) is given for depth-first traversal the following four tasks: customer segmentation using k-
on the dendrogram. When the D-value between the prototypes, improved k-prototypes and SBAC, evaluation,
dissimilarity of former node and current node is larger than to ensure that the convincing and accurate results can be
t, the cluster represented by current node can be selected as obtained.
one cluster in the clustering result. And then return to

Tehnički vjesnik 25, 6(2018), 1783-1791 1785


Kai ZHUANG et al.: Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple Mixed-Type Data Clustering Algorithms

First, use k-prototypes algorithm to segment customers, 3.3 Customer Characterization and Integration
the number of clusters k should be pre-set. To ensure
accurate customer segmentation and analysis, the optimal Here, the optimal clustering results are extracted, and
clustering result is always expected. k is given according to next we need to identify the final customer segmentation
the data size and the characteristics of auto insurance and analyse the characteristic of each segment to mine new
companies. If there is no experience for determining k, we information and knowledge for decision supporting. We
suggest that a wider range of k could be set, and then the suggest computing the representative data object of each
optimal clustering result would be selected by the cluster. The mode values of categorical attributes and the
following evaluation step. mean values of numerical attributes are calculated as
Next, use improved k-prototypes to segment representative attribute values of each cluster. Furthermore,
customers, the number of clusters k also needs to be pre- the most important attribute that directly reflects the
set. Similarly, k is given according to the data size and the profitability should be selected and guides companies or
characteristics of auto insurance companies and could be researchers to identify property of cluster.
set with a large range if there is no idea to pre-set it. The characteristics of different clusters can be
For SBAC, if we want to gain the final clustering result, discovered by analysing the representative data object of
we have to extract the appropriate parts from the each cluster. Then we integrate the clustering results and
dendrogram to construct the clustering result. The extract customer related rules for decision supporting. We
threshold t is used to achieve extraction and needs to be assume that the characteristics that more than two
pre-set in advance. The determination of t depends on the clustering results are consistent should be noted and
dendrogram constructed and the dissimilarity of each node. analysed. The corresponding rules would be extracted after
We encourage to set a wider range of t for selecting the more discussion. For example, if the clustering results of
optimal clustering result. k-prototypes and SBAC show that new car purchase price
After all clustering results are prepared, we need to is always the highest when accumulative paid-in amount is
select the optimal clustering result of each algorithm. Since the highest, and after analysis, this is reasonable, we could
the labels of the real auto insurance data are unknown, an extract the rule that in the cluster corresponding to the
internal cluster validation metric should be selected to find highest accumulative paid-in amount, new car purchase
the optimal clustering result. price is always the highest. In this way, auto insurance
company should focus on the customers who own luxury
cars. On this basis, auto insurance company could develop
more appropriate customer related strategies to keep and
attract customers.
Table 1 Table of valuable attributes
Valuable attributes
Policy nature Policy state Vehicle type Seating capacity
License plate color Document type Insurance type New car purchase price
Vehicle age Accumulative paid-in amount Written premium Tonnage

Table 2 Comparison table of categorical attributes digitization


0 1 2 3
Policy nature New insurance Renewal of insurance -- --
Policy state Effective Insurance cancellation -- --
Passenger car with 6 or less Passenger car with more than 6 and less Passenger car with more than 10
Vehicle type --
seats than 10 seats and less than 20 seats
License plate color Blue Yellow Black Others
Document type Insurance policy Insurance cancellation Information correction --
Mandatory traffic liability Motor vehicle comprehensive Shenxing auto insurance motor
Insurance type --
insurance insurance clauses auto insurance

4 CASE: CUSTOMER SEGMENTATION IN AN AUTO information attributes, 5 attributes with unique value, 4
INSURANCE COMPANY code number attributes, 4 unrelated date attributes, 3
meaning repetition attributes. Twelve valuable attributes
In this section we demonstrate the practical value of are chosen and shown in Table 1.
our approach through a real case. In these attributes, seven attributes belong to
categorical attributes including Policy nature, Policy state,
4.1 Data Collection and Preparation Vehicle type, Seating capacity, License plate color,
Document type and Insurance type, and they need to be
Data for the case is about vehicle insurance sales from digitized before further analysis. In particular, due to
an auto insurance company, including comprehensive Seating capacity being described by digits originally, this
attributes of all vehicle insurances in the company since attribute does not need to be digitized. The comparison
2014. There are a total of 25738 objects, and each object is table of categorical attributes digitization is shown in Table
described with 46 attributes. Since some attributes do not 2, and the categorical attributes listed in this table are only
have the value for clustering analysis, it is necessary to six except Seating capacity.
identify useful attributes. In the case, a total of 34 unrelated Additionally, the categorical and numerical missing
attributes are eliminated, including 18 sensitive values are imputed respectively by the mode values and

1786 Technical Gazette 25, 6(2018), 1783-1791


Kai ZHUANG et al.: Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple Mixed-Type Data Clustering Algorithms

mean values of corresponding attributes. Specially, the different optimal clustering results. Next, we will analyse
imputed dataset is only used for k-prototypes and SBAC. several representative clustering results, which are selected
from all optimal clustering results due to the limit of space.
4.2 Customer Segmentation We select k-prototypes clustering results when the number
of clusters (k) is equal to 2 and 7. These clustering results
Since improved k-prototypes and SBAC require higher are respectively given in Tab. 3 and Tab. 4, in which each
time complexity, 500 data objects are extracted randomly cluster is represented by representative data object.
as one group for experiments in this case, and each
algorithm undergoes 20 groups. Table 3 Clustering result of k-prototypes algorithm – k = 2
In this case, we exploit Silhouette index (S) [21] to k=2 Cluster1 Cluster2
evaluate the effectiveness of clustering results. It is given Policy nature 0 0
Policy state 0 0
as follows: Vehicle type 0 0
Seating capacity 5 5
b (i ) − a (i ) License plate color 0 0
s (i ) = (9) Document type 0 0
max {a ( i ) , b ( i )} Insurance type 1 0
New car purchase price 463195 83289.39
Vehicle age 0.92828 1.943439
where a(i) is the dissimilarity between data object i and its Accumulative paid-in amount 7565.74 1608.708
own cluster, and b(i) is the dissimilarity between data Written premium 7565.74 1608.708
object i and its neighbouring cluster. Obviously, the closer Tonnage 0 0
s(i) is to one, the better the data object i is clustered. The object number 29 471
average s(i) over all data in dataset reflects the performance
of clustering result. For auto insurance companies, the higher the order
(1) Customer segmentation using k-prototypes premium, the higher the profit for companies. Since
In this case, k is given with nine values, respectively 2, ‘Accumulative paid-in amount’ and ‘Written premium’
3, 4, 5, 6, 7, 8, 9 and 10, for each group, k-prototypes can represent the order premium, these two attributes
undergo a total of 180 runs. Silhouette index is utilized to should be used to identify property of cluster.
select optimal clustering result in each group experiment In tables, the values that are worth being paid attention
for analysis. are in bold.
Fig. 2 shows that the optimal clustering results of k-
prototypes are scattered. Different initial prototypes lead to

Table 4 Clustering result of k-prototypes algorithm – k = 7


k=7 Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7
Policy nature 0 0 0 0 0 0 0
Policy state 0 0 0 0 0 0 0
Vehicle type 0 0 0 0 0 0 1
Seating capacity 5 5 5 5 5 5 8
License plate color 0 0 0 0 0 0 0
Document type 0 0 0 0 0 0 0
Insurance type 0 0 1 0 0 0 0
New car purchase price 40263.3 120698.9 358824 198788.4 745696.7 70857.64 28184
Vehicle age 1.99333 1.737303 1.04842 1.824364 1.785 1.443493 3.79543
Accumulative paid-in amount 1082.38 1801.168 4854.02 3243.607 3877.767 1622.538 958.6
Written premium 1082.38 1801.168 4854.02 3243.607 3877.767 1622.538 958.6
Tonnage 0 0 0 0 0 0 0.01429
Data object number 81 89 19 55 12 209 35

(2) Customer segmentation using improved k-


prototypes
For improved k-prototypes, k is also given with nine
values respectively 2 to 10 for each group, and improved
k-prototypes undergoes a total of 180 runs. Fig. 3 shows
that the optimal clustering results are concentrated on k =
2 or 3 and very stable, that is related to the prototypes
initialization method based on the number of nearest
neighbours. Therefore, the randomness of selection is
avoided, and the clustering result is more stable. Two
optimal clustering results when k = 2 and 3 are respectively
given in Tab. 5 and Tab. 6.
(3) Customer segmentation using SBAC
In this case, t is given with ten values, respectively 0.01,
0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09 and 0.1
Figure 2 Silhouette index values of k-prototypes clustering results multiple of the dissimilarity of root node D(root), for each
group. SBAC algorithm undergoes a total of 200 runs. Fig.
4 shows that the optimal clustering results of some groups

Tehnički vjesnik 25, 6(2018), 1783-1791 1787


Kai ZHUANG et al.: Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple Mixed-Type Data Clustering Algorithms

are mainly concentrated on t = 0.02, 0.03 and 0.04 multiple


of D(root). And the optimal clustering results of other
groups appear when t takes 0.09 and 0.1 multiple of D(root)
that is shown in the bottom of the figure. However, since
the Silhouette index values of the first situation are much
larger than the second situation, we select the optimal
clustering results when t = 0.02 and 0.03 multiple of D(root)
to exhibit respectively in Tab. 7 and Tab. 8.

Table 5 Clustering result of improved k-prototypes algorithm – k = 2


k=2 Cluster1 Cluster2
Policy nature 0 0
Policy state 0 0
Vehicle type 0 0
Seating capacity 5 5
License plate color 0 0
Document type 0 0
Insurance type 0 1
New car purchase price 67282.82 158789.9 Figure 3 Silhouette index values of improved k-prototypes clustering results
Vehicle age 2.23488 1.397799
Accumulative paid-in amount 852.8802 3570.466
Written premium 852.8802 3570.466
Tonnage 0 0.005789
object number 291 209

Table 6 Clustering result of improved k-prototypes algorithm – k = 3


k=3 Cluster1 Cluster2 Cluster3
Policy nature 0 0 0
Policy state 0 0 0
Vehicle type 0 0 0
Seating capacity 5 5 5
License plate color 0 0 0
Document type 0 0 0
Insurance type 1 0 0
New car purchase price 181747 98958.19 65555
Vehicle age 0.83919 4.386443 0.67399
Accumulative paid-in amount 4273.223 1098.981 938.957
Written premium 4273.223 1098.981 938.957
Tonnage 0 0 0 Figure 4 Silhouette index values of SBAC clustering results

Table 7 Clustering result of SBAC algorithm – t = 0.02 multiple of D(root)


Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6
Policy nature 0 1 1 0 0 0
Policy state 0 0 0 0 1 0
Vehicle type 0 1 0 0 0 0
Seating capacity 5 9 5 5 5 5
License plate color 0 0 0 0 0 0
Document type 0 0 0 0 1 0
Insurance type 0 0 2 1 2 2
New car purchase price 95544.08 59800 104732.1 306405 150313.3 195310
Vehicle age 1.908996 6 2.265455 0 2.64 4.8325
Accumulative paid-in amount 1900.8 1055.97 2989.338 6942.06 1083.542 5647.64
Written premium 1900.8 1055.97 2989.338 6942.06 1083.542 5647.64
Tonnage 0 0 0 0 0 0
Object number 468 2 11 2 6 4

Table 8 Clustering result of SBAC algorithm – t = 0.03 multiple of D(root)


Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6 Cluster7
Policy nature 0 1 1 1 0 1 1
Policy state 0 1 0 0 0 0 0
Vehicle type 0 0 0 0 0 0 0
Seating capacity 5 5 5 5 5 5 5
License plate color 0 0 0 0 0 0 0
Document type 0 0 0 0 0 0 0
Insurance type 0 2 2 2 1 2 2
New car purchase price 102737.5 68273.33 97020 94320 187020 139660 175815
Vehicle age 1.814574 4.223333 2.5 4.5 4 2.71 2
Accumulative paid-in amount 1915.358 1839.56 2280.4 2563.33 4832.86 3496.06 3703.61
Written premium 1915.358 1839.56 2280.4 2563.33 4832.86 3496.06 3703.61
Tonnage 0.00252 0 0 0 0 0 0
Object number 481 3 2 2 2 2 2

1788 Technical Gazette 25, 6(2018), 1783-1791


Kai ZHUANG et al.: Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple Mixed-Type Data Clustering Algorithms

4.3 Customer Characterization and Integration paid-in amount, which is related to reduction of clustering
result randomness in the improved k-prototypes algorithm.
In this subsection, we analyse the characteristic of each In addition, compared to the other two algorithms, the
selected clustering result and extract rules by integrating silhouette index value of improved k-prototypes algorithm
clustering results of different algorithms. Furthermore, the clustering result is lower. Therefore, we think Tab. 3 to
effectiveness of each clustering result is discussed to verify Tab. 4 and Tab. 7 to Tab. 8 are more convincing.
the advantages of multiple algorithms analysis. 6) In the cluster corresponding to the lowest
(1) The rules extracted from customer segmentation accumulative paid-in amount, the number of data objects is
results more.
1) In the cluster corresponding to the highest Tab. 3 to Tab. 4 and Tab. 7 to Tab. 8 show that number
accumulative paid-in amount, new car purchase price is of data objects is always the most or more when
always the highest. accumulative paid-in amount is the lowest. According to
Tab. 3 to Tab. 8 show that new car purchase price is the second rule, customers who purchase low-end cars
always the highest when accumulative paid-in amount is contribute low accumulative paid-in amount for the
the highest. Similarly, accumulative paid-in amount is also company. That is in line with reality, the number of
the highest when new car purchase price is the highest. It customers purchasing low-end cars is prominently more
indicates that luxury car customers can produce more than the number of customers purchasing luxury cars.
profits to insurance company, and they are important 7) When vehicle ages are similar, the accumulative
customers that should be followed by the insurance paid-in amount is high if the new car purchase price is high.
company. Tab. 4, Tab. 6, and Tab. 8 show that when vehicle ages
2) In the cluster corresponding to the lowest are similar, accumulative paid-in amount is higher if the
accumulative paid-in amount, new car purchase price is the new car purchase price is higher. It indicates that insurance
lowest. company should focus on luxury car customers when
Tab. 3 to Tab. 8 show that new car purchase price is gaining new car customers.
always the lowest when accumulative paid-in amount is the (2) The effectiveness of different algorithm clustering
lowest. Similarly, accumulative paid-in amount is also the results
lowest when new car purchase price is the lowest. It Fig. 2 shows that the optimal clustering results of k-
indicates that low-end car customers produce less profit to prototypes are more scattered. The optimal clustering
insurance company. results appear when the number of clusters k = 2, 4, 5, 6, 7,
3) When the insurance type of one cluster is motor 8 or 10, which is related to the randomly selected
vehicle comprehensive insurance clauses, accumulative prototypes. Fortunately, we can find different rules from
paid-in amount is always the highest. the different results. Universal rules can be discovered
Tab. 3 to Tab. 8 show that accumulative paid-in according to the results with a smaller number of clusters,
amount is always the highest when insurance type is motor such as positive correlation between accumulative paid-in
vehicle comprehensive insurance clauses. It indicates that amount and new car purchase price. Detailed rules can be
customers purchasing motor vehicle comprehensive discovered according to the results with a greater number
insurance clauses always can produce more profits to the of clusters, for example, when the vehicle ages are similar,
insurance company. Insurance company should focus on there is a positive correlation between new car purchase
these kinds of customers, and actively recommend this price and accumulative paid-in amount.
insurance type to luxury car customers, thereby expanding Fig. 3 shows that the optimal clustering results of
this kind of customers group, and producing more income improved k-prototypes are the most stable, which is
to the company. distributed when the number of clusters k = 2 or 3. It is
4) When the vehicle type of one cluster is 6-10 seats, related to the prototypes initialization method. However,
the accumulative paid-in amount is lower. the scales of all clusters in improved k-prototypes
Tab. 4 and Tab. 7 show that accumulative paid-in algorithm clustering results are even, the scale of
amount is always the lowest or lower when vehicle type is customers purchasing luxury cars and contributing more
6-10 seats. 6-10 seat cars mostly belong to micro-bus, new profits to the company are similar to that of customers
car purchase price is also lower, and owners of this kind of purchasing low-end cars and contributing less profits. It is
cars will not purchase expensive insurance type, therefore contradicted to the results of the other two algorithms about
they produce less profits to the insurance company. scales of all clusters. Since the silhouette index values of
5) In the cluster corresponding to the highest this algorithm clustering results are worse than those of k-
accumulative paid-in amount, the number of data objects is prototypes and SBAC, we think the results of k-prototypes
less. and SBAC are more accurate.
Tab. 3 to Tab. 4 and Tab. 7 to Tab. 8 show that number Fig. 4 shows that the optimal clustering results of
of data objects is less when accumulative paid-in amount SBAC are mainly distributed in the parts with smaller
is higher. When accumulative paid-in amount is the threshold t, the silhouette index values of the clustering
highest, the new car purchase price is the highest and there results are higher compared with the other two algorithms.
are less customers that can afford to purchase luxury cars, And the analysis of optimal clustering results of SBAC is
so the number of data objects in the cluster with highest consistent with that of other algorithms. Therefore, SBAC
accumulative paid-in amount is less. However, Tab. 5 algorithm has better clustering effectiveness for auto
shows that the number of data objects corresponding to the insurance data in this case.
cluster with high accumulative paid-in amount is not After analysing the clustering effectiveness, we can
prominently less than the cluster with low accumulative find that each customer segmentation result is affected by

Tehnički vjesnik 25, 6(2018), 1783-1791 1789


Kai ZHUANG et al.: Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple Mixed-Type Data Clustering Algorithms

the characteristics of the algorithm used. We cannot obtain [5] David, M. (2015). Auto Insurance Premium Calculation
convincing customer segmentation result if only one Using Generalized Linear Models. Procedia Economics &
algorithm was used to do cluster analysis. Therefore, we Finance, 2015(20), 147-156.
utilize three clustering algorithms to segment customers, https://doi.org/10.1016/S2212-5671(15)00059-3
[6] Kang, S. & Song, J. (2017). Feature selection for continuous
and the more convincing and accurate customer aggregate response and its application to auto insurance data.
segmentation results can be obtained to support decision- Expert Systems with Applications, 2017(93), 104-117.
making. [7] Griva, A., Bardaki, C., Pramatari, K., & Papakiriakopoulos,
D. (2018). Retail business analytics: customer visit
5 CONCLUSION segmentation using market basket data. Expert Systems with
Applications, 2018(100), 1-16.
In this paper, we investigate how to mine more https://doi.org/10.1016/j.eswa.2018.01.029
convincing and accurate customer related information of [8] Ravasan, A. Z. & Mansouri, T. (2015). A Fuzzy ANP Based
Weighted RFM Model for Customer Segmentation in Auto
auto insurance companies through customer segmentation.
Insurance Sector. International Journal of Information
Along this line, we propose an auto insurance business Systems in the Service Sector, 2(7), 71-86.
analytics approach for customer segmentation that exploits https://doi.org/10.4018/ijisss.2015040105
three mixed-type data clustering algorithms including k- [9] Hanafizadeh, P. & Rastkhiz Paydar, N. (2013). A Data
prototypes, improved k-prototypes and SBAC to segment Mining Model for Risk Assessment and Customer
customers. In this way, the clustering results of these Segmentation in the Insurance Industry. International
algorithms can complement and reinforce each other, and Journal of Strategic Decision Sciences, 1(4), 52-78.
we can obtain as much information as possible to support https://doi.org/10.4018/jsds.2013010104
customer related decision making of auto insurance [10] Du, M., Ding, S., & Xue, Y. (2017). A Novel Density Peaks
Clustering Algorithm for Mixed Data. Pattern Recognition
companies.
Letters, 2017(97), 46-53.
The practical value of this work is confirmed in the https://doi.org/10.1016/j.patrec.2017.07.001
fourth section, on one side, seven useful rules for the auto [11] Wangchamhan, T., Chiewchanwattana, S., & Sunat, K.
insurance company are extracted. Rules 1) 3) 5) and 7) (2017). Efficient algorithms based on the k-means and
show the characteristics of customers who produce more Chaotic League Championship Algorithm for numeric,
profits to the auto insurance company, and auto insurance categorical, and mixed-type data clustering. Expert Systems
company should actively expand the number of such with Applications an International Journal, 2017(90), 146-
customers, thereby contributing more profits to the 167. https://doi.org/10.1016/j.eswa.2017.08.004
company. Rules 2) 4) and 6) display the characteristics of [12] Skabar, A. (2017). Clustering Mixed-Attribute Data using
Random Walk. Procedia Computer Science, 2017(108),
a huge number of customers producing less profits to the
988-997. https://doi.org/10.1016/j.procs.2017.05.083
company, and insurance company also should maintain [13] Huang, Z. (1997). Clustering Large Data Sets with Mixed
them and convert them to another kind of customers that Numeric and Categorical Values. In 1st Pacific-Asia
produce more profits. On the other side, the clustering Conference on Knowledge Discovery and Data Mining
effectiveness of different algorithms is also analysed in this (PAKDD), 21-34.
paper to verify the validity of the approach. The case shows [14] Li, C. & Biswas, G. (2002). Unsupervised Learning with
that the more convincing and accurate customer Mixed Numeric and Nominal Data. Knowledge & Data
segmentation result can be obtained by utilizing multiple Engineering IEEE Transactions on, 14(4), 673-690.
algorithms. https://doi.org/10.1109/TKDE.2002.1019208
[15] Goodall, D. W. (1966). A New Similarity Index Based on
Probability. Biometrics, 22(4), 882-907.
Acknowledgements https://doi.org/10.2307/2528080
[16] Wu, S., Chen, H., & Feng, X. (2013). Clustering Algorithm
This work is supported by National Natural Science for Incomplete Data Sets with Mixed Numeric and
Foundation of China (NSFC) under Grant No. 71271027. Categorical Attributes. International Journal of Database
Theory & Application, 6(5), 95-104.
6 REFERENCES https://doi.org/10.14257/ijdta.2013.6.5.09
[17] Luo, H., Kong, F., & li, Y. (2006). Clustering Mixed Data
[1] Matiş, C. & Ilieş, L. (2014). Customer relationship Based on Evidence Accumulation. Advanced Data Mining
management in the insurance industry. Procedia Economics and Applications, 2006(4093), 348-355.
and Finance, 2014(15), 1138-1145. [18] Chiu, T., Fang, D. P., Chen, J., Wang, Y., & Jeris, C. (2001).
https://doi.org/10.1016/S2212-5671(14)00568-1 A robust and scalable clustering algorithm for mixed type
[2] Ka, Elan, L., Ka, Elan, V., & Novovi, Buri. (2014). A Data attributes in large database environment. In ACM SIGKDD
Mining Approach for Risk Assessment in Car Insurance: International Conference on Knowledge Discovery and Data
Evidence from Montenegro. International Journal of Mining, San Francisco, CA, USA, 263-268.
Business Intelligence Research (IJBIR), 5(3), 11-28. https://doi.org/10.1145/502512.502549
https://doi.org/10.4018/ijbir.2014070102 [19] Cheung, Y. M. & Jia, H. (2013). Categorical-and-numerical-
[3] Ke, N., Zhang, H., Tayal, A., Coleman, T., & Li, Y. (2016). attribute data clustering based on a unified similarity metric
Auto insurance fraud detection using unsupervised spectral without knowing cluster number. Pattern Recognition,
ranking for anomaly. Journal of Finance & Data Science, 46(8), 2228-2238.
2(1), 58-75. https://doi.org/10.1016/j.patcog.2013.01.027
https://doi.org/10.1016/j.jfds.2016.03.001 [20] Chen, H. L., Chuang, K. T., &Chen, M. S. (2008). On data
[4] Bhowmik, R. (2011). Detecting Auto Insurance Fraud by labeling for clustering categorical data. IEEE Transactions
Data Mining Techniques. Journal of Emerging Trends in on Knowledge & Data Engineering, 20(11), 1458-1472.
Computing & Information Sciences, 2(4), 156-162. https://doi.org/10.1109/TKDE.2008.81

1790 Technical Gazette 25, 6(2018), 1783-1791


Kai ZHUANG et al.: Auto Insurance Business Analytics Approach for Customer Segmentation Using Multiple Mixed-Type Data Clustering Algorithms

[21] Rousseeuw, P. (1987).Silhouettes: a Graphical Aid to the


Interpretation and Validation of Cluster Analysis. Journal of
Computational & Applied Mathematics, 20(20), 53-65.
https://doi.org/10.1016/0377-0427(87)90125-7

Contact information:

Kai ZHUANG, PhD candidate


Donlinks School of Economics and Management
University of Science and Technology Beijing
30 Xueyuan Road, Haidian District, Beijing 100083, China
[email protected]

Sen WU, PhD, Full Professor


(Corresponding author)
Donlinks School of Economics and Management
University of Science and Technology Beijing
30 Xueyuan Road, Haidian District, Beijing 100083, China
[email protected]

Xiaonan GAO, PhD candidate


Donlinks School of Economics and Management
University of Science and Technology Beijing
30 Xueyuan Road, Haidian District, Beijing 100083, China
[email protected]

Tehnički vjesnik 25, 6(2018), 1783-1791 1791

You might also like