Predictive Modeling Using Transactional Data: Financial Services
Predictive Modeling Using Transactional Data: Financial Services
1 Introduction 3
3 Data Quality 4
3.1 Data Profiling 4
6 Model Selection 10
7 Conclusion 11
2
the way we see it
1 Introduction
In a world where traditional bases of competitive advantages have dissipated,
The real benefit of analytics driven processes may be one of the few remaining points of differentiation
analytics is in using past for firms in any industry1. This is particularly true in financial services, which has
progressed rather fast along the analytical path in the last couple of decades.
data to forecast or predict
future events, providing Analytics can be used to slice and dice historical data to analyze past performance
firms with a strategic and to produce reports. Here analytics helps firms react to past events. The
real benefit of analytics is in using past data to forecast or predict future events,
capability to be proactive. providing firms with a strategic capability to be proactive.
VALUE
Prediction
Forecasting
KNOWLEDGE Models
Monitoring
ANALYTICS Dashboards
Scorecards
INFORMATION
Analysis OLAP
Visualization
CONTEXT
Source: Capgemini
This provides marketing departments with a great tool to optimize their marketing
campaigns, channel performance, customer on-boarding and cross-sell. These
are typically driven by predictive models for customer life-time value, behavioral
segmentation and attrition.
Source: Capgemini
1
Competing on Analytics: The New Science of Winning by Thomas H. Davenport, Jeanne G. Harris. Harvard Business School Press
Predictive Modeling Using Transactional Data 3
2 Using Transactional Data
A customer’s historical activity typically comprises of a few accounts and
transactions around those accounts. For example, a customer may have a checking
and savings account, a mortgage loan and a credit card from a bank. Banks also
offer services like Electronic Bill Pay (EBP) and ATM/debit cards which generate
Electronic Funds Transfer (EFT) transactions.
Transactional data
Data associated with accounts are typically stored in an Accounts Processing
potentially offers (AP) system. They may contain transactions, but AP systems usually carry only
additional levels of the last month’s history. Prior months’ transactions are reflected in monthly
insight into customer’s balance snapshots.
activity, but poses some Unlike AP data, transaction data is typically maintained as is in corresponding
challenges that need transaction processing systems, whether it is EBP or EFT. Banks may have
to be addressed before many months or years worth of daily transactional data archived and stored.
Therefore, transactional data potentially offers additional levels of insight into
analytics can derive customer’s activity.
valuable insights from it.
The richness of transactional data poses some challenges that need to be addressed
before analytics can derive valuable insights from it. The rest of this paper
details these challenges and possible solutions by referring to a case study as an
illustrative example.
3 Data Quality
As with any kind of data for any kind of analytics, data quality is the first issue to
be tackled. In order to understand the structure of data and identify issues, the key
steps are to perform data profiling and exploratory data analysis.
Data profiling helps understand which columns warrant additional attention from
data quality perspective. The appropriate course of action for each column has
to be carefully determined. For some columns, missing values may be replaced by
mean or mode or a constant. Some columns may need to be simply dropped
from analysis.
4
the way we see it
Values
Values
Values
Values
Values
40
0 5 1 1
20
0 -0.5 0 0 0
0 1 0 1 0 1 0 1 0 1
BLRORGINDVIDDCNT BLRTYPEIDDCNT BLRTIERRNKDCNT CRCARDTYPEIDDCNT CRCARDTYPEABBRVTNMDCNT
1 1 1 1 1
Values
Values
Values
Values
Values
0.5 0.5 0.5 0.5 0.5
0 0 0 0 0
0 1 0 1 0 1 0 1 0 1
Values
Values
Values
Values
10 0.5 1 4
2
2
0 0 0 0 0
0 1 0 1 0 1 0 1 0 1
PYMTDLRAMTSUM PYMTDLRAMTAVG RECURPYMTFLGDCNT FNDGFNCLACTTYPEIDDCNT EBILLIDDCNT
10 2 2
6000 10
Values
Values
Values
Values
Values
5 4000 1 1
5
2000
0 0 0 0 0
0 1 0 1 0 1 0 1 0 1
PYMTRQSTNBRDCNT RISKOWNIDDCNT MEDIACTGYIDDCNT LGCYPOSTCDDCNT FIRSTPYEESETDTMTHS
0.5 0.5 4 1 60
40
Values
Values
Values
Values
Values
0 0 2 0.5
20
-0.5 -0.5 0 0 0
0 1 0 1 0 1 0 1 0 1
Source: Capgemini
The next step is to look further into the columns at the values represented by
the data and identify any inconsistency. For example, in a transaction file, the
transaction date cannot be earlier than the customer’s account start date. There
may also be subtle issues that cannot be caught by such logic, but can be observed
simply by plotting the corresponding attribute. As an example, the plot below
shows the number of customers who attrited each month from a bank.
In this case, the spike was caused by default values entered for some customers
whose data was migrated from one source system to another. The resolution in this
case was to not rely on the end date provided in the data column, but to define
attrition as a period of inactivity as depicted by the transaction data.
This definition also opens up the possibility of defining and detecting lower levels of
customer engagement that typically precedes attrition. Instead of defining attrition
as period of no activity, it could be defined as a period of declining activity.
200802
200803
200804
200805
200806
200807
200808
200809
200810
200811
200812
200901
200902
200903
200904
200905
200906
200907
200908
200909
200910
200911
Source: Capgemini
For transactional data, this step often implies rolling up daily transactions into a
weekly or monthly aggregate for analysis purposes. For example, EBP data which
contains daily bill-pay transactions for all customers can produce an aggregation
of monthly transactions for each customer per month. These can include count
of transactions, total dollar amount of transactions, average dollar amount of
transactions. If individual transactions had flag values associated with them, then an
aggregate count of flag value occurrences might make sense.
While modeling customer attrition, one of the first steps is to look at periods of
inactivity to determine the appropriate definition of attrition. This is sometimes
referred to as activity analysis. The example analysis below can be extended to
determine that 3 or more consecutive months of inactivity can be considered as
attrition, and customers with more than 25 transactions per month can be classified
as small businesses.
12000 100%
10000
80%
Frequency
8000
60%
6000
40%
4000
2000 20%
0 0%
0
10
12
14
16
18
20
22
More
12000 100%
Count of Customers
10000
80%
8000
60%
6000
40%
4000
2000 20%
0 0%
10
30
50
70
90
110
130
150
170
190
210
230
250
Source: Capgemini
6
the way we see it
4 Cohort and
Trend Analysis
Once a prediction segment has been defined (e.g. attriter or high transactor), the
next step is to look at groups of customers that belong to that segment. In the case
of an attrition model, we can identify customers who attrited in each month and
bucket them into a cohort. For example, JAN09 cohort would be customers whose
three consecutive months of inactivity started in January 2009. This approach leads
to a cohort for nearly every month of data in consideration.
It is possible that each cohort is different – i.e. customers who attrited in one month
exhibit different behavior than customers who attrited in another month. Unless
there are seasonal effects, it is usually unlikely that cohorts are significantly different
from each other. To confirm this, one can compare some attributes of attriters and
non-attriters from different cohorts.
In the example below, average monthly transaction counts of attriters and non-
attriters are plotted for 12 months prior to month of attrition for the cohort. The
four months chosen are Jul 2008, Jan 2009, Jul 09 and Sep 2009.
8 8
JUL 08 JAN 09
7 7
Count of transactions
Count of transactions
6 6
5 5
4 4
3 3
2 2
ATT_FLAG 1 ATT_FLAG 1
1 1
200801 200801 200802 200802 200803 200803 200804 200804 200805 200805 200806 200807 200807 200808 200808 200809 200809 200810 200810 200811 200811 200812
8 8
JUL 09 SEP 09
7 7
Count of transactions
Count of transactions
6 6
5 5
4 4
3 3
2 2
ATT_FLAG 1 ATT_FLAG 1
1 1
200901 200901 200902 200902 200903 200903 200904 200904 200905 200905 200906 200903 200903 200904 200904 200905 200905 200906 200906 200907 200907 200908
Source: Capgemini
The plots indicate that there is no significant difference between cohorts – whether
it is across years or across months. In each case, there is a difference in level of
activity between attriters and non-attriters. Also, attriters tend to show declining
activity in months close to attrition. These patterns are consistent across all cohorts.
For example, in the first diagram below, JAN09 cohort had 98 attriters, FEB09
cohort had 105 attriters and so on. Each cohort has 12 months of history that is
considered for analysis. When aggregated, the cohorts stack up as shown in the
bottom diagram.
2008 2009
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV
98
105
94
97
93
121
117
103
107
T-12 T-11 T-10 T-9 T-8 T-7 T-6 T-5 T-4 T-3 T-2 T-1 T(ATT)
JAN
FEB
MAR
APR
2009
MAY
JUN
JUL
AUG
SEP
8
the way we see it
5 Model Variable
Definition
Once cohorts are analyzed and combined (if appropriate), the next important step is
to define the set of variables that will be used for modeling.
One obvious set of variables are those associated with the customer and not with
the transactions. These are demographic type of information like Gender, Age,
Location, Marital Status etc. They fluctuate very little over time (except age, which
steadily increases) and are sometimes referred to as stock variables.
Linear trends in flow variables can be captured using two types of variables – one to
capture the level of activity (sometimes referred to as intercept) and one to capture
the trend itself (sometimes referred to as slope). Below is a summary of the types of
variable and the analysis performed on each one.
Used for
Variable Type Description Type of Analysis Example
Modeling
Stock Variable Static value for Distribution Age YES
customer during
the analysis period
Decision trees use tree-like graph or model of decisions to determine the conditional
probability of an outcome (like attrition). It also uses numerical and categorical
variables similar to logistic regression.
Since there are many possible predictive models to choose from, it is useful to have
metrics to compare models and select the best one. Some commonly used metrics
are Receiver Operating Characteristics (ROC) curve, Cumulative Gains Chart and
Lift Chart. All of these provide metrics by trading off desirable outcomes (i.e. correct
predictions) against undesirable outcomes (false positives or false negatives). These
metrics are obtained by running the model on the training data set (used to create
the model) or on an out-of-sample validation set.
ROC Curve plots True Positives along the y-axis and False Positives along the
x-axis. Visually, the higher the curve is above the 45 degree line, and the closer it is
to the top left corner, the better the model.
0.9
0.8
Cumulative % correct precition
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
Customers ordered by attrition loading
Source: Capgemini
Cumulative Gains Chart and Lift Charts are more commonly used by marketing
departments as they allow for direct visual comparison and interpretation of results
with respect to marketing campaigns.
10
the way we see it
0.8
Cumulative
0.6
0.4
0.2
0.0
0% 20% 40% 60% 80% 100%
Cumulative Baseline
Source: Capgemini
Lift chart directly shows the gain of using the model versus no-model approach.
For example, in the figure above the model works 10 times better when a small
percentage of audience is selected. The effectiveness decreases as the audience widens.
10
8
Lift
0
0% 20% 40% 60% 80% 100%
Lift Lift base
Source: Capgemini
7 Conclusion
Predictive modeling offers the potential for firms to be proactive rather than
reactive. Predictive modeling using transactional data poses particular challenges
which need to be carefully addressed to create useful models. Some of the key
issues identified in this paper are data quality, cohort and trend analysis, model
variable definition and model selection.
Backed by over three decades of industry Capgemini reported 2009 global revenues
and service experience, the Collaborative of EUR 8.4 billion and employs over
Business Experience™ is designed to 90,000 people worldwide.
help our clients achieve better, faster,
more sustainable results through seamless More information about our services,
access to our network of world-leading offices and research is available at
technology partners and collaboration- www.capgemini.com.
Backed by over three decades of industry Capgemini reported 2009 global revenues
and service experience, the Collaborative of EUR 8.4 billion and employs over
Business Experience™ is designed to 90,000 people worldwide.
help our clients achieve better, faster,
more sustainable results through seamless More information about our services,
access to our network of world-leading offices and research is available at
technology partners and collaboration- www.capgemini.com.