Alternate Data
Alternate Data
*** Physical Copy of Bank Statement has been used for manual underwriting in consumer lending for long. However, the information typically does not flow as a
feature in a credit scoring engine. In Digital Lending paradigm, bank statement are being digitized and its information is being used for credit scoring
Alternative Data Feature Store
Automated Feature Engineering
Feature Primitives
Feature Synthesis
Pattern Matching
Expert Judgment
Raw data points are transformed to features using Feature Synthesis (applying library of transformations to raw data) and Feature Mining using NLP (e.g. extraction of
features from Text data such as SMS, Email), with an overlay of expert judgement.
Illustrative Feature Mining from SMS Data using NLP
Automated Feature Mining
standard L1 and L2 information from each level data at customer level Customer Risk
(Id / pool) Score
SMS1 categories SMS such as ID, Amount, to generate features for
Customer1 0.99
Transaction Type, Date etc. model training, such as:
Customer2 0.80
L1 such as Savings, Customer 3 0.50
SMS2 Current, Debit Card, Credit • Monthly Income
Card, E-Wallets etc. • Total Loans O/s Customer4 0.25
• Total EMI
L2 such as Savings > • Expected Monthly Spend
SMS3
Salary, Spend, Balance, and Savings
Investment, Loan / EMI • Delinquency pattern
related, Account Info
SMS4
Process Process
NLP based classification Process Feature engineering by data
SMS5 (SMS embeddings using Pattern matching based science team
neural networks) data extraction rules
Feature Mining: Bank Statement with Text Recognition and NLP
Aptivaa’s Bank Statement API supports English and Arabic Bank Statement
Customer Score
Income Estimation
Spend Analytics
Fixed Obligations
Alternative Data Modelling
Explainable Machine Learning for superior predictive power with full model transparency
Non-linear Machine Learning Models are used for feature selection. Discretization and Transformed (such as WoE transformation) Features are passed as an input to a Linear
Algorithm or XgBoost (with Monotonic Constraints) to build fully-explainable predictive models
Alternative Data Model Landscape for different customer segments
Illustrative Model Landscape
Approach 1 Approach 2
Step1 Step1
Step2 Step2
Alternate +
Traditional Data for some segments Alternate Data Model
for No Hit Segment
Some Segments (e.g. Medium Risk Customers) are Combined Model is used for Hit Segment and
rescored using a Combined Data Model (for Bureau Standalone Alternate Data Model is used for No Hit
Hit cases only) Segment
The final approach is selected on basis of product (ticket size, loan tenor), data cost (bureau pull, alternate data cost) and marginal contribution of a source of data to predictive power
Combining Alternative Data with Traditional Data
Prevalent methodologies to combine alternative data with traditional data
Single Model trained on combined Alternative Model Score added Traditional Model Score added Two independent models are
dataset, with features from both sources as a feature to traditional data as a feature to alternative data trained, and a matrix of scores
for model training for model training from both models is used for
decisioning
Illustrative Alternative Data Use Case
Credit Scoring using Telco Data
Call Location
User Info
Records Data
Internet Top-Ups
VAS Data Demograp Income Spend
Usage Data
hics Related Related
Daily Postpaid
SMS Data Usage Social Employme
Balance Payment
Duration Network nt
Mobile Device
Apps Data
Wallet Txn Info
XgBoost
Use of predictive models instead of heuristic/rule-based models can significantly improve profitability, business volume and ROA
1. For instance, for a default prediction model, an improvement of Gini coefficient from 40% to 50% 2. This would result in either higher business
would result in Lower Default Rate for same approval rate (reduction to 1.3% DR from 3.0% DR volumes at same delinquency rates; or lower
at same score cut-off for the ‘illustrative portfolio’) or Higher Approval Rate for same default rate delinquency rates at same business volume. In
(improvement in Approval Rate from 72.7% to 89.1% at ~3% DR for the ‘illustrative portfolio’). either case, ROA would improve significantly.
1 Compliance with GDPR guidelines for expats 2 Data sparsity (incomplete datasets)
5 Vendor Risk (e.g. financial strength of third-party data 6 Data Quality and Veracity
providers)
7 Commercial Implications (Cost vs. Benefit) 8 Different predictive power for different data sources, so cannot be used
with performance assessment