Py - Clustering Credit Card Fraud - Actuaries' Analytical Cookbook
Py - Clustering Credit Card Fraud - Actuaries' Analytical Cookbook
Fraud
Contents
Define the problem
Identifying Fraud:
References:
Packages
Functions
Data
Modelling
Evaluation / observations
This notebook was originally created by Amanda Aitken for the Data Analytics
Applications subject, as Case study 2 - Credit card fraud detection in the DAA M06
Unsupervised learning module.
stealing another person’s credit card details and using these to purchase goods;
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 1/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
In July 2019, the Centre for Counter Fraud Studies at the University of Portsmouth
estimated that fraud costs the global economy £3.89 trillion, with losses from fraud rising
by 56% in the past decade. Fraudulent behaviour can occur in any industry, including in
many of the industries that actuaries typically work in, such as banking and finance.
Fraud detection is a set of activities that are undertaken to identify fraud. Fraud detection
can help to rectify the outcomes of any fraudulent behaviour and/or to prevent fraud from
occurring in the future.
Identifying Fraud:
Some methods that can be used to detect fraud are:
reputation list;
rules engine;
supervised machine learning; and
unsupervised machine learning.
A reputation list, or ‘blacklist’, is a relatively static list of known fraudulent identities. For
example, a bank’s ‘blacklist’ might include a list of individuals who have previously been
convicted of credit card fraud. A drawback of this method is that it is difficult to keep the
reputation list up-to-date and it often requires identities to have been found committing
fraudulent behaviour in the past before they can be added to the list.
A rules engine approach to fraud detection involves running a set of rules over an activity.
If the activity meets one of the rules, it might be further investigated manually. For example,
a retailer might use a rules engine to flag any potentially fraudulent online purchases. They
might, for example, flag for further investigation any transactions that are over a certain
volume or that are being requested for delivery to a foreign country. While rules engines
have the benefit of being easy to understand, they can also be hard to keep up-to-date
with recent fraudulent activities.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 2/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
media site might have a historical dataset that contains a list of new accounts and their
attributes (‘features’) as well as a label to indicate whether each new account was opened
by a legitimate customer (‘not fraud’) or by someone pretending to be a legitimate
customer (‘fraud’). The classification techniques discussed in Module 5 could be used for
this supervised learning approach to fraud detection.
Supervised learning, therefore, approximates a rules engine and might be used to create a
rules engine. A drawback of this method is that it requires a large dataset of past examples
of fraud which can be very ‘expensive’ to obtain, as the exercise of validating historical
instances of fraud can be very time-consuming. In addition, past examples of fraud may not
be a good indicator of the types of fraud to which an organisation could be subject in the
future. In fact, there should be some anticipation that fraudsters will change their strategies
over time when they see that their existing methods of fraud are thwarted or investigated.
by identifying outliers that are dissimilar to other observations and do not align closely
with any of the clusters found in the dataset—these outliers are potential cases of
fraud; or, conversely,
by identifying a cluster of observations, when all other observations appear to be more
random and not tightly bunched together—this method is described in Video 6.6.
References:
The datasets that are used in this case study were sourced from the following Kaggle
competition: https://www.kaggle.com/c/ieee-fraud-detection.
The aim of the competition was to improve the efficacy of fraudulent transaction
identification models.
The data originates from real-world e-commerce transactions and contains two different
types of datasets. Note that the meaning of some of the features has been masked by the
data provider, so it is not clear what all the features mean.
The first dataset contains the following features for each credit card transaction:
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 3/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
The training version of the transactions dataset also contains a ‘TransactionID’ identifier for
each transaction and a label ‘isFraud’ to indicate whether the transaction was fraudulent.
The second dataset contains ‘identity’ features for each transaction. These features are
described on Kaggle as being ‘network connection information (IP, ISP, Proxy, etc) and
digital signature (UA/browser/os/version, etc) associated with transactions’. These features
were ‘collected by Vesta’s fraud protection system and digital security partners’. The actual
meaning of each of these fields is masked. These features have the following names:
id_01 - id_38;
DeviceType; and
DeviceInfo.
Both the transactions and identity datasets also contain a ‘TransactionID’ column that
allows the two datasets to be joined together.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 4/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
The above information about the features contained in these datasets was sourced from
the following Kaggle webpage: https://www.kaggle.com/c/ieee-fraud-
detection/discussion/101203.
Packages
This section installs packages that will be required for this exercise/case study.
Functions
This section defines functions that will be used for this exercise/case study.
'''
This function is used to print a bold heading followed by a new line.
'''
print('\033[1m'+string+'\033[0m\n')
# The '\033[1m' at the start of the string makes the font bold
# and the '\033[0m' at the end of the string makes the font go back to normal.
# The '\n' at the very end of the string prints a new line to give some space
# before the next item is printed to the output box.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 5/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment='center',
color='white' if cm[i, j] > thresh else 'black')
plt.tight_layout()
plt.ylabel('True response')
plt.xlabel('Predicted response')
Data
This section:
Import data
The code below uploads the following two csv files needed for this case study:
Note that the transaction dataset is very large (667MB) hence the file is zipped - but pandas
is able to read zipped csv files directly.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 6/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
heading('Identity dataset')
print(identity.head())
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 7/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 8/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
Transaction dataset
P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 \
0 NaN NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
1 gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 outlook.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
3 yahoo.com NaN 2.0 5.0 0.0 0.0 0.0 4.0 0.0 0.0 1.0
4 gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 \
0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 \
0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
2 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 9/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 \
0 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 0.0 0.0 0.0
1 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
2 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0
3 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 \
0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 \
0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0
1 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0
3 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 \
0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 1.0 48.0 28.0 0.0 10.0 4.0 1.0 38.0
4 NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 \
0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
3 24.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
4 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 \
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 50.0
4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 \
0 117.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1758.0 925.0 0.0 354.0 135.0 50.0 1404.0 790.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 0.0 0.0 0.0 0.0 6.0 18.0 140.0 0.0 0.0 0.0 0.0
V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 10/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1803.0 49.0 64.0 0.0 0.0 0.0 0.0 0.0 0.0 15557.990234
V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 169690.796875 0.0 0.0 0.0 515.0 5155.0 2840.0 0.0 0.0 0.0
V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
V194 V195 V196 V197 V198 V199 V200 V201 V202 V203 V204 V205 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
V206 V207 V208 V209 V210 V211 V212 V213 V214 V215 V216 V217 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
V218 V219 V220 V221 V222 V223 V224 V225 V226 V227 V228 V229 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
V230 V231 V232 V233 V234 V235 V236 V237 V238 V239 V240 V241 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
V242 V243 V244 V245 V246 V247 V248 V249 V250 V251 V252 V253 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 11/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
V254 V255 V256 V257 V258 V259 V260 V261 V262 V263 V264 V265 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0
V266 V267 V268 V269 V270 V271 V272 V273 V274 V275 V276 V277 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
V278 V279 V280 V281 V282 V283 V284 V285 V286 V287 V288 V289 \
0 NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1 NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
2 NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 NaN 1.0 28.0 0.0 0.0 0.0 0.0 10.0 0.0 4.0 0.0 0.0
4 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
V290 V291 V292 V293 V294 V295 V296 V297 V298 V299 V300 V301 \
0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0 38.0 24.0 0.0 0.0 0.0 0.0 0.0 0.0
4 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
V302 V303 V304 V305 V306 V307 V308 V309 V310 V311 V312 \
0 0.0 0.0 0.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0 50.0 1758.0 925.0 0.0 354.0 0.0 135.0
4 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
V313 V314 V315 V316 V317 V318 V319 V320 V321 V322 V323 V324 \
0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN
3 0.0 0.0 0.0 50.0 1404.0 790.0 0.0 0.0 0.0 NaN NaN NaN
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
V325 V326 V327 V328 V329 V330 V331 V332 V333 V334 V335 V336 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Identity dataset
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 12/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_18 \
0 NaN NaN 100.0 NotFound NaN -480.0 New NotFound 166.0 NaN
1 NaN NaN 100.0 NotFound 49.0 -300.0 New NotFound 166.0 NaN
2 0.0 0.0 100.0 NotFound 52.0 NaN Found Found 121.0 NaN
3 NaN NaN 100.0 NotFound 52.0 NaN New NotFound 225.0 NaN
4 0.0 0.0 100.0 NotFound NaN -300.0 Found Found 166.0 15.0
id_19 id_20 id_21 id_22 id_23 id_24 id_25 id_26 id_27 id_28 \
0 542.0 144.0 NaN NaN NaN NaN NaN NaN NaN New
1 621.0 500.0 NaN NaN NaN NaN NaN NaN NaN New
2 410.0 142.0 NaN NaN NaN NaN NaN NaN NaN Found
3 176.0 507.0 NaN NaN NaN NaN NaN NaN NaN New
4 529.0 575.0 NaN NaN NaN NaN NaN NaN NaN Found
DeviceInfo
0 SAMSUNG SM-G892A Build/NRD90M
1 iOS Device
2 Windows
3 NaN
4 MacOS
The outputs above show that there are many features in these two datasets that have
missing values (NaN).
heading('Identity dataset')
print(identity.info())
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 13/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 14/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
Transaction dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 394 entries, TransactionID to V339
dtypes: float64(376), int64(4), object(14)
memory usage: 1.7+ GB
None
Identity dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144233 entries, 0 to 144232
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TransactionID 144233 non-null int64
1 id_01 144233 non-null float64
2 id_02 140872 non-null float64
3 id_03 66324 non-null float64
4 id_04 66324 non-null float64
5 id_05 136865 non-null float64
6 id_06 136865 non-null float64
7 id_07 5155 non-null float64
8 id_08 5155 non-null float64
9 id_09 74926 non-null float64
10 id_10 74926 non-null float64
11 id_11 140978 non-null float64
12 id_12 144233 non-null object
13 id_13 127320 non-null float64
14 id_14 80044 non-null float64
15 id_15 140985 non-null object
16 id_16 129340 non-null object
17 id_17 139369 non-null float64
18 id_18 45113 non-null float64
19 id_19 139318 non-null float64
20 id_20 139261 non-null float64
21 id_21 5159 non-null float64
22 id_22 5169 non-null float64
23 id_23 5169 non-null object
24 id_24 4747 non-null float64
25 id_25 5132 non-null float64
26 id_26 5163 non-null float64
27 id_27 5169 non-null object
28 id_28 140978 non-null object
29 id_29 140978 non-null object
30 id_30 77565 non-null object
31 id_31 140282 non-null object
32 id_32 77586 non-null float64
33 id_33 73289 non-null object
34 id_34 77805 non-null object
35 id_35 140985 non-null object
36 id_36 140985 non-null object
37 id_37 140985 non-null object
38 id_38 140985 non-null object
39 DeviceType 140810 non-null object
40 DeviceInfo 118666 non-null object
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 15/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
dtypes: float64(23), int64(1), object(17)
memory usage: 45.1+ MB
None
heading('Identity dataset')
print(identity.describe())
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 16/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 17/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
Transaction dataset
C1 C2 C3 C4 \
count 590540.000000 590540.000000 590540.000000 590540.000000
mean 14.092458 15.269734 0.005644 4.092185
std 133.569018 154.668899 0.150536 68.848459
min 0.000000 0.000000 0.000000 0.000000
25% 1.000000 1.000000 0.000000 0.000000
50% 1.000000 1.000000 0.000000 0.000000
75% 3.000000 3.000000 0.000000 0.000000
max 4685.000000 5691.000000 26.000000 2253.000000
C5 C6 C7 C8 \
count 590540.000000 590540.000000 590540.000000 590540.000000
mean 5.571526 9.071082 2.848478 5.144574
std 25.786976 71.508467 61.727304 95.378574
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 1.000000 0.000000 0.000000
50% 0.000000 1.000000 0.000000 0.000000
75% 1.000000 2.000000 0.000000 0.000000
max 349.000000 2253.000000 2255.000000 3331.000000
C13 C14 D1 D2 \
count 590540.000000 590540.000000 589271.000000 309743.000000
mean 32.539918 8.295215 94.347568 169.563231
std 129.364844 49.544262 157.660387 177.315865
min 0.000000 0.000000 0.000000 0.000000
25% 1.000000 1.000000 0.000000 26.000000
50% 3.000000 1.000000 3.000000 97.000000
75% 12.000000 2.000000 122.000000 276.000000
max 2918.000000 1429.000000 640.000000 640.000000
D3 D4 D5 D6 \
count 327662.000000 421618.000000 280699.000000 73187.000000
mean 28.343348 140.002441 42.335965 69.805717
std 62.384721 191.096774 89.000144 143.669253
min 0.000000 -122.000000 0.000000 -83.000000
25% 1.000000 0.000000 1.000000 0.000000
50% 8.000000 26.000000 10.000000 0.000000
75% 27.000000 253.000000 32.000000 40.000000
max 819.000000 869.000000 819.000000 873.000000
D7 D8 D9 D10 D11 \
count 38917.000000 74926.000000 74926.000000 514518.000000 311253.000000
mean 41.638950 146.058108 0.561057 123.982137 146.621465
std 99.743264 231.663840 0.316880 182.615225 186.042622
min 0.000000 0.000000 0.000000 0.000000 -53.000000
25% 0.000000 0.958333 0.208333 0.000000 0.000000
50% 0.000000 37.875000 0.666666 15.000000 43.000000
75% 17.000000 187.958328 0.833333 197.000000 274.000000
max 843.000000 1707.791626 0.958333 876.000000 670.000000
V2 V3 V4 V5 \
count 311253.000000 311253.000000 311253.000000 311253.000000
mean 1.045204 1.078075 0.846456 0.876991
std 0.240133 0.320890 0.440053 0.475902
min 0.000000 0.000000 0.000000 0.000000
25% 1.000000 1.000000 1.000000 1.000000
50% 1.000000 1.000000 1.000000 1.000000
75% 1.000000 1.000000 1.000000 1.000000
max 8.000000 9.000000 6.000000 6.000000
V6 V7 V8 V9 \
count 311253.000000 311253.000000 311253.000000 311253.000000
mean 1.045686 1.072870 1.027704 1.041529
std 0.239385 0.304779 0.186069 0.226339
min 0.000000 0.000000 0.000000 0.000000
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 19/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
25% 1.000000 1.000000 1.000000 1.000000
50% 1.000000 1.000000 1.000000 1.000000
75% 1.000000 1.000000 1.000000 1.000000
max 9.000000 9.000000 8.000000 8.000000
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 24/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
V126 V127 V128 V129 \
count 590226.000000 590226.000000 590226.000000 590226.000000
mean 129.979417 336.611559 204.094038 8.768944
std 2346.951681 4238.666949 3010.258774 113.832828
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000
75% 0.000000 107.949997 0.000000 0.000000
max 160000.000000 160000.000000 160000.000000 55125.000000
V339
count 82351.000000
mean 100.700882
std 814.946722
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 104060.000000
Identity dataset
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 33/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 34/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
percent_fraud = fraud/(fraud+not_fraud)
print('There are '+'{:,.0f}'.format(fraud)+' fraudulent transactions out of '+
'{:,.0f}'.format(fraud+not_fraud)+' total transactions in the dataset.')
print('These fraudulent transactions represent approximately '+'{:,.2%}'.format(pe
' of all transactions.')
There are 20,663 fraudulent transactions out of 590,540 total transactions in the
These fraudulent transactions represent approximately 3.50% of all transactions.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 35/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
# Calculate the average transaction amounts for fraud and non fraud transactions.
heading('Average transaction amounts for fraud and non fraud')
print(transaction.groupby('isFraud').mean()['TransactionAmt'])
print('\n')
isFraud
0 134.511665
1 149.244779
Name: TransactionAmt, dtype: float64
isFraud
mean count
ProductCD
C 0.116873 68519
H 0.047662 33024
R 0.037826 37699
S 0.058996 11628
W 0.020399 439670
The average transaction amount is slightly higher for fraud cases (149v135).
The percentage of transactions that are fraudulent is quite different by product code. The
fraud percentage is greatest for product code ‘C’ (11.7%) and smallest for product code ‘W’
(2%). These different fraud percentages for different product codes suggest that it might be
helpful to look at instances of fraud separately for each of the 5 product codes.
The rest of this case study will focus on fraud for product code ‘C’.
You could then repeat the analysis below for each of the other product codes.
Prepare data
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 36/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
# Join the transaction and identity datasets using a left join so that all
# observations in the transaction dataset are retained
# (the transaction dataset contains the response 'isFraud'), and any matching
# observations in the identity dataset are joined into the merge.
product = 'C'
fraud_data = pd.merge(transaction[transaction['ProductCD']=='C'],identity,on='Tran
print(fraud_data.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: float64(399), int64(4), object(31)
memory usage: 227.4+ MB
None
The fraud dataset created above contains 68,519 observations of fraud for the product
being investigated (product code ‘C’) and 434 columns (‘TransactionID’, ‘isFraud’ and 432
features).
The number of features should be reduced to make the model fitting more tractable.
The features used in the clustering algorithm should be numeric and not contain missing
values, so that Euclidean distances between the observations can be calculated.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 37/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 38/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 39/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TransactionDT 68519 non-null int64
1 TransactionAmt 68519 non-null float64
2 card1 68519 non-null int64
3 card2 67988 non-null float64
4 card3 68326 non-null float64
5 card4 68324 non-null object
6 card5 67896 non-null float64
7 card6 68326 non-null object
8 addr1 3400 non-null float64
9 addr2 3400 non-null float64
10 dist1 0 non-null float64
11 dist2 26741 non-null float64
dtypes: float64(8), int64(2), object(2)
memory usage: 6.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 C1 68519 non-null float64
1 C2 68519 non-null float64
2 C3 68519 non-null float64
3 C4 68519 non-null float64
4 C5 68519 non-null float64
5 C6 68519 non-null float64
6 C7 68519 non-null float64
7 C8 68519 non-null float64
8 C9 68519 non-null float64
9 C10 68519 non-null float64
10 C11 68519 non-null float64
11 C12 68519 non-null float64
12 C13 68519 non-null float64
13 C14 68519 non-null float64
dtypes: float64(14)
memory usage: 7.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 D1 68324 non-null float64
1 D2 20184 non-null float64
2 D3 26800 non-null float64
3 D4 64356 non-null float64
4 D5 32584 non-null float64
5 D6 64352 non-null float64
6 D7 32503 non-null float64
7 D8 29436 non-null float64
8 D9 29436 non-null float64
9 D10 65021 non-null float64
10 D11 0 non-null float64
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 40/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
11 D12 64717 non-null float64
12 D13 57948 non-null float64
13 D14 51083 non-null float64
14 D15 64581 non-null float64
15 DeviceType 61015 non-null object
16 DeviceInfo 41593 non-null object
dtypes: float64(15), object(2)
memory usage: 9.4+ MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 M1 0 non-null object
1 M2 0 non-null object
2 M3 0 non-null object
3 M4 66785 non-null object
4 M5 0 non-null object
5 M6 0 non-null object
6 M7 0 non-null object
7 M8 0 non-null object
8 M9 0 non-null object
dtypes: object(9)
memory usage: 5.2+ MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 100 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 V1 0 non-null float64
1 V2 0 non-null float64
2 V3 0 non-null float64
3 V4 0 non-null float64
4 V5 0 non-null float64
5 V6 0 non-null float64
6 V7 0 non-null float64
7 V8 0 non-null float64
8 V9 0 non-null float64
9 V10 0 non-null float64
10 V11 0 non-null float64
11 V12 65021 non-null float64
12 V13 65021 non-null float64
13 V14 65021 non-null float64
14 V15 65021 non-null float64
15 V16 65021 non-null float64
16 V17 65021 non-null float64
17 V18 65021 non-null float64
18 V19 65021 non-null float64
19 V20 65021 non-null float64
20 V21 65021 non-null float64
21 V22 65021 non-null float64
22 V23 65021 non-null float64
23 V24 65021 non-null float64
24 V25 65021 non-null float64
25 V26 65021 non-null float64
26 V27 65021 non-null float64
27 V28 65021 non-null float64
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 41/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
28 V29 65021 non-null float64
29 V30 65021 non-null float64
30 V31 65021 non-null float64
31 V32 65021 non-null float64
32 V33 65021 non-null float64
33 V34 65021 non-null float64
34 V35 64356 non-null float64
35 V36 64356 non-null float64
36 V37 64356 non-null float64
37 V38 64356 non-null float64
38 V39 64356 non-null float64
39 V40 64356 non-null float64
40 V41 64356 non-null float64
41 V42 64356 non-null float64
42 V43 64356 non-null float64
43 V44 64356 non-null float64
44 V45 64356 non-null float64
45 V46 64356 non-null float64
46 V47 64356 non-null float64
47 V48 64356 non-null float64
48 V49 64356 non-null float64
49 V50 64356 non-null float64
50 V51 64356 non-null float64
51 V52 64356 non-null float64
52 V53 67067 non-null float64
53 V54 67067 non-null float64
54 V55 67067 non-null float64
55 V56 67067 non-null float64
56 V57 67067 non-null float64
57 V58 67067 non-null float64
58 V59 67067 non-null float64
59 V60 67067 non-null float64
60 V61 67067 non-null float64
61 V62 67067 non-null float64
62 V63 67067 non-null float64
63 V64 67067 non-null float64
64 V65 67067 non-null float64
65 V66 67067 non-null float64
66 V67 67067 non-null float64
67 V68 67067 non-null float64
68 V69 67067 non-null float64
69 V70 67067 non-null float64
70 V71 67067 non-null float64
71 V72 67067 non-null float64
72 V73 67067 non-null float64
73 V74 67067 non-null float64
74 V75 64581 non-null float64
75 V76 64581 non-null float64
76 V77 64581 non-null float64
77 V78 64581 non-null float64
78 V79 64581 non-null float64
79 V80 64581 non-null float64
80 V81 64581 non-null float64
81 V82 64581 non-null float64
82 V83 64581 non-null float64
83 V84 64581 non-null float64
84 V85 64581 non-null float64
85 V86 64581 non-null float64
86 V87 64581 non-null float64
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 42/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
87 V88 64581 non-null float64
88 V89 64581 non-null float64
89 V90 64581 non-null float64
90 V91 64581 non-null float64
91 V92 64581 non-null float64
92 V93 64581 non-null float64
93 V94 64581 non-null float64
94 V95 68447 non-null float64
95 V96 68447 non-null float64
96 V97 68447 non-null float64
97 V98 68447 non-null float64
98 V99 68447 non-null float64
99 V100 68447 non-null float64
dtypes: float64(100)
memory usage: 52.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 100 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 V101 68447 non-null float64
1 V102 68447 non-null float64
2 V103 68447 non-null float64
3 V104 68447 non-null float64
4 V105 68447 non-null float64
5 V106 68447 non-null float64
6 V107 68447 non-null float64
7 V108 68447 non-null float64
8 V109 68447 non-null float64
9 V110 68447 non-null float64
10 V111 68447 non-null float64
11 V112 68447 non-null float64
12 V113 68447 non-null float64
13 V114 68447 non-null float64
14 V115 68447 non-null float64
15 V116 68447 non-null float64
16 V117 68447 non-null float64
17 V118 68447 non-null float64
18 V119 68447 non-null float64
19 V120 68447 non-null float64
20 V121 68447 non-null float64
21 V122 68447 non-null float64
22 V123 68447 non-null float64
23 V124 68447 non-null float64
24 V125 68447 non-null float64
25 V126 68447 non-null float64
26 V127 68447 non-null float64
27 V128 68447 non-null float64
28 V129 68447 non-null float64
29 V130 68447 non-null float64
30 V131 68447 non-null float64
31 V132 68447 non-null float64
32 V133 68447 non-null float64
33 V134 68447 non-null float64
34 V135 68447 non-null float64
35 V136 68447 non-null float64
36 V137 68447 non-null float64
37 V138 0 non-null float64
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 43/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
38 V139 0 non-null float64
39 V140 0 non-null float64
40 V141 0 non-null float64
41 V142 0 non-null float64
42 V143 0 non-null float64
43 V144 0 non-null float64
44 V145 0 non-null float64
45 V146 0 non-null float64
46 V147 0 non-null float64
47 V148 0 non-null float64
48 V149 0 non-null float64
49 V150 0 non-null float64
50 V151 0 non-null float64
51 V152 0 non-null float64
52 V153 0 non-null float64
53 V154 0 non-null float64
54 V155 0 non-null float64
55 V156 0 non-null float64
56 V157 0 non-null float64
57 V158 0 non-null float64
58 V159 0 non-null float64
59 V160 0 non-null float64
60 V161 0 non-null float64
61 V162 0 non-null float64
62 V163 0 non-null float64
63 V164 0 non-null float64
64 V165 0 non-null float64
65 V166 0 non-null float64
66 V167 59669 non-null float64
67 V168 59669 non-null float64
68 V169 59885 non-null float64
69 V170 59885 non-null float64
70 V171 59885 non-null float64
71 V172 59669 non-null float64
72 V173 59669 non-null float64
73 V174 59885 non-null float64
74 V175 59885 non-null float64
75 V176 59669 non-null float64
76 V177 59669 non-null float64
77 V178 59669 non-null float64
78 V179 59669 non-null float64
79 V180 59885 non-null float64
80 V181 59669 non-null float64
81 V182 59669 non-null float64
82 V183 59669 non-null float64
83 V184 59885 non-null float64
84 V185 59885 non-null float64
85 V186 59669 non-null float64
86 V187 59669 non-null float64
87 V188 59885 non-null float64
88 V189 59885 non-null float64
89 V190 59669 non-null float64
90 V191 59669 non-null float64
91 V192 59669 non-null float64
92 V193 59669 non-null float64
93 V194 59885 non-null float64
94 V195 59885 non-null float64
95 V196 59669 non-null float64
96 V197 59885 non-null float64
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 44/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
97 V198 59885 non-null float64
98 V199 59669 non-null float64
99 V200 59885 non-null float64
dtypes: float64(100)
memory usage: 52.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 100 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 V201 59885 non-null float64
1 V202 59669 non-null float64
2 V203 59669 non-null float64
3 V204 59669 non-null float64
4 V205 59669 non-null float64
5 V206 59669 non-null float64
6 V207 59669 non-null float64
7 V208 59885 non-null float64
8 V209 59885 non-null float64
9 V210 59885 non-null float64
10 V211 59669 non-null float64
11 V212 59669 non-null float64
12 V213 59669 non-null float64
13 V214 59669 non-null float64
14 V215 59669 non-null float64
15 V216 59669 non-null float64
16 V217 52036 non-null float64
17 V218 52036 non-null float64
18 V219 52036 non-null float64
19 V220 59160 non-null float64
20 V221 59160 non-null float64
21 V222 59160 non-null float64
22 V223 52036 non-null float64
23 V224 52036 non-null float64
24 V225 52036 non-null float64
25 V226 52036 non-null float64
26 V227 59160 non-null float64
27 V228 52036 non-null float64
28 V229 52036 non-null float64
29 V230 52036 non-null float64
30 V231 52036 non-null float64
31 V232 52036 non-null float64
32 V233 52036 non-null float64
33 V234 59160 non-null float64
34 V235 52036 non-null float64
35 V236 52036 non-null float64
36 V237 52036 non-null float64
37 V238 59160 non-null float64
38 V239 59160 non-null float64
39 V240 52036 non-null float64
40 V241 52036 non-null float64
41 V242 52036 non-null float64
42 V243 52036 non-null float64
43 V244 52036 non-null float64
44 V245 59160 non-null float64
45 V246 52036 non-null float64
46 V247 52036 non-null float64
47 V248 52036 non-null float64
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 45/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
48 V249 52036 non-null float64
49 V250 59160 non-null float64
50 V251 59160 non-null float64
51 V252 52036 non-null float64
52 V253 52036 non-null float64
53 V254 52036 non-null float64
54 V255 59160 non-null float64
55 V256 59160 non-null float64
56 V257 52036 non-null float64
57 V258 52036 non-null float64
58 V259 59160 non-null float64
59 V260 52036 non-null float64
60 V261 52036 non-null float64
61 V262 52036 non-null float64
62 V263 52036 non-null float64
63 V264 52036 non-null float64
64 V265 52036 non-null float64
65 V266 52036 non-null float64
66 V267 52036 non-null float64
67 V268 52036 non-null float64
68 V269 52036 non-null float64
69 V270 59160 non-null float64
70 V271 59160 non-null float64
71 V272 59160 non-null float64
72 V273 52036 non-null float64
73 V274 52036 non-null float64
74 V275 52036 non-null float64
75 V276 52036 non-null float64
76 V277 52036 non-null float64
77 V278 52036 non-null float64
78 V279 68512 non-null float64
79 V280 68512 non-null float64
80 V281 68324 non-null float64
81 V282 68324 non-null float64
82 V283 68324 non-null float64
83 V284 68512 non-null float64
84 V285 68512 non-null float64
85 V286 68512 non-null float64
86 V287 68512 non-null float64
87 V288 68324 non-null float64
88 V289 68324 non-null float64
89 V290 68512 non-null float64
90 V291 68512 non-null float64
91 V292 68512 non-null float64
92 V293 68512 non-null float64
93 V294 68512 non-null float64
94 V295 68512 non-null float64
95 V296 68324 non-null float64
96 V297 68512 non-null float64
97 V298 68512 non-null float64
98 V299 68512 non-null float64
99 V300 68324 non-null float64
dtypes: float64(100)
memory usage: 52.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 39 columns):
# Column Non-Null Count Dtype
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 46/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
--- ------ -------------- -----
0 V301 68324 non-null float64
1 V302 68512 non-null float64
2 V303 68512 non-null float64
3 V304 68512 non-null float64
4 V305 68512 non-null float64
5 V306 68512 non-null float64
6 V307 68512 non-null float64
7 V308 68512 non-null float64
8 V309 68512 non-null float64
9 V310 68512 non-null float64
10 V311 68512 non-null float64
11 V312 68512 non-null float64
12 V313 68324 non-null float64
13 V314 68324 non-null float64
14 V315 68324 non-null float64
15 V316 68512 non-null float64
16 V317 68512 non-null float64
17 V318 68512 non-null float64
18 V319 68512 non-null float64
19 V320 68512 non-null float64
20 V321 68512 non-null float64
21 V322 0 non-null float64
22 V323 0 non-null float64
23 V324 0 non-null float64
24 V325 0 non-null float64
25 V326 0 non-null float64
26 V327 0 non-null float64
27 V328 0 non-null float64
28 V329 0 non-null float64
29 V330 0 non-null float64
30 V331 0 non-null float64
31 V332 0 non-null float64
32 V333 0 non-null float64
33 V334 0 non-null float64
34 V335 0 non-null float64
35 V336 0 non-null float64
36 V337 0 non-null float64
37 V338 0 non-null float64
38 V339 0 non-null float64
dtypes: float64(39)
memory usage: 20.9 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id_01 62192 non-null float64
1 id_02 61015 non-null float64
2 id_03 27141 non-null float64
3 id_04 27141 non-null float64
4 id_05 59809 non-null float64
5 id_06 59809 non-null float64
6 id_07 491 non-null float64
7 id_08 491 non-null float64
8 id_09 29436 non-null float64
9 id_10 29436 non-null float64
10 id_11 61016 non-null float64
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 47/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
11 id_12 62192 non-null object
12 id_13 60121 non-null float64
13 id_14 2396 non-null float64
14 id_15 61016 non-null object
15 id_16 51136 non-null object
16 id_17 60724 non-null float64
17 id_18 15052 non-null float64
18 id_19 60701 non-null float64
19 id_20 60656 non-null float64
20 id_21 495 non-null float64
21 id_22 495 non-null float64
22 id_23 495 non-null object
23 id_24 274 non-null float64
24 id_25 484 non-null float64
25 id_26 494 non-null float64
26 id_27 495 non-null object
27 id_28 61016 non-null object
28 id_29 61016 non-null object
29 id_30 0 non-null object
30 id_31 60898 non-null object
31 id_32 0 non-null float64
32 id_33 0 non-null object
33 id_34 0 non-null object
34 id_35 61016 non-null object
35 id_36 61016 non-null object
36 id_37 61016 non-null object
37 id_38 61016 non-null object
dtypes: float64(23), object(15)
memory usage: 20.4+ MB
None
Based on the information above, the number of features could be reduced by only selecting
those that are numeric and have at least 60,000 observations.
# Select the features that are numeric and have at least 60,000 observations.
fraud_numeric = fraud_data.select_dtypes(['number']).drop(columns='TransactionID')
# Select only the features that have at least 60,000 non-missing values.
fraud_numeric2 = pd.DataFrame()
for (columnName, columnData) in fraud_numeric.iteritems():
enough_data = columnData.notna().sum() > 60000
if enough_data:
fraud_numeric2[columnName] = columnData
# Then select only the rows that do not have any missing values.
fraud_numeric3 = fraud_numeric2.dropna(axis='rows')
print(fraud_numeric3.info())
print(list(fraud_numeric3.columns))
# Check that there are no missing values in this new fraud data
print('There are '+str(fraud_numeric3.isna().sum().sum())+' missing values')
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 48/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51660 entries, 0 to 68518
Columns: 203 entries, isFraud to id_20
dtypes: float64(200), int64(3)
memory usage: 80.4 MB
None
['isFraud', 'TransactionDT', 'TransactionAmt', 'card1', 'card2', 'card3', 'card5',
There are 0 missing values
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51660 entries, 0 to 68518
Columns: 202 entries, TransactionDT to id_20
dtypes: float64(200), int64(2)
memory usage: 80.0 MB
None
The fraud data now contains 51,660 observations on 202 features with no missing values.
If fraud assessors who were familiar with this dataset were available to speak to, they could
provide qualitative information about the features that are likely to be good predictors of
fraud. This information could then be used to refine the feature selection process.
Assume that you have access to these fraud assessors and they have suggested you the
following features might be the most helpful in identifying fraud:
id_02;
id_19;
id_20;
V44;
V45;
V86;
V87;
C1;
C2;
C8;
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 49/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
C11;
C12;
C13;
C14;
card1;
card2;
TransactionAmt; and
TransactionDT.
You will use this reduced feature list to perform your initial modelling.
print(fraud_features_selected.head())
id_02 id_19 id_20 V44 V45 V86 V87 C1 C2 C8 C11 C12 C13 \
0 191631.0 410.0 142.0 1.0 1.0 1.0 1.0 1.0 4.0 1.0 2.0 2.0 2.0
1 221832.0 176.0 507.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 116098.0 410.0 142.0 2.0 2.0 2.0 2.0 2.0 5.0 1.0 2.0 2.0 3.0
3 257037.0 484.0 507.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
4 287959.0 254.0 507.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 1.0 1.0 2.0
Modelling
This section:
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 50/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
fits a model;
evaluates the fitted model;
improves the model; and
selects a final model.
Fit model
# Perform K-means clustering on the fraud dataset.
# Fit the models and create an elbow curve to help select a value of
# K that provides a good clustering outcome, wihtout overfitting the data.
# Note that this step may take a while to run as the kmeans model needs to be
# fitted 10 times.
score = [kmeans[k].fit(fraud_features_scaled).score(fraud_features_scaled) for
k in range(len(kmeans))]
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 51/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
The elbow curve shows a slight ‘kink’ at K = 2 and another at K = 3, suggesting that these
might be good values of K to use such that the clustering produces a small sum of squared
distances but does not overfit the training data.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 52/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
# The steps below will flag cases as fraud if they are far from their
# cluster centroid. In other words, they will be flagged as fraud if they
# look unusual compared to other observations in their cluster.
# Extract the selected kmeans model with K = 3 from the list of all
# kmeans models fitted.
# Note: because indexing starts at 0, the model with K = 3 has the list index 2.
kmeans_selected = kmeans[2]
# Get the cluster number for each observation in the fraud dataset.
labels = kmeans_selected.predict(fraud_features_scaled)
# Get the cluster centroids so that the distance from each observation to the
# centroid can be calculated to select cases to be flagged as fraud.
centroids = kmeans_selected.cluster_centers_
# Calculate the distance between each observation and its cluster centroid.
distance = [np.linalg.norm(X-Z) for X,Z in zip(fraud_features_scaled,centroids[lab
# np.linalg.norm is the 'norm' function from Numpy's linear algebra package
# and is used to calculate the Euclidean distance between two points,
# in this case between each observation (X) and its cluster centroid (Z).
# From the exploration of data above, it was seen that approximately 12% of
# Product Code C transactions were fraudulent. The code below will therefore
# classify an observation as fraudulent if its distance from its
# cluster centroid is above the 88th percentile of all distances in the cluster.
fraud_boundary = 88
fraud_prediction = np.array(distance)
fraud_prediction[distance>=np.percentile(distance, fraud_boundary)] = 1
# Assign an observation as being fraudulent (1) if it is outside the
# 88th percentile of observations in its cluster.
fraud_prediction[distance<np.percentile(distance, fraud_boundary)] = 0
# Otherwise assigne the observation as being non fraudulent (0).
print(fraud_prediction)
Evaluate model
Unsupervised learning often involves working with datasets that do not contain response
variable. In these cases, the output of a clustering algorithm can be evaluated using internal
or manual evaluation techniques as described in Section 6.3.5 of Module 6.
An example of internal evaluation was shown above when the clustering scor (WCSS) was
calculated based on the sum of squared distances between observations and their centroid
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 53/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
clusters.
In fraud detection exercises, manual evaluation might be carried out by showing a fraud
expert the observations in the training dataset that were classified as fraud and asking them
to evaluate the likelihood that these observations were in fact fraud.
In this case study, response variables are available in the dataset, so these can be used to
conduct external evaluation by comparing known cases of fraud to predicted cases of
fraud. This external evaluation is performed below.
# Calculate and plot a confusion matrix to show True Positives, False Positives,
# True Negatives and False Negatives.
The confusion matrix shows that 2,690 (47%) of fraudulent transactions have been correctly
classified as ‘Fraud’. It also shows that 3,510 (8%) of non fraudulent transactions that
predicted to be fraud.
This suggests that there is room to improve the model to make it a more helpful tool for
the fraud detection team to use.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 54/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
Improve model
The model might be improved by trying different values for the fraud_boundary or K, or by
experimenting with different features used in the clustering.
The code below tries some different values for the fraud_boundary and K. You should also
experiment to see if you can obtain better fraud predictions from the model, either by
trying other values for fraud_coundary or K, or by trying different features to use in the
clustering.
# Extract the selected kmeans model with K = 4 from the list of all kmeans
# models fitted.
# Note: because indexing starts at 0, the model with K = 4 has the list index 3.
kmeans2 = kmeans[3]
# Get the cluster number for each observation in the fraud dataset.
labels2 = kmeans2.predict(fraud_features_scaled)
# Get the cluster centroids so that the distance from each observation to the
# centroid can be calculated to select cases to be flagged as fraud.
centroids2 = kmeans2.cluster_centers_
# Calculate the distance between each observation and the cluster centroid.
distance2 = [np.linalg.norm(X-Z) for X,Z in zip(fraud_features_scaled,
centroids2[labels2])]
# This time the fraud_boundary will be set to 80 so that more transactions will be
# identified as fraud.
fraud_boundary2 = 80
fraud_prediction2 = np.array(distance2)
fraud_prediction2[distance2>=np.percentile(distance2, fraud_boundary)] = 1
fraud_prediction2[distance2<np.percentile(distance2, fraud_boundary)] = 0
print(fraud_prediction2)
# Calculate and plot a confusion matrix to show True Positives, False Positives,
# True Negatives and False Negatives.
cm = confusion_matrix(fraud_response, fraud_prediction2)
plot_confusion_matrix(cm, classes = ['Not fraud','Fraud'])
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 55/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
This model has a slightly higher F1 score, indicating it provides a slightly better fraud
prediction than the original model.
Evaluation / observations
This fraud detection model:
detects about 48% of true fraud cases in the training data (2,766 out of 5,763); and
classifies about 7.5% of true non-fraud cases in the training data as being fraudulent
(3,434 out of 45,897).
Further, of the 6,200 transactions classified as fraud, only 45% of them are actual cases of
fraud.
Therefore, this model probably requires further refinement to avoid missing too many true
cases of fraud and to avoid classifying too many non-fraud cases as fraud.
It is likely that another classifier, such as a GBM or a neural network, would make better
fraud predictions based on the training data. However, the power of the model presented
above is that it does not need examples of fraud to be able to make a reasonable attempt
at detecting them. This means that this ‘outlier’ classifier might be better at identifying new
types of fraud that emerge over time than another classifier might be.
Further, while this case study has not provided a very good classification of fraudulent
transactions based on the training data, it has demonstrated how clustering might be used
to detect outliers in a dataset.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 56/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
Afterword
The output of fraud detection is a ‘short-list’ of possible outliers. Incorrectly accusing
someone of fraudulent behaviour can have very detrimental financial and reputational
impacts on an organisation, as well as likely causing the accused individual significant
stress. For this reason, the results of a fraud detection activity usually need to be reviewed
manually by a human to determine whether fraud has occurred or is likely to occur in each
case. Further monitoring and/or investigation might be required before deciding whether
fraud has occurred. Again, when conducting fraud investigations, it is important to consider
the potential harm that can be caused to someone undergoing investigation.
At the same time, letting fraud go undetected and uncorrected can also be very costly for
an organisation. Therefore, it is important to track how well any fraud detection algorithm
performs over time. As the nature of fraud changes over time, it is important to assess
whether the algorithm is keeping pace with the changing environment.
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 57/58
10/13/24, 4:13 PM Py: Clustering Credit Card Fraud — Actuaries' Analytical Cookbook
https://actuariesinstitute.github.io/cookbook/docs/DAA_M06_CS2.html 58/58