Identifying Banking Transaction Descriptions Via S
Identifying Banking Transaction Descriptions Via S
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier xx.xxxx/ACCESS.xxxx.DOI
ABSTRACT Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional
text representation methods have been successfully applied to self-contained documents of medium size.
However, information in short texts is often insufficient, due, for example, to the use of mnemonics,
which makes them hard to classify. Therefore, the particularities of specific domains must be exploited.
In this article we describe a novel system that combines Natural Language Processing techniques with
Machine Learning algorithms to classify banking transaction descriptions for personal finance management,
a problem that was not previously considered in the literature. We trained and tested that system on a
labelled dataset with real customer transactions that will be available to other researchers on request.
Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce
training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining
this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into
account complexity and computing time. Finally, we present a use case with a personal finance application,
CoinScrap, which is available at Google Play and App Store.
INDEX TERMS Machine Learning, Natural Language Processing, banking, personal finance management.
VOLUME X, 2020 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
[8], [9], as it is difficult to extract key features from B. PERSONAL FINANCE MANAGEMENT
large feature spaces for accurate classification training. Personal finance management or PFM aggregates household
2) Real-time generation: Nowadays vast amounts of infor- bank accounts and offers users a view of their day-to-day
mation are continuously produced in the form of short personal finances. It involves planning and budgeting, cash
messages. Consider, for example, chat and micro-blog flow control, investment, taxation, and insurance [21]. It is
information and news comments, among others. They becoming increasingly popular and many PFM resources such
reflect reactions in real-time to outside world events as BudgetBuddy1 , AccBiz2 , Prosper3 , Finn4 and Figo5 ex-
and, therefore, are difficult to collect. Consequently, ploit PFM by recommending personalized insurance products
short-text classification methods must be highly effi- or long-term financing plans. These applications also provide
cient. budgeting and credit scoring tools to help households track
3) Irregularity: Short-text terminology is not standardized their expenses and credit score.
and vocabularies are informal or specific (in our case
related to banking). C. OPEN BANKING EUROPEAN REGULATION
Two key aspects are that words are seldom repeated in a The European path to digitization is based on four pil-
given BT description and that few words are irrelevant. The lars [22]: (1) extensive reporting requirements to control
level of significance of a word cannot be simply determined systemic risk and change financial sector behaviour; (2) strict
by its repetition within the text. However, for the same data protection rules; (3) open banking to enhance competi-
reasons, short texts are less noisy than long texts. tion; and (4) a legislative framework for digital identification.
Our proposal is based on Natural Language Processing In this line, the Second Payments Services Directive6 (PSD 2)
(NLP) and Machine Learning (ML). It characterizes financial empowers customers to make their banking data available to
short messages with features such as character and word third parties such as FinTech companies. In essence it paves
n-grams, which feed a supervised Support Vector Machine the way for new banking products and services, by promoting
(SVM) classifier. Motivated by existing solutions in spam competition without compromising security.
detection, we also propose a short text similarity detector to
reduce training set size based on the Jaccard distance. There- D. TEXT CLASSIFICATION
fore, our proposal consists in a two-stage classifier combining Most existing approaches for text classification rely on sim-
this detector with a SVM. In any case, the sizes of short- ple document representations in word-oriented input spaces.
text banking description datasets discourage the application Despite considerable efforts to introduce more sophisticated
of deep learning techniques [10]. techniques for document representation such as those based
The rest of this article is organized as follows. In Section II on higher-order word statistics [23], NLP [24], string kernels
we review the state of the art in short-text classification. In [25] and word clusters [26], simple bag-of-words (BOW)
Section III we describe the classification problem. Subsec- approaches [27] are still popular.
tions III-A1-III-A4 explain the modules of our system. In Different ML methods, such as Naive Bayes [28], logistic
particular, Section III-A4 describes the short text similarity regression [29] and SVMs [30] have been proposed for text
detector to reduce training set size based on the Jaccard dis- classification. In particular, linear classifiers, which are effi-
tance. Section IV presents the experimental text corpora and cient, robust and easy to interpret, have been successful at
evaluates our approach with real data. Section V cites a real sentiment analysis [31].
world solution based on our approach. Finally, Section VI Diverse complex features have been added to these text
concludes the paper. classification models. Some examples are parts-of-speech
and phrase information [32], syntax integration by means
II. RELATED WORK of explicit features and implicit kernels [33], and, for sen-
A. CUSTOMER ANALYSIS timent analysis, dependency tree features [34] and semantic
composition models [35]. In [36] it was shown that BOW
BT data have grown considerably with the expansion of
and bigram features are more productive than much more
electronic banking [11]. The banking sector is well aware of
complex features. Distributed word representations [37]–[39]
the value of customer information covering demographics,
have enriched discrete models for semi-supervised learning.
leisure, wealth, insurance, financial transactions, and so on.
Word embeddings have mostly been used to feed neural
Several studies have been conducted on the analysis of
customer attrition and retention. Some focus on aspects in- 1 Available at https://www.budgetbuddyaus.com.au/.
2 Available at http://www.webunit.co.uk/clients/access/
fluencing customer choices, such as customer care, speed and
quality of service, variety of services, fees, online accessibil- index.html.
3 Formerly available at https://www.prosper.com/.
ity, etc [12]–[14]. Other studies have focused on customer 4 Available at https://www.chase.com/personal/finnbank.
churn (that is, leaving one bank to another) [12], [15]–[17], 5 Available at https://www.figo.io/.
fraud [18], [19] and even spatial distribution from transaction 6 Directive (EU) 2015/2366 of the European Parliament and of the Council
activity in commercial areas [20]. of 25 November 2015 on payment services in the internal market, amending
Directives 2002/65/EC, 2009/110/EC and 2013/36/EU; Regulation (EU) No
1093/2010; Repealing Directive 2007/64/EC, OJ of 23.12.2015, L 337/35.
2 VOLUME X, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
few characters. In most cases verbs are totally absent. Never- 1) Linguistic knowledge extraction
theless, BT descriptions may still contain useless information In this step we create lexica whose entries are related to the
that may affect text classification. categories of the classification problem. Figure 3 represents
First, each BT description is split into tokens, and, in some the lexicon generation procedure.
cases, into sentences. Then meaningless words or stopwords, First, starting from the preprocessed BT descriptions in the
such as determiners and prepositions (‘el’/‘the’, ‘en’/‘in’, training set, which are labelled according to the classification
‘entonces’/‘so’, ‘aunque’/‘although’, ‘pero’/‘but’ and so on) categories, all non-alphabetic characters such as numbers,
are removed. Table 1 shows some stopword examples7 . Next, punctuation marks and symbols are cleared. Next, useful final
all punctuation marks apart from ‘.’ and ‘,’ are also removed. elements for the lexica are extracted. These are the unigrams
TABLE1: Some examples of stopwords. that appear at least five times in the text corpora for each
category (all others are excluded) and the bigrams that are
Stopwords
present at least three times in the corpora. Single-character
algún como incluso poder también
ambos esta otro por tras alphabetic elements are also discarded. The final result is a set
ante estar para primero un of lexica with unigrams and bigrams and their corresponding
antes hacer pero ser uso categories.
For example, let us suppose that the training set only has
the following entries for a given category:
3) Proper name detection
1) Compra en Pescados Diego, S.L. (‘Purchase at Pesca-
Finally, proper names are detected using lists of names and dos Diego, S.L.’)
surnames8 and replaced by a tag. 2) Compra en supermercado Elvira Madrid 28 (‘Purchase
Taking the real BT description ‘Compra en supermercado at Elvira supermarket Madrid 28’)
Elvira Madrid 28. TARJ. :*320546’ as an example, after text 3) Compra en amazon.es (‘Purchase in amazon.es’)
tokenization, stopword removal and proper name extraction, 4) Compra en supermercado Carrefour Enero 2018 (‘Pur-
the result is ‘Compra # supermercado #PNegi# Madrid 28. chase at Carrefour supermarket January 2018’)
TARJ. #320546’. The “#" symbol marks the place where a 5) Compra en amazon.es Febrero 2018 (‘Purchase in
word is removed. Note that each proper name is substituted amazon.es February 2018’)
by “#PN" followed by a set of characters (‘egi’ in the exam- 6) Compra en Amazon (“Purchase in Amazon’)
ple) and “#". Thus, a given name is always replaced by the 7) Pago en supermercado Elvira Alicante (‘Payment at
same identifier (‘Elvira’ by #PNegi# in the example). Credit Elvira Alicante supermarket’)
card number was always anonymized. 8) Pago en supermercado El Corte Inglés Vigo (‘Payment
at El Corte Inglés Vigo supermarket’)
4) Training sample reduction with similarity detection stage
9) Compra en supermercado Carrefour Febrero 2018
We take advantage of the fact that many BT descriptions are (‘Purchase at Carrefour supermarket February 2018’)
similar to reduce the size of the training set. For that purpose, 10) Compra en supermercado amazon.es (‘Purchase in
we insert a similarity detector based on the Jaccard distance amazon.es supermarket’)
[55] before the classifier. This is inspired by spam detection
The resulting lexicon would only contain the words ‘com-
techniques that use this distance to seek for characteristic
pra’ and ‘supermercado’ and the bigram ‘compra supermer-
sentences [56]–[58].
cado’ followed by the categories.
The similarity detector only considers the text of the
descriptions. When the Jaccard similarity between a new
2) Feature selection and weight calculation
labelled description and a previous entry in our dataset ex-
ceeds 85%, and both belong to the same category, the new The system uses a standard SVM algorithm for modelling and
description is not added to the SVM training set. Otherwise, prediction. Short texts are encoded according to the vector
we keep it. When the similarity between a new unlabelled space model in [59]. The smallest data unit in the model
description and a previous entry exceeds 85%, we assign corresponds to a feature. A text T may be seen as an n-
to the description the class of the entry. Otherwise, the dimensional vector in the vector space, as follows:
description is passed to the SVM for classification.
Figure 2 illustrates the architecture of the system includ- T = ((t1 , w1 ), (t2 , w2 ), ..., (tn , wn )) (1)
ing the Jaccard similarity detector. The SVM classifier is where t is the value of a feature of text T and w its weight.
explained in Section III-B3. The greater the w, the more information the feature contains
in that case [60].
B. MACHINE LEARNING ANALYSIS Many different types of features are possible, such as
In this section we explain the knowledge-based linguistic Boolean, word frequency (number of times a word appears in
extraction as well as the feature selection. the text) and TF - IDF. Note that classification results depend
7 Available at https://www.ranks.nl/stopwords/spanish. greatly on feature selection [61], [62]. An efficient feature
8 Available at https://github.com/olea/lemarios. selection method not only reduces the dimension of the
4 VOLUME X, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
feature space but also avoids useless features. The features 5) Word n-grams. N -gram representation is language-
in our system are the following: independent. It transforms documents into high-
1) Lexicon data. These features count the words in the BT dimensional feature vectors where each feature corre-
descriptions that appear in the lexica for each existing sponds to a contiguous sub-string. Formally, an n-gram
category. consists of n adjacent items from alphabet A. Items can
2) Amount. The range of the BT amount field, since be phonemes, syllables, letters, words or base pairs de-
ranges are more significant for our application than ex- pending on the application. Hence, the number of dif-
act values. Specifically, we consider non-overlapping ferent n-grams in a text is |A|n at most. The dimension
intervals limited by 20, 60, 200, 800, 1500 and 3000 of an n-gram feature sub-vector may therefore be very
euros. high even for moderate values of n. However, since not
3) Sign of the amount. This feature indicates if the BT is all n-grams are present in a document, the dimension
an income (positive) or an expense (negative). is substantially reduced. During the formation of an n-
4) Date. The information in the date field of each BT. gram feature sub-vector, all upper-case characters are
Again, we use ranges. This is because some events converted into lower-case characters and punctuation
occur on specific days of the month (e.g. salary at marks are converted to spaces. Sub-vectors are then
the end), whereas other events (e.g. purchases) may normalized. The optimal n depends on the text corpora.
happen anytime during the month. The selected ranges We explain feature sub-vectors with an example that
were the last five, ten, twenty and twenty-five days of computes the n-grams from one to four words for
the month. the BT description ‘Operación tarjeta débito Amazon’
VOLUME X, 2020 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
(‘Amazon debit card transaction’). The resulting vec- 1) ID: a unique numeric identifier.
tor consists of the following components: ‘operación’ 2) Description: the BT short-text description.
(‘transaction’), ‘tarjeta’ (‘card’), ‘débito’ (‘debit’), 3) Amount: the amount in euros of the BT, either positive
amazon; ‘operación tarjeta’ (‘card transaction’), ‘tar- (income) or negative (expense).
jeta débito’ (‘debit card’), ‘débito amazon’ (‘amazon 4) Date: the date when the BT occurred.
debit’); ‘operación tarjeta débito’ (‘debit card transac- Every entry has an extra field with the category label that
tion’), ‘tarjeta débito amazon’ (‘amazon debit card’); determines the classification goal. The dataset may be re-
‘operación tarjeta débito amazon’ (‘amazon debit card quested to the authors by e-mail. Table 2 shows the numerical
transaction’). distributions of the fifteen categories in the dataset. Table 3
6) Character n-grams. Character n-grams have been shows some examples of dataset entries.
proven useful for a variety of ML problems, such as
language detection. Simple models based on them have B. EVALUATION METRICS
outperformed convolutional and recursive deep neural Due to the issues of accuracy with class asymmetries [68],
networks (CNNs and RNNs) [63]–[65]. [69], we employed precision, recall and F metrics using a
We illustrate them with an example that computes the macro-average approach.
trigram, four-gram and five-gram character sub-vectors Macro-averaged results were computed as indicated
for the sentence ‘Operación tarjeta débito Amazon’ by [70]. Consider a binary evaluation metric B(tp , tn , fp , fn )
(note that spaces are also taken into account when that is calculated based on the number of true positives (tp ),
computing character n-grams): (ope, per, era, rac, aci, true negatives (tn ), false positives (fp ) and false negatives
ció, ión, ón , n t, ta, tar, arj, rje, jet, eta, ta , a d, dé, déb, (fn ). Let tpλ , fpλ , tnλ and fnλ be the amounts of true
ébi, bit, ito, to , o a, am, ama, maz, azo, zon; oper, pera, positives, false positives, true negatives and false negatives,
erac, raci, ació, ción, ión , ón t, n ta, tar, tarj, arje, rjet, respectively, after binary evaluation for label λ. The macro-
jeta, eta , ta d, a dé, déb, débi, ébit, bito, ito , to a, o average evaluation metric is calculated as follows:
am, ama, amaz, mazo, azon; opera, perac, eraci, ració,
ación, ción , ión t, ón ta, n tar, tarj, tarje, arjet, rjeta, jeta 1X
k
, eta d, ta dé, a déb, débi, débit, ébito, bito , ito a, to am, Bmacro = B(tpλ , fpλ , tnλ , fnλ ) (2)
q
λ=1
o ama, amaz, amazo, mazon).
They have been applied in scenarios with misspelling Macro-averaging weights all classes equally, whereas
errors [66], [67]. Character n-grams may also capture micro-averaging weights all document classification deci-
other effects of language usage, such as re-named en- sions equally. Since F ignores true negatives and its mag-
tities and abbreviations, e.g. ‘maths’ instead of ‘math- nitude is mostly determined by the number of true positives,
ematics’. In our case, they are justified by the many large classes dominate over small classes in micro-averaging
shortened words in BT descriptions. [71]. For this reason we preferred the macro-average ap-
proach.
3) SVM classifier To calculate precision, recall and F rates we first computed
We decompose the overall problem into pairwise two-class each of these measures separately for each category using
problems, following a one-versus-one approach. Therefore, expressions (3)-(5):
k(k − 1)/2 SVM classification models are necessary for k
tpq
text classes. The category is decided by majority voting. Precisionmicroq = (3)
tpq + fpq
IV. EXPERIMENTAL RESULTS tpq
All experiments were performed on a computer with the Recallmicroq = (4)
tpq + fnq
following specifications:
1) Operating System: Ubuntu 18.04 LTS 64 bits 2(Precisionmicroq ∗ Recallmicroq )
2) Processor: Intel@Core i5 3470 CPU 3.2Ghz x 4 Fmicroq = (5)
(Precisionmicroq + Recallmicroq )
3) RAM: 15.4 Gb
4) Disk: 1.9 Tb These metrics were then averaged by category using ex-
pression (2) to produce the macro-averaged metrics.
A. DATASET
The dataset comprises 30,844 BT descriptions from customer C. NUMERICAL RESULTS
accounts of major Spanish banks, written mostly in Spanish We performed cross-validation in different dataset splits of
and issued between August 2017 and February 2018. They training and testing subsets (in all cases the first and second
were collected during the CatCoin project with the collab- percentages correspond to training and testing subset sizes,
oration of CoinScrap Finance S.L., Spain, using the Coin- respectively): 30%-70%, 40%-60%, 60%-40% and 70%-
Scrap platform. The entries of the dataset have the following 30%. The purpose was to check the robustness of our system
attributes: when fewer training data were available.
6 VOLUME X, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
Category Instances
Bank 4,835
Means of transport 3,479
Shopping 11,061
Household expenses 1,158
Taxes and charges 489
Off-cycle income 89
Payroll 248
Leisure 2,362
Health, sport and education 867
Insurances 883
Social security, grants and pensions 67
Transfers 2,086
Business and professional expenses 197
Rentals 116
Others 2,907
Total 30,844
TABLE4: Average word distribution in the lexica for the different training-testing splits before applying the similarity filter.
In each experiment we extracted the lexica of the set as We compared our system with three competitor ap-
explained in Section III-B1. Table 4 shows the distributions proaches, All-In-1 [72] and two variants of the method
of words in the lexica for all categories before applying the by IITP (Indian Institute of Technology Patna) [73]. These
similarity detector. We added features incrementally to the approaches analyzed customer feedback to manufacturers,
model to assess their significance. Therefore, first we only which also consisted of short texts, although with more
used word n-grams and lexica, then we added BT amount elaborate sentences than BT descriptions. Note that no other
and date, and finally character n-grams features. researchers have considered BT to date. For the sake of
fairness, we applied the Jaccard distance detector stage to the
Given the target sector (finance), precision may be more competitors as well.
important than recall. This is because banking campaigns
prefer to obtain less positives for key categories. By doing The All-In-1 approach in [72] is based on a classic SVM
so, they maximize the probability that customers will be classifier that takes character n-grams and monolingual word
receptive to personalized products. embeddings as input. Logically we only used the monolin-
VOLUME X, 2020 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
TABLE5: Elapsed training and testing times of our system for different dataset splits.
competing approaches from the state-of-the-art, All-In-1 and ment; dynamic “gamified” saving rules (e.g. saving when
two variants of the IITP method. your favourite team wins, or when you take a coffee); and
The Jaccard similarity detector achieved reductions of personalised recommendations for financial management.
training data exceeding 56% for all splits. The latter rely on our system to classify BT transactions.
For the 30%-70% split, our system attained the best pre- In this line, CoinScrap recommends personalized services
cision. It was inferior to All-In-1 in recall and F unless all and products based on financial necessities and objectives.
features were enabled. If they were, our system also out- Figure 4 shows an screenshot of the app.
performed its competitors in F . For the 40%-60% split, our
system outperformed the competitors in terms of precision, VI. CONCLUSIONS
recall and F when all features were enabled. It was better Compared to normal texts, short texts analysis is challenging
in precision even with the basic combination of features. due to sparsity, irregularity and real-time data generation. In
For the 60%-40% and 70%-30% splits, our system again this paper we describe a short-text SVM BT classification sys-
outperformed the competitors in terms of precision, and the tem using a combination of meta-information and linguistic
performance gap with All-In-1, in the cases it existed, was re- knowledge (by relying on specialized lexica).
duced. Indeed, our approach is simpler than the competitors, Motivated by existing solutions in spam detection, we
which allowed significant training time reduction. achieved a significant reduction of training information with
a short text similarity detector based on the Jaccard distance.
V. USE CASE:COINSCRAP Experimental results, by comparing our approach with
CoinScrap launched its mobile app for iOS and Android in three state-of-the-art competitors with higher computational
November 2016, and since then it has had thousands of down- complexity, are very promising. Our lexicon feature is crucial
loads. A new version of the application was launched October to attain high precision, especially if the training dataset is
2018. It includes journey improvement for product fulfil- small.
VOLUME X, 2020 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
TABLE6: Elapsed training and testing times of the competitor systems for different dataset splits.
TABLE8: Average evaluation metrics for the basic combinations of features, 30%-70% split.
10 VOLUME X, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
TABLE9: Average evaluation metrics of the proposed system for all combinations of features, 30%-70% split.
TABLE10: Average evaluation metrics for the basic combinations of features, 40%-60% split.
TABLE11: Average evaluation metrics of the proposed system for all combinations of features, 40%-60% split.
TABLE12: Average evaluation metrics for the basic combinations of features, 60%-40% split.
TABLE13: Average evaluation metrics of the proposed system for all combinations of features, 60%-40% split.
TABLE14: Average evaluation metrics for the basic combinations of features, 70%-30% split.
TABLE15: Average evaluation metrics of the proposed system for all combinations of features, 70%-30% split.
VOLUME X, 2020 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
TABLE16: Performance of our system by category with all features enabled, 70%-30% split.
REFERENCES
[1] B. L. Derby, “Data mining for improper payments,” The Journal of
Government Financial Management, vol. 52, no. 4, p. 10, 2003.
[2] E. W. Ngai, L. Xiu, and D. C. Chau, “Application of data mining tech-
niques in customer relationship management: A literature review and
classification,” Expert Systems With Applications, vol. 36, no. 2, pp.
2592–2602, 2009.
[3] X. Hu, “A data mining approach for retailing bank customer attrition
analysis,” Applied Intelligence, vol. 22, no. 1, pp. 47–60, 2005.
[4] M. R. Islam and M. A. Habib, “A data mining approach to predict
prospective business sectors for lending in retail banking using decision
tree,” CoRR, vol. abs/1504.02018, 2015.
[5] C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal, “Web search
using automatic classification,” in Proceedings of the Sixth International
Conference on the World Wide Web, 1997.
[6] D.-T. Vo and Y. Zhang, “Target-Dependent Twitter Sentiment Classifica-
tion with Rich Automatic Features.” in Proc. IJCAI, 2015, pp. 1347–1353.
[7] G. Kumaran and J. Allan, “Text classification and named entities for
new event detection,” in Proceedings of the 27th Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval. ACM, 2004, pp. 297–304.
[8] Y. Cai, W.-H. Chen, H.-F. Leung, Q. Li, H. Xie, R. Y. Lau, H. Min,
FIGURE4: Coinscrap app. and F. L. Wang, “Context-aware ontologies generation with basic level
concepts from collaborative tags,” Neurocomputing, vol. 208, pp. 25–38,
2016.
[9] Q. Du, H. Xie, Y. Cai, H.-F. Leung, Q. Li, H. Min, and F. L. Wang,
The effectiveness of the proposed system was demon- “Folksonomy-based personalized search by hybrid user profiles in multiple
strated on a real dataset reflecting the activity of real levels,” Neurocomputing, vol. 204, pp. 142–152, 2016.
[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
customers of Spanish banks, organized in fifteen different no. 7553, p. 436, 2015.
classes including means of transport, shopping, household [11] A. M. Hormozi and S. Giles, “Data mining: A competitive weapon for
expenses, taxes, charges and payroll. This labelled dataset is banking and retail industries,” Information Systems Management, 2004.
[12] O. Aregbeyen, “The determinants of bank selection choices by customers:
a valuable asset that will be available to other researchers on Recent and extensive evidence from Nigeria,” International Journal of
request. Business and Social Science, vol. 2, no. 2, pp. 276–288, 2011.
Our system attained the best precision (which is the most [13] H. U. Rehmann and S. Ahmed, “An empirical analysis of the determinants
of bank selection in Pakistan: A customer view,” Pakistan Economic and
relevant metric in PFM) and performed similarly in terms Social Review, vol. 46, no. 2, pp. 147–160, 2008.
of recall and F if enough features were enabled, especially [14] V. Dinh and L. Pickler, “Examining service quality and customer satis-
when the methods were stressed by reducing the training-to- faction in the retail banking sector in Vietnam,” Journal of Relationship
Marketing, vol. 11, no. 4, pp. 199–214, 2012.
test subset size ratio.
[15] A. Keramati, H. Ghaneei, and S. M. Mirmohammadi, “Developing a
Given the encouraging results in this work, we are cur- prediction model for customer churn from electronic banking services
rently expanding it to obtain sub-categorisations of the de- using data mining,” Financial Innovation, vol. 2, no. 1, p. 10, dec 2016.
scriptions. Our approach has been put into production in a [16] A. Sharma and P. K. Panigrahi, “A neural network based approach for
predicting customer churn in cellular network services,” CoRR, vol.
real PFM application, CoinScrap. abs/1309.3945, 2013.
[17] K. Chen, Y.-H. Hu, and Y.-C. Hsieh, “Predicting customer churn from
valuable B2B customers in the logistics industry: A case study,” Inf. Syst.
E-bus. Manag., vol. 13, no. 3, pp. 475–494, 2015.
[18] S. Barman, U. Pal, M. A. Sarfaraj, B. Biswas, A. Mahata, and P. Mandal,
“A complete literature review on financial fraud detection applying data
12 VOLUME X, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
mining techniques,” International Journal of Trust Management in Com- [41] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural
puting and Communications, vol. 3, no. 4, pp. 336–359, 2016. network for modelling sentences,” arXiv preprint arXiv:1404.2188, 2014.
[19] J. West and M. Bhattacharya, “Intelligent financial fraud detection: A [42] C. dos Santos and M. Gatti, “Deep convolutional neural networks for
comprehensive review,” Computers & Security, vol. 57, pp. 47 – 66, 2016. sentiment analysis of short texts,” in Proceedings of COLING 2014, the
[20] Y. Yoshimura, A. Amini, S. Sobolevsky, J. Blat, and C. Ratti, “Analysis 25th International Conference on Computational Linguistics: Technical
of customers’ spatial distribution through transaction datasets,” in Trans- Papers, 2014, pp. 69–78.
actions on Large-Scale Data and Knowledge-Centered Systems XXVII - [43] Q. Le and T. Mikolov, “Distributed representations of sentences and docu-
Volume 9860. New York, NY, USA: Springer-Verlag New York, Inc., ments,” in Proc. International Conference on Machine Learning, 2014, pp.
2016, pp. 177–189. 1188–1196.
[21] R. Vahidov and X. He, “Situated DSS for personal finance management: [44] C. C. Aggarwal and C. Zhai, “A Survey of Text Clustering Algorithms,” in
Design and evaluation,” Information & Management, vol. 47, no. 2, pp. Mining Text Data. Springer, 2012, pp. 77–128.
78–86, 2010. [45] S. Banerjee, K. Ramanathan, and A. Gupta, “Clustering short texts using
[22] D. A. Zetzsche, D. W. Arner, R. P. Buckley, and R. H. Weber, “The Wikipedia,” in Proceedings of the 30th Annual International ACM SI-
future of data-driven finance and regtech: Lessons from EU Big Bang II,” GIR Conference on Research and Development in Information Retrieval.
Available at SSRN 3359399, 2019. ACM, 2007, pp. 787–788.
[23] M. F. Caropreso, S. Matwin, and F. Sebastiani, “A learner-independent [46] S. Fodeh, B. Punch, and P.-N. Tan, “On ontology-driven document clus-
evaluation of the usefulness of statistical phrases for automated text tering using core semantic features,” Knowledge and Information Systems,
categorization,” Text Databases and Document Management: Theory and vol. 28, no. 2, pp. 395–421, 2011.
Practice, vol. 5478, pp. 78–102, 2001. [47] J. Yin and J. Wang, “A Dirichlet multinomial mixture model-based ap-
[24] P. S. Jacobs, “Joining statistics with NLP for text categorization,” in proach for short text clustering,” in Proceedings of the 20th ACM SIGKDD
Proceedings of the Third Conference on Applied Natural Language Pro- International Conference on Knowledge Discovery and Data Mining.
cessing. Association for Computational Linguistics, 1992, pp. 178–185. ACM, 2014, pp. 233–242.
[25] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, [48] D. Cai, X. He, and J. Han, “Document clustering using locality preserv-
“Text classification using string kernels,” Journal of Machine Learning ing indexing,” IEEE Transactions on Knowledge and Data Engineering,
Research, vol. 2, no. Feb, pp. 419–444, 2002. vol. 17, no. 12, pp. 1624–1637, 2005.
[26] L. D. Baker and A. K. McCallum, “Distributional clustering of words [49] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neural
for text classification,” in Proceedings of the 21st Annual International Networks for Text Classification.” in Proc. AAAI, vol. 333, 2015, pp.
ACM SIGIR Conference on Research and Development in Information 2267–2273.
Retrieval. ACM, 1998, pp. 96–103. [50] W. Wu, H. Li, H. Wang, and K. Q. Zhu, “Probase: A probabilistic taxon-
[27] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146– omy for text understanding,” in Proceedings of the 2012 ACM SIGMOD
162, 1954. International Conference on Management of Data. ACM, 2012, pp. 481–
[28] A. McCallum, K. Nigam et al., “A comparison of event models for naive 492.
bayes text classification,” in Proc. AAAI-98 Workshop on Learning for [51] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness using
Text Categorization, vol. 752, 1998, pp. 41–48. Wikipedia-based explicit semantic analysis,” in Proc. IJCAI, vol. 7, 2007,
[29] K. Nigam, J. Lafferty, and A. McCallum, “Using maximum entropy for pp. 1606–1611.
text classification,” in Proc. IJCAI-99 Workshop on Machine Learning for [52] X. Han and J. Zhao, “Named entity disambiguation by leveraging
Information Filtering, vol. 1, 1999, pp. 61–67. Wikipedia semantic knowledge,” in Proceedings of the 18th ACM con-
[30] T. Joachims, “Text categorization with support vector machines: Learning ference on Information and knowledge management. ACM, 2009, pp.
with many relevant features,” in European Conference on Machine Learn- 215–224.
ing. Springer, 1998, pp. 137–142. [53] X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, “Exploiting Wikipedia as
[31] F. Sebastiani, “Machine learning in automated text categorization,” ACM external knowledge for document clustering,” in Proceedings of the 15th
Computing Surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002. ACM SIGKDD International Conference on Knowledge Discovery and
[32] D. D. Lewis, “An evaluation of phrasal and clustered representations on a Data Mining. ACM, 2009, pp. 389–396.
text categorization task,” in Proceedings of the 15th Annual International [54] X. Ni, J.-T. Sun, J. Hu, and Z. Chen, “Mining multilingual topics from
ACM SIGIR Conference on Research and Development in Information Wikipedia,” in Proceedings of the 18th International Conference on World
Retrieval. ACM, 1992, pp. 37–50. Wide Web. ACM, 2009, pp. 1155–1156.
[33] M. Post and S. Bergsma, “Explicit and implicit syntactic features for [55] C. Seung-Seok, C. Sung-Hyuk, and C. C. Tappert, “A Survey of Binary
text classification,” in Proceedings of the 51st Annual Meeting of the Similarity and Distance Measures,” Journal of Systemics, Cybernetics &
Association for Computational Linguistics, vol. 2, 2013, pp. 866–872. Informatics, vol. 8, no. 1, pp. 43–48, 2010.
[34] T. Nakagawa, K. Inui, and S. Kurohashi, “Dependency tree-based senti- [56] S. R. Harsule and M. K. Nighot, “N-Gram Classifier System to Filter Spam
ment classification using CRFs with hidden variables,” in Human Lan- Messages from OSN User Wall,” in Advances in Intelligent Systems and
guage Technologies: The 2010 Annual Conference of the North American Computing. Springer, 2016, pp. 21–28.
Chapter of the Association for Computational Linguistics. Association [57] S. Bajaj, N. Garg, and S. K. Singh, “A Novel User-based Spam Review
for Computational Linguistics, 2010, pp. 786–794. Detection,” Procedia Computer Science, vol. 122, pp. 1009–1015, 2017.
[35] M. Karo and P. Stephen, “Sentiment composition,” in Proc. of Recent [58] S. Temma, M. Sugii, and H. Matsuno, “The Document Similarity Index
Advances in Natural Language Processing (RANLP), 2007, pp. 378–382. based on the Jaccard Distance for Mail Filtering,” in Proceedings of the
[36] S. Wang and C. D. Manning, “Baselines and bigrams: Simple, good 34th International Technical Conference on Circuits/Systems, Computers
sentiment and topic classification,” in Proceedings of the 50th Annual and Communications. IEEE, 2019, pp. 1–4.
Meeting of the Association for Computational Linguistics: Short Papers- [59] C. Yin, “Towards Accurate Node-Based Detection of P2P Botnets,” The
Volume 2. Association for Computational Linguistics, 2012, pp. 90–94. Scientific World Journal, pp. 1–10, 2014.
[37] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic [60] C. Yin, M. Zou, D. Iko, and J. Wang, “Botnet detection based on corre-
language model,” Journal of Machine Learning Research, vol. 3, pp. 1137– lation of malicious behaviors,” Int J Hybrid Inf Technol, vol. 6, no. 6, pp.
1155, 2003. 291–300, 2013.
[38] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and [61] A. Veeraswamy and D. S. A. Balamurugan, “A Survey of Feature Selection
P. Kuksa, “Natural language processing (almost) from scratch,” Journal Algorithms in Data Mining,” in Proceedings of the 3rd International
of Machine Learning Research, vol. 12, pp. 2493–2537, 2011. Conference on Trends in Information Sciences and Computing (TISC-
[39] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of 2011), 2011, pp. 40–46.
word representations in vector space,” arXiv preprint arXiv:1301.3781, [62] X.-J. Tong, M.-G. Cui, and G.-L. Song, “Research on Chinese Text Au-
2013. tomatic Categorization Based on VSM,” in Proc. International Conference
[40] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, on Wireless Communications, Networking and Mobile Computing. IEEE,
“Semi-supervised recursive autoencoders for predicting sentiment distri- 2007, pp. 3863–3866.
butions,” in Proceedings of the Conference on Empirical Methods in Nat- [63] M. Medvedeva, M. Kroon, and B. Plank, “When sparse traditional models
ural Language Processing. Association for Computational Linguistics, outperform dense neural networks: the curious case of discriminating
2011, pp. 151–161. between similar languages,” in VarDial, 2017.
VOLUME X, 2020 13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access
Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus
14 VOLUME X, 2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.