0% found this document useful (0 votes)

30 views14 pages

Identifying Banking Transaction Descriptions Via S

Uploaded by

khoaa678

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views14 pages

Identifying Banking Transaction Descriptions Via S

Uploaded by

khoaa678

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier xx.xxxx/ACCESS.xxxx.DOI

Identifying Banking Transaction

Descriptions via Support Vector Machine
Short-Text Classification Based on a
Specialized Labelled Corpus
SILVIA GARCÍA-MÉNDEZ1 , MILAGROS FERNÁNDEZ-GAVILANES2 , JONATHAN
JUNCAL-MARTÍNEZ1 , FRANCISCO J. GONZÁLEZ-CASTAÑO1 , AND ÓSCAR BARBA
SEARA3
1
Information Technology Group, atlanTTic, School of Telecommunications Engineering, University of Vigo, Campus, 36310 Vigo, Spain
2
Defense University Center, 36920 Marín, Pontevedra, Spain
3
CoinScrap Finance S.L., Cobián Roffignac 2, 36002 Pontevedra, Spain
Corresponding author: Silvia García-Méndez (e-mail: [email protected]).
This work was partially supported by Ministerio de Economía, Industria y Competitividad under grant TEC2016-76465-C2-2-R and Xunta
de Galicia under grants GRC2018/053 and ED341D-R2016/012.

ABSTRACT Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional
text representation methods have been successfully applied to self-contained documents of medium size.
However, information in short texts is often insufficient, due, for example, to the use of mnemonics,
which makes them hard to classify. Therefore, the particularities of specific domains must be exploited.
In this article we describe a novel system that combines Natural Language Processing techniques with
Machine Learning algorithms to classify banking transaction descriptions for personal finance management,
a problem that was not previously considered in the literature. We trained and tested that system on a
labelled dataset with real customer transactions that will be available to other researchers on request.
Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce
training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining
this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into
account complexity and computing time. Finally, we present a use case with a personal finance application,
CoinScrap, which is available at Google Play and App Store.

INDEX TERMS Machine Learning, Natural Language Processing, banking, personal finance management.

I. INTRODUCTION tomatic classification of short-text BT descriptions (according

Financial companies need to develop new strategies to keep to a predefined set of labels) has not yet been tackled.
and to expand their customer base. Their product portfolios From a broader perspective, automated text classification
have diversified over the years and customer behaviour has has become a popular research area due to the many public
shifted from long-term loyalty to online interaction. digital text sources available. Text classification is useful for
The fierce competition between banks has led to a grow- a wide range of applications, such as web searching [5],
ing need to convert customer data – which include short- opinion mining [6] and event detection [7]. Nevertheless,
text banking transaction (BT) descriptions – into information most text classification methods are valid for long texts.
relevant for decision making. Some distinctive aspects of short texts are:
Data mining has been successfully applied to finance in 1) Sparsity: Short texts often have fewer than 150 words
various ways: identifying likely candidates for loan disburse- and are usually organized in few sentences. They con-
ment [1] and product acceptance [2]; characterizing product vey very little effective information. Since sparsity
segments [3]; and analysing customer attrition and retention affects the quality of short text semantics, traditional
[4]. However, to the best of our knowledge the problem of au- techniques as those used for long texts are impractical

VOLUME X, 2020 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2983584, IEEE Access

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

[8], [9], as it is difficult to extract key features from B. PERSONAL FINANCE MANAGEMENT
large feature spaces for accurate classification training. Personal finance management or PFM aggregates household
2) Real-time generation: Nowadays vast amounts of infor- bank accounts and offers users a view of their day-to-day
mation are continuously produced in the form of short personal finances. It involves planning and budgeting, cash
messages. Consider, for example, chat and micro-blog flow control, investment, taxation, and insurance [21]. It is
information and news comments, among others. They becoming increasingly popular and many PFM resources such
reflect reactions in real-time to outside world events as BudgetBuddy1 , AccBiz2 , Prosper3 , Finn4 and Figo5 ex-
and, therefore, are difficult to collect. Consequently, ploit PFM by recommending personalized insurance products
short-text classification methods must be highly effi- or long-term financing plans. These applications also provide
cient. budgeting and credit scoring tools to help households track
3) Irregularity: Short-text terminology is not standardized their expenses and credit score.
and vocabularies are informal or specific (in our case
related to banking). C. OPEN BANKING EUROPEAN REGULATION
Two key aspects are that words are seldom repeated in a The European path to digitization is based on four pil-
given BT description and that few words are irrelevant. The lars [22]: (1) extensive reporting requirements to control
level of significance of a word cannot be simply determined systemic risk and change financial sector behaviour; (2) strict
by its repetition within the text. However, for the same data protection rules; (3) open banking to enhance competi-
reasons, short texts are less noisy than long texts. tion; and (4) a legislative framework for digital identification.
Our proposal is based on Natural Language Processing In this line, the Second Payments Services Directive6 (PSD 2)
(NLP) and Machine Learning (ML). It characterizes financial empowers customers to make their banking data available to
short messages with features such as character and word third parties such as FinTech companies. In essence it paves
n-grams, which feed a supervised Support Vector Machine the way for new banking products and services, by promoting
(SVM) classifier. Motivated by existing solutions in spam competition without compromising security.
detection, we also propose a short text similarity detector to
reduce training set size based on the Jaccard distance. There- D. TEXT CLASSIFICATION
fore, our proposal consists in a two-stage classifier combining Most existing approaches for text classification rely on sim-
this detector with a SVM. In any case, the sizes of short- ple document representations in word-oriented input spaces.
text banking description datasets discourage the application Despite considerable efforts to introduce more sophisticated
of deep learning techniques [10]. techniques for document representation such as those based
The rest of this article is organized as follows. In Section II on higher-order word statistics [23], NLP [24], string kernels
we review the state of the art in short-text classification. In [25] and word clusters [26], simple bag-of-words (BOW)
Section III we describe the classification problem. Subsec- approaches [27] are still popular.
tions III-A1-III-A4 explain the modules of our system. In Different ML methods, such as Naive Bayes [28], logistic
particular, Section III-A4 describes the short text similarity regression [29] and SVMs [30] have been proposed for text
detector to reduce training set size based on the Jaccard dis- classification. In particular, linear classifiers, which are effi-
tance. Section IV presents the experimental text corpora and cient, robust and easy to interpret, have been successful at
evaluates our approach with real data. Section V cites a real sentiment analysis [31].
world solution based on our approach. Finally, Section VI Diverse complex features have been added to these text
concludes the paper. classification models. Some examples are parts-of-speech
and phrase information [32], syntax integration by means
II. RELATED WORK of explicit features and implicit kernels [33], and, for sen-
A. CUSTOMER ANALYSIS timent analysis, dependency tree features [34] and semantic
composition models [35]. In [36] it was shown that BOW
BT data have grown considerably with the expansion of
and bigram features are more productive than much more
electronic banking [11]. The banking sector is well aware of
complex features. Distributed word representations [37]–[39]
the value of customer information covering demographics,
have enriched discrete models for semi-supervised learning.
leisure, wealth, insurance, financial transactions, and so on.
Word embeddings have mostly been used to feed neural
Several studies have been conducted on the analysis of
customer attrition and retention. Some focus on aspects in- 1 Available at https://www.budgetbuddyaus.com.au/.
2 Available at http://www.webunit.co.uk/clients/access/
fluencing customer choices, such as customer care, speed and
quality of service, variety of services, fees, online accessibil- index.html.
3 Formerly available at https://www.prosper.com/.
ity, etc [12]–[14]. Other studies have focused on customer 4 Available at https://www.chase.com/personal/finnbank.
churn (that is, leaving one bank to another) [12], [15]–[17], 5 Available at https://www.figo.io/.

fraud [18], [19] and even spatial distribution from transaction 6 Directive (EU) 2015/2366 of the European Parliament and of the Council

activity in commercial areas [20]. of 25 November 2015 on payment services in the internal market, amending
Directives 2002/65/EC, 2009/110/EC and 2013/36/EU; Regulation (EU) No
1093/2010; Repealing Directive 2007/64/EC, OJ of 23.12.2015, L 337/35.

2 VOLUME X, 2020

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

network models such as recursive tensor networks [40], dy-

namic pooling networks [41] and deep convolutional neural
networks [42]. Finally, direct learning of distributed vector
representations of paragraphs and sentences for text classifi-
cation was discussed in [43].
As previously mentioned, unlike normal text classification,
short-text classification must tackle the problem of sparsity
[44]. Rare and even missing words in training texts may ap-
pear in testing data. Most words only appear once in the texts
that include them. Therefore, the term frequency-inverse
document frequency (TF - IDF) metric is not representative.
To address this issue, some researchers enrich data con-
texts with information from Wikipedia [45] and ontologies
[46]. However, this requires solid NLP knowledge and high-
dimensional representations that may be expensive in terms
of memory and computing time and, thus, inefficient for
real-time solutions. The more sophisticated approach in [47]
applied a Dirichlet multinomial mixture model for short-text
classification. The approach in [48] clustered texts using the
Locality Preserving Indexing (LPI) algorithm. In recurrent
neural network (RNN), textual trees are also computationally
expensive [49]. Therefore, the design of efficient models is
still challenging.
Two well-known methods for short-text classification are
Probase Bag-of-Concepts short-text classification (Probase
BOC STC ) [50] and Entity Explicit Semantic Analysis ( ESA )
[51]. ESA is based on semantic relation degrees [52]–[54]
from Wikipedia. It associates all words in a Wikipedia page
to the corresponding Wikipedia entry (concept) using the TF -
IDF value as correlation metric and produces indexes that
map each word in a short text to the concepts considered.
Note that the short text may not mention the concept explic-
itly. ESA uses the vector representation of a short text as the
input of a SVM classifier. Regarding Probase BOC STC, it
is based on the Probase knowledge base of entity relation-
ships and other related information that Microsoft extracted
from massive Internet data using the is-a relationship. A key
difference with ESA is that Probase is a knowledge base by
itself that has been produced with an automatic extraction
algorithm. However, as in the case of ESA indexing, is-a
relationships may lack relevant information for short-text
classification.
To the best of our knowledge no previous research has
considered short-text BT classification. We propose a simple FIGURE1: System stages.
and efficient approach that could be easily adapted to other
application domains.
A. PREPROCESSING
III. SYSTEM DESCRIPTION
1) Data retrieval
We seek to develop a simple and efficient short-text classifi-
cation system by taking advantage of the particularities of Data was retrieved with the CoinScrap coin scrapper embed-
BT descriptions, with high macro-average precision, recall
ded into electronic banking apps of real users, who granted
and F measures. Our approach has three stages, as described us permission. Section IV-A describes the resulting dataset.
in Figure 1: (1) preprocessing, (2) ML (linguistic knowledge
extraction and probabilistic model training), and (3) classifi- 2) Text tokenization and stopwords
cation. The language of BT descriptions is quite particular because it
must be concise. The meaning of the message is condensed in
VOLUME X, 2020 3

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

few characters. In most cases verbs are totally absent. Never- 1) Linguistic knowledge extraction
theless, BT descriptions may still contain useless information In this step we create lexica whose entries are related to the
that may affect text classification. categories of the classification problem. Figure 3 represents
First, each BT description is split into tokens, and, in some the lexicon generation procedure.
cases, into sentences. Then meaningless words or stopwords, First, starting from the preprocessed BT descriptions in the
such as determiners and prepositions (‘el’/‘the’, ‘en’/‘in’, training set, which are labelled according to the classification
‘entonces’/‘so’, ‘aunque’/‘although’, ‘pero’/‘but’ and so on) categories, all non-alphabetic characters such as numbers,
are removed. Table 1 shows some stopword examples7 . Next, punctuation marks and symbols are cleared. Next, useful final
all punctuation marks apart from ‘.’ and ‘,’ are also removed. elements for the lexica are extracted. These are the unigrams
TABLE1: Some examples of stopwords. that appear at least five times in the text corpora for each
category (all others are excluded) and the bigrams that are
Stopwords
present at least three times in the corpora. Single-character
algún como incluso poder también
ambos esta otro por tras alphabetic elements are also discarded. The final result is a set
ante estar para primero un of lexica with unigrams and bigrams and their corresponding
antes hacer pero ser uso categories.
For example, let us suppose that the training set only has
the following entries for a given category:
3) Proper name detection
1) Compra en Pescados Diego, S.L. (‘Purchase at Pesca-
Finally, proper names are detected using lists of names and dos Diego, S.L.’)
surnames8 and replaced by a tag. 2) Compra en supermercado Elvira Madrid 28 (‘Purchase
Taking the real BT description ‘Compra en supermercado at Elvira supermarket Madrid 28’)
Elvira Madrid 28. TARJ. :*320546’ as an example, after text 3) Compra en amazon.es (‘Purchase in amazon.es’)
tokenization, stopword removal and proper name extraction, 4) Compra en supermercado Carrefour Enero 2018 (‘Pur-
the result is ‘Compra # supermercado #PNegi# Madrid 28. chase at Carrefour supermarket January 2018’)
TARJ. #320546’. The “#" symbol marks the place where a 5) Compra en amazon.es Febrero 2018 (‘Purchase in
word is removed. Note that each proper name is substituted amazon.es February 2018’)
by “#PN" followed by a set of characters (‘egi’ in the exam- 6) Compra en Amazon (“Purchase in Amazon’)
ple) and “#". Thus, a given name is always replaced by the 7) Pago en supermercado Elvira Alicante (‘Payment at
same identifier (‘Elvira’ by #PNegi# in the example). Credit Elvira Alicante supermarket’)
card number was always anonymized. 8) Pago en supermercado El Corte Inglés Vigo (‘Payment
at El Corte Inglés Vigo supermarket’)
4) Training sample reduction with similarity detection stage
9) Compra en supermercado Carrefour Febrero 2018
We take advantage of the fact that many BT descriptions are (‘Purchase at Carrefour supermarket February 2018’)
similar to reduce the size of the training set. For that purpose, 10) Compra en supermercado amazon.es (‘Purchase in
we insert a similarity detector based on the Jaccard distance amazon.es supermarket’)
[55] before the classifier. This is inspired by spam detection
The resulting lexicon would only contain the words ‘com-
techniques that use this distance to seek for characteristic
pra’ and ‘supermercado’ and the bigram ‘compra supermer-
sentences [56]–[58].
cado’ followed by the categories.
The similarity detector only considers the text of the
descriptions. When the Jaccard similarity between a new
2) Feature selection and weight calculation
labelled description and a previous entry in our dataset ex-
ceeds 85%, and both belong to the same category, the new The system uses a standard SVM algorithm for modelling and
description is not added to the SVM training set. Otherwise, prediction. Short texts are encoded according to the vector
we keep it. When the similarity between a new unlabelled space model in [59]. The smallest data unit in the model
description and a previous entry exceeds 85%, we assign corresponds to a feature. A text T may be seen as an n-
to the description the class of the entry. Otherwise, the dimensional vector in the vector space, as follows:
description is passed to the SVM for classification.
Figure 2 illustrates the architecture of the system includ- T = ((t1 , w1 ), (t2 , w2 ), ..., (tn , wn )) (1)
ing the Jaccard similarity detector. The SVM classifier is where t is the value of a feature of text T and w its weight.
explained in Section III-B3. The greater the w, the more information the feature contains
in that case [60].
B. MACHINE LEARNING ANALYSIS Many different types of features are possible, such as
In this section we explain the knowledge-based linguistic Boolean, word frequency (number of times a word appears in
extraction as well as the feature selection. the text) and TF - IDF. Note that classification results depend
7 Available at https://www.ranks.nl/stopwords/spanish. greatly on feature selection [61], [62]. An efficient feature
8 Available at https://github.com/olea/lemarios. selection method not only reduces the dimension of the
4 VOLUME X, 2020

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

FIGURE2: Flow diagram of the system with Jaccard similarity detector.

FIGURE3: UML diagram of the lexicon generation procedure.

feature space but also avoids useless features. The features 5) Word n-grams. N -gram representation is language-
in our system are the following: independent. It transforms documents into high-
1) Lexicon data. These features count the words in the BT dimensional feature vectors where each feature corre-
descriptions that appear in the lexica for each existing sponds to a contiguous sub-string. Formally, an n-gram
category. consists of n adjacent items from alphabet A. Items can
2) Amount. The range of the BT amount field, since be phonemes, syllables, letters, words or base pairs de-
ranges are more significant for our application than ex- pending on the application. Hence, the number of dif-
act values. Specifically, we consider non-overlapping ferent n-grams in a text is |A|n at most. The dimension
intervals limited by 20, 60, 200, 800, 1500 and 3000 of an n-gram feature sub-vector may therefore be very
euros. high even for moderate values of n. However, since not
3) Sign of the amount. This feature indicates if the BT is all n-grams are present in a document, the dimension
an income (positive) or an expense (negative). is substantially reduced. During the formation of an n-
4) Date. The information in the date field of each BT. gram feature sub-vector, all upper-case characters are
Again, we use ranges. This is because some events converted into lower-case characters and punctuation
occur on specific days of the month (e.g. salary at marks are converted to spaces. Sub-vectors are then
the end), whereas other events (e.g. purchases) may normalized. The optimal n depends on the text corpora.
happen anytime during the month. The selected ranges We explain feature sub-vectors with an example that
were the last five, ten, twenty and twenty-five days of computes the n-grams from one to four words for
the month. the BT description ‘Operación tarjeta débito Amazon’
VOLUME X, 2020 5

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

(‘Amazon debit card transaction’). The resulting vec- 1) ID: a unique numeric identifier.
tor consists of the following components: ‘operación’ 2) Description: the BT short-text description.
(‘transaction’), ‘tarjeta’ (‘card’), ‘débito’ (‘debit’), 3) Amount: the amount in euros of the BT, either positive
amazon; ‘operación tarjeta’ (‘card transaction’), ‘tar- (income) or negative (expense).
jeta débito’ (‘debit card’), ‘débito amazon’ (‘amazon 4) Date: the date when the BT occurred.
debit’); ‘operación tarjeta débito’ (‘debit card transac- Every entry has an extra field with the category label that
tion’), ‘tarjeta débito amazon’ (‘amazon debit card’); determines the classification goal. The dataset may be re-
‘operación tarjeta débito amazon’ (‘amazon debit card quested to the authors by e-mail. Table 2 shows the numerical
transaction’). distributions of the fifteen categories in the dataset. Table 3
6) Character n-grams. Character n-grams have been shows some examples of dataset entries.
proven useful for a variety of ML problems, such as
language detection. Simple models based on them have B. EVALUATION METRICS
outperformed convolutional and recursive deep neural Due to the issues of accuracy with class asymmetries [68],
networks (CNNs and RNNs) [63]–[65]. [69], we employed precision, recall and F metrics using a
We illustrate them with an example that computes the macro-average approach.
trigram, four-gram and five-gram character sub-vectors Macro-averaged results were computed as indicated
for the sentence ‘Operación tarjeta débito Amazon’ by [70]. Consider a binary evaluation metric B(tp , tn , fp , fn )
(note that spaces are also taken into account when that is calculated based on the number of true positives (tp ),
computing character n-grams): (ope, per, era, rac, aci, true negatives (tn ), false positives (fp ) and false negatives
ció, ión, ón , n t, ta, tar, arj, rje, jet, eta, ta , a d, dé, déb, (fn ). Let tpλ , fpλ , tnλ and fnλ be the amounts of true
ébi, bit, ito, to , o a, am, ama, maz, azo, zon; oper, pera, positives, false positives, true negatives and false negatives,
erac, raci, ació, ción, ión , ón t, n ta, tar, tarj, arje, rjet, respectively, after binary evaluation for label λ. The macro-
jeta, eta , ta d, a dé, déb, débi, ébit, bito, ito , to a, o average evaluation metric is calculated as follows:
am, ama, amaz, mazo, azon; opera, perac, eraci, ració,
ación, ción , ión t, ón ta, n tar, tarj, tarje, arjet, rjeta, jeta 1X
k

, eta d, ta dé, a déb, débi, débit, ébito, bito , ito a, to am, Bmacro = B(tpλ , fpλ , tnλ , fnλ ) (2)
q
λ=1
o ama, amaz, amazo, mazon).
They have been applied in scenarios with misspelling Macro-averaging weights all classes equally, whereas
errors [66], [67]. Character n-grams may also capture micro-averaging weights all document classification deci-
other effects of language usage, such as re-named en- sions equally. Since F ignores true negatives and its mag-
tities and abbreviations, e.g. ‘maths’ instead of ‘math- nitude is mostly determined by the number of true positives,
ematics’. In our case, they are justified by the many large classes dominate over small classes in micro-averaging
shortened words in BT descriptions. [71]. For this reason we preferred the macro-average ap-
proach.
3) SVM classifier To calculate precision, recall and F rates we first computed
We decompose the overall problem into pairwise two-class each of these measures separately for each category using
problems, following a one-versus-one approach. Therefore, expressions (3)-(5):
k(k − 1)/2 SVM classification models are necessary for k
tpq
text classes. The category is decided by majority voting. Precisionmicroq = (3)
tpq + fpq
IV. EXPERIMENTAL RESULTS tpq
All experiments were performed on a computer with the Recallmicroq = (4)
tpq + fnq
following specifications:
1) Operating System: Ubuntu 18.04 LTS 64 bits 2(Precisionmicroq ∗ Recallmicroq )
2) Processor: Intel@Core i5 3470 CPU 3.2Ghz x 4 Fmicroq = (5)
(Precisionmicroq + Recallmicroq )
3) RAM: 15.4 Gb
4) Disk: 1.9 Tb These metrics were then averaged by category using ex-
pression (2) to produce the macro-averaged metrics.
A. DATASET
The dataset comprises 30,844 BT descriptions from customer C. NUMERICAL RESULTS
accounts of major Spanish banks, written mostly in Spanish We performed cross-validation in different dataset splits of
and issued between August 2017 and February 2018. They training and testing subsets (in all cases the first and second
were collected during the CatCoin project with the collab- percentages correspond to training and testing subset sizes,
oration of CoinScrap Finance S.L., Spain, using the Coin- respectively): 30%-70%, 40%-60%, 60%-40% and 70%-
Scrap platform. The entries of the dataset have the following 30%. The purpose was to check the robustness of our system
attributes: when fewer training data were available.
6 VOLUME X, 2020

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

TABLE3: Examples of entries in the labelled dataset.

ID Description Category Amount Date

59da944c5858 Recibo ORANGE ESPAGNE S.A.U Thu Sep 28 2017
Household expenses -42,29 C
aa32256f883a ‘ORANGE ESPAGNE S.A.U bill’ 02:00:00
59da944c5858 Traspaso recibido Cuenta Nómina Thu Oct 05 2017
Bank 100,00 C
aa32256f882a ‘Transfer received Payroll Account’ 02:00:00
5a046cf2d9f7 Thu Nov 09 2017
www.just-eat.es Leisure -39,60 C
0921c74182d9 01:00:00
RECIBO ASOC DE CONSUMIDORES
5a69353d323c EN ACCION-FACUA Tue Jan 09 2018
Health, sport and education -63,00 C
a817506a2bdd ‘Association of Consumers in 01:00:00
Action-FACUA bill’
TRANSFERENCIA A FAVOR DE PN
5a8d5a8d1a33 CONCEPTO Alquiler Febrero 2017 + Luz Wed Feb 07 2018
Rentals -377,75 C
590273326bed ‘Transfer in favour of PN CONCEPT 01:00:00
February 2017 rent + electricity’

TABLE4: Average word distribution in the lexica for the different training-testing splits before applying the similarity filter.

Lexicon 30%-70% 40%-60% 60%-40% 70%-30%

Bank 368.6 421.8 486.2 513.6
Means of transport 515 602.6 742.4 782.6
Shopping 1,551.2 1,830.2 2,295.2 2,501.4
Household expenses 191.4 226.8 280.4 307.6
Taxes and charges 101.8 128.4 159.8 169.2
Off-cycle income 13.2 18.2 24.2 113.2
Payroll 65 80.4 105 27.6
Leisure 479.6 569.8 706.4 765.2
Health, sport and education 262 310.2 393.2 437.2
Insurances 154 181.2 244.6 259.4
Social security, grants and pensions 15.2 19.2 23.4 24.2
Transfers 516.6 639.6 849.6 933
Business and professional expenses 61.8 80.4 107.8 119.6
Rentals 54.8 58.6 79.2 90
Others 160.2 190.6 226.4 241.8

In each experiment we extracted the lexica of the set as We compared our system with three competitor ap-
explained in Section III-B1. Table 4 shows the distributions proaches, All-In-1 [72] and two variants of the method
of words in the lexica for all categories before applying the by IITP (Indian Institute of Technology Patna) [73]. These
similarity detector. We added features incrementally to the approaches analyzed customer feedback to manufacturers,
model to assess their significance. Therefore, first we only which also consisted of short texts, although with more
used word n-grams and lexica, then we added BT amount elaborate sentences than BT descriptions. Note that no other
and date, and finally character n-grams features. researchers have considered BT to date. For the sake of
fairness, we applied the Jaccard distance detector stage to the
Given the target sector (finance), precision may be more competitors as well.
important than recall. This is because banking campaigns
prefer to obtain less positives for key categories. By doing The All-In-1 approach in [72] is based on a classic SVM
so, they maximize the probability that customers will be classifier that takes character n-grams and monolingual word
receptive to personalized products. embeddings as input. Logically we only used the monolin-
VOLUME X, 2020 7

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

gual version. 2% compared to the competitors. The precision, recall and

The two IITP variants [73] are based on CNNs. The F of our system increased by about 17%, 37% and 34%,
second variant combines a CNN with an RNN. Specifically, respectively, after activating the lexica feature. By adding
a convolutional feature extractor is applied to the input, a meta-information, the improvements were 7%, 3% and 5%,
recurrent network is applied to the CNN output, an optional respectively. Total precision, recall and F improved com-
fully connected layer is applied to the RNN output, and finally pared to the baseline (word n-grams) by about 2%, 18% and
a softmax layer delivers the result. 12%, respectively, after activating all the features.
Tables 5 and 6 show average elapsed training and testing We again attained the best precision performance, outper-
times for our system and its competitors for the differ- forming the best competitor by almost 4% when all features
ent splits and selections of features, obtained with cross- are activated.
validation (five different dataset samplings in each experi-
ment). For our system, the values of n in character and word 3) 60%-40% split
n-grams were adjusted to 3-5 and 1-4 respectively. The training dataset had in average 6,209 annotated entries
Note that, even though testing times were comparable, the (after similarity detection) and the testing dataset was com-
training times of the competitors were significantly higher. posed of 12,338 entries.
This is due to their greater computational complexity. Specif- Tables 12 and 13 summarize the results. Table 12 shows
ically, the SVM classifier of All-In-1 uses word embeddings that our system had the best precision performance if both
and the two variants of IITP are based on CNNs. word and character n-grams features are enabled, but All-In-
Table 7 shows the average training set reductions that the 1 was still the best alternative in terms of recall and F for the
similarity detector achieved for the different splits. Note that basic combinations of features. In this case, the precision,
they exceeded 56% in all cases. recall and F metrics of our system improved versus the
baseline by about 27%, 58% and 50%, respectively, after ac-
1) 30%-70% split
tivating all the features. We again attained the best precision,
In each experiment the training dataset had in average 4,031 outperforming the best competitor by almost 2% when all
entries (after similarity detection) and the testing dataset had features are activated. Our system ranked second in recall
21,590 entries. after All-In-1 by a narrow margin of about 3% again when
Tables 8 and 9 show the results of BT classification. Note all features are activated. We remark again that we are not
that we did not modify the design or the implementation using semantic information from word embeddings.
of the selected competitors. Thus, for a fair comparison,
only word and character n-grams features were enabled in
4) 70%-30% split
Table 8. Our system outperformed the competitors in terms
of precision and All-In-1 was the best option in recall and The training dataset had in average 6,780 annotated entries
F -measure. (after similarity detection) and the testing dataset had 9,253
In Table 9 we observe that, after activating the lexicon entries.
feature, the precision, recall and F of our system increased by Tables 14 and 15 show the results. With the basic features
about 15%, 38% and 35%, respectively, so the lexicon feature all systems achieved similar results (Table 14). In this case,
was crucial. Another key result is that meta-information the precision, recall and F of our system improved by about
features yielded a precision increase of 8% in our system. 28%, 57% and 49%, respectively after activating all the
After activating all features, precision, recall and F further features. We again attained the best precision and almost
improved by around 3%, 19% and 13%, respectively. matched All-In-1 in terms of recall and F when all features
Our system attained the best precision, but only attained are activated.
better F than All-In-1 if all features are activated. On the Table 16 shows the performance of our system by BT
other hand, All-In-1 was better in recall but the difference category when all the features are enabled. In general the
with our system in that regard was only about 3%. Note, performance was satisfactory. The worst performance corre-
however, that regardless of the fact that precision is more sponded to the categories with fewer entries in the training
important in our scenario, our system is simpler than its com- set (according to Table 2).
petitors (based, depending on the case, on CNN, CNN+RNN or
a SVM with word embeddings), especially in terms of training D. SUMMARY OF NUMERICAL RESULTS
time, as shown in tables 5 and 6. To evaluate the performance of our system we applied cross-
validation in five dataset splits between training and testing
2) 40%-60% split subsets (30%-70%, 40%-60%, 60%-40% and 70%-30%), to
In this case, in each experiment the training dataset had in check the robustness of our approach as the sizes of the
average 4,849 annotated entries (after similarity detection) testing subsets decreased. In these experiments we added fea-
and the testing dataset had 18,506 entries. tures to the model incrementally to assess their significance,
Tables 10 and 11 show the results. In this case our im- in the following order: word n-grams, lexica, amount, date
provement in precision with the basic features was about and character n-grams. We compared our system with three
8 VOLUME X, 2020

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

TABLE5: Elapsed training and testing times of our system for different dataset splits.

Split Set Instances Features Computing time (s)

Word n-grams 2.20 ± 0.40
Word n-grams + char n-grams 5.00 ± 1.55
Train 4,031 Word n-grams + lexica 2.20 ± 0.40
Word n-grams + lexica + amount + date 2.20 ± 0.40
Word n-grams + lexica + amount + date + char n-grams 6.20 ± 1.47
30%-70%
Word n-grams 8.20 ± 0.40
Word n-grams + char n-grams 15.40 ± 0.49
Test 21,590 Word n-grams + lexica 10.00 ± 0.00
Word n-grams + lexica + amount + date 10.40 ± 0.49
Word n-grams + lexica + amount + date + char n-grams 18.60 ± 2.15
Word n-grams 2.00 ± 0.00
Word n-grams + char n-grams 5.80 ± 0.75
Train 4,849.40 Word n-grams + lexica 3.00 ± 0.00
Word n-grams + lexica + amount + date 3.20 ± 0.40
Word n-grams + lexica + amount + date + char n-grams 6.00 ± 0.00
40%-60%
Word n-grams 7.00 ± 0.00
Word n-grams + char n-grams 14.00 ± 0.00
Test 18,506 Word n-grams + lexica 8.40 ± 0.49
Word n-grams + lexica + amount + date 8.40 ± 0.49
Word n-grams + lexica + amount + date + char n-grams 15.20 ± 0.40
Word n-grams 3.00 ± 0.00
Word n-grams + char n-grams 9.40 ± 1.50
Train 6,209.20 Word n-grams + lexica 4.20 ± 0.40
Word n-grams + lexica + amount + date 4.00 ± 0.00
Word n-grams + lexica + amount + date + char n-grams 9.00 ± 0.63
60%-40%
Word n-grams 5.00 ± 0.00
Word n-grams + char n-grams 11.40 ± 0.80
Test 12,338 Word n-grams + lexica 6.20 ± 0.40
Word n-grams + lexica + amount + date 6.00 ± 0.00
Word n-grams + lexica + amount + date + char n-grams 12.00 ± 0.00
Word n-grams 4.00 ± 0.00
Word n-grams + char n-grams 11.20 ± 0.98
Train 6,780.20 Word n-grams + lexica 4.20 ± 0.40
Word n-grams + lexica + amount + date 4.60 ± 0.80
Word n-grams + lexica + amount + date + char n-grams 11.20 ± 0.40
70%-30%
Word n-grams 4.00 ± 0.00
Word n-grams + char n-grams 9.60 ± 0.49
Test 9,253 Word n-grams + lexica 5.20 ± 0.40
Word n-grams + lexica + amount + date 5.20 ± 0.40
Word n-grams + lexica + amount + date + char n-grams 10.60 ± 0.49

competing approaches from the state-of-the-art, All-In-1 and ment; dynamic “gamified” saving rules (e.g. saving when
two variants of the IITP method. your favourite team wins, or when you take a coffee); and
The Jaccard similarity detector achieved reductions of personalised recommendations for financial management.
training data exceeding 56% for all splits. The latter rely on our system to classify BT transactions.
For the 30%-70% split, our system attained the best pre- In this line, CoinScrap recommends personalized services
cision. It was inferior to All-In-1 in recall and F unless all and products based on financial necessities and objectives.
features were enabled. If they were, our system also out- Figure 4 shows an screenshot of the app.
performed its competitors in F . For the 40%-60% split, our
system outperformed the competitors in terms of precision, VI. CONCLUSIONS
recall and F when all features were enabled. It was better Compared to normal texts, short texts analysis is challenging
in precision even with the basic combination of features. due to sparsity, irregularity and real-time data generation. In
For the 60%-40% and 70%-30% splits, our system again this paper we describe a short-text SVM BT classification sys-
outperformed the competitors in terms of precision, and the tem using a combination of meta-information and linguistic
performance gap with All-In-1, in the cases it existed, was re- knowledge (by relying on specialized lexica).
duced. Indeed, our approach is simpler than the competitors, Motivated by existing solutions in spam detection, we
which allowed significant training time reduction. achieved a significant reduction of training information with
a short text similarity detector based on the Jaccard distance.
V. USE CASE:COINSCRAP Experimental results, by comparing our approach with
CoinScrap launched its mobile app for iOS and Android in three state-of-the-art competitors with higher computational
November 2016, and since then it has had thousands of down- complexity, are very promising. Our lexicon feature is crucial
loads. A new version of the application was launched October to attain high precision, especially if the training dataset is
2018. It includes journey improvement for product fulfil- small.
VOLUME X, 2020 9

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

TABLE6: Elapsed training and testing times of the competitor systems for different dataset splits.

Split Set Instances Features Computing time (s)

Word n-grams 2.20 ± 0.40
Word n-grams + char n-grams 5.00 ± 1.55
All-In-1 Word n-grams 1.83 ± 0.08
Train 4,031
All-In-1 Word n-grams + char n-grams 30.04 ± 1.66
IITP-CNN 976.10 ± 33.34
IITP-CNN + RNN 1153.03 ± 59.08
30%-70%
Word n-grams 8.20 ± 0.40
Word n-grams + char n-grams 15.40 ± 0.49
All-In-1 Word n-grams 1.56 ± 0.02
Test 21,590
All-In-1 Word n-grams + char n-grams 5.53 ± 0.16
IITP-CNN 4.47 ± 0.37
IITP-CNN + RNN 5.99 ± 0.62
Word n-grams 2.00 ± 0.00
Word n-grams + char n-grams 5.80 ± 0.75
All-In-1 Word n-grams 2.28 ± 0.15
Train 4,849.40
All-In-1 Word n-grams + char n-grams 38.68 ± 1.70
IITP-CNN 1224.89 ± 71.48
IITP-CNN + RNN 1698.12 ± 108.46
40%-60%
Word n-grams 7.00 ± 0.00
Word n-grams + char n-grams 14.00 ± 0.00
All-In-1 Word n-grams 1.33 ± 0.04
Test 18,506
All-In-1 Word n-grams + char n-grams 5.01 ± 0.12
IITP-CNN 3.85 ± 0.28
IITP-CNN + RNN 5.80 ± 0.19
Word n-grams 3.00 ± 0.00
Word n-grams + char n-grams 9.40 ± 1.50
All-In-1 Word n-grams 2.93 ± 0.08
Train 6,209.20
All-In-1 Word n-grams + char n-grams 61.95 ± 3.68
IITP-CNN 2025.20 ± 22.99
IITP-CNN + RNN 2759.47 ± 66.43
60%-40%
Word n-grams 5.00 ± 0.00
Word n-grams + char n-grams 11.40 ± 0.80
All-In-1 Word n-grams 0.91 ± 0.02
Test 12,338
All-In-1 Word n-grams + char n-grams 3.91 ± 0.16
IITP-CNN 2.90 ± 0.09
IITP-CNN + RNN 4.45 ± 0.13
Word n-grams 4.00 ± 0.00
Word n-grams + char n-grams 11.20 ± 0.98
All-In-1 Word n-grams 3.21 ± 0.09
Train 6,780.20
All-In-1 Word n-grams + char n-grams 69.35 ± 1.75
IITP-CNN 2327.73 ± 38.76
IITP-CNN + RNN 3072.74 ± 327.81
70%-30%
Word n-grams 4.00 ± 0.00
Word n-grams + char n-grams 9.60 ± 0.49
All-In-1 Word n-grams 0.69 ± 0.02
Test 9,253
All-In-1 Word n-grams + char n-grams 3.30 ± 0.10
IITP-CNN 2.37 ± 0.06
IITP-CNN + RNN 3.41 ± 0.12

TABLE7: Training sample reduction for different dataset splits.

Split Instances Instances after similarity detection

30%-70% 9,254 4,031
40%-60% 12,338 4,849
60%-40% 18,506 6,209
70%-30% 21,591 6,780

TABLE8: Average evaluation metrics for the basic combinations of features, 30%-70% split.

Train Test Features Pmacro Rmacro Fmacro

Proposed system word n-grams 68.19% 25.70% 37.32%
Proposed system word n-grams + char n-grams 93.36% 76.30% 83.97%
All-In-1 word n-grams 90.87% 85.22% 87.67%
30% 70%
All-In-1 word n-grams + char n-grams 90.75% 86.87% 88.57%
IITP-CNN 87.83% 81.02% 83.78%
IITP-CNN+RNN 88.27% 75.05% 80.32%

10 VOLUME X, 2020

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

TABLE9: Average evaluation metrics of the proposed system for all combinations of features, 30%-70% split.

Train Test Features Pmacro Rmacro Fmacro

Word n-grams 68.19% 25.70% 37.32%
Word n-grams + lexica 82.90% 63.51% 71.90%
30% 70%
Word n-grams + lexica + amount + date 91.31% 64.92% 75.87%
Word n-grams + lexica + amount + date + char n-grams 94.59% 84.15% 89.06%

TABLE10: Average evaluation metrics for the basic combinations of features, 40%-60% split.

Train Test Features Pmacro Rmacro Fmacro

Proposed system word n-grams 68.25% 27.36% 39.05%
Proposed system word n-grams + char n-grams 93.56% 78.72% 85.50%
All-In-1 word n-grams 91.69% 87.12% 89.05%
40% 60%
All-In-1 word n-grams + char n-grams 90.99% 88.11% 89.27%
IITP-CNN 90.27% 84.06% 86.70%
IITP-CNN+RNN 88.84% 77.93% 82.15%

TABLE11: Average evaluation metrics of the proposed system for all combinations of features, 40%-60% split.

Train Test Features Pmacro Rmacro Fmacro

Word n-grams 68.25% 27.36% 39.05%
Word n-grams + lexica 85.54% 64.50% 73.53%
40% 60%
Word n-grams + lexica + amount + date 92.94% 67.30% 78.05%
Word n-grams + lexica + amount + date + char n-grams 95.43% 85.43% 90.15%

TABLE12: Average evaluation metrics for the basic combinations of features, 60%-40% split.

Train Test Features Pmacro Rmacro Fmacro

Proposed system word n-grams 68.54% 29.24% 40.97%
Proposed system word n-grams + char n-grams 94.38% 80.42% 86.83%
All-In-1 word n-grams 92.74% 89.75% 91.12%
60% 40%
All-In-1 word n-grams + char n-grams 90.98% 89.89% 90.29%
IITP-CNN 90.54% 85.15% 87.37%
IITP-CNN+RNN 89.97% 81.42% 85.02%

TABLE13: Average evaluation metrics of the proposed system for all combinations of features, 60%-40% split.

Train Test Features Pmacro Rmacro Fmacro

Word n-grams 68.54% 29.24% 40.97%
Word n-grams + lexica 85.74% 65.42% 74.21%
60% 40%
Word n-grams + lexica + amount + date 93.14% 68.28% 78.79%
Word n-grams + lexica + amount + date + char n-grams 95.60% 87.05% 91.12%

TABLE14: Average evaluation metrics for the basic combinations of features, 70%-30% split.

Train Test Features Pmacro Rmacro Fmacro

Proposed system word n-grams 67.41% 30.67% 42.16%
Proposed system word n-grams + char n-grams 94.37% 80.15% 86.88%
All-In-1 word n-grams 92.58% 90.01% 91.16%
70% 30%
All-In-1 word n-grams + char n-grams 91.35% 89.95% 90.52%
IITP-CNN 89.33% 85.67% 87.19%
IITP-CNN+RNN 90.06% 82.14% 85.63%

TABLE15: Average evaluation metrics of the proposed system for all combinations of features, 70%-30% split.

Train Test Features Pmacro Rmacro Fmacro

Word n-grams 67.41% 30.67% 42.16%
Word n-grams + lexica 87.91% 66.71% 75.84%
70% 30%
Word n-grams + lexica + amount + date 92.99% 70.22% 80.01%
Word n-grams + lexica + amount + date + char n-grams 95.05% 87.51% 91.12%

VOLUME X, 2020 11

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

TABLE16: Performance of our system by category with all features enabled, 70%-30% split.

Category Pmacro Rmacro Fmacro

Bank 92.32% 93.08% 92.69%
Means of transport 94.54% 90.06% 92.24%
Shopping 89.88% 98.07% 93.79%
Household expenses 98.15% 91.35% 94.62%
Taxes and charges 96.34% 82.18% 88.58%
Off-cycle income 95.32% 88.89% 91.97%
Payroll 95.92% 88.11% 91.76%
Leisure 95.24% 87.36% 91.12%
Health, sport and education 97.01% 82.38% 89.10%
Insurances 98.29% 94.87% 96.54%
Social security, grants and pensions 98.89% 85.00% 91.31%
Transfers 88.84% 80.19% 84.29%
Business and professional expenses 89.12% 73.22% 80.31%
Rentals 98.00% 82.86% 89.77%
Others 97.85% 95.02% 96.42%

REFERENCES
[1] B. L. Derby, “Data mining for improper payments,” The Journal of
Government Financial Management, vol. 52, no. 4, p. 10, 2003.
[2] E. W. Ngai, L. Xiu, and D. C. Chau, “Application of data mining tech-
niques in customer relationship management: A literature review and
classification,” Expert Systems With Applications, vol. 36, no. 2, pp.
2592–2602, 2009.
[3] X. Hu, “A data mining approach for retailing bank customer attrition
analysis,” Applied Intelligence, vol. 22, no. 1, pp. 47–60, 2005.
[4] M. R. Islam and M. A. Habib, “A data mining approach to predict
prospective business sectors for lending in retail banking using decision
tree,” CoRR, vol. abs/1504.02018, 2015.
[5] C. Chekuri, M. H. Goldwasser, P. Raghavan, and E. Upfal, “Web search
using automatic classification,” in Proceedings of the Sixth International
Conference on the World Wide Web, 1997.
[6] D.-T. Vo and Y. Zhang, “Target-Dependent Twitter Sentiment Classifica-
tion with Rich Automatic Features.” in Proc. IJCAI, 2015, pp. 1347–1353.
[7] G. Kumaran and J. Allan, “Text classification and named entities for
new event detection,” in Proceedings of the 27th Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval. ACM, 2004, pp. 297–304.
[8] Y. Cai, W.-H. Chen, H.-F. Leung, Q. Li, H. Xie, R. Y. Lau, H. Min,
FIGURE4: Coinscrap app. and F. L. Wang, “Context-aware ontologies generation with basic level
concepts from collaborative tags,” Neurocomputing, vol. 208, pp. 25–38,
2016.
[9] Q. Du, H. Xie, Y. Cai, H.-F. Leung, Q. Li, H. Min, and F. L. Wang,
The effectiveness of the proposed system was demon- “Folksonomy-based personalized search by hybrid user profiles in multiple
strated on a real dataset reflecting the activity of real levels,” Neurocomputing, vol. 204, pp. 142–152, 2016.
[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
customers of Spanish banks, organized in fifteen different no. 7553, p. 436, 2015.
classes including means of transport, shopping, household [11] A. M. Hormozi and S. Giles, “Data mining: A competitive weapon for
expenses, taxes, charges and payroll. This labelled dataset is banking and retail industries,” Information Systems Management, 2004.
[12] O. Aregbeyen, “The determinants of bank selection choices by customers:
a valuable asset that will be available to other researchers on Recent and extensive evidence from Nigeria,” International Journal of
request. Business and Social Science, vol. 2, no. 2, pp. 276–288, 2011.
Our system attained the best precision (which is the most [13] H. U. Rehmann and S. Ahmed, “An empirical analysis of the determinants
of bank selection in Pakistan: A customer view,” Pakistan Economic and
relevant metric in PFM) and performed similarly in terms Social Review, vol. 46, no. 2, pp. 147–160, 2008.
of recall and F if enough features were enabled, especially [14] V. Dinh and L. Pickler, “Examining service quality and customer satis-
when the methods were stressed by reducing the training-to- faction in the retail banking sector in Vietnam,” Journal of Relationship
Marketing, vol. 11, no. 4, pp. 199–214, 2012.
test subset size ratio.
[15] A. Keramati, H. Ghaneei, and S. M. Mirmohammadi, “Developing a
Given the encouraging results in this work, we are cur- prediction model for customer churn from electronic banking services
rently expanding it to obtain sub-categorisations of the de- using data mining,” Financial Innovation, vol. 2, no. 1, p. 10, dec 2016.
scriptions. Our approach has been put into production in a [16] A. Sharma and P. K. Panigrahi, “A neural network based approach for
predicting customer churn in cellular network services,” CoRR, vol.
real PFM application, CoinScrap. abs/1309.3945, 2013.
[17] K. Chen, Y.-H. Hu, and Y.-C. Hsieh, “Predicting customer churn from
valuable B2B customers in the logistics industry: A case study,” Inf. Syst.
E-bus. Manag., vol. 13, no. 3, pp. 475–494, 2015.
[18] S. Barman, U. Pal, M. A. Sarfaraj, B. Biswas, A. Mahata, and P. Mandal,
“A complete literature review on financial fraud detection applying data

12 VOLUME X, 2020

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

mining techniques,” International Journal of Trust Management in Com- [41] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural
puting and Communications, vol. 3, no. 4, pp. 336–359, 2016. network for modelling sentences,” arXiv preprint arXiv:1404.2188, 2014.
[19] J. West and M. Bhattacharya, “Intelligent financial fraud detection: A [42] C. dos Santos and M. Gatti, “Deep convolutional neural networks for
comprehensive review,” Computers & Security, vol. 57, pp. 47 – 66, 2016. sentiment analysis of short texts,” in Proceedings of COLING 2014, the
[20] Y. Yoshimura, A. Amini, S. Sobolevsky, J. Blat, and C. Ratti, “Analysis 25th International Conference on Computational Linguistics: Technical
of customers’ spatial distribution through transaction datasets,” in Trans- Papers, 2014, pp. 69–78.
actions on Large-Scale Data and Knowledge-Centered Systems XXVII - [43] Q. Le and T. Mikolov, “Distributed representations of sentences and docu-
Volume 9860. New York, NY, USA: Springer-Verlag New York, Inc., ments,” in Proc. International Conference on Machine Learning, 2014, pp.
2016, pp. 177–189. 1188–1196.
[21] R. Vahidov and X. He, “Situated DSS for personal finance management: [44] C. C. Aggarwal and C. Zhai, “A Survey of Text Clustering Algorithms,” in
Design and evaluation,” Information & Management, vol. 47, no. 2, pp. Mining Text Data. Springer, 2012, pp. 77–128.
78–86, 2010. [45] S. Banerjee, K. Ramanathan, and A. Gupta, “Clustering short texts using
[22] D. A. Zetzsche, D. W. Arner, R. P. Buckley, and R. H. Weber, “The Wikipedia,” in Proceedings of the 30th Annual International ACM SI-
future of data-driven finance and regtech: Lessons from EU Big Bang II,” GIR Conference on Research and Development in Information Retrieval.
Available at SSRN 3359399, 2019. ACM, 2007, pp. 787–788.
[23] M. F. Caropreso, S. Matwin, and F. Sebastiani, “A learner-independent [46] S. Fodeh, B. Punch, and P.-N. Tan, “On ontology-driven document clus-
evaluation of the usefulness of statistical phrases for automated text tering using core semantic features,” Knowledge and Information Systems,
categorization,” Text Databases and Document Management: Theory and vol. 28, no. 2, pp. 395–421, 2011.
Practice, vol. 5478, pp. 78–102, 2001. [47] J. Yin and J. Wang, “A Dirichlet multinomial mixture model-based ap-
[24] P. S. Jacobs, “Joining statistics with NLP for text categorization,” in proach for short text clustering,” in Proceedings of the 20th ACM SIGKDD
Proceedings of the Third Conference on Applied Natural Language Pro- International Conference on Knowledge Discovery and Data Mining.
cessing. Association for Computational Linguistics, 1992, pp. 178–185. ACM, 2014, pp. 233–242.
[25] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, [48] D. Cai, X. He, and J. Han, “Document clustering using locality preserv-
“Text classification using string kernels,” Journal of Machine Learning ing indexing,” IEEE Transactions on Knowledge and Data Engineering,
Research, vol. 2, no. Feb, pp. 419–444, 2002. vol. 17, no. 12, pp. 1624–1637, 2005.
[26] L. D. Baker and A. K. McCallum, “Distributional clustering of words [49] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neural
for text classification,” in Proceedings of the 21st Annual International Networks for Text Classification.” in Proc. AAAI, vol. 333, 2015, pp.
ACM SIGIR Conference on Research and Development in Information 2267–2273.
Retrieval. ACM, 1998, pp. 96–103. [50] W. Wu, H. Li, H. Wang, and K. Q. Zhu, “Probase: A probabilistic taxon-
[27] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146– omy for text understanding,” in Proceedings of the 2012 ACM SIGMOD
162, 1954. International Conference on Management of Data. ACM, 2012, pp. 481–
[28] A. McCallum, K. Nigam et al., “A comparison of event models for naive 492.
bayes text classification,” in Proc. AAAI-98 Workshop on Learning for [51] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness using
Text Categorization, vol. 752, 1998, pp. 41–48. Wikipedia-based explicit semantic analysis,” in Proc. IJCAI, vol. 7, 2007,
[29] K. Nigam, J. Lafferty, and A. McCallum, “Using maximum entropy for pp. 1606–1611.
text classification,” in Proc. IJCAI-99 Workshop on Machine Learning for [52] X. Han and J. Zhao, “Named entity disambiguation by leveraging
Information Filtering, vol. 1, 1999, pp. 61–67. Wikipedia semantic knowledge,” in Proceedings of the 18th ACM con-
[30] T. Joachims, “Text categorization with support vector machines: Learning ference on Information and knowledge management. ACM, 2009, pp.
with many relevant features,” in European Conference on Machine Learn- 215–224.
ing. Springer, 1998, pp. 137–142. [53] X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou, “Exploiting Wikipedia as
[31] F. Sebastiani, “Machine learning in automated text categorization,” ACM external knowledge for document clustering,” in Proceedings of the 15th
Computing Surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002. ACM SIGKDD International Conference on Knowledge Discovery and
[32] D. D. Lewis, “An evaluation of phrasal and clustered representations on a Data Mining. ACM, 2009, pp. 389–396.
text categorization task,” in Proceedings of the 15th Annual International [54] X. Ni, J.-T. Sun, J. Hu, and Z. Chen, “Mining multilingual topics from
ACM SIGIR Conference on Research and Development in Information Wikipedia,” in Proceedings of the 18th International Conference on World
Retrieval. ACM, 1992, pp. 37–50. Wide Web. ACM, 2009, pp. 1155–1156.
[33] M. Post and S. Bergsma, “Explicit and implicit syntactic features for [55] C. Seung-Seok, C. Sung-Hyuk, and C. C. Tappert, “A Survey of Binary
text classification,” in Proceedings of the 51st Annual Meeting of the Similarity and Distance Measures,” Journal of Systemics, Cybernetics &
Association for Computational Linguistics, vol. 2, 2013, pp. 866–872. Informatics, vol. 8, no. 1, pp. 43–48, 2010.
[34] T. Nakagawa, K. Inui, and S. Kurohashi, “Dependency tree-based senti- [56] S. R. Harsule and M. K. Nighot, “N-Gram Classifier System to Filter Spam
ment classification using CRFs with hidden variables,” in Human Lan- Messages from OSN User Wall,” in Advances in Intelligent Systems and
guage Technologies: The 2010 Annual Conference of the North American Computing. Springer, 2016, pp. 21–28.
Chapter of the Association for Computational Linguistics. Association [57] S. Bajaj, N. Garg, and S. K. Singh, “A Novel User-based Spam Review
for Computational Linguistics, 2010, pp. 786–794. Detection,” Procedia Computer Science, vol. 122, pp. 1009–1015, 2017.
[35] M. Karo and P. Stephen, “Sentiment composition,” in Proc. of Recent [58] S. Temma, M. Sugii, and H. Matsuno, “The Document Similarity Index
Advances in Natural Language Processing (RANLP), 2007, pp. 378–382. based on the Jaccard Distance for Mail Filtering,” in Proceedings of the
[36] S. Wang and C. D. Manning, “Baselines and bigrams: Simple, good 34th International Technical Conference on Circuits/Systems, Computers
sentiment and topic classification,” in Proceedings of the 50th Annual and Communications. IEEE, 2019, pp. 1–4.
Meeting of the Association for Computational Linguistics: Short Papers- [59] C. Yin, “Towards Accurate Node-Based Detection of P2P Botnets,” The
Volume 2. Association for Computational Linguistics, 2012, pp. 90–94. Scientific World Journal, pp. 1–10, 2014.
[37] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic [60] C. Yin, M. Zou, D. Iko, and J. Wang, “Botnet detection based on corre-
language model,” Journal of Machine Learning Research, vol. 3, pp. 1137– lation of malicious behaviors,” Int J Hybrid Inf Technol, vol. 6, no. 6, pp.
1155, 2003. 291–300, 2013.
[38] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and [61] A. Veeraswamy and D. S. A. Balamurugan, “A Survey of Feature Selection
P. Kuksa, “Natural language processing (almost) from scratch,” Journal Algorithms in Data Mining,” in Proceedings of the 3rd International
of Machine Learning Research, vol. 12, pp. 2493–2537, 2011. Conference on Trends in Information Sciences and Computing (TISC-
[39] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of 2011), 2011, pp. 40–46.
word representations in vector space,” arXiv preprint arXiv:1301.3781, [62] X.-J. Tong, M.-G. Cui, and G.-L. Song, “Research on Chinese Text Au-
2013. tomatic Categorization Based on VSM,” in Proc. International Conference
[40] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, on Wireless Communications, Networking and Mobile Computing. IEEE,
“Semi-supervised recursive autoencoders for predicting sentiment distri- 2007, pp. 3863–3866.
butions,” in Proceedings of the Conference on Empirical Methods in Nat- [63] M. Medvedeva, M. Kroon, and B. Plank, “When sparse traditional models
ural Language Processing. Association for Computational Linguistics, outperform dense neural networks: the curious case of discriminating
2011, pp. 151–161. between similar languages,” in VarDial, 2017.

VOLUME X, 2020 13

Silvia García-Méndez et al.: Identifying Banking Transaction Descriptions via SVM Based on a Specialized Labelled Corpus

[64] S. Malmasi, K. Evanini, A. Cahill, J. Tetreault, R. Pugh, C. Hamill,

D. Napolitano, and Y. Qian, “A report on the 2017 native language identi-
fication shared task,” in Proceedings of the 12th Workshop on Innovative
Use of NLP for Building Educational Applications. Association for
Computational Linguistics, 2017, pp. 62–75.
[65] A. Kulmizev, B. Blankers, J. Bjerva, M. Nissim, G. van Noord, B. Plank,
and M. Wieling, “The power of character n-grams in native language
identification,” in BEA@EMNLP. Association for Computational Lin-
guistics, 2017, pp. 382–389.
[66] W. Cavnar, “Using an n-gram-based document representation with a vector
processing retrieval model,” NIST Special Publication, pp. 269–269, 1995.
[67] S. Huffman, “Acquaintance: Language-independent document categoriza-
tion by n-grams,” Department of Defense Fort George G. Meade, Tech.
Rep., 1995.
[68] M. Sokolova and G. Lapalme, “A systematic analysis of performance
measures for classification tasks,” Inf. Process. Manage., vol. 45, no. 4,
pp. 427–437, Jul. 2009.
[69] R. G. Rossi, A. d. A. Lopes, and S. O. Rezende, “Optimization and label
propagation in bipartite heterogeneous networks to improve transductive
classification of texts,” Inf. Process. Manage., vol. 52, no. 2, pp. 217–257,
Mar. 2016.
[70] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,” in
Data Mining and Knowledge Discovery Handbook. Springer, 2009, pp.
667–685.
[71] C. D. Manning, P. Raghavan, H. Schütze et al., Introduction to Information
Retrieval. Cambridge University Press, Cambridge, 2008, vol. 1.
[72] B. Plank, “All-in-1: Short text classification with one model for all lan-
guages,” in Proceedings of the International Joint Conference on Natural
Language Processing (Shared Task 4). Association for Computational
Linguistics, December 2017.
[73] D. Gupta, P. Lenka, H. Bedi, A. Ekbal, and P. Bhattacharyya, “IITP at
IJCNLP-2017 task 4: Auto analysis of customer feedback using CNN and
GRU network,” in Proceedings of the IJCNLP 2017, Shared Tasks. Asian
Federation of Natural Language Processing, 2017, pp. 184–193.

14 VOLUME X, 2020

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

Instrumentation Module 3 Lesson 3
No ratings yet
Instrumentation Module 3 Lesson 3
40 pages
05 Text Categorization
No ratings yet
05 Text Categorization
22 pages
The Complete Guide To The TOEFL PBT Test Class 1
No ratings yet
The Complete Guide To The TOEFL PBT Test Class 1
3 pages
Building Conversational Bots with Botkit: Definitive Reference for Developers and Engineers
From Everand
Building Conversational Bots with Botkit: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
MassTransit for Distributed .NET Applications: Definitive Reference for Developers and Engineers
From Everand
MassTransit for Distributed .NET Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
EPI 200 SIT Skin Irritation MK 24 007 0023
No ratings yet
EPI 200 SIT Skin Irritation MK 24 007 0023
35 pages
A Comparison of Pre-Trained Language Models For Multi-Class Text Classification in The Financial Domain
No ratings yet
A Comparison of Pre-Trained Language Models For Multi-Class Text Classification in The Financial Domain
9 pages
వదినా కొంచెం వాటిని తాకచ్చా ప్లీజ్
35% (31)
వదినా కొంచెం వాటిని తాకచ్చా ప్లీజ్
10 pages
Publisher-Subscriber Architecture: Definitive Reference for Developers and Engineers
From Everand
Publisher-Subscriber Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Paho MQTT Client Libraries in Practice: Definitive Reference for Developers and Engineers
From Everand
Paho MQTT Client Libraries in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SSRN Id4406997
No ratings yet
SSRN Id4406997
52 pages
Article 3
No ratings yet
Article 3
74 pages
Cryptocurrencies and Beyond: Adapting Portfolio Theories for the Digital Era
From Everand
Cryptocurrencies and Beyond: Adapting Portfolio Theories for the Digital Era
Chenjiazi Zhong
No ratings yet
Amazon SNS in Practice: Definitive Reference for Developers and Engineers
From Everand
Amazon SNS in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Enterprise Strategy for Blockchain: Lessons in Disruption from Fintech, Supply Chains, and Consumer Industries
From Everand
Enterprise Strategy for Blockchain: Lessons in Disruption from Fintech, Supply Chains, and Consumer Industries
Ravi Sarathy
No ratings yet
Designing Decentralized Applications: Definitive Reference for Developers and Engineers
From Everand
Designing Decentralized Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Value of Text For Small Business Default Prediction: A Deep Learning Approach
No ratings yet
The Value of Text For Small Business Default Prediction: A Deep Learning Approach
22 pages
SSRN 3582447
No ratings yet
SSRN 3582447
26 pages
04.1 PP 3 22 Introduction
No ratings yet
04.1 PP 3 22 Introduction
20 pages
Deepset Cloud for Intelligent Search and Question Answering: The Complete Guide for Developers and Engineers
From Everand
Deepset Cloud for Intelligent Search and Question Answering: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Pub/Sub Systems and Message-Driven Architectures: Definitive Reference for Developers and Engineers
From Everand
Pub/Sub Systems and Message-Driven Architectures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Scalable and Weakly Supervised Bank Transaction CL
No ratings yet
Scalable and Weakly Supervised Bank Transaction CL
20 pages
WalletConnect Protocol Development: Definitive Reference for Developers and Engineers
From Everand
WalletConnect Protocol Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Chapter 9 - Big Data Applications (Finance)
No ratings yet
Chapter 9 - Big Data Applications (Finance)
32 pages
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
From Everand
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
E13273 e Tarjome
No ratings yet
E13273 e Tarjome
15 pages
HSC Bao Cao Chien Luoc Thi Truong Nam 2025
No ratings yet
HSC Bao Cao Chien Luoc Thi Truong Nam 2025
108 pages
Media Transfer Protocol Engineering: Definitive Reference for Developers and Engineers
From Everand
Media Transfer Protocol Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Lazar Et Al 2020
No ratings yet
Lazar Et Al 2020
31 pages
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ML Unit 1
No ratings yet
ML Unit 1
17 pages
SMMA Contract Template757-1
No ratings yet
SMMA Contract Template757-1
7 pages
Unchained Vision: Product & Financial Management in a Web 3.0 Startup
From Everand
Unchained Vision: Product & Financial Management in a Web 3.0 Startup
Dr. Srinidhi Vasan
No ratings yet
Construction of Plastic Waste Extruding Machine To
No ratings yet
Construction of Plastic Waste Extruding Machine To
21 pages
Design of A NLP-empowered Finance Fraud Awareness Model The Anti-Fraud Chatbot For Fraud Detection and Fraud Classification As An Instance
No ratings yet
Design of A NLP-empowered Finance Fraud Awareness Model The Anti-Fraud Chatbot For Fraud Detection and Fraud Classification As An Instance
17 pages
Textual Analysis and Machine Leaning Crack Unstructured Data in
No ratings yet
Textual Analysis and Machine Leaning Crack Unstructured Data in
18 pages
DeepCardFraud Abakarim
No ratings yet
DeepCardFraud Abakarim
8 pages
SSRN Id4235511
No ratings yet
SSRN Id4235511
32 pages
Behavior-Driven Development in Practice: Definitive Reference for Developers and Engineers
From Everand
Behavior-Driven Development in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
From Everand
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
Manish Soni
No ratings yet
Privacy Enabled Financial Text Classification Using Differential Privacy and Federated Learning-8
No ratings yet
Privacy Enabled Financial Text Classification Using Differential Privacy and Federated Learning-8
6 pages
A History of UNESCO
No ratings yet
A History of UNESCO
469 pages
00 Manufacturing Process III Text Book Kestoor Praveen
No ratings yet
00 Manufacturing Process III Text Book Kestoor Praveen
66 pages
Blockchain Mastery: Building Decentralized Applications from Beginner to Expert
From Everand
Blockchain Mastery: Building Decentralized Applications from Beginner to Expert
Kameron Hussain
No ratings yet
Text Mining Application
No ratings yet
Text Mining Application
4 pages
7-S Framework of McKinsey
No ratings yet
7-S Framework of McKinsey
13 pages
Deco E4 (EU) 4.0 - Datasheet
No ratings yet
Deco E4 (EU) 4.0 - Datasheet
4 pages
SOA Made Simple
From Everand
SOA Made Simple
Lonneke Dikmans
No ratings yet
MY COPY Evaluating Text Image
No ratings yet
MY COPY Evaluating Text Image
52 pages
Chap.5 FINANCIAL ASSET Valuation
No ratings yet
Chap.5 FINANCIAL ASSET Valuation
39 pages
People v. de Leon
No ratings yet
People v. de Leon
9 pages
Project Lit Final1
No ratings yet
Project Lit Final1
15 pages
Modern Asset Allocation for Wealth Management
From Everand
Modern Asset Allocation for Wealth Management
David M. Berns
No ratings yet
Level 7 Hfle
No ratings yet
Level 7 Hfle
21 pages
Secure Your Business: Insights to Governance, Risk, Compliance & Information Security
From Everand
Secure Your Business: Insights to Governance, Risk, Compliance & Information Security
BoD - Books on Demand
No ratings yet
Cloud Brokering
From Everand
Cloud Brokering
Felipe Díaz-Sánchez
No ratings yet
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
From Everand
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
Rick Spair
No ratings yet
Theories of Conflict - Lecture Notes
No ratings yet
Theories of Conflict - Lecture Notes
5 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
Document and Knowledge Management Interrelationships
From Everand
Document and Knowledge Management Interrelationships
A. Afritopic
4.5/5 (2)
Deep Learning in Finance Sector
No ratings yet
Deep Learning in Finance Sector
5 pages
A New Text Mining Approach Based On HMM-SVM For Web News Classification
No ratings yet
A New Text Mining Approach Based On HMM-SVM For Web News Classification
8 pages
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
No ratings yet
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
7 pages
Straight from the Client: Consulting Experiences and Observed Trends
From Everand
Straight from the Client: Consulting Experiences and Observed Trends
Carsten Fabig
No ratings yet
Discussion Board 3
No ratings yet
Discussion Board 3
9 pages
Credit Card Score Prediction Using Machine Learning
No ratings yet
Credit Card Score Prediction Using Machine Learning
8 pages
Auditing Cloud Computing: A Security and Privacy Guide
From Everand
Auditing Cloud Computing: A Security and Privacy Guide
Ben Halpert
3/5 (2)
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
Timothy F Resume
No ratings yet
Timothy F Resume
4 pages
T 10D
No ratings yet
T 10D
7 pages
3-Intestate Estate of Gonzales v. People G.R. No. 181409 February 11, 2010
No ratings yet
3-Intestate Estate of Gonzales v. People G.R. No. 181409 February 11, 2010
12 pages
Basics of Jyotish Science
No ratings yet
Basics of Jyotish Science
2 pages
Fintech Insights: 2023 Update
From Everand
Fintech Insights: 2023 Update
Rupert Nicolay
No ratings yet
Big Data For Stock Market by Means of Mining Techniques 55636905
No ratings yet
Big Data For Stock Market by Means of Mining Techniques 55636905
10 pages
Economy Monitor Guide to Smart Contracts: Blockchain Examples
From Everand
Economy Monitor Guide to Smart Contracts: Blockchain Examples
Percy Venegas
No ratings yet
Stilan Non Slip Brochure 2016
No ratings yet
Stilan Non Slip Brochure 2016
2 pages
Chapter 5 Section B Wilson THeorem
No ratings yet
Chapter 5 Section B Wilson THeorem
5 pages
DATAMINING
No ratings yet
DATAMINING
8 pages
English Language Cala 3 Learner's Guide 3
No ratings yet
English Language Cala 3 Learner's Guide 3
2 pages
Kenerl Based SVM Classification For Financial News
No ratings yet
Kenerl Based SVM Classification For Financial News
6 pages
PG Diploma in Oil & Gas Piping Engineering Design and Analysis
No ratings yet
PG Diploma in Oil & Gas Piping Engineering Design and Analysis
4 pages
Seedfolks Reflective Journal Entry
No ratings yet
Seedfolks Reflective Journal Entry
1 page
Implementation of a Central Electronic Mail & Filing Structure
From Everand
Implementation of a Central Electronic Mail & Filing Structure
Patapios Tranakas
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
Classroom Rules For Online Learning
No ratings yet
Classroom Rules For Online Learning
1 page
Computer Science Self Management: Fundamentals and Applications
From Everand
Computer Science Self Management: Fundamentals and Applications
Fouad Sabry
No ratings yet
CISSP - Certified Information Systems Security Professional Exam Preparation Study Guide
From Everand
CISSP - Certified Information Systems Security Professional Exam Preparation Study Guide
Georgio Daccache
5/5 (1)
Quality Management System Concept
From Everand
Quality Management System Concept
James Hutchins
3/5 (1)
An Introduction to SDN Intent Based Networking
From Everand
An Introduction to SDN Intent Based Networking
alasdair gilchrist
5/5 (1)
Scavenger Hunt Lesson Plan
No ratings yet
Scavenger Hunt Lesson Plan
3 pages
Cracking Microservices Interview: Learn Advance Concepts, Patterns, Best Practices, NFRs, Frameworks, Tools and DevOps
From Everand
Cracking Microservices Interview: Learn Advance Concepts, Patterns, Best Practices, NFRs, Frameworks, Tools and DevOps
Sameer S Paradkar
3/5 (1)
Microservices Architecture Handbook: Non-Programmer's Guide for Building Microservices
From Everand
Microservices Architecture Handbook: Non-Programmer's Guide for Building Microservices
Stephen Fleming
4/5 (5)
Astute-Class Submarine - Wikipedia
No ratings yet
Astute-Class Submarine - Wikipedia
9 pages
Amazing Grace (D)
No ratings yet
Amazing Grace (D)
1 page

Identifying Banking Transaction Descriptions Via S

Uploaded by

Identifying Banking Transaction Descriptions Via S

Uploaded by

This article has been accepted for publication in a future issue of this journal, but has not been

Identifying Banking Transaction

I. INTRODUCTION tomatic classification of short-text BT descriptions (according

network models such as recursive tensor networks [40], dy-

FIGURE2: Flow diagram of the system with Jaccard similarity detector.

FIGURE3: UML diagram of the lexicon generation procedure.

TABLE2: Distribution of categories in the labelled dataset.

TABLE3: Examples of entries in the labelled dataset.

ID Description Category Amount Date

Lexicon 30%-70% 40%-60% 60%-40% 70%-30%

gual version. 2% compared to the competitors. The precision, recall and

Split Set Instances Features Computing time (s)

Split Set Instances Features Computing time (s)

TABLE7: Training sample reduction for different dataset splits.

Split Instances Instances after similarity detection

Train Test Features Pmacro Rmacro Fmacro

Train Test Features Pmacro Rmacro Fmacro

Train Test Features Pmacro Rmacro Fmacro

Train Test Features Pmacro Rmacro Fmacro

Train Test Features Pmacro Rmacro Fmacro

Train Test Features Pmacro Rmacro Fmacro

Train Test Features Pmacro Rmacro Fmacro

Train Test Features Pmacro Rmacro Fmacro

Category Pmacro Rmacro Fmacro

[64] S. Malmasi, K. Evanini, A. Cahill, J. Tetreault, R. Pugh, C. Hamill,

You might also like