Machine Learning and NLP Approaches in Address Matching
Machine Learning and NLP Approaches in Address Matching
Lamine SYNE
D. Isabel Sánchez Berriel, con N.I.F. 42.885.838-S profesora Contratada Doctora adscrita al
Departamento de Ingeniería Informá ca y de Sistemas de la Universidad de La Laguna, como
tutora
D. Luz Marina Moreno de Antonio, con N.I.F. 45.457.492-Q profesora Contratada Doctora
adscrita al Departamento de Ingeniería Informá ca y de Sistemas de la Universidad de La
Laguna, como cotutora
C E R T I F I C A (N)
Ha sido realizada bajo su dirección por D. Lamine SYNE, con N.I.F. Y- 90 77 440 -K.
1
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Acknowledgments
I would like to thank my tutors Isabel Sanchez Berriel and Luz Marina Moreno de
Antonio for their supervision and guidance throughout this project.
I would like to thank the director of the master of Cybersecurity and Data Intelligence master
at La Laguna University, Pino Caballero Gil for her precious help during the year.
Finally, I would like to thank the Canary Government for giving me the chance to live this
experience through the PBCA program.
2
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Licence
3
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Abstract
The object of this project is to explore machine learning and NLP poten al to the
address matching sub-field of geographic informa on science. To achieve this a deep
study about word and sentence embeddings models was made, how they work and
how they can be used to generate numerical representa ons of an address.
On the other hand we introduce the confusion matrix for evalua ng performance of
each model on a dataset of already matched addresses created from ISTAC [1] data
sources and make a comparison study between the models.
Finally, a use case example will be shown by choosing the most performing model
among those one studied above. This last one can be a debut for building a powerful
tool for matching address pairs in all Canary Islands.
Key words : machine learning, NLP, language model, address matching, word
embedding, similarity
4
Máster Universitario en Ciberseguridad e Inteligencia de los datos
List of Figures 7
List of Tables 10
Chapter 1 : Introduc on 11
1.1 Backgrounds 12
1.2 Objec ves 13
1.3 Scope 13
Chapter 2 : State of the art 14
2.1 Address matching Challenges 14
2.2 Introduc on to natural language processing 16
2.2.1 Terminologies 16
2.2.2 Text preprocessing in NLP 17
2.2.3 Syntac c and seman c analysis 18
2.3 Word and sentence embedding techniques 18
2.3.1 Word embeddings 18
2.3.1.1 One-Hot Encoding & Bag of words 18
2.3.1.2 Term Frequency-Inverse Document Frequency : TF-IDF 19
2.3.1.3 Word2Vec 20
2.3.1.4 FastText 23
2.3.2 Sentence embeddings 24
2.3.2.1 Doc2vec 24
2.3.2.2 BERT : Bidirec onal Encoder Representa ons from Transformers 26
2.4 Text Similarity and measures 27
Chapter 3 : Methodology 31
3-1 Process defini on 31
3.2 Implementa on planning 33
Chapter 4 : Development 35
4.1 Dataset Crea on 35
4.2 Libraries: Gensim, NLTK and Sentence Transformers 37
4.2 Implementa on 39
4.2.1 Data descrip on and preprocessing 39
5
Máster Universitario en Ciberseguridad e Inteligencia de los datos
4.2.2 Modelling 43
4.2.3 Vectoriza on and similarity calculs 48
4.3 Results 51
Chapter 5 : Conclusions and future development 54
5.1 Conclusions 54
5.2 Future developments 55
Chapter 6 : Appendices 56
6.1 Dataset crea on code source 56
6.2 Project code source 56
6.3 Data sources 56
6.4 Pre-trained models 56
Bibliography 57
6
Máster Universitario en Ciberseguridad e Inteligencia de los datos
List of Figures
Figure 2.3 : CBOW model with mul ple words in the context……………………………..22
7
Máster Universitario en Ciberseguridad e Inteligencia de los datos
8
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Figure 4.25 : applying the func on with the trained wor2vec model……………….…48
9
Máster Universitario en Ciberseguridad e Inteligencia de los datos
List of Tables
10
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Chapter 1 : Introduc on
In par cular, the ISTAC [1] in their objec ve to facilitate the obtaining of spa al sta s cs,
as well as the produc on of mul -source sta s cs through the Integrated Data System of
Canary Islands Sta cal Plan 2018-202, is georeferencing informa on from different sources
in the geosta s cal reference of Canary Islands.
A database covering all the georeferenced municipali es of the Canary Islands is used,
however the periodical update of the data, as well as the integra on of new sources of
informa on into the system requires to match addresses of each registry with the set of
those that have already been georeferenced.
In this project, we will focus on bringing solu ons about ISTAC’s [1] address matching
problems exposed above, with machine learning and NLP techniques.
11
Máster Universitario en Ciberseguridad e Inteligencia de los datos
1.1 Backgrounds
During years , in the specific field of address matching, notable research has been
achieved for resolving address records into “matched” and “not matched”. In par cular there
has been advanced research into quan ta ve methods for determining the extent of
matching between pairs of text-based records, with numerous string-similarity measures
developed, including Levenshtein and Jaro-Winkler.
By contrast a group of researchers developed the concept of ‘similarity join’, whereby two
databases are tested by each combina on of record pairs against a similarity measure
func on, with those pairs that exceed a preset threshold being recorded. They
acknowledged that despite the availability of numerous similarity or ‘distance’ func ons, no
one measure excels in every applica on [2].
The ISTAC [1], in their georeferencing works[22], uses a technique based on Record
Linkage that consists of comparing normalised addresses with other records that have
already been geo-referenced. Currently, they use the R package “RecordLinkage” to evaluate
similari es and assign a weight indica ng the similarity between the compared records.
The research reported in [4], [2] shows that machine-learning techniques can be used to
either enhance or replace the tradi onal rule-based solu ons that are commonly applied to
address matching .
In the first paper [4] it’s about two par cular innova ons into the address matching
workflow: condi onal random fields (CRFs) and word (address) embedding.
The second paper [2] introduces a framework called Post Match, the related work is a
combina on of the open source library “Libpostal” for address-parsing with a post-parse
process and the Jaro-Winkler edit distance algorithm together with XGBoost machine
learning classifica on.
In both cases there is an applica on of bi-class algorithms for several mes a bi-class
algorithm to decide whether one address matches or not to another. This, applied for each
address with respect to the reference address pool, makes the problem very computa onally
expensive.
12
Máster Universitario en Ciberseguridad e Inteligencia de los datos
1.3 Scope
Due to the lack of power resources caused by the difficulty of processing address
components and the me reserved for this work we have to set some limita ons .
We will focus only on the addresses of Santa Brígida Municipality located in Gran Canaria
where we have a register of 3600 unique addresses (SantaBrigidareferencia set) and another
register of 16024 (SantaBrigida set) that have unique addresses with varia ons of them. For
example, below we have a collec on of addresses, of which number 1 belongs to
SantaBrigidareferencia and 1,2,3,4 are in SantaBrigida with 2,3,4 as varia ons of 1 .
13
Máster Universitario en Ciberseguridad e Inteligencia de los datos
One of the biggest unstructured data points is an address, this makes address matching
a downstream challenge. Here are the most likely issues we will run into while trying to
match addresses:
abbrevia on 29 C/ Nava
While these errors may seem easy to no ce at a glance, it is very challenging to program a
system to iden fy each difference. More than that, it requires significant computa onal
power, and will take a lot of me to process. These errors can lead to significant errors when
a emp ng to perform address matching, as the records will not match.
14
Máster Universitario en Ciberseguridad e Inteligencia de los datos
For all the reasons discussed above, some mes we can face difficul es to relate two
addresses and to connect datasets. When this occurs, we end up with the following
dispari es between our records :
As we can see, the task of matching addresses becomes complicated when we have to
compare records that are o en forma ed and input differently. Because of this, it makes the
simple task of matching addresses much more complicated than predicted.
In real business loca on based these issues with linking datasets will cause major issues
with your workflow, slowing up your business and causing errors in delivery, billing, and
more [5]
One of the most common problems in address matching is data preprocessing. In many
cases we fail to correctly preprocess our data. However this step is very important and
having cleaning data before processing is essen al for ge ng quality results.
15
Máster Universitario en Ciberseguridad e Inteligencia de los datos
2.2.1 Terminologies
Corpus
A corpus is a large, structured set of machine-readable texts produced in a natural
communica ve se ng. If we have a bunch of sentences in our dataset, all the sentences will
come into the corpus, and the corpus would be like a paragraph with a mixture of sentences.
We just have to know that Corpus is a collec on of documents. In our case of study the
corresponding corpus is a set of addresses [7].
Documents
It is a unique text different from the corpus. If we have 100 sentences, each sentence is a
document. Mathema cal Representa on of Documents is Vector [7]. In this project we will
consider each address as a document.
Vocabulary
Vocabulary is the collec on of unique words involved in the corpus. Let’s take this
following example:
sentence 1 = CALLE NAVA Y GRIMÓN
sentence 2 = CALLE EL HAMBRE
Vocabulary = { CALLE, NAVA ,Y, GRIMÓN, EL, HAMBRE }
Words
All the words in the corpus .Let’s take the previous example
sentence 1 = CALLE NAVA Y GRIMÓN
sentence 2 = CALLE El HAMBRE
words = { CALLE, NAVA, Y, GRIMÓN, CALLE, EL, HAMBRE }
N-gram
In the field of computa onal linguis cs, an n-gram is a con nuous sequence of n items
from a given sample of text or speech [8]. For this given address: CALLE NAVA Y GRIMON, we
have:
1-gram set: CALLE, NAVA, Y, GRIMON
2-gram set: CALLE NAVA, NAVA Y, Y GRIMON
16
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Char N-gram
Preprocessing
➢ Removal of Noise, URLs, Hashtag and User-men ons
➢ Lowercasing
➢ Replacing Emo cons and Emojis
➢ Replacing elongated characters
➢ Correc on of Spellings
➢ Removing the Punctua on
➢ …etc
Stemming
Stemming is the technique to replace and remove the suffixes and affixes to get the root,
base or stem word. We may find similar words in the corpus but with different spellings like
having, have, etc. All those are similar in meaning, so to make them into a base word, we use
a concept called stemming, which converts words to their base word [7]
Lemma za on
Lemma za on is a technique similar to stemming. In stemming root words may or may
not have the meaning, but in lemma za on, root word surely would have a meaning, it uses
linguis c knowledge to transform words into their base forms [7].
17
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Parsing
Parsing refers to the formal analysis of a sentence by a computer into its cons tuents,
which results in a parse tree showing their syntac c rela on to one another in visual form,
which can be used for further processing and understanding [6].
Syntac c analysis (syntax) and seman c analysis (seman c) are the two primary
techniques that lead to the understanding of natural language. Language is a set of valid
sentences but, what makes a sentence valid?: syntax and seman cs.
Syntax is the gramma cal structure of a text whereas seman c is the meaning being
conveyed. A sentence that is syntac cally correct, however, is not always seman cally
correct [6].
One-Hot Encoding and Bag of Words form part of the most straigh orward way to
numerically represent words.
For One-Hot Encoding , the idea is to create a vector with the size of the total number of
unique words in the corpus. Each unique word has a unique feature and will be represented
by a 1 with 0s everywhere else. In the case of Bag of words representa on (also called
count vectorizing [11]), each word is represented by its count instead of 1 [12]. Let’s look at
an easy example to understand the concepts previously explained. We could be interested in
analysing the tables 2.3 and 2.4:
18
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Calle 1 0 0 0 …. 0
Nava 0 1 0 0 …… 0
1 1 1 1 1 0 0
2 1 0 0 0 1 1
TF-IDF is a sta s cal measure that evaluates how relevant a word is to a document in a
collec on of documents. This is done by mul plying two metrics: how many mes a word
appears in a document (TF), and the inverse document (IDF) of the word across a set of
documents [9]. IDF has been used to penalise very commonly used words that do not
provide seman c informa on, such as ar cles, preposi ons, etc.
( )
Let’s calculate 𝑇𝐹 − 𝐼𝐷𝐹 "𝐶𝑎𝑙𝑙𝑒" {𝑑1, 𝑑2}, 𝐷 in the following :.
Address 1 (d1) : Calle Nava y Grimon
Address 2 (d2) : Calle el Hambre
Corpus D = [Address 1, Address 2 ]
19
Máster Universitario en Ciberseguridad e Inteligencia de los datos
d1 1
4
𝑙𝑜𝑔 ( )
2
1
1
4
× 𝑙𝑜𝑔 ( )
2
1
d2 1
4
𝑙𝑜𝑔 ( )
2
1
1
3
× 𝑙𝑜𝑔 ( )
2
1
(
Tabla 2.5 : Calcula on of 𝑇𝐹 − 𝐼𝐷𝐹 "𝐶𝑎𝑙𝑙𝑒" {𝑑1, 𝑑2}, 𝐷 )
2.3.1.3 Word2Vec
Word2vec is one of the most popular technique to learn word embeddings based on
neural network .The neural network aim to predict the distribu on of word contexts in the
corpus 𝑝(𝑤| 𝐶𝑜𝑛𝑡𝑒𝑥𝑡 𝑜𝑓 𝑤) and simultaneously learn the word representa on. A
single-layer neural network with a linear ac va on func on is used, the contexts are
represented by a succession of previous words of the window size (𝑛) chosen:
(
𝑝 𝑤𝑖|𝑤𝑖−𝑛, 𝑤𝑖−𝑛+1 , ..., 𝑤𝑖−1)
We have the representa on of the words in a con nuous mul dimensional number
space, words with similar contexts will be next to each other in the new space. It takes as
input the text corpus and outputs a set of feature vectors that represent words in that
corpus. It uses two neural network-based methods :
➢ Con nuous Bag Of Words (CBOW)
➢ Skip-Gram
20
Máster Universitario en Ciberseguridad e Inteligencia de los datos
The CBOW Model takes the context of each word as the input and tries to predict the
word corresponding to the context. Here, context simply means the surrounding words.
Skip-Gram uses the target word, the word we want to generate the representa on for, to
predict the context. In the process of predic ng the context words, the model learns the
vector representa on of the target word .
Figure 2.2 : CBOW model with one word in the context [12]
Let’s say we use the word ‘Plaza’ as the input to the neural network and we are trying to
predict the word ‘paz’. We will use the one-hot encoding of the input word ‘plaza’, then
measure and op mise for the output error of the target word ‘paz”.In this process of trying
to predict the target word,this shallow network learns its vector representa on. As the same
way the model used a single word to predict the target, it can use mul ple context to do the
like the in the architecture in figure :
21
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Figure 2.3: CBOW model with mul ple words in the context [12]
22
Máster Universitario en Ciberseguridad e Inteligencia de los datos
The choice of using CBOW or Skip-Gram when training a word2vec model will depend on
the case we intend to resolve. CBOW is be er at learning syntac c rela onships between
words while skip-gram is be er at understanding the seman c rela onships. However
Skip-gram works be er when working with a small amount of data, focuses on seman c
similarity of words, and represents rare words well. On the other hand, CBOW is faster,
focuses more on the morphological similarity of words, and needs more data to achieve
similar performance.
2.3.1.4 FastText
Fastext tries to include the morphological structure of words because this carries
importance about the meaning and such structure is not taken into account by tradi onal
word embeddings like word2vec, which train unique word embedding for every individual
word. FastText a empts to solve this by trea ng each word as the aggrega on of its
subwords. For the sake of simplicity and language-independence, subwords are taken to be
the character n-grams of the word. The vector for a word is simply taken to be the sum of all
vectors of its component char-ngrams [14]. For example, the fastText representa on of the
word “CALLE” when using 3-grams corresponds to the collec on of trigrams of the string
<CALLE>: <CA, CAL, ALL, LLE, LE>.
The algorithm always starts the string of each word with "<" and ends them with ">". This
representa on helps to extract morphological informa on from the words such as suffixes
and prefixes. With the generated n-grams a skip-gram model is trained to create the word
representa ons [15].
23
Máster Universitario en Ciberseguridad e Inteligencia de los datos
So far we have discussed how word embeddings represent the meaning of the words in a
text document. But some mes we need to go a step further and encode the meaning of the
whole sentence to readily understand the context in which the words are used.
A straigh orward approach for crea ng sentence embeddings is to use a word embedding
model to encode all words of the given sentence and take the average of all the word
vectors. While this provides a strong baseline, it falls short of capturing informa on related
to word order and other aspects of overall sentence seman cs.
2.3.2.1 Doc2vec
Doc2vec is a model for crea ng numerical representa on of a document, it extends the
idea of word2vec and as this last one, doc2vec has two variants :
Each word and sentence of the training corpus are one-hot encoded and stored in
matrices D and W, respec vely. The training process involves passing a sliding window over
the sentence, trying to predict the next word based on the previous words and the sentence
vector (or Paragraph Matrix in the figure above). This predic on of the next word is done by
concatena ng the sentence and word vectors and passing the result into a so max layer.
24
Máster Universitario en Ciberseguridad e Inteligencia de los datos
The sentence vectors change with sentences, while the word vectors remain the same.
Both are updated during training.
The inference process also involves the same sliding window approach. The difference is that
all the vectors of the models are fixed except the sentence vector. A er all the predic ons of
the next word are computed for a sentence, the sentence embedding is the resultant
sentence vector [12].
The DBOW model ignores the word order and has a simpler architecture. Each sentence in
the training corpus is converted into a one-hot representa on. During training, a random
sentence is selected from the corpus, and from the sentence, a random number of words.
The model tries to predict these words using only the sentence ID, and the sentence vector
is updated (Paragraph ID and paragraph matrix in the figure). During inference, a new
sentence ID is trained with random words from the sentence. The sentence vector is
updated in each step, and the resul ng sentence vector is the embedding for that sentence
[12].
As a comparison between the two doc2vec models we can follow the direc on of the
authors in the original paper [16] who affirm that the DM model “is consistently be er
than” DBOW . However other studies [17] showed that the DBOW approach is be er for
more tasks. In other ways we have to know that the DM model takes into account the word
order, the DBOW model doesn’t. Also, the DBOW model doesn’t use word vectors so the
seman cs of the words are not preserved.
25
Máster Universitario en Ciberseguridad e Inteligencia de los datos
BERT provides a way to pre-train models that consider contexts both to the right and le
of words using the Masked LM (MLM) technique. In BERT, MLM instead of using pre- or
post-word sequences, the en re sequence is used, from which a percentage of words to be
predicted is removed. The algorithm works on pairs of sentences, once the words have been
predicted, BERT uses the predic on of the next sentence. This part of the algorithm predicts
whether the second sentence is the next sentence according to the original text.
The algorithm embeds metadata to indicate start and end of segments, separa on between
sentences, the masked words, etc. as can be seen in the example:
[CLS] the [MASK] has blue spots [SEP] it rolls [MASK] the parking lot [SEP] [19]
26
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Within the implementa on of BERT we have two steps : pre-training and fine-tuning .
During pre-training the model is trained on unlabeled data over different pre-training tasks.
For finetuning, the BERT model is first ini alised with the pre-trained parameters, and all of
the parameters are fine-tuned using labelled data from the downstream tasks. Each
downstream task has separate fine-tuned models, even though they are ini alised with the
same pre-trained parameters [18].
This is just the ini al part of BERT implementa on and whole steps are described in the
original paper[18] but we have to keep in mind that BERT is one of the best general
language models and produces good results on sentence embeddings.
Similarity is the distance between two vectors where the vector dimensions represent
the features of two objects. In simple terms, similarity is the measure of how different or
alike two data objects are. If the distance is small, the objects are said to have a high degree
of similarity and vice versa. Generally, it is measured in the range 0 to 1. This score in the
range of [0, 1] is called the similarity score [12].
27
Máster Universitario en Ciberseguridad e Inteligencia de los datos
As the same text similarity is how different or alike two texts or sentences are. However
as humans it is very obvious to us that two sentences mean the same thing despite being
wri en in completely different formats. But algorithms and to come to that same conclusion
we have first to solve the problem of text representa on by conver ng it into feature vectors
using a suitable text embedding technique above. Once we have the text representa on, we
can compute the similarity score using one of the many distance/similarity measures [12].
Let’s dive deeper into the text similarity measures.
Jaccard Index
Jaccard index, also known as Jaccard similarity coefficient, treats the data objects like
sets. It is defined as the size of the intersec on of two sets divided by the size of the union.
|𝐴∩𝐵| |𝐴∩𝐵|
The jaccard distance as : 𝐽(𝐴, 𝐵 ) = |𝐴∪𝐵|
= |𝐴|+|𝐵|−|𝐴∩𝐵|
28
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Euclidean Distance
Euclidean distance, or L2 norm, uses the Pythagoras theorem to calculate the distance
between two points as indicated in the figure 2.10. Generally speaking, when people talk
about distance, they refer to Euclidean distance. It below :
The larger the distance d between two vectors, the lower the similarity score and vice
versa .The distances can vary from 0 to infinity, we need to use some way to normalise them
to the range of 0 to 1.
Although we have our typical normalisa on formula that uses mean and standard
devia on, it is sensi ve to outliers. That means if there are a few extremely large distances,
every other distance will become smaller as a consequence of the normalisa on opera on.
So the best op on here is to use something like the Euler’s constant ( ) [12].
1
𝑒
𝑑
Levenshtein distance
The Levenshtein distance is a string metric for measuring the difference between two
sequences. Informally, the Levenshtein distance between two words is the minimum number
of single-character edits (i.e. inser ons, dele ons or subs tu ons) required to change one
word into the other [21].
29
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Mathema cally, the Levenshtein distance between two strings a, b (of length |a| and |b|
respec vely) is given by the formula below:
Cosine Similarity
Cosine Similarity computes the similarity of two vectors as the cosine of the angle between
two vectors. It determines whether two vectors are poin ng in roughly the same direc on.
So if the angle between the vectors is 0 degrees, then the cosine similarity is 1 [12].
𝑣•𝑤
It is given as : 𝑐𝑜𝑠(𝑣, 𝑤) = . Where ||𝑣|| represents the length of the vector 𝑣,
||𝑣||×||𝑤||
||𝑤||represents the length of the vector 𝑤 and ‘•’ denotes the dot product operator.
30
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Chapter 3 : Methodology
In order to achieve the objec ves set for the realisa on of this project it's necessary to
find the right methodologies.
Several mee ngs were held with the tutor and the co-tutor. At first an explana on of the
topic was made, secondly we defined the process and the necessary tasks to achieve for
producing results. During the development of the project we frequently held mee ngs for
checking tasks progression, raising doubts and verifying that the steps taken were the right
ones. The main idea is that at each review the project should show some evolu on with
respect to the previous check, which is in line with the Scrum planning model.
During our inves ga ons in order to find machine learning opportuni es in the address
matching field a lot of approaches were tested but the most promising one remains the use
of word or sentence embedding coupled to text similarity measure.
Our proposal in this work consists of determining the similarity of each address with the
reference addresses through the embeddings generated using the different algorithms
exposed above. We consider an exis ng matching between two addresses when the
similarity in the representa on space exceeds a threshold.
Our approach is to measure the distances between addresses, but by using language
models, complex rela onships in words such as seman cs and morphology are considered
and not only similari es at the character level.
However, there are different techniques of word embedding, so our process will naturally
be in the first me a study of each one, in a second me implement them using address
dataset and finally make a prac cal comparison.
In other hands, we define a performance evalua on procedure similar to those applied in
machine learning classifica on,we set a confusion matrix and evaluate the metrics accuracy,
precision and recall .
31
Máster Universitario en Ciberseguridad e Inteligencia de los datos
The figures 3.1 shows the key steps of our work process
As shown in the figure we firstly generated address embeddings for each model
Wor2vec, Fastext, Doc2vec and BERT, and secondly classify addresses into match or no
match through the similarity. Finally we evaluate the performance of each model in the
objec ve to make a comparison.
32
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Similarity
Classifica on
For classifying addresses into matched or no matched we will compare the result of the
similarity calcula on to a fixed threshold value.
Performance Evalua on
Our ini al dataset provides the status of matching for addresses by an iden fica on
number( uuid_idt) so a classical method would be the use of a machine learning
classifica on algorithm and extract the performance. But a lack of varia on on our dataset
mo vates us to do a manual evalua on calcula ng true posi ves, false posi ves, true
nega ve and false nega ve from classifica on results and known status of addresses
In order to correctly implement the defined process for this project a planning was
made the table 3.1 gives the details of needed tasks to implement en-to-end the
drescripted process .
33
Máster Universitario en Ciberseguridad e Inteligencia de los datos
34
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Chapter 4 : Development
4.1 Dataset Crea on
The ISTAC [1] provides a csv file with normalised and georeferenced addresses that
currently appear in their Integrated Data System.
The variables included in the file correspond to the different elements that make up an
address, as well as an iden fica on code(uuid_idt) shared by all the addresses for which the
matching has been posi ve according to their algorithms. Among the variables we find
territory codes(“códigos de territorio”), the normalised and unnormalised type of road(“ po
de via”), the normalised and unnormalised road name(“nombre de via”), road code(“código
de la vía”), normalised and unnormalised portal number.
Also we can find informa on about the technique used to generate the matching and a
categorical variable with values: AVERAGE, HIGH or VERY_HIGH, which indicates the quality
of the link .
This dataset is from all the municipali es in the Canary Islands but in our case we will
extract a part from one municipality called Santa Brígida.We will train our models using this
dataset of 16024 addresses in order to build word embeddings.
From this dataset we select relevant columns that we will need in the rest of the work
35
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Column descrip on
Create dataset : once we have a good sample of our dataset we can export it in csv format
for evalua ng performance of our models
Finally, we have the dataset “Santabrigida” (16024 addresses) that we will use to train
our word and sentence embedding models.
To evaluate the performance of the models and compare them, we will use the dataset
“muestra” which is a frac on of this dataset from “Santabrigida”.
We make this reduc on of the data because of a lack of resources and as exposed in chapter
3 the performance evalua on is very costly .
36
Máster Universitario en Ciberseguridad e Inteligencia de los datos
In addi on to the basic libraries for data analysis we used some special libraries during
the project with specifics roles for each one :
Gensim
Gensim is an open source python library for topic modelling able to train large-scale
seman c NLP models , represent text as vectors and find related documents.
Gensim runs on Linux, Windows and Mac OS X, and should run on any other pla orm that
supports Python 3.6+ and NumPy[23].
We can install it by running this command : pip install gensim
In this project we use gensim to train wor2vec , FastText and doc2vec models for
genera ng vector representa on of addresses and calculate the similarity.
The figure below shows a basic syntax of impor ng and training gensim models
NLTK is a leading pla orm for building python programs to work with human language
data. It provides a suite of text processing u li es for classifica on, tokeniza on, stemming,
tagging, parsing .. etc.
In this project we use NLTK to preprocess our address dataset and in par cular to tokenize
data before passing it to models.
The figure 4.5 shows a basic syntax of tokenisa on addresses with NLTK
37
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Sentence Transformers
In this project we use sentence transformers for implemen ng the BERT model. The figure
shows an example of sentence-transformers implemen ng a BERT model
38
Máster Universitario en Ciberseguridad e Inteligencia de los datos
4.2 Implementa on
Before star ng work with the data let’s check missing values and make various
descrip ons through graphics and sta s cs.
39
Máster Universitario en Ciberseguridad e Inteligencia de los datos
As we can see the dataset doesn’t present missing values, so we don’t need to apply a
technique to fill out missing values. Our next step is to take a look at the variable nvia :
“nombre de via “.
The figure 4.9 shows the count word of nvia in each address row.
Once we have for each nvia the number of words we can represent the distribu on graphic of words
40
Máster Universitario en Ciberseguridad e Inteligencia de los datos
In our last part of descrip ng data we will show the most used words of the variables nvia
and tvia on addresses. The words that appear the most are the higher dimensions the most
and vice versa.
41
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Once we understand the data the next step is preprocessing, firstly we format addresses.
In principle an address is an concentena on of the variables tvia, nvia, nume, codmun and
nommun but we have to remember that we are working with data from one municipality
(Santa Brígida). It means that all our addresses have the same value on the variable
codmun(35021) and nommun (Santa Brígida). That’s why it will be necessary to remove
them from the address in order to get the root of an address.
So, we are going to create a address column without the two variables cited above
42
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Now we have address data almost ready to be trained but for some models like wor2vec
and fas ext it’s preferable to pass them the data in a certain format that’s why we will
tokenize the data before the training phase.
4.2.2 Modelling
In NLP instead of always training your own model it is recommended in some cases to use
pre-trained models.The advantage of these models is that they have been trained in a larger
corpus of words so they gain in maturity. We can find these models in differents public
repositories or research publica ons.In this project, in addi on to our own trained we a
word2vec , fastext and BERT pre-trained model for the Spanish language
As already men oned, in this project we work with wor2vec, fastText, doc2vec and BERT
model. For word2vec and fastText, first we train our own model and second we load
pre-trained models. In the case of doc2vec we also train our own model but for BERT we
load a spanish BERT model.
43
Máster Universitario en Ciberseguridad e Inteligencia de los datos
➢ word2vec
Train model
A word2vec model uses a set of parameters that affect both the training speed and the
quality. During our training phase we adjust the parameters several mes in order to have a
good model. The figure below shows the way to train the model.
It’s important to no ce that word2vec sets some parameters by default, in our case we
use the CBOW variant which is the default variant implemented by wor2vec. The model
receives the dataset in the right format as the first parameter here this last one is data which
is the result of our tokenized addresses. The parameters min_count is for ignoring the word
that does not appear a certain number of mes in the corpus. By default the value is 5 but
this can pose a problem in our case that’s why we put the minimum value 1. The size
determines the number of dimensions (N) of the N-dimensional space that gensim
Word2Vec maps the words onto; we chose a 100-dimensional space. The workers
parameters determine the number of cores to use for the training. It takes effect when we
set it to 1. If we put another value we have to install some tool like Cytron.
44
Máster Universitario en Ciberseguridad e Inteligencia de los datos
➢ FastText
Train Model
Like word2vec , fastText model uses a set of parameters that affect both the training speed
and the quality. We train the fastText model in the same way we did with wor2vec
The model receives the dataset in the right format as the first parameter here this last one is
data which is the result of our tokenized addresses. The parameters min_count is for
ignoring the word that does not appear a certain number of mes in the corpus.
The size determines the number of dimensions (N) of the N-dimensional space that gensim
Word2Vec maps the words onto; we chose a 100-dimensional space. The workers
parameters determine the number of cores to use for the training.
Load Model
For the pre-trained fasText model [26], depending on the format that we have downloaded
the model (vec or bin) there is a way to load it. In our case we download a pre-trained
model for Spanish language in vector format because the bin format needs complex
transforma ons.
45
Máster Universitario en Ciberseguridad e Inteligencia de los datos
➢ doc2vec
The process of training a doc2vec model is similar to word2vec but here we have some
addi onal steps. Below we have the step to train the model
Tagged documents
Training model
46
Máster Universitario en Ciberseguridad e Inteligencia de los datos
➢ BERT
In this project we will use a BERT model for Spanish language [27]. Below we have the
steps to follow :
47
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Once we have the trained and pre-trained models we can vectorize each address in the
dataset and calculate the similari es in order to classify into matched and not matched.
For word2vec and FastTex we cannot directly get the vector representa on of a whole
address, we have a vector per word, so we will average the word vectors.
The func on below will receive a list of addresses and a model to generate the
corresponding vector representa on of each address .
Figure 4.25 : Applying the func on with the trained wor2vec model
48
Máster Universitario en Ciberseguridad e Inteligencia de los datos
With doc2vec and BERT we directly generate the vector representa on of the whole
address.
In order to measure the similarity between addresses we evaluate the cosine similarity
between their represented vectors. However, we will introduce a heatmap to represent the
similari es.
Beforehand we will define two func ons respec vely for similari es calcula ons and
heatmap crea on
49
Máster Universitario en Ciberseguridad e Inteligencia de los datos
The figure 4.29 represents the heatmap of the similarity between addresses using our
trained word2vec model .
The graphics for each models is available in the appendices 6.1
50
Máster Universitario en Ciberseguridad e Inteligencia de los datos
4.3 Results
In order to compare the models we will define, as discussed in chapter 3, a procedure for
performance evalua on similar to those applied in machine learning classifica on
algorithms. A confusion matrix is defined and the metrics accuracy, precision and recall are
evaluated.
The func on in the figure 4.30 shows the different steps for crea ng the confusion matrix
The func on receives as parameters address vectors and the list of address uuids.
The func on process by calcula ng the cosine similarity between addresses one by one and
comparing the result with the fixed threshold (0.9 ). The threshold has been fixed to this
value a er tes ng the performance of the models with several values between 0.7, 0.8 and
0.9 .
51
Máster Universitario en Ciberseguridad e Inteligencia de los datos
52
Máster Universitario en Ciberseguridad e Inteligencia de los datos
In general the model trained with address dataset presents acceptable performance. In
the case of pre-trained models only BERT is giving acceptable results. The accuracy obtained
with word2vec and fastText is very low, taking a high value of recall. This means that these
models predict as “match” addresses that “no match” (precision), however, they predict as
“match” addresses that match (recall). On the other hand, doc2vec outperforms the two
previous models and finally, the results obtained with BERT improves the performance
reaching promising values.
Doc2vec with BERT form the most performing models, the trained word2vec and fastText
present almost the same results while their pre-trained fail to perform.
53
Máster Universitario en Ciberseguridad e Inteligencia de los datos
In this project we explore the use of machine learning techniques in the field of address
matching.I n par cular, with addresses in text-based format, we introduce natural language
processing approaches and generate numerical representa ons for each pair of addresses.
In order to generate numerical representa ons of addresses we study several word and
sentence embedding models such as wor2vec , fastText , doc2vec and BERT.
In a first me we train these models with real address datasets from ISTAC [1], in a second
me we use pre-trained models for the Spanish language pre-trained by other communi es
and with a large corpus of data.
We introduce the cosine similarity as a metric for resolving address records into match
and not match and finally evaluate the performances.
The specific studies during this project show great poten al for the use of machine
learning and NLP in the field of address matching but it's really important in the
implementa on processes to accurate data and choose the right models.
The results obtained lead us to the conclusion that it is promising to solve the address
matching problem through the similarity of the vectors that generate the language models.
They also reveal the need for models generated with large numbers of documents, in our
tests the guarantees are offered by the BERT model for the Spanish used, but they also
suggest that genera ng a doc2vec model with a much larger volume of addresses can lead to
good system performance.
54
Máster Universitario en Ciberseguridad e Inteligencia de los datos
➢ This study has been restricted to the municipality of Santa Brigida, but in future
works, with more available resources, we plan to extend it into all municipali es of
the Canary Islands in order to confirm our generic method.
➢ As we did with the different models comparing their performance using cosine
similarity it should be realised as a comparison, a study based on the different
similarity metrics such as euclidean distance, levenshtein distance ….etc
➢ In the case of having a dataset with a great pool of varia ons for each address, it
should consider each pool of addresses as a machine learning classifica on problem.
In this case, we will use machine learning classifica on algorithms like XGBoost,
random forest ….etc
➢ Generate language models with the data for all Canary Islands addresses in the
Canary Islands Integrated Data System
55
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Chapter 6 : Appendices
At the following Colab notebook we have the python code to created the dataset that will
be use in the project
Google Colab Link
Below is a link to the project on Google Collaboratory where you can view and test the
Python code that has been shown throughout this project.
At the following link we have a shared google drive folder that contains all the data files in
csv format used for this project.We can find the “Santabrigida.csv” file used for crea ng the
base dataset “muestra.csv”.
At the following link we have a shared google drive folder that contains all the pre-trained
models used on this project.
56
Máster Universitario en Ciberseguridad e Inteligencia de los datos
Bibliography
[4]. Comber S , Arribas-Bel D. Machine learning innova ons in address matching : A prac cal
comparison of wor2vec and CRFs.Transac on in GIS . 2019;23:334-348.
h ps://doi.org/10.1111/tgis.12522
57
Máster Universitario en Ciberseguridad e Inteligencia de los datos
[11]. Countvectorizer, scikit-learn
h p://scikit-learn.org/stable/modules/generated/sklearn.feature_extrac on.text.CountVect
orizer.html
[ 15 ] Fastext
h ps://blogs.sap.com/2019/07/03/glove-and-fas ext-two-popular-word-vector-models-in-nl
p/
[17] An Empirical Evalua on of doc2vec with Prac cal Insights into Document Embedding
Genera on
h ps://arxiv.org/abs/1607.05368
[18] BERT: Pre-training of Deep Bidirec onal Transformers for Language Understanding
h ps://arxiv.org/pdf/1810.04805.pdf
[19 ] BERT Explained State of the art language model for NLP
h ps://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b
21a9b6270
h ps://www.cuelogic.com/blog/the-levenshtein-algorithm#:~:text=The%20Levenshtein%20
distance%20is%20a,one%20word%20into%20the%20other.
58
Máster Universitario en Ciberseguridad e Inteligencia de los datos
[22] González Yanes, A., Betancor Villalba, R., Hernández García, M.S. (2021). Título. XXI
Jornadas de Estadís ca de las Comunidades Autónomas, JECAS. Las Palmas de Gran Canaria:
ISTAC.
Retrieved from h ps://jecas.es/sistema-de-georreferenciacion-para-fines-estadis cos/
59