0% found this document useful (0 votes)

11 views60 pages

Machine Learning and NLP Approaches in Address Matching

Uploaded by

miki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views60 pages

Machine Learning and NLP Approaches in Address Matching

Uploaded by

miki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Máster Universitario en Ciberseguridad e Inteligencia de Datos

Trabajo de Fin de Máster

Machine learning and NLP

approaches in address matching

Lamine SYNE

La Laguna, 7 de sep embre del 2022

Máster Universitario en Ciberseguridad e Inteligencia de los datos

D. Isabel Sánchez Berriel, con N.I.F. 42.885.838-S profesora Contratada Doctora adscrita al
Departamento de Ingeniería Informá ca y de Sistemas de la Universidad de La Laguna, como
tutora

D. Luz Marina Moreno de Antonio, con N.I.F. 45.457.492-Q profesora Contratada Doctora
adscrita al Departamento de Ingeniería Informá ca y de Sistemas de la Universidad de La
Laguna, como cotutora

C E R T I F I C A (N)

Que la presente memoria tulada:

“Machine learning and NLP approaches in address matching”

Ha sido realizada bajo su dirección por D. Lamine SYNE, con N.I.F. Y- 90 77 440 -K.

Y para que así conste, en cumplimiento de la legislación vigente y a los efectos

oportunos ﬁrman la presente en La Laguna a 7 sep embre del 2022

1
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Acknowledgments

I would like to thank my tutors Isabel Sanchez Berriel and Luz Marina Moreno de
Antonio for their supervision and guidance throughout this project.

I would like to thank the director of the master of Cybersecurity and Data Intelligence master
at La Laguna University, Pino Caballero Gil for her precious help during the year.

Finally, I would like to thank the Canary Government for giving me the chance to live this
experience through the PBCA program.

2
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Licence

©This work is licensed under a Crea ve Commons

A ribu on-ShareAlike 4.0 Interna onal.

3
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Abstract

The object of this project is to explore machine learning and NLP poten al to the
address matching sub-ﬁeld of geographic informa on science. To achieve this a deep
study about word and sentence embeddings models was made, how they work and
how they can be used to generate numerical representa ons of an address.

For each word or sentence embedding model we generate vector representa on of

addresses in the database and calculate the cosine similarity between them in order
to know which ones represent the same geographic posi on or not.

On the other hand we introduce the confusion matrix for evalua ng performance of
each model on a dataset of already matched addresses created from ISTAC [1] data
sources and make a comparison study between the models.

Finally, a use case example will be shown by choosing the most performing model
among those one studied above. This last one can be a debut for building a powerful
tool for matching address pairs in all Canary Islands.

Key words : machine learning, NLP, language model, address matching, word
embedding, similarity

4
Máster Universitario en Ciberseguridad e Inteligencia de los datos

List of Figures 7
List of Tables 10
Chapter 1 : Introduc on 11
1.1 Backgrounds 12
1.2 Objec ves 13
1.3 Scope 13
Chapter 2 : State of the art 14
2.1 Address matching Challenges 14
2.2 Introduc on to natural language processing 16
2.2.1 Terminologies 16
2.2.2 Text preprocessing in NLP 17
2.2.3 Syntac c and seman c analysis 18
2.3 Word and sentence embedding techniques 18
2.3.1 Word embeddings 18
2.3.1.1 One-Hot Encoding & Bag of words 18
2.3.1.2 Term Frequency-Inverse Document Frequency : TF-IDF 19
2.3.1.3 Word2Vec 20
2.3.1.4 FastText 23
2.3.2 Sentence embeddings 24
2.3.2.1 Doc2vec 24
2.3.2.2 BERT : Bidirec onal Encoder Representa ons from Transformers 26
2.4 Text Similarity and measures 27
Chapter 3 : Methodology 31
3-1 Process deﬁni on 31
3.2 Implementa on planning 33
Chapter 4 : Development 35
4.1 Dataset Crea on 35
4.2 Libraries: Gensim, NLTK and Sentence Transformers 37
4.2 Implementa on 39
4.2.1 Data descrip on and preprocessing 39

5
Máster Universitario en Ciberseguridad e Inteligencia de los datos
4.2.2 Modelling 43
4.2.3 Vectoriza on and similarity calculs 48
4.3 Results 51
Chapter 5 : Conclusions and future development 54
5.1 Conclusions 54
5.2 Future developments 55
Chapter 6 : Appendices 56
6.1 Dataset crea on code source 56
6.2 Project code source 56
6.3 Data sources 56
6.4 Pre-trained models 56
Bibliography 57

6
Máster Universitario en Ciberseguridad e Inteligencia de los datos

List of Figures

Figure 2.1 : CBOW & Skip-Gram model ………………………………………………………………20

Figure 2.2 : CBOW model with one word in the context……………………………………..21

Figure 2.3 : CBOW model with mul ple words in the context……………………………..22

Figure 2.4 : Skip-Gram model using target words…………………………………………………22

Figure 2.5 : Doc2vec Distributed Memory model………………………………………………..24

Figure 2.6 : Doc2vec distributed bag of words model…………………………………………..25

Figure 2.7 : BERT mask LM………………………………………………………………………………… 26

Figure 2.8 : pre-training and ﬁne-tuning procedures for BERT……………………………..27

Figure 2.9 : Jaccard distance on two sets……………………………………………………………28

Figure 2.10 : Euclidean distance representa on………………………………………………….29

Figure 2.11 Levenshtein distance formula…………………………………………………………30

Figure 2.12 : θ angle of vectors (𝑣, 𝑤)……………………………………………………………….30

Figure 3.1 : Project process………………………………………………………………………………….32

Figure 4.1 : Dataset registers………………………………………………………………………………35

Figure 4.2 : selec ng columns forming an address at Santabrigida………………………36

Figure 4.4 : Gensim training models example……………………………………………………….37

Figure 4.5 : NLTK tokeniza on example usage………………………………………………………38

7
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.6 : Example Usage Sentence-Transformers………………………………………………38

Figure 4.7 : created dataset……………………………………………………………………………………39

Figure 4.8 : missing values and registers numbers…………………………………………………39

Figure 4.9 : nvia word count…………………………………………………………………………………..40

Figure 4.10 : distribu on word count of nvia variable………………………………………………40

Figure 4.11 : Most commonly used word on addresses (nvia)....................................41

Figure 4.12 : Most commonly used word on addresses (tvia).....................................42

Figure 4.13 : Crea on of new address column………………………………………………………….42

Figure 4.14 : Tokeniza on addresses…………………………………………………………………………43

Figure 4.15 : Training wor2vec model………………………………………………………………………44

Figure 4.16 : Loading word2vec pre-trained model………………………………………………….44

Figure 4.17 : Training word2vec model…………………………………………………………………….45

Figure 4.18 : Loading FastText pre-trained model…………………………………………………….45

Figure 4.19 : Ini alise doc2vec model……………………………………………………………………..46

Figure 4.20 : tagged document……………………………………………………………………………….46

Figure 4.21 : Training doc2vec model……………………………………………………………………..46

Figure 4.22 : Installing and loading Transformer………………………………………………………47

Figure 4.23 : Loading BERT model……………………………………………………………….……………47

Figure 4.24 : func on for genera ng vector per document(address)...........................48

8
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.25 : applying the func on with the trained wor2vec model……………….…48

Figure 4.26 : Vector representa on of an address with word2vec……………………49

Figure 4.27 : func on for similari es calcula ons……………………………………………..49

Figure 4.28 : func on for heatmaps crea on……………………………………………………50

Figure 4.29 Heatmap of similarity using wor2vec trained model……………………..50

Figure 4.30 func on for crea ng confusion matrix……………………………………………51

9
Máster Universitario en Ciberseguridad e Inteligencia de los datos

List of Tables

Tabla 2.1: Address matching most common input errors ………………………………….14

Tabla 2.2 : input address vs reference data to match………………………………………….15

Tabla 2.3 : One-Hot Encoding illustra on…………………………………………………………….19

Tabla 2.4: Bag of words illustra on……………………………………………………………………...19

Tabla 2.5 : Calcula on of TF(“Calle”,d1,D) & TF-IDF(“Calle”,d1,D)..........................20

Table 3.1 : Implementa on planning………………………………………………………………….34

Table 4.1 : Columns descrip ons (spanish)..............................................................36

Tabla 4.2 : Results resuming……………………………………………………………………………….52

10
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 1 : Introduc on

Address matching, the process of assigning physical loca on coordinates to addresses in

databases, becomes a core func on in various loca on-based businesses like take-out
services, express delivery, customers merging, fraud iden ﬁca on, lead outreach etc.
This makes the need to match two lists of addresses a common occurrence in many
companies, organisa ons and government bodies.

In par cular, the ISTAC [1] in their objec ve to facilitate the obtaining of spa al sta s cs,
as well as the produc on of mul -source sta s cs through the Integrated Data System of
Canary Islands Sta cal Plan 2018-202, is georeferencing informa on from diﬀerent sources
in the geosta s cal reference of Canary Islands.
A database covering all the georeferenced municipali es of the Canary Islands is used,
however the periodical update of the data, as well as the integra on of new sources of
informa on into the system requires to match addresses of each registry with the set of
those that have already been georeferenced.

However, address matching is an extremely complicated task resul ng in a number of

challenges such as: address component types, noisy databases, inconsistent and replete with
missing values on databases, input error made by users, text-based type … etc.

Despite the importance of matching addresses in the above-men oned sectors we

no ce a great lack of open robust solu ons. Recent innova ons in machine learning,
par cularly in natural language processing (NLP), have been introduced in the wider area of
address matching with signiﬁcant poten al.

In this project, we will focus on bringing solu ons about ISTAC’s [1] address matching
problems exposed above, with machine learning and NLP techniques.

11
Máster Universitario en Ciberseguridad e Inteligencia de los datos

1.1 Backgrounds

During years , in the speciﬁc ﬁeld of address matching, notable research has been
achieved for resolving address records into “matched” and “not matched”. In par cular there
has been advanced research into quan ta ve methods for determining the extent of
matching between pairs of text-based records, with numerous string-similarity measures
developed, including Levenshtein and Jaro-Winkler.

By contrast a group of researchers developed the concept of ‘similarity join’, whereby two
databases are tested by each combina on of record pairs against a similarity measure
func on, with those pairs that exceed a preset threshold being recorded. They
acknowledged that despite the availability of numerous similarity or ‘distance’ func ons, no
one measure excels in every applica on [2].

The ISTAC [1], in their georeferencing works[22], uses a technique based on Record
Linkage that consists of comparing normalised addresses with other records that have
already been geo-referenced. Currently, they use the R package “RecordLinkage” to evaluate
similari es and assign a weight indica ng the similarity between the compared records.

The research reported in [4], [2] shows that machine-learning techniques can be used to
either enhance or replace the tradi onal rule-based solu ons that are commonly applied to
address matching .
In the first paper [4] it’s about two par cular innova ons into the address matching
workflow: condi onal random fields (CRFs) and word (address) embedding.
The second paper [2] introduces a framework called Post Match, the related work is a
combina on of the open source library “Libpostal” for address-parsing with a post-parse
process and the Jaro-Winkler edit distance algorithm together with XGBoost machine
learning classifica on.

In both cases there is an applica on of bi-class algorithms for several mes a bi-class
algorithm to decide whether one address matches or not to another. This, applied for each
address with respect to the reference address pool, makes the problem very computa onally
expensive.

12
Máster Universitario en Ciberseguridad e Inteligencia de los datos

1.2 Objec ves

The main objec ves of this project are :

➢ Study of machine learning approaches into the ﬁeld of address matching

➢ Study of natural language processing: text analysis and word embeddings
➢ Implement diﬀerents models for numerical representa on of text-based data and
introduce a metric to evaluate them
➢ Apply those models to ISTAC’s address datasets and make a compara ve study
➢ Build a tool able to resolve address pairs into match and non-match using text
similarity

1.3 Scope

Due to the lack of power resources caused by the diﬃculty of processing address
components and the me reserved for this work we have to set some limita ons .
We will focus only on the addresses of Santa Brígida Municipality located in Gran Canaria
where we have a register of 3600 unique addresses (SantaBrigidareferencia set) and another
register of 16024 (SantaBrigida set) that have unique addresses with varia ons of them. For
example, below we have a collec on of addresses, of which number 1 belongs to
SantaBrigidareferencia and 1,2,3,4 are in SantaBrigida with 2,3,4 as varia ons of 1 .

1. CAMINO ACEQUIA TAFIRA 6 SANTA BRIGIDA

2. CAMINO ACEQUIA TAFIRA MADROÑAL 4 SANTA BRIGIDA
3. CALLE CAMINO ACEQUIA TAFIRA 4 SANTA BRIGIDA
4. CALLE ACEQUIA TAFIRA 15 SANTA BRIGIDA

13
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 2 : State of the art

2.1 Address matching Challenges

One of the biggest unstructured data points is an address, this makes address matching
a downstream challenge. Here are the most likely issues we will run into while trying to
match addresses:

➢ Input errors made by users :

In many cases, addresses are input incorrectly by users, including misspellings, missing
spaces, incorrect labels (“CALLE” vs “AVENIDA”), abbrevia on formats (“C/” and “AV”),
synonyms, and more. All of these make it diﬃcult to have standardised data within a single
database, let alone across mul ple databases.

The following table show the most common input errors:

Input Error Example

Misspelling 29 CALE NAVA

Miss Space 29CALLE Nava

Incorrect label 29 AVENIDA Nava

abbrevia on 29 C/ Nava

tokeniza on Nava CALLE 29

Tabla 2.1 : Address matching most common input errors

While these errors may seem easy to no ce at a glance, it is very challenging to program a
system to iden fy each difference. More than that, it requires significant computa onal
power, and will take a lot of me to process. These errors can lead to significant errors when
a emp ng to perform address matching, as the records will not match.

14
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ Problem to link two datasets together

For all the reasons discussed above, some mes we can face diﬃcul es to relate two
addresses and to connect datasets. When this occurs, we end up with the following
dispari es between our records :

Input address string Reference data to match against (AddressBase)

Unstructured text Structured, tokenized

Messy, containing typos, etc. Complete & correct (more or less)

Incomplete Snapshot of addresses at a given me

Organisa on/business names are not always

Range from historic to very recent part of the address. Changes due to the
addresses, including businesses Historical Memory Law and other reasons

Tabla 2.2 : Input address vs reference data to match [5]

As we can see, the task of matching addresses becomes complicated when we have to
compare records that are o en forma ed and input diﬀerently. Because of this, it makes the
simple task of matching addresses much more complicated than predicted.
In real business loca on based these issues with linking datasets will cause major issues
with your workﬂow, slowing up your business and causing errors in delivery, billing, and
more [5]

➢ Data preprocessing Failing

One of the most common problems in address matching is data preprocessing. In many
cases we fail to correctly preprocess our data. However this step is very important and
having cleaning data before processing is essen al for ge ng quality results.

➢ Require of signiﬁcant computa onal power processing algorithms

The automa za on of the address matching process through programs s ll requires a
large amount of computa onal power and me to run. During the process various
comparisons and calcula ons are made. When conven onal techniques are used, based on
similarity between the strings, each character needs to be compared, and they need to be
processed one at a me. Data o en needs to be preprocessed beforehand as well.

15
Máster Universitario en Ciberseguridad e Inteligencia de los datos

2.2 Introduc on to natural language processing

Natural language processing (NLP) is a subﬁeld of Ar ﬁcial Intelligence that uses

algorithms to interpret and manipulate human language. The goal is to make a computer
able to understand human language processing and content (text, document ..) in the same
way humans can.
NLP can be used in many ﬁelds such as speech recogni on, knowledge representa on, text
classiﬁca on … etc [6],[7]

2.2.1 Terminologies

Corpus
A corpus is a large, structured set of machine-readable texts produced in a natural
communica ve se ng. If we have a bunch of sentences in our dataset, all the sentences will
come into the corpus, and the corpus would be like a paragraph with a mixture of sentences.
We just have to know that Corpus is a collec on of documents. In our case of study the
corresponding corpus is a set of addresses [7].
Documents
It is a unique text diﬀerent from the corpus. If we have 100 sentences, each sentence is a
document. Mathema cal Representa on of Documents is Vector [7]. In this project we will
consider each address as a document.
Vocabulary
Vocabulary is the collec on of unique words involved in the corpus. Let’s take this
following example:
sentence 1 = CALLE NAVA Y GRIMÓN
sentence 2 = CALLE EL HAMBRE
Vocabulary = { CALLE, NAVA ,Y, GRIMÓN, EL, HAMBRE }
Words
All the words in the corpus .Let’s take the previous example
sentence 1 = CALLE NAVA Y GRIMÓN
sentence 2 = CALLE El HAMBRE
words = { CALLE, NAVA, Y, GRIMÓN, CALLE, EL, HAMBRE }
N-gram
In the ﬁeld of computa onal linguis cs, an n-gram is a con nuous sequence of n items
from a given sample of text or speech [8]. For this given address: CALLE NAVA Y GRIMON, we
have:
1-gram set: CALLE, NAVA, Y, GRIMON
2-gram set: CALLE NAVA, NAVA Y, Y GRIMON
16
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Char N-gram

Character n-grams are found in text documents by represen ng the document as a

sequence of characters. These n-grams are then extracted from this sequence in order to
extract features through a trained model [9]. For this given address: CALLE NAVA Y GRIMON,
we have:
Char 1-gram set: C,A, L, L, E, N, A, V, A, Y, G, R, I, M, O, N
Char 2-gram set: CA, AL, LL, LE, EN, NA, …

2.2.2 Text preprocessing in NLP

Generally, in natural language preprocessing we have speciﬁc techniques for

preprocessing and understanding texts. But this depends on the problem we are resolving
or the type of text. For example, when we deal with sen ment analysis based on social
media content it’s important to analyse emo cons and emojis. It has usually been treated as
a classiﬁca on problem etc. Let’s describe the key steps of processing text in NLP :

Preprocessing
➢ Removal of Noise, URLs, Hashtag and User-men ons
➢ Lowercasing
➢ Replacing Emo cons and Emojis
➢ Replacing elongated characters
➢ Correc on of Spellings
➢ Removing the Punctua on
➢ …etc

Stemming
Stemming is the technique to replace and remove the suffixes and affixes to get the root,
base or stem word. We may find similar words in the corpus but with different spellings like
having, have, etc. All those are similar in meaning, so to make them into a base word, we use
a concept called stemming, which converts words to their base word [7]

Lemma za on
Lemma za on is a technique similar to stemming. In stemming root words may or may
not have the meaning, but in lemma za on, root word surely would have a meaning, it uses
linguis c knowledge to transform words into their base forms [7].

17
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Parsing
Parsing refers to the formal analysis of a sentence by a computer into its cons tuents,
which results in a parse tree showing their syntac c rela on to one another in visual form,
which can be used for further processing and understanding [6].

2.2.3 Syntac c and seman c analysis

Syntac c analysis (syntax) and seman c analysis (seman c) are the two primary
techniques that lead to the understanding of natural language. Language is a set of valid
sentences but, what makes a sentence valid?: syntax and seman cs.

Syntax is the gramma cal structure of a text whereas seman c is the meaning being
conveyed. A sentence that is syntac cally correct, however, is not always seman cally
correct [6].

2.3 Word and sentence embedding techniques

A er processing text data the next step is to extract features. To achieve this goal we have to
use some techniques for represen ng text into vectors, so computers can understand the
corpus easily. Those are word and sentence embedding techniques.

2.3.1 Word embeddings

In natural language processing (NLP), word embedding is a term used for the
representa on of words for text analysis, typically in the form of a real-valued vector that
encodes the meaning of the word such that the words that are closer in the vector space are
expected to be similar in meaning [10]. Word embeddings can be obtained using various
methods, let’s deep dive into those methods.

2.3.1.1 One-Hot Encoding & Bag of words

One-Hot Encoding and Bag of Words form part of the most straigh orward way to
numerically represent words.
For One-Hot Encoding , the idea is to create a vector with the size of the total number of
unique words in the corpus. Each unique word has a unique feature and will be represented
by a 1 with 0s everywhere else. In the case of Bag of words representa on (also called
count vectorizing [11]), each word is represented by its count instead of 1 [12]. Let’s look at
an easy example to understand the concepts previously explained. We could be interested in
analysing the tables 2.3 and 2.4:

18
Máster Universitario en Ciberseguridad e Inteligencia de los datos

word Calle Nava ….. …… …. Word n

Calle 1 0 0 0 …. 0

Nava 0 1 0 0 …… 0

Tabla 2.3 : One-Hot Encoding illustra on

Address Calle Nava y Grimon el Hambre

1 1 1 1 1 0 0

2 1 0 0 0 1 1

Tabla 2.4: Bag of words illustra on

2.3.1.2 Term Frequency-Inverse Document Frequency : TF-IDF

TF-IDF is a sta s cal measure that evaluates how relevant a word is to a document in a
collec on of documents. This is done by mul plying two metrics: how many mes a word
appears in a document (TF), and the inverse document (IDF) of the word across a set of
documents [9]. IDF has been used to penalise very commonly used words that do not
provide seman c informa on, such as ar cles, preposi ons, etc.

The TF-IDF value of a term t in a given document d from a set of documents D is :

𝑇𝐹 − 𝐼𝐷𝐹(𝑡, 𝑑, 𝐷) = 𝑇𝐹(𝑡, 𝐷) × 𝐼𝐷𝐹(𝑡, 𝐷)

Where 𝑇𝐹(𝑡, 𝐷) is the term count within the document and

𝐼𝐷𝐹(𝑡, 𝐷 ) = 𝑙𝑜𝑔 ( 𝐷
{𝑑ϵ𝐷: 𝑡ϵ𝐷} ) ,{𝑑ϵ𝐷: 𝑡ϵ𝐷} is document count across the corpus and 𝐷 is
corpus cardinal.

( )
Let’s calculate 𝑇𝐹 − 𝐼𝐷𝐹 "𝐶𝑎𝑙𝑙𝑒" {𝑑1, 𝑑2}, 𝐷 in the following :.
Address 1 (d1) : Calle Nava y Grimon
Address 2 (d2) : Calle el Hambre
Corpus D = [Address 1, Address 2 ]

19
Máster Universitario en Ciberseguridad e Inteligencia de los datos

document TF IDF TF-IDF

d1 1
4
𝑙𝑜𝑔 ( )
2
1
1
4
× 𝑙𝑜𝑔 ( )
2
1

d2 1
4
𝑙𝑜𝑔 ( )
2
1
1
3
× 𝑙𝑜𝑔 ( )
2
1

(
Tabla 2.5 : Calcula on of 𝑇𝐹 − 𝐼𝐷𝐹 "𝐶𝑎𝑙𝑙𝑒" {𝑑1, 𝑑2}, 𝐷 )

2.3.1.3 Word2Vec

Word2vec is one of the most popular technique to learn word embeddings based on
neural network .The neural network aim to predict the distribu on of word contexts in the
corpus 𝑝(𝑤| 𝐶𝑜𝑛𝑡𝑒𝑥𝑡 𝑜𝑓 𝑤) and simultaneously learn the word representa on. A
single-layer neural network with a linear ac va on func on is used, the contexts are
represented by a succession of previous words of the window size (𝑛) chosen:
(
𝑝 𝑤𝑖|𝑤𝑖−𝑛, 𝑤𝑖−𝑛+1 , ..., 𝑤𝑖−1)
We have the representa on of the words in a con nuous mul dimensional number
space, words with similar contexts will be next to each other in the new space. It takes as
input the text corpus and outputs a set of feature vectors that represent words in that
corpus. It uses two neural network-based methods :
➢ Con nuous Bag Of Words (CBOW)
➢ Skip-Gram

Figure 2.1 : CBOW & Skip-Gram model [13]

20
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The CBOW Model takes the context of each word as the input and tries to predict the
word corresponding to the context. Here, context simply means the surrounding words.

Skip-Gram uses the target word, the word we want to generate the representa on for, to
predict the context. In the process of predic ng the context words, the model learns the
vector representa on of the target word .

Figure 2.2 : CBOW model with one word in the context [12]

Considering the following address : Address : “Plaza de la paz”

Let’s say we use the word ‘Plaza’ as the input to the neural network and we are trying to
predict the word ‘paz’. We will use the one-hot encoding of the input word ‘plaza’, then
measure and op mise for the output error of the target word ‘paz”.In this process of trying
to predict the target word,this shallow network learns its vector representa on. As the same
way the model used a single word to predict the target, it can use mul ple context to do the
like the in the architecture in ﬁgure :

21
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 2.3: CBOW model with mul ple words in the context [12]

Figure 2.4 : Skip-Gram model using target words [12]

22
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The choice of using CBOW or Skip-Gram when training a word2vec model will depend on
the case we intend to resolve. CBOW is be er at learning syntac c rela onships between
words while skip-gram is be er at understanding the seman c rela onships. However
Skip-gram works be er when working with a small amount of data, focuses on seman c
similarity of words, and represents rare words well. On the other hand, CBOW is faster,
focuses more on the morphological similarity of words, and needs more data to achieve
similar performance.

2.3.1.4 FastText

FastText is a proposal model by Facebook AI Research(FAIR) for learning word

embeddings and text classiﬁca ons. This model allows crea ng unsupervised learning or
supervised learning algorithms for obtaining vector representa ons for words. FastText
supports both CBOW and Skip-gram models.

Fastext tries to include the morphological structure of words because this carries
importance about the meaning and such structure is not taken into account by tradi onal
word embeddings like word2vec, which train unique word embedding for every individual
word. FastText a empts to solve this by trea ng each word as the aggrega on of its
subwords. For the sake of simplicity and language-independence, subwords are taken to be
the character n-grams of the word. The vector for a word is simply taken to be the sum of all
vectors of its component char-ngrams [14]. For example, the fastText representa on of the
word “CALLE” when using 3-grams corresponds to the collec on of trigrams of the string
<CALLE>: <CA, CAL, ALL, LLE, LE>.

The algorithm always starts the string of each word with "<" and ends them with ">". This
representa on helps to extract morphological informa on from the words such as suﬃxes
and preﬁxes. With the generated n-grams a skip-gram model is trained to create the word
representa ons [15].

23
Máster Universitario en Ciberseguridad e Inteligencia de los datos

2.3.2 Sentence embeddings

So far we have discussed how word embeddings represent the meaning of the words in a
text document. But some mes we need to go a step further and encode the meaning of the
whole sentence to readily understand the context in which the words are used.
A straigh orward approach for crea ng sentence embeddings is to use a word embedding
model to encode all words of the given sentence and take the average of all the word
vectors. While this provides a strong baseline, it falls short of capturing informa on related
to word order and other aspects of overall sentence seman cs.

2.3.2.1 Doc2vec
Doc2vec is a model for crea ng numerical representa on of a document, it extends the
idea of word2vec and as this last one, doc2vec has two variants :

➢ Distributed Memory model (IDM)

Figure 2.5 : Doc2vec Distributed Memory model [16]

Each word and sentence of the training corpus are one-hot encoded and stored in
matrices D and W, respec vely. The training process involves passing a sliding window over
the sentence, trying to predict the next word based on the previous words and the sentence
vector (or Paragraph Matrix in the ﬁgure above). This predic on of the next word is done by
concatena ng the sentence and word vectors and passing the result into a so max layer.

24
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The sentence vectors change with sentences, while the word vectors remain the same.
Both are updated during training.
The inference process also involves the same sliding window approach. The diﬀerence is that
all the vectors of the models are ﬁxed except the sentence vector. A er all the predic ons of
the next word are computed for a sentence, the sentence embedding is the resultant
sentence vector [12].

➢ Distributed Bag of Words (DBOW) model

Figure 2.6 : Doc2vec distributed bag of words model [16]

The DBOW model ignores the word order and has a simpler architecture. Each sentence in
the training corpus is converted into a one-hot representa on. During training, a random
sentence is selected from the corpus, and from the sentence, a random number of words.
The model tries to predict these words using only the sentence ID, and the sentence vector
is updated (Paragraph ID and paragraph matrix in the ﬁgure). During inference, a new
sentence ID is trained with random words from the sentence. The sentence vector is
updated in each step, and the resul ng sentence vector is the embedding for that sentence
[12].
As a comparison between the two doc2vec models we can follow the direc on of the
authors in the original paper [16] who aﬃrm that the DM model “is consistently be er
than” DBOW . However other studies [17] showed that the DBOW approach is be er for
more tasks. In other ways we have to know that the DM model takes into account the word
order, the DBOW model doesn’t. Also, the DBOW model doesn’t use word vectors so the
seman cs of the words are not preserved.

25
Máster Universitario en Ciberseguridad e Inteligencia de los datos

2.3.2.2 BERT : Bidirec onal Encoder Representa ons from Transformers

BERT is a transformers-based language representa on model pre-training developed by

Google. It’s designed to pretrain deep bidirec onal representa ons from unlabeled text by
jointly condi oning on both le and right context in all layers [18].

BERT provides a way to pre-train models that consider contexts both to the right and le
of words using the Masked LM (MLM) technique. In BERT, MLM instead of using pre- or
post-word sequences, the en re sequence is used, from which a percentage of words to be
predicted is removed. The algorithm works on pairs of sentences, once the words have been
predicted, BERT uses the predic on of the next sentence. This part of the algorithm predicts
whether the second sentence is the next sentence according to the original text.
The algorithm embeds metadata to indicate start and end of segments, separa on between
sentences, the masked words, etc. as can be seen in the example:
[CLS] the [MASK] has blue spots [SEP] it rolls [MASK] the parking lot [SEP] [19]

Figure 2.7 : BERT mask LM [19]

26
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Within the implementa on of BERT we have two steps : pre-training and fine-tuning .
During pre-training the model is trained on unlabeled data over different pre-training tasks.
For finetuning, the BERT model is first ini alised with the pre-trained parameters, and all of
the parameters are fine-tuned using labelled data from the downstream tasks. Each
downstream task has separate fine-tuned models, even though they are ini alised with the
same pre-trained parameters [18].

Figure 2.8 : pre-training and ﬁne-tuning procedures for BERT [18]

This is just the ini al part of BERT implementa on and whole steps are described in the
original paper[18] but we have to keep in mind that BERT is one of the best general
language models and produces good results on sentence embeddings.

2.4 Text Similarity and measures

Similarity is the distance between two vectors where the vector dimensions represent
the features of two objects. In simple terms, similarity is the measure of how diﬀerent or
alike two data objects are. If the distance is small, the objects are said to have a high degree
of similarity and vice versa. Generally, it is measured in the range 0 to 1. This score in the
range of [0, 1] is called the similarity score [12].

27
Máster Universitario en Ciberseguridad e Inteligencia de los datos

As the same text similarity is how different or alike two texts or sentences are. However
as humans it is very obvious to us that two sentences mean the same thing despite being
wri en in completely different formats. But algorithms and to come to that same conclusion
we have first to solve the problem of text representa on by conver ng it into feature vectors
using a suitable text embedding technique above. Once we have the text representa on, we
can compute the similarity score using one of the many distance/similarity measures [12].
Let’s dive deeper into the text similarity measures.

Jaccard Index

Jaccard index, also known as Jaccard similarity coeﬃcient, treats the data objects like
sets. It is deﬁned as the size of the intersec on of two sets divided by the size of the union.

In the case in ﬁgure 2.9 :

Figure 2.9 : Jaccard distance on two sets [20]

|𝐴∩𝐵| |𝐴∩𝐵|
The jaccard distance as : 𝐽(𝐴, 𝐵 ) = |𝐴∪𝐵|
= |𝐴|+|𝐵|−|𝐴∩𝐵|

28
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Euclidean Distance

Euclidean distance, or L2 norm, uses the Pythagoras theorem to calculate the distance
between two points as indicated in the ﬁgure 2.10. Generally speaking, when people talk
about distance, they refer to Euclidean distance. It below :

Figure 2.10 : Euclidean distance representa on [12]

The larger the distance d between two vectors, the lower the similarity score and vice
versa .The distances can vary from 0 to inﬁnity, we need to use some way to normalise them
to the range of 0 to 1.
Although we have our typical normalisa on formula that uses mean and standard
devia on, it is sensi ve to outliers. That means if there are a few extremely large distances,
every other distance will become smaller as a consequence of the normalisa on opera on.
So the best op on here is to use something like the Euler’s constant ( ) [12].
1
𝑒
𝑑

Levenshtein distance

The Levenshtein distance is a string metric for measuring the diﬀerence between two
sequences. Informally, the Levenshtein distance between two words is the minimum number
of single-character edits (i.e. inser ons, dele ons or subs tu ons) required to change one
word into the other [21].

29
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Mathema cally, the Levenshtein distance between two strings a, b (of length |a| and |b|
respec vely) is given by the formula below:

Figure 2.11 Levenshtein distance formula [21]

Cosine Similarity

Cosine Similarity computes the similarity of two vectors as the cosine of the angle between
two vectors. It determines whether two vectors are poin ng in roughly the same direc on.
So if the angle between the vectors is 0 degrees, then the cosine similarity is 1 [12].

Figure 2.12 : θ angle of vectors (𝑣, 𝑤) [12]

𝑣•𝑤
It is given as : 𝑐𝑜𝑠(𝑣, 𝑤) = . Where ||𝑣|| represents the length of the vector 𝑣,
||𝑣||×||𝑤||

||𝑤||represents the length of the vector 𝑤 and ‘•’ denotes the dot product operator.

30
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 3 : Methodology

In order to achieve the objec ves set for the realisa on of this project it's necessary to
find the right methodologies.
Several mee ngs were held with the tutor and the co-tutor. At first an explana on of the
topic was made, secondly we defined the process and the necessary tasks to achieve for
producing results. During the development of the project we frequently held mee ngs for
checking tasks progression, raising doubts and verifying that the steps taken were the right
ones. The main idea is that at each review the project should show some evolu on with
respect to the previous check, which is in line with the Scrum planning model.

3-1 Process deﬁni on

In the ar cles studied ([2], [4]), they apply machine learning techniques over two
sub-tasks : address normalisa on and address classiﬁca on into matched and not matched.
For each address to be georeference (or match) they generate a classiﬁca on problem for
each of the addresses that serve as a reference in this case.

During our inves ga ons in order to ﬁnd machine learning opportuni es in the address
matching ﬁeld a lot of approaches were tested but the most promising one remains the use
of word or sentence embedding coupled to text similarity measure.

Our proposal in this work consists of determining the similarity of each address with the
reference addresses through the embeddings generated using the diﬀerent algorithms
exposed above. We consider an exis ng matching between two addresses when the
similarity in the representa on space exceeds a threshold.

Our approach is to measure the distances between addresses, but by using language
models, complex rela onships in words such as seman cs and morphology are considered
and not only similari es at the character level.
However, there are different techniques of word embedding, so our process will naturally
be in the first me a study of each one, in a second me implement them using address
dataset and finally make a prac cal comparison.
In other hands, we define a performance evalua on procedure similar to those applied in
machine learning classifica on,we set a confusion matrix and evaluate the metrics accuracy,
precision and recall .

31
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The ﬁgures 3.1 shows the key steps of our work process

Figure 3.1 : Project process

As shown in the ﬁgure we ﬁrstly generated address embeddings for each model
Wor2vec, Fastext, Doc2vec and BERT, and secondly classify addresses into match or no
match through the similarity. Finally we evaluate the performance of each model in the
objec ve to make a comparison.

32
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Similarity

A er ge ng the vector representa on(embedding) of each address we introduce the

cosine similarity as a measure of the similarity between addresses.

Classiﬁca on

For classifying addresses into matched or no matched we will compare the result of the
similarity calcula on to a ﬁxed threshold value.

Performance Evalua on

Our ini al dataset provides the status of matching for addresses by an iden fica on
number( uuid_idt) so a classical method would be the use of a machine learning
classifica on algorithm and extract the performance. But a lack of varia on on our dataset
mo vates us to do a manual evalua on calcula ng true posi ves, false posi ves, true
nega ve and false nega ve from classifica on results and known status of addresses

3.2 Implementa on planning

In order to correctly implement the deﬁned process for this project a planning was
made the table 3.1 gives the details of needed tasks to implement en-to-end the
drescripted process .

33
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Task Start date End date

Data set crea on 10/06/2022 14/06/2022

Implement wor2vec 15/06/2022 17/06/2022

models

Performance and 18/06/2022 20/06/2022

opera on analysis

Improvements 21/06/2022 26/06/2022

Implement fastext models 27/06/2022 30/06/20022

Performance and 1/07/2022 3/07/2022

opera on analysis

Implement doc2vec model 04/07/2022 9/07/2022

Performance and 10/07/2022 15/07/2022

opera on analysis

Improvements 16/07/2022 20/07/2022

Implement BERT model 21/07/2022 31/07/2022

Performance and 1/08/2022 8/08/2022

opera on analysis

Improvements 9/08/2022 14/08/2022

Comparison study 15/08/2022 28/08/2022

Table 3.1 : Implementa on planning

34
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 4 : Development
4.1 Dataset Crea on

The ISTAC [1] provides a csv file with normalised and georeferenced addresses that
currently appear in their Integrated Data System.
The variables included in the file correspond to the different elements that make up an
address, as well as an iden fica on code(uuid_idt) shared by all the addresses for which the
matching has been posi ve according to their algorithms. Among the variables we find
territory codes(“códigos de territorio”), the normalised and unnormalised type of road(“ po
de via”), the normalised and unnormalised road name(“nombre de via”), road code(“código
de la vía”), normalised and unnormalised portal number.
Also we can find informa on about the technique used to generate the matching and a
categorical variable with values: AVERAGE, HIGH or VERY_HIGH, which indicates the quality
of the link .
This dataset is from all the municipali es in the Canary Islands but in our case we will
extract a part from one municipality called Santa Brígida.We will train our models using this
dataset of 16024 addresses in order to build word embeddings.

Figure 4.1 : Dataset registers

From this dataset we select relevant columns that we will need in the rest of the work

Figure 4.2 : selec ng columns forming an address at Santabrigida

35
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Column descrip on

uuid_idt Iden ﬁcador compar do entre las

direcciones que causan match

tvia Tipo de vía

nvia Nombre de vía normalizado

numer Número de portal normalizado

codnum Código de municipio

nommun Nombre de municipio

direccion Unión de los campos:

tvia+nvia+numer+nommun, en caso de
disponer de todos ellos

Table 4.1 : Columns descrip ons (spanish) [22]

Create dataset : once we have a good sample of our dataset we can export it in csv format
for evalua ng performance of our models

Figure 4.3 : Dataset crea on

Finally, we have the dataset “Santabrigida” (16024 addresses) that we will use to train
our word and sentence embedding models.
To evaluate the performance of the models and compare them, we will use the dataset
“muestra” which is a frac on of this dataset from “Santabrigida”.
We make this reduc on of the data because of a lack of resources and as exposed in chapter
3 the performance evalua on is very costly .

36
Máster Universitario en Ciberseguridad e Inteligencia de los datos

4.2 Libraries: Gensim, NLTK and Sentence Transformers

In addi on to the basic libraries for data analysis we used some special libraries during
the project with speciﬁcs roles for each one :

Gensim
Gensim is an open source python library for topic modelling able to train large-scale
seman c NLP models , represent text as vectors and ﬁnd related documents.
Gensim runs on Linux, Windows and Mac OS X, and should run on any other pla orm that
supports Python 3.6+ and NumPy[23].
We can install it by running this command : pip install gensim

In this project we use gensim to train wor2vec , FastText and doc2vec models for
genera ng vector representa on of addresses and calculate the similarity.

The ﬁgure below shows a basic syntax of impor ng and training gensim models

Figure 4.4 : Gensim training models example

NLTK : Natural language processing Tool-kit

NLTK is a leading pla orm for building python programs to work with human language
data. It provides a suite of text processing u li es for classiﬁca on, tokeniza on, stemming,
tagging, parsing .. etc.
In this project we use NLTK to preprocess our address dataset and in par cular to tokenize
data before passing it to models.

The ﬁgure 4.5 shows a basic syntax of tokenisa on addresses with NLTK

37
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.5 : NLTK tokeniza on example usage

Sentence Transformers

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image

embeddings. It can be used to compute sentence / text embeddings for more than 100
languages. These embeddings can then be compared e.g. with cosine-similarity to ﬁnd
sentences with a similar meaning. This can be useful for seman c textual similar, seman c
research or paraphase mining [24].

Sentence Transformers can be installed by running this command:

pip install -U sentence-transformers
It recommended to have python 3.6 or higher, and at least Pytorch 1.3.6 remain that
sentence-transformers are based on Pytorch and transformers.

In this project we use sentence transformers for implemen ng the BERT model. The ﬁgure
shows an example of sentence-transformers implemen ng a BERT model

Figure 4.6 : Example Usage Sentence-Transformers

38
Máster Universitario en Ciberseguridad e Inteligencia de los datos

4.2 Implementa on

4.2.1 Data descrip on and preprocessing

From the data crea on process above we generate this present data on which one we will
build our word and sentence embedding models and evaluate performance.

Figure 4.7 : Created dataset

Before star ng work with the data let’s check missing values and make various
descrip ons through graphics and sta s cs.

Figure 4.8 : missing values and registers numbers

39
Máster Universitario en Ciberseguridad e Inteligencia de los datos

As we can see the dataset doesn’t present missing values, so we don’t need to apply a
technique to ﬁll out missing values. Our next step is to take a look at the variable nvia :
“nombre de via “.

The ﬁgure 4.9 shows the count word of nvia in each address row.

Figure 4.9 : nvia word count

Once we have for each nvia the number of words we can represent the distribu on graphic of words

Figure 4.10 : distribu on word count of nvia variable

40
Máster Universitario en Ciberseguridad e Inteligencia de los datos

In our last part of descrip ng data we will show the most used words of the variables nvia
and tvia on addresses. The words that appear the most are the higher dimensions the most
and vice versa.

Figure 4.11 : Most commonly used word on addresses (nvia)

41
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.12 : Most commonly used word on addresses (tvia)

Once we understand the data the next step is preprocessing, ﬁrstly we format addresses.
In principle an address is an concentena on of the variables tvia, nvia, nume, codmun and
nommun but we have to remember that we are working with data from one municipality
(Santa Brígida). It means that all our addresses have the same value on the variable
codmun(35021) and nommun (Santa Brígida). That’s why it will be necessary to remove
them from the address in order to get the root of an address.
So, we are going to create a address column without the two variables cited above

42
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.13 : Crea on of new address column

Now we have address data almost ready to be trained but for some models like wor2vec
and fas ext it’s preferable to pass them the data in a certain format that’s why we will
tokenize the data before the training phase.

Figure 4.14 : Tokeniza on addresses

4.2.2 Modelling

In NLP instead of always training your own model it is recommended in some cases to use
pre-trained models.The advantage of these models is that they have been trained in a larger
corpus of words so they gain in maturity. We can ﬁnd these models in diﬀerents public
repositories or research publica ons.In this project, in addi on to our own trained we a
word2vec , fastext and BERT pre-trained model for the Spanish language

As already men oned, in this project we work with wor2vec, fastText, doc2vec and BERT
model. For word2vec and fastText, ﬁrst we train our own model and second we load
pre-trained models. In the case of doc2vec we also train our own model but for BERT we
load a spanish BERT model.

43
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ word2vec
Train model

A word2vec model uses a set of parameters that aﬀect both the training speed and the
quality. During our training phase we adjust the parameters several mes in order to have a
good model. The ﬁgure below shows the way to train the model.

Figure 4.15 : Training wor2vec model

It’s important to no ce that word2vec sets some parameters by default, in our case we
use the CBOW variant which is the default variant implemented by wor2vec. The model
receives the dataset in the right format as the ﬁrst parameter here this last one is data which
is the result of our tokenized addresses. The parameters min_count is for ignoring the word
that does not appear a certain number of mes in the corpus. By default the value is 5 but
this can pose a problem in our case that’s why we put the minimum value 1. The size
determines the number of dimensions (N) of the N-dimensional space that gensim
Word2Vec maps the words onto; we chose a 100-dimensional space. The workers
parameters determine the number of cores to use for the training. It takes eﬀect when we
set it to 1. If we put another value we have to install some tool like Cytron.

Load pretrained Model

To use a pre trained word2vec model we just need to download the corresponding model
In most cases they are in vector or bin format and we can ﬁnd a lot of pre-trained model
from the communi es or IT companies like Facebook , Google …etc
In this project we use a pre-trained word2vec for Spanish Language [25], the ﬁgure below
the code to execute for loading this model.

Figure 4.16 : Loading word2vec pre-trained model

44
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ FastText

Train Model

Like word2vec , fastText model uses a set of parameters that aﬀect both the training speed
and the quality. We train the fastText model in the same way we did with wor2vec
The model receives the dataset in the right format as the ﬁrst parameter here this last one is
data which is the result of our tokenized addresses. The parameters min_count is for
ignoring the word that does not appear a certain number of mes in the corpus.
The size determines the number of dimensions (N) of the N-dimensional space that gensim
Word2Vec maps the words onto; we chose a 100-dimensional space. The workers
parameters determine the number of cores to use for the training.

Figure 4.17 : Training word2vec model

Load Model

For the pre-trained fasText model [26], depending on the format that we have downloaded
the model (vec or bin) there is a way to load it. In our case we download a pre-trained
model for Spanish language in vector format because the bin format needs complex
transforma ons.

Figure 4.18 : Loading FastText pre-trained model

45
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ doc2vec

The process of training a doc2vec model is similar to word2vec but here we have some
addi onal steps. Below we have the step to train the model

Model ini aliza on

Figure 4.19 : Ini alise doc2vec model

The parameters vector_size and min_count represent the same as on wor2vec and
fastText but here we use addi onal parameters epochs for se ng the number of itera on
over the corpus into 10.

Tagged documents

Figure 4.20 : tagged document

Diﬀerent to wor2vec where the model directly receives addresses(documents) in a

format of list of addresses ( addr variable in ﬁgure 4.20 ). A tag has been added to each
address (document) before passing to the model for building vocabulary and training .
The vocabulary is just a list of all of the unique words extracted from the training corpus.

Training model

Figure 4.21 : Training doc2vec model

46
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ BERT

In this project we will use a BERT model for Spanish language [27]. Below we have the
steps to follow :

Install and import Transformers

Figure 4.22 : Installing and loading Transformer

Load the model

Figure 4.23 : Loading BERT model

47
Máster Universitario en Ciberseguridad e Inteligencia de los datos

4.2.3 Vectoriza on and similarity calculs

Once we have the trained and pre-trained models we can vectorize each address in the
dataset and calculate the similari es in order to classify into matched and not matched.
For word2vec and FastTex we cannot directly get the vector representa on of a whole
address, we have a vector per word, so we will average the word vectors.
The func on below will receive a list of addresses and a model to generate the
corresponding vector representa on of each address .

Figure 4.24 : Func on for genera ng vector per document(address)

Figure 4.25 : Applying the func on with the trained wor2vec model

48
Máster Universitario en Ciberseguridad e Inteligencia de los datos

We can see the corresponding vector of the address : CAMINO TEJAR 28

Figure 4.26 : Vector representa on of an address with word2vec

With doc2vec and BERT we directly generate the vector representa on of the whole
address.

In order to measure the similarity between addresses we evaluate the cosine similarity
between their represented vectors. However, we will introduce a heatmap to represent the
similari es.
Beforehand we will deﬁne two func ons respec vely for similari es calcula ons and
heatmap crea on

Figure 4.27 : func on for similari es calcula ons

49
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.28 : Func on for heatmaps crea on

The ﬁgure 4.29 represents the heatmap of the similarity between addresses using our
trained word2vec model .
The graphics for each models is available in the appendices 6.1

Figure 4.29 Heatmap of similarity using wor2vec trained model

50
Máster Universitario en Ciberseguridad e Inteligencia de los datos

4.3 Results
In order to compare the models we will define, as discussed in chapter 3, a procedure for
performance evalua on similar to those applied in machine learning classifica on
algorithms. A confusion matrix is defined and the metrics accuracy, precision and recall are
evaluated.
The func on in the figure 4.30 shows the different steps for crea ng the confusion matrix

Figure 4.30 func on for creating confusion matrix

The func on receives as parameters address vectors and the list of address uuids.
The func on process by calcula ng the cosine similarity between addresses one by one and
comparing the result with the ﬁxed threshold (0.9 ). The threshold has been ﬁxed to this
value a er tes ng the performance of the models with several values between 0.7, 0.8 and
0.9 .

51
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The table 4.2 resume the results :

Model Accuracy Precision Recall

Trained wor2vec 0.675 0.023 0.93

Pre-trained wor2vec 0.789 0.034 0.92

Trained FastText 0.552 0.017 0.94

Pre-trained FastText 0.976 0.233 0.848

Trained doc2vec 0.996 0.694 0.848

BERT 0.997 0.80 0.848

Tabla 4.2 : Results resuming

52
Máster Universitario en Ciberseguridad e Inteligencia de los datos

In general the model trained with address dataset presents acceptable performance. In
the case of pre-trained models only BERT is giving acceptable results. The accuracy obtained
with word2vec and fastText is very low, taking a high value of recall. This means that these
models predict as “match” addresses that “no match” (precision), however, they predict as
“match” addresses that match (recall). On the other hand, doc2vec outperforms the two
previous models and ﬁnally, the results obtained with BERT improves the performance
reaching promising values.

Doc2vec with BERT form the most performing models, the trained word2vec and fastText
present almost the same results while their pre-trained fail to perform.

53
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 5 : Conclusions and future development

5.1 Conclusions

In this project we explore the use of machine learning techniques in the ﬁeld of address
matching.I n par cular, with addresses in text-based format, we introduce natural language
processing approaches and generate numerical representa ons for each pair of addresses.

In order to generate numerical representa ons of addresses we study several word and
sentence embedding models such as wor2vec , fastText , doc2vec and BERT.
In a ﬁrst me we train these models with real address datasets from ISTAC [1], in a second
me we use pre-trained models for the Spanish language pre-trained by other communi es
and with a large corpus of data.

We introduce the cosine similarity as a metric for resolving address records into match
and not match and ﬁnally evaluate the performances.

The speciﬁc studies during this project show great poten al for the use of machine
learning and NLP in the ﬁeld of address matching but it's really important in the
implementa on processes to accurate data and choose the right models.

The results obtained lead us to the conclusion that it is promising to solve the address
matching problem through the similarity of the vectors that generate the language models.
They also reveal the need for models generated with large numbers of documents, in our
tests the guarantees are oﬀered by the BERT model for the Spanish used, but they also
suggest that genera ng a doc2vec model with a much larger volume of addresses can lead to
good system performance.

54
Máster Universitario en Ciberseguridad e Inteligencia de los datos

5.2 Future developments

➢ This study has been restricted to the municipality of Santa Brigida, but in future
works, with more available resources, we plan to extend it into all municipali es of
the Canary Islands in order to conﬁrm our generic method.

➢ As we did with the diﬀerent models comparing their performance using cosine
similarity it should be realised as a comparison, a study based on the diﬀerent
similarity metrics such as euclidean distance, levenshtein distance ….etc

➢ In the case of having a dataset with a great pool of varia ons for each address, it
should consider each pool of addresses as a machine learning classiﬁca on problem.
In this case, we will use machine learning classiﬁca on algorithms like XGBoost,
random forest ….etc

➢ Generate language models with the data for all Canary Islands addresses in the
Canary Islands Integrated Data System

➢ Explore algorithms to improve eﬃciency in the comparison of the similarity of all

addresses in order to extend the results to datasets that include a signiﬁcant volume
of addresses, for example for the whole of the Canary Islands.

55
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 6 : Appendices

6.1 Dataset crea on code source

At the following Colab notebook we have the python code to created the dataset that will
be use in the project
Google Colab Link

6.2 Project code source

Below is a link to the project on Google Collaboratory where you can view and test the
Python code that has been shown throughout this project.

Google colab Link

6.3 Data sources

At the following link we have a shared google drive folder that contains all the data files in
csv format used for this project.We can find the “Santabrigida.csv” file used for crea ng the
base dataset “muestra.csv”.

Google drive Link

6.4 Pre-trained models

At the following link we have a shared google drive folder that contains all the pre-trained
models used on this project.

Google drive Link

56
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Bibliography

[1] . ISTAC ( Ins tuto Canario de Estadís ca) oﬃcial website

h p://www.gobiernodecanarias.org/istac/

[2]. PostMatch : A Framework for Eﬃcient Address Matching

Springer Nature Singapore Pte Ltd. 2021 Y.Xu et al. (Eds) : AusDM 2021, CCIS 1504,pp.
136-151,2021.
h ps://doi.org/10.1007/978-981-16-8531-6_10

[3] ISTAC Sistema-georreferenciación

h ps://jecas.es/wp-content/uploads/2021/11/21.4.ISTAC_Sistema-georreferenciacion.pdf

[4]. Comber S , Arribas-Bel D. Machine learning innova ons in address matching : A prac cal
comparison of wor2vec and CRFs.Transac on in GIS . 2019;23:334-348.
h ps://doi.org/10.1111/tgis.12522

[5]. The ul mate guide to address matching (online)

h ps://www.placekey.io/blog

[6]. Introduc on to NLP

h ps://buil n.com/data-science/introduc on-nlp

[7]. Theory behind the basics of NLP

h ps://www.analy csvidhya.com/blog/2022/08/theory-behind-the-basics-of-nlp/
[8]. Wikipedia : n-gram
h ps://en.wikipedia.org/wiki/N-gram

[9] Char n-gram

h ps://subscrip on.packtpub.com/book/big-data-and-business-intelligence/978178712678
7/9/ch09lvl1sec56/character-n-grams#:~:text=An%20n%2Dgram%20is%20a,high%20quality
%20for%20authorship%20a ribu on.

[10]. Word Embedding, Wikipedia

h ps://en.wikipedia.org/wiki/Word_embedding

57
Máster Universitario en Ciberseguridad e Inteligencia de los datos
[11]. Countvectorizer, scikit-learn
h p://scikit-learn.org/stable/modules/generated/sklearn.feature_extrac on.text.CountVect
orizer.html

[12]. Ul mate guide to Text similarity with python

h ps://newscatcherapi.com/blog/ul mate-guide-to-text-similarity-with-python

[13]: Eﬃcient Es ma on of Word Representa ons in Vector Space (Research paper)

h ps://arxiv.org/pdf/1301.3781.pdf

[14] Fastext Model

h ps://radimrehurek.com/gensim/auto_examples/tutorials/run_fas ext.html#:~:text=The%
20main%20principle%20behind%20fastText,embedding%20for%20every%20individual%20w
ord.

[ 15 ] Fastext
h ps://blogs.sap.com/2019/07/03/glove-and-fas ext-two-popular-word-vector-models-in-nl
p/

[16] Distributed Representa ons of Sentences and Documents (Research paper)

h ps://cs.stanford.edu/~quocle/paragraph_vector.pdf

[17] An Empirical Evalua on of doc2vec with Prac cal Insights into Document Embedding
Genera on
h ps://arxiv.org/abs/1607.05368

[18] BERT: Pre-training of Deep Bidirec onal Transformers for Language Understanding
h ps://arxiv.org/pdf/1810.04805.pdf

[19 ] BERT Explained State of the art language model for NLP
h ps://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b
21a9b6270

[20]. Jaccard Index ,Wikipedia

h ps://en.wikipedia.org/wiki/Jaccard_index

[21] The Levenshtein Algorithm

h ps://www.cuelogic.com/blog/the-levenshtein-algorithm#:~:text=The%20Levenshtein%20
distance%20is%20a,one%20word%20into%20the%20other.

58
Máster Universitario en Ciberseguridad e Inteligencia de los datos

[22] González Yanes, A., Betancor Villalba, R., Hernández García, M.S. (2021). Título. XXI
Jornadas de Estadís ca de las Comunidades Autónomas, JECAS. Las Palmas de Gran Canaria:
ISTAC.
Retrieved from h ps://jecas.es/sistema-de-georreferenciacion-para-ﬁnes-estadis cos/

[23] Gensim documenta on

https://radimrehurek.com/gensim/

[24] Sentence transformers Documenta on

h ps://www.sbert.net

[25] Wor2vec pretrained model for spanish language

h ps://github.com/aitoralmeida/spanish_word2vec

[26] FastText pretrained Models

h ps://fas ext.cc/docs/en/crawl-vectors.html

[27] BERT Spanish Model

h ps://huggingface.co/hackathon-pln-es/paraphrase-spanish-dis lroberta

Malware Classification Using Graph Neural Networks
No ratings yet
Malware Classification Using Graph Neural Networks
53 pages
Machine Learning Lecture - 2 and Lecture - 3
No ratings yet
Machine Learning Lecture - 2 and Lecture - 3
59 pages
Articles Search Project
No ratings yet
Articles Search Project
8 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
BERT Model
No ratings yet
BERT Model
69 pages
Mẫu Trình Bày (Tham Khảo)
No ratings yet
Mẫu Trình Bày (Tham Khảo)
82 pages
Thesis Anum Afzal
No ratings yet
Thesis Anum Afzal
127 pages
ChatBot With GANs
No ratings yet
ChatBot With GANs
61 pages
006 NLP-pipelineSLides
No ratings yet
006 NLP-pipelineSLides
12 pages
Ai Fall-23 Assignment
No ratings yet
Ai Fall-23 Assignment
5 pages
Wang Asu 0010N 21448
No ratings yet
Wang Asu 0010N 21448
81 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
Evaluation of Text Transformers For Classifying Sentiment of Revi
No ratings yet
Evaluation of Text Transformers For Classifying Sentiment of Revi
104 pages
Sample Project Final Document
No ratings yet
Sample Project Final Document
68 pages
81 Cse e
No ratings yet
81 Cse e
5 pages
DeepLearning Text
No ratings yet
DeepLearning Text
21 pages
Text Data Labelling Using Transformer Based Sentence Embeddings and Text Similarity For Text Classification
No ratings yet
Text Data Labelling Using Transformer Based Sentence Embeddings and Text Similarity For Text Classification
8 pages
RigmaUmesh Finalprojectreport
No ratings yet
RigmaUmesh Finalprojectreport
60 pages
Machine Learning Unit - 1
No ratings yet
Machine Learning Unit - 1
7 pages
Data Science & Data Analytics Project - Documentation
No ratings yet
Data Science & Data Analytics Project - Documentation
10 pages
Generative AI
No ratings yet
Generative AI
16 pages
One Class Text Classification Using An Ensemble of Classifiers
No ratings yet
One Class Text Classification Using An Ensemble of Classifiers
71 pages
Full Text 01
No ratings yet
Full Text 01
51 pages
Base
No ratings yet
Base
17 pages
机器学习开发者指南: Chinese Edition
From Everand
机器学习开发者指南: Chinese Edition
Posts & Telecom Press
No ratings yet
Well Posed Learning Problems and Applications of ML
100% (1)
Well Posed Learning Problems and Applications of ML
17 pages
Automatic Question & Answer Generation Using Generative Large Language Model (LLM)
No ratings yet
Automatic Question & Answer Generation Using Generative Large Language Model (LLM)
52 pages
Lab Report 09: Regression and Natural Language Processing With Keras
No ratings yet
Lab Report 09: Regression and Natural Language Processing With Keras
15 pages
(IJCST-V9I6P5) :amalesh A, Gowthamy J
No ratings yet
(IJCST-V9I6P5) :amalesh A, Gowthamy J
4 pages
Internship Codsoft Machine Learning
No ratings yet
Internship Codsoft Machine Learning
36 pages
Deep Learning Approaches For Network Int
No ratings yet
Deep Learning Approaches For Network Int
116 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
2 pages
Word Embedding Methodsof Text Processing
No ratings yet
Word Embedding Methodsof Text Processing
7 pages
SrushtiKulkarni 23551008 Blackbook
No ratings yet
SrushtiKulkarni 23551008 Blackbook
85 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
1 s2.0 S1877050922021640 Main
No ratings yet
1 s2.0 S1877050922021640 Main
11 pages
Paper Review
No ratings yet
Paper Review
6 pages
Well Posed Learning Problems and Applications of ML
No ratings yet
Well Posed Learning Problems and Applications of ML
17 pages
1.machine Learning and Its Applications
No ratings yet
1.machine Learning and Its Applications
75 pages
Vtu Internship Hyderbad
No ratings yet
Vtu Internship Hyderbad
11 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Sentiment Analysis Using NLP
No ratings yet
Sentiment Analysis Using NLP
42 pages
Detecting Malicious URLs Using LSTM and Google's BERT Models - Towards Data Science
No ratings yet
Detecting Malicious URLs Using LSTM and Google's BERT Models - Towards Data Science
27 pages
Uber
No ratings yet
Uber
46 pages
Data Redundancy Using LSTM
No ratings yet
Data Redundancy Using LSTM
24 pages
Fake Review Detection Prj2
No ratings yet
Fake Review Detection Prj2
30 pages
Research Method - 2
No ratings yet
Research Method - 2
6 pages
Mal BERTv 2
No ratings yet
Mal BERTv 2
33 pages
A Minor Project Report On Detection of Ai Generated Text Using Machine Learning
No ratings yet
A Minor Project Report On Detection of Ai Generated Text Using Machine Learning
62 pages
Report
No ratings yet
Report
36 pages
CS985 Project FrankMitchell BiP Solutions
No ratings yet
CS985 Project FrankMitchell BiP Solutions
66 pages
Malicious Url: Analysis and Detection Using Machine Learning
No ratings yet
Malicious Url: Analysis and Detection Using Machine Learning
58 pages
Unit 1 Supervised Learning
No ratings yet
Unit 1 Supervised Learning
33 pages
NLP - Natural Language Processing
No ratings yet
NLP - Natural Language Processing
74 pages
Gagandeep Singh: Ublic Projects
No ratings yet
Gagandeep Singh: Ublic Projects
1 page
RAJIVRANJAN 26-03-2023 MachineLearningProjectReport Final
No ratings yet
RAJIVRANJAN 26-03-2023 MachineLearningProjectReport Final
54 pages
深度学习：核心原理与案例分析: Chinese Edition
From Everand
深度学习：核心原理与案例分析: Chinese Edition
Posts & Telecom Press
No ratings yet
AI and Deep Learning for Networks
From Everand
AI and Deep Learning for Networks
Gopee Mukhopadhyay
No ratings yet
Unit 3 Resume Reflection
No ratings yet
Unit 3 Resume Reflection
5 pages
Week 1 EL 111
No ratings yet
Week 1 EL 111
2 pages
Local Media2776546381057787868
No ratings yet
Local Media2776546381057787868
10 pages
Chapter 21: "Adult Female Learner: Is That A Real Thing?": An Overview of Adult Education
No ratings yet
Chapter 21: "Adult Female Learner: Is That A Real Thing?": An Overview of Adult Education
13 pages
Bible College Student Handout
No ratings yet
Bible College Student Handout
14 pages
My Attachment Report
No ratings yet
My Attachment Report
3 pages
Fences Unit Plan
No ratings yet
Fences Unit Plan
2 pages
Ja Biztown - Comparing Payments - Lesson Plan
No ratings yet
Ja Biztown - Comparing Payments - Lesson Plan
2 pages
Catch-Up Plan For Grammar English Cefr
No ratings yet
Catch-Up Plan For Grammar English Cefr
2 pages
SPG Action Plan 2015-2016
83% (6)
SPG Action Plan 2015-2016
10 pages
Bell Hooks - Wikipedia
No ratings yet
Bell Hooks - Wikipedia
12 pages
Syllabus: Communication Systems
No ratings yet
Syllabus: Communication Systems
8 pages
Paper Work Bullet SPM
No ratings yet
Paper Work Bullet SPM
4 pages
Staircase Problem
No ratings yet
Staircase Problem
10 pages
Tapak Forensik Kbat
No ratings yet
Tapak Forensik Kbat
12 pages
Final Placement Report 2021
No ratings yet
Final Placement Report 2021
10 pages
History
No ratings yet
History
2 pages
Rafay Hussain SF 2
No ratings yet
Rafay Hussain SF 2
1 page
TOPIC: The Structure of Personality Client-Centered (Carl Rogers) DISCUSSANT: Mary Ann S. Ariente MAED AS
No ratings yet
TOPIC: The Structure of Personality Client-Centered (Carl Rogers) DISCUSSANT: Mary Ann S. Ariente MAED AS
1 page
Social Science 504 CUES
No ratings yet
Social Science 504 CUES
80 pages
Yash Jainn CVV
No ratings yet
Yash Jainn CVV
3 pages
Tax Management Syllabus
No ratings yet
Tax Management Syllabus
2 pages
Television Has More Advantages Than Disadvantages: Group Name
No ratings yet
Television Has More Advantages Than Disadvantages: Group Name
9 pages
BEC Class Record Grade V and VI (EPP)
No ratings yet
BEC Class Record Grade V and VI (EPP)
598 pages
Sub Plans 2
No ratings yet
Sub Plans 2
2 pages
English Panel Meeting Minutes1
No ratings yet
English Panel Meeting Minutes1
8 pages
Building A Healthy Corporate Culture
No ratings yet
Building A Healthy Corporate Culture
3 pages
Team Aloo Manifesto
No ratings yet
Team Aloo Manifesto
4 pages
CFPS Program Overview (1) 22
No ratings yet
CFPS Program Overview (1) 22
21 pages
The Department of Education Culture and Sports
No ratings yet
The Department of Education Culture and Sports
6 pages