0% found this document useful (0 votes)
11 views60 pages

Machine Learning and NLP Approaches in Address Matching

Uploaded by

miki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views60 pages

Machine Learning and NLP Approaches in Address Matching

Uploaded by

miki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Máster Universitario en Ciberseguridad e Inteligencia de Datos

Trabajo de Fin de Máster

Machine learning and NLP


approaches in address matching

Lamine SYNE

La Laguna, 7 de sep embre del 2022


Máster Universitario en Ciberseguridad e Inteligencia de los datos

D. Isabel Sánchez Berriel, con N.I.F. 42.885.838-S profesora Contratada Doctora adscrita al
Departamento de Ingeniería Informá ca y de Sistemas de la Universidad de La Laguna, como
tutora

D. Luz Marina Moreno de Antonio, con N.I.F. 45.457.492-Q profesora Contratada Doctora
adscrita al Departamento de Ingeniería Informá ca y de Sistemas de la Universidad de La
Laguna, como cotutora

C E R T I F I C A (N)

Que la presente memoria tulada:


“Machine learning and NLP approaches in address matching”

Ha sido realizada bajo su dirección por D. Lamine SYNE, con N.I.F. Y- 90 77 440 -K.

Y para que así conste, en cumplimiento de la legislación vigente y a los efectos


oportunos firman la presente en La Laguna a 7 sep embre del 2022

1
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Acknowledgments

I would like to thank my tutors Isabel Sanchez Berriel and Luz Marina Moreno de
Antonio for their supervision and guidance throughout this project.

I would like to thank the director of the master of Cybersecurity and Data Intelligence master
at La Laguna University, Pino Caballero Gil for her precious help during the year.

Finally, I would like to thank the Canary Government for giving me the chance to live this
experience through the PBCA program.

2
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Licence

©This work is licensed under a Crea ve Commons


A ribu on-ShareAlike 4.0 Interna onal.

3
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Abstract

The object of this project is to explore machine learning and NLP poten al to the
address matching sub-field of geographic informa on science. To achieve this a deep
study about word and sentence embeddings models was made, how they work and
how they can be used to generate numerical representa ons of an address.

For each word or sentence embedding model we generate vector representa on of


addresses in the database and calculate the cosine similarity between them in order
to know which ones represent the same geographic posi on or not.

On the other hand we introduce the confusion matrix for evalua ng performance of
each model on a dataset of already matched addresses created from ISTAC [1] data
sources and make a comparison study between the models.

Finally, a use case example will be shown by choosing the most performing model
among those one studied above. This last one can be a debut for building a powerful
tool for matching address pairs in all Canary Islands.

Key words : machine learning, NLP, language model, address matching, word
embedding, similarity

4
Máster Universitario en Ciberseguridad e Inteligencia de los datos

List of Figures 7
List of Tables 10
Chapter 1 : Introduc on 11
1.1 Backgrounds 12
1.2 Objec ves 13
1.3 Scope 13
Chapter 2 : State of the art 14
2.1 Address matching Challenges 14
2.2 Introduc on to natural language processing 16
2.2.1 Terminologies 16
2.2.2 Text preprocessing in NLP 17
2.2.3 Syntac c and seman c analysis 18
2.3 Word and sentence embedding techniques 18
2.3.1 Word embeddings 18
2.3.1.1 One-Hot Encoding & Bag of words 18
2.3.1.2 Term Frequency-Inverse Document Frequency : TF-IDF 19
2.3.1.3 Word2Vec 20
2.3.1.4 FastText 23
2.3.2 Sentence embeddings 24
2.3.2.1 Doc2vec 24
2.3.2.2 BERT : Bidirec onal Encoder Representa ons from Transformers 26
2.4 Text Similarity and measures 27
Chapter 3 : Methodology 31
3-1 Process defini on 31
3.2 Implementa on planning 33
Chapter 4 : Development 35
4.1 Dataset Crea on 35
4.2 Libraries: Gensim, NLTK and Sentence Transformers 37
4.2 Implementa on 39
4.2.1 Data descrip on and preprocessing 39

5
Máster Universitario en Ciberseguridad e Inteligencia de los datos
4.2.2 Modelling 43
4.2.3 Vectoriza on and similarity calculs 48
4.3 Results 51
Chapter 5 : Conclusions and future development 54
5.1 Conclusions 54
5.2 Future developments 55
Chapter 6 : Appendices 56
6.1 Dataset crea on code source 56
6.2 Project code source 56
6.3 Data sources 56
6.4 Pre-trained models 56
Bibliography 57

6
Máster Universitario en Ciberseguridad e Inteligencia de los datos

List of Figures

Figure 2.1 : CBOW & Skip-Gram model ………………………………………………………………20

Figure 2.2 : CBOW model with one word in the context……………………………………..21

Figure 2.3 : CBOW model with mul ple words in the context……………………………..22

Figure 2.4 : Skip-Gram model using target words…………………………………………………22

Figure 2.5 : Doc2vec Distributed Memory model………………………………………………..24

Figure 2.6 : Doc2vec distributed bag of words model…………………………………………..25

Figure 2.7 : BERT mask LM………………………………………………………………………………… 26

Figure 2.8 : pre-training and fine-tuning procedures for BERT……………………………..27

Figure 2.9 : Jaccard distance on two sets……………………………………………………………28

Figure 2.10 : Euclidean distance representa on………………………………………………….29

Figure 2.11 Levenshtein distance formula…………………………………………………………30

Figure 2.12 : θ angle of vectors (𝑣, 𝑤)……………………………………………………………….30

Figure 3.1 : Project process………………………………………………………………………………….32

Figure 4.1 : Dataset registers………………………………………………………………………………35

Figure 4.2 : selec ng columns forming an address at Santabrigida………………………36

Figure 4.4 : Gensim training models example……………………………………………………….37

Figure 4.5 : NLTK tokeniza on example usage………………………………………………………38

7
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.6 : Example Usage Sentence-Transformers………………………………………………38

Figure 4.7 : created dataset……………………………………………………………………………………39

Figure 4.8 : missing values and registers numbers…………………………………………………39

Figure 4.9 : nvia word count…………………………………………………………………………………..40

Figure 4.10 : distribu on word count of nvia variable………………………………………………40

Figure 4.11 : Most commonly used word on addresses (nvia)....................................41

Figure 4.12 : Most commonly used word on addresses (tvia).....................................42

Figure 4.13 : Crea on of new address column………………………………………………………….42

Figure 4.14 : Tokeniza on addresses…………………………………………………………………………43

Figure 4.15 : Training wor2vec model………………………………………………………………………44

Figure 4.16 : Loading word2vec pre-trained model………………………………………………….44

Figure 4.17 : Training word2vec model…………………………………………………………………….45

Figure 4.18 : Loading FastText pre-trained model…………………………………………………….45

Figure 4.19 : Ini alise doc2vec model……………………………………………………………………..46

Figure 4.20 : tagged document……………………………………………………………………………….46

Figure 4.21 : Training doc2vec model……………………………………………………………………..46

Figure 4.22 : Installing and loading Transformer………………………………………………………47

Figure 4.23 : Loading BERT model……………………………………………………………….……………47

Figure 4.24 : func on for genera ng vector per document(address)...........................48

8
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.25 : applying the func on with the trained wor2vec model……………….…48

Figure 4.26 : Vector representa on of an address with word2vec……………………49

Figure 4.27 : func on for similari es calcula ons……………………………………………..49

Figure 4.28 : func on for heatmaps crea on……………………………………………………50

Figure 4.29 Heatmap of similarity using wor2vec trained model……………………..50

Figure 4.30 func on for crea ng confusion matrix……………………………………………51

9
Máster Universitario en Ciberseguridad e Inteligencia de los datos

List of Tables

Tabla 2.1: Address matching most common input errors ………………………………….14

Tabla 2.2 : input address vs reference data to match………………………………………….15

Tabla 2.3 : One-Hot Encoding illustra on…………………………………………………………….19

Tabla 2.4: Bag of words illustra on……………………………………………………………………...19

Tabla 2.5 : Calcula on of TF(“Calle”,d1,D) & TF-IDF(“Calle”,d1,D)..........................20

Table 3.1 : Implementa on planning………………………………………………………………….34

Table 4.1 : Columns descrip ons (spanish)..............................................................36

Tabla 4.2 : Results resuming……………………………………………………………………………….52

10
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 1 : Introduc on

Address matching, the process of assigning physical loca on coordinates to addresses in


databases, becomes a core func on in various loca on-based businesses like take-out
services, express delivery, customers merging, fraud iden fica on, lead outreach etc.
This makes the need to match two lists of addresses a common occurrence in many
companies, organisa ons and government bodies.

In par cular, the ISTAC [1] in their objec ve to facilitate the obtaining of spa al sta s cs,
as well as the produc on of mul -source sta s cs through the Integrated Data System of
Canary Islands Sta cal Plan 2018-202, is georeferencing informa on from different sources
in the geosta s cal reference of Canary Islands.
A database covering all the georeferenced municipali es of the Canary Islands is used,
however the periodical update of the data, as well as the integra on of new sources of
informa on into the system requires to match addresses of each registry with the set of
those that have already been georeferenced.

However, address matching is an extremely complicated task resul ng in a number of


challenges such as: address component types, noisy databases, inconsistent and replete with
missing values on databases, input error made by users, text-based type … etc.

Despite the importance of matching addresses in the above-men oned sectors we


no ce a great lack of open robust solu ons. Recent innova ons in machine learning,
par cularly in natural language processing (NLP), have been introduced in the wider area of
address matching with significant poten al.

In this project, we will focus on bringing solu ons about ISTAC’s [1] address matching
problems exposed above, with machine learning and NLP techniques.

11
Máster Universitario en Ciberseguridad e Inteligencia de los datos

1.1 Backgrounds

During years , in the specific field of address matching, notable research has been
achieved for resolving address records into “matched” and “not matched”. In par cular there
has been advanced research into quan ta ve methods for determining the extent of
matching between pairs of text-based records, with numerous string-similarity measures
developed, including Levenshtein and Jaro-Winkler.

By contrast a group of researchers developed the concept of ‘similarity join’, whereby two
databases are tested by each combina on of record pairs against a similarity measure
func on, with those pairs that exceed a preset threshold being recorded. They
acknowledged that despite the availability of numerous similarity or ‘distance’ func ons, no
one measure excels in every applica on [2].

The ISTAC [1], in their georeferencing works[22], uses a technique based on Record
Linkage that consists of comparing normalised addresses with other records that have
already been geo-referenced. Currently, they use the R package “RecordLinkage” to evaluate
similari es and assign a weight indica ng the similarity between the compared records.

The research reported in [4], [2] shows that machine-learning techniques can be used to
either enhance or replace the tradi onal rule-based solu ons that are commonly applied to
address matching .
In the first paper [4] it’s about two par cular innova ons into the address matching
workflow: condi onal random fields (CRFs) and word (address) embedding.
The second paper [2] introduces a framework called Post Match, the related work is a
combina on of the open source library “Libpostal” for address-parsing with a post-parse
process and the Jaro-Winkler edit distance algorithm together with XGBoost machine
learning classifica on.

In both cases there is an applica on of bi-class algorithms for several mes a bi-class
algorithm to decide whether one address matches or not to another. This, applied for each
address with respect to the reference address pool, makes the problem very computa onally
expensive.

12
Máster Universitario en Ciberseguridad e Inteligencia de los datos

1.2 Objec ves

The main objec ves of this project are :

➢ Study of machine learning approaches into the field of address matching


➢ Study of natural language processing: text analysis and word embeddings
➢ Implement differents models for numerical representa on of text-based data and
introduce a metric to evaluate them
➢ Apply those models to ISTAC’s address datasets and make a compara ve study
➢ Build a tool able to resolve address pairs into match and non-match using text
similarity

1.3 Scope

Due to the lack of power resources caused by the difficulty of processing address
components and the me reserved for this work we have to set some limita ons .
We will focus only on the addresses of Santa Brígida Municipality located in Gran Canaria
where we have a register of 3600 unique addresses (SantaBrigidareferencia set) and another
register of 16024 (SantaBrigida set) that have unique addresses with varia ons of them. For
example, below we have a collec on of addresses, of which number 1 belongs to
SantaBrigidareferencia and 1,2,3,4 are in SantaBrigida with 2,3,4 as varia ons of 1 .

1. CAMINO ACEQUIA TAFIRA 6 SANTA BRIGIDA


2. CAMINO ACEQUIA TAFIRA MADROÑAL 4 SANTA BRIGIDA
3. CALLE CAMINO ACEQUIA TAFIRA 4 SANTA BRIGIDA
4. CALLE ACEQUIA TAFIRA 15 SANTA BRIGIDA

13
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 2 : State of the art

2.1 Address matching Challenges

One of the biggest unstructured data points is an address, this makes address matching
a downstream challenge. Here are the most likely issues we will run into while trying to
match addresses:

➢ Input errors made by users :


In many cases, addresses are input incorrectly by users, including misspellings, missing
spaces, incorrect labels (“CALLE” vs “AVENIDA”), abbrevia on formats (“C/” and “AV”),
synonyms, and more. All of these make it difficult to have standardised data within a single
database, let alone across mul ple databases.

The following table show the most common input errors:

Input Error Example

Misspelling 29 CALE NAVA

Miss Space 29CALLE Nava

Incorrect label 29 AVENIDA Nava

abbrevia on 29 C/ Nava

tokeniza on Nava CALLE 29

Tabla 2.1 : Address matching most common input errors

While these errors may seem easy to no ce at a glance, it is very challenging to program a
system to iden fy each difference. More than that, it requires significant computa onal
power, and will take a lot of me to process. These errors can lead to significant errors when
a emp ng to perform address matching, as the records will not match.

14
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ Problem to link two datasets together

For all the reasons discussed above, some mes we can face difficul es to relate two
addresses and to connect datasets. When this occurs, we end up with the following
dispari es between our records :

Input address string Reference data to match against (AddressBase)

Unstructured text Structured, tokenized

Messy, containing typos, etc. Complete & correct (more or less)

Incomplete Snapshot of addresses at a given me

Organisa on/business names are not always


Range from historic to very recent part of the address. Changes due to the
addresses, including businesses Historical Memory Law and other reasons

Tabla 2.2 : Input address vs reference data to match [5]

As we can see, the task of matching addresses becomes complicated when we have to
compare records that are o en forma ed and input differently. Because of this, it makes the
simple task of matching addresses much more complicated than predicted.
In real business loca on based these issues with linking datasets will cause major issues
with your workflow, slowing up your business and causing errors in delivery, billing, and
more [5]

➢ Data preprocessing Failing

One of the most common problems in address matching is data preprocessing. In many
cases we fail to correctly preprocess our data. However this step is very important and
having cleaning data before processing is essen al for ge ng quality results.

➢ Require of significant computa onal power processing algorithms


The automa za on of the address matching process through programs s ll requires a
large amount of computa onal power and me to run. During the process various
comparisons and calcula ons are made. When conven onal techniques are used, based on
similarity between the strings, each character needs to be compared, and they need to be
processed one at a me. Data o en needs to be preprocessed beforehand as well.

15
Máster Universitario en Ciberseguridad e Inteligencia de los datos

2.2 Introduc on to natural language processing

Natural language processing (NLP) is a subfield of Ar ficial Intelligence that uses


algorithms to interpret and manipulate human language. The goal is to make a computer
able to understand human language processing and content (text, document ..) in the same
way humans can.
NLP can be used in many fields such as speech recogni on, knowledge representa on, text
classifica on … etc [6],[7]

2.2.1 Terminologies

Corpus
A corpus is a large, structured set of machine-readable texts produced in a natural
communica ve se ng. If we have a bunch of sentences in our dataset, all the sentences will
come into the corpus, and the corpus would be like a paragraph with a mixture of sentences.
We just have to know that Corpus is a collec on of documents. In our case of study the
corresponding corpus is a set of addresses [7].
Documents
It is a unique text different from the corpus. If we have 100 sentences, each sentence is a
document. Mathema cal Representa on of Documents is Vector [7]. In this project we will
consider each address as a document.
Vocabulary
Vocabulary is the collec on of unique words involved in the corpus. Let’s take this
following example:
sentence 1 = CALLE NAVA Y GRIMÓN
sentence 2 = CALLE EL HAMBRE
Vocabulary = { CALLE, NAVA ,Y, GRIMÓN, EL, HAMBRE }
Words
All the words in the corpus .Let’s take the previous example
sentence 1 = CALLE NAVA Y GRIMÓN
sentence 2 = CALLE El HAMBRE
words = { CALLE, NAVA, Y, GRIMÓN, CALLE, EL, HAMBRE }
N-gram
In the field of computa onal linguis cs, an n-gram is a con nuous sequence of n items
from a given sample of text or speech [8]. For this given address: CALLE NAVA Y GRIMON, we
have:
1-gram set: CALLE, NAVA, Y, GRIMON
2-gram set: CALLE NAVA, NAVA Y, Y GRIMON
16
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Char N-gram

Character n-grams are found in text documents by represen ng the document as a


sequence of characters. These n-grams are then extracted from this sequence in order to
extract features through a trained model [9]. For this given address: CALLE NAVA Y GRIMON,
we have:
Char 1-gram set: C,A, L, L, E, N, A, V, A, Y, G, R, I, M, O, N
Char 2-gram set: CA, AL, LL, LE, EN, NA, …

2.2.2 Text preprocessing in NLP

Generally, in natural language preprocessing we have specific techniques for


preprocessing and understanding texts. But this depends on the problem we are resolving
or the type of text. For example, when we deal with sen ment analysis based on social
media content it’s important to analyse emo cons and emojis. It has usually been treated as
a classifica on problem etc. Let’s describe the key steps of processing text in NLP :

Preprocessing
➢ Removal of Noise, URLs, Hashtag and User-men ons
➢ Lowercasing
➢ Replacing Emo cons and Emojis
➢ Replacing elongated characters
➢ Correc on of Spellings
➢ Removing the Punctua on
➢ …etc

Stemming
Stemming is the technique to replace and remove the suffixes and affixes to get the root,
base or stem word. We may find similar words in the corpus but with different spellings like
having, have, etc. All those are similar in meaning, so to make them into a base word, we use
a concept called stemming, which converts words to their base word [7]

Lemma za on
Lemma za on is a technique similar to stemming. In stemming root words may or may
not have the meaning, but in lemma za on, root word surely would have a meaning, it uses
linguis c knowledge to transform words into their base forms [7].

17
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Parsing
Parsing refers to the formal analysis of a sentence by a computer into its cons tuents,
which results in a parse tree showing their syntac c rela on to one another in visual form,
which can be used for further processing and understanding [6].

2.2.3 Syntac c and seman c analysis

Syntac c analysis (syntax) and seman c analysis (seman c) are the two primary
techniques that lead to the understanding of natural language. Language is a set of valid
sentences but, what makes a sentence valid?: syntax and seman cs.

Syntax is the gramma cal structure of a text whereas seman c is the meaning being
conveyed. A sentence that is syntac cally correct, however, is not always seman cally
correct [6].

2.3 Word and sentence embedding techniques


A er processing text data the next step is to extract features. To achieve this goal we have to
use some techniques for represen ng text into vectors, so computers can understand the
corpus easily. Those are word and sentence embedding techniques.

2.3.1 Word embeddings


In natural language processing (NLP), word embedding is a term used for the
representa on of words for text analysis, typically in the form of a real-valued vector that
encodes the meaning of the word such that the words that are closer in the vector space are
expected to be similar in meaning [10]. Word embeddings can be obtained using various
methods, let’s deep dive into those methods.

2.3.1.1 One-Hot Encoding & Bag of words

One-Hot Encoding and Bag of Words form part of the most straigh orward way to
numerically represent words.
For One-Hot Encoding , the idea is to create a vector with the size of the total number of
unique words in the corpus. Each unique word has a unique feature and will be represented
by a 1 with 0s everywhere else. In the case of Bag of words representa on (also called
count vectorizing [11]), each word is represented by its count instead of 1 [12]. Let’s look at
an easy example to understand the concepts previously explained. We could be interested in
analysing the tables 2.3 and 2.4:

18
Máster Universitario en Ciberseguridad e Inteligencia de los datos

word Calle Nava ….. …… …. Word n

Calle 1 0 0 0 …. 0

Nava 0 1 0 0 …… 0

Tabla 2.3 : One-Hot Encoding illustra on

Address Calle Nava y Grimon el Hambre

1 1 1 1 1 0 0

2 1 0 0 0 1 1

Tabla 2.4: Bag of words illustra on

2.3.1.2 Term Frequency-Inverse Document Frequency : TF-IDF

TF-IDF is a sta s cal measure that evaluates how relevant a word is to a document in a
collec on of documents. This is done by mul plying two metrics: how many mes a word
appears in a document (TF), and the inverse document (IDF) of the word across a set of
documents [9]. IDF has been used to penalise very commonly used words that do not
provide seman c informa on, such as ar cles, preposi ons, etc.

The TF-IDF value of a term t in a given document d from a set of documents D is :

𝑇𝐹 − 𝐼𝐷𝐹(𝑡, 𝑑, 𝐷) = 𝑇𝐹(𝑡, 𝐷) × 𝐼𝐷𝐹(𝑡, 𝐷)

Where 𝑇𝐹(𝑡, 𝐷) is the term count within the document and


𝐼𝐷𝐹(𝑡, 𝐷 ) = 𝑙𝑜𝑔 ( 𝐷
{𝑑ϵ𝐷: 𝑡ϵ𝐷} ) ,{𝑑ϵ𝐷: 𝑡ϵ𝐷} is document count across the corpus and 𝐷 is
corpus cardinal.

( )
Let’s calculate 𝑇𝐹 − 𝐼𝐷𝐹 "𝐶𝑎𝑙𝑙𝑒" {𝑑1, 𝑑2}, 𝐷 in the following :.
Address 1 (d1) : Calle Nava y Grimon
Address 2 (d2) : Calle el Hambre
Corpus D = [Address 1, Address 2 ]

19
Máster Universitario en Ciberseguridad e Inteligencia de los datos

document TF IDF TF-IDF

d1 1
4
𝑙𝑜𝑔 ( )
2
1
1
4
× 𝑙𝑜𝑔 ( )
2
1

d2 1
4
𝑙𝑜𝑔 ( )
2
1
1
3
× 𝑙𝑜𝑔 ( )
2
1

(
Tabla 2.5 : Calcula on of 𝑇𝐹 − 𝐼𝐷𝐹 "𝐶𝑎𝑙𝑙𝑒" {𝑑1, 𝑑2}, 𝐷 )

2.3.1.3 Word2Vec

Word2vec is one of the most popular technique to learn word embeddings based on
neural network .The neural network aim to predict the distribu on of word contexts in the
corpus 𝑝(𝑤| 𝐶𝑜𝑛𝑡𝑒𝑥𝑡 𝑜𝑓 𝑤) and simultaneously learn the word representa on. A
single-layer neural network with a linear ac va on func on is used, the contexts are
represented by a succession of previous words of the window size (𝑛) chosen:
(
𝑝 𝑤𝑖|𝑤𝑖−𝑛, 𝑤𝑖−𝑛+1 , ..., 𝑤𝑖−1)
We have the representa on of the words in a con nuous mul dimensional number
space, words with similar contexts will be next to each other in the new space. It takes as
input the text corpus and outputs a set of feature vectors that represent words in that
corpus. It uses two neural network-based methods :
➢ Con nuous Bag Of Words (CBOW)
➢ Skip-Gram

Figure 2.1 : CBOW & Skip-Gram model [13]

20
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The CBOW Model takes the context of each word as the input and tries to predict the
word corresponding to the context. Here, context simply means the surrounding words.

Skip-Gram uses the target word, the word we want to generate the representa on for, to
predict the context. In the process of predic ng the context words, the model learns the
vector representa on of the target word .

Figure 2.2 : CBOW model with one word in the context [12]

Considering the following address : Address : “Plaza de la paz”

Let’s say we use the word ‘Plaza’ as the input to the neural network and we are trying to
predict the word ‘paz’. We will use the one-hot encoding of the input word ‘plaza’, then
measure and op mise for the output error of the target word ‘paz”.In this process of trying
to predict the target word,this shallow network learns its vector representa on. As the same
way the model used a single word to predict the target, it can use mul ple context to do the
like the in the architecture in figure :

21
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 2.3: CBOW model with mul ple words in the context [12]

Figure 2.4 : Skip-Gram model using target words [12]

22
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The choice of using CBOW or Skip-Gram when training a word2vec model will depend on
the case we intend to resolve. CBOW is be er at learning syntac c rela onships between
words while skip-gram is be er at understanding the seman c rela onships. However
Skip-gram works be er when working with a small amount of data, focuses on seman c
similarity of words, and represents rare words well. On the other hand, CBOW is faster,
focuses more on the morphological similarity of words, and needs more data to achieve
similar performance.

2.3.1.4 FastText

FastText is a proposal model by Facebook AI Research(FAIR) for learning word


embeddings and text classifica ons. This model allows crea ng unsupervised learning or
supervised learning algorithms for obtaining vector representa ons for words. FastText
supports both CBOW and Skip-gram models.

Fastext tries to include the morphological structure of words because this carries
importance about the meaning and such structure is not taken into account by tradi onal
word embeddings like word2vec, which train unique word embedding for every individual
word. FastText a empts to solve this by trea ng each word as the aggrega on of its
subwords. For the sake of simplicity and language-independence, subwords are taken to be
the character n-grams of the word. The vector for a word is simply taken to be the sum of all
vectors of its component char-ngrams [14]. For example, the fastText representa on of the
word “CALLE” when using 3-grams corresponds to the collec on of trigrams of the string
<CALLE>: <CA, CAL, ALL, LLE, LE>.

The algorithm always starts the string of each word with "<" and ends them with ">". This
representa on helps to extract morphological informa on from the words such as suffixes
and prefixes. With the generated n-grams a skip-gram model is trained to create the word
representa ons [15].

23
Máster Universitario en Ciberseguridad e Inteligencia de los datos

2.3.2 Sentence embeddings

So far we have discussed how word embeddings represent the meaning of the words in a
text document. But some mes we need to go a step further and encode the meaning of the
whole sentence to readily understand the context in which the words are used.
A straigh orward approach for crea ng sentence embeddings is to use a word embedding
model to encode all words of the given sentence and take the average of all the word
vectors. While this provides a strong baseline, it falls short of capturing informa on related
to word order and other aspects of overall sentence seman cs.

2.3.2.1 Doc2vec
Doc2vec is a model for crea ng numerical representa on of a document, it extends the
idea of word2vec and as this last one, doc2vec has two variants :

➢ Distributed Memory model (IDM)

Figure 2.5 : Doc2vec Distributed Memory model [16]

Each word and sentence of the training corpus are one-hot encoded and stored in
matrices D and W, respec vely. The training process involves passing a sliding window over
the sentence, trying to predict the next word based on the previous words and the sentence
vector (or Paragraph Matrix in the figure above). This predic on of the next word is done by
concatena ng the sentence and word vectors and passing the result into a so max layer.

24
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The sentence vectors change with sentences, while the word vectors remain the same.
Both are updated during training.
The inference process also involves the same sliding window approach. The difference is that
all the vectors of the models are fixed except the sentence vector. A er all the predic ons of
the next word are computed for a sentence, the sentence embedding is the resultant
sentence vector [12].

➢ Distributed Bag of Words (DBOW) model

Figure 2.6 : Doc2vec distributed bag of words model [16]

The DBOW model ignores the word order and has a simpler architecture. Each sentence in
the training corpus is converted into a one-hot representa on. During training, a random
sentence is selected from the corpus, and from the sentence, a random number of words.
The model tries to predict these words using only the sentence ID, and the sentence vector
is updated (Paragraph ID and paragraph matrix in the figure). During inference, a new
sentence ID is trained with random words from the sentence. The sentence vector is
updated in each step, and the resul ng sentence vector is the embedding for that sentence
[12].
As a comparison between the two doc2vec models we can follow the direc on of the
authors in the original paper [16] who affirm that the DM model “is consistently be er
than” DBOW . However other studies [17] showed that the DBOW approach is be er for
more tasks. In other ways we have to know that the DM model takes into account the word
order, the DBOW model doesn’t. Also, the DBOW model doesn’t use word vectors so the
seman cs of the words are not preserved.

25
Máster Universitario en Ciberseguridad e Inteligencia de los datos

2.3.2.2 BERT : Bidirec onal Encoder Representa ons from Transformers

BERT is a transformers-based language representa on model pre-training developed by


Google. It’s designed to pretrain deep bidirec onal representa ons from unlabeled text by
jointly condi oning on both le and right context in all layers [18].

BERT provides a way to pre-train models that consider contexts both to the right and le
of words using the Masked LM (MLM) technique. In BERT, MLM instead of using pre- or
post-word sequences, the en re sequence is used, from which a percentage of words to be
predicted is removed. The algorithm works on pairs of sentences, once the words have been
predicted, BERT uses the predic on of the next sentence. This part of the algorithm predicts
whether the second sentence is the next sentence according to the original text.
The algorithm embeds metadata to indicate start and end of segments, separa on between
sentences, the masked words, etc. as can be seen in the example:
[CLS] the [MASK] has blue spots [SEP] it rolls [MASK] the parking lot [SEP] [19]

Figure 2.7 : BERT mask LM [19]

26
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Within the implementa on of BERT we have two steps : pre-training and fine-tuning .
During pre-training the model is trained on unlabeled data over different pre-training tasks.
For finetuning, the BERT model is first ini alised with the pre-trained parameters, and all of
the parameters are fine-tuned using labelled data from the downstream tasks. Each
downstream task has separate fine-tuned models, even though they are ini alised with the
same pre-trained parameters [18].

Figure 2.8 : pre-training and fine-tuning procedures for BERT [18]

This is just the ini al part of BERT implementa on and whole steps are described in the
original paper[18] but we have to keep in mind that BERT is one of the best general
language models and produces good results on sentence embeddings.

2.4 Text Similarity and measures

Similarity is the distance between two vectors where the vector dimensions represent
the features of two objects. In simple terms, similarity is the measure of how different or
alike two data objects are. If the distance is small, the objects are said to have a high degree
of similarity and vice versa. Generally, it is measured in the range 0 to 1. This score in the
range of [0, 1] is called the similarity score [12].

27
Máster Universitario en Ciberseguridad e Inteligencia de los datos

As the same text similarity is how different or alike two texts or sentences are. However
as humans it is very obvious to us that two sentences mean the same thing despite being
wri en in completely different formats. But algorithms and to come to that same conclusion
we have first to solve the problem of text representa on by conver ng it into feature vectors
using a suitable text embedding technique above. Once we have the text representa on, we
can compute the similarity score using one of the many distance/similarity measures [12].
Let’s dive deeper into the text similarity measures.

Jaccard Index

Jaccard index, also known as Jaccard similarity coefficient, treats the data objects like
sets. It is defined as the size of the intersec on of two sets divided by the size of the union.

In the case in figure 2.9 :

Figure 2.9 : Jaccard distance on two sets [20]

|𝐴∩𝐵| |𝐴∩𝐵|
The jaccard distance as : 𝐽(𝐴, 𝐵 ) = |𝐴∪𝐵|
= |𝐴|+|𝐵|−|𝐴∩𝐵|

28
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Euclidean Distance

Euclidean distance, or L2 norm, uses the Pythagoras theorem to calculate the distance
between two points as indicated in the figure 2.10. Generally speaking, when people talk
about distance, they refer to Euclidean distance. It below :

Figure 2.10 : Euclidean distance representa on [12]

The larger the distance d between two vectors, the lower the similarity score and vice
versa .The distances can vary from 0 to infinity, we need to use some way to normalise them
to the range of 0 to 1.
Although we have our typical normalisa on formula that uses mean and standard
devia on, it is sensi ve to outliers. That means if there are a few extremely large distances,
every other distance will become smaller as a consequence of the normalisa on opera on.
So the best op on here is to use something like the Euler’s constant ( ) [12].
1
𝑒
𝑑

Levenshtein distance

The Levenshtein distance is a string metric for measuring the difference between two
sequences. Informally, the Levenshtein distance between two words is the minimum number
of single-character edits (i.e. inser ons, dele ons or subs tu ons) required to change one
word into the other [21].

29
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Mathema cally, the Levenshtein distance between two strings a, b (of length |a| and |b|
respec vely) is given by the formula below:

Figure 2.11 Levenshtein distance formula [21]

Cosine Similarity

Cosine Similarity computes the similarity of two vectors as the cosine of the angle between
two vectors. It determines whether two vectors are poin ng in roughly the same direc on.
So if the angle between the vectors is 0 degrees, then the cosine similarity is 1 [12].

Figure 2.12 : θ angle of vectors (𝑣, 𝑤) [12]

𝑣•𝑤
It is given as : 𝑐𝑜𝑠(𝑣, 𝑤) = . Where ||𝑣|| represents the length of the vector 𝑣,
||𝑣||×||𝑤||

||𝑤||represents the length of the vector 𝑤 and ‘•’ denotes the dot product operator.

30
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 3 : Methodology

In order to achieve the objec ves set for the realisa on of this project it's necessary to
find the right methodologies.
Several mee ngs were held with the tutor and the co-tutor. At first an explana on of the
topic was made, secondly we defined the process and the necessary tasks to achieve for
producing results. During the development of the project we frequently held mee ngs for
checking tasks progression, raising doubts and verifying that the steps taken were the right
ones. The main idea is that at each review the project should show some evolu on with
respect to the previous check, which is in line with the Scrum planning model.

3-1 Process defini on


In the ar cles studied ([2], [4]), they apply machine learning techniques over two
sub-tasks : address normalisa on and address classifica on into matched and not matched.
For each address to be georeference (or match) they generate a classifica on problem for
each of the addresses that serve as a reference in this case.

During our inves ga ons in order to find machine learning opportuni es in the address
matching field a lot of approaches were tested but the most promising one remains the use
of word or sentence embedding coupled to text similarity measure.

Our proposal in this work consists of determining the similarity of each address with the
reference addresses through the embeddings generated using the different algorithms
exposed above. We consider an exis ng matching between two addresses when the
similarity in the representa on space exceeds a threshold.

Our approach is to measure the distances between addresses, but by using language
models, complex rela onships in words such as seman cs and morphology are considered
and not only similari es at the character level.
However, there are different techniques of word embedding, so our process will naturally
be in the first me a study of each one, in a second me implement them using address
dataset and finally make a prac cal comparison.
In other hands, we define a performance evalua on procedure similar to those applied in
machine learning classifica on,we set a confusion matrix and evaluate the metrics accuracy,
precision and recall .

31
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The figures 3.1 shows the key steps of our work process

Figure 3.1 : Project process

As shown in the figure we firstly generated address embeddings for each model
Wor2vec, Fastext, Doc2vec and BERT, and secondly classify addresses into match or no
match through the similarity. Finally we evaluate the performance of each model in the
objec ve to make a comparison.

32
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Similarity

A er ge ng the vector representa on(embedding) of each address we introduce the


cosine similarity as a measure of the similarity between addresses.

Classifica on

For classifying addresses into matched or no matched we will compare the result of the
similarity calcula on to a fixed threshold value.

Performance Evalua on

Our ini al dataset provides the status of matching for addresses by an iden fica on
number( uuid_idt) so a classical method would be the use of a machine learning
classifica on algorithm and extract the performance. But a lack of varia on on our dataset
mo vates us to do a manual evalua on calcula ng true posi ves, false posi ves, true
nega ve and false nega ve from classifica on results and known status of addresses

3.2 Implementa on planning

In order to correctly implement the defined process for this project a planning was
made the table 3.1 gives the details of needed tasks to implement en-to-end the
drescripted process .

33
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Task Start date End date

Data set crea on 10/06/2022 14/06/2022

Implement wor2vec 15/06/2022 17/06/2022


models

Performance and 18/06/2022 20/06/2022


opera on analysis

Improvements 21/06/2022 26/06/2022

Implement fastext models 27/06/2022 30/06/20022

Performance and 1/07/2022 3/07/2022


opera on analysis

Implement doc2vec model 04/07/2022 9/07/2022

Performance and 10/07/2022 15/07/2022


opera on analysis

Improvements 16/07/2022 20/07/2022

Implement BERT model 21/07/2022 31/07/2022

Performance and 1/08/2022 8/08/2022


opera on analysis

Improvements 9/08/2022 14/08/2022

Comparison study 15/08/2022 28/08/2022

Table 3.1 : Implementa on planning

34
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 4 : Development
4.1 Dataset Crea on

The ISTAC [1] provides a csv file with normalised and georeferenced addresses that
currently appear in their Integrated Data System.
The variables included in the file correspond to the different elements that make up an
address, as well as an iden fica on code(uuid_idt) shared by all the addresses for which the
matching has been posi ve according to their algorithms. Among the variables we find
territory codes(“códigos de territorio”), the normalised and unnormalised type of road(“ po
de via”), the normalised and unnormalised road name(“nombre de via”), road code(“código
de la vía”), normalised and unnormalised portal number.
Also we can find informa on about the technique used to generate the matching and a
categorical variable with values: AVERAGE, HIGH or VERY_HIGH, which indicates the quality
of the link .
This dataset is from all the municipali es in the Canary Islands but in our case we will
extract a part from one municipality called Santa Brígida.We will train our models using this
dataset of 16024 addresses in order to build word embeddings.

Figure 4.1 : Dataset registers

From this dataset we select relevant columns that we will need in the rest of the work

Figure 4.2 : selec ng columns forming an address at Santabrigida

35
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Column descrip on

uuid_idt Iden ficador compar do entre las


direcciones que causan match

tvia Tipo de vía

nvia Nombre de vía normalizado

numer Número de portal normalizado

codnum Código de municipio

nommun Nombre de municipio

direccion Unión de los campos:


tvia+nvia+numer+nommun, en caso de
disponer de todos ellos

Table 4.1 : Columns descrip ons (spanish) [22]

Create dataset : once we have a good sample of our dataset we can export it in csv format
for evalua ng performance of our models

Figure 4.3 : Dataset crea on

Finally, we have the dataset “Santabrigida” (16024 addresses) that we will use to train
our word and sentence embedding models.
To evaluate the performance of the models and compare them, we will use the dataset
“muestra” which is a frac on of this dataset from “Santabrigida”.
We make this reduc on of the data because of a lack of resources and as exposed in chapter
3 the performance evalua on is very costly .

36
Máster Universitario en Ciberseguridad e Inteligencia de los datos

4.2 Libraries: Gensim, NLTK and Sentence Transformers

In addi on to the basic libraries for data analysis we used some special libraries during
the project with specifics roles for each one :

Gensim
Gensim is an open source python library for topic modelling able to train large-scale
seman c NLP models , represent text as vectors and find related documents.
Gensim runs on Linux, Windows and Mac OS X, and should run on any other pla orm that
supports Python 3.6+ and NumPy[23].
We can install it by running this command : pip install gensim

In this project we use gensim to train wor2vec , FastText and doc2vec models for
genera ng vector representa on of addresses and calculate the similarity.

The figure below shows a basic syntax of impor ng and training gensim models

Figure 4.4 : Gensim training models example

NLTK : Natural language processing Tool-kit

NLTK is a leading pla orm for building python programs to work with human language
data. It provides a suite of text processing u li es for classifica on, tokeniza on, stemming,
tagging, parsing .. etc.
In this project we use NLTK to preprocess our address dataset and in par cular to tokenize
data before passing it to models.

The figure 4.5 shows a basic syntax of tokenisa on addresses with NLTK

37
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.5 : NLTK tokeniza on example usage

Sentence Transformers

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image


embeddings. It can be used to compute sentence / text embeddings for more than 100
languages. These embeddings can then be compared e.g. with cosine-similarity to find
sentences with a similar meaning. This can be useful for seman c textual similar, seman c
research or paraphase mining [24].

Sentence Transformers can be installed by running this command:


pip install -U sentence-transformers
It recommended to have python 3.6 or higher, and at least Pytorch 1.3.6 remain that
sentence-transformers are based on Pytorch and transformers.

In this project we use sentence transformers for implemen ng the BERT model. The figure
shows an example of sentence-transformers implemen ng a BERT model

Figure 4.6 : Example Usage Sentence-Transformers

38
Máster Universitario en Ciberseguridad e Inteligencia de los datos

4.2 Implementa on

4.2.1 Data descrip on and preprocessing


From the data crea on process above we generate this present data on which one we will
build our word and sentence embedding models and evaluate performance.

Figure 4.7 : Created dataset

Before star ng work with the data let’s check missing values and make various
descrip ons through graphics and sta s cs.

Figure 4.8 : missing values and registers numbers

39
Máster Universitario en Ciberseguridad e Inteligencia de los datos

As we can see the dataset doesn’t present missing values, so we don’t need to apply a
technique to fill out missing values. Our next step is to take a look at the variable nvia :
“nombre de via “.

The figure 4.9 shows the count word of nvia in each address row.

Figure 4.9 : nvia word count

Once we have for each nvia the number of words we can represent the distribu on graphic of words

Figure 4.10 : distribu on word count of nvia variable

40
Máster Universitario en Ciberseguridad e Inteligencia de los datos

In our last part of descrip ng data we will show the most used words of the variables nvia
and tvia on addresses. The words that appear the most are the higher dimensions the most
and vice versa.

Figure 4.11 : Most commonly used word on addresses (nvia)

41
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.12 : Most commonly used word on addresses (tvia)

Once we understand the data the next step is preprocessing, firstly we format addresses.
In principle an address is an concentena on of the variables tvia, nvia, nume, codmun and
nommun but we have to remember that we are working with data from one municipality
(Santa Brígida). It means that all our addresses have the same value on the variable
codmun(35021) and nommun (Santa Brígida). That’s why it will be necessary to remove
them from the address in order to get the root of an address.
So, we are going to create a address column without the two variables cited above

42
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.13 : Crea on of new address column

Now we have address data almost ready to be trained but for some models like wor2vec
and fas ext it’s preferable to pass them the data in a certain format that’s why we will
tokenize the data before the training phase.

Figure 4.14 : Tokeniza on addresses

4.2.2 Modelling

In NLP instead of always training your own model it is recommended in some cases to use
pre-trained models.The advantage of these models is that they have been trained in a larger
corpus of words so they gain in maturity. We can find these models in differents public
repositories or research publica ons.In this project, in addi on to our own trained we a
word2vec , fastext and BERT pre-trained model for the Spanish language

As already men oned, in this project we work with wor2vec, fastText, doc2vec and BERT
model. For word2vec and fastText, first we train our own model and second we load
pre-trained models. In the case of doc2vec we also train our own model but for BERT we
load a spanish BERT model.

43
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ word2vec
Train model

A word2vec model uses a set of parameters that affect both the training speed and the
quality. During our training phase we adjust the parameters several mes in order to have a
good model. The figure below shows the way to train the model.

Figure 4.15 : Training wor2vec model

It’s important to no ce that word2vec sets some parameters by default, in our case we
use the CBOW variant which is the default variant implemented by wor2vec. The model
receives the dataset in the right format as the first parameter here this last one is data which
is the result of our tokenized addresses. The parameters min_count is for ignoring the word
that does not appear a certain number of mes in the corpus. By default the value is 5 but
this can pose a problem in our case that’s why we put the minimum value 1. The size
determines the number of dimensions (N) of the N-dimensional space that gensim
Word2Vec maps the words onto; we chose a 100-dimensional space. The workers
parameters determine the number of cores to use for the training. It takes effect when we
set it to 1. If we put another value we have to install some tool like Cytron.

Load pretrained Model


To use a pre trained word2vec model we just need to download the corresponding model
In most cases they are in vector or bin format and we can find a lot of pre-trained model
from the communi es or IT companies like Facebook , Google …etc
In this project we use a pre-trained word2vec for Spanish Language [25], the figure below
the code to execute for loading this model.

Figure 4.16 : Loading word2vec pre-trained model

44
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ FastText

Train Model

Like word2vec , fastText model uses a set of parameters that affect both the training speed
and the quality. We train the fastText model in the same way we did with wor2vec
The model receives the dataset in the right format as the first parameter here this last one is
data which is the result of our tokenized addresses. The parameters min_count is for
ignoring the word that does not appear a certain number of mes in the corpus.
The size determines the number of dimensions (N) of the N-dimensional space that gensim
Word2Vec maps the words onto; we chose a 100-dimensional space. The workers
parameters determine the number of cores to use for the training.

Figure 4.17 : Training word2vec model

Load Model

For the pre-trained fasText model [26], depending on the format that we have downloaded
the model (vec or bin) there is a way to load it. In our case we download a pre-trained
model for Spanish language in vector format because the bin format needs complex
transforma ons.

Figure 4.18 : Loading FastText pre-trained model

45
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ doc2vec

The process of training a doc2vec model is similar to word2vec but here we have some
addi onal steps. Below we have the step to train the model

Model ini aliza on

Figure 4.19 : Ini alise doc2vec model


The parameters vector_size and min_count represent the same as on wor2vec and
fastText but here we use addi onal parameters epochs for se ng the number of itera on
over the corpus into 10.

Tagged documents

Figure 4.20 : tagged document

Different to wor2vec where the model directly receives addresses(documents) in a


format of list of addresses ( addr variable in figure 4.20 ). A tag has been added to each
address (document) before passing to the model for building vocabulary and training .
The vocabulary is just a list of all of the unique words extracted from the training corpus.

Training model

Figure 4.21 : Training doc2vec model

46
Máster Universitario en Ciberseguridad e Inteligencia de los datos

➢ BERT

In this project we will use a BERT model for Spanish language [27]. Below we have the
steps to follow :

Install and import Transformers

Figure 4.22 : Installing and loading Transformer

Load the model

Figure 4.23 : Loading BERT model

47
Máster Universitario en Ciberseguridad e Inteligencia de los datos

4.2.3 Vectoriza on and similarity calculs

Once we have the trained and pre-trained models we can vectorize each address in the
dataset and calculate the similari es in order to classify into matched and not matched.
For word2vec and FastTex we cannot directly get the vector representa on of a whole
address, we have a vector per word, so we will average the word vectors.
The func on below will receive a list of addresses and a model to generate the
corresponding vector representa on of each address .

Figure 4.24 : Func on for genera ng vector per document(address)

Figure 4.25 : Applying the func on with the trained wor2vec model

48
Máster Universitario en Ciberseguridad e Inteligencia de los datos

We can see the corresponding vector of the address : CAMINO TEJAR 28

Figure 4.26 : Vector representa on of an address with word2vec

With doc2vec and BERT we directly generate the vector representa on of the whole
address.

In order to measure the similarity between addresses we evaluate the cosine similarity
between their represented vectors. However, we will introduce a heatmap to represent the
similari es.
Beforehand we will define two func ons respec vely for similari es calcula ons and
heatmap crea on

Figure 4.27 : func on for similari es calcula ons

49
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Figure 4.28 : Func on for heatmaps crea on

The figure 4.29 represents the heatmap of the similarity between addresses using our
trained word2vec model .
The graphics for each models is available in the appendices 6.1

Figure 4.29 Heatmap of similarity using wor2vec trained model

50
Máster Universitario en Ciberseguridad e Inteligencia de los datos

4.3 Results
In order to compare the models we will define, as discussed in chapter 3, a procedure for
performance evalua on similar to those applied in machine learning classifica on
algorithms. A confusion matrix is defined and the metrics accuracy, precision and recall are
evaluated.
The func on in the figure 4.30 shows the different steps for crea ng the confusion matrix

Figure 4.30 func on for creating confusion matrix

The func on receives as parameters address vectors and the list of address uuids.
The func on process by calcula ng the cosine similarity between addresses one by one and
comparing the result with the fixed threshold (0.9 ). The threshold has been fixed to this
value a er tes ng the performance of the models with several values between 0.7, 0.8 and
0.9 .

51
Máster Universitario en Ciberseguridad e Inteligencia de los datos

The table 4.2 resume the results :

Model Accuracy Precision Recall

Trained wor2vec 0.675 0.023 0.93

Pre-trained wor2vec 0.789 0.034 0.92

Trained FastText 0.552 0.017 0.94

Pre-trained FastText 0.976 0.233 0.848

Trained doc2vec 0.996 0.694 0.848

BERT 0.997 0.80 0.848

Tabla 4.2 : Results resuming

52
Máster Universitario en Ciberseguridad e Inteligencia de los datos

In general the model trained with address dataset presents acceptable performance. In
the case of pre-trained models only BERT is giving acceptable results. The accuracy obtained
with word2vec and fastText is very low, taking a high value of recall. This means that these
models predict as “match” addresses that “no match” (precision), however, they predict as
“match” addresses that match (recall). On the other hand, doc2vec outperforms the two
previous models and finally, the results obtained with BERT improves the performance
reaching promising values.

Doc2vec with BERT form the most performing models, the trained word2vec and fastText
present almost the same results while their pre-trained fail to perform.

53
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 5 : Conclusions and future development


5.1 Conclusions

In this project we explore the use of machine learning techniques in the field of address
matching.I n par cular, with addresses in text-based format, we introduce natural language
processing approaches and generate numerical representa ons for each pair of addresses.

In order to generate numerical representa ons of addresses we study several word and
sentence embedding models such as wor2vec , fastText , doc2vec and BERT.
In a first me we train these models with real address datasets from ISTAC [1], in a second
me we use pre-trained models for the Spanish language pre-trained by other communi es
and with a large corpus of data.

We introduce the cosine similarity as a metric for resolving address records into match
and not match and finally evaluate the performances.

The specific studies during this project show great poten al for the use of machine
learning and NLP in the field of address matching but it's really important in the
implementa on processes to accurate data and choose the right models.

The results obtained lead us to the conclusion that it is promising to solve the address
matching problem through the similarity of the vectors that generate the language models.
They also reveal the need for models generated with large numbers of documents, in our
tests the guarantees are offered by the BERT model for the Spanish used, but they also
suggest that genera ng a doc2vec model with a much larger volume of addresses can lead to
good system performance.

54
Máster Universitario en Ciberseguridad e Inteligencia de los datos

5.2 Future developments

➢ This study has been restricted to the municipality of Santa Brigida, but in future
works, with more available resources, we plan to extend it into all municipali es of
the Canary Islands in order to confirm our generic method.

➢ As we did with the different models comparing their performance using cosine
similarity it should be realised as a comparison, a study based on the different
similarity metrics such as euclidean distance, levenshtein distance ….etc

➢ In the case of having a dataset with a great pool of varia ons for each address, it
should consider each pool of addresses as a machine learning classifica on problem.
In this case, we will use machine learning classifica on algorithms like XGBoost,
random forest ….etc

➢ Generate language models with the data for all Canary Islands addresses in the
Canary Islands Integrated Data System

➢ Explore algorithms to improve efficiency in the comparison of the similarity of all


addresses in order to extend the results to datasets that include a significant volume
of addresses, for example for the whole of the Canary Islands.

55
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Chapter 6 : Appendices

6.1 Dataset crea on code source

At the following Colab notebook we have the python code to created the dataset that will
be use in the project
Google Colab Link

6.2 Project code source

Below is a link to the project on Google Collaboratory where you can view and test the
Python code that has been shown throughout this project.

Google colab Link

6.3 Data sources

At the following link we have a shared google drive folder that contains all the data files in
csv format used for this project.We can find the “Santabrigida.csv” file used for crea ng the
base dataset “muestra.csv”.

Google drive Link

6.4 Pre-trained models

At the following link we have a shared google drive folder that contains all the pre-trained
models used on this project.

Google drive Link

56
Máster Universitario en Ciberseguridad e Inteligencia de los datos

Bibliography

[1] . ISTAC ( Ins tuto Canario de Estadís ca) official website


h p://www.gobiernodecanarias.org/istac/

[2]. PostMatch : A Framework for Efficient Address Matching


Springer Nature Singapore Pte Ltd. 2021 Y.Xu et al. (Eds) : AusDM 2021, CCIS 1504,pp.
136-151,2021.
h ps://doi.org/10.1007/978-981-16-8531-6_10

[3] ISTAC Sistema-georreferenciación


h ps://jecas.es/wp-content/uploads/2021/11/21.4.ISTAC_Sistema-georreferenciacion.pdf

[4]. Comber S , Arribas-Bel D. Machine learning innova ons in address matching : A prac cal
comparison of wor2vec and CRFs.Transac on in GIS . 2019;23:334-348.
h ps://doi.org/10.1111/tgis.12522

[5]. The ul mate guide to address matching (online)


h ps://www.placekey.io/blog

[6]. Introduc on to NLP


h ps://buil n.com/data-science/introduc on-nlp

[7]. Theory behind the basics of NLP


h ps://www.analy csvidhya.com/blog/2022/08/theory-behind-the-basics-of-nlp/
[8]. Wikipedia : n-gram
h ps://en.wikipedia.org/wiki/N-gram

[9] Char n-gram


h ps://subscrip on.packtpub.com/book/big-data-and-business-intelligence/978178712678
7/9/ch09lvl1sec56/character-n-grams#:~:text=An%20n%2Dgram%20is%20a,high%20quality
%20for%20authorship%20a ribu on.

[10]. Word Embedding, Wikipedia


h ps://en.wikipedia.org/wiki/Word_embedding

57
Máster Universitario en Ciberseguridad e Inteligencia de los datos
[11]. Countvectorizer, scikit-learn
h p://scikit-learn.org/stable/modules/generated/sklearn.feature_extrac on.text.CountVect
orizer.html

[12]. Ul mate guide to Text similarity with python


h ps://newscatcherapi.com/blog/ul mate-guide-to-text-similarity-with-python

[13]: Efficient Es ma on of Word Representa ons in Vector Space (Research paper)


h ps://arxiv.org/pdf/1301.3781.pdf

[14] Fastext Model


h ps://radimrehurek.com/gensim/auto_examples/tutorials/run_fas ext.html#:~:text=The%
20main%20principle%20behind%20fastText,embedding%20for%20every%20individual%20w
ord.

[ 15 ] Fastext
h ps://blogs.sap.com/2019/07/03/glove-and-fas ext-two-popular-word-vector-models-in-nl
p/

[16] Distributed Representa ons of Sentences and Documents (Research paper)


h ps://cs.stanford.edu/~quocle/paragraph_vector.pdf

[17] An Empirical Evalua on of doc2vec with Prac cal Insights into Document Embedding
Genera on
h ps://arxiv.org/abs/1607.05368

[18] BERT: Pre-training of Deep Bidirec onal Transformers for Language Understanding
h ps://arxiv.org/pdf/1810.04805.pdf

[19 ] BERT Explained State of the art language model for NLP
h ps://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b
21a9b6270

[20]. Jaccard Index ,Wikipedia


h ps://en.wikipedia.org/wiki/Jaccard_index

[21] The Levenshtein Algorithm

h ps://www.cuelogic.com/blog/the-levenshtein-algorithm#:~:text=The%20Levenshtein%20
distance%20is%20a,one%20word%20into%20the%20other.

58
Máster Universitario en Ciberseguridad e Inteligencia de los datos

[22] González Yanes, A., Betancor Villalba, R., Hernández García, M.S. (2021). Título. XXI
Jornadas de Estadís ca de las Comunidades Autónomas, JECAS. Las Palmas de Gran Canaria:
ISTAC.
Retrieved from h ps://jecas.es/sistema-de-georreferenciacion-para-fines-estadis cos/

[23] Gensim documenta on


https://radimrehurek.com/gensim/

[24] Sentence transformers Documenta on


h ps://www.sbert.net

[25] Wor2vec pretrained model for spanish language


h ps://github.com/aitoralmeida/spanish_word2vec

[26] FastText pretrained Models


h ps://fas ext.cc/docs/en/crawl-vectors.html

[27] BERT Spanish Model


h ps://huggingface.co/hackathon-pln-es/paraphrase-spanish-dis lroberta

59

You might also like