Name Matching
Name Matching
me Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
Listen Share
We have all been there: you have found two interesting datasets that could really
supplement each other, but… you have no way of joining them together. When
analyzing the data, you start searching for a way to join both datasets. The only field
you find to join the datasets, is a name field and you discover that the spelling of
these names is inconsistent to say the least. At the Dutch Central Bank, we
frequently encounter this problem. We get company names from different sources,
but sometimes a consistent identifier for these companies is lacking. In order to
deal with this problem, we have created a Python package for fuzzy company name
matching. In this blog, I will go over the steps that we take to be able to match
company names and leverage the various datasets.
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 1/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
When matching names, you often have large databases that need to be joined. Most
name matching algorithms are computationally expensive if you take into account
that each of the names should be analyzed pairwise, so for two datasets of 10.000
names, that will be already 100.000.000 pairwise comparisons. That is why we start
with preprocessing to get the most out of perfect name matches. Next, we apply
cosine similarity to choose candidates for the fuzzy name matching and we perform
the fuzzy matching algorithms only on these data. Lastly, we do some
postprocessing to determine how well two names actually match.
Preprocessing
Before trying to match company names, it is useful to do some preprocessing of the
data, making the data easier to match. We will do this in several steps. Say we have
the following company name:
1 company_name.lower()
2
3 > 'samsung êlectronics holding, ltd'
1 import unicodedata
2 unicodedata.normalize('NFKD', company_name).encode('ASCII', 'ignore').decode()
3
4 >'samsung electronics holding, ltd'
Then, we remove punctuation, i.e. remove any character that is not a word or space
character with nothing.
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 2/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
1 import re
2 re.sub(r'[^\w\s]','',company_name)
3
4 > 'samsung electronics holding ltd'
We remove common legal business suffixes, using a package called cleanco, which
is able to process company names and remove terms referring to organization type.
The idea behind this is to bring the name back to it essence and compare that, as
most of the name similarity scores are normalized based on the length of the string.
Obviously, depending on the data you have, not all of these steps are necessary. With
the preprocessing done, we proceed with approximate string matching.
Cosine Similarity
Using cosine similarity is necessary as the more advanced string matching
algorithms are computationally more complex. In this way, the potential number of
matches can be reduced from a few million down to about fifty. This is done via the
conversion of a string to an n-gram and applying a tf-idf transform.
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 3/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
This results in a sparse matrix with the size of all of the unique n-grams that occur
in the dataset. In this matrix, only the elements which link to the n-grams present in
the company name are filled. By calculating the dot product between the matrix for
the entire dataset and the matrix for the names we want to match, we can get the
cosine similarity between the two. From this cosine similarity, we can then run an
partition function to select the top fifty best matches. For these matches, we can
apply the fuzzy string matching.
Discounted Levenshtein
The first way in which we judge how well two strings match, is the discounted
Levenshtein distance, using the abydos package. The Levenshtein distance can be
obtained by changing one string to another by substitution, insertion and deletion.
The discounted version is a variation on the Levenshtein distance where differences
at the end of the string are penalized less than those at the beginning. This is handy
for company names as suffixes to names are far more common than prefixes.
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 4/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
A score of 1 implies a perfect match. Here you can see that even though pepsi cola
company has more letters in common and requires fewer edits then coca-cola group,
it is still ranked lower, because of the discounting of the edits further down the
name string.
You can see that even writing out the full legal suffix of the company has less of an
effect then making a few typo’s in the company name. This allows us to also match
long names and names with large differences in length well.
Token Sort
The last metric we now take into account is the token sort distance, which first
tokenizes the data, then sorts the tokens (using the thefuzz package). Based on these
sorted tokens a Levenshtein distance can be determined. This is especially useful
when the words from the company name get scrambled.
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 5/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
Here, you can see that the switching around of words no longer affects the score
that a match will get.
Post Processing
After applying the fuzzy matching, we have a score indicating how well two
company names match for each of the algorithms. These scores can be combined to
get a score for how well the two company names match. Depending on the goal of
the name matching, some post processing might be necessary. During the post
processing you can flag potential false positives. When matching fund names, for
instance, it often occurs that you have different rounds of a fund, e.g. Sustainable
Equity Fund I and Sustainable Equity Fund II. These give a high matching score, but
should be differentiated in some cases. Specifically scanning for these kinds of
differences and flagging these results can avoid making these false positive
matches.
NameMatcher
In order to simplify our name matching process, we developed a name matching
Python package. In this package, we can initialize a NameMatcher class object with
the required preprocessing steps and the top n matches that should be returned
from the cosine similarity step.
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 6/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
Next up we can set the algorithms we want to use for the fuzzy name matching.
1 matcher.set_distance_metrics(['discounted_levenshtein',
2 'SSK',
3 'fuzzy_wuzzy_token_sort'])
We can then load in our two datasets and indicate which column should be used for
the name matching.
1 matcher.load_and_process_master_data('company_name', name_data_a)
2 matcher.match_names(to_be_matched=name_data_b, column_matching='name company')
The package will perform the name matching and provide us with the best matched
options from the dataset including the score.
Conclusion
Using our name_matching Python package, we can easily match the names of
companies with many different algorithms depending on out data. With a scores
between the 0 and 100 for each of the matches, we can also choose how many false
positives we can accept. So in cases where we really need to be sure, a score of 95 or
higher is used as threshold, while in other cases it will be lower. Checking the
matches near this threshold gives us an idea about the number of false positives
/negatives in our matched data.
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 7/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
At DNB, we are receiving more and more data where both a company name and an
identifier is used. Based on these kind of datasets, we can use different company
names with the same identifier to build a list of alternatives for a company name.
The resulting dataset can be a training dataset for a neural net based name
matching approach once we have enough data, taking the process of name
matching one step further in the future.
TL;DR
In order to match company names from different datasets not sharing any
identifiers, we developed a Python package called name_matching , to help us with
that problem. It is available on the DNB Github.
Central Bank
Follow
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 8/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 9/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
Search
Yassine EL KHAL
254 2
Data matching is a critical task in data management, and fuzzy data matching presents its own
set of challenges. In this blog post, we will…
Lists
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 11/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
4.4K 41
David Goudet
6.1K 66
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 12/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium
59
Abhit Maurya
23
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 14/14