0% found this document useful (0 votes)
10 views14 pages

Name Matching

The document discusses the challenges of matching inconsistent company names across different datasets and presents a Python package developed by the Dutch Central Bank for fuzzy company name matching. It outlines the preprocessing steps, algorithms used for matching (including cosine similarity and various fuzzy matching techniques), and the post-processing of results to improve accuracy. The package allows users to customize matching criteria and provides a score for the quality of matches, facilitating better data integration.

Uploaded by

nishita1710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

Name Matching

The document discusses the challenges of matching inconsistent company names across different datasets and presents a Python package developed by the Dutch Central Bank for fuzzy company name matching. It outlines the preprocessing steps, algorithms used for matching (including cosine similarity and various fuzzy matching techniques), and the post-processing of results to improve accuracy. The package allows users to customize matching criteria and provides a score for the quality of matches, facilitating better data integration.

Uploaded by

nishita1710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

15/12/2023, 12:37 Company Name Matching.

me Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Company Name Matching


Michiel Nijhuis · Follow
Published in DNB — Data Science Hub
6 min read · Mar 3, 2022

Listen Share

We have all been there: you have found two interesting datasets that could really
supplement each other, but… you have no way of joining them together. When
analyzing the data, you start searching for a way to join both datasets. The only field
you find to join the datasets, is a name field and you discover that the spelling of
these names is inconsistent to say the least. At the Dutch Central Bank, we
frequently encounter this problem. We get company names from different sources,
but sometimes a consistent identifier for these companies is lacking. In order to
deal with this problem, we have created a Python package for fuzzy company name
matching. In this blog, I will go over the steps that we take to be able to match
company names and leverage the various datasets.

How We Match Names

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 1/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

When matching names, you often have large databases that need to be joined. Most
name matching algorithms are computationally expensive if you take into account
that each of the names should be analyzed pairwise, so for two datasets of 10.000
names, that will be already 100.000.000 pairwise comparisons. That is why we start
with preprocessing to get the most out of perfect name matches. Next, we apply
cosine similarity to choose candidates for the fuzzy name matching and we perform
the fuzzy matching algorithms only on these data. Lastly, we do some
postprocessing to determine how well two names actually match.

Preprocessing
Before trying to match company names, it is useful to do some preprocessing of the
data, making the data easier to match. We will do this in several steps. Say we have
the following company name:

1 company_name = 'SAMSUNG ÊLECTRONICS Holding, LTD'

4b6142a6-b892-4a67-9524-4bd021e51438.py hosted with ❤ by GitHub view raw

We start by removing all capital letters.

1 company_name.lower()
2
3 > 'samsung êlectronics holding, ltd'

6eab2d1c-faa3-4caa-823b-39d1b20e6789.py hosted with ❤ by GitHub view raw

Next, we replace non-ASCII characters.

1 import unicodedata
2 unicodedata.normalize('NFKD', company_name).encode('ASCII', 'ignore').decode()
3
4 >'samsung electronics holding, ltd'

8238b96c-31c4-4609-803c-65d992ed277c.py hosted with ❤ by GitHub view raw

Then, we remove punctuation, i.e. remove any character that is not a word or space
character with nothing.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 2/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

1 import re
2 re.sub(r'[^\w\s]','',company_name)
3
4 > 'samsung electronics holding ltd'

81d704bc-cbfe-4138-b2ce-46b6a69922b1.py hosted with ❤ by GitHub view raw

We remove common legal business suffixes, using a package called cleanco, which
is able to process company names and remove terms referring to organization type.

1 from cleanco import basename


2 basename(company_name)
3
4 > 'samsung electronics holding'

cc79f8e3-985f-4ad3-a14c-ac7012035c72.py hosted with ❤ by GitHub view raw

Finally, we remove the most common words using regular expressions.

1 ' '.join(re.sub(r'\b{}\b'.format(re.escape(suffix)), '', company_name).split())


2
3 > 'samsung electronics'

a08b7bcf-6623-49df-95d4-30cd50d53803.py hosted with ❤ by GitHub view raw

The idea behind this is to bring the name back to it essence and compare that, as
most of the name similarity scores are normalized based on the length of the string.
Obviously, depending on the data you have, not all of these steps are necessary. With
the preprocessing done, we proceed with approximate string matching.

Cosine Similarity
Using cosine similarity is necessary as the more advanced string matching
algorithms are computationally more complex. In this way, the potential number of
matches can be reduced from a few million down to about fifty. This is done via the
conversion of a string to an n-gram and applying a tf-idf transform.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 3/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

1 from sklearn.feature_extraction.text import TfidfVectorizer


2
3 vec = TfidfVectorizer(lowercase=False, analyzer="char", ngram_range=(2, 3))
4 vec.fit(company_name_dataset)
5 vec.transform(company_name)
6
7 > <1x350357 sparse matrix of type '<class 'numpy.float64'>'
8 with 35 stored elements in Compressed Sparse Row format>

440f5cec-55d9-45ed-9bf7-9566255a647e.py hosted with ❤ by GitHub view raw

This results in a sparse matrix with the size of all of the unique n-grams that occur
in the dataset. In this matrix, only the elements which link to the n-grams present in
the company name are filled. By calculating the dot product between the matrix for
the entire dataset and the matrix for the names we want to match, we can get the
cosine similarity between the two. From this cosine similarity, we can then run an
partition function to select the top fifty best matches. For these matches, we can
apply the fuzzy string matching.

Fuzzy String Matching


For the fuzzy matching of company names, there are many different algorithms
available out there. To match company names well, a combination of these
algorithms is needed to find most matches. Depending on the differences between
two company names, different algorithms should be used. In this case, we will be
using three algorithms which I will now discuss in turn.

Discounted Levenshtein

The first way in which we judge how well two strings match, is the discounted
Levenshtein distance, using the abydos package. The Levenshtein distance can be
obtained by changing one string to another by substitution, insertion and deletion.
The discounted version is a variation on the Levenshtein distance where differences
at the end of the string are penalized less than those at the beginning. This is handy
for company names as suffixes to names are far more common than prefixes.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 4/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

1 import abydos.distance as abd


2
3 abd.DiscountedLevenshtein().sim('coca-cola company','coca-cola group')
4 > 0.74
5
6 abd.DiscountedLevenshtein().sim('coca-cola company','pepsi cola company')
7 > 0.54

389edd36-0ffc-4316-9a9f-023a6233e188.py hosted with ❤ by GitHub view raw

A score of 1 implies a perfect match. Here you can see that even though pepsi cola
company has more letters in common and requires fewer edits then coca-cola group,
it is still ranked lower, because of the discounting of the edits further down the
name string.

String Subsequence Kernel Similarity

A different way of trying to match strings is by looking at possible substrings


between the two strings. By dividing the name into (non)-continuous substrings a
difference between the two sets of substrings can be determined. An SVM can
subsequently be applied to generate a difference score between the two strings. For
longer names, this gives a better idea of the matching.

1 abd.SSK().sim('Anheuser-Busch InBev International Gesellschaft mit beschränkter Haftung


2 > 0.74
3
4 abd.SSK().sim('Anheuser-Busch InBev','Anhauser Bosch InBef')
5 > 0.72

6e191e67-5169-4991-8e9a-2bb560daad55.py hosted with ❤ by GitHub view raw

You can see that even writing out the full legal suffix of the company has less of an
effect then making a few typo’s in the company name. This allows us to also match
long names and names with large differences in length well.

Token Sort

The last metric we now take into account is the token sort distance, which first
tokenizes the data, then sorts the tokens (using the thefuzz package). Based on these
sorted tokens a Levenshtein distance can be determined. This is especially useful
when the words from the company name get scrambled.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 5/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

1 from thefuzz import fuzz


2
3 fuzz.token_sort_ratio('Apple Computer Inc','Apple Inc Computer')/100
4 > 1.0
5
6 fuzz.token_sort_ratio('Apple Computer Inc','Apple Inc')/100
7 > 0.67

7c1d3441-0f06-495e-ba15-e62a12a17285.py hosted with ❤ by GitHub view raw

Here, you can see that the switching around of words no longer affects the score
that a match will get.

Post Processing
After applying the fuzzy matching, we have a score indicating how well two
company names match for each of the algorithms. These scores can be combined to
get a score for how well the two company names match. Depending on the goal of
the name matching, some post processing might be necessary. During the post
processing you can flag potential false positives. When matching fund names, for
instance, it often occurs that you have different rounds of a fund, e.g. Sustainable
Equity Fund I and Sustainable Equity Fund II. These give a high matching score, but
should be differentiated in some cases. Specifically scanning for these kinds of
differences and flagging these results can avoid making these false positive
matches.

NameMatcher
In order to simplify our name matching process, we developed a name matching
Python package. In this package, we can initialize a NameMatcher class object with
the required preprocessing steps and the top n matches that should be returned
from the cosine similarity step.

1 from name_matching.name_matcher import NameMatcher


2
3 matcher = NameMatcher(top_n=10,
4 lowercase=True,
5 punctuations=True,
6 remove_ascii=True,
7 legal_suffixes=False,
8 common_words=False,
9 verbose=True)

855bd49c-0643-42d9-9cec-fb78b985e4af.py hosted with ❤ by GitHub view raw

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 6/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Next up we can set the algorithms we want to use for the fuzzy name matching.

1 matcher.set_distance_metrics(['discounted_levenshtein',
2 'SSK',
3 'fuzzy_wuzzy_token_sort'])

d622f902-dfcd-4f5d-8807-714b12fe3422.py hosted with ❤ by GitHub view raw

We can then load in our two datasets and indicate which column should be used for
the name matching.

1 matcher.load_and_process_master_data('company_name', name_data_a)
2 matcher.match_names(to_be_matched=name_data_b, column_matching='name company')

dafe1d7c-22d7-44ad-8e22-cd8f3b5bfe01.py hosted with ❤ by GitHub view raw

The package will perform the name matching and provide us with the best matched
options from the dataset including the score.

1 original_name match_name score


2 asml nv asml holding nv 100.0
3 unilever bv unilever nv 100.0
4 shell bv royal dutch shell plc 69.6
5 ing bank nv ing group nv 64.0
6 koninklijke filips koninklijke philips nv 79.4
7 adyen nv adyen nv 100.0
8 relx plc relx plc 100.0
9 prosus group prosus nv 100.0
10 dsm koninklijke dsm nv 100.0
11 ahold-delheize koninklijke ahold delhaize nv 88.1
12 heineken breweries heineken nv 71.2

6e926dbe-1437-42fc-b7a1-c9d1033971cd.py hosted with ❤ by GitHub view raw

Conclusion
Using our name_matching Python package, we can easily match the names of
companies with many different algorithms depending on out data. With a scores
between the 0 and 100 for each of the matches, we can also choose how many false
positives we can accept. So in cases where we really need to be sure, a score of 95 or
higher is used as threshold, while in other cases it will be lower. Checking the
matches near this threshold gives us an idea about the number of false positives
/negatives in our matched data.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 7/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

At DNB, we are receiving more and more data where both a company name and an
identifier is used. Based on these kind of datasets, we can use different company
names with the same identifier to build a list of alternatives for a company name.
The resulting dataset can be a training dataset for a neural net based name
matching approach once we have enough data, taking the process of name
matching one step further in the future.

TL;DR
In order to match company names from different datasets not sharing any
identifiers, we developed a Python package called name_matching , to help us with
that problem. It is available on the DNB Github.

Data Science Name Matching Company Name Machine Learning

Central Bank

Follow

Written by Michiel Nijhuis


22 Followers · Editor for DNB — Data Science Hub

Data Scientist at the Dutch central bank

More from Michiel Nijhuis and DNB — Data Science Hub

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 8/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Michiel Nijhuis in DNB — Data Science Hub

Optimizing Banknote Sorting Machine Settings


A multi-objective genetic algorithm approach

7 min read · Oct 11, 2022

See all from Michiel Nijhuis

See all from DNB — Data Science Hub

Recommended from Medium

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 9/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Open in app Sign up Sign in

Search

Yassine EL KHAL

The complete guide to string similarity algorithms


Introduction

14 min read · Aug 21

254 2

Dr. Lovedeep Saini

Fuzzy Data Matching with GPT-based Embeddings and Nearest


Neighbors
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 10/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Data matching is a critical task in data management, and fuzzy data matching presents its own
set of challenges. In this blog post, we will…

3 min read · Jul 5

Lists

Predictive Modeling w/ Python


20 stories · 686 saves

Practical Guides to Machine Learning


10 stories · 783 saves

Natural Language Processing


976 stories · 469 saves

data science and AI


38 stories · 2 saves

Rahul Nayak in Towards Data Science

How to Convert Any Text Into a Graph of Concepts


A method to convert any text corpus into a Knowledge Graph using Mistral 7B.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 11/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

12 min read · Nov 10

4.4K 41

David Goudet

This is Why I Didn’t Accept You as a Senior Software Engineer


An Alarming Trend in The Software Industry

· 5 min read · Jul 26

6.1K 66

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 12/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Bobby Wu in Trusted Data Science @ Haleon

Fuzzy Matching at Scale for Beginners


How to effectively perform large scale cross-system data reconciliation (beginner level)

16 min read · Sep 18

59

Abhit Maurya

5 Python String Matching Algorithm Every Data Analyst Should Know.


(Part 1)
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 13/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Selecting the Optimal String Matching Approach in Python.

· 4 min read · Jun 25

23

See more recommendations

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 14/14

You might also like