0% found this document useful (0 votes)

10 views14 pages

Name Matching

The document discusses the challenges of matching inconsistent company names across different datasets and presents a Python package developed by the Dutch Central Bank for fuzzy company name matching. It outlines the preprocessing steps, algorithms used for matching (including cosine similarity and various fuzzy matching techniques), and the post-processing of results to improve accuracy. The package allows users to customize matching criteria and provides a score for the quality of matches, facilitating better data integration.

Uploaded by

nishita1710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views14 pages

Name Matching

Uploaded by

nishita1710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

15/12/2023, 12:37 Company Name Matching.

me Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Company Name Matching

Michiel Nijhuis · Follow
Published in DNB — Data Science Hub
6 min read · Mar 3, 2022

Listen Share

We have all been there: you have found two interesting datasets that could really
supplement each other, but… you have no way of joining them together. When
analyzing the data, you start searching for a way to join both datasets. The only field
you find to join the datasets, is a name field and you discover that the spelling of
these names is inconsistent to say the least. At the Dutch Central Bank, we
frequently encounter this problem. We get company names from different sources,
but sometimes a consistent identifier for these companies is lacking. In order to
deal with this problem, we have created a Python package for fuzzy company name
matching. In this blog, I will go over the steps that we take to be able to match
company names and leverage the various datasets.

How We Match Names

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 1/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

When matching names, you often have large databases that need to be joined. Most
name matching algorithms are computationally expensive if you take into account
that each of the names should be analyzed pairwise, so for two datasets of 10.000
names, that will be already 100.000.000 pairwise comparisons. That is why we start
with preprocessing to get the most out of perfect name matches. Next, we apply
cosine similarity to choose candidates for the fuzzy name matching and we perform
the fuzzy matching algorithms only on these data. Lastly, we do some
postprocessing to determine how well two names actually match.

Preprocessing
Before trying to match company names, it is useful to do some preprocessing of the
data, making the data easier to match. We will do this in several steps. Say we have
the following company name:

1 company_name = 'SAMSUNG ÃŠLECTRONICS Holding, LTD'

4b6142a6-b892-4a67-9524-4bd021e51438.py hosted with ❤ by GitHub view raw

We start by removing all capital letters.

1 company_name.lower()
2
3 > 'samsung Ãªlectronics holding, ltd'

6eab2d1c-faa3-4caa-823b-39d1b20e6789.py hosted with ❤ by GitHub view raw

Next, we replace non-ASCII characters.

1 import unicodedata
2 unicodedata.normalize('NFKD', company_name).encode('ASCII', 'ignore').decode()
3
4 >'samsung electronics holding, ltd'

8238b96c-31c4-4609-803c-65d992ed277c.py hosted with ❤ by GitHub view raw

Then, we remove punctuation, i.e. remove any character that is not a word or space
character with nothing.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 2/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

1 import re
2 re.sub(r'[^\w\s]','',company_name)
3
4 > 'samsung electronics holding ltd'

81d704bc-cbfe-4138-b2ce-46b6a69922b1.py hosted with ❤ by GitHub view raw

We remove common legal business suffixes, using a package called cleanco, which
is able to process company names and remove terms referring to organization type.

1 from cleanco import basename

2 basename(company_name)
3
4 > 'samsung electronics holding'

cc79f8e3-985f-4ad3-a14c-ac7012035c72.py hosted with ❤ by GitHub view raw

Finally, we remove the most common words using regular expressions.

1 ' '.join(re.sub(r'\b{}\b'.format(re.escape(suffix)), '', company_name).split())

2
3 > 'samsung electronics'

a08b7bcf-6623-49df-95d4-30cd50d53803.py hosted with ❤ by GitHub view raw

The idea behind this is to bring the name back to it essence and compare that, as
most of the name similarity scores are normalized based on the length of the string.
Obviously, depending on the data you have, not all of these steps are necessary. With
the preprocessing done, we proceed with approximate string matching.

Cosine Similarity
Using cosine similarity is necessary as the more advanced string matching
algorithms are computationally more complex. In this way, the potential number of
matches can be reduced from a few million down to about fifty. This is done via the
conversion of a string to an n-gram and applying a tf-idf transform.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 3/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

1 from sklearn.feature_extraction.text import TfidfVectorizer

2
3 vec = TfidfVectorizer(lowercase=False, analyzer="char", ngram_range=(2, 3))
4 vec.fit(company_name_dataset)
5 vec.transform(company_name)
6
7 > <1x350357 sparse matrix of type '<class 'numpy.float64'>'
8 with 35 stored elements in Compressed Sparse Row format>

440f5cec-55d9-45ed-9bf7-9566255a647e.py hosted with ❤ by GitHub view raw

This results in a sparse matrix with the size of all of the unique n-grams that occur
in the dataset. In this matrix, only the elements which link to the n-grams present in
the company name are filled. By calculating the dot product between the matrix for
the entire dataset and the matrix for the names we want to match, we can get the
cosine similarity between the two. From this cosine similarity, we can then run an
partition function to select the top fifty best matches. For these matches, we can
apply the fuzzy string matching.

Fuzzy String Matching

For the fuzzy matching of company names, there are many different algorithms
available out there. To match company names well, a combination of these
algorithms is needed to find most matches. Depending on the differences between
two company names, different algorithms should be used. In this case, we will be
using three algorithms which I will now discuss in turn.

Discounted Levenshtein

The first way in which we judge how well two strings match, is the discounted
Levenshtein distance, using the abydos package. The Levenshtein distance can be
obtained by changing one string to another by substitution, insertion and deletion.
The discounted version is a variation on the Levenshtein distance where differences
at the end of the string are penalized less than those at the beginning. This is handy
for company names as suffixes to names are far more common than prefixes.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 4/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

1 import abydos.distance as abd

2
3 abd.DiscountedLevenshtein().sim('coca-cola company','coca-cola group')
4 > 0.74
5
6 abd.DiscountedLevenshtein().sim('coca-cola company','pepsi cola company')
7 > 0.54

389edd36-0ffc-4316-9a9f-023a6233e188.py hosted with ❤ by GitHub view raw

A score of 1 implies a perfect match. Here you can see that even though pepsi cola
company has more letters in common and requires fewer edits then coca-cola group,
it is still ranked lower, because of the discounting of the edits further down the
name string.

String Subsequence Kernel Similarity

A different way of trying to match strings is by looking at possible substrings

between the two strings. By dividing the name into (non)-continuous substrings a
difference between the two sets of substrings can be determined. An SVM can
subsequently be applied to generate a difference score between the two strings. For
longer names, this gives a better idea of the matching.

1 abd.SSK().sim('Anheuser-Busch InBev International Gesellschaft mit beschrÃ¤nkter Haftung

2 > 0.74
3
4 abd.SSK().sim('Anheuser-Busch InBev','Anhauser Bosch InBef')
5 > 0.72

6e191e67-5169-4991-8e9a-2bb560daad55.py hosted with ❤ by GitHub view raw

You can see that even writing out the full legal suffix of the company has less of an
effect then making a few typo’s in the company name. This allows us to also match
long names and names with large differences in length well.

Token Sort

The last metric we now take into account is the token sort distance, which first
tokenizes the data, then sorts the tokens (using the thefuzz package). Based on these
sorted tokens a Levenshtein distance can be determined. This is especially useful
when the words from the company name get scrambled.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 5/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

1 from thefuzz import fuzz

2
3 fuzz.token_sort_ratio('Apple Computer Inc','Apple Inc Computer')/100
4 > 1.0
5
6 fuzz.token_sort_ratio('Apple Computer Inc','Apple Inc')/100
7 > 0.67

7c1d3441-0f06-495e-ba15-e62a12a17285.py hosted with ❤ by GitHub view raw

Here, you can see that the switching around of words no longer affects the score
that a match will get.

Post Processing
After applying the fuzzy matching, we have a score indicating how well two
company names match for each of the algorithms. These scores can be combined to
get a score for how well the two company names match. Depending on the goal of
the name matching, some post processing might be necessary. During the post
processing you can flag potential false positives. When matching fund names, for
instance, it often occurs that you have different rounds of a fund, e.g. Sustainable
Equity Fund I and Sustainable Equity Fund II. These give a high matching score, but
should be differentiated in some cases. Specifically scanning for these kinds of
differences and flagging these results can avoid making these false positive
matches.

NameMatcher
In order to simplify our name matching process, we developed a name matching
Python package. In this package, we can initialize a NameMatcher class object with
the required preprocessing steps and the top n matches that should be returned
from the cosine similarity step.

1 from name_matching.name_matcher import NameMatcher

2
3 matcher = NameMatcher(top_n=10,
4 lowercase=True,
5 punctuations=True,
6 remove_ascii=True,
7 legal_suffixes=False,
8 common_words=False,
9 verbose=True)

855bd49c-0643-42d9-9cec-fb78b985e4af.py hosted with ❤ by GitHub view raw

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 6/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Next up we can set the algorithms we want to use for the fuzzy name matching.

1 matcher.set_distance_metrics(['discounted_levenshtein',
2 'SSK',
3 'fuzzy_wuzzy_token_sort'])

d622f902-dfcd-4f5d-8807-714b12fe3422.py hosted with ❤ by GitHub view raw

We can then load in our two datasets and indicate which column should be used for
the name matching.

1 matcher.load_and_process_master_data('company_name', name_data_a)
2 matcher.match_names(to_be_matched=name_data_b, column_matching='name company')

dafe1d7c-22d7-44ad-8e22-cd8f3b5bfe01.py hosted with ❤ by GitHub view raw

The package will perform the name matching and provide us with the best matched
options from the dataset including the score.

1 original_name match_name score

2 asml nv asml holding nv 100.0
3 unilever bv unilever nv 100.0
4 shell bv royal dutch shell plc 69.6
5 ing bank nv ing group nv 64.0
6 koninklijke filips koninklijke philips nv 79.4
7 adyen nv adyen nv 100.0
8 relx plc relx plc 100.0
9 prosus group prosus nv 100.0
10 dsm koninklijke dsm nv 100.0
11 ahold-delheize koninklijke ahold delhaize nv 88.1
12 heineken breweries heineken nv 71.2

6e926dbe-1437-42fc-b7a1-c9d1033971cd.py hosted with ❤ by GitHub view raw

Conclusion
Using our name_matching Python package, we can easily match the names of
companies with many different algorithms depending on out data. With a scores
between the 0 and 100 for each of the matches, we can also choose how many false
positives we can accept. So in cases where we really need to be sure, a score of 95 or
higher is used as threshold, while in other cases it will be lower. Checking the
matches near this threshold gives us an idea about the number of false positives
/negatives in our matched data.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 7/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

At DNB, we are receiving more and more data where both a company name and an
identifier is used. Based on these kind of datasets, we can use different company
names with the same identifier to build a list of alternatives for a company name.
The resulting dataset can be a training dataset for a neural net based name
matching approach once we have enough data, taking the process of name
matching one step further in the future.

TL;DR
In order to match company names from different datasets not sharing any
identifiers, we developed a Python package called name_matching , to help us with
that problem. It is available on the DNB Github.

Data Science Name Matching Company Name Machine Learning

Central Bank

Written by Michiel Nijhuis

22 Followers · Editor for DNB — Data Science Hub

Data Scientist at the Dutch central bank

More from Michiel Nijhuis and DNB — Data Science Hub

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 8/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Michiel Nijhuis in DNB — Data Science Hub

Optimizing Banknote Sorting Machine Settings

A multi-objective genetic algorithm approach

7 min read · Oct 11, 2022

See all from Michiel Nijhuis

See all from DNB — Data Science Hub

Recommended from Medium

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 9/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Open in app Sign up Sign in

Yassine EL KHAL

The complete guide to string similarity algorithms

Introduction

14 min read · Aug 21

254 2

Dr. Lovedeep Saini

Fuzzy Data Matching with GPT-based Embeddings and Nearest

Neighbors
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 10/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Data matching is a critical task in data management, and fuzzy data matching presents its own
set of challenges. In this blog post, we will…

3 min read · Jul 5

Lists

Predictive Modeling w/ Python

20 stories · 686 saves

Practical Guides to Machine Learning

10 stories · 783 saves

Natural Language Processing

976 stories · 469 saves

data science and AI

38 stories · 2 saves

Rahul Nayak in Towards Data Science

How to Convert Any Text Into a Graph of Concepts

A method to convert any text corpus into a Knowledge Graph using Mistral 7B.

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 11/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

12 min read · Nov 10

4.4K 41

David Goudet

This is Why I Didn’t Accept You as a Senior Software Engineer

An Alarming Trend in The Software Industry

· 5 min read · Jul 26

6.1K 66

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 12/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Bobby Wu in Trusted Data Science @ Haleon

Fuzzy Matching at Scale for Beginners

How to effectively perform large scale cross-system data reconciliation (beginner level)

16 min read · Sep 18

Abhit Maurya

5 Python String Matching Algorithm Every Data Analyst Should Know.

(Part 1)
https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 13/14
15/12/2023, 12:37 Company Name Matching. We have all been there: you have found… | by Michiel Nijhuis | DNB — Data Science Hub | Medium

Selecting the Optimal String Matching Approach in Python.

· 4 min read · Jun 25

See more recommendations

https://medium.com/dnb-data-science-hub/company-name-matching-6a6330710334 14/14

Dell Technologies PowerEdge Server Concepts and Products
No ratings yet
Dell Technologies PowerEdge Server Concepts and Products
207 pages
HS RGA Operating Manual PDF
No ratings yet
HS RGA Operating Manual PDF
52 pages
KPI
0% (1)
KPI
671 pages
Amadine User Manual For iOS
No ratings yet
Amadine User Manual For iOS
251 pages
CIT758-Wireless-Communication-II Teacher - Co .Ke
No ratings yet
CIT758-Wireless-Communication-II Teacher - Co .Ke
140 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
108 pages
22ESC145 (C Programming)
No ratings yet
22ESC145 (C Programming)
5 pages
AVH-X491BHS OwnersManual070617
No ratings yet
AVH-X491BHS OwnersManual070617
208 pages
L3 Arrays
No ratings yet
L3 Arrays
41 pages
Ready Reckoner For Capgemini Exceller Winning Steps Learning Journey Webinar 5
No ratings yet
Ready Reckoner For Capgemini Exceller Winning Steps Learning Journey Webinar 5
6 pages
Transaction Manage. Recovery
No ratings yet
Transaction Manage. Recovery
54 pages
Chapter 3 - Multistage Amplifiers
No ratings yet
Chapter 3 - Multistage Amplifiers
39 pages
Ams Project Report
No ratings yet
Ams Project Report
14 pages
X Plane 11 Keyboard Shortcuts
No ratings yet
X Plane 11 Keyboard Shortcuts
1 page
Uy - Charles - LAB1 - Setup A Multi-VM Environment
No ratings yet
Uy - Charles - LAB1 - Setup A Multi-VM Environment
15 pages
3HAC024480-011 Controlador IRC5 Armario
No ratings yet
3HAC024480-011 Controlador IRC5 Armario
162 pages
Syllabus
No ratings yet
Syllabus
11 pages
Tally Sheet - Prizes Computation wITH Distribution
No ratings yet
Tally Sheet - Prizes Computation wITH Distribution
2 pages
PHP New
No ratings yet
PHP New
42 pages
Formalization of The Data Flow Diagram Rules For Consistency Check
No ratings yet
Formalization of The Data Flow Diagram Rules For Consistency Check
18 pages
Join The Purcell School' Wi-Fi Broadcast
No ratings yet
Join The Purcell School' Wi-Fi Broadcast
6 pages
Format For Semester Training Report 8th Sem
No ratings yet
Format For Semester Training Report 8th Sem
10 pages
Practical No. 4 Aim-Theory: - Source Code
No ratings yet
Practical No. 4 Aim-Theory: - Source Code
6 pages
GIS Project Report
100% (1)
GIS Project Report
22 pages
Step 1: Create A Folder Redirection Security Group
No ratings yet
Step 1: Create A Folder Redirection Security Group
7 pages
05 - Module de Communication Modbus TCP Rx3i
No ratings yet
05 - Module de Communication Modbus TCP Rx3i
13 pages
DTMF Based Humenless Boat Control Robot: Proposed System
No ratings yet
DTMF Based Humenless Boat Control Robot: Proposed System
4 pages
St. John International School, Palghar Name of Faculty: Employee Code No.
No ratings yet
St. John International School, Palghar Name of Faculty: Employee Code No.
2 pages
Amphenol Installation Guide RETU-Ex01
No ratings yet
Amphenol Installation Guide RETU-Ex01
2 pages
Krone Connection Box 301a
No ratings yet
Krone Connection Box 301a
2 pages
Understanding The Basic Building Blocks of Salesforce CRM
100% (2)
Understanding The Basic Building Blocks of Salesforce CRM
5 pages
Crushing The Technical Interview: Data Structures And Algorithms (C# Edition)
From Everand
Crushing The Technical Interview: Data Structures And Algorithms (C# Edition)
Keith Henning
No ratings yet
Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel
From Everand
Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel
Matt Eland
No ratings yet
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Crushing The Technical Interview: Data Structures And Algorithms (Python Edition)
From Everand
Crushing The Technical Interview: Data Structures And Algorithms (Python Edition)
Keith Henning
No ratings yet
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
From Everand
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
Rajinder Kr. Chitoria
No ratings yet
Amazon DynamoDB - The Definitive Guide: Explore enterprise-ready, serverless NoSQL with predictable, scalable performance
From Everand
Amazon DynamoDB - The Definitive Guide: Explore enterprise-ready, serverless NoSQL with predictable, scalable performance
Aman Dhingra
No ratings yet
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
Go Recipes for Developers: Top techniques and practical solutions for real-life Go programming problems
From Everand
Go Recipes for Developers: Top techniques and practical solutions for real-life Go programming problems
Burak Serdar
No ratings yet
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
From Everand
Engineering Data Mesh in Azure Cloud: Implement data mesh using Microsoft Azure's Cloud Adoption Framework
Aniruddha Deswandikar
No ratings yet
Elasticsearch for Hadoop
From Everand
Elasticsearch for Hadoop
Shukla Vishal
No ratings yet
JavaScript for Kids: Start Your Coding Adventure
From Everand
JavaScript for Kids: Start Your Coding Adventure
Abdelfattah Ragab
No ratings yet
Advanced JavaScript Design Patterns
From Everand
Advanced JavaScript Design Patterns
Hernando Abella
No ratings yet
Unleashing the Power of CSS
From Everand
Unleashing the Power of CSS
Stephanie Eckles
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Responsive Web Design with HTML5 and CSS3
From Everand
Responsive Web Design with HTML5 and CSS3
Ben Frain
3.5/5 (12)
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
Data Analytics with SAS: Explore your data and get actionable insights with the power of SAS (English Edition)
From Everand
Data Analytics with SAS: Explore your data and get actionable insights with the power of SAS (English Edition)
Nishant Sidana
No ratings yet
Hands-On Azure Data Platform: Building Scalable Enterprise-Grade Relational and Non-Relational database Systems with Azure Data Services
From Everand
Hands-On Azure Data Platform: Building Scalable Enterprise-Grade Relational and Non-Relational database Systems with Azure Data Services
Sagar Lad
No ratings yet
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
Learning Three.js – the JavaScript 3D Library for WebGL - Second Edition
From Everand
Learning Three.js – the JavaScript 3D Library for WebGL - Second Edition
Jos Dirksen
No ratings yet
Sass and Compass for Designers
From Everand
Sass and Compass for Designers
Ben Frain
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Mastering DynamoDB
From Everand
Mastering DynamoDB
Tanmay Deshpande
No ratings yet
AWS Solutions Architect Certification Case Based Practice Questions Latest Edition 2023
From Everand
AWS Solutions Architect Certification Case Based Practice Questions Latest Edition 2023
Exam OG
No ratings yet
Frank Kane's Taming Big Data with Apache Spark and Python
From Everand
Frank Kane's Taming Big Data with Apache Spark and Python
Frank Kane
No ratings yet
Ms Access 2007: Step by Step
From Everand
Ms Access 2007: Step by Step
Asim Abbasi
5/5 (1)
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
SAS Viya: The Python Perspective
From Everand
SAS Viya: The Python Perspective
Kevin D. Smith
No ratings yet
Conversations with: AI: Developer edition, #1
From Everand
Conversations with: AI: Developer edition, #1
Xinc Cyberwizard
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Amazon Web Services (AWS) Interview Questions and Answers
From Everand
Amazon Web Services (AWS) Interview Questions and Answers
Tech Interviews
4.5/5 (3)
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Learn MongoDB in 24 Hours
From Everand
Learn MongoDB in 24 Hours
Alex Nordeen
5/5 (2)
Tarsnap Mastery: IT Mastery, #6
From Everand
Tarsnap Mastery: IT Mastery, #6
Michael W. Lucas
No ratings yet
The CSS Guide: The Complete Guide to Modern CSS
From Everand
The CSS Guide: The Complete Guide to Modern CSS
Tim Robards
5/5 (2)
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Getting SASSY: A Practical Guide to SASS
From Everand
Getting SASSY: A Practical Guide to SASS
Tim Robards
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
From Everand
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
equitypress
4.5/5 (3)
IPv6 Fundamentals: Learn the Basics of How IPv6 Works, IPv6 Addresses and IPv6 Subnetting: Computer Networking, #2
From Everand
IPv6 Fundamentals: Learn the Basics of How IPv6 Works, IPv6 Addresses and IPv6 Subnetting: Computer Networking, #2
Ramon Nastase
4.5/5 (2)
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
From Everand
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
Georgio Daccache
No ratings yet
AWS Solution Architect Certification Exam Practice Paper 2019
From Everand
AWS Solution Architect Certification Exam Practice Paper 2019
Tech Interviews
3.5/5 (3)
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Name Matching

Uploaded by

Name Matching

Uploaded by

15/12/2023, 12:37 Company Name Matching.

Company Name Matching

How We Match Names

1 company_name = 'SAMSUNG ÃŠLECTRONICS Holding, LTD'

4b6142a6-b892-4a67-9524-4bd021e51438.py hosted with ❤ by GitHub view raw

We start by removing all capital letters.

6eab2d1c-faa3-4caa-823b-39d1b20e6789.py hosted with ❤ by GitHub view raw

Next, we replace non-ASCII characters.

8238b96c-31c4-4609-803c-65d992ed277c.py hosted with ❤ by GitHub view raw

81d704bc-cbfe-4138-b2ce-46b6a69922b1.py hosted with ❤ by GitHub view raw

1 from cleanco import basename

cc79f8e3-985f-4ad3-a14c-ac7012035c72.py hosted with ❤ by GitHub view raw

Finally, we remove the most common words using regular expressions.

1 ' '.join(re.sub(r'\b{}\b'.format(re.escape(suffix)), '', company_name).split())

a08b7bcf-6623-49df-95d4-30cd50d53803.py hosted with ❤ by GitHub view raw

1 from sklearn.feature_extraction.text import TfidfVectorizer

440f5cec-55d9-45ed-9bf7-9566255a647e.py hosted with ❤ by GitHub view raw

Fuzzy String Matching

1 import abydos.distance as abd

389edd36-0ffc-4316-9a9f-023a6233e188.py hosted with ❤ by GitHub view raw

String Subsequence Kernel Similarity

A different way of trying to match strings is by looking at possible substrings

1 abd.SSK().sim('Anheuser-Busch InBev International Gesellschaft mit beschrÃ¤nkter Haftung

6e191e67-5169-4991-8e9a-2bb560daad55.py hosted with ❤ by GitHub view raw

1 from thefuzz import fuzz

7c1d3441-0f06-495e-ba15-e62a12a17285.py hosted with ❤ by GitHub view raw

1 from name_matching.name_matcher import NameMatcher

855bd49c-0643-42d9-9cec-fb78b985e4af.py hosted with ❤ by GitHub view raw

d622f902-dfcd-4f5d-8807-714b12fe3422.py hosted with ❤ by GitHub view raw

dafe1d7c-22d7-44ad-8e22-cd8f3b5bfe01.py hosted with ❤ by GitHub view raw

1 original_name match_name score

6e926dbe-1437-42fc-b7a1-c9d1033971cd.py hosted with ❤ by GitHub view raw

Data Science Name Matching Company Name Machine Learning

Written by Michiel Nijhuis

Data Scientist at the Dutch central bank

More from Michiel Nijhuis and DNB — Data Science Hub

Michiel Nijhuis in DNB — Data Science Hub

Optimizing Banknote Sorting Machine Settings

7 min read · Oct 11, 2022

See all from Michiel Nijhuis

See all from DNB — Data Science Hub

Recommended from Medium

Open in app Sign up Sign in

The complete guide to string similarity algorithms

14 min read · Aug 21

Dr. Lovedeep Saini

Fuzzy Data Matching with GPT-based Embeddings and Nearest

3 min read · Jul 5

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

data science and AI

Rahul Nayak in Towards Data Science

How to Convert Any Text Into a Graph of Concepts

12 min read · Nov 10

This is Why I Didn’t Accept You as a Senior Software Engineer

· 5 min read · Jul 26

Bobby Wu in Trusted Data Science @ Haleon

Fuzzy Matching at Scale for Beginners

16 min read · Sep 18

5 Python String Matching Algorithm Every Data Analyst Should Know.

Selecting the Optimal String Matching Approach in Python.

· 4 min read · Jun 25

See more recommendations

You might also like