PDF-Dataset Matching Toolkit
PDF-Dataset Matching Toolkit
Dataset Matching
JANUARY 2023
ACKNOWLEDGEMENTS
NASTAD is grateful for the development of this resource by:
Shauna Onofrey, Rita Isabel Lechuga, Zakiya Grubbs, Fozia Ajmal and the following persons
that facilitated the review whose contributions and support made this publication possible:
Henry Roberts; Laurie Barker; Kathleen Ly; Danica Kuncio; Lindsey Sizemore;
Nicola D. Thompson; Clarisse Tsang ; Sarah New; Tony Fristachi
D E D I C AT I O N
In memory of Tony Fristachi
Contents
Introduction 1
Congratulations! You are taking the first step towards Why match?
understanding how your hepatitis data fits into a bigger Matching may provide data you cannot otherwise get through standard
picture! Maybe you are looking at co-infection of hepatitis B surveillance processes. For example, you may not have
and C or trying to find out how many of the hepatitis cases a way to capture mortality data on your HCV cases, but this is data
in your registry gave birth this year. Regardless of your focus you want to capture in the care cascade. A match with vital records
and the datasets you are using, this toolkit is designed could provide you with that data.
to help you along the way. Here you will find information
Particular to PS21-2103, matching may provide solutions for
on paperwork you should consider before you start your the following:
match, how to prepare the data once you have it, and
considerations across three software options for matching. Outcome 1.2.2
Both exact matches and probabilistic or ‘fuzzy’ matching is Improve monitoring of burden of disease and trends in
hepatitis A, acute hepatitis B and acute hepatitis C infections
covered here. If this guide does not provide information, you
are looking for, please reach out to NASTAD via hepatitis@ How matching can help: Matching with other disease
nastad.org for further assistance. or outcome datasets can be used to increase completeness
of age, gender, race/ethnicity, and county of resident data.
What is data matching?
Outcome 1.2.4
Data matching, also called record linkage, is the process for Improved monitoring of hepatitis C continuum of cure (CoC)
identifying an entity across different data sources. In our situation,
How matching can help: As mentioned above, matching can
the entity is usually a person, but it could also be a hospital, service
provide the data on mortality required by CDC in year 3. Matching
center, or other entity for which we collect data. When we match
may provide data about other outcomes or treatment data as well.
datasets, we can identify when that entity does or does not appear
in both datasets and combine the data on that entity from both Outcome 1.2.5
datasets to answer questions about the entity that you cannot Improved development and utilization of viral hepatitis
answer from the data in one dataset. surveillance data reports
How matching can help: Including data from matches,
including co-infections among viral hepatitis types and
with other syndemic illnesses in surveillance reports
provides more insight into the diseases in question and
may make the data reports of interest to a wider audience.
Before you match your data, there are some steps you should take that will make the process easier.
Questions to answer before matching (all examples using match with death record data):
1. What is the question you are trying to answer? 6. Which variables are required for the match?
(Ex: Case finding: How many death records with HCV listed as a From Dataset 1 ( Ex: Last_name, First_name, DOB, Address,
cause of death were never reported to our registry) Maiden_Name)
(Ex: Care continuum: How many cases of HCV with confirmatory From Dataset 2 (Ex: LName, Gname, Birth_date, Street, City)
tests in the last 5 years died following confirmatory test)
7. What other variables do you need to answer your question?
2. What will be the outcome of answering the question? If variables are included in both datasets (Ex: race, ethnicity)
Does this have policy implications? Do you want to pursue which dataset’s variable will you use?
publication? Are you adding the data to surveillance? From Dataset 1 (Ex: Race, Ethnicity, Age)
From Dataset 2 (Ex: Causes of Death 1-22)
3. What datasets will you use to answer this question?
Dataset 1: (Ex HCV registry) 8. How will this data be used?
Dataset 2: (Ex: Vital Records - Deaths) (Ex: Cases identified that were not in the registry will be added,
and reporting hospital, if available, will be contacted for medical
4. What time period will you need for each dataset? How are you records. OR Aggregate data will be used to create a new bar in the
defining time? HCV care continuum produced for and available on our website)
Dataset 1: ( Ex HCV registry in full, HCV registry cases with 9. How readily available is each dataset?
confirmatory test (NAT) between 2017-2022)
Dataset 2: (Ex Deaths with date of death in 2022) 10. Once you are ready to do the match, how long will it take you
to get your data ready?
5. Are there timing considerations between the two datasets?
(Does one event have to have happened first? Is there a period 11. How long will it take to get the dataset for matching?
of time that needs to have passed between the two events)?
12. How often will the match be repeated?
Which dataset will anchor the match?
(Ex: One time only, once a year)
Do I need one?
That answer depends on a few things. Who ‘owns’ the data you want to match? If it is a dataset you have access
to for your day-to-day work, you may not require a data use agreement. Datasets such as death records are
public and may also not require a data use agreement. However, if the dataset falls under another division or
organization, you should plan to fill out a data use agreement.
It also depends on the rules of your organization or the owner of the other dataset. Your organization may
require all data matches to have a data use agreement in place.
Even if it is not required, having a data use agreement may not be a bad idea. Having a document that
outlines the details of your matching project allows for managing everyone’s understanding and expectations
from the beginning.
Some groups are more restrictive on the use of their data. You may need to compromise. Consider if you would
be comfortable providing them with the data to match and to receive back aggregate data. If there is hesitation
to that, or if you need person-level data, consider if you might start by matching a smaller subset (a month’s
worth of data) to see what the volume might be.
Do not expect that everyone will be immediately happy with your plans. Prepare to talk through and support
why you want to look at what you want to look at.
The data use agreement, itself, describes who, what, when, where, Institutional Review Board
why and how of the actual data match.
You may need to put your project through the Institutional Review
Board or IRB, if you are matching to a dataset that has protected data
Who:
(ex: birth registry) or if you are matching to a dataset that came from
• Who will be doing the match?
research. Most matches, even once going through IRB, will be found
• Who will have access to the dataset before its matched, to be ‘exempt’, but you may need to go to the process anyways. The
during the match process, and after the match? IRB form will ask for the information reviewed, above.
• It’s wise to make sure at least two people in an organization
The decision as to if your match requires IRB approval is specific to
are listed as being able to access the data at all points, in
your institution. You should plan for the review to take at least 2 weeks
case someone is unable to continue the process, and for
if it is necessary, and possibly longer depending on how often the IRB
troubleshooting during the process.
for the institution meets.
What:
• What tools are you using to do the match?
• Which variables will be used to match the two datasets?
• What additional variables do you need?
When:
• When will the match be conducted?
• How long do you expect the process to take?
• How long will you have access to the data?
• What is the time period of the dataset you need?
Why:
• What is the purpose of your match?
• What do you plan to do with the data once it is matched?
How:
• How will data be shared?
• How will it be stored?
• How will it be destroyed at the end of the process?
Now that you have thought through your match and gotten in place all the appropriate agreements and permissions,
you are ready to begin manipulating the data to conduct the match.
# Creating a new data set using the existing “iris” data in R # Method 1: Remove all spaces in a string using the base R “gsub()”
df1 <- iris function
str1 <- (gsub(" ","",str))
# Run the ‘sapply()’ or ‘lapply()’ or ‘str()’ function to determine data str1
elements and their types
sapply(df1, class) # Method 2: Remove all spaces in a string using the Stringr
lapply(df1, class) package’s str_remove_all() function
str (df1) library("stringr")
str2 <- str_remove_all(str," ")
# to convert numeric to character variable using ‘as.character()’ function str2
df1$char_Var2 <- as.character(df1$Sepal.Width)
Removing punctuation IN R
Punctuation in names such as “O’Brien” or “Smith-Jones” is often s <- "I am a,# new 1 comer,!to ,please $ help,me:out,here;
missed or mis-placed in data entry and can lead to missed matches a!$%bbbêéè)(/&ß2"
2. You can create a custom function as described here: #Method3: uses the base R gsub() function (allows choosing which
https://www.extendoffice.com/documents/excel/3296-excel- punctuation to remove)
remove-all-punctuation.html s3 <- gsub("[?.;!¡¿·$%&#,'/:()]", "", s)
s3
IN SAS
Again, the easiest way to do this is using the COMPRESS function,
with the modifier ‘P’ for punctuation.
Now that your data is prepared, you are ready to perform the actual match. However, there are still many options to consider
to ensure the most complete and accurate match possible. This section will describe how to perform both exact matches and
probabilistic, or ‘fuzzy’ matches and provide considerations for both.
Matching using Match*Pro datasets, exact matching may be more straightforward. However,
Match*Pro provides excellent documentation for using the software. when you are looking for matching names, you may find that exact
Refer to the help document that accompanies your download to matching is more of a challenge. As mentioned in the ‘preparing data
perform your match using Match*Pro. The information, below, may be for matching’ section, above, computer software, specifically statistical
helpful in understanding some of the terms referenced in Match*Pro. software packages, are not built to ignore spaces, punctuation, and
capitalization in character variables. The exact match may be your
first indication that you need to spend some additional time on data
Using Statistical Packages (SAS and R) preparation.
Statistical packages such as SAS and R are useful tools for matching
datasets. Once matches are completed, you can move directly into In SAS
analysis without importing or exporting additional data. If these are
You are likely very familiar with the exact match process in SAS, if you
software packages with which you are already familiar, they may be
work with datasets often. Another name for the exact match would be
an ideal place to start for matching your datasets. Different processes
‘merge’.
for matching are detailed here, with example code for using in these
processes for both SAS and R. These processes likely exist in other To complete this process, make sure the variables you are looking to
programs, including SPSS and STATA, and searching for the topics listed match in each dataset are named the same. For comparison purposes,
below may help you uncover the details of how to use these statistical you may want to keep a duplicate of those variables in each dataset
packages for matching your datasets. with different names, so you can identify differences between the
variables and which dataset may need changing if there are issues.
Types of matches Each dataset will need to be sorted by the variables of interest. In the
merge statement, the variables of interest should be named as the ‘by’
We have mentioned throughout this document different types of
variables.
matches, including exact and probabilistic matches. Here we will define
those terms and provide sample code for conducting the matches. If using names, you may also wish to consider running a secondary
exact match on ‘last name’ in one dataset matched to the ‘given name’
Exact matches in another dataset, if you think the two may often be switched.
What is an exact match? Exact matching is often the first step in the matching process. You can
feel very confident that these matches are ‘true’ matches, but repeating
Exact matching is looking for the exact same values in both datasets.
the matching process using some of these additional tools will give you
For datasets where you have the same identification number in both
more confidence that you have identified as many matches as you can.
EXAMPLE: In R
*Merge a hepatitis C dataset with vital death records. # Creating a sample dataframe set Hep C
*Create a new dataset with the hepatitis C data with the merging variables HepC <- data.frame(
of last name, first name, and birth date, renamed to match the vital records firstname= c("Carsten","Gerd","Robert","Stephen","Ralf", "Ravi"),
dataset;
lastname = c("Meier","Buaer","Hartmana","Wolff","Krueger", "Chander"),
DATA DATASET1;
dob = c(1949, 1968, 1930, 1957, 1966, 1971),
SET HEPC;
FIRST_NAME=GNAME; nationality = c("US","US","UK","US","Poland", "India"),
LAST_NAME=LNAME; stringsAsFactors=FALSE)
DOB=BIRTH_DATE;
RUN; # Creating a sample dataframe set death
*Create a second dataset for merging with the vital records, keeping a copy of death <- data.frame(
the required variables with new names to monitor for errors in matching; firstname = c("Carsten", "Gred", "Robert", "Stephn", "Ralff", "Ravee"),
DATA DATASET2; lastname = c("Meier","Buaer","Hartaman","Wolff","Krueger", "Chander"),
SET VITALS_DEATHS; dob = c(1949, 1968, 1930, 1957, 1966, 1970),
DEATH_FNAME=FIRST_NAME;
death = c("Yes", "No", "Yes", "No", "No", "Yes"),
DEATH_LNAME=LAST_NAME;
DEATH_DOB=DOB; stringsAsFactors=FALSE)
RUN;
# Merge the two datasets using exact matching in R
*Sort datasets by required variables;
exact_match <- merge (HepC, death, by=c("firstname","lastname", "dob"))
PROC SORT DATA=DATASET1; exact_match
BY FIRST_NAME LAST_NAME DOB;
RUN;
PROC SORT DATA=DATASET2;
BY FIRST_NAME LAST_NAME DOB;
RUN;
*Merge datasets. Here we only want to keep the matches found in both
datasets, and use the ‘in’ indicators to keep only those that are found in both;
DATASET EXACTMATCHES;
MERGE DATASET1 (IN=A) DATASET2 (IN=B);
BY FIRST_NAME LAST_NAME DOB;
IF A AND B;
RUN;
Note only those records where all three matches are exactly the same will
be identified as matches. Check the logs to determine how many exact
matches you have found.
Soundex is a phonetic algorithm that helps to identify words that HepC1 <- data.frame(
sound alike in English, but look different. By using SOUNDEX, you firstname= c("Carsten","Gerd","Admore","Stephen","Ralf", "Ravi"),
can identify names that may have minor spelling differences in your lastname = c("Meier","Buaer","Patrick","Wolff","Krueger", "Chander"),
records, because their soundex will be the same. Here, your software dob = c(1949, 1968, 1930, 1957, 1966, 1971),
will assign a ‘soundex’ to each variable you are interested in, and will nationality = c("US","US","UK","US","Poland", "India"),
compare it across the two datasets. stringsAsFactors=FALSE)
quit;
#Fuzzy matching using the Levenshtein (a.k.a. edit difference) method # Creating a new column `x` including the three matching variables
fuzzy_lv <- stringdist_join(HepC1, death1, in a single column
by=c("x"), HepC1$x <- apply( HepC1[ , cols ] , 1 , paste , collapse = "-" )
mode='left', #use left join death1$x <- apply( death1[ , cols ] , 1 , paste , collapse = "-" )
method = "lv", #use Levenshtein distance metric
#Perform fuzzy matching using the Jaro-Winkler (JW) method
max_dist=5,
distance_col='dist') %>% fuzzy_jw <- stringdist_join(HepC1, death1,
group_by(x.x) %>% by=c("x"),
slice_min(order_by=dist, n=1) mode='left', #use left join
fuzzy_lv ignore_case = TRUE,
method = "jw", #use jw distance metric
#Fuzzy matching: max_dist=0.5,
# Creating a sample data frame HepC1 distance_col='dist') %>%
HepC1 <- data.frame( group_by(x.x) %>%
firstname= c("Carsten","Gerd","Admore","Stephen","Ralf", "Ravi"), slice_min(order_by=dist, n=1)
lastname = c("Meier","Buaer","Patrick","Wolff","Krueger", "Chander"), fuzzy_jw
dob = c(1949, 1968, 1930, 1957, 1966, 1971),
nationality = c("US","US","UK","US","Poland", "India"), A note on deduplication
stringsAsFactors=FALSE) It may be quickly apparent to you that the methods used here are also
what is employed or can be employed to look for duplicates within a
# Creating a sample data frame death
dataset. You may find duplicates through this process because of multiple
death1 <- data.frame( ‘near matches’. If you are concerned that you have many duplicates you
firstname = c("Carsten", "Gred", "Robert", "Stephn", "Ralff", "Ravee"), have not identified, you may be able to use some of these techniques to
lastname = c("Meier","Buaer","Hoffman","Wolff","Krueger", "Chander"), look for duplicates in your dataset.
dob = c(1949, 1968, 1930, 1957, 1966, 1970),
death = c("Yes", "No", "Yes", "No", "No", "Yes"), Regarding matching
stringsAsFactors=FALSE)
The code listed here provides a starting place for probabilistic matching, but
#Install required packages by no means covers all approaches. The descriptions give you language that
#install.packages ("data.table") you can use in additional searches, which will uncover other approaches to
#install.packages ("dplyr") distance measures that may be of use in your matching programs. However,
#install.packages ("fuzzyjoin") one can also lose themselves in a search for a 'complete' or 'perfect' match.
This quest follows the law of diminishing returns. Anyone performing
#Attach packages a match should accept that not all matches will be found, and determine
library(data.table) a stopping point before they start, instead of trying to include every method
library (dplyr) they know for finding additional matches. If you are going to use multiple
library (fuzzyjoin) approaches in a code, assuming you are looking for one to one matches
(one person with hepatitis in your dataset can only match to one death
# Pasting together the key matching variables in a single column
event for example), you may want to remove matches as you go, so that
cols <- c( 'firstname' , 'lastname' , 'dob' ) you are only looking for matches for cases that do not have a match.
In general, datasets are good for matching if they have the same individual level data as the dataset you want to match with.
That means, if you have hospital level data, and find another dataset with hospital-level data, they will be good for matching.
Similarly, if you are matching with a registry, you will likely want another dataset with individual-level data.
Datasets that do not have the same levels of data may be good for comparison, or ‘triangulation’. Even if you can’t make the
datasets match, exactly, they still may be used to understand trends, or where there may be gaps in the available data.
The datasets listed here may be good places to start for either of these processes.
Insurance data: All Payer Claims and Medicaid behaviors including information on sexual identity and sexual partners,
tobacco use, dietary behaviors, physical activity, and activity leading to
Many states maintain ‘All Payer Claims’ datasets. These are claims data
unintentional injury and violence. More information can be found here:
collected by major medical insurers in the state. These contain billing
https://www.cdc.gov/healthyyouth/data/yrbs/index.htm
data sent to and collected from most of the insurance providers in
a given state. While there may be a unique identifier that would
Syndromic surveillance
allow understanding the number of claims per individual, most of the
identifiers here are likely stripped away, making matching impossible. Syndromic surveillance is data emergency room chief complaint
However, claims data can help understand other issues such as hepatitis information which is provided to states in real time. It is most useful
A and B vaccines distributed or medication prescription practices. for looking at patterns and changes over time in the use of said
emergency departments. Many states have worked on ‘syndromes’
Similar to All Payor Claims, Medicaid for each jurisdiction may also to better understand overdoses in their state.
be able to provide a dataset based on claims. Medicaid data may
or may not be included in the ‘All Payor Claims’ dataset depending Emergency Medical Services (EMS) Data
on the jurisdiction.
While syndromic surveillance data provides emergency department
Hospital discharge data data, EMS data contains information gathered from ambulances
reports. This is another data source that has been used to deepen
Many states maintain hospital discharge datasets. These datasets contain understanding of opioid use.
information on all discharge from acute care hospitals in the state, often
at the patient level. This data can be useful for understanding hospital Drug treatment
utilization and changes in incidence and burden of disease and injury.
Drug treatment data may belong in both overarching categories. An
Information is included on select demographics, procedures, services, and
individual drug treatment facility would have patient-level data but
diagnoses. While there is patient-level data, most identifiers are usually
working with that data would likely be especially challenging due to
stripped from the available datasets, so matching is likely not possible.
federal regulations which protect the identities of people seeking drug
BRFSS treatment. However, the area of your health department that provides
addiction services may have publicly available data in aggregate
The Behavioral Risk Factor Surveillance System (BRFSS) is a telephone- regarding the utilization of drug treatment facilities by individuals
based survey administered annually across the US. BRFSS asks a in different areas of your jurisdiction. They may also provide some
multitude of questions regarding health practices, chronic conditions, information regarding the demographics of the individuals utilizing
and use of prevention services. Local jurisdictions are also permitted to these services.
add questions to their local BRFSS. More information, including the survey
instrument, can be found here: https://www.cdc.gov/brfss/index.html The US Department of Housing and Urban Development (HUD)