0% found this document useful (0 votes)
16 views20 pages

PDF-Dataset Matching Toolkit

This document provides guidance on preparing for and conducting a data matching project. It outlines key steps to take before matching such as defining the purpose and questions to answer, identifying which datasets and variables are needed, and considering any required data use agreements. Guidance is also given on software options and preparing the data as well as the matching process and potential datasets to match with.

Uploaded by

simbalukman38
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

PDF-Dataset Matching Toolkit

This document provides guidance on preparing for and conducting a data matching project. It outlines key steps to take before matching such as defining the purpose and questions to answer, identifying which datasets and variables are needed, and considering any required data use agreements. Guidance is also given on software options and preparing the data as well as the matching process and potential datasets to match with.

Uploaded by

simbalukman38
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

TOOLKIT

Dataset Matching
JANUARY 2023
ACKNOWLEDGEMENTS
NASTAD is grateful for the development of this resource by:
Shauna Onofrey, Rita Isabel Lechuga, Zakiya Grubbs, Fozia Ajmal and the following persons
that facilitated the review whose contributions and support made this publication possible:
Henry Roberts; Laurie Barker; Kathleen Ly; Danica Kuncio; Lindsey Sizemore;
Nicola D. Thompson; Clarisse Tsang ; Sarah New; Tony Fristachi

D E D I C AT I O N
In memory of Tony Fristachi
Contents
Introduction 1

Before You Match 2

Software Options and Data Preparation 5

Conducting the Match 9

Datasets for Consideration  15


Introduction

Congratulations! You are taking the first step towards Why match?
understanding how your hepatitis data fits into a bigger Matching may provide data you cannot otherwise get through standard
picture! Maybe you are looking at co-infection of hepatitis B surveillance processes. For example, you may not have
and C or trying to find out how many of the hepatitis cases a way to capture mortality data on your HCV cases, but this is data
in your registry gave birth this year. Regardless of your focus you want to capture in the care cascade. A match with vital records
and the datasets you are using, this toolkit is designed could provide you with that data.
to help you along the way. Here you will find information
Particular to PS21-2103, matching may provide solutions for
on paperwork you should consider before you start your the following:
match, how to prepare the data once you have it, and
considerations across three software options for matching. Outcome 1.2.2
Both exact matches and probabilistic or ‘fuzzy’ matching is Improve monitoring of burden of disease and trends in
hepatitis A, acute hepatitis B and acute hepatitis C infections
covered here. If this guide does not provide information, you
are looking for, please reach out to NASTAD via hepatitis@ How matching can help: Matching with other disease
nastad.org for further assistance. or outcome datasets can be used to increase completeness
of age, gender, race/ethnicity, and county of resident data.
What is data matching?
Outcome 1.2.4
Data matching, also called record linkage, is the process for Improved monitoring of hepatitis C continuum of cure (CoC)
identifying an entity across different data sources. In our situation,
How matching can help: As mentioned above, matching can
the entity is usually a person, but it could also be a hospital, service
provide the data on mortality required by CDC in year 3. Matching
center, or other entity for which we collect data. When we match
may provide data about other outcomes or treatment data as well.
datasets, we can identify when that entity does or does not appear
in both datasets and combine the data on that entity from both Outcome 1.2.5
datasets to answer questions about the entity that you cannot Improved development and utilization of viral hepatitis
answer from the data in one dataset. surveillance data reports
How matching can help: Including data from matches,
including co-infections among viral hepatitis types and
with other syndemic illnesses in surveillance reports
provides more insight into the diseases in question and
may make the data reports of interest to a wider audience.

TOOLKIT: Dataset Matching 1


Before You Match

Before you match your data, there are some steps you should take that will make the process easier.

Define your purpose.


Before you embark on matching data, you should define the purpose of your match. Is there a particular hypothesis you want to test?
Are you using the match for case finding? Is this a subset of a larger project? Use the questions, below, to define your match.

Questions to answer before matching (all examples using match with death record data):

1. What is the question you are trying to answer? 6. Which variables are required for the match?
(Ex: Case finding: How many death records with HCV listed as a From Dataset 1 ( Ex: Last_name, First_name, DOB, Address,
cause of death were never reported to our registry) Maiden_Name)
(Ex: Care continuum: How many cases of HCV with confirmatory From Dataset 2 (Ex: LName, Gname, Birth_date, Street, City)
tests in the last 5 years died following confirmatory test)
7. What other variables do you need to answer your question?
2. What will be the outcome of answering the question? If variables are included in both datasets (Ex: race, ethnicity)
Does this have policy implications? Do you want to pursue which dataset’s variable will you use?
publication? Are you adding the data to surveillance? From Dataset 1 (Ex: Race, Ethnicity, Age)
From Dataset 2 (Ex: Causes of Death 1-22)
3. What datasets will you use to answer this question?
Dataset 1: (Ex HCV registry) 8. How will this data be used?
Dataset 2: (Ex: Vital Records - Deaths) (Ex: Cases identified that were not in the registry will be added,
and reporting hospital, if available, will be contacted for medical
4. What time period will you need for each dataset? How are you records. OR Aggregate data will be used to create a new bar in the
defining time? HCV care continuum produced for and available on our website)
Dataset 1: ( Ex HCV registry in full, HCV registry cases with 9. How readily available is each dataset?
confirmatory test (NAT) between 2017-2022)
Dataset 2: (Ex Deaths with date of death in 2022) 10. Once you are ready to do the match, how long will it take you
to get your data ready?
5. Are there timing considerations between the two datasets?
(Does one event have to have happened first? Is there a period 11. How long will it take to get the dataset for matching?
of time that needs to have passed between the two events)?
12. How often will the match be repeated?
Which dataset will anchor the match?
(Ex: One time only, once a year)

TOOLKIT: Dataset Matching 2


BEFORE YOU MATCH

Data use agreements


What is it?
A data use agreement is a formal agreement between two parties that have ownership of datasets. It outlines
how a cut of a dataset will be used, protected, and eventually destroyed by the party that is asking for the
data. The questions, above, should help you outline most of the information you will need to complete a data
use agreement. Other things you should consider are the timeline you need to complete the match, who will
be accessing the data, where and how they will be accessing the data. It is wise to put a timeline in your data
agreement that accounts for potential setbacks, and to include at least two people from your team who can
access the data in case of turnover, illness, etc.

Do I need one?
That answer depends on a few things. Who ‘owns’ the data you want to match? If it is a dataset you have access
to for your day-to-day work, you may not require a data use agreement. Datasets such as death records are
public and may also not require a data use agreement. However, if the dataset falls under another division or
organization, you should plan to fill out a data use agreement.

It also depends on the rules of your organization or the owner of the other dataset. Your organization may
require all data matches to have a data use agreement in place.

Even if it is not required, having a data use agreement may not be a bad idea. Having a document that
outlines the details of your matching project allows for managing everyone’s understanding and expectations
from the beginning.

What should I expect?


Be prepared to provide detailed thinking about your project. Just as you may feel like no one understands the
caveats of the hepatitis data like you do, so may these other groups feel about their data.

Some groups are more restrictive on the use of their data. You may need to compromise. Consider if you would
be comfortable providing them with the data to match and to receive back aggregate data. If there is hesitation
to that, or if you need person-level data, consider if you might start by matching a smaller subset (a month’s
worth of data) to see what the volume might be.

Do not expect that everyone will be immediately happy with your plans. Prepare to talk through and support
why you want to look at what you want to look at.

TOOLKIT: Dataset Matching 3


BEFORE YOU MATCH

The data use agreement, itself, describes who, what, when, where, Institutional Review Board
why and how of the actual data match.
You may need to put your project through the Institutional Review
Board or IRB, if you are matching to a dataset that has protected data
Who:
(ex: birth registry) or if you are matching to a dataset that came from
• Who will be doing the match?
research. Most matches, even once going through IRB, will be found
• Who will have access to the dataset before its matched, to be ‘exempt’, but you may need to go to the process anyways. The
during the match process, and after the match? IRB form will ask for the information reviewed, above.
• It’s wise to make sure at least two people in an organization
The decision as to if your match requires IRB approval is specific to
are listed as being able to access the data at all points, in
your institution. You should plan for the review to take at least 2 weeks
case someone is unable to continue the process, and for
if it is necessary, and possibly longer depending on how often the IRB
troubleshooting during the process.
for the institution meets.
What:
• What tools are you using to do the match?
• Which variables will be used to match the two datasets?
• What additional variables do you need?

When:
• When will the match be conducted?
• How long do you expect the process to take?
• How long will you have access to the data?
• What is the time period of the dataset you need?

Why:
• What is the purpose of your match?
• What do you plan to do with the data once it is matched?

How:
• How will data be shared?
• How will it be stored?
• How will it be destroyed at the end of the process?

The questions posed in above will all be helpful in completing a data


use agreement.

TOOLKIT: Dataset Matching 4


Software Options and Data Preparation

Now that you have thought through your match and gotten in place all the appropriate agreements and permissions,
you are ready to begin manipulating the data to conduct the match.

Software Options Preparing data for matching


If you are already using a statistical software package, such as SAS or Regardless of the software you decide to use to match your data,
R, methods are available using these packages to prepare your data and the variables on which you are matching, the data will need to
and to do your match. These will be detailed in the coming pages. be prepared in various ways. As you already know if you are familiar
If your data is output into Excel, information is provided on how to with working with data, while we may view NASTAD, nastad, and
prepare your data for matching in Excel before moving it into another N A S T A D as the same word, most software do not. You will want
system for matching. If your statistical software is not detailed here, the to ensure variables are correctly formatted as character or numeric,
information provided under SAS and R should provide terms you can and for character variables, remove excess spaces, consider removing
search for in your particular software. punctuation, and change all characters to upper case to avoid missing
matches. Here are some steps you will want to take to prepare your data
If you do not have a statistical software package you typically use, or if
in Excel, SAS or R:
your datasets are particularly large and you are concerned that they
may not run well in your usual software, cancer registries have come Variable format (Character/Numeric)
up with a few options. The most recent at the time of writing this is a
software called Match*Pro, which can be found here: https://seer.cancer. A common frustration in merging data can be an error indicating your
gov/tools/matchpro data are not in the same format. Even if you are using a variable that is
populated with numeric data, like a date, your software may read it as
Match*Pro, and its predecessor LinkPlus (https://www.cdc.gov/cancer/ text. Checking your data format at the beginning can help you avoid
npcr/tools/registryplus/lp.htm) were created for Cancer registries, but this issue.
can be used with any fixed-width, delimited, or NAACCR XML File. Even
if your data is not currently in a fixed width or delimited format, it can IN EXCEL
likely be put in the correct format to use with Match*Pro.
Highlight the column in question and right click to choose the
Match*Pro can be downloaded, for free, here https://seer.cancer.gov/ ‘format cells’ menu. There you can see how the variable is already
tools/matchpro/download. To register, you must provide a valid e-mail formatted and change the format.
address and your affiliation. A link for the download is sent instantly.
Installation takes only a few minutes. It is a java-based program and Be aware, however, in pulling an excel spreadsheet into SAS, blank
can only be used on Mac Computers with a Windows emulator. The values in the ‘guessing rows’ (the top rows in the spreadsheet that
download includes a 181 page manual on how to use the software. This SAS uses to determine appropriate format) can change the format
documentation is constantly updated, and such, the details of how to in the resulting dataset.
conduct a match using Match*Pro will not be detailed here. However, the
steps below will prepare your data for matching using any software.

TOOLKIT: Dataset Matching 5


SOFTWARE OPTIONS AND DATA PREPARATION

IN SAS Removing excess spaces


Run a ‘proc contents’ on each dataset to see the format of the Excess spaces can be hard to identify but will affect matching.
variables and determine if any changes need to be made. Removing excess spaces can be done through a ‘trim’ function in
various software
If you have a variable that is stored as character, but should be
numeric, use the ‘input’ function IN EXCEL
new_variable = input(original_variable, informat.); use the formula trim(Cell). For example, putting the formula
Informat should include how you would like the number formatted, =TRIM(A1) in cell C1 will result in a trimmed value of cell A1.
including date formats
IN SAS
If you have a numeric variable that you think would be easier to
convert to character for the purpose of the match, use the ‘put’ use the COMPRESS function newvariable=compress(oldvariable)
function to change it to character
IN R
new_variable = put(original_variable, format.);
# Creating a string variable
IN R str <- " 001 0000 AB 01 "

# Creating a new data set using the existing “iris” data in R # Method 1: Remove all spaces in a string using the base R “gsub()”
df1 <- iris function
str1 <- (gsub(" ","",str))
# Run the ‘sapply()’ or ‘lapply()’ or ‘str()’ function to determine data str1
elements and their types
sapply(df1, class) # Method 2: Remove all spaces in a string using the Stringr
lapply(df1, class) package’s str_remove_all() function
str (df1) library("stringr")
str2 <- str_remove_all(str," ")
# to convert numeric to character variable using ‘as.character()’ function str2
df1$char_Var2 <- as.character(df1$Sepal.Width)

# to convert character to numeric using the ‘as.numeric()’ function


df1$num_Var2 <- as.numeric(df1$char_Var2)

TOOLKIT: Dataset Matching 6


SOFTWARE OPTIONS AND DATA PREPARATION

Removing punctuation IN R
Punctuation in names such as “O’Brien” or “Smith-Jones” is often s <- "I am a,# new 1 comer,!to ,please $ help,me:out,here;
missed or mis-placed in data entry and can lead to missed matches a!$%bbbêéè)(/&ß2"

# Remove punctuations in stringr package’s str_replace_all or the


IN EXCEL
base R gsub
There are two ways to handle this in Excel install.packages("stringr")
1. Y
 ou can use the “find” and “replace” functions to remove the library(stringr)
punctuation. Do this by # Method1: uses str_replace_all() punct option
a. P
 ress CTRL+F to open the ‘Find’ and ‘Replace’ functions. s1 <- str_replace_all(s, "[[:punct:]]", "")
Choose ‘Replace’ s1
b. Enter the first type punctuation, such as “-”
# Method2: uses str_replace_all() alnum option (removes
c. Leave the ‘replace with’ field blank whitespaces as well)
d. Choose ‘Replace All’ s2 <- str_replace_all(s, "[^[:alnum:]]", "")
e. Repeat with next punctuation s2

2. You can create a custom function as described here: #Method3: uses the base R gsub() function (allows choosing which
https://www.extendoffice.com/documents/excel/3296-excel- punctuation to remove)
remove-all-punctuation.html s3 <- gsub("[?.;!¡¿·$%&#,'/:()]", "", s)
s3
IN SAS
Again, the easiest way to do this is using the COMPRESS function,
with the modifier ‘P’ for punctuation.

Example: newvariable=compress(oldvariable, , ‘P’);

TOOLKIT: Dataset Matching 7


SOFTWARE OPTIONS AND DATA PREPARATION

Changing Text to Uppercase File format


Having text variables in different combinations of cases can be easily If you are using SAS or R, you will likely be familiar with the format
overlooked and will keep your matches from being found. Luckily, it’s your files should be in. If you are using Match*Pro, your data will need
an easy problem to fix. to be one of three file formats, sometimes referred to as ‘flat files’:

• A Fixed Width file has data organized into columns in fixed


IN EXCEL
positions within the file. For example, in a fixed width file, Last
Use the formula UPPER name takes the first 30 characters in the file, and town name is
the next 20 characters.
Example: enter into cell =UPPER(A1)
• A Delimited file has data where the columns are separated by a
IN SAS delimiter, such as a comma or a tab.
Use the function UPCASE • A NAACCR XML File is a file format specific to Cancer registries
Example: newvariable=UPCASE(oldvariable);
From most software, you should be able to export your dataset to a
In R fixed width (text) or delimited (formatted text, CSV) file. This may be
through an ‘export’ or ‘save as’ function, depending on the software
s <- "I am a new comer to this work, please help me out here" you are using. If the solution is not obvious, search for how to export
your data as a CSV or text file.
# Change case to upper case using the toupper() function in base R
s1 <- toupper(s)
s1

TOOLKIT: Dataset Matching 8


Conducting the Match

Now that your data is prepared, you are ready to perform the actual match. However, there are still many options to consider
to ensure the most complete and accurate match possible. This section will describe how to perform both exact matches and
probabilistic, or ‘fuzzy’ matches and provide considerations for both.

Matching using Match*Pro datasets, exact matching may be more straightforward. However,
Match*Pro provides excellent documentation for using the software. when you are looking for matching names, you may find that exact
Refer to the help document that accompanies your download to matching is more of a challenge. As mentioned in the ‘preparing data
perform your match using Match*Pro. The information, below, may be for matching’ section, above, computer software, specifically statistical
helpful in understanding some of the terms referenced in Match*Pro. software packages, are not built to ignore spaces, punctuation, and
capitalization in character variables. The exact match may be your
first indication that you need to spend some additional time on data
Using Statistical Packages (SAS and R) preparation.
Statistical packages such as SAS and R are useful tools for matching
datasets. Once matches are completed, you can move directly into In SAS
analysis without importing or exporting additional data. If these are
You are likely very familiar with the exact match process in SAS, if you
software packages with which you are already familiar, they may be
work with datasets often. Another name for the exact match would be
an ideal place to start for matching your datasets. Different processes
‘merge’.
for matching are detailed here, with example code for using in these
processes for both SAS and R. These processes likely exist in other To complete this process, make sure the variables you are looking to
programs, including SPSS and STATA, and searching for the topics listed match in each dataset are named the same. For comparison purposes,
below may help you uncover the details of how to use these statistical you may want to keep a duplicate of those variables in each dataset
packages for matching your datasets. with different names, so you can identify differences between the
variables and which dataset may need changing if there are issues.
Types of matches Each dataset will need to be sorted by the variables of interest. In the
merge statement, the variables of interest should be named as the ‘by’
We have mentioned throughout this document different types of
variables.
matches, including exact and probabilistic matches. Here we will define
those terms and provide sample code for conducting the matches. If using names, you may also wish to consider running a secondary
exact match on ‘last name’ in one dataset matched to the ‘given name’
Exact matches in another dataset, if you think the two may often be switched.
What is an exact match? Exact matching is often the first step in the matching process. You can
feel very confident that these matches are ‘true’ matches, but repeating
Exact matching is looking for the exact same values in both datasets.
the matching process using some of these additional tools will give you
For datasets where you have the same identification number in both
more confidence that you have identified as many matches as you can.

TOOLKIT: Dataset Matching 9


CONDUCTING THE MATCH

EXAMPLE: In R
*Merge a hepatitis C dataset with vital death records. # Creating a sample dataframe set Hep C
*Create a new dataset with the hepatitis C data with the merging variables HepC <- data.frame(
of last name, first name, and birth date, renamed to match the vital records firstname= c("Carsten","Gerd","Robert","Stephen","Ralf", "Ravi"),
dataset;
lastname = c("Meier","Buaer","Hartmana","Wolff","Krueger", "Chander"),
DATA DATASET1;
dob = c(1949, 1968, 1930, 1957, 1966, 1971),
SET HEPC;
FIRST_NAME=GNAME; nationality = c("US","US","UK","US","Poland", "India"),
LAST_NAME=LNAME; stringsAsFactors=FALSE)
DOB=BIRTH_DATE;
RUN; # Creating a sample dataframe set death
*Create a second dataset for merging with the vital records, keeping a copy of death <- data.frame(
the required variables with new names to monitor for errors in matching; firstname = c("Carsten", "Gred", "Robert", "Stephn", "Ralff", "Ravee"),
DATA DATASET2; lastname = c("Meier","Buaer","Hartaman","Wolff","Krueger", "Chander"),
SET VITALS_DEATHS; dob = c(1949, 1968, 1930, 1957, 1966, 1970),
DEATH_FNAME=FIRST_NAME;
death = c("Yes", "No", "Yes", "No", "No", "Yes"),
DEATH_LNAME=LAST_NAME;
DEATH_DOB=DOB; stringsAsFactors=FALSE)
RUN;
# Merge the two datasets using exact matching in R
*Sort datasets by required variables;
exact_match <- merge (HepC, death, by=c("firstname","lastname", "dob"))
PROC SORT DATA=DATASET1; exact_match
BY FIRST_NAME LAST_NAME DOB;
RUN;
PROC SORT DATA=DATASET2;
BY FIRST_NAME LAST_NAME DOB;
RUN;

*Merge datasets. Here we only want to keep the matches found in both
datasets, and use the ‘in’ indicators to keep only those that are found in both;
DATASET EXACTMATCHES;
MERGE DATASET1 (IN=A) DATASET2 (IN=B);
BY FIRST_NAME LAST_NAME DOB;
IF A AND B;
RUN;
Note only those records where all three matches are exactly the same will
be identified as matches. Check the logs to determine how many exact
matches you have found.

TOOLKIT: Dataset Matching 10


CONDUCTING THE MATCH

Fuzzy Matching SAS


‘Fuzzy Matching’ is known by a number of other names, including (Continuing the example using same datasets from above);
approximate string matching or probabilistic matching. Where exact
matches finds values or strings where there is 100% matches, fuzzy DATA SOUNDHCV;
matching provides a likelihood that two values match. There are SET HEPC;
many matching techniques available for fuzzy matching. Most of SLNAME=SOUNDEX(LNAME)
these can be categorized as either a phonetic technique or an edit SGNAME=SOUNDEX(GNAME);
distance measurement. Phonetic techniques use knowledge about DOB=BIRTH_DATE;
sounds to determine if two values are likely matches. Edit distance RUN;
measurements look a the number of changes or keystrokes that
DATA SOUNDDEATH;
would be needed to turn one value into another value. We will provide
SET VITALS_DEATHS;
an example of each category of technique.
SLNAME=SOUNDEX(LAST_NAME)
Regardless of technique used, fuzzy matching will result in some SGNAME=SOUNDEX(FIRST_NAME);
false matches. Human review can help to limit the number of DEATH_DOB=DOB;
false matches. You may also want to increase the variables you are RUN;
including in your review. For example, lets say you are going to use
PROC SORT DATA=SOUNDHCV; BY SLNAME SGNAME DOB; RUN;
fuzzy matching only on individuals who share a date of birth, and you
PROC SORT DATA=SOUNDDEATH; BY SLNAME SGNAME DOB; RUN;
get two names that look like they may be likely matches. If you have
the city of residence from both datasets and find that, in addition to * This merge is based on those soundex values;
the same date of birth, they also list the same city of residence, you
DATA SOUNDMATCH;
may feel more confident in your match.
MERGE SOUNDHCV (IN=A) SOUNDDEATH (IN=B)
Fuzzy matching also takes time, both computing and human. You will BY SLNAME SGNAME DOB;
need to make decisions about how many more potential matches you IF A AND B;
want to look for, and how long it may take the computer to find them. RUN;
One important step to take is to remove your exact matches first, if
possible. Since you already know those matches exist, you do not R
need to take processing time to search for those matches again.
#Fuzzy matching using Soundex method:
SOUNDEX # Creating a sample data frame HepC1

Soundex is a phonetic algorithm that helps to identify words that HepC1 <- data.frame(
sound alike in English, but look different. By using SOUNDEX, you firstname= c("Carsten","Gerd","Admore","Stephen","Ralf", "Ravi"),
can identify names that may have minor spelling differences in your lastname = c("Meier","Buaer","Patrick","Wolff","Krueger", "Chander"),
records, because their soundex will be the same. Here, your software dob = c(1949, 1968, 1930, 1957, 1966, 1971),
will assign a ‘soundex’ to each variable you are interested in, and will nationality = c("US","US","UK","US","Poland", "India"),
compare it across the two datasets. stringsAsFactors=FALSE)

TOOLKIT: Dataset Matching 11


CONDUCTING THE MATCH

# Creating a sample data frame death Distance Measurements


death1 <- data.frame( There are several ways to measure the ‘edit distance’ between two
firstname = c("Carsten", "Gred", "Robert", "Stephn", "Ralff", "Ravee"), variables. Edit distance is the term for the measurement of how
lastname = c("Meier","Buaer","Hoffman","Wolff","Krueger", "Chander"), many operations are required to change one ‘string’ or word into
dob = c(1949, 1968, 1930, 1957, 1966, 1970), another. Different standards of measure allow different combinations
death = c("Yes", "No", "Yes", "No", "No", "Yes"), of deletions, insertions, substitutions, and transpositions. Many of
stringsAsFactors=FALSE) these standards of measure can be used in fuzzy matching. For the
#Install required packages example, below, in SAS, we are using the COMPLEV function which
calculates the Levenshtein edit distance between two text strings,
#install.packages ("data.table")
which is the number of insertions, deletions, or replacements of single
#install.packages ("dplyr")
characters that are required to convert one string to the other. There
#install.packages ("fuzzyjoin")
are other distance measures, but they are not built-in functions of
#Attach packages SAS.
library(data.table)
For this method, you set a cutoff distance, where you would want
library (dplyr)
to review anything with a distance less than that cutoff. You can
library (fuzzyjoin)
choose this cutoff in a few ways. You could review your data, if its short
# Pasting together the key matching variables in a single column enough, starting with a higher cutoff value and determine where you
cols <- c( 'firstname' , 'lastname' , 'dob' )
stop seeing matches. You could also run a frequency on your cutoff
values, and set a cutoff based on that graph.
# Creating a new column `x` including the three matching variables
in a single column
It should be noted that to run this code, SAS is merging every possible
pairing between your two datasets. Depending on the size of your
HepC1$x <- apply( HepC1[ , cols ] , 1 , paste , collapse = "-" )
datasets, this can take considerable processing time, and may crash
death1$x <- apply( death1[ , cols ] , 1 , paste , collapse = "-" )
your computer. SQL code that improves the efficiency of this process
#Fuzzy matching using Soundex method has been included here as well, for larger datasets. Remember to
fuzzy_soundex <- stringdist_join(HepC1, death1, remove exact matches and variables not required for the match (but
by=c("x"), keep an identifier from each dataset!).
mode='left', #use left join SAS
method = "soundex", #use soundex distance metric
max_dist=0, Here, the distance measures are actually calculated in the merge step.
distance_col='dist') %>% Remember to remove any matches you have already identified. For
group_by(x.x) %>% this code, you would also need to have renamed birth day as DOB. We
slice_min(order_by=dist, n=1) are calculating 2 edit distances - one for first name and one for last
name, and then combining the two for an overall value. The cutoff for
fuzzy_soundex printing was based on review of the data. The modifier ‘ILN’ shown in
first name does the following:

TOOLKIT: Dataset Matching 12


CONDUCTING THE MATCH

i or I ignores the case in string-1 and string-2. R


l or L removes leading blanks in string-1 and string-2 before #Fuzzy matching using the Levenshtein method
comparing the values. # Creating a sample data frame HepC1
n or N removes quotation marks from any argument that is an HepC1 <- data.frame(
n-literal and ignores the case of string-1 and string-2. firstname= c("Carsten","Gerd","Admore","Stephen","Ralf", "Ravi"),
DATA FUZZY; lastname = c("Meier","Buaer","Patrick","Wolff","Krueger", "Chander"),
MERGE HCV (IN=A) VITALS_DEATHS (IN=B); dob = c(1949, 1968, 1930, 1957, 1966, 1971),
BY DOB; nationality = c("US","US","UK","US","Poland", "India"),
IF A AND B; stringsAsFactors=FALSE)
FIRST_COMPLEV=COMPLEV(FIRST_NAME, GNAME, ‘ILN’); # Creating a sample data frame death
LAST_COMPLEV=COMPLEV(LAST_NAME, LNAME);
death1 <- data.frame(
TOTALCOMPLEV=FIRST_COMPLEV+LASTCOMPLEV;
firstname = c("Carsten", "Gred", "Robert", "Stephn", "Ralff", "Ravee"),
RUN;
lastname = c("Meier","Buaer","Hoffman","Wolff","Krueger", "Chander"),
PROC PRINT DATA=FUZZY; WHERE TOTALCOMPLEV<5; RUN;
dob = c(1949, 1968, 1930, 1957, 1966, 1970),
death = c("Yes", "No", "Yes", "No", "No", "Yes"),
SQL stringsAsFactors=FALSE)
Using the datasets as listed above #Install required packages
Proc sql noprint; #install.packages ("data.table")
Create table matrix2 as #install.packages ("dplyr")
#install.packages ("fuzzyjoin")
Select a.gname, b. first_name, a.lname, b.last_name a.dob, b.dob,
a.caseid, b.vitalid; #Attach packages
COMPLEV(a.gname, b.first_name, ‘ILN) as compfirstname, library(data.table)
COMPLEV(a.lname, b.last_name) as complastname, library (dplyr)
Calculated compfirstname+complastname as Combocomp library (fuzzyjoin)
From HEPC a vitals_deaths b; # Pasting together the key matching variables in a single column
On a.dob=b.dob AND cols <- c( 'firstname' , 'lastname' , 'dob' )
COMPLEV(a.gname, b.first_name, ‘ILN) <= 5 OR
# Creating a new column `x` including the three matching variables in a single
COMPLEV(a.lname, b.last_name)<5
column
Group by a.lname HepC1$x <- apply( HepC1[ , cols ] , 1 , paste , collapse = "-" )
Order by a.lname; death1$x <- apply( death1[ , cols ] , 1 , paste , collapse = "-" )

quit;

TOOLKIT: Dataset Matching 13


CONDUCTING THE MATCH

#Fuzzy matching using the Levenshtein (a.k.a. edit difference) method # Creating a new column `x` including the three matching variables
fuzzy_lv <- stringdist_join(HepC1, death1, in a single column
by=c("x"), HepC1$x <- apply( HepC1[ , cols ] , 1 , paste , collapse = "-" )
mode='left', #use left join death1$x <- apply( death1[ , cols ] , 1 , paste , collapse = "-" )
method = "lv", #use Levenshtein distance metric
#Perform fuzzy matching using the Jaro-Winkler (JW) method
max_dist=5,
distance_col='dist') %>% fuzzy_jw <- stringdist_join(HepC1, death1,
group_by(x.x) %>% by=c("x"),
slice_min(order_by=dist, n=1) mode='left', #use left join
fuzzy_lv ignore_case = TRUE,
method = "jw", #use jw distance metric
#Fuzzy matching: max_dist=0.5,
# Creating a sample data frame HepC1 distance_col='dist') %>%
HepC1 <- data.frame( group_by(x.x) %>%
firstname= c("Carsten","Gerd","Admore","Stephen","Ralf", "Ravi"), slice_min(order_by=dist, n=1)
lastname = c("Meier","Buaer","Patrick","Wolff","Krueger", "Chander"), fuzzy_jw
dob = c(1949, 1968, 1930, 1957, 1966, 1971),
nationality = c("US","US","UK","US","Poland", "India"), A note on deduplication
stringsAsFactors=FALSE) It may be quickly apparent to you that the methods used here are also
what is employed or can be employed to look for duplicates within a
# Creating a sample data frame death
dataset. You may find duplicates through this process because of multiple
death1 <- data.frame( ‘near matches’. If you are concerned that you have many duplicates you
firstname = c("Carsten", "Gred", "Robert", "Stephn", "Ralff", "Ravee"), have not identified, you may be able to use some of these techniques to
lastname = c("Meier","Buaer","Hoffman","Wolff","Krueger", "Chander"), look for duplicates in your dataset.
dob = c(1949, 1968, 1930, 1957, 1966, 1970),
death = c("Yes", "No", "Yes", "No", "No", "Yes"), Regarding matching
stringsAsFactors=FALSE)
The code listed here provides a starting place for probabilistic matching, but
#Install required packages by no means covers all approaches. The descriptions give you language that
#install.packages ("data.table") you can use in additional searches, which will uncover other approaches to
#install.packages ("dplyr") distance measures that may be of use in your matching programs. However,
#install.packages ("fuzzyjoin") one can also lose themselves in a search for a 'complete' or 'perfect' match.
This quest follows the law of diminishing returns. Anyone performing
#Attach packages a match should accept that not all matches will be found, and determine
library(data.table) a stopping point before they start, instead of trying to include every method
library (dplyr) they know for finding additional matches. If you are going to use multiple
library (fuzzyjoin) approaches in a code, assuming you are looking for one to one matches
(one person with hepatitis in your dataset can only match to one death
# Pasting together the key matching variables in a single column
event for example), you may want to remove matches as you go, so that
cols <- c( 'firstname' , 'lastname' , 'dob' ) you are only looking for matches for cases that do not have a match.

TOOLKIT: Dataset Matching 14


Datasets for Consideration

In general, datasets are good for matching if they have the same individual level data as the dataset you want to match with.
That means, if you have hospital level data, and find another dataset with hospital-level data, they will be good for matching.
Similarly, if you are matching with a registry, you will likely want another dataset with individual-level data.
Datasets that do not have the same levels of data may be good for comparison, or ‘triangulation’. Even if you can’t make the
datasets match, exactly, they still may be used to understand trends, or where there may be gaps in the available data.
The datasets listed here may be good places to start for either of these processes.

Datasets to Consider for Matching Other ‘Syndemic’ infectious diseases


The datasets listed here are available in many jurisdictions, and Matching to other diseases where risk behaviors overlap, including HIV,
usually have person-level data. They may be useful in further sexually transmitted diseases, and TB are another great place to start
exploring populations to prioritize for interventions, the current to try these techniques. Understanding where these diseases overlap
implementation levels of known interventions, and long-term may allow for smarter deployment of shared services when available.
outcomes of viral hepatitis. If allowed by your jurisdiction, and depending on timing of infections,
these may also be opportunities for improving completeness of
Other Hepatitis Datasets demographic and risk data.
Matching to other viral hepatitis datasets available in your jurisdiction
is a great place to start on matching. Often, these datasets may Vital Records – Deaths
be more readily available to you, and you are aware of some of the Death records are considered public records and should be easy to get
challenges with matching with them. They may provide you with an access to. Depending on your jurisdiction and the goal of your match,
opportunity to explore some of the methods described here without you may want to ask if you can get access to ‘preliminary’ data. Usually,
worrying about a data use agreement or data restrictions. death records go through a process of coding and data cleaning that can
delay the release of the final dataset, sometimes up to a year. Preliminary
Matching between hepatitis viruses can provide a lot of insight into
records may be available sooner and can be used with some caveats.
both diseases. For example, a match between hepatitis A cases and
hepatitis C cases that shows a rising proportion of hepatitis A cases Death records can be used in a number of ways. They may be used
occurring annually among people who have ever been infected with for case-finding (deaths with hepatitis listed that were not previously
hepatitis C may be used to increase vaccination efforts among people reported), understanding your care cascade, looking at average age of
with hepatitis C. If more risk behavior is available among the hepatitis death for people reported with hepatitis, what people with hepatitis are
C cases, you may be able to detect a shift in the populations affected listed as dying from, etc.
by hepatitis A in your jurisdiction.
Death records utilize International Classification of Diseases (ICD) codes
to code causes of death as well as conditions contributing to deaths.
Where multiple codes of death are reported, causes are put through

TOOLKIT: Dataset Matching 15


DATASETS FOR CONSIDERATION

an algorithm to determine what is the ‘underlying’ cause of death. It is Cancer Registry


recommended that when using the cause of death variables you look at
Cancer registries are often accustomed to matching and staff may be
all causes, not just the underlying cause, for this reason.
able to assist you with the process. This is another dataset where you
Many jurisdictions have been able to gain better access to death records should be able to get person-level data. Cancer registry matches can
due to COVID, and if COVID is captured in the same system as your provide more information for your care cascade, including a clearer
hepatitis data, there may be a process in place that could be used as a understanding of long-term outcomes of viral hepatitis. You may want to
template for matching these systems. match to all types of cancer, not just liver cancer.

Vital Records – Births Voter Registration


Depending on your jurisdiction, birth records may be harder to access. Some states have been able to match with voter registration data to
Mothers and newborns are considered protected, and there are more improve and support the demographic data in their surveillance system.
restrictions on this data.
Electronic Medical Records
When matching to this dataset, consider WHO you are matching – it may
Electronic Medical Records (EMRs), also called Electronic Health Records
be best to match women of childbearing age to women who have given
(EHRs) are the electronic records held by medical institutions. The ability
birth, rather than trying to match to the babies. This would allow you to
for these systems to report appropriate data to public health systems is a
understand the proportion of people with viral hepatitis who are
big part of ‘Meaningful use’. Matching with a single health system’s EMR
Hepatitis may be listed as an underlying condition on the birth record, can help you understand where there are gaps in surveillance, including
but it may be a checklist rather than a coded variable. how good their reporting is, what data they have that is not being
consistently reported, and what data they do not have.
Vaccine Registry
Your state’s vaccine registry is another system that may provide useful Aggregate Data Sets
data for matching. Knowing hepatitis A and B vaccination status of Even if you are not able to match person-level data, summary data may
people living with HCV, for example, may allow you to better target be used to provide insight into a question you would like to answer. The
vaccination campaigns. This is also a system where matching with COVID following datasets may be available in your jurisdiction and may provide
has already been explored, and a template may exist for a match. useful information that can be included in annual reports. These datasets
Many vaccine registries have more complete vaccine histories on may also aid you in estimating denominators for your jurisdiction.
children than adults, and the completeness can vary across jurisdictions.
Having a contact who knows the system well is important for any of Census
these matches, but for the vaccine registries you will want someone who US census data is likely one of the first data sources you turn to
can give you a sense of if the registry will have complete data on what for understanding the denominators in your state or jurisdiction.
you want. It is a helpful reminder that census can provide aggregate data
down to the census tract including data on select demographics.

TOOLKIT: Dataset Matching 16


DATASETS FOR CONSIDERATION

Insurance data: All Payer Claims and Medicaid behaviors including information on sexual identity and sexual partners,
tobacco use, dietary behaviors, physical activity, and activity leading to
Many states maintain ‘All Payer Claims’ datasets. These are claims data
unintentional injury and violence. More information can be found here:
collected by major medical insurers in the state. These contain billing
https://www.cdc.gov/healthyyouth/data/yrbs/index.htm
data sent to and collected from most of the insurance providers in
a given state. While there may be a unique identifier that would
Syndromic surveillance
allow understanding the number of claims per individual, most of the
identifiers here are likely stripped away, making matching impossible. Syndromic surveillance is data emergency room chief complaint
However, claims data can help understand other issues such as hepatitis information which is provided to states in real time. It is most useful
A and B vaccines distributed or medication prescription practices. for looking at patterns and changes over time in the use of said
emergency departments. Many states have worked on ‘syndromes’
Similar to All Payor Claims, Medicaid for each jurisdiction may also to better understand overdoses in their state.
be able to provide a dataset based on claims. Medicaid data may
or may not be included in the ‘All Payor Claims’ dataset depending Emergency Medical Services (EMS) Data
on the jurisdiction.
While syndromic surveillance data provides emergency department
Hospital discharge data data, EMS data contains information gathered from ambulances
reports. This is another data source that has been used to deepen
Many states maintain hospital discharge datasets. These datasets contain understanding of opioid use.
information on all discharge from acute care hospitals in the state, often
at the patient level. This data can be useful for understanding hospital Drug treatment
utilization and changes in incidence and burden of disease and injury.
Drug treatment data may belong in both overarching categories. An
Information is included on select demographics, procedures, services, and
individual drug treatment facility would have patient-level data but
diagnoses. While there is patient-level data, most identifiers are usually
working with that data would likely be especially challenging due to
stripped from the available datasets, so matching is likely not possible.
federal regulations which protect the identities of people seeking drug
BRFSS treatment. However, the area of your health department that provides
addiction services may have publicly available data in aggregate
The Behavioral Risk Factor Surveillance System (BRFSS) is a telephone- regarding the utilization of drug treatment facilities by individuals
based survey administered annually across the US. BRFSS asks a in different areas of your jurisdiction. They may also provide some
multitude of questions regarding health practices, chronic conditions, information regarding the demographics of the individuals utilizing
and use of prevention services. Local jurisdictions are also permitted to these services.
add questions to their local BRFSS. More information, including the survey
instrument, can be found here: https://www.cdc.gov/brfss/index.html The US Department of Housing and Urban Development (HUD)

YRBSS HUD provides data by State on shelter and unsheltered homeless


individuals through their data exchange. This may be another
The Youth Risk Behavior Surveillance System (YRBSS) is a system of surveys denominator worth understanding, particularly considering hepatitis A
collected by CDC, states, territories, tribal governments, and health agency outbreaks in recent years. https://www.hudexchange.info/programs/hdx/
focused on school-aged children. It monitors health related behaviors
among youth and young adults including alcohol and drug use, sexual

TOOLKIT: Dataset Matching 17

You might also like