Degroot
Degroot
com
Chapter 123
Data Matching –
Optimal and Greedy
Introduction
This procedure is used to create treatment-control matches based on propensity scores and/or observed covariate
variables. Both optimal and greedy matching algorithms are available (as two separate procedures), along with
several options that allow the user to customize each algorithm for their specific needs. The user is able to choose
the number of controls to match with each treatment (e.g., 1:1 matching, 1:k matching, and variable (full)
matching), the distance calculation method (e.g., Mahalanobis distance, propensity score difference, sum of rank
differences, etc.), and whether or not to use calipers for matching. The user is also able to specify variables whose
values must match exactly for both treatment and controls in order to assign a match. NCSS outputs a list of
matches by match number along with several informative reports and optionally saves the match numbers directly
to the database for further analysis.
Matching Overview
Observational Studies
In observational studies, investigators do not control the assignment of treatments to subjects. Consequently, a
difference in covariates may exist between treatment and control groups, possibly resulting in undesired biases.
Matching is often used to balance the distributions of observed (and possibly confounding) covariates.
Furthermore, in many observational studies, there exist a relatively small number of treatment group subjects as
compared to control group subjects, and it is often the case that the costs associated with obtaining outcome or
response data is high for both groups. Matching is used in this scenario to reduce the number of control subjects
included in the study. Common matching methods include Mahalanobis metric matching, propensity score
matching, and average rank sum matching. Each of these will be discussed later in this chapter. For a thorough
treatment of data matching for observational studies, the reader is referred to chapter 1.2 of D'Agostino, Jr.
(2004).
123-1
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
e( x i ) = pr ( Z i = 1 | X i = x i ) .
It is assumed that the Zi’s are independent, given the X’s. The observed covariates, xi, are not necessarily the same
covariates used in the matching algorithm, yi, although they could be. Rosenbaum and Rubin (1985a) suggest
using the logit of the estimated propensity score for matching because the distribution of transformed scores is
often approximately normal. The logit of the propensity score is defined as
1 − e( x )
q( x ) = log ,
e( x )
Matching on the observed propensity score (or logit propensity score) can balance the overall distribution of
observed covariates between the treatment and control groups. The propensity score is often calculated using
logistic regression or discriminant analysis with the treatment variable as the dependent (group) variable and the
background covariates as the independent variables. Research suggests that care must be taken when creating the
propensity score model (see Austin et al. (2007)). For more information about logistic regression or discriminant
analysis, see the corresponding chapters in the NCSS manuals.
123-2
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
treatment subjects matched is less than the total number of treatment subjects in the reservoir. Rosenbaum and
Rubin (1985b) present strong reasons for avoiding incomplete matched-pair samples.
Distance Measures
The complete list of possible distance measures available in NCSS is as follows:
1. Mahalanobis Distance within Propensity Score Calipers (no matches outside calipers)
2. Mahalanobis Distance within Propensity Score Calipers (matches allowed outside calipers)
3. Mahalanobis Distance including the Propensity Score (if specified)
4. Propensity Score Difference within Propensity Score Calipers (no matches outside calipers)
5. Propensity Score Difference
6. Sum of Rank Differences within Propensity Score Calipers (no matches outside calipers)
7. Sum of Rank Differences within Propensity Score Calipers (matches allowed outside calipers)
8. Sum of Rank Differences including the Propensity Score (if specified)
Distance measures #2 and #7, where matches are allowed outside calipers in caliper matching, are only available
with greedy matching. All others can be used with both the greedy and optimal matching algorithms.
123-3
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
For distance measures that involve propensity score calipers, the caliper size is determined by the user-specified
radius, c. For any treatment subject, i, the jth, control subject is included in the ith treatment caliper if
| q( x i ) − q( x j ) |≤ c
where q( x i ) = e( x i ) is the propensity score based on the covariates x i . If the logit transformation is used in the
analysis, then q( x ) = log((1 − e( x )) / e( x )) . The width of each caliper is equal to 2c.
Table 2.3.1 from Cochran and Rubin (1973). Percent Reduction in bias of x for caliper matching to within
± a (σ 12 + σ 22 )/2
a σ 12 /σ 22 = 1/2 σ 12 /σ 22 = 1 σ 12 /σ 22 = 2
0.2 0.99 0.99 0.98
0.4 0.96 0.95 0.93
0.6 0.91 0.89 0.86
0.8 0.86 0.82 0.77
1.0 0.79 0.74 0.69
The caliper radius to use depends on the desired bias reduction (table body), the coefficient a, and the ratio of the
treatment group sample variance of q(x ) , σ 12 , to the control group sample variance of q(x ) , σ 22 . “Loose
Matching” corresponds to a ≥ 1.0, while “Tight Matching” corresponds to a ≤ 0.2. The caliper radius is calculated
as
123-4
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
c = a (σ12 + σ 22 )/ 2 = a × SIGMA
NCSS allows you to choose the caliper radius using the syntax “a*SIGMA”, where you specify the value for a
(e.g. “0.2*SIGMA”) or by entering the actual value directly for c (e.g. “0.5”). In the case of the former, the
program calculates the variances of the treatment and control group propensity scores for you and determines the
pooled standard deviation, sigma. You may want to run descriptive statistics on the treatment and control group
propensity scores to determine the variance ratio of your data in order to find the appropriate value of a (from the
table above) for your research objectives.
Data Structure
The propensity scores and covariate variables must each be entered in individual columns in the database. Only
numeric values are allowed in propensity score and covariate variables. Blank cells or non-numeric (text) entries
are treated as missing values. If the logit transformation is used, values in the propensity score variable that are
not between zero and one are also treated as missing. A grouping variable containing two (and only two) unique
groups must be present. A data label variable is optional. The following is a subset of the Propensity dataset,
which illustrates the data format required for the greedy and optimal data matching procedures.
Procedure Options
This section describes the options available in both the optimal and greedy matching procedures.
Variables Tab
Specify the variables to be used in matching and storage, along with the matching options.
Data Variables
Grouping Variable
Specify the variable that contains the group assignment for each subject. The values in this variable may be text or
numeric, but only two unique values are allowed. One value should designate the treatment group and the other
should designate the control group. The value assigned to the treatment group should be entered under “Treatment
Group” to the right.
123-5
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Treatment Group
Specify the value in the Grouping Variable that is used to designate the treatment group. This value can be text or
numeric.
Use Logit
This option specifies whether or not to use the logit transformation on the propensity score. If selected, all
calculations and reports will be based on the logit propensity score (where applicable).
Covariate Variable(s)
Specify variables to be used in distance calculations between treatment and control subjects. Only numeric values
are allowed. Text values are treated as missing. Covariate variables are optional, however, if no propensity score
variable is specified you must specify at least one covariate variable. If the distance calculation method involves
only the propensity score (e.g. propensity score difference) and one or more covariate variables are specified, then
the covariate variables are only used in group comparison reports (they are not used in matching nor to determine
whether or not a row contains missing values during matching).
Storage Variable
Maximum Iterations
Specify the number of optimization iterations to perform before exiting. You may choose a value from the list, or
enter your own. This option is available in order to avoid an infinite loop. We have found that as the number of
Matches per Treatment increases, it takes more and more iterations in order to arrive at a solution.
123-6
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Matching Options
11. Ri,p and Rj,p are the ranks of the pth covariate values or propensity score for subjects i and j, respectively.
Average ranks are used in the case of ties.
The options are:
• Mahalanobis Distance within Propensity Score Calipers (no matches outside calipers)
(u − u j )T C −1 (u i − u j ) if | q( x i ) − q( x j ) |≤ c and FM i ,l = FM j ,l for all l
d (i , j ) = i
∞ otherwise
• Mahalanobis Distance within Propensity Score Calipers (matches allowed outside calipers)
(u i − u j )T C −1 (u i − u j ) if | q( x i ) − q( x j ) |≤ c and FM i ,l = FM j ,l for all l
| q( x i ) − q( x j ) | if | q( x i ) − q( x j ) |> c for all unmatched j
d (i , j ) =
and FM i ,l = FM j ,l for all l
∞ otherwise
The absolute difference, | q( x i ) − q( x j ) | , is only used in assigning matches if there are no available controls
for which | q( x i ) − q( x j ) |≤ c .
123-7
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
• Propensity Score Difference within Propensity Score Calipers (no matches outside calipers)
| q( x i ) − q( x j ) | if | q( x i ) − q( x j ) |≤ c and FM i ,l = FM j ,l for all l
d (i , j ) =
∞ otherwise
• Sum of Rank Differences within Propensity Score Calipers (no matches outside calipers)
d (i , j ) = ∑ |R
p i, p − R j, p | if | q( x i ) − q( x j ) |≤ c and FM i ,l = FM j ,l for all l
∞ otherwise
• Sum of Rank Differences within Propensity Score Calipers (matches allowed outside calipers)
∑
p
| Ri , p − R j , p | if | q( x i ) − q( x j ) |≤ c and FM i ,l = FM j ,l for all l
| q( x i ) − q( x j ) | if | q( x i ) − q( x j ) |> c for all unmatched j
d (i , j ) =
and FM i ,l = FM j ,l for all l
∞ otherwise
The absolute difference, | q( x i ) − q( x j ) | , is only used in assigning matches if there are no available controls
for which | q( x i ) − q( x j ) |≤ c .
In the Greedy Data Matching procedure, two distance calculation methods are available that are not in the
Optimal Data Matching procedure (option #2 and option #7). Both involve caliper matching with matches
allowed outside calipers. When matches are allowed outside calipers, the algorithm always tries to find matches
inside the calipers first, and only assigns matches outside calipers if a match was not found inside. Matches
outside calipers are created based solely on the propensity score, i.e., if matches outside calipers are allowed and
no available control subject exists that is within c propensity score units of a treatment subject, then the control
subject with the nearest propensity score is matched with the treatment. This type of matching algorithm is
described in Rosenbaum and Rubin (1985a).
123-8
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
• Maximum Possible
This option causes the program to assign the maximum number (k) of matches that can be made between
treatments and controls. If greedy matching is used and controls/treatments is not an integer, then using this
option will result in incomplete pair-matching.
• Integer Values
If an integer value is entered or selected, then the program attempts to create the specified number of control
matches for each treatment.
• Random
Both treatment and control subjects are randomly ordered before entering into the matching algorithm. When
the number of matches per treatment is greater than one, the greedy algorithm finds the best match (if
possible) for each treatment before returning and creating the second match, third match, etc. It is likely that
match assignments will change from run-to-run when using random ordering.
Caliper Radius
This option specifies the caliper radius, c, to be used in caliper matching. The caliper radius is calculated as
c = a (σ12 + σ 22 )/ 2 = a × SIGMA
where a is a user-specified coefficient, σ12 is the sample variance of q(x ) for the treatment group, and σ 22 is the
sample variance of q(x ) for the control group. NCSS allows you to enter the caliper radius using the syntax
“a*SIGMA”, where you specify the value for a (e.g. “0.2*SIGMA”) or by entering the actual value directly for c
(e.g. “0.5”). In the case of the former, the program calculates the variances of the treatment and control group
propensity scores for you. You may want to run descriptive statistics on the treatment and control group
propensity scores to determine the variance ratio of your data in order to find the appropriate value of a (from the
table above) for your research objectives.
Reports Tab
The following options control the format of the reports that are displayed.
Select Reports
123-9
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Report Options
Variable Names
This option lets you select whether to display variable names, variable labels, or both.
123-10
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
The following reports will be generated for both optimal and greedy matching with slight variations depending on
the algorithm selected.
Rows Read 30
Rows with Missing Data 0
Treatment Rows 8
Control Rows 22
This report gives a summary of the data and variables used for matching.
Percent Percent
Group N Matched Matched Unmatched Unmatched
Exposed 8 8 100.00% 0 0.00%
Not Exposed 22 8 36.36% 14 63.64%
This report gives a summary of the matches created, as well as a summary of the matching parameters used by the
matching algorithm.
123-11
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Caliper Radius
This is the caliper radius entered or calculated by the program. This line is only displayed if caliper matching
based on propensity scores was used.
Sum of Match Mahalanobis Distances (Sum of Match Propensity Score Differences or Sum of
Match Rank Differences)
This is the sum of Mahalanobis distances, propensity score differences, or rank differences (depending on the
distance calculation method selected) for all matched-pairs.
Average Match Mahalanobis Distance (Average Match Propensity Score Difference or Average
Match Rank Differences)
This is the average Mahalanobis distances, propensity score difference, or rank difference (depending on the
distance calculation method selected) for all matched-pairs. This is calculated as the [Sum of Match Distances (or
Differences)]/[Number of Matches Formed].
N
This is the number of candidates for matching in each group, i.e. the number of subjects with non-missing values
for all matching variables in each group.
Matched (Unmatched)
This is the number of subjects that were matched (unmatched) from each group.
123-12
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
This report provides summary statistics by group for the data in the propensity score variable and each covariate
variable both before and after matching. Notice that the matching seemed to improve the balance of the propensity
scores (Standardized Difference dropped from –160% to
–73%) between the treatment and control groups, but worsened the balance for the covariate X1 (Standardized
Difference increased from –27% to 73.58%).
Group Type
This specifies whether the summary statistics refer to groups before or after matching.
N
This is the number of non-missing values in each variable by group. If there are missing values in covariates that
were not used for matching, then these numbers may be different from the total number of subjects in each group.
Mean
This is the average value for each variable by group.
SD
This is the standard deviation for each variable by group.
Mean Difference
This is the difference between the mean of the treatment group and the mean of the control group.
where x t , p and xc , p are the treatment and control group means for the pth covariate variable, respectively, and
s 2t , p and s 2c , p are the treatment and control group sample variances for the pth covariate variable, respectively.
123-13
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
This report provides a list of all matches created and important information about each match.
Match
This is the match number assigned by the program to each match and stored to the database (if a storage variable
was specified).
Row
This is the row of the treatment or control subject in the database.
123-14
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
123-15
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Rows Read 30
Rows with Missing Data 0
Treatment Rows 8
Control Rows 22
Percent Percent
Exposure N Matched Matched Unmatched Unmatched
Exposed 8 6 75.00% 2 25.00%
Not Exposed 22 10 45.45% 12 54.55%
.
.
.
123-16
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Notice that only the propensity score variable was used in distance calculations, but group comparison reports
were generated for each covariate variable specified. In the Matching Detail Report, you can see that not all
treatments were matched (incomplete matching). Finally, notice that race and gender were both used as Forced
Match variables.
If you go back to the spreadsheet and sort the data on C11 (click on Data > Sort from the NCSS Home window),
you will notice that matches were only created where the race and gender were identical for both the treatment
and control.
123-17
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
This report lists the treatments that were not paired with the target number of controls (2 in this case). Rows 1 and
14 were not paired with any controls. Rows 26 and 30 were only paired with 1 control. All other treatment rows
were paired with 2 treatments. Incomplete matching is usually due to the use of forced match variables, using
caliper matching, or setting Matches per Treatment to ‘Maximum Possible’.
Treatment Row
This is the row in the database containing the treatment subject that was not fully matched.
Matches (Target = k)
This is the number of matches that were found for each treatment. The target represents the number of Matches
per Treatment specified on the input window.
123-18
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Matching Reports
Matching Detail Report
Treatment = "Exposed", Control = "Not Exposed"
123-19
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
The matching detail report is not very informative because all of the propensity scores are equal to 1. If you run
the procedure several times, you will notice that the controls are randomly pairing with the treatments when the
race and gender are the same. Your report may be slightly different from this report because random ordering was
used. If you sort on C11, you will see that all matched pairs have the same value for race and gender.
123-20
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Matching Reports
Matching Summary Report
Distance Calculation Method Sum of Rank Differences including the Propensity Score
Order for Matching Random
Controls Matched per Treatment 2
Sum of Match Rank Differences 74.00000
Average Match Rank Difference 5.28571
Percent Percent
Site N Matched Matched Unmatched Unmatched
Existing 7 7 100.00% 0 0.00%
New 19 14 73.68% 5 26.32%
The optimal match-pairings found by NCSS match those in Rosenbaum (1989) exactly. Notice, however, that the
distances (Sum of Rank |Differences|) are slightly different in some instances from those given in Table 1 of the
article. This is due to the fact that Rosenbaum (1989) rounds all non-integer distances in their reports. This
rounding also affects the overall sum of match rank differences; NCSS calculates the overall sum as 74, while
Rosenbaum (1989) calculates the overall sum as 71, with the difference due to rounding.
123-21
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
123-22
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Data Matching – Optimal and Greedy
Output
Matching Summary Report
Distance Calculation Method Sum of Rank Differences including the Propensity Score
Order for Matching Sorted by Distance
Controls Matched per Treatment 2
Sum of Match Rank Differences 80.00000
Average Match Rank Difference 5.71429
Percent Percent
Site N Matched Matched Unmatched Unmatched
Existing 7 7 100.00% 0 0.00%
New 19 14 73.68% 5 26.32%
The greedy match-pairings found by NCSS match those in Rosenbaum (1989) exactly. Again, some of the
distances are different from those in Table 1 of the article because of rounding. NCSS calculates the overall sum
of rank differences as 80, while Rosenbaum (1989) calculates the overall sum as 79 with the difference due to
rounding.
123-23
© NCSS, LLC. All Rights Reserved.