How MERGE Really Works: Bob Virgile Robert Virgile Associates, Inc
How MERGE Really Works: Bob Virgile Robert Virgile Associates, Inc
Bob Virgile
Robert Virgile Associates, Inc.
Overview 3. Length
DATA PROBLEM;
Three Key Concepts SET OLD;
IF SEX=’M’ THEN TYPE=’MALE’;
ELSE TYPE=’FEMALE’;
All SAS® DATA steps employ three key concepts:
A problem arises because TYPE's length is determined
1. Compile and execute.¹ The SAS software
by the first DATA step statement capable of defining the
compiles the statements within the DATA step,
length:
and performs any required set-up work for
each statement. Then, the software executes IF SEX=’M’ THEN TYPE=’MALE’;
the programming statements for each
observation. Thus TYPE receives a length of 4 before any data has
been read from OLD. The order of the data in OLD is
2. The Program Data Vector (PDV). As part of irrelevant to the length of TYPE. The first observation
the compilation process, the software sets up in OLD (in fact, all observations) may contain SEX='F',
storage locations in memory to hold the current but TYPE will always have a length of 4.
values of all variables.
On the other hand, programs can take advantage of the
3. Data movement. As the DATA step executes, distinct compilation phase in many ways. For example,
values are placed into the PDV and later the following program might be useful when entering
copied from the PDV to output SAS data sets. data via SAS/FSP®:
The details of the internal workings of the DATA step DATA YOUR.DATA;
are well documented in the literature.² The key STOP;
concepts are reviewed here, with special emphasis on SET MY.DATA;
topics related to MERGE.
In compiling the SET statement, the program reads the
header information from MY.DATA, defining all
Compile and Execute variables. Next, the program executes, and hits the
STOP statement. The DATA step is therefore over,
All DATA steps are compiled in their entirety before and YOUR.DATA contains zero observations, with all
being executed. The compilation process defines all variables defined exactly as they exist in MY.DATA.
variable attributes, including: Therefore, data entry via SAS/FSP can begin on
YOUR.DATA using the same screens used for
1. Name MY.DATA. The actual program or programs which
created MY.DATA are not needed to create
2. Type YOUR.DATA.
Here is another example which takes advantage of the PUT ’3rd: ’ _N_= X=;
distinct compilation phase:³
In comparing the messages generated, notice how SET
DATA _NULL_; statement variables are retained.
PUT ’TOTAL OBS IS ’ TOTOBS;
STOP; 1ST: _N_=1 X=.
SET DISK.SASDATA NOBS=TOTOBS; 2ND: _N_=1 X=11
3RD: _N_=1 X=110
When compiling the SET statement the software can 1ST: _N_=2 X=.
access the header information for DISK.DATASET, 2ND: _N_=2 X=22
including the total number of observations in the data 3RD: _N_=2 X=220
1ST: _N_=3 X=.
set. The software creates TOTOBS and initializes it
with that total number of observations. That value is 1ST: _N_=1 X=.
available to the PUT statement without reading any 2ND: _N_=1 X=110
data values. 3RD: _N_=1 X=1100
1ST: _N_=2 X=1100
2ND: _N_=2 X=220
Next, consider the execution phase of the DATA step. 3RD: _N_=2 X=2200
Some of the important concepts are: 1ST: _N_=3 X=2200
1. Statements which read data (INPUT, SET, Both DATA steps generate seven messages, not six.
MERGE, UPDATE) are executable. They are Neither step ends until a read statement fails because
not merely labels that identify the source of the there are no more incoming data.
data. For example, the SET statement means
"go out to this data set and read in an
observation." Executable statements may Program Data Vector
appear anywhere in the DATA step, and do not
have to be placed right after the DATA The PDV is a set of storage locations set up in memory,
statement. holding the current value of all variables. Programming
statements modify values stored in the PDV, not values
2. The DATA step continually loops through all its stored in SAS data sets. For example, the assignment
statements. The typical way out of this loop statement below operates on values stored in the PDV
(i.e., the typical ending to a DATA step) is for a (as opposed to values stored in OLD):
SET or INPUT statement to fail because there
are no more observations left to read. DATA TOTALS;
SET OLD;
3. Variables read with a SET, MERGE, or CUPS=2*PINTS + 4*QUARTS;
UPDATE statement are retained. That is, their
values are not reinitialized to missing just The program uses the PDV by:
because the program passes through the
DATA statement and outputs an observation. 1. Copying observations from OLD into the PDV.
The DATA steps below illustrate these concepts in 2. Computing CUPS based on the PDV's values
action. for PINTS and QUARTS.
The role of the PDV clears up confusing combinations Understanding data movement will let you write more
of KEEPs, DROPs, and RENAMEs. All KEEPs, efficient programs as well. Compare these two DATA
DROPs, and RENAMEs on a DATA statement refer to steps:
variable names in the PDV. All KEEPs, DROPs, and
DATA TOTALS;
RENAMEs on a SET, MERGE, or UPDATE statement SET OLD (DROP=VAR5-VAR20);
refer to variable names in the source data set. So CUPS=2*PINTS + 4*QUARTS;
when will the following program work?
DATA TOTALS (DROP=VAR5-VAR20);
DATA TOTALS (RENAME=(QUARTS=QTS)); SET OLD;
SET ALL (DROP=CUPS CUPS=2*PINTS + 4*QUARTS;
RENAME=(VAR3=VAR4));
CUPS=2*PINTS + 4*QUARTS;
Data Movement
Figure 2
The PDV in a MERGE
CPU
================
VACATION | | BOTH
| PDV |
============ ------------ ============
| LOCATION |--->| LOCATION |--->| LOCATION |
|------------| |------------| |------------|
| NAME |--->| | | |
============ | NAME |--->| NAME |
============ | | | |
| NAME |--->| | | |
|------------| |------------| |------------|
| AGE |--->| AGE | | |
============ |------------| | AGE |
| DUMMY |--->| |
AGEDATA ------------ ============
| Automatic |
| Variables |
| ------------ |
| |
================
The PDV for the second program contains AFTER: _N_=3 NAME=BOB AGE=49
VAR5-VAR20, while the PDV for the first program does LOCATION=BIMINI COUNT=3 X=5
not. The second program performs extra work, copying BEFORE: _N_=4 NAME=BOB AGE=49
LOCATION=BIMINI COUNT=3 X=.
those 16 variables from ALL into the PDV for each AFTER: _N_=4 NAME=CAROL AGE=55
observation. LOCATION= COUNT=4 X=5
BEFORE: _N_=5 NAME=CAROL AGE=55
LOCATION= COUNT=4 X=.
At Last, MERGE AFTER: _N_=5 NAME=TED AGE=40
LOCATION=TAHITI COUNT=5 X=5
BEFORE: _N_=6 NAME=TED AGE=40
The MERGE process employs the concepts above, as LOCATION=TAHITI COUNT=5 X=.
well as containing a few of its own characteristics.
Consider the following data set and program. Notice the timing of a few key actions:
AGEDATA contains one observation per name. 1. _N_ is incremented each the the DATA step
leaves the DATA statement.
NAME AGE
ALICE 38 2. COUNT is initially 0, and is always retained
(never reinitialized).
BOB 49
CAROL 55 3. Variables read by the MERGE statement
(NAME, AGE, and LOCATION) are retained.
TED 40 They are initially missing, and are reinitialized
at the MERGE statement, whenever the
VACATION contains a varying number of observations
program encounters a new value for the BY
per NAME:
variable(s).
NAME LOCATION
4. X is reinitialized to missing for each
ALICE ARUBA observation, at the DATA statement.
BOB BERMUDA
Both the general DATA step processes described
BOB BIMINI above, as well as the MERGE concepts, are important
TED TAHITI to understanding how MERGE works. However, when
programmers begin to apply these concepts, in practice
And the program: MERGE may produce unwanted results. Most of the
time, these results occur in match-merges which either
DATA BOTH; (1) contain a common variable other than the BY
LENGTH NAME $ 5 LOCATION $ 8; variable(s), or (2) are many-to-one MERGEs. The rest
PUT ’BEFORE: ’ _N_= NAME= AGE= /
@9 LOCATION= COUNT= X=; of this paper illustrates MERGEs which contain typical
MERGE AGEDATA VACATION; problems and shows programming fixes to overcome
BY NAME; them.
COUNT + 1;
X=5; The first problem area involves common variables other
PUT ’AFTER: ’ _N_= NAME= AGE= / than the BY variable(s). The merged data set contains
@9 LOCATION= COUNT= X=;
the last value read from either source data set. In a
one-to-one MERGE this means the value from the last
This program generates the following messages:
data set mentioned in the MERGE statement. But in a
BEFORE: _N_=1 NAME= AGE=. many-to-one MERGE the value may come from either
LOCATION= COUNT=0 X=. data set. Let’s once again merge AGEDATA and
AFTER: _N_=1 NAME=ALICE AGE=38 VACATION:
LOCATION=ARUBA COUNT=1 X=5
BEFORE: _N_=2 NAME=ALICE AGE=38 NAME AGE
LOCATION=ARUBA COUNT=1 X=.
AFTER: _N_=2 NAME=BOB AGE=49 ALICE 70
LOCATION=BERMUDA COUNT=2 X=5
BEFORE: _N_=3 NAME=BOB AGE=49 BOB 70
LOCATION=BERMUDA COUNT=2 X=. CAROL 70
TED 70
NAME LOCATION AGE NAME LOCATION
ALICE ARUBA 38 ALICE ARUBA
BOB BERMUDA 35 ALICE ARGENTINA
CAROL CANCUN 55 BOB BAHAMAS
TED TIMBUKTU 60 BOB BERMUDA
TED TIPPERARY 60 BOB BIMINI
TED TOLEDO 60 CAROL CANCUN
TED TIMBUKTU
The results depend on the order of the data sets in the
MERGE statement. The merge BY NAME is straightforward for this
many-to-one situation. But suppose the objective were
DATA BOTH; a little more complex, involving some data manipulation
MERGE AGEDATA VACATION to AGE. In particular, suppose BOB’s last vacation was
/* or VACATION AGEDATA */; to BERMUDA, and he turned 50 just before he left.
BY NAME;
The desired result would be:
produces one of these results: NAME AGE LOCATION
NAME LOCATION AGE ALICE 38 ARUBA
ALICE ARUBA 38 ALICE 38 ARGENTINA
BOB BERMUDA 35 BOB 49 BAHAMAS
CAROL CANCUN 55 BOB 50 BERMUDA
TED TIMBUKTU 60 BOB 49 BIMINI
TED TIPPERARY 60 CAROL 55 CANCUN
TED TOLEDO 60 TED 40 TIMBUKTU
The last two AGEs are 60, not 70, because that was the The result is that AGE remains 50 for BIMINI, not just
last value read from any of the merged SAS data sets. for BERMUDA. To get around this problem, it is
The AGE of 70 is NOT reread, merely retained in the necessary to create a new variable:
PDV. When merging in the last two observations, the
value of AGE (60) replaces the current value in the DATA BOTH (DROP=AGE
PDV. RENAME=(DUMMY=AGE));
MERGE AGEDATA VACATION;
BY NAME;
In many-to-one MERGEs, be careful when modifying IF NAME=’BOB’ AND
variables which come from the "one" data set. LOCATION=’BERMUDA’ THEN DUMMY=50;
Consider one more variation for AGEDATA and ELSE DUMMY=AGE;
VACATION:
Figure 2 illustrates the MERGE process for this
NAME AGE program.
ALICE 38
BOB 49 Finally, consider a combination example where both
situations exist: a many-to-one MERGE where both
CAROL 55 incoming data sets contain a common variable (not the
TED 40 BY variable).
An existing SAS data set MASTER contains many *CONTAINS COUNTY + STATE ONLY;
records for each STATE, but does not contain the
variable STATE. The objective is to add the STATE DATA UNIQUE;
SET COUNTIES;
variable, based on either of the existing variables BY COUNTY;
COUNTY or ZIPCODE. IF FIRST.COUNTY AND LAST.COUNTY;
Two separate SAS data sets may be able to supply the Step 4 becomes tricky. The idea is to MERGE BY
STATE variable: COUNTIES contains COUNTY and COUNTY and replace only the missing STATE values.
STATE, while ZIPCODES contains three-digit zipcode Because of the previous MERGE, both data sets now
and STATE. Both data sets are incomplete sources of contain STATE. It becomes necessary to RENAME
STATE data, and the information in ZIPCODES is more one of the incoming variables:
reliable than the information in COUNTIES. So the
plan of action is: PROC SORT DATA=ADDSTATE;
BY COUNTY;
1. Add to MASTER a variable holding the first
DATA ADDSTATE (DROP=DUMMY);
three characters of ZIPCODE. MERGE ADDSTATE (IN=INMAST)
UNIQUE (RENAME=(STATE=DUMMY));
2. Sort MASTER and ZIPCODES by shortened BY COUNTY;
ZIP, and MERGE them to get some STATE IF INMAST;
values added to MASTER. IF STATE=’ ’ THEN STATE=DUMMY;
3. If COUNTIES contains multiple occurrences of The author welcomes comments and questions about
a COUNTY, delete all of them. (For example, this paper, as well as suggestions for future
Suffolk County would appear for both New papers. Feel free to call or write:
York and Massachusetts.)
Bob Virgile
4. MERGE MASTER and COUNTIES by Robert Virgile Associates, Inc.
COUNTY. However, if a legitimate STATE 3 Rock Street
value had already been retrieved from Woburn, MA 01801
ZIPCODES, disregard the STATE from (781) 938-0307
COUNTIES. [email protected]
PROC SORT DATA=ZIPCODES; ³This is not an original program, but has appeared in
BY ZIP3; various forms in past SUGI papers.
*CONTAINS ZIP3 + STATE ONLY;
DATA ADDSTATE;
MERGE MASTER (IN=INMAST)
ZIPCODES;
BY ZIP3;
IF INMAST;