Advanced SAS Programming Techniques
Advanced SAS Programming Techniques
Techniques
A Workshop Presented to the
Alaska Chapter
The American Fisheries Society
E. Barry Moser
Department of Experimental Statistics
Louisiana State University
and
Louisiana State University Agricultural Center
Baton Rouge, LA 70803
Phone: 504-388-8376
FAX: 504-388-8344
E-mail: [email protected]
1
CONTENTS 2
2.4.2 Indeterminant DO Loops : : : : : : : : : : : : : : : : : : : : : : : : 32
2.5 The NULL Data Set : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 33
2.6 Data Step Examples : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35
2.6.1 Simple Random Sampling Without Replacement : : : : : : : : : : : 35
2.6.2 Data Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36
Introduction
The SAS1 system, composed of many diverse components, is a very powerful programming
environment, data management and data analysis environment, and report generation and
graphics presentation environment. This manuscript was developed for a short-course in
\advanced SAS." Obviously the coverage will have to be quite limited. The coverage is
designed around material that I have encountered through my teaching, research, and sta-
tistical consulting work that I believe will be relevant and useful for others dealing with
basic data management and statistical analysis needs. This manuscript is not intended as
a SAS language or SAS system reference manual, it hardly scratches the surface. Nor is
it designed to show how to do statistical analysis with the SAS system. The manuscript
will rst focus on the data step, as a lot of the power of the SAS environment can be
demonstrated through the data step. Next, the input/output and library system will be
discussed. Later the macro language will be introduced. And nally several chapters dealing
with various parts of the system, graphics, data analysis procedures, and the internet will
be introduced. As this is an \advanced" course, some items will be introduced before they
are actually covered in some detail. This was purposefully done so as to avoid completely
articial-looking or contrived examples (although a few do exist, sorry). Further, since not
all of the \basics" are covered, keep copies of the SAS manuals available. The best way to
learn the SAS system and to benet from this course is to experiment with the examples
and to create your own. Again, DO NOT hessitate to modify the examples and to create
new ones.
1
SAS, SAS/BASE, SAS/GRAPH, SAS/ACCESS, SAS/ASSIST, SAS/FSP, SAS/INSIGHT, SAS/OR,
SAS/ETS, SAS/IntrNet, SAS/IML, and SAS/STAT are registered trademarks or copyrights of SAS Institute,
Inc., Cary, NC.
3
Chapter 2
4
CHAPTER 2. THE DATA STEP 5
Notice that the conversion factor, CONVFACT, for the second observation is missing (rep-
resented by a period). This observation could be dropped from the data set by several
methods. We'll consider a couple to further illustrate the behavior of the data step. The
OUTPUT statement can be used to output an observation to a data set or data sets. When it
is present in a data step, observations will ONLY BE OUTPUT when the OUTPUT statement
is executed. Consider the example below using both the RETURN and OUTPUT statements.
CHAPTER 2. THE DATA STEP 6
Title2 "RETURN and OUTPUT Statements";
Data One;
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
If TSL=0 Then RETURN;
ConvFact=Lt/TSL;
OUTPUT;
Datalines;
7 21 74 2 7 3 0 5 3.5 1.4 57.00
1 9 76 2 3 2 0 1 5.3 3.0 0.00
12 18 74 2 4 1 0 5 5.4 3.4 83.20
2 15 76 2 1 1 0 5 6.0 5.7 111.00
9 13 75 2 2 2 1 5 10.1 23.4 203.00
;
Proc Print Data=One;
Run;
Note that the second observation was never output to the new SAS data set. Now, what
if we would like the observations with zero (or no) total scale length (TSL) to be placed
into one data set and the others to be placed into another data set. Let's use the OUTPUT
statement to do this for us.
Title2 "OUTPUT to Different Data Sets";
Data WithTSL NoTSL;
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
ConvFact=Lt/TSL;
If TSL=0 Then OUTPUT NoTSL;
Else OUTPUT WithTSL;
Datalines;
7 21 74 2 7 3 0 5 3.5 1.4 57.00
1 9 76 2 3 2 0 1 5.3 3.0 0.00
12 18 74 2 4 1 0 5 5.4 3.4 83.20
2 15 76 2 1 1 0 5 6.0 5.7 111.00
9 13 75 2 2 2 1 5 10.1 23.4 203.00
;
Title3 "With TSL > 0";
Proc Print Data=WithTSL;
Run;
1 1 9 76 2 3 2 0 1 5.3 3 0 .
One problem that we will have here is that the conversion factor will be computed on
each and every observation and so we get a divide by zero error on the second observation
(TSL=0). Further, since no conversion factor is possible when the scale length is not
measured, we should like to drop the CONVFACT variable from the NoTSL data set.
1 1 9 76 2 3 2 0 1 5.3 3 0 .
Unfortunately, the CONVFACT variable is still present in the NoTSL data set, although it
was not computed for any observations in this data set. A DROP statement could be used
to remove the CONVFACT variable from ALL output data sets, but this is not what we
would like either.
OBS= Species the last observation number to be processed, after which pro-
cessing will stop.
CHAPTER 2. THE DATA STEP 9
IN= Names a new variable that will have the value 1 when the current ob-
servation is read from the data set and 0 when the current observation is read from
another data set.
Now we can use this information to update our program so that we can have a dierent set
of variables for the two new data sets.
Title2 "DROP Data Set Option";
Data WithTSL NoTSL(DROP=ConvFact);
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
If TSL=0 Then OUTPUT NoTSL;
Else
Do;
ConvFact=Lt/TSL;
OUTPUT WithTSL;
End;
Datalines;
7 21 74 2 7 3 0 5 3.5 1.4 57.00
1 9 76 2 3 2 0 1 5.3 3.0 0.00
12 18 74 2 4 1 0 5 5.4 3.4 83.20
2 15 76 2 1 1 0 5 6.0 5.7 111.00
9 13 75 2 2 2 1 5 10.1 23.4 203.00
;
Title3 "With TSL > 0";
Proc Print Data=WithTSL;
Run;
1 1 9 76 2 3 2 0 1 5.3 3 0
Finally, this is what we had wanted. Note that the KEEP= option could also have been used,
but would have required that we list each of the variables that we wanted to keep in the
new data set. The choice as to KEEP or DROP is usally based upon which list is shorter or
easiest to write.
CHAPTER 2. THE DATA STEP 10
2.1.5 DROP, KEEP, and RETAIN
There are a few statements in the SAS data step that are not executable statements and can
be placed anywhere within the SAS data step. As with the DROP= and KEEP= data set options,
the DROP and KEEP statements are used to specify which variables are to be excluded or
included in the new data set or data sets being created. Others include the FORMAT, LENGTH,
and ARRAY statements. The RETAIN statement can also be placed anywhere but can serve
several roles. Normally, after the variables have been output for an observation at the end
of the data step loop, the variables' values are reset to missing before the next observation
is processed. The RETAIN statement alters this behavior by not resetting the data values for
any variables given in its list of variables. Another use of the RETAIN statement is to give
initial values to specic variables. Examples of the RETAIN statement will be encountered
later.
2.2 Input/Output
One of the many powerful features of the SAS language is the diversity of methods, and
modications to them, for inputing data into a SAS data set. We will review several of the
more important methods and then consider some options and modications that become
more and more useful as the data become more complicated to read. For reading data from
a spreadsheet, for example, you may want to skip to the section on le import and export
on page 51. We will next look at output methods that can be very useful for generating
reports that have to be in very specic formats.
The basic statement used for reading raw data from a le is the INPUT statement. It takes as
its arguments a list of variables, pointer placement instructions, and informat information.
For options that are used to modify the standard behavior of the input process, the INFILE
statement is used.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
CHAPTER 2. THE DATA STEP 11
This may not be what you have intended. Consider the example below where there are
several missing data values scattered throughout the input data set. Note that the data
values become mismatched with which variables they should correspond with.
Title2 "List Input / Missing Data";
Data One;
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
Datalines;
7 21 74 2 7 3 0 5 3.5
1 9 76 2 3 2 0 1 5.3 3.0 0.00
12 18 74 1 0 5 5.4 3.4 83.20
2 15 76 2 1 1 0 5 6.0 5.7 111.00
9 13 75 2 2 2 1 5 23.4 203.00
1 28 75 2 3 1 1 5 13.5 33.2 242.00
;
First of all, the month and day information for observation 2 was used as the weight and
total scale length data for observation 1. Second, observe that the remainder of the data for
observation 2 was discarded. Third, observation 3 has missing data for the area and station
variables, but the sex and age data were used for these variables. Finally, we note that
instead of 6 observations, the nal data set has only 3. What has happened is the following.
When list input is used, the input processor simply scans a line of data until it comes to
non-blank information. It will then place this data into the next variable, in order, from
the INPUT statement. It does not matter in which columns the data are placed on the line,
except that their left-to-right order corresponds with the order of the variables in the INPUT
statement. If insucient values are found to ll the variables, then the processor moves on
to the next line of data. When all variables have been lled, the processor stops reading
in data and the current line of data being read is, by default, discarded. This is why the
remainder of line 2 was not used. You should now be able to duplicate the assignment of
the data to the variables by following the above rules.
How to modify the list input to handle the missing data? If the missing data all occur at
the end of a data line, as might happen with repeated measurements data where dropouts
would have no data after dropping out, then the MISSOVER option of the INFILE statement
can be used. This option species that the pointer is NOT to be moved to the next line if
no more values are found on a data line, rather, the remaining variables are to be assigned
the value for missing data (by default, a period for numeric data and a blank for character
data).
CHAPTER 2. THE DATA STEP 12
Title2 "Missing Data / Missover Option";
Data Trout;
/* Flack Lake trout Catch Data*/
Infile Datalines MISSOVER;
Input Year Age3-Age9;
Datalines;
1975 0 105 674 446 16 2 2
1976 46 422 838 726 70 4 4
1977 3 310 1224 1068 65
1978 14 354 1264 1172 69 0 6
1979 6 429 1222 1067 192
;
Proc Print Data=Trout;
Run;
Notice that we did get 5 observations with the correct data values. In this case the missing
values are not truly missing, but rather are zero values. The data step below would change
the missing values to zeros. See the section on looping and arrays on page 26 to see how to
solve this problem more generally and easily.
Title2 "Missing Data / Missover Option";
Data Trout;
/* Flack Lake trout Catch Data*/
Infile Datalines MISSOVER;
Input Year Age3-Age9;
If Age8=. Then Age8=0;
If Age9=. Then Age9=0;
Datalines;
1975 0 105 674 446 16 2 2
1976 46 422 838 726 70 4 4
1977 3 310 1224 1068 65
1978 14 354 1264 1172 69 0 6
1979 6 429 1222 1067 192
;
Now what if the missing data are interior to other data line values? In order to continue
to use list input, one needs to enter the missing value symbol in the data lines for these
values. This is illustrated below.
Title2 "List Input / Missing Data";
Data One;
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
Datalines;
7 21 74 2 7 3 0 5 3.5 . .
1 9 76 2 3 2 0 1 5.3 3.0 0.00
12 18 74 . . 1 0 5 5.4 3.4 83.20
2 15 76 2 1 1 0 5 6.0 5.7 111.00
9 13 75 2 2 2 1 5 . 23.4 203.00
1 28 75 2 3 1 1 5 13.5 33.2 242.00
;
CHAPTER 2. THE DATA STEP 13
1 7 21 74 2 7 3 0 5 3.5 . .
2 1 9 76 2 3 2 0 1 5.3 3.0 0.0
3 12 18 74 . . 1 0 5 5.4 3.4 83.2
4 2 15 76 2 1 1 0 5 6.0 5.7 111.0
5 9 13 75 2 2 2 1 5 . 23.4 203.0
6 1 28 75 2 3 1 1 5 13.5 33.2 242.0
1 7 21 74 2 7 3 0 5 3.5 . .
2 1 9 76 2 3 2 0 1 5.3 3.0 0.0
3 12 18 74 . . 1 0 5 5.4 3.4 83.2
4 2 15 76 2 1 1 0 5 6.0 5.7 111.0
5 9 13 75 2 2 2 1 5 . 23.4 203.0
6 1 28 75 2 3 1 1 5 13.5 33.2 242.0
CHAPTER 2. THE DATA STEP 14
One can also mix the various input methods. Thus, the input line could have been written
as
Input Mo Day Yr Ar 14 St 18 Sex Age Sn Lt 32-36 Wt 37-42 TSL 43-50;
since the date, sex, age, and scale number information were never missing. It is more
common to use either all list or all column input.
Note that we had to leave at least 2 spaces separating the reader's name from the month
value since we used the & symbol, otherwise the month value would have been read as part
of the name. Secondly, since we did not specify a character format, we needed to dene the
length of the character variable using the LENGTH statement. Otherwise the length would
have been determined from the rst value read in. Using formatted character input we
could have used the INPUT statement
Input @1 Name $Char13. Mo Day Yr @26 Ar 1. @30 St 1.
Sex Age Sn #2 @1 Lt 5.1 @6 Wt 6.1 @12 TSL 8.2;
and have dropped the LENGTH statement from the program. SAS also has input formats,
called \informats" for reading in other types of data such as social security numbers, phone
numbers, and dates and times. We will give some examples of working with SAS dates and
times later. The input formats can be explicitly given on the INPUT statement, or can be
assigned to the variables using the INFORMAT statement.
When reading complex data les controlling the pointer can be very important. Normally
after the input statement has been executed for an observation, the physical input line
is discarded and the next physical input line is moved into the input buer. Sometimes,
however, you would like to read what is on the physical line at specic points, say looking
for special key words, then you use an input statement that depends upon the key word.
The \trailing" @ and @@ signs can be used to hold the pointer on the current input line.
The single @ sign holds the line until another input statement releases it or until the data
step loop restarts. The double @@ sign holds the current line even after the data step loop
restarts. They are called \trailing" because the symbol is placed at the very end of the
input statement (just before the semicolon).
OBS LT WT
1 3.5 1.4
2 5.3 3.0
3 5.4 3.4
4 6.0 5.7
5 10.1 23.4
6 13.5 33.2
7 8.7 16.8
8 11.9 44.4
9 12.0 40.9
10 16.0 103.4
11 9.1 17.9
12 10.5 29.1
13 17.2 127.1
14 17.9 132.7
15 12.1 47.9
16 17.2 136.4
17 17.3 138.3
18 17.5 134.4
19 13.2 42.2
20 16.4 110.0
21 16.7 101.6
22 15.3 92.9
23 15.4 76.3
24 18.0 125.8
25 18.5 131.6
OBS DATE LT WT
Note that we needed to retain the date variable so that its value would be maintained
through each loop of the data step. The date is only changed when a date value is found on
the input line. Also note that in this example the date format for the DATE variable was
specied in the PROC PRINT section rather than in the data step. If the FORMAT statement
is placed in the data step, then the format will be used by future procedures. If it is placed
in a procedure step, then the format is local to that procedure and will not carry on to
future procedures.
There are several features to notice in this particular example, besides its very unattractive
appearance. Note that string constants can be printed in the report simply by enclosing
the data within quotes. Note also that the output format does not have to be the same
as the input format. Further, the name of a variable along with its value can be obtained
simply by listing the variable's name on the PUT statement followed immediately by the
equals sign. The pointer control symbol / is used to move the pointer to the next line. The
FILE statement can also be used to redirect the output report to a le or device such as a
printer.
Now we discover that SAS dates are relative to January 1, 1960 as this is the date corre-
sponding to the SAS date of zero. Note that because we did not assign a SAS date format
to the BIRTHDAY variable, that the actual SAS date value was printed. The example also
demonstrated the SAS formats WEEKDATXn. and DATEn., where n is the format width. If the
width is not sucient to write out the entire names, as requested, then abbreviations will
be used where possible. Decimal values of SAS dates can also be used to contain internal
values for time. See the BASE SAS documentation for these formats. SAS date-time values
are entered in a date set as MM/DD/YY:hh:mm:ss where MM/DD/YY is the date, while hh is
the hour, mm is the minutes, and ss is the seconds in the time. When typed within the data
step, such as in an IF statement, a date is enclosed in quotes and followed by the letter d,
such as "09/29/97"d while times are followed by the letter t, such as "8:30"t. Note that
hour takes on the values 0 through 23, where 0 is midnight.
Formats can also be constructed using PROC FORMAT. Once constructed, these formats can
be used as with any other format. These formats provide a very nice way for coding data
very simply for input, but then producing reports with very nice labels for values. We'll
create a format for the SEX variable used in the
ier data and use it to write out the labels
CHAPTER 2. THE DATA STEP 21
on the printout.
Title2 "Proc FORMAT";
Proc Format;
Value Sex 1="Male" 2="Female" 3="Unknown";
Run;
Data One;
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
Format Sex Sex.;
Datalines;
7 21 74 2 7 3 0 5 3.5 1.4 57.00
1 9 76 2 3 2 0 1 5.3 3.0 0.00
12 18 74 2 4 1 0 5 5.4 3.4 83.20
2 15 76 2 1 1 0 5 6.0 5.7 111.00
9 13 75 2 2 2 1 5 10.1 23.4 203.00
;
SUM(v1,v2,...,vn) returns the sum of the non-missing values contained in the ar-
gument list.
A simple example to generate a standard normal variate Z, and from it a normal variate Y
with mean u and standard deviation s is given below. Assume u=5 and s=3.
Title2 "Normal Random Variates";
Data One;
Drop I u s;
Retain u 5 s 3;
Do I=1 To 25;
Z=Normal(0); /* Use system clock as seed */
Y=u + s*Z;
Output;
End;
Run;
OBS Z Y
1 2.01531 11.0459
2 0.58587 6.7576
3 0.18383 5.5515
4 -1.08207 1.7538
5 -1.87971 -0.6391
6 -0.87702 2.3689
7 0.63108 6.8932
8 1.53379 9.6014
9 -0.34128 3.9761
10 0.47535 6.4261
11 1.26282 8.7885
12 0.64412 6.9323
13 2.11873 11.3562
14 0.25801 5.7740
15 0.08347 5.2504
16 -1.37542 0.8737
17 -1.00606 1.9818
18 -0.68600 2.9420
19 -0.91837 2.2449
20 -1.28510 1.1447
21 -1.69460 -0.0838
22 -1.37999 0.8600
23 0.35292 6.0588
24 0.76663 7.2999
25 0.00566 5.0170
RIGHT(string) will right align a character string by removing trailing blanks from
the end of the string and inserting them at the beginning of the string. Thus, it does
not change the length of the string.
SCAN(string,n,delimiters) returns the nth word from the character string string,
where words are delimited by the characters in delimiters. If delimiters is omitted
from the function, then blanks and most punctuation and special characters are used
as the delimiters. Consult the SAS help or SAS/BASE documentation. If there are
fewer words in the string than given by n, then a blank character string is returned.
SUBSTR(string,start,n) returns a substring or part of string beginning with the
character at the position start in the string and continuing for n characters. If n is
omitted, then the remainder of the string is extracted.
SUBSTR(string1,start)=string2 replaces the characters in string1 beginning at
position start in string1 with string2.
TRIM(string) returns a new string whose trailing blanks have been removed and
whose length corresponds with the position of the last non-blank character in string.
A blank string, however, is returned as a string with one blank character.
TRIMN(string) is like TRIM(string), but a blank string is returned as a null string
(length of zero).
UPCASE(string) returns a new string will any lowercase characters replaced with their
uppercase counterparts.
while much more programming is needed if we use the character variable. Basically, we
would have to create our own \SAS date" representation of the character variable for such
comparisons.
where arrayname is the name for the array and {number|*} is either the number of variables
in the that dimension of the array or is \*" indicating that all variables listed are to be used
in the one dimensional array. Using the Flier sunsh data, let's assume that we wished to
create a set of new variables that were the logarithmic transformation of a set of the original
variables. Rather than write a number of assignment statements, we can use arrays, a single
assignment statement, and a loop to complete the task. The following SAS code illustrates
the brute force method that we wish to avoid.
Title2 "Brute Force Approach to Repetitive Task";
Data One;
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
If TSL=0 Then DELETE;
LnLt=Log(Lt); LnWt=Log(Wt); LnTSL=Log(TSL);
Datalines;
7 21 74 2 7 3 0 5 3.5 1.4 57.00
1 9 76 2 3 2 0 1 5.3 3.0 0.00
12 18 74 2 4 1 0 5 5.4 3.4 83.20
2 15 76 2 1 1 0 5 6.0 5.7 111.00
9 13 75 2 2 2 1 5 10.1 23.4 203.00
;
CHAPTER 2. THE DATA STEP 27
Now let's re-write this program using arrays and the basic \do" loop.
Title2 "Arrays and Do Loop";
Data One(DROP=I);
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
If TSL=0 Then DELETE;
Array RawVars{3} Lt Wt TSL;
Array NewVars{3} LnLt LnWt LnTSL;
Do I=1 To 3;
NewVars(I)=Log(RawVars(I));
End;
Datalines;
7 21 74 2 7 3 0 5 3.5 1.4 57.00
1 9 76 2 3 2 0 1 5.3 3.0 0.00
12 18 74 2 4 1 0 5 5.4 3.4 83.20
2 15 76 2 1 1 0 5 6.0 5.7 111.00
9 13 75 2 2 2 1 5 10.1 23.4 203.00
;
The data sets produced by each of these programs are the same. However, imagine that
instead of 3 variables that needed to be transformed, there were many more. Multidimen-
sional arrays are also allowed simply by specifying multiple subscript sizes.
Now, let's rearrange the data into the univariate view, where one variable will contain the
age of the catch and another will contain the number caught of that age. We will start with
the raw data and use a loop to read in the data.
Title2 "Multivariate To Univariate Data View I";
Data TroutU;
/* Flack Lake trout Catch Data*/
Input Year @;
Do Age=3 To 9;
Input Number @;
Output;
End;
Datalines;
1968 13 129 646 954 99 19 4
1969 19 169 416 1031 243 47 18
1970 40 354 606 479 152 18 7
1971 32 606 1424 644 157 23 17
1972 0 226 1178 1156 116 16 5
1973 2 165 593 982 428 22 11
1974 53 209 560 410 30 0 4
1975 0 105 674 446 16 2 2
1976 46 422 838 726 70 4 4
1977 3 310 1224 1068 65 0 0
1978 14 354 1264 1172 69 0 6
1979 6 429 1222 1067 192 0 0
;
Proc Print Data=TroutU(Obs=22);
Run;
CHAPTER 2. THE DATA STEP 29
1 1968 3 13
2 1968 4 129
3 1968 5 646
4 1968 6 954
5 1968 7 99
6 1968 8 19
7 1968 9 4
8 1969 3 19
9 1969 4 169
10 1969 5 416
11 1969 6 1031
12 1969 7 243
13 1969 8 47
14 1969 9 18
15 1970 3 40
16 1970 4 354
17 1970 5 606
18 1970 6 479
19 1970 7 152
20 1970 8 18
21 1970 9 7
22 1971 3 32
Here we only listed the rst 22 observations of the data set to illustrate the format of the
univariate view.
Title2 "Multivariate To Univariate Data View II";
Data TroutU;
Drop Age3-Age9;
Array Ages{3:9} Age3-Age9;
Set TroutM;
Do Age=3 To 9;
Number=Ages(Age);
Output;
End;
Run;
This data and print step will produce the same output as the previous one. In this instance
we are accessing a SAS data set that already has the data in the multivariate view. Notice
that the ARRAY statement species the beginning and ending value of the array index (3
and 9). Notice also that the DROP statement is needed to remove the \old" variables AGE3
through AGE9 from the new data set.
PROC TRANSPOSE
Before leaving this section it is worth looking at a procedure developed to convert between
the univariate and multivariate data views. Although not appropriate for all problems, it
can be very useful for many. PROC TRANSPOSE takes an input data set, and based upon
CHAPTER 2. THE DATA STEP 30
some structure commands, creates a new data set with a dierent conguration. For this
example we will again use the lake trout data and we will input it into the multivariate
view. Then PROC TRANSPOSE will be called to convert it to the univariate view. The DATA=
and OUT= options on the PROC TRANSPOSE statement specify the input and output SAS data
sets, respectively. The VAR statement lists the variables that will be transposed. The BY
statement instructs PROC TRANSPOSE to treat each year separately, i.e., we want to transpose
all of the values of the specied variables before moving on to the next year.
Title2 "PROC TRANSPOSE";
Data TroutM;
/* Flack Lake Trout Catch Data*/
Input Year Age3-Age9;
Datalines;
1968 13 129 646 954 99 19 4
1969 19 169 416 1031 243 47 18
1970 40 354 606 479 152 18 7
1971 32 606 1424 644 157 23 17
1972 0 226 1178 1156 116 16 5
1973 2 165 593 982 428 22 11
1974 53 209 560 410 30 0 4
1975 0 105 674 446 16 2 2
1976 46 422 838 726 70 4 4
1977 3 310 1224 1068 65 0 0
1978 14 354 1264 1172 69 0 6
1979 6 429 1222 1067 192 0 0
;
Proc Transpose Data=TroutM Out=TroutU;
By Year Notsorted;
Var Age3-Age9;
Run;
Proc Print Data=TroutU;
Run;
1 1968 AGE3 13
2 1968 AGE4 129
3 1968 AGE5 646
4 1968 AGE6 954
5 1968 AGE7 99
6 1968 AGE8 19
7 1968 AGE9 4
8 1969 AGE3 19
9 1969 AGE4 169
10 1969 AGE5 416
11 1969 AGE6 1031
12 1969 AGE7 243
13 1969 AGE8 47
14 1969 AGE9 18
Only the rst 14 observations are listed. The NAME variable contains the names of the
original variables, while the COL1 variable contains the values that those variables held.
There are some options on the PROC TRANSPOSE procedure line that can be used to make
the variable names more attractive. However, we will use the data step below to make the
data set look like one that we might have created from scratch if our original intent was
CHAPTER 2. THE DATA STEP 31
to have a univariate view. This will make the reverse transpose to follow more realistic
looking. Note the use of the INPUT() and SUBSTR() functions to convert values such as
AGE4 into a numeric value, here, 4. Then we dropped the NAME variable as it is no longer
needed. We also renamed the COL1 variable to NUMBER.
/* Make the data set look more like one that would
have come from reading the data directly into
the univariate view. This makes the example a
little more realistic. */
Data TroutU;
Set TroutU;
Age=Input(Substr(_NAME_,4),2.);
Rename Col1=Number;
Drop _NAME_;
Run;
1 1968 13 3
2 1968 129 4
3 1968 646 5
4 1968 954 6
5 1968 99 7
6 1968 19 8
7 1968 4 9
8 1969 19 3
9 1969 169 4
10 1969 416 5
11 1969 1031 6
12 1969 243 7
13 1969 47 8
14 1969 18 9
Again, only the rst 14 observations are listed here. In transposing the data set back, we
would like to have the same variable names as before. This would have been easy had we
not adjusted the data set as above. However, that is not particularly realistic, since you
would not likely transpose a data set and then reverse the transpose exactly as given later.
Since variable names cannot be numbers, we need to have a way to handle the numeric
values for AGE. PROC TRANSPOSE will use formats when available and this will be our way
out. For a very large number of levels of AGE, it might be more productive to construct a
character variable (like NAME ) containing the names of the variables that we wish to create.
The ID statement gives the variable containing the names of the new variables. Since we
used a FORMAT statement for AGE, the formatted values will be used.
CHAPTER 2. THE DATA STEP 32
Proc Format;
Value Ages 0="AGE0" 1="AGE1" 2="AGE2" 3="AGE3" 4="AGE4"
5="AGE5" 6="AGE6" 7="AGE7" 8="AGE8" 9="AGE9";
Proc Transpose Data=TroutU Out=TroutM;
By Year Notsorted;
Var Number;
Id Age;
Format Age Ages.;
Run;
OBS YEAR _NAME_ AGE3 AGE4 AGE5 AGE6 AGE7 AGE8 AGE9
With the exception of the NAME variable, the data set looks like the data set we started
with. Thus, it is relatively easy to go either direction with PROC TRANSPOSE when dealing
with univariate and multivariate views.
The use of the single @ sign keeps the pointer on the same line until all data for an employee
have been read. Then the loop exits and the data step begins again, but because the single
@ sign was used, the previous line gets discarded and a new one is worked on.
This example also demonstrates several other features of the SAS language that have not
yet been discussed. The 2 lines N+1; and TOTWT+WT; are called sum statements. They take
the value or value of the variable to the right of the plus sign and add it to the value of the
variable to the left of the plus sign. They are not exactly like, for example, N=N+1; because
when the data step loop begins again, unless N is retained it will lose its value. This does
not happen with variables in sum statements. They are automatically retained.
Note also the strange IF SEX=2; statement that appears to be missing the THEN clause. This
statement is equivalent to the one IF SEX NE 2 THEN DELETE;. It is called a subsetting IF
statement.
Lastly is the data step \subroutine." The data step does not have subroutines that are
separated from the data step, rather they are contained within it and all variables are global
to the main and subroutine sections. These \subroutines" behave like goto statements,
but control can return to the code immediately following the call. These subroutines are
called from input/output coditions, such as above where the condition is an \End of File"
CHAPTER 2. THE DATA STEP 35
condition and the INFILE option EOF is used to point to a subroutine to be executed when the
condition is met, or from LINK and GOTO statements. If the LINK statement is used, control
returns to the spot immediately following the LINK statement, while the GOTO statement
simply redirects program execution through the subroutine and control usually returns
to the top of the data step loop. Note the use of the RETURN statement. The rst use
prevents execution from continuing into the subroutine, while the second marks the end of
the subroutine. Had the rst RETURN been left out, a \running" average of the sh weights
would have been generated for the females.
This particular method uses unequal probability methods in the selection of each element
into the sample, but because each sample of size 12 has the same probability of selection as
every other sample of size 12, the sample is a simple random sample. Since we do not know
when the 12 elements will be selected, it could be the rst 12 or the last 12 observations,
we use an indeterminant loop, the DO UNTIL loop. Note also that the data step loop is
executed only once here. Otherwise, multiple samples of size 12 would have been taken.
The STOP statement is used to keep the data step from looping again by simply stopping
execution of the data step.
Using formats
For some problems the data recoding can be particularly complicated. Either a simple
mathematical transformation is not possible (as could be done above), or one may need
to convert between numeric and character data. The IF-THEN-ELSE statements or CASE
expression can be used for these purposes, but often require a lot of programming. Often
times the PUT() and INPUT() functions can be used to simplify this process. Assume that
we have a data set on vegetation collected from transects run across a marsh in which the
species of plant and coverage along a 5m stick are recorded. To facilitate data entry, only
a 2-character abbreviation for the species is input. However, the scientist would like the
complete name in the data set. The example below illustrates one approach to accomplishing
this.
Title2 "Data Recoding";
Proc Format;
Value $Veg "wg"="Wire Grass" "sp"="Spartina patens"
"br"="Bull Rush" "wi"="Widgeon Grass";
Run;
Data One;
Input Sp $ Coverage @@;
Length Species $20.;
Species=Put(Sp,$Veg.);
Drop Sp;
Datalines;
wi 0.4 wi 0.6 wg 1.2 br 0.2 sp 4.8
;
Proc Print Data=One;
Run;
where the fish subdirectory is relative to our current working directory on Unix. Alterna-
tively, the FILENAME statement can be used to link a le reference to the actual le. This
can make programs much easier to use on multiple platforms and to update and modify
later. The basic statement looks like
FILENAME leref "path-and-le-name";
Using the WIN95 le given earlier a skeleton program would look like
38
CHAPTER 3. WORKING WITH FILES 39
FILENAME fish "c:\fish\raw.dat";
Data one;
Infile fish;
Input ....;
Run;
Notice that the INFILE statement is still needed, but it species the le reference rather
than the actual le name. The FILENAME statement can also be used to reference more than
one le in a subdirectory. Study the following example.
FILENAME fish "c:\fish"; /* specify the subdirectory only */
Data Halibut;
Infile Fish(Halibut.dat);
Input ...;
Run;
Data Coho;
Infile Fish(Coho.dat);
Input ...;
Run;
The SAS system assumes that data les will have the extension .DAT, and so the extension
can be dropped from the INFILE statement. Thus the code could have been written,
FILENAME fish "c:\fish"; /* specify the subdirectory only */
Data Halibut;
Infile Fish(Halibut);
Input ...;
Run;
Data Coho;
Infile Fish(Coho);
Input ...;
Run;
An example of these methods is shown for the
ier data set. The orignal data set was also
broken into 2 pieces called FLIERS1.DAT and FLIERS2.DAT, and the variable headings were
removed from each of these two data sets. To better illustrate the results, the SAS log for
the code is shown. The rst method uses only the INFILE statement.
238 Title2 "External File Reference With Infile";
239 Data Flier;
240 Infile "c:\projects\alaska97\flier.dat" Firstobs=2;
241 Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
242 Run;
243
244 Proc Print Data=Flier(Obs=10);
245 Run;
The FILENAME statement can also be used to refer to directories. When directories are
used, the INFILE statement is then used to select the member to process. This method also
permits several directories to be concatenated together to search for les, and the same le
reference can be used for multiple les, and for reading and writing.
254 Title2 "FILENAME Directory Reference";
255 Filename Fish "c:\projects\alaska97";
256 Data Flier;
257 Infile Fish(Flier.dat) Firstobs=2;
258 Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
259 Run;
NOTE: A total of 664 records were read from the infile library FISH.
The minimum record length was 102.
The maximum record length was 102.
NOTE: 664 records were read from the infile FISH(Flier.dat).
The minimum record length was 102.
The maximum record length was 102.
NOTE: The data set WORK.FLIER has 664 observations and 11 variables.
NOTE: The DATA statement used 0.59 seconds.
Since the SAS system assumes that data les end with .DAT, the sux can be dropped
from the member name. However, for names that do not have a sux, the name should be
enclosed within quotes.
261 Data Flier;
262 Infile Fish(Flier) Firstobs=2;
263 Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
264 Run;
NOTE: A total of 664 records were read from the infile library FISH.
The minimum record length was 102.
The maximum record length was 102.
NOTE: 664 records were read from the infile FISH(Flier).
The minimum record length was 102.
The maximum record length was 102.
NOTE: The data set WORK.FLIER has 664 observations and 11 variables.
NOTE: The DATA statement used 0.66 seconds.
Sometimes the data that we would like to work with exists in more than one physical raw
le. For example, the catch data might be kept in separate spreadsheets for each harvest
year. To analyze the data together, the data sets must be concatenated. Below, the two
ier data sets (the original data set split into 2 pieces) will be input and concatenated using
3 dierent data steps. Note that the le reference from above is being reused.
266 Title2 "Data Set Concatenation of Files";
267 Data Flier1;
268 Infile Fish(Fliers1);
269 Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
270 Run;
NOTE: A total of 298 records were read from the infile library FISH.
The minimum record length was 102.
The maximum record length was 102.
NOTE: 298 records were read from the infile FISH(Fliers1).
The minimum record length was 102.
The maximum record length was 102.
NOTE: The data set WORK.FLIER1 has 298 observations and 11 variables.
NOTE: The DATA statement used 0.44 seconds.
NOTE: A total of 366 records were read from the infile library FISH.
The minimum record length was 102.
The maximum record length was 102.
NOTE: 366 records were read from the infile FISH(Fliers2).
The minimum record length was 102.
The maximum record length was 102.
CHAPTER 3. WORKING WITH FILES 42
NOTE: The data set WORK.FLIER2 has 366 observations and 11 variables.
NOTE: The DATA statement used 0.44 seconds.
NOTE: The data set WORK.FLIER has 664 observations and 11 variables.
NOTE: The DATA statement used 0.5 seconds.
As the FILENAME statement can concatenate directories together, it can also concatenate
physical les together. The method is to put the list of lenames within parenthesis, each
separated by a comma.
279 Title2 "FILENAME Concatenation of Files";
280 Filename Fish
("c:\projects\alaska97\fliers1.dat","c:\projects\alaska97\fliers2.dat");
281 Data Flier;
282 Infile Fish;
283 Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL;
284 Run;
Clearing a le reference can free up some memory, but usually is important with complicated
programs to insure that an important le is not written over, or the wrong le read as input,
due to programming errors.
Files can also be created and appended to using techniques similar to the above for input.
Basically, dene a le reference to receive the data and use the FILE statement to direct
CHAPTER 3. WORKING WITH FILES 43
output to the reference. The example below rst writes the FLIER1 data to an external ascii
le, then next, this could be at some latter time for example, appends the FLIER2 data to
the same external le.
72 Title2 "Creating An External File";
73 Filename Fish "c:\projects\alaska97";
74 Data _NULL_;
75 File Fish(Flier1N2);
76 Set Flier1;
77 Put (Mo Day Yr) (3.) (Ar St Sex Age Sn) (2.) (Lt Wt TSL) (7.3);
78 Run;
NOTE: A total of 298 records were written to the file library FISH.
The minimum record length was 40.
The maximum record length was 40.
NOTE: 298 records were written to the file FISH(Flier1N2).
The minimum record length was 40.
The maximum record length was 40.
NOTE: The DATA statement used 0.6 seconds.
NOTE: A total of 366 records were written to the file library FISH.
The minimum record length was 40.
The maximum record length was 40.
NOTE: 366 records were written to the file FISH(Flier1N2).
The minimum record length was 40.
The maximum record length was 40.
NOTE: The DATA statement used 0.44 seconds.
A sample of what the output data set looks like is given below.
7 21 74 2 7 3 0 5 3.500 1.400 57.000
1 9 76 2 3 2 0 1 5.300 3.000 0.000
12 18 74 2 4 1 0 5 5.400 3.400 83.200
2 15 76 2 1 1 0 5 6.000 5.700111.000
12 14 75 2 1 1 0 5 6.100 6.600109.000
Notice the use of the grouped formats to associate a single format with several dierent
variables that appear together. A slightly larger eld width for TSL appears needed. To
CHAPTER 3. WORKING WITH FILES 44
append the second data set to the rst, it was necessary to specify the MOD option on the
FILE statement.
This technique can also be used to input not only les via FTP, but to retrieve any infor-
mation that can be obtained via FTP. To get a directory listing from the remote machine
above, we might write
Title2 "FTP Access To List A Directory";
Filename Dir FTP " " /* Null data set (required) */
ls /* FTP command */
host="unix1.sncc.lsu.edu" /* Host name */
user="bill" /* User login name */
prompt /* Prompt for password */
cd="/u/bill/alaska/stuff"; /* Change directory first */
Data _NULL_;
File Log;
Infile Dir;
Input;
Put _INFILE_;
Run;
The data step simply reads then writes the directory listing to the SAS Log. Note that the
data could have been placed into a variable or variables, and then processed using the data
step.
CHAPTER 3. WORKING WITH FILES 45
3.1.2 WWW Access
Not to be outdone by the FTP access method, the FILENAME statement can also reference
a URL (uniform resource locator), otherwise known as a web page. The page might be a
hypertext document or might be a le being distributed via a WWW server. In the example
below, we will assume that the
ier data set is stored at the URL
http://www.stat.lsu.edu/faculty/moser/flier.dat
Some important aspects of the FTP and URL access, is that data may be stored in some
\distant" location, yet reachable by network, and be accessed at runtime. This could be
very important for real-time data, such as from data loggers, or can be important when
many people are accessing the data and and single, master copy is needed. From a systems
programmer perspective, it means that the SAS system can be used to develop an interface
to the internet.
The NOSOURCE2 option to the %INCLUDE statement is used to suppress printing the included
code to the SAS log.
where \libname" is some 1-8 character name and \path" is the path in the lesystem
where the library is to be kept. Upon startup, the SAS System internally issues a LIBNAME
statement dening a library named WORK which it uses as the default library. By default, at
the end of the SAS session, the WORK library is cleaned of all SAS les. Thus, to keep any
SAS data sets permanent, you should store them in a library that you specify. To do this,
a two-level data set name is used in the SAS program. A two-level name has two names
which are separated by a period. The rst name is the library name dened in the LIBNAME
statement, and the second name is the member or data set name.
/* Point to the subdirectory containing the raw effort data */
FILENAME raw "c:\temp\effort97.csv";
/* Point to the SAS Library to store the effort data */
LIBNAME effort "c:\fishing\effort";
DATA effort.year1997;
LENGTH location $30.;
INFILE raw delimiter=",";
INPUT date mmddyy8. location & effort;
RUN;
Here, the raw data are input into the SAS data set effort.year1997, which can be used
later in the same program or in some future program for analysis, without the need for the
above data step. For example, the SAS code below could be used to print the data set at
some future time.
LIBNAME effort "c:\fishing\effort";
PROC PRINT DATA=effort.year1997;
RUN;
CHAPTER 3. WORKING WITH FILES 47
3.3.2 Library Procedures
There are several procedures for working with SAS libraries. The procedures permit update
of the information about a data set and utility operations such as copying, deletion or
renaming a member. You can also append one member to another.
PROC DATASETS
PROC DATASETS does many things related to the management of data sets. The usual way
to operate with this procedure is in full-screen mode (the default). In full-screen mode you
can easily modify parameters of the data set, copy, rename, and delete data sets, and list
various parameters associated with them. A couple of these tasks are shown below in a
non-full-screen session.
1777 Title2 "Proc Datasets";
1778 Proc Datasets Library=Work Nofs;
-----Directory-----
Libref: WORK
Engine: V612
Physical Name: C:\SAS\SASWORK\#TD16077
1 CURSTAT CATALOG
2 FLIER DATA
3 FLIER1 DATA
4 FLIER2 DATA
5 SASMACR CATALOG
1779 Modify Flier(Label="James Geaghan Flier Data Set");
1780 Run;
Proc Contents
CONTENTS PROCEDURE
This type of listing can be very important for documentation purposes and for nding out
what is in a permanent SAS data set.
CHAPTER 3. WORKING WITH FILES 49
PROC COPY
PROC COPY is used to copy SAS data set members (and catalogs) from one data library to
another. This is useful for making backups. When combined with import/export engines,
it can also be used to convert data from one form into another. The example below simply
copies a data set from one library to another.
1789 Libname Perm "c:\temp";
NOTE: Libref PERM was successfully assigned as follows:
Engine: V612
Physical Name: c:\temp
1790 Proc Copy In=Work Out=Perm;
1791 Select Flier;
1792 Run;
PROC APPEND
PROC APPEND is normally used to append one SAS data set to another SAS data set called
the base. If the base is a new data set, then a simple copy is used. In the example below,
the FLIER2 data set is appended to the original FLIER data set.
1795 Title2 "Proc Append";
1796 Proc Append Base=Perm.Flier Data=Flier2;
1797 Run;
PROC DELETE
The DELETE procedure simply deletes a SAS data set.
1799 Title2 "Proc Delete";
1800 Proc Delete Data=Perm.Flier;
1801 Run;
1 AF AFGO 09/08/97
2 DMKEYS KEYS 04/06/97 Function Key Definitions
3 PASSIST SLIST 09/25/97 User profile
4 _VPLAY_ SLIST 06/25/97 VIDEO: player preferences
5 MRUWSAVE WSAVE 09/25/97
Some of these catalogs might be accessible using functions found in PROC CATALOG, but typ-
ically the parameters that they contain are to be set using the programs that these catalogs
are associated with (such as the Display Manager). This procedure is more commonly used
in its interactive mode.
CHAPTER 3. WORKING WITH FILES 51
3.4 File Import/Export/Transport
3.4.1 Import/Export
Many users enter their data using spreadsheets or data base software such as Microsoft
Excel, Lotus 1-2-3, or Borland Paradox. If you have SAS/ACCESS installed, special
import/export lters are available to directly access and use the data stored in several
other software formats. A SAS/AF application is provided on the File->Import and
File->Export pull-down menu that automates the programming of this task.
The task of importing an Excel spreadsheet containing age and growth information of Flier
sunsh, courtesy of Dr. James Geaghan, LSU, where the rst non-blank row of the spread-
sheet contains the variable names (SAS compatible) and the data follow in the remaining
rows, is given below using PROC ACCESS.
PROC ACCESS DBMS=EXCEL;
CREATE work.xcell.ACCESS;
PATH='C:\fishing\Flierdat.xls';
WORKSHEET='flier';
GETNAMES YES;
SCANTYPE=YES;
CREATE work.xcell.VIEW;
SELECT ALL;
RUN;
DATA work.flier;
SET work.xcell;
RUN;
The PROC ACCESS step creates a view into the data set and species the particular worksheet
to import, whether variable names are to be gotten from the worksheet, etc. The library
name WORK is included to show that the data view is actually two catalog members within
the libarary member XCELL. Note that the data step is the process that actually converts
CHAPTER 3. WORKING WITH FILES 52
the data to a SAS data set. It is possible to use the data directly from the spreadsheet by
referencing the data view as in the SET statement.
PROC PRINT DATA=work.xcell;
RUN;
In general, this would not be the most ecient way to access the data as the \conversion"
would need to be performed on each data access.
It is also possible to export the data to other program formats. PROC DBLOAD is used below
to convert the
ier data set back into an Excel spreadsheet.
PROC DBLOAD DBMS=EXCEL DATA=work.flier;
PATH='C:\fishing\newflier.xls';
PUTNAMES YES;
LIMIT=0;
LOAD;
RUN;
Also there are a number of other options for both PROC ACCESS and PROC DBLOAD that
can be used to control the variable names, types, data ranges, and other input/output
information.
What if you do not have SAS/ACCESS? It is not too dicult to input a spreadsheet using
the SAS data step. If the data contain no commas, then the \comma separated values"
(CSV) format produced by most spreadsheets and many data bases is often convenient.
The rst step is to save the spreadsheet into the CSV format (don't remove the original
spreadsheet le, the CSV le is just an intermediate step). Next, write a SAS data step
to input the data using an INFILE statement containing the DELIMITER="," option. As an
example, let's assume that the
ier data set has been saved as a CSV le called \
ierdat.csv".
A portion of the data set is shown below.
Mo,Day,Yr,Ar,St,Sex,Age,Sn,Lt,Wt,TSL,Size1,Size2,Size3,Size4,Size5,Size6,Edge,No
7,21,74,2,7,3,0,5,3.5,1.4,57.00,0.00,0.00,0.00,0.00,0.00,0.00,34.72,1
1,9,76,2,3,2,0,1,5.3,3.0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2
12,18,74,2,4,1,0,5,5.4,3.4,83.20,0.00,0.00,0.00,0.00,0.00,0.00,47.79,3
2,15,76,2,1,1,0,5,6.0,5.7,111.00,0.00,0.00,0.00,0.00,0.00,0.00,60.97,4
12,14,75,2,1,1,0,5,6.1,6.6,109.00,0.00,0.00,0.00,0.00,0.00,0.00,60.04,5
Now we can write a SAS data step to read in these data. Note that we need to skip over
the rst line of the data set as it contains the spreadsheet column headings.
FILENAME in "flierdat.csv";
DATA flier;
INFILE in FIRSTOBS=2 DELIMITER=",";
INPUT Mo Day Yr Ar St Sex Age Sn Lt Wt TSL
Size1 Size2 Size3 Size4 Size5 Size6 Edge No;
RUN;
If character data were included in the data set, then appropriate \informats" would need
to be used for reading those variables.
The third-party software produce DBMS/COPY is designed to move data between a number
of dierent data formats including SAS data sets. This introduces the need for a \third"
program, but can greatly ease the data export/import process, especially when the same
data are needed in several dierent formats.
CHAPTER 3. WORKING WITH FILES 53
The SAS/CONNECT product that interconnects a network of PC's, Unix workstations,
and/or mainframes, can also be used to move SAS data sets from one machine type to
another. This product contains PROC UPLOAD and PROC DOWNLOAD for easily moving SAS
data sets, but they can also be used to handle ascii les as well. This product also permits
you to execute code and work with data on several dierent platforms at the same time.
3.4.2 Transport
To copy a SAS data set from one system to another may lead to trouble if the systems and
SAS versions are not exactly compatible with one another. For example, the structure of a
SAS data set on an IBM MVS mainframe is quite dierent from that on a WIN95 PC. To
move or copy a SAS data set, a transport data library can be created. Since it is a library,
it may contain more than one SAS data set. The CPORT and CIMPORT procedures are used,
respectively, to create and import a transport data library. The process is to rst create
the transport library, transport it to the other host system, then import the library. A
real advantage to this methodology is that you can send someone a SAS data set and know
exactly what is in the data set, while sending them a \raw" data set requires that they
know how to read in the data correctly. In other instances, the data may not exist in its
\raw" form, but may have been entered directly into a SAS data set. Below is an example
that creates a transport library of the
ier data set, then imports it as if we had moved
the transport library to another system. First, let's create the transport data library. The
LIBRARY argument species which SAS data library will be used, the FILE argument species
the destination for the transport image, and the MT argument species what types of things
from the SAS data library are to be transported. Here we specied that only SAS data sets
are to be considered.
668 Title2 "Create A Transport Data Library";
669 Filename XPT "c:\projects\alaska97\flier.xpt";
670 Proc CPort Library=Work File=XPT MT=Data;
671 Run;
Note that we could couple some of this code with the FTP le access method described
earlier to let the SAS system move the le for us. When moving the transport library,
use a binary transfer method so that the internal structure of the transport library is not
changed.
CHAPTER 3. WORKING WITH FILES 55
3.5 The X Files
The SAS system provides through various methods, access to the operating system and its
commands. The X statement can be used to pass a command along to the operating system
and have the command executed as if it were given at a system command prompt (such
as at an MS-DOS, Unix, or terminal window). The SAS program will wait, by default,
until the command has completed. Also by default, the user will have to \manually" close
the external shell or program started by the X statement. The program below will use the
WIN95 operating system to generate a directory listing of the
ier data sets and will redirect
the output to a le named flier.dir. The NOXWAIT system option is used to tell the SAS
system to continue processing SAS statements once the X statement has completed, while
the XSYNC option is used to prevent the SAS system from processing more statements before
the command has completed. We then open the flier.dir data set and read its contents
and write them to the SAS log.
652 Options XSync NoXWait;
653 X "dir c:\projects\alaska97\flier*.dat > c:\projects\alaska97\flier.dir"
653 ;
654 Filename Fish "c:\projects\alaska97";
655 Data _NULL_;
656 Infile Fish("flier.dir") EOF=LF;
657 File Log;
658 Input;
659 If _N_=1 Then Link LF;
660 Put _INFILE_;
661 Return;
662 LF:
663 Put //;
664 Return;
665 Run;
NOTE: A total of 11 records were read from the infile library FISH.
The minimum record length was 0.
The maximum record length was 56.
NOTE: 11 records were read from the infile FISH("flier.dir").
The minimum record length was 0.
The maximum record length was 56.
NOTE: The DATA statement used 0.66 seconds.
CHAPTER 3. WORKING WITH FILES 56
Note that with very little programming we could have read the directory listing into variables
in a SAS data set and have processed them in some fashion, say to compute the disk space
occupied by the les.
Chapter 4
57
CHAPTER 4. THE MACRO LANGUAGE 58
The %LET statement has the name of the variable to receive the macro assignment, the equals
sign, then the value to be assigned to the macro variable. Note that the variable is written
without any special symbols (at least for now). Also note that the value is given without
any quotes. We'll address more complicated issues as we move along. In this particular
instance if we had several dierent data sets that we might print, it becomes easy to specify
a dierent data set and, at the same time, have its name included in the title. The type
of quotes used, however, are very important. You denitely must use double quotes if you
want the macro variable to resolve to its value. Single quotes will print the ampersand and
name as written.
Normally, there is nothing written on the SAS log to indicate what happened with respect
to the macro substitution. To see what actually was substituted we can change one of the
system options, SYMBOLGEN, on. Then upon re-running the code we see the following:
21 Options Symbolgen;
22 %Let DSName=Flier;
SYMBOLGEN: Macro variable DSNAME resolves to Flier
23 Title2 "Listing of Data Set &DSNAME";
24 Proc Print Data=&DSName(Obs=10);
SYMBOLGEN: Macro variable DSNAME resolves to Flier
25 Run;
Now we can see what value was substituted for &DSNAME. Let's look at the following code.
27 %Let Somecode=%Str(Proc Print Data=Flier(Obs=10); Run;);
28 %Unquote(&Somecode);
SYMBOLGEN: Macro variable SOMECODE resolves to Proc Print
Data=Flier(Obs=10); Run;
SYMBOLGEN: Some characters in the above value which were subject to macro
quoting have been unquoted for printing.
Thus, macro variables can actually hold very complicated expressions. The macro quoting
function %STR() was used to quote the entire expression so that the internal parentheses
and semicolons would not cause any problems. However, when we reference the variable,
the value remains quoted, which means that its contents want be executed as normal SAS
language statements. To get rid of the special quoting, the macro function %UNQUOTE was
CHAPTER 4. THE MACRO LANGUAGE 59
used.
Although these variables are quite powerful with respect to substitution, they have their
limits for writing reusable software. Macro procedures give us even more control of code
generation and variable substitution.
Again, it is dicult to determine what is actually happening in the macro. The SYMBOLGEN
results do help but they appear out of place. This is because macros are actually compiled
before they are executed. This is also why the complexity of macro programming can be
greater within a macro procedure. You should notice that to perform the same task on a
dierent data set would only require another call to the procedure. For example
%PRT(Bluegill);
The MPRINT system option can be used to print out the SAS language statements generated
by the macro language. The code below turns on this option and turns o the SYMBOLGEN
option.
37 Options NoSymbolGen MPrint;
38 %PRT(Flier)
MPRINT(PRT): TITLE2 "Listing of Data Set Flier";
MPRINT(PRT): PROC PRINT DATA=FLIER(OBS=10);
MPRINT(PRT): RUN;
Not only did we get the macro variables resolved, but we also have them listed in the context
of the SAS language statements.
CHAPTER 4. THE MACRO LANGUAGE 60
It is also possible to put optional arguments to a macro. This permits you to assign initial
values to these macro variables, which can be overwritten when the macro is called. The
example below makes the data set name optional.
40 %Macro PRT(DSName=Flier);
41 Title2 "Listing of Data Set &DSNAME";
42 Proc Print Data=&DSName(Obs=10);
43 Run;
44 %Mend PRT;
45 %PRT;
MPRINT(PRT): TITLE2 "Listing of Data Set Flier";
MPRINT(PRT): PROC PRINT DATA=FLIER(OBS=10);
MPRINT(PRT): RUN;
Note that the macro was called without replacing the data set name. Had we wanted to
use the BLUEGILL data set, for example, we would have made the call
%PRT(DSNAME=BLUEGILL);
We can even include required and optional parameters in the same macro. The required or
positional parameters must be called in the same order as given in the %MACRO statement,
while the optional parameters can be given in any order following after the positional
parameters.
47 %Macro Sex(Sex,DSName=Flier);
48 Data &DSName&Sex;
49 Set &DSName;
50 Where (Sex=&Sex);
51 Run;
52 %Mend Sex;
Now we can call the new macro and specify which sex we would like to create the data set
for. Notice here the complicated name &DSName&Sex. This is actually two macro names
joined together. When executed, both macro names will be resolved then joined together
to form a single name. Let's call the macro and look at the results produced.
53 %Sex(1,DSNAME=Flier);
MPRINT(SEX): DATA FLIER1;
MPRINT(SEX): SET FLIER;
MPRINT(SEX): WHERE (SEX=1);
MPRINT(SEX): RUN;
NOTE: The data set WORK.FLIER1 has 306 observations and 11 variables.
NOTE: The DATA statement used 0.44 seconds.
54 %Sex(2);
MPRINT(SEX): DATA FLIER2;
MPRINT(SEX): SET FLIER;
MPRINT(SEX): WHERE (SEX=2);
MPRINT(SEX): RUN;
NOTE: The data set WORK.FLIER2 has 340 observations and 11 variables.
NOTE: The DATA statement used 0.38 seconds.
55 %Sex(3);
MPRINT(SEX): DATA FLIER3;
MPRINT(SEX): SET FLIER;
MPRINT(SEX): WHERE (SEX=3);
MPRINT(SEX): RUN;
CHAPTER 4. THE MACRO LANGUAGE 61
56
57 Proc Datasets Library=Work;
-----Directory-----
Libref: WORK
Engine: V612
Physical Name: C:\SAS\SASWORK\#TD41921
The macro created each of the 3 data sets, one for each sex category. Now if we knew
that we would be doing this, there might be other ways to accomplish this with the macro
language. The macro language also has looping statements like the SAS language does.
61 %Macro Sex(DSName=Flier);
62 %Do Sex=1 %To 3;
63 Data &DSName&Sex;
64 Set &DSName;
65 Where (Sex=&Sex);
66 Run;
67 %End;
68 %Mend Sex;
NOTE: The data set WORK.FLIER1 has 306 observations and 11 variables.
NOTE: The data set WORK.FLIER2 has 340 observations and 11 variables.
Data _NULL_;
Set AllStats;
If Iter=0 Then
Do;
MonteP=(_N_-1)/(&MaxIter+1);
Put "Monte Carlo P-value is " MonteP 6.4;
End;
Run;
%Mend MonteT;
NOTE: The data set WORK._NOUNK_ has 646 observations and 11 variables.
MPRINT(MONTET): QUIT;
NOTE: The data set WORK.BOOTSAMP has 646 observations and 12 variables.
The last 3 steps will continue until all iterations have been completed. Once completed the
code generated will be.
MPRINT(MONTET): TITLE2 "Listing of All ANOVA Statistics";
MPRINT(MONTET): PROC PRINT DATA=ALLSTATS(OBS=10);
MPRINT(MONTET): RUN;
NOTE: The data set WORK.ALLSTATS has 100 observations and 8 variables.
NOTE: The data set WORK.ALLSTATS has 100 observations and 8 variables.
The nal page of the output produced had the following observations, including the original
analysis results (ITER=0),
OBS ITER _NAME_ _SOURCE_ _TYPE_ DF SS F PROB
We get a very similar answer as to what was obtained by the original ANOVA analysis.
Many more simulations and we would have obtained a more precise estimate of the sampling
distribution of the test statistic. It does take a while for the macro to execute. Its speed can
be increased by suppressing some of the printed output. For example, suppress the macro
printed output along with the notes.
Options NoNotes NoMPrint NoSymbolgen;
Proc Format;
Value State
1="ALABAMA" 2="ARIZONA" 3="ARKANSAS"
4="CALIFORNIA" 5="COLORADO" 6="CONNECTICUT"
7="DELAWARE" 8="WASHINGTON, D.C." 9="FLORIDA"
10="GEORGIA" 11="IDAHO" 12="ILLINOIS"
13="INDIANA" 14="IOWA" 15="KANSAS"
16="KENTUCKY" 17="LOUISIANA" 18="MAINE"
19="MARYLAND" 20="MASSACHUSETTS" 21="MICHIGAN"
22="MINNESOTA" 23="MISSISSIPPI" 24="MISSOURI"
25="MONTANA" 26="NEBRASKA" 27="NEVADA"
28="NEW HAMPSHIRE" 29="NEW JERSEY" 30="NEW MEXICO"
31="NEW YORK" 32="NORTH CAROLINA" 33="NORTH DAKOTA"
34="OHIO" 35="OKLAHOMA" 36="OREGON"
37="PENNSYLVANIA" 38="RHODE ISLAND" 39="SOUTH CAROLINA"
40="SOUTH DAKOTA" 41="TENNESSEE" 42="TEXAS"
43="UTAH" 44="VERMONT" 45="VIRGINIA"
46="WASHINGTON" 47="WEST VIRGINIA" 48="WISCONSIN"
49="WYOMING" 50="ALASKA" 51="HAWAII";
Run;
Proc Sort;
By state citysize;
Run;
Proc print;
Format murder manslght rape robbery assault burglary larceny motor 6.2;
Run;
%Include "dendro.sas";
%dendro;
Quit;
CHAPTER 4. THE MACRO LANGUAGE 69
5.1 Autoexec.sas
The AUTOEXEC.SAS le is typically stored in a user's home directory or in the main SAS
subdirectory. Once the SAS system initializes, it will, by default, read and execute the
statements contained in the AUTOEXEC.SAS le. This makes it a very convenient way to
set up your environment with the basic settings that you might like. You can dene your
graphics drivers here as well as printout size, and other options. In developing applications,
it can be a way to autostart a program for persons that are non-programmers, but need to
use a pre-built SAS application.
Below is an example AUTOEXEC.SAS le that I use on my Unix system. Some parts is use
frequently and others much less frequently.
*************************************************;
* AUTOEXEC.SAS *;
*************************************************;
%Macro CGM;
%************************************************;
%* CGM options are CGMFRMA - Monochrome *;
%* CGMFRGA - Gray Scale *;
%* CGMFRCA - Color *;
%************************************************;
Filename GSASFile '/tmp/sas.cgm';
GOptions Device=CGMFRMA GAccess=GSASFile GSFMode=Replace
FText=HWCGM010 HText=1 CText=Black
FTitle=HWCGM010 HTitle=1 CTitle=Black;
%* FText = HWCGM010 is for Helvetica;
%* Set FText to HWCGM001 For SansSerif Font;
%Put %Str( );
%Put %Str(NOTE: Graphics Device is: CGMFRMA);
%Put %Str(NOTE: Writing Graphics Output To: /tmp/sas.cgm);
%Put %Str( );
%Mend CGM;
70
CHAPTER 5. SAS SPECIAL FILES 71
%Macro CGMX;
%************************************************;
%* Display device is XCOLOR - Then print to CGM *;
%* CGM options are CGMFRMA - Monochrome *;
%* CGMFRGA - Gray Scale *;
%* CGMFRCA - Color *;
%************************************************;
Filename GSASFile '/tmp/sas.cgm';
GOptions Device=XColor TargetDevice=CGMFRMA GAccess=GSASFile
GSFMode=Replace FText=HWCGM010 HText=1 CText=Black;
%Put %Str( );
%Put %Str(NOTE: Graphics Device is: XCOLOR);
%Put %Str(NOTE: Target Device is: CGMFRMA);
%Put %Str(NOTE: Writing Target Output To: /tmp/sas.cgm);
%Put %Str( );
%Mend CGMX;
%Macro EPS;
%************************************************;
%* EPS -Generate Encapsulated Postscript Output.*;
%************************************************;
FILENAME GSASFile "graph.eps";
GOPTIONS Device=PSEPSF TargetDevice=PSEPSF
CBack=White Colors=(Black)
GAccess=GSASFile NoPrompt GSFMode=Replace;
%Put %Str( );
%Put %Str(NOTE: Graphics Device is: PSEPSF);
%Put %Str(Note: Graphics Output to: graph.eps);
%Put %Str( );
%Mend EPS;
%Macro EPSX;
%************************************************;
%* EPSX-Generate Encapsulated Postscript Output.*;
%************************************************;
FILENAME GSASFile "graph.eps";
GOPTIONS Device=XCOLOR TargetDevice=PSEPSF
CBack=White Colors=(Black)
GAccess=GSASFile NoPrompt GSFMode=Replace;
%Put %Str( );
%Put %Str(NOTE: Graphics Device is: XCOLOR);
%Put %Str(Note: Graphics Output to: graph.eps);
%Put %Str( );
%Mend EPSX;
%Macro PS;
%************************************************;
%* PS - Generate Postscript Output For Printing.*;
%************************************************;
FILENAME GSASFile pipe 'lpr -Pps -h';
GOPTIONS Device=PS1200 TargetDevice=PS1200
CBack=White Colors=(Black)
GProlog='252150532D41646F62652D0D0A'x
GAccess=GSASFile NoPrompt GSFMode=Replace;
%Put %Str( );
%Put %Str(NOTE: Graphics Device is: PS1200);
%Put %Str( );
%Mend PS;
%Macro PSX;
CHAPTER 5. SAS SPECIAL FILES 72
%************************************************;
%* PS - Generate Postscript Output For Printing.*;
%************************************************;
FILENAME GSASFile pipe 'lpr -Pps -h';
GOPTIONS Device=XColor
TargetDevice=PS1200 CBack=White Colors=(Black)
GProlog='252150532D41646F62652D0D0A'x
GAccess=GSASFile NoPrompt GSFMode=Replace;
%Put %Str( );
%Put %Str(Note: Graphics Device is: XCOLOR);
%Put %Str(NOTE: Hardcopy Graphics Device is: PS1200);
%Put %Str( );
%Mend PSX;
%Macro PSCX;
%************************************************;
%* PS - Generate Postscript Output For Printing.*;
%************************************************;
FILENAME GSASFile pipe 'lpr -Ppsc0 -J/nff/nb';
GOPTIONS Device=XColor
TargetDevice=PSCOLOR CBack=White
GProlog='252150532D41646F62652D0D0A'x
GAccess=GSASFile NoPrompt GSFMode=Replace;
%Put %Str( );
%Put %Str(Note: Graphics Device is: XCOLOR);
%Put %Str(NOTE: Hardcopy Graphics Device is: PSCOLOR);
%Put %Str( );
%Mend PSCX;
%Macro HP;
%************************************************;
%* HP - Generate HPLJS3 Output For Printing. *;
%************************************************;
FILENAME GSASFile pipe 'lpr -Php1 -J/nff/nb';
GOPTIONS Device=HPLJS3 GAccess=GSASFile NoPrompt GSFMode=Append;
%Put %Str( );
%Put %Str(NOTE: Graphics Device is: HPLJS3);
%Put %Str( );
%Mend HP;
The macros make it easy for me to assign the graphics devices that I want to use. For in-
corporating graphics into FrameMaker I use the %CGM and %CGMX macros. For incorporating
graphics into LATEXI use %EPS and %EPSX. While for direct printing I might use %PS.
/*
* -maps specifies the pathname of the map datasets used by PROC GMAP.
*/
-maps !SASROOT/maps
/*
* -msg specifies the directory where the SAS System will search
* for the files containing the text for all error messages
* and notes. These messages are stored in an internal format.
*/
-msg !SASROOT/sasmsg
/*
* -news specifies the name of a text file that will automatically
* be displayed in the log when SAS is invoked.
*/
-news !SASROOT/misc/base/news
/*
* -sasautos establishes the path(s) to director(ies) for automatic
* macro definitions to be searched by the macro facility when
* an unknown macro is referenced.
*/
-sasautos !SASROOT/sasautos
/*
* -sashelp specifies the pathname for the directory containing on-line
* help and menu screens for the SAS System.
*/
-sashelp !SASROOT/sashelp
/*
* -sasuser specifies the pathname for the directory used by the SAS
* System as a default place to store files, such as the SAS
* user profile catalog. See your SAS System documentation
* for more information.
*/
-sasuser ~/sasuser
/*
* -work specifies where to create the SAS work library. This
* library is temporary and any SAS data sets created there
* will be deleted when the system terminates. The unique name
* 'SAS_workANNNN' is assigned to each SAS work library. 'A' is
* some letter and 'NNNN' is the hexadecimal representation of the
* process ID of the SAS process.
*/
-work /tmp
/*
* -sasscript specifies the location to search for SAS/CONNECT scripts.
*/
CHAPTER 5. SAS SPECIAL FILES 74
-sasscript !SASROOT/misc/connect
/*
* -dms specifies that you are running SAS in Display Manager mode.
*/
-dms
/*
*
* -memsize limits the amount of memory that will be allocated by the
* SAS System.
*/
-memsize 32m
/*
*
* -sortsize limits the amount of memory that will be allocated during
* sorting operations.
*/
-sortsize 16m
/*
* Default windowing system to use.
*/
-fsdevice x11
/*
* -helpenv specifies that native help should be used when help is
* invoked.
*/
-helpenv helplus
/*
* -helploc specifies the location of the native help files for helplus.
* You would have to specify '-helpenv helplus' to use those files.
*/
-helploc !SASROOT/X11/native_help
/*
* -samploc specifies the location of the sample files for helplus.
*/
-samploc !SASROOT/X11/native_help
/*
* -path specifies the search path that the SAS System will use
* to find the dynamically loaded modules. Each -path
* specification indicates one directory. They will be
* searched in the order in which they are given.
*/
-path !SASROOT/sasexe/base
-path !SASROOT/sasexe/graph
-path !SASROOT/sasexe/stat
-path !SASROOT/sasexe/fsp
-path !SASROOT/sasexe/af
-path !SASROOT/sasexe/insight
-path !SASROOT/sasexe/ets
-path !SASROOT/sasexe/eis
-path !SASROOT/sasexe/iml
-path !SASROOT/sasexe/connect
-path !SASROOT/sasexe/or
CHAPTER 5. SAS SPECIAL FILES 75
-path !SASROOT/sasexe/qc
-path !SASROOT/sasexe/dbi
-path !SASROOT/sasexe/english
-path !SASROOT/sasexe/fsc
-path !SASROOT/sasexe/gis
-path !SASROOT/sasexe/image
-path !SASROOT/sasexe/lab
-path !SASROOT/sasexe/nvi
-path !SASROOT/sasexe/pub
-path !SASROOT/sasexe/share
-path !SASROOT/sasexe/trader
-path !SASROOT/sasexe/toolkit
-path !SASROOT/sasexe/spectraview
-path !SASROOT/sasexe/unixdb
-path !SASROOT/sasexe/mddbserver
76
CHAPTER 6. SAS INTERNET TOOLS 77
Options PS=55 LS=78 PageNo=1 NoDate;
Options Formchar="|----|+|---+=|-/\<>*";
Title1;
Data Flier;
Infile "c:\projects\alaska97\flierdat.csv" Delimiter="," Firstobs=2;
Input Mo Day Yr Ar St Sex Age Sn Lt Wt TSL
Size1 Size2 Size3 Size4 Size5 Size6 Edge No;
Run;
%out2htm(capture=on);
%out2htm(capture=off,proploc=sasuser.htmlgen.outprop.slist,
encode=n,htmlfile=flier.htm);
Note the use of HTML codes embedded within the TITLE statement so as to modify the de-
fault settings. The rst call to OUT2HTM turns capturing on. The second call ends capturing,
species the properties catalog that will control the appearance of the HTML code, and also
species the output le to contain the results. It is also possible to specify modications to
the properties directly in the OUT2HTM macro as below
%out2htm(capture=off,encode=n,dface=COURIER,hface=COURIER,htag=PREFORMATTED TEXT,
htmlfile=flier.htm,ttag=NO FORMATTING);
The DFACE variable controls the typeface for the data, the HFACE variable controls the type-
face for the headings, HTAG controls the tag type(s) that will be used around the headings,
and the TTAG controls the tag type(s) around the titles. The ENCODE variable determines
whether or not the angle brackets (greater than and less than signs) are encoded so as to
print in HTML as angle brackets. To capture the SAS log, use the option WINDOW=LOG. To
append HTML output to an existing le, specify OPENMODE=APPEND. There are many other
options that are possible.
For the settings that I used (xed-width fonts and black color), the following page was
observed in my browser.
CHAPTER 6. SAS INTERNET TOOLS 78
CHAPTER 6. SAS INTERNET TOOLS 79
Since a properties catalog can be constructed with the properties information, once you
have found the properties you like for a particular type of report, you may want to enter
them into a properties catalog. An easy way to do this is interactively. From within the
display manager, issue the following macro call
%out2htm(runmode=I);
This will bring up a dialog box within which you can specify the properties catalog to use.
Then select the properties button on the dialog box and make whatever modications that
you have decided upon. The rst dialog box that you encounter looks like the following.
From this dialog box you can update the properties very easily then save them away. Later
when you wish to assign those properties to the SAS output or log, simply refer to this
properties list and you'll not have to make a long list of properties in the macro call.