Jack N Shoemaker, Thotwave Technologies, Cary, NC: Proc Format in Action
Jack N Shoemaker, Thotwave Technologies, Cary, NC: Proc Format in Action
Beginning Tutorials
Paper 56-27
ABSTRACT
The PROC FORMAT subsystem provides an efficient and compact way to store all sorts of facts and data for data-driven applications. This paper will focus on PROC FORMAT in action. That is, building user-defined formats from existing tables and using PROC FORMAT in conjunction with other SAS procedures like SUMMARY, FREQ, TABULATE, REPORT, and PRINT. Since these other SAS procedures are format-aware, judicious use of user-defined formats can save hours of coding and result in more robust, data-driven applications.
There is nothing inherently wrong with this approach. If the existing data set called old has a numeric column called sex which has values of 1 or 2 only, this data step will create a new character column called SexString which will take on values of Female or Male. One problem with this approach might be a waste of computer resources, namely, storage. If the old data set is small, this likely isnt a concern. However, if the old data set were a twentymillion row claims history file from the administrative records of a health insurer, the creation of essentially a copy of the data set with a new six-character column may be a show stopper. If a SEX. format existed which displayed 1s as Female and 2s as Male, you could display the values in old directly without creating a whole new data set. For example, to count the numbers of males and females in the old data set you could use PROC FREQ as follows. proc freq data = old; format sex sex.; table sex; run; But the SEX. format doesnt come with SAS. To make the above proc step work, you must create a user-defined format using PROC FORMAT. For example, proc format; value sex 1 = Female 2 = Male ; run; The code above will create a numeric format called SEX. The PROC FORMAT block contains three statements each terminating wit a semicolon. The first statement calls PROC FORMAT. The second statement is a value clause that defines the name and contents of the user-defined format. The final statement is the run statement which, though not technically required, is good programming practice nonetheless. The PROC FORMAT block may contain as many value clauses as you desire. The value clause consists of a format name followed by a set of range-value pairs which define what the format does. Ranges may be single items as shown in the example above, or ranges of items separated by commas, or ranges expressed as from through. Value clauses with format names which begin with a dollar sign define character formats; names without a dollar sign define numeric formats as shown above. Note that the distinction between numeric and character applies to column to which the format will be applied not the displayed value. For example, the example above defines a numeric format because it will be applied to a numeric column. Once defined you may use the user-defined format called SEX. anywhere you would use a SAS-supplied format. Note that all formats, user-defined or otherwise, end with a period when referenced as in the PROC FREQ example above.
WHAT IS SAS?
That may seem like an odd question to pose to a group of SAS programmers and developers. But it is one that your author has encountered frequently over the years. SAS is a programming language, or set of programming languages depending upon how you do the maths. There is the ubiquitous base language with its data steps and procs. There is the macro language which bears close resemblance to the base language, but doesnt follow quite all the same rules. There is the SAS component language nee screen control language, or SCL, which doesnt look a thing like the base language save the use of the semicolon as a statement terminator. In fact, if SAS stands for anything, it is semicolon always semicolon. The data-step portion of the base language would be recognized as a programming language to a developer, or programmer, of any other programming language. The data step provides syntax for assignment, looping, and logical branching that are basis of any language. The base language also supplies pre-built procedures, or procs, some of which would be recognized by users of reporting and statistical packages.
SAVING SPACE
SUGI 27
Beginning Tutorials
conditions in the data step. However, this approach requires that you change your code to account for these changes in the data. And, if the sex column is used at multiple points in your application, you may end up changing code in many places. For coded data only slightly more complicated that our sex example, this can quickly devolve into a maintenance nightmare. Using a user-defined format provides a more robust solution to this problem because it separates the code in the application the PROC FREQ in our simple example from the data that is, the input data set and translation data which de-references the coded values of sex. For example, we can easily modify our initial definition of SEX. to include a translation for value 3 and a catch-all for all other values. proc format; value sex 1 = Female 2 = Male 3 = Indeterminate other = Unknown ; run; We can now run our PROC FREQ application without change. Furthermore, we have surfaced the translation rules for the sex codes rather than burying them in an if-then-else tree in a data step. This brief introduction is not meant to be a full description of the FORMAT procedure. For that, you should turn to the PROC FORMAT section of the SAS reference manual. Rather, this should whet your appetite about the possibilities. Unlike the familiar aspects of data-step programming or the reporting-package-like functionality of PROC PRINT and PROC REPORT, the FORMAT procedure has no close analog in other programming languages. It is not quite an enumerated data type. Nor, is it exactly like a dictionary table in Python. Though these constructs are close. In your authors humble opinion, PROC FORMAT is a sub-system of SAS deserving of its own special treatment like SQL. What follows is a loosely-knit series of examples showing how to use PROC FORMAT in everyday applications. It is hoped that some of these examples will resonate and help you with your day-to-day programming tasks.
SASAVE.SUGI.EXAMPLE.FORMATC refers to a catalog member called EXAMPLE in the catalog called SUGI in the library called SASAVE. The final node of this four-level name, FORMATC, means that EXAMPLE is a user-define character format. If you are using one of the operating systems listed above which support tree-structured directories, you can browse the directory contents and see the actual file names which correspond to the data set and catalog listed above. For example, if you are running version 8 of the SAS system under Windows NT, then the data set would have this name: SUGI.sas7bdat While the catalog would appear as: SUGI.sas7bcat The default format catalog is LIBRARY.FORMATS. That is, a catalog called FORMATS in the library called LIBRARY. The library called LIBRARY should be created by the person, or group, who administers SAS at your site. The installation process does not create this library. However, somewhat paradoxically, SAS searches for a library called LIBRARY for many of its default operations, like locating user-defined formats. The definition for the library called LIBRARY usually occurs in your AUTOEXEC.SAS file which you should find in the SAS root directory which contains the SAS executable file, sas.exe. You can use PROC CATALOG to list the contents of a format catalog or any other SAS catalog for that matter. For example, the following code fragment will display a list of all the members of the default catalog, LIBRARY.FORMATS: proc catalog c = library.formats; contents stat; run; The output will look something like this: # Name Type Description ---------------------------------1 AGE FORMAT 2 PHONE FORMAT 3 AGE FORMATC 4 MYDATE INFMT The actual display will be wider than whats shown here which has been truncated to fit within the margins of this paper. Note that there are three different member types: FORMAT, FORMATC, and INFMT. The FORMAT member type specifies a numeric or picture format. The FORMATC format specifies a character format. And the INFMT member type specifies an informat which is used to read rather than display data. In version 8, the description attribute is left blank. In earlier versions, the description attribute contains some technical details about the format like default length and maximum size. In any event, you should use the description attribute to provide short documentation about the user-defined format. The name-space for user-defined formats still remains just eight characters which means that your format names will look pretty dense, like variable names and such in the pre-version 7 days. The description attribute provides a simple way to compensate for this lingering restriction. The following code fragment uses PROC CATALOG to modify the description attribute of two members of the temporary catalog WORK.FORMATS. proc catalog c = work.formats;
SUGI 27
Beginning Tutorials
modify age.format( description = 'Age Map' ); modify age.formatc( description = 'Age Decoder' ); run; If your SAS system administrators have acted in a responsible fashion, you will not be allowed to modify the common LIBRARY.FORMATS catalog. So, the example above uses the temporary format catalog called WORK.FORMATS which is created in the temporary WORK library. Just as data sets created in the WORK library disappear at the end of your SAS session, a format catalog created in the WORK library will also disappear. Notwithstanding, for the purposes of illustration and discussion the remainder of this paper will use the temporary WORK library. The resulting display from PROC CATLOG would look like this:
displaying the contents of a user-defined format as regular SAS output. You can also unload the contents of a user-defined format into a SAS data set using the CNTLOUT= option on PROC FORMAT. For example, the following code fragment will create a data set called CNTLOUT from the all the user-defined formats stored in the catalog called WORK.FORMATS. proc format library = work.formats cntlout = cntlout; run; The resulting SAS data set will contain the following twenty columns. Variable Type Label ------------------------------------------DATATYPE Char Date/time/datetime? DECSEP Char Decimal separator DEFAULT Num Default length DIG3SEP Char Three-digit separator EEXCL Char End exclusion END Char Ending value for format FILL Char Fill character FMTNAME Char Format name FUZZ Num Fuzz value HLO Char Additional information LABEL Char Format value label LANGUAGE Char Language for date strings LENGTH Num Format length MAX Num Maximum length MIN Num Minimum length MULT Num Multiplier NOEDIT Num Is picture string noedit? PREFIX Char Prefix characters SEXCL Char Start exclusion START Char Starting value for format TYPE Char Type of format If that seems like a lot of columns, it is. Most are there to provide the extra levels of control which are needed in specific circumstances. In fact there are only three required columns: FMTNAME, START, and LABEL. In addition to these required columns it is good habit to include the TYPE column which explicitly tells PROC FORMAT that you are building a numeric or character format. Of course if your format is to include ranges, you will need to include an END column as well as the START column. Finally, the HIGH, LOW, and OTHER keywords are coded in the HLO column. In summary, the six commonly useful columns are listed below: Variable Type Label ------------------------------------------FMTNAME Char Format name TYPE Char Type of format START Char Starting value for format END Char Ending value for format LABEL Char Format value label HLO Char Additional information Heres what the CNTLOUT data set for the AGE format looks like: FMTNAME TYPE START AGE AGE AGE N N N 0 20 30 END 20 30 HIGH LABEL HLO 1 2 3
# Name Type Description ---------------------------------1 AGE FORMAT Age Map 2 PHONE FORMAT 3 AGE FORMATC Age Decoder 4 MYDATE INFMT
SUGI 27
Beginning Tutorials
values in raw data. For example, the values of M and F could become Male and Female if displayed using a user-defined format called $SEX. In a sense, the user-defined format called $SEX. is just a two-column lookup table with M and F as the key values and Male and Female as the looked-up return values. You can use user-defined formats in just this fashion in a data step by using the PUT() function. Following along our example, if you wish to create a new data-step variable called description from an existing data-step variable called sex using a user-defined format called $SEX., you could use a piece of code like this: description = put( sex, $sex. ); This technique allows you to re-write if-then-else trees and replace then with a single line of code. For example, assume that you have a set of discount factors stored in a user-defined format called $DISC. proc format; value $disc ABC = 0.20 DEF = 0.25 XYZ = 0.00 other = 0.00; You could replace code that looks like this: if vendor = ABC then discount = 0.20; else if vendor = DEF then discount = 0.25; else if vendor = XYZ then discount = 0.00; With a single statement that looks like this: discount = put( vendor, $disc. ); This technique also has the added advantage of separating the data the table of discount factors from the code. If you need to add or change the discount values for your vendors, you simply change that data outside of the data step and leave your existing data-step code alone. One word of caution: the PUT() function always returns a character string. So, if you mean to use the return value as a number you must take some action to cause SAS to convert the character string to a number. For example: length discount 8; discount = put( vendor, $disc. ); or net = gross * ( 1 - put( vendor, $disc. ) ); That is, either explicitly declare the return variable as a number. Or, perform some sort of arithmetic on the result inside the assignment statement. A simpler example still is to create an user-defined informat instead of a format and use the input() function instead of the put() function. For example: proc format; invalue disc ABC = 0.20 DEF = 0.25 XYZ = 0.00 other = 0.00; discount = input( vendor, disc. );
This final technique has the added advantage of not producing and conversion messages in the SAS log. You may consider these messages harmless when you expect to see them. On the other hand, if you consider any conversion message in the SAS log to be a sign of sloppy or suspect programming, you should use a userdefined informat in conjunction with the input() function.
SUGI 27
Beginning Tutorials
end = intnx( month, rundate, 0, E ); label = This Month; output; start = intnx( month, rundate, -1 ); end = intnx( month, rundate, -1, E ); label = Last Month; output; hlo = O; label = Really Old; output; stop; run;
MULTI-VALUE LABELS
The next topic for this paper is multi-value labels. That is, how to handle situations where you want to use a user-defined format to associate more than one attribute with a given key value. For example, in our vendor example above, we might have a region and salesperson associated with each vendor as well as a discount amount. There are two choices: create a separate user-defined format for each attribute, or create label which stores both attributes using some unique character to distinguish one attribute from the other. Consider the following VENDOR data set data vendor; infile cards; input vendor $ region $ salesp $; cards; ABC NE Alice DEF MW Molly XYZ SE Linda ; run; The following code fragment will create a CNTLIN= data set which will create two separate user-defined formats one for the region and one for the salesperson. data cntlin( keep = fmtname type start label ); retain type C; set vendor; start = vendor; fmtname = region; label = region; output; fmtname = salesp; label = salesp; output; run; proc sort data = cntlin; by fmtname; run;
proc format cntlin = cntlin; run; We could have created two separate CNTLIN data sets and fed them to PROC FORMAT one at a time. Instead we created a CNTLIN data set which contains two output rows for each row of input from the VENDOR data set. When using the later technique the PROC SORT is crucial. Using it ensures that all the region definitions come first followed by all the salesperson definitions. Alternatively, you could create a label which concatenates the region and salesperson values with a delimiting character like #. For example, data cntlin( keep = fmtname type start label ); retain fmtname vinfo type C; set vendor; start = vendor; label = region || # || salesp; run; proc format cntlin = cntlin; run;
SUGI 27
Beginning Tutorials
function. For example, the following data-step code fragment will create two data-step variables called REGION and SALESP from VENDOR using the user-defined format $VINFO. length region $ 2 salesp $ 5 vinfo = put( vendor, $vinfo. region = scan( vinfo, 1, # salesp = scan( vinfo, 2, # vinfo $ 8; ); ); );
Before the multilabel option, there was really no way to do this save some very convoluted data-step trickery. Now, it is as simple as a single user-defined format and a PROC MEANS.
CONCLUSION
The FORMAT procedure is a real gem. You may use it in a variety of situations to make your code more robust and easier to maintain. Also, the use of PROC FORMAT encourages separation of code and data which leads to cleaner and more understandable code. This paper has explicated a couple of common uses for user-defined formats. There are plenty more that await you in your applications. Would that this paper aid you in that process of discovery.
Choice of the delimiting character is crucial when using this technique. The character you chouse as a delimiter must never appear as in either of the tokens inside the concatenated label.
MULTI-VALUE RANGES
Starting with version 8, PROC FORMAT allows you to define multivalue ranges. That is, ranges which have overlapping values. Normally overlapping ranges cause the FORMAT procedure to terminate in error. The special multilabel option instructs PROC FORMAT to construct a user-defined format using overlapping ranges. You may ask yourself, Why on earth would I want to do this? An example will illustrate the point. proc format; value age(multilabel) 20 - 29 = '20 - 29' 30 - 39 = '30 - 39' 40 - 49 = '40 - 49' 50 - 59 = '50 - 59' 60 - high = '60 +++' 20 - 35 = '20 - 35' 36 - 55 = '36 - 55' 55 - high = '55 +++' ; run; The user-defined format called AGE. has overlapping ranges. The last three entries in the value clause overlap or are contained in the first five entries. To see how this might be useful, consider passing a data set containing income and age through PROC MEANS as follows: proc means data = sugi n min max maxdec = 0; class age / mlf; format age age.; var income; run; Note the mlf option on the class statement. This instructs PROC MEANS that a multi-label format is in use. The resulting output is, frankly, quite remarkable: The MEANS Procedure Analysis Variable : Income Age N Minimum Maximum --------------------------------------------20 - 29 2170 0 150000 20 - 35 3501 0 150000 30 - 39 2201 10000 160000 36 - 55 4417 10000 180000 40 - 49 2189 20000 180000 50 - 59 2289 30000 170000 55 +++ 2286 30000 190000 60 +++ 1151 40000 190000 -----------------------------------------------
CONTACT INFORMATION
Your comments and questions are always valued and encouraged. If you have any other suggestions or techniques about PROC FORMAT that you would like to share, please feel free to drop me a line. Jack N Shoemaker ThotWave Technologies Suite 202, 117 Edinburgh South Cary, NC Work Phone: 919 465 0322 Fax: 919 465 0323 Email: [email protected] Web: http://www.thotwave.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.