Arrays: in and Out and All About: Marge Scerbo, CHPDM/UMBC
Arrays: in and Out and All About: Marge Scerbo, CHPDM/UMBC
Abstract
This tutorial will present the basics of array statements and show easy examples of usage, leading to a final and more complicated process made efficient through the use of this statement. Arrays are SAS DATA step statements that allow clever programmers to do a great deal of work with little code. Iterative inputting of text and outputting of records are two tasks which can utilize the power of arrays to their fullest. Calculations of multiple values are also a simple task for this statement. If tot1 = 0 then tot1 = . ; If tot2 = 0 then tot2 = . ; If tot3 = 0 then tot3 = . ; If tot4 = 0 then tot4 = . ; If tot5 = 0 then tot5 = . ; If tot6 = 0 then tot6 = . ; If tot7 = 0 then tot7 = . ; If tot8 = 0 then tot8 = . ; If tot9 = 0 then tot9 = . ; If tot10 = 0 then tot10 = . ; If tot11 = 0 then tot11 = . ; If tot12 = 0 then tot12 = . ; If tot13 = 0 then tot13 = . ; if tot14 = 0 then tot14 = . ; run; This code is acceptable and readable, but imagine what this code would look like if there were 400 different variables!
Introduction
Some programmers actually like writing lines and lines of simple code. As long as the project is completed and the results are within the expected time frame, this is usually acceptable. Other programmers hate to type and find any method possible to write as little code as possible, but similarly want to get the job done. In either case, code may be easy or hard to review. A very long monotonous program may be difficult to debug simply because of volume; fix one mistake and another one is discovered. A very terse program with complicated algorithms may also be difficult to interpret in the future by either the original or subsequent programmer. SAS code allows programmers to write either way, long or short, easy or complicated. It is easy to get to the same result from different directions. Data step programming is the core to SAS code. Within a data step it is possible to accomplish many different tasks. This paper will add one new technique, that of the array statement, to a programmers tool belt. This paper will cover only: One-dimensional arrays Explicit arrays - SAS Language guide recommends use of explicit arrays rather than implicit arrays
named, SAS creates new variables that are named arrayname with the subscript number concatenated to the end. Arrays themselves are not data in a SAS data set. The array-name and all the array-elements must be valid SAS names. In Version 6, this means that the name can be between 1 and 8 characters long beginning with a letter (AZ) or an underscore (_). These names cannot contain a blank or any special character except the underscore. Finally, names cannot be SAS reserved words. Version 8 allows names to be between 1 and 32 characters in length with all the other rules still enforced.
array tots(400) tot1-tot400; do i = 1 to 400; if tots(i) = 0 then tots(i) = . ; end; drop i; run; Changing the array code to add or subtract array-elements is easy and can make the SAS code more flexible than the earlier, non-array, versions. Imagine adding 376 more if statements!
surgcnt = 0; array procs (6) $5 procs1-procs6; do i = 1 to 6; if procs(i) ^= ' ' then surgcnt = surgcnt + 1; end; drop i; run; Again, the code is similar to the numeric example. S ince there are only 6 procedures involved, there are no spectacular differences between the non-array and array examples. If more codes were added, the non-array code would get longer and longer, and errors could easily appear. It is quite easy to copy the line of code over and over, but remember that the number attached to the variable name must be changed, and it is simple to miss one number or repeat a number.
array units(50) unit1-unit50; do i = 1 to 50; if codes(i) in('150','151','152','153','160') then rbunits = sum(rbunits, units(i)); end; drop i; run; As these examples get more complicated, the efficiency of array programming becomes more evident!
*trailing at sign will hold this line; input @19 clmstat $ 1. @; *keep only records which were paid; if clmstat eq P then do; input @1 invnum $ 17. @18 acctcode $ 1. @20 clmtyp $ 1. @131 provnum $ 9. @140 category $ 2. ; end;
run; On the line identified with the arrow , note that only one field is read. This field will designate the record as a paid claim or not. The final database will contain only paid claims, and the steps to accomplish this are: The claim status variable, clmstat, is read. The pointer remains on that record or line. This is indicated by the at sign @ as the last character before the semicolon. The status field is tested to see if the value equals P. If this condition is true, the rest of the record is read. If the condition is false, the processing will continue until the bottom of the data step, at which point the processing will begin again and a new record is read. This background is important to understand the next piece of code. The new code using an array to complete the task above follows: data surgproc; infile hospital.dat lrecl = 568 missover; input @1 @12
recipid servdate
$11. mmddyy10. @ ;
array procs(6) $5 surgpr1-surgpr6; cols = 400; do i = 1 to 6; input @cols procs(i) $5. @; cols + 5; end; drop i; run; So, to parse the various pieces of code in this example: A character ($) array procs is built with 6 elements, corresponding to the 6 surgical procedures. The array-elements are defined as variables surgpr1 through surgpr6. A portion of the record is read, including the fields recipid and servdate. The pointer remains on this line because of the trailing at sign. A new variable is created named cols with the initial value of 400, the beginning column for the set of fields. A do loop increments 6 times. Each time it is incremented, the pointer will be moved to the column
identified by the variable cols. The new variable (surgpr1 through surgpr6) will be read. Although the pointer will be moved across the line throughout the do loop, it will remain on the same record because of the trailing at sign. At the end of each increment of the do loop, 5 (the length of the procedure field) is added to the pointer variable, cols. After the do loop has completed processing, the index-variable, i, is dropped, since this field is not needed in the output data set.
if surgpr6 not in( ,000,0000,00000) then do; surgproc =surgpr6; output; end; end; end; end; end; end; run; Using an array the data step could be:
data surgproc (keep = idnum surgdate surgproc); set basefile4; length surgproc $6; array surg(6) $ surgpr1-surgpr6; do i = 1 to 6; if surg(i) not in( ,000,0000,00000) then do; surgproc =surg(i); output; end; else leave; end; drop i; run; The in operator will allow a list of values to be tested. In the above case, the value of the surgical procedure should not be in that list of missing values. If indeed a missing value is encountered, the leave command causes the do loop to end. *Thanks to Ron Cody for his introduction to the leave statement.
The original programmer was at a loss on how to write other than the basic data step code and began coding each group of fields separately. It was determined that there were 4 fields in each group that were needed for the studies underway. These four fields are identified above as: firstdate-of-svc, revenue-code, units-of-service, allowed-charge.
An example of the non-array code to read 2 of the 50 groups is shown below: data hospital; infile hospital.dat lrecl = 5421 missover; input @1 invnum $17. @21 lastdos mmddyy10. @31 billdate mmddyy10. @2324 detdos1 mmddyy10. @2337 billcd1 $3. @2341 units1 5. @2353 detchg1 zd9.2 @2386 detdos2 mmddyy10. @2399 billcd2 $3. @2403 units2 5. @2415 detchg2 zd9.2 . ; run; Clearly repeating these fields 50 times is time consuming and difficult to debug. There would be 200 lines to read in these 4 variables 50 times! To create efficient and readable code to input the hospital data using the above ideas and some careful calculation would lead to the following code: data hospital; infile hospital.dat lrecl = 5421 missover; input @1 invnum $ 17. @21 lastdos mmddyy10. @31 billdate mmddyy10. .. @; /*hold the pointer on this record*/ *create 4 arrays to read in 4 fields 50 times; array dos (50) detdos1-detdos50; array billcode(50) $3 billcd1-billcd50; array units(50) units1-units50; array detchg(50) detchg1-detchg50; *always begin reading in column 2324; pntr = 2324; do i = 1 to 50; input @pntr dos(i) mmddyy10. @; pntr = pntr +13; input @pntr billcode(i) $3. @; pntr = pntr + 4; input @pntr units(i) 5. @; pntr = pntr + 12; input @pntr detchg(i) zd9.2 @; *skip the unwanted fields; pntr = pntr + 33; end; *additional input statements to follow; drop i; run;
These statements provide an efficient mechanism for inputting a large number of fields. Again, there are other ways to accomplish this! In testing this code, first execute the do loop only two or three times to create a small number of variables. Compare this output with the results of code which actually reads each field separately. Again, never assume code is correct if there are no errors listed in the LOG!
Conclusion
With a little practice and common sense, arrays can become a standard tool in a programmers toolbelt. Follow these tips: First, always have a SAS Language Guide available! In the process of learning how to use arrays, make sure to test the program with non-array code. Print out the Log and Output.
Then rework the program to include array code and compare these results with the non-array code.
After a while, it will become second nature to use arrays. Once the learning curve is over, the usefulness will increase and soon there will be multiple arrays and do loops within do loops!
References
Leighton, Ralph, (1992), Working with Arrays: Doing More with Less Code, in the Proceedings of the NorthEast SAS Users Group Conference, 129-139
Contact Information
For more information contact: Marge Scerbo CHPDM/UMBC 1000 Hilltop Circle Social Science Room 309 Baltimore, MD 21250 Phone: 410-455-6807 Fax: 410-455-6850 Email: [email protected]