0% found this document useful (0 votes)
98 views7 pages

Sas Proc Summary and Proc Format

The document discusses using PROC SUMMARY and PROC FORMAT together to generate customized reports from summary data. Specifically: 1) PROC FORMAT is used to map original data values to interim values for sorting and grouping purposes. 2) PROC SUMMARY generates totals and counts based on the interim values. 3) PROC FORMAT and PROC PRINT or a DATA step are used to substitute the original values back in when reporting, allowing customized sorting and grouping of the summary results. This technique provides control over the ordering and grouping of summary results beyond what PROC SUMMARY alone allows. However, it requires an additional DATA step pass over the data, so it may not be efficient for very large datasets.

Uploaded by

Ioana Condor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views7 pages

Sas Proc Summary and Proc Format

The document discusses using PROC SUMMARY and PROC FORMAT together to generate customized reports from summary data. Specifically: 1) PROC FORMAT is used to map original data values to interim values for sorting and grouping purposes. 2) PROC SUMMARY generates totals and counts based on the interim values. 3) PROC FORMAT and PROC PRINT or a DATA step are used to substitute the original values back in when reporting, allowing customized sorting and grouping of the summary results. This technique provides control over the ordering and grouping of summary results beyond what PROC SUMMARY alone allows. However, it requires an additional DATA step pass over the data, so it may not be efficient for very large datasets.

Uploaded by

Ioana Condor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

PROC SUMMARY AND PROC FORMAT: A WINNING

COMBINATION

Alan Dickson - Consultant

Introduction:
VALUE $OUT_AGE
Is this scenario at all familiar to you? Your users want a ‘1’ = ‘30-36’
report with multiple levels of subtotals and grand totals - ‘2’ = ‘37-39’
BUT they don’t need the lowest level of detail and/or ‘3’ = ‘0-29’
they want some wrinkle (such as calculated values on ‘4’ = ‘40-49’
the subtotal lines) that good old PROC PRINT can’t ‘5’ = ‘50-53’
provide. Furthermore, they want the final report in a ‘6’ = ‘54-99’;
certain sequence that bears no resemblance to any collat-
ing sequence (EBCDIC, ASCII, ascend- Here, the general trend of increased age equating to
ing/descending…) that you have available. poorer health and thus to greater risk is distorted by one
As PROC REPORT matures, it will perhaps address all group, including the very young (infant illnesses) and
these problems, but another viable approach is to use young adults (wild & crazy lifestyles). It is a straight-
PROC SUMMARY to tackle the totaling end and PROC forward task to summarize the data based on the inter-
FORMAT to address the order. How can FORMAT be mediate, grouped values, and then report them with the
used to effectively sort a dataset? Well, of course it “for-show-only” values of the second format Clearly, it
can’t, but one of the beauties of SUMMARY is that the would be difficult, if not impossible, to reproduce this
CLASS statement does not require sorted input to do its kind of user sequence without using some kind of recod-
stuff. ing.

Essentially, the technique involves what could be called However, while the technique outlined above solves
a 2-phase format. The steps are: many problems, it should not be applied without careful
consideration. Since an entire DATA step pass of the
1)- Use FORMAT to superimpose your own simple se- data is required to set up the interim values, the cost
quence over the user’s requirements for grouping could be substantial for large files. Of course, if you are
and ordering. already passing the data for other reasons, this setup can
turn out to be almost a free ride.
2)- Use SUMMARY to generate the appropriate totals.
Regular use of this technique has made me a life-long
3)- Use FORMAT in conjunction with PROC PRINT or fan, but a cautious one. The degree of control it gives
DATA-step reporting techniques to produce the re- you is impressive, but there are a few pitfalls for the
port, substituting the user’s values back over your unwary. The remainder of this paper will walk you
interim values. If you use a DATA step, you can through an example which will show how to apply the
also address any subtotal calculation needs. technique, what PROC SUMMARY can do for you, and
what it can sometimes do TO you.
Consider this example:

PROC FORMAT; The Demonstration Data (See Figure 1)


VALUE IN_AGE This is a handful of selected records from a fictitious
30-36 = ‘1’ database of participants in a weight-loss clinic in New
37-39 = ‘2’ England which draws its clientele from the Northeast
0 -29 = ‘3’ and Canada. Each individual has an indicator (Y/N) for
40-49 = ‘4’ each of the three programs (exercise, weights, aerobics)
50-53 = ‘5’ and the number of pounds lost on each.
54-99 = ‘6’;
Note that, while the combination of an N-flag and a encounters values in the input dataset, with similar non-
non-zero loss is invalid, a Y-flag with no attendant loss intuitive results in the output. Because ON is encoun-
is valid. A participant may be merely toning up, for tered first in the Male group, it remains first in the Fe-
instance. male group, despite the fact that there is a PQ Female
before the ON Female in the data.
The males and females are banded by initial weight into
the appropriate weight class for their sex, using the PUT Clearly, these unexpected results are dependent on the
function. The generated variable is associated with the actual data. You could run a job several times using
output format for printing. these options and then suddenly start running into
problems when you least expect it. It should be noted
that these results were obtained under CMS Version
The Basic SUMMARY 6.09E and Win95 Version 6.12. Other operating sys-
First, we run a simple SUMMARY, using the NWAY tems environments have produced different results in the
option to suppress all subtotal and grand total records. past. Caveat emptor!
Later, we’ll look at the output when all records are in- The underlying reason for these results is that
cluded. For now, we will use the SUM= option on the SUMMARY is evaluating the dataset in its entirety
analysis variables. This allows us to keep the same relative to the CLASS statement. You can cause it to
variable names and would also retain any LABEL or treat categories in isolation by using the BY statement
FORMAT attributes associated with the analysis vari- instead at the highest level or levels. This, of course,
ables. However, to protect you from overflow, SAS will requires either that you know your data is sorted or you
always use a length of 8 for the summed variables. incur the cost of sorting it.
As always, SUMMARY generates the special variables In our example, based on our knowledge of the data, we
_TYPE_ and _FREQ_. We will cover _TYPE_ in more could use a BY SEX NOTSORTED; statement to cause
depth later. _FREQ_ is simply a count of the observa- SUMMARY to deal with the two groups independently
tions included in each CLASS cell. of each other. This will also result in less memory being
used since SAS can essentially re-use the space when
PROC SUMMARY the SEX changes rather than holding all combinations
DATA=TEST.DATA NWAY; separate for the entire step.
CLASS SEX COUNTRY STATE;
VAR EXERLOSS WEITLOSS AEROLOSS; An even more surprising result (at least it was to me)
OUTPUT OUT=TEST1 SUM=; occurs when we add our derived, formatted variable
WT_CLASS to the CLASS breakdown. Obviously, with
PROC PRINT; so few observations, the results are pretty meaningless,
VAR SEX COUNTRY STATE _FREQ_ but look closely at the data values when we run this pro-
EXERLOSS WEITLOSS AEROLOSS; gram (Figure 3):

Due to space constraints, the output is not shown but the PROC SUMMARY
resulting report, containing the summed totals for each DATA=TEST.DATA NWAY;
loss variable, is ordered as expected - alphabetically by CLASS SEX COUNTRY STATE WT_CLASS;
STATE within COUNTRY within SEX. VAR EXERLOSS WEITLOSS AEROLOSS;
OUTPUT OUT=TEST1 SUM=;
PROC SUMMARY also has options for different or-
ders. For example, adding ORDER=FREQ to the PROC PROC PRINT;
statement orders the output based on frequency of occur- VAR SEX COUNTRY STATE WT_CLASS
rence. However, close examination of the results shows _FREQ_ EXERLOSS WEITLOSS
that the Female Canadians from the Province of Quebec AEROLOSS;
(F/CN/PQ) come out after the sole lady from Ontario FORMAT WT_CLASS $CHAR1.;
(Figure 2). What is happening here?
It seems that SAS “remembers” the Male from Ontario Without the final FORMAT statement, everything ac-
and takes him into account when processing the Fe- tually looked fine. However, when we examine the un-
males. Since there are now two each from PQ and ON, derlying data values, we find that there are no
and ON was encountered first, it remains first in the WT_CLASS 6 values any more - in fact, they’ve all
Female category, leading to the unexpected output. been turned into value 3. Why??

Similarly, if you try ORDER=DATA, SUMMARY will The first time I encountered this phenomenon, I called
order the output based strictly on the order in which it SAS Institute to report it as a bug. I was politely in-
formed that SUMMARY is supposed to work this way ID EXER WEIT AERO;
and is furthermore fully documented on page 376 of the VAR EXERLOSS WEITLOSS AEROLOSS;
Version 6 Procedures manual. Essentially, because both OUTPUT OUT=TEST1 SUM=;
values 3 and 6 are formatted to ‘150-199’, SUMMARY etc……
will force them all to 3. Again, using a
As the manual makes quite clear, when more than 1 ID
BY SEX NOTSORTED;
variable is specified, the result represents the highest
statement will eliminate this effect since only Females SINGLE OBSERVATION combination of values within
can be 3. However, if BY is not an option for you, it’s each cell. In the test data, the Males in MA give a result
best to be aware of this subtle possibility. of YYN - not YYY - because no single Male partici-
pated in all three programs, although all three had some
MA Male participation.
The ID and MAXID Statements
In Version 5, there was no easy way around this , short
Although not directly related to the technique of com- of generating numeric flag variables to replace the Y/N
bining formats with SUMMARY, the ID statement can flags and taking the MAX. However, Version 6 added
sometimes prove useful when using SUMMARY. the MAXID (and MINID) statements which address this
ID gives you the highest value encountered for a vari- need.
able in each cell. This often overlooked option is help- Essentially, MAXID returns the value of a “partner”
ful when you want to carry a value from the detail rec- variable when the other partner reaches its maximum
ords forward to the summarized output dataset. It is value within a cell. MINID, obviously, returns the
most applicable when the value remains the same across minimum.
all detail records in each CLASS cell.
For example, if you had hospital identifier codes and PROC SUMMARY
hospital names on the detail data and you wanted to DATA=TEST.DATA NWAY;
summarize admissions or anything else by hospital and CLASS SEX COUNTRY STATE;
produce the report in code order, but also show the name VAR EXERLOSS WEITLOSS AEROLOSS;
on the report, ID fits the bill perfectly. OUTPUT OUT=TEST1 SUM=
MAXID( EXERLOSS(EXER)
PROC SUMMARY- WEITLOSS(WEIT)
DATA=HOSP.DATA NWAY; AEROLOSS(AERO) )=
CLASS HOSPCODE; MAX_EXER MAX_WEIT
ID HOSPNAME; MAX_AERO;
VAR ADMISSNS; etc….
etc……
This code will now give us the desired result of YYY for
If the variable you want to carry forward is a numeric MA Males.
variable, another option is to use the MAX statistic on
that variable. However, you can only request statistics Finally, let’s look at the output from SUMMARY when
on analysis variables, which by definition must be nu- we remove the NWAY option and all the intermediate
meric. Hospital name is clearly alphabetic. level totals are produced. This is when the _TYPE_
special variable really comes into play.
I have frequently seen hospital name used as a final
CLASS variable to achieve this result, but that causes a The number of values which _TYPE_ will take is a
lot of redundant summarization. function of the number of variables in the CLASS
statement - in fact, 2 raised to that power. So, in our
To return to the weight-loss example, let’s say that you example of CLASS SEX COUNTRY STATE;,
wanted to produce a report showing total weight losses _TYPE_ will have 8 values (2 cubed) ranging from 0 to
for each program on a separate line along with the indi- 7.
cators to show participation in a given cell. Simply ig-
noring zero values for the losses may be invalid The _TYPE_ 0 record is always the grand total, repre-
(remember the guy just toning up). Unfortunately, this senting all observations and so is always a single obser-
code will NOT work. vations. The number of observations generated for each
other _TYPE_ value depends on how many discrete
PROC SUMMARY combinations occur in the dataset.
DATA=TEST.DATA NWAY;
CLASS SEX COUNTRY STATE;
Refer to the final output (Figure 4) as we walk through rather than
the rest of the _TYPE_s and you should see how
CLASS A B C
SUMMARY builds them. Essentially, SAS moves from
right to left along the CLASS statement in a binary which you might be tempted to assume.
fashion, getting progressively more detailed in the con-
With 3 CLASS variables, it is quite possible that the
tents of each cell. So,
user will find all combinations valid and meaningful.
With only 2, it’s almost certain. However, when you
_TYPE_ 1 represents a STATE breakdown, irrespec-
move up to 4 or 5 CLASS variables (16 or 32 combina-
tive of SEX and COUNTRY.
tions), some of them are bound to be redundant or
_TYPE_ 2 represents a COUNTRY breakdown, irre- meaningless. Two solutions are possible at this point.
spective of SEX and STATE. The first is to let SUMMARY generate them all, and
then discard the unwanted ones based on their _TYPE_
_TYPE_ 3 represents a COUNTRY and STATE
values. One way to do this would be a WHERE= dataset
break down, irrespective of SEX.
option on the output dataset.
_TYPE_ 4 represents a SEX breakdown, irrespective
The second is to run an NWAY SUMMARY on the
of COUNTRY and STATE.
finest level of detail required and then subsequent
_TYPE_ 5 represents a SEX and STATE breakdown, SUMMARY step(s) on the resulting dataset. This may
irrespective of COUNTRY. be one more non-NWAY step, or you could repeat the
process of NWAY steps at increasingly higher levels of
_TYPE_ 6 represents a SEX and COUNTRY break-
totaling. Finally, SET all the resulting datasets together
down, irrespective of STATE.
to give you the master data for printing. As usual, the
_TYPE_ 7 represents a SEX, COUNTRY and STATE choice depends on your particular data and application
breakdown, the finest level of detail possible with but, in general, the first option will use more space (at
this CLASS statement. least until you discard), while the second will use more
CPU (multiple steps).
Within each _TYPE_ value, the total of the _FREQ_
I hope this brief explanation has given you some new
values adds up to that of the grand total, i.e. the whole
insight into using SUMMARY and another valuable
dataset is counted in each _TYPE_. There is a good
addition to your SAS toolkit.
diagram in the V6 Procedures manual (p.377) which
outlines how the _TYPE_s are built up in this binary
fashion. However, it is helpful if you note that the dia- Acknowledgments:
gram works up to the results of the statement
SAS is a registered trademark of SAS Institute Inc.,
CLASS C B A Cary NC, USA.
------ E X H I B I T S -------

Figure 1:
PROC FORMAT LIBRARY=LIBRARY;
VALUE FWT_FMT
000-099 = ‘1’
100-149 = ‘2’
150-199 = ‘3’
200-HIGH = ‘4’;
VALUE MWT_FMT
000-149 = ‘5’
150-199 = ‘6’
200-249 = ‘7’
250-HIGH = ‘8’;
VALUE $WT_FMT
‘1’ = ‘000-099’ ‘5’ = ‘000-149’
‘2’ = ‘100-149’ ‘6’ = ‘150-199’
‘3’ = ‘150-199’ ‘7’ = ‘200-249’
‘4’ = ‘200+’ ‘8’ = ‘250+’;
*** FEMALE WEIGHT CLASSES MALE WEIGHT CLASSES ***;

DATA TEST.DATA;
LENGTH SEX $ 1 COUNTRY $ 2 STATE $ 2 WT 3 WT_CLASS $ 1
EXER $ 1 EXERLOSS 2 WEIT $ 1 WEITLOSS 2 AERO $ 1 AEROLOSS 2;
LABEL WT_CLASS = ‘WEIGHT CLASS’;
FORMAT WT_CLASS $WT_FMT.;
INPUT SEX $ COUNTRY $ STATE $ WT
EXER $ EXERLOSS WEIT $ WEITLOSS AERO $ AEROLOSS;
IF SEX = ‘F’
THEN WT_CLASS = PUT(WT,FWT_FMT.);
ELSE WT_CLASS = PUT(WT,MWT_FMT.);
CARDS;
M US MA 148 Y 03 Y 00 N 00
M US MA 195 N 00 Y 12 N 00
M US MA 262 Y 11 N 00 N 00
M US MA 172 Y 08 N 00 Y 05
M US MA 211 N 00 Y 16 Y 05
M CN ON 188 Y 08 N 00 N 00
M CN PE 222 Y 07 Y 07 Y 07
F CN PQ 202 Y 11 N 00 Y 11
F CN ON 177 N 00 N 00 Y 09
F CN PQ 169 Y 10 N 00 Y 04
F US MA 210 N 00 N 00 Y 08
F US MA 081 Y 02 N 00 Y 03
F US MA 111 Y 02 N 00 Y 02
;
PROC PRINT;
VAR SEX COUNTRY STATE WT WT_CLASS
EXER EXERLOSS WEIT WEITLOSS AERO AEROLOSS;
TITLE ‘PRINT OF RAW INPUT DATA - FORMATTED WEIGHT CLASS’;

PROC PRINT;
VAR SEX COUNTRY STATE WT WT_CLASS
EXER EXERLOSS WEIT WEITLOSS AERO AEROLOSS;
FORMAT WT_CLASS $CHAR1.;
TITLE ‘PRINT OF RAW INPUT DATA - UNFORMATTED WEIGHT CLASS’;
PRINT OF RAW INPUT DATA - FORMATTED WEIGHT CLASS

OBS SEX COUNTRY STATE WT WT_CLASS EXER EXERLOSS WEIT WEITLOSS AERO AEROLOSS

1 M US MA 148 000-149 Y 3 Y 0 N 0
2 M US MA 195 150-199 N 0 Y 12 N 0
3 M US MA 262 250+ Y 11 N 0 N 0
4 M US MA 172 150-199 Y 8 N 0 Y 5
5 M US MA 211 200-249 N 0 Y 16 Y 5
6 M CN ON 188 150-199 Y 8 N 0 N 0
7 M CN PE 222 200-249 Y 7 Y 7 Y 7
8 F CN PQ 202 200+ Y 11 N 0 Y 11
9 F CN ON 177 150-199 N 0 N 0 Y 9
10 F CN PQ 169 150-199 Y 10 N 0 Y 4
11 F US MA 210 200+ N 0 N 0 Y 8
12 F US MA 81 000-099 Y 2 N 0 Y 3
13 F US MA 111 100-149 Y 2 N 0 Y 2

PRINT OF RAW INPUT DATA - UNFORMATTED WEIGHT CLASS

OBS SEX COUNTRY STATE WT WT_CLASS EXER EXERLOSS WEIT WEITLOSS AERO AEROLOSS

1 M US MA 148 5 Y 3 Y 0 N 0
2 M US MA 195 6 N 0 Y 12 N 0
3 M US MA 262 8 Y 11 N 0 N 0
4 M US MA 172 6 Y 8 N 0 Y 5
5 M US MA 211 7 N 0 Y 16 Y 5
6 M CN ON 188 6 Y 8 N 0 N 0
7 M CN PE 222 7 Y 7 Y 7 Y 7
8 F CN PQ 202 4 Y 11 N 0 Y 11
9 F CN ON 177 3 N 0 N 0 Y 9
10 F CN PQ 169 3 Y 10 N 0 Y 4
11 F US MA 210 4 N 0 N 0 Y 8
12 F US MA 81 1 Y 2 N 0 Y 3
13 F US MA 111 2 Y 2 N 0 Y 2

Figure 2:
PRINT OF SUMMARY DATA - FREQUENCY ORDERING

OBS SEX COUNTRY STATE _FREQ_ EXERLOSS WEITLOSS AEROLOSS


1 M US MA 5 22 28 10
2 M CN ON 1 8 0 0
3 M CN PE 1 7 7 7
4 F US MA 3 4 0 13
5 F CN ON 1 0 0 9
6 F CN PQ 2 21 0 15
Figure 3: PRINT OF SUMMARY DATA - UNFORMATTED WEIGHT CLASS

OBS SEX COUNTRY STATE WT_CLASS _FREQ_ EXERLOSS WEITLOSS AEROLOSS


1 F CN ON 3 1 0 0 9
2 F CN PQ 3 1 10 0 4
3 F CN PQ 4 1 11 0 11
3 F US MA 1 1 2 0 3
4 F US MA 2 1 2 0 2
5 F US MA 4 1 0 0 8
7 M CN ON 3 1 8 0 0
8 M CN PE 7 1 7 7 7
9 M US MA 3 2 8 12 5
10 M US MA 5 1 3 0 0
11 M US MA 7 1 0 16 5
12 M US MA 8 1 11 0 0

Figure 4: PRINT OF SUMMARY DATA - NO NWAY - ALL SUMMARIES


OBS _TYPE_ SEX COUNTRY STATE _FREQ_ EXERLOSS WEITLOSS AEROLOSS
1 0 13 62 35 54
2 1 MA 8 26 28 23
3 1 ON 2 8 0 9
4 1 PE 1 7 7 7
5 1 PQ 2 21 0 15
6 2 CN 5 36 7 31
7 2 US 8 26 28 23
8 3 CN ON 2 8 0 9
9 3 CN PE 1 7 7 7
10 3 CN PQ 2 21 0 15
11 3 US MA 8 26 28 23
12 4 F 6 25 0 37
13 4 M 7 37 35 17
14 5 F MA 3 4 0 13
15 5 F ON 1 0 0 9
16 5 F PQ 2 21 0 15
17 5 M MA 5 22 28 10
18 5 M ON 1 8 0 0
19 5 M PE 1 7 7 7
20 6 F CN 3 21 0 24
21 6 F US 3 4 0 13
22 6 M CN 2 15 7 7
23 6 M US 5 22 28 10
24 7 F CN ON 1 0 0 9
25 7 F CN PQ 2 21 0 15
26 7 F US MA 3 4 0 13
27 7 M CN ON 1 8 0 0
28 7 M CN PE 1 7 7 7
29 7 M US MA 5 22 28 10

You might also like