Programmatic Approach Using DATA Step and PROC SQL Creating A SAS Data Set Using DATA Step
Programmatic Approach Using DATA Step and PROC SQL Creating A SAS Data Set Using DATA Step
Storing Results:
Very often you don’t want to display results. Instead you want to store them for
use in subsequent computations. That’s what this DATA step will do:
Page 1 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com
DATA new;
SET raj;
RUN;
Column Subsets:
If you don’t need all of the variables available in the existing data set? In the
DATA step, a KEEP statement can be used to identify those to be stored in the new
data set. For example:
DATA subset;
SET raj;
KEEP fname sex age;
RUN;
DATA subset;
SET raj;
DROP height weight;
RUN;
Page 2 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com
PROC SQL;
CREATE TABLE subset(DROP=height weight) AS
SELECT *
FROM raj
;
QUIT;
Creating Subtotals:
Suppose that instead of an overall summary, we want the computations stratified
by SEX. The PROC SUMMARY code shown previously can be adapted by
inserting a CLASS statement and coding the NWAY option (to suppress
production of the grand overall statistics, which we no longer want). Here is the
code:
PROC SUMMARY DATA=raj NWAY;
CLASS sex;
VAR age height weight;
OUTPUT OUT=group_averages(DROP = _type_ _freq_)
MIN (age )=Youngest
MAX (age )=Oldest
MEAN(height)=Avg_Height
MEAN(weight)=Avg_Weight;
RUN;
Conditionality
It is not uncommon to have values that depend on other values—in other words,
conditionality. Probably the most common way of implementing conditionality in
the DATA step is the IF/THEN/ELSE structure.
Page 4 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com
For example, suppose that students of different ages and sexes are to go on
different field trips. The 11-year-olds (boys and girls) are going to the zoo; girls
who are not going to the zoo (that is, 12-year-old girls) are going to the museum;
and boys who aren’t going to the zoo have to stay behind. Here’s one way of
generating a list of individual student destinations:
DATA trip_list;
SET raj;
IF age=11 THEN Trip = 'Zoo ';
ELSE IF sex='F' THEN trip = 'Museum';
ELSE trip = '[None]';
KEEP fname age sex trip;
RUN;
Page 5 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com
FILTERING USING WHERE STATEMENT:
Using DATA STEP:
DATA girls;
SET raj;
WHERE sex='F';
RUN;
PROC SQL;
CREATE TABLE girls AS
SELECT *
FROM raj
WHERE sex='F'
;
QUIT;
PROC SQL;
SELECT *
FROM raj
WHERE age=10
;
QUIT;
PROC SQL;
CREATE TABLE tens AS
SELECT *
FROM raj
WHERE age=10
;
QUIT;
Page 6 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com
To illustrate, consider this PROC SUMMARY step, which calculates the extreme
values of the HEIGHT variable and does so separately for each SEX/AGE
combination:
PROC SQL;
CREATE TABLE hilo AS
SELECT sex,
age,
MAX(height) AS Tallest,
MIN(height) AS Shortest
FROM raj
GROUP BY sex, age
;
QUIT;
Reordering Rows:
The purpose of PROC SORT is the reordering of observations. For example, if we
run:
PROC SQL;
CREATE TABLE age_sort AS
SELECT *
FROM raj
ORDER BY age DESCENDING, fname
Page 7 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com
;
QUIT;
PROC SQL;
CREATE TABLE sex_age AS
SELECT sex, age
FROM RAJ
; QUIT;
SQL has a special keyword, DISTINCT, to specify that duplicate rows are to be
eliminated. The keyword appears in the SELECT statement or clause, immediately
following SELECT and preceding the list of columns. So the SQL code to
eliminate duplicates from our table is:
PROC SQL;
CREATE TABLE sex_age_distinct AS
SELECT DISTINCT *
FROM sex_age
;
QUIT;
To combine these counts with the original data, we first sort that original data:
PROC SORT DATA=teens OUT=sorted;
BY age;
RUN;
DATA detail_and_counts;
MERGE sorted cohorts;
BY age;
RUN;
We now have all of the data together, but the names are grouped by AGE and thus
not in
alphabetical order.
Then we combine the original data with the counts, via a MERGE statement:
DATA detail_and_counts;
MERGE sorted cohorts;
BY age;
RUN;
We now have all of the data together, but the names are grouped by AGE and thus
not in alphabetical order. So we sort again to restore the original alphabetical
order:
Page 9 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com
PROC SORT DATA=detail_and_counts;
BY fname;
run;
It has taken four steps to get the output. In contrast, using SQL, we can simply
write:
PROC SQL;
CREATE TABLE detail_and_counts AS
SELECT fname,
age,
COUNT(*) AS Many
FROM teens
GROUP BY age
ORDER BY fname
;
QUIT;
Deriving our unweighted mean via PROC MEANS is more complicated, and is a
twostep proposition. First we have to eliminate repetitions of AGE values; one way
to do this is with PROC FREQ:
Now we can proceed to find the average of these distinct (unduplicated) AGE
values, using PROC MEANS:
Page 10 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com
PROC MEANS DATA=freq2means MEAN MAXDEC=3;
VAR age;
RUN;
This derivation can be done in just one PROC SQL statement. We can even display
the simple weighted mean alongside. The code is:
PROC SQL;
SELECT MEAN( age)
LABEL = 'Weighted' FORMAT=8.3,
MEAN(DISTINCT age)
LABEL = 'Unweighted' FORMAT=8.3
FROM teens; QUIT;
Page 11 of 11
E-Mail: [email protected] Phone: +91-9848733309/+91-9676828080
www.covalentech.com & www.covalenttrainings.com