Effective Use of SQL in SAS Programming
Effective Use of SQL in SAS Programming
INTRODUCTION
Structured Query Language (SQL) is a data manipulation tool of which many SAS®
programmers are unaware, or not comfortable. Using fewer lines of code as well as
achieving improved performance, SQL can accomplish the same goal as many SAS data
steps. This paper gives a brief introduction on the subject of relational databases and SQL
syntax followed by a variety of tips on how to use SQL effectively in SAS programming.
RELATIONAL DATABASE
SQL BASICS
SQL – Structured Query Language - developed by IBM in the early 1970s, is a standard
interactive and programming language for querying, modifying data, and managing
databases. The basic syntax is shown in the following example:
Select d.subjid,
d.treat_cd,
a.exam_val
From demos d,
assy a
Where d.subjid = a.subjid
Group by d.treat_cd
Order by d.subjid;
1
NESUG 2008 Programming Beyond the Basics
Although SQL is both an ANSI and an ISO standard, many database products support
SQL with proprietary extensions to the standard language such as Oracle SQL, SQL
Server, MySQL, and so on. Proc SQL is the SAS version of SQL. Proc SQL adopts most
of the standard SQL features with additional SAS ingredients such as dataset options,
SAS functions, etc. As a result, SAS SQL has the power of regular SQL and many SAS
special add-on features.
TERMINOLOGY
To help less-experienced SAS programmers better understand the different terms used by
database SQL programmers and SAS programmers, a comparison of these terms is
displayed in Table 1 below:
SAS uses SQL in two different ways – Where statement and Proc SQL. Where statement
is one of the most commonly used SAS statements. The concept and syntax, however,
were originally adopted from SQL - this is one example that SAS is a powerful language
that imports and mixes syntax from other languages.
Proc SQL is the main tool within SAS to use SQL. While Proc SQL is a SAS procedure,
it performs many functions similar to those found within SAS data steps. Often, for data
manipulation, data step or Proc SQL can be used either individually or interchangeably.
Four major areas which describe the effective use of SQL in SAS Proc SQL are outlined
in the following sections.
In SAS, there are two approaches to access relational databases. One is the LIBNAME
Statement and the other is the pass-through facility. Below is an example of the pass-
through facility. The code is to read a demographic table from an Oracle database and
output all those allocated subjects.
2
NESUG 2008 Programming Beyond the Basics
Proc sql;
connect to odbc (dsn=&dsn uid=&uid pwd=&pwd);
create table demo as
select *
from connection to odbc
(select distinct allocation_number subjid,
visit_number vt_num,
age
from std_demos
where allocation_number is not null
);
disconnect from odbc;
quit;
Programming Tips:
• Get login credentials from interactive Window input for security reasons.
• Do not use multiple joins to retrieve data - it is more efficient if multiple
CREATE TABLE statements are used.
• If possible, avoid the use of ORDER BY to speed up execution.
• Use index if available.
SAS programmers often use %LET or SAS function CALL SYMPUT() to create macro
variables. The following is an example:
Data _null_;
set dup nobs=obs;
call symput(‘totdup', compress(put(obs, best.)));
There is an alternative approach to achieving the same result by using the following SQL
procedure:
The Into clause stores the value of one or more columns in macro variable(s) for use later
in another Proc SQL query or SAS statement - below is an example:
3
NESUG 2008 Programming Beyond the Basics
quit;
The above code creates a macro variable &TOT_TRT to store the total number of
treatment groups, creates macro variables &TRT1, &TRT2 …, and stores the names of
treatment groups in them. The total number of macro variables is determined by the value
in &TOT_TRT.
Programming Tips:
The biggest advantage of a SQL join is that there is no need for sorting and renaming
which is especially useful when dealing with large datasets. The following is
corresponding code for a by-merge data step and SQL join:
Merge (Join)
Proc sort data = one; Proc sql;
By subjid; Create table three as
Select *
Proc sort data = two (rename = (an_num = From one, two
subjid)); Where one.subjid = two.an_num;
By subjid; Quit;
Data three;
Merge one two;
By subjid;
Run;
There are two kinds of joins in SQL: inner join and outer join. An inner join returns a
result table for all the rows in a table that have one or more matching rows in the other
4
NESUG 2008 Programming Beyond the Basics
table(s). The example above is an implied inner join and can be re-written with specific
inner join key words as shown below:
Inner Join
Proc sort data = one; Proc sql;
By subjid; Create table three as
Select *
Proc sort data = two (rename = (an_num = From one INNER JOIN two
subjid)); ON one.subjid = two.an_num;
By subjid; Quit;
Data three;
Merge one(in=a) two(in=b);
By subjid;
If a and b;
Run;
Outer joins are inner joins that have been augmented with rows that did not match with
any row from the other table in the joins. The three types of outer joins are left, right, and
full join. Below are examples of outer joins:
Left Join
Proc sort data = one; Proc sql;
By subjid; Create table three as
Select *
Proc sort data = two (rename = (an_num = From one LEFT JOIN two
subjid)); ON one.subjid = two.an_num;
By subjid; Quit;
Data three;
Merge one(in=a) two(in=b);
By subjid;
If a;
Run;
5
NESUG 2008 Programming Beyond the Basics
Right Join
Proc sort data = one; Proc sql;
By subjid; Create table three as
Select *
Proc sort data = two (rename = (an_num = From one RIGHT JOIN two
subjid)); ON one.subjid = two.an_num
By subjid; Quit;
Data three;
Merge one(in=a) two(in=b);
By subjid;
If b;
Run;
A full outer join, specified with the keywords FULL JOIN and ON, returns all the rows
from all the tables regardless of whether they match. The full outer join is rarely used in
the real world.
SQL is used frequently for creating, renaming new variables, and ordering output.
Suppose we have the following task at hand:
• Create a new variable new_v1 by concatenating v1 and v2
• Create a new variable new_v2 as the sum of v3
• Rename v4 and v5 as out4 and out5
• Only output new_v1, new_v2, out4, out5, v3 and in that particular order in the
output dataset
Proc sql;
create table new as
select v1 || v2 as new_v1, sum(v3) as new_v2, v4 as out4, v5 as out5, v3
from old;
quit;
Programming Tips:
• SAS dataset options such as keep, drop, rename and SAS functions can be used
within Proc SQL. Here is an example:
6
NESUG 2008 Programming Beyond the Basics
scores t2
where t1.subject_no=input(substr(t2.subject_id,5),8.) and
t1.center=input(substr(t2.subject_id,1,3),8.);
quit;
CONCLUSION
• Proc SQL is more powerful and efficient than SAS data steps in certain cases,
with fewer lines of code.
• SQL is a basic tool for many job functions that involve working with databases.
Mastering SQL could result in project (or job) opportunities and enhance career
growth.
• Proc SQL must be used wisely or it can become complicated and inefficient..
• In summary, Proc SQL is an excellent alternative to non-SQL Base SAS, making
it worth the programmers' time to explore its use.
REFERENCES
Feng, Ying “Tips for Using SQL: When to Use and How?"
Proceedings of the 18th Annual NorthEast SAS Users Group Conference,
POS12, 2005.
SAS and all other SAS Institute Inc. product or service names are registered trademarks
or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA
7
NESUG 2008 Programming Beyond the Basics
registration. Other brand and product names are trademarks of their respective
companies.
Yi Zhao
Senior Scientific Programming Analyst
Merck Research Laboratories
UG1CD-38
PO Box 1000
North Wales, PA 19454
Phone: 267-305-7672
Email: [email protected]