Database ACCESS Basics SQL and Data Step
Database ACCESS Basics SQL and Data Step
ABSTRACT
SAS/ACCESS® provides virtually seamless interaction with relational databases such as DB2®. The SQL pass-
through facility gives direct access to the database, but requires knowledge of the native SQL syntax. Alternately,
the libname statement can be used to assign SAS® library references to database objects, allowing database ta-
bles to be used like SAS data sets. SAS SQL functions are available that cannot be used with the pass-through
facility. For data step diehards, database tables can be treated (almost) just like indexed SAS data sets. SAS sta-
tistical procedures can also reference database tables directly using the libref name. This paper covers the basics
of accessing database data from SAS using SQL and data step language, using the database with SAS proce-
dures, and getting information about the database delivered through SAS.
INTRODUCTION
Many businesses utilize the power of SAS® analytics while using a separate relational database management
system (RDBMS or DBMS) for data storage. SAS/ACCESS® allows these components to work together. Actually
a family of separate products, each SAS/ACCESS engine functions to translate requests from SAS into the ap-
propriate query syntax for a specific DBMS or file structure. Specific DBMS access engines are available for
DB2®, ORACLE®, OLE DB, SYBASE®, Teradata®, Informix®, MySQL, MS SQL Server, and ODBC. Other data
sources that can be connected to with SAS/ACCESS include ADABAS®, and PC file formats including .DBF, and
.XLS. The connection to these data sources is virtually transparent and can be used to read, update, create and
delete data tables or records in the native data source. Each connection to the data source can be defined using
the LIBNAME statement. In a windowing environment, the connection can be defined from the Explorer window. If
it is necessary to pass non-ANSI standard query language to the DBMS, the SAS Pass-Through Facility is avail-
able to pass native SQL directly to the DBMS without translation or optimization by SAS. This paper covers basic
considerations of utilizing DBMS through SAS including LIBNAME syntax, optimization of SQL, use of DBMS in
the Data Step, and data translation issues.
LIBNAME STATEMENT
Since SAS v7, the general LIBNAME syntax has been extended to cover connections to other data sources. For
review, an example of a basic LIBNAME statement is:
LIBNAME mydata 'C:\project\sasdata' ;
mydata is a shortcut name to associate with the physical location ('C:\project\sasdata') recognized by the OS. A
few options are also available that affect how data in the defined library is handled. One is the ACCESS= option
which can be READONLY or TEMP (no CPU used to monitor data integrity).
LIBNAME mydata 'C:\project\sasdata' ACCESS=READONLY ;
To associate a libref (mydata is the libref in the above example) with a DBMS database, this syntax is extended to
include the SAS/ACCESS engine name and connection options:
LIBNAME mydata db2 DEFER=YES ACCESS=READONLY ;
Notice that the file location is gone. ‘db2’ specifies which ACCESS engine to use and the DEFER= option speci-
fies that the connection not be made until it is referenced. The connection can be terminated by de-assigning the
libref using the LIBNAME statement with no parameters:
LIBNAME mydata ;
The libname option ACCESS=READONLY is still valid for the DB2 engine. Connection and libname options con-
trol how SAS manages the actual connection to the DBMS and define how data objects are processed or han-
dled. Many libname options are available for use with each DBMS data sources. Their availability or default be-
havior is often DBMS-specific.
1
NESUG 18 Ins & Outs
Once you have assigned a library reference to the data source, you can reference tables as if they were SAS
data sets. Some important differences should be remembered when working with DBMS data. While SAS data
sets can be sorted, this has no meaning in the context of DBMS data. Also, SAS has DBMS specific default be-
havior for translating SAS data types to and from DBMS data types. Most data manipulations that involve DBMS
data are best handled by the DBMS; program to maximize work done by the DBMS and limit SAS data handling
to operations that require SAS functions and procedures.
3
NESUG 18 Ins & Outs
for LACT_TABLE. For the DBMS to make use of this index, both must be accounted for in the where clause.
Thus the most efficient code becomes:
data to_lacs;
set db2lib.LACT_TABLE (where=(SPECIES_CODE=’C’ and BREED_CODE=’TO’ and
“01jan2000”d le CALV_PDATE le “31dec2002”d));
run;
The DBMS will use the index to locate and retrieve only the small percentage of records requested. The DBMS
will make use of a partial key, thus to extract all of the goat records it is possible to use a partial key including only
SPECIES_CODE.
Creating DBMS data
It is possible to use SAS to create, modify, or delete DBMS tables. Maintaining data integrity, managing data ac-
cess, and maintaining database integrity can be complicated when using a non-native application and is outside
the scope of this paper. One important reason for creating DBMS tables however is to facilitate complex joins of
DBMS data with other data sources. This will be covered in more detail later. When creating DBMS data using the
data step, it can help the performance of DBMS joins later if the data type is defined on the data statement.
libname mydb2 db2 DB=mydbms;
filename bigzip pipe "zcat cows_hol.Z 2>bout.2";
data mydb2.cows (dbtype=(brd='CHAR(2)' cty='CHAR(3)' idn='CHAR(12)')
dbnull=(_ALL_=no))
infile bigzip;
input brd $2. cty $3. idn $12. ;
run;
The libref mydb2 was created in the LIBNAME statement and points to the DBMS database. Since no schema, or
user are specified, the DB2 default environment for temporary user tables will be used (i.e. the session user must
have some write privileges). The file cows_hol.Z is a very large zip file so the UNIX command zcat is used to read
records out of the archive one at a time without needing the space to inflate the whole file at once. Some IDs in-
clude 12 numeric digits, but others include characters. If the transaction set includes only IDs that are all numeric,
the default data type would be numeric, and these data could not be matched to the character field in the DBMS.
Matching the data type of analogous data in the database eliminates the need for translation during processing,
so it is generally more efficient. DBMS and SAS handle missing or null data differently. If you have null or missing
data, be sure to read the DBMS specific references to understand how this data will be treated.
4
NESUG 18 Ins & Outs
key for retrieving a record from the DBMS. You cannot retrieve multiple rows from the DBMS from a single look
up record. Likewise, if no match is found, an error code is returned. re-setting _error_ to 0 prevents all the non-
matches from cluttering the log. Additional processing for non matching records could also be nested with _error_
= 0 in a DO loop using if _error_ then do. Another automatic variable _IORC_, is created when you use the SET
statement with the KEY= option. Using this variable for error checking can be more complex than using the sim-
ple _error_ return code. For a full explanation see the SAS documentation on error processing.
This process is most efficient when a small number of look-ups are to be done against a large DBMS. Although it
is generally quite fast, it does require a separate query to the database for each record processed. Also, all the
variables in the key must be included in the record from the PDV, with the same names as the variables in the
DBMS table.
While DBKEY= identifies the key or partial key (variables) for the DBMS index, the values of the DBINDEX= op-
tion are YES (look for an appropriate index to use) or NO (don’t look for an index), or the name of the specific
DBMS index to use. DBINDEX can be used like DBKEY above, in the data step, or it can be used as an option on
the LIBNAME statement that defines the DBMS libref. If DBINDEX is used, the value of KEY= is the name of the
DBMS index. Use of DBKEY= overrides DBINDEX=.
DBCONDITION= can also be used as an option on the SET statement to accomplish this type of look-up. It al-
lows the specification of DBMS-specific SQL query clauses, which SAS passes directly to the DBMS for process-
ing. For information on structuring these clauses, see the documentation for SAS data set options for relational
databases, and the documentation for the DBMS data source.
SQL
To work with DBMS data, experience with SQL syntax and PROC SQL is almost imperative. A complete introduc-
tion to SQL is outside the scope of this paper. For a solid foundation in SAS SQL, try the SAS training course
“SQL Processing with the SAS System.” Many excellent references are also available from SAS Press.
Since version 8, data set options have been available in PROC SQL. Care must be taken when using these op-
tions on DBMS data sources because not all options can be translated to the DBMS. The SAS SQL query optimi-
zation process is fairly robust, and SAS generally attempts to translate as much of a DBMS query as it can to
pass to the DBMS for processing, but if the query contains SAS specific language that cannot be parsed out of
the overall query, SAS will attempt to process the query by bringing the entire DBMS data table into SAS data
space. This is almost always undesirable, if not impossible due to lack of resources.
While SAS may know what to do with DBMS data, the DBMS has no idea what to do with SAS data. On the other
hand, while SAS is a fairly powerful data processor, that is the sole function of the DBMS, and thus getting it to do
as much of the work as possible is desirable. SAS will, of course, do exactly what it is told to do, and it is regret-
fully simple to write perfectly valid SQL statements that prove the importance of these facts. Knowing how SQL
statements will be processed is often as important as knowing what the end product will be.
BASIC SQL
Retrieving DBMS data
Whether the desired information is output or used to create a SAS data set, DBMS data can be read using the
defined libref:
proc sql;
create table breeds as
select * from db2lib.BREED_TABLE
where SPECIES_CODE=’B’;
quit;
SAS creates a query string from the program statements in valid ANSI standard SQL which it passes to the
DBMS for processing. SAS functions are available, and translated if possible. Otherwise they are performed after
data is retrieved from the DBMS. This is important, because all records which might meet a sub-setting require-
ment based on a function that must be performed by SAS will need to be retrieved from the DBMS for processing.
5
NESUG 18 Ins & Outs
proc sql;
create table breeds as
select * from db2lib.BREED_TABLE
where SPECIES_CODE=’B’ and substr(BREED_CODE,1,1)=’H’;
quit;
The SUBSTR function is not available in all DBMS, and even for those where it is (the DB2 function name is
SUBSTRING), SUBSTR is processed by SAS unless the libname option SQL_FUNCTIONS= is set to ALL. In this
example, SAS would pass the first part of the where clause to the DBMS to retrieve all of the records with
SPECIES_CODE=’B’ and then check the returned records for matches to the SUBSTR function. In this case, only
a few possible breed codes would match the SUBSTR so the problem can be avoided by restructuring the query
using the IN operator
proc sql;
create table breeds as
select * from db2lib.BREED_TABLE
where SPECIES_CODE=’B’ and BREED_CODE in (’HO’, ‘HI’);
quit;
This entire where clause would be passed to the DBMS for processing.
Combining SAS or raw data with DBMS data
Often, data from an outside source needs to be combined with DBMS data, but not stored permanently in the
DBMS itself. To combine large non-DBMS data sets with data in the DBMS, it is most efficient to load the data
into temporary space in the DBMS so that the data joining can be processed by the DBMS. Consult the database
administrator to be sure access to temporary write space is available. You can use the data step to create the
DBMS table, PROC SQL, or the DBLOAD procedure (however, in version 9, DBLOAD is no longer recom-
mended)
data mydb2.cows;
set sav.cow_data (keep=cowbreed cowid);
by cowbreed cowid;
if first.cowid;
run;
proc sql;
select SPECIES_CODE, ANIM_KEY, ANIM_ID_NUM, COUNTRY_CODE, BREED_CODE, SEX_CODE
from db2lib.ID_XREF_TABLE id, mydb2.cows cows
where id.SPECIES_CODE='0' and
id.ANIM_ID_NUM=cows.cowid and
id.BREED_CODE=cows.cowbreed and
id.COUNTRY_CODE='USA' and
id.SEX_CODE='F';
quit;
In this example, the cow records were already stored in a SAS data set. The key variables were loaded into a
temporary DBMS table COWS. This data was then available for an SQL join to data in the DBMS. Retrieving
DBMS keys (ANIM_KEY) could be done through the data step, as shown in the previous sections, but for larger
transaction sets, or when complex joins to data in multiple DBMS tables is required, this is a more efficient way to
get the DBMS to process the data.
GRAB BAG
Using a libref, DBMS data can be used directly in PROCs. Any data set options that are valid in the PROC and
valid for the DBMS are available.
The SQL pass-through facility allows for explicit ANSI (or DBMS specific) standard SQL code to be passed di-
rectly to the database. This is sometimes more efficient than the query that would be created by SAS. This is an
advanced topic that requires knowledge of DBMS specific SQL so it has been omitted from discussion in this pa-
per.
6
NESUG 18 Ins & Outs
Store DBMS libname statements in the SAS macro library. Include the DEFER=yes option to keep from making
unnecessary connection to the DBMS. Programs that require DBMS access require only a single line of code, lib-
refs are standardized across programs, and nobody needs to remember all the libname options.
Use DBMS names for data whenever feasible. Use ALL_CAPS or some other easy identifier for DBMS variables.
The DB2 schema SYSCAT includes metadata tables; among them are TABLES, COLUMNS, and INDEXES (All
DBMS keep this information somewhere). These tables can be accessed by SAS using a LIBNAME statement
with the SCHEMA= option. Create a tool for getting a report of tables and table properties, including valid indexes
so that SAS programmers can remain up to date on what data is available.
libname db2lib db2 datasrc=yourdb access=readonly
schema=SYSCAT;
This is an example of the LIBNAME statement for accessing SYSCAT tables; ‘yourdb’ is the name of the db2 da-
tabase.
proc sql;
create table db2_tables as
select trim(tb.TABNAME) as tname, substr(tb.TBSPACE,1,10) as tt,
trim(tb.REMARKS) as rmk
from db2lib.TABLES tb, db2lib.COLUMNS col
where trim(col.TABSCHEMA)="YOURSCHEMA" and
tb.TABNAME=col.TABNAME and
tb.TABSCHEMA=col.TABSCHEMA
order by tname;TABLES;
quit;
This creates a list of available tables with the tablespace they occupy and the remarks describing the tables con-
tents stored in the TABLES table; ‘YOURSCHEMA’ is the primary schema for your database (DB2INST1, for ex-
ample).
proc sql;
create table col_&sysparm as
select *
from db2lib.COLUMNS
where TABNAME = "&sysparm" and TABSCHEMA = "YOURSCHEMA"
quit;
7
NESUG 18 Ins & Outs
CONCLUSIONS
The SAS/ACCESS engine opens up the power of DBMS data management to SAS applications and makes the
flow of data between storage (in the DBMS) and function (SAS analytical power) fairly seamless. To work effec-
tively and efficiently, knowledge of the DBMS structure, understanding of SQL and the function of the
SAS/ACCESS engine are required. The most successful programmer will also be willing to experiment with sys-
tem-based optimization to determine which procedures produce the most satisfactory results based on correct-
ness, and use of resources (memory, CPU time, and programmer time).
The paper is intended only to give examples of the types of syntax and programming that are possible using
DBMS data in SAS processing and analysis. This is partly because the specific options available differ substan-
tially between DBMS ACCESS engines. At the same time, the implication of the libref for DBMS data is that all the
programming possibilities for SAS data can be utilized for DBMS data. SAS documentation is the best resource
for determining the compatibility of SAS code and any specific DBMS.
REFERENCES
SAS Institute Inc., SAS/ACCESS® Software for Relational Databases: Reference, Version 8, Cary, NC: SAS Insti-
tute Inc., 1999, 300pp.
ACKNOWLEDGMENTS
SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina.
DB2 and Informix are a Registered Trademarks of IBM.
ORACLE is a registered trademark of Oracle Corporation.
SYBASE is a registered trademark of Sybase Incorporated.
Teradata is a registered trademark of NCR
ADABAS is a registered trademark of Software AG
CONTACT INFORMATION
Ashley H. Sanders
Animal Improvement Programs Laboratory
BARC-West, Bldg. 005, Rm. 312
Beltsville, MD 20705-2350
Work Phone: 301-504-8667
Fax: 301-504-8092
Email: [email protected]
Web: http://aipl.arsusda.gov