Accessing Metadata
Accessing Metadata
Paper TU06
ABSTRACT
There is an extensive range of metadata available to the SAS programmer: information not only about all of our SAS file
entities (e.g. libraries, datasets, variables, etc.), but also the settings associated with our current SAS session, such as
system options, titles, and macro variable values.
We can exploit this information to help us write programs which are more flexible, require less maintenance, and are less
error prone.
This tutorial firstly introduces some of the metadata available to us; it then explains how the programmer can access this
information, using SAS features such as Proc SQL, the Macro language, and SAS Component Language (SCL) functions.
Examples are presented which illustrate both the principles and benefits of adopting a data-driven approach to some
common programming tasks.
INTRODUCTION
Metadata is data dictionary information, i.e. data about data. We are all familiar with the CONTENTS procedure in SAS,
which lists information about our SAS datasets, including the following:
• Dataset name
• Date of creation
• Number of observations and number of variables
• Whether the dataset is sorted, compressed, indexed
• Information for each variable
o Position (sequential number) in the dataset
o Variable name
o Type
o Length
o Format and Informat
o Label
Most of us are also aware that this information can be output to a SAS dataset, from where it can be accessed by the
programmer and used to drive processing, for example:
• finding all datasets that contain a certain variable
• testing whether a dataset has any observations, and making subsequent actions conditional on this
• checking that variable type and length are consistent before merging or appending datasets, etc.
However, there is much more information available to the programmer. This information is held in data dictionary tables,
and allows us to retrieve information about the libraries, datasets, external files, system options, titles, and macro variables
associated with the current SAS session.
This information can help us to write more flexible code. For example, we may wish to carry out some sort of data
conversion (e.g. increase the length of a character variable) in every dataset where a variable exists. One approach might
be to look in all of our datasets to find out where the variable is present, and then type in all of the relevant dataset names
into a program which carries out the required conversion.
However, this approach is laborious, error-prone, and requires maintenance (e.g. if new datasets are subsequently added
to our library that may also require conversion).
By making use of metadata, we can write syntax in which, in this example, we do not need to hard-code the dataset
names: the program can find out for itself which datasets require conversion, and process each one accordingly. This is a
data-driven program.
1
PhUSE 2009
DICTIONARY TABLES
Dictionary tables are special read-only PROC SQL tables. SAS automatically issues the DICTIONARY libref, but these
tables are only accessible through PROC SQL.
However, SAS also provides views based on these dictionary tables, in the SASHELP library, and these can be accessed
by SAS procedures and in the data step.
Table 1 below lists some of the dictionary tables available and their equivalent SASHELP view:
proc sql;
select * from dictionary.extfiles;
quit;
This will list all columns (FILEREF and PATH in this example) and all rows in the table. To see which columns are
available in each dictionary table, we can use the DESCRIBE TABLE statement, for example:
proc sql;
describe table dictionary.columns;
quit;
2
PhUSE 2009
The table definition is written to the log. The definition for the dictionary table COLUMNS is displayed in the following
panel; the definitions of the other dictionary tables listed in Table 1 are displayed in the appendix.
Dictionary tables are created at the start of each SAS session and dynamically maintained throughout by SAS.
Dictionary tables can be large, so once we have the definition available, we can specify individual columns on the SELECT
statement, and add a subsetting WHERE clause, in order to obtain the specific information of interest, as in the example
that follows:
SQL syntax:
proc sql;
select memname from dictionary.columns
where upcase(libname)='DERIVED' and upcase(name)='SID1A';
quit;
Output:
Member Name
--------------------------------
A_AEV
A_CMP
A_DAR
A_DYS
A_IDENT
A_SCR
A_UPD23
...
Note: accessing dictionary information through SASHELP views is generally slower than using Proc SQL.
3
PhUSE 2009
If the query produces more than one row of output, the macro variable(s) will contain only the value(s) from the first row. If
the query has no rows in its output, the macro variable is not modified, or, if it did not already exist, is not created. The
automatic variable &SQLOBS contains the number of rows produced by the query.
The different forms of the syntax are illustrated in the following panels.
To create a macro variable based on values in the first row of a query result, just provide a single name
for the macro variable (for each column selected), for example:
select count(patid)
into :npat
from derived.demo;
quit;
To create one macro variable per row of a query result, code a range of macro variable names, with the
hyphen (or keywords THROUGH, THRU), for example:
This produces sequential macro variables from multiple rows. In this example, room is created for 99
macro variables, but only those required are actually created (one per centre).
To concatenate the values from one column into one macro variable, code the SEPARATED BY option
with a delimiting character (e.g. space, comma, etc.), for example:
This produces a single macro variable, containing a list of values (in this example, high recruiting centre
numbers), each separated by a blank space.
4
PhUSE 2009
So, continuing on from the above example, we could produce a separate randomization listing for each high recruiting
centre, by doing the following (within a macro):
Or, if unique laboratory parameter codes have been stored into a sequence of macro variables, e.g. ¶m1-¶m99,
we could use this list to produce a separate plot for each parameter:
Therefore, if there is a subsequent change to our data (e.g. a centre recruits sufficient subjects to become a ‘high
recruiter’, or further lab parameters are added) we do not have to edit our program. There is no hard coding of which
centres are high recruiting, or which lab parameters are plotted – the programs are data-driven.
CALL EXECUTE
This is a data step routine, of the general form:
CALL EXECUTE(argument);
where argument is a quoted string. The SAS statements contained within the string are executed immediately after the
data step containing the CALL EXECUTE, for example:
data _null_;
call execute('proc print; run;');
run;
We can make a program data-driven by building the CALL EXECUTE argument based upon values in a dataset. For
example, if we have a dataset LFT_PATS, containing patients with elevated liver function tests (LFTs), we can write
dynamic SAS statements to fetch adverse event (AE) data specifically for these patients. The SAS log displays the
statements generated by the CALL EXECUTE.
5
PhUSE 2009
data _null_;
set lft_pats end = lastone;
if _n_ = 1 then do;
call execute('data ae_lft;');
call execute('set ae;');
call execute('where patid in (');
end;
call execute(quote(patid));
if lastone then do;
call execute(');');
call execute('run;');
end;
run;
Log:
NOTE: CALL EXECUTE generated line.
1 + data ae_lft;
2 + set ae;
3 + where patid in (
4 + "1234"
5 + "2345"
6 + "7890"
7 + "8901"
8 + );
9 + run;
The code that is executed has been built according to our data, i.e. data-driven.
• This is often analogous to writing SAS statements to an external file, which is then %INCLUDE’d back into our
program and executed.
• Since CALL EXECUTE can be executed conditionally, it can often be used where we would otherwise need
%IF… (i.e. it can sometimes overcome the need to write a macro).
data _null_;
set demo;
call execute('%profile(pno = '||patid||');');
run;
The macro %PROFILE is executed for every row in the DEMO dataset, with patient number passed as a
parameter.
%SYSFUNC FUNCTIONS
The %SYSFUNC function allows access from within a SAS macro to most data step functions and several SAS
Component Language (SCL) functions (SAS file I/O and SAS external file functions). This simplifies many common macro
tasks, and gives us ready access to data about datasets, variables, external files, etc. In other words, they represent
another means of accessing SAS metadata.
INPUTC / INPUTN and PUTC / PUTN are available instead of INPUT and PUT respectively.
• SCL functions: some useful SCL function are shown in Table 2 (not an exhaustive list):
6
PhUSE 2009
OPEN Opens a SAS dataset and returns a value (id). Many of the subsequent functions in this
table use the id as an argument (and not the dataset name):
%let dsid = %sysfunc(open(crt.d_ae));
CLOSE Closes the dataset given by the id, and returns a value (0 if successful). Any dataset
opened with the OPEN function should be closed with the CLOSE function:
%let rc = %sysfunc(close(&dsid));
ATTRC Returns the value of a character attribute for the dataset associated with the id:
%let att = %sysfunc(attrc(&dsid,attrib));
where attrib is (amongst others)
sortedby - sort order if dataset is sorted (otherwise blank)
label - dataset label
mem - dataset name
ATTRN Returns the value of a numeric attribute for the dataset associated with the id:
%let att = %sysfunc(attrn(&dsid,attrib));
where attrib is (amongst others)
nobs - number of observations
nvars - number of variables
crdte - date created (SAS date time format)
VARNUM Returns the number (position) of a variable in a SAS dataset. The following VARxxxx
functions in this table use this number as an argument (and not the variable name).
%let vnum = %sysfunc(varnum(&dsid,varname));
where varname is the variable name.
VARFMT, Returns the format (or informat) assigned to a SAS dataset variable:
VARINFMT %let fmt = %sysfunc(varfmt(&dsid,&vnum));
Or %let fmt = %sysfunc(varfmt(&dsid,
%sysfunc(varnum
(&dsid,varname))));
VARTYPE Returns the type of a SAS dataset variable, C for character and N for numeric:
%let type = %sysfunc(vartype(&dsid,&vnum));
Note: these functions are also available in the data step, in which case character strings should be enclosed in quotes as
usual, for example:
data _null_;
dsid = open('derived.ae');
vn = varnum(dsid,'patno');
rc = close(dsid);
put dsid= vn= rc=;
run;
7
PhUSE 2009
Other similar SAS functions exist (and can also be called in a macro via %SYSFUNC), for working with:
• Catalogs and catalog entries (CEXIST)
• External files (FILEEXIST, FOPEN, FCLOSE, etc.)
• Directories (DOPEN, DNUM, etc.)
General principles: Proc SQL is used to obtain a list from the data dictionary tables of the variable names
in a given dataset. The SELECT…INTO…SEPARATED BY clause stores this list to a single macro
variable. The automatic macro variable &SQLOBS tells us the number of items on the list, and this
information is then used in a macro loop to perform the operation (in this example a rename) on each
variable in turn.
Syntax:
Log (partial):
8
PhUSE 2009
Example Application 2: Performing an operation on all variables of a particular type within a library
General principles: Proc SQL is used to obtain from the dictionary tables a list of datasets containing, in this
example, character date variables. For each dataset on this list, a second list is obtained of the names of
variables. Each of these will undergo a conversion to a numeric date. Both lists are in the form of a set of
sequentially-numbered macro variables, plus a count of the number of items on the list. Macro loops are used to
process these two lists, one nested within the other.
Syntax:
* Macro to process first list, creating and processing second list for each item ;
%macro convert;
%put NOTE: There are %trim(&nset) dataset(s) to process...;
%* Loop to process first list;
%do i = 1 %to &nset;
%put NOTE: ...Processing dataset &&set&i.....;
proc sql noprint;
select count(distinct name) into :ndat
from datevars
where upcase(memname) = "&&set&i";
%* Create second list (date variable names);
select distinct name into :dat1 - :dat%left(%trim(&ndat))
from datevars
where upcase(memname) = "&&set&i";
quit;
%put NOTE: ...There are %trim(&ndat) date(s) to convert on &&set&i.....;
data &&set&i;
set outdata.&&set&i;
%* Loop to process 2nd list;
%do j = 1 %to &ndat;
%put NOTE: ......Converting date &&dat&j;
if length(&&dat&j)= 8 then &&dat&j..N = input(&&dat&j,yymmdd8.);
else if not missing(&&dat&j) then put
"CHECK: partial date found on dataset &&set&i - " PATID= &&dat&j=;
%end;
run;
%end;
%mend;
%convert;
A new numeric version of each date is created, with a suffix of ‘N’ added to the variable name. Character dates of
length 8 are assumed to be complete; incomplete dates are not converted and a message is issued to the log,
shown overleaf.
9
PhUSE 2009
Log (partial):
Example Application 3: Saving and restoring current title and footnote settings
General principles: Proc SQL is used to fetch the current title and footnote settings from the dictionary tables,
and store these in a dataset. CALL EXECUTE can then be used to restore them later if required.
Syntax:
Log (partial):
10
PhUSE 2009
General principles: Proc SQL is used to fetch global macro variable names from the dictionary tables, and store
them in a dataset. This dataset can be subsequently read, and CALL EXECUTE used to issue calls to %SYMDEL
to delete each macro variable (without having to code their individual names). This can be useful for clearing up at
the end of a macro.
Syntax:
General principles: %SYSFUNC is used within a macro to call SCL functions which access dataset information.
This information (in this example the names of the dataset variables) is stored to a series of sequentially
numbered macro variables. In turn this permits us to process these variables in a macro loop, renaming each one,
without having to code its individual name. This could be used to add a suffix / prefix to the name, or, as in this
example, convert the name to uppercase where necessary.
Syntax:
%macro rename;
%let dsid = %sysfunc(open(outdata.adev));
%let nvars = %sysfunc(attrn(&dsid,nvars));
%do i = 1 %to &nvars;
%let oldname&i = %sysfunc(varname(&dsid,&i));
%let newname&i = %upcase(&&oldname&i);
%end;
%let rc = %sysfunc(close(&dsid));
%* Only rename where new name is different to old name;
proc datasets lib = outdata nolist nodetails;
modify adev;
rename
%do i = 1 %to &nvars;
%if &&oldname&i ne &&newname&i %then &&oldname&i = &&newname&i;
%end;
;
run; quit;
%mend rename;
%rename;
Log (partial):
11
PhUSE 2009
General principles: %SYSFUNC is used within a macro to call SCL functions which access dataset information.
This time the information is used to build a list of variable names in a single macro variable, which is returned by
calling the macro. The macro could subsequently be called, for example, in a KEEP or DROP statement.
Syntax:
%macro varlist(dsn);
%let dsid = %sysfunc(open(&dsn));
%let nvars = %sysfunc(attrn(&dsid,nvars));
%let varlist=;
%do i = 1 %to &nvars;
%let varlist = &varlist %sysfunc(varname(&dsid,&i));
%end;
%let rc = %sysfunc(close(&dsid));
&varlist
%mend varlist;
The example shows the trapping of mismatched rows into separate datasets. In each case we only want to keep
the original variables, and not those coming from the other dataset in the merge (which will have missing values in
the output dataset anyway). The example shows how the macro call removes the need for us to determine all of
the relevant variable names, and code them in a lengthy KEEP statement.
Example Application 7: Testing for dataset existence and number of observations; using this metadata to
control program flow
General principles: %SYSFUNC is used within a macro to call SCL functions which access dataset information.
This information is used to control the actions of, in this example, a reporting program. This permits the program
to be run without errors when the expected dataset is empty, or even if it does not exist at all.
Syntax:
%macro genrpt(dsn=);
%if %sysfunc(exist(&dsn)) = 1 %then %do;
%let dsid = %sysfunc(open(&dsn));
%let nobs = %sysfunc(attrn(&dsid,nobs));
%if &nobs GT 0 %then %do;
%put NOTE: &nobs obs will be reported from &dsn..;
*<Usual reporting syntax here>;
%end;
%else %do;
%put NOTE: 0 obs in &dsn - Producing placeholder report. ;
*<Alternative reporting syntax here>;
%end;
%let rc = %sysfunc(close(&dsid));
%end;
%else %do;
%put WARNING: Dataset &dsn does not exist.;
%end;
%mend genrpt;
%genrpt(dsn=sae);
12
PhUSE 2009
General principles: %SYSFUNC is used within a macro to call the GETOPTION function, storing the specified
option setting to a macro variable. If subsequent statements alter the option setting, we can revert to the original
setting by referencing the macro variable in an OPTIONS statement.
Syntax:
CONCLUSION
This paper has demonstrated how metadata, used in conjunction with a few relevant programming techniques, can yield
efficiency gains for the SAS programmer. Much SAS code has been written in the pharmaceutical arena with a “write once,
use once” mentality, and the wheel then reinvented with similar code written on the next project.
By adopting more of an application software developer’s mindset, we can raise our game and exploit the metadata at our
disposal to write more flexible, data-driven programs. This enhanced flexibility benefits us in several ways:
• there is less need to “hard-code” the names of SAS entities such as datasets and variables in our programs, and,
in some instances, to even know their names;
• there is less chance of making an error while editing programs, especially when dealing with large numbers of
entities, and therefore less re-work required;
• there is less maintenance required, since changes to our data libraries (e.g. the addition of a new variable, or the
removal of a dataset) are reflected automatically in the dictionary information that now drives our programs;
• there is improved re-usability, through the creation of code that can be applied to common programming tasks
across many projects.
Many of the concepts and SAS components described here have been with us for a number of years. However, aided by
the emergence of data standards such as CIDSC, and with increasing pressure to produce our deliverables faster, the
data-driven approach to programming offers us a useful tool in our quest to accelerate clinical development.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Iain Humphreys
PRA International
Pacific House, Imperial Way
Reading, RG2 0TD, UK
Email: [email protected]
Web: www.praintl.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks of SAS Institute Inc. in the USA and other
countries. Other brand and product names are trademarks of their respective companies.
13
PhUSE 2009
APPENDIX
SAS DATA DICTIONARY TABLE DEFINITIONS
DICTIONARY.CATALOGS
libname char(8) label='Library Name',
memname char(32) label='Member Name',
memtype char(8) label='Member Type',
objname char(32) label='Object Name',
objtype char(8) label='Object Type',
objdesc char(256) label='Object Description',
created num format=DATETIME informat=DATETIME label='Date Created',
modified num format=DATETIME informat=DATETIME label='Date Modified',
alias char(32) label='Object Alias',
level num label='Library Concatenation Level'
DICTIONARY.COLUMNS
libname char(8) label='Library Name',
memname char(32) label='Member Name',
memtype char(8) label='Member Type',
name char(32) label='Column Name',
type char(4) label='Column Type',
length num label='Column Length',
npos num label='Column Position',
varnum num label='Column Number in Table',
label char(256) label='Column Label',
format char(49) label='Column Format',
informat char(49) label='Column Informat',
idxusage char(9) label='Column Index Type',
sortedby num label='Order in Key Sequence',
xtype char(12) label='Extended Type',
notnull char(3) label='Not NULL?',
precision num label='Precision',
scale num label='Scale',
transcode char(3) label='Transcoded?'
DICTIONARY.DICTIONARIES
memname char(32) label='Member Name',
memlabel char(256) label='Dataset Label',
name char(32) label='Column Name',
type char(4) label='Column Type',
length num label='Column Length',
npos num label='Column Position',
varnum num label='Column Number in Table',
label char(256) label='Column Label',
format char(49) label='Column Format',
informat char(49) label='Column Informat'
DICTIONARY.EXTFILES
fileref char(8) label='Fileref',
xpath char(1024) label='Path Name',
xengine char(8) label='Engine Name'
DICTIONARY.FORMATS
libname char(8) label='Library Name',
memname char(32) label='Member Name',
path char(1024) label='Path Name',
objname char(32) label='Object Name',
fmtname char(32) label='Format Name',
fmttype char(1) label='Format Type',
source char(1) label='Format Source',
minw num label='Minimum Width',
mind num label='Minimum Decimal Width',
maxw num label='Maximum Width',
maxd num label='Maximum Decimal Width',
defw num label='Default Width',
defd num label='Default Decimal Width'
14
PhUSE 2009
DICTIONARY.GOPTIONS
optname char(32) label='Option Name',
opttype char(8) label='Option type',
setting char(1024) label='Option Setting',
optdesc char(160) label='Option Description',
level char(8) label='Option Location',
group char(32) label='Option Group'
DICTIONARY.INDEXES
libname char(8) label='Library Name',
memname char(32) label='Member Name',
memtype char(8) label='Member Type',
name char(32) label='Column Name',
idxusage char(9) label='Column Index Type',
indxname char(32) label='Index Name',
indxpos num label='Position of Column in Concatenated Key',
nomiss char(3) label='Nomiss Option',
unique char(3) label='Unique Option'
DICTIONARY.LIBNAMES
libname char(8) label='Library Name',
engine char(8) label='Engine Name',
path char(1024) label='Path Name',
level num label='Library Concatenation Level',
fileformat char(8) label='Default File Format',
readonly char(3) label='Read-only?',
sequential char(3) label='Sequential?',
sysdesc char(1024) label='System Information Description',
sysname char(1024) label='System Information Name',
sysvalue char(1024) label='System Information Value'
DICTIONARY.MACROS
scope char(32) label='Macro Scope',
name char(32) label='Macro Variable Name',
offset num label='Offset into Macro Variable',
value char(200) label='Macro Variable Value'
DICTIONARY.MEMBERS
libname char(8) label='Library Name',
memname char(32) label='Member Name',
memtype char(8) label='Member Type',
dbms_memtype char(32) label='DBMS Member Type',
engine char(8) label='Engine Name',
index char(32) label='Indexes',
path char(1024) label='Path Name'
DICTIONARY.OPTIONS
optname char(32) label='Option Name',
opttype char(8) label='Option type',
setting char(1024) label='Option Setting',
optdesc char(160) label='Option Description',
level char(8) label='Option Location',
group char(32) label='Option Group'
DICTIONARY.STYLES
libname char(8) label='Library Name',
memname char(32) label='Member Name',
style char(32) label='Style Name',
crdate num format=DATETIME informat=DATETIME label='Date Created'
15
PhUSE 2009
DICTIONARY.TABLES
libname char(8) label='Library Name',
memname char(32) label='Member Name',
memtype char(8) label='Member Type',
dbms_memtype char(32) label='DBMS Member Type',
memlabel char(256) label='Dataset Label',
typemem char(8) label='Dataset Type',
crdate num format=DATETIME informat=DATETIME label='Date Created',
modate num format=DATETIME informat=DATETIME label='Date Modified',
nobs num label='Number of Physical Observations',
obslen num label='Observation Length',
nvar num label='Number of Variables',
protect char(3) label='Type of Password Protection',
compress char(8) label='Compression Routine',
encrypt char(8) label='Encryption',
npage num label='Number of Pages',
filesize num label='Size of File',
pcompress num label='Percent Compression',
reuse char(3) label='Reuse Space',
bufsize num label='Bufsize',
delobs num label='Number of Deleted Observations',
nlobs num label='Number of Logical Observations',
maxvar num label='Longest variable name',
maxlabel num label='Longest label',
maxgen num label='Maximum number of generations',
gen num label='Generation number',
attr char(3) label='Dataset Attributes',
indxtype char(9) label='Type of Indexes',
datarep char(32) label='Data Representation',
sortname char(8) label='Name of Collating Sequence',
sorttype char(4) label='Sorting Type',
sortchar char(8) label='Charset Sorted By',
reqvector char(24) format=$HEX48 informat=$HEX48 label='Requirements Vector',
datarepname char(170) label='Data Representation Name',
encoding char(256) label='Data Encoding',
audit char(3) label='Audit Trail Active?',
audit_before char(3) label='Audit Before Image?',
audit_admin char(3) label='Audit Admin Image?',
audit_error char(3) label='Audit Error Image?',
audit_data char(3) label='Audit Data Image?'
DICTIONARY.TITLES
type char(1) label='Title Location',
number num label='Title Number',
text char(256) label='Title Text'
DICTIONARY.VIEWS
libname char(8) label='Library Name',
memname char(32) label='Member Name',
memtype char(8) label='Member Type',
engine char(8) label='Engine Name'
16