Parallel Processing in Sas: Pooja Matekar

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

COGNIZANT TECHNOLOGY SOLUTIONS

PARALLEL PROCESSING in SAS


Pooja Matekar
12/9/2014
Introduction:
When we work with large data sets and large volumes of SAS processes, leveraging
multiple CPUs to run parts of a SAS job in parallel threads can significantly increase the
speed of performance. There are several ways to multi-thread jobs in SAS, for example, we
can use threaded procedures in SAS 9 or run parallel jobs by using MP Connect in
SAS/CONNECT 8 or SAS/CONNECT 9. However, in many scenarios, we cannot take
advantage of these methods.

Parallel processing or multi-threading a SAS job is a mode of execution in which two or


more portions of a job run in parallel streams to accelerate processing and reduce the real
time to completion. By using the technique that is explained here, we can significantly
increase performance time. However, this result also is affected by the number of CPUs that
are available.

Let discuss one of the approach i.e. introducing parallelism through code

Parallel processing through code:


This technique is based on a mechanism that spawns multiple jobs from a single instance of
code, and, depending on how your SAS code is structured; either the whole process or parts
of the process can be run in parallel.

Using this technique the user is required to execute just a single copy of code. Through this
code many codes will be generated and executed. Once the code is submitted, splitting the
input dataset, generation of multiple instances of the code, their execution and aggregation
of output datasets all follow automatically. Because of automation there is no need to
monitor spawned jobs.

Following is the generalized version of the code, which can be used in your code to
implement parallelism. %PARALLEL will include your code (this part is given in bold).

PROC PRINTTO LOG="./redirected.log" NEW; RUN;

 Using PRINTTO procedure will help us to keep our main log clean. We will redirect the
details of parallel is used to redirect the log to some other location with the specified name.
%LET JOBNAME = MYJOB ;
*** Job name must be a unique name and should NOT have extensions **;
 This job name will used to create multiple SAS scripts.
%LET NUM_PARTITIONS = 10; * Number of threads;
 Num_Partitions will decide the number of threads i.e. the number of independent SAS
scripts to be created.

Parallel Processing Page 2


%MACRO PARTITION;
 This is the macro where multiple scripts will be created.
OPTIONS OBS=0 NOSYNTAXCHECK FULLSTIMER SOURCE2;

%DO PART_NUM=1 %TO &NUM_PARTITIONS;


FILENAME MPRINT "./&JOBNAME._PART&PART_NUM..sas" LRECL=170;
OPTIONS MPRINT MFILE;
.
.
YOUR CODE
.
.
%END; *NUM_PARTITIONS;
 The above DO loop will create multiple SAS scripts using the MPRINT and MFILE macro
options.

OPTIONS OBS=MAX NOMFILE;


 As macro %partition iterates and generates SAS codes, it also executes the statements
contained in this macro. If it run as it is, it would defeat the purpose of multi-threading, by
creating datasets sequentially rather than concurrently .To avoid this we set the option
OBS=0 which makes the program momentarily iterate through all partitions. Thus we
practically do not spend any processing time.
 When the codes are generated we set OBS=MAX.

*** Create a shell script ***;


DATA _NULL_;
FILE "RUN_&JOBNAME..sh";
%DO PART_NUM = 1 %TO &NUM_PARTITIONS;
PUT "sas &JOBNAME._PART&PART_NUM..sas -log &JOBNAME._PART&PART_NUM..log &";
%IF %SYSFUNC(MOD(&PART_NUM, 20)) = 0 %THEN %DO;
PUT "wait";
%END; *SYSFUNC;
%END; *PART_NUM;
PUT "WAIT";
RUN;
%MEND PARTITION;

%PARTITION;
 The above do loop will create a shell script for each of the SAS script. If loop will check
whether all the scripts in that particular group have executed, if yes then it will proceed
further with other set of scripts. If not then it will wait for all the scripts in that group to
complete before it starts executing the rest scripts.

Parallel Processing Page 3


 The numeric 20 over here specifies that the SAS scripts will execute in a batch of 20. Each
batch will contain 20 scripts. First 20 scripts will execute in parallel. After these scripts are
executed next set of 20 scripts will start executing this process continues till the last script
is executed.

*** Execute the shell script ***;


X "chmod 755 RUN_&JOBNAME..sh";
X "RUN_&JOBNAME..sh";
X "wait";
 The above piece of code will execute the shell scripts created in previous step.

%DO X=1 %TO &NUM_PARTITIONS .;


X " cat &JOBNAME._PART&PART_NUM..log >>
&outlog.&run_dt._&sysuserid._&sysmacroname._%upcase(&option.).log";
%END;
 This do loop will concatenate the different log files to the main log.

***Concatenate all the datasets***;


DATA OUT_LIB.OUTPUT;
SET %DO PART_NUM = 1 %TO &NUM_PARTITIONS;
LIB.OUTPUT_&JOBNAME._PART&PART_NUM
%END;;
RUN;
 The above data step will aggregate all the intermediate datasets. This dataset will be our
final dataset.

PROC PRINTTO; RUN;


 Proc printo without any options will redirect the log to main log.

***Optional Part: Delete intermediate scripts and logs***;


%DO X=1 %TO &NUM_PARTITIONS .;
X " rm &JOBNAME._PART&PART_NUM..sas ";
X " rm &JOBNAME._PART&PART_NUM..log ";
%END; *PART_NUM;
X " rm &JOBNAME..sh";
 The final DO loop will delete the intermediate scripts and log files. And lastly the shell script
will be deleted.

Parallel Processing Page 4

You might also like