Integrating Python and Base SAS PDF
Integrating Python and Base SAS PDF
ABSTRACT
This paper presents a new framework for integrating Python into Base SAS® on the
Microsoft Windows platform. Previous attempts at invoking Python functionality through
Base SAS involved the use of a Java object in the DATA step. This paper aims to present a
more comprehensive framework that includes support for the following: transferring data
and parameters from SAS to Python; transferring data from Python to SAS; transferring
figures generated in Python (using Matplotlib) back to SAS reports; and transferring
contents from standard output and error streams from the Python process back to SAS
reports. Although the components required to invoke Python from Base SAS have been
available for a while, the lack of ease-of-use could prove to be a barrier to widespread use.
The proposed framework aims to improve the usability of Python functionality through Base
SAS by making the process more user-friendly. The framework is implemented through
helper SAS macros, helper Python scripts, and helper binaries.
INTRODUCTION
Integration pathways between SAS and popular open source analytics software such as R
and Python allow data scientists and data analysts to use the best features of both worlds. A
solid integration pathway between SAS and R was introduced through SAS/IML® in 2009
(SAS Institute Inc., 2009). However, Python does not seem to enjoy similar first-class
integration support yet.
A white paper published in 2015 (Hall, Myneni, and Zhang, 2015) provides a practical way
to achieve this using the Base SAS Java Object. The approach presented is not even specific
to (or restricted to) Python; it can be used to integrate with R or any other program.
Perhaps due to the broad range of programs that can be targeted using the approach, it
does not address certain specific usability issues that could deliver an improved integration
experience to SAS users.
This paper attempts to: enumerate various features/options that are needed to provide a
comprehensive SAS-Python integration experience; propose an architecture to support such
features; and provide an implementation of the proposed architecture. An open-source
implementation is made available under the permissive MIT License (for commercial use,
modification, distribution, and private use) at: https://github.com/svglolla/sasnpy.
A point to note: while SAS-Python integration is referred to in this paper multiple times, it
focuses on the case of trying to use/invoke Python capabilities from SAS and excludes the
case of trying to use SAS from within Python. Ample development seems to be occurring
and available through SAS-sponsored open-source GitHub repositories (SAS Institute Inc.,
2015) in the context of using SAS functionality (and feature set) through Python.
1
DESIRABLE FEATURES
Several features are needed to provide a complete SAS-Python integration solution. An
integration pathway that supports the following feature subset is likely to cover a significant
number of use cases arising in the SAS-Python integration context:
Ability to invoke Python scripts: the ability to invoke Python scripts from SAS is the
first and perhaps the only strictly necessary requirement.
Ability to transfer data from SAS to Python and Python to SAS: the ability to transfer
data between SAS and Python semi-automatically is one of the first steps that could
enhance the usability of the integration significantly. Data to be transferred could be:
(a) tables representing datasets used by (or produced during) analyses; (b) scalar
values that could be used to govern specific behaviors in the Python script; (c)
binary payloads representing data that might not readily fit into a tabular form (for
example, PMML models or network graph representations).
Ability to capture output and error streams from Python: the ability to automatically
capture output and error streams and inject them into SAS logs, outputs, or reports
will make information that is potentially not present in data returned from Python
available to the user without the user having to process/search log files.
Ability to retrieve plots generated in Python: the ability to automatically capture plots
and inject them into SAS reports is likely to make information (and insights) that is
potentially not readily available in the data returned from Python available to the
user.
Ability to set/modify various parameters that modify the behavior of the integration
component such as Python installation to use, turning on (or off) plot capture,
whether or not to keep the Python program alive, etc. Whenever there is a design
decision to be made and no clear winner emerges, it is probably desirable that the
user be able to switch between various options elegantly without having to modify
the code of the integration component.
It is also desirable that the integration component is light-weight and minimally intrusive in
both the SAS and Python environments.
While the feature subset is enumerated in the context of SAS-Python integration, they are
likely to apply to other similar integration use-cases and should probably be considered
during the development of similar integration components.
ARCHITECTURE OVERVIEW
While it is possible to provide a platform-independent mechanism that works on various
operating systems, the remainder of this paper restricts its scope to the Windows platform.
The general idea driving the design of the architecture for the SAS-Python integration
solution proposed in this paper is to hide from the end-user, as much as possible, the
technical details relating to the integration and let the user focus on using the integration
instead. In other words, the integration solution should allow the end-user to focus on using
the integration as opposed to having to exercise (or learn) the nitty-gritty technical details
about the integration itself. The user should be allowed to focus on their task (for example,
perform network analytics computations using the igraph library on data from SAS) instead
of worrying about how to capture the output stream from Python.
Python provides an ability to specify the script to run in a file form as a command line
argument; it is also possible to provide the script through an input stream. The ability to
invoke Python (or any other program) through SAS will satisfy the first requirement
2
discussed in the previous section. SAS provides multiple mechanisms to achieve this: the
SYSTASK statement, the X statement, the CALL SYSTEM routine, and the %SYSEXEC
macro. However, when the other requirements from the previous section are considered,
the idea of a separate component managing interactions with the Python program begins to
gain traction. This component will be in-charge of creating temporary folders, managing
temporary files, managing the Python program (process) lifetime, capturing the output and
error streams, etc.
The details of using the aforementioned component can be encapsulated into a SAS helper
script file that the end-user can use. This script will surface macros to the end-user that can
be used as entry points into the integration pipeline. This script file will also be responsible
for insulating the end-user from the details relating to data transfer between SAS and
Python and injecting output and error streams from the Python process into SAS reports.
A helper Python script (module) will be responsible for insulating the end-user from the
details relating to capturing plots, converting data between SAS datasets to pandas data
frames (or other Python-native representations), installing required Python packages if
necessary etc.
The SAS-Python integration solution described in this paper is delivered as single a
component (referred to as “SASnPy integration component”; .ZIP file) that can be
downloaded and used. The component will have three major subcomponents: a helper SAS
script file (.SAS file), a core integration binary component (.DLL file), and a helper Python
script file (.PY file); see Figure 1. The component will also contain other files (text files
serving as templates etc.) that might be used by any of the three major subcomponents.
The SAS helper file will be included into the user’s SAS program (%INCLUDE) and will be
executing in the SAS session along with the user’s SAS script. The integration binary
component will be loaded into the SAS session upon initialization and will instantiate a
Python process. The helper Python script and the user’s Python script will be running in the
Python process instantiated by the integration binary. Figure 2 illustrates the activity
sequence involved in executing a Python script from the user’s SAS script using the SASnPy
integration component. Separate sequences are involved in other activities such as
transferring data between SAS and Python.
3
Figure 2: Example activity diagram for the execution of a Python script from SAS.
IMPLEMENTATION OVERVIEW
The implementation, source code, and related information are available under the MIT
license at https://github.com/svglolla/sasnpy. This section will provide an overview of some
of the key implementation details.
4
Some of the key methods exposed by this component are as follows:
PyStartSession(): initialize a Python process and load helper modules.
PyEndSession(): shutdown an open Python session, close the process, and conduct
clean-up activities.
PySetPath(): used to set the Python installation to be used to execute the user's
Python script.
PyExecuteScript(): used to execute the user's Python script. This method is also
responsible for capturing (and processing) output and error streams from the Python
process and making them available for use by the SAS helper script.
PySessionTempLocation(): used to get the temporary folder location where
intermediate files are created for the consumption of various sub-components.
PySetInputTable(): used to induce information relating to a SAS dataset into the
Python session.
PySetInputScalar(): used to induce a scalar value into the Python session.
PyGetOutputTable(): used to get information relating to a dataset in Python into the
SAS session.
PyGetOutputScalar(): used to get the value of a scalar object in the Python session
into the SAS session.
The end-user does not have to be aware of these methods or their usage. These methods
will be invoked almost exclusively through the SAS helper script file sub-component.
5
get_output_table(): used to route a table-like object (for example, a pandas data
frame) from Python back into the SAS session.
get_output_scalar(): used to route a scalar value from the Python back into the SAS
session.
The end-user does not have to be aware of these methods or their usage. These methods
will be invoked almost exclusively through the .NET binary component.
6
To use the component, the SAS-Python integration component zip file matching the bit-ness
(32-bit or 64-bit) of the SAS installation has to be downloaded and extracted. For the
purpose of the following example, assume that the files have been extracted to "C:/sasnpy".
Also assume that the Python installation to use is located at "C:/Python3.6/Python.exe" and
the user’s Python script is located at "C:/scripts/test.py".
data _null_;
/* Execute script */
%PyExecute("C:/scripts/test.py", result);
run;
7
Example Python script file:
import pandas as pd
air_ds.describe()
comet_ds.describe()
for i in range(max_iter):
print(i)
py_ds1 = pd.DataFrame.from_dict(data1)
py_ds2 = pd.DataFrame.from_dict(data2)
py_res1 = 12.34
py_res2 = "This works"
8
The current implementation does not support the use of non-local Python engines;
adding this feature will allow SAS-Python integration to take advantage of remote
Pythons installations (possibly as services) that offer better performance.
The current implementation does not support asynchronous script execution or
multiple concurrent Python sessions; adding this feature will allow SAS-Python
integration to allow the user to execute large Python jobs without blocking the
execution of the user’s SAS scripts when warranted.
The current implementation does not provide any fault-tolerance; if a Python process
shuts down unexpectedly or is stopped, the job/script submitted to the Python
process is lost. Adding this feature to the implementation will improve the reliability
of the integration.
Further abstraction (and parametrization) of the implementation is possible in a
manner that will allow the use of the implementation for engines other than Python.
REFERENCES
SAS Institute Inc. 2009. “Enhancements in SAS/IML® 9.22 Software.” Accessed March 23,
2019. https://support.sas.com/rnd/app/iml/IML922.html
Hall, Patrick, Myneni, Radhika and Zhang, Ruiwen. 2015. "Open Source Integration using
the Base SAS Java Object." Accessed March 23, 2019.
https://github.com/sassoftware/enlighten-
integration/blob/master/SAS_Base_OpenSrcIntegration/SAS_Base_OpenSrcIntegration.pdf
SAS Institute Inc. 2015. "SAS Software: Open Source from SAS Software." Accessed March
23, 2019. https://github.com/sassoftware
Giesecke, Robert. 2012. "UnmanagedExports 1.2.7: Unmanaged Exports (DllExport for
.Net)." Accessed March 23, 2019. https://www.nuget.org/packages/UnmanagedExports
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Venu Gopal Lolla
Oklahoma State University
[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA
registration.
Other brand and product names are trademarks of their respective companies.