Cuda Programming Within Mathematica
Cuda Programming Within Mathematica
CUDA Programming
within Mathematica
Introduction
CUDA, short for Common Unified Device Architecture, is a C-like programming language developed by NVIDIA to facilitate general computation on the Graphical Processing Unit (GPU). CUDA
allows users to design programs around the many-core hardware architecture of the GPU. And,
by using many cores, carefully designed CUDA programs can achieve speedups (1000x in some
cases) over a similar CPU implementation. Coupled with the investment price and power required
to GFLOP (billion floating point operations per second), the GPU has quickly become an ideal
platform for both high performance clusters and scientists wishing for a super computer at their
disposal.
Yet while the user can achieve speedups, CUDA does have a steep learning curveincluding
learning the CUDA programming API, and understanding how to set up CUDA and compile
CUDA programs. This learning curve has, in many cases, alienated many potential CUDA programmers.
Mathematicas CUDALink simplified the use of the GPU within Mathematica by introducing
dozens of functions to tackle areas ranging from image processing to linear algebra. CUDALink
also allows the user to load their own CUDA functions into the Mathematica kernel.
By utilizing Mathematicas language, mirroring Mathematica function syntax, and integrating
with existing programs and development tools, CUDALink offers an easy way to use CUDA. In this
document we describe the benefits of CUDA integration in Mathematica and provide some
applications for which it is suitable.
Today, GPUs priced at just $500 can achieve performance of 2 TFLOP (trillion operations per
second). The GPU also competes with the CPU in terms of power consumption, using a fraction of
the power compared to the CPU for the same GFLOP performance.
Because GPUs are off-the-shelf hardware, can fit into a standard desktop, have low power
consumption, and perform exceptionally, they are very attractive to users. Yet a steep learning
curve has always been a hindrance for users wanting to use CUDA in their applications.
Mathematicas CUDALink alleviates much of the burden required to use CUDA from within
Mathematica. CUDALink allows users to query system hardware, use the GPU for dozens of
functions, and define new CUDA functions to be run on the GPU.
High-performance computing
Mathematica has built-in support for multicore systems, utilizing all cores on the system for
optimal performance. Many functions automatically utilize the power of multicore processors,
and built-in parallel constructs make high-performance programming easy.
Scalability
Wolfram Researchs gridMathematica allows Mathematica programs to be parallelized on
many machines in cluster or grid configuration.
With Mathematica, memory and thread management is automatically handled for the user.
The Mathematica memory manager handles memory transfers intelligently in the background.
Memory, for example, is not copied to the GPU until computation is needed and is flushed out
when the GPU memory gets full.
Mathematicas CUDA support streamlines the whole programming process. This allows GPU
programmers to follow a more interactive style of programming:
Ready-to-use applications
CUDA integration in Mathematica provides several ready-to-use CUDA functions that cover a
broad range of topics such as mathematics, image processing, financial engineering, and more.
Examples will be given in the section Mathematicas CUDALink Applications.
C Code Generation
Mathematica 8 introduces the ability to export expressions written using Compile to a C file. The
C file can then be compiled and run either as a Mathematica command (for native speed), or be
integrated with an external application. Through the code generation mechanism, you can use
Mathematica for both prototype and native speed deployment.
To motivate the C code generation feature, we will solve the call option using the BlackScholes
equation. The call option in terms of the BlackScholes equation is defined by:
HT - tL Jr +
d1 =
s2
2
N + LogA X E
T-t s
d2 = d1 - s
T-t
T Jr +
d1 =
s2
2
N + LogA X E
T s
d2 = d1 - s T ;
BlackScholes = CDF@NormalDistribution@0, 1D, d1 D S CDF@NormalDistribution@0, 1D, d2 D X -r T
1
2
T Jr +
S ErfcB-
s2
2
S
N + LogA X E
T s
T Kr+
F-
1
2
T s-
s2
2
O+LogB F
X
T s
-r T X ErfcB
The following command generates the C code, compiles it, and links it back into Mathematica to
provide native speed:
LibraryLink
LibraryLink allows you to load C functions as Mathematica functions. It is similar in purpose to
MathLink, but, by running in the same process as the Mathematica kernel, it avoids the memory
transfer cost associated with MathLink. This loads a C function from a library; the function adds
one to a given integer:
addOne =
LibraryFunctionLoad@"demo", "demo_I_I", 8Integer<, IntegerD
LibraryFunction@<>, demo_I_I, 8Integer<, IntegerD
The library function is run with the same syntax as any other function:
addOne@3D
4
CUDALink and OpenCLLink are examples of LibraryLinks usage.
Symbolic C Code
Using Mathematicas symbolic capabilities, users can generate C programs within Mathematica.
The following, for example, creates macros for common math constants:
<< SymbolicC`
These are all constants in the Mathematica system context. We use Mathematicas CDefine to
declare a C macro:
ToCCodeString@sD
define
define
define
define
define
define
define
define
define
Catalan 0.915965594177219
Degree 0.017453292519943295
E 2.718281828459045
EulerGamma 0.5772156649015329
Glaisher 1.2824271291006226
GoldenRatio 1.618033988749895
Khinchin 2.6854520010653062
MachinePrecision 15.954589770191003
Pi 3.141592653589793
By representing the C program symbolically, you can manipulate it using standard Mathematica
techniques. Here, we convert all the macro names to lowercase:
ReplaceAll@s,
CDefine@name_, val_D CDefine@ToLowerCase@nameD, valDD
8CDefine@catalan, 0.915966D,
CDefine@degree, 0.0174533D, CDefine@e, 2.71828D,
CDefine@eulergamma, 0.577216D, CDefine@glaisher, 1.28243D,
CDefine@goldenratio, 1.61803D, CDefine@khinchin, 2.68545D,
CDefine@machineprecision, 15.9546D, CDefine@pi, 3.14159D<
ToCCodeString@%D
define
define
define
define
define
define
define
define
define
catalan 0.915965594177219
degree 0.017453292519943295
e 2.718281828459045
eulergamma 0.5772156649015329
glaisher 1.2824271291006226
goldenratio 1.618033988749895
khinchin 2.6854520010653062
machineprecision 15.954589770191003
pi 3.141592653589793
C Compiler Invoking
Another Mathematica 8 innovation is the ability to call external C compilers from within Mathematica. The following compiles a simple C program into an executable:
<< CCompilerDriver`
exe = CreateExecutable@"
include \"stdio.h\"
int mainHvoidL 8
printfH\"Hello from CCompilerDriver.\"L;
return 0;
<", "foo"D
homeabduld.MathematicaSystemFilesLibraryResourcesLinuxfoo
Using the above syntax, you can create executables using any Mathematica-supported C compiler
(Visual Studio, GCC, Intel CC, etc.) in a compiler independent fashion. The above command can be
executed within Mathematica:
Needs@"CUDALink`"D
CUDAQ tells whether the current hardware and system configuration support CUDALink:
CUDAQ@D
True
SystemInformation gives information on the available GPU hardware:
SystemInformation@D
ManipulateBoperationB
, xF,
Using the same technique you can build more complicated interfaces. This allows users to choose
different Gaussian kernel sizes (and their angle) and performs a convolution on the image on the
right:
Users can also get data from the web or Wolfram curated datasets. The following code imports
an image from a given URL:
image = Import@
"http:gallery.wolfram.com2dpopup00_contourMosaic.pop.jpg
"D;
The function Import automatically recognizes the file format and converts it into a Mathematica
expression. This can be directly used by CUDALink functions, such as CUDAImageAdd:
output = CUDAImageAddBimage,
All outputs from Mathematica functions, including the ones from CUDALink functions, are also
expressions, and can be easily exported to one of the supported formats using the Export
function. For example, the following code exports the above output into PNG format:
Export@"masked.png", outputD
masked.png
CUDALink Programming
Programming the GPU in Mathematica is straightforward. It begins with writing a CUDA kernel.
Here, we will create a simple example that negates colors of a 3-channel image:
kernel = "
__global__ void cudaColorNegateHint
*img, int *dim, int channelsL 8
int width = dim@0D, height = dim@1D;
int xIndex = threadIdx.x + blockIdx.x * blockDim.x;
int yIndex = threadIdx.y + blockIdx.y * blockDim.y;
int index = channels * HxIndex + yIndex*widthL;
if HxIndex < width && yIndex < heightL 8
for Hint c = 0; c < channels; c++L
img@index + cD = 255 - img@index + cD;<<";
Pass that string to the built-in function CUDAFunctionLoad, along with the kernel function
name and the argument specification. The last argument denotes the CUDA block size:
Several things are happening at this stage. Mathematica automatically compiles the kernel
function and loads it as a Mathematica function. Now you can apply this new CUDA function to
an image:
img =
>
System requirements
To utilize Mathematicas CUDALink, the following is required :
Operating System: Windows, Linux, and Mac OS X 10.6.3+, both 32- and 64-bit
architecture.
NVIDIA CUDA enabled products.
For CUDA programming, a CUDALink supported C compiler is required.
Financial Engineering
CUDALinks options pricing function uses the binomial or Monte Carlo method, depending on
the type of option selected. Computing options on the GPU can be dozens of times faster than
using the CPU.
numberOfOptions = 32;
spotPrices = [email protected], 35.0<, numberOfOptionsD;
strikePrices = [email protected], 40.0<, numberOfOptionsD;
expiration = [email protected], 10.0<, numberOfOptionsD;
interest = 0.08;
volatility = [email protected], 0.50<, numberOfOptionsD;
dividend = [email protected], 0.06<, numberOfOptionsD;
This computes the Asian arithmetic call option with the above data:
CUDAFinancialDerivative@8"AsianArithmetic", "Call"<,
8 "StrikePrice" strikePrices, "Expiration" expiration<,
8 "CurrentPrice" spotPrices, "InterestRate" interest,
"Volatility" volatility, "Dividend" dividend<D
88.34744, 1.18026, 9.53711, 5.39746, 2.2478, 4.94333, 0.859259,
6.08291, 2.4044, 2.41929, 6.53313, 7.48516, 2.71696, 1.08229,
7.50222, 0.790236, 0.816325, 1.28744, 0.953413, 0.131352,
7.60693, 1.15648, 7.07213, 8.2441, 4.45964, 7.94849,
2.22669, 1.17793, 10.1456, 0.263328, 4.12236, 4.99476<
Computing the price of Asian arithmetic call options corresponding to random data.
BlackScholes
For options with no built-in implementation in CUDALink, users can load their own. Here, we will
show how to load the BlackScholes model for calculating the vanilla American option and in the
next section we will show how to load code to compute the binary call and put of an asset-ornothing option.
Recall from above that the call option in the BlackScholes model is defined by:
HT - tL Jr +
d1 =
s2
2
N + LogA X E
T-t s
d2 = d1 - s
T-t
code = "
define NHxL
HerfHHxLsqrtH2.0LL2+0.5L
__global__ void blackScholesHReal_t
* call, Real_t * S, Real_t * X, Real_t *
sigma, Real_t * T, Real_t * r, mint lengthL 8
int ii = threadIdx.x + blockIdx.x*blockDim.x;
if Hii < lengthL 8
Real_t d1 =
HlogHS@iiDX@iiDL+Hr@iiD+HpowHsigma@iiD,2.0L2L*T@iiDLLH
sigma@iiD*sqrtHT@iiDLL;
Real_t d2 = d1 - sigma@iiD*sqrtHT@iiDL;
call@iiD =
S@iiD*NHd1L - X@iiD*expH-r@iiD*T@iiDL*NHd2L;
<
<";
This loads the above code into Mathematica:
numberOfOptions = 32;
S = [email protected], 40.0<, numberOfOptionsD;
X = [email protected], 40.0<, numberOfOptionsD;
T = [email protected], 10.0<, numberOfOptionsD;
R = [email protected], 0.1<, numberOfOptionsD;
Q = [email protected], 0.08<, numberOfOptionsD;
V = [email protected], 0.4<, numberOfOptionsD;
This allocates memory for the call result:
CUDABlackScholes@call, S, X, T, R, Q, V, numberOfOptionsD
8CUDAMemory@<17988>, DoubleD, CUDAMemory@<29045>, DoubleD<
This retrieves the CUDA memory back into Mathematica:
CUDAMemoryGet@callD
8- 54.0707, - 7.72175, - 71.9744, - 52.5466, - 37.523,
- 54.7824, - 66.2942, - 38.6423, - 44.1287, - 57.2223,
- 95.4658, - 89.4017, - 7.29522, - 100.874, - 34.8353,
- 7.77607, - 65.9444, - 88.8973, - 51.1116, - 72.3198,
- 16.5077, - 83.9489, - 73.5695, - 66.6177, - 44.9486, - 5.54228,
- 36.1501, - 92.0541, - 53.079, - 97.6815, - 34.6764, - 38.1144<
Binary Option
Using the same BlackScholes model we can calculate both the asset-or-nothing call and put for
the binary/digital option model:
code = "
define NHxL
HerfHHxLsqrtH2.0LL2+0.5L
__global__ void binaryAssetOptionHReal_t * call,
Real_t * put, Real_t * S, Real_t * X, Real_t * T,
Real_t * R, Real_t * D, Real_t * V, mint lengthL 8
int ii = threadIdx.x + blockIdx.x*blockDim.x;
if Hii < lengthL 8
Real_t d1 = HlogHS@iiDX@iiDL + HR@iiD - D@iiD +
0.5f * V@iiD * V@iiDL * T@iiDLHV@iiD * sqrtHT@iiDLL;
call@iiD = S@iiD * expH-D@iiD * T@iiDL * NHd1L;
put@iiD = S@iiD * expH-D@iiD * T@iiDL * NH-d1L;
<
<";
This loads the function into Mathematica:
CUDABinaryAssetOption =
CUDAFunctionLoad@code, "binaryAssetOption",
88_Real, "Output"<, 8_Real, "Output"<, 8_Real, "Input"<,
8_Real, "Input"<, 8_Real, "Input"<, 8_Real, "Input"<,
8_Real, "Input"<, 8_Real, "Input"<, _Integer<, 128D;
This creates some random data for the strike price, expiration, etc.:
numberOfOptions = 64;
S = [email protected], 40.0<, numberOfOptionsD;
X = [email protected], 40.0<, numberOfOptionsD;
T = [email protected], 10.0<, numberOfOptionsD;
R = [email protected], 0.1<, numberOfOptionsD;
Q = [email protected], 0.08<, numberOfOptionsD;
V = [email protected], 0.4<, numberOfOptionsD;
The call and put memory is allocated:
Both the call and put memory can be retrieved using CUDAMemoryGet:
CUDAMemoryGet@callD
811.4286, 17.9626, 9.18249, 6.12558, 16.0267, 31.328, 9.00912,
18.1401, 15.0725, 22.843, 16.9324, 2.47273, 13.3358,
12.6676, 27.3212, 21.1174, 3.89489, 21.0599, 24.4702,
26.2096, 16.4893, 23.412, 17.8685, 5.92021, 16.5681,
29.1562, 10.5462, 21.7121, 12.9029, 28.6162, 13.7667,
29.118, 13.1107, 13.5246, 30.2435, 17.8356, 19.787, 20.3238,
15.7308, 21.2625, 7.60982, 14.5222, 17.4748, 20.4389, 2.9495,
6.98209, 18.9727, 33.6374, 14.5634, 14.1232, 12.1017,
17.8766, 22.6046, 18.3085, 17.1108, 18.7836, 24.816, 13.8477,
22.9212, 12.7896, 26.5338, 9.44261, 13.4497, 0.0381422<
CUDAMemoryGet@putD
85.44932, 5.30771, 4.84476, 11.6373, 10.5256, 4.30614,
15.0703, 9.17361, 7.14722, 5.68002, 7.53906, 21.5292, 12.769,
7.99204, 1.85855, 3.09161, 13.1441, 5.38475, 6.38312, 7.681,
8.36659, 2.28956, 7.79013, 8.42515, 6.09523, 3.58002,
6.99927, 9.59092, 5.91341, 2.7295, 7.77723, 3.32061, 6.80901,
6.62897, 4.22096, 8.75582, 4.88707, 10.5992, 5.17202,
4.62305, 5.16322, 13.3896, 6.92575, 7.32805, 16.0042,
13.4299, 5.21345, 2.79763, 11.8877, 6.21821, 13.8582,
5.52745, 6.02657, 6.10139, 3.20472, 3.7745, 1.86951, 6.29216,
2.6889, 7.95113, 7.44534, 15.6227, 7.80008, 20.6595<
src = "
__device__ mint primes@D = 8
2, 3, 5, 7, 11, 13, 17, 19, 23, 29,
31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97,101,103,107,109,113,
127,131,137,139,149,151,157,163,167,173,
179,181,191,193,197,199,211,223,227,229,
<;
__global__ void HaltonHReal_t
* out, unsigned int dim, unsigned int nL 8
const mint tx = threadIdx.x, bx =
blockIdx.x, dx = blockDim.x;
const mint index = tx + bx*dx;
if Hindex >= nL
return ;
mint ii;
double digit, rnd, idx, half;
for Hii = 0,
idx=index, rnd=0, digit=0; ii < dim; ii++L 8
half = 1.0primes@iiD;
while Hidx > DBL_EPSILONL 8
digit = HHmintLidxL%primes@iiD;
rnd += half*digit;
idx = Hidx - digitLprimes@iiD;
half = primes@iiD;
<
out@index*dim + iiD = rnd;
<
<
";
This loads the CUDA source into Mathematica:
CUDAHaltonSequence = CUDAFunctionLoad@src,
"Halton", 88_Real, "Output"<, _Integer, _Integer<, 256D
CUDAFunction@<>, Halton, 88_Real, _, Output<, _Integer, _Integer<D
This allocates 1024 real elements. Real elements are interpreted to be the highest floating
precision on the machine:
CUDAHaltonSequence@mem, 1, 1024D
8CUDAMemory@<11521>, DoubleD<
You can use Mathematicas extensive visualization support to visualize the result. Here we plot
the data:
ListPlot@CUDAMemoryGet@memDD
1.0
0.8
0.6
0.4
0.2
200
400
600
800
1000
Some random number generators and distributions are not naturally parallelizable. In those
cases, users can adopt a hybrid GPU programming approachutilizing the CPU for some tasks
and the GPU for others. Using this approach, users can use Mathematicas extensive statistics
capabilities to generate or derive distributions from their data.
Here, we simulate a random walk by generating numbers on the CPU, performing a reduction
(using CUDAFoldList) on the GPU, and plotting the result using Mathematica:
ListLinePlot@
Thread@List@CUDAFoldList@Plus, 0, RandomReal@8- 1, 1<, 500DD,
CUDAFoldList@Plus, 0, RandomReal@8- 1, 1<, 500DDDDD
2
-8
-6
-4
-2
-2
-4
-6
-8
Image Processing
CUDALinks image processing capabilities can be classified into three categories. The first is
convolution, which is optimized for CUDA. The second is morphology, which contains abilities
such as erosion, dilation, opening, and closing. Finally, there are the binary operators. These are
the image multiplication, division, subtraction, and addition. All operations work on either
images or lists.
Image convolution
CUDALinks convolution is similar to Mathematicas ListConvolve and ImageConvolve functions. It will operate on images, lists, or CUDA memory references, and it can use Mathematicas
built-in filters as the kernel.
CUDAImageConvolveB
-1 0 1
-2 0 2 F
-1 0 1
Pixel operations
CUDALink supports simple pixel operations on one or two images, such as adding or multiplying
pixel values from two images.
CUDAImageMultiplyB
Morphological operations
CUDALink supports fundamental operations such as erosion, dilation, opening, and closing.
CUDAErosion , CUDADilation, CUDAOpening , and CUDAClosing are equivalent to Mathematicas built-in Erosion, Dilation, Opening, and Closing functions. More sophisticated operations can be built using these fundamental operations.
CUDATopHatTransform@image_, r_D :=
Image@CUDAImageSubtract@image, CUDAOpening@image, rDDD;
CUDATopHatTransformB
, 2F
Linear Algebra
You can perform various linear algebra functions with CUDALink such as matrix-matrix and
matrix-vector multiplication, finding minimum and maximum elements, and transposing matrices.
Fourier Analysis
The Fourier analysis capabilities of the CUDALink package include forward and inverse Fourier
transforms that can operate on a list of 1D, 2D, or 3D real or complex numbers.
ArrayPlot@Log@Abs@CUDAFourier@
Table@Mod@Binomial@i, jD, 2D, 8i, 0, 63<, 8j, 0, 63<DDDDD
PDE Solving
This computational fluid dynamics example is included with CUDALink. This solves the Navier
Stokes equations for a million particles using the finite element method:
Volumetric Rendering
CUDALink includes functions to read and display volumetric data in 3D, with interactive interfaces for setting the transfer functions and other volume-rendering parameters.
code = "
define NHxL
HerfHHxLsqrtH2.0LL2+0.5L
__kernel void onetouchH__global Real_t * call, __global
Real_t * put, __global Real_t * S, __global Real_t *
X, __global Real_t * T, __global Real_t * R, __global
Real_t * D, __global Real_t * V, mint lengthL 8
Real_t d1, d5, power;
int ii = get_global_idH0L;
if Hii < lengthL 8
d1 = HlogHS@iiDX@iiDL + HR@iiD - D@iiD + 0.5f
* V@iiD * V@iiDL * T@iiDL HV@iiD * sqrtHT@iiDLL;
d5 = HlogHS@iiDX@iiDL - HR@iiD - D@iiD + 0.5f
* V@iiD * V@iiDL * T@iiDL HV@iiD * sqrtHT@iiDLL;
pThetaower = powHX@iiDS@iiD, 2*R@iiDHV@iiD*V@iiDLL
call@iiD = S@iiD < X@iiD
? power * NHd5L + HS@iiDX@iiDL*NHd1L : 1.0;
put@iiD = S@iiD > X @iiD ? power *
NH-d5L + HS@iiDX@iiDL*NH-d1L : 1.0;
<
<";
This loads OpenCL function into Mathematica:
OpenCLOneTouchOption = OpenCLFunctionLoad@code,
"onetouch", 88_Real, _, "Output"<, 8_Real, _, "Output"<,
8_Real, _, "Input"<, 8_Real, _, "Input"<,
8_Real, _, "Input"<, 8_Real, _, "Input"<,
8_Real, _, "Input"<, 8_Real, _, "Input"<, _Integer<, 128D;
This generates random input data:
numberOfOptions = 64;
S = [email protected], 40.0<, numberOfOptionsD;
X = [email protected], 40.0<, numberOfOptionsD;
T = [email protected], 10.0<, numberOfOptionsD;
R = [email protected], 0.1<, numberOfOptionsD;
Q = [email protected], 0.08<, numberOfOptionsD;
V = [email protected], 0.4<, numberOfOptionsD;
This allocates memory for both the call and put result:
OpenCLMemoryGet@callD
81., 0.398116, 1., 1., 1.00703, 0.909275, 1., 1., 1.,
0.541701, 0.631649, 1., 0.702748, 1., 1., 1., 0.626888,
1., 1., 0.827843, 0.452237, 0.998761, 0.813008, 1.,
1., 0.96773, 0.795428, 1., 1.79325, 1., 1., 1., 1., 1.,
0.547425, 0.968162, 1., 1., 0.907489, 1., 1.90031, 0.316174,
1., 0.998824, 0.383825, 1., 0.804287, 0.977305, 1., 1.,
0.855764, 1., 0.952568, 0.573249, 0.239455, 0.635454,
0.917078, 0.624179, 1., 0.679681, 1., 1., 0.968929, 0.712148<
Summary
Due to Mathematicas integrated platform design, all functionality is included without the need
to buy and maintain multiple tools and add-on packages.
With its simplified development cycle, multicore computing, and built-in functions, Mathematicas built-in CUDALink application provides a powerful high-level interface for GPU computing.
Asia:
+81-(0)3-3518-2880
[email protected]
2011 Wolfram Research, Inc. Mathematica is a registered trademark and Mathematica Player, Wolfram Workbench,
and gridMathematica are trademarks of Wolfram Research, Inc. Wolfram|Alpha is a registered trademark and
computational knowledge engine is a trademark of Wolfram Alpha LLCA Wolfram Research Company. All other
trademarks are the property of their respective owners. Mathematica is not associated with Mathematica Policy
Research, Inc. or Mathtech, Inc. MKT3014 2173925 0111.hk