0% found this document useful (0 votes)

252 views25 pages

Cuda Programming Within Mathematica

Uploaded by

Ignorato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

252 views25 pages

Cuda Programming Within Mathematica

Uploaded by

Ignorato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Wolfram White Paper

CUDA Programming
within Mathematica

Introduction
CUDA, short for Common Unified Device Architecture, is a C-like programming language developed by NVIDIA to facilitate general computation on the Graphical Processing Unit (GPU). CUDA
allows users to design programs around the many-core hardware architecture of the GPU. And,
by using many cores, carefully designed CUDA programs can achieve speedups (1000x in some
cases) over a similar CPU implementation. Coupled with the investment price and power required
to GFLOP (billion floating point operations per second), the GPU has quickly become an ideal
platform for both high performance clusters and scientists wishing for a super computer at their
disposal.
Yet while the user can achieve speedups, CUDA does have a steep learning curveincluding
learning the CUDA programming API, and understanding how to set up CUDA and compile
CUDA programs. This learning curve has, in many cases, alienated many potential CUDA programmers.
Mathematicas CUDALink simplified the use of the GPU within Mathematica by introducing
dozens of functions to tackle areas ranging from image processing to linear algebra. CUDALink
also allows the user to load their own CUDA functions into the Mathematica kernel.
By utilizing Mathematicas language, mirroring Mathematica function syntax, and integrating
with existing programs and development tools, CUDALink offers an easy way to use CUDA. In this
document we describe the benefits of CUDA integration in Mathematica and provide some
applications for which it is suitable.

Motivations for Mathematica CUDALink

CUDA is a C-like language designed to write general programs around the NVIDIA GPU hardware. By programming the GPU, users can get performance unrivaled by a CPU for a similar
investment. The following graph shows the performance of the GPU compared to the CPU:

Today, GPUs priced at just $500 can achieve performance of 2 TFLOP (trillion operations per
second). The GPU also competes with the CPU in terms of power consumption, using a fraction of
the power compared to the CPU for the same GFLOP performance.
Because GPUs are off-the-shelf hardware, can fit into a standard desktop, have low power
consumption, and perform exceptionally, they are very attractive to users. Yet a steep learning
curve has always been a hindrance for users wanting to use CUDA in their applications.
Mathematicas CUDALink alleviates much of the burden required to use CUDA from within
Mathematica. CUDALink allows users to query system hardware, use the GPU for dozens of
functions, and define new CUDA functions to be run on the GPU.

Wolfram White Paper: CUDA Programming within Mathematica | 1

A Brief Introduction to Mathematica

In its over 20 years of existence, Mathematica has always been at the forefront of technical
computing, providing a highly expressive language and inventing and implementing groundbreaking algorithms.
Mathematica is a development environment that combines a flexible programming language
with a wide range of symbolic and numeric computational capabilities, production of highquality visualizations, built-in application area packages, and a range of immediate deployment
options. Combined with integration of dynamic libraries, automatic interface construction, and C
code generation, Mathematica is the most sophisticated build-to-deploy environment on the
market today.
Mathematica 8 introduces the ability to program the GPU. For developers, this new integration
means native access to Mathematicas computing abilitiescreating hybrid algorithms that
combine the CPU and the GPU. Some of Mathematicas features include:

Full-featured, unified development environment

Through its unique interface and integrated features for computation, development, and deployment, Mathematica provides a streamlined workflow. Wolfram Research also offers Wolfram
Workbench, a state-of-the-art integrated development engine based on the Eclipse platform.

Unified data representation

At the core of Mathematica is the foundational idea that everythingdata, programs, formulas,
graphics, documentscan be represented as symbolic entities, called expressions. This unified
representation makes Mathematicas language and functions extremely flexible, streamlined, and
consistent.

Multiparadigm programming language

Mathematica provides its own highly declarative functional language, as well as several different
programming paradigms, such as procedural and rule-based programming. Programmers can
choose their own style for writing code with minimal effort. Along with comprehensive documentation and resources, Mathematicas flexibility greatly reduces the cost of entry for new users.

Symbolic-numeric hybrid system

The principle behind Mathematica is full integration of symbolic and numeric computing capabilities. Through its full automation and preprocessing mechanisms, users reap the power of a hybrid
computing system without needing knowledge of specific methodologies and algorithms.

Scientific and technical area coverage

Mathematica provides thousands of built-in functions and packages that cover a broad range of
scientific and technical computing areas, such as statistics, control systems, data visualization, and
image processing. All functions are carefully designed and tightly integrated with the core system.

2 | Wolfram White Paper: CUDA Programming within Mathematica

High-performance computing
Mathematica has built-in support for multicore systems, utilizing all cores on the system for
optimal performance. Many functions automatically utilize the power of multicore processors,
and built-in parallel constructs make high-performance programming easy.

Data access and connectivity

Mathematica natively supports hundreds of formats for importing and exporting, as well as realtime access to data from Wolfram|Alpha, Wolfram Researchs computational knowledge
engine. It also provides APIs for accessing many programming languages and databases, such
as C/C++, Java, .New, MySQL, and Oracle.

Platform-independent deployment options

Through its interactive notebooks, Mathematica Player, and browser plugins, Mathematica
provides a wide range of options for deployment. Built-in code generation functionality can be
used to create standalone programs for independent distribution.

Scalability
Wolfram Researchs gridMathematica allows Mathematica programs to be parallelized on
many machines in cluster or grid configuration.

CUDA Integration in Mathematica

CUDALink offers a high-level interface to the GPU built on top of Mathematicas development
technologies. It allows users to execute code on their GPU with minimal effort. By fully integrating and automating the GPUs capabilities using Mathematica, users experience a more productive and efficient development cycle.

Automation of development project management

Unlike other development frameworks that require the user to manage project setup, platform
dependencies, and device configuration, CUDALink makes the process transparent and
automated.

Automated GPU memory and thread management

A CUDA program written from scratch delegates memory and thread management to the
programmer. This bookkeeping is required in lieu of the need to write the CUDA program:

Wolfram White Paper: CUDA Programming within Mathematica | 3

With Mathematica, memory and thread management is automatically handled for the user.

The Mathematica memory manager handles memory transfers intelligently in the background.
Memory, for example, is not copied to the GPU until computation is needed and is flushed out
when the GPU memory gets full.
Mathematicas CUDA support streamlines the whole programming process. This allows GPU
programmers to follow a more interactive style of programming:

Integration with Mathematicas built-in capabilities

CUDA integration provides full access to Mathematicas native language and built-in functions.
With Mathematicas comprehensive symbolic and numerical functions, built-in application area
support, and graphical interface building functions, users can write hybrid algorithms that use
the CPU and GPU depending on the efficiency of each algorithm.

Ready-to-use applications
CUDA integration in Mathematica provides several ready-to-use CUDA functions that cover a
broad range of topics such as mathematics, image processing, financial engineering, and more.
Examples will be given in the section Mathematicas CUDALink Applications.

Zero device configuration

Mathematica automatically finds, configures, and makes CUDA devices available to the users.

Multiple GPU Support

Through Mathematicas built-in parallel programming support, users can launch CUDA programs
on different GPUs. Users can also scale the setup across machines and networks using
gridMathematica.

Technologies Underlying CUDALink

GPU integration in Mathematica has only been possible due to advancement in system integration introduced in Mathematica 8. Features such as C code generation, SymbolicC manipulation,
dynamic library loading, and C compiler invocation are all used internally by CUDALink to enable
fast and easy access to the GPU.
4 | Wolfram White Paper: CUDA Programming within Mathematica

C Code Generation
Mathematica 8 introduces the ability to export expressions written using Compile to a C file. The
C file can then be compiled and run either as a Mathematica command (for native speed), or be
integrated with an external application. Through the code generation mechanism, you can use
Mathematica for both prototype and native speed deployment.
To motivate the C code generation feature, we will solve the call option using the BlackScholes
equation. The call option in terms of the BlackScholes equation is defined by:

C HS, tL = N Hd1 L S - N Hd2 L X -r HT-tL

with

HT - tL Jr +
d1 =

s2
2

N + LogA X E

T-t s
d2 = d1 - s

T-t

N(p) is the cumulative distribution function of the normal distribution.

Here, we define the equation for t = 0:

T Jr +
d1 =

s2
2

N + LogA X E

T s
d2 = d1 - s T ;
BlackScholes = CDF@NormalDistribution@0, 1D, d1 D S CDF@NormalDistribution@0, 1D, d2 D X -r T
1
2

T Jr +
S ErfcB-

s2
2

S
N + LogA X E

T s

T Kr+

F-

1
2

T s-

s2
2

O+LogB F
X

T s

-r T X ErfcB

The following command generates the C code, compiles it, and links it back into Mathematica to
provide native speed:

cf = Compile@88S, _Real<, 8X, _Real<,

8s, _Real<, 8T, _Real<, 8r, _Real<<, BlackScholes,
CompilationOptions 8"InlineExternalDefinitions" True<,
CompilationTarget "C"D;
The function can be used as any other Mathematica function. Here we call the above function:

[email protected], 1.0, 1.0, 1.0, 1.0D

0.678818

LibraryLink
LibraryLink allows you to load C functions as Mathematica functions. It is similar in purpose to
MathLink, but, by running in the same process as the Mathematica kernel, it avoids the memory
transfer cost associated with MathLink. This loads a C function from a library; the function adds
one to a given integer:

addOne =
LibraryFunctionLoad@"demo", "demo_I_I", 8Integer<, IntegerD
LibraryFunction@<>, demo_I_I, 8Integer<, IntegerD

Wolfram White Paper: CUDA Programming within Mathematica | 5

The library function is run with the same syntax as any other function:

addOne@3D
4
CUDALink and OpenCLLink are examples of LibraryLinks usage.

Symbolic C Code
Using Mathematicas symbolic capabilities, users can generate C programs within Mathematica.
The following, for example, creates macros for common math constants:

<< SymbolicC`
These are all constants in the Mathematica system context. We use Mathematicas CDefine to
declare a C macro:

s = Map@CDefine@ToString@D, N@DD &,

Map@ToExpression, Select@Names@"System`*"D,
MemberQ@Attributes@D, ConstantD &DDD
8CDefine@Catalan, 0.915966D,
CDefine@Degree, 0.0174533D, CDefine@E, 2.71828D,
CDefine@EulerGamma, 0.577216D, CDefine@Glaisher, 1.28243D,
CDefine@GoldenRatio, 1.61803D, CDefine@Khinchin, 2.68545D,
CDefine@MachinePrecision, 15.9546D, CDefine@Pi, 3.14159D<
The symbolic expression can be converted to C using the ToCCodeString function:

ToCCodeString@sD
define
define
define
define
define
define
define
define
define

Catalan 0.915965594177219
Degree 0.017453292519943295
E 2.718281828459045
EulerGamma 0.5772156649015329
Glaisher 1.2824271291006226
GoldenRatio 1.618033988749895
Khinchin 2.6854520010653062
MachinePrecision 15.954589770191003
Pi 3.141592653589793

By representing the C program symbolically, you can manipulate it using standard Mathematica
techniques. Here, we convert all the macro names to lowercase:

ReplaceAll@s,
CDefine@name_, val_D CDefine@ToLowerCase@nameD, valDD
8CDefine@catalan, 0.915966D,
CDefine@degree, 0.0174533D, CDefine@e, 2.71828D,
CDefine@eulergamma, 0.577216D, CDefine@glaisher, 1.28243D,
CDefine@goldenratio, 1.61803D, CDefine@khinchin, 2.68545D,
CDefine@machineprecision, 15.9546D, CDefine@pi, 3.14159D<

6 | Wolfram White Paper: CUDA Programming within Mathematica

Again, the code can be converted to C code using ToCCodeString:

ToCCodeString@%D
define
define
define
define
define
define
define
define
define

catalan 0.915965594177219
degree 0.017453292519943295
e 2.718281828459045
eulergamma 0.5772156649015329
glaisher 1.2824271291006226
goldenratio 1.618033988749895
khinchin 2.6854520010653062
machineprecision 15.954589770191003
pi 3.141592653589793

C Compiler Invoking
Another Mathematica 8 innovation is the ability to call external C compilers from within Mathematica. The following compiles a simple C program into an executable:

<< CCompilerDriver`
exe = CreateExecutable@"
include \"stdio.h\"
int mainHvoidL 8
printfH\"Hello from CCompilerDriver.\"L;
return 0;
<", "foo"D
homeabduld.MathematicaSystemFilesLibraryResourcesLinuxfoo
Using the above syntax, you can create executables using any Mathematica-supported C compiler
(Visual Studio, GCC, Intel CC, etc.) in a compiler independent fashion. The above command can be
executed within Mathematica:

Import@"!" <> exe, "Text"D

Hello from CCompilerDriver.
By using the Mathematica enhancements mentioned earlier in this section, CUDALink and
OpenCLLink facilitate fast and simple access to the GPU.

Mathematicas CUDALink: Integrated

GPU Programming
CUDALink is a built-in Mathematica application that provides a powerful interface for using
CUDA within Mathematica. Through CUDALink, users get carefully tuned linear algebra, Fourier
transform, financial derivative, and image processing algorithms. Users can also write their own
CUDALink modules with little effort.

Wolfram White Paper: CUDA Programming within Mathematica | 7

Accessing System Information

CUDALink supplies functions that query the systems GPU hardware. To use CUDALink operations,
users have to first load the CUDALink application:

Needs@"CUDALink`"D
CUDAQ tells whether the current hardware and system configuration support CUDALink:

CUDAQ@D
True
SystemInformation gives information on the available GPU hardware:

SystemInformation@D

Example of a report generated by SystemInformation.

Integration with Mathematica Functions

CUDALink integrates with existing Mathematica functions such as its import/export facilities,
functional language, and interface building. This allows you to build deployable programs in
Mathematica with minimal disruption to the GPU task. This section showcases how you can build
interfaces as well as use the import/export capabilities in Mathematica.

Manipulate: Mathematicas automatic interface generator

Mathematica provides extensive built-in interface building functions. Users can customize controls using Mathematicas highly declarative interface language.
One fully automated interface generating function is Manipulate, which builds the interface by
inspecting the possible values of variables. It then chooses the appropriate GUI widget based on
the interpretation of the variable values.

8 | Wolfram White Paper: CUDA Programming within Mathematica

Here, we build an interface that performs a morphological operation on an image

with varying radii:

ManipulateBoperationB

, xF,

8x, 0, 9<, 8operation, 8CUDAErosion, CUDADilation<<F

Using the same technique you can build more complicated interfaces. This allows users to choose
different Gaussian kernel sizes (and their angle) and performs a convolution on the image on the
right:

Example of a user interface built with Manipulate.

Support for import and export

Mathematica natively supports hundreds of file formats and their subformats for importing and
exporting. Supported formats include: common image formats (JPEG, PNG, TIFF, BMP, etc.), video
formats (AVI, MOV, H264, etc.), audio formats (WAV, AU, AIFF, FLAC, etc.), medical imaging
formats (DICOM), data formats (Excel, CSV, MAT, etc.), and various raw formats for further
processing.
Any supported data formats will be automatically converted to Mathematicas unified data
representation, or an expression, which can be used in all Mathematica functions, including
CUDALink functions.

Wolfram White Paper: CUDA Programming within Mathematica | 9

Users can also get data from the web or Wolfram curated datasets. The following code imports
an image from a given URL:

image = Import@
"http:gallery.wolfram.com2dpopup00_contourMosaic.pop.jpg
"D;
The function Import automatically recognizes the file format and converts it into a Mathematica
expression. This can be directly used by CUDALink functions, such as CUDAImageAdd:

output = CUDAImageAddBimage,

All outputs from Mathematica functions, including the ones from CUDALink functions, are also
expressions, and can be easily exported to one of the supported formats using the Export
function. For example, the following code exports the above output into PNG format:

Export@"masked.png", outputD
masked.png

CUDALink Programming
Programming the GPU in Mathematica is straightforward. It begins with writing a CUDA kernel.
Here, we will create a simple example that negates colors of a 3-channel image:

kernel = "
__global__ void cudaColorNegateHint
*img, int *dim, int channelsL 8
int width = dim@0D, height = dim@1D;
int xIndex = threadIdx.x + blockIdx.x * blockDim.x;
int yIndex = threadIdx.y + blockIdx.y * blockDim.y;
int index = channels * HxIndex + yIndex*widthL;
if HxIndex < width && yIndex < heightL 8
for Hint c = 0; c < channels; c++L
img@index + cD = 255 - img@index + cD;<<";
Pass that string to the built-in function CUDAFunctionLoad, along with the kernel function
name and the argument specification. The last argument denotes the CUDA block size:

colorNegate = CUDAFunctionLoad@kernel, "cudaColorNegate",

88_Integer, "InputOutput"<,
8_Integer, "Input"<, _Integer<, 816, 16<D;

10 | Wolfram White Paper: CUDA Programming within Mathematica

Several things are happening at this stage. Mathematica automatically compiles the kernel
function and loads it as a Mathematica function. Now you can apply this new CUDA function to
an image:

img =

colorNegate@img, ImageDimensions@imgD, ImageChannels@imgDD

System requirements
To utilize Mathematicas CUDALink, the following is required :

Operating System: Windows, Linux, and Mac OS X 10.6.3+, both 32- and 64-bit
architecture.
NVIDIA CUDA enabled products.
For CUDA programming, a CUDALink supported C compiler is required.

Mathematicas CUDALink Applications

In addition to support for user-defined CUDA functions and automatic compilation, CUDALink
includes several ready-to-use functions ranging from image processing to financial option
valuation.

Financial Engineering
CUDALinks options pricing function uses the binomial or Monte Carlo method, depending on
the type of option selected. Computing options on the GPU can be dozens of times faster than
using the CPU.

Wolfram White Paper: CUDA Programming within Mathematica | 11

This generates some random input data:

numberOfOptions = 32;
spotPrices = [email protected], 35.0<, numberOfOptionsD;
strikePrices = [email protected], 40.0<, numberOfOptionsD;
expiration = [email protected], 10.0<, numberOfOptionsD;
interest = 0.08;
volatility = [email protected], 0.50<, numberOfOptionsD;
dividend = [email protected], 0.06<, numberOfOptionsD;
This computes the Asian arithmetic call option with the above data:

CUDAFinancialDerivative@8"AsianArithmetic", "Call"<,
8 "StrikePrice" strikePrices, "Expiration" expiration<,
8 "CurrentPrice" spotPrices, "InterestRate" interest,
"Volatility" volatility, "Dividend" dividend<D
88.34744, 1.18026, 9.53711, 5.39746, 2.2478, 4.94333, 0.859259,
6.08291, 2.4044, 2.41929, 6.53313, 7.48516, 2.71696, 1.08229,
7.50222, 0.790236, 0.816325, 1.28744, 0.953413, 0.131352,
7.60693, 1.15648, 7.07213, 8.2441, 4.45964, 7.94849,
2.22669, 1.17793, 10.1456, 0.263328, 4.12236, 4.99476<
Computing the price of Asian arithmetic call options corresponding to random data.

BlackScholes
For options with no built-in implementation in CUDALink, users can load their own. Here, we will
show how to load the BlackScholes model for calculating the vanilla American option and in the
next section we will show how to load code to compute the binary call and put of an asset-ornothing option.
Recall from above that the call option in the BlackScholes model is defined by:

C HS, tL = N Hd1 L S - N Hd2 L X -r HT-tL

with

HT - tL Jr +
d1 =

s2
2

N + LogA X E

T-t s
d2 = d1 - s

T-t

N(p) is the cumulative distribution function of the normal distribution.

12 | Wolfram White Paper: CUDA Programming within Mathematica

The following CUDA code computes the call option when t = 0:

code = "
define NHxL
HerfHHxLsqrtH2.0LL2+0.5L
__global__ void blackScholesHReal_t
* call, Real_t * S, Real_t * X, Real_t *
sigma, Real_t * T, Real_t * r, mint lengthL 8
int ii = threadIdx.x + blockIdx.x*blockDim.x;
if Hii < lengthL 8
Real_t d1 =
HlogHS@iiDX@iiDL+Hr@iiD+HpowHsigma@iiD,2.0L2L*T@iiDLLH
sigma@iiD*sqrtHT@iiDLL;
Real_t d2 = d1 - sigma@iiD*sqrtHT@iiDL;
call@iiD =
S@iiD*NHd1L - X@iiD*expH-r@iiD*T@iiDL*NHd2L;
<
<";
This loads the above code into Mathematica:

CUDABlackScholes = CUDAFunctionLoad@code, "blackScholes",

88_Real<, 8_Real, "Input"<, 8_Real, "Input"<, 8_Real, "Input"<,
8_Real, "Input"<, 8_Real, "Input"<, _Integer<, 128D;
Here we generate some random input data for the model. We are only computing 32 options:

numberOfOptions = 32;
S = [email protected], 40.0<, numberOfOptionsD;
X = [email protected], 40.0<, numberOfOptionsD;
T = [email protected], 10.0<, numberOfOptionsD;
R = [email protected], 0.1<, numberOfOptionsD;
Q = [email protected], 0.08<, numberOfOptionsD;
V = [email protected], 0.4<, numberOfOptionsD;
This allocates memory for the call result:

call = CUDAMemoryAllocate@Real, numberOfOptionsD;

This calls the function:

CUDABlackScholes@call, S, X, T, R, Q, V, numberOfOptionsD
8CUDAMemory@<17988>, DoubleD, CUDAMemory@<29045>, DoubleD<
This retrieves the CUDA memory back into Mathematica:

CUDAMemoryGet@callD
8- 54.0707, - 7.72175, - 71.9744, - 52.5466, - 37.523,
- 54.7824, - 66.2942, - 38.6423, - 44.1287, - 57.2223,
- 95.4658, - 89.4017, - 7.29522, - 100.874, - 34.8353,
- 7.77607, - 65.9444, - 88.8973, - 51.1116, - 72.3198,
- 16.5077, - 83.9489, - 73.5695, - 66.6177, - 44.9486, - 5.54228,
- 36.1501, - 92.0541, - 53.079, - 97.6815, - 34.6764, - 38.1144<

Wolfram White Paper: CUDA Programming within Mathematica | 13

Binary Option
Using the same BlackScholes model we can calculate both the asset-or-nothing call and put for
the binary/digital option model:

code = "
define NHxL
HerfHHxLsqrtH2.0LL2+0.5L
__global__ void binaryAssetOptionHReal_t * call,
Real_t * put, Real_t * S, Real_t * X, Real_t * T,
Real_t * R, Real_t * D, Real_t * V, mint lengthL 8
int ii = threadIdx.x + blockIdx.x*blockDim.x;
if Hii < lengthL 8
Real_t d1 = HlogHS@iiDX@iiDL + HR@iiD - D@iiD +
0.5f * V@iiD * V@iiDL * T@iiDLHV@iiD * sqrtHT@iiDLL;
call@iiD = S@iiD * expH-D@iiD * T@iiDL * NHd1L;
put@iiD = S@iiD * expH-D@iiD * T@iiDL * NH-d1L;
<
<";
This loads the function into Mathematica:

CUDABinaryAssetOption =
CUDAFunctionLoad@code, "binaryAssetOption",
88_Real, "Output"<, 8_Real, "Output"<, 8_Real, "Input"<,
8_Real, "Input"<, 8_Real, "Input"<, 8_Real, "Input"<,
8_Real, "Input"<, 8_Real, "Input"<, _Integer<, 128D;
This creates some random data for the strike price, expiration, etc.:

numberOfOptions = 64;
S = [email protected], 40.0<, numberOfOptionsD;
X = [email protected], 40.0<, numberOfOptionsD;
T = [email protected], 10.0<, numberOfOptionsD;
R = [email protected], 0.1<, numberOfOptionsD;
Q = [email protected], 0.08<, numberOfOptionsD;
V = [email protected], 0.4<, numberOfOptionsD;
The call and put memory is allocated:

call = CUDAMemoryAllocate@Real, numberOfOptionsD;

put = CUDAMemoryAllocate@Real, numberOfOptionsD;
This calls the function, returning the call and put memory handles:

CUDABinaryAssetOption@call, put, S, X, T, R, Q, V, numberOfOptionsD

8CUDAMemory@<17988>, DoubleD, CUDAMemory@<29045>, DoubleD<

14 | Wolfram White Paper: CUDA Programming within Mathematica

Both the call and put memory can be retrieved using CUDAMemoryGet:

CUDAMemoryGet@callD
811.4286, 17.9626, 9.18249, 6.12558, 16.0267, 31.328, 9.00912,
18.1401, 15.0725, 22.843, 16.9324, 2.47273, 13.3358,
12.6676, 27.3212, 21.1174, 3.89489, 21.0599, 24.4702,
26.2096, 16.4893, 23.412, 17.8685, 5.92021, 16.5681,
29.1562, 10.5462, 21.7121, 12.9029, 28.6162, 13.7667,
29.118, 13.1107, 13.5246, 30.2435, 17.8356, 19.787, 20.3238,
15.7308, 21.2625, 7.60982, 14.5222, 17.4748, 20.4389, 2.9495,
6.98209, 18.9727, 33.6374, 14.5634, 14.1232, 12.1017,
17.8766, 22.6046, 18.3085, 17.1108, 18.7836, 24.816, 13.8477,
22.9212, 12.7896, 26.5338, 9.44261, 13.4497, 0.0381422<
CUDAMemoryGet@putD
85.44932, 5.30771, 4.84476, 11.6373, 10.5256, 4.30614,
15.0703, 9.17361, 7.14722, 5.68002, 7.53906, 21.5292, 12.769,
7.99204, 1.85855, 3.09161, 13.1441, 5.38475, 6.38312, 7.681,
8.36659, 2.28956, 7.79013, 8.42515, 6.09523, 3.58002,
6.99927, 9.59092, 5.91341, 2.7295, 7.77723, 3.32061, 6.80901,
6.62897, 4.22096, 8.75582, 4.88707, 10.5992, 5.17202,
4.62305, 5.16322, 13.3896, 6.92575, 7.32805, 16.0042,
13.4299, 5.21345, 2.79763, 11.8877, 6.21821, 13.8582,
5.52745, 6.02657, 6.10139, 3.20472, 3.7745, 1.86951, 6.29216,
2.6889, 7.95113, 7.44534, 15.6227, 7.80008, 20.6595<

Wolfram White Paper: CUDA Programming within Mathematica | 15

Random Number Generators

One of the difficult problems when parallelizing algorithms is generating good random numbers.
CUDALink offers many examples on how to generate both pseudo- and quasi-random numbers.
Here, we generate quasi-random numbers using the Halton sequence:

src = "
__device__ mint primes@D = 8
2, 3, 5, 7, 11, 13, 17, 19, 23, 29,
31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97,101,103,107,109,113,
127,131,137,139,149,151,157,163,167,173,
179,181,191,193,197,199,211,223,227,229,
<;
__global__ void HaltonHReal_t
* out, unsigned int dim, unsigned int nL 8
const mint tx = threadIdx.x, bx =
blockIdx.x, dx = blockDim.x;
const mint index = tx + bx*dx;
if Hindex >= nL
return ;
mint ii;
double digit, rnd, idx, half;
for Hii = 0,
idx=index, rnd=0, digit=0; ii < dim; ii++L 8
half = 1.0primes@iiD;
while Hidx > DBL_EPSILONL 8
digit = HHmintLidxL%primes@iiD;
rnd += half*digit;
idx = Hidx - digitLprimes@iiD;
half = primes@iiD;
<
out@index*dim + iiD = rnd;
<
<
";
This loads the CUDA source into Mathematica:

CUDAHaltonSequence = CUDAFunctionLoad@src,
"Halton", 88_Real, "Output"<, _Integer, _Integer<, 256D
CUDAFunction@<>, Halton, 88_Real, _, Output<, _Integer, _Integer<D
This allocates 1024 real elements. Real elements are interpreted to be the highest floating
precision on the machine:

mem = CUDAMemoryAllocate@Real, 81024<D

CUDAMemory@<11521>, DoubleD
This calls the function:

CUDAHaltonSequence@mem, 1, 1024D
8CUDAMemory@<11521>, DoubleD<

16 | Wolfram White Paper: CUDA Programming within Mathematica

You can use Mathematicas extensive visualization support to visualize the result. Here we plot
the data:

ListPlot@CUDAMemoryGet@memDD
1.0
0.8
0.6
0.4
0.2

200

400

600

800

1000

Some random number generators and distributions are not naturally parallelizable. In those
cases, users can adopt a hybrid GPU programming approachutilizing the CPU for some tasks
and the GPU for others. Using this approach, users can use Mathematicas extensive statistics
capabilities to generate or derive distributions from their data.
Here, we simulate a random walk by generating numbers on the CPU, performing a reduction
(using CUDAFoldList) on the GPU, and plotting the result using Mathematica:

ListLinePlot@
Thread@List@CUDAFoldList@Plus, 0, RandomReal@8- 1, 1<, 500DD,
CUDAFoldList@Plus, 0, RandomReal@8- 1, 1<, 500DDDDD
2

-8

-6

-4

-2

-2
-4
-6
-8

Random walk simulation.

Image Processing
CUDALinks image processing capabilities can be classified into three categories. The first is
convolution, which is optimized for CUDA. The second is morphology, which contains abilities
such as erosion, dilation, opening, and closing. Finally, there are the binary operators. These are
the image multiplication, division, subtraction, and addition. All operations work on either
images or lists.

Wolfram White Paper: CUDA Programming within Mathematica | 17

Image convolution
CUDALinks convolution is similar to Mathematicas ListConvolve and ImageConvolve functions. It will operate on images, lists, or CUDA memory references, and it can use Mathematicas
built-in filters as the kernel.

CUDAImageConvolveB

-1 0 1
-2 0 2 F
-1 0 1

Convolving a microscopic image with a Sobel mask to detect edges.

Pixel operations
CUDALink supports simple pixel operations on one or two images, such as adding or multiplying
pixel values from two images.

CUDAImageMultiplyB

Multiplication of two images.

18 | Wolfram White Paper: CUDA Programming within Mathematica

Morphological operations
CUDALink supports fundamental operations such as erosion, dilation, opening, and closing.
CUDAErosion , CUDADilation, CUDAOpening , and CUDAClosing are equivalent to Mathematicas built-in Erosion, Dilation, Opening, and Closing functions. More sophisticated operations can be built using these fundamental operations.

CUDATopHatTransform@image_, r_D :=
Image@CUDAImageSubtract@image, CUDAOpening@image, rDDD;
CUDATopHatTransformB

, 2F

Building non-elementary image processing operations using primitive ones.

Linear Algebra
You can perform various linear algebra functions with CUDALink such as matrix-matrix and
matrix-vector multiplication, finding minimum and maximum elements, and transposing matrices.

Nest@CUDADot@RandomReal@1, 8100, 100<D, D &,

RandomReal@1, 8100<D, 1000D;
Perform Arnolidi-like linear algebra operation.

Fourier Analysis
The Fourier analysis capabilities of the CUDALink package include forward and inverse Fourier
transforms that can operate on a list of 1D, 2D, or 3D real or complex numbers.

ArrayPlot@Log@Abs@CUDAFourier@
Table@Mod@Binomial@i, jD, 2D, 8i, 0, 63<, 8j, 0, 63<DDDDD

Find the logarithmic power spectrum of a dataset.

Wolfram White Paper: CUDA Programming within Mathematica | 19

PDE Solving
This computational fluid dynamics example is included with CUDALink. This solves the Navier
Stokes equations for a million particles using the finite element method:

Fluid simulation with multi-particles.

Volumetric Rendering
CUDALink includes functions to read and display volumetric data in 3D, with interactive interfaces for setting the transfer functions and other volume-rendering parameters.

Volumetric rendering of a medical image.

20 | Wolfram White Paper: CUDA Programming within Mathematica

OpenCL Integration in Mathematica

Mathematica also includes the ability to use the GPU using OpenC via OpenCLLink. This a vendorneutral way to use the GPU and works both on NVIDIA and non-NVIDIA hardware. OpenCLLink
and CUDALink offer the same syntax, and the following demonstrates how to compute the onetouch option:

code = "
define NHxL
HerfHHxLsqrtH2.0LL2+0.5L
__kernel void onetouchH__global Real_t * call, __global
Real_t * put, __global Real_t * S, __global Real_t *
X, __global Real_t * T, __global Real_t * R, __global
Real_t * D, __global Real_t * V, mint lengthL 8
Real_t d1, d5, power;
int ii = get_global_idH0L;
if Hii < lengthL 8
d1 = HlogHS@iiDX@iiDL + HR@iiD - D@iiD + 0.5f
* V@iiD * V@iiDL * T@iiDL HV@iiD * sqrtHT@iiDLL;
d5 = HlogHS@iiDX@iiDL - HR@iiD - D@iiD + 0.5f
* V@iiD * V@iiDL * T@iiDL HV@iiD * sqrtHT@iiDLL;
pThetaower = powHX@iiDS@iiD, 2*R@iiDHV@iiD*V@iiDLL
call@iiD = S@iiD < X@iiD
? power * NHd5L + HS@iiDX@iiDL*NHd1L : 1.0;
put@iiD = S@iiD > X @iiD ? power *
NH-d5L + HS@iiDX@iiDL*NH-d1L : 1.0;
<
<";
This loads OpenCL function into Mathematica:

OpenCLOneTouchOption = OpenCLFunctionLoad@code,
"onetouch", 88_Real, _, "Output"<, 8_Real, _, "Output"<,
8_Real, _, "Input"<, 8_Real, _, "Input"<,
8_Real, _, "Input"<, 8_Real, _, "Input"<,
8_Real, _, "Input"<, 8_Real, _, "Input"<, _Integer<, 128D;
This generates random input data:

Wolfram White Paper: CUDA Programming within Mathematica | 21

This allocates memory for both the call and put result:

call = OpenCLMemoryAllocate@Real, numberOfOptionsD;

put = OpenCLMemoryAllocate@Real, numberOfOptionsD;
This calls the function:

OpenCLOneTouchOption@call, put, S, X, T, R, Q, V, numberOfOptionsD

8CUDAMemory@<6029>, DoubleD, CUDAMemory@<15684>, DoubleD<
This retrieves the result for the call option (the put option can be retrieved similarly):

OpenCLMemoryGet@callD
81., 0.398116, 1., 1., 1.00703, 0.909275, 1., 1., 1.,
0.541701, 0.631649, 1., 0.702748, 1., 1., 1., 0.626888,
1., 1., 0.827843, 0.452237, 0.998761, 0.813008, 1.,
1., 0.96773, 0.795428, 1., 1.79325, 1., 1., 1., 1., 1.,
0.547425, 0.968162, 1., 1., 0.907489, 1., 1.90031, 0.316174,
1., 0.998824, 0.383825, 1., 0.804287, 0.977305, 1., 1.,
0.855764, 1., 0.952568, 0.573249, 0.239455, 0.635454,
0.917078, 0.624179, 1., 0.679681, 1., 1., 0.968929, 0.712148<

Summary
Due to Mathematicas integrated platform design, all functionality is included without the need
to buy and maintain multiple tools and add-on packages.
With its simplified development cycle, multicore computing, and built-in functions, Mathematicas built-in CUDALink application provides a powerful high-level interface for GPU computing.

22 | Wolfram White Paper: CUDA Programming within Mathematica

Pricing and Licensing Information

Wolfram Research offers many flexible licensing options for both organizations and individuals.
You can choose a convenient, cost-effective plan for your workgroup, department, directorate,
university, or just yourself, including network licensing for groups.
Visit us online for more information:
www.wolfram.com/mathematica-purchase

Recommended Next Steps

Try Mathematica for free:

http://www.wolfram.com/mathematica/trial
Schedule a technical demo:
www.wolfram.com/mathematica-demo
Learn more about CUDA programming in Mathematica:
U.S. and Canada:
Europe:
1-800-WOLFRAM (965-3726)
+44-(0)1993-883400
[email protected]
[email protected]
Outside U.S. and Canada (except Europe and Asia):
+1-217-398-0700
[email protected]

Asia:
+81-(0)3-3518-2880
[email protected]

2011 Wolfram Research, Inc. Mathematica is a registered trademark and Mathematica Player, Wolfram Workbench,
and gridMathematica are trademarks of Wolfram Research, Inc. Wolfram|Alpha is a registered trademark and
computational knowledge engine is a trademark of Wolfram Alpha LLCA Wolfram Research Company. All other
trademarks are the property of their respective owners. Mathematica is not associated with Mathematica Policy
Research, Inc. or Mathtech, Inc. MKT3014 2173925 0111.hk

Wolfram White Paper: CUDA Programming within Mathematica | 23

Axh1140 - Uk - Ed02 - Agri Plus
100% (1)
Axh1140 - Uk - Ed02 - Agri Plus
312 pages
An Introduction To Programming With Mathematica PDF
100% (12)
An Introduction To Programming With Mathematica PDF
556 pages
한도샤프트2
100% (1)
한도샤프트2
56 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
Cuda PDF
No ratings yet
Cuda PDF
18 pages
An Introduction To Programming With Ma Thematic A
No ratings yet
An Introduction To Programming With Ma Thematic A
556 pages
Precommissioning - Loto
0% (1)
Precommissioning - Loto
29 pages
Cuda
No ratings yet
Cuda
15 pages
Pipe Supports & Dynamic Restraints - An Overview
No ratings yet
Pipe Supports & Dynamic Restraints - An Overview
27 pages
Borehole Imaging Fracture Analysis Client PDF
100% (1)
Borehole Imaging Fracture Analysis Client PDF
57 pages
Green Roof HYDROPACK
No ratings yet
Green Roof HYDROPACK
4 pages
2002-10 ISA S95 Part 3 Overview
No ratings yet
2002-10 ISA S95 Part 3 Overview
12 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Introduction - CUDA C Programming Guide
No ratings yet
Introduction - CUDA C Programming Guide
573 pages
Blooms Taxonomy and MCQs
100% (1)
Blooms Taxonomy and MCQs
1 page
An Integrated Machine Learning and Finite Element Analysis Framework, Applied To Composite Substructures Including Damage
No ratings yet
An Integrated Machine Learning and Finite Element Analysis Framework, Applied To Composite Substructures Including Damage
120 pages
Computer Basics For AP Grama Sachivalayam Exams
0% (1)
Computer Basics For AP Grama Sachivalayam Exams
7 pages
Gyro Instructions
No ratings yet
Gyro Instructions
1 page
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Dynmod 1 4
No ratings yet
Dynmod 1 4
238 pages
Vibration Report
No ratings yet
Vibration Report
5 pages
Tutorial hpcs2011 Fixed
No ratings yet
Tutorial hpcs2011 Fixed
89 pages
Part1 22
No ratings yet
Part1 22
77 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
Cuda
No ratings yet
Cuda
69 pages
Goulds Pumps 3355 Series
100% (1)
Goulds Pumps 3355 Series
7 pages
Yatwin IP Camera Manual
No ratings yet
Yatwin IP Camera Manual
47 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Student Handbook
No ratings yet
Student Handbook
41 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
MCQ Data Entry
100% (2)
MCQ Data Entry
3 pages
Unit 4
No ratings yet
Unit 4
48 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
0803
No ratings yet
0803
29 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Cuda
No ratings yet
Cuda
25 pages
Course 7
No ratings yet
Course 7
21 pages
7563.ARM White Paper - DSP Capabilities of Cortex-M4 and Cortex-M7
No ratings yet
7563.ARM White Paper - DSP Capabilities of Cortex-M4 and Cortex-M7
19 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Puting Experiences
No ratings yet
Puting Experiences
15 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
CUDA
No ratings yet
CUDA
20 pages
TFM - Unfinished
No ratings yet
TFM - Unfinished
17 pages
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
No ratings yet
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
19 pages
Matlab With Cuda: by Vishwanath Sarkar NIT Bhopal
No ratings yet
Matlab With Cuda: by Vishwanath Sarkar NIT Bhopal
17 pages
CUDA Programming Within Mathematica
No ratings yet
CUDA Programming Within Mathematica
17 pages
Modulos Capacitacion API
No ratings yet
Modulos Capacitacion API
18 pages
Graphics Processing Unit GPU Programming Strategie
No ratings yet
Graphics Processing Unit GPU Programming Strategie
14 pages
Adax Thermostat
No ratings yet
Adax Thermostat
8 pages
Nvidia Cuda C Getting Started Guide For Microsoft Windows: Installation and Verification On Windows
No ratings yet
Nvidia Cuda C Getting Started Guide For Microsoft Windows: Installation and Verification On Windows
14 pages
JCUDA
No ratings yet
JCUDA
13 pages
Wolfram Mathematica
No ratings yet
Wolfram Mathematica
7 pages
Nvidia Cuda Getting Started Guide For Microsoft Windows: Installation and Verification On Windows
No ratings yet
Nvidia Cuda Getting Started Guide For Microsoft Windows: Installation and Verification On Windows
15 pages
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
No ratings yet
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
10 pages
Solution Overview. Motorola Solutions Dimetra IP Micro Automatic Failover PDF
No ratings yet
Solution Overview. Motorola Solutions Dimetra IP Micro Automatic Failover PDF
8 pages
Cambridge Books Online
No ratings yet
Cambridge Books Online
9 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
No ratings yet
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
11 pages
Case Study On: Nitte Meenakshi Institute of Technology
No ratings yet
Case Study On: Nitte Meenakshi Institute of Technology
8 pages
2014 Toyota Sai Press Release
No ratings yet
2014 Toyota Sai Press Release
4 pages
The Notebook Interface: Wolfram Mathematica (Usually Termed Mathematica) Is A Modern Technical Computing System
No ratings yet
The Notebook Interface: Wolfram Mathematica (Usually Termed Mathematica) Is A Modern Technical Computing System
9 pages
Irr Eo 801
No ratings yet
Irr Eo 801
9 pages
KMP Algorithm
No ratings yet
KMP Algorithm
3 pages
Brodtkorb Etal Meta10
No ratings yet
Brodtkorb Etal Meta10
15 pages
CUDA Wikipedia
No ratings yet
CUDA Wikipedia
10 pages
It Acquisition Management
No ratings yet
It Acquisition Management
6 pages
Example of Array List Using Object Stud With Data Type Student
No ratings yet
Example of Array List Using Object Stud With Data Type Student
3 pages
Csaq
No ratings yet
Csaq
4 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
What Is National Philosophny of Education Education Essay
No ratings yet
What Is National Philosophny of Education Education Essay
5 pages
Time Management: We Should Fix Priority
No ratings yet
Time Management: We Should Fix Priority
4 pages
Ac 5
No ratings yet
Ac 5
2 pages
GPU Mathematica 8
No ratings yet
GPU Mathematica 8
3 pages
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
No ratings yet
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
1 page
BCA Timetable
No ratings yet
BCA Timetable
2 pages
VTU - B.E B.Tech - 2019 - 4th Semester - July - CBCS 17 Scheme - MECH - 17ME44 Fluid PDF
No ratings yet
VTU - B.E B.Tech - 2019 - 4th Semester - July - CBCS 17 Scheme - MECH - 17ME44 Fluid PDF
2 pages
DoP M100 Gypframe Metal Profiles
No ratings yet
DoP M100 Gypframe Metal Profiles
2 pages
387 / 387TC / 387ST Hose: Delivering Value and Performance For High-Pressure Systems
No ratings yet
387 / 387TC / 387ST Hose: Delivering Value and Performance For High-Pressure Systems
1 page
Rfe-Hf: Flame-Retardant Halogen-Free Instru-Mentation and Communication Cable
No ratings yet
Rfe-Hf: Flame-Retardant Halogen-Free Instru-Mentation and Communication Cable
1 page