ISC Programming Guide StarPwithMATLAB
ISC Programming Guide StarPwithMATLAB
Release 2.7
12/11/08
Release 2.7
COPYRIGHT
Copyright © 2004-2008, Interactive Supercomputing, Inc. All rights reserved. Portions Copyright
© 2003-2004 Massachusetts Institute of Technology. All rights reserved.
Star-P® Introduction
Extending MATLAB with Star-P® . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Parallel Computing Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
About the Star-P® Programming Guide for Use with MATLAB® . . . . . . . . . . . . . . . . . 5
Star-P® Functions
Basic Server Functions Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
General Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
fseek. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
np . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
p............................................................ 139
pp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
ppbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
ppclear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
ppgetoption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
ppsetoption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
ppgetlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
ppgetlogpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
ppinvoke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
pploadpackage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Application Examples
Application Example: Image Processing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
How the Analysis Is Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Application Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Images For Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
M Files for the Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Application Example Not Using Star-P® . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
patmatch_color_noStarP.m File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
patmatch_calc.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Application Example Using Star-P® . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
patmatch_colordemo_StarP.m File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
Application Example Using ppeval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
About ppeval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
About the ppeval Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
patmatch_color_ppeval.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Star-P® Introduction
Star-P® extends easy to use Very High Level Languages (VHLLs) such as MATLAB®1 and
Python to support simple, user-friendly parallel computing on a spectrum of computing
architectures: multi-core desktops and servers, large shared memory servers, and clusters.
Star-P® fundamentally transforms the workflow, substantially shortening the “time to solution”
by allowing the user to easily adapt their application for use on parallel resources.
This chapter provides an overview of using Star-P® in the MATLAB® VHLL environment. It
includes sections on the following topics:
1. MATLAB® is a registered trademark of The MathWorks, Inc. Star-P® and the "star" logo are reg-
istered trademarks of Interactive Supercomputing, Inc. Other product or brand names are trade-
marks or registered trademarks of their respective holders. ISC's products are not sponsored or
endorsed by The MathWorks, Inc. or by any other trademark owner referred to in this docu-
ment.
With Star-P®, existing MATLAB scripts and functions can be re-used to run larger problems in
parallel with minimal modification, and new parallel MATLAB code can be developed in a
fraction of the time normally required to develop parallel applications in traditional
programming languages, such as C, C++, or Fortran with MPI. Parallel programming with
Star-P® in MATLAB requires learning a bare minimum of additional programming constructs.
Implementing Data Parallelism with Star-P® does not require the addition of any new
functions to your MATLAB code, and adding Task Parallelism requires only one additional
construct.
To implement Data Parallelism, Star-P® overloads ordinary MATLAB commands with the *p
construct. This simply multiplies (*) array dimension(s) by a symbolic variable (p) denoting
that a matrix dimension is to be distributed. A class of overloaded MATLAB programs
becomes parallel with the insertion of this construct. The *p syntax tells data construction
routines (for example, rand) to build the matrix on the parallel HPC back-end, and perform
the indicated operation (for example, matrix inversion) there as well. Creating a distributed
random matrix and taking its inverse with MATLAB can be expressed with the following two
lines of code:
A = rand(100,100);
B = inv(A);
To express the same operations in Data Parallel using Star-P®, requires only one slight
change:
App = rand(100,100*p);
Bpp = inv(App);
Once the *p construct has been applied to a variable, all subsequent operations on that
variable will occur in parallel on the HPC and result in new variables that are also resident on
the HPC. This important inheritance feature of Star-P® allows you to parallelize your MATLAB
code with minimal effort. For more information on distributed data operations, see “Data
Parallelism with Star-P® and MATLAB”.
For implementing Task Parallel functionality, Star-P® introduces the ppeval function into
MATLAB. The ppeval function, which is called in a similar manner as the MATLAB function
feval, allows one to pass a string containing a valid MATLAB function foo as well as all of
foo’s calling arguments. The ppeval function then packages foo, along with all functions
called within foo, and ships those functions to the HPC server. The calling arguments of foo
are also shipped to the HPC and can be either broadcast to all processors using the bcast
function or split amongst the processors using the split function.
App = rand(100,100,100*p);
Bpp = ppeval('inv',App);
Or equivalently:
Bpp = ppeval('inv',split(App,3));
In this example, the ppeval call splits the variable App into 100 individual slices (along the
last dimension). The slices are divided among available processors on the server, and then
each processor iterates over its received slices, performing an inv operation on each slice.
The results from all processors are then combined, preserving original order, and returned as
the output variable. More information about task parallel functionality can be found in “Task
Parallelism with Star-P® and MATLAB”
To use Star-P® with MATLAB, the user needs only one copy of the Mathworks’ product to
serve as a front-end, which need not be the parallel machine. No copies of MATLAB are
required on the parallel computer.
Users have the benefit of working in the familiar MATLAB environment. When new releases
of MATLAB are distributed, the user merely plugs in the new copy and Star-P® continues to
execute.
Despite Star-P®’s ability to add functionality for distributed matrices and parallel operations,
don’t forget that you are still using MATLAB as your desktop development tool. This means
that you can run an existing MATLAB program in Star-P® with almost no changes, and it will
run strictly on your desktop (client) machine, never invoking the Star-P® system after
initialization. Of course, this would be a waste of HPC resources, if you ran this way all the
time. But it is a convenient way of porting the compute-intensive portions of your code one at
a time, allowing the unported portions to execute in MATLAB proper.
In the Star-P® context, there are many features of the MATLAB environment that are still
relevant for developing applications with distributed objects and operations. The MATLAB
debugger and the script and function editor are two of the most useful MATLAB functions
when you’re programming with Star-P®. The designers of Star-P® have taken great pains to
fit within the MATLAB mindset, using the approach “It’s still MATLAB.” So if you’re wondering
whether a MATLAB operation works in Star-P®, just try it. Most operations work in the
obvious way.
Note: If a MATLAB function that has high value for you does not work, please let us
know via [email protected]
Star-P® greatly simplifies the parallelization of new and existing MATLAB code by allowing
the user to either run code on the local MATLAB client or on the HPC back-end appropriately
taking advantage of the respective strengths.
This section reviews various domains of parallel computing. We present these concepts for
users who are new to parallel computing and then discuss their implementation by Star-P®.
Parallel computing textbooks list many models for parallelizing programs, including:
• Data Parallel Computation
• Message Passing
• Task Parallel Computation
You may wish to go to a website that has several points related to parallel computing, such
as, http://beowulf.csail.mit.edu/18.337 or any of the numbers of textbooks that cover these
topics. In brief, the current version of Star-P® is best expressed as a data parallel language or
a global array language. The prototypical example of data parallelism is matrix addition:
where App and Bpp are matrices. When we add two n-by-n matrices, we perform n2 data
parallel additions. In other words, we perform the same operation (addition) simultaneously
on each of the n2 numbers.
The name “data parallel” is often extended to operations that have communication of
dependencies among some of the operations, but at some level can be viewed as identical
operations happening purely in parallel. Two simple examples are matrix multiplication
(Cpp=App*Bpp) and prefix sums (Dpp=cumsum(App)).
A beneficial description of Star-P® for many users is that Star-P® is a global array syntax
language. By providing a global array syntax in Star-P®, the user variable App refers to the
entirety of a distributed object on the back end server. The abstraction of an array that
contains many elements is a powerful construct. With one variable name such as App, you
are able to package up a large collection of numbers. This construct enables higher level
mathematical operations expressed with a minimal amount of notation. On a parallel
computer, this construct allows you to consider data on many processors as one entity.
“Task parallel” or “embarrassingly parallel” computations are those operations where there is
little or no dependency among the computational pieces. Each piece can easily be placed on
a distinct processor. While not strictly required, such computations typically depend on a
relatively small amount of input data, and produce relatively small amounts of output data. In
such circumstances, the implementation may not store any persistent data on distributed
memory. An example is Monte Carlo simulation of financial instruments, where the
calculations for each sample are done completely in isolation from every other sample. While
Star-P® may be considered a data parallel language, it also has task parallel functionality
through the use of its ppeval operation.
Most of the operations for which Star-P® will deliver good performance will be operations on
global arrays, so most of this document treats arrays as global arrays. An important exception
to this is the ppeval function, which supports task parallelism and works on global arrays,
but in a less straightforward manner. A global array that is an input to the ppeval function is
partitioned into sections, each of which is converted to an array that is local to a single
instance of a MATLAB function on a single processor. The reverse process is used for output
arrays; the assemblage of the sections into global arrays.
The remainder of this document provides chapters that cover the following topics:
• “Starting Star-P® with MATLAB” takes you through a sample session to illustrate
how to start up Star-P® from a graphical or command line interface with various
command-line options. A simple program is shown that illustrates the use of
Star-P®’s ability to parallelize MATLAB code.
• “Task Parallelism with Star-P® and MATLAB” describes Star-P®’s ppeval function
for performing embarrassingly parallel operations on either local or distributed data.
• “Tips and Tools for High Performance Star-P® Code” provides suggestions for
maximizing the performance of code written for both data and task parallel
computations, and describes tools for monitoring and profiling MATLAB code using
Star-P®.
• “Star-P® Functions” summarizes functions that are not part of the standard MATLAB
language and describes their implementation.
• “Supported MATLAB® Functions” lists the MATLAB functions that are supported in
both data and task parallel modes, as well as MATLAB toolbox functions that are
supported only in task parallel computations.
This chapter is intended for users who have a working Star-P® installation on a client system
as well as a high performance computing server. It includes the following topics:
• "Getting Help at the IDE Window" explains how to use and invoke help.
• "Starting Star-P® on a Linux Client System" provides information for users running
Star-P® under Linux.
• "Starting Star-P® on a Windows Client System" provides information for users running
Star-P® under Windows
When working at the IDE, you can invoke online help in the following ways:
• Using the HTML-Based Help
• Using the Text-Based Help
• Getting Command Syntax Information
You can get Star-P® HTML-based help within the MATLAB IDE by entering the starpdoc
command.
>> starpdoc rosser
You can also get information on how to use starphelp by using the MATLAB command help.
For example:
>> help starpdoc
starpdoc Get browser-displayed help related to Star-P® parallel computing
Syntax 1:
starpdoc % Bring up the main Star-P® online help page
Syntax 2:
starpdoc <Star-P®-M-function-name> | <Star-P®-M-library-name> | syntax
You can get Star-P® HTML-based help within the MATLAB IDE by entering the starpdoc
command.
>> starphelp rosser
You can also get information on how to use starphelp by using the MATLAB command help.
For example:
Syntax 1:
starphelp % Bring up the main list of Star-P® help
Syntax 2:
starphelp <Star-P®-M-function-name> | <Star-P®-M-library-name> | syntax
You can get Star-P®-specific conventions and syntax information by way of the following
methods:
• Syntax grammar and conventions used in the Star-P® documentation
• Get syntax information for a particular function
For general syntax grammar usage and conventions, you can invoke either starphelp or
starpdoc using the <syntax> option.
Convention... Meaning...
You can individual functions by calling either form of Star-P® help with a function name as its
argument.
Syntax 1:
<vector-cross-product> = cross( <input-vector-1> , <input-vector-1> )
Your system administrator will usually have installed the Star-P® software on the systems
(client(s) and server) you will be running on in advance. The default location of the starp
software is /usr/local/starp/<version>. Assuming this install location is in your shell
path, then the following sequence will start the Star-P® client (on a system named
your_system) and connect to the Star-P® server configured by the administrator, which
happens to be a system named remote_server.
your_system% starp
user@remote_server’s password: **********
< M A T L A B >
Copyright 1984-2009 The MathWorks, Inc.
By using this software you agree to the terms and conditions described
in the license agreement. Type help agreement
client log file: /home_directory/.starp/log/2008_04_05_1111_54/starpclient.log
>>
As you can see, the HPC server will typically require a password for user authentication. You
will either need to supply this password upon every start-up or configure SSH so it is not
needed on every session initiation. Otherwise, there are few visible signs that the Star-P®
server is running on a distinct machine from your client.
This last line (“>>”) is the MATLAB prompt. At this point you can type the commands and
operators that you are familiar with using from prior MATLAB experience, and can start to use
the Star-P® extensions described in “Data Parallelism with Star-P® and MATLAB” and “Task
Parallelism with Star-P® and MATLAB”.
A full description of the starp command and its command line options is provided in the
section “Star-P® Functions”, or by typing the following at the command prompt:
$ ./starp --help
By default, the Star-P® installation on a Windows XP system will create a shortcut on the
desktop, as well as an entry in the list of programs under the Windows Start menu.
The default location for the Star-P® programs will be C:\Program Files\starp; if you
can’t find them there, check with your system administrator to see if an alternate location was
used. For installation instructions, see the “Star-P® Installation and Configuration Guide”.
To invoke the Star-P® software, either double-click the desktop icon, or click on:
Start -> All Programs -> Star-P® Client Software -> Star-P® M Client
Figure 2-1 Star-P® Desktop icon
If passwordless SSH has not been configured for the user name in your current Star-P®
properties configuration file (the default file being starpd.properties), a dialogue box will
appear prompting you for a password. If no user name appears in the configuration file, then
the user name associated with your current Windows session will be utilized.
Once the connection has been established, MATLAB will start, with Star-P® enabled.
Star-P® can also be started from a Windows command line prompt using the starp
command. A full description of the starp command and its options is provided in the section
“Star-P® Start-Up Command Line Options”, or type starp --help at the Windows
command line.
Star-P® Dashboard
• inform the user of the connection status of the Star-P® client and server
• provide an interface to allow the user to kill the server should it be necessary.
The Server Status window displays information about the server startup process and
information about the success or failure of Kill button operations.
The Server Status window displays information about the server startup process, information
about connectivity to the server, and information about the success or failure of kill button
operations.
The server status light on the dashboard provides a simple visual indicator representing the
primary set of possible states. At any time, it may display one of the following values:
• Server Initializing
• Server Ready
• Server Busy
• Connection Lost
During the start-up phase, the dashboard will indicate that the server is initializing. Then
when a command is submitted to the server, it switches to the “busy” state, and returns to the
“ready” state when the command completes. If connectivity to the server is lost at any time,
this will be reflected by the status light. Connectivity is tested by periodic heartbeats that pass
between the client and the server.
By default, the dashboard always appears when connecting to the Star-P® server. The
dashboard can then be hidden or shown using the following pair of commands, which take no
arguments:
ppshowdashboard
pphidedashboard
If you desire to change the default settings for dashboard initialization, then you can
uncomment the environment variable line starpd.dashboard.no_gui=1 in the
The Kill Star-P® Server button is not intended for routine use, but only for situations where the
user is unable to exit Star-P® in the usual way. Upon pressing the Kill Star-P® Server button,
the user will click Yes when the confirmation dialog appears.
Figure 2-4 Star-P® Kill Button Confirmation
The Star-P® Dashboard opens set to Always On Top mode. However it can be minimized or
the user can unset Always On Top using the View menu.
If you are running Star-P® on a system without graphical display capability (for example, a
UNIX shell with no DISPLAY environment set), the Dashboard will not be visible or
accessible.
First, we check to see whether the server is alive, and the number of processes running.
>> np
ans =
8
>> norm(App*Xpp-Bpp)
ans =
3.4621e-13
>> ppwhos
Your variables are:
Name Size Bytes Class
App 100x100p 80000 ddense array
Bpp 100px100 80000 ddense array
Xpp 100px100 80000 ddense array
ans 1x1 8 double array
Grand total is 30001 elements using 240008 bytes
MATLAB has a total of 1 elements using 8 bytes
Star-P® server has a total of 30000 elements using 240000 bytes
Finally, to end Star-P® execution, you can use either the quit or the exit command:
>> quit
your_system =>
At this point you are ready to write a Star-P® program or port a MATLAB program to Star-P®.
You may have a set of Star-P® options that you want to choose every time you run Star-P®.
Just as MATLAB will execute a startup.m file in the current working directory when you
start MATLAB, Star-P® will execute a ppstartup.m file. Note that Star-P® itself executes
some initial commands to create the link between the Star-P® Client for use with MATLAB
and the Star-P® server. The ppstartup.m file will be executed after those Star-P®
initialization commands. Thus the order of execution is:
• startup.m % MATLAB configuration commands
For example, this mechanism can be useful for choosing a particular sparse linear solver to
use (see “ppsetoption” documentation in “Star-P® Functions”) or for loading your own
packages (see the “Star-P® Software Development Kit (SDK) Tutorial and Reference
Guide”).
Star-P® can be used with default options enabled, but advanced users might prefer to
override defaults at start-up time. The start-up executable is named starp. The starp
application reads its default start-up options from the starpd.properties file. For
information on how to edit these properties directly, please see the section titled
Administration Topics in the Star-P® Installation and Configuration Guide.
Note: You can get help regarding Star-P® startup options by executing the following
command: starp --help.
Hostname or address of HPC to which to connect. Also may be a comma delimited list
of machines comprising a cluster, head node first.
• -c, --config config_file
Distribute Star-P® processes when not using a workload manager. Where acceptable
values are one of the following:
Run MATLAB in filter mode so that it reads from stdin and writes to stdout, for
testing.
• -h, --help
Print help text associated with the other arguments you provide.
• -j, --wlmargs <wlmargs>
Specify a non-standard SSH port for communication with the HPC server
• -t, --datadir data_path
Path that will be used by the HPC Server for file I/O. Star-P® HPC Server reads and
writes data to the directory you specify with this path.
• -u, --hpcuser
Extra arguments to be passed to the Workload Manager, will not override Workload
Manager options normally supplied by Star-P®.
• -x, --exclude <exclude>
Specify which nodes of a cluster not to use (mutually exclusive with “use”)
• -z, --use <use>
When running in a cluster, it is also useful to understand the precedence order of potential
machine files.
• Any nodes specified in a machine file passed in using -m, or specified in a -x or -z
option that are not also included in the default machine file, will not be used by
<starp>.
• A user default machine file
(~/.starp/.config/machine_file.user_default) by default, or,
<starp-usr-config>/<usr>/ if overwritten during the installation, will take
precedence over the system default machine file
(<StarP_dir>/config/machine_file.system_default) and will not need
to represent a subset of the system default machine file.
• -m,--machine machine_file_path
The path to a machine file to be used for this instance of starp. The file format is one
machine name per line, with no empty lines at the end of the file. node specified by -a
argument must be included in the file. Example contents of this file would be: node1
node2 ... nodeN
• -x,--exclude [node] or [node1,node2,...,nodeN] or
[node2-nodeN]
Exclude a node, a set of nodes or a range of nodes from the current instance of starp.
This argument will be used as a modifier against either a machine file passed in using
the -m argument, or against either the user's or the system's default machine file. This
flag is mutually exclusive with -z.
• -z,--use [node] or [node1,node2,...,nodeN] or [node2-nodeN]
Use a node, a set of nodes or a range of nodes for the current instance of starp. This
argument will be used as a modifier against either a machine file passed in using the
-m argument, or against either the user's or the system's default machine file. This flag
is mutually exclusive with -x.
By providing command-line options, you can override some of the information normally
supplied by the starpd.properties file. The following example shows the minimal set of
command-line options required for running Star-P®. In this case, the command would cause
MATLAB to start up, running eight Star-P® Server processes on a machine with the
hostname altix as the user joe:
Examples
Using this command line, a new machine file for this one run of <starp> will be
generated using the default machine file, but with node3 and node7 removed.
Note: If node3 or node7 are not members of the default machine file, they will be ignored
as defined in Cluster Configurations at the end of this section.
• If you are running on a cluster and you want to specify a range of nodes in the cluster
to be excluded from a particular run of <starp> (perhaps a rack of nodes has been
taken offline), your <starp> command line would look like this:
Using this command line, a new machine file for this one run of <starp> will be
generated using the default machine file, but with node3 through node14 utilized.
Note: If either node3 or node14 are not members of the default machine file, <starp>
will return a "bad range" error.
Note: If node14 appears before node3 in the default machine file, <starp> will return
a "bad range" error.
• If you are running on a cluster and you want to specify a custom machine file for a
particular run of <starp>, your <starp> command line would look like this:
Using this command line, the machine file specified by [machine file path] will
be used for this one run of <starp>.
Note: The machine file specified by [machine file path] must represent a subset of the
user's or system's default machine file.
• If you are running on a cluster and you want to specify a custom machine file for a
particular run of <starp> and you only want to use a subset of that machine file,
your <starp> command line would look like this:
Using this command line, the machine file specified by [machine file path] will be used
for this one run of <starp>.
Note: The machine file specified by [machine file path] must represent a subset
of the user's or system's default machine file.
Note: If either node3 or node14 are not members of the default machine file, <starp>
will return a "bad range" error.
Note: If node14 appears before node3 in the default machine file, <starp> will return
a "bad range" error.
A limited form of batch processing can be used in Star-P® that is separate from the realm of
full workload management systems that are also supported by Star-P®. This process involves
use of command line options listed above as well as the name of a desired script you wish to
run within your VHLL environment. If you wish to run a .m script named myscript.m, you
would redirect the contents of a MATLAB .m file into the starp command like this:
Cluster Configurations
There are 2 files and several command line arguments that can affect cluster configuration.
This chapter contains information on creating, manipulating, loading, and saving data in
parallel and includes the following:
• "Star-P® Naming Conventions"
• "Examining Star-P® Data"
• "Special Variables: p and np"
• "Supported Data Types"
• "Creating Distributed Arrays"
• "Types of Distributions"
• "Propagation of Distribution"
• "Explicit Data Movement with ppback and ppfront"
• "Loading And Saving Data on the Parallel Server"
The Star-P® extensions to MATLAB allow you to parallelize computations by declaring data
as distributed. This places the data in the memory of multiple processors. Once the data is
distributed, then operations on the distributed data will run implicitly in parallel. Since
declaring the data as distributed requires very little code in a Star-P® program, performing the
MATLAB operations in parallel requires very little change from standard, serial MATLAB
programing.
Another key concept in Star-P® is that array dimensions are declared as distributed, not the
array proper. Of course, creating an array with array dimensions that are distributed causes
the array itself to be distributed as well. This allows the distribution of an array to propagate
through not only computational operators like + or fft, but also data operators like size.
Propagation of distribution is one of the key concepts that allows large amounts of MATLAB
code to be reused directly in Star-P® without change.
Star-P® commands and data types generally use the following conventions, to distinguish
them from standard MATLAB commands and data types:
• Most Star-P® commands begin with the letters pp, to indicate parallel. For example,
the Star-P® ppload command loads a distributed matrix from local files. Exceptions
to this rule include the split and bcast commands.
• Star-P® data types begin with the letter d, to indicate “distributed”. For example, the
Star-P® dsparse class implements distributed sparse matrices.
The following convention for displaying Star-P® related commands and classes is used
throughout this chapter.
Command/Variable Font
p & other dlayout variables bold green font
This section describes how you can look at your variables, see their sizes and determine
whether they reside on the client as a regular MATLAB object or on the server as a Star-P®
object. The MATLAB whos command is often used for this function, but whos is unaware of
the true sizes of the distributed arrays. Star-P® supports a similar command called ppwhos.
Here is sample calling sequence and output:
>> n = 1000;
>> app = ones(n*p);
>> bpp = ones(n*p,n);
>> ppwhos
Your variables are:
Name Size Bytes Class
app 1000x1000p 8000000 ddense array
bpp 1000px1000 8000000 ddense array
n 1x1 8 double array
Grand total is 2000001 elements using 16000008 bytes
MATLAB has a total of 1 elements using 8 bytes
Star-P® server has a total of 2000000 elements using 16000000 bytes
Note that each dimension of the arrays includes the “p” if it is distributed. Size and Bytes
reflect the size on the server for distributed objects, and transition naturally to scientific
notation when their integer representations get too large for the space.
>> n = 2*10^9;
>> xpp = ones(1,n*p);
>> ppwhos
Your variables are:
Name Size Bytes Class
n 1x1 8 double array
xpp 1x2000000000p 1.600000e+10 ddense array
Grand total is 2000000001 elements using 1.600000e+10 bytes
MATLAB has a total of 1 elements using 8 bytes
Star-P® server has a total of 2000000000 elements using 1.600000e+10 bytes
Note that the MATLAB whos command, when displaying distributed objects, only shows the
amount of memory they consume on the front-end, not including their server memory. This
does not reflect their true extent. For example, the output from whos for the session above
looks like the following:
>> n = 1000;
>> app = ones(n*p);
>> bpp = ones(n*p,n);
>> whos
Name Size Bytes Class
app 1000x1000 1728 ddense
bpp 1000x1000 1728 ddense
n 1x1 8 double
The following routine is the built-in MATLAB routine to construct a Hilbert matrix:
>> H = hilb(4096);
Because the operators in the routine (:, ones, subsasgn, transpose, rdivide, +, -) are
overloaded to work with distributed matrices and arrays, typing the following would create a
4096 by 4096 Hilbert matrix on the server.
By exploiting MATLAB’s object-oriented features in this way, existing scripts can run in
parallel under Star-P® with minimal modification.
As a general rule, you will probably not want to view an entire distributed array, because the
arrays that are worth distributing tend to be huge. For example, the text description of 10
million floating-point numbers is vast. But looking at a portion of an array can be useful. To
look at any portion of a distributed array bigger than a scalar, it will have to be transferred
explicitly to the client MATLAB program. But looking at a single element of the array can be
done simply. Remember from above that result arrays that are 1x1 matrices are created as
local arrays on the MATLAB client.
As you can see, examining a single element of the array returns its value. Examining multiple
elements creates another distributed object, which remains on the server, as in the last
command above. To see the values of these elements, you will need to use ppfront to
move them to the front-end. For information on ppfront and ppback see "Explicit Data
Movement with ppback and ppfront".
In Star-P® you use two special variables to control parallel programming. While they are
technically functions, you can think of them as special variables. The first is p, which is used
in declarations such as the following to denote that an array should be distributed for parallel
processing.
The second variable with special behavior is np, denoting the number of processors that
have been allocated to the user’s job for the current Star-P® session. Because these are not
unique names, and existing MATLAB programs may use these names, care has been taken
to allow existing programs to run, as described here. The behavior described here for p and
np is the same as the behavior for MATLAB built-in variables such as i and eps, which
represent the imaginary unit and floating-point relative accuracy, respectively.
The variables p and np exist when Star-P® is initiated, but they are not visible by the whos or
ppwhos command.
After Star-P® initializes in a new session, the following commands yield no output.
>> whos
>> ppwhos
Even though the variables p and np do not appear in the output of whos or ppwhos, they do
have values:
>> p
ans =
1p
>> np
ans =
8
The variable np will contain the number of processors in use in the current Star-P® session.
In this example, the session was using eight processors.
Because these variable names may be used in existing programs, it is possible to replace the
default Star-P® definitions of p and np with your own definitions, as in the following example:
>> p
ans =
1p
>> np
ans =
8
>> n = 100;
>> app = ones(n*p);
>> bpp = ones(n*p,n);
>> cpp = bpp*bpp;
>> p = 3.14;
>> z = p*p;
>> z
z =
9.8596
>> p
p =
3.1400
>> ppwhos
Note that in the first output from ppwhos, the variable p is displayed, because it has been
defined by the user, and it works as a normal variable. But once it is cleared, it reverts to the
default Star-P® definition. If you define p in a function, returning from the function acts like a
clear and the definition of p will revert in the same way.
Assignments to p
The variable pp is a synonym for p. If you use a mechanism to control client versus Star-P®
operation (execution solely on the client versus execution with Star-P®), the assignment of p
= 1 anywhere in the MATLAB script will alter the p function. In this case, use a construct
similar to the following:
if StarP
p = pp;
else
p = 1;
end
Anytime you clear the variable p, for example clear p, the symbolic nature of p is restored.
Real and complex numbers in Star-P® are supported as in MATLAB. Matrices of double
precision real and complex data can be directly created and manipulated by use of the
complex, real, imag, conj, and isreal operators and the special variables i and j
(equal to the square root of -1, or the imaginary unit), and they can be the output of certain
operators.
Note: Complex integer types are not supported within the Star-P® Task Parallel Engine
(TPE). However, it does support floating point complex types such as double.
>> n = 1000;
>> app = rand(n*p,n)
app =
ddense object: 1000p-by-1000
>> bpp = rand(n*p,n)
bpp =
ddense object: 1000p-by-1000
>> cpp = app + i*bpp
cpp =
ddense object: 1000p-by-1000
>> ccpp = conj(cpp)
ccpp =
ddense object: 1000p-by-1000
>> dpp = real(cpp)
dpp =
ddense object: 1000p-by-1000
>> epp = imag(cpp)
epp =
ddense object: 1000p-by-1000
>> fpp = complex(app)
fpp =
ddense object: 1000p-by-1000
>> ppwhos
Your variables are:
Name Size Bytes Class
app 1000px1000 8000000 ddense array
bpp 1000px1000 8000000 ddense array
cpp 1000px1000 16000000 ddense array (complex)
ccpp 1000px1000 16000000 ddense array (complex)
dpp 1000px1000 8000000 ddense array
epp 1000px1000 8000000 ddense array
fpp 1000px1000 16000000 ddense array (complex)
n 1x1 8 double array
Grand total is 7000001 elements using 80000008 bytes
MATLAB has a total of 1 elements using 8 bytes
Star-P® server has a total of 7000000 elements using 80000000 bytes
Besides these direct means of constructing complex numbers, they are often the result of
specific operators, perhaps the most common example being FFTs.
The *p Syntax
The symbol p means “distributed” and can add that attribute to a variety of other operators
and variables by the multiplication operator *. Technically, p is a function, but it may be
simpler to think of it as a special variable. Any scalar that is multiplied by p will be of class
dlayout. For more information about p, see "Special Variables: p and np".
>> p
ans =
1p
>> whos
Name Size Bytes Class Attributes
ans 1x1 362 dlayout
Note: While it might seem natural to add a *p to the bounds of a for loop to have it run in
parallel, unfortunately that doesn't work. The simplicity of this type of approach has
not been lost on the designers of Star-P®, and a functionality of this type or similar
may appear in future releases.
The first and second examples create matrices that are distributed in the first and second
dimensions, respectively. The last two examples create a matrix that is distributed in the
second dimension. For more detail, see "Types of Distributions".
App = sprandn(100*p,100,0.03);
The operators ones, zeros, rand, sprand, eye, and speye all have the same behavior as
randn and sprandn, respectively, for dense and sparse operators. The horzcat and
vertcat operators work in the obvious way; concatenation of distributed objects yields
distributed objects.
The meshgrid operator can create distributed data in a similar way, although this example
may not be the way you would use it in practice:
Also, the diag operator extends a distributed object in the obvious way.
The reshape command can also create distributed arrays, even from local arrays.
>> a = rand(100,100);
>> app = reshape(a,100,100*p)
app =
ddense object: 100-by-100p
>> ppwhos
Your variables are:
Name Size Bytes Class
a 100x100 80000 double array
app 100x100p 80000 ddense array
Grand total is 20000 elements using 160000 bytes
MATLAB has a total of 10000 elements using 80000 bytes
Star-P® server has a total of 10000 elements using 80000 bytes
Note: The data sizes shown in the examples illustrate the functionality of Star-P® but do not
necessarily reflect the sizes of problems for which Star-P® will provide significant
benefit.
Some programs or functions take as input not an array, but the bounds of arrays that are
created internally. The *p syntax can be used in this situation as well, as shown in the
following:.
>> n = 1000*p;
>> whos
Name Size Bytes Class Attributes
n 1x1 362 dlayout
>> App = rand(n)
App =
ddense object: 1000-by-1000p
Indexing allows creation of new matrices or arrays from subsections of existing matrices or
arrays. Indexing on distributed matrices or arrays always creates a distributed object, unless
the result is a scalar, in which case it is created as a local object. Consider the following
example:
Note that creating a new matrix or array by indexing, as in the creation of dpp above, may
involve interprocessor communication on the server, as the new matrix or array will need to
be evenly distributed across the processors (memories) in use, and the original position of
the data may not be evenly distributed.
It may seem logical that you could create a distributed object by adding the *p to the left-hand
side of an equation, just as you can to the right-hand side. But this approach doesn't work,
either in MATLAB in general or in Star-P® specifically for distributed arrays.
Note: There is an incompatibility between MATLAB and Star-P® in this area. In MATLAB,
when you type the command app or bpp, as soon as that assignment is complete, you
can modify either app or bpp and know that they are distinct entities, even though the
data may not be copied until later. For technical reasons Star-P® can get fooled by this
deferment. Thus if you modify either app or bpp, the contents of both app and bpp get
modified. Because of the semantics of the MATLAB language, this is only relevant for
assignments of portions of app or bpp; i.e., app(18,:) = ones(1,100*p) or
app(1234) = 3.14159. There are several ways to avoid the deferment and force
the data to be copied immediately to avoid this problem. One example would be (for
a 2D matrix) to do the copy with app = bpp(:,:). Another example that works for
all non-logical arrays is app = +bpp.
Note: Related to the previous note, if a shallow copy of a variable is created using the
command app = bpp, then the deletion of either app or bpp using clear or ppclear
on app or bpp will delete the data for both app and bpp but will not delete the symbols
for both variables. To avoid the this scenario, use an assignment statement of the form
app = bpp(:,:) or app = +bpp.
Types of Distributions
Star-P® supports row and column distribution of dense matrices. These distributions assign a
block of contiguous rows/columns of a matrix to successive processes.
A two-dimensional distributed dense matrix can be created with any of the following
commands:
The *p designates which of the dimensions are to be distributed across multiple processors.
Row distribution
In the example above, app is created with groups of rows distributed across the memories of
the processors in the parallel server. Thus, with 400 rows on 8 processors, the first 400/8 ==
50 rows would be on the first processor, the next 50 on the second processor, and so forth, in
a style known as row-distributed. Figure 3-1 illustrates the layout of a row-distributed array.
Figure 3-1 Row Distribution
Column distribution
Column-distribution works just the same as row distribution, except column data is split over
available processors; bpp is created that way above. Figure 3-2: illustrates the layout of a
column distributed array. When a *p is placed in more than one dimension, the matrix or
multi-dimensional array will be distributed in the rightmost dimension containing a *p. For
example, if there was a *p in both dimensions of the constructor for a two dimensional matrix,
it would result in a column distribution.
Distributed multidimensional arrays are also supported in Star-P®. They are distributed on
only a single dimension, like row- and column-distributed 1D or 2D matrices. Hence if you
create a distributed object with the following command, then app will be distributed on the
third dimension:
>> n = 10;
>> app = rand(n,n,n*p,n);
If you should happen to request distribution on more than one dimension, the resulting array
will be distributed on the rightmost non-singleton requested dimension. A singleton is defined
as a matrix dimension with a size equal to 1.
Distributed sparse matrices in Star-P® use the compressed sparse row format. Distributed
sparse matrices are represented as dsparse objects. This format represents the nonzeros in
each row by the index (in the row) of the nonzero and the value of the nonzero, as well as
one per-row entry in the matrix data structure. This format consumes storage proportional to
the number of nonzeros and the number of rows in the matrix. Sparse matrices in Star-P®
typically consume 12 bytes per double-precision element, compared to 8 bytes for a dense
matrix. The matrix is distributed by rows, with the same number of rows per processor
(modulo an incomplete number on the last processor(s)). Note that, as a consequence, it is
possible to create sparse matrices that do not take advantage of the parallel nature of the
server. For instance, if a series of operations creates a distributed sparse row vector, all of
that vector will reside on one processor and would typically be operated on by just that one
processor.
While one might imagine the data stored in three columns headed by i, j, Aij, in fact the data is
stored as described by this picture:
Figure 3-3 Star-P® Sparse Data Structure
Notice that if you subtract the row index vector from itself shifted one position to the left, you
get the number of elements in a row. This makes it clear what to do if element (2,2) with the
value of 59 gets deleted in Figure 3-3:, resulting in no elements left in the second row. The
indices would then point to [1 3 3 5]. In other words, noticing that the number of non-zeros per
row is [2 0 2] in this case, you could perform a cumsum on [1 2 0 2] and obtain [1 3 3 5].
Figure 3-4 shows what happens when the sparse data structure from Figure 3-3: is
distributed across multiple processors by Star-P®. The number of rows is divided among the
participating processors, and each processor gets the information about its local rows. Thus
operations that occur on rows can happen locally on a processor; operations that occur on
columns require communication among the processors.
The dcell is analogous to MATLAB cells. The dcell type is different from the other
distributed matrix or array types, as it may not have the same number of data elements per
dcell iteration and hence doesn't have the same degree of regularity as the other
distributions. This enables dcells to be used as return arguments for ppevalsplit(). For
more information on ppevalsplit, see "ppevalsplit" in "Star-P® Functions".
The data distribution mechanisms can be combined in a program. For instance, the array App
can be loaded from a file and then its dimensions used to create internal work arrays based
on the size of the passed array.
Similarly, input data created by ones or zeros or sprand can be used as input to other
functions, scripts, or toolboxes that are not aware of the distributed nature of their input, but
will work anyway. For example, the function foo is defined as follows:
In this example, the following code will then work because all the operators in foo are
defined for distributed objects as well as regular MATLAB objects:
These mechanisms are designed to work this way so that a few changes can be made when
data is input to the program or initially created, and then the rest of the code can be
untouched, giving high re-use and easy portability from standard MATLAB to Star-P®
execution.
The examples up until now have covered operations that included exclusively local or
distributed data. Of course, it is possible to have operations that include both. In this case,
Star-P® typically moves the local object from the client to the server, following the philosophy
that operations on distributed objects should create distributed objects. In the example here,
you can see this by the pptoc output showing 80KB received by the server.
>> A = rand(100);
>> Bpp = rand(100*p);
>> pptic; Cpp = A + Bpp; pptoc;
Client/server communication report:
Sent by server: 2 messages, 1.560e+02 bytes
Received by server: 2 messages, 8.017e+04 bytes
And of course, note that all scalars are local, so whenever a scalar is involved in a calculation
with a distributed object, it will be sent to the server.
The mixing of local and distributed data arrays is not as common as you might think.
Remember that Star-P® is intended for solving large problems, so distributed arrays will
typically be bigger than the memory of the client system. So, a typically sized distributed
array would not have an equal size client array to add to it.
There are cases where mixed calculations can be useful. For example, if a vector and a
matrix are being multiplied together, the vector may be naturally stored on the client, but a
calculation involving a distributed array will move it to the server.
You may have been wondering about these class types you have been seeing in the output of
ppwhos, namely dlayout, ddense, dsparse, and densend. Classes are the way that
MATLAB supports extensions of its baseline functionality, similar to the way C++ and other
languages support classes. To create a new class, it must have a name and a set of functions
that implement it.
The ddense class may be the simplest Star-P® class to understand. It is a dense matrix, just
like a MATLAB dense matrix, except it is distributed across the processors (memories) of the
HPC server system. When you create a distributed dense object, you will see its type listed
by ppwhos, as in the following example:
>> n = 1000;
>> App = ones(n*p);
>> Bpp = ones(n*p,n);
>> ppwhos
Creating a new class is simple. Having it do something useful requires operators that know
how to operate on the class. MATLAB allows class-specific operators to be in a directory
named @ddense, in the case of class ddense. For instance, if you wanted to know where the
routine is that implements the gradient operator, you would use the MATLAB which
command, as in the following example:
In the above example, <starp_root> is the location where the Star-P® client installation took
place.
The which sum command tells you where the routine is that implements the sum operator for
a generic MATLAB object. The which @double/sum command tells you where the
MATLAB code is that implements the sum operator for the MATLAB double type. The which
@ddense/sum command tells you where the Star-P® code is that implements it for the
Star-P® ddense class. The MATLAB class support is essential to the creation of Star-P®’s
added classes.
Similarly to the ddense class, the dsparse class implements distributed sparse matrices.
Since the layout and format of data is different between dense and sparse matrices, typically
each will have its own code implementing primitive operators. The same holds for the
ddensend class implementing multidimensional arrays.
However, as shown in the hilb example below, there are non-primitive MATLAB routines
which use the underlying primitives that are implemented for ddense and dsparse. These
routines will work in the obvious way, and so no further class-specific version of the routine is
necessary.
The dlayout class is not as simple as the ddense and dsparse classes, because the only
function of the dlayout class is to declare dimensions of objects to be distributed. Thus, you
will see that operators are defined for dlayout only where it involves array construction (e.g.
ones, rand, speye) and simple operators often used in calculations on array bounds (for
example, max, floor, log2, abs). The complete set of functions supported by dlayout are
found in "Supported MATLAB® Functions". The only way to create an object of class
dlayout is to append a *p to an array bound at some point, or to create a distributed object
otherwise, as via ppload.
To create dlayout objects without the *p construct we can import data with ppload and
extract the dlayout objects from size of the imported variable.
>> n = 1000;
>> app = rand(n*p)
app =
ddense object: 1000-by-1000p
>> [rows, cols] = size(app)
rows =
1000
cols =
1000p
>> ppload imagedata App
>> Bpp = inv(App)
Bpp =
ddense object: 1000-by-1000p
>> [Brows, Bcols] = size(Bpp)
Brows =
1000
Bcols =
1000p
>> ppwhos
Your variables are:
Name Size Bytes Class
App 1000x1000p 8000000 ddense array
Bpp 1000x1000p 8000000 ddense array
Bcols 1x1 258 dlayout array
Brows 1x1 8 double array
app 1000x1000p 8000000 ddense array
cols 1x1 258 dlayout array
n 1x1 8 double array
rows 1x1 8 double array
Grand total is 3000005 elements using 24000540 bytes
MATLAB has a total of 5 elements using 540 bytes
Star-P® server has a total of 3000000 elements using 24000000 bytes
Since the distributed attribute of matrices and arrays is what triggers parallel execution, the
semantics of Star-P® have been carefully designed to propagate distribution as frequently as
possible. In general, operators which create data objects as large as their input (*, +, \ (linear
solve), fft, svd, etc.) will create distributed objects if their input is distributed. Operators
which reduce the dimensionality of their input, such as max or sum, will create distributed
objects if the resulting object is larger than a scalar (1x1 matrix). Routines that return a fixed
number of values, independent of the size of the input (like eigs, svds, and histc) will
return local MATLAB (non-distributed) objects even if the input is distributed. Operators
whose returns are bigger than the size of the input (e.g. kron) will return distributed objects if
any of their inputs are distributed. Note that indexing, whether for a reference or an
assignment, is just another operator, and follows the same rules.
The following example creates a distributed object through the propagation of a distributed
object. In this case, since App is created as a distributed object through the *p syntax, Bpp will
be created as distributed.
Note that in this example, both ones and “*” are overloaded operations and will perform the
same function whether the objects they operate on are local or distributed.
The following computes the eigenvalues of Xpp, and stores the result in a matrix Epp, which
resides on the server.
The result is not returned to the client, unless explicitly requested, in order to reduce data
traffic.
Operators which reduce the dimensionality of their input naturally transition between
distributed and local arrays, in many cases allowing an existing MATLAB script to be reused
with Star-P® having little or no change. Putting together all of these concepts in a single
example, you can see how distribution propagates depending on the size of the output of an
operator. (Note that the example omits trailing semicolons for operators that create
distributed objects so their size will be apparent.)
In that case, distribution will propagate through its operations as follows (note that we are
omitting the use of a suffix pp variable notation here, since the script is being reused without
modification):
>> a = ones(1000*p,1000)
a =
ddense object: 1000p-by-1000
% now executing the commands in script 'propagate'
>> [rows, cols] = size(a)
rows =
1000p
cols =
1000
>> b = rand(rows,cols)
b =
ddense object: 1000p-by-1000
>> c = b+a
c =
ddense object: 1000p-by-1000
>> d = b*a
d =
ddense object: 1000p-by-1000
>> e = b.*a
e =
ddense object: 1000p-by-1000
>> f = max(e)
f =
ddense object: 1-by-1000p
>> ff = max(max(e))
ff =
1.0000
>> gg = sum(sum(e))
gg =
4.9991e+05
>> size(ff), size(gg)
ans =
1 1
ans =
1 1
>> h = fft(e)
h =
ddense object: 1000p-by-1000
>> i = ifft(h)
i =
ddense object: 1000p-by-1000
>> [i j k] = find(b > 0.95)
i =
ddense object: 49977p-by-1
j =
ddense object: 49977p-by-1
k =
ddense object: 49977p-by-1
>> q = sparse(i, j, k, rows, cols)
q =
dsparse object: 1000p-by-1000
>> r = q' + speye(rows);
>> s = svd(d);
>> t = svds(d,4);
>> ee = eig(d);
% end of 'propagate' script, back to main session
>> ppwhos
Your variables are:
Name Size Bytes Class
a 1000px1000 8000000 ddense array
ans 1x2 16 double array
b 1000px1000 8000000 ddense array
c 1000px1000 8000000 ddense array
cols 1x1 8 double array
d 1000px1000 8000000 ddense array
e 1000px1000 8000000 ddense array
ee 1000px1 16000 ddense array (complex)
f 1x1000p 8000 ddense array
ff 1x1 8 double array
gg 1x1 8 double array
i 49977px1 399816 ddense array
j 49977px1 399816 ddense array
k 49977px1 399816 ddense array
q 1000px1000 807696 dsparse array (sparse)
r 1000px1000 822688 dsparse array (sparse)
rows 1x1 258 dlayout array
s 1000px1 8000 ddense array
t 4x1 32 double array
As long as the size of resulting arrays are dependent on the size of an input array and hence
will likely be used in further parallel computations, the output arrays are created as distributed
objects. When the output is small and likely to be used in local operations in the MATLAB
front-end, it is created as a local object. For this example, with two exceptions, all of the
outputs have been created as distributed objects. The exceptions are rows, which is a scalar
of class dlayout, and t, whose size is based on the size of a value passed to svds. Even in
cases where dimensionality is reduced, as with find, when the resulting object is large, it is
created as distributed.
Propagation of Distribution
A natural question often asked is, “What is the distribution of the output of a given function
expressed in terms of the inputs?” In Star-P®, there is a general principle on distribution that
has been carefully implemented in the case of indexing and for a large class of functions.
Perhaps like irregular verbs of a natural language, there are also a number of special cases,
that do not follow these rules, some of which we list here.
In Star-P®, the output of an operation does not depend on the distribution of its inputs. The
rules specifying the exact distribution of the output may vary in future releases of Star-P®.
Note: Performance and floating point accuracy may be affected, see "Accuracy of Star-P®
Routines" for more information.
Type Distribution
ddense row, column
ddensend linear distribution along any dimension
dsparse row distribution only
The distributions of the output of operations follow the “calculus of distribution”. To calculate
the expected distribution of the output of a given function, express the size of the output in
terms of the size of the inputs. Note that matrices and multidimensional arrays are never
distributed along singleton dimensions (dimensions with a size of one), unless explicitly
created that way.
In the simplest case, for functions of one argument where the size of the output is the size of
the input, the output distribution matches that of the input.
The cosine function operates on each element, and the output retains the same distribution
as the input:
A conjugate transpose exchanges the dimension sizes of its input, so it also exchanges the
dimensions' distribution attributes:
Exceptions:
Certain Linear Algebra functions such as qr, svd, eig and schur benefit from a different
approach and do not follow this rule. See "Single ddense arguments" below.
For functions with multiple input arguments, we again express the size of the output in terms
of the size of the inputs. When the calculation provides an ambiguous result, the output will
be distributed in the rightmost dimension that has a size greater than one.
For operations in which the output size is the same as both inputs, such as element-wise
operations (App+Bpp, App.*Bpp, App./Bpp, etc), we consider the distribution of both inputs.
If both inputs are row distributed, then the output will be row distributed. If the combination of
inputs has more than one distributed dimension, then the default of distributing on the
rightmost dimension applies.
For Cpp=App*Bpp, if App and Bpp are both row distributed, the output will have its first
dimension distributed as a result of the fact that App has its first dimension distributed. Its
second dimension will not be distributed since Bpp's second dimension is not distributed.
Therefore Cpp will be row distributed as well.
For Cpp=App*Bpp, if App and Bpp are both column distributed, similar logic forces the output
to be column distributed.
For Cpp=App*Bpp, if App is row distributed and Bpp is column distributed, the calculus of
distribution indicates that both dimensions of the output should be distributed. Since this is
not permissible, the rightmost dimension is distributed, resulting in a column distribution.
For Cpp=App*Bpp, If App is column distributed and Bpp is row distributed, the calculus of
distribution indicates that neither dimension of the output should be distributed. Once again,
we fall back on the default of distributing the rightmost (column) dimension.
As a less trivial example, consider Cpp = kron(App,Bpp). The size of the dimensions of
Cpp are calculated through the following formula:
The resulting distribution would be ambiguous, so it defaults to the standard of distributing the
rightmost dimension:
Cpp =
ddense object: 10000-by-400p
Cpp = App.'
size(Cpp) = [size(App,2) size(App,1)]
For transpose, if App is row distributed, the output will be column distributed. If App is column
distributed, the output will be row distributed.
The following operations benefit from special-case rules and must be accounted for one by
one. The following list is only the non-trivial cases.
Indexing Operations
Indexing operations follow the same style of rules as other operations. Since the output size
depends on the size of the indices (as opposed to the size of the array being indexed), the
output distribution will depend on the distribution of the arguments being used to index into
the array. If all objects being used to index into the array are front-end objects, then the result
will default to distribution along the rightmost dimension.
Indexing is a particularly tricky example, because subsref has many different forms.
Bpp = App(:,:) has the same distribution as App, because size(Bpp) == size(App).
Bpp = App(:) vectorizes (linearizes) the elements of App, so the output will be row or column
distributed accordingly.
Other linear indexing forms inherit the output distribution from the indexing array:
To summarize:
• Output distributions follow the “calculus of distribution” in which the rules for
determining the size of the output define the rules for the distribution of the output,
though a selection of Linear Algebra functions do not follow these rules.
• Typically, functions with one input and one output will have outputs that match the
distribution of the input.
• When the output distribution will be ambiguous or undefined by the standard rules,
the output will be distributed along its rightmost dimension.
• Outputs are never distributed along singleton dimensions (dimensions with a size of
one).
In some instances a user wants to move data explicitly between the client and the server. The
ppback command and its inverse, ppfront, do these functions.
>> n = 1000;
>> mA = rand(n);
>> mB = rand(n);
>> ppwhos
Your variables are:
Name Size Bytes Class
mA 1000x1000 8000000 double array
ppfront is the inverse operation, and is in fact the only interface for moving data back to the
front end system. This conforms to the principle that once you, the programmer, have
declared data to be distributed, it should stay distributed unless you explicitly want it back on
the front end. Early experience showed that some implicit forms of moving data back to the
front end were subtle enough that users sometimes moved much more data than they
intended and introduced correctness (due to memory size) or performance problems.
Note that the memory size of the client system running MATLAB, compared to the parallel
server, will usually prevent full-scale distributed arrays from being transferred back to the
client.
The ppback and ppfront commands will emit a warning message if the array already
resides on the destination (parallel server or client, respectively), so you will know if the
movement is superfluous or if the array is not where you think it is.
These two commands, as well as the ppchangedist command, will also emit a warning
message if the array being moved is bigger than a threshold data size (default size being
100MB). The messages can be disabled, or the threshold changed, by use of the
ppsetoption command, documented in "Star-P® Functions".
Just as the load command reads data from a file into MATLAB variable(s), the ppload
command reads data from a file into distributed Star-P® variable(s). Assume that you have
a file created from a prior MATLAB or Star-P® run, called imagedata.mat, with variables
App and Bpp in it. (MATLAB or Star-P® appends the .mat suffix.) You can then read that data
into a distributed object in Star-P® as follows:
Note that the file to be loaded from must be available in a filesystem visible from the HPC
server system, not just from the client system on which MATLAB itself is executing.
Consequently, if your .mat file is initially located on your client system, then copy the file into
a working directory on your server.
The distributed I/O commands ppload and ppsave store distributed matrices in the same
uncompressed Level 5 .mat-File Format used by MATLAB.
Information about which dimension(s) of an array were distributed are not saved with the
array, so ddense matrices retrieved by ppload will, by default, be distributed on the last
dimension.
Note: The use of *p to make objects distributed and thereby make operators parallel can
almost always be made backwards compatible with MATLAB by setting p = 1. The
use of ppload does not have the same backward compatibility.
If you use ppsave to store distributed matrices into a file, you can later use load to retrieve
the objects into the MATLAB client. Distributed matrices (ddense and dsparse) will be
converted to local matrices (full and sparse), as if ppfront had been invoked on them. (The
exception to this operation is that some very large matrices break .mat-File compatibility; if
ppsave is applied to a distributed matrix with more than 232 rows or columns, or ppwhos
data requires more than 231 bytes of storage, then load may not be able to read the file.)
To move data from the front-end to the back-end via a file, the MATLAB save command must
use the -v6 format, as in save('foo','w','-v6') for saving variable w in file foo. Then
you can use ppload to read the resulting file to the server. This will convert local matrices to
global matrices, just as if ppback had been invoked, except that the resulting matrices will be
distributed only on the last dimension.
Star-P®’s ppload command cannot yet read the older Level 4 .mat-File, nor the
compressed Level 5 format. Use the -v6 flag in the MATLAB client to convert such files to
uncompressed Level 5 format.
Another method of loading of data is through the use of the ppfopen command. By calling
ppfopen with only a single string argument specifying a target file to open, the contents of
the file are opened in a read-only mode. The following command opens your_file and
returns a distributed file identifier of class @dfid.
fid = ppfopen('your_file');
Using a second input argument to ppfopen, further permissions for handling the contents of
the target file can be specified.
fid = ppfopen('your_file',MODE);
The input MODE can take values that allow for various permissions for viewing or altering the
file’s contents.
MODE Permission
'rb' read
'wb' write (create if necessary)
'ab' append (create if necessary)
MODE Permission
'rb+' read and write (do not create)
'wb+' truncate or create for read and write
'ab+' read and append (create if necessary)
Note: Only native machine format is supported and the ppfopen interface will return an error
if the caller tries to specify a different machine format or encoding parameter.
The functions fopen, fread, fwrite, frewind, and fclose have been overloaded to
work with distributed data, including distributed file identifiers.
For example, the fread function can be used in the following manner to assign a 1000 by
1000 matrix to a variable that has previously been associated with the distributed file
identifier fid:
You will notice that fread allows you to specify the distribution properties of the data assigned
to the distributed variable App.
Star-P® supports import and export of datasets in the Hierarchical Data Format, Version 5
(HDF5). The HDF5 format
• is widely used in the high-performance computing community,
• is portable across platforms,
• provides built-in support for storing large scientific datasets (larger than 2GB) and
• permits lossless compression of data.
For more information about the HDF5 file format, please visit http://hdf.ncsa.uiuc.edu/HDF5.
The Star-P® interface to the HDF5 file format currently supports the import and export of
distributed dense and sparse matrices with double precision and complex double precision
elements. In addition, a utility function is provided to list meta-data information about all
variables stored in a HDF5 file.
The next few sub-sections discuss the syntax of the individual HDF5 commands in more
detail.
Distributed variables are written to a remote HDF5 file using the pph5write command. This
command takes a filename, and a list of pairs consisting of a distributed variable and its
corresponding fully-qualified dataset name within the HDF5 file. If the file already exists, an
optional string argument can be passed to the command: 'clobber' causes the file to be
overwritten and 'append' causes the variables to be appended to the file. The default mode
is 'clobber'. If the write mode is 'append' and a variable already exists in the location
specified, it is replaced.
Example 1
Example 2
To append a distributed variable matrix_c to the HDF5 file created in the previous example
to the location /my_matrices/workspace2/temp/matrix_c, one would use:
Datasets in a HDF5 file can be read into distributed variables using the pph5read command.
It takes a file name and a list of fully-qualified dataset names to read.
Example 3
It is possible to obtain a list of variables stored in an HDF5 file and their associated types
using the pph5whos command that takes in the name of the HDF5 file as its sole argument.
With a single output argument, the command returns a structure array containing the variable
name, dimensions and type information. With no output arguments, the command simply
prints the output on the MATLAB console.
Example 4
Running pph5whos on the file after running Examples 1 and 2, the following is obtained:
>> pph5whos('/tmp/temp.h5')
Table 3-4
This section describes the internal representation of HDF5 files used by the functions
described previously. If the HDF5 file to be read is not generated using pph5write, it is
important to read the following subsections carefully.
Multidimensional arrays
Distributed matrices are stored in column-major (or Fortran) ordering. Therefore, pph5write
follows the same strategy used by Fortran programs that import or export data in the HDF5
format: multidimensional matrices are written to disk in the same order in which they are
stored in memory, except that the dimensions are reversed. This implies that HDF5 files
generated from a C program will have their dimensions permuted when read back in using
pph5read, but the dimensions will not be permuted if the HDF5 file was generated either
using a Fortran program or pph5write. In the former case, the data must be manually
permuted using ctranspose for two-dimensional and permute for multidimensional
matrices.
Complex data
An array of complex numbers is stored in the interleaved format consisting of a pairs of HDF5
native double-precision numbers representing the real and imaginary components.
Sparse matrices
A sparse matrix is stored in its own group, consisting of three attributes (a sparsity flag,
IS_SPARSE, the number of rows, ROWS and the number of columns, COLS) and three
datasets (row_indices, col_indices and nonzero_vals) containing the matrix data
stored in the triplet form. All attributes and datasets are stored as double precision numbers,
except IS_SPARSE which is stored as an integer and nonzero_vals which can either be
double or double complex.
Limitations
The HDF5 import-export features in Star-P® currently differ from that provided in MATLAB in
the following respects:
1. Permutations of dimensions for multidimensional arrays. MATLAB only permutes the first
two dimensions even for multidimensional arrays; the permutation in Star-P® is consistent
with that used for other Fortran programs
2. Handling of complex matrices. MATLAB does not support saving of complex matrices
natively.
3. Handling of sparse matrices. MATLAB does not support saving of sparse matrices
natively.
4. Handling of hdf5 objects. Star-P® currently does not support the loading and saving of
datasets described using instances of the hdf5 class supported by MATLAB.
5. Direct access to the HDF5 library. Unlike MATLAB Star-P® does not provide direct access
to the HDF5 library; all access must happen through the pph5write, pph5read and
pph5whos commands.
In the previous chapter, the operators used on distributed arrays operated on the entire
array(s) in a fine-grained parallel approach. While this operation is easy to understand and
easy to implement (in terms of changing only a few lines of code), there are other types of
parallelism that don't fit this model. The ppeval function allows for coarse-grained parallel
computation, otherwise known as MIMD (multiple instruction multiple data) or task
parallelism, where operations are conducted on blocks, coarse-grains, of the data. This
coarse-grained computation is distributed uniformly over the number of parallel processors.
This mode of computation allows non-uniform parallelism to be expressed (e.g., the sum
operator could be used on odd columns and the max operator on even columns).
This chapter contains information on performing operations in task parallel and includes the
following:
• "The ppeval Function: The Mechanism for Task Parallelism"
ppeval allows you to execute built-in functions and user-defined functions in parallel on a
High Performance Computer. ppeval handles the distribution of data and code over the
processors in the HPC, as well as the execution and the gathering of computational results.
To define some of relevant terminology for ppeval, let’s look again at the example from
“Extending MATLAB with Star-P®”.
Xpp = rand(1000,1000,100*p);
Ypp = ppeval('inv',Xpp);
In this example, the ppeval call splits up the variable Xpp into 100 individual slices (by default
splitting is done along the last dimension). The slices are then divided over the available
processors; so in the case of 100 slices and 10 processors, each processor would receive 10
slices. Each processor iterates over the slices it receive and applies the function 'inv' to
each of the slices. When each processor completed its job, the results of all processors are
combined, preserving order, and returned as the output value.
You can view ppeval as a parallel loop. You cannot assume anything about the order in
which the iterations occur or the processor(s) on which they occur. Since the computations of
the individual iterations are performed in complete isolation of all the other iterations, ppeval
requires that the computation being performed is independent over the iterations.
Consequently, functions that contain recursive relations or that update variables based on
sequentially previous iterations inside the function body are not applicable for task-parallel
execution.
1. A subset of the MATLAB operators are supported. While you might want to extend this set with
routines that are part of MATLAB or one of its toolboxes, The MathWorks software license
prohibits this for the way the ppeval is implemented. To comply with this prohibition, ppeval
will not move to the HPC server any routines that are generated by The MathWorks.
Star-P® commands and data types generally use the following conventions, to distinguish
them from standard MATLAB commands and data types:
• Most Star-P® commands begin with the letters pp, to indicate parallel. For example,
the Star-P® ppload command loads a distributed matrix from local files. Exceptions
to this rule include the split and bcast commands.
• Star-P® data types begin with the letter d, to indicate “distributed”. For example, the
Star-P® dsparse class implements distributed sparse matrices.
The following convention for displaying Star-P® related commands and classes is used
throughout this chapter.
Command/Variable Font
p & other dlayout variables bold green font
The typical work-flow of introducing ppeval into a code that is currently serial takes the
following steps:
1. Identify a for loop that is embarrassingly parallel.
2. Determine the input and output variables of the for loop.
3. Transform the body of the for loop into a function.
4. Call your newly defined function with ppeval using the correct input and output
variables.
Here we will walk through an example of these steps. Below we will discuss ways in which
the user can control the splitting and broadcasting behavior of the input variables to ppeval.
x = rand(n,n,m);
y = rand(n,m);
z = zeros(n,m);
for i = 1:m
[v d] = eig(x(:,:,i));
a = v*y(:,i) + diag(d);
z(:,i) = x(:,:,i)\a;
end
The for loop in this example is indeed embarrassingly parallel since it contains no recurrent
relations and/or variable updates. In principle, if we had m computers, then we could compute
one iteration of the for loop on every computer and obtain the correct result after
recombining them.
The input variables to the loop are x and y and the output variable is z. The variables v, d,
and a are variables whose scope is limited to the for loop.
The function foo1, defined below, contains the for loop body content with output variable z
and input variables x and y.
function z = foo1(x,y)
[v d] = eig(x);
a = v*y + diag(d);
z = x\a;
Note that we removed all of the indexing operations that are present in the for loop body in
Step 1. Since the ppeval process splits the variables x and y into individual slices along the
last dimension (by default), ppeval does the indexing operations for you in the process of
dividing of the input data and gathering the output data.
Now that we have the defined our function foo1 and we know the input and the output
arguments, we can perform the ppeval call:
X = rand(n,n,m*p);
Y = rand(n,m*p);
Zpp = ppeval('foo1',X,Y);
This completes the transformation of the serial for loop to a task-parallel execution of the
same code with Star-P®.
Note: X and Y do not necessarily have to be distributed objects. See "Splitting" and
"Broadcasting" for more details.
Note: It might seem natural that you could transform a for loop to run in parallel just
by adding a *p to the loop bounds. Unfortunately, this does not have the desired
effect. The simplicity of this approach has not been lost on the Star-P® developers,
and some support for this method may appear in a future release.
• “Star-P® M TPE”
foo is the name of the function you would like to execute in task-parallel. In1, In2, ... are the
input arguments to func and o1, o2, ... are the output arguments to foo. The supported
input argument types are: strings, function handles, scalars, arrays, and matrices (see
workaround section below for input arguments of type string-array and struct-array) and the
supported output arguments are scalars, arrays, and matrices.
Input Arguments
The user has complete control over the splitting and broadcasting of input variables with the
split/ppsplit and bcast/ppbcast commands. These commands can only be used in
conjunction with the ppeval command.
Default Behavior
By default all scalars, strings, and function handles are broadcast to every processor on the
HPC. Every processor receives an identical copy. Arrays and matrices are split up into slices
along the last dimension and divided over the processors. The default behavior of splitting
and broadcasting input arguments can be overridden by the user.
Splitting
To split an array or matrix in a dimension other than the last dimension, use the split
command in conjunction with ppeval. The syntax of the split/ppsplit commands are
split(A,DIM)
ppsplit(A,DIM)
where A is the input argument and DIM is the dimension along which you want to split the
variable A. The possible arguments to split and ppsplit are:
As stated above the split and ppsplit command can only be used in conjunction with
ppeval. For example:
Xpp = rand(100*p,1000,1000);
Ypp = ppeval('inv',split(Xpp,1));
Broadcasting
Xpp = rand(n*p);
Ypp = rand(n,n,m*p);
Zpp = ppeval('+',Ypp,bcast(Xpp));
This ppeval command is equivalent to the following for loop, apart from the fact it is
performed in parallel as opposed to serial execution:
x = rand(n);
y = rand(n,n,m);
z = zeros(n,n,m);
for i = 1:m
z(:,:,i) = y(:,:,i) + x;
end
The supported input argument types are: strings and functions handles, as well as scalars,
arrays, and matrices of type double and complex double. Scalars, arrays, and matrices of
other types (for example single, ints, logical) are first converted to type double before being
transferred to ppeval. By default, strings, function handles and scalars are broadcast.
This section discusses how you can use ppeval in non-broadcast (serial mode) with a single
scalar input argument.
Example
>> ppsetoption('TaskParallelEngine','octave')
>> ppeval('rand',1)
??? Error using ==> ppeval_octave at 208
At least one argument in a call to ppeval must be split
(either implicitly or explicitly)
The message indicates that at least one of the input arguments must be split. By default scalar
input arguments are not split. They are broadcast. It is possible to split on a scalar by explicitly
including the split command, such as:
>> ppeval('rand',split(1));
This results in one function evaluation of the function rand with the input argument 1.
Example
>> ppsetoption('TaskParallelEngine','starp_tpe')
>> ppeval('ones',1 ) % This example was run on a two processor install
ans =
1x2 double
So, the execution of ppeval is effectively a per process evaluation of function ones. If you
want to have just one function evaluation use the split function.
Example
>> ppsetoption('TaskParallelEngine','starp_tpe')
>> ppeval('rand',split(1))% This syntax works the same as w/ the octave engine.
ans =
0.8147
In the examples above, we used the *p construct to create the data that ppeval operates
on. It is now a necessary requirement that ppeval operates on server variables only. In the
case that ppeval receives a client variable, say a MATLAB variable, ppeval will first move
the client variable to the server. Then the task parallel operation will be performed. Hence the
result of operating on a client or server variable will be exactly the same. However, since the
client variable must be moved from the client to the server, you will incur a performance
penalty (moving large amounts of data of networks can be costly).
The Star-P® server stores variables in a distributed matrix fashion. The information/memory
contained by one variable is divided across the processors with each processor having
access to part of the data. Star-P® supports several distributions. 2D matrices can be stored
by rows or columns, where each processor has access to a single set of rows or columns
respectively. ND arrays can be distributed only along one of the dimensions.
For the correctness of the ppeval execution, the dimensions of distributions for server
variables are not important. However, the dimensions of distributions do have an effect on the
performance characteristics of the ppeval execution. The best performance is achieved
when the distributed dimension and the split dimension are the same; for example, splitting
an input variable Xpp, defined by Xpp = rand(10,10*p,10);, as push-pull(Xpp,2).
This superior performance occurs because all of the data is already distributed to the correct
processors. As a counter-example, if an input variable is row-distributed (along the first
dimension of a 2D matrix), and the ppeval splits the input along the columns (along the
second dimension of a 2D matrix), then the first operation that must be performed is a
distribution change of the input data. These operations do not come free, because they do
cost communication time to perform. Consequently, the optimal performance of a ppeval
operation occurs when all of the variables to be “split” are distributed along the same
dimension as the dimension requested for the split or ppsplit operations.
Output Arguments
The supported output arguments to ppeval are scalars, arrays, and matrices of type double
and complex-double. Scalars, arrays, and matrices of different types are converted to double
before being handed from the task parallel engine to Star-P®. Additionally, each of the output
arguments of the function called by ppeval need to have the same size for each iteration, or
for each input slice for that function. For example, the following function func2, with input
scalar variables in1 and in2, always returns output variables of size 1-by-10, 3-by-5 and
13-by-1:
out1 = zeros(1,10);
out2 = zeros(3,5);
out3 = zeros(13,1);
As you can see, for every call to func2, the outputs will have exactly the same size. The
requirement that the function called by ppeval returns arguments of the same size is
important because of the way that ppeval returns the aggregate of task parallel
computation. ppeval laminates the outputs for each iteration together along an additional
dimension. The rules for laminating the output are the following:
1. If the output is a scalar, then laminate them in the column direction, and distribute by
columns.
2. If the output is a row-vector, then laminate them in the row dimension, and distribute by
rows.
3. If the output is a column-vector, then laminate them in the column dimension, and
distribute by columns.
4. If the output is a 2D array or ND array, then laminate them in an additional dimension,
and distribute along that dimension.
This means that if the size of the output argument of the function called by ppeval is
k-by-1 and the ppeval operation performs r iterations, then the output of ppeval is of size
k-by-1-by-rp.
Let’s first consider a simple example. Rather than using the built-in sum function on a
ddense array, you could code it using ppeval and sum on a row or column.
>> n = 100
n =
100
>> App = 1:n*p
app =
ddense object: 1-by-100p
>> Bpp = repmat(App,n,1)
bpp =
ddense object: 100-by-100p
>> ppfront(Bpp(1:6,1:6))
ans =
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
>> Cpp = ppeval('sum',Bpp)
Cpp =
ddense object: 1-by-100p
>> ppfront(Cpp(1,1:6))
ans =
100 200 300 400 500 600
>> Epp = ppeval('sum',ppsplit(Bpp,1))
Epp =
ddense object: 1-by-100p
>> ppfront(Epp(1,1:6))
ans =
• The first call in the previous ppeval example uses the default behavior to split its
arguments along the last dimension (columns, in the case of 2D matrices).
• The variable b in the previous example did not need to be distributed, created with
the *p construct, or transferred to the server using ppback, because ppeval
automatically handled its distribution on the server.
• In the second call to ppeval, it was desired to split along rows, so the ppsplit
function was used explicitly to obtain that result.
In this example, the function 'sum' was called on each column of the input array. While
useful for a simple example of functionality, you would not do this in practice because the
sum operator on the whole array has the same behavior and is simpler to use. However, as
shown in the next example, the function passed to ppeval does not have to perform the
same computation for each input, and thus can be used to implement MIMD/task parallelism.
In this example, we will make use of the MATLAB function quad, which computes the definite
integral of a function over an interval. The function being integrated could be highly nonlinear
in its behavior, but ppeval supports that functionality.
ppevalsplit
In the case where the function returns a different size output for every iteration (see func4
below) the ppeval procedure will fail since there is no logical way of laminating the output
values of the function. In this case, Star-P® provides the user with the ppevalsplit
command, which returns the outputs of the individual iterations in a cell array to the client.
Cell arrays are capable of holding variables of different sizes. Although the cell array returned
from a ppevalsplit call is stored as a variable of type dcell on the Star-P® server, when
indexing into a dcell array, the contents are automatically returned to the client. An example
of a function that returns differently sized outputs and how to use it with ppevalsplit
follows:
end
in = 1:10;
outpp = ppevalsplit('func4',in);
When performing task parallel operations in Star-P®, ppeval and ppevalsplit allow for
you to make a choice as to the environment that will be utilized for performing task parallel
operations.
In using ppeval or ppevalsplit, use one of the following as your task parallel engine:
• Star-P® M TPE
The Star-P® M TPE and Star-P® Octave TPE provide high performance computing
compatible with MATLAB m-files.
Note! The Star-P® M TPE provides the fastest performance times for non-Altix users.
When choosing the option of using your own compiled C/C++ functions, packages must be
loaded on to the server and used in accordance with the instructions specified in the “Star-P®
Software Development Kit (SDK) Tutorial and Reference Guide”.
Note! Choose an Octave task-parallel engine for task parallel applications that use sparse arrays
or if you are running Star-P® on Altix. Star-P® TPE support for sparse task parallel
operations will be implemented as a follow on to Star-P® Release 2.7.
Star-P® M TPE
The Star-P® M task-parallel engine (starp_tpe) is native to Star-P® and yields the best
overall performance for non-Altix users when using ppeval. Select it by using the following
call:
ppsetoption('TaskParallelEngine','starp_tpe')
Note! Choose an Octave task-parallel engine for task parallel applications that use sparse arrays
or if you are running Star-P® on Altix. Star-P® TPE support for sparse task parallel
operations will be implemented as a follow on to Star-P® Release 2.7.
You can choose Octave as the task parallel engine used when you call ppeval. If your task
parallel codes require the use of MEX file functionality or include functionality that was not
included in Octave 2.9.5, then you will want to set the task parallel engine to Octave 2.9.9 by
calling:
ppsetoption('TaskParallelEngine','octave-2.9.9')
Within a given Star-P® session, only a single version of Octave can be set using
ppsetoption.
ppsetoption('TaskParallelEngine','octave-2.9.9')
you cannot switch to Octave 2.9.5 during a particular Star-P® session. By initially calling:
ppsetoption('TaskParallelEngine','octave-2.9.5')
then you are setting the Octave version to be 2.9.5 for the duration of a session. The first call
to ppsetoption('TaskParallelEngine','octave-2.9.x') sets the available
Octave engine for that particular Star-P® session.
In order to set the task parallel engine to use your own compiled C/C++ functions,
ppsetoption can be configured in the following manner:
ppsetoption('TaskParallelEngine','C')
This allows you to write functions that run in parallel in C/C++ rather than Octave or Star-P®
M with the Star-P® TPE. Use it if you have serial libraries in C and C++ you want to use in a
task parallel manner. Using ppeval or ppevalsplit makes it easy to write wrapper
functions in C/C++ using the task parallel Star-P® SDK API. When using ppeval or
ppevalsplit in this manner, you need to build the function on the server and copy the
module over to the HPC server machines.
The function pploadpackage can be used to load previously compiled shared object
libraries whose contents can then be called using ppeval. Loading a compiled library for
task parallel operations using pploadpackage requires calling syntax in one of two
manners:
stringTP = pploadpackage('C','/path/to/package.so','TPname')
stringTP = pploadpackage('C','/path/to/package.so')
For more information about the task parallel API for using ppeval and ppevalsplit with
compiled languages, see the “Star-P® Software Development Kit (SDK) Tutorial and
Reference Guide”.
It is often useful to perform a certain operation only once per processor rather than
performing the exact same operation within each iteration. Examples of such operations
include opening and closing files or setting global variables. To enable a per process
execution, one can use the Star-P® function named np, which returns the number of
processors active in the current Star-P® session. For instance, the following sections of code
exemplify how to open a file for reading, read and process the data in the file, and close the
file:
where the functions open_file and process_file could look something like:
and
Note that in the first line of the example above we used ppsplit(1:np) instead of 1:np.
This is because in the case that np happens to be equal to 1, the expression 1:np returns a
scalar. Normally, this input syntax is not valid due to the fact ppeval has no input argument
over which it can iterate. In other words, ppeval received a string and a scalar, both of which
will be broadcast by default. To override this behavior, use the ppsplit command on the
1:np expression.
The ppeval function can also be used to call a non-MATLAB program, via the system
function and get results from that executable back into the Star-P® context. The simple
example here illustrates a function callapp2 that calls a pipeline of shell commands that
returns the number of currently executing processors for a given user ID.
function z = callapp2(uid)
s = sprintf('ps -ael | grep %i | wc -l\n',uid)
[status, result] = system(s);
z = str2num(result);
Note: In this case, the variable App has been split evenly to the available processors,
which can be displayed by np. The default behavior of split(App) is to
distribute along the last dimension, for example, split(App,2).
By extension of this last example, almost any executable program could be called in parallel
via ppeval using the system command, including end-user applications (written in C, C++
or Fortran) or third-party applications such as ANSYS, NASTRAN, FLUENT, or Gaussian. For
further information on incorporating external applications in Star-P®, see the “Star-P®
Software Development Kit (SDK) Tutorial and Reference Guide”.
Below we discuss a couple of workarounds to some of the limitations that result from the
requirement on the types of the input and output argument to ppeval. There are however
several ways to work around these limitations. This section provides guidelines for the
following:
• “String Arrays”
• “Splitting on a Scalar”
• “Global Variables”
String Arrays
To work with string arrays and ppeval, we use the fact that characters can be converted to
variables of type double and doubles can be converted to variables of type character. So with
minor modifications to a code, we can incorporate string arrays into applications that use
ppeval to increase their performance.
str_arr = ['filename1';'filename2';'filename3'];
fnames = char(ppfront(ppeval('func',double(str_arr))));
function y = func(x)
y = x;
The ppfront command in this example is necessary since the char function has not been
implemented in Star-P® and since Star-P® currently can only store objects of type double or
double-complex.
Splitting on a Scalar
Global Variables
When running code inside of a ppeval command, several global variables that are specific
to Star-P® are defined. These are:
1. PP_COMM_SIZE : The number of processors and ppeval engine processes.
2. PP_MY_RANK : The rank of this ppeval engine process, running from 0 to
PP_COMM_SIZE-1.
3. PP_TEMP_DIR : The temporary work directory for the ppeval engine process.
4. PP_CUR_ITER : The value of the current iteration for each ppeval engine process.
The PP_CUR_ITER counter runs from one to the number of slices for each ppeval
engine process.
Star-P® enables MATLAB users to harness the computing power of HPC systems from within
their familiar desktop environment. But as with any other software development environment
or tool, there are advantageous and disadvantageous methods of using Star-P®.
This chapter provides tips for structuring your MATLAB codes for optimal performance using
Star-P® and describes tools that can be utilized for monitoring and profiling the performance
of your MATLAB applications using Star-P®. The tips and tools contained in this chapter are
organized into the following sections:
• "Performance and Productivity"
• "Tips for Data Parallel Code"
• "Tips for Task Parallel Code"
• "Using External Libraries"
The two most common reasons for users moving off their desktops to parallel computers are:
• to solve larger problems
• to solve problems faster
To make the most of Star-P®, you need to find your own “comfort level” in the trade-off
between productivity and performance. This is not a new trade-off. In 1956, the first so-called
high level computer language was invented: FORTRAN. At the time, the language was highly
criticized because of its relatively poor performance compared to programs that were highly
tuned for special machines. Of course, as the years passed, the higher-level language
outlasted any code developed for any one machine. Libraries became available and
compilers improved.
This lesson is valuable today. To take advantage of Star-P®, you will benefit from simply
writing MATLAB code, and inserting the characters *p at just the right times. You can improve
performance both in terms of problem sizes and speed by any of the following means:
• restructure the serial MATLAB program through vectorization (described in
“Vectorization”)
• restructure the serial MATLAB program through uses of functionally equivalent
commands that run faster
• restructure the serial MATLAB program through algorithmic changes
You may not wish to change your MATLAB programs. Programs are written in a certain style
that expresses the job that needs to be done. Psychologically, a change to the code may feel
risky or uncomfortable. Programmers who are willing to make small or even large changes to
programs may find huge performance increases both in serial MATLAB and with Star-P®.
Typically, changes that speed up serial MATLAB also speed up Star-P®. In other words, the
benefits of speeding up the serial code multiply when going parallel.
You may want to develop new applications rapidly that work on very large problems, but
absolute performance may not be critically important. The MATLAB operators have proven to
be very powerful for expressing typical scientific and engineering problems. Star-P® provides
a simple way to use those operators on large data sets. Today, Star-P® is early in its product
life, and will undoubtedly see significant optimizations of existing operators in future releases.
Your programs will transparently see the benefit of those optimizations. You benefit from ease
of use and portability of code today.
Vectorization
Vectorization speeds up serial MATLAB programs and eases the path to parallelization in
many instances.
Note: The following MATLAB timings were performed on a Dell Dimension 2350. The
Star-P® timings were performed on an SGI Altix system. Note that small test
cases are used so that the unvectorized versions will complete in reasonable
time, so the speedups shown in these examples are modest.
>> v = 1:1e6;
>> s = 0;
>> tic;
>> for i=1:length(v), s = s+v(i); end
>> toc;
Elapsed time is 0.684787 seconds.
>> s
s =
5.0000e+11
>> v = 1:1e6;
>> tic;
>> s = sum(v);
>> toc;
Elapsed time is 0.003273 seconds.
The two ways of summing the elements of v give the same answer, yet the vectorized version
using the sum operator runs more than 100 times faster. This is an extreme case of the
speed-up due to vectorization, but not rare. Expressing your algorithm in high level operators,
provides more opportunities for optimization by Star-P® (or MATLAB) developers within those
operators, resulting in better performance.
Based on the vectorized form, it is straightforward to move to a parallel version with Star-P®.
Note that the unvectorized form, since it’s calculating element-by-element, would be
executing on only a single processor at a time, even though Star-P® would have multiple
processors available to it!
>> v = 1:1e7;
>> w = 0*v;
>> tic;
>> for i=1:length(v), w(i) = v(i)^3 + 2*v(i); end
>> toc;
Elapsed time is 19.815496 seconds.
% The following code is vectorized
>> tic;
>> w = v.^3 + 2*v;
>> toc;
Elapsed time is 2.137521 seconds.
% The following code is parallelized
>> vpp = 1:1e7*p;
>> tic;
>> wpp = vpp.^3 + 2*vpp;
>> toc;
Elapsed time is 0.118621 seconds.
This example shows exactly the value of vectorization: it creates simpler code, as you don’t
have to worry about getting subscripts right, and it allows the Star-P® system bigger chunks
of work to operate on, which leads to better performance.
This example compares two methods of multiplying two matrices. One (partially vectorized)
uses dot n2 times to calculate the result. The vectorized version uses the simple * operator
to multiply the two matrices; this results in a call to optimized libraries (PBLAS in the case of
Star-P®) tuned for the specific machine you’re using. These versions compare to the BLAS
Level 1 DDOT and BLAS Level 3 DGEMM routines, where exactly the same effect holds.
Higher-level operators allow more flexibility on the part of the library writer to achieve optimal
performance for a given machine.
This example is a bit fancy. If you are going to restructure this construct, it requires you to
recognize that two computations are the same; the first is not vectorized, while the second
may be considered vectorized. Here the trick is to recognize that the code is computing a
histogram and then cumulatively adding the numbers in the bins.
>> v = rand(1,1e7);
>> w = [];
>> i = 0;
>> tic;
>> while (i<1), i=i+0.1; w = [w sum(v<i)]; end
>> toc
Elapsed time is 0.947873 seconds.
>> w.'
ans =
997890
1998324
2996577
3997599
4999280
6000307
7000870
8000829
9000054
10000000
10000000
As one would expect, the vectorized version works best in Star-P® as well.
For all but the smallest of loops, vectorization can give enormous benefits to serial MATLAB
code. However, as array sizes get larger, much of the benefit of serial vectorization can break
down. The good news is that in Star-P® vectorization is nearly always a good thing. It is
unlikely to break down.
The problem with serial MATLAB is that as variable sizes get larger, MATLAB swaps out the
memory to disk. This is a very costly measure. It often slows down serial MATLAB programs
immensely.
There is a serial approach that can partially remedy the situation. You may be able to rewrite
the code with an outer loop that keeps the variable size small enough to remain in main
memory while large enough to enjoy the benefit of vectorization. While for some problems
this may solve the problem, users often find the solution ugly and not particularly scalable.
The other remedy uses the Star-P® system. This example continues to use vectorized code,
inserting the Star-P® at the correct points to mark the large data set.
As an example, consider the case of FFTs performed on matrices that are near the memory
capacity of the system MATLAB is running on.
>> n = 1.2*10^4;
>> a = rand(n);
>> app = rand(n*p);
>> tic; b = fft( a); toc;
Elapsed time is 92.685374 seconds.
>> tic; bpp = fft(app); toc;
Elapsed time is 6.916634 seconds.
While you would expect Star-P® to be faster due to running on multiple processors, Star-P® is
also benefiting from larger physical memory. The serial MATLAB execution is hampered by a
lack of physical memory and hence runs inordinately slow. A recurring requirement for
efficient Star-P® programs is keeping large datasets off the front end.
The code below shows what happens upon computing 2^26 random real numbers with
decreasing vector sizes. When k=0, there is no loop, just one big vectorized command. On
the other extreme, when k=25, the code loops 2^25 times computing a small vector of length
2.
Notice that in the beginning, the vectorized code is not efficient. This turns out to be due to
paging overhead, as the matrix exceeds the physical memory of the system on which
MATLAB is running. Later on, the code is inefficient due to loop overhead. Star-P®
overcomes the problem of insufficient memory by enabling you to run on larger-memory HPC
systems. The simple command app = randn(2^26*p,1) parallelizes this computation.
Serial:
>> for k=0:25, tic; for i=1:2^k, a = randn(2^(26-k),1); end; toc; end;
Elapsed time is 1.865770 seconds.
Elapsed time is 1.600310 seconds.
Elapsed time is 1.581707 seconds.
Elapsed time is 1.590823 seconds.
Elapsed time is 1.597639 seconds.
Elapsed time is 1.577038 seconds.
Elapsed time is 1.579628 seconds.
Elapsed time is 1.578954 seconds.
Elapsed time is 1.581229 seconds.
Elapsed time is 1.163945 seconds.
Elapsed time is 1.059308 seconds.
Elapsed time is 1.165907 seconds.
Elapsed time is 1.079797 seconds.
Elapsed time is 1.069463 seconds.
Elapsed time is 1.090218 seconds.
Elapsed time is 1.145205 seconds.
Elapsed time is 1.235547 seconds.
Elapsed time is 1.453363 seconds.
Elapsed time is 1.883642 seconds.
Elapsed time is 2.731986 seconds.
Elapsed time is 4.467244 seconds.
Elapsed time is 7.057231 seconds.
Elapsed time is 13.076593 seconds.
Elapsed time is 25.143928 seconds.
Elapsed time is 44.867566 seconds.
Elapsed time is 88.540178 seconds.
The ability of MATLAB and Star-P® to create and manipulate large matrices easily sometimes
conflicts with the desire to run a problem that consumes a large percentage of the physical
memory on the system in question. Many operators require a copy of the input, or sometimes
temporary array(s) that are the same size as the input, and the memory consumed by those
temporary arrays is not always obvious. Both MATLAB1 and Star-P® will run much more
slowly when their working set exceeds the size of physical memory, though Star-P® has the
advantage that the size of physical memory will be bigger.
If you are running into memory capacity issues, as evidenced by server exceptions being
logged in the ~/.starp/log/latest/starpserver.log, then there may be one or a
few places that are using the most memory. In those places, manually inserting clear
statements for arrays no longer in use, allows the Star-P® garbage collector to free up as
much memory as possible.
As a means of determining where in your application you are requesting a larger amount of
memory than is available for use, then you may consider enabling the
STARP_SOFT_MEM_LIMITS environment variable in the env.sh file on the server, located in
the <path/to/starp/install>/config directory, or placing this environment variable in
the user’s .bashrc file, also on the server. STARP_SOFT_MEM_LIMITS controls whether
“soft-limits” will be enforced (= true ) or not (= false). By default, the value of
STARP_SOFT_MEM_LIMITS is set to be false.
malloc() exceeds the memory available on the system, then an exception is thrown on the
server, and an error message is returned to the client.
When examining the performance of your code using "soft-limits", you should also be aware
of the Star-P® mallochooks setting on the server. mallochooks is set using ppsetoption. It
provides a thin wrapper around the malloc() operation in C that records the user request in
the case of a failed malloc(). This wrapper also provides that record at a later point to the
Star-P® server for logging purposes. If you choose to set mallochooks to be off, then you
are turning off Star-P®'s mechanism for tracking memory usage on the server. Any out of
memory errors that occur with mallochooks turned off are subject to the memory limits and
error handling provided by your server's operating system.
Further user control for the actual memory “soft” limit is available via the
STARP_MBYTES_PER_PEER environment variable. If defined, this environment variable will
override the default limit which is calculated to be 1/Nth of the host’s actual physical memory.
STARP_MBYTES_PER_PEER can be exceed the default value, but should be used with
caution since it will allow oversubscription of memory, and could thereby cause application
slow down due to swapping.
MATLAB codes allow for the use of structs and cell arrays as a convenient method of
collecting and organizing related data sets. Within MATLAB, the contents of these containers
can be any valid MATLAB data type, including matrices, strings, and other structures or cell
arrays. Depending on the code being developed, these arrays may be gigantic arrays of
structures or cells.
Star-P® currently allows the use of structures locally inside of functions called by “ppeval”.
Structs and cell arrays on the client side continue to work within the MATLAB environment.
You can assign distributed data to members of a struct or cell array, as well as manipulate
distributed data that is a member of a struct or cell array. However, current versions of
Star-P® lack the ability to pass entire structures or cell arrays from client to server. This
means that you cannot pass a top-level struct or cell array name as an argument to
“ppeval”, nor can you distribute an entire struct using ppback. Here are some examples:
a.scalar = 57.36;
a.foo = ppback([1:100]);
a.left = rand(100*p,100);
a.right = rand(100,100*p);
myprod = a.left*a.right;
bar = ppeval('somefunc', split(a.foo), split(a.left,1),split(a.right,2));
MATLAB structs or cell arrays can be arbitrarily more complex than shown here (for example,
structs containing cell arrays containing structs among other possibilities). As a general rule,
if your data is held in a struct or cell array, and you need to pass a part of that data to the
server, then pass only the structure members or contents of a cell array element that contain
distributable matrix data or string variables.
When creating replacement variables for passing this data into or out of a ppeval call, give
your replacement matrices names evocative of your original struct or cell array to help you
keep track of what your code is doing.
For similar reasons that vectorization is key to achieving optimal performance with data
parallel codes, vectorization is also extremely important for good performance of task parallel
codes. Each iteration of a function in task parallel takes place on an individual processor on
the server and still involves the use of an interpreter. Consequently, the benefits of
vectorization that can be achieved in serial MATLAB code are also available with task parallel
MATLAB code with Star-P®.
The following example shows the effort needed and gains achieved by vectorization inside a
“ppeval” call:
%Top.m -- Top level fcn invokes two different versions of sum to check speeds.
tic;
x_looping = ppeval('fcn_looping',n,split(yarr,3),split(zarr,3));
toc
tic;
x_vectorized = ppeval('fcn_vectorized',n,split(yarr),split(zarr,3));
toc
function x = fcn_looping(n,y,z)
%===== Unvectorized version -- Bad! =====
for i = 1:n
if z(1,i) >= 0.5
x(i) = y(1,i)*z(1,i) + y(2,i)*z(2,i) + y(3,i)*z(3,i);
else
function x = fcn_vectorized(n,y,z)
%===== Vectorized version -- Good! =====
indx = z(1,:) >= 0.5; %Replace if with a logical expression
x(indx) = sum(y(:,indx).*z(:,indx),1);
indx = indx == 0; %use complement of indx for else case
x(indx) = sum(y(:,indx)./z(:,indx),1);
The performance gains achieved by this vectorization inside of a ppeval are shown in the
following table as a function for the main loop index.
Each ppeval iteration has overhead cost associated with it on the order of 10s of
microseconds. This means that if iterations of your for loop take less time than this
overhead cost no performance gains will be achieved by using ppeval directly over the
entire set of iterations. By blocking iterations within a function call, you can:
1. reduce the number of iterations performed by ppeval
2. increase the time per ppeval iteration
3. reduce the overall time necessary to perform all target function iterations.
To illustrate this point, let us consider a function foo contained in a file foo.m. Let us also
assume that evaluating a single iteration of foo takes less than 10-20 microseconds.
for i=1:N
z(i) = foo(x(i),y(i));
end
% file foo.m
function z = foo(x,y);
z = x+y;
Now in the assumed case where foo takes less than 10-20 microseconds to execute, it is
recommended that you rearrange the loop body to create a wrapper function, say
foo_wrapper, that executes only a portion of your iterations serially. Then, this function
foo_wrapper would be passed to ppeval.
This example assumes that the total number of iterations desired for foo is a multiple of the
number of processors. When this is not the case, the logic of how you choose to break up
your for loops needs to be changed. In addition, the optimal method for determining the
number of iterations that should be performed inside foo_wrapper, which is called by
ppeval, is something that you will need to determine through experimentation based on the
following quantities:
• The amount of time necessary to call a single iteration of foo.
• The total number of iterations of foo needed.
• The number of processors available.
In addition to the functions available within MATLAB®, Star-P® allows for the integration of
external functions for your own libraries or third party vendor libraries through the use of the
ppinvoke, ppeval, and ppevalsplit, pploadpackage, and ppunloadpackage
functions that are part of Star-P® SDK interface. More information on the Star-P® SDK can be
found in the “Star-P® Software Development Kit Reference and Tutorial”. External functions
can also be run in task parallel within ppeval through the use of the MATLAB system
command. For more information on calling external functions using the system command,
see "Calling Non-”M” Functions from within ppeval".
Although most Star-P® functions are insensitive to matrix distribution, many (or most) third
party libraries are not. Consequently, if your MATLAB® program interfaces to external
programs or libraries that are sensitive to distribution through the Star-P® SDK, then you
must carefully consider how you distribute your matrices. In this situation you may ask, “how
do I know whether to call the function with row distributed or column distributed input
matrices?” Unless the third party programs explicitly state their desired distribution, then the
answer is: experiment. Surround the function with “tic/toc”, “pptic/pptoc” and send it
random matrices distributed in all ways possible. Then scale the matrix sizes up and see
which distributions (if any) offer faster execution time, or which distributions break first when
the matrix size becomes gigantic.
In MATLAB®, all operations on integer types “saturate.” This means values greater than
intmax of that integer class are set to intmax and values below intmin of that integer
class are set to intmin.
The underlying numerical libraries in Star-P® such as ScaLAPACK, FFTW and SPRNG are of
high accuracy and comply with the IEEE standards. However, in many cases the results from
Star-P® may differ from that reported by MATLAB® for a number of reasons:
1. In the most common case, the answers may simply be non-unique. For example, the
eigenvalues from the single-return form of eig might be returned in a different order
from MATLAB® or the eigenvectors might be scaled differently. Similarly, the outputs
from svd (singular value decomposition) and its derivatives such as null and orth,
hess (reduction to the upper-Hessenberg form) and schur (reduction to the Schur
form) are non-unique and therefore not guaranteed to match the corresponding outputs
from MATLAB®. Instead, you must verify that the results satisfy the properties of the
underlying decomposition. For instance, if you were to run
[Upp,Spp,Vpp]=svd(App) for App being a ddense object, the outputs Upp, Spp, and
Vpp are valid if Upp and Vpp are unitary and norm(Upp*Spp*Vpp - App) is small.
2. Another reason numerical results from Star-P® might not correspond to those from
MATLAB has to do with the influence of small round-off errors. For example, it is
well-known that even addition is not associative in the presence of rounding errors: the
result of (a + b) + c can differ from a + (b + c).
3. When the underlying problem is ill-conditioned or singular, it is very likely that the
results from Star-P® will not match MATLAB. For instance, when a matrix A is singular
to working precision, inv(A) returns inf(size(A)) in MATLAB, but not in Star-P®.
When such cases are encountered, Star-P® does its best to return a descriptive
warning message.
4. Differences between MATLAB and Star-P® numerical results might arise for extremely
large or extremely small input values.
5. Finally, differences between the numerical results in Star-P® and MATLAB might result
from software issues. If you suspect that the Star-P® result is incorrect, please contact
us at [email protected].
The following items should be considered when the performance of your Star-P® application
is critical.
1. By default, each call to the Star-P® server causes an entry in a log file on the server
system. Some performance benefits can be achieved by disabling logging in the server,
via the following command at the MATLAB® prompt:
ppsetoption('log','off')
2. On SGI Altix systems, the Star-P® server will yield CPU usage after each command
completes. Significantly improved performance is available at the cost of continuing to
use CPU time in a loop even when Star-P® is idle. This increased performance can be
obtained via the following command at the MATLAB prompt:
ppsetoption('YieldCPU','off')
3. By default, the Star-P® server maintains a count of how much memory it is consuming
and how much memory is being used on the system. This enables it to more gracefully
handle situations that arise when the server machine is running low on available
memory. There is a minor performance cost associated with this functionality, because
it requires a small amount of extra work to be done with each call to malloc() inside
the server. This feature can be disabled, providing improved performance, via the
following command:
ppsetoption('mallochooks','off')
Star-P® TPE provides the option of using various versions of Octave or compiled C codes as
task parallel computational engines using the 'TaskParallelEngine' option as an
argument in ppsetoption.
For information on using ppsetoption to change your task parallel engine, see "Choosing
Your Task Parallel Engine (TPE)".
Star-P® provides several diagnostic commands that help determine the following:
• Which variables are distributed,
• How much time is spent on communication between the client and the server.
• How much time is spent on each function call inside the server.
Each of these diagnostics can help identify bottlenecks in the code and improve
performance. The diagnostic arguments are ppwhos, pptic/toc, ppeval_tic/toc, and
ppprofile.
Communication between the client and server can be measured by use of the pptic and
pptoc commands, which are modeled after the MATLAB® tic and toc commands, but
instead of providing wall-clock time between the two calls, they provide the number of
client-server messages and bytes sent during the interval.
And of course the two can be combined to provide information about transfer rates.
The pptic and pptoc commands can be used on various amounts of code, to focus on the
source of a suspected performance problem involving communications between the client
and the server. For instance, when you explicitly move data between the client and server via
ppfront or ppback, you will expect to see a large number of bytes moved.
But there might be places where implicit data movement occurs. For example, below we see
an example of a distributed matrix being multiplied by a local, client-side matrix. In performing
this operation, the matrix b must be shipped to the server to perform this operation.
Other operations may produce different amounts of communication depending upon how
they are called. For example, the single-return case of the find function may move only a
few hundreds or thousands of bytes between the client and the server, but when calling the
find operation on a distributed variable with three returns, the row indices, column indices and
array values are all moved from the server to the client. Depending on the size of the
distributed input, this could be a very large amount of data that is transferred.
An excessive number of client-server messages (as opposed to bytes transferred) can also
hurt performance. For instance, the values of an array could be created element-by-element,
as in the for loop below, or it could be created by a single array-level construct as below.
The first construct calls the Star-P® server for each element of the array, meaning almost all
the time will be spent communicating between the client and the server, rather than letting the
server spend time working on its large data.
The second construct is drastically better because it allows the Star-P® server to be called
only a few times to operate on the same amount of data.
The execution of this script bears out the differences in messages sent/received, with the first
method sending 200 times more messages than the second. What is even worse for the
element-wise approach, the performance difference will grow as the size of the data grows.
The different subfunctions of the ppprofile command can be combined to give you lots of
information about where the time is being spent in your Star-P® program. There are different
types of information that are available.
Perhaps the most common usage of ppprofile is to get a report on a section of code, as
follows.
The report prints out all server functions that are used between the calls to ppprofile on
and ppprofile report, sorted by the percentage of the execution time spent in that
function. For this example, it shows you that 34% of the time is spent executing in the server
routine ppfftw_fft, which calls the FFT routine in the FFTW parallel library. This report
also tells you how many calls were made to each server routine, and the average time per
call.
Information from this report can be used to identify routines that your program is calling more
often than necessary, or that are not yet implemented optimally. An example of the former is
given below, by a script which does a matrix multiplication in a non-vectorized manner,
compared to a vectorized routine. The script has the following contents:
With two input arrays sized as 20-by-20p, you get the following output:
You can see that the first report requires over 2,000 server calls, while the second requires
only one. This accounts for the drastic performance distance between the two styles of
accomplishing this same computational task.
If you want to delve more deeply and understand the sequential order of system calls, or get
more detailed info about each server call, you can use the ppprofile display option.
With this option, the information comes out interspersed with the usual MATLAB console
output, so you can see which MATLAB or Star-P® commands are invoking which server calls.
This can help you identify situations where Star-P® is doing something you didn’t expect, and
possibly creating a performance issue.
Another level of information is available with the ppprofile on -detail full option
coupled with the ppprofile display option.
As you can see, the per-server-call information now includes not only the time spent
executing on the server (“stime”) but also the number of times that the distribution of an
object was changed in the execution of a function (“chdist”). Changes of distribution are
necessary to provide good usability (think of the instance where you might do element-wise
addition on 2 arrays, one of which is row-distributed and one of which is column-distributed),
but changing the distribution also involves communication among the processors of the
Star-P® server, which can be a bottleneck if done too often. In this example, the max function
is doing 2 changes of distribution.
ppeval_tic/toc:
Star-P® also provides a set of timer functions specific to the ppeval command:
ppeval_tic/ppeval_toc. They provide information on the complete ppeval process by
breaking down the time spent in each step necessary to perform a ppeval call:
>> ppeval_tic();
>> ypp = ppeval('inv',rand(10,10,1000*p));
>> ppeval_toc(0)
ans =
TotalCalls: 1
ServerInit: 6.1989e-06
ServerUnpack: 5.0068e-06
ServerFunctionGen: 0.0019
ServerCallSetup: 1.9908e-04
ServerOctaveExec: 0.0493
ServerDataCollect: 2.0599e-04
ServerTotalTime: 0.0516
ClientArgScan: 0.0050
ClientDepFun: 0.0028
ClientEmode: 0.0549
ClientReturnValues: 0.0096
ClientTotalTime: 0.0723
TPELogFileLength: 84
InputElementsPP: [12503 0]
OutputElementsPP: [12500 0]
TPEInnerExec: 0
TPEOuterExec: 0.0478
TPESliceCount: 125
Maximizing Performance
The first point is most important for data parallel computation and can be achieved by
vectorizing your code, meaning, that instead of using looping and control structures, you use
higher level functions to perform your calculations. Vectorization takes control of the
execution away from MATLAB (e.g., MATLAB is no longer executing the for loop line by
line) and hands it over to optimized parallel libraries on the server. Not only will vectorized
code run faster with Star-P®, it will also run faster with MATLAB.
The second point simply reflects the fact that transferring data from the client to the server is
the slowest link in the Star-P® system. Any operation that involves a distributed variable and
a normal MATLAB variable will be executed on the server, and hence, includes transferring
the MATLAB variable to the server so that the server has access to it. When the MATLAB
variables are scalars, this does not impact the execution time, but when the variables
become large it does impact the time it takes to perform the operation.
Note that when combining a distributed and MATLAB variable inside a loop, the MATLAB
variable will be sent over to the server for each iteration of the loop.
The third point reflects the fact that changes in the distribution type, say from row to column
distributed, costs a small amount of time. This time is a function of the interconnect between
the processors and will be larger for slower interconnects. In general, avoiding distribution
changes is straightforward and is easily achieved by aligning the distribution types of all
variables, i.e. all row distributed or all column distributed.
Distributed objects in Star-P® reside on the server system, which is usually a different
physical machine from the system where the MATLAB client is running. Thus, whenever data
is moved between the client and the server, it involves interprocessor communication, usually
across a typical TCP/IP network (Gigabit Ethernet, for instance). While this connection
enables the power of the Star-P® server, excessive data transfer between the client and
server can cause performance degradation, and thus the default behavior for Star-P® is to
leave large data on the server. One typical programming style is to move any needed data to
the server at the beginning of the program (via ppback, ppload, etc.), operate on it
repeatedly on the server, then save any necessary results at the end of the program (via
ppfront, ppsave, etc.).
However, there are times when you want to move the data between the client and the server.
This communication can be explicit.
The load command loads data from a file into MATLAB variable(s). The ppback command
moves the data from the client working space to the Star-P® working space, in this case as a
ddense array. Similarly, the ppfront command moves data from the Star-P® server working
space back to the MATLAB client working space.
When accessing data from disk, it may be faster to load it directly as distributed array(s)
rather than loading it into the client and then moving it via ppback (and similarly to save it
directly as distributed arrays). The ppload/ppsave commands are the distributed versions
of the load/save commands. For information on ppload and ppsave, see "The ppload
and ppsave Star-P® Commands".
Implicit Communication
The communication between the client and the server can also be implicit. The most frequent
cases of this communication pattern are the call(s) that are made to the Star-P® server for
operations on distributed data. While attention has been paid to optimizing these calls,
making too many of them will slow down your program. The best approach to minimizing the
number of calls is to operate on whole arrays and minimize the use of control structures such
as for and while, with operators that match what you want to achieve.
Another type of implicit communication is done via reduction operations, which reduce the
dimensionality of arrays, often to a single data element, or other operators which produce
only a scalar.
>> d = max(max(bpp));
>> e = norm(bpp);
>> ppwhos
Your variables are:
Name Size Bytes Class
bpp 100x100p 80000 ddense array
d 1x1 8 double array
e 1x1 8 double array
One of the motivations behind the design of Star-P® was to allow larger problems to be
tackled than was possible on a single-processor MATLAB session. Because these problems
often involve large data (i.e., too big to fit on the MATLAB client), and because of the
possibility of performance issues mentioned above, Star-P®’s default behavior is to avoid
moving data between the client and the server. Indeed, given the memory sizes of parallel
servers compared to client systems (usually desktops or laptops), in general it will be
impractical to move large arrays from the server to the client. The exception to this rule arises
when operations on the server result in scalar output, in which case the scalar value will
automatically be brought to the client.
>> f = rand(8,8)
f =
0.4838 0.1520 0.1996 0.7267 0.4563 0.7669 0.3624 0.7185
0.5923 0.5584 0.1937 0.4047 0.2911 0.2298 0.2460 0.8987
0.7036 0.2819 0.4815 0.3219 0.0787 0.4983 0.9179 0.8907
0.8828 0.1345 0.1551 0.3135 0.4714 0.7376 0.1811 0.8055
0.1802 0.1512 0.2509 0.2147 0.9806 0.0915 0.6026 0.8420
0.6950 0.4017 0.5268 0.0104 0.9427 0.0030 0.1507 0.3435
0.9811 0.0213 0.4433 0.7595 0.8324 0.7831 0.4493 0.2497
0.1848 0.7306 0.0034 0.5078 0.7174 0.1684 0.6500 0.8098
0.0904 0.5250 0.2795 0.5770 0.5986 0.0795 0.3651 0.4867
0.4757 0.5727 0.9461 0.6291 0.4177 0.8044 0.2065 0.3597
While this makes good sense for small data sizes, printing out the data sizes possible with
Star-P® distributed objects, which often contain hundreds of millions to trillions of elements,
would not be useful. Thus the Star-P® behavior for a command lacking a trailing semicolon is
to print out the size of the resulting object.
100 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
fpp =
ddense object: 8p-by-8
If you want to see the contents of an array, or a portion of an array, you can display a single
element in the obvious way, as follows:
>> fpp(1,4)
ans =
0.2433
Note: When you call ppfront and leave off the final semicolon, MATLAB will print out
the whole contents of the array.
Note: Communication can happen implicitly as described in "Mixing Local and
Distributed Data".
During operations on the parallel server, communication among processors can happen for a
variety of reasons. Users who are focused on fast application development time can probably
ignore distribution and communication of data, but those wanting the best performance will
want to pay attention to them.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 101
Performance Tuning and Monitoring
These element-wise operators operate on just one element from each array, and if those
elements happen to be on the same processor, no communication occurs. If the elements
happen not to be on the same processor, the element-wise operators can cause
communication. In the example below, app and epp are distributed differently, so internally
Star-P® redistributes epp to the same distribution as app before doing the element-wise
operations.
Often redistribution cannot be avoided, but for arrays which will be operated on together, it is
usually best to give them the same distribution.
Any operator that rearranges data (for example, sort, transpose, reshape, permute,
horzcat, circshift, extraction of a submatrix) will typically involve communication on a
parallel system. Other operators by definition include communication when executed on
distributed arrays. For example, multiplication of two matrices requires, for each row and
column, multiplication of each element of the row by the corresponding element of the
column and then taking the summation of those results. Similarly, a multi-dimensional FFT is
often implemented by executing the FFT in one dimension, transposing the data, and then
executing the FFT in another dimension. Some operators require communication, in the
general case, because of the layout of data in Star-P®. For instance, the find operator returns
a distributed dense array (column vector) of the nonzero elements in a distributed array.
Column vectors in Star-P® contain an equal number of elements per processor for as many
processors as can be fully filled, with any remainder in the high-numbered processors. Thus
the find operator must take the result values and densely pack them into the result array. In
general, this requires interprocessor communication. For the same reason, creating a
submatrix by indexing into a distributed array also requires communication.
102 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
the interconnect of the parallel server. For high communication problems, a tightly integrated
system, such as an SGI Altix system, will provide the best performance.
If you are using Star-P®, it is probably because you want to achieve maximum performance
from your MATLAB program. That is, you are interested in making your program run as
quickly as possible, while still returning correct results. Since Star-P® is a client/server
program, correctly distributing your processing tasks between client and server is
instrumental in obtaining best performance. Also, calling the right functions to achieve your
goals is important to obtaining good run times.
Knowing which functions are invoked, where they are running, how many times they are
called, and how long they run before completion is information you can use while optimizing
your program for best performance. Therefore, Star-P® provides several profiling facilities to
help you wring maximum performance out of your program. These facilities include:
• MATLAB's tic/toc and Star-P®'s pptic/pptoc, which report the time elapsed
on the client (tic/to) and the server (pptic/pptoc) between the tic/pptic
call and the toc/pptoc call.
The rest of this section describes the use of ppperf in investigating your code.
Using ppperf
Usage of Star-P®'s profiling tool is loosely based upon MATLAB's profile functionality. If
you are used to code profiling using MATLAB's profile, then Star-P®'s ppperf will feel
comfortable to you.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 103
Performance Tuning and Monitoring
To use ppperf, you should first have a mental model of what ppperf is doing. Figure 5-1 is
a simplified diagram showing the major software components at work when you execute a
function called myfunc on the parallel supercomputer. You, the user, interacts with MATLAB
on your local PC. When you execute myfunc:
• MATLAB passes control to the Star-P® client software, which in turn sends it to the
Star-P® server software, which evaluates your function using the appropriate
numerical library.
• Then, the Star-P® server passes the result to the Star-P® client, which sends it up
to MATLAB, which displays the result to you in your MATLAB session. Meanwhile,
performance data is recorded at several points within the system.
Figure 5-1 is a conceptual picture of what happens when you invoke myfunc() in the
MATLAB client. Star-P® software passes the function call down to the appropriate numerical
library on the server, and passes the returned results back to the MATLAB session (thick
dotted black line in Figure 5-1).
104 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
Three different sets of performance data are gathered: One set in MATLAB on the client, and
two sets in the Star-P® server (red arrows). Later, when the user types “ppperf report”,
the performance data is gathered into a table living on the client (blue arrows).
Gathering these three categories of performance parameters is suggested in the figure by the
red arrows, which indicate collection of performance data into the associated data structures
and storage tables.
After your program has run, you may request a report showing all recorded performance data
by issuing the command ppperf report. This command brings all the performance data
scattered around the client and server into a table living on the client. This is suggested by
the blue arrows shown in the figure. Once the data is gathered into a table on the client, the
table is then displayed to the user.
Another important picture to visualize when profiling your code is to understand how the client
and the server interact. Under Star-P®, the client and the server process your computation
using a “ping-pong” mode. That is, while Star-P® is performing your calculation, the client
does some work while the server sits idle, and then the server does some work while the
client is idle, then the client does work and the server sits idle, and so on. Each time a work
hand-off occurs, a burst of network activity occurs as data is exchanged between client and
server. Keep this work flow in mind as you examine the data generated by ppperf.
Using ppperf is simple. First, you initiate performance monitoring. This means that you tell
Star-P® to clear any old performance data stored on the server, and start the performance
counters and timers afresh. Next you run your program. When your program is done, you
fetch the performance data from the server and display it. Finally, assuming you are done
with performance monitoring, you turn off the monitoring facility. Here's an example sequence
of commands you could enter in your Star-P® session:
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 105
Performance Tuning and Monitoring
ppperf will accumulate performance statistics until you turn it off using one of ppperf
report, ppperf off, or ppperf clear. The distinction between these commands lies in
whether they erase the statistics table. As a general rule, ppperf's subcommands will
behave similarly to the analogous subcommands of MATLAB's profile function.
• ppperf off - Turns off the performance monitoring process, but leaves the results
table alone. Use this command if you want to perform some work without gathering
statistics.
Star-P®'s ppperf facility supports two methods to display performance statistics: textual and
graphical. Text reports are covered in this section and graphical output is covered in "Using
ppperf's graphical mode".
106 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
Here's an example ppperf run. First, we'll look at the program being profiled. It is called
SumDifferences_loop.m. It calculates the RMS deviation from one point to the next in a
1000 element random vector.
tic
dx_sum = 0;
for i = 1:n
dx = xpp(i+1)-xpp(i);
dx_sum = dx_sum + dx^2;
end
dx_sum = sqrt(dx_sum);
toc
fprintf('dx_sum = %f\n', dx_sum);
This is a particularly bad program for Star-P®, since it involves using a for loop to perform a
simple sum. It is easy to create a vectorized version of this program whose run time is
perhaps 100 times faster. Nonetheless, this code provides a very interesting example for
ppperf profiling since it demonstrates many of the things you can learn by running ppperf
on your code.
>> ppperf on
Start MATLAB/Star-P® Performance Metrics
>> SumDifferences_loop
Elapsed time is 12.254049 seconds.
dx_sum = 12.304663
>> ppperf report
=============================================================================
MATLAB/Star-P® Performance Metrics
Date: 17-May-2007 18:11:01
Client: my_client_machine_address.com
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 107
Performance Tuning and Monitoring
Server: my_server
Elapsed: 32 seconds
-----------------------------------------------------------------------------
Star-P® Profiling
-----------------------------------------------------------------------------
function calls time avg time %calls %time
ppdense_viewelement 1998 16.9636 0.0083551 99.8501 99.6233
ppbase_setoption 1 0.030958 0.030958 0.049975 0.18475
ppdense_rand 1 0.023654 0.023654 0.049975 0.14116
ppbase_profile_onoff 1 0.008517 0.008517 0.049975 0.050827
Total 2001 16.7567 0.0083742
>>
Now that we've seen the output generated by a typical ppperf run, the question is: What
does all that data mean? Let's look at each section generated by ppperf.
The preamble
The preamble provides basic information about the performance run just completed. Here's
the preamble from the above run:
108 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
If you are running ppperf interactively (for example, typing each command into the Star-P®
prompt), then the elapsed time is the time duration between when you typed in ppperf o2
and when you typed in ppperf report. If you wait for 20 seconds before running your
function, the additional 20 seconds of idle time will be incorporated into the reported elapsed
time.
This section provides information about how the two computers (client and server) spent their
time on your calculation.
Remember that Star-P® operates in a ping-pong mode. The client does some work while the
server idles, then the server does some work while the client idles, and so on, until your
computation is done.
Also, every time there is a hand-off of work between client and server, a burst of network
activity occurs. The performance time measurement shows you how many times each
component was active in your program, the min, max, and mean time it was active (in
seconds), and the total time required by each component to do its job. Here's the
“Performance Time Measurement” section of the report shown earlier:
As you can see, the client performed a chunk of work 2001 times. Also, it consumed by far
the majority of the elapsed time. Since the program SumDifferences_loop.m involves a
for loop iterated 1000 times, it appears that the client performed two tasks for each loop
iteration -- plus one task at the end of the loop -- giving rise to the activity count of 2001.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 109
Performance Tuning and Monitoring
The network was active 2001 times, which reflects the fact that upon each iteration of the for
loop the client was required to perform two tasks. Also, the large amount of time
communicating on the network signals that this program spent too much time
communicating.
Finally, the server was active 2001 times, but each burst of activity lasted at most a few
milliseconds. This indicates that the server was very underutilized by this program. Since the
whole point of using Star-P® is to effectively harness and use the power of the
supercomputer server, this program, SumDifferences_loop.m, clearly does an inefficient
job of exploiting the potential of Star-P®. Of course, this is expected, since the program is
essentially a big for loop, which is known to be a slow and inefficient way to implement this
computation. Later, we'll look at a vectorized version of the same computation.
One of the most interesting things you can learn from ppperf is the amount of time spent in
the various software subcomponents (daemons or libraries) running on the server. The
“process measurement” results provide this information.
To visualize the meaning of the “process measurement” data, imagine that the server runs a
Star-P® server daemon. The daemon manages a set of numerical libraries that are used to
evaluate your function. This is shown schematically in Figure 5-1.
When a server call is made, the Star-P® server daemon must spend some time figuring out
how to handle your function call. Having done that, it then hands your data to the appropriate
function in one of the numerical libraries. Once the function is done executing, it hands the
returned data back to the Star-P® server daemon, which in turn sends it to the client.
Meanwhile, performance timers and counters are running, measuring the amount of time
spent in the Star-P® server daemon, as well as the time spent executing the library function.
Here's a report returned in the default mode, for example, ppperf report:
• In this example the only process that ran was the “Starp” process. The “Starp”
process is the Star-P® server daemon, which manages your computation on the
server side. Depending upon the details of your calculation -- and specifically which
numerical engines it invokes on the server -- you may see other processes listed in
this section alongside “Starp”. Rerun the example to get Octave times.
• Three times are listed for each process:
110 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
1. Real Time - The wall clock time spent by this process during the execution of your
function. This can be greater or less than User Time depending on the application.
2. Sys Time - This is the CPU time spent executing “kernel space” code on the server
at the behest of your program. This will likely be smaller than the real, wall clock
time since the server is multitasking many jobs at once. This can be greater or less
than User Time depending on the application.
3. User Time - This is the CPU time spent executing “user space” code on the server
at the request of your program. This will likely be smaller than the real, wall clock
time since the server is multitasking many jobs at once.
• The elapsed time reported is the time since the ppperf command was invoked, not
since the process started.
If you issue the command ppperf report detail, then ppperf will return the time spent
broken down by compute node. The “Performance Process Measurement” section is the only
one in which ppperf report detail will provide additional detailed information about
your run. Here are the results of a new run of SumDifferences_loop.m under ppperf
showing the difference between the default and the detailed report:
As you can see, using ppperf report detail shows a breakdown of time spent on each
compute node. You can use this information to help locate a node which might be particularly
slow, either due to an excessively long task-parallel computation, or perhaps because it has
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 111
Performance Tuning and Monitoring
other activity running on it. You can use this information to help balance the load across all
compute nodes on your machine.
Star-P® Profiling
This section lists all functions called on the server while running your program. It tabulates
the number of invocations as well as the average time spent in each function, along with
some other performance metrics.
The functions listed in the Star-P® Profiling section are exclusively built-in Star-P® functions.
The built-in Star-P® functions are typically named something like ppdense_foo or
ppbase_bar to distinguish them from the names you might give to your functions. In
general, these Star-P® built-in functions are not available to you to run, and you cannot use
help ppdense_foo at the command line to get more information about the function.
However, the functions are usually named in a logical way so that you can make an educated
guess about what the functions are doing.
The information provided in the Star-P® Profiling section is particularly useful when tweaking
your code for best performance, since it allows you to identify which functions consume the
majority of your compute time. You can focus your optimization efforts on improving the
functions which consume the most time, or at least optimizing the number of times each
function is invoked.
Here's the Star-P® Profiling section copied from the above run of
SumDifferences_loop.m:
Star-P® Profiling
-----------------------------------------------------------------------------
function calls time avg time %calls %time
ppdense_viewelement 1998 16.9636 0.0083551 99.8501 99.6233
ppbase_setoption 1 0.030958 0.030958 0.049975 0.18475
ppdense_rand 1 0.023654 0.023654 0.049975 0.14116
ppbase_profile_onoff 1 0.008517 0.008517 0.049975 0.050827
Total 2001 16.7567 0.0083742
As you can see, the Star-P® built-in function ppdense_viewelement consumed the
majority of the compute time during the run. But what is ppdense_viewelement? This
function is invoked each time an array element (for example, a scalar) must be returned from
the server to the client for processing. Our program SumDifferences_loop.m iterates
over all elements in the vector and sums them, as follows:
n = length(xpp)-1;
for i = 1:n
dx = xpp(i+1)-xpp(i);
dx_sum = dx_sum + dx^2;
end
112 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
Recall that under Star-P®, scalar variables always live on the client. Therefore, when this
program requests the values xpp(i+1) and xpp(i), Star-P® goes to the server, gets the
individual elements out of the vector, and brings them to the client where they are added.
That's why ppdense_viewelement is invoked 1998 times.
Lessons Learned
When using ppperf on your code, look for the following things:
• Time spent on client vs. server. Is the program running on either client and server
according to your expectations? That is, if you think you have exported a calculation
to the server, but you see significant activity on the client, this is a signal that your
program isn't fully optimized.
• Excessive network time. Monitor your program's network time. Keep in mind that
a 100mb or 1Gig network link between your client and server can transport many
millions of bytes in well under a second. Therefore, if your network time is not
commensurate with transferring a small number of bytes (depending upon your
program's structure), you may be paying a communication time penalty due to for
loops or an unexpected data transfer.
• Number of times the client runs a task. If your client runs a short task many times,
or there is significant “ping-ponging” between the client and the server, your program
is causing too much client/server communication. Find a way to keep all the
computation on the server. Perhaps you need to vectorize more?
• Excessive time spent on one or two functions. If your program spends most of
its time running one particular function, you should probably focus attention on why
that is the case. If the one function was written by you, then it is a good candidate
for further optimization.
• Many calls to ppdense_viewelement. ppdense_viewelement transfers
scalar data from server to client. If this function dominates your function usage, it is
a signal that you need to vectorize your code.
Besides providing you a text report, ppperf can also show a graph of client and server
activity. This information can be useful if you want to see exactly when your computation was
passed from client to server and back again.
You invoke ppperf graphical mode in much the same way as you get a text report. The
particular command sequence looks like this:
ppperf o2 1
ppperf graph on
my_function
ppperf off
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 113
Performance Tuning and Monitoring
A screenshot showing the results of a graphical profiling session using ppperf is presented
in Figure 5-2.
Figure 5-2 Graphical output of ppperf2
Graphical output of ppperf, showing activity on client (top), network (middle), and server
(bottom). The units are percent of activity, where 100% means that the particular Star-P®
subcomponent was active 100% of the time over the last measurement interval. Remember
that this is not a measure of CPU utilization! Rather, the graph shows which Star-P®
subcomponent is active (or has control over) performing your calculation.
Here are some things to keep in mind when using ppperf's graphical mode:
• The graphical display is only available in conjunction with the o2 and o3 statistics
gathering levels.
• When you turn on performance logging for graphical display, you must specify the
logging interval. The default value (1 second for a text report) does not apply to
graphical output. If you neglect to specify a logging interval, Star-P® will give you a
“No samples” error. Again, the logging interval must be an integer; the units are
seconds. Accordingly, in the above example the logging interval is explicitly set to
one second.
114 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
• You can get a real-time running update of activity on the client and server by turning
on graphics mode before invoking your function, as in the example shown in Figure
5-2. Your graphic display will be updated whenever the client has control over your
computation, and is available to update the graphics window. Note that if you kick
off a long, server-side calculation (for example, using ppeval), the ppperf graphic
won't update until the server is done working, and passes control to the client again.
The same is true for the client side when it is in a CPU-bound execution.
• After the run is over, turn off the performance logging using ppperf off, as shown
in Figure 5-2. If you do not turn off logging, then the performance graphic will
continue to update, and the region of interest - the portion of the graph showing your
computation running - will compress within the available space, making it hard to
read.
• If ppperf data gathering is not active when you invoke ppperf graph on (for
example, you have issued a ppperf off command), then ppperf will plot the
static results contained in the performance results table. If you do not have any data
in the performance results table, Star-P® will give you a “No samples” error.
• Since performance samples are made at regular, but large intervals, the client and
server utilization graphs - like that shown in Figure 5-2 - represent averages over the
sample interval. As you know, Star-P® ping-pongs control between client and server;
while one machine is busy processing, the other is essentially idle. In the program
SumDifferences_loop.m, control is passed between the client and the server
about 2000 times. However, the update interval is q seconds. Therefore, ppperf
cannot graph each and every time control is passed between client and server.
Rather, it plots the average time spent on each, over each 1 second measurement
intervals.
• When both client and server are idle, ppperf's graphing function will indicate 100%
client utilization. This is because the graph itself is not a graph of CPU loading.
Rather, it is simply an indication of which computer is currently in control of your
computation. When both computers are idle, waiting for you to type something, then
the client is the computer who is in control. Therefore, it's graph will indicate 100%
utilization.
You might wonder, “For what is ppperf's graphics facility useful?” It can be used for quick,
visual identification of situations where there is too much communication between client and
server during the course of a computation.
This is signaled by graphs showing compute activity on both the client and the server at the
same time. A better computation is shown in the graph in Figure 5-3. This particular
computation involved computing the Mandelbrot set using a task-parallel algorithm. In this
case, the client initialized some variables, and then passed control to the server. The server
performed the bulk of the computation over a period of about 30 seconds, and then returned
control to the client.
The resulting graph shows that 100% of the computation takes place on either the client or
the server, depending upon time. At no time does it appear that the computation is shared
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 115
Performance Tuning and Monitoring
between client and server. Therefore, this computation is not bogged down with client/server
communication.
Figure 5-3 Graphical output from ppperf showing a different Star-P® session.
In this example, a ppeval call was used to compute the Mandelbrot set. The ppeval call
lasted about 30 seconds, during which time the server was working on the computation 100%
of the time, while the client idled.
Any successful computation performed using Star-P® should show a similar graph: 100% of
the computation should take place on either the client or the server for long periods of time
(seconds or longer). Control of the computation may bounce back and forth between client
and server, but certainly not frequently, and at no time should your program appear to share
the computation between client and server.
This highlights another use of ppperf's graphing facility: It can alert you to situations where
you think a portion of a calculation is taking place on the server, but it is actually running on
the client. That is, since ppperf shows you where the computation is happening, it can help
you verify that your program is actually doing what you think it should be doing.
Finally, ppperf’s graphic mode can quickly show you if you are spending too much time
running on the client. Since you likely purchased Star-P® to help export computations to the
server, if you find that a lot of your compute time is spent on the client, then you probably
need to modify your program so that more of the computation is performed on the server.
116 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
Lessons Learned
• Use ppperf's graphic facility as a quick way to see if your program requires too
much client/server communication. This scenario is signaled by performance
graphs showing computation is shared between client and server.
• ppperf’s graphic facility can also verify that your code is actually doing what you
think it should be doing.
• The graphic facility can also show you if you are not getting enough use from your
server. In the best case, you should see client activity at the beginning and end of
your program run, and server activity in the middle for the bulk of its run.
To illustrate the utility of ppperf when optimizing your code's performance, let's look at an
example finite element method calculation (FEM). FEM problems typically involve
manipulating large matrices, and are computation intensive.
The example code shown below was originally written solely in MATLAB, with no parallel
extensions. The program consists of four parts:
1. Data read-in and initialization,
2. Building the stiffness matrix,
3. Solving the set of linear equations, and
4. Post-processing and solution visualization.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 117
Performance Tuning and Monitoring
matrix.
tocnp = 1;
disp('Building stiffnessmatrix');
tic
% Set up global variables
set_globals(1);
% allocate row, column, value arrays
II = zeros(6,6,nelem);
JJ = II;
KK = II;
% Get stiffnessmatrix contribution for each mesh element
for i = 1:nelem
[II(:,:,i), JJ(:,:,i), KK(:,:,i)] = ...
get_k_matrix(pi(i,:),pj(i,:),pm(i,:),connec(i,:));
end
toc;
%
% Now we need to set the boundary conditions.
% For the boundary condititions we require that the
% vertices on the bottom stay fixed.
%
% find vertices on the bottom
disp('Apply boundary conditions');
tic;
nbase = length(ipoints_base);
K(2*ipoints_base-1,:) = 0.0;
K(2*ipoints_base,:) = 0.0;
K(:,2*ipoints_base-1) = 0.0;
K(:,2*ipoints_base) = 0.0;
K(2*ipoints_base-1,2*ipoints_base-1) = speye(nbase);
K(2*ipoints_base,2*ipoints_base) = speye(nbase);
toc;
118 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
% Calculate new point positions based on old ones and the displacement
% from the FEM analysis
tic;
i= 1:npoints;
new_points(i,1) = points(i,1)+displacement(2*i-1);
new_points(i,2) = points(i,2)+displacement(2*i);
toc
subplot(1,2,2)
h=trimesh(connec,new_points(:,1),new_points(:,2),zeros(size(new_points,1),1));
set(h,'EdgeColor','k');
view(2),axis equal,axis off,drawnow;
It might be tempting to run this program under ppperf to see what happens. However, since
it is a pure MATLAB program, it runs exclusively on the client. Therefore, it doesn't generate
any server-side performance data, so profiling with ppperf does not provide any useful
statistics. (Running this program under ppperf o3 would indeed show performance data,
specifically the performance data gathered by MATLAB. Since this is not relevant to Star-P®
performance tweaking, we will skip that step here.)
First, since FEM modeling is a logical candidate for data-parallel processing, we will simply
read the matrix data into the server (instead of the client) by replacing the load statement
with ppload. This tells Star-P® to read the data from a disk on the server directly into the
server's memory. (This implies that you previously copied the data onto the server machine
using a separate step, for example, using FTP.) This change is highlighted in blue in the
listing below.
Once the data is read into the server using ppload, a couple of other changes become
necessary. First, since points, connec, pi, pj, and pm are now all server-side variables, the
return from get_k_matrix(pi(i,:),pj(i,:),pm(i,:),connec(i,:)) will also be a
server variable. Therefore, we must initialize II, JJ, and KK on the server, instead of the
client. Second, since the matrices all live on the server, we must bring them to the client using
ppfront before plotting.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 119
Performance Tuning and Monitoring
With these changes, the parallelized FEM program takes the following form:
%
% Now we need to set the boundary conditions.
%
% For the boundary condititions we require that the
% vertices on the bottom stay fixed.
%
% a) find vertices on the bottom
120 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
%
disp('Apply boundary conditions');
tic;
nbase = length(ipoints_base);
K(2*ipoints_base-1,:) = 0.0;
K(2*ipoints_base,:) = 0.0;
K(:,2*ipoints_base-1) = 0.0;
K(:,2*ipoints_base) = 0.0;
K(2*ipoints_base-1,2*ipoints_base-1) = speye(nbase);
K(2*ipoints_base,2*ipoints_base) = speye(nbase);
toc;
% Calculate new point positions based on old ones and the displacement
% from the FEM analysis
tic;
i= 1:npoints;
new_points(i,1) = points(i,1)+displacement(2*i-1);
new_points(i,2) = points(i,2)+displacement(2*i);
toc
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 121
Performance Tuning and Monitoring
set(h,'EdgeColor','k');
view(2),axis equal,axis off,drawnow;
subplot(1,2,2)
h=trimesh(connec,new_points(:,1),new_points(:,2),zeros(size(new_points,1),1));
set(h,'EdgeColor','k');
view(2),axis equal,axis off,drawnow;
With these changes, this program is now parallelized, and will execute on the server.
Unfortunately, with the above changes, the FEM program is now extremely slow. Watching
the output of disp as the program executes, it is clear that the above program gets stuck
somehow when it tries to build the stiffness matrix. But what is wrong? To investigate this
question, you can use ppperf in your Star-P® session as follows:
1. Turn on profiling: ppperf o2.
2. Run the program: fem_ppload.
3. Let the program run for a while. Then, when you are tired of waiting for it to complete, hit
<control>-C.
4. Stop profiling: ppperf off.
5. Bring up the ppperf graph: ppperf graph on.
>> pp_perf o2
Start MATLAB/Star-P® Performance Metrics
>> fem_ppload
load grid file
Elapsed time is 0.100384 seconds.
Building stiffnessmatrix
Error in ==> datenum at 92
n = datenummx(arg1);
Error in ==>
/usr/local/starp-versions/6718/matlab/pp_perfupdate.p>pp_perfupdate at 87
Error in ==>
/usr/local/starp-versions/6718/matlab/cppclient/private/ppprofileupdate.p>pppr
ofileupdate at 63
Error in ==>
/usr/local/starp-versions/6718/matlab/@ddense/ctranspose.p>ctranspose at 4
122 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
Since both client and server show compute activity occurring at the same time, it is likely that
control of the program is ‘ping-ponging’ rapidly back and forth between client and server. This
implies a severe performance penalty, since each transfer of control involves a
communications delay.
The hypothesis of too much client-server activity is further evidenced by the result of running
ppperf report, as shown here:
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 123
Performance Tuning and Monitoring
Server: my_server
Elapsed: 72 seconds
-----------------------------------------------------------------------------
Performance Time Measurement
-----------------------------------------------------------------------------
count min max mean time metric
6205 0.0014 3.9374 0.0077 29.6009 Client Time
6205 0.0002 0.1298 0.0004 2.6080 Network Time
6206 0.0000 0.0025 0.0000 0.1488 Client2Server
6206 0.0000 0.0043 0.0000 0.1171 Server2Client
6205 0.0073 0.1317 0.0093 39.9593 Server Time
6206 0.0073 0.1317 0.0093 39.9717 Command
6206 0.0072 0.1239 0.0073 27.3850 Command Distribute
5555 0.0000 0.0528 0.0004 2.1035 Command Execute
978 0.0000 0.0137 0.0016 1.5488 Command Execute MathOp
6206 0.0000 0.0014 0.0001 0.4486 Command EStatus
3423 0.0000 0.0001 0.0001 0.2308 Command Execute Move
327 0.0002 0.0527 0.0005 0.1577 Command Execute LibOp
658 0.0000 0.0164 0.0001 0.0470 Command Execute SubsRef
1312 0.0000 0.0001 0.0000 0.0327 Command Execute Misc
163 0.0000 0.0002 0.0000 0.0055 Command Execute Redist
Performance Process Measurement
-----------------------------------------------------------------------------
value metric
109.1050 Starp Real Time
122.6221 Starp Sys Time
258.5273 Starp User Time
®
Star-P Profiling
-----------------------------------------------------------------------------
function calls time avg time %calls %time
ppdense_viewelement 3423 29.0718 0.0084931 55.1652 43.2014
pp_dense_ppback 651 11.4339 0.017564 10.4915 16.991
ppdensend_subsasgn_slice 487 8.1989 0.016835 7.8485 12.1838
ppdense_subsref_row 652 7.5395 0.011564 10.5077 11.2039
ppdense_kron 326 3.2293 0.0099058 5.2538 4.7988
ppdense_scalar_op 326 3.1701 0.0097243 5.2538 4.7109
ppdense_transpose 163 2.7824 0.01707 2.6269 4.1348
ppdense_binary_op 163 1.5718 0.0096428 2.6269 2.3357
ppio_loadallvar 1 0.073072 0.073072 0.016116 0.10859
ppdense_subsref_col 3 0.06607 0.022023 0.048348 0.098182
ppdensend_add 1 0.042937 0.042937 0.016116 0.063806
ppdense_subsref_drow 3 0.033902 0.011301 0.048348 0.050379
ppbase_id2ddata 4 0.033217 0.0083042 0.064464 0.049361
ppbase_setoption 1 0.030537 0.030537 0.016116 0.045379
ppbase_profile_onoff 1 0.016158 0.016158 0.016116 0.024011
Total 6205 67.2935 0.010845
>>
124 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
At this point, it is clear that the code suffers from having a for loop, which drags down
Star-P® performance. Inspecting the code shows that there is indeed a for loop involved in
initializing the elements of the stiffness matrix, II, JJ, and KK. Optimizing this code obviously
requires eliminating the for loop. Since the loop involves no dependencies, it can become a
task-parallel operation, and can be replaced by a ppeval call to perform the job of initializing
II, JJ, and KK.
A new version of the program - in which the II, JJ, and KK are initialized in a ppeval call - is
shown below.
% Set up the global variables used in the calculation of the stiffness matrix.
tocnp = 1;
disp('Building stiffnessmatrix');
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 125
Performance Tuning and Monitoring
tic
% Set up global variables.
% We need to use ppeval in this case since
% we need to set the global variables on ALL processors.
out = ppeval('set_globals',1:np); % --- Star-P®! ---
%
% Get stiffnessmatrix contribution for each mesh element
% Note that we need to split the arguments along the rows
[II,JJ,KK] = ppeval('get_k_matrix',split(pi,1),split(pj,1),...
split(pm,1),split(connec,1));
% --- Star-P®! ---
toc;
%
% Now we need to set the boundary conditions.
%
% For the boundary conditions we require that the
% vertices on the bottom stay fixed.
%
% a) find vertices on the bottom
%
disp('Apply boundary conditions');
tic;
nbase = length(ipoints_base);
K(2*ipoints_base-1,:) = 0.0;
K(2*ipoints_base,:) = 0.0;
K(:,2*ipoints_base-1) = 0.0;
K(:,2*ipoints_base) = 0.0;
K(2*ipoints_base-1,2*ipoints_base-1) = speye(nbase);
K(2*ipoints_base,2*ipoints_base) = speye(nbase);
toc;
126 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
disp('Solve system');
tic;
displacement = K\F;
toc;
% Calculate new point positions based on old ones and the displacement
% from the FEM analysis
tic;
i= 1:npoints;
new_points(i,1) = points(i,1)+displacement(2*i-1);
new_points(i,2) = points(i,2)+displacement(2*i);
toc
subplot(1,2,2)
h=trimesh(connec,new_points(:,1),new_points(:,2),zeros(size(new_points,1),1));
set(h,'EdgeColor','k');
view(2),axis equal,axis off,drawnow;
In this version of the program, initializing the stiffness matrix is performed almost totally as a
parallel operation on the back-end HPC. The entire program takes under 15 seconds to
complete.
Running this program under ppperf reveals the reasons for the performance improvement:
Control of the computation stays with the server for almost the entire computation. Because
the for loop in the initialization section has been replaced with ppeval, the computation
does not need to ping-pong rapidly and repeatedly between client and server. This is shown
quite clearly in the graphical result from ppperf, shown in Figure 5-5. In that figure, transfer
of computational control started with the client, but quickly passed to the server. The
computation stayed on the server until the end of the run, when the computation was passed
back to the client for results visualization.
It's interesting to note that both client and server seem to have been active towards the end
(starting at around 27 seconds); this reflects the fact that ppfront was invoked to move the
results back to the client after they were generated on the server. As such, this behavior is
unavoidable since the results must live on the client in order to graph them.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 127
Performance Tuning and Monitoring
At the beginning of the graph, there is a dead time for about 12 seconds. This represents the
time elapsed between typing ppperf o2 and typing the name of the function to run,
fem_ppload_ppeval. Then, once the program started up, transfer of control passed
quickly from client to server, and stayed with the server for most of the execution time.
At the end, client and server were both active (starting at about 27 seconds) as evidenced by
the rise in client utilization and the accompanying fall in server utilization. This is likely due to
data being transferred back to the client via ppfront.
Finally, the difference between this optimized run, and the previous, slow run can be seen in
the results returned by ppperf report. The report generated by this successful run is
shown below:
128 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 129
Performance Tuning and Monitoring
• In the “Star-P® Profiling” section, no function is called more times than any other.
This is in contrast to the report generated for fem_ppload.m above, in which
ppdense_viewelement was invoked 3423 times.
• A new function, ppemode_emodecall, was invoked only twice, but soaked up 75%
of the compute time. This function is the Star-P® function which handles ppeval
calls on the server side. Since the ppeval call which initialized the stiffness matrix
soaked up the majority of the wall clock time during this run, it makes sense that
ppemode_emodecall uses most of the server processing time.
130 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Performance Tuning and Monitoring
Lessons Learned
• Use ppperf in graphics mode to identify sections of code with excessive
client-server communication.
• Use ppperf report to provide detailed analysis of what resources your program
uses while executing.
• For best Star-P® performance, make sure your program is thoroughly vectorized!
Avoid using for loops over multi-dimensional data whenever you can. Looping over
multi-dimensional data necessitates transfer of scalar data between client and
server, causing a significant time penalty due to communication overhead.
Command Explanation
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 131
UNIX Commands to Monitor the Server
ppperf clear This command clears the results table, and turns off
performance data logging. Use this command if you want
to end your performance monitoring session.
ppperf report This command prints out a large text report providing
information about compute resources utilized by your
program while it ran.
ppperf report detail This command prints out a large text report providing
information about compute resources utilized by your
program while it ran. It provides more detail than “ppperf
report”. In particular, it breaks down the process
measurement results for each compute node on your
parallel server.
ppperf graph off This command closes the performance graph window. It
does not affect the compiled performance data table.
While Star-P® is designed to allow you to program at the level of the MATLAB command
language and ignore the details of how your program runs on the HPC server, there are times
when you may want to monitor the execution of your program directly on the server.
132 Star-P® Programming Guide for Use with MATLAB® Release 2.7
UNIX Commands to Monitor the Server
The following commands will often be useful for monitoring server processes. Execute these
commands on the HPC server in a terminal window.
• top: This command is often the most useful. See man top for details. It displays the
most active processes on the system over the previous time interval, and can
display all processes or just those of a specific user. It can help you understand if
your Star-P® server processes are being executed, if they're using the processors,
if they're competing with other processes for the processors, etc. top also gives
information about the amount of memory your processes are using, and the total
amount of memory in use by all processes in the system.
• ps: The ps command will tell you about your active processes, giving a snapshot
similar to the information available via top. Since the Star-P® server processes are
initiated from an ssh or rsh session, you may find that ps -lu <yourlogin> will
give you the information you want about your Star-P® processes. In the event that
Star-P® processes hang or get disconnected from the client, this can give you the
process IDs you need to kill the processes.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 133
UNIX Commands to Monitor the Server
134 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Chapter 6
Star-P® Functions
This chapter summarizes the Star-P® functions that are not part of standard MATLAB and
describes their implementation. It also describes the syntax of the Star-P® functions.
Function Description
General Functions
fseek The return value FID is a distributed file
identifier. Passing this value to the
following MATLAB functions: fopen(),
fread(), fwrite(), fseek(), frewind() and
fclose() will operate on distributed
matrices on the server with the same
semantics as with regular file id on the
client.
np Returns the number of processes in the
server.
p Creates an instance of a dlayout
object.
pp Is useful for users who wish to use the
variable p for another purpose.
ppbench Collects basic information about the
hardware and software characteristics
of your server, and runs low-level
performance tests on your server.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 135
Basic Server Functions Summary
Function Description
ppclear Clears distributed variables from the
server memory.
ppgetoption Returns the value of Star-P®
properties.
ppsetoption Sets the value of Star-P® properties.
ppgetlog Get the Star-P® server log file.
ppgetlogpath Get starpserver log file path.
ppinvoke Invoke a function contained in a
previously loaded user library via the
Star-P® Software Development Kit
(SDK).
pploadpackage Load a compiled user library on the
server.
ppunloadpackage Unload a user library from the server.
ppfopen Open a distributed server-side file
descriptor. The syntax is similar to that
of the regular fopen() but the file is
accessed on the server. You control
data distribution when reading data
from a file on the server as column
distributed only.
ppquit Disconnects from the server and
causes the server to terminate.
ppwhos Gives information about distributed
variables and their sizes (similar to
whos).
pph5whos Print information about variables in a
HDF5 file.
Data Movement Functions
136 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
Function Description
ppchangedist Allows you to explicitly change the
distribution of a matrix in order to avoid
implicit changes in subsequent
operations.
pph5write Writes variables to an HDF5 file on the
server.
pph5read Reads distributed variables from an
HDF5 file on the server
ppload Loads a data set from the server
filesystem to the back-end.
ppsave Saves backend data to the server
filesystem.
Task Parallel Functions
bcast, ppbcast Broadcasts an array section where the
entire argument is passed along to
each invocation.of a function called by
ppeval.
split, ppsplit Splits an array for each iteration of a
ppeval function.
ppeval Executes a specified function in parallel
on sections of input array(s)
When using ppeval or ppevalsplit to call a compiled C++
library function, use the format PACKAGENAME:FNAME,
where PACKAGENAME is the module name as returned by an
earlier call to ppevalcloadmodule, and, FNAME is the
function name registered in that module. For example, the
call:
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 137
Basic Server Functions Summary
Function Description
ppevalcloadmodule Loads a C++ module for task parallel
operation on the server. This function is
deprecated as of Release 2.6.0.
Loading compiled C++ libraries can
now also be performed using
pploadpackage.
ppevalcunloadmodule Removes a previously loaded C++
module.from the server. This function is
deprecated as of Release 2.6.0.
Unloading compiled C++ libraries can
now also be performed using
ppunloadpackage.
Performance Functions
ppperf Star-P®’s performance monitoring
function.
ppprofile Collects and display performance
information on Star-P®
pptic/pptoc Provides information complementary to
the MATLAB® tic/toc command
General Functions
fseek
Repositions the file position indicator in the file with the given distributed file identifier FID to
the byte specified with the offset.
The return value FID is a distributed file identifier. Passing this value to the following MATLAB
functions:
• fopen()
• fread()
• fwrite()
• fseek()
• frewind() and
• fclose()
138 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
np
n = np
z = p
pp
z = pp
• pp is an alias to p. pp is useful for users who wish to use the variable p for another
purpose.
Reference
• See p.
ppbench
ppbench collects information about the basic hardware and software characteristics of your
server. When utilizing multiple CPUs in a cluster configuration, the output of this test should
be examined for consistency; for example, the amount of memory per node should be the
same and the reported CPU information is similar.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 139
Basic Server Functions Summary
If ppbench is invoked with an output argument, then it will return a data structure that can be
stored using the save function and later displayed (see example 1 below).
If ppbench is invoked with no input arguments, then it acts as if it were invoked with the
-levels [0,1] switch.
If the -levels switch is used, the additional argument is either a scalar or a list of levels to
be run. (see example 2 below)
ppbench('-levels',0) will print the lowest level system information which is extracted
from /proc/cpuinfo and /proc/meminfo. In addition, when the Star-P® server utilizes
more than 1 CPU, the generated report will include MPI latency and bandwidth data.
ppbench('-levels',1) will print the results of a single CPU HPC Streams benchmark.
This provides an interesting data point that represents an important class of simple
operations that turn up frequently in HPC applications. See http://www.cs.virginia.edu/stream/ to
see how your results compare with a range of commodity and special purpose CPUs.
If the '-display' switch is used (example 3), the additional argument identifies the data
structure saved from a previous invocation of ppbench, which is then displayed.
Example 1:
X = ppbench
Example 2:
ppbench('-levels',[0,1])
Example 3:
ppbench('-display',X)
ppclear
ppclear eliminates distributed variables from the caller’s Star-P® workspace, and
immediately frees the memory allocated for them on the server. If no argument is provided,
then ppclear removes all distributed variables in the workspace.
Important:Invoking bpp = app; ppclear app; will leave the symbol bpp in your
workspace, but the distributed object accessed through bpp will no longer
exist. When you desire a hard copy of a variable, as opposed to a soft copy,
use assignment statements such as bpp = +app; or bpp = app(:,:);.
140 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
ppgetoption
ppsetoption
ppsetoption('option','value')
ppgetlog
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 141
Basic Server Functions Summary
Returns the filename of a local temporary file containing the Star-P® server log. The
temporary file is deleted when MATLAB exits.
• ppgetlog(FILENAME)
Copies ALL files from the server log directory into the client log directory and creates
an all_logs.zip archive with all files in the client log directory.
• ppgetlog -all <filename>
Creates a <filename> zip archive with all files from the client and server log directories.
• f = ppgetlog -all
Creates an all_logs.zip archive with all files from the client and server log
directories and returns the full filename of the archive.
• ppgetlog -all -nozip
Copies all files from the server log directory into the client log directory.
The '-nozip' option is ignored if '-all' is not specified, <filename> is ignored if '-all
-nozip' is specified, and the output is an empty string if '-all -nozip' is specified.
Note: ppgetlog will make an SSH connection to the Star-P® server machine to fetch
the log file, so if your ssh client is not configured for passwordless SSH, then you
may be prompted for your server password again.
ppgetlogpath
The naming format for individual session log directories is YYYY_MM_DD_HHMM_SS. Hours
are represented in the 24-hour format.
142 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
<log>/workgroup_manager.log
<log>/starp_server.log
<log>/octave_$MPI_RANK.log
<log>/machine_file
<config>/machine_file.user_default
<log>/starp_session_id.*
<config>/user_env.sh
<log>/starpmatlab.log
<log>/starpclient.log
ppinvoke
Invoke a function contained in a previously loaded user library via the Star-P® SDK.
Note: See the “Star-P® Software Development Kit (SDK) Tutorial and Reference Guide”
for more information on this function.
pploadpackage
Loads a compiled task parallel or data parallel user library on the server using positional
arguments.
stringTP = pploadpackage('C','/path/to/package.so','TPname')
stringTP = pploadpackage('C','/path/to/package.so')
Loads a package named 'package.so' containing compiled functions for later use in
ppeval. The first argument, specifies the language in which the target package is written.
Currently, only C or C++ libraries can be loaded on the server for task parallel operation, and
require the initial argument to be the string 'C'. The second string argument specifies a
user-defined name that is used for identification of the task parallel package on the server.
The string provided with the keyword argument name is returned in the function output
stringTP. If the third argument 'TPname' is not provided, then the naming convention
utilized for assigning an output string to stringTP is to take the filename without path,
extension, or underscores, converted to lowercase. This change ensures that the default
name can always be used to prefix a function name, and is recognizable by the Star-P® client
and server.
stringDP = pploadpackage('/path/to/package.so','DPname')
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 143
Basic Server Functions Summary
stringDP = pploadpackage('/path/to/package.so')
When the initial engine string argument is omitted, the package specified will be loaded as a
data parallel package. Currently, only C or C++ libraries can be loaded on the server for task
parallel operation. The keyword argument “name” specifies a user-defined name that is used
for identification of the data parallel package on the server. The string provided with the
keyword argument name is returned in the function output stringDP. If the name keyword is
not provided, then the naming convention utilized for assigning an output string to stringDP
is to take the filename without path, extension, or underscores, converted to lowercase. This
change ensures that the default name can always be used to prefix a function name, and is
recognizable by the Star-P® client and server.
Note: See the “Star-P® Software Development Kit (SDK) Tutorial and Reference Guide”
for more information on this function.
ppunloadpackage
Unload a user task parallel or data parallel library from the server.
ppunloadpackage('C','TPname')
ppunloadpackage('C',stringTP)
By passing the initial engine string argument,'C', along with a string containing the name of
a compiled language task parallel package that has previously been loaded on the server,
ppunloadpackage will unload the package from the Star-P® server’s current compiled
language task parallel engine.
ppunloadpackage('DPname')
ppunloadpackage(stringDP)
By passing only a single string argument, containing the name of a compiled language data
parallel package that has previously been loaded to the server, ppunloadpackage will
unload the package from the Star-P® server.
In the case of unloading either a task parallel or data parallel package, if the name given for the
package does not match the name of a package already loaded on the server, then an error will
be thrown.
Note: See the “Star-P® Software Development Kit (SDK) Tutorial and Reference Guide”
for more information on this function.
144 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
ppfopen
Open a distributed server-side file descriptor. The syntax is similar to that of the regular
fopen() but the file is accessed on the server. You control data distribution when reading
data from a file on the server as column distributed only.
Opens file F in the mode specified by MODE. MODE can be: '
MODE DESCRIPTION
rb read
wb write (create if necessary)
ab append (create if necessary)
rb+ read and write (do not create)
wb+ truncate or create for read and write
ab+ read and append (create if necessary)
Return Values
The return value FID is a distributed file identifier. Passing this value to the following
MATLAB functions: fopen(), fread(), fwrite(), frewind() and fclose() will
operate on distributed matrices on the server with the same semantics as with regular file
id on the client.
Note: For fread(), you control data distribution when reading data from a file on the
server as column distributed only.
ppquit
ppwhos
ppwhos lists the variables in the caller's Star-P® workspace. ppwhos is aware of distributed
matrices that exist on the server so it will return the correct dimensions and sizes for those
matrices, as well as returning the distribution information.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 145
Basic Server Functions Summary
ppwhos is the Star-P® equivalent of the MATLAB whos command. It provides detailed
information about the distribution of the server side variables (2nd column), their size (3rd
column), and their types (4th column).
All distributed variables will also show up in the MATLAB whos command, but the information
displayed for these variables does not accurately represent their size and distribution
properties. The ppwhos output helps align the distributions of the variables; in general having
similar distributions for all variables provides the best performance. It also allows identifying
variables that should be distributed, since they are large, which variables are not, and
variables that should not be distributed, since they are small, but are distributed. A typical
ppwhos output looks something like this:
pph5whos
pph5whos('FILE')
Prints size and type information of variables in an HDF5 FILE on the server. The format is
similar to the MATLAB whos function.
S = pph5whos('FILE')
Returns the dataset names in an HDF5 FILE along with the corresponding size and type
information in a structure array, S.
Note: pph5whos is able to parse an arbitrary HDF5 file, but will return accurate size and
type information only for datasets that consist of double or double complex dense
and sparse data. In all other cases, the type field is marked 'unknown'.
Reference
146 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
ppback
Bpp = ppback(A)
Bpp = ppback(A, d)
Transfer the MATLAB matrix A to the backend server and stores the result in Bpp. A can be
dense or sparse.
Transfer the MATLAB matrix A to the backend server and store the result in B.
If A is sparse:
• Bpp is row distributed.
Reference:
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 147
Basic Server Functions Summary
ppfront
Transfers the distributed matrix App from the server to the MATLAB client.
B = ppfront(App)
ppfront transfers the distributed matrix A from the server to the MATLAB client.
• If App is a distributed dense matrix, then B is a dense MATLAB matrix.
• If App is a distributed sparse matrix, then B is a sparse matrix.
dlayout objects are converted to double and other non-distributed objects are preserved.
Important:A warning message is emitted if the transfer is over a threshold size (currently
100MB), to avoid silent performance losses. Displays the warning message or
the value of the threshold can be changed by use of the ppsetoption
command. Currently, there is also a 2GB limit for the size of data that can be
transferred from the server to the client using ppfront.
Reference
ppchangedist
The ppchangedist command allows you to explicitly change the distribution of a matrix in
order to avoid implicit changes in subsequent operations. This is especially important to do
when performing operations within loops. In order to maximize performance, operands
should have conformant distributions. ppchangedist can be used before and/or after the
loop to prepare for subsequent operations.
ppchangedist(App,dist)
148 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
Important:A warning message is emitted if the transfer is over a threshold size (currently
100MB), to avoid silent performance losses. Emission of the message or the
value of the threshold can be changed by use of the ppsetoption
command.
pph5write
Writes VARIABLE1 to DATASET1 in the FILE specified on the server in the HDF5 format.
• If the FILE already exists, it is overwritten.
• Similarly if one of the dataset variables already exists, it is also overwritten with the
new variable.
Example 1
Example 2
Note: Currently, only writing double and double complex dense and sparse matrices is
supported.
Reference
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 149
Basic Server Functions Summary
pph5read
Reads from FILE, the contents of DATASET1 into VARIABLE1, DATASET2 into VARIABLE2,
etc.
• If any of the datasets is missing or invalid, or the FILE is not a valid HDF5 file, the
function returns an error.
Example
Only the contents of datasets which contain double or double complex dense or sparse data
can currently be read. In the latter case, the sparse matrix must be stored in a specific format
outlined in “How Star-P® Represents Sparse Matrices”.
Reference
ppload
Loads the distributed objects named v1, v2, ... from the file f into variables of the same
names. Specify the distribution to use with dist.
Loads all variables out of mat file f, retaining their original names. All loaded matrices
will be distributed the same way, given by dist. A dist value of 1 denotes a
row-distributed object, and a value of 2 denotes a column-distributed object.
• ppload('f','v1', 'v2', ...)
• ppload('f')
150 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
ppsave
Saves the distributed objects v1, v2, ... directly to the server file f, each under its own name.
If no variables are listed, saves all distributed objects currently assigned to variable
names in the workspace.
• ppsave('f', 'v1', 'v2', ..., -append)
Splits the variable data into one file per processor, each containing the local data for that
processor.
• ppsave f v1, v2, ...
bcast, ppbcast
y = bcast(x)
y = ppbcast(x)
References
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 151
Basic Server Functions Summary
split, ppsplit
Split a distributed object Xpp along dimension dim. Used as input to ppeval.
y = split(Xpp,dim)
y = ppsplit(Xpp,dim)
Example
Each row is then an input to the function specified in the ppeval call.
References
ppeval
Execute a function in parallel on distributed data. ppeval is just another way of specifying
iteration.
[o1,o2,...,oN] = ppeval('foo',in1,in2,...,inl)
152 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
The output arguments, o1, o2, ..., oN are ddense or ddensend arrays representing
the results of calling 'foo'. Each output argument is created by concatenating the result of
each iteration along the next highest dimension; for example, if K iterations of foo are
performed and the output of each iteration is a matrix of size MxN, then the corresponding
output after the ppeval invocation will be a MxNxK matrix.
Note: Note that prior versions of Star-P® had a version of ppeval that did not reshape
the output arguments to ddense objects. For backward compatibility, this function
is now as ppevalsplit.
If foo returns n output arguments then there will be n output arguments. See also split and
bcast.
When using ppeval or to call a compiled C++ library function, use the format
MODULENAME:FNAME, where MODULENAME is the module name as returned by an earlier call
to ppevalcloadmodule, and, FNAME is the function name registered in that module. For
example, the call:
invokes the polyfit function in the imsl C++ module with input arguments arg1 and
arg2.
This section lists the known differences between MATLAB and Octave, which is useful to
know when Octave is set as your task parallel engine (the default setting).
• If an inf value is present in a matrix that is used as an argument to eig in ppeval,
Star-P® may hang, while MATLAB returns an error.
• When using the Star-P® Octave TPE, the evaluation of the '++' and '--'
auto-increment/decrement operators differs between ppeval and MATLAB. For
example, x=7;++x returns 8 in ppeval, but returns 7 in MATLAB.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 153
Basic Server Functions Summary
ppevalsplit
ppevalsplit()
The dcell is analogous to MATLAB cells. The dcell type is different from the other
distributed matrix or array types, as it may not have the same number of data elements per
dcell iteration and hence doesn't have the same degree of regularity as the other
distributions. This enables dcells to be used as return arguments for ppevalsplit().
Because of this potential irregularity, a dcell object cannot be used for much of anything
until it is converted into a “normal” distributed object via the reshape operator. The only
operators that will work on a dcell are those that help you figure out what to convert it into,
e.g., size, numel, length, and reshape, which converts it, in addition to ppwhos. Luckily,
you will almost never need to be aware of dcell arrays or manipulate them.
When using ppevalsplit to call a compiled C++ library function, use the format
PACKAGENAME:FNAME, where PACKAGENAME is the module name as returned by an earlier
call to ppevalcloadmodule (deprecated) or pploadpackage, and, FNAME is the function
name registered in that package. For example, the call:
invokes the polyfit function in the imsl C++ module with input arguments arg1 and
arg2.
ppevalcloadmodule
This function is deprecated as of release 2.6.0. Compiled C and C++ task parallel libraries
can now be loaded on the server with pploadpackage.
ppevalcunloadmodule
ppevalcunloadmodule(NAME)
This function is deprecated as of release 2.6.0. Compiled C and C++ task parallel libraries
can now be unloaded from the server with ppunloadpackage.
154 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
Performance Functions
ppperf
ppperf
Provides fine-grained profiling of compute activity on both the client and the server together.
It pays close attention to the time required to perform computational tasks. It also tracks
communication between the client and server over the network. The vision behind ppperf is
to provide you a top-level view of what your program is doing as it runs your calculation.
Using the information provided by ppperf, you can
• identify program choke points,
• identify excessive client/server communication,
• see what functions are invoked on both client and server, and
• see how long each function takes to finish.
Command Explanation
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 155
Basic Server Functions Summary
ppperf report This command prints out a large text report providing
information about compute resources utilized by your
program while it ran.
ppperf report detail This command prints out a large text report providing
information about compute resources utilized by your
program while it ran. It provides more detail than
ppperf report. In particular, it breaks the process
measurement results down for each compute node on
your parallel server.
ppperf graph on This command displays a graph showing compute
resource utilization on the client, network, and server.
If you invoke this command before running your
program, it will show you a real-time graph of your
computation's activity (as long as control passes to the
client). If you invoke this command after executing
ppperf off, it will show you the static graph of
compute activity recorded between ppperf on and
ppperf off.
ppperf graph off This command closes the performance graph window.
It does not affect the compiled performance data table.
ppprofile
The ppprofile command collects and displays performance information for Star-P®.
ppprofile is a profiler for the Star-P® server. It allows you to examine which function calls
the Star-P® server makes and how much time is spent in each call.
• ppprofile display displays the data about each server call as it occurs.
• ppprofile nodisplay delays the immediate display of data about each server call.
156 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
See "Summary and Per-Server-Call Timings with ppprofile" for examples of the usage of
ppprofile.
Example
>> ppprofile on
Then follow with the commands or scripts of interest and end with ppprofile report:
>> ppprofile on
>> app = rand(1000*p);
>> bpp = inv(app);
>> dpp = inv(app);
>> cpp = eig(bpp);
>> ppprofile report
function calls time avg time %calls %time
ppscalapack_eig 1 6.3244 6.3244 10 90.082
ppscalapack_inv 2 0.62992 0.31496 20 8.9723
ppdense_scalar_op 1 0.014254 0.014254 10 0.20303
ppdense_binary_op 1 0.012628 0.012628 10 0.17987
ppdense_sumv 1 0.009353 0.009353 10 0.13322
ppdense_rand 1 0.009036 0.009036 10 0.1287
ppbase_setoption 1 0.00856 0.00856 10 0.12192
ppdense_transpose 1 0.006571 0.006571 10 0.093594
ppdense_sum 1 0.005996 0.005996 10 0.085404
Total 10 7.0208 0.70208
The ppprofile information is ordered in columns and displays, from left to right, the server
function called, the number of function calls, the time spent inside the function, the average
time spent inside the function per function call, the percentage of function calls, and the
percentage of time spend inside the function. For the full range of functionality of ppprofile
please consult the Command Reference Guide or type help ppprofile in Star-P®.
pptic/pptoc
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 157
Basic Server Functions Summary
In addition to the number of messages and bytes received and sent, pptic/pptoc shows
the time spent on communication and calculation as well as the number of distribution
changes needed to accomplish the instructions enclosed by the pptic/pptoc statement.
The two important pieces of information contained in pptic/pptoc that affect performance
are the bytes received or sent and the number of distribution changes.
Combining client variables and server variables in the expression will result in the movement
of the client variable to the server, which will show up in the bytes received field. Since data
movement is expensive, this is a possible place to enhance performance, especially if the
expression happens to be located inside a looping construct. For example, compare the
following two calculations:
Example 1
Example 2
158 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Basic Server Functions Summary
In the first example, you see that number of bytes received by the server is exactly the size of
App, 1000*1000*8 bytes = 8 MB, and that the communication took 1.16 sec.
In the second example, the number of bytes received is 212 or 37,000 times smaller. These
212 bytes contain the instructions to the server that specify what operations need to be
performed. The penalty you pay in the first example is 1.16 sec of data transfer, which could
have been prevented by creating the variable App on the server instead of on the client.
The number of distribution changes reported by pptic/toc indicates how often Star-P®
needed to make a temporary change to the distribution of a variable, for example, from row to
column distributed, in order to perform a set of instructions. Distribution changes cost time
and should be avoided whenever possible when optimizing code for performance (note that
distribution changes become more expensive for slower interconnects between the
processors, e.g., clusters). In general, keeping the distributions of all variables aligned, i.e.,
all row distributed or all column distributed, prevents distribution changes and improves
performance.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 159
Basic Server Functions Summary
160 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Chapter 7
This chapter lists the MATLAB1 functions supported by Star-P®. The table in the section titled
“Data Parallel Functions Listed Alphabetically” lists the supported data parallel functions in
alphabetical order, while the tables in the section titled “Task-Parallel Functions Listed by
Default Platform TPE” list task-parallel functions for what is referred to as “ppeval()” mode.
Sparse matrices and functions operating on sparse matrices cannot currently be passed into
a ppeval call, but may be used within the function called by ppeval.
Table 1 lists the MATLAB® functions available for Data-Parallel Computing with Star-P®
Release 2.7 x86/64 or Itanium-based Servers.
1. MATLAB® is a registered trademark of The MathWorks, Inc. Star-P® and the "star p" logo are
registered trademarks of Interactive Supercomputing, Inc. Other product or brand names are
trademarks or registered trademarks of their respective holders. ISC's products are not spon-
sored or endorsed by The MathWorks, Inc. or by any other trademark owner referred to in this
document.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 161
Data Parallel Functions Listed Alphabetically
acot elfun
acoth elfun
acscd elfun Yes Yes
acsc elfun
acsch elfun
all ops Yes Yes
and ops Yes Yes
angle elfun Yes Yes
any ops Yes Yes
asecd elfun Yes Yes
asec elfun
asech elfun
asind elfun Yes Yes
asin elfun Yes Yes
asinh elfun Yes Yes
atan2 elfun Yes Yes
atand elfun Yes
atan elfun Yes Yes
atanh elfun Yes Yes
blkdiag elmat Yes Yes
cat elmat Yes Yes
ceil elfun Yes Yes
cell datatypes Yes
chol matfun Yes
clpxpair elfun
colon ops
colperm sparfun Yes
162 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Data Parallel Functions Listed Alphabetically
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 163
Data Parallel Functions Listed Alphabetically
164 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Data Parallel Functions Listed Alphabetically
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 165
Data Parallel Functions Listed Alphabetically
linspace elmat
log10 elfun Yes Yes
log1p elfun Yes
log2 elfun Yes Yes
log elfun Yes Yes
logical datatypes Yes Yes
logspace elmat
lt ops Yes Yes
lu matfun Yes
magic elmat
max datafun Yes Yes
mean datafun Yes Yes
median datafun Yes Yes
meshgrid elmat Yes Yes
min datafun Yes Yes
minus (-) ops Yes Yes
mldivide (\) ops Yes Yes
mod elfun Yes Yes
mpower (^) ops Yes Yes
mrdivide (/) ops Yes Yes
mtimes (*) ops Yes Yes
nan elmat
nchoosek specfun
ndgrid elmat Yes
ndims elmat Yes Yes
ne ops Yes Yes
nnz sparfun Yes Yes
166 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Data Parallel Functions Listed Alphabetically
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 167
Data Parallel Functions Listed Alphabetically
168 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Data Parallel Functions Listed Alphabetically
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 169
Task-Parallel Functions Listed by Default Platform TPE
Table 2 lists the MATLAB® functions available for Default Task Parallel Engine (TPE) for
SGI-Altix/Itanium-based Servers Star-P® Release 2.7 and optional TPE for x86/64-based
Servers.
conv datafun
corrcoef datafun
cov datafun
cumprod datafun
cumsum datafun
cumtrapz datafun
deconv datafun
del2 datafun
detrend datafun
diff datafun
fft datafun
fft2 datafun
fftn datafun
fftshift datafun
filter datafun
filter2 datafun
gradient datafun
hist datafun
ifft datafun
ifftn datafun
max datafun
mean datafun
median datafun
170 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
min datafun
prod datafun
sort datafun
sortrows datafun
std datafun
sum datafun
trapz datafun
var datafun
cast datatypes
cell datatypes
cell2mat datatypes
cell2struct datatypes
cellfun datatypes
class datatypes
deal datatypes
double datatypes
fieldnames datatypes
func2str datatypes
functions datatypes
getfield datatypes
isa datatypes
iscell datatypes
isfield datatypes
isnumeric datatypes
isstruct datatypes
logical datatypes
mat2cell datatypes
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 171
Task-Parallel Functions Listed by Default Platform TPE
num2cell datatypes
orderfields datatypes
rmfield datatypes
setfield datatypes
single datatypes
str2func datatypes
struct datatypes
struct2cell datatypes
abs elfun
acos elfun
acosh elfun
acot elfun
acoth elfun
acsc elfun
acsch elfun
angle elfun
asec elfun
asech elfun
asin elfun
asinh elfun
atan elfun
atan2 elfun
atanh elfun
ceil elfun
clpxpair elfun
complex elfun
conj elfun
172 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
cos elfun
cosh elfun
cot elfun
coth elfun
csc elfun
csch elfun
exp elfun
fix elfun
imag elfun
isreal elfun
log elfun
log10 elfun
log2 elfun
mod elfun
nextpow2 elfun
nthroot elfun
pow2 elfun
real elfun
rem elfun
round elfun
sec elfun
sech elfun
sign elfun
sin elfun
sind elfun
sinh elfun
sqrt elfun
tan elfun
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 173
Task-Parallel Functions Listed by Default Platform TPE
tanh elfun
unwrap elfun
blkdiag elmat
cat elmat
circshift elmat
compan elmat
diag elmat
eps elmat
eye elmat
find elmat
flipdim elmat
fliplr elmat
flipud elmat
flops elmat
hankel elmat
hilb elmat
i elmat
ind2sub elmat
intmax elmat
intmin elmat
invhilb elmat
ipermute elmat
isempty elmat
isequal elmat
isequalwithequalnans elmat
isinf elmat
isnan elmat
isscalar elmat
174 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
isvector elmat
j elmat
length elmat
linspace elmat
logspace elmat
meshgrid elmat
ndims elmat
numel elmat
ones elmat
pascal elmat
permute elmat
pi elmat
rand elmat
randn elmat
realmax elmat
realmax elmat
repmat elmat
reshape elmat
rosser elmat
rot90 elmat
rref elmat
shiftdim elmat
size elmat
squeeze elmat
sub2ind elmat
toeplitz elmat
tril elmat
triu elmat
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 175
Task-Parallel Functions Listed by Default Platform TPE
vander elmat
wilkinson elmat
zeros elmat
fminbnd funfun
fminsearch funfun
fzero funfun
inline funfun
ode23 funfun
ode45 funfun
quad funfun
quadl funfun
vectorize funfun
addpath general
ans general
beep general
brighten general
cd general
clear general
computer general
delete general
diary general
dir general
dos general
echo general
exit general
fileattrib general
176 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
format general
genpath general
getenv general
isdir general
ispc general
isunix general
load general
ls general
mex general
mkdir general
more general
pack general
path general
pwd general
quit general
rehash general
rmdir general
rmpath general
save general
savepath general
system general
type general
unix general
ver general
which general
who general
whos general
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 177
Task-Parallel Functions Listed by Default Platform TPE
clc iofun
csvread iofun
csvwrite iofun
fclose iofun
feof iofun
ferror iofun
fgetl iofun
fgets iofun
fileparts iofun
filesep iofun
fopen iofun
fprintf iofun
fread iofun
frewind iofun
fscanf iofun
fseek iofun
ftell iofun
fullfile iofun
fwrite iofun
home iofun
rename iofun
tar iofun
tempdir iofun
tempname iofun
textread iofun
untar iofun
unzip iofun
178 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
assignin lang
break lang
builtin lang
case lang
catch lang
continue lang
disp lang
else lang
elseif lang
end lang
error lang
eval lang
evalin lang
exist lang
feval lang
for lang
global lang
if lang
input lang
inputname lang
isglobal lang
iskeyword lang
isvarname lang
keyboard lang
lasterr lang
lastwarn lang
mislocked lang
mlock lang
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 179
Task-Parallel Functions Listed by Default Platform TPE
munlock lang
nargchk lang
nargin lang
nargout lang
otherwise lang
persistent lang
return lang
switch lang
try lang
varargin lang
varargout lang
warning lang
while lang
180 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
orth matfun
pinv matfun
qr matfun
qz matfun
rank matfun
schur matfun
sqrtm matfun
svd matfun
trace matfun
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 181
Task-Parallel Functions Listed by Default Platform TPE
ismember ops
kron ops
ldivide (.\) ops
le ops
lt ops
minus (-) ops
mldivide (\) ops
mpower (^) ops
mrdivide (/) ops
mtimes (*) ops
ne ops
not ops
or ops
plus (+) ops
power (.^) ops
rdivide (./) ops
setdiff ops
setxor ops
times (.*) ops
uminus (-) ops
union ops
unique ops
uplus (+) ops
vertcat ops
xor ops
interp1 polyfun
interp2 polyfun
182 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
interpft polyfun
mkpp polyfun
pchip polyfun
poly polyfun
polyarea polyfun
polyder polyfun
polyfit polyfun
polyval polyfun
polyvalm polyfun
ppval polyfun
residue polyfun
roots polyfun
spline polyfun
ss2tf polyfun
unmkpp polyfun
colamd sparfun
colperm sparfun
dmperm sparfun
etree sparfun
etreeplot sparfun
full sparfun
gplot sparfun
issparse sparfun
luinc sparfun
nnz sparfun
nonzeros sparfun
nzmax sparfun
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 183
Task-Parallel Functions Listed by Default Platform TPE
randperm sparfun
spalloc sparfun
sparse sparfun
spconvert sparfun
speye sparfun
spfun sparfun
spones sparfun
spparms sparfun
sprand sparfun
sprandn sparfun
sprandsym sparfun
spy sparfun
symamd sparfun
airy specfun
besselh specfun
besseli specfun
besselj specfun
besselk specfun
bessely specfun
beta specfun
betainc specfun
betain specfun
cart2pol specfun
cart2sph specfun
cross specfun
dot specfun
erf specfun
184 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
erfc specfun
erfinv specfun
gamma specfun
gammainc specfun
gammaln specfun
gcd specfun
hsv2rgb specfun
lcm specfun
legendre specfun
perms specfun
pol2cart specfun
primes specfun
rgb2hsv specfun
sph2cart specfun
base2dec strfun
bin2dec strfun
blanks strfun
cellstr strfun
char strfun
deblank strfun
dec2base strfun
dec2bin strfun
dec2hex strfun
findstr strfun
hex2dec strfun
hex2num strfun
int2str strfun
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 185
Task-Parallel Functions Listed by Default Platform TPE
iscellstr strfun
ischar strfun
isletter strfun
isspace strfun
isstr strfun
lower strfun
mat2str strfun
num2str strfun
regexp strfun
regexpi strfun
regexprep strfun
setstr strfun
sprintf strfun
sscanf strfun
str2double strfun
str2mat strfun
str2num strfun
strcat strfun
strcmp strfun
strcmpi strfun
strfind strfun
strjust strfun
strmatch strfun
strncmp strfun
strncmpi strfun
strrep strfun
strtok strfun
strtrim strfun
186 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
strvcat strfun
upper strfun
calendar timefun
clock timefun
cputime timefun
date timefun
datenum timefun
datestr timefun
datevec timefun
eomday timefun
etime timefun
now timefun
pause timefun
weekday timefun
iqr timeseries
Table 3 lists the MATLAB® functions available for Star-P® Release 2.7 Default Task-Parallel
Engine (TPE) for x86/64-based Servers.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 187
Task-Parallel Functions Listed by Default Platform TPE
188 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 189
Task-Parallel Functions Listed by Default Platform TPE
190 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 191
Task-Parallel Functions Listed by Default Platform TPE
192 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 193
Task-Parallel Functions Listed by Default Platform TPE
194 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 195
Task-Parallel Functions Listed by Default Platform TPE
196 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 197
Task-Parallel Functions Listed by Default Platform TPE
198 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 199
Task-Parallel Functions Listed by Default Platform TPE
csvread FileIO
csvwrite FileIO
disp FileIO
dlmread FileIO
dlmwrite FileIO
fclose FileIO
feof FileIO
ferror FileIO
fgetl FileIO
fgets FileIO
fopen FileIO
fprintf FileIO
fputs FileIO
fread FileIO
frewind FileIO
fscanf FileIO
fseek FileIO
ftell FileIO
fwrite FileIO
ls FileIO
contour Graphics
contourc Graphics
inpolygon Graphics
200 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 201
Task-Parallel Functions Listed by Default Platform TPE
202 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
ode23 ODE
ode45 ODE
ode78 ODE
odeset ODE
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 203
Task-Parallel Functions Listed by Default Platform TPE
204 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Task-Parallel Functions Listed by Default Platform TPE
assert Programming
cast Programming
else Programming
elseif Programming
end Programming
error Programming
exist Programming
feval Programming
for Programming
func2str Programming
global Programming
if Programming
lasterr Programming
lasterror Programming
lastwarn Programming
nargchk Programming
nargin Programming
nargout Programming
persistent Programming
rethrow Programming
return Programming
str2func Programming
switch Programming
try Programming
typecast Programming
varargin Programming
varargout Programming
warning Programming
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 205
Task-Parallel Functions Listed by Default Platform TPE
while Programming
206 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Appendix A
Application Examples
The application examples in this section show pattern matching for an input image and a
target image using the Fourier transform of the image, or, in basic terms, Fourier pattern
matching.
The program performs Fourier analysis of an input image and a target image. This analysis
tries to locate the target image within the input image. Correlation peaks show where the
target image exists. The output matrix shows where high correlation exists in the Fourier
plane. In other words, X marks the spot.
The analysis in this simplified application takes the transform of the input and target images,
multiplies the elements of the transforms, and then transforms the product back. This results
in correlation peaks located where the target image is located within the input image. Since
the image is in color, the processing is performed within three different color spaces,
correlation matches occur three times. Strong peaks exist in the image along with the
possibility of some noise. To further data reduce the image, a threshold is used which
reduces the information to a two dimensional (2D) binary map. The image of the 2D binary
map reduces the three color space images into a single binary map indicating the locations of
the target image. The location of the ones (1) indicate the position of the target image within
the input image.In this example, the ones exist in four separate clusters and the centroid of
each cluster indicates the center location of the target image.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 207
Application Examples
Application Examples
• An example not using Star-P®, see "Application Example Not Using Star-P®".
• An example using *p to distribute the computation, see "Application Example Using
Star-P®".
• An example using ppeval to distribute the computation see "Application Example
Using ppeval".
The images used for the examples are shown in the figures below.
There are two .m files used in each example. The files used for each example are as follows:
208 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Application Examples
• patmatch_calc.m
M files are text files which typically contain the following information:
File Element Description
Function definition Informs MATLAB that the M-file
line contains a function. This line
defines function number and the
number and order of input and
output arguments
Function or script Program code that performs the
body actual computations and assigns
values to any output arguments
Comments Text in the body of the program that
explains the internal workings of
the program
The following provides the actual flow for this application example where Star-P® is not used.
The M files associated with this example are shown immediately after this table.
Step Description
1 The input image Figure A-2: is separated into Hue, Saturation
and Value (HSV).
2 The image is tiled and replicated. The color constituent parts are
each replicated in a tiling fashion to make a larger H, S, and V
images.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 209
Application Examples
Step Description
3 A correlation calculation is performed on the HSV components of
the input and target images. The patchmatch_calc.m file is
called. A pattern matching calculation is used. This particular
function is called for each of the three HSV images.
a. The function correlates the input and target image by
padding the target image, which is assumed to be a
smaller image. It is padded with bright regions or ones
(1).
• 1 represents background
• 0 represents lack of background
patmatch_color_noStarP.m File
The following is the sample file that contains the program code for the application example.
The numbers on the left correspond to the table in the previous section.
210 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Application Examples
1 thres = 0.85;
Read in RGB % Load the data, comes in RGB, transfer to HSV space
data, convert a = rgb2hsv(imread(img )); % Get the image containing targets
to HSV b = rgb2hsv(imread(target)); % Get the filter mask
2 % Setup the input image tiling problem
Image is if imr > 1 || imc > 1
tiled and a = repmat(a,imr,imc);
replicated
end
% Perform correlation calculation in HSV space
3
d = zeros(size(a));
Calculate the
correlation on
for i = 1:3
each of HSV d(:,:,i) = patmatch_calc(a(:,:,i),b(:,:,i));
components end
% Threshold for finding target within input image
4 e = (1-d(:,:,2)) > 0.5 & d(:,:,3) > thres;
Perform the
threshold %Display the result
figure(1);
imagesc(hsv2rgb(a)); colormap jet; title('Input Image');
figure(2);
imagesc(hsv2rgb(b)); colormap jet; title('Filter Pattern');
figure(3);
5 imagesc(d(:,:,1)); colormap jet; title('Correlation H');
Display the figure(4);
results
imagesc(1-d(:,:,2)); colormap jet; title('Correlation S');
figure(5);
imagesc(d(:,:,3)); colormap jet; title('Correlation V');
figure(6);
imagesc(e); colormap gray; title('Threshold Correlation');
patmatch_calc.m
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 211
Application Examples
3c corr = (d-min(min(d)))/max(max(d-min(min(d))));
Scale between
0 and 1
Application Example Using Star-P®
The following provides the actual flow for this application example using Star-P®. The M files
associated with this example are shown immediately after this table.
Step Description
1 The input image is loaded and separated as previously described
in "Application Example Not Using Star-P®". The main difference
is that each of these images are transferred to the backend
(server or HPC). From this point every subsequent operation or
computation that occurs will occur on the backend.
2 This tiled image is now created on the backend. See "Application
Example Not Using Star-P®".
3 The correlation calculation as described previously for
"Application Example Not Using Star-P®" is performed on the
backend.
The patmatch_calc.m file is identical as for "Application
Example Not Using Star-P®" except the calculation is performed
on the backend. No changes required.
4 The threshold operation is performed on the backend (see
"Application Example Not Using Star-P®").
5 The ppfront function moves the data to the frontend or client for
viewing. (see "Application Example Not Using Star-P®").
patmatch_colordemo_StarP.m File
The following is the sample file that contains the program code for the application example.
The numbers on the left correspond to the table in the previous section. Only the differences
from the "Application Example Not Using Star-P®" are described.
212 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Application Examples
a = ppback(a);
% Setup the input image tilling problem
if imr > 1 | imc > 1
a = repmat(a,imr,imc);
end
% Perform correlation calculation in HSV space
d = zeros(size(a));
for i = 1:3
d(:,:,i) = patmatch_calc(a(:,:,i),b(:,:,i));
end
% Threshold for finding target within input image
5
e = (1-d(:,:,2)) > 0.5 & d(:,:,3) > thres;
Image is
transferred % Transfer results to the client
from back-end a = ppfront(a);
d = ppfront(d);
e = ppfront(e);
% Display the result
figure(1);
imagesc(hsv2rgb(a)); colormap jet; title('Input Image');
figure(2);
imagesc(hsv2rgb(b)); colormap jet; title('Filter Pattern');
figure(3);
imagesc(d(:,:,1)); colormap jet; title('Correlation H');
figure(4);
imagesc(1-d(:,:,2)); colormap jet; title('Correlation S');
figure(5);
imagesc(d(:,:,3)); colormap jet; title('Correlation V');
figure(6);
imagesc(e); colormap gray; title('Threshold Correlation');
The following provides the actual flow for this application example using ppeval. The M files
associated with this example are shown immediately after the table.
About ppeval
ppeval executes embarrassingly parallel operations in a task parallel mode. The tasks are
completely independent and are computed individually, with access only to local data. For
example, if there are four function evaluations to be computed and Star-P® has four
processors allocated, ppeval takes the function to be evaluated and sends it to each of the
four processors for calculation.
This function takes the HSV components for the input and target images and calculates all
the correlations for each of these components simultaneously.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 213
Application Examples
The technical explanation of the computation is identically the same as the previous example
and is eliminated for brevity. The key difference in using the patmatch_calc function is the
setup of ppeval that calls this function.
In the case of item 5, ppeval calls patmach_calc with the input image a and target image b.
The parallelization is performed with the split function that breaks the input and target images
into their respective HSV components. The split in each case is along the 3rd dimension. If
you have three processors, processor 1 gets the H component, processor 2 gets the S
component, and processor 3 gets the V component.
Step Description
1 The operation is the same as described for the previous two
examples.
2 Not included for the ppeval because tiling to larger images or
working with larger input images on a single processor limits the
performance gains achieved by single processor calculation. In
other words, single processor calculations provide performance
on small data sizes.
3 The correlation calculation as described for the previous two
examples is performed on an individual processor on the
backend.
4 The operation is the same as described for the previous two
examples.
5 The operation is the same as described for the previous two
examples.
patmatch_color_ppeval.m
The following is the sample file that contains the program code for the application example.
The numbers on the left correspond to the table in the previous section. Only the differences
from the "Application Example Not Using Star-P®" are described.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 215
Application Examples
216 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Appendix B
This chapter introduces a mode of thinking about a large class of combinatorial problems.
Star-P® can be considered as a potential tool whenever you are faced with a discrete
problem where quantitative information is extracted from a data structure such as those
found on networks or in databases.
Sparse matrix operations are widely used in many contexts, but what is less well known is
that these operations are powerfully expressive for formulating and parallelizing
combinatorial problems. This chapter covers the basic theory and illustrates a host of
examples. In many ways this chapter extends the notion that array syntax is more powerful
than scalar syntax by applying this syntax to the structures of a class of real-world problems.
At the mathematical level, a sparse matrix is simply a matrix with sufficiently many zeros that
it is sensible to save storage and operations by not storing the zeros or performing
unnecessary operations on zero elements such as x+0 or x*0. For example, the discretization
of partial differential equations typically results in large sparse linear systems of equations.
Sparse matrices and the associated algorithms are particularly useful for solving such
problems.
Sparse matrices additionally specify connections and relations among objects. Simple
discrete operations including data analysis, sorting, and searching can be expressed in the
language of sparse matrices.
Graphs are used for networks and relationships. Sparse matrices are the data structures
used to represent graphs and to perform data analysis on large data sets represented as
graphs.
In the following discussion, a “graph” is simply a group of discrete entities and their
connections. While standard, the term is not especially illuminating, so it may be helpful to
consider a graph as a “network”. Think of a phone network or a computer network or a social
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 217
Graphs and Sparse Matrices
network. The most important thing to know are the names and who is connected to whom.
Formally, a graph is a set of nodes and edges. It is the information of who can directly
influence whom or at least who has a link to whom.
Consider the route map of an airline. The nodes are cities, the edges are plane routes.
The earth is a geometrical object, i.e. continuous, yet the important information for the airline
is the graph, the discrete object connecting the relevant cities. Next time you see a subway
map, think of the graph connecting the train stops. Next time you look at a street map think of
the intersections as nodes, and each street as an edge.
Electrical circuits are graphs. Connect up resistors, batteries, diodes, and inductors. Ask
questions about the resistance of the circuit. In high school one learns to follow Ohm’s law
and Ampere’s law around the circuit. Graph theory gives the bigger picture. We can take a
large grid of resistors and connect a battery across one edge. Looked at one way, this is a
discrete man-made problem requiring a purchase of electrical components.
The internet is a great source for graphs. We could have started with any communications
network: telegraphs, telephones, smoke signals... but let us consider the internet. The
internet can be thought of as the physical links between computers. The current internet is
composed of various subnetworks of connected computers that are connected at various
peering points. Run traceroute from your machine to another machine and take a walk along
the edges of this graph.
More exciting than the hardware connections are the virtual links. Any web page is a node;
hyperlinks take us from one node to another. Web pages live on real hardware, but there is
no obvious relationship between the hyperlinks connecting web pages and the wires
connecting computers.
The graph that intrigues us all is the social graph: in its simplest form, the nodes are people.
Two people are connected if they know each other.
A graph may be a discretization of a continuous structure. Think of the graph whose vertices
are all the USGS benchmarks in North America, with edges joining neighboring benchmarks.
This graph is a mesh: its vertices have coordinates in Euclidean space, and the discrete
graph approximates the continuous surface of the continent. Finite element meshes are the
key to solving partial differential equations on (finite) computers.
Graphs can represent computations. Compilers use graphs whose vertices are basic blocks
of code to optimize computations in loops. The heart of a finite element computation might be
the sparse matrix-vector multiplication in an iterative linear solver; the pattern of data
dependency in the multiplication is the graph of the mesh.
Oftentimes graphs come with labels on their edges (representing length, resistance, cost) or
vertices (name, location, cost).
There are so many examples -- some are discrete from the start, others are discretizations of
continuous objects, but all are about connections.
218 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Graphs and Sparse Matrices
Consider putting everybody at a party in a circle holding hands and each person rates how
well they know the person on the left and right with a number from 1 to 10.
Each person can be represented with the index i, and the rating of the person on the right can
be listed as Ai,i+1 while the person on the left is listed as Ai,i-1.
As an example;
In serial MATLAB
With Star-P®
>> n = 1000*p;
>> i = 1:n;
>> j = ones(1,n);
>> j(1,1:end-1) = i(1,2:end); j(1,end) = i(1,1);
>> k = ones(1,n);
>> k(1,2:end) = i(1,1:end-1); k(1,1) = i(1,end);
>> r = rand(1,n);
>> l = rand(1,n);
>> A = sparse([i i], [j k], [r l])
A =
dsparse object: 1000p-by-1000
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 219
Graphs and Sparse Matrices
>> B=spones(A)
Imagine we have an airline that flies certain routes on certain days of the week and we are
interested in the revenue per route and per day. We begin with a table which can be simply
an n x 3 array:
In Microsoft Excel, there is a little known feature that is readily available on the Data menu
called PivotTable which allows for the analysis of such data.
MATLAB and Star-P® users can perform the same analysis with sparse matrices.
220 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Graphs and Sparse Matrices
3 6 4
3 0 4
>> a = sparse(m(:,1),m(:,2)+1,m(:,3))
a =
(1,1) 3
(3,1) 4
(1,2) 5
(2,2) 3
(2,3) 3
(1,4) 4
(2,5) 3
(1,6) 5
(2,7) 3
(3,7) 4
>> sum(a')
ans =
(1,1) 17
(1,2) 12
(1,3) 8
>> sum(a)
ans =
(1,1) 7
(1,2) 8
(1,3) 3
(1,4) 4
(1,5) 3
(1,6) 5
(1,7) 7
>> sum(a(:))
ans =
(1,1) 37
Since Star-P® extends the functionality of sparse matrices to parallel machines, one can do
very sophisticated data analysis on large data sets using Star-P®.
Note that the sparse command also adds data with duplicate indices.
If the sparse constructor encounters duplicate (i,j) indices, the corresponding nonzero
values are added together. This is sometimes useful for data analysis; for example, here is an
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 221
Graphs and Sparse Matrices
example of a routine that computes a weighted histogram using the sparse constructor. In
the routine, bin is a vector that gives the histogram bin number into which each input
element falls. Notice that h is a sparse matrix with just one column! However, all the values of
w that have the same bin number are summed into the corresponding element of h. The
MATLAB function bar plots a bar chart of the histogram.
Multiplication of a sparse matrix by a dense vector (sometimes called “matvec”) turns out to
be useful for many kinds of data analysis that have nothing directly to do with linear algebra.
We will see several examples later that have to do with paths or searches in graphs. Here is
a simple example that has to do with the nonzero structure of a matrix.
Suppose G is a dsparse matrix with nr rows and nc columns. For each row, we want to
compute the average of the column indices of the nonzeros in that row (or zero if the whole
row is zero, say). The result will be a ddense vector with nr elements. The following code
does this. (The first line replaces each nonzero in G with a one; it can be omitted if, say, G is
the adjacency matrix of a graph or a 0/1 logical matrix.)
Since epp is a column of all ones, the first matvec Gpp*epp computes the number of
nonzeros in each row of Gpp. The second matvec Gpp*vpp computes the sum of the column
indices of the nonzeros in each row. The *max* in the denominator of the last line makes
averageindex zero whenever a row has no nonzeros.
222 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Graphs and Sparse Matrices
The Laplacian matrix is a matrix associated with an undirected graph. Like the adjacency
matrix, it is square and symmetric and has a pair of nonzeros (i,j) and (j,i) for each edge (i,j) of
the graph. However, the off-diagonal nonzero elements of the Laplacian all have value -1,
and the diagonal element Li,i is the number of edges incident on vertex i. If A is the adjacency
matrix of an undirected graph, one way to compute the Laplacian matrix is with the following:
>> L = -spones(A);
>> L = L - diag(diag(L));
>> L = L + diag(sum(L));
This code is a little more general than it needs to be -- it doesn’t assume that all the nonzeros
in A have value 1, nor does it assume that the diagonal of A is zero. If both of these are true,
as in a proper adjacency matrix, it would be enough to say:
>> L = diag(sum(A)) - A;
The Laplacian matrix has many algebraic properties that reflect combinatorial properties of
the graph. For example, it is easy to see that the sums of the rows of L are all zero, so zero is
an eigenvalue of L (with an eigenvector of all ones). It turns out that the multiplicity of zero as
an eigenvalue is equal to the number of connected components of the graph. The other
eigenvalues are positive, so L is a positive semidefinite matrix. The eigenvector
corresponding to the smallest nonzero eigenvalue has been used in graph partitioning
heuristics.
For a connected graph, the eigenvectors corresponding to the three smallest Laplacian
eigenvalues can be used as vertex coordinates (the coordinates of vertex number i are (xi, yi,
zi), where x, y, and z are the eigenvectors), and the result is sometimes an interesting picture
of the graph. Figure B-1: is an example of this technique applied to the graph created in
Kernel 1 of the SSCA#2 benchmark.
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 223
On Path Counting
Figure B-1: 8192-vertex graph from Kern1 plotted with Fiedler coordinates
On Path Counting
You may want to know how many paths connect two nodes in a graph. The incidence matrix I
is useful for this calculation, and is defined as:
The matrix a in "Sparse Matrices: Representing Graphs and General Data Analysis" is
actually an adjacency matrix. For any particular path length k, each element of I^k represents
the number of paths that connect node i to node j.
>> a = spones(sprandn(100*p,100,0.1));
>> b = a^3
b =
ddense object: 100-by-100p
>> b(14,23)
ans =
4
>> nnz(b)
ans =
9928
In this example there are 11 paths of length 3 that connect nodes 14 and 23. Another
characteristic of the graph that can be gleaned from this calculation is that almost all of the
nodes are reachable from all other nodes with a path of length 3 (9946 out of 10000 entries).
224 Star-P® Programming Guide for Use with MATLAB® Release 2.7
Symbols computation of eigenvectors for non-Hermitian
* , overloaded, 41 matrices, 90
*p syntax, 2, 28 configuration, user specific Star-P® start-up,
14
A creating distributed arrays, 28
about the Star-P® MATLAB® Programming cumsum, 82
Guide, 5
accuracy of Star-P® routines, 89 D
application example d
not using Star-P, 209 Star-P® naming conventions, distrubtued,
using ppeval, 213 22
using Star-P, 212 data distribution mechanisms, combining, 36
application examples, 207 data movement functions, 147
array bounds, distributed, 30 data oarallelism with Star-P® and MATLAB, 2
array bounds variable, distributed, 28 data parallelism with Star-P® and MATLAB, 21
assignments to p, 26 dcell, 36
dddense
B propagation of distribution, 44
bcast, 137, 151 ddense, 38, 39, 44
ddensend, 34, 38, 44
C propagation of distribution, 44
C, 2, 4 deep copy vs. shallow copy, incompatibility of
C++, 2, 4 Star-P® and MATLAB, 32
C++ as task parallel engine, 72 diag, 30
calculus of distribution, 44 distrbutions, types of, 32
calling non-MATLAB functions within ppeval, distributed and local data, mixing, 37
73 distributed array bounds, 30
cell arrays, using in task parallel, 85 distributed array bounds variable, 28
changing/examining distributed matrices, 24 distributed arrays, creating, 28
circshift, 102 distributed attribute, propagating the, 41
client/server messages, excessive, 93 distributed cell objects, 36
client vs. server variables, ppeval, 66 distributed classes used by Star-P®, 38
cluster configurations, command line options, distributed data creation routines, 29
19 distributed dense matrices and arrays, 32
coarse-grained parallelism, 59 distributed dense multidimensional arrays, 34
column distribution of matrices, 33 distributed matrices, examining/changing, 24
combining data distribution mechanisms, 36 distributed sparse matrices
command line options sparse matrices
cluster configurations, 19 distributed, 34
command line options, examples, 18 distribution, propagation of, 44
communication among processors in the paral- dlayout, 28, 38, 40
lel server, 101 dsparse, 34, 38, 39, 44
communication between the Star-P® client propagation of distribution, 44
and server, 98
communication dependencies, maintaining E
awareness of, 98 eigs, 41
complex number data, 27 embarrassingly parallel, 4
compressed sparse row format, 34, 35 enhanced performance profiling in Star-P®,
225 Star-P® Programming Guide for Use with MATLAB® Release 2.7
103 I
examining/changing distributed matrices, 24 ill-conditioned or singular operations, 90
examining Star-P® data, 22 image processing algorithm, 207
excessive client/server messages, 93 implicit communication, 99
excluding nodes in a cluster, 17 implicit data movement, 93
explicit data movement with ppback and pp- indexing into distributed matrices or arrays,
front, 49 distributed matrices, indexing into, 30
extending MATLAB with Star-P®, 2 indexing operations, 48
external libraries in task parallel codes, 89 input arguments to ppeval, 64
eye, 29
L
F launching Star-P® with a MATLAB .m script, 19
fclose, 53 load, 28
FEM, finite element method, 117 loading and saving data on the parallel server,
FFT, 102 51
fft, 41 local and distributed data, mixing, 37
fopen, 53 logical indexing, 49
for loop into a ppeval call, transforming, 61
Fortran, 2, 4, 78 M
fread, 53 machine file
frewind, 53 path, 17
fseek, 135, 138 user default, 17
fwrite, 53 MATLAB® functions, supported, 161
MATLAB as task parallel engine, 91
G MATLAB with Star-P, extending, 2
global array syntax, 4 maximizing performance of Star-P® code, 98
global variables for task parallel operations, 75 memory issues, MathWorks technical notes,
graphs and sparse matrices, 217 84
graphs - it’s all in the connections, 217 memory issues, solving large problems, 84
meshgrid, 29
H message passing, 4
HDF5 MIMD, 59
converting data from other formats to, 56 mixing local and distributed data, 37
differences from MATLAB support, 56 monitor the server, UNIX commands, 132
HDF5, Hierachical Data Format Version 5, 53 monte carlo simulations, 5
HDF5, limitations with Star-P®, 56 MPI, 2, 4
HDF5 file, querying variables stored inside, 55 multiple instruction multiple data, 59
HDF5 file, reading variables from, 54
HDF5 file, representation of data in, 55 N
complex data, 55 nnz, 31
multidimensional arrays, 55 node-oriented languages, 4
sparse matrices, 56 nodes in a cluster
HDF5 file, writing variables to, 54 excluding, 17
Hilbert matrix, 23 specifying, 17
histc, 41, 82 specifying a range, 17
horzcat, 29, 102 specifying a set, 17
non-Hermitian eigenvalue problems, 90
non-MATLAB function within ppeval, calling, 73
226 Star-P® Programming Guide for Use with MATLAB® Release 2.7
non-uniqueness of MATLAB and Star-P® rou- broadcasting, 65
tines, 89 default behavior, 64
np, 25, 139 splitting, 64
Star-P® functions known differences between MATLAB and
np, 135 Octave functions, 153
output arguments, 67
O per process execution, 72
Octave as task parallel engine, 71 requirements of functions passed to, 64
ones, 29, 37, 41 syntax and behavior, 63
overloaded operators transforming a for loop into a, 61
*, 41 ppeval_tic, 97
ones, 41 ppeval_toc, 97
ppeval, about, 59
P ppeval and ppevalc functions, the mechanism
p, 24, 135, 139 for task parallelism, 59
parallel computing 101, 4 ppevalc, 59
data parallel computation, 4 ppevalcloadmodule, 138, 154
message passing, 4 ppevalcunloadmodule, 138, 154
task parallel computation, 4 ppevalsplit, 69, 137, 154
password, user, 10, 11 ppfopen, 52, 136, 145
patmatch_calc.m, 211 ppfront, 24, 136, 148
patmatch_color_noStarP.m, 210 warning messages, 51
patmatch_color_ppeval.m, 214 ppfront, explicit data movement with, 49
patmatch_colordemo_Star-P.m, 212 ppgetlog, 136, 141
pattern matching, 207 ppgetlogpath, 136, 142
performance and productivity, 77 ppgetoption, 136, 141
performance bottlenecks, eliminating using pp- pph5read, 54, 137, 150
perf, 117 pph5whos, 55, 136, 146
performance profiling in Star-P®, enhanced, pph5write, 54, 137, 149
103 ppinvoke, 136, 143
performance tuning and monitoring, 91 ppload, 28, 51, 137, 150
client/server monitoring, 91 pploadpackage, 143
diagnostics and performance, 91 ppperf, 103, 106, 131, 138, 155
permute, 102 displaying performance statistics, 106
per processe execution for ppeval, 72 gathering performance statistics, 106
pp, 26, 135, 139 graphical mode, 113
Star-P® naming conventions, 22 lessons learned, 117, 131
ppback, 28, 136, 147 interpretation of output, 108
warning messages, 51 lessons learned, 113
ppback, explicit data movement with, 49 output preamble, 108
ppbench, 135 performance process measurement, 110
ppchangedist, 137, 148 performance time measurement, 109
warning messages, 51 Star-P® functions
ppeval, 2, 59, 137, 152 ppperf, 155
calling non-MATLAB functions within, 73 using, 103
client vs. server variables, 66 using to eliminate performance bottle-
distribution of input variables, 67 necks, 117
input arguments, 64 ppperf clear, 106, 132, 155
ppperf graph off, 132, 156
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 227
ppperf graph on, 123, 156 row distribution of matrices, 33
ppperf off, 106, 132, 155 rules for propagation of distribution, 46
ppperf report, 106, 107, 132, 156
ppperf report detail, 107, 132, 156 S
ppperf resume, 155 saving and loading data on the parallel server,
ppprofile, 94, 103, 105, 138, 156 51
ppquit, 136, 145 serial MATLAB code, restructuring, 78
ppsave, 51, 137, 151 server
ppsetoption, 136, 141 configuring data I/O path, 16
configuring for high performance, 90 server/client messages, excessive, 93
warning messages, 51 server vs. client variables, ppeval, 66
ppstartup, 14 shallow copy vs. deep copy, incompatibility of
pptic, 91, 103, 138, 157 Star-P® and MATLAB, 32
pptoc, 91, 103, 138, 157 singleton dimensions, 44
ppunloadpackage, 136, 144 solving large problems - memory issues, 84
ppwhos, 22, 136, 145 solving large sparse matrix and combinatorial
profile, 103 problems with Star-P, 217
propagating the distributed attribute, 41 sort, 102
propagation of distrbution sparse matrices
exceptions for functions with multiple argu- data analysis and comparison with pivot ta-
ments, 47 bles, 220
propagation of distribution, 44 graphs, and, 217
ddense, 44 HDF5 file, representation of data in, 56
ddensend, 44 laplacian matrices, visualizing graphs, and,
dsparse, 44 223
examples for functions of multiple argu- on path counting, 224
ments, 46 representing graphs and general data anal-
examples for functions of one argument, 45 ysis, 219
functions of multiple arguments, 45 sparse matrices, Star-P® representation of, 35
functions of one argument, 44 special variables, p and np, 24
rules for, 46 specifying a range of nodes in a cluster, 17
summary, 49 specifying a set of nodes in a cluster, 17
propgation of distribution specifying nodes in a cluster, 17
functions of one argument, exceptions, 45 speye, 29
ps, UNIX command for monitoring the server, split, 2, 137, 152
133 splitting on a scalar, workarounds for, 75
Python, 1 sprand, 29, 37
sprandn, 29
R Star-P®
rand, 29 sparse matrices, 35
randn, 29 support, 3
real number data, 27 Star-P® functions, 135
requirements of functions passed to ppeval, 64 basic server function summary, 135
reshape, 30, 102 bcast, 137, 151
restructuring serial MATLAB code, 78 data movement functions, 147
reusing existing scripts, 23 fseek, 135, 138
reusing scripts, 42 np, 25, 135, 139
round-off errors, 90 p, 24, 139
228 Star-P® Programming Guide for Use with MATLAB® Release 2.7
perfomance functions, 155 131
pp, 26, 135, 139 interpretation of output, 108
ppback, 28, 136, 147 lessons learned, 113
warning messages, 51 output preamble, 108
ppchangedist, 137, 148 performance process measurement,
warning messages, 51 110
ppeval, 2, 59, 137, 152 performance time measurment, 109
broadcasting input arguments, 65 using to eliminate performance bottle-
calling non-MATLAB functions within, necks, 117
73 ppperf, using, 103
client vs. server variables, 66 ppperf clear, 106, 132, 155
distribution of input variables, 67 ppperf graph off, 132, 156
input agruments ppperf graph on, 123, 156
default behavior of, 64 ppperf off, 106, 132, 155
input arguments, 64 ppperf report, 106, 107, 132, 156
broadcasting, 65 ppperf report detail, 107, 132, 156
splitting, 64 ppperf resume, 155
known differences between MATLAB ppprofile, 94, 103, 105, 138, 156
and Octave functions, 153 ppquit, 136, 145
output arguments, 67 ppsave, 51, 137, 151
per process execution, 72 ppsetoption, 136, 141
requirements of functions passed to, 64 configuring for high performance, 90
splitting input arguments, 64 warning messages, 51
syntax and behavior, 63 ppstartup, 14
ppeval_tic, 97 pptic, 91, 103, 138, 157
ppeval_toc, 97 pptoc, 91, 103, 138, 157
ppevalc, 59 ppunloadpackage, 136, 144
ppevalcloadmodule, 138, 154 ppwhos, 22, 136, 145
ppevalcunloadmodule, 138, 154 split, 2, 137, 152
ppevalsplit, 69, 137, 154 task parallel functions, 151
ppfopen, 52, 136, 145 Star-P® naming conventions
ppfront, 24, 136, 148 d, distributed, 22, 60
warning messages, 51 pp, 22, 60
ppgetlog, 136, 141 starp command, 10, 15
ppgetlogpath, 136, 142 configuring data I/O directory, 16
ppgetoption, 136, 141 data I/O directory, 16
pph5read, 54, 137, 150 starting Star-P® with MATLAB, 7
pph5whos, 55, 136, 146 on a Linux client system, 7
pph5write, 54, 137, 149 on a Windows client system, 10
ppinvoke, 136, 143 startup.m, 14
ppload, 28, 51, 137, 150 start-up configuration, user specific, 14
pploadpackage, 143 string arrays, workarounds for, 75
ppper, 155 structs, using in task parallel, 85
ppperf, 103, 106, 131, 138 subsref, 48
displaying performance statistics, 106 support, Star-P®, 3
gathering performance statistics, 106 supported data types, 27
graphical mode, 113 supported MATLAB® functions, 161
graphical mode, lessons learned, 117, support website, 161
svd, 41
Release 2.7 Star-P® Programming Guide for Use with MATLAB® 229
svds, 41 Z
zeros, 29, 37
T
task parallel, global variables, 75
task parallel codes, tips for, 85
task parallel engine
choosing Octave, 71
using C++ for compiled codes, 72
task parallel engine, choosing MATLAB, 91
task parallelism with Star-P® and MATLAB, 2,
59
task parallelism workarounds and additional in-
formation, 75
task parallel workarounds and additional infor-
mation
splitting on a scalar, 75
string arrays, 75
tic/toc, 103
tips and tools for high performance Star-P®
code, 77
tips for data parallel codes, 79
vectorization, 79
tips for task parallel codes, 85
use of structs and cell arrays, 85
using external libraries, 89
vectorize for loops inside of ppeval calls, 86
top, UNIX command for monitoring the server,
133
transforming a for loop into a ppeval call, 61
transpose, 102
types of distributions, 32
U
UNIX commands to monitor the server, 132
user specific Star-P® start-up configuration, 14
V
vectorization, 79
vectorization, MathWorks online tutorial, 79
vectorize for loops inside of ppeval calls, 86
vertcat, 29
Very High Level Languages (VHLL), 1
W
which, 39
whos, 22
230 Star-P® Programming Guide for Use with MATLAB® Release 2.7