SC08 Engineering Track: Parallel computing using MATLAB
Siddharth Samsi Computational Science Researcher
[email protected]Goals of the session
Overview of parallel MATLAB
Why parallel MATLAB?
Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing Toolbox (PCT)
Running a serial job Running interactive parallel job Running a batch job on the OSC cluster
EMPOWER. PARTNER. LEAD.
Parallel Computing
Goals:
Speed up computations by using multiple processors Utilize more memory than available on single machine
How?
Using MPI: Message Passing Interface, a library that is used to exchange data and control information between the processors.
Used in distributed memory environments
Using OpenMP: A set of compiler directives that is used to run threads in parallel in a shared memory environment
Reality
Parallel programming using C/C++/FORTRAN and MPI is hard Creating parallel code in C/FORTRAN and MPI takes a long time
EMPOWER. PARTNER. LEAD.
Why Parallel MATLAB ?
MATLAB is widely used for developing/prototyping algorithms The High Level Language and Integrated Development/Visualization environment leads to productive code development By parallelizing MATLAB code
Algorithm can be run with different/larger data sets Algorithm can be run with larger parameter sweeps Compute times may be reduced
EMPOWER. PARTNER. LEAD.
Multiprocessing in MATLAB
MATLAB R2008a supports implicit and explicit multi-processing Implicit multi-processing
Built-in multithreading Speeds up many linear algebra routines, matrix operations Leverages multiple cores on processor
Explicit multi-processing
Parallel computing using the Parallel Computing Toolbox and MATLAB Distributed Computing Server Leverages multiple processors on clusters
EMPOWER. PARTNER. LEAD.
Implicit Multiprocessing : Multithreading in MATLAB
MATLAB runs computations on multiple threads on your machine No changes to MATLAB code required Users can change behavior via preferences Maximum gain in element-wise operations and BLAS routines To see the performance improvements possible on your multi-core system, run the following demo: multithreadedcomputations
EMPOWER. PARTNER. LEAD.
Implicit Multiprocessing : Multithreading in MATLAB
Sample speedup graph on a 4-core processor
Performance Improvement with 4 Threads on Arrays of 3000x3000
2.5
Performance Improvement
1.5
0.5
qr
lu
sin
.^
sqrt
.*
EMPOWER. PARTNER. LEAD.
Explicit Multiprocessing : The Parallel Computing Toolbox
Explicit multiprocessing is enabled through the use of the following two products
The Parallel Computing Toolbox (PCT) MATLAB Distributed Computing Server (MDCS)
EMPOWER. PARTNER. LEAD.
The Parallel Computing Toolbox
Provides parallel constructs in the MATLAB language, such as parallel for loops, distributed arrays and message passing
Desktop
Parallel Computing Toolbox
Local Workers
Enables rapid prototyping of parallel code through an interactive parallel MATLAB session Provides the ability to scale the problem by harnessing resources in a remote cluster
EMPOWER. PARTNER. LEAD.
Simulink, Blocksets, Toolboxes
MATLAB
Parallel Computing Toolbox
Language enhancements include
Ability to create and use distributed arrays
Over 150 parallel functions for use on distributed arrays
cos, sin, log, find, isempty, etc. Full list available here :
www.mathworks.com/access/helpdesk/help/toolbox/distcomp/bqxooam-1.html
ScaLAPACK based parallel linear algebra routines
svd, lu
Global, collective operations such as global addition, reduction, etc. Explicit, fine grained parallelism via MPI functions
EMPOWER. PARTNER. LEAD.
Parallel Computing Toolbox
The Parallel Computing Toolbox supports the following types of schedulers for job submission
Local scheduler
Can run up to 4 Workers simultaneously Useful for debugging parallel code locally
Job Manager Supported Third-Party scheduler (PBS, LSF, Microsoft CCS) Generic Scheduler
Generic interface that allows use with a third-party scheduler
Additionally the PCT supports the use of configurations
Configurations are a convenient way to store scheduler parameters
EMPOWER. PARTNER. LEAD.
MATLAB Distributed Computing Server
The MATLAB Distributed Computing Server enables scaling of parallel MATLAB code on clusters
MATLAB Distributed Computing Server
Workers
It includes a basic scheduler and also supports LSF, PBS, TORQUE and Windows CCS
Scheduler
EMPOWER. PARTNER. LEAD.
Roadmap
Topics to be covered
Quick note on Configurations
Short setup for using OSC cluster
Interactive parallel MATLAB
Task and Data Parallelism
Running serial batch jobs
Running parallel batch jobs
EMPOWER. PARTNER. LEAD.
A Note About Configurations
The Parallel Computing Toolbox allows the use of configurations for storing scheduler and job parameters Typically, configurations are used to store scheduler and job settings that do not change By default, the PCT ships with the 'local' configuration We will be adding a new configuration for use with the OSC Cluster
EMPOWER. PARTNER. LEAD.
Setup
Start MATLAB
You should have one of these directories :
C:\osctools\
Or
C:\Documents and Settings\your_username\osctools\ For brevity, we will refer to either of these directories as <OSCTOOLSDIR>
EMPOWER. PARTNER. LEAD.
Setup
If the directories do not exist, you should have received a zip file called osctools.zip on your USB drive Alternatively, you can download this file from : http://www.osc.edu/~samsi/SC08/Downloads/osctools.zip Save the file to this location : C:\Documents and Settings\your_username\ Unzip the osctools.zip file from MATLAB using the commands : cd('C:\Documents and Settings\your_username\') unzip osctools.zip
EMPOWER. PARTNER. LEAD.
Setup
In MATLAB, change directory to : <OSCTOOLSDIR>\common Run the command : oscsetup You will see the following prompt : Enter the OSC username you have been given Next, you will see the following message in the MATLAB command Window :
In order to complete the setup process, we need to connect to glenn.osc.edu You will be prompted for your OSC password in order to connect Press return to continue
EMPOWER. PARTNER. LEAD.
Setup
Once you press the Enter key, you will see :
After you click on yes, you will be asked for your password The setup is now complete
EMPOWER. PARTNER. LEAD.
Testing the setup
In order to test the setup, first connect to the OSC cluster using the command
a = ssh(your_OSC_username, glenn.osc.edu)
You will be prompted for your password Next, change to the directory <OSCTOOLSDIR>\demo and run the command
pctdemo
EMPOWER. PARTNER. LEAD.
Interactive Parallel MATLAB
The PCT provides the ability to use up to 4 Workers on a single desktop Useful for debugging and code development Starting pmode : Run the following command pmode start local 4
EMPOWER. PARTNER. LEAD.
What can we do with pmode ?
Run a for loop in parallel Serial s = 0; Parallel s = 0;
for k = 1:10
s = s + k; end disp(s)
for k = drange(1, 10)
s = s + k; end disp(s)
EMPOWER. PARTNER. LEAD.
Collective Operations
The PCT provides the following collective operations gplus Global addition
Example : p = gplus(s)
gcat Global concatenation
Example : c = gcat(s)
gop Global operation
Example : m = gop(@mean, s)
EMPOWER. PARTNER. LEAD.
Notes on pmode
Some useful commands
pmode lab2client labvar lab clientvar
Send data from the lab to the client MATLAB
pmode client2lab clientvar labs labvar
Send data from the client MATLAB to the specified lab(s)
pmode exit
Limitations
A maximum of 4 Workers permitted with the 'local' configuration Workers cannot use graphics functions
To plot data, you must send the data to the MATLAB client that started the pmode session
EMPOWER. PARTNER. LEAD.
Lab 1
Familiarizing yourself with the pmode
Serial version of pi
Parallel version of pi using pmode
EMPOWER. PARTNER. LEAD.
Lab 1: Calculating
Algorithm
Consider a circle of radius 1 Let N = some large number (say 1000) and count = 0 Repeat the following procedure N times
Generate two random numbers x and y between 0 and 1 Check whether (x,y) lie inside the circle Increment count if they do
PI = 4 * count / N
EMPOWER. PARTNER. LEAD.
Running Non-Interactive Jobs
The Parallel Computing Toolbox can also be used to run non-interactive jobs Jobs can be run
Locally :
Useful for prototype development
Remote :
On a cluster in conjunction with the MATLAB Distributed Computing Server Can scale up to much larger number of parallel Labs
The functions discussed in this section can be used to run jobs locally as well as on a cluster
EMPOWER. PARTNER. LEAD.
Basic Commands
The PCT offers the following two functions for evaluating MATLAB functions on multiple processors dfeval : Evaluate function on cluster dfevalasync : Evaluate function on cluster asynchronously Both functions are similar to the eval function, but, they leverage the Parallel Computing Toolbox to evaluate functions on the specified compute resources For this tutorial, we will be using the dfevalasync function
EMPOWER. PARTNER. LEAD.
More on dfevalasync
Syntax
job = dfevalasync(F, numOutput, input, 'P1', 'V1');
where F : Function to be evaluated numArgout : Number of output arguments input : Cell array of input values P1/V1 : Name/value pairs
Example
job = dfevalasync(@rand, 1, {4}, ...
'Configuration', 'local')
EMPOWER. PARTNER. LEAD.
Running on a cluster : Submitting jobs
We will use the 'OSC Opteron' configuration to run our jobs on the OSC cluster First, connect to the cluster using ssh sshobj = ssh(your_username, 'glenn.osc.edu')
Using the previous example :
job = dfevalasync(@rand, 1, {4}, ... 'Configuration', 'OSC Opteron') To check the status of the job, run the command : getjobstatus(job)
EMPOWER. PARTNER. LEAD.
Running on a cluster : Getting output
Once the job has finished the output can be retrieved by running the command out = getAllOutputArguments(job) The output is returned in a cell array In the above example out{k} gives the output from the kth Worker/Lab
EMPOWER. PARTNER. LEAD.
Running on a cluster using Schedulers
Programming with a scheduler consists of the following steps
Create scheduler object Create a new job Add task(s) to a job Submit the job Retrieve results
Jobs (and tasks) are persistent and can be retrieved later
EMPOWER. PARTNER. LEAD.
Running on a cluster using a Generic Scheduler Through this tutorial we will use the generic scheduler interface for running jobs on the OSC cluster Creating scheduler
sched = findResource('scheduler', ... 'type', 'generic');
set(sched, 'Configuration', 'OSC Opteron')
EMPOWER. PARTNER. LEAD.
Running on a cluster using a Generic Scheduler
Some important scheduler properties
Configuration
Customized settings for specific cluster
DataLocation
Location of the Job and Task files created by MATLAB
SubmitFcn
The MATLAB function to call for actually submitting the job to the cluster. Used for serial jobs
ParallelSubmitFcn
This is similar to the SubmitFcn but is used for parallel jobs
EMPOWER. PARTNER. LEAD.
Running on a cluster using a Generic Scheduler
Creating a job
job = createParallelJob(sched);
set(job, 'MaximumNumberOfWokers', 4); set(job, 'MinimumNumberOfWokers', 4);
Some important properties of jobs
FileDependencies: List of user m-files that the job needs PathDependencies: List of directories to be added to the MATLAB path
Output is retrieved using the function
getAllOutputArguments
EMPOWER. PARTNER. LEAD.
Running on a cluster using a Generic Scheduler Creating tasks
task = createParallelJob(job);
set(task, 'CaptureCommandWindowOutput', 1);
Some important properties of Tasks
CommandWindowOutput
Returns the messages printed to the screen
Error
Returns the error stack (if an error occurs on the Worker)
EMPOWER. PARTNER. LEAD.
Running on a cluster using a Generic Scheduler
Finally, the job is submitted using the command
submit(job)
You can check the status of the jobs using the getjobstatus command as shown here getjobstatus(job) Note : The getjobstatus function is a custom function developed at OSC
EMPOWER. PARTNER. LEAD.
Lab 2 : Image Processing
Many image processing operations tend to be compute intensive. Examples of common operations include histogram equalization, contrast enhancement, filtering, etc.
Let's look at one particular implementation of a simple automatic contrast enhancement algorithm
EMPOWER. PARTNER. LEAD.
Lab 2 : Contrast Enhancement Algorithm
All color images are basically MxNx3 matrices Our algorithm will look at each pixel p(i,j) and the 3x3 pixel neighborhood
177 181 180
181 182 181 185 187 183 198 183 185 197 179 186 200 180 184 201 199 198 188 186 185 203 199 198 Blue Green Red
The value of p(i,j) will be replaced appropriately
EMPOWER. PARTNER. LEAD.
p(i,j)
Lab 2 : Contrast Enhancement Algorithm (continued)
The new value of p(i,j) is calculated as follows 1. Find the low frequency component of the pixel as follows i n j n
m p (i, j ) 1 (2n 1) k p(k , l )
i nl j n
In this case, n = 1 for a 3x3 neighborhood 2. Calculate the new value of as p(k,l) follows
f (i, j )
m p (i, j ) C[ x(i, j ) m p (i, j )]
Where C = constant > 1
EMPOWER. PARTNER. LEAD.
Lab 2
Implement the serial version of the contrast enhancement algorithm. Run the algorithm locally Run the algorithm on a single image on the OSC cluster (Use the image pout.tif)
EMPOWER. PARTNER. LEAD.
Programming Parallel Jobs
In this section, we will discuss
Types of parallel jobs
Running parallel jobs using the Parallel Computing Toolbox
EMPOWER. PARTNER. LEAD.
Types of Parallel Jobs
Task Parallel (Embarrassingly parallel)
Multiple Workers work on different parts of the problem No communication between the Workers Results independent of the execution order Example : Monte Carlo simulations
Data Parallel
Typically in this case data being analyzed is too large for one computer Each Worker operates on part of the data Workers may or may not communicate with each other Example : Image enhancement
EMPOWER. PARTNER. LEAD.
Task Parallel Jobs
Consider our contrast enhancement application from Lab 2 Examples of a task parallel application
Consider a RGB image : A simple task parallel implementation would be to process each color channel independently We have hundreds of images that we want to process : Simply have multiple Workers process a subset of the images
EMPOWER. PARTNER. LEAD.
Lab 3
Start pmode with 3 labs
Read in the given image file
Process each channel (R, G, B) separately on a different Worker
Combine the 3 channels into a new image. (Hint: use the gcat function). Visualize the processed image
Now run the same code on the OSC cluster using the dfevalasync function Modify the code to process the given list of images in parallel
EMPOWER. PARTNER. LEAD.
Data Parallel Jobs
Data parallel jobs can be broadly classified into two types of problems
Capacity : It may be possible to process the data on a single processor, but, it may take hours or days Capability : The data to be processed is simply too large for a single system
For example :
In medical imaging applications, images can be as large as 100,000x100,000 with files sizes of several Gigabytes
EMPOWER. PARTNER. LEAD.
Data Parallel : Contrast Enhancement
We will now re-implement this algorithm so that it now works as a data parallel algorithm Advantages of this approach:
Process much larger images Run many more iterations in a reasonable amount of time
Total compute time may or may not be reduced depending on the actual size of the image being processed
EMPOWER. PARTNER. LEAD.
Data Parallel Implementation
Distribute chunks of data to Workers
Lab 1
Lab 2
Lab 3 Lab 4
EMPOWER. PARTNER. LEAD.
Data Parallel Implementation : How-To
Data can be distributed in two ways using the PCT
1. Use explicit message passing :
Labs/Workers can use MPI to share data Data distribution must be programmed by the user The PCT manages the communication necessary to organize data across Labs/Workers User decides the distribution pattern
2. Use distributed arrays :
EMPOWER. PARTNER. LEAD.
Creating Distributed Arrays
Distributed arrays can be created in three ways
Partitioning a larger array
All Labs have the entire array Assumes sufficient memory
From smaller arrays
All Labs contain part of the data
Using constructor functions
Functions such as zeros, ones, randn, eye can create distributed arrays directly
EMPOWER. PARTNER. LEAD.
Creating Distributed Arrays : Distribution Type
MATLAB supports the following distribution schemes 1d Distribution along a single dimension
Supported for all arrays Distributes data non-cyclically along one dimension
2d Distribution along two dimensions
Supported only for 2D arrays Distributes matrix along two dimensions
The default distribution is 1d with arrays distributed column-wise
EMPOWER. PARTNER. LEAD.
Distributed Arrays : Useful Commands
distributed()
Creates a distributed array
r = labindex*(10:15); rd = distributed(r);
rep = reshape(1:64, 8, 8); repd = distributed(rep, distributor(), 'convert');
distributor()
Defines the type of distribution
a = zeros(100, distributor()); b = rand(16, distributor('1d', 1));
EMPOWER. PARTNER. LEAD.
Distributed Arrays : Useful Commands
redistribute()
Changes the distribution of a distributed array
c = redistribute(b, distributor('1d', 2));
localPart()
Retrieves part of the data local to the Lab
a = rand(100, distributor()); a_local = localPart(a);
EMPOWER. PARTNER. LEAD.
Example : 2D FFT
a = reshape(1:16, 4, 4);
% Create 4x4 distributed array from replicated array 'a'
da = distributed(a, distributor(), 'convert'); Da = fft(da, [], 1); Da = fft(Da, [], 2); % FFT in first dimension % FFT in second dimension
% Gather the FFT matrix into all labs A = abs(gather(Da));
EMPOWER. PARTNER. LEAD.
Going back to our image processing problem
Our goal is to process parts of the image on different Labs We can process the subset of rows in a number of ways :
Use indices to figure out the rows each lab works on Use distributed array and let MATLAB figure out the distribution
EMPOWER. PARTNER. LEAD.
Approach I : Calculate the row indices ourselves
This approach involves :
Get the size of the entire image (size() function) Divide the number of rows by the number of labs
If not perfectly divisible, figure out the remainder
Based on previous step, calculate the row indices that each lab works on
Problems with this approach :
Need to be careful when dimensions are not perfectly divisible across the labs Must debug code with different combinations of image sizes and labs to ensure correctness
EMPOWER. PARTNER. LEAD.
Approach II : Use distributed arrays
This approach involves :
Read in the image on all the labs Create a distributed array using the distributed command Get the local part of the data (localPart() function)
Advantages of this approach
Simple to program Much less prone to error
Disadvantage : Image replicated across the Labs. This may not be desirable in all applications
EMPOWER. PARTNER. LEAD.
Lab 4
Modify your contrast enhancement program so that it uses distributed arrays Test the code using pmode Run the code on the OSC cluster
EMPOWER. PARTNER. LEAD.
Output of Data Parallel Algorithm
Lets examine the output of the data parallel implementation As shown here, the resulting image has stripes across it
This is caused due to missing data at the boundaries
Solution : Each Lab needs to exchange data with its neighbor
EMPOWER. PARTNER. LEAD.
Modified Data Parallel Algorithm
Add communication between Labs Each lab exchanges data with its neighbor as shown in the diagram Lab 2
100 98 105 110 112 120
Lab 1
140 136 141 150 154 142
138
135
140
142
150
144
140
136
141
150
154
142
138
135
140
142
150
144
Lab 1
Lab 2
210
200
205
198
199
195
100
210
98
200
105
205
110
198
112
199
120
195
Lab 2
Lab 3
Lab 3
Lab 3
EMPOWER. PARTNER. LEAD.
Lab 4
Communication between Labs
The PCT provides the following functions for sending data between labs: labSend Send data to lab
labSend(data, destination)
labReceive Receive data from another lab
data = labReceive(source)
EMPOWER. PARTNER. LEAD.
Communication between Labs
labSendReceive Simultaneously send and receive data
This function can avoid deadlocks when communicating between labs data = labSendReceive(labTo, labfrom, data)
EMPOWER. PARTNER. LEAD.
Lab 5
Modify the data parallel implementation of the contrast enhancement algorithm to add communication
EMPOWER. PARTNER. LEAD.
Questions ?
EMPOWER. PARTNER. LEAD.