ParallelProgramming_Start2016
ParallelProgramming_Start2016
ro
Agenda for the Presentation
HPC Overview
Parallel Programming
C/C++ in Linux with:
Try it…
2. Technologies Combined for Solving the Challenge
Parallel Programming
• OpenMP
Bitmap Processing
• Sample-code
Serial Computing
Parallel Computing
Parallel Computing & Systems - Intro
https://computing.llnl.gov/tutorials/parallel_comp/
2. HW & SW Platform
Alternative to:
1. C/C++ Nvidia CUDA
2. C/C++ OpenCL
programming on GPU – video boards
2. Vector Adding with Parallel Computing
http://ism.ase.ro/vm/Ubuntu12x64_Intel.zip
Download the VM-Ware virtual machine with Ubuntu 12 x64 - Intel 2 cores, RAM 2048MB, HDD 20GB in order to solve the
contest problem with Intel C/C++ compiler, Intel Parallel Studio 2013 and Intel TBB, Eclipse CDT Juno, GCC and Oracle JDK 7,
PLUS all necessary documents for the Intel contest.
http://www.drdobbs.com/parallel/programming-intels-xeon-phi-a-jumpstart/240144160
2. Parallel Programming Restrictions
http://www.drdobbs.com/parallel/programming-intels-xeon-phi-a-jumpstart/240144160
The source code for C/C++ can be compiled without modification by the
Intel compiler (icc), to run in the following modes:
An Application Program Interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism.
API components:
Compiler directives (Compilers that supports: GNU & Intel C/C++ compilers - gcc/g++ & icc)
Runtime library routines
Environment variables
Portability
API is specified for C/C++ and Fortran
Implementations on almost all platforms including Unix/Linux and Windows
Standardization
Jointly defined and endorsed by major computer hardware and software vendors
Possibility to become ANSI standard
Partial Copyright:
http://www3.nd.edu/~zxu2/acms60212-40212-S12/Lec-11-01.pdf | https://computing.llnl.gov/tutorials/openMP/
2.1 OpenMP Architecture – Version 3
2.1 OpenMP Mini-Tutorial – Version 3
Thread
A process is an instance of a computer program that is being executed. It contains the program code and its current activity.
A thread of execution is the smallest unit of processing that can be scheduled by an operating system.
Differences between threads and processes:
A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as
memory. The threads of a process share the latter’s instructions (code) and its context (values that its variables
reference at any given moment).
Different processes do not share these resources.
http://en.wikipedia.org/wiki/Process_(computing) |
Process
A process contains all the information needed to execute the program:
Process ID
Program code
Data on run time stack
Global data
Data on heap
Each process has its own address space.
In multitasking, processes are given time slices in a round robin fashion.
If computer resources are assigned to another process, the status of the present process has to be saved, in order that
the execution of the suspended process can be resumed at a later time.
2.1 OpenMP - Summary of MS Windows Memory
Native EXE File on HDD RAM Memory Layout
MS Windows: MS Windows:
Load Module
EXE IMAGE http://www.codinghorror.com/b
Firefox log/2007/03/dude-wheres-my-
4-gigabytes-of-ram.html
EXE File Beginning – ’MZ’
CESS
PRO
EXE IMAGE
Load Module
Adobe Reader
IPC
PROCESS
References / pointers to the segments
Relocation Pointer Table
Optional – Thread 1
Optional – Thread 2
… Optional – Thread n
2.1 OpenMP - Mini-Tutorial – Version 3
Threads Features: Thread operations include thread creation, termination, synchronization
(joins, blocking), scheduling, data management and process interaction.
Thread model is an extension of the process model.
A thread does not maintain a list of created threads, nor does it know
the thread that created it.
Each process consists of multiple independent
instruction streams (or threads) that are assigned All threads within a process share the same address space.
computer resources by some scheduling procedure.
Threads in the same process share:
Threads of a process share the address space of this Process instructions
process. Most data
open files (descriptors)
Global variables and all dynamically allocated
signals and signal handlers
data objects are accessible by all threads of a current working directory
process. User and group id
Each thread has a unique:
Each thread has its own run time stack, register, Thread ID
program counter. set of registers, stack pointer
stack for local variables, return addresses
Threads can communicate by reading/writing signal mask
priority
variables in the common address space.
Return value: errno
pthread functions return "0" if OK.
2.1 OpenMP - Mini-Tutorial – Version 3
Multi-threading vs. Multi-process development in UNIX/Linux:
https://computing.llnl.gov/tutorials/pthreads/ http://www.javamex.com/tutorials/threads/how_threads_work.shtml
2.1 OpenMP - Mini-Tutorial – Version 3
OpenMP Programming Model:
Shared memory, thread-based parallelism
OpenMP is based on the existence of multiple threads in the shared memory programming paradigm.
A shared memory process consists of multiple threads.
Explicit Parallelism
Programmer has full control over parallelization. OpenMP is not an automatic parallel programming
model.
Compiler directive based
Most OpenMP parallelism is specified through the use of compiler directives which are embedded in the
source code.
OpenMP is NOT:
Necessarily implemented identically by all vendors
Meant for distributed-memory parallel systems (it is designed for shared address spaced machines)
Guaranteed to make the most efficient use of shared memory
Required to check for data dependencies, data conflicts, race conditions, or deadlocks
Required to check for code sequences
Meant to cover compiler-generated automatic parallelization and directives to the compiler to assist such
parallelization
Designed to guarantee that input or output to the same file is synchronous when executed in parallel.
2.1 OpenMP - Mini-Tutorial – Version 3
OpenMP - Fork-Join Parallelism Model:
OpenMP program begin as a single process: the master thread (in pictures in red/grey). The master thread
executes sequentially until the first parallel region construct is encountered.
When a parallel region is encountered, master thread
Create a group of threads by FORK.
Becomes the master of this group of threads, and is assigned the thread id 0 within the group.
The statement in the program that are enclosed by the parallel region construct are then executed in
parallel among these threads.
JOIN: When the threads complete executing the statement in the parallel region construct, they
synchronize and terminate, leaving only the master thread.
2.1 OpenMP - Mini-Tutorial – Version 3
I/O
OpenMP does not specify parallel I/O.
It is up to the programmer to ensure that I/O is conducted correctly within the context of a multi-threaded program.
Memory Model
Threads can “cache” their data and are not required to maintain exact consistency with real memory all of the time.
When it is critical that all threads view a shared variable identically, the programmer is responsible for insuring that the
variable is updated by all threads as needed.
//OpenMP Code Structure
#include <stdlib.h>
#include <stdio.h> Set # of threads for OpenMP:
#include "omp.h" - In csh:
setenv OMP_NUM_THREAD 8
int main() - In bash:
{ set OMP_NUM_THREAD=8
#pragma omp parallel export $OMP_NUM_THREAD
{
int ID = omp_get_thread_num();
printf("Hello (%d)\n", ID);
Compile: g++ -fopenmp hello.c
printf(" world (%d)\n", ID);
} Run: ./a.out
}
2.1 OpenMP - Mini-Tutorial – Version 3
OpenMP Core Syntax OpenMP C/C++ Directive Format
OpenMP directive forms
#include “omp.h” C/C++ use compiler directives
void main () Prefix: #pragma omp …
{ A directive consists of a directive name followed by
int var1, var2, var3; clauses
// 1. Serial code Example:
... #pragma omp parallel default (shared) private (var1, var2)
// 2. Beginning of parallel section.
// Fork a team of threads. Specify variable scoping OpenMP Directive Format - General Rules:
#pragma omp parallel private(var1, var2) shared(var3) Case sensitive
{ Only one directive-name may be specified per directive
// 3. Parallel section executed by all threads Each directive applies to at most one succeeding
... statement, which must be a structured block.
// 4. All threads join master thread and disband Long directive lines can be “continued” on succeeding
} lines by escaping the newline character with a
// 5. Resume serial code . . . backslash “\” at the end of a directive line.
}
2.1 OpenMP - Mini-Tutorial – Version 3
OpenMP parallel Region Directive Example:
#pragma omp parallel if (is_parallel == 1) num_threads(8)
#pragma omp parallel [clause list]
shared (var_b) private (var_a) firstprivate (var_c) default
(none)
Typical clauses in [clause list] {
Conditional parallelization /* structured block */
– if (scalar expression) }
Determine whether the parallel construct creates threads
Degree of concurrency if (is_parallel == 1) num_threads(8)
– num_threads (integer expression) – If the value of the variable is_parallel is one, create
8 threads
number of threads to create
shared (var_b)
Date Scoping – Each thread shares a single copy of variable var_b
– private (variable list) private (var_a) firstprivate (var_c)
Specifies variables local to each thread – Each thread gets private copies of variable var_a
– firstprivate (variable list) and var_c
Similar to the private – Each private copy of var_c is initialized with the
Private variables are initialized to variable value before the value of var_c in main thread when the parallel directive
parallel directive is encountered
default (none)
– shared (variable list)
– Default state of a variable is specified as none
Specifies variables that are shared among all the threads (rather than shared)
– default (data scoping specifier) – Signals error if not all variables are specified as
Default data scoping specifier may be shared or none shared or private
2.1 OpenMP - Mini-Tutorial – Version 3
Number of Threads:
A parallel construct by itself creates a “Single Program/Instruction Multiple Data” (SIMD) program, i.e., each
thread executes the same code.
Work-sharing is to split up pathways through the code between threads within a team.
– Loop construct (for/do)
– Sections/section constructs
– Single construct
Within the scope of a parallel directive, work-sharing directives allow concurrency between iterations or tasks
Work-sharing constructs do not create new threads.
A work-sharing construct must be enclosed dynamically within a parallel region in order for the directive to
execute in parallel.
Work-sharing constructs must be encountered by all members of a team or none at all.
Two directives to be presented
– do/for: concurrent loop iterations
– sections: concurrent tasks
2.1 OpenMP - Mini-Tutorial – Version 3
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
Work-Sharing do/for Directive
void main()
{
do/for: int nthreads, tid;
Shares iterations of a loop across the group
Represents a “data parallelism”. omp_set_num_threads(3);
for directive partitions parallel iterations across threads #pragma omp parallel private(tid)
do is the analogous directive in Fortran {
int i;
tid = omp_get_thread_num();
Usage:
printf("Hello world from (%d)\n", tid);
#pragma omp for [clause list]
#pragma omp for
/* for loop */ for(i = 0; i <=4; i++)
{
Implicit barrier at end of for loop printf(“Iteration %d by %d\n”, i, tid);
}
} // all threads join master thread and terminates
}
2.1 OpenMP - Mini-Tutorial – Version 3
//A worksharing for construct to add vectors:
//Sequential code to add two vectors: #pragma omp parallel
for(i=0;i<N;i++) { {
c[i] = b[i] + a[i]; #pragma omp for
} {
for(i=0; i<N; i++) { c[i]=b[i]+a[i]; }
//OpenMP implementation 1 (not desired): }
#pragma omp parallel }
{
int id, i, Nthrds, istart, iend; //A worksharing for construct to add vectors:
id = omp_get_thread_num(); #pragma omp parallel for
Nthrds = omp_get_num_threads(); for(i=0; i<N; i++) { c[i]=b[i]+a[i]; }
istart = id*N/Nthrds;
iend = (id+1)*N/Nthrds;
a[i] =sum;
}
2.1 OpenMP - Mini-Tutorial – Version 3
Matrix-Vector | Matrix-Matrix Multiplication Static scheduling - 16 iterations, 4 threads:
schedule clause
Describe how iterations of the loop are divided among the threads in the
group. The default schedule is implementation dependent.
Usage: schedule (scheduling_class[, parameter]).
– static - Loop iterations are divided into pieces of size chunk and then
statically assigned to threads. If chunk is not specified, the iteration are // Static schedule maps iterations to threads at compile time
evenly (if possible) divided contiguously among the threads. // static scheduling of matrix multiplication loops
– dynamic - Loop iterations are divided into pieces of size chunk and then #pragma omp parallel default (private) \
dynamically assigned to threads. When a thread finishes one chunk, it is shared (a, b, c, dim) num_threads(4)
dynamically assigned another. The default chunk size is 1. #pragma omp for schedule(static)
– guided - For a chunk size of 1, the size of each chunk is proportional to the for(i=0; i < dim; i++)
number of unassigned iterations divided by the number of threads, {
decreasing to 1. For a chunk size with value 𝑘(𝑘>1), the size of each chunk is for(j=0; j < dim; j++)
determined in the same way with the restriction that the chunks do not {
contain fewer than 𝑘 iterations (except for the last chunk to be assigned, c[i][j] = 0.0;
which may have fewer than 𝑘 iterations). The default chunk size is 1. for(k=0; j < dim; k++)
– runtime - The scheduling decision is deferred until runtime by the c[i][j] += a[i][k]*b[k][j];
environment variable OMP_SCHEDULE. It is illegal to specify a chunk size for }
this clause }
– auto - The scheduling decision is made by the compiler and/or runtime
system.
2.2 BMP Format and Sample
http://en.wikipedia.org/wiki/BMP_file_format
2.3 A.I / Data-mining Neural Network Sample
Are you in?
?
(Parallel System).Equals(Parallel Computing) == true?
But wait…
There’s More!