0% found this document useful (0 votes)
0 views

ParallelProgramming_Start2016

The document outlines a presentation agenda focused on High-Performance Computing (HPC) and parallel programming techniques, including various programming models like OpenMP and MPI. It discusses the use of parallel programming for data mining and artificial intelligence, providing sample code and resources for participants. Additionally, it covers the architecture of OpenMP, threading concepts, and practical examples for implementing parallel computing solutions.

Uploaded by

Sergiu Cusnir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

ParallelProgramming_Start2016

The document outlines a presentation agenda focused on High-Performance Computing (HPC) and parallel programming techniques, including various programming models like OpenMP and MPI. It discusses the use of parallel programming for data mining and artificial intelligence, providing sample code and resources for participants. Additionally, it covers the architecture of OpenMP, threading concepts, and practical examples for implementing parallel computing solutions.

Uploaded by

Sergiu Cusnir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

ism.ase.ro | acs.ase.ro | dice.ase.ro | csie.ase.

ro
Agenda for the Presentation

HPC Overview Parallel & BMP & Exchange Ideas


Data-mining/AI

www.dice.ase.ro www.ism.ase.ro www.intel.com


Objectives, Conditions

HPC Overview
Parallel Programming
C/C++ in Linux with:

MP – Multi- MPI – Message TBB – Thread OpenCL – Open Multi-threaded


processing Passing Interface Building Blocks Computing Parallel
Programming (OpenMPI) (Intel TBB) Language (Intel Computing (Intel
(OpenMP) OpenCL SDK) Cilk Plus, POSIX
Threads, C++’11
Multithread)
Sample in C++’11/POSIX Threads/OpenMP, BMP image processing, Data-mining / A.I. – Artificial Intelligence issues

PARALLEL PROGRAMMING FOR INTEL CONTEST – TECH & Hints


It’s not just about the deploying software, but providing smart & fast solutions

Try it…
2. Technologies Combined for Solving the Challenge

Parallel Programming
• OpenMP

Bitmap Processing
• Sample-code

Data-Mining (Artificial Intelligence)


• Algorithms for pattern recognition
Parallel Computing & Systems - Intro https://computing.llnl.gov/tutorials/parallel_comp/
Flynn Taxonomy
Parallel vs. Distributed Systems

Parallel vs. Distributed Computing / Algorithms

Where is the picture for: http://en.wikipedia.org/wiki/Distributed_computing


Distributed System and Parallel System? http://en.wikipedia.org/wiki/Flynn's_taxonomy
Parallel Computing & Systems - Intro
https://computing.llnl.gov/tutorials/parallel_comp/

Serial Computing

Parallel Computing
Parallel Computing & Systems - Intro
https://computing.llnl.gov/tutorials/parallel_comp/
2. HW & SW Platform

Alternative to:
1. C/C++ Nvidia CUDA
2. C/C++ OpenCL
programming on GPU – video boards
2. Vector Adding with Parallel Computing
http://ism.ase.ro/vm/Ubuntu12x64_Intel.zip
Download the VM-Ware virtual machine with Ubuntu 12 x64 - Intel 2 cores, RAM 2048MB, HDD 20GB in order to solve the
contest problem with Intel C/C++ compiler, Intel Parallel Studio 2013 and Intel TBB, Eclipse CDT Juno, GCC and Oracle JDK 7,
PLUS all necessary documents for the Intel contest.

Adding two vectors sample:


 POSIX Threads
 C++’11 Threads
 Java Threads
 C++ (in OpenMP) – OpenMP mini Tutorial
2. Parallel Programming Restrictions

http://www.drdobbs.com/parallel/programming-intels-xeon-phi-a-jumpstart/240144160
2. Parallel Programming Restrictions
http://www.drdobbs.com/parallel/programming-intels-xeon-phi-a-jumpstart/240144160
The source code for C/C++ can be compiled without modification by the
Intel compiler (icc), to run in the following modes:

 Native: The entire application runs on the Intel Xeon Phi.


 Offload: The host processor runs the application and offloads
compute intensive code and associated data to the device as
specified by the programmer via pragmas in the source code.
 Host: Run the code as a traditional OpenMP application on the
host.

# compile for host-based OpenMP


icc -mkl -O3 -no-offload -openmp -Wno-unknown-pragmas -std=c99 -vec-report3 \
matrix.c -o matrix.omp
# compile for offload mode
icc -mkl -O3 -offload-build -Wno-unknown-pragmas -std=c99 -vec-report3 \
matrix.c -o matrix.off
# compile to run natively on the Xeon Phi
icc -mkl -O3 -mmic -openmp -L /opt/intel/lib/mic -Wno-unknown-pragmas \
-std=c99 -vec-report3 matrix.c -o matrix.mic -liomp5
2.1 Open MP
Open specifications for Multi Processing
Long version: Open specifications for Multi-Processing via collaborative work between interested parties from the
hardware and software industry, government and academia.

 An Application Program Interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism.
 API components:
 Compiler directives (Compilers that supports: GNU & Intel C/C++ compilers - gcc/g++ & icc)
 Runtime library routines
 Environment variables

 Portability
 API is specified for C/C++ and Fortran
 Implementations on almost all platforms including Unix/Linux and Windows

 Standardization
 Jointly defined and endorsed by major computer hardware and software vendors
 Possibility to become ANSI standard

Partial Copyright:
http://www3.nd.edu/~zxu2/acms60212-40212-S12/Lec-11-01.pdf | https://computing.llnl.gov/tutorials/openMP/
2.1 OpenMP Architecture – Version 3
2.1 OpenMP Mini-Tutorial – Version 3
Thread
 A process is an instance of a computer program that is being executed. It contains the program code and its current activity.
 A thread of execution is the smallest unit of processing that can be scheduled by an operating system.
 Differences between threads and processes:
 A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as
memory. The threads of a process share the latter’s instructions (code) and its context (values that its variables
reference at any given moment).
 Different processes do not share these resources.
http://en.wikipedia.org/wiki/Process_(computing) |

Process
 A process contains all the information needed to execute the program:
 Process ID
 Program code
 Data on run time stack
 Global data
 Data on heap
Each process has its own address space.
 In multitasking, processes are given time slices in a round robin fashion.
 If computer resources are assigned to another process, the status of the present process has to be saved, in order that
the execution of the suspended process can be resumed at a later time.
2.1 OpenMP - Summary of MS Windows Memory
Native EXE File on HDD RAM Memory Layout
MS Windows: MS Windows:

Load Module
EXE IMAGE http://www.codinghorror.com/b
Firefox log/2007/03/dude-wheres-my-
4-gigabytes-of-ram.html
EXE File Beginning – ’MZ’

EXE 16, 32 bits Headers

CESS
PRO
EXE IMAGE
Load Module
Adobe Reader
IPC

PROCESS
References / pointers to the segments
Relocation Pointer Table

Optional – Thread 1
Optional – Thread 2
… Optional – Thread n
2.1 OpenMP - Mini-Tutorial – Version 3
Threads Features: Thread operations include thread creation, termination, synchronization
(joins, blocking), scheduling, data management and process interaction.
 Thread model is an extension of the process model.
A thread does not maintain a list of created threads, nor does it know
the thread that created it.
 Each process consists of multiple independent
instruction streams (or threads) that are assigned All threads within a process share the same address space.
computer resources by some scheduling procedure.
Threads in the same process share:
 Threads of a process share the address space of this  Process instructions
process.  Most data
 open files (descriptors)
 Global variables and all dynamically allocated
 signals and signal handlers
data objects are accessible by all threads of a  current working directory
process.  User and group id
Each thread has a unique:
 Each thread has its own run time stack, register,  Thread ID
program counter.  set of registers, stack pointer
 stack for local variables, return addresses
 Threads can communicate by reading/writing  signal mask
 priority
variables in the common address space.
 Return value: errno
pthread functions return "0" if OK.
2.1 OpenMP - Mini-Tutorial – Version 3
Multi-threading vs. Multi-process development in UNIX/Linux:

https://computing.llnl.gov/tutorials/pthreads/ http://www.javamex.com/tutorials/threads/how_threads_work.shtml
2.1 OpenMP - Mini-Tutorial – Version 3
OpenMP Programming Model:
 Shared memory, thread-based parallelism
 OpenMP is based on the existence of multiple threads in the shared memory programming paradigm.
 A shared memory process consists of multiple threads.
 Explicit Parallelism
 Programmer has full control over parallelization. OpenMP is not an automatic parallel programming
model.
 Compiler directive based
 Most OpenMP parallelism is specified through the use of compiler directives which are embedded in the
source code.
OpenMP is NOT:
 Necessarily implemented identically by all vendors
 Meant for distributed-memory parallel systems (it is designed for shared address spaced machines)
 Guaranteed to make the most efficient use of shared memory
 Required to check for data dependencies, data conflicts, race conditions, or deadlocks
 Required to check for code sequences
 Meant to cover compiler-generated automatic parallelization and directives to the compiler to assist such
parallelization
 Designed to guarantee that input or output to the same file is synchronous when executed in parallel.
2.1 OpenMP - Mini-Tutorial – Version 3
OpenMP - Fork-Join Parallelism Model:

 OpenMP program begin as a single process: the master thread (in pictures in red/grey). The master thread
executes sequentially until the first parallel region construct is encountered.
 When a parallel region is encountered, master thread
 Create a group of threads by FORK.
 Becomes the master of this group of threads, and is assigned the thread id 0 within the group.
 The statement in the program that are enclosed by the parallel region construct are then executed in
parallel among these threads.
 JOIN: When the threads complete executing the statement in the parallel region construct, they
synchronize and terminate, leaving only the master thread.
2.1 OpenMP - Mini-Tutorial – Version 3
I/O
 OpenMP does not specify parallel I/O.
 It is up to the programmer to ensure that I/O is conducted correctly within the context of a multi-threaded program.

Memory Model
 Threads can “cache” their data and are not required to maintain exact consistency with real memory all of the time.
 When it is critical that all threads view a shared variable identically, the programmer is responsible for insuring that the
variable is updated by all threads as needed.
//OpenMP Code Structure
#include <stdlib.h>
#include <stdio.h> Set # of threads for OpenMP:
#include "omp.h" - In csh:
setenv OMP_NUM_THREAD 8
int main() - In bash:
{ set OMP_NUM_THREAD=8
#pragma omp parallel export $OMP_NUM_THREAD
{
int ID = omp_get_thread_num();
printf("Hello (%d)\n", ID);
Compile: g++ -fopenmp hello.c
printf(" world (%d)\n", ID);
} Run: ./a.out
}
2.1 OpenMP - Mini-Tutorial – Version 3
OpenMP Core Syntax OpenMP C/C++ Directive Format
 OpenMP directive forms
#include “omp.h”  C/C++ use compiler directives
void main ()  Prefix: #pragma omp …
{  A directive consists of a directive name followed by
int var1, var2, var3; clauses
// 1. Serial code Example:
... #pragma omp parallel default (shared) private (var1, var2)
// 2. Beginning of parallel section.
// Fork a team of threads. Specify variable scoping OpenMP Directive Format - General Rules:
#pragma omp parallel private(var1, var2) shared(var3)  Case sensitive
{  Only one directive-name may be specified per directive
// 3. Parallel section executed by all threads  Each directive applies to at most one succeeding
... statement, which must be a structured block.
// 4. All threads join master thread and disband  Long directive lines can be “continued” on succeeding
} lines by escaping the newline character with a
// 5. Resume serial code . . . backslash “\” at the end of a directive line.
}
2.1 OpenMP - Mini-Tutorial – Version 3
OpenMP parallel Region Directive Example:
#pragma omp parallel if (is_parallel == 1) num_threads(8)
#pragma omp parallel [clause list]
shared (var_b) private (var_a) firstprivate (var_c) default
(none)
Typical clauses in [clause list] {
 Conditional parallelization /* structured block */
– if (scalar expression) }
 Determine whether the parallel construct creates threads
 Degree of concurrency  if (is_parallel == 1) num_threads(8)
– num_threads (integer expression) – If the value of the variable is_parallel is one, create
8 threads
 number of threads to create
 shared (var_b)
 Date Scoping – Each thread shares a single copy of variable var_b
– private (variable list)  private (var_a) firstprivate (var_c)
 Specifies variables local to each thread – Each thread gets private copies of variable var_a
– firstprivate (variable list) and var_c
 Similar to the private – Each private copy of var_c is initialized with the
 Private variables are initialized to variable value before the value of var_c in main thread when the parallel directive
parallel directive is encountered
 default (none)
– shared (variable list)
– Default state of a variable is specified as none
 Specifies variables that are shared among all the threads (rather than shared)
– default (data scoping specifier) – Signals error if not all variables are specified as
 Default data scoping specifier may be shared or none shared or private
2.1 OpenMP - Mini-Tutorial – Version 3

Number of Threads:

 The number of threads in a parallel region is determined by the following


factors, in order of precedence:
1.Evaluation of the if clause
2.Setting of the num_threads() clause
3.Use of the omp_set_num_threads() library function
4.Setting of the OMP_NUM_THREAD environment variable
5.Implementation default – usually the number of cores on a node

 Threads are numbered from 0 (master thread) to N-1


2.1 OpenMP - Mini-Tutorial – Version 3
Thread Creation: Parallel Region Example - Create threads with the parallel construct
2.1 OpenMP - Mini-Tutorial – Version 3
Work-Sharing Construct:

 A parallel construct by itself creates a “Single Program/Instruction Multiple Data” (SIMD) program, i.e., each
thread executes the same code.
 Work-sharing is to split up pathways through the code between threads within a team.
– Loop construct (for/do)
– Sections/section constructs
– Single construct
 Within the scope of a parallel directive, work-sharing directives allow concurrency between iterations or tasks
 Work-sharing constructs do not create new threads.
 A work-sharing construct must be enclosed dynamically within a parallel region in order for the directive to
execute in parallel.
 Work-sharing constructs must be encountered by all members of a team or none at all.
 Two directives to be presented
 – do/for: concurrent loop iterations
 – sections: concurrent tasks
2.1 OpenMP - Mini-Tutorial – Version 3
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
Work-Sharing do/for Directive
void main()
{
do/for: int nthreads, tid;
 Shares iterations of a loop across the group
 Represents a “data parallelism”. omp_set_num_threads(3);

for directive partitions parallel iterations across threads #pragma omp parallel private(tid)
do is the analogous directive in Fortran {
int i;
tid = omp_get_thread_num();
 Usage:
printf("Hello world from (%d)\n", tid);
#pragma omp for [clause list]
#pragma omp for
/* for loop */ for(i = 0; i <=4; i++)
{
 Implicit barrier at end of for loop printf(“Iteration %d by %d\n”, i, tid);
}
} // all threads join master thread and terminates
}
2.1 OpenMP - Mini-Tutorial – Version 3
//A worksharing for construct to add vectors:
//Sequential code to add two vectors: #pragma omp parallel
for(i=0;i<N;i++) { {
c[i] = b[i] + a[i]; #pragma omp for
} {
for(i=0; i<N; i++) { c[i]=b[i]+a[i]; }
//OpenMP implementation 1 (not desired): }
#pragma omp parallel }
{
int id, i, Nthrds, istart, iend; //A worksharing for construct to add vectors:
id = omp_get_thread_num(); #pragma omp parallel for
Nthrds = omp_get_num_threads(); for(i=0; i<N; i++) { c[i]=b[i]+a[i]; }
istart = id*N/Nthrds;
iend = (id+1)*N/Nthrds;

if(id == Nthrds-1) iend = N;


for(I = istart; i<iend; i++) {
c[i] = b[i]+a[i];
}
}
2.1 OpenMP - Mini-Tutorial – Version 3
C/C++ for Directive Syntax:
#pragma omp for [clause list] How to combine values into a single
schedule (type [,chunk])
accumulation variable (avg)?
ordered
private (variable list)
//Sequential code to do average value from an
firstprivate (variable list)
shared (variable list)
array-vector:
reduction (operator: variable list)
collapse (n) {
nowait double avg = 0.0, A[MAX];
/* for_loop */ int i;
For Directive Restrictions …
For the “for loop” that follows the for directive: for(i =0; i<MAX; i++) {
 It must not have a break statement avg += a[i];
 The loop control variable must be an integer }
 The initialization expression of the “for loop” must be an
integer assignment. avg /= MAX;
 The logical expression must be one of <,≤,>,≥ }
 The increment expression must have integer increments or
decrements only.
2.1 OpenMP - Mini-Tutorial – Version 3
Reduction Clause
Reduction in OpenMP for:
 Reduction (operator: variable list):
Inside a parallel or a work-sharing construct:
specifies how to combine local copies of a variable in different
 A local copy of each list variable is made and
threads into a single copy at the master when threads exit.
initialized depending on operator (e.g. 0 for “+”)
Variables in variable list are implicitly private to threads.
 Compiler finds standard reduction expressions
 Operators used in Reduction Clause: +, *, -, &, |, ^, &&, and ||
containing operator and uses it to update the local
 Usage Sample:
copy.
 Local copies are reduced into a single value and
#pragma omp parallel reduction(+: sums) num_threads(4)
combined with the original global value when
{
returns to the master thread.
/* compute local sums in each thread */
}
/* sums here contains sum of all local instances of sum */ //A work-sharing for average value from a vector:
{
double avg = 0.0, A[MAX];
Reduction Operators/Initial-Values in C/C++ OpenMP int i;
Operator Initial Value Operator Initial Value …
#pragma omp parallel for reduction (+:avg)
+ 0 | 0
for(i =0; i<MAX; i++) {avg += a[i];}
* 1 ^ 0
- 0 && 1 avg /= MAX;
}
& ~0 || 0
2.1 OpenMP - Mini-Tutorial – Version 3
Matrix-Vector Multiplication

#pragma omp parallel default (none) \


shared (a, b, c, m,n) private (i,j,sum) num_threads(4)
for(i=0; i < m; i++)
{
sum = 0.0;
for(j=0; j < n; j++)
sum += b[i][j]*c[j];

a[i] =sum;
}
2.1 OpenMP - Mini-Tutorial – Version 3
Matrix-Vector | Matrix-Matrix Multiplication Static scheduling - 16 iterations, 4 threads:
schedule clause
 Describe how iterations of the loop are divided among the threads in the
group. The default schedule is implementation dependent.
 Usage: schedule (scheduling_class[, parameter]).

– static - Loop iterations are divided into pieces of size chunk and then
statically assigned to threads. If chunk is not specified, the iteration are // Static schedule maps iterations to threads at compile time
evenly (if possible) divided contiguously among the threads. // static scheduling of matrix multiplication loops
– dynamic - Loop iterations are divided into pieces of size chunk and then #pragma omp parallel default (private) \
dynamically assigned to threads. When a thread finishes one chunk, it is shared (a, b, c, dim) num_threads(4)
dynamically assigned another. The default chunk size is 1. #pragma omp for schedule(static)
– guided - For a chunk size of 1, the size of each chunk is proportional to the for(i=0; i < dim; i++)
number of unassigned iterations divided by the number of threads, {
decreasing to 1. For a chunk size with value 𝑘(𝑘>1), the size of each chunk is for(j=0; j < dim; j++)
determined in the same way with the restriction that the chunks do not {
contain fewer than 𝑘 iterations (except for the last chunk to be assigned, c[i][j] = 0.0;
which may have fewer than 𝑘 iterations). The default chunk size is 1. for(k=0; j < dim; k++)
– runtime - The scheduling decision is deferred until runtime by the c[i][j] += a[i][k]*b[k][j];
environment variable OMP_SCHEDULE. It is illegal to specify a chunk size for }
this clause }
– auto - The scheduling decision is made by the compiler and/or runtime
system.
2.2 BMP Format and Sample

http://en.wikipedia.org/wiki/BMP_file_format
2.3 A.I / Data-mining Neural Network Sample
Are you in?

Communicate & Exchange Ideas


Some “myths” – Would you like to set-up another meeting for OpenMP tutorial?:
(Distributed Systems).Equals(Distributed Computing) == true?

?
(Parallel System).Equals(Parallel Computing) == true?

(Parallel System == Distributed System) != true?

(Sequential vs. Parallel vs. Concurrent vs. Distributed


Programming) ? (Different) : (Same)
Questions & Answers!
if (HTC != HPC)
But wait… HTC (High Throughput Computing) >
There’s More! MTC (Many Task Computing) >
HPC (High Performance Computing);
… Will be continued!?! …
?
Questions & Answers!

But wait…
There’s More!

You might also like