Lecture02 Types
Lecture02 Types
2: Types of Parallelism
▪ Parallelism in Hardware (Uniprocessor)
– Pipelining
– Superscalar, VLIW etc.
▪ Parallelism in Hardware (SIMD, Vector processors, GPUs)
▪ Parallelism in Hardware (Multiprocessor)
– Shared-memory multiprocessors
– Distributed-memory multiprocessors
– Chip-multiprocessors a.k.a. Multi-cores
▪ Parallelism in Hardware (Multicomputers a.k.a. clusters)
▪ Parallelism in Software
– Task parallelism
– Data parallelism
Interconnection
Main memory
Interconnection
flag = 0; flag = 0;
… …
a = 10; while (!flag) {}
flag = 1; x = a * y;
▪ Message passing
Producer (p1) Consumer (p2)
… …
a = 10; receive(p1, b, label);
send(p2, a, label); x = b * y;
▪ Transaction-level parallelism
– Multiple threads/processes from different transactions can be executed
concurrently
– Limited by access to metadata and by interconnection bandwidth
while (!done) {
diff = 0;
for (i=1; i<=n; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
diff += abs(A[i,j] – temp);
}
}
if (diff/(n*n) < TOL) done=1;
}
CS4/MSc Parallel Architectures - 2014-2015
11
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
int mymin = 1+(pid * n/P);
int mymax = mymin + n/P – 1;
while (!done) {
diff = 0; mydiff = 0;
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
12
Example: Equation Solver Kernel
▪ TLP version (shared-memory) (for 2 processors):
– Each processor gets a chunk of rows
▪ E.g., processor 0 gets: mymin=1 and mymax=2
and processor 1 gets: mymin=3 and mymax=4
while (!done) {
diff = 0; mydiff = 0;
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
...
CS4/MSc Parallel Architectures - 2014-2015
13
Example: Equation Solver Kernel
▪ TLP version (shared-memory) (for 2 processors):
– Each processor gets a chunk of rows
▪ E.g., processor 0 gets: mymin=1 and mymax=2
and processor 1 gets: mymin=3 and mymax=4
while (!done) {
diff = 0; mydiff = 0;
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
...
CS4/MSc Parallel Architectures - 2014-2015
13
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
– All processors can access freely the same data structure A
– Access to diff, however, must be in turns
– All processors update together their own done variable
...
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
14
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
– All processors can access freely the same data structure A
– Access to diff, however, must be in turns
– All processors update together their own done variable
...
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
14
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
– All processors can access freely the same data structure A
– Access to diff, however, must be in turns
– All processors update together their own done variable
...
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
14
Example: Equation Solver Kernel
▪ TLP version (shared-memory):
– All processors can access freely the same data structure A
– Access to diff, however, must be in turns
– All processors update together their own done variable
...
for (i=mymin; i<=mymax; i++) {
for (j=1; j<=n; j++) {
temp = A[i,j];
A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] +
A[i,j+1]+A[i+1,j]);
mydiff += abs(A[i,j] – temp);
}
}
lock(diff_lock); diff += mydiff; unlock(diff_lock);
barrier(bar, P);
if (diff/(n*n) < TOL) done=1;
barrier(bar, P);
} CS4/MSc Parallel Architectures - 2014-2015
14
Types of Speedups and Scaling
▪ Scalability: adding x times more resources to the machine yields
close to x times better “performance”
– Usually resources are processors (but can also be memory size or
interconnect bandwidth)
– Usually means that with x times more processors we can get ~x times
speedup for the same problem
– In other words: How does efficiency (see Lecture 1) hold as the number of
processors increases?
Time(1 processor)
SPC =
Time(p processors)
Work(p processors)
STC =
Work(1 processor)