Juan Mendivelso
MULTITHREADING ALGORITHMS
SERIAL ALGORITHMS & PARALLEL ALGORITHMS
Serial Algorithms: Suitable for running on an
uniprocessor computer in which only one
instruction executes at a time.
Parallel Algorithms: Run on a multiprocessor
computer that permits multiple execution to
execute concurrently.
PARALLEL COMPUTERS
Computers with multiple processing units.
They can be:
ChipMultiprocessors: Inexpensive
laptops/desktops. They contain a single multicore
integrated-circuit that houses multiple processor
“cores” each of which is a full-fledged processor
with access to common memory.
PARALLEL COMPUTERS
Computers with multiple processing units.
They can be:
Clusters:
Build from individual computers with a
dedicated network system interconnecting them.
Intermediate price/performance.
PARALLEL COMPUTERS
Computers with multiple processing units.
They can be:
Supercomputers: Combination of custom
architectures and custom networks to deliver the
highest performance (instructions per second).
High price.
MODELS FOR PARALLEL COMPUTING
Although the random-access machine model
was early accepted for serial computing, no
model has been established for parallel
computing.
A major reason is that vendors have not agreed
on a single architectural model for parallel
computers.
MODELS FOR PARALLEL COMPUTING
For example some parallel computers feature
shared memory where all processors can
access any location of memory.
Others employ distributed memory where each
processor has a private memory.
However, the trend appears to be toward
shared memory multiprocessor.
STATIC THREADING
Shared-memory parallel computers use static
threading.
Software abstraction of “virtual processors” or
threads sharing a common memory.
Each thread can execute code independently.
For most applications, threads persist for the
duration of a computation.
PROBLEMS OF STATIC THREADING
Programming a shared-memory parallel
computer directly using static threads is
difficult and error prone.
Dynamically partioning the work among the
threads so that each thread receives
approximately the same load turns out to be
complicated.
PROBLEMS OF STATIC THREADING
The programmer must use complex
communication protocols to implement a
scheduler to load-balance the work.
This has led to the creation of concurrency
platforms. They provide a layer of software that
coordinates, schedules and manages the
parallel-computing resources.
DYNAMIC MULTITHREADING
Class of concurrency platform.
It allows programmers to specify parallelism in
applications without worrying about
communication protocols, load balancing, etc.
The concurrency platform contains a scheduler
that load-balances the computation
automatically.
DYNAMIC MULTITHREADING
It supports:
Nested parallelism: It allows a subroutine to be
spawned, allowing the caller to proceed while the
spawned subroutine is computing its result.
Parallel loops: regular for loops except that the
iterations can be executed concurrently.
ADVANTAGES OF DYNAMIC MULTITHREADING
The user only spicifies the logical parallelism.
Simple extension of the serial model with:
parallel, spawn and sync.
Clean way to quantify parallelism.
Many multithreaded algorithms involving
nested parallelism follow naturally from the
Divide & Conquer paradigm.
BASICS OF MULTITHREADING
Fibonacci Example
The serial algorithm: Fib(n)
Repeated work
Complexity
However, recursive calls are independent!
Parallel algorithm: P-Fib(n)
SERIALIZATION
Concurrency keywords: spawn, sync and
parallel
The serialization of a multithreaded algorithm
is the serial algorithm that results from deleting
the concurrency keywords.
NESTED PARALLELISM
It occurs when the keyword spawn precedes a
procedure call.
It differs from the ordinary procedure call in
that the procedure instance that executes the
spawn - the parent – may continue to execute
in parallel with the spawn subroutine – its child
- instead of waiting for the child to complete.
KEYWORD SPAWN
It doesn’t say that a procedure must execute
concurrently with its spawned children; only
that it may!
The concurrency keywords express the logical
parallelism of the computation.
At runtime, it is up to the scheduler to
determine which subcomputations actually run
concurrently by assigning them to processors.
KEYWORD SYNC
A procedure cannot safely use the values
returned by its spawned children until after it
executes a sync statement.
The keyword sync indicates that the procedure
must wait until all its spawned children have
been completed before proceeding to the
statement after the sync.
Every procedure executes a sync implicitly
before it returns.
COMPUTATIONAL DAG
We can see a multithread computation as a
directed acyclic graph G=(V,E) called a
computational dag.
The vertices are instructions and and the edges
represent dependencies between instructions,
where (u,v) є E means that instruction u must
execute before instruction v.
COMPUTATIONAL DAG
If a chain of instructions contains no parallel
control (no spawn, sync, or return), we may
group them into a single strand, each of which
represents one or more instructions.
Instructions involving parallel control are not
included in strands, but are represented in the
structure of the dag.
COMPUTATIONAL DAG
For example, if a strand has two successors,
one of them must have been spawned, and a
strand with multiple predecessors indicates the
predecessors joined because of a sync.
Thus, in the general case, the set V forms the
set of strands, and the set E of directed edges
represents dependencies between strands
induced by parallel control.
COMPUTATIONAL DAG
If G has a directed path from strand u to strand,
we say that the two strands are (logically) in
series. Otherwise, strands u and are (logically)
in parallel.
We can picture a multithreaded computation as
a dag of strands embedded in a tree of
procedure instances.
Example!
COMPUTATIONAL DAG
We can classify the edges:
Continuation edge : connects a strand u to its
successor u’within the same procedure instance.
Call edges: representing normal procedure calls.
Return edges: When a strand u returns to its calling
procedure and x is the strand immediately following
the next sync in the calling procedure.
A computation starts with an initial strand and
ends with a single final strand.
IDEAL PARALLEL COMPUTER
A parallel computer that consists of a set of
processors and a sequential consistent shared
memory.
Sequential consistent means that the shared
memory behaves as if the multithreaded
computation’s instructions were interleaved to
produce alinear order that preserves the partial
order of the computation dag.
IDEAL PARALLEL COMPUTER
Depending on scheduling, the ordering could
differ from one run of the program to another.
The ideal-parallel-computer model makes some
performance assumptions:
Each processor in the machine has equal
computing power
It ignores the cost of scheduling.
PERFORMANCE MEASURES
Work:
Total time to execute the entire computation on
one processor.
Sum of the times taken by each of the strands.
In the computational dag, it is the number of
strands (assuming each strand takes a time unit).
PERFORMANCE MEASURES
Span:
Longest time to execute thge strands along in path
in the dag.
The span equals the number of vertices on a
longest or critical path.
Example!
PERFORMANCE MEASURES
The actual running time of a multithreaded
computation depends also on how many
processors are available and how the
scheduler allocates strands to processors.
Running time on P processors: TP
Work: T1
Span: T∞ (unlimited number of processors)
PERFORMANCE MEASURES
The work and span provide lower bound on the
running time of a multithreaded computation TP
on P processors:
Work law: TP ≥ T1 /P
Span law: TP ≥ T∞
PERFORMANCE MEASURES
Speedup:
Speedup of a computation on P processors is the
ratio T1 /TP
How many times faster the computation is on P
processors than on one processor.
It’s at most P.
Linear speedup: T1 /TP = θ(P)
Perfect linear speedup: T1 /TP =P
PERFORMANCE MEASURES
Parallelism:
T1 /T∞
Average amount amount of work that can be
performed in parallel for each step along the critical
path.
As an upper bound, the parallelism gives the
maximum possible speedup that can be achieved
on any number of processors.
The parallelism provides a limit on the possibility of
attaining perfect linear speedup.
SCHEDULING
Good performance depends on more than
minimizing the span and work.
The strands must also be scheduled efficiently
onto the processors of the parallel machine.
On multithreaded programming model provides
no way to specify which strands to execute on
which processors. Instead, we rely on the
concurrency platform’s scheduler.
SCHEDULING
A multithreaded scheduler must schedule the
computation with no advance knowledge of
when strands will be spawned or when they will
complete—it must operate on-line.
Moreover, a good scheduler operates in a
distributed fashion, where the threads
implementing the scheduler cooperate to load-
balance the computation.
SCHEDULING
To keep the analysis simple, we shall consider
an on-line centralized scheduler, which knows
the global state of the computation at any given
time.
In particular, we shall consider greedy
schedulers, which assign as many strands to
processors as possible in each time step.
SCHEDULING
If at least P strands are ready to execute during
a time step, we say that the step is a complete
step, and a greedy scheduler assigns any P of
the ready strands to processors.
Otherwise, fewer than P strands are ready to
execute, in which case we say that the step is
an incomplete step, and the scheduler assigns
each ready strand to its own processor.
SCHEDULING
A greedy scheduler executes a multithreaded
computation in time: TP ≤ T1 /P + T∞
Greedy scheduling is provably good becauses it
achieves the sum of the lower bounds as an
upper bound.
Besides it is within a factor of 2 of optimal.