Notes 03
Notes 03
1 Overview
In the last lecture, we started introducing the various models of parallel computation. We first
covered the shared-memory model, where each processor is connected to a common global memory.
There are multiple forms of the shared-memory model, which are distinguished mainly by how
it handles the situation where 2 or more processors are all attempting to either read or write to
the same memory location. The next model we covered was the local-memory model, where each
processor has its own private memory. Each processor is connected to an interconnection network,
to allow each processor to communicate with each other. We then looked at the modular-memory
model, where each processor is connected to an interconnection network, which is then connected
to k-memory modules. All processors can access each of the k-memory modules. Lastly, we looked
at mixed models, such as the parallel external memory (PEM) model and the GPU architecture.
We finished up the last lecture by examining the different types of interconnection networks. Some
examples are: ethernet, ring, and 2D mesh.
In this lecture, we will first introduce the circuit model. Then, we will look at how the circuit
model is related to parallel models.
2 Circuit Model
The circuit model is comprised of gates; each gate has a constant fan-in and executes a primitive
operation. Some examples of primitive operations of a circuit are addition, subtraction, multipli-
cation, and boolean operations. Each gate works in parallel and starts its execution once all of its
inputs are ready. Hence, we can combine gates together to compute [BM04]. The formal definition
of a circuit is as follows:
Definition 1. A circuit for a particular problem, is a family of directed acyclic graphs (DAG) where
each node is a primitive operation and each edge is a dependency between operations. [JaJ92]
Therefore, in the circuit model, the runtime is equal to the depth of the circuit, which is equivalent
to the longest directed path in the DAG; and the work is equal to the size of the circuit, which is
equivalent to the number of nodes in the DAG [BM04]. Notice that at most there will be kn edges
in the DAG, where n = the number of nodes in the DAG and k = fan-in of the DAG. Thus, if
k = O(1), number of edges in the DAG is Θ(number of nodes in the DAG).
1
2.1 Relation to Parallel Models
Theorem 2. A circuit for a particular problem with time, T (n), and work, W (n), can be simulated
on a CREW PRAM with p processors in time:
W (n)
t(n) ≤ + T (n).
p
where gi is a gate. Let nl be the number of gates at level l. It will take dnl /pe time to simulate
level l using a p processor CREW PRAM (note: we need concurrent reads in order to handle cases
where two or more gates all use a common input). By definition, we know that the number of levels
in the circuit is T (n), hence we have:
T (n)
X nl TX(n)
nl
t(n) = ≤ +1
p p
l=1 l=1
T (n)
1 X
= · nl + T (n).
p
l=1
PT (n)
By definition, we have that l=1 nl = W (n), therefore:
T (n)
1 X W (n)
t(n) ≤ · nl + T (n) = + T (n)
p p
l=1
Proof. We will prove by contradiction. Assume that there exists a circuit with size Wc (n), such
that Wc (n) < TA (n). By Brent’s Theorem, for p = 1 we have:
2
Therefore, we can never have a circuit smaller than the fastest sequential algorithm runtime.
We know that given a circuit, we can create a CREW PRAM algorithm by converting each edge into
a memory read and each gate into an operation. Similarly, a p-processor CREW PRAM algorithm
with work W (n) and time T (n) can be converted into a circut with W (n) gates and depth T (n) by
converting each memory read to an incoming edge, each memory write to an outgoing edge, and
each operation into a gate.
3 Conclusion
In conclusion, we now know that given a p processor CREW PRAM algorithm with work W (n)
and time T (n) can be converted
into a CREW PRAM algorithm with p0 < p processors that has
W (n)
work W (n) and time O p0 + T (n) . In other words, if we design a CREW PRAM algorithm
with p processors, then we can always scale the algorithm down for p0 < p processors. Therefore,
when designing parallel algorithms, we want to design it for the largest possible p.
References