04tutorials Solutions
04tutorials Solutions
Tutorials
Processing Amount of
# Element # PEs
Element Variation
1 Processor ALU very few very high
An alternative definition of flexibility could be the amount of control information versus the
amount of data flowing into the device. In a processor, for every piece of data that is
processed, one or more instructions (control information) are needed. In a full custom chip,
no instructions are needed, since all data processing is hard wired. This definition will lead to
the same order of flexibility as above.
b) In a processor, the actual logic computation is carried out in the arithmetical/logical
unit. The logic execution part only covers a fraction of the total chip area. The
remainder of the chip delivers its flexibility.
Since this one can ideally do two logical operations on 32-bit words every clock cycle,
its raw CD becomes
2 32op / Hz 466 MHz op
CD 40 ,
9mm / 0.11 m
2 2 2
s sq
c) Cache misses (just 1% is enough!) and subsequent DRAM latencies cause the low
value. The cache miss rate depends on the application. The effective CD can be
calculated as
d) ASIPs: Computation can be done as for processors, since they are per se processors
with specialized instruction sets and execution units. Raw computational density is:
SoC Paradigm: Solutions System-on-Chip Technologies
Tutorials
32 311MHz (0.11m) 2 op
CD 270
0.45mm 2
s sq
e) FPGAs: Each LUT delivers (at most) one logic operation, thus, for the FPGA we get:
3333 op / Hz 200 MHz op
CD 467
7mm / 0.07 m
2 2 2
s sq
f) Standard cell ASICs: Every cell/flip-flop combination is assumed to deliver (again, at
most) one logic operation.
1 M op
CD 0.13
90 10 mm / 0.07 m 0.43 10 s
6 2
2 2 3
s sq
Dr e a m
r esea r ch: SoC
Ch ip t m
CPU DSP
ASIP
Log F L E X I B I L I T Y
FPGA
c
gi
lo
b le ASIC
c h: r a
r g u
s e a n fi
r e eco
r Custom IC
Exercise 2: Multicore
a) Since the processing is done on blocks of 32x32 pixels, we need 16 blocks in order to
process one 128x128 frame. At each loop iteration a block must be read from the
main memory, processed and written back to the main memory. In total the complete
frame requires:
T 1000 (100 400 700 100 ) 16 1000 22800 cycles
Considering 1 GHz clock frequency, 22.8 µs are needed to process one frame. This
value is much larger than the frame inter-arrival time of 12.5 µs. To avoid dropping of
frames, the processor performance must be improved. One solution is to increase the
operating frequency to f = 22800 / 12.5 µs = 1.824 GHz. Alternatively, we can employ
a multi-core processor.
b) Since the 16 blocks within one frame are independent, they can be processed in
parallel. The resulting task graph is depicted below:
1000 100 cycles 400 cycles 700 cycles 100 cycles 1000
cycles cycles
16 / N iterations
16 / N iterations
over N
cores
16 / N iterations
The computation is now distributed over N cores; therefore, each core has to process
only 16/N blocks.
SoC Paradigm: Solutions System-on-Chip Technologies
Tutorials