0% found this document useful (0 votes)
8 views4 pages

04tutorials Solutions

The document discusses the computational density and flexibility of various system-on-chip (SoC) technologies, including processors, DSPs, ASIPs, FPGAs, ASICs, and full custom chips. It also explores multicore processing, detailing how independent tasks can be parallelized to improve performance and reduce processing time for image frames. The analysis includes calculations for performance gains, speedups, and power consumption comparisons between single-core and multi-core processors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

04tutorials Solutions

The document discusses the computational density and flexibility of various system-on-chip (SoC) technologies, including processors, DSPs, ASIPs, FPGAs, ASICs, and full custom chips. It also explores multicore processing, detailing how independent tasks can be parallelized to improve performance and reduce processing time for image frames. The analysis includes calculations for performance gains, speedups, and power consumption comparisons between single-core and multi-core processors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

SoC Paradigm: Solutions System-on-Chip Technologies

Tutorials

SoC Paradigm Solutions


Exercise 1: Computational Density
a) One possible definition of flexibility could be derived from an empiric metric for the
number of different operations available on a general purpose computing device.

Processing Amount of
# Element # PEs
Element Variation
1 Processor ALU very few very high

2 DSP Multiplier very few high


Special Execution
3 ASIP few some
Unit(s)
4 FPGA LUT thousands some

5 ASIC Standard Cell millions very low

6 Full Custom Transistor 10 millions almost none

An alternative definition of flexibility could be the amount of control information versus the
amount of data flowing into the device. In a processor, for every piece of data that is
processed, one or more instructions (control information) are needed. In a full custom chip,
no instructions are needed, since all data processing is hard wired. This definition will lead to
the same order of flexibility as above.
b) In a processor, the actual logic computation is carried out in the arithmetical/logical
unit. The logic execution part only covers a fraction of the total chip area. The
remainder of the chip delivers its flexibility.
Since this one can ideally do two logical operations on 32-bit words every clock cycle,
its raw CD becomes
2  32op / Hz  466 MHz op
CD   40 ,
9mm / 0.11 m
2 2 2
s  sq

c) Cache misses (just 1% is enough!) and subsequent DRAM latencies cause the low
value. The cache miss rate depends on the application. The effective CD can be
calculated as

0.35  32op / Hz  466 MHz op


CD  7
9mm / 0.11 m
2 2 2
s  sq

d) ASIPs: Computation can be done as for processors, since they are per se processors
with specialized instruction sets and execution units. Raw computational density is:
SoC Paradigm: Solutions System-on-Chip Technologies
Tutorials

32  311MHz  (0.11m) 2 op
CD   270
0.45mm 2
s  sq
e) FPGAs: Each LUT delivers (at most) one logic operation, thus, for the FPGA we get:
3333 op / Hz  200 MHz op
CD   467
7mm / 0.07 m
2 2 2
s  sq
f) Standard cell ASICs: Every cell/flip-flop combination is assumed to deliver (again, at
most) one logic operation.
1 M op
CD   0.13
90  10 mm / 0.07 m  0.43  10 s
6 2
2 2 3
s  sq

Dr e a m
r esea r ch: SoC
Ch ip t m

CPU DSP
ASIP
Log F L E X I B I L I T Y

FPGA

c
gi
lo
b le ASIC
c h: r a
r g u
s e a n fi
r e eco
r Custom IC

Log COMPUTATIONAL DENSITY = performance / area


SoC Paradigm: Solutions System-on-Chip Technologies
Tutorials

Exercise 2: Multicore
a) Since the processing is done on blocks of 32x32 pixels, we need 16 blocks in order to
process one 128x128 frame. At each loop iteration a block must be read from the
main memory, processed and written back to the main memory. In total the complete
frame requires:
T  1000  (100  400  700  100 )  16  1000  22800 cycles
Considering 1 GHz clock frequency, 22.8 µs are needed to process one frame. This
value is much larger than the frame inter-arrival time of 12.5 µs. To avoid dropping of
frames, the processor performance must be improved. One solution is to increase the
operating frequency to f = 22800 / 12.5 µs = 1.824 GHz. Alternatively, we can employ
a multi-core processor.
b) Since the 16 blocks within one frame are independent, they can be processed in
parallel. The resulting task graph is depicted below:

1000 100 cycles 400 cycles 700 cycles 100 cycles 1000
cycles cycles
16 / N iterations

100 cycles 400 cycles 700 cycles 100 cycles

16 / N iterations

over N
cores

100 cycles 400 cycles 700 cycles 100 cycles

16 / N iterations

The computation is now distributed over N cores; therefore, each core has to process
only 16/N blocks.
SoC Paradigm: Solutions System-on-Chip Technologies
Tutorials

c) In an N-core processor, the processing of a frame will now take:


16 20800
T  1000  100  400  700  100    1000  2000 
N N
Performance gain can be calculated as:
22800
G( N ) 
20800
2000 
N
G(2) = 1.84, G(4) = 3.17
G(8) = 4.96, G(16) = 6.91
No, because the application cannot be completely parallelized; receiving and sending
of frames to I/O is performed sequentially. The total speedup will be even lower if we
consider the limited bandwidth of the shared bus.
d) Considering the formula above, a quad-core processor will require:
20800
T  2000   7200 cycles.
4
Hence, speed up = (22800 cycles / 7200 cycles) = 3.17
New clock frequency = 1.824 GHz / 3.17) = 575 MHz
The total savings of the capacitive dynamic power consumption will be:
2
f V  n
PMULTI CORE      n  PSINGLE CORE  3
k k k
4
PQUADCORE  PSINGLE CORE   0.13  PSINGLE CORE
(3.17)3
Thus, the capacitive dynamic power consumption of the quad-core processor is 7.9
times lower compared to the single-core processor.

You might also like