Parallelizing SystemC Kernel For Fast Hardware Simulation On SMP Machines
Parallelizing SystemC Kernel For Fast Hardware Simulation On SMP Machines
Keywords:
affinity
Priya Chandran
National Institute of Technology Calicut
[email protected]
Abstract
SystemC is a system-level modeling language and simulation framework which facilitates design and verification
of processor designs at different levels. Recently, SystemC is becoming a popular choice for designers of both
System-On-Chip (SoC) and embedded processors, due to
its adaptability at cycle as well as transaction levels, and
ability to model concurrent processes. However, the single
threaded simulation kernel inherent to SystemC, prevents it
from utilizing the potential computing power of symmetric
multiprocessing (SMP) machines to speed up hardware
simulation. We present a parallel SystemC simulation
kernel, which is implemented using parallel programming
techniques and leverages the parallel execution capabilities
of multi-core machines to speed up hardware simulation.
We discuss the mechanism we use for mapping parallel
SystemC modules into different cores. Finally we report
the performance of the parallelized SystemC kernel using
a linear pipelined performance model and a pipelined
performance model tailored to exhibit the behavior of
real world simulation. Our results demonstrate that the
performance improvement obtained by using parallelized
SystemC for simulation of the above models is significant
and improves with increasing design complexity of the
simulated design and the number of cores in the machine
running the simulators.
Task level parallelism can be exploited by running several possible experiments in parallel on different machines,
but the presence of task level dependencies in processor
simulation limits the scope of such techniques. Parallelization of the hardware simulation kernel helps to increase
the speed of individual simulation runs. In this paper, we
present a scheme for parallelizing the simulation kernel, and
report the performance benefits of the parallelized version
over the conventional one, on multi-core machines.
The rest of this paper is organized in the following manner. Section 2 presents an overview of the existing attempts at parallelizing SystemC. A description of the OSCI
Open SystemC Initiative SystemC-2.2.0 scheduler is also
presented here. Section 3 describes our parallel kernel, including an overview of the different techniques we use for
parallelizing the OSCI SystemC-2.2.0 scheduler. We also
describe the strategy of setting core affinity to simulation
modules and manual grouping of SystemC modules, which
resists speed-up degradation with increase in the number
of cores. Section 4 demonstrates our experimental setup for performance analysis of parallelized SystemC for a
pipelined performance model in which the number of modules and amount of computation inside each module can
be varied. We also demonstrate the performance improvement for a parallelized benchmark SystemC simulation, tailored to exhibit the behavior of a real graphics hardware
simulation. We conclude with note on future work in this
Introduction
Simulation of processors at various stages in their design process is the technique popularly adopted for verification and validation of the design of the processors. Simula1087-4097/09 $25.00 2009 IEEE
DOI 10.1109/PADS.2009.25
Deepak Ravi
Intel Corporation
[email protected]
80
Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.
area in Section 5.
2. Background
Chopardl et al. [4] present a functional parallel kernel
which runs multiple copies of the SystemC scheduler on
a large number of inexpensive machines. The partitioning of the hardware design is manual, and has to be fairly
balanced. It requires the definition of a module hierarchy,
and has the additional drawback of a high communication
overhead. In [5] a SystemC Distribution Library is introduced for geographical distribution of an arbitrary number
of SystemC simulations. However, it can also be adapted
for multi-core architectures. It supports distributed functional and approximated-timed TLM1 simulation only, and
cannot be used for cycle level simulations. Heterogeneity in
SystemC Language and simulation framework is introduced
by implementing an SDF (Synchronous Data Flow) kernel
extension into SystemC [9]. In [9], efficiency is gained
through concurrency of the SDF model. A fast SystemC
engine is proposed in [10] by introducing a new scheduling
strategy, which combines features from SystemC dynamic
scheduling technique and static scheduling, but requires fixing a unique execution order for SystemC processes. A
method to map VHDL models to PDES (Parallel Discrete
Event Simulation) is discussed in [8]. [7] introduces the
architecture of a parallel Verilog simulator. But parallel
HDL simulations are communication intensive [7]. Static
partitioning algorithm for parallel VHDL simulation [12]
partitions tightly connected parts into different processors,
which is not scalable to large number of cores due to interprocess communication delays, which get added to the simulation time. Low level schemes in general do not scale well
to complex designs and are not suitable for cycle level exploration. The module based parallel simulation objects can
provide comparatively better computation to communication ratio. In our work, we explore the benefits of applying
parallel programming techniques to OSCI2 SystemC-2.2.0
scheduler.
Level Modeling
SystemC Initiative
81
Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.
Runnable
Processes
Exist ?
Process Timed
Notification
NO
Process Update
Requests
Delta
Notifications
Exist ?
Process Delta
Notification
YES
Advance
Simulation Time
NO
Timed
Notifications
Exist ?
YES
NO
YES
82
Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.
3.2.2
Synchronization issues
3.2.3
Manual Grouping
Work Stealing
The sub modules and methods of a module are automatically placed the same group as the module itself. But if a
particular method inside a module is added to a particular
group, only that method gets added to the group. A user
can add another method of the same module into a different
group, if he/she wants.
Work Sharing
Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.
Compiler
We have used the gcc-4.3.1 compiler and included parallel/algorithms.h, parallel/setting.h and omp.h header files
with the following settings to global parameters.
Work Sharing Chunk Size = 8
Work Stealing Chunk Size = 8
Minimum number of elements for parallel operation = 16
Algorithm strategy = forced parallel
4.1.2
Hardware
Computations
Computation are generated by executing a bunch of instructions asm volatile ( nop ) which is invariant on all
compiler optimizations. We assume that the execution of
thousand instructions will take order of one micro second
CPU time. The actual computation time depends on underlying hardware.
4.3
We first assess the effect of design complexity, represented by the number of modules and frequency of computations in the performance model, on the speed-up. The
speed-up reported is with reference to execution on a single
core machine. The parallelization strategy used is manual
partitioning, on a 16 core machine.
Figure 4 illustrates the effect of varying the number of
modules on the same computation frequency. We observe
that the speed-up is almost linear to log(number of modules)
until it approaches the number of cores. It also illustrates
the lower bound on computation frequency per module for
achieving a speed-up.
Figure 5 demonstrates the speed-up for a set of performance models by varying the computations per module,
Results
in Socket 604
84
Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.
on a set of different modules, using parallel SystemC kernel with manual partitioning on a 16 core SMP machine.
The lower bound on the number of modules for achieving a
speed-up greater than one, can also be noted. Simulations
having average computations per module more than 10 micro seconds exhibit significant speedup.
Figure 6 demonstrates the speed-up for a performance
model having 128 modules and 40 microsecond of computations per module on parallel SystemC kernel using different parallelization techniques on varying number of cores.
We note that Parallel SystemC kernel implementation using
a manual grouping shows considerable speed-up enhancement over other parallelization techniques when the number
of cores is more than eight. Work stealing algorithm performance degrades with increasing number of cores, due to
the high multi-threading overhead. Except manual grouping, all the other schemes show a degradation with increasing number of cores after an initial improvement. We can
observe that the speed-up achieved by parallel simulation
depends on the number of cores, the underlying hardware
complexity, average computations per module, number of
modules per core and overhead per thread.
Figure 7 demonstrates CPU utilization for a performance
model with 128 modules and 40 micro second computations
per module using parallel SystemC kernel with different
parallelization techniques on varying the number of cores.
We note that CPU utilization % decreases with increase in
the number of cores for all parallelization techniques. The
difference in CPU utilization among various parallelization
techniques becomes prominent when the number of cores
increase.
Figure 8 demonstrates the difference between the speed-
up achieved and increase in CPU utilization for a performance model with 128 modules and 40 micro second
computation per module using the parallel SystemC kernel by manual partitioning on varying the number of cores.
Speedup achieved is higher than the increase in CPU utilization.
When a problem having data-set of size d, represented
in our experiment by d number of modules, is solved by n
CPUs then the data-set will be evenly distributed among n
CPUs.
Size of the data-set per CPU for serial execution = d
Size of data-set per CPU for parallel execution = d/n
Every CPU has a separate L1 cache, and L2 cache is
shared among 2 CPUs and core affinity is set for modules,
L1 cache misses(per cache) ideally reduce by a factor of
n and L2 cache misses reduce by a factor of n/2. Hence,
the time saved on servicing cache misses contributes to the
increase in speed-up for parallel simulation by manual partitioning. Efficiency5 of parallel SystemC simulation depends on the ratio between the speed-up achieved and increase in CPU utilization.
per watt
85
Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.
References
[1] Approved IEEE Draft Standard SystemC Language Reference Manual (superseded by 1666-2005). In IEEE Std
P1666/D2.1.1, 2005.
[2] K. Andreev and H. Racke. Balanced Graph Partitioning.
39(6):929939, November 2006.
[3] R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Proceedings of
the Thirty-fifth Annual Symposium on Foundations of Computer Science (FOCS), pages 356368, 1994.
[4] B. Chopard1, P. Combes, and J. Zory. A Conservative Approach to SystemC Parallelization. In Proceedings of the
Workshop on Scientific Computing in Electronics Engineering, pages 653660, May 2006.
[5] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele. Scalably Distributed SystemC Simulation for Embedded Applications. In Proceedings of the International Symposium on
Industrial Embedded Systems (SIES 2008), pages 271274,
June 2008.
[6] G. Karypis and V. Kumar. Parallel Multilevel k-way Partitioning Scheme for Irregular Graphs. In Proceedings of
the 1996 ACM/IEEE Conference on Supercomputing, pages
3535, 1996.
[7] T. Li, Y. Guo, and S.-K. Li. Design and Implementation
of a Parallel Verilog Simulator: PVSim. In Proceedings of
the Seventeenth International Conference on VLSI Design,
pages 329334, 2004.
[8] E. Naroska. Parallel VHDL Simulation. In Proceedings
of the Conference on Design, Automation and Test in Europe(DATE 98), pages 159165, Washington, DC, USA,
1998. IEEE Computer Society.
[9] H. D. Patel and S. K. Shukla. Towards a Heterogeneous
Simulation Kernel for System Level Models: A SystemC
Kernel for Synchronous Data Flow Models. In Proceedings
of the IEEE Computer Society Annual Symposium on VLSI,
pages 241242, February 2004.
86
Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.
[10] D. G. Perez, G. Mouchard, and O. Temam. A New Optimized Implementation of the SystemC Engine using Acyclic
Scheduling. In Proceedings of the Conference on Design,
Automation and Test in Europe, pages 552557, February
2004.
[11] M. J. Quinn. Parallel Programming in C with MPI and
OpenMP. McGraw-Hill Science Engineering, First edition,
2003.
[12] W. Yue, J. Ling, Y. Hong-Bin, and L. Zong-Tian. A New
Partitioning Scheme of Parallel VHDL Simulation. In Proceedings of the Conference on High Density Microsystem
Design and Packaging and Component Failure Analysis,
pages 14, June 2005.
87
Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.