Modeling and Simulation of System Bus and Memory Collisions in Heterogeneous SoCs
Modeling and Simulation of System Bus and Memory Collisions in Heterogeneous SoCs
ABSTRACT A system simulator is proposed and developed, which can help to optimize design parameters
and hence minimize the number of collisions. In order to search the optimal design parameter combi-
nation which meets the user requirement, the proposed simulator has some knobs: partitioning between
software and hardware, scheduling the operations in the system, and memory merging, all of which can be
adjusted to predict collisions and search the optimal architecture. Also, design parameters can be adjusted
sequentially to cover all design options and estimate the predicted performance for each option. The
proposed system simulator is evaluated with an example signal processing algorithm, orthogonal matching
pursuit (OMP) algorithm. Performances of four cases of the OMP algorithm are predicted by the proposed
simulator and in turn are compared with the actual performances on ZedBoard. The proposed simulator
can predict the performance of heterogeneous systems on chips with under 5% error for all the candidate
architectures for OMP while taking the system bus and memory conflicts into account. Moreover, the
optimized heterogeneous SoC architecture for the OMP algorithm improves performance by up to 32%
compared with the conventional CAG-based approach. The proposed simulator is verified that the proposed
performance estimation algorithm is generally applicable to estimate the performance of any heterogeneous
SoC architecture. For example, the estimation error is measured to be no more than 5.9% for the convolutional
layers of CNNs and no more than 5.6% for the LDPC-coded MIMO-OFDM. In addition, the optimized
heterogeneous SoC architecture improves performance by up to 48% for the third convolutional layer of
AlexNet and 56% for the LDPC-coded MIMO-OFDM. Lastly, compared with the conventional simulation-
based approaches, the proposed estimation algorithm provides a speedup of one to two orders of magnitudes.
The source code is available on the GitHub repository: https://github.com/SDL-KU/HetSoCopt.
INDEX TERMS Collision, conflict, design space exploration, hardware accelerator, heterogeneous SoC,
modeling, performance prediction and estimation, system simulator.
bus and the paths of data becomes more complex. In turn, algorithm can predict the communication performance of CSs
leveraging hardware accelerators to enhance the system per- more accurately than conventional performance estimation
formance may make the system bus more complicated and algorithms. In [11], a revised roofline model was proposed
increase data collisions in memory and the system bus. Con- to estimate the performance of a DMA-controlled accelerator,
sequently, the actual performance of the overall system can considering the impact of DRAM latency. The roofline model
be quite lower than the predicted performance if memory and proposed in [11] was verified using the Zynq platform.
system bus collisions are not properly taken into account. In this work, a system simulator which can predict the per-
Thus predicting the collisions in the system bus and memory formance of heterogeneous SoCs and model dynamic effects
and obtaining the system performance at early stages are such as collisions is proposed. Static analysis-based tech-
important. For these, a system simulator is needed, which niques typically provide sufficient efficiency in terms of time
can help to optimize design parameters and hence minimize cost but the accuracy is low. On the other hand, simulation-
the number of collisions. To reduce design turnaround time based techniques may generally provide satisfactory accu-
and also explore system level trade-offs of SoCs, design racy at the expense of lower efficiency than static analysis
space exploration and performance prediction by using sys- based techniques. The proposed simulator can help optimize
tem simulators early in the design flow are indispensable for design parameters and accordingly minimize the number of
successful SoC prototyping. conflicts or collisions while exploring a large design space.
The Zynq platform [6] is widely used to verify many Both speed and accuracy are valued by rapid design space
heterogeneous SoC simulators [7]-[11]. The Zynq platform exploration and consideration of dynamic effects such as col-
has a dual-core ARM processor tightly coupled with a field- lisions. To search the optimal design parameter combination
programmable gate array (FPGA). The FPGA can be used which meets the user requirement, the proposed simulator has
as a reconfigurable accelerator to efficiently implement some some knobs (based on the information extracted from the tar-
hardware functionality. The ARM side of the SoC is known as get system): partitioning between software and hardware (or
a processing system (PS), whereas the FPGA side is termed mapping operations into software or hardware), scheduling
programmable logic (PL). When an application is ported to the operations (or changing the order of operations) in the
the Zynq SoC, it can run as a pure software implementation system, and memory merging (separate memories or shared
on the PS. To gain performance, compute-bound parts of an memory), all of which can be adjusted to predict collisions.
application can be mapped as hardware accelerators on the Also, design parameters can be adjusted sequentially to cover
PL. For example, [7] predicted the performance of a het- all design options and estimate the predicted performance for
erogeneous SoC consisting of a DMA-controlled hardware each option. The simulator proposed in this study utilizes the
accelerator, a single processor core, a DRAM subsystem, and extracted time information (some knobs) for each component
on-chip buses using a SystemC TLM-based simulator. The such as a direct memory access (DMAC) controller, processor
results predicted using the SystemC TLM simulator were core, and on-chip bus through emulation based on the Zynq
compared with those obtained using the Zynq platform. [8] platform [6]. When the target application is implemented in
introduced gem5-aladdin, which integrates the gem5 system the Altera/Intel SoC platform [13], if the time information
simulator with the Aladdin accelerator simulator to enable of the Altera/Intel SoC platform is extracted in pre-measured
the simulation of SoCs with complex accelerator system and applied to the proposed simulator, the performance of the
interactions. The gem5-aladdin simulator showed that it val- target application implemented in Altera/Intel SoC Platform
idated against the Xilinx Zynq platform and achieved less can be easily predicted.
than 6% error. Specifically, the gem5-aladdin simulator mea- The proposed simulator in this study is used to seek the
sured and utilized the execution time information of the optimal condition which meets the system requirement, while
ZedBoard [12] for the CPU (i.e., Cortex-A9) to model the using minimal number of hardware accelerators. Pipelining is
cache line latency. In [9], the requirements for memory, assumed for the implementation of a given algorithm which
computation, and flexibility of the system were summarized is subdivided into multiple steps of operations. The SoC
for mapping a CNN on embedded FPGAs. Based on these performance is predicted while the allocation of operations
requirements, they proposed Angel-Eye, a programmable and to hardware accelerators and processors is changed and the
flexible CNN accelerator architecture simulator, along with a order of operations is adjusted, whereby the types of colli-
data quantization strategy and compilation tool. The design sions will vary. Namely, both partitioning and scheduling are
strategy obtained using the Angel-Eye simulator [9] was considered in the proposed simulator to recommend some
implemented on the Zynq platform, and the actual perfor- design parameter combinations to achieve the optimal sys-
mance gains were evaluated. Next, [10] proposed a perfor- tem performance. The execution time of each operation step
mance estimation algorithm to optimize the communication varies as the partitioning varies and the type of data collision
schemes (CSs), which are defined by the number of direct varies (and hence the execution time) as the scheduling varies.
memory access controllers (DMACs) and the bank alloca- The proposed system simulator is evaluated with an exam-
tion of DRAM. Using the communication bandwidths of ple compressive sensing algorithm, orthogonal matching pur-
CPs obtained from prior full-system simulations based on suit (OMP), which is subdivided into 5 steps. The OMP
the Zynq platform, the proposed performance estimation algorithm estimates the channel in the LTE wireless standard
with a 5MHz channel bandwidth and multiple hardware II. RELATED WORK
accelerators are employed to implement the algorithm. The A multitude of studies have addressed design space explo-
data type, the size of data, the order of data, the access ration and performance estimation of heterogeneous SoC
type (read or write), and timing parameters are the inputs architectures [14]–[18]. Scheduling and timing analysis of
to the simulator. Performances of 4 cases of the OMP algo- on-chip communication, factoring in software parts such as
rithm are predicted by the simulator and those are compared interrupt service routines and device drivers were presented
with the measured performance in ZedBoard with the Xilinx in [14], [15], wherein the buffer resource, bus contentions,
Zynq 7020 SoC chip [12]. The execution time of the optimal and bus sharing were also taken into consideration to enhance
OMP architecture, obtained from the proposed simulator, is accuracy. In [16], a static performance estimation method
10295 cycles whereas the execution time from the actual was used to first reduce a large design space, and then
implementation on ZedBoard with the Zynq SoC chip is the reduced space was explored using a trace-driven mem-
10788 cycles, leading to a 4.8% error. The proposed system ory simulator to select the optimal on-chip communication
simulator can predict the performance of heterogeneous SoCs architecture. Because memory accesses are involved in bus
with under 5% error while taking the system bus and memory contention, memory allocation was also considered in [16].
collisions into account. Moreover, the proposed simulator is In today’s complex heterogeneous embedded SoC design, it is
verified that the proposed performance estimation algorithm imperative to model and simulate the system of interest at a
is generally applicable to estimate the performance of any high level. A methodology and framework to perform this
heterogeneous SoC architecture. For example, the estima- with analytical modeling and multi-objective optimization
tion error is measured to be no more than 5.9% for the were presented in [17], wherein candidate architectures were
convolutional layers of CNNs and no more than 5.6% for explored at different levels of abstraction. Dataflow graphs
the LDPC-coded MIMO-OFDM. Moreover, the optimized in [17] represented the transformation of coarse-grained
heterogeneous SoC architecture improves performance by up events in an application into fine-grained events in the archi-
to 48% for the third convolutional layer of AlexNet and 56% tecture. A system-level modeling and simulation environment
for the LDPC-coded MIMO-OFDM. was specified in [18], wherein pruning of the design space and
We summarize the novelty of this study as follows: calibration of timing parameters by coupling with low-level
• We propose a novel system simulator to optimize a simulators or prototype implementations were discussed.
heterogeneous SoC, which is defined by the number of In our work on heterogeneous SoC modeling and simu-
hardware accelerators and processors. The novel perfor- lation, we assumed a bus compliant with the AMBA AXI
mance estimation algorithm of the simulator proposed bus protocol [19]. In [10] and [20], a SystemC TLM-based
herein evaluates the performance of a heterogeneous simulator was used to evaluate and predict the performance
SoC architecture, based on the timing information of of a hardware accelerator implemented with an AXI bus-
accelerators and processors, measured on commercial based SoC architecture based on memory access modeling,
SoC platforms, such as the Zynq 7020 SoC chip on including the memory access type, burst type, and access
ZedBoard [12] and AccTLMSim [7]. latency. In [21], the design space for the AMBA hierarchical
• The performance estimation algorithm of the proposed shared bus was explored, and the execution time of each bus
simulator can search the design space by reflecting architecture was estimated by pipelining and accounting for
all combinations of memory merging, hardware- burst transactions. The execution order and amount of transfer
software partitioning, and scheduling to find the opti- data were also considered during design space exploration.
mal heterogeneous SoC architecture. In addition, the However, dynamic effects such as scheduling and hardware-
proposed performance estimation algorithm primarily software partitioning at the memory and system bus were not
considers the performance impact of both the mem- taken into account in [10], [20], [21].
ory latency and bus protocol overhead on a hardware A performance estimation method based on a tree model
accelerator. was presented in [22] to determine the relationship between
SoC parameters and performance, as well as to improve the
The paper is organized as follows. Section II reviews some
prediction accuracy during design space exploration, using
of the related works and summarize our contributions. System
parameter ranking to guide the design. A high-level ana-
modeling preliminaries are accounted for in Section III where
lytical tool to evaluate the communication and computation
the system, the bus, and data transfer types are addressed, fol-
overheads of heterogeneous systems was presented in [23],
lowed by transfer delay types which are extensively explained
wherein dataflow and loop pipelining within the high-level
in particular. Section IV deals with the performance estima-
synthesis framework were considered with the assumption of
tion process of the performance estimation method in terms of
an AXI4 stream protocol. In [23], dynamic data dependencies
modeling using an example of a signal processing algorithm.
were considered, and the communication and computation
Simulation methods, results, and analysis are elaborated on
times were estimated. A design space exploration method was
in Section V with case studies. Finally, the conclusion and
introduced in [24] with task scheduling, which is based on
future work are given in Section VI.
a traffic-aware priority-based earliest-finish-time algorithm,
to achieve a high level of core utilization. Guidance on how modeled the communication bandwidth of a DMA-controlled
to offer a well-designed scratchpad on-chip memory system accelerator from/to the DRAM subsystem through an on-
in terms of memory capacity and number of memory banks chip bus on a transaction level. However, a simulation-based
was provided in [24]. There have been several reports on the approach may often be too time-consuming to explore a broad
optimization of the bank allocation of DRAM [25], [26] in design space for communication architectures.
a heterogeneous SoC architecture. Most proposed bank allo- We summarize the contributions of this study as follows:
cations focus on balancing interference mitigation and bank- • In order to improve the performance of the hetero-
level parallelism. For example, in [25], DRAM banks were geneous SoC architecture, compared with that of a
dynamically partitioned according to application profiling CAG-based optimization [14], which considers only
(e.g., memory-intensive vs. non-intensive). In [26], locality- memory merging, the proposed simulator considers
aware bank partitioning (LABP) was proposed to improve both hardware-software partitioning and scheduling.
dynamic bank partitioning by mitigating the inference caused The optimal combination of design options suggested
by non-intensive applications, for example, by separating by the proposed simulator can improve the perfor-
their banks from that of memory-intensive high-row buffer mance of a heterogeneous SoC architecture by up
locality applications. However, the proposed bank allocations to 32
are targeted at general-purpose processors (e.g., CPUs) run- • Compared with the conventional simulation-based [7]
ning different applications. Specifically, optimization of bank approach, the proposed performance-estimation algo-
allocation for application-specific hardware accelerators has rithm provides a speedup of two orders of magnitude.
not been considered. The simulation-based approach is sufficiently accu-
When it comes to the estimation of communication rate to capture the dynamic nature of the communica-
performance, there are two different approaches: static tion bandwidth, but it often takes a prohibitively long
analysis-based approach and simulation-based approach. The time to simulate. According to our experiments, the
static analysis-based approach tends to be faster than the conventional simulation-based approach takes at least
simulation-based approach, but the estimation accuracy may a few hours to evaluate a single heterogeneous SoC
not be sufficiently high to drive the design of communication architecture with hundreds of different combinations of
architectures. In particular, the static analysis-based approach hardware-software partitioning, memory merging, and
cannot accurately capture the dynamic nature of communica- scheduling. Using the conventional simulation-based
tion bandwidth. To improve accuracy, a static analysis-based approach, we run the full-system simulator proposed
approach is often combined with a set of traces extracted in [7], once for each design point, taking approxi-
from simulations. In [14], a hybrid trace-based performance mately 12.5 seconds per design point. By contrast,
estimation algorithm was proposed to estimate the perfor- to minimize the simulation time and maintain esti-
mance of bus-based communication architectures using a mation accuracy, the proposed performance estimation
communication analysis graph (CAG). In [27], a simulation- algorithm constrains the use of an evaluation board
based performance estimation was proposed based on the (e.g., ZedBoard [12]) to evaluate the time information
observation of actual traffic tracking of the application of of the heterogeneous SoC architecture. Because a few
each core (e.g., bus master). In [28], in order to estimate tens of time information are sufficient to express most of
the performance of bus-based communication architectures, the heterogeneous SoC architectures of interest, the extra
a static analysis based on a modified queueing model was simulation time required to obtain the time information
incorporated into the schedule-aware performance estima- of each hardware component becomes negligible, par-
tion. [29] proposes to estimate the memory latency based on ticularly in the case of broad design space (i.e., a space
the statistics of different access conditions. In [30], bus-based of hundreds of design points).
on-chip communication architectures were explored early in • Compared with static analysis-based bus performance
the design flow using a modeling abstraction, called the cycle estimation [28] and statistics-based estimation [29], the
count accurate at transaction boundaries. The abstraction proposed performance estimation algorithm can predict
used in [30] can speed up the simulation by a factor of 2 when the communication performance of heterogeneous SoC
compared with the bus cycle accurate abstractions. However, architectures more accurately because it considers both
in [27]–[30], neither the dynamic effects such as collisions or the bus protocol overhead (bus conflict) and memory
conflicts at the memory and the system bus, nor the operation latency (memory collision) using an evaluation board
scheduling was considered. (e.g., ZedBoard [12]) to evaluate the time information
However, the simulation-based approach is sufficiently of the hardware and software. The experimental results
accurate to capture the dynamic communication bandwidth. show that the proposed algorithm approaches the full-
Thus, most reports on bank allocation rely on performance system simulator [7] more closely than the conven-
evaluation using full-system simulators in conjunction with tional algorithms. For example, the proposed algorithm
DRAM simulators. For example, in [25], the proposed reduced the estimation error to 6%, whereas the conven-
bank allocations were evaluated using gem5 in conjunction tional algorithms in [28] and [29] experienced estima-
with open DRAM simulator. In [7], a full-system simulator tion errors of 18% and 16%, respectively.
with two processors, and the case with one processor and
one DMA.
FIGURE 7. Data transfer associated with a hardware accelerator. Figure 10 shows the case with two DMAs and two
memories, which corresponds to the left half of Figure 9.
DMA 0 and DMA 1 read from memory 0 and memory 1
by delivering addresses A1 and D1, respectively. No col-
lision occurs since different memories are accessed and
addresses A1 and D1 are delivered without delay such that
memory 0 and memory 1, respectively, dispatch data burst
with length 4 without delay. Owing to the allowed multiple
outstanding transactions, DMAs send addresses B1 and E1
before their first data transfers are finished. However, each
memory receives its next address after its currently ongoing
data transfer is completed. Namely, addresses B1 and E1 are
not immediately input to memories. On the other hand, the
DMA data channel can transfer at almost every interval and
hence the delay in the address does not impact the actual data
transfer, incurring no data transfer delay in effect.
FIGURE 8. Memory collision: address channel. Figure 11 shows the case with two DMAs and one mem-
ory, corresponding to the right half of Figure 9, assuming
DMA 0 has a higher priority. DMA 0 and DMA 1 deliver
processor. Assume that the time consumed on the bus is read addresses A1 and D1 to memory 0. Since the MAMD
ignored and the MAMD mode is considered for the sake of bus is unable to deliver two addresses simultaneously mem-
examining only the memory collision situation. Also assume ory 0, address A1 of the higher priority DMA 0 is first
that the DMA uses bursts with burst length 4 and sup- delivered. Subsequently, DMA 0 tries to send address B1
ports multiple outstanding transactions (i.e., a new transfer while memory 0 dispatches data corresponding to address
request is allowed before the currently ongoing transfer is A1. The first address D1 of DMA 1 is delayed at memory
terminated) whereas memory does not support multiple out- 0 until the data transfer of DMA 0 corresponding to A1 is
standing transactions. The processor is assumed to support terminated, which is different from the situation illustrated in
neither burst transfers nor multiple outstanding transactions. Figure 10. After the data transfer of DMA 1 corresponding
Figure 9 shows two situations: two masters read from two to D1 starts, the opposite situation occurs: the second address
memories (left) and two masters read from one memory, B1 of DMA 0 is delayed at memory 0 until the data transfer of
leading to collision (right). The master can be a processor DMA 1 corresponding to D1 is finished. To sum up, memory
or a DMA and hence three cases are considered with the collisions alternately incur transfer delays on DMA 0 and
two masters in Figure 9: the case with two DMAs, the case DMA 1 caused by each other, thus entailing more time to
FIGURE 11. DMA Transaction with memory collision. FIGURE 15. Processor core and DMA transactions in combination:
memory collision case.
FIGURE 16. SAMD collision between DMAs. FIGURE 18. SAMD collision between processor core and DMA.
memory and bus conflicts or collisions. Algorithm 1 proposed lower the complexity of the heterogeneous SoC architecture
in this study can predict the performance of a heteroge- while performing memory merging to gain almost the same
neous SoC architecture. This algorithm considers hardware- performance improvement as having an on-chip memory for
software partitioning, memory merging, and scheduling in each function of the target application. We assume three situ-
each loop, to determine the optimal design combination ations in this loop: 1) If all functions share the same memory
when implementing a heterogeneous SoC using a hardware (m=0). 2) If all functions share two memories (m=1). 3) if
accelerator. all functions use different memories (m=2).
First, the partitioning loop determines that the primary When design options are determined by the index of the
thing to consider in an algorithm for system optimization three loops, an ideal heterogeneous SoC architecture (ide-
is whether each operation is executed by the processor or alArch) that does not reflect memory collisions and bus col-
by the hardware accelerator. Through this loop, it is pos- lisions is returned by a predefined function. The returned
sible to consider the cases where all functions of the tar- ideal heterogeneous SoC architecture (idealArch) is used as
get application to be implemented in a heterogeneous SoC an input to predefined functions (swConstraints) to reflect the
are mapped with the software domain, and the cases where constructs related to the software and hardware domains. This
all functions are mapped with the hardware domain. The function returns the architecture (swConstArch), wherein
total number of partitioning (NoPart) considered in the pro- time information (e.g., memory load/store time information)
posed estimation algorithm is determined by the power of 2 is updated to the processor core in a heterogeneous SoC archi-
(i.e., 2NoStep ). Next, each function (i.e., step) of the target tecture. The heterogeneous SoC architecture, which reflects
applications determines the scheduling in which each func- software domain time information, is used as an input for
tion mapped to the hardware domain or the software domain predefined functions to reflect constructions related to hard-
is executed. This loop considers the effects of memory and ware domains. This function returns the architecture (hwCon-
bus conflicts that occur when each function shares one mem- stArch), which updates information regarding the hardware
ory simultaneously. In addition, it can be considered that accelerator to be designed in the heterogeneous SoC archi-
the execution time of this loop differs depending on the tecture (e.g., burst length and computation time). Next, in the
domain in which each function is to be performed. The total heterogeneous SoC architecture (hwConstArch), which is an
number of scheduling, considered in the proposed estimation output from the hardware domain function (hwConstraints),
algorithm, is determined by a factorial (i.e., NoFunc!) of the the time information of memory collisions and bus collisions
number of functions. Finally, a memory merging step is per- is not reflected. Therefore, it uses the collision modeling
formed to predict the performance of the heterogeneous SoC function (busConstraints), written in Algorithm 2, to update
architectures according to the number of on-chip memories. the heterogeneous SoC architecture (busConstArch) latency
In general, on-chip memory collisions degrade the perfor- due to memory collision and bus collision. Finally, the perfor-
mance of accelerators implemented with heterogeneous SoC mance (T ) of the heterogeneous SoC architecture, reflecting
architectures. In other words, if the variables of different func- all the constraints, is returned by the predefined perf function.
tions do not use the same on-chip memory, the performance The performance estimation algorithm proposed simulates all
of the accelerator implemented with a heterogeneous SoC design points for a heterogeneous SoC architecture that can
architecture can be improved. However, using a large amount be obtained through each design option of hardware-software
of on-chip memory can increase the hardware complexity of partitioning, memory merging, and scheduling. Among the
the heterogeneous SoC architecture that needs to be imple- simulation results of all heterogeneous SoC architectures
mented. Therefore, in this loop, we merge on-chip memory to obtained through the combination of design options, the
optimum design point is determined to have the minimum are explored and generated beforehand and called afterward
execution time and minimum hardware complexity. The exe- to predict each performance. In case of a system with, for
cution time of many design points can be quickly estimated instance, three operations, six distinct sequences exist.
according to the combination of design options because the Subsequently, the constraint on hardware accelerators is
time information of each component of each heterogeneous set in Algorithm 1. The number of hardware accelerators is a
SoC is obtained through emulation in the target platform representative constraint. Among the operations constituting
(e.g., ZedBoard [12]). Based on this information, it is pos- the given system, any operation(s) may be chosen to be
sible to calculate the execution time of combinations for all executed by the hardware accelerator(s). In this phase, other
possible heterogeneous SoC architectures. conditions such as conflicts (or collisions) are not yet set.
In summary, the time information of each domain (hard- The number of processors needed to set DMAs for hard-
ware and software) is reflected in an ideal heterogeneous ware accelerators is assumed to be infinite and hence all the
SoC architecture, to which three design options are applied. operations are potentially allowed get started at once. Also,
Additionally, the latency due to bus and memory collisions no bus or memory conflict is assumed to exist. Based on these
is reflected in the heterogeneous SoC architecture. We mea- assumptions and the time information from the user inputs,
sured the performance of a heterogeneous SoC architecture the execution time of each operation is predicted, which is
that reflects all these factors. not accurate yet. As conditions on processors in conjunction
with bus and memory conflicts are taken into consideration,
C. CASE STUDY: ORTHOGONAL MATCHING the system performance will get more accurate.
PURSUIT (OMP) ALGORITHM In succession, the constraint on processors is set in
The design space is explored over all partitioning and Algorithm 1. The number of processors in the system is the
scheduling methods. A detailed explanation is given constraint. Processors are used for both the operations (com-
with an example application, orthogonal matching pur- putation) and the hardware accelerators (DMA setting). The
suit (OMP) [34], [35], as follows. Figure 22 shows user number of operations that can be started concurrently is
inputs (right) of the OMP algorithm (left) (based on least determined by the number of processors. On the basis of
squares to iteratively recover sparse data), which is the initial the sequence or order of operations, only after the cur-
phase to obtain the necessary data preceding the simulation rent operation by a processor is terminated, the next opera-
run. The OMP algorithm will be revisited in Section V. tion by the processor is allowed to get started. The limited
To implement a digital system architecture from an algorithm, number of processors renders the system simulator more
parameters are needed at the initial phase: To predict the practical.
overall execution time, basic information and the required Then, memory and bus conflicts (or collisions) are con-
time of each operation in the system should be fed as user sidered in Algorithm 1. In this phase, which operations are
inputs. Namely, the number of operations (the number of affected and how large the operations are affected by each
computations), the number of input and output data used in conflict can be predicted. Owing to the use of the SAMD bus,
each operation, the order of input and output data (in/out, a conflict on the address channel may occur between, e.g.,
timing), and information on times (time parameters) such as processor and DMA, yielding some data transfer delay in the
the operation time consumed inside the hardware accelerator, processor. This delay may cause another (equal-length) delay
the required time for the processor to transfer a data item, and of the following operation executed by the same processor.
the required time for the hardware to transfer a data item. As a result, the lastly ending operation in the system may
Next, schedule setting is conducted in Algorithm 1. Based change from one to another and the frame execution time in
on the number of operations, all the sequences of operations block interlacing may become longer.
Figure 24 shows how the transactions are affected by the TABLE 1. Simulation conditions.
SAMD bus conflict between DMA and processor. Assume
that the masters individually access separate slaves during
their reads. The first read of the DMA and the first read of
the processor get started almost at the same time, as shown in
time point 1 of Figure 24, and subsequently the SAMD bus
receives the data for the DMA and the data for the processor
on its two distinct data channels. At this time, the DMA which
allows multiple outstanding transactions sends the address of
the second read but since the corresponding slave cannot pro-
cess this read address, the address is held on the bus. Owing TABLE 2. Block interlacing applied to the OMP algorithm.
to this address held on the bus, the address of the second read
sent by the processor is not delivered immediately to the bus
but delayed until the first read burst of the DMA is finished,
as shown in time point 2 of Figure 24. In this manner, the
read operation of the processor appears to be in sync with
the data transfer of the DMA. This condition will last up to the
penultimate DMA read burst and subsequently at the instant unit steps accompany the initialization phase: computing the
of the last DMA read burst, the DMA no longer has an address correlation (S1), finding the location with the maximum cor-
to send and hence the address channel of the SAMD bus is relation and adding that location to the set with locations
empty and the read operations of the processor are carried on found up to now (S2), estimating the x value by using the
without delay, as shown in time point 3 of Figure 24. least mean squares method (S3), calculating the new r value
by using the estimated x value (S4), and judging whether to
V. SIMULATION proceed or not by using the calculated r value (S5). These
To validate the proposed performance estimation meth- steps are iterated, where if the r value in S5 is small enough
ods, an example algorithm, OMP [34], [35], is taken and to meet the performance required by the system, the x value
modeled. The OMP algorithm is utilized in compressive estimated so far is printed and the algorithm is terminated.
sensing and long-term evolution (LTE) to estimate the wire- If the r value is not sufficiently small, the steps are iterated
less channel. In order to prove that the proposed perfor- unceasingly and hence the maximum number of iterations is
mance estimation algorithm is generally applicable to any typically specified, which is set to 10 in this work.
other application, we have included additional experimen- The simulation conditions are listed in Table 1, where
tal results on CNNs (AlexNet [36]) and wireless commu- 4 conditions, case 1–case 4, are shown. Case 1 is the optimal
nications (LDPC-coded MIMO-OFDM [37], [38]). In the case in case that 3 hardware (HW) accelerators are used.
case of CNNs, the third layer for AlexNet is assumed as Case 2–case 4 are the cases which deviate from the optimal
the example applications. Moreover, in the case of wire- by altering some condition(s). Case 2 uses a shared memory
less communications, the LDPC-coded MIMO-OFDM con- while case 1 uses separate memories for data items and hence
sists of five functions, including initial synchronization, fast memory conflicts become more frequent in case 2. Case 3 has
Fourier transform (FFT), channel estimation, multiple-input a scheduling different from case 1. Namely, the order and
and multiple-output (MIMO) and low-density parity-check the number of operations executed by processors P1 and P2
code (LDPC). The system architecture for the OMP, CNN are changed, leading to changes in the conflict order and
and LDPC-coded MIMO-OFDM is implemented in the full- locations. Case 4 has a different partitioning. Namely, S4
system simulator [7] and Xilinx Zynq 7020 SoC chip on executed by hardware is executed by software (SW), leading
ZedBoard [12] for comparison and verification. to the change in conflict in S4 from memory conflict to
SAMD bus conflict and hence the potential increase in the
A. SIMULATION METHODS execution time.
The OMP algorithm can estimate the LTE channel with a The proposed performance estimation methods are applied
5MHz bandwidth and is converted to a system architecture as follows. Block interlacing is assumed and different
implemented on dual-core Zynq 7020 [12]. The algorithm is data items are processed in different steps in parallel.
subdivided into 5 steps, S1-S5, corresponding to 5 operation Table 2 shows the block interlacing applied to OMP. Each
units. Block interlacing is utilized to enable all the steps to step operates on one of the data items 1-5 at every instant.
concurrently run without data dependency. Hardware acceler- By using the proposed methods, the performance of the last
ators, DMAs, processors, and memories are connected to one phase (or stage) of the fifth iteration (of the 10 iterations in
another through an SAMD bus. Each data item is assumed to total), which is the shaded area in Table 2, is predicted. Thus
have its respective memory unless otherwise stated. the first data item (the data entry numbered 1 in Table 2)
The OMP algorithm is revisited in Algorithm 3, which for the fifth iteration is the target. Partitioning between pro-
is equivalent to the left half of Figure 22. Five operation cessors and hardware accelerators to execute the operations
from S1 to S5 is determined and scheduling (or the operation fetched by the operation and the lower half expresses the
order) is also determined to predict the overall execution time. outcome of the operation. The computation time is typically
The performance is predicted in view of the effect of memory much smaller than the data transfer time and not visible
merging (or the number of memories) as well. from the outside but in case of S3, the computation time is
relatively long and explicitly denoted as comp in Figure 25.
B. SIMULATION RESULTS AND ANALYSIS In the first place, data r and then A are read by S1. As data A
According to the simulation conditions mentioned above, are read, outputs g is produced. In S2, all of data g are read
simulation results from the system simulator are provided and the outcome T is produced. In case of S1, data A and g can
in this subsection. The performance of each of the modeled be processed concurrently since S1 is assumed to be handled
system architectures mapped from the example algorithm, by a hardware accelerator and hence two distinct DMAs take
OMP, is predicted and estimated and also compared with charge of the input and the output of the operation. Whereas
the result from the Zynq-on-ZedBoard implementation. First, both a read and a write are conducted concurrently in S1,
simulation results and analysis of case 1 to case 4 in terms either a read or a write is conducted at a time in S2 since S2 is
of memory merging, scheduling, and partitioning will be assumed to be handled by a processor. Therefore, in case of
provided. Then, implementation through optimization will be S2, the output of the operation is produced after all the inputs
explained. Lastly, simulation results will be briefly compared are fetched. Figure 25 shows the case when each data item
with those of an existing work. is allotted a separate memory. The memory conflict between
Errors between the performance of the system simulator two distinct hardware accelerators is only modeled and its
for the algorithm and the performance of the implementation effect is predicted in this work. Operation units or steps S1,
in Zynq 7020 on ZedBoard are under 5% for all the simula- S3, and S4 which are implemented in hardware accelerators
tion conditions, case 1–case 4. These errors come from the are subject to memory conflict if the identical data item is read
modeling based on the data transfer unit (by the burst) which (or written) at the same time. Accordingly, memory conflict
exhibits lower accuracy than the cycle-accurate modeling occurs where S1 and S3 read data A at once, which is the
(by the clock cycle). However, the details that impact the shaded area in Figure 25.
performance marginally are simplified in our modeling and Figure 26 shows the memory conflict for case 2 which is
instead the speed is enhanced by a factor of 10 to 100 in defined in Table 1. The simulation with case 2 is to identify
the proposed system simulator for expeditious performance the effect of memory merging by using an integrated or shared
prediction and large design space exploration. memory instead of separate memories in case 1. In case 2, all
Figure 25 shows the timing diagram of input and output the data are assigned one memory. All the other simulation
data of each operation (e.g., S1) for case 1. In other words, conditions of case 2 are identical to those of case 1. The
for each operation or step, the upper half expresses the data execution time will grow in case 2 because only one master
drop since more operations can be carried on concurrently. TABLE 3. Optimized design option.
The optimum execution time is improved maximally, rela-
tive to the mean execution time, when only one hardware
accelerator is used. This means a specific operation will
occupy a large percentage of the overall execution time when
processors deal with operations and also the gain will be
large if the operation is managed by the hardware accelerator.
In the architecture optimization process below, two hard-
ware accelerators are assumed to be used. Figure 30 shows
the heterogeneous SoC architecture of each optimum bar in
Figure 29, their timing diagram and memory complexity.
As mentioned earlier, the execution time tends to decrease
when more hardware accelerators and memories are used
in the heterogeneous SoC architecture, because more tasks
can be performed simultaneously without memory collision.
Additionally, since it is more efficient for the hardware accel-
erator to access the memory in burst units than for CPUs to
read and write to the memory, the performance of the hetero-
geneous SoC architecture improves as the number of hard-
ware accelerators increases. For example, in Figure 30 (a), FIGURE 31. Comparison of computational complexity between the
CAG-based optimization and the proposed optimization: (a) throughput
since the steps assigned to each CPU are executed sequen- optimization and (b) memory area optimization.
tially, scheduling cannot be freely performed through block-
interlacing manner [33]. However, as Figure 30 (f) assumes
that all steps are implemented with a hardware accelerator, hardware accelerators are employed and to run S2, S4, and
block interlacing can be applied to execute all steps simul- S5, processors are used. According to the chosen schedul-
taneously and without dependency. Figure 30 shows that ing, processor P1 runs one hardware accelerator for S1 and
allocating memory for each variable further improves perfor- then runs the S5 operation, followed by the S4 operation.
mance. Furthermore, as shown in Figure 30 (c), if bus masters Processor P2 first runs the other hardware accelerator for S3
do not access memory simultaneously by hardware-software and runs the S2 operation. The simulated execution time of
partitioning and scheduling, hardware complexity can be the optimized architecture is 10295 cycles, obtained from
reduced, and high-performance gain can be obtained through the system simulator, and the actual execution time of the
merged memory. This shows that using separate memory for optimized architecture is 10788 cycles, obtained from the
a heterogeneous SoC architecture is not always optimal for board implementation, resulting in a 4.8% error.
execution time. Figure 30 (g) illustrates the hardware area Some comparison is made with an existing work [14]
measured by the memory model based on the commercial where the performance difference according to scheduling
SRAM memory compiler provided with the TSMC 28-nm is not considered but a predetermined scheduling is fixed
standard cell library. The simulation result indicates that the during the simulation. However, if the timing of an operation
hardware area of the merged memory manner (Figure 30 (c)) to access data coincides with the timing of another operation
is approximately 67.6% smaller than that of the separated to access the same data, a conflict occurs and the performance
memory manner (Figure 30 (a), (b), (d), (e), and (f)). The is impacted accordingly. Thus the effect of scheduling for
separate memory manner allocates additional input/output the OMP algorithm is considered for comparison with the
(I/O) ports than the merged memory manner, which is the method in [14]. Assuming two hardware accelerators are
primary reason the memory area is different between the used, the optimum architecture for the algorithm is shown in
two methods using the same capacity of memory. In addi- Table 3, which has an execution time of 10295 cycles from
tion, multiple separated memories make the AXI interconnect simulation. If the optimum scheduling is not explored but the
more complex. It leads to making the overall heterogeneous operations are executed from S1 to S5 in order, then the exe-
SoC architecture more complex. As a result, the proposed cution time is 11604 cycles, even if partitioning and memory
simulator can achieve a heterogeneous SoC architecture merging are the same for the two architectures. Scheduling
that can achieve high-performance gains with low memory is not considered in [14] while the optimum execution order
complexity. can be explored in this work which models the data transfers
For the OMP algorithm, various system architecture can- with both the processor and the DMA. The execution time
didates are searched for in terms of partitioning, schedul- difference between the architecture with optimum scheduling
ing, and memory merging, where two hardware accelerators and the architecture without scheduling is about 1309 cycles,
are assumed to be used. The optimum architecture has the leading to a 12.7% discrepancy if conflicts are not taken into
parameters listed in Table 3. To execute S1 and S3, two account.
of magnitudes. In this work, by considering scheduling in the [10] J. Wang, S. Park, and C. S. Park, ‘‘Optimization of communica-
algorithm where block interlacing or pipelining is applied, the tion schemes for DMA-controlled accelerators,’’ IEEE Access, vol. 9,
pp. 139228–139247, 2021.
optimum system architecture can be found with an improved [11] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, ‘‘Caffeine:
performance. For example, the optimized heterogeneous SoC Toward uniformed representation and acceleration for deep convolutional
architecture for the OMP algorithm improves performance neural networks,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits
Syst., vol. 38, no. 11, pp. 2072–2085, Nov. 2019.
by up to 32% compared with the conventional CAG-based [12] Digilent. ZedBoard Zynq-7000 ARM/FPGA SoC Development Board.
approaches. The proposed simulator is verified that the pro- Accessed: Apr. 10, 2019. [Online]. Available: https://store.digilentinc.
posed performance estimation algorithm is generally appli- com/zedboardzynq-7000-arm-fpga-soc-development-board/
[13] Altera SoC FPGAs. Accessed: Nov. 8, 2017. [Online]. Available:
cable to estimate the performance of any heterogeneous SoC http://www.altera.com/devices/processor/soc-fpga/overview/procsoc-
architecture. For example, the estimation error is measured to fpga.html
be no more than 5.9% for the convolutional layers of CNNs [14] L. Kanishka, A. Raghunathan, and S. Dey, ‘‘System-level performance
analysis for designing on-chip communication architectures,’’ IEEE Trans.
and no more than 5.6% for the LDPC-coded MIMO-OFDM. Comput.-Aided Design Integr. Circuits Syst., vol. 20, no. 6, pp. 768–783,
In addition, the optimized heterogeneous SoC architecture Jun. 2001.
improves performance by up to 48% for the third convo- [15] Y. Cho, G. Lee, S. Yoo, K. Choi, and N.-E. Zergainoh, ‘‘Scheduling and
timing analysis of HW/SW on-chip communication in MP SoC design,’’
lutional layer of AlexNet and 56% for the LDPC-coded in Proc. IEEE Design, Automat. Test Eur. Conf. Exhib. (DATE), Mar. 2003,
MIMO-OFDM. pp. 132–137.
Lastly, it is worthwhile to mention that the estimation algo- [16] S. Kim, C. Im, and S. Ha, ‘‘Efficient exploration of on-chip bus archi-
tectures and memory allocation,’’ in Proc. 2nd IEEE/ACM/IFIP Int.
rithm in the proposed simulator is generally applicable to any Conf. Hardw./Softw. Codesign Syst. Synth. (CODES+ISSS), Sep. 2004,
heterogeneous SoC architecture. In particular, the extension pp. 248–253.
of the performance estimation algorithm into the emerging [17] A. D. Pimentel, C. Erbas, and S. Polstra, ‘‘A systematic approach to
exploring embedded system architectures at multiple abstraction levels,’’
compute-in-memory (CiM) hardware accelerators for general IEEE Trans. Comput., vol. 55, no. 2, pp. 99–112, Feb. 2006.
matrix to matrix multiplication (GEMM) are considered to [18] C. Erbas, A. D. Pimentel, M. Thompson, and S. Polstra, ‘‘A framework
be promising for future work. Note that such a CiM-based for system-level modeling and simulation of embedded systems architec-
tures,’’ EURASIP J. Embedded Syst., vol. 2007, pp. 1–11, Dec. 2007.
hardware accelerators for GEMM are often equipped with [19] AMBA AXI and ACE Protocol Specification, AXI3, AXI4, and AXI4-Lite,
DMACs [39-43]. In addition, as a standalone IP, it is ACE and ACE-Lite, ARM Infocenter, ARM Ltd., Cambridge, U.K., 2011.
connected to an off-chip memory through an on-chip bus [20] S. Kim, S. Park, and C. S. Park, ‘‘System-level communication perfor-
mance estimation for DMA-controlled accelerators,’’ IEEE Access, vol. 9,
[44, 45]. Moreover, the memory allocation (e.g., bank allo- pp. 141389–141402, 2021.
cation of DRAM) tends to affect the communication per- [21] S. Sombatsiri, K. Kobashi, K. Sakanushi, Y. Takeuchi, and M. Imai,
formance of the emerging CiM-based hardware accelerators, ‘‘An AMBA hierarchical shared bus architecture design space exploration
method considering pipeline, burst and split transaction,’’ in Proc. 10th
as depicted in [46, 47]. Thus, we expect the performance Int. Conf. Electr. Eng./Electron., Comput., Telecommun. Inf. Technol.,
estimation algorithm proposed in this paper to be gener- May 2013, pp. 1–6.
ally applicable to the CiM-based hardware accelerators for [22] C. Lin, X. Du, X. Jiang, and D. Wang, ‘‘An efficient and effective per-
formance estimation method for DSE,’’ in Proc. Int. Symp. VLSI Design,
GEMM. Automat. Test (VLSI-DAT), Apr. 2016, pp. 1–4.
[23] M. Makni, S. Niar, M. Baklouti, G. Zhong, T. Mitra, and M. Abid,
‘‘A rapid data communication exploration tool for hybrid CPU-FPGA
REFERENCES architectures,’’ in Proc. 25th Euromicro Int. Conf. Parallel, Distrib. Netw.-
Based Process. (PDP), 2017, pp. 85–92.
[1] L. Cohen, A. Nadkarni, P. Rutten, K. Stolarski, and J. Vela, ‘‘IDC’s world
[24] H. Meng, H. Meng, P. Ding, M. Wang, and D. Wang, ‘‘A design space
wide computing platforms taxonomy,’’ Int. Data Corp., Needham, MA,
exploration method for on-chip memory system based on task scheduling,’’
USA, Tech. Rep. US42024017, 2017.
in Proc. IEEE 9th Int. Conf. Softw. Eng. Service Sci. (ICSESS), Nov. 2018,
[2] K. Sano, Y. Hatsuda, and S. Yamamoto, ‘‘Multi-FPGA accelerator for
pp. 912–915.
scalable stencil computation with constant memory bandwidth,’’ IEEE
[25] M. Xie, D. Tong, K. Huang, and X. Cheng, ‘‘Improving system throughput
Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 58–68, Jan. 2011.
[3] P. Knag, J. K. Kim, T. Chen, and Z. Zhang, ‘‘A sparse coding neural and fairness simultaneously in shared memory CMP systems via dynamic
network ASIC with on-chip learning for feature extraction and encoding,’’ bank partitioning,’’ in Proc. Int. Symp. High Perform. Comput. Archit.,
IEEE J. Solid-State Circuits, vol. 50, no. 4, pp. 1070–1079, Apr. 2015. Feb. 2014, pp. 344–355.
[4] M. A. Suchard, Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West, [26] Y. Liu, J. Lu, D. Tong, and X. Cheng, ‘‘Locality-aware bank partitioning for
‘‘Understanding GPU programming for statistical computation: Studies in shared DRAM MPSoCs,’’ in Proc. 22nd Asia South Pacific Design Autom.
massively parallel massive mixtures,’’ J. Comput. Graph. Statist., vol. 19, Conf. (ASP-DAC), Jan. 2017, pp. 16–19.
no. 2, pp. 419–438, Jan. 2010. [27] S. Murali, L. Benini, and G. De Micheli, ‘‘An application-specific design
[5] S. Y. Shao, ‘‘Design and modeling of specialized architectures,’’ methodology for on-chip crossbar generation,’’ IEEE Trans. Comput.-
Ph.D. dissertation, Harvard Univ., Cambridge, MA, USA, 2016. Aided Design Integr. Circuits Syst., vol. 26, no. 7, pp. 1283–1296,
[6] Zynq-7000 All Programmable SoC Technical Reference Manual V1.12.2, Jun. 2007.
Xilinx, San Jose, CA, USA, Jul. 2018. [28] S. Kim, C. Im, and S. Ha, ‘‘Schedule-aware performance estimation of
[7] S. Kim, J. Wang, Y. Seo, S. Lee, Y. Park, S. Park, and C. S. Park, communication architecture for efficient design space exploration,’’ IEEE
‘‘Transaction-level model simulator for communication-limited accelera- Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 5, pp. 19–24,
tors,’’ 2020, arXiv:2007.14897. May 2005.
[8] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, ‘‘Co- [29] R. V. W. Putra, M. A. Hanif, and M. Shafique, ‘‘DRMap: A generic DRAM
designing accelerators and SoC interfaces using gem5-Aladdin,’’ in Proc. data mapping policy for energy-efficient processing of convolutional neu-
Int. Symp. Microarchitecture, Oct. 2016, pp. 1–12. ral networks,’’ 2020, arXiv:2004.10341.
[9] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang, [30] S. Pasricha, N. Dutt, and M. Ben-Romdhane, ‘‘Fast exploration
‘‘Angel-eye: A complete design flow for mapping CNN onto embedded of bus-based on-chip communication architectures,’’ in Proc. 2nd
FPGA,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, IEEE/ACM/IFIP Int. Conf. Hardw./Softw. Codesign Syst. Synth.
no. 1, pp. 35–47, Jan. 2018. (CODES+ISSS), Sep. 2004, pp. 242–247.
[31] G. D. Micheli and L. Benini, Networks on Chips: Technology and Tools. JOOHO WANG received the B.S. degree in elec-
San Francisco, CA, USA: Morgan Kaufmann, Aug. 2006, ch. 8. tronics engineering from Korea Polytechnic Uni-
[32] S. Pasricha and N. Dutt, On-Chip Communication Architectures. versity (KPU), Siheung, South Korea, in 2014.
Burlington, VT, USA: Morgan Kaufmann, 2008, chs. 2–9. He is currently pursuing the M.S./Ph.D. degree in
[33] A. Darabiha, A. C. Carusone, and F. R. Kschischang, ‘‘Block-interlaced electronics engineering with Konkuk University,
LDPC decoders with reduced interconnect complexity,’’ IEEE Trans. Cir- Seoul, South Korea. His research interests include
cuits Syst. II, Exp. Briefs, vol. 55, no. 1, pp. 74–78, Jan. 2008. hardware/software co-design of programmable
[34] P. Maechler, P. Greisen, N. Felber, and A. Burg, ‘‘Matching pursuit:
accelerators and simulation for SoC architecture.
Evaluation and implementatio for LTE channel estimation,’’ in Proc. IEEE
Int. Symp. Circuits Syst., May 2010, pp. 589–592.
[35] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Applica-
tions. New York, NY, USA: Cambridge Univ. Press, 2012. YUNGYU GIM received the B.S. and M.S.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification degrees in electronics engineering from Konkuk
with deep convolutional neural networks,’’ in Proc. Conf. Neural Inf. University, Seoul, South Korea, in 2015 and 2018,
Process. Syst. (NIPS), 2012, pp. 1097–1105. respectively. He is currently conducting research
[37] P.-Y. Tsai, P.-C. Lo, F.-J. Shih, W.-J. Jau, M.-Y. Huang, and Z.-Y. Huang, on hardware security architectures for the Android
‘‘A 4 × 4 MIMO-OFDM baseband receiver with 160 MHz bandwidth for
mobile OS. His research interest includes tamper-
indoor gigabit wireless communications,’’ IEEE Trans. Circuits Syst. I,
resistant integrated secure element (iSE).
Reg. Papers, vol. 62, no. 12, pp. 2929–2939, Dec. 2015.
[38] T. Suzuki, H. Yamada, T. Yamagishi, D. Takeda, K. Horisaki, T. V. Aa,
T. Fujisawa, L. Perre, and Y. Unekawa, ‘‘High-throughput, low-power
software-defined radio using reconfigurable processors,’’ IEEE Micro,
vol. 31, no. 6, pp. 19–28, Dec. 2011.
SUNGKYUNG PARK (Senior Member, IEEE)
[39] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, ‘‘A pro-
grammable embedded microprocessor for bit-scalable in-memory com-
received the Ph.D. degree in electronics engineer-
puting,’’ IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621, ing from Seoul National University, South Korea,
Sep. 2020. in 2002. From 2002 to 2004, he was with Sam-
[40] S. Yin, Z. Jiang, J.-S. Seo, and M. Seok, ‘‘XNOR-SRAM: In-memory sung Electronics, as a Senior Engineer, where he
computing SRAM macro for binary/ternary deep neural networks,’’ IEEE worked on the development of system-level simu-
J. Solid-State Circuits, vol. 55, no. 6, pp. 1733–1743, Jun. 2020. lators for cellular standards. From 2004 to 2006,
[41] W.-S. Khwa, J.-J. Chen, J.-F. Li, X. Si, E.-Y. Yang, X. Sun, R. Liu, he was with the Electronics and Telecommuni-
P.-Y. Chen, Q. Li, S. Yu, and M.-F. Chang, ‘‘A 65 nm 4 Kb algorithm- cations Research Institute (ETRI), as a Senior
dependent computing-in-memory SRAM unit-macro with 2.3 ns and Member of Research Staff, where he worked on
55.8 TOPS/W fully parallel product-sum operation for binary DNN edge fiber-optic front-end IC design. From 2006 to 2009, he was with Erics-
processor,’’ in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. son Inc., as a Senior Staff Hardware Designer, where he worked on the
Papers, Feb. 2018, pp. 496–498. design and modeling of multi-standard RF transceivers and clocking circuits.
[42] M. Zhu, Y. Zhuo, C. Wang, W. Chen, and Y. Xie, ‘‘Performance evalu- In 2009, he joined as a Faculty Member with the Department of Electronics
ation and optimization of HBM-enabled GPU for data-intensive applica- Engineering, Pusan National University, South Korea, where he is currently
tions,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2017,
a Professor. His research interests include design and modeling of SoC,
pp. 1245–1248.
hardware accelerators, and virtual platforms for neural networks and 5G.
[43] Z. Wang, H. Huang, J. Zhang, and G. Alonso, ‘‘Shuhai: Benchmark-
ing high bandwidth memory on FPGAS,’’ in Proc. IEEE 28th Annu.
Int. Symp. Field-Program. Custom Comput. Mach. (FCCM), May 2020, CHESTER SUNGCHUNG PARK (Senior Mem-
pp. 111–119. ber, IEEE) received the Ph.D. degree in electrical
[44] G. Singh, D. Diamantopoulos, C. Hagleitner, J. Gomez-Luna, S. Stuijk, engineering from the Korea Advanced Insti-
O. Mutlu, and H. Corporaal, ‘‘NERO: A near high-bandwidth memory tute of Science and Technology (KAIST), Dae-
stencil accelerator for weather prediction modeling,’’ in Proc. Field-
jeon, in 2006. From 2006 to 2007, he was
Program. Log. Appl. (FPL), Sep. 2020, pp. 9–17.
with Samsung Electronics, Giheung, South Korea.
[45] A. Kurth, W. Rönninger, T. Benz, M. Cavalcante, F. Schuiki, F. Zaruba, and
L. Benini, ‘‘An open-source platform for high-performance non-coherent
From 2007 to 2013, he was with Ericsson
on-chip communication,’’ 2020, arXiv:2009.05334. Research, Plano, TX, USA, as a Senior Engineer.
[46] P. Gu, X. Xie, S. Li, D. Niu, H. Zheng, K. T. Malladi, and Y. Xie, ‘‘DLUX: Since 2013, he has been with the Department
A LUT-based near-bank accelerator for data center deep learning training of Electronics Engineering, Konkuk University,
workloads,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., South Korea, as an Associate Professor, where he is working on the design
vol. 40, no. 8, pp. 1586–1599, Aug. 2021. and modeling of SoC, hardware accelerators, and virtual platforms for neural
[47] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, ‘‘Neurostream: Scalable and networks and 5G. His research interests include SoC architecture design for
energy efficient deep learning with smart memory cubes,’’ IEEE Trans. artificial intelligence, processing in memory, and wireless communication.
Parallel Distrib. Syst., vol. 29, no. 2, pp. 420–434, Feb. 2018.