0% found this document useful (0 votes)
148 views4 pages

Analysis of Performance Bottlenecks in Soc Interconnect Subsystems

Uploaded by

MOHAMMAD AWAIS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views4 pages

Analysis of Performance Bottlenecks in Soc Interconnect Subsystems

Uploaded by

MOHAMMAD AWAIS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Analysis of Performance Bottlenecks in SoC

Interconnect Subsystems
Kirill A. Zhezlov, Fedor M. Putrya, Andrey A. Belyaev
R&D ELVEES
Moscow, Zelenograd, Russia
[email protected]

Abstract— An interconnect subsystem as a part of SoC most ensure the transmission and routing of this data inside of the
often determines the performance of the entire system, so the SoC. It can be implemented through various technologies,
analysis of its effectiveness must be started as early as possible. such as Cross-Bus-Based approach or Network-on-Chip [3].
The main point is to determine whether the interconnect Each block in the SoC generates some traffic flows which
subsystem itself is a potential bottleneck. To solve this problem, have their own characteristics and special performance
the following components are important: a test scenario aimed at requirements in terms of latency and throughput.
identifying performance problems, and a methodology for
evaluating the results of its execution. This article introduces a This paper describes an approach for automated
new approach to SoC interconnect performance evaluation based performance analysis, which includes a set of metrics and
on specific test scenarios and a new set of performance metrics performance evaluation criteria and a method for their
and criteria of their evaluation. The basics of proposed approach automated analysis. This approach concerns the development
is to concern the system-level constraints to evaluate interconnect stage when SoC and interconnect architecture is already
performance. chosen, and there is a need to test its RTL-model with system
constraints.
Keywords—system on a chip, verification, performance
analysis, high-level model, performance metrics.
II. PERFORMANCE EVALUATION METHODOLOGY
I. INTRODUCTION A. Task of Performance Analysis
The process of verifying the feasibility of target The task of evaluation of performance of an interconnect
applications and confirming the effectiveness of the designed subsystem is to check whether the system meets the stated
system for all target modes of operation is an integral step of performance requirements, and the key indicator is
functional verification of modern SoCs. The interconnect compliance with the requirements of a task. The parameters
subsystem as a part of the SoC most often determines the under control (in terms of a task) are the following:
performance of the entire system; therefore, it is necessary to
begin the analysis of its efficiency as soon as possible. The • execution time of a specific task;
complexity of this task lies in the need to determine whether
• system response time to external events.
the interconnect subsystem itself is a potential bottleneck. The
following components are important for solving this task: a The following factors are critical for evaluating system
test scenario aimed at identifying performance problems and a performance, as well as for finding problems with it:
methodology for evaluating the results of its execution [1,2].
• test scenario aimed at creating conditions for
The complexity of the designed systems poses the problem measuring performance;
of finding approaches to evaluation of performance at the
earliest stages of development. The task of verifying the • a way of representing, forming and executing this
functional correctness of the entire system, as well as the scenario, suitable for reproducing errors in the future;
analysis of its effectiveness when using the entire SoC model, • methodology for evaluating performance, which
are very resource-intensive both in terms of creating tests and defines what parameters and how are calculated.
in terms of modeling time and require a working model of the
whole system, which appears too late for significant revisions) The main step at which performance evaluation begins is
architecture in case of detection of system performance the characterization of the load. It includes the task of
problems. describing traffic patterns and evaluating its temporal
characteristics. That is, it is necessary to describe not only the
Modern systems on a chip (SoC) contain many amount of data, their source, receiver, and the sequence of
heterogeneous blocks, such as different types of memory, receipt of the network, but also the sparseness of these data,
DMA, processor cores, and others. All of these blocks are the volumes of individual packets, etc. In this context the
integrated onto the same chip can communicate with each abstract traffic is meant, it is not tied to data transfer protocols;
other through the interconnect subsystem. The interconnect
subsystem is a set of data exchange channels and devices that

1911 978-1-7281-5761-0/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on July 16,2020 at 20:14:05 UTC from IEEE Xplore. Restrictions apply.
therefore, at the characterization stage, the type of transactions This set of metrics is basic and allows to accurately
used does not matter. characterize the performance of the interconnect subsystem,
but it still has several disadvantages. First of all, there are
Today, approaches to the tasks of characterizing the load various ways to define this metrics, and each of them depends
and automating of the test scenarios generating are actively on the mathematical model of the interconnect being used. It
studied in the works of Z. Qian [1], A.B. Mehta [3], R. Shen also does not allow to identify the causes of performance
[4] and Ch. Keimel [5]. The approaches proposed by the problems — the blocks (e.g. interfaces) that limit the system’s
authors can be divided into two groups: the use of synthetic performance — since these metrics are controlled parameters
templates and mathematical models (in particular statistical by the system, and an additional investigation and study with
distributions) to describe traffic and methods based on the the analysis of time charts is required to explain their
analysis of device usage scenarios. deviation from the required values.
In synthetic models, traffic is usually described through the Thus, based on the analysis of the problems existing in this
distribution of the pause length between transactions, which field, we can conclude that it is necessary to develop a
allows you to determine the injection rate, i.e. speed of data methodology for the analysis of switching media that allows
arrival. The most common models are the Poisson and fractal you to detect potential bottlenecks in the system that can
models. The advantage of this approach is the simplicity of the
adversely affect its performance.
formation of input influences. However, this approach is
primarily convenient for testing and comparing various
architectures of switching media in the development of their B. Metrics and Criteria
mathematical models, since it allows them to be quickly The factors that can potentially affect the performance of
evaluated. In the presence of a high-level model of the entire the system and, therefore, the parameters being monitored
system, this approach is less effective, since it does not allow a may be software or hardware in nature. Programmatic
sufficiently accurate description of the traffic models of real problems are the non-optimization of the executable task for
applications. In addition, in the event of errors, this approach this architecture (in case of satisfactory operation). In the case
(in view of the probability distribution) greatly complicates of hardware, there are three main potential bottlenecks —
the recovery of the error situation. master and slave devices, and the interconnect itself.
Interconnect subsystem can cause poor system performance in
The most convenient way for solving the problems of the the following cases: when it does not provide the necessary
generation and execution of the test scenario is its presentation master and slave throughput; arbitration is not configured
in the form of a graph. This approach allows to create a correctly.
scenario that operates with transactions or even data flows,
which allows to transfer the test description to a higher level A switch can become a poor system in the following cases:
of abstraction. Independence from the internal architecture of it does not provide bandwidth necessary for master and slave;
the blocks increases the degree of reuse of test script code and arbitration set up incorrectly.
facilitates their modification. At the same time, the The master devices, as a rule, are the CPU. Master devices
determinism of the test scenario makes possible to can potentially reduce performance if their memory settings
unambiguously reproduce the situation in which a system are not optimized. As slaves, various types of memory are
error was detected, and determine the specific event that most often used. In this case, the problem may be the low
caused it. speed of its operation, the non-optimization of internal
The way of constructing, representation and simulation of interconnects.
performance-aimed test scenarios was described in the The set of metrics being proposed in this paper is shown in
previous paper [6]. It is based on a graph-based approach to table 1. The following notation is used in this table: Tn – the
describing of traffic flows. current transaction, Tsizen – volume (length) of the current
Approaches to the analysis of the efficiency of transaction, startn – start time of the current transaction, endn
interconnect subsystems and verification of their compliance – stop time of the current transaction, endn-1 – stop time of the
with performance requirements are actively explored in the previous transaction, t – the time period over which the
works of Z. Qian [1], Y. Umit [7] and S. Suboh [8]. Based on measurement is made.
their work, it can be concluded that the main measured values Each metric is measured based on the simulation trace,
by which it is possible to estimate the performance of the from which information about transactions is extracted. Also,
interconnect as a part of the SoC are bandwidth (or for each metric its average, maximum and minimum values
throughput) and latency. Throughput shows the actual load of can be measured. Calculation of metric values over different
the data transmission channel and is generally defined as the time intervals (if it is necessary to analyze the dependence of
ratio of the amount of transmitted data to the length of the the metric value on time), and over the entire modeling time
time period and is expressed in bytes/s or bytes/clock and period for obtaining integral values can be performed. In the
multiple units. Latency, or the amount of delay, shows the case of calculating the values of metrics at several time
time it takes to complete a transaction and is defined as the intervals, it must be taken into account that increasing the
time elapsed from the start of the transaction (entering the number of segments into which the entire simulation time is
network) to the completion of its reception or receipt of a divided increases the accuracy of the calculation, but increases
response. the time spent.

1912

Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on July 16,2020 at 20:14:05 UTC from IEEE Xplore. Restrictions apply.
The first metric is latency. It is measured as the difference transfer speed is not achieved.
between transaction start time and end time and shows how
Transactions are carried in a dense stream.
long it takes for transaction to reach the destination point. The 5 The transmission channel is fully loaded.
second metric shows the actual throughput of an interface or Cavg ≈ Eavg Otherwise, the device does not create the
transmission line. The key point is that it can be measured necessary load.
over the different time periods and it should be done carefully. 6 The master device processes transactions
Eavg(MS) < Eavg(SL) slower than the slave one.
The best way is to measure it over the longest period of device
active work. Local average throughput is the metric which 7 The slave device processes transactions
shows the sparseness of traffic, it takes into consideration the Eavg(SL) < Eavg(MS) slower than the master one.
time space between transactions. The last metric is the number
of outstanding transactions, it shows the number of C. Basic Performance Tests
transactions which are active at the same time with the current The basics of performance evaluation are point-to-point
one. tests. This test iterates over all possible pairs of connected
interfaces and sends a number of transactions for each pair.
It is important to remember that data transfer speed and
Basic performance tests are a test of latency, throughput and
transaction execution time are not directly or inversely related.
the number of simultaneously executed transactions
A decrease in the data transfer rate and a decrease in the
(outstanding). It makes sense to carry out these tests as soon as
transaction execution time means that transactions are
the RTL-description of the interconnect subsystem has passed
executed in less time, but the intervals between them have
a basic functional check for compliance with the memory map
increased.
and the registers map.
To evaluate the performance of interconnect subsystems
based on the proposed set of metrics, a number of criteria are The following characteristics are crucial for that kind of
proposed. They are given in table 2. As all the metrics can be tests: transactions length and width, total data volume to be
measured for master and slave agents, these criteria could also sent and additional delays of different signals.
be applied to both of them. In the table 2 coefficients Kc and The tracked metric for latency test is average transaction
Ke are mentioned. Both of them should be defined by the execution time. To get the correct values, you need to send
block under test developer and should be in range from 0 to 1. transactions with a minimum length and they should be sent
These coefficients show the acceptable deviations from the one by one.
maximum value. The value Vdref is the minimal required data
transfer speed and is also defined by the developer. The tracked metric for throughput test is the average data
transfer speed. For this test, it is necessary to load the used
TABLE I. PERFORMANCE METRICS channel of the interconnect subsystem as much as possible,
№ Metric Formula Units therefore, transactions of maximum length should be
transferred to sit at maximum speed (i.e., with a zero pause
Transaction
between transactions) and with the maximum number of
1 execution time Ln = (endn – startn) clock, s
(latency) simultaneously executed transactions. This test must be
carried out both with zero delays to assess the maximum
Average data Byte/clock,
2
transfer speed Vdavg = ∑Tsizen / t Byte/s capabilities of the switching infrastructure, and with real
delays for all devices to get an idea of how close the system is
Cn = Tsizen / (endn – endn- Byte/clock, to the maximum values under ideal conditions.
3 Local throughput
1) Byte/s
The test of outstanding transactions is carried out with the
Transaction En = Tsizen / (endn – Byte/clock, minimum size of the length and width of the transactions, and
4
execution speed startn) Byte/s
the transactions should be sent with maximum speed. It is also
Outstanding On = ∑Tm+1 for Tm that necessary to simultaneously send to the network the number of
5 None transactions greater than the master device can transfer and
transactions startn ≤ startm ≤ endn
accept the slave device. To get the correct results, it is
TABLE II. PERFORMANCE CRITERIA
necessary to set additional delays on the response signals (for
example, RVALID) in order to artificially increase the
1 Vdavg < Vdref
The device does not provide the minimal transaction execution time, that is, to delay the “valid
required data transfer speed. handshake” moment and the beginning of the data transfer
Local throughput is much lower than phase. The delay should be such that before the completion of
2 Cavg < Ke * Emax maximum. Potentially, the device can the first transaction, the master device can transmit the
operate faster.
addresses for all transactions sent.
3 Transactions run slower than they could.
Eavg < Ke * Emax Potentially, the device can run faster. The next step of interconnect performance evaluation is
constructing and executing performance “use case”-based
4 Maximum local throughput below
Cmax < Vdref expected bandwidth. The desired data tests. The scenarios of such tests strongly depend on the SoC
structure and its targets.

1913

Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on July 16,2020 at 20:14:05 UTC from IEEE Xplore. Restrictions apply.
III. RESULTS IV. CONCLUSION
The results presented here first of all concern the use cases The performance analysis methodology that we proposed
and were obtained while testing SoC models with cross-bus- in this paper consists of recommendations for construction of
based interconnect and Noc. test scenarios aimed at performance bottlenecks localization
and a set of metrics and criteria. That methodology allows you
Table 3 illustrates an example of bandwidth limitation by to measure and analyze the performance of a system being
master agent. In this example data was sent from from CPU to designed on the early stages. Application of proposed metrics
system memory, so CPU in that case is master agent, and and criteria showed that actual performance bottlenecks could
memory is the slave one. The required data transfer speed is be easily uncovered, but every such case still needs to be
640 MB/s. After performing a test and taking measurements checked by the deep analysis of waveforms in order to
on both master and slave agents the average data transfer confirm that the problem really exists and to find its reason.
speed was measured as 524 MB/s. The table shows that Thus, the proposed methodology can help significantly reduce
maximum local throughput is less than expected data transfer
the time required to localize performance problems.
speed and average local throughput is bigger than average
transaction execution speed on master agent (criteria 1,4 and
5). On slave agent there is the same situation, but average REFERENCES
local throughput is less than average transaction execution [1] Qian, Z., Bogdan, P., Tsui, C. and Marculescu, R. (2016). Performance
time. That means that bandwidth is limited by master agent. Evaluation of NoC-Based Multicore Systems: From Traffic Analysis to
NoC Latency Modelling. ACM Transactions on Design Automation of
Table 4 illustrates the same example after RTL model Electronic Systems, 21(3), pp.1-38.
correction. Here in both cases measured values of average data [2] Silva, D., Oliveira, B. and Moraes, F. (2014). Effects of the NoC
transfer speed and maximum local throughput are bigger than architecture in the performance of NoC-based MPSoCs. 2014 21st IEEE
International Conference on Electronics, Circuits and Systems (ICECS).
the required value of the average data transfer speed, but on
mater agent average local throughput is approximately equal [3] A. B. Mehta, ASIC/SoC functional design verification. Springer. DOI:
10.1007/978-3-319-59418-7
to average transaction execution speed and on slave agent it is
[4] Shen, R., Tan, S. and Yu, H. (2012). Statistical Performance Analysis
less. Thus, it shows that the average data transfer speed and Modeling Techniques for Nanometer VLSI Designs. Boston, MA:
exceeds the minimum required value and is limited by the Springer US.
interconnect subsystem. [5] Keimel, C. (2016). Design of Video Quality Metrics with Multi-Way
Data Analysis. Springer Science+Business Media Singapore.
TABLE III. MEASUREMENTS BEFORE RTL MODEL CORRECTION [6] K.A. Zhezlov. (2019). Graph-based approach to test generation aimed at
SoC. Microelectronica i informatica [Microelectronics and information
technologies], Moscow, Zelenograd, pp.43-47.
Metric Master agent Slave agent
[7] Ogras, U., Bogdan, P. and Marculescu, R. (2010). An Analytical
Approach for Network-on-Chip Performance Analysis. IEEE
Vdavg, MB/s 524,3 524,3 Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 29(12), pp.2001-2013.
Cavg, MB/s 546,4 560,4
[8] Suboh, S., Bakhouya, M., Gaber, J. and El-Ghazawi, T. (2010).
Analytical modeling and evaluation of network-on-chip architectures.
Cmax, MB/s 600,1 636,9 2010 International Conference on High Performance Computing &
Simulation
Eavg, MB/s 338,9 638,1

Emax, MB/s 342,2 672,2

TABLE IV. MEASUREMENTS AFTER RTL MODEL CORRECTION

Metric Master agent Slave agent

Vdavg, MB/s 719,2 718,9

Cavg, MB/s 720,4 721,4

Cmax, MB/s 727,1 721,9

Eavg, MB/s 720,1 720,3

Emax, MB/s 725,4 723,2

1914

Authorized licensed use limited to: University of Management & Technology Lahore. Downloaded on July 16,2020 at 20:14:05 UTC from IEEE Xplore. Restrictions apply.

You might also like