MSC Thesis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

Technische Universität München

Fakultät für Elektrotechnik und Informationstechnik

Study on Embedded
Multi-Processor Systems-on-Chip
with Virtual Prototyping Technique

Bogdan Pricope

A thesis submitted for the degree of Master of Science.

September 11, 2008

Supervisors:
Dr. Jinan Lin
Dr. Xianing Nie
Dipl. Ing. Michael Meitinger
Prof. Andreas Herkersdorf
Abstract

Processor performance has been primarily driven by increasing clock frequency and advances
in silicon process technology. However, power dissipation density is the critical factor limiting
performance increases. For this reason, performance growth has slowed down in the last years. It
has become clear that the future performance demands can only be met by new design solutions.
Moreover, todays embedded applications are very different from those in the past, in terms of
both application complexity and dataset sizes. Consequently, it is not feasible any more to meet
demands of embedded applications by using single core based systems.
Multiprocessor system-on-chip (MPSoC) designs are a way to scale performance in accor-
dance with Moores law. There is a growing trend towards employing MPSoC type of architec-
tures, where multiple processor cores reside on the same chip and share data through on-chip
memory and an on-chip communication network.
However, high performance MPSoC architectures need high memory bandwidth. With the
widening gap between processor and memory speeds, system performance has become increas-
ingly dependent upon the effective use of the memory hierarchy. Moreover, the integration of
multiple processors on a single chip makes the problem even worse.
Caches, which store frequently used instructions and data in high speed memory close to the
processor are a means of increasing memory bandwidth. But, especially in embedded systems
caches are very expensive. Therefore, cache design is still an important area of research.
An important concept for understanding how caches behave is the principle of locality. In
this thesis, the locality of a stream of instructions is described using the reuse-distance model.
This model bases the probability of a cache hit on the instruction reuse-distance. The concept
of Instruction Reuse is introduced as a reference for our measurements, in order to abstract
our results from implementation details such as the application being executed or the cache
configuration.
An ARM11 MPCore based multiprocessor system is modelled and simulated using virtual
prototyping technology from VaST Systems and the effect of Instruction Reuse on system perfor-
mance and scalability is studied. We show that a low Instruction Reuse limits the performance
and scalability of multiprocessor systems. Moreover, it is observed that even doubling the mem-
ory bandwidth does not improve system scalability when Instruction Reuse is low. In Symmetric
Multiprocesing mode, it is shown that a solution to the MPSoC scalability problem is the ad-
dition of a shared Level 2 cache. However, in Asymmetric Multiprocessing, the shared Level 2
cache may actually decrease system performance when Instruction Reuse is low.

i
Declaration of Originality

I hereby declare that the research documented in this thesis and the thesis itself is the result of
my own work in the Communications Solutions business group at Infineon Technologies.

Bogdan Pricope

ii
Acknowledgments

Hereby, I would like to express my gratitude to all those who gave me the possibility to complete
this thesis.

First and foremost, I wish to thank my supervisors from Infineon Technologies, Dr. Jinan
Lin and Dr. Xiaoning Nie for offering me this thesis topic and for their continuous support and
guidance. I also would like to thank Mr. Stefan Maier and Mr. Thomas Niedermeier for their
fruitfull discussions which improved the quality of this thesis and I thank all my collegues from
the Advanced Systems and Circuits department for making me feel at home.

Moreover, I would like to thank Dipl. Ing. Michael Meitinger and Prof. Herkersdorf from the
Lehrstuhl für Integrierte Systeme at Technische Universität München, for providing the initial
support without which I would not have been able to commence this thesis. I especially want
to thank Mr. Meitinger for his valuable support and discussions.

Last but not least, I wish to thank my family and my girlfriend for their continuous support
during my studies.

iii
Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Declaration of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 The Instruction Reuse Challenge 6


2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Reuse-distance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Virtual Prototyping Technology 18


3.1 Virtual System Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Virtual Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 VaST Systems Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Experiment System Architecture 24


4.1 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Simulation Results & Analysis 42


5.1 Effect of Cache Size on Instruction Reuse . . . . . . . . . . . . . . . . . . . . . . 43
5.2 The Low Instruction Reuse Problem . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 The Shared Level 2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 The effect of Tightly Coupled Memory . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Conclusion & Future Work 57

iv
List of Figures

1.1 Exponentially increasing application complexity [7] . . . . . . . . . . . . . . . . . 2

2.1 Symmetric multiprocessing (SMP) . . . . . . . . . . . . . . . . . . . . . . . . . . 7


2.2 Asymmetric Multiprocessing (AMP) . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Cache block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Instruction Reuse Distance Histogram . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Instruction Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Instruction Reuse increases, as the cache capacity increases from 4 to 7 instructions. 16

3.1 Virtual Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


3.2 CoMET window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Experiment System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


4.2 Arm11 MPCore block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Memory Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Output image structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Flowchart main() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Basis for the OS instruction reuse distance benchmark . . . . . . . . . . . . . . . 38
4.7 Benchmark OS instruction reuse-distance histogram . . . . . . . . . . . . . . . . 39
4.8 Test() function control flow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 Instruction Reuse as cache size is varied for the TEST benchmark. . . . . . . . . 43
5.2 A low Instruction Reuse results in no performance improvement as the number of
CPUs is increased. In other words, a low Instruction Reuse limits the scalability
of multiprocessor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 High Instruction Reuse values enable a multiprocessor system to scale to a higher
number of processor and significant performance gains can be seen over a single
processor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Doubling the memory bandwidth increases system IPC but does not help to
improve the scalability of the multiprocessor system when Instruction Reuse is low. 46
5.5 Doubling the Level 1 cache size or even the memory bandwidth may not improve
the scalability of a multiprocessor system. However, increasing the cache size
above a certain threshold value solves the scalability problem. . . . . . . . . . . . 47

v
LIST OF FIGURES LIST OF FIGURES

5.6 Instruction Reuse as cache size is varied for the modified histogram. . . . . . . . 48
5.7 Effect of application instruction reuse-distance histogram . . . . . . . . . . . . . 50
5.8 In SMP mode, the addition of a shared Level 2 cache increases system IPC and
also improves the scalability of the multiprocessor system for low Instruction Reuse. 52
5.9 In AMP mode, the addition of a shared Level 2 cache slightly increases system
IPC but does not improve the scalability of the multiprocessor system. In fact,
for low Instruction Reuse, increasing the number of processors decreases system
IPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.10 The addition of relatively small Level 2 shared cache (e.g. twice the size of the
Level 1 cache) provides a significantly greater performance improvement than
doubling the Level 1 cache size alone. However, increasing the Level 2 cache
size without also increasing the number of CPUs, does not bring any significant
performance gain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.11 As opposed to SMP mode, in AMP mode increasing the Level 2 cache size con-
siderably increases system IPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

vi
List of Tables

1.1 MPSoC cache configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Instruction Reuse Distance example . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Experiment System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 25


4.2 Cache and TLB Operation Functions . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Typical memory sizes and access times . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Memory Controller configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Memory Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Instruction Reuse Distance is modelled by loops. . . . . . . . . . . . . . . . . . . 40
4.8 Type and number of instructions executed. . . . . . . . . . . . . . . . . . . . . . 41

5.1 Instruction Reuse as cache size is varied for the TEST benchmark . . . . . . . . 43
5.2 Instruction Reuse comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

vii
Chapter 1

Introduction

1.1 Motivation

Embedded electronic systems are specialized to carry out specific tasks and are embedded in their
environment. This is in contrast to personal computers or supercomputers which are general
purpose and interact with users. Embedded systems are much more prevalent than their general-
purpose counterparts; for instance, 98% of all microprocessors manufactured in a given year are
used within embedded systems [14, 21]. Embedded systems designers must meet strict time-
to-market and productivity requirements. Thus, embedded systems are generally restrictive,
because designers have to make trade-offs between design cost and design complexity.

However, computational requirements for embedded applications are increasing exponentially.


During the past 15 years, a variety of new protocols and standards have been introduced which
feature rapidly increasing computational requirements. Figure 1.1 shows some of these trends for
three classes of multimedia applications: video, cellular, and wireless LAN. Code size for these
applications is also increasing, reflecting the trend that application complexity is increasing along
with computational requirements. This exponential trend is creating demand for the increasing
number of transistors that can be integrated with scaling [18].

Advances in process technology have made possible to roughly double the number of transistors
per area every 2 months according to Moore’s law [22]. Thus, performance gains have been
achieved by higher transistor integration densities and increased clock frequencies due to the
smaller size of the transistors. However, the increase in performance came at the cost of increased
power consumption and thus heat dissipation. The latter has been the most critical technical
challenge in maintaining performance growth.

1
1.1 Motivation Introduction

Figure 1.1: Exponentially increasing application complexity [7]

The total power consumption for a chip is given by the sum of two components: the active (or
dynamic) power consumption and the static power consumption. The dynamic power dissipation
density is proportional to the number of transistor devices per area (N), the activation factor
of the device (α), the switched capacitance per device (C), the operation frequency (f) and the
square of the supply voltage (V).

Pdynamic /Area ≈ N · α · C · f · V 2

For older process technologies, the dynamic power consumption was dominant. As the dimen-

sions of the transistors shrank by 2 every two years, the power dissipation density increased

by more than 2 or 40%. This was due to the increase in the static power consumption, with
gate current leakage being the dominant component.

In order to keep the power dissipation under acceptable levels, designers have traded silicon
area against power consumption. Thus, performance of a single processor has increased with
the square root of complexity [6]. Each new processor architecture has required two to three
times the silicon area, while providing only 20% improvement in performance [11]. Due to this
marginal increases in performance, a new approach to increase the gain out of the same silicon
area was needed.

Multiprocessor systems-on-chip (MPSoCs) have emerged as a solution to scale performance by


exploiting software parallelism. Nevertheless, MPSoCs introduce some challenges to the system
architects concerning the efficient design of memory hierarchies and system interconnects while
maintaining the low power and cost constraints of embedded systems.

2
1.1 Motivation Introduction

Cache Type Size Line size No. of ways No. of sets


Data 16 KB 32 B 4 128
Instruction 16 KB 32 B 4 128

Table 1.1: MPSoC cache configuration

For example, in [2], multiprocessor performance has been investigated for network protocol
processing. The MPSoC platform is based on two 32-bit MIPS cores with the following features:

• 32-bit address path

• 64-bit data path to the caches

• Eight stages pipeline

• Two separate instruction and data caches with 16 KByte per each

• Cache lines are virtually indexed, physically tagged

• Cache replacement policy is based on least recently used (LRU) strategy

• Processor clock frequency to bus clock frequency ratio is 2:1

The cache configuration is given in Table 1.1.

During the measurements, which were conducted in the course of studying software based cache
coherence, an interesting phenomenon was observed: for TCP/IP protocol processing on Linux,
up to 70% of the total cycles are stall cycles due to instruction cache misses caused by the Linux
operating system code [2].

As the example above shows, using multiple processors does not necessarily increase system
performance. The challenge is scalability: system performance must increase as additional
processors are added to the system.

High instruction cache hit rates are key to achieve high performance. In contrast to data cache
accesses, instruction cache accesses are serialized and cannot be overlapped. Instruction cache
misses prevent the flow of instructions through the processor and directly affect performance. To
maximize processor instruction cache utilization and minimize stalls, application code should
have a high locality i.e. few branches (exhibiting high spatial locality), a repeating pattern
when deciding whether to follow a branch (yielding a low branch misprediction rate), and most
importantly, the working set code footprint should fit in the processor’s instruction cache.

Unfortunately, many applications and especially the operating system code exhibit exactly the
opposite behavior. As the example above showed, the low instruction locality of the operating
systems while running a TCP/IP processing application is responsible for a substantial amount
of idle processor clock. What makes it worse now is the increasing gap between memory and
processor speeds, which results in huge performance penalties.

3
1.2 Objective Introduction

Even if the processor clock frequency remains the same, if the core count is doubled, the memory
bandwidth must double. Unfortunately, doubling the cache size on-chip, does not result in twice
as low a miss rate. Empirically, one has found that miss rate only goes down by a factor that
is the square root of two when doubling the size of the cache [1]. As a result, for each new
generation, the bandwidth in and out of the chip will increase exponentially. This poses a
dilemma as the pin bandwidth will not increase exponentially but rather linearly according to
ITRS (International Technology Roadmap for Semiconductors) predictions [19, 10].

1.2 Objective

Memory latency and bandwidth are two important metrics when designing a multiprocessor
system. While the latency is the (round-trip) time it takes to read/write a piece of data from
memory until the data is returned, the bandwidth is the amount of data that is transferred in
a time unit. The proximity of multiple processors makes interprocessor communication much
cheaper, but providing enough memory bandwidth for all processsors to function becomes a
serious problem. This is because if cache hierarchy is not designed appropriately, one can incur
a significant number of off-chip references (accesses), which can be very expensive from both
performance and power perspectives.

With modern CPUs having 16KB to 64KB Level 1 instruction cache sizes, operating system
codes are too long to reside in the cache. Current chip design trends for improving process
performance are leading to thread-parallel architectures, where multiple threads run simulta-
neously on a single chip via multiple on-chip processor cores (chip multiprocessors, or CMP)
and/or multiple simultaneous threads per processor (SMT).

To be able to fit more cores on a single chip without overheating, and also save time in hardware
verification, chip designers are expected to use simple processor cores as building blocks. One
such example is the Suns UltraSPARC T1, which uses up to eight cores on a single chip with
four threads per core. The instruction cache size of these cores is not expected to grow. For
example, the UltraSPARC T1 features a 16KB Level 1 instruction cache per core, which is
the same size as in the first UltraSPARC chip, introduced 10 years ago. Moreover, SMT chips
already operate on a reduced effective instruction cache size, since the instruction cache is shared
among all simultaneous threads. In future processors, the combined effect of larger and shared
Level 2 cache sizes and small Level 1 instructoin caches will make instruction cache stalls the
key performance bottleneck.

Although lots of research has been devoted to caches, it is still important and there is still room
for innovation. While uniprocessor architectures are well understood, this is not the case for
embedded multiprocessor systems and especially for multithreaded chip multiprocessors (CMT)
i.e. multiple on-chip processor cores with multiple simultaneous threads per core.

4
1.2 Objective Introduction

The goal and contribution of this thesis is to study and understand the effect of instruction
locality on the performance and scalability of multiprocessor systems. The study includes the
following parts:

1. Investigate the instruction locality issue of embedded multiprocessor systems:

• Design and write test programs with a configurable instruction locality, which can be
used for systems with different number of cores.
• Analyse the performance (e.g. IPC) of multiprocessor architectures for various in-
struction localities, cache sizes, as well as number of processors.

2. Explore performance/cost optimization possibilities:

• Analyse the tradeoff between using a shared Level 2 cache and TCM (Tightly Coupled
Memory).

5
Chapter 2

The Instruction Reuse Challenge

The concept of Instruction Reuse provides the foundation of this thesis. First, an introduction to
the main concepts used throughout this thesis is given. Then, reuse-distance analysis is presented
and the proposed Instruction Reuse concept for measuring instruction locality is described in
detail.

2.1 Background

2.1.1 MPSoC Classification

Depending on the combination of processors, memory and operating system, multiprocessor


systems are divided into two major categories: symmetric multiprocessing and assymetric mul-
tiprocessing.

Symmetric Multiprocessing (SMP)

SMP is a homogeneous topology, which means processors share a common instruction set archi-
tecture (ISA) and have a common view of the rest of the system resources, including a shared
memory architecture. In SMP mode, a single operating system runs on all processors, which
access a single image of the operating system in memory. In Figure 2.1 a diagram of an SMP
system is shown.

6
2.1 Background The Instruction Reuse Challenge

Figure 2.1: Symmetric multiprocessing (SMP)

The operating system (OS) is responsible for dynamically distributing tasks across the pro-
cessors, managing the ordering of task completion, and controling the sharing of all resources
between the cores. Thus, processes or threads can be assigned and reassigned to different
processors depending on processor loading. Moreover, porting of applications developed for
single-processor systems to SMP systems is easy and load balancing algorithms are efficient in
making maximum use of the available processing power.

The major disadvantage of the SMP approach is that as the number of processors increases,
the communication overhead becomes dominant and the shared memory cannot support the
bandwidth demands for all processors. Each additional processor in the system increases the
amount of time load balancing algorithms spend assessing load conditions, deciding task assign-
ments and transferring tasks between processors. Moreover, the shared communication medium
quickly becomes a bottleneck. As a result, SMP systems typically do not scale to more than
about 8 processors.

Moreover, the SMP behavior is non-deterministic. This means that critical software functions
cannot be guaranteed to execute with a certain response time because execution time is highly
dependent on the systems current state and load distribution. Without guaranteed response
time, the SMP approach does not meet the needs of real-time systems.

The SMP approach also cannot be implemented in a heterogeneous system. The software de-
pends on each processor having the same instructions set architecture and identical resources
available to it, including the operating system it is running, so that tasks can be readily in-
terchanged. Multiprocessor systems that have different processors to handle different types of
tasks simply cannot run an SMP operating system, nor can SMP be constructed using different
operating systems on each core.

7
2.1 Background The Instruction Reuse Challenge

Figure 2.2: Asymmetric Multiprocessing (AMP)

Assymetric Multiprocessing (AMP)

Asymmetric multiprocessing differs from symmetric multiprocessing in allowing the use of het-
erogeneous processors and operating systems as well as the homogeneous environment supported
by SMP. In AMP mode, different operating systems run on different processors from private local
memories. These processors are specialized for certain tasks by having different instruction set
architectures and communicate with each other through shared-memory and message passing.
In Figure 2.2 a diagram of an AMP system is shown.

The advantage of AMP is that memory bandwidth is increased for each processor and it reduces
the latency for accesses to local memory. Moreover, the processors spend less time handshaking
with each other. This enables designs to scale to much larger numbers of multiprocessors than
SMP does.

AMP performs selective load balancing, allowing the designer to permanently assign some tasks
to fixed processor while allowing others to be load-balanced among many processors. This means
that the application can be made deterministic in those areas where system response is critical.

The major disadvantage is that communication between processors is much more complex and
it requires more effort from the software side. It is up to the programmer to make sure the
processors are being utilized to their maximum potential and to worry about whether a processor
can complete a certain task and how to make the processors communicate effectively to distribute
tasks accordingly.

Moreover, since application partitioning and mapping is an NP-hard problem, designers cannot
easily port their applications over from earlier generations to an AMP system. They must decide
which components need to be fixed and which can be distributed, and map them to processors
accordingly.

8
2.1 Background The Instruction Reuse Challenge

Figure 2.3: Memory Hierarchy

2.1.2 The principle of locality

The principle of locality is an empirically observed phenomenon that has numerous practical
implications. The basic observation is that programs tend to reuse data and instructions they
have used recently. A widely held rule of thumb is that a program spends 90% of its execution
time in only 10% of the code [16].

An implication of locality is that we can predict with reasonable accuracy what instructions and
data a program will use in the near future based on its accesses in the recent past. The principle
of locality also applies to data accesses, though not as strongly as to code accesses.

Two different types of locality have been observed:

• Temporal locality states that recently accessed items have a high probability to be accessed
in the near future.

• Spatial locality says that items whose addresses are physically near an item recently ac-
cessed have a high probability of being accessed in the near future.

The principle of locality and the higher speed of smaller memories, led to hierarchies based on
memories of different speeds and sizes [15].

Figure 2.3 shows a multilevel memory hierarchy, including typical sizes and speeds of access. As
we move farther away from the processor, the memory in the level below becomes slower and
larger. Since fast memory is expensive, a memory hierarchy is organized into several levels with
each being smaller, faster, and more expensive per byte than the next lower level. The goal is

9
2.1 Background The Instruction Reuse Challenge

Figure 2.4: Cache block diagram

to provide a memory system with cost per byte almost as low as the cheapest level of memory
and speed almost as fast as the fastest level.

The importance of the memory hierarchy has increased with advances in performance of pro-
cessors. Cache memories have become a major factor to bridge the bottleneck between the time
to access main memory and faster clock rate of current processors. Cache behavior has become
one of the major factors to impact application performance. Since memory access times improve
much slower than processor speed, performance is bound by instruction and data cache misses
that cause expensive main-memory accesses.

2.1.3 Caches

To hide the slowness of the main memory, caches are used. Caches are fast but small memories
between the processor and the main memory. In order to achieve high performance, data should
be found in the cache most of the time. However, because of the limited capacity of the cache,
cache misses occur.

A cache miss occurs when a word is not found in the cache by the processor. The word must
be fetched and placed in the cache before continuing. Because of spatial locality, multiple words
called a block or line are moved at one time. Since cache size is much smaller than main memory,
a key design decision is where can cache blocks (lines) be placed in a cache. The most popular
scheme is set associative, where a set is a group of blocks in the cache.

Figure 2.4 shows the structure of a cache. A cache block is first mapped onto a set, and then
the block can be placed anywhere within that set. Finding a block consists of first mapping the

10
2.1 Background The Instruction Reuse Challenge

block address to the set, and then searching the set to find the block. The set is chosen by the
address of the data:

(Block address) MOD (Number of sets in cache)

If there are n blocks in a set, the cache placement is called n-way set associative. The end points
of set associativity have their own names:

• A direct-mapped cache has just one block per set, thus a block is always mapped to the
same location.

• A fully associative cache has just one set, thus a block can be placed anywhere.

The replacement algorithm is the process used to select one of the blocks in a given set for
occupation by a newly referenced block. The important schemes include LRU (Least Recently
Used), random, and FIFO (First In First Out).

Most cache designs also assume demand fetch and write allocate. Demand fetch means that
a block is fetched from memory into the cache only on a cache miss; and write allocate is the
policy where an entire block is fetched into the cache on a write into the block if it is absent
from the cache.

One measure of the benefits of different cache configurations is miss rate. Miss rate is simply
the fraction of cache accesses that result in a miss, i.e. the number of accesses that miss divided
by the number of total accesses.

To gain insights into the causes of high miss rates, the three Cs model sorts all misses into three
simple categories [16]:

• Compulsory: The very first access to a block cannot be in the cache, so the block must
be brought into the cache. Compulsory misses are those that occur even if you had an
infinite cache.

• Capacity: If the cache cannot contain all the blocks needed during execution of a program,
capacity misses (in addition to compulsory misses) will occur because of blocks being
discarded and later retrieved.

• Conflict: If the block placement strategy is not fully associative, conflict misses (in addition
to compulsory and capacity misses) will occur because a block may be discarded and later
retrieved if conflicting blocks map to its set.

To exploit the principle of locality, cache designs are adding more cache levels and dynamic
configuration control. It is common in today’s design to have two or three levels of caches
memory. As the memory hierarchy becomes deeper and more adaptive, its performance will
increasingly depend on our ability to predict instruction and data locality.

11
2.2 Reuse-distance analysis The Instruction Reuse Challenge

2.2 Reuse-distance analysis

The reuse distance is a metric for the cache behavior of programs. A large reuse distance
indicates a high probability of cache misses. A low reuse distance indicates good temporal
locality and thus a high probability of cache hits.

Reuse-distance analysis predicts program locality by experimentally determining locality prop-


erties as a function of the data size of a program, allowing accurate locality analysis when the
programs data size changes.

Prior work has established the effectiveness of reuse distance analysis in predicting program
locality over a wide range of data sizes. Ding, et al. [9, 24], have proposed techniques to
predict the reuse distance of memory references across all program inputs using a few profiling
runs. They use curve fitting to predict reuse distance (the number of distinct memory locations
accessed between two references to the same memory location) as a function of a programs data
size. By quantifying reuse as a function of data size, the information obtained via a few profiled
runs allow the prediction of reuse to be quite accurate over varied data sizes. Ding, et al., have
used reuse-distance predictions to accurately predict whole program miss rates [24, 5].

The most obvious application of reuse distance is to prefetching those memory operations that
cause the most misses. Both hardware and software prefetching may issue many unneces-
sary prefetches. Hardware could be constructed to use reuse-distance information to schedule
prefetches dynamically for important instructions.

Knowledge from reuse-distance analysis can be used to reduce capacity misses. Since most of
the cache misses are capacity misses, to eliminate the capacity miss, the reuse distance must be
made smaller than the cache size. On the hardware level, this can be done by increasing the
cache size. As a result, probability of hitting the cache will increase for the reference with a long
reuse distance.

Moreover, reuse-distance analysis may also be used in architectural optimization via compiler
hints to gain a more global view of the expected behavioral patterns of a program. On the com-
piler and algorithmic level, the cache size cannot be changed, but the program or the algorithm
can be changed, so that fewer long reuses occur. On the compiler level, the most well-known
techniques are loop tiling and loop fusion.

At the algorithmic level, one has more freedom to restructure the program than at the compiler
level. Also the programmer has a better understanding of the global program structure. There-
fore, the programmer can decide to use different algorithms to decrease long reuse distances.
However for it is difficult to know exactly where in the code the bad data locality happens, and
instrumentation and visualization of the program can help the programmer to pinpoint the hot
spots.

12
2.2 Reuse-distance analysis The Instruction Reuse Challenge

2.2.1 Instruction Reuse Distance

In 1970, Mattson et al. studied stack algorithms in cache management and defined the concept
of stack distance [17]. Instruction Reuse Distance (IRD) is the same as LRU stack distance or
stack distance using LRU (Least Recently Used) replacement policy.
Whenever a memory location is used multiple times throughout program execution (i.e. it is
reused), cache hits may result if the corresponding instruction stays in the cache between the
different accesses to it. However, when the reuses are separated by accesses to a lot of other
different instructions, the probability that it remains in the cache between use and reuse is low.
By ordering the instruction memory accesses of a program execution by logical time, we obtain
a program trace. In a sequential execution, reuse distance is the number of distinct instructions
executed between two consecutive executions of the same instruction (i.e. between use and
reuse).
Instruction reuse distance measures the volume of the intervening instructions not the time
between two executions. While time distance is unbounded in a long-running program, reuse
distance is always bounded by the code footprint. Moreover, the reuse distance is a property
of the trace and is independent of hardware parameters. In Table 2.1 an example of how to
compute the instruction reuse distance is shown.

Time 1 2 3 4 5 6 7 8 9
Memory Address A1 A2 A3 A4 A2 A3 A4 A1 A1
Instruction I1 I2 I3 I4 I2 I3 I4 I1 I1
Reuse-distance of I1 ∞ 3 0
Reuse-distance of I2 ∞ 2
Reuse-distance of I3 ∞ 2
Reuse-distance of I4 ∞ 2

Table 2.1: Instruction Reuse Distance example

When an instruction has reuse distance d, exactly d different instructions were executed previ-
ously. If d is smaller than the number of instructions that can fit in the cache, the referenced
instruction will be found in a fully associative cache. Conversly, if d is larger, the referenced
instruction will result in a cache miss. If the reuse distance is zero, the referenced instruction
will always result in a cache hit.
The classification of a miss into compulsory, conflict or capacity misses is easily made using
reuse-distance:

• A compulsory miss has an infinite reuse distance, since it was not previously referenced.

• If the reuse distance is smaller than the cache size, it is a conflict miss, since the same
reference would have been a hit in the fully associative cache.

• When the reuse distance is larger than the cache size, it is a capacity miss, since the
reference also misses in a fully associative cache.

13
2.2 Reuse-distance analysis The Instruction Reuse Challenge

Figure 2.5: Instruction Reuse Distance Histogram

2.2.2 Instruction Reuse Distance Histograms

A reuse distance histogram summarizes the locality of a program execution and is important for
cache performance prediction [13, 12, 23], reference affinity detection [25], and data reorganiza-
tion [8].

The reuse distance histogram is a histogram showing the distribution of reuse distances in an
execution. Each bar in the histogram shows the fraction of total instructions executed with
a certain reuse distance. The X-axis is the instruction reuse distance, and the Y-axis is the
fraction of total references.

An example of an instruction reuse distance histogram is shown in Figure 2.5. The fraction of
references to the left of the mark Cache Capacity will hit in a fully associative cache having
the capacity indicated by the dotted line. For set associative caches, reuse distance is still an
important hint to cache behavior as the probability of a conflict miss was determined in [1].

14
2.3 Our approach The Instruction Reuse Challenge

Figure 2.6: Instruction Reuse

2.3 Our approach

Cache performance depends on program locality, which changes from program to program and
also from input to input for the same program. For example, different inputs may require the
execution of different routines with diverse locality profiles. In addition, different programs
usually have different instruction reuse-distance histograms. For these reasons, measurements
of cache performance are application or program specific.

Moreover, using cache miss-ratios as a reference to compare the performance of different systems
requires knowledge of implementation details such as cache line size or set-associativity. Such
knowledge is not available in early stages of the design.

In order to abstract from implementation details and to obtain application/program independent


results, we introduce the notion of Instruction Reuse as a method to measure instruction
locality.

Definition:
Instruction Reuse is the projection of a particular program execution on a particular cache
configuration.

As shown in Figure 2.6, different application-cache combinations can have the same projection
i.e. Instruction Reuse. Thus, using Instruction Reuse as a reference for measuring system perfor-
mance, systems with different applications and/or cache configurations can be compared. This
means that results based on Instruction Reuse values are not specific to a particular instruc-
tion reuse-histogram or cache configuration. It is the effect from the combination of these two
components that we study.

15
2.3 Our approach The Instruction Reuse Challenge

Figure 2.7: Instruction Reuse increases, as the cache capacity increases from 4 to 7 instructions.

We measure Instruction Reuse as:


Number of instructions executed
Instruction Reuse =
Number of instructions loaded in Level 1 cache

The Number of instructions executed is a parameter that only depends on the applica-
tion. On the other hand, the Number of instructions loaded in Level 1 cache depends
both on the application (represented by its instruction reuse-distance histogram) and the cache
configuration.

For the example given in Table 2.1, the Instruction Reuse is equal to 9/4, assuming the cache
capacity is larger than 4 instructions i.e. once loaded from memory instructions I1 . . . I4 can be
reused from the cache.

Different Instruction Reuse values can be obtained either by:

• fixing a particular application profile and varying the cache configuration (size, associativ-
ity, replacement policy) or

• by keeping a particular cache configuration fixed and varying the application profile.

By application profile the instruction reuse distance histogram is meant. In this thesis, the
former approach is used in order to obtain a large range of Instruction Reuse values.

In Figure 2.7 this idea is illustrated graphically. When the cache size increases, as indicated
by the arrow, more instructions will fit inside the cache and thus the number of instructions
loaded in the Level 1 cache will decrease. Since the total number of instructions executed by the

16
2.3 Our approach The Instruction Reuse Challenge

processors is independent of the cache configuration and thus remains constant, the Instruction
Reuse will increase.

To investigate the effect of instruction locality on the performance and scalability of multipro-
cessors systems, our approach consisted of the following steps:

1. Model a multiprocessor system-on-chip with configurable number of CPUs using virtual


prototyping technology.

2. Model the instruction reuse-distance histogram of a typical application, which will serve
as a benchmark for comparison with future real system measurements.

3. Measure the Instruction Reuse values resulting from the modelled reuse-distance histogram
in combination with different Level 1 cache sizes.

4. Measure multiprocessor system performance and scalability for the measured Instruction
Reuse values.

Before describing the experiment architecture, the virtual prototyping technology used for mod-
elling both the hardware and software parts of the design is introduced.

17
Chapter 3

Virtual Prototyping Technology

3.1 Virtual System Prototypes

A Virtual System Prototype (VSP) is a model of a complete embedded system including software.
The hardware platform component of the VSP is called a Virtual Prototype. Characteristics
such as performance and power for a complex system cannot be represented and computed as
a formal mathematical problem. The only realistic solution for determining such characteristics
is through simulation.

One option for this simulation is hardware acceleration and/or emulation. Unfortunately, in
addition to providing only limited visibility into the inner working of the system, the highest
level of abstraction supported by these solutions are register transfer level (RTL) representations.
As a result, development and evaluation cannot commence until a long way into the design cycle
when the hardware portion of the design is largely completed. In turn, this limits the design
teams ability with regard to exploring and evaluating the hardware architecture. In addition,
FPGA implementations of processors typically are slow, executing software at around 1 MIPS,
about 50 times slower than a virtual processor model of the same processor [??].

A VSP is a pure software model of the entire system: that is, the combination of the virtual
prototype and the software that will run on it. Fully evaluating the characteristics of a complex
system may require performing many hundreds of experiments on various system configurations.
Furthermore, it is not unusual for a single simulation to require that 100 billion instructions be
run to reproduce a problem or to compute a representative result. This represents less than one
hour of simulation time using a high performance, timing-accurate VSP. By comparison, the

18
3.2 Virtual Platforms Virtual Prototyping Technology

same software simulation would take between 100 to 500 hours or more using a typical timing-
accurate structural instruction set simulator model and 100,000 hours or more using an RTL
model.

System architects use VSPs to explore the optimum architecture while software developers use
the VSP to develop software before hardware is available. Overall, simulation speed proves to be
more important for SW developers, but accuracy (in terms of processor and bus cycles) is more
important for the hardware architects. Nevertheless, software developers often also require high
degrees of accuracy, for example in real-time critical inter-processor communication. In return,
the system architects are moving towards SW-driven architecture analysis and optimization
strategies, in which real SW loads are used as stimulus for the architectural exploration.

An effective and efficient VSP simulation system must have the following characteristics:

• Near Silicon Speed: The solution must be fast enough so that the real software applications
written for the SoC may be run on the VSP, including the operating system (OS) and any
target application that may run on top of the OS.

• Complete System: The solution must model and simulate the whole system (processors,
buses, peripherals, external hardware).

• Cycle-Accurate: The solution must retain accuracy; i.e., the simulated hardware must have
timing associated with it that is reflected in the real hardware. This must also include
asynchronous events and multiple clock domains.

• Model Library: For the purpose of architecture design productivity and efficiency, the
system should offer a portfolio of processor, bus, and peripheral models.

• High-Speed Modeling Method: A proven modeling method that supports simulation results
orders of magnitude faster then traditional RTL simulations must exist by which high-
speed, system-level modules of custom hardware are modeled in the VSP.

• Binary Compatibility: The solution must be capable of using the same target images that
will be executed by real hardware for execution by the modeled processor; that is, binary
compatibility between simulated and actual. The solution must also provide the capability
to use commercial debugging and development tools for those applications.

• Configurable: The solution must include run-time configurability for as many parameters
as possible. i.e., no recompilation should be necessary in order to try different experiments
for different parameters such as cache size of the processor models.

• Visibility: The solution must make available data mining statistics and events that occur
in the hardware system. For example, the VSP must be able to track things like instruction
counts, cache statistics (hits, misses, fetches) and bus transactions.

19
3.2 Virtual Platforms Virtual Prototyping Technology

Figure 3.1: Virtual Platform

3.2 Virtual Platforms

Virtual Platforms which contain the underlying models of the system are the building blocks
of a Virtual System Prototype. As shown in Figure 3.1, a simple VSP usually consists of a
single Virtual Platform containing one of more virtual devices: Virtual Processor Models (that
model the actual processor and execute software embedded within the product), Virtual Memory
Models, Peripheral Device Models (that emulate the functionality of the peripheral hardware
devices) and interconnections.

Virtual Processor Models


A Virtual Processor Model (VPM) emulates the behavior of the physical processor running the
software written and compiled for that processor. A Virtual Platform can contain one or more
VPMs.
A VPM runs the actual target code that is designed for the physical processor. This means that
target software can be developed, executed and debugged exactly as with a physical prototype.
However, using a VPM provides greater control and flexibility than with an actual processor.
With a VPM, internal processor resources such as cache size or the number of TLB can be
configured, which would be impossible with a physical processor. Thus, the performance of
the target code on various configurations of the processor can be analyzed. Moreover, after
porting the software to a new VPM, the performance and suitability of various processors can
be compared.
In order to accurately simulate the effects of the physical processor, a VPM has to be instruction
cycle, bus cycle, and register cycle accurate.

20
3.3 VaST Systems Tools Virtual Prototyping Technology

Peripheral Device Models

A Peripheral Device Model (PDM) emulates the behavior of a physical device in the hardware
architecture, such as interrupt controller, clock generator, etc. Peripheral Device Models can
connect directly to other PDMs or interface to Virtual Processor Models using interconnections
such as bus connections or asynchronous connections.

Proprietary (pre-built) device models can be used within a VSP. However, some Peripheral
Device Models are unique for each platform and must be developed to suit the architecture.

3.3 VaST Systems Tools

Figure 3.2: CoMET window

VaST Systems is an Electronic Design and Automation company, which builds and markets
system level design tools and intellectual property to support the engineering of virtual system
protototypes. It was founded by Graham Hellestrand, a professor of computer science and
engineering at the University of New South Wales, Australia.

CoMET from VaST Systems was used to implement our experimental virtual multiprocessor
system and to evaluate its performance. Some of the features of CoMET are:

• It has high performance, typically 20-100 MIPS, depending on the complexity of the plat-
form and the performance of the host PC. Therefore it is possible to run real applications
at near real-time speeds.

21
3.3 VaST Systems Tools Virtual Prototyping Technology

• The simulation technology is cycle accurate.

• A library of models of commercially available processors, bus architectures, and peripheral


devices is provided.

• Target images may be specified for each processor core in the design and third-party
debuggers (such as Lauterbach T32) are supported so that users may use their standard
environment to debug software in the virtual environment.

• Virtual processor model parameters (such as cache size, processor frequency) may be
specified at run-time and therefore no recompilation of the system model is necessary.

• Through its Metrix profiling tool, CoMET enables tracing of system events so that system
performance may be evaluated.

The CoMET window is shown in Figure 3.2. A VSP is constructed by adding instances of
component modules in a hierarchial structure. Modules instances, nets and port connections are
added or edited using the XML standard view. Target software code for the ARM processors
can be created and compiled within the CoMET environment.

The Metrix is a component of CoMET, which provide non-intrusive performance monitoring


capabilities for the entire VSP including the virtual processor models, buses and peripheral
device models. Metrix consists of three components:

• VPM Metrix that provides access to the Virtual Processor Models (VPMs) in a Virtual
System Prototype. VPMs have features not available with the actual hardware such as
visibility. They can provide details of the instruction path, registers, memory, and cache
usage while executing, and can issue reports that summarize such activity over a user-
determined period.

• Bus Metrix that allows triggering and monitoring of bus accesses.

• Net Metrix that allows the monitoring of logic, 32-bit vector and clock nets defined within
the module hierarchy.

The output of a Metrix VPM can look as follows:

VpmCtrl Counter 1
VpmCtrl 7733 Total Instructions Executed, Using
VpmCtrl 12965 Cycles, with
VpmCtrl 0 Inst Page Table Walks, and
VpmCtrl 0 Data Page Table Walks
VpmCtrl 0 Page Table Walks
VpmCtrl
VpmCtrl Data Read Access Counts
VpmCtrl 3336 Total
VpmCtrl 3304 - Cache Hits
VpmCtrl 19 - Cache Miss Allocate new line in cache
VpmCtrl 13 - Cache Miss no allocate, Uncached Region
VpmCtrl 0 - Cache Miss no allocate, Cache Disabled
VpmCtrl 0 - Cache Miss, All ways locked

22
3.3 VaST Systems Tools Virtual Prototyping Technology

VpmCtrl 0 - Cache Miss, Hit Pending Buffer


VpmCtrl 0 - TLB Abort - Read access denied
VpmCtrl
VpmCtrl Data Write Access Counts
VpmCtrl 1326 Total
VpmCtrl 1189 - Cache Hits
VpmCtrl 0 - Cache Miss allocate new line in cache
VpmCtrl 100 - Cache Miss no allocate, Uncached Region
VpmCtrl 0 - Cache Miss no allocate, Cache Disabled
VpmCtrl 37 - Cache Miss no allocate, No write allocate
VpmCtrl 0 - Cache Miss, All ways locked
VpmCtrl 0 - Cache Miss, Hit Pending Buffer
VpmCtrl 0 - Cache Hit with Write Through attribute
VpmCtrl 0 - TLB Abort - Write access denied
VpmCtrl
VpmCtrl Inst Access Counts
VpmCtrl 0 - Pipeline Stall Ticks, Cache Miss
VpmCtrl 0 - TLB Abort - Execute access denied
VpmCtrl
VpmCtrl Cache Counts
VpmCtrl 19 Data Cache Line Fill
VpmCtrl 0 Data Cache Write Back (sub line)
VpmCtrl 24 Instruction Cache Line Fill

VpmCtrl
0 Inst. Line Not Cacheable

23
Chapter 4

Experiment System Architecture

The final purpose of this thesis is to model a workable multiprocessor system for evaluating the
effect of Instruction Reuse on the performance and scalability of multiprocessor systems. There-
fore, one of the primary scopes of the chosen system architecture was to be easily configurable
and scalable and at the same time gather necessary experience for building up a real system in
the future.

This chapter describes details about the modelled virtual multiprocessor architecture. The
processor, memory system, and buses are discussed separately in the Hardware Architecture
section, while the target code running on the processor is described in the Software Architecture
section. The whole system was modelled with the VaST CoMET virtual prototyping tools
described in Chapter 3.

24
4.1 Hardware Architecture Experiment System Architecture

4.1 Hardware Architecture

Figure 4.1: Experiment System Architecture

Processor No. of CPUs 1 to 8


Instruction Size 32-bit
L1 Cache Latency one CPU clock cycle
Instruction Cache Size per CPU 4 KB to 512 KB
Data Cache Size per CPU 4 KB to 512 KB
Set Associativity 4-way
Line Size 32 Bytes
Instruction Cache virtually indexed, physically tagged
Data Cache physically indexed, physically tagged
L2 Cache Latency 6 clock cycles
Size 16 KB to 1 MB
Set Associativity Direct Mapped
Line Size 32 Bytes
Memory Latency 28 CPU clock cycles
CPU to MEM Frequency Ratio 2:1
CPU to MEM Bandwidth Ratio 2:1

Table 4.1: Experiment System Configuration

25
4.1 Hardware Architecture Experiment System Architecture

A block diagram of the experiment system arhictecture is shown in Figure 4.1. It contains the
following device modules that simulate the functionality of the hardware devices indicated in
parentheses:

• VaST Arm11MPCore (Arm11MPCore)

• VaST ARM L220 Cache Controller (Arm L220 Cache Controler)

• VaST ARM AXI PL300 Interconnect (ArmAxiPl300)

• VaST ARM AXI PL340 Memory Controller (ArmAxiPl340)

• VaST Gp Memory (GenericMemory)

• VaST StdBus AXI (AMBA 3 AXI Protocol)

• VaST StdBus AHB (AMBA 3 AHB Protocol)

• VaST StdBus APB (AMBA 3 APB Protocol)

4.1.1 Arm11 MPCore

Figure 4.2: Arm11 MPCore block diagram

The processor is the most important element in a multiprocessor system, because it influences
both the hardware and software design. It determines the hardware interfaces, which connect
the processor to the rest of the system and it influences the choice of the operating system or

26
4.1 Hardware Architecture Experiment System Architecture

the structure and functionality of the applications running on it.

The Arm11MPCore was chosen for the following reasons:

• Can be configured to contain between one and four processors.

• Both data and instruction cache can be configured individually across each processor with
support for full data coherence.

• Ability for data to move between each processors cache permitting rapid data sharing
without accesses to main memory.

• Either dual or single 64-bit AMBA 3 AXI bus intefaces providing high bandwidth.

• Support for both asymmetric multiprocessor (AMP) and for symmetric multiprocessing
(SMP) multiprocessor programming.

• Designed for low power by providing gate level shutdown of unused resources and sup-
porting the ability for each processor to go into standby, dormant or power off energy
management states.

The Snoop Control Unit

A block diagram of the Arm11MPCore processor is shown in Figure 4.2. The Snoop Control Unit
(SCU) is a key component for the MPCore solution as it interfaces up to four multiprocessing
CPUs with each other and with an L2 memory system. Individual CPUs can be dynamically
configured as operating in a symmetric (SMP mode) or asymmetric (AMP mode) manner, i.e.
taking part in the L1 coherency or not. The SCU manages the coherent traffic between CPUs
marked as being part of the coherent system and routes all non-coherent/instruction traffic
between CPUs and L2 memory through its dual 64-bit AMBA AXI ports. In order to limit the
number of requests to individual CPUs, the SCU contains a duplication of all CPU L1 physical
tag RAMs, so it sends a coherent request only to the CPUs that contains the target data line
for the coherent traffic.

The MPCore implements a modified MESI write invalidate protocol. MESI stands for the 4
possible states of a data line in a CPU cache:

• Modified: The data line is in one CPU cache only and has been written to.

• Exclusive: The data line is in one CPU cache only and has not been modified.

• Shared: The data line is in multiple CPUs caches and has not been modified.

• Invalid: The cache line is empty or the line can be present in another CPU’s cache in the
modified state.

27
4.1 Hardware Architecture Experiment System Architecture

The MESI protocol has been modified in order to reduce the amount of coherency commands
and data traffic on the bus. When a CPU reads a data line and allocates it into its cache, the
MESI protocol states the line should be in Shared state whenever it is already in another CPU
cache or not, and then moves its state to Exclusive if the CPU requests it. In MPCore, if the
data is not in any other CPU, the data line is marked as Exclusive from the start, removing the
need of an additional coherency command.

Another optimization is known as Direct Data Intervention (DDI), which consists in passing
data directly from a CPU to another one, without having to request the data from Level 2
memory. If the data line in the source CPU was in modified state, the data is written to Level 2
anyway (data line is then in Shared state) but the destination CPU gets its data directly from
source CPU, without having to perform an additional request to Level 2 to get updated data.

Last improvement, called Migratory Lines support, is based on additional logic being capable
of detecting that a line is going across multiple CPUs. Instead of writing the dirty data back
to Level 2 on each line migration, the data line is allocated into the destination CPU Level 1
cache as being dirty (Modified state). That prevents then any useless write to Level 2 memory
system until the data line ceases to be migratory and brought back coherent with Level 2.

In our system the ARM11MPCore Virtual Processor Model from VaST Systems was used and
provided all the properties of the real processor, while providing more flexibility in terms of
performance analysis. With the property of profiling, a cycle and instruction accurate trace of
the application being executed can be obtained. Moreover, cache statistics are recorded, which
is helpful in evaluating the system and locating problems.

The Level 1 Cache

Each MPCore CPU has separate instruction and data caches, which have the following features:

• The instruction and data cache can be configured to sizes between 16KB and 64KB. The
VaST Virtual Processor Model allowed for cache sizing of any power of 2.

• Both caches are 4-way set-associative.

• The cache line length is 8 words or 32 bytes.

• Each cache can be sized or disabled independently, using the CP15 system control copro-
cessor.

Cache operations are controlled through a dedicated coprocessor, CP15, integrated within the
core. This coprocessor provides a standard mechanism for configuring the level one memory
system. The CP15 registers can be accessed with MRC and MCR assembler instructions. The
assembler syntax for these instructions is:

MRC{cond} P15,<Opcode_1>,<Rd>,<CRn>,<CRm>,<Opcode_2>
MCR{cond} P15,<Opcode_1>,<Rd>,<CRn>,<CRm>,<Opcode_2>

28
4.1 Hardware Architecture Experiment System Architecture

Function Assembler Instruction


Instruction cache invalidate MCR p15, 0, R0, c7, c5, 0
Clean and invalidate cache MCR p15, 0, R0, c7, c14, 0
TLB Invalidate MCR p15, 0, R0, c8, c7, 0

Table 4.2: Cache and TLB Operation Functions

CPU synchronization

Some additions to the ARMv6 architecture are implemented in Arm11MPCore for multiprocess-
ing support, such as 64-bit non-bus locking exclusive read and write instructions: LDRDEX and
STRDEX. Exclusive loads and stores are a way to implement interprocess communication in
multiprocessor and shared-memory systems, where the load and store operations are not atomic.

Moreover, theses instructions rely on the ability to tag a physical address as exclusive-access for
a particular processor. This tag is later used to determine if an exclusive store to an address
occurs. The system guarantees that if the data that has been previously loaded has been modified
by another CPU, the store fails and the load-store sequence must be retried.

• LDREX loads data from memory. If the physical address has the Shared TLB attribute,
LDREX tags the physical address as exclusive access for the current processor, and clears
any exclusive access tag for this processor for any other physical address.

• STREX performs a conditional store to memory. The conditions are:

– If the physical address has the Shared TLB attribute, and the physical address is
tagged as exclusive access for the executing processor, the store takes place, the tag
is cleared, and the value 0 is returned in Rd.
– If the physical address has the Shared TLB attribute, and the physical address is not
tagged as exclusive access for the executing processor, the store does not take place,
and the value 1 is returned in Rd.

An example of how to implement a synchronization semaphore is implemented is provided below:

tryAgain:
ldrex r2, [r1] ; load semaphore and set exclusive
orr r0, r0, r2 ; update the semaphore
strex r2, r0, [r1] ; if still exclusive access then store
cmp r2, #0 ; did this succeed?
bne tryAgain ; no try again

29
4.1 Hardware Architecture Experiment System Architecture

4.1.2 Arm L220 Cache Controller

The addition of an on-chip secondary cache, also referred to as a Level 2 cache, is a recognized
method of improving the system performance when significant memory traffic is generated by
the processor. By definition a secondary cache assumes the presence of a Level 1 cache, closely
coupled or internal to the CPU. Memory access is fastest to Level 1 cache, followed closely by
the Level 2 cache. Memory access is significantly slower with Level 3 memory or main memory.

Memory Type Typical Size Access Time


Processor registers 64 B 1 cycle
Level 1 Cache 32 KB 1-2 cycles
Level 2 Cache 128 KB 8 cycles
Off-chip memory MB or GB 30-42 cycles

Table 4.3: Typical memory sizes and access times

The Cache Controller has the following features:

• Physically addressed and physically tagged

• Fixed line length of 32 bytes (eight words or 256 bits)

• Cache size can be configured from 16KB to 2MB

• Configurable set-associativity from Direct Mapped to 8-way associativity

• Configurable latency from 1-8 cycles

• Designed to work with 64-bit AXI master and slave interfaces

Unlike the Level 1 cache, the L220 Cache Controller is configured using memory-mapped regis-
ters. In our design, the Level 2 cache was configured to be Direct Mapped with a latency of 8
cycles. The cache size was varied from 16KB to 128 KB.

4.1.3 ARM AXI PL300 Interconnect

The PrimeCell AXI Configurable Interconnect (PL300) is a high performance interconnect model
that provides connectivity between one or more AXI Masters and one or more AXI Slaves.

The PL300 supports a full multi-layer connection of all master and slave interfaces on all of the
AXI channels. Multi-layer interconnect enables parallel access paths between multiple masters
and slaves in a system, which increases data throughput and decreases latency.

Write data interleaving enables the interconnect to combine write data streams from different
physical masters, to a single slave. This is useful because you can combine write data from a

30
4.1 Hardware Architecture Experiment System Architecture

fast master with write data from a slow master and consequently increase the throughput of
data across the interconnect.

In any interconnect that is connected to a slave that reorders read or write signals, there is the
potential for deadlock. To prevent this the PL300 provides arbitration priority and three cyclic
dependency schemes that enable the slave interface to accept or stall a new transaction address.

The following list highlights the functionality available:

• Compliant with the AMBA 3 AXI Protocol v1.0 Specification

• Multi-layer capability to allow multiple masters to access different slaves simultaneously

• Automatically connect to buses of varying data width (32, 64, 128 or 256 bits wide)

• Independently configurable number of Slave and Master Interfaces.

• Each slave interface has configurable: Read and Write transaction acceptance, arbitration
priority and cyclic dependency scheme.

• Each master interface has configurable: Read or Combined Issuing capability, Write In-
terleave depth, and Arbitration scheme.

• Supports read and write data interleaving for increased data throughput

4.1.4 ARM AXI PL340 Memory Controller

The PL340 memory controller is a high-performance, area-optimized SDRAM memory controller


compatible with the AMBA 3 AXI protocol.

The following list highlights the functionality available:

• Highly configurable via APB protocol register interface

• Multiple active read and write transactions via AXI protocol Slave Interface

• Timing accurate internal modeling of DRAM devices

• Automatically connects to AXI buses of 4, 8 or 16 bytes data width

• Support for Exclusive Access transactions

Before the memory PL340 memory controller can be used to access external memory internal
configuration registers must be setup and the external memory initialized. Table 4.4 lists the
main configuration values used to model the memory:

31
4.1 Hardware Architecture Experiment System Architecture

Symbol Memory Cycles Description


CAS 5 Column Address Strobe latency
T RCD 2 RAS to CAS minimum delay
T RP 2 Precharge to RAS delay
T RAS 9 Row Address Strobe to Precharge delay

Table 4.4: Memory Controller configuration

Figure 4.3: Memory Timing

4.1.5 Memory

High speed memory can increase the speed of the system dramatically. The VaST generic
memory model is used to model memory blocks with configurable timing such as ROM or RAM.
The following list highlightes the functionality available:

• Supports Read, Write, Fetch and Load access types

• Supports memory paging.

• Supports exclusive access.

• Connects to AHB, AHB Lite and APB bus protocols.

• Configurable memory width and size.

• Configurable burst read and burst write limit.

• Configurable first read/write delay and next read/write delay.

The memory timing can be configured for:

• InitialRead/InitialWrite delay: indicates the number of bus clock cycles that is inserted
on initiating the first read/write burst to memory. In our system it is set to 1 clock cycle.

• FirstRead/FirstWrite delay: indicates the number of bus clock cycles on data phase that
is inserted for first read/write of a memory width of data in a burst. In our system it is
set to 1 clock cycle.

32
4.1 Hardware Architecture Experiment System Architecture

• NextRead/NextWrite delay: indicates the number of bus clock cycles inserted for each
read/write of a memory width of data in a burst. In our system it is set to 1 clock cycle.

The burst read and burst write limit is set to 8-beats and the memory timing is shown in Figure
4.3.

4.1.6 Buses

The VaST Standard Bus (StdBus) provides an interface to processors, peripheral devices and
memory models and represents the standard concept of address and data phases along with
associated timing in a bus transaction. The following bus protocols were used in our architec-
ture: AXI, AHB and APB. These protocols provide a single interface definition for describing
interfaces:

• between a master and the interconnect

• between a slave and the interconnect

• between a master and a slave.

In order to resolve bus access in a multi master system the following arbitration algorithms are
used: First Come, Round Robin and Fixed Priority.

AXI Bus Protocol Support

The AMBA 3 AXI protocol supports several functionalities, which make it suitable for high-
performance, high-frequency system designs. The funcionalities include:

• separate address/control and data phases

• separate read and write data channels

• burst-based transactions with only start address issued

• out-of-order transaction completion

• ability to issue multiple outstanding addresses

The AXI protocol is burst-based. Every transaction has address and control information on the
address channel that describes the nature of the data to be transferred. The data is transferred
between master and slave using a write data channel to the slave or a read data channel to the
master. In write transactions, in which all the data flows from the master to the slave, the AXI
protocol has an additional write response channel to allow the slave to signal to the master the
completion of the write transaction.
Out-of-order transaction completion means that transactions with the same ID tag are completed
in order, but transactions with different ID tags can be completed out of order. Out-of-order
transactions can improve system performance in two ways:

33
4.1 Hardware Architecture Experiment System Architecture

• The interconnect can enable transactions with fast-responding slaves to complete in ad-
vance of earlier transactions with slower slaves.

• Complex slaves can return read data out of order. For example, a data item for a later
access might be available from an internal buffer before the data for an earlier access is
available.

AHB Bus Protocol Support

The AMBA 3 AHB interface specification enables highly efficient interconnect between simpler
peripherals in a single frequency subsystem where the performance of AMBA 3 AXI is not
required. The funcionalities include:

• separate address/control and data phases

• separate read and write data channels

The master starts a transfer by driving the address and control signals. These signals provide
information about the address, direction, width of the transfer, and indicate if the transfer forms
part of a burst. Transfers can be single, incrementing bursts or wrapping bursts that wrap at
the address bounderies.

APB Bus Protocol Support

APB protocol is used when low bandwidth transactions are necessary to access configuration
registers in peripherals and data traffic through low bandwidth peripherals. It is used to isolate
data traffic from the high performance AXI and AHB interconnects, and thus to reduce the
power consumption in a design.

34
4.2 Software Architecture Experiment System Architecture

4.2 Software Architecture

The main aim of our software architecture is to model the instruction locality of a typical
application. As explained in Section 2.2, an instruction reuse-distance histogram summarizes
instruction locality information and in combination with a configurable cache architecture per-
mits the measurement of Instruction Reuse values.

The software architecture consists of two parts: system initialization and the TEST application.
In the system initialization phase, all the necessary functionality required to configure the multi-
processor system is implemented. The TEST application contains Assembler code that matches
to the modelled benchmark instruction reuse-distance histogram.

4.2.1 System Initialization

The target code is usually a stand-alone program, such as an operating system, which has
access to I/O devices and/or a file system. Target applications compiled to run on top of a
target operating system (OS), such as WinCE or Linux would normally have access, via the OS,
to the I/O devices and/or file system. However, using the VaST VPM, a target application can
also run without the support of a target operating system.

The ARM-ELF GCC compiler and assembler are used to build the target executable. The archi-
tecture was setup with a default memory map. The GNU ARM-ELF GCC compiler supports
linker directives embedded in the start-up files that have been tailored to produce a binary
executable, which will load the image in SDRAM.

Memory Map

The device and memory configuration are defined in a platform configuration file. At startup
this configuration file is read and configures all memory regions with the corresponding physical
addresses.

The SDRAM memory has 64MB starting from physical address 0x0 and is split into 64 pages.
The TCM memory is modelled as a 1MB in size and therefore fits in one page.

Memory Start Address Size Memory Width Page Size


SDRAM 0x0000 0000 64 MB 32 bit 1 MB
TCM 0x1000 0000 1 MB 64 bit 1 MB

Table 4.5: Memory Map

35
4.2 Software Architecture Experiment System Architecture

The Linker

The purpose of the ARM linker is to combine the contents of one or more object files (the output
of a compiler or assembler) with selected parts of one or more object libraries, to produce an
executable program. The ELF file format is used for the output image and specifies an executable
binary image made up of several sections starting from virtual address 0x0. Moreover, the linker
is used to create a virtual map of these sections by specifing the base virtual addresses of each
section in the the output image.

Each ELF file is made up of one ELF header, followed by file data.
The file data includes the following sections:

• Program Header: contains assembler directives to load the ELF


image in SDRAM at address 0x0, it initializes the ARM11 MP-
Core and calls the main() function.

• TCM Code: contains the code that shall be copied to the TCM
memory.

• Page Table: contains the modified Page Table code.

• .text: contains all other executable instruction code of the com-


piled program.

• .data: contains the initialized global and static variables and


their values.
Figure 4.4: Output
image structure.

The Page Table

A Page Table is a data structure used to store the mapping between virtual addresses (used in
the ELF image) and physical addresses (unique to TCM, SDRAM) and to set the attributes of
each page. While the code and data residing in SDRAM memory is cacheable, TCM is used to
hold critical code where the unpredictability of a cache is not desired. Therefore, TCM addresses
are non-cacheable.

Output image section Mapped to Attribute


Program Header SDRAM Shared, cacheable
TCM Code TCM Shared, non-cacheable
Page Table SDRAM Shared, cacheable
.text SDRAM Shared, cacheable
.data SDRAM Shared, cacheable

Table 4.6: Page Table

36
4.2 Software Architecture Experiment System Architecture

The main() function

The purpose of the main() function is to:

1. configure the PL340 Memory Controller

2. copy TCM code to TCM memory (if applicable)

3. configure and initialize the L220 Cache Controller (if applicable)

4. set the Page Table

5. call the TEST application

Device configuration and initialization is different in a multiprocessor system than in a unipro-


cessor system because some code needs to be executed only by one processor, while some must
be executed by all processors. Therefore, processor cooperation is essential. For example, points
1-3 above need to be executed by only one processor, while point 4 by all processors. In Figure
4.5 the control flow graph of the main function is shown.

Figure 4.5: Flowchart main() function

37
4.2 Software Architecture Experiment System Architecture

4.2.2 The Modelled OS Benchmark

Figure 4.6: Basis for the OS instruction reuse distance benchmark

In order to obtain results, which are independent of specific implementation details like the
application being executed or the cache line size, the concept of Instruction Reuse is used as
the reference for measuring system performance and scalability. This concept is introduced in
Section 4.2.

Different Instruction Reuse values are obtained by changing either the application (represented
by its instruction reuse-distance histogram) or the cache configuration. Therefore, the distad-
vantage of this technique, is that benchmarks are required in order for designers to quickly
estimate in what range of Instruction Reuse values their particular application-cache combina-
tion is situated.

The TEST application was modelled to serve as a benchmark for the instruction reuse-distance
of an operating system. The significant number of processor stall cycles caused when workloads
include the operating system motivates a thorough characterization of the effect of operating sys-
tem references. Exclusion of operating system’s references have caused cache miss-rate estimates
to be optimistic because:

• the working sets of system references are much larger than single process user workloads

• system code is less repetitive than user code

• interruption of user activity by system calls, or by other user processes, tends to evict
portions of user code from the cache.

Due to the limited amount of time available for the thesis, it was decided not to port an operating
system on the modelled experimental platform. Moreover, in [4, 3], the RDVIS tool is used to
measure and visualize the reuse-distance histogram of data reference. Unfortunately, we cannot
use this tool to obtain the instruction reuse-distance histogram of a current operating system
as it is limited to visualizing only load/store instructions.

38
4.2 Software Architecture Experiment System Architecture

Figure 4.7: Benchmark OS instruction reuse-distance histogram

However, in [20] the instruction cache performance of the operating system is studied. The oper-
ating system used in their experiments is Alliants Concentrix 3.0, which is based on Unix BSD 4.2.

Figure 4.6 shows the number of intervening operating system instruction words referenced be-
tween two consecutive calls to the same routine in the same operating system invocation. The
data corresponds to the 10 most frequently invoked routines in the operating system and is the
average of four workloads [20].

What is important to note is the shape of the histogram: the majority of routine invocations
have a small number of intervening instructions i.e. a small instruction reuse-distance.

Figure 4.6 differs from an instruction reuse-distance histogram because the reuse-distance (in
number of instructions) of OS routines was measured, as opposed to the reuse-distance of indi-
vidual instructions. However, routines can be viewed as very complex instruction. Therefore,
Figure 4.6 provides a good basis for modelling a benchmark instruction reuse-distance histogram.

The modelled benchmark histogram is shown in Figure 4.7. The code modelling the benchmark
instruction reuse-distance histogram of an operating system is contained in the TEST() function.
The instruction reuse-distance values are provided in KBytes for an easier comparison with cache
size. The conversion is possible because all instructions have a fixed length of 32-bit or 4-Bytes.

Each individual instruction reuse-distance in the histogram is modelled as one basic block of
instructions. A basic block is code that has one entry point (no code within it is the destination
of a branch instruction), one exit point and no jump instructions contained within it. The start

39
4.2 Software Architecture Experiment System Architecture

Time 1 2 3 4 5 6 7 8 9 ...
Memory Address A1 A2 A3 A1 A2 A3 A1 A2 A3 ...
Instruction I1 I2 I3 I1 I2 I3 I1 I2 I3 ...
Instruction Reuse-distance ∞ ∞ ∞ 2 2 2 2 2 2 ...

Table 4.7: Instruction Reuse Distance is modelled by loops.

of a basic block may be jumped to from more than one location. The end of a basic block may
be a branch instruction or the statement before the destination of a branch instruction.

The instruction reuse-distance quantifies temporal locality, which is modeled by loops. When
the destination of the branch at the end of a basic block, is the start of the same basic block, a
loop is created. All instructions in the loop have a reuse-distance given by the loop size (number
of instructions in the loop) minus 1. An example with loop size of 3 is provided in Table 4.7.

The control flow graph of the TEST() function is shown in Figure 4.8. Between the first in-
struction beginning a basic block and the branch instruction ending a basic block a number of
instructions equal to the instruction reuse-distance are executed. Note, that all instructions in
the basic block have the same reuse-distance.

The number of times each basic block is executed, is representative for the percent of instructions
executed with a particular instruction reuse distance i.e. the Y-axis from the instruction reuse-
distance histogram.

In our modelling, the following assumptions and simplifications are made:

1. The ARM11 MPCore is used to model a multiprocessor system. However, we want to


obtain measurements that are independent of the choice of processor. Therefore, the
number of branch instructions is minimized in order to eliminate the effect of the processor
pipeline. Thus, almost all instructions perform register-to-register operations and the ideal
processor performance is modelled. This implies that a real application will perform worse
than our benchmark.

2. The effect of data instructions is not taken into consideration. The code to model in-
struction reuse-distance does not contain any load or store instructions. This is motivated
by the work published in [2] and described in the multiprocessor system example given
the Motivation section. It was found that the misses to the instruction cache, not to the
data cache, were causing the majority of processor stall cycles and were limiting system
scalability.

3. Instruction reuse-distance is a measure of temporal locality only. Spatial locality depends


not only on cache implementation such as cache block sizes and cache associativity but also
on program implementation such as data placement. Therefore, in our benchmark model,
spatial locality is modelled to be optimal in order to abstract from such implementation
details. Because of this the performance of a real system will be lower than for our
benchmark.

40
4.2 Software Architecture Experiment System Architecture

Figure 4.8: Test() function control flow graph

The table below illustrates the type and total number of instructions executed in modelling the
benchmark instruction reuse-distance histogram.

Instruction Number Executed Percent Executed


MOV, ADD, CMP 15 346 441 99.92%
Branch 11 803 0.08%

Table 4.8: Type and number of instructions executed.

41
Chapter 5

Simulation Results & Analysis

In this section, the results of our simulation experiments are described. The concept of Instruc-
tion Reuse, described in Section 2.3, is used in order to evaluate the performance and scalability
of a multiprocessor system.

For the purpose of our experiment, the TEST benchmark (Section 4.2.2) for the instruction
reuse-distance histogram of an operating system was modelled. In our simulations, we vary only
the cache size since all other cache parameters are fixed for the ARM11 MPCore processor.

To evaluate the effects of Instruction Reuse on a multiprocessor system, we simulate the TEST
benchmark both in Symmetric Multiprocessing (SMP) configuration and in Asymmetric Mul-
tiprocessing (AMP) configuration. In SMP mode, there is only one instance of the TEST
benchmark code in SDRAM, which is run by all processors. In AMP mode, there are several
copies of the TEST bechmark code in memory, and each processor executes a different copy.
Here, the ideal behavior of both multiprocessing modes is modelled. In a real SMP system,
applications from different memory regions may be executed, while in AMP mode, code might
still be shared with other processors.

The total number of instructions executed, processor clock cycles and instruction cache line fills
are measured using Metrix. An example of Metrix output is given in Section 3.2.

42
5.1 Effect of Cache Size on Instruction Reuse Simulation Results & Analysis

5.1 Effect of Cache Size on Instruction Reuse

(a) TEST benchmark histogram (b) Instruction Reuse as a function of cache size

Figure 5.1: Instruction Reuse as cache size is varied for the TEST benchmark.

Table 5.1 shows the Instruction Reuse values as a function of cache size for the TEST benchmark
(Figure 5.1(a)). The Instruction Reuse values are plotted in Figure 5.1(b) and are computed
according with the definition given in Section 2.3. For convenience, the formula is shown again
below:
Number of instructions executed
Instruction Reuse =
Number of instructions loaded in Level 1 cache
.
Instructions Executed 15358244
Instructions Per Cache Line 8
Level 1 Cache Size [KB] Disabled 4 8 16 32 64 128 256 512
Cache Line Fills - 1299285 1014593 722563 353735 291126 104394 33927 9372
Instruction Reuse 0 1,48 1,89 2,66 5,43 6,59 18,39 56,59 204,84

Table 5.1: Instruction Reuse as cache size is varied for the TEST benchmark

When the Level 1 cache is disabled, all instructions are fetched from main memory. There-
fore, the Instruction Reuse is zero. As expected, when the cache size is increased, the number
of instructions fetched from main memory decreases due to the principle of locality and the
Instruction Reuse increases accordingly.
For 512 KB cache size, the TEST benchmark code completely fits in the cache and the in-
structions loaded in the cache are compulsory cache misses. This corresponds to the maximum
Instruction Reuse, because any further increase in the cache size will not decrease the number
of instructions loaded from main memory.
If we recall the example given in the Motivation section, for TCP/IP protocol processing under
Linux the Instruction Reuse was measured to be around 2 for a Level 1 cache size of 16 KB [2].
We note that the TEST benchmark shows a similar Instruction Reuse for the same cache size.

43
5.2 The Low Instruction Reuse Problem Simulation Results & Analysis

5.2 The Low Instruction Reuse Problem

In a single processor system, the IPC is computed simply as the number of instructions executed
by the processor divided by the number of clock cycles required to execute the instructions. In
a multiprocessor system, there are several processors operating in parallel and thus the IPC is
computed for the entire system as follows:
P
(Number of instructions executed per CPU)
IPC =
Maximum number of clock cycles
Figures 5.2 and 5.3 show the effect of Instruction Reuse on system IPC in both SMP and AMP
mode, as the number of processors is increased from 1 to 8.

If the Instruction Reuse is low, our multiprocessor system does not scale to more than two
processors. From Figure 5.2, it can be seen that while a second CPU, slightly helps to improve
system IPC, the addition of a 3rd or 4th CPU does not result in any performance improvement.
This is justified by the fact that when Instruction Reues is low, a high amount of processor cache
line fills occurs, which in the worst-case must be brought from main memory consuming a total
bandwidth of:
Cache line size
Total BW consumed = Number of CPUs ·
Memory Latency
In our example, adding the second core compensates the Memory Latency and thus a perfor-
mance improvement can still be measured. However, increasing the number of CPUs increases
the total amount of bandwidth consumed beyond the limit the memory system can satisfy. In-
creasing the number of CPUs, when the memory bandwidth limit has been reached, results in
no additional performance improvement.

This is an important result because it proves through simulation that a low instruction locality
limits the performance and scalability of multiprocessor systems. For example, an Instruction
Reuse of almost 2 was measured for the MPSoC example in the Motivation section, which
explains the high number of stall cycles.

If the Instruction Reuse is high, the performance scales almost linearly with the number of
processors. In Figure 5.3, for a single processor, the ideal IPC is reached at the maximum
Instruction Reuse. As the number of CPUs increases, the IPC does not increase exactly linearly
due to contention at the main memory. Therefore, for the maximum Instruction Reuse and 8
processors, an IPC of 5.64 is obtained instead of the ideal IPC of 8.

In AMP mode, the results are almost the same as in SMP mode. This is due to the fact
that, in both modes, the same contention at the SDRAM interface applies. Moreover, in AMP
operation, there is never contention for the same word in memory, while in SMP mode this type
of contention has little effect on performance.

One obvious way to decrease the number of stall cycles, while fetching instructions from main
memory, is to increase the memory bandwidth. Instruction Reuse and memory bandwidth are
two independent parameters. Next, we look at what influence does memory bandwidth have on
the scalability of a multiprocessor system.

44
5.2 The Low Instruction Reuse Problem Simulation Results & Analysis

(a) Symmetric Multiprocessing mode (b) Asymmetric Multiprocessing mode

Figure 5.2: A low Instruction Reuse results in no performance improvement as the number of
CPUs is increased. In other words, a low Instruction Reuse limits the scalability of multipro-
cessor system.

(a) Symmetric Multiprocessing mode (b) Asymmetric Multiprocessing mode

Figure 5.3: High Instruction Reuse values enable a multiprocessor system to scale to a higher
number of processor and significant performance gains can be seen over a single processor system.

45
5.2 The Low Instruction Reuse Problem Simulation Results & Analysis

5.2.1 Effect of Memory Bandwidth

(a) BWCPU = 2 · BWMEM (b) BWCPU = BWMEM

Figure 5.4: Doubling the memory bandwidth increases system IPC but does not help to improve
the scalability of the multiprocessor system when Instruction Reuse is low.

Figures 5.4 shows the system IPC as a function of Instruction Reuse with the original (Figure
5.4(a)) and doubled (Figure 5.4(b)) memory bandwidth. Results are presented only for SMP
mode because there is almost no different in AMP mode as could be seen in the previous
subsection.

As expected, doubling the memory bandwidth, increases system IPC over the whole range of
Instruction Reuse values. However, Figure 5.4(b), shows an interesting result: if the Instruction
Reuse is low, the double memory bandwidth does not improve the scalability of the multiproces-
sor system. The addition of the second CPU provides a higher performance improvement than in
Figure 5.4(a) because main memory can return instructions two times faster. Nevertheless, the
3rd and 4th CPUs still do not increase IPC significantly due to the high number of instruction
fetches, which cause many stall cycles.

With this information, it is certain that for the MPSoC example in the Motivation section, dou-
bling the memory bandwidth will increase system IPC but will bring no scalability improvement
because the Instruction Reuse was as low as 2.

Instruction Reuse is a function of both the application instruction reuse-distance histogram


and the cache configuration. From the software side, Instruction Reuse can be increased by
optimizing the code layout such that the spatial and temporal locality is maximized. From the
hardware point of view, Instruction Reuse can be increased by increasing the cache size. In
the following we investigate the effect of both varying cache size and of a different instruction
reuse-distance histogram.

46
5.2 The Low Instruction Reuse Problem Simulation Results & Analysis

5.2.2 Effect of Level 1 Cache Size

(a) BWCPU = 2 · BWMEM (b) BWCPU = BWMEM

Figure 5.5: Doubling the Level 1 cache size or even the memory bandwidth may not improve
the scalability of a multiprocessor system. However, increasing the cache size above a certain
threshold value solves the scalability problem.

As could be seen so far, knowledge of the Instruction Reuse of a particular application-cache


combination offers important information on the scalability of a multiprocessor system. However,
it does not provide any information on how to design a multiprocessor system for improved
performance or scalability.

For a fixed application profile, one method to improve Instruction Reuse is to increase cache size.
The relation between Instruction Reuse and cache size for the TEST benchmark is shown in
Figure 5.1(b). In Figure 5.5, the system IPC is shown as a function of cache size with the original
(Figure 5.5(a)) and doubled (Figure 5.5(b)) memory bandwidth for the TEST benchmark.

An interesting observation that can be seen is that increasing Level 1 cache size or even doubling
the memory bandwidth may not necessarily improve the scalability of a multiprocessor system.
This is due to the fact that the Instruction Reuse is still small enough, creating a considerable
amount of instruction fetches to main memory, which imply processor stall cycles.

However, increasing Level 1 cache size above a certain threshold value solves the scalability
problem. As cache size increases, the number of capacity cache misses decreases, resulting
in fewer accesses to main memory which decreases the number of processor stall cycles. The
result is that the system IPC scales linearly with the number of CPUs. This is an important
observation because it allows designers to choose an optimal value for the Level 1 cache.

47
5.2 The Low Instruction Reuse Problem Simulation Results & Analysis

5.2.3 Effect of modified application profile

(a) Gaussian Instruction Reuse-Distance histogram. (b) Instruction Reuse as a function of cache size.

Figure 5.6: Instruction Reuse as cache size is varied for the modified histogram.

Instructions Executed TEST Benchmarck 15358244


Gaussian histogram 38282521
Level 1 Cache Size [KB] 4 8 16 32 64 128 256 512
Instruction Reuse TEST Benchmarck 1.48 1.89 2.66 5.43 6.59 18.39 56.59 204.84
Gaussian histogram 1.05 1.25 1.41 2.81 5.98 13.36 27.06 52.14

Table 5.2: Instruction Reuse comparison.

The concept of Instruction Reuse was introduced as a measurement reference in order to abstract
our results from a particular application. We showed that a low Instruction Reuse limits the
performance and scalability of a multiprocessor system using the TEST benchmark. From the
definition of Instruction Reuse our result is independent of the TEST benchmark. Whatever
the application running on the processors may be, if in combination with the Level 1 cache a
low instruction reuse is obtained, our results are valid.

In order to investigate the effect of a different application profile, a reuse-distance histogram


was modelled in order to have the envelope of a Gaussian distribution. Many measurements
of physical phenomena can be approximated by the Gaussian distribution. The use of the
Gaussian distribution is justified by assuming that many small, independent events are additively
contributing to each experiment observation and by the central limit theorem the sum will be
Gaussian distributed.

The Gaussian instruction reuse-distance histogram is shown in Figure 5.6(a). It can be seen,
that the average instruction reuse-distance is of about 32 KB. The average instruction reuse-
distance for the TEST benchmark is about 16 KB. As explained in Section 2.3, the cache capacity
needs to be bigger than the average reuse-distance in order to achieve a high Instruction Reuse.

48
5.2 The Low Instruction Reuse Problem Simulation Results & Analysis

Therefore, higher cache size values are required to obtain the same Instruction Reuse in the case
of the Gaussian histogram as for the TEST benchmark.

Table 5.2 shows the Instruction Reuse values as a function of cache size for the Gaussian reuse-
distance histogram and the TEST benchmark the values are also plotted in Figure 5.6(b). As
expected, the Instruction Reuse curve for the Gaussian histogram is below the one for the TEST
benchmark.

In Figure 5.7, the main results presented so far can be compared for the TEST benchmark (left
column) and the Gaussian histogram (right column). The comparison is given only for the SMP
mode, because the results are the same in AMP mode as shown in Section 5.2.

The two instruction reuse-distance histograms are shown for comparison in Figures 5.7(a) and
5.7(b).

In Figure 5.7(d), the system IPC is plotted as a function of Instruction Reuse for the Gaussian
histogram. The same trend as in Figure 5.7(c) can be observed: for low Instruction Reuse,
increasing the number of CPUs to more than two, provides no additional performance improve-
ment.

Figure 5.7(f), shows the system IPC in relation to the cache size. As the Level 1 cache size
increases, Instruction Reuse increases as well and if the Level 1 cache size exceeds a certain
threshold value, system IPC increases almost linearly with increasing number of CPUs for both
application profiles.

Due to the limited amount of time available, the effect of other application profiles could not be
investigated. In order to statistically show that our results are indepedendent of the application,
a higher number of reuse-distance profiles would have to be simulated. However, the TEST
benchmark (designed to model the profile of an operating system) and the Gaussian histogram
show that our results are not influenced by the different application profiles.

49
5.2 The Low Instruction Reuse Problem Simulation Results & Analysis

(a) TEST Benchmark (b) Gaussian histogram

(c) TEST Benchmark., BWCPU = BWMEM (d) Gaussian histogram, BWCPU = BWMEM

(e) TEST Benchmark., BWCPU = BWMEM (f) Gaussian histogram, BWCPU = BWMEM
Figure 5.7: Effect of application instruction reuse-distance histogram
50
5.3 The Shared Level 2 Cache Simulation Results & Analysis

5.3 The Shared Level 2 Cache

When the Instruction Reuse is low, significant memory traffic is generated by the processor. The
addition of an on-chip Level 2 cache, is a recognized method of exploiting instruction locality
and improving system performance. In this section, a shared Level 2 cache is investigated as a
solution to the low Instruction Reuse problem.

Figure 5.8, shows a comparison between the system IPC without Level 2 cache and with a
shared 128 KB Level 2 Direct Mapped cache, in Symmetric Multiprocessing mode and for
BWCPU = 2 · BWMEM . As can be seen, not only is the system IPC considerably higher for
all Instruction Reuse values, but the scalability is significantly improved for Instruction Reuse
values higher than 3.

In SMP mode, the shared Level 2 cache transforms the compulsory cache misses of one processor
into Level 2 cache hits for the other processors. Thus, an instruction needs to be fetched only
once from main memory and then it can be reused by all other processors as long as the reuse-
distance is smaller than the cache size.

In AMP mode (Figure 5.9, the situation is different. Because processors execute instructions
from different memory regions, the cache misses of one processor may evict instructions required
by other processors from the shared Level 2 cache. This means conflict misses are created. The
number of conflict misses increases with increasing number of CPUs and depends on the set-
associativity of the shared Level 2 cache. The higher the set-associativity, the lower is the
probability of a conflict miss. A Direct Mapped cache has the highest probability of a conflict
miss.

While the shared Level 2 cache, offers the best-case scenario in SMP mode, it provides the
worst-case in AMP mode. As shown in Figure 5.9(b), when the number of CPUs increases
to more than two, the low Instruction Reuse causes a high number of instruction fetches to
main memory, which results in decreasing performance due to conflict misses at the Level 2
cache. When the Instruction Reuse is increased, the number of conflict misses decreases but the
performance and scalability of the multiprocessor system does not improve significantly and is
comparable to that of the system without Level 2 cache.

In a realistic scenario, the results will correspond to a weighted average between the two multi-
processing case. One example is the multiprocessor system presented in the Motivation section:
the operating system is Linux, therefore, the system code is running in Symmetric Multiprocess-
ing mode, while the code of the applications running on top of the operating system are running
in Asymmetric Multiprocessing mode.

51
5.3 The Shared Level 2 Cache Simulation Results & Analysis

(a) SMP mode, no Level 2 cache (b) SMP mode, 128 KB shared Level 2 cache

Figure 5.8: In SMP mode, the addition of a shared Level 2 cache increases system IPC and also
improves the scalability of the multiprocessor system for low Instruction Reuse.

(a) AMP mode, no Level 2 cache (b) AMP mode, 128 KB shared Level 2 cache

Figure 5.9: In AMP mode, the addition of a shared Level 2 cache slightly increases system IPC
but does not improve the scalability of the multiprocessor system. In fact, for low Instruction
Reuse, increasing the number of processors decreases system IPC.

52
5.3 The Shared Level 2 Cache Simulation Results & Analysis

5.3.1 Level 1 Cache vs. Level 2 cache

In Figures 5.10(a)-5.10(d), the system IPC as a function of Level 1 and Level 2 cache size is
shown for increasing number of processors in Symmetric Multiprocessing mode.

One of the most important observations is that the addition of relatively small Level 2 shared
cache (e.g. twice the size of the Level 1 cache) provides a significantly greater performance
improvement than doubling the Level 1 cache size with no Level 2 cache. For example, increasing
the Level 1 cache from 16 KB to 32 KB or even 64 KB, results in a system IPC significantly
below the system IPC with 16 KB Level 1 cache size and a 32 KB shared Level 2 cache. This
is due to the fact that the shared Level 2 cache transforms the compulsory cache misses of one
processor into cache hits for the other CPUs, eliminating even the first-time fetches that would
be required in the absence of a shared Level 2 cache.

Another important observation is that in the absence of a Level 2 cache, the threshold Level 1
cache size beyond which system IPC scales linearly with increasing number of processors is 128
KB for the TEST benchmark. With the addition of Level 2 cache, the threshold Level 1 cache
size decreases to just 16 KB. An instruction needs to be fetched only once from main memory in
order to be reused by all other processors from the Level 2 cache. Therefore, the low Instruction
Reuse due to a smaller Level 1 cache is compensated by a high reuse of instructions at the Level
2 cache.

Finally, it can be seen that for a given number or processors, increasing the size of the shared
Level 2 cache alone, does not offer a significant performance gain. When the Level 2 cache size
increases, a higher number of instructions can reside in the cache but this does not imply that
the reuse of instructions at the Level 2 cache increases. Only if the number of CPUs is increased
will the reuse increase because the additional processors will fetch instructions directly from the
shared Level 2 cache.

In Asymmetric Multiprocessing mode, the size of the shared Level 2 cache plays an important
role in system performance. In Figures 5.11(a)-5.11(d), the system IPC as a function of Level 1
and Level 2 cache size is shown for Asymmetric Multiprocessing.

The problem with a shared Level 2 cache in AMP mode, is that instructions that one processor
fetches into the Level 2 cache may be evicted by instruction fetches from a different processor
resulting in conflict misses. Increasing the Level 2 cache size, decreases the number conflict
misses and thus increases system IPC.

53
5.3 The Shared Level 2 Cache Simulation Results & Analysis

(a) SMP mode, #CPUs = 1 (b) SMP mode, #CPUs = 2

(c) SMP mode, #CPUs = 3 (d) SMP mode, #CPUs = 4

Figure 5.10: The addition of relatively small Level 2 shared cache (e.g. twice the size of the Level
1 cache) provides a significantly greater performance improvement than doubling the Level 1
cache size alone. However, increasing the Level 2 cache size without also increasing the number
of CPUs, does not bring any significant performance gain.

54
5.3 The Shared Level 2 Cache Simulation Results & Analysis

(a) AMP mode, #CPUs = 1 (b) AMP mode, #CPUs = 2

(c) AMP mode, #CPUs = 3 (d) AMP mode, #CPUs = 4

Figure 5.11: As opposed to SMP mode, in AMP mode increasing the Level 2 cache size consid-
erably increases system IPC.

55
5.4 The effect of Tightly Coupled Memory Simulation Results & Analysis

5.4 The effect of Tightly Coupled Memory

Tightly Coupled Memory or TCM is a type of on-chip memory used to hold critical code when
deterministic memory behavior is required such as for real-time systems. Because caches behav-
ior is not deterministic, TCM has an address space that is non-cacheable. The words ”Tightly
Coupled” come from the fact that TCM sits very close to the processor having a latency of
about 1-2 clock cycles just as Level 1 cache.

TCM could provide a solution to the multiprocessor scalability problem by storing code with
low instruction reuse i.e. code which creates the most amount of traffic to memory increasing
the total bandwidth consumed.

The effect of placing low Instruction Reuse code in TCM could not be properly investigated
because of hardware constraints in placing the TCM as close as possible to the ARM11 MPCore.
In the following this constraints are described:

• The Arm11 MPCore processor was not designed to support TCM. It contains an advanced
internal memory management system with support for snooping cache coherency and two
AXI interfaces specifically designed to be connected to the ARM L220 Cache Controller.
Therefore, the TCM can only connected as a second slave via the PL300 Interconnect to
the MPCore.

• The AXI PL300 Interconnect supports only AXI interfaces while the on-chip memory
model supports at most the AHB protocol. Therefore, a Bridge is required in order to
convert between the AXI and AHB protocols.

In this configuration, the latency to TCM was measured to be 14 processor clock cycles. Given
this high latency compared to the typical TCM latency of 1-2 processor clock cycles, the effect
of placing code in the modelled TCM would not realistic.

Moreover, compared to the latency to Level 2 cache of 6 processor clock cycles, the performance
of the modelled TCM will be significantly lower than that of using Level 2 cache.

For the reasons above, the effect of using TCM memory as an alternative to Level 2 Cache in
order to increase system performance and/or scalability could not be investigated.

56
Chapter 6

Conclusion & Future Work

In this thesis an ARM11 MPCore based multiprocessor system is modelled and simulated us-
ing virtual prototyping technology from VaST Systems. The purpose of this modelling is to
study and understand the effect of instruction locality on the performance and scalability of
multiprocessor systems to do preparation for a future MPSoC design.

The design of the entire system consists of two aspects: hardware architecture and software
architecture. The hardware architecture was modelled using virtual models of ARM fabric
components. The software architecture was designed to permit a configurable instruction locality
to be modelled and the concept of Instruction Reuse is introduced in order to evaluate the
performance and scalability of a multiprocessor system indepedent of the targe application being
executed.

In order to obtain Instruction Reuse values, the instruction reuse-distance histogram of an


operating system is modelled based on previous published work in [20]. The modelled histogram
serves as a benchmark for comparison with future real measurements.

After implementation the system is simulated and system IPC is recorded using Metrix from
VaST Systems. System performance is analyzed both in Symmetric Multiprocessing and Asym-
metric Multiprocessing modes.

One of the main contributions of this thesis is that it is proved by means of simulation that
whatever the target application may be, if in combination with a specific cache configuration it
results in a low Instruction Reuse, then increasing the number of processors above a fairly small
number results in no additional performance increase. In other words, a low Instruction Reuse
limits the scalability of a multiprocessor system.

57
Conclusion & Future Work

The effects of doubling memory bandwidth, cache configuration and modified application profile
are also investigated. Doubling the memory bandwidth increases system IPC but does not help
to improve the scalability of the multiprocessor system when Instruction Reuse is low. However,
increasing the Level 1 cache size above a certain threshold value solves the scalability problem
and system IPC increases linearly with increasing number of CPUs. Using a different application
profile, the above mentioned conclusions did not change.

It was also shown that in the absence of a shared cache there is no significant difference between
Symmetric Multiprocessing and Asymmetric Multiprocessing modes. However, the addition of
a shared Level 2 cache introduces great differences between the two processing modes:

• In Symmetric Multiprocessing mode: the addition of a shared Level 2 cache increases


system IPC for all Instruction Reuse values or Level 1 cache sizes. Moreover, adding a
shared Level 2 cache with double the size of the Level 1 cache, provides a significantly
greater performance improvement than doubling the Level 1 cache size alone. However,
increasing the Level 2 cache size without also increasing the number of CPUs, does not
bring any significant performance gain.

• In Symmetric Multiprocessing mode: the addition of a shared Level 2 cache may actually
decrease system performance if the Instruction Reuse or Level 1 cache size is small. More-
over, increasing the Level 2 cache size offers considerable performance improvement even
when the number of CPUs is constant.

The conclusions above give meaningful insights to the factors that govern MPSoC performance
and scalability.

The effect of Tightly Coupled Memory could not be investigated due to hardware architecture
restrictions on the placement of the TCM.

A few interesting issues that future work could focus on are:

• extending the analysis to include both Data Reuse as well as Instruction Reuse.

• porting an actual operating system on the modelled hardware architecture to investigate


the effects of Instruction Reuse based on a real application.

• exploring the effect of placing code with low Instruction Reuse in TCM memory.

58
Bibliography

[1] J. Hennessy A. Agarwal and M. Horowitz. An analytical cache model. ACM Trans. Comput.
Syst., 7(2):184–215, 1989.

[2] Mohamed A. Bamakhrama. Embedded multiprocessor system-on-chip for access network


processing. MSc. Thesis, Technische Universitt Mnchen, December 2007.

[3] Kristof Beyls and Erik D‘Hollander. Platform-independent cache optimization by pinpoint-
ing low-locality reuse. In M. Bubak, G.D. van Albada, P.M.A. Sloot, and J.J. Dongarra,
editors, Computational Science - ICCS 2004: 4th International Conference, Proceedings,
Part III, volume 3038, pages 448–455, Krakow, 6 2004. Springer-Verlag Heidelberg.

[4] Kristof Beyls, Erik D‘Hollander, and Frederik Vandeputte. Rdvis: A tool that visualizes
the causes of low locality and hints program optimizations. In V.S. et al. Sunderam, editor,
Computational Science – ICCS 2005, 5th International Conference, volume 3515, pages
166–173, Atlanta, 5 2005. Springer.

[5] Soner nder Changpeng Fang, Steve Carr and Zhenlin Wang. Reuse-distance-based miss-
rate prediction on a per instruction basis. MSP ’04: Proceedings of the 2004 workshop on
Memory system performance, pages 60–68, 2004.

[6] Peter Claydon. Multicore gives more bang for the buck. EE Times, 15th October 2007.

[7] Abhijit Davare. Automated Mapping for Heterogeneous Multiprocessor Embedded Systems.
PhD thesis, EECS Department, University of California, Berkeley, Sep 2007.

[8] C. Ding. Improving effective bandwidth through compiler enhancement of global and dy-
namic cache reuse. PhD thesis, Dept. of Computer Science, Rice University, January 2000.

[9] Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance
analysis. Proceedings of the ACM SIGPLAN 2003 conference on Programming language
design and implementation, pages 245–257, 2003.

[10] International Technology Roadmap for Semiconductors. http://www.itrs.net.

[11] D. Geer. Chip makers turn to multicore processors. Computer, 38:11–13, May 2005.

[12] P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section


analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3):350360, July 1991.

59
BIBLIOGRAPHY BIBLIOGRAPHY

[13] M. D. Hill. Aspects of cache memory and instruction buffer performance. PhD thesis,
University of California, Berkeley, November 1987.

[14] Anthony Massa and Michael Barr. Programming Embedded Systems. OReilly Publishers,
San Francisco, CA, chapter 1 edition, October 2006.

[15] David A. Patterson and John L. Hennessy. Computer Organization and Design: The Hard-
ware/Software Interface. Morgan Kaufmann Publishers, San Francisco, CA, third edition,
2004.

[16] David A. Patterson and John L. Hennessy. Computer Architecture: A Quantitative Ap-
proach. Morgan Kaufmann Publishers, San Francisco, CA, fourth edition edition, 2007.

[17] D. Slutz R. L. Mattson, J. Gecsei and I. L. Traiger. Evaluation techniques for storage
hierarchies. IBM System Journal, 9(2):78117, 1970.

[18] IEEE Design & Test staff. Dac, moore’s law still drive eda. IEEE Des. Test, 20(3):99–100,
2003.

[19] P. Stenstrom. The paradigm shift to multi-cores: Opportunities and challenges. Appl.
Comput. Math., 6(2):253–257, 2007.

[20] Josep Torrellas, Chun Xia, and Russell Daigle. Optimizing instruction cache performance
for operating system intensive workloads. In Proceedings of the 1st Intl. Conference on High
Performance Computer Architecture, pages 360–369, 1995.

[21] Jim Turley. The two percent solution. http://www.embedded.com/story/OEG20021217S0039,


December 2002.

[22] Intel website. http://www.intel.com/museum/archives/history docs/mooreslaw.htm.

[23] C. Ding Y. Zhong and K. Kennedy. Reuse distance analysis for scientific programs. Proceed-
ings of Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers,
March 2002.

[24] S. Dropsho Y. Zhong and C. Ding. Miss rate prediction across all program inputs. Pro-
ceedings of the 12th International Conference on Parallel Architectures and Compilation
Techniques, page 91101, September 2003.

[25] X. Shen Y. Zhong, M. Orlovich and C. Ding. Array regrouping and structure splitting using
whole-program reference affinity. Proceedings of ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, June 2004.

60

You might also like