ITEC582 Chapter18
ITEC582 Chapter18
Chapter 18
Multicore Processors
After studying this chapter, you should be able to:
2
Introduction
A multicore computer, also known as a chip
multiprocessor, combines two or more processors
(called cores) on a single piece of silicon (called a die).
Typically, each core consists of all of the components of
an independent processor, such as registers, ALU,
pipeline hardware, and control unit, plus L1 instruction
and data caches.
In addition to the multiple cores, contemporary
multicore chips also include L2 cache and, in some
cases, L3 cache.
3
The most highly integrated multicore processors, known as
systems on chip (SoCs), also include memory and peripheral
controllers.
This chapter provides an overview of multicore systems. We begin
with a look at the hardware performance factors that led to the
development of multicore computers and the software challenges
of exploiting the power of a multicore system. Next, we look at
multicore organization
4
1.Hardware Performance Issues
5
Increase in Parallelism and Complexity
The organizational changes in processor design have primarily
been focused on exploiting ILP, so that more work is done in
each clock cycle. These changes include;
■ Pipelining:
7
■ Simultaneous multithreading (SMT):
Register banks are expanded so that multiple threads can share
the use of pipeline resources
8
Multicore architecture places multiple processor cores and
bundles them as a single physical processor. The objective is
to create a system that can complete more tasks at the same
time, thereby gaining better overall system performance.
Multicore Organization
9
• With each of these innovations, designers have over the years
attempted to increase the performance of the system by adding
complexity.
In the case of pipelining, simple three-stage pipelines were replaced
by pipelines with five stages. Intel’s Pentium 4 “Prescott” core had 31
stages for some instructions.
10
Power Consumption
To maintain the trend of higher performance as the number of
transistors per chip rises, designers have resorted to more elaborate
processor designs (pipelining, superscalar, SMT) and to high clock
frequencies. Unfortunately, power requirements have grown
exponentially as chip density and clock frequency have risen.
One way to control power density
is to use more of the chip area for
cache memory. Memory transistors
are smaller and have a power
density an order of magnitude
lower than that of logic
Why is there a trend toward
given an increasing fraction of
chip area to cache memory?
11
As chip transistor density has increased, the percentage of chip area
devoted to memory has grown, and is now often half the chip area
Pollack’s rule
12
2. Software Performance Issues
The potential performance benefits of a multicore organization
depend on the abilitity to effectively exploit the parallel resources
available to the application. Let us focus first on a single application
running on a multicore system.
Even a small amount of serial code
has a noticeable impact. If only
10% of the code is inherently serial
(f = 0.9), running the program
on a multicore system with eight
processors yields a performance
gain of only a factor of 4.7
13
In addition, software
typically incurs overhead as
a result of communication and
distribution of work among
multiple processors and as a
result of cache coherence
overhead. This overhead
results in a curve where
performance peaks and then
begins to degrade because of
the increased burden of the
overhead of using multiple
processors
14
3.Multicore Organization
The main variables in a multicore
organization are as follows:
The number of core processors on the chip
The number of levels of cache memory
How cache memory is shared among cores
Whether simultaneous multithreading (SMT)
is employed
The types of cores
16
In this, there is enough area
available on the chip to allow for
L2 cache.
An example of this organization is
the AMD Opteron
17
The organization for the given
figure is a similar allocation of chip
space to memory, but with the use
of a shared L2 cache. The Intel
Core Duo has this organization
18
Finally, a shared L3 cache is used
with dedicated L1 and L2 caches
for each core processor.
19
The use of a shared higher-level cache on the chip has several
advantages over exclusive reliance on dedicated caches:
20
List some advantages of a shared L2 cache among cores compared
to separate dedicated L2 caches for each core.
21
3. With proper frame replacement algorithms, the amount of
shared cache allocated to each core is dynamic, so that threads
that have a less locality can employ more cache.
4. Interprocessor communication is easy to implement, via
shared memory locations.
5. The use of a shared L2 cache confines the cache coherency
problem to the L1 cache level, which may provide some
additional performance advantage
22
A potential advantage to having only dedicated L2 caches on the
chip is that each core enjoys more rapid access to its private L2
cache. This is advantageous for threads that exhibit strong locality.
As both the amount of memory available and the number of
cores grow, the use of a shared L3 cache combined with
dedicated per core L2 caches seems likely to provide better
performance than simply a massive shared L2 cache or very
large dedicated L2 caches with no on-chip L3. An example of
this latter arrangement is the Xeon E5-2600/4600 chip processor
23
4. Heterogeneous Multicore Organization
As clock speeds and logic densities increase, designers must
balance many design elements in attempts to maximize
performance and minimize power consumption. We have so
far examined a number of such approaches, including the
following:
1. Increase the percentage of the chip devoted to cache memory.
2. Increase the number of levels of cache memory.
3. Change the length (increase or decrease) and functional
components of the instruction pipeline.
4. Employ simultaneous multithreading.
5. Use multiple cores
24
A typical case for the use of multiple cores is a chip with
multiple identical cores, known as homogenous multicore
organization.
To achieve better results, in terms of performance and/or power
consumption, an increasingly popular design choice is
heterogeneous multicore organization, which refers to a
processor chip that includes more than one kind of core.
Two approaches to heterogeneous multicore organization.
25
(i) Different Instruction Set Architectures
CPU/GPU multicore :
26
Multiple CPUs and GPUs share on-chip resources, such as the last-level
cache (LLC), interconnection network, and memory controllers. Most
critical is the way in which cache management policies provide
effective sharing of the LLC. The differences in cache sensitivity and
memory access rate between CPUs and GPUs create significant
challenges to the efficient sharing of the LLC
27
Table above illustrates the potential performance benefit of combining
CPUs and GPUs for scientific applications. This table shows the basic
operating parameters of an AMD chip, the A10 5800K. For floating- point
calculations, the CPU’s performance at 121.6 GFLOPS is dwarfed by the
GPU, which offers 614 GFLOPS to applications that can utilize the
resource effectively.
28
The overall objective is to allow programmers to write applications
that exploit the serial power of CPUs and the parallel-processing
power of GPUs seamlessly with efficient coordination at the OS and
hardware level.
CPU / DSP multicore
29
DSPs are used to process analog data from sources such as sound,
weather satellites, and earthquake monitors.
Signals are converted into digital data and analyzed using various
algorithms such as Fast Fourier Transform.
DSP cores are widely used in myriad devices, including cellphones,
sound cards, fax machines, modems, hard disks, and digital TVs
30
(ii) Equivalent Instruction Set Architectures
31
Chip containing two high- performance Cortex- A15 cores and
two lower performance, lower-power-consuming Cortex-A7
cores. The A7 cores handle less computation-intense tasks, such
as background processing, playing music, sending
texts, and making phone calls. The A15 cores are invoked for
high intensity tasks, such as for video, gaming, and navigation.
32
These are devices whose performance demands from users are
increasing at a much faster rate than the capacity of batteries or the
power savings from semiconductor process advances.
The usage pattern for smartphones and tablets is quite dynamic.
The A15 is designed for maximum performance within the mobile
power budget.
The A7 processor is designed for maximum efficiency and high
enough performance to address all but the most intense periods of
work.
33
Intel Core i7-990X
Each core has its own
dedicated L2 cache and the
six cores share a 12-MB L3
cache.
It uses prefetching in which
the hardware examines
memory access patterns and
attempts to fill the caches
speculatively with data that’s
likely to be requested soon.
The Core i7-990X chip supports two forms of external communications to
other chips.
The DDR3 memory controller
It brings the memory controller for the DDR main memory onto
the chip. The interface supports three channels that are 8 bytes
wide for a total bus width of 192 bits, for an aggregate data rate
of up to 32 GB/s.
35
2. Give several reasons for the choice by designers to move to a
multicore organization rather than increase parallelism within a single
processor.
4. List some examples of applications that benefit directly from the
ability to scale throughput with the number of cores.
5. At a top level,
6.
36