0% found this document useful (0 votes)
53 views36 pages

ITEC582 Chapter18

Multicore processors place multiple processor cores on a single chip to improve performance. As chip complexity has increased to gain performance, power consumption has become an issue. This has driven the adoption of multicore designs which can gain performance through parallelism while controlling power. Key design aspects of multicore chips include the number of cores, levels of cache memory, and whether cache is shared or dedicated between cores. Shared caches can improve performance by reducing cache misses and supporting communication between cores.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views36 pages

ITEC582 Chapter18

Multicore processors place multiple processor cores on a single chip to improve performance. As chip complexity has increased to gain performance, power consumption has become an issue. This has driven the adoption of multicore designs which can gain performance through parallelism while controlling power. Key design aspects of multicore chips include the number of cores, levels of cache memory, and whether cache is shared or dedicated between cores. Shared caches can improve performance by reducing cache misses and supporting communication between cores.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Eastern Mediterranean University

School of Computing and Technology


Master of Technology

Architecture and Hardware (ITEC582 )

Chapter 18
Multicore Processors
After studying this chapter, you should be able to:

 Understand the hardware performance issues that have driven


the move to multicore computers.
 Understand the software performance issues posed by the use of
multihreaded multicore computers.
 Present an overview of the two principal approaches to
heterogeneous multicore organization.
 Have an appreciation of the use of multicore organization on
embedded systems, PCs and servers, and mainframes.

2
Introduction
 A multicore computer, also known as a chip
multiprocessor, combines two or more processors
(called cores) on a single piece of silicon (called a die).
 Typically, each core consists of all of the components of
an independent processor, such as registers, ALU,
pipeline hardware, and control unit, plus L1 instruction
and data caches.
 In addition to the multiple cores, contemporary
multicore chips also include L2 cache and, in some
cases, L3 cache.

3
The most highly integrated multicore processors, known as
systems on chip (SoCs), also include memory and peripheral
controllers.
This chapter provides an overview of multicore systems. We begin
with a look at the hardware performance factors that led to the
development of multicore computers and the software challenges
of exploiting the power of a multicore system. Next, we look at
multicore organization

4
1.Hardware Performance Issues

Microprocessor systems have experienced a steady increase in


execution performance for decades.
This increase is due to a number of factors;
• including increase in clock frequency,
• increase in transistor density, and
• refinements in the organization of the processor on the chip

5
Increase in Parallelism and Complexity
The organizational changes in processor design have primarily
been focused on exploiting ILP, so that more work is done in
each clock cycle. These changes include;
■ Pipelining:

Individual instructions are executed through a pipeline of stages


so that while one instruction is executing in one stage of the
pipeline, another instruction is executing in another stage of the
pipeline
 Summarize the differences among simple
instruction pipelining, superscalar, and
simultaneous multithreading
6
■ Superscalar:

Multiple pipelines are constructed by replicating execution


resources. This enables parallel execution of instructions in
parallel pipelines, so long as hazards are avoided

7
■ Simultaneous multithreading (SMT):
Register banks are expanded so that multiple threads can share
the use of pipeline resources

8
Multicore architecture places multiple processor cores and
bundles them as a single physical processor. The objective is
to create a system that can complete more tasks at the same
time, thereby gaining better overall system performance.

Multicore Organization

9
• With each of these innovations, designers have over the years
attempted to increase the performance of the system by adding
complexity.
In the case of pipelining, simple three-stage pipelines were replaced
by pipelines with five stages. Intel’s Pentium 4 “Prescott” core had 31
stages for some instructions.

With superscalar organization, increased performance can be


achieved by increasing the number of parallel pipelines.
This increases the difficulty of designing, fabricating, and debugging
the chips as the complexity increases.

10
Power Consumption
To maintain the trend of higher performance as the number of
transistors per chip rises, designers have resorted to more elaborate
processor designs (pipelining, superscalar, SMT) and to high clock
frequencies. Unfortunately, power requirements have grown
exponentially as chip density and clock frequency have risen.
One way to control power density
is to use more of the chip area for
cache memory. Memory transistors
are smaller and have a power
density an order of magnitude
lower than that of logic
 Why is there a trend toward
given an increasing fraction of
chip area to cache memory?

11
As chip transistor density has increased, the percentage of chip area
devoted to memory has grown, and is now often half the chip area

Pollack’s rule

• states that performance increase is roughly proportional to


square root of increase in complexity.

In other words, if you double the logic in a processor core, then it


delivers only 40% more performance. I

12
2. Software Performance Issues
The potential performance benefits of a multicore organization
depend on the abilitity to effectively exploit the parallel resources
available to the application. Let us focus first on a single application
running on a multicore system.
Even a small amount of serial code
has a noticeable impact. If only
10% of the code is inherently serial
(f = 0.9), running the program
on a multicore system with eight
processors yields a performance
gain of only a factor of 4.7

13
In addition, software
typically incurs overhead as
a result of communication and
distribution of work among
multiple processors and as a
result of cache coherence
overhead. This overhead
results in a curve where
performance peaks and then
begins to degrade because of
the increased burden of the
overhead of using multiple
processors

14
3.Multicore Organization
 The main variables in a multicore
organization are as follows:
 The number of core processors on the chip
 The number of levels of cache memory
 How cache memory is shared among cores
 Whether simultaneous multithreading (SMT)

is employed
 The types of cores

 what are the main design variables


in a multicore organization?
15
Levels of Cache
There are four general organizations for multicore systems.

 In this organization, the only


on-chip cache is L1 cache, with
each core having its own
dedicated L1 cache, divided
into instruction and data
caches for performance
reasons while L2 and higher
caches are unified.
 It is found in some of the
earlier multicore computer
chips and still seen in
embedded chips. An example
of this organization is the
ARM11 MPCore.

16
In this, there is enough area
available on the chip to allow for
L2 cache.
An example of this organization is
the AMD Opteron

17
The organization for the given
figure is a similar allocation of chip
space to memory, but with the use
of a shared L2 cache. The Intel
Core Duo has this organization

18
Finally, a shared L3 cache is used
with dedicated L1 and L2 caches
for each core processor.

The Intel Core i7 is an example of


this organization.

19
The use of a shared higher-level cache on the chip has several
advantages over exclusive reliance on dedicated caches:

• It can reduce overall miss rates.


• The data shared by multiple cores is not replicated at the shared
cache level.
• Inter-core communication is easy to implement, via shared
memory locations.
• It confines the cache coherency problem to the lower cache
levels, which may provide some additional performance
advantage

20
 List some advantages of a shared L2 cache among cores compared
to separate dedicated L2 caches for each core.

1. Constructive interference can reduce overall miss rates. That is,


if a thread on one core accesses a main memory location, this
brings the frame containing the
referenced location into the shared cache. If a thread on another
core soon thereafter accesses the same memory block, the memory
locations will already be available in the shared on-chip cache.
2. A related advantage is that data shared by multiple cores is not
replicated at the shared cache level.

21
3. With proper frame replacement algorithms, the amount of
shared cache allocated to each core is dynamic, so that threads
that have a less locality can employ more cache.
4. Interprocessor communication is easy to implement, via
shared memory locations.
5. The use of a shared L2 cache confines the cache coherency
problem to the L1 cache level, which may provide some
additional performance advantage

22
A potential advantage to having only dedicated L2 caches on the
chip is that each core enjoys more rapid access to its private L2
cache. This is advantageous for threads that exhibit strong locality.
As both the amount of memory available and the number of
cores grow, the use of a shared L3 cache combined with
dedicated per core L2 caches seems likely to provide better
performance than simply a massive shared L2 cache or very
large dedicated L2 caches with no on-chip L3. An example of
this latter arrangement is the Xeon E5-2600/4600 chip processor

23
4. Heterogeneous Multicore Organization
As clock speeds and logic densities increase, designers must
balance many design elements in attempts to maximize
performance and minimize power consumption. We have so
far examined a number of such approaches, including the
following:
1. Increase the percentage of the chip devoted to cache memory.
2. Increase the number of levels of cache memory.
3. Change the length (increase or decrease) and functional
components of the instruction pipeline.
4. Employ simultaneous multithreading.
5. Use multiple cores

24
A typical case for the use of multiple cores is a chip with
multiple identical cores, known as homogenous multicore
organization.
To achieve better results, in terms of performance and/or power
consumption, an increasingly popular design choice is
heterogeneous multicore organization, which refers to a
processor chip that includes more than one kind of core.
Two approaches to heterogeneous multicore organization.

(i) Different Instruction Set Architectures


(ii) Equivalent Instruction Set Architectures

25
(i) Different Instruction Set Architectures
CPU/GPU multicore :

The most prominent trend in terms of heterogeneous multicore design


is the use of both CPUs and graphics processing units (GPUs) on the
same chip.
Briefly, GPUs are characterized by the ability to support thousands of
parallel execution threads. Thus, GPUs are well matched to
applications that process large amounts of vector and matrix data.

26
Multiple CPUs and GPUs share on-chip resources, such as the last-level
cache (LLC), interconnection network, and memory controllers. Most
critical is the way in which cache management policies provide
effective sharing of the LLC. The differences in cache sensitivity and
memory access rate between CPUs and GPUs create significant
challenges to the efficient sharing of the LLC

27
Table above illustrates the potential performance benefit of combining
CPUs and GPUs for scientific applications. This table shows the basic
operating parameters of an AMD chip, the A10 5800K. For floating- point
calculations, the CPU’s performance at 121.6 GFLOPS is dwarfed by the
GPU, which offers 614 GFLOPS to applications that can utilize the
resource effectively.

28
The overall objective is to allow programmers to write applications
that exploit the serial power of CPUs and the parallel-processing
power of GPUs seamlessly with efficient coordination at the OS and
hardware level.
CPU / DSP multicore

Another common example of a heterogeneous multicore chip is a


mixture of CPUs and digital signal processors (DSPs).
A DSP provides ultra- fast instruction sequences (shift and add;
multiply and add), which are commonly used in math- intensive
digital signal processing applications

29
DSPs are used to process analog data from sources such as sound,
weather satellites, and earthquake monitors.
Signals are converted into digital data and analyzed using various
algorithms such as Fast Fourier Transform.
DSP cores are widely used in myriad devices, including cellphones,
sound cards, fax machines, modems, hard disks, and digital TVs

30
(ii) Equivalent Instruction Set Architectures

Another recent approach to heterogeneous multicore organization is the


use of multiple cores that have equivalent ISAs but vary in performance
or power efficiency. The leading example of this is ARM’s big.Little
architecture, which we examine in this section.

31
Chip containing two high- performance Cortex- A15 cores and
two lower performance, lower-power-consuming Cortex-A7
cores. The A7 cores handle less computation-intense tasks, such
as background processing, playing music, sending
texts, and making phone calls. The A15 cores are invoked for
high intensity tasks, such as for video, gaming, and navigation.

The big.Little architecture is aimed at the smartphone and tablet


market

32
These are devices whose performance demands from users are
increasing at a much faster rate than the capacity of batteries or the
power savings from semiconductor process advances.
The usage pattern for smartphones and tablets is quite dynamic.
The A15 is designed for maximum performance within the mobile
power budget.
The A7 processor is designed for maximum efficiency and high
enough performance to address all but the most intense periods of
work.

33
Intel Core i7-990X
Each core has its own
dedicated L2 cache and the
six cores share a 12-MB L3
cache.
It uses prefetching in which
the hardware examines
memory access patterns and
attempts to fill the caches
speculatively with data that’s
likely to be requested soon.
The Core i7-990X chip supports two forms of external communications to
other chips.
The DDR3 memory controller

The QuickPath Interconnect (QPI)


34
The DDR3 memory controller

It brings the memory controller for the DDR main memory onto
the chip. The interface supports three channels that are 8 bytes
wide for a total bus width of 192 bits, for an aggregate data rate
of up to 32 GB/s.

The QuickPath Interconnect (QPI)


It is a cache- coherent, point- to- point linkbased electrical
interconnect specification for Intel processors and chipsets. It
enables high- speed communications among connected processor
chips. The QPI link operates at 6.4 GT/s (transfers per second). At
16 bits per transfer, that adds up to 12.8 GB/s, and since QPI links
involve dedicated bidirectional pairs, the total bandwidth is 25.6
GB/s

35
2. Give several reasons for the choice by designers to move to a
multicore organization rather than increase parallelism within a single
processor.
4. List some examples of applications that benefit directly from the
ability to scale throughput with the number of cores.
5. At a top level,
6.

36

You might also like