0% found this document useful (0 votes)

53 views36 pages

ITEC582 Chapter18

Multicore processors place multiple processor cores on a single chip to improve performance. As chip complexity has increased to gain performance, power consumption has become an issue. This has driven the adoption of multicore designs which can gain performance through parallelism while controlling power. Key design aspects of multicore chips include the number of cores, levels of cache memory, and whether cache is shared or dedicated between cores. Shared caches can improve performance by reducing cache misses and supporting communication between cores.

Uploaded by

Ana Clara Cavalcante Sousa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views36 pages

ITEC582 Chapter18

Uploaded by

Ana Clara Cavalcante Sousa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

Eastern Mediterranean University

School of Computing and Technology

Master of Technology

Architecture and Hardware (ITEC582 )

Chapter 18
Multicore Processors
After studying this chapter, you should be able to:

 Understand the hardware performance issues that have driven

the move to multicore computers.
 Understand the software performance issues posed by the use of
multihreaded multicore computers.
 Present an overview of the two principal approaches to
heterogeneous multicore organization.
 Have an appreciation of the use of multicore organization on
embedded systems, PCs and servers, and mainframes.

2
Introduction
 A multicore computer, also known as a chip
multiprocessor, combines two or more processors
(called cores) on a single piece of silicon (called a die).
 Typically, each core consists of all of the components of
an independent processor, such as registers, ALU,
pipeline hardware, and control unit, plus L1 instruction
and data caches.
 In addition to the multiple cores, contemporary
multicore chips also include L2 cache and, in some
cases, L3 cache.

3
The most highly integrated multicore processors, known as
systems on chip (SoCs), also include memory and peripheral
controllers.
This chapter provides an overview of multicore systems. We begin
with a look at the hardware performance factors that led to the
development of multicore computers and the software challenges
of exploiting the power of a multicore system. Next, we look at
multicore organization

4
1.Hardware Performance Issues

Microprocessor systems have experienced a steady increase in

execution performance for decades.
This increase is due to a number of factors;
• including increase in clock frequency,
• increase in transistor density, and
• refinements in the organization of the processor on the chip

5
Increase in Parallelism and Complexity
The organizational changes in processor design have primarily
been focused on exploiting ILP, so that more work is done in
each clock cycle. These changes include;
■ Pipelining:

Individual instructions are executed through a pipeline of stages

so that while one instruction is executing in one stage of the
pipeline, another instruction is executing in another stage of the
pipeline
 Summarize the differences among simple
instruction pipelining, superscalar, and
simultaneous multithreading
6
■ Superscalar:

Multiple pipelines are constructed by replicating execution

resources. This enables parallel execution of instructions in
parallel pipelines, so long as hazards are avoided

7
■ Simultaneous multithreading (SMT):
Register banks are expanded so that multiple threads can share
the use of pipeline resources

8
Multicore architecture places multiple processor cores and
bundles them as a single physical processor. The objective is
to create a system that can complete more tasks at the same
time, thereby gaining better overall system performance.

Multicore Organization

9
• With each of these innovations, designers have over the years
attempted to increase the performance of the system by adding
complexity.
In the case of pipelining, simple three-stage pipelines were replaced
by pipelines with five stages. Intel’s Pentium 4 “Prescott” core had 31
stages for some instructions.

With superscalar organization, increased performance can be

achieved by increasing the number of parallel pipelines.
This increases the difficulty of designing, fabricating, and debugging
the chips as the complexity increases.

10
Power Consumption
To maintain the trend of higher performance as the number of
transistors per chip rises, designers have resorted to more elaborate
processor designs (pipelining, superscalar, SMT) and to high clock
frequencies. Unfortunately, power requirements have grown
exponentially as chip density and clock frequency have risen.
One way to control power density
is to use more of the chip area for
cache memory. Memory transistors
are smaller and have a power
density an order of magnitude
lower than that of logic
 Why is there a trend toward
given an increasing fraction of
chip area to cache memory?

11
As chip transistor density has increased, the percentage of chip area
devoted to memory has grown, and is now often half the chip area

Pollack’s rule

• states that performance increase is roughly proportional to

square root of increase in complexity.

In other words, if you double the logic in a processor core, then it

delivers only 40% more performance. I

12
2. Software Performance Issues
The potential performance benefits of a multicore organization
depend on the abilitity to effectively exploit the parallel resources
available to the application. Let us focus first on a single application
running on a multicore system.
Even a small amount of serial code
has a noticeable impact. If only
10% of the code is inherently serial
(f = 0.9), running the program
on a multicore system with eight
processors yields a performance
gain of only a factor of 4.7

13
In addition, software
typically incurs overhead as
a result of communication and
distribution of work among
multiple processors and as a
result of cache coherence
overhead. This overhead
results in a curve where
performance peaks and then
begins to degrade because of
the increased burden of the
overhead of using multiple
processors

14
3.Multicore Organization
 The main variables in a multicore
organization are as follows:
 The number of core processors on the chip
 The number of levels of cache memory
 How cache memory is shared among cores
 Whether simultaneous multithreading (SMT)

is employed
 The types of cores

 what are the main design variables

in a multicore organization?
15
Levels of Cache
There are four general organizations for multicore systems.

 In this organization, the only

on-chip cache is L1 cache, with
each core having its own
dedicated L1 cache, divided
into instruction and data
caches for performance
reasons while L2 and higher
caches are unified.
 It is found in some of the
earlier multicore computer
chips and still seen in
embedded chips. An example
of this organization is the
ARM11 MPCore.

16
In this, there is enough area
available on the chip to allow for
L2 cache.
An example of this organization is
the AMD Opteron

17
The organization for the given
figure is a similar allocation of chip
space to memory, but with the use
of a shared L2 cache. The Intel
Core Duo has this organization

18
Finally, a shared L3 cache is used
with dedicated L1 and L2 caches
for each core processor.

The Intel Core i7 is an example of

this organization.

19
The use of a shared higher-level cache on the chip has several
advantages over exclusive reliance on dedicated caches:

• It can reduce overall miss rates.

• The data shared by multiple cores is not replicated at the shared
cache level.
• Inter-core communication is easy to implement, via shared
memory locations.
• It confines the cache coherency problem to the lower cache
levels, which may provide some additional performance
advantage

20
 List some advantages of a shared L2 cache among cores compared
to separate dedicated L2 caches for each core.

1. Constructive interference can reduce overall miss rates. That is,

if a thread on one core accesses a main memory location, this
brings the frame containing the
referenced location into the shared cache. If a thread on another
core soon thereafter accesses the same memory block, the memory
locations will already be available in the shared on-chip cache.
2. A related advantage is that data shared by multiple cores is not
replicated at the shared cache level.

21
3. With proper frame replacement algorithms, the amount of
shared cache allocated to each core is dynamic, so that threads
that have a less locality can employ more cache.
4. Interprocessor communication is easy to implement, via
shared memory locations.
5. The use of a shared L2 cache confines the cache coherency
problem to the L1 cache level, which may provide some
additional performance advantage

22
A potential advantage to having only dedicated L2 caches on the
chip is that each core enjoys more rapid access to its private L2
cache. This is advantageous for threads that exhibit strong locality.
As both the amount of memory available and the number of
cores grow, the use of a shared L3 cache combined with
dedicated per core L2 caches seems likely to provide better
performance than simply a massive shared L2 cache or very
large dedicated L2 caches with no on-chip L3. An example of
this latter arrangement is the Xeon E5-2600/4600 chip processor

23
4. Heterogeneous Multicore Organization
As clock speeds and logic densities increase, designers must
balance many design elements in attempts to maximize
performance and minimize power consumption. We have so
far examined a number of such approaches, including the
following:
1. Increase the percentage of the chip devoted to cache memory.
2. Increase the number of levels of cache memory.
3. Change the length (increase or decrease) and functional
components of the instruction pipeline.
4. Employ simultaneous multithreading.
5. Use multiple cores

24
A typical case for the use of multiple cores is a chip with
multiple identical cores, known as homogenous multicore
organization.
To achieve better results, in terms of performance and/or power
consumption, an increasingly popular design choice is
heterogeneous multicore organization, which refers to a
processor chip that includes more than one kind of core.
Two approaches to heterogeneous multicore organization.

(i) Different Instruction Set Architectures

(ii) Equivalent Instruction Set Architectures

25
(i) Different Instruction Set Architectures
CPU/GPU multicore :

The most prominent trend in terms of heterogeneous multicore design

is the use of both CPUs and graphics processing units (GPUs) on the
same chip.
Briefly, GPUs are characterized by the ability to support thousands of
parallel execution threads. Thus, GPUs are well matched to
applications that process large amounts of vector and matrix data.

26
Multiple CPUs and GPUs share on-chip resources, such as the last-level
cache (LLC), interconnection network, and memory controllers. Most
critical is the way in which cache management policies provide
effective sharing of the LLC. The differences in cache sensitivity and
memory access rate between CPUs and GPUs create significant
challenges to the efficient sharing of the LLC

27
Table above illustrates the potential performance benefit of combining
CPUs and GPUs for scientific applications. This table shows the basic
operating parameters of an AMD chip, the A10 5800K. For floating- point
calculations, the CPU’s performance at 121.6 GFLOPS is dwarfed by the
GPU, which offers 614 GFLOPS to applications that can utilize the
resource effectively.

28
The overall objective is to allow programmers to write applications
that exploit the serial power of CPUs and the parallel-processing
power of GPUs seamlessly with efficient coordination at the OS and
hardware level.
CPU / DSP multicore

Another common example of a heterogeneous multicore chip is a

mixture of CPUs and digital signal processors (DSPs).
A DSP provides ultra- fast instruction sequences (shift and add;
multiply and add), which are commonly used in math- intensive
digital signal processing applications

29
DSPs are used to process analog data from sources such as sound,
weather satellites, and earthquake monitors.
Signals are converted into digital data and analyzed using various
algorithms such as Fast Fourier Transform.
DSP cores are widely used in myriad devices, including cellphones,
sound cards, fax machines, modems, hard disks, and digital TVs

30
(ii) Equivalent Instruction Set Architectures

Another recent approach to heterogeneous multicore organization is the

use of multiple cores that have equivalent ISAs but vary in performance
or power efficiency. The leading example of this is ARM’s big.Little
architecture, which we examine in this section.

31
Chip containing two high- performance Cortex- A15 cores and
two lower performance, lower-power-consuming Cortex-A7
cores. The A7 cores handle less computation-intense tasks, such
as background processing, playing music, sending
texts, and making phone calls. The A15 cores are invoked for
high intensity tasks, such as for video, gaming, and navigation.

The big.Little architecture is aimed at the smartphone and tablet

market

32
These are devices whose performance demands from users are
increasing at a much faster rate than the capacity of batteries or the
power savings from semiconductor process advances.
The usage pattern for smartphones and tablets is quite dynamic.
The A15 is designed for maximum performance within the mobile
power budget.
The A7 processor is designed for maximum efficiency and high
enough performance to address all but the most intense periods of
work.

33
Intel Core i7-990X
Each core has its own
dedicated L2 cache and the
six cores share a 12-MB L3
cache.
It uses prefetching in which
the hardware examines
memory access patterns and
attempts to fill the caches
speculatively with data that’s
likely to be requested soon.
The Core i7-990X chip supports two forms of external communications to
other chips.
The DDR3 memory controller

The QuickPath Interconnect (QPI)

34
The DDR3 memory controller

It brings the memory controller for the DDR main memory onto
the chip. The interface supports three channels that are 8 bytes
wide for a total bus width of 192 bits, for an aggregate data rate
of up to 32 GB/s.

The QuickPath Interconnect (QPI)

It is a cache- coherent, point- to- point linkbased electrical
interconnect specification for Intel processors and chipsets. It
enables high- speed communications among connected processor
chips. The QPI link operates at 6.4 GT/s (transfers per second). At
16 bits per transfer, that adds up to 12.8 GB/s, and since QPI links
involve dedicated bidirectional pairs, the total bandwidth is 25.6
GB/s

35
2. Give several reasons for the choice by designers to move to a
multicore organization rather than increase parallelism within a single
processor.
4. List some examples of applications that benefit directly from the
ability to scale throughput with the number of cores.
5. At a top level,
6.

VSP Gxx0 and VSP Fxx0 Architecture and Concepts Guide
No ratings yet
VSP Gxx0 and VSP Fxx0 Architecture and Concepts Guide
68 pages
System x3650 M4 - Type 7915 Service Maual - System - x3650 - m4
No ratings yet
System x3650 M4 - Type 7915 Service Maual - System - x3650 - m4
1,222 pages
HP Z440 Z640 Z840 Maintenance Service Guide
No ratings yet
HP Z440 Z640 Z840 Maintenance Service Guide
156 pages
Dual Core Architecture Seminar Paper
No ratings yet
Dual Core Architecture Seminar Paper
15 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
NUMA Deep Dive Part 2 System Architecture
No ratings yet
NUMA Deep Dive Part 2 System Architecture
15 pages
APP For Intel Xeon Processors
No ratings yet
APP For Intel Xeon Processors
17 pages
CS8493 Operating Systems - Unit I
100% (2)
CS8493 Operating Systems - Unit I
24 pages
Seminar Report
50% (4)
Seminar Report
30 pages
HP DL380 G8: Hardware Module Description
No ratings yet
HP DL380 G8: Hardware Module Description
6 pages
Multicore Processor Report
100% (1)
Multicore Processor Report
19 pages
18 Multicore Computers
0% (1)
18 Multicore Computers
31 pages
2 - A Top-Level View of Computer Function and Interconnection
100% (1)
2 - A Top-Level View of Computer Function and Interconnection
39 pages
Laptop Questions
No ratings yet
Laptop Questions
40 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Chapter 8 - Parallel Processing
No ratings yet
Chapter 8 - Parallel Processing
50 pages
Multicore Processor
100% (1)
Multicore Processor
23 pages
Huawei FusionServer RH1288 V3 White Paper
No ratings yet
Huawei FusionServer RH1288 V3 White Paper
78 pages
Modle 01 - HPC Introduction To Pipeline
No ratings yet
Modle 01 - HPC Introduction To Pipeline
124 pages
Core I7 900 Ee and Desktop Processor Series 32nm Datasheet Vol 1
No ratings yet
Core I7 900 Ee and Desktop Processor Series 32nm Datasheet Vol 1
102 pages
Multicore Computers
100% (1)
Multicore Computers
29 pages
1019 - HP ProLiant ML350p Gen8 Datasheet 1 PDF
No ratings yet
1019 - HP ProLiant ML350p Gen8 Datasheet 1 PDF
68 pages
Poweredge r910 Technical Guide
No ratings yet
Poweredge r910 Technical Guide
78 pages
Installation and User's Guide: IBM System x3550 M3 Types 4254 and 7944
No ratings yet
Installation and User's Guide: IBM System x3550 M3 Types 4254 and 7944
152 pages
CH02 COA10e.performance Issues
No ratings yet
CH02 COA10e.performance Issues
19 pages
X8DAH+ X8DAH+-F X8DAH+-LR X8DAH+-F-LR USER S MANUAL. Revision 1.2b
No ratings yet
X8DAH+ X8DAH+-F X8DAH+-LR X8DAH+-F-LR USER S MANUAL. Revision 1.2b
103 pages
Unit VI - Multi Core Architectures
No ratings yet
Unit VI - Multi Core Architectures
51 pages
Performance Analysis Guide For Intel I7 Processor
No ratings yet
Performance Analysis Guide For Intel I7 Processor
72 pages
m640 Technical Guide
No ratings yet
m640 Technical Guide
37 pages
CH03 COA10e
No ratings yet
CH03 COA10e
39 pages
Manual de Board
No ratings yet
Manual de Board
96 pages
CH17 COA9e Parallel Processing
No ratings yet
CH17 COA9e Parallel Processing
52 pages
CC Unit 1
No ratings yet
CC Unit 1
24 pages
Multi-Core Processors: Concepts and Implementations
No ratings yet
Multi-Core Processors: Concepts and Implementations
10 pages
Multi-Core Architectures: Rakesh Kumar Rakumar@cs - Ucsd.edu
No ratings yet
Multi-Core Architectures: Rakesh Kumar Rakumar@cs - Ucsd.edu
23 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Catalogo Edwards
100% (1)
Catalogo Edwards
8 pages
Chapter 9 COA
No ratings yet
Chapter 9 COA
31 pages
CH03-Top-Level View of Computer
No ratings yet
CH03-Top-Level View of Computer
53 pages
CH18 MultiCoreComputers 18 Slides
No ratings yet
CH18 MultiCoreComputers 18 Slides
18 pages
What Is A Multicore Processor
No ratings yet
What Is A Multicore Processor
21 pages
Multi-Core Processing: Advantages & Challenges
No ratings yet
Multi-Core Processing: Advantages & Challenges
35 pages
Chapter 11
No ratings yet
Chapter 11
33 pages
Huawei Tecal E9000 CH121 Compute Node White Paper
No ratings yet
Huawei Tecal E9000 CH121 Compute Node White Paper
40 pages
Intel Core I7 Processor
No ratings yet
Intel Core I7 Processor
7 pages
Unit II
No ratings yet
Unit II
9 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
Lecture 36
No ratings yet
Lecture 36
15 pages
Multicore Computers
No ratings yet
Multicore Computers
18 pages
Many Core Processor Architecture
No ratings yet
Many Core Processor Architecture
36 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Introduction To Intel Architecture - The Basics
No ratings yet
Introduction To Intel Architecture - The Basics
25 pages
Multi Core System
No ratings yet
Multi Core System
9 pages
Single-ISA Heterogeneous Multi-Core Architectures: The Potential For Processor Power Reduction
No ratings yet
Single-ISA Heterogeneous Multi-Core Architectures: The Potential For Processor Power Reduction
12 pages
Multi-Core Processor PDF
No ratings yet
Multi-Core Processor PDF
6 pages
Parallel Arch 2
No ratings yet
Parallel Arch 2
9 pages
Chapter 1 Solution
No ratings yet
Chapter 1 Solution
35 pages
Ahmad Aljebaly Department of Computer Science Western Michigan University
No ratings yet
Ahmad Aljebaly Department of Computer Science Western Michigan University
42 pages
Lecture 37
No ratings yet
Lecture 37
17 pages
Slot29 CH18 MultiCoreComputers 18 Slides
No ratings yet
Slot29 CH18 MultiCoreComputers 18 Slides
18 pages
SP23 CS 212 Week 2
No ratings yet
SP23 CS 212 Week 2
23 pages
Final Report: Multicore Processors
No ratings yet
Final Report: Multicore Processors
12 pages
Pipelining For Multi-Core Architectures
No ratings yet
Pipelining For Multi-Core Architectures
31 pages
Multicore Computers
No ratings yet
Multicore Computers
21 pages
Single-ISA Heterogeneous Multi-Core Architectures: The Potential For Processor Power Reduction
No ratings yet
Single-ISA Heterogeneous Multi-Core Architectures: The Potential For Processor Power Reduction
12 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
Hyper-Threading Technology: Processor Microarchitecture
No ratings yet
Hyper-Threading Technology: Processor Microarchitecture
18 pages
IIS DataStageSortPerformance PDF
No ratings yet
IIS DataStageSortPerformance PDF
26 pages
Arch&org - Chapter 3
No ratings yet
Arch&org - Chapter 3
14 pages
Core Performance
No ratings yet
Core Performance
13 pages
A Survey On Parallel Multicore Computing Performan
No ratings yet
A Survey On Parallel Multicore Computing Performan
9 pages
Level 18 (Chapter 18 - Multicore Computers)
No ratings yet
Level 18 (Chapter 18 - Multicore Computers)
10 pages
1.1 Processor Micro Architecture
No ratings yet
1.1 Processor Micro Architecture
21 pages
20BCE2351 Micro Assignment-02
No ratings yet
20BCE2351 Micro Assignment-02
5 pages
20BCE2351 Micro Assignment-02
No ratings yet
20BCE2351 Micro Assignment-02
5 pages
DX Diag
No ratings yet
DX Diag
14 pages
Increasing Factors Which Improves The Performance of Computer in Future
No ratings yet
Increasing Factors Which Improves The Performance of Computer in Future
7 pages
"Multicore Processors": A Seminar Report
No ratings yet
"Multicore Processors": A Seminar Report
11 pages
Term Paper
No ratings yet
Term Paper
9 pages
Logo - File 5 PDF
No ratings yet
Logo - File 5 PDF
6 pages
Final Research Paper Sanjay Comparative Diffrenece I7
No ratings yet
Final Research Paper Sanjay Comparative Diffrenece I7
8 pages
Lesson 3 ITS 204 Reviewer All About Motherboards
No ratings yet
Lesson 3 ITS 204 Reviewer All About Motherboards
5 pages
Note 2
No ratings yet
Note 2
3 pages
Multicore Processor Technology-Advantages and Challenges: Anil Sethi, Himanshu Kushwah
No ratings yet
Multicore Processor Technology-Advantages and Challenges: Anil Sethi, Himanshu Kushwah
3 pages
r520 Spec Sheet
No ratings yet
r520 Spec Sheet
2 pages
Mastering the Art of Linux Kernel Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Linux Kernel Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Advanced Linux Kernel Engineering: In-Depth Insights into OS Internals
From Everand
Advanced Linux Kernel Engineering: In-Depth Insights into OS Internals
Adam Jones
No ratings yet
Expert Linux Development: Mastering System Calls, Filesystems, and Inter-Process Communication
From Everand
Expert Linux Development: Mastering System Calls, Filesystems, and Inter-Process Communication
Adam Jones
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet

ITEC582 Chapter18

Uploaded by

ITEC582 Chapter18

Uploaded by

Eastern Mediterranean University

School of Computing and Technology

Architecture and Hardware (ITEC582 )

 Understand the hardware performance issues that have driven

Microprocessor systems have experienced a steady increase in

Individual instructions are executed through a pipeline of stages

Multiple pipelines are constructed by replicating execution

With superscalar organization, increased performance can be

• states that performance increase is roughly proportional to

In other words, if you double the logic in a processor core, then it

 what are the main design variables

 In this organization, the only

The Intel Core i7 is an example of

• It can reduce overall miss rates.

1. Constructive interference can reduce overall miss rates. That is,

(i) Different Instruction Set Architectures

The most prominent trend in terms of heterogeneous multicore design

Another common example of a heterogeneous multicore chip is a

Another recent approach to heterogeneous multicore organization is the

The big.Little architecture is aimed at the smartphone and tablet

The QuickPath Interconnect (QPI)

The QuickPath Interconnect (QPI)

You might also like