0% found this document useful (0 votes)

4 views46 pages

Comp Architecture 101

Areg Melik-Adamyan, an Engineering Manager at Intel, presents an overview of CPU architecture, focusing on concepts such as pipelining, memory hierarchy, out-of-order execution, and branch prediction. The lecture aims to enhance understanding of CPU operations and performance optimization. Key topics include the limitations of pipelining, memory trade-offs, and the role of superscalar architecture in improving instruction execution efficiency.

Uploaded by

Satya Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views46 pages

Comp Architecture 101

Uploaded by

Satya Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Areg Melik-Adamyan, PhD

Engineering Manager, Intel Developer Products Division

Introduction

Who am I?
• 7 years at Intel, 17 years in industry
• Managing compiler teams (GCC, Go)
• 10 years teaching
Why we are here?
• To better understand how CPU works

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 2
*Other names and brands may be claimed as the property of others.
Texbooks and References

• Try to hit the tip of the iceberg

• Explain main concepts only
• Not enough to develop your own microprocessor…
• But allow better understand behavior and performance of your program
• Hennesy, Patterson, Computer Architecture: Quantative Approach, 6th Ed.
• Blaauw, Brooks, Computer Architecture: Concepts and Evolution

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 3
*Other names and brands may be claimed as the property of others.
Lecture Outline

• Pipeline
• Memory Hierarchy (Caches: +1 lecture later)
• Out-of-order execution
• Branch prediction
• Real example: Haswell Microarchitecture

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 4
*Other names and brands may be claimed as the property of others.
Layers of Abstraction
Application
Algorithms
Software
Programming Languages
Operating Systems/Libraries
Interface between
HW and SW
Instruction Set Architecture
Microarchitecture
Gates/Register-Transfer Level (RTL)
Hardware
Circuits
Physics

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 5
*Other names and brands may be claimed as the property of others.
Basic CPU Actions
4ns 8ns time

F D E M W

1. Fetch instruction by PC from memory

2. Decode it and read its operands from registers
3. Execute calculations
4. Read/write memory
5. Write the result into registers and update PC

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 6
*Other names and brands may be claimed as the property of others.
Non-Pipelined Processing

• Instructions are processed sequentially, one per cycle

• How to speed-up?
• SW: decrease number of instructions
• HW: decrease the time to process one instruction
or overlap their processing. i.e. make pipeline
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 7
*Other names and brands may be claimed as the property of others.
Pipeline

• Processing is split into several steps called “stages”

• Each stage takes one cycle
• The clock cycle is determined by the longest stage
• Instructions are overlapped
• A new instruction occupies a stage as soon as the previous one leaves it

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 8
*Other names and brands may be claimed as the property of others.
Pipeline vs Non-Pipeline

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 9
*Other names and brands may be claimed as the property of others.
Pipeline vs Non-Pipeline

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 10
*Other names and brands may be claimed as the property of others.
Pipeline Limitations

• Max speed of the pipeline is one instruction per clock

• It is rare due to dependencies among instructions (data or control) and in-
order processing

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 11
*Other names and brands may be claimed as the property of others.
Pipeline Limitations
• Various types of hazards:
• read after write (RAW), a true dependency
• write after read (WAR), an anti-dependency
• write after write (WAW), an output dependency

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 12
*Other names and brands may be claimed as the property of others.
Motivation for Memory Hierarchy

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 13
*Other names and brands may be claimed as the property of others.
Memory Tradeoffs

• Large memories are slow

• Small memories are fast, but expensive and consume high power
• Goal: give the processor a feeling that it has memory which is fast, large,
cheap and consumes low energy
• Solution: Hierarchy of Memories

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 14
*Other names and brands may be claimed as the property of others.
Superscalar: Wide Pipeline

• Pipeline exploits instruction level parallelism (ILP)

• Can we improve? Execute, instructions in parallel
• Need to double HW structures
• Max speedup is 2 instructions per cycle (IPC=2)
• The real speedup is less due to dependencies and in-order execution

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 15
*Other names and brands may be claimed as the property of others.
Is Superscalar Good Enough?
• Theoretically can execute multiple instructions in parallel
• Wide pipeline => more performance
• But…
• Only independent subsequent instructions can be executed in parallel
• Whereas subsequent instructions are often dependent
• So the utilization of the second pipe is often low
• Solution: out-of-order execution
• Execute instructions based on the “data flow” graph, rather than
program order
• Still need to keep the visibility of in-order execution
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 16
*Other names and brands may be claimed as the property of others.
Data Flow Analysis

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 17
*Other names and brands may be claimed as the property of others.
Instruction “Grinder”

• Then technology allowed building wide HW, but the code representation
remained sequential
• Decision: extract parallelism back by means of hardware
• Compatibility burden: needs to look like sequential hardware

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 18
*Other names and brands may be claimed as the property of others.
Why Order is Important?

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 19
*Other names and brands may be claimed as the property of others.
Maintaining Architectural State

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 20
*Other names and brands may be claimed as the property of others.
Dependency Check

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 21
*Other names and brands may be claimed as the property of others.
How Large Windows Should Be?

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 33
*Other names and brands may be claimed as the property of others.
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 34
*Other names and brands may be claimed as the property of others.
Block Diagram

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 35
*Other names and brands may be claimed as the property of others.
FrontEnd
• Instruction Fetch and Decode
• 32 KB 8-way Icache
• 4 decoders, up to 4 inst/cycle
• CISC to RISC transformation
• Decode Pipeline supports 16
bytes per cycle

• Four decoding units decode instructions

into uops
• The first can decode all instructions
up to four uops in size
• Uops emitted by the decoders are
directed to the Decode Queue and to
the Decoded Uop Cache
• Instructions with >4 uoops generate
their uops from the MSROM
• The MSROM bandwith is 4 uops per
cycle

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 38
*Other names and brands may be claimed as the property of others.
FrontEnd: Loop Stream Detector
• LSD detects small loops that fit in the
Decode Queue
• The loop streams from the uop queue,
with no more fetching, decoding, or
reading uops from any of the caches
• Works until a branch misprediction
• The loops with the following attributes
qualify for LSD replay
• Up to 56 uops
• All uops are also resident in the UC
• No more than eight taken branches
• No CALL or RET
• No mismatched stack operations (e.g.
more PUSH than POP)

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 39
*Other names and brands may be claimed as the property of others.
FrontEnd: Macro-Fusion
• Merge two instructions into a single uop
• Increased decode, rename and retire
bandwidth
• Power savings from representing
more work in fewer bits
• The first instruction of a macro-fused pair
modifies flags
• CMP, TEST, ADD, SUB, AND, INC, DEC
• The 2nd inst of a macro-fusible pair is a
conditional branch
• For each first instruction, some
branches can fuse with it
• These pairs are common in many apps
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 40
*Other names and brands may be claimed as the property of others.
OOO Structures

Bugreport CPH2365T2 RKQ1.211119.001 2023 02 27 20 45 28 Dumpstate - Log 16616
No ratings yet
Bugreport CPH2365T2 RKQ1.211119.001 2023 02 27 20 45 28 Dumpstate - Log 16616
36 pages
Computer Generations
No ratings yet
Computer Generations
13 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
English Thesis
75% (4)
English Thesis
4 pages
Embedded C Programming PDF
100% (1)
Embedded C Programming PDF
37 pages
15DD
No ratings yet
15DD
51 pages
007 4564 001 Tezrot
No ratings yet
007 4564 001 Tezrot
174 pages
Multicore Processor Report
100% (1)
Multicore Processor Report
19 pages
Presentation - Cea - Chapter16 2
No ratings yet
Presentation - Cea - Chapter16 2
33 pages
VOL1
No ratings yet
VOL1
498 pages
Computer Evolution 2 (Details)
No ratings yet
Computer Evolution 2 (Details)
23 pages
L03 Pipelining
No ratings yet
L03 Pipelining
45 pages
The First Encounter
50% (2)
The First Encounter
44 pages
Case Studies On File System Analysis
No ratings yet
Case Studies On File System Analysis
66 pages
Software Optimization Guide For AMD EPYC™ 7003 Processors
No ratings yet
Software Optimization Guide For AMD EPYC™ 7003 Processors
55 pages
HM Manual
No ratings yet
HM Manual
48 pages
El Impacto en El Software de Las Arquitecturas Multicore
No ratings yet
El Impacto en El Software de Las Arquitecturas Multicore
48 pages
8085 Questions & Answers
67% (3)
8085 Questions & Answers
5 pages
Systems Design Interview Study Guide
100% (1)
Systems Design Interview Study Guide
18 pages
A History of Microsoft Windows OS
No ratings yet
A History of Microsoft Windows OS
20 pages
Slides Ch2 Devices PDF
No ratings yet
Slides Ch2 Devices PDF
28 pages
Introduction To Parallelism
No ratings yet
Introduction To Parallelism
27 pages
SOG Fam 17h Processors 3.00
No ratings yet
SOG Fam 17h Processors 3.00
45 pages
Home Security System Using Microcontroller 8051: Project Synopsis On
No ratings yet
Home Security System Using Microcontroller 8051: Project Synopsis On
19 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Computer Architecture 1st Semester Spring Session Unit 3
No ratings yet
Computer Architecture 1st Semester Spring Session Unit 3
33 pages
Intel 64 and IA-32 Architectures Software Developers Manual - Volume 1 - Basic Architecture
No ratings yet
Intel 64 and IA-32 Architectures Software Developers Manual - Volume 1 - Basic Architecture
500 pages
Brushless DC (BLDC) Motor Fundamentals
100% (1)
Brushless DC (BLDC) Motor Fundamentals
20 pages
IAS & MIPS Rate
No ratings yet
IAS & MIPS Rate
42 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
Waqar Hussain, Jari Nurmi, Jouni Isoaho, Fabio Garzia - Computing Platforms For Software-Defined Radio-Springer (2017)
No ratings yet
Waqar Hussain, Jari Nurmi, Jouni Isoaho, Fabio Garzia - Computing Platforms For Software-Defined Radio-Springer (2017)
241 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
RISC Microprocessors
No ratings yet
RISC Microprocessors
63 pages
Presentation Cea Chapter16 2 Demo
No ratings yet
Presentation Cea Chapter16 2 Demo
30 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
CH16-WS ILP and Superscalar-V2
No ratings yet
CH16-WS ILP and Superscalar-V2
42 pages
Using Data Mining For Screening Tax Returns
No ratings yet
Using Data Mining For Screening Tax Returns
10 pages
10 Week
No ratings yet
10 Week
35 pages
White Spaces PDF
No ratings yet
White Spaces PDF
8 pages
White Spaces PDF
No ratings yet
White Spaces PDF
8 pages
Easy 6502 by Skilldrick
100% (1)
Easy 6502 by Skilldrick
12 pages
L1.0 HPC Overview
No ratings yet
L1.0 HPC Overview
58 pages
Aficio SP 6330n
No ratings yet
Aficio SP 6330n
4 pages
Risc and Cisc Microprocessor
No ratings yet
Risc and Cisc Microprocessor
11 pages
HC31 2.6 Intel SPH 2019 v3
No ratings yet
HC31 2.6 Intel SPH 2019 v3
12 pages
Nutrition Garden For Family Farming
No ratings yet
Nutrition Garden For Family Farming
15 pages
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Instruction Level Parallelism and Superscalar Processors
50 pages
Supported HP Printer in Hematology Analyzer
No ratings yet
Supported HP Printer in Hematology Analyzer
186 pages
MNM Lab # 3 Use of Multiplication and Division Instructions
No ratings yet
MNM Lab # 3 Use of Multiplication and Division Instructions
5 pages
CH18 COA11e
No ratings yet
CH18 COA11e
40 pages
MT7628 ProgrammingGuide 20140428 (E2)
No ratings yet
MT7628 ProgrammingGuide 20140428 (E2)
347 pages
Flexible AC Transmission System
No ratings yet
Flexible AC Transmission System
3 pages
Flexible AC Transmission System
No ratings yet
Flexible AC Transmission System
3 pages
K.Ramakrishnan College of Engineering: Samayapurm, Trichy-621 112
No ratings yet
K.Ramakrishnan College of Engineering: Samayapurm, Trichy-621 112
2 pages
IT8786E-I: Environment Control - Low Pin Count Input / Output (Ec - LPC I/O)
No ratings yet
IT8786E-I: Environment Control - Low Pin Count Input / Output (Ec - LPC I/O)
192 pages
Parallel Programming Platforms: Alexandre David 1.2.05
No ratings yet
Parallel Programming Platforms: Alexandre David 1.2.05
30 pages
Ecpu-Plus Main Processor
No ratings yet
Ecpu-Plus Main Processor
8 pages
HEYER Scalis M - Product List Spare Parts - en 0914
No ratings yet
HEYER Scalis M - Product List Spare Parts - en 0914
2 pages
Java Module
No ratings yet
Java Module
30 pages
Unit1 1.7 Instr Cycle
No ratings yet
Unit1 1.7 Instr Cycle
35 pages
Lecture
No ratings yet
Lecture
38 pages
Types of Components and Objects To Be Measured.: Mind Map Jan Clifford Pahigon
No ratings yet
Types of Components and Objects To Be Measured.: Mind Map Jan Clifford Pahigon
2 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
12 - Processor Structure and Function
No ratings yet
12 - Processor Structure and Function
73 pages
Module 2
No ratings yet
Module 2
127 pages
Caos Notes
No ratings yet
Caos Notes
38 pages
Modle 01 - HPC Introduction To Pipeline
No ratings yet
Modle 01 - HPC Introduction To Pipeline
124 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
SmartShrink (CV180)
No ratings yet
SmartShrink (CV180)
2 pages
Computer System Overview: 1 Spring 2015
No ratings yet
Computer System Overview: 1 Spring 2015
48 pages
Assembly Language - LAB-01 Lecture Notes
No ratings yet
Assembly Language - LAB-01 Lecture Notes
17 pages
Intel Architecture Code Analyzer, IACA-Guide
No ratings yet
Intel Architecture Code Analyzer, IACA-Guide
18 pages
BTD600 Quick Guide
No ratings yet
BTD600 Quick Guide
2 pages
STW120CT Computer Architecture and Networks: (Instruction Pipelining)
No ratings yet
STW120CT Computer Architecture and Networks: (Instruction Pipelining)
24 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
OLP Notes
No ratings yet
OLP Notes
11 pages
1.2 Software and Software Development.280155520
No ratings yet
1.2 Software and Software Development.280155520
2 pages
Multi-Core Processing: Advantages & Challenges
No ratings yet
Multi-Core Processing: Advantages & Challenges
35 pages
Multi-Core Programming Digital Edition (06!29!06)
No ratings yet
Multi-Core Programming Digital Edition (06!29!06)
362 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
Mahaquizzer 2005 Ans PDF
No ratings yet
Mahaquizzer 2005 Ans PDF
7 pages
Computer Organization and Architecture What Does Superscalar Mean?
No ratings yet
Computer Organization and Architecture What Does Superscalar Mean?
14 pages
Advanced Microprocessor & Microcontroller Lab Manual PDF
No ratings yet
Advanced Microprocessor & Microcontroller Lab Manual PDF
22 pages
An Introduction To Computer Architecture: © 2019 Arm Limited
No ratings yet
An Introduction To Computer Architecture: © 2019 Arm Limited
46 pages
6 - Block Architecture and The LAD-STL-FBD Editor
100% (1)
6 - Block Architecture and The LAD-STL-FBD Editor
27 pages
Inside The Cpu
No ratings yet
Inside The Cpu
10 pages
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Processor Structure and Function
74 pages
Multi-Core Programming Digital Edition (06-29-06) PDF
100% (1)
Multi-Core Programming Digital Edition (06-29-06) PDF
362 pages
梁存铭Intel - Core - effeciency PDF
No ratings yet
梁存铭Intel - Core - effeciency PDF
21 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
Dasan Zhone MXK F219 Spec Sheet
No ratings yet
Dasan Zhone MXK F219 Spec Sheet
1 page
CS MCQs PDF
No ratings yet
CS MCQs PDF
3 pages
Advanced Control Systems: Maharaja Surajmal Institute of Technology
No ratings yet
Advanced Control Systems: Maharaja Surajmal Institute of Technology
1 page

Comp Architecture 101

Uploaded by

Comp Architecture 101

Uploaded by

Areg Melik-Adamyan, PhD

Engineering Manager, Intel Developer Products Division

• Try to hit the tip of the iceberg

1. Fetch instruction by PC from memory

• Instructions are processed sequentially, one per cycle

• Processing is split into several steps called “stages”

• Max speed of the pipeline is one instruction per clock

• Large memories are slow

• Pipeline exploits instruction level parallelism (ILP)

• Four decoding units decode instructions

You might also like