Computer Arichitecture
Computer Arichitecture
2 1
2 0
J T
Lecture 01 - Introduction U
X@
n
j u Re Pengju Ren
n g of Artificial Intelligence and Robotics
Pe Xi’an Jiaotong University
Institute
http://gr.xjtu.edu.cn/web/pengjuren
Course Administration
Instructor: Pengju Ren
TA: Gelin Fu (Ph.D Candidate)
Lectures: Two 100-minute lectures a week
2 1
2 0
Textbook: Computer Architecture: A Quantitative Approach
6th Edition(2019)J T U
@X
Prerequisite: Digital System Structure and Design
n
j u Re
ng
Pe
2
Preface
3
What is Computer Architecture
Application
2 1
2 0 computer
In its broadest definition,
architecture Tis U
the design of the
Gap too large to X J
abstraction/Implementation layers
@ us to implement information
thatnallow
bridge in one step
R e
processing applications efficiently
gj u using available manufacturing
e n technologies.
P
Physics
4
What is Computer Architecture
Application
Algorithm 2 1
Programming Language 20
Operating System/Virtual Machines J TU
Gap too largeSet
Instruction to Architecture (ISA)
@X
bridge in one step
e n
Microarchitecture
j u R
n g
Register-Transfer Level (RTL)
e
Gates
PCircuits
Devices
Physics
5
What is Computer Architecture
Application
Algorithm 2 1
Programming Language 2 0
Operating System/Virtual Machines J TU
Gap too largeSet
Instruction to Architecture (ISA)
@X
bridge in one step
e n
Microarchitecture
j u R This course
n g
Register-Transfer Level (RTL)
e
Gates
PCircuits
Devices
Physics
6
What is Computer Architecture
Algorithm 2 1
Suggest how to improve
Programming Language 2 0
architecture
Provide revenue to fund
n g
Register-Transfer Level (RTL)
e
Gates
PCircuits Technology Constraints:
Restrict what can be done efficiently
Devices New technologies make new arch
Physics possible
7
Computing Devices Then…
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
Modern computing is as much
about enhancing capabilities as
data processing!
9
Architecture continually changing
10
Single-Thread(Sequential) Processor Performance
Mulitcore or ManyCore
11
Moore’s Law Scaling with Cores
2 1
2 0
J TU
@X
n
j u Re
ng
Pe
12
Global Semiconductor Market
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
The global semiconductor market is estimated at $450 billion USD in revenue for 2020.
Products using these semiconductors represent global revenues of $2 trillion USD, or
around 3.5% of global gross domestic product (GDP)
ISSCC2021‘Feb 13
Advanced Tech nodes continue provide value
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
Steady progress in two-dimensional transistor scaling and a variety of device
enhancement techniques have sustained energy-efficiency improvement and device
density gains from one technology generation to the next
ISSCC2021‘Feb 14
Upheaval in Computer Design
• Most of last 50 years, Moore’s Law ruled
– Technology scaling allowed continual performance/energy
improvements without changing software model
2 1
• Last decade, technology scaling slowed/stopped 2 0
J U
– Dennard (voltage) scaling over (supplyTvoltage ~fixed)
@ X
– Moore’s Law (cost/transistor) over?
e
– No competitive replacementn for CMOS anytime soon
u R
– Energy efficiency constrains
j everything
n
• No “free lunch” g for software developers, must
P
consider:
e
– Parallel systems
– Heterogeneous systems
15
Today’s Dominant Target Systems
• Mobile (smartphone/tablet)
– >1 billion sold/year
in system-on-a-chip (SoC)
2 1
– Market dominated by ARM-ISA-compatible general-purpose processor
2 0
– Plus sea of custom accelerators (radio, image, video, graphics, audio,
motion, location, security, etc.)
J T U
• Warehouse-Scale Computers (WSCs)
@X
–
n
100,000’s cores per warehouse
–
–
j u Re
Market dominated by x86-compatible server chips
Dedicated apps, plus cloud hosting of virtual machines
– ng
Now seeing increasing use of GPUs, FPGAs, custom hardware to
e
P
accelerate workloads
• Embedded computing
– Wired/wireless network infrastructure, printers
– Consumer TV/Music/Games/Automotive/Camera/MP3
– Internet of Things!
16
Course Content Computer Architecture
• Instruction Level Parallelism
– Superscalar
– Very Long Instruction Word (VLIW)
2 1
• Advanced Memory and Caches
2 0
• Data Level Parallelism
J T U
– Vector Machine
@X
– GPU
en
j u R
• Thread Level Parallelism
g
– Multithreading
n
e
– Multiprocessor/Multicore/ManyCore
P
• Warehouse-Scale Computers (Request Level Paral.)
• Domain-Specific Architectures (DNN Accelerator)
18
Same Architecture Diff Micro-Architecture
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
19
Diff Architecture Diff Micro-Architecture
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
20
Where do Operands come from and
Where do Results Go ?
Processor
2 1
20
J TU
ALU
@X
n
j u Re
ng
Pe
MEMORY
21
Where do Operands come from and
Where do Results Go ?
Stack Accumulator Reg-Mem Reg-Reg
Processor
Processor
Processor
Processor
2 1
2 0
ALU ALU
J T U
ALU ALU
@X
n
j u Re
ng
MEMORY
MEMORY
MEMORY
MEMORY
… … … …
Pe
22
Where do Operands come from and
Where do Results Go ?
Stack Accumulator Reg-Mem Reg-Reg
Processor
Processor
Processor
Processor
2 1
2 0
ALU ALU
J T U
ALU ALU
@X
n
j u Re
ng
MEMORY
MEMORY
MEMORY
MEMORY
… … … …
Pe
Number Explicitly
0 1 2 or 3 2 or 3
Named Operands
23
Stack-Based Instruction Set Architecture(ISA)
Stack
Burrough’s B5000 (1960)
Processor
• Burrough’s B6700
2 1
• HP 3000
2 0
ALU
• ICL 2900
J T U
@X
• Symbolics 3600
• Inmos Transputer
n
j u Re
Modern
• Forth machines
ng
MEMORY
…
Pe • Java Virtual Machine
• Intel x87 Floating Point Unit
24
Evaluation of Expressions
2 1
20
J T U
@X
n
j u Re
ng
Pe
25
Evaluation of Expressions
2 1
20
J T U
@X
n
j u Re
ng
Pe
26
Evaluation of Expressions
2 1
20
J T U
@X
n
j u Re
ng
Pe
27
Evaluation of Expressions
2 1
20
J T U
@X
n
j u Re
ng
Pe
28
Evaluation of Expressions
2 1
20
J T U
@X
n
j u Re
ng
Pe
29
Evaluation of Expressions
2 1
20
J T U
@X
n
j u Re
ng
Pe
30
Evaluation of Expressions
2 1
20
J T U
@X
n
j u Re
ng
Pe
31
Evaluation of Expressions
2 1
20
J T U
@X
n
j u Re
ng
Pe
32
Hardware Organization of the Stack
2 1
Stack is part of the processor state 2 0
T
stack must be bounded and small
J U
X
≈ number of Registers, not the size of main memory
@
n
Conceptually stack is unbounded
j u Re
a part of the stack is included in the
ng
processor state; the rest is kept in the main memory
Pe
33
Stack Operations/Implicit Memory References
1
Suppose the top 2 elements of the stack are kept in registers and
2
the rest is kept in the memory.
2 0
Each push operation 1 memory
J T U reference
pop operation 1@ X
memory reference
e n
u R
Better performance byjkeeping the top N elements in registers, and
g
memory referencesnare made only when register stack overflows or
underflows. P e
34
Stack Size and Memory References
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
Four Store and Fetch
35
Stack Size and Memory References
2 1
20
J TU
@X
n
j u Re
ng
Pe
36
Where do Operands come from and
Where do Results Go ?
Stack Accumulator Reg-Mem Reg-Reg
Processor
Processor
Processor
Processor
2 1
2 0
ALU ALU
J T U
ALU ALU
@X
n
j u Re
ng
MEMORY
MEMORY
MEMORY
MEMORY
… … … …
Pe
Push A Load R1, A
Load A Load R1 A
C= A+B Push B Load R2, B
Add B Add R3 R1, B
Add Add R3, R1, R2
Store C Store R3, C
Pop C Store R3, C 37
Classes of Instructions
• Data Transfer
2 1
– LD, ST, MFC1, MTC1, MFC0, MTC0
• ALU 2 0
J TU
– ADD, SUB, AND, OR, XOR, MUL, DIV, SLT, LUI
• Control Flow X
@ERET
• Floating Point Re
n
– BEQZ, JR, JAL, TRAP,
g j u
– ADD.D, SUB.S, MUL.D, C.LT.D, CVT.S.W,
e n (SIMD)
• Multimedia
P SUB.PS, MUL.PS, C.LT.PS
– ADD.PS,
• String
– REP MOVSB (x86)
38
Addressing Modes: How to get operands from
Memory (MIPS)
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
2 1
20
J TU
@X
n
j u Re
ng
Pe
40
Case study: X86(IA-32) Instruction Encoding
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
41
RISC-V Instruction Encoding(1)
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
42
RISC-V Instruction Encoding(2)
http://www-inst.eecs.berkeley.edu/~cs61c/fa18/img/riscvcard.pdf 43
Real World Instruction Sets
2 1
20
J T U
@X
n
j u Re
ng
Pe
44
Why the Diversity in ISAs?
45
Recap
Application
Algorithm 2 1
Programming Language 2 0
ISA vs Micro-Architecture
ISAUCharacteristics
Operating System/Virtual Machines
XJ T
• Machine Models
Gap too largeSet
to Architecture (ISA)
Instruction
n @ • Encoding
bridge in one step
Microarchitecture
R e
g j u
Register-Transfer Level (RTL)
• Data Types
e n
Gates • Instructions
PCircuits • Addressing Modes
Devices
Physics
46
And in conclusion …
• Computer Architecture >> ISAs and RTL
• Computer Architecture is about interaction of
hardware and software, and design of appropriate 2 1
abstraction layers 2 0
J TU
X
• Computer architecture is shaped by technology and
@
applications
– History provides lessonsR
e n
for the future
• Computer Science j u
g at the crossroads from sequential
e n
to parallelPcomputing
– Salvation requires innovation in many fields, including computer
architecture
• Read Chapter 1 & Appendix A for next time! (6th)
47
2 1
0 2
Next Lecture: RISC-V ISA, Datapath
T U & Control
XJ
@
(ISA and Micro-Architecture)
n
R e
gj u
n
Pe
48
Acknowledgements
• Some slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
2 1
– Joel Emer (Intel/MIT)
20
– James Hoe (CMU)
J T U
–
–
David Patterson (UCB)
@ X
David Wentzlaff (Princeton University)
en
u
• MIT material derived
j Rfrom course 6.823
g from course CS252 and CS 61C
• UCB materialnderived
Pe
49
Idealized Uniprocessor Model
Processor names variables:
– Integers, floats, pointers, arrays, structures, etc.
2 1
0
– These are really words, e.g., 64-bit double, 32-bit INTs, bytes, etc.
2
Processor performs operations on those
J T U variables:
– Arithmetic, logical operations, etc.X
n @on values in registers
e
– Only performs these operations
R
Processor controls the
g j u order, as specified by program
– Branches(if), e n
loops(while), function calls, etc.
Idealized Cost
P
– Each operation has roughly the same cost: add, multiply, etc.
50
2 1
Six Great Ideas in Computer U 20
Architecture
XJ T
n @
j u Re
ng
Pe
51
New School Machine Architecture
• Parallel Requests
Software Hardware
2 1
Assigned to computer Harness
Warehouse
Scale
2 0 Smart
Phone
e.g., Search “Cats” Parallelism &
J T
Computer
U
• Parallel Threads
Assigned to core
Achieve High
@X Computer
e.g., Lookup, Ads
Performance
en Core … Core
• Parallel Instructions
j u R Memory (Cache)
n g
>1 instruction @ one time
Input/Output
Instruction Unit(s)
Core
Functional
• Parallel Data
e
e.g., 5 pipelined instructions
P Unit(s)
A0+B0A1+B1A2+B2A3+B3
>1 data item @ one time Main Memory
e.g., Add of 4 pairs of words Logic Gates
• Hardware descriptions
All gates working in parallel at same time
52
Great Idea #1: Abstraction
(Levels of Representation/Interpretation)
ng
Interpretation 1010 1110 0001 0010 0000 0000 0000 0000
Pe
Hardware Architecture Description 1010 1101 1110 0010 0000 0000 0000 0100
(e.g., block diagrams)
Architecture
Implementation
Logic Circuit Description
(Circuit Schematic Diagrams)
53
Great Idea #2: Moore’s law
(Technique driven development)
2 1
20
J T U
@X
n
j u Re
ng
Pe
54
Great Idea #3: Principle of Locality/
Memory Hierarchy
2 1
2 0
J T U
@X
n
j u Re
ng
Pe
55
Great Idea #4: Parallelism
2 1
20
J TU
@X
n
j u Re
ng
Pe
56
2 1
20
J TU
@X
n
j u Re Gene Amdahl
ng
Pe
Computer Pioneer
Amdahl’s Law
57
Great Idea #5: Performance Measurement
and Improvement
j u Re
(e.g., matrix manipulation)
• Latency
e ng
P
– How long to set the problem up
– How much faster does it execute once it gets going
– It is all about time to finish
58
Great Idea #6: Dependability via Redundancy
n
j u Re
e ng
P
1+1=2 1+1=2 1+1=1 FAIL!
Increasing transistor density reduces the cost of redundancy
59
Great Idea #6: Dependability via Redundancy
60