Ece5745 Overview
Ece5745 Overview
Course Overview
Christopher Batten
http://www.csl.cornell.edu/courses/ece5745
• Course Goal, Structure, Motivation • Activity ASIC Design Case Studies
Application
Technology
Application
Algorithm
Programming Language
Operating System
Compiler
Instruction Set Architecture
Microarchitecture
Register-Transfer Level
Gate Level
Circuits
Devices
Technology
Application
Algorithm Application Requirements
Programming Language • Provide motivation for building system
Computer Architecture
8
10
Single-Core Multi-Core Accelerator
7
10 Transistors
(Thousands) Parallelization
6 &
10 Specialization
Aggressive Superscalar
Out-of-Order Execution SPECrate
5
10 (4-7 cores)
Superscalar Out-of-Order
Execution SPECint (singe-core)
4
10
Superscalar
Execution
3
10 Pipelining
Frequency (MHz)
& Caches Power (W)
2
10
1 Number of
10 Accelerators
Number of
Cores
0
10
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025
C. Batten, M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, K. Rupp & [Y. Shao, IEEE Micro'15] & [C. Leiserson, Science'20]
Network Out-of-Order C D
Superscalar
Superpipelined
I$ Accelerated I$
D$ D$ D$ D$
Performance (Tasks per Second)
Network
Design
Network Performance Custom
n
io
at
Embedded Less Flexible
iz
Accelerated Accelerator
al
I$ I$ Architectures
ci
pe
Instructions More Flexible
.S
vs
Accelerator
y
ilit
ib
P Xcel Xcel P
ex
Fl
Simple Design Power
Network Processor Constraint
High-Performance
D$ D$ D$ D$ Architectures
Network
Cells arranged
in rows
Standard
Mem
1 Mem
Cell Design
2
Cells have standard height but vary in width
Generated
Designed to connect power, ground, and wells by abutment
memory arrays
Well Contact
pring 2005 2 Feb 2005 L01 – Introduction 24
VDD Rail
Cell I/O
on M2
Power Ripple carry adder with carry
Rails in
M1 chain highlighted
GND Rail
NAND2 Flip-flop
Layout
Power
Energy (J/task)
Analysis
Integrated
CMOS
Technology
System Performance
Fallow
Period?
Integrated
Bipolar OR
Starfish
Dog
Training
Model
forward
"starfish" labels
=? "dog"
error
backward
many images
forward
Inference "dog"
few
images
26% ~100%
28% 89%
74%
16%
12%
Human
7% Error Rate
14%
3.6% 3% 2.3%
0 0 Hardware: Graphics Processing Units
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025
C. Batten, M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, K. Rupp & [Y. Shao, IEEE Micro'15] & [C. Leiserson, Science'20]
Application
Algorithm
Programming Language
Operating System
Compiler Circuit-level researchers
Instruction Set Architecture need to appreciate the
Microarchitecture
system-level context for
Register-Transfer Level
their circuits
Gate Level
Circuits
Devices Cross-layer interaction
Technology can generate some
of the most exciting
research ideas!
Course Structure
Prereq Part 2
Computer Digital CMOS
Architecture Circuits
P P
Part 1
ASIC Design
Overview M M
Part 3
CAD Algorithms
P P Topic 1
Hardware
Description
Languages
M M
Topic 4 Topic 6 Topic 5
Full-Custom Closing Automated Topic 8
Design the Design Testing and Verification
Methodology Gap Methodologies
Topic 3
CMOS Circuits
Topic 7
Clocking, Power Distribution,
Topic 2 Packaging, and I/O
CMOS Devices
al
Lo na 9
ne 1
St en 10
ct
c n
on 1
e al
bi pic
gi tio
rc ic
qu ic
at ti
te p
om To
Se Top
In To
C
Topic 12 Topic 13
Synthesis Algorithms Physical Design Automation
Placement
RTL to Logic
Synthesis
Global
x = a'bc + a'bc' Routing
y = b'c' + ab' + ac
Technology
Independent
Synthesis
x = a'b
y = b'c' + ac Detailed
Technology
Dependent Routing
Synthesis
Design
Network Performance Custom
n
io
at
Embedded Less Flexible
iz
Accelerated Accelerator
al
I$ I$ Architectures
ci
pe
Instructions More Flexible
.S
vs
Accelerator
y
ilit
ib
P Xcel Xcel P
ex
Fl
Simple Design Power
Network Processor Constraint
High-Performance
D$ D$ D$ D$ Architectures
Network
b_mux_sel
b_reg b_lsb
req_msg.b
>>
32b
32b
req_msg
a_mux_sel
a_reg
req_msg.a
<<
32b
32b
result_ add_
mux_sel result_ mux_sel
reg
resp_msg
0 32b
result_en 32b
Memory
Energy
Architecture Cores
Data Memory
Performance
Vector-SIMD Processors
Memory
Vector Lanes
Architecture MIMD
Energy
0 1
2 3
Vector-
SIMD
VMU
4 lanes
1 core w/
4 cores w/
1 lane ea
Single-core multi-lane design
reduces area by 15%
Quad-Core Vector- Multi-core single-lane design
w/ Vertical SIMD increases area by 20%
Multithreading (8 elm/lane)
(increasing number of
1.6 control logic
1 thread
1 elm/lane
2 elm/lane
4 elm/lane
8 elm/lane
2 threads
4 threads
8 threads
Normalized Tasks / Second
Performance reduction with increasing
threads due to increased cycle time and
thread management overhead on Quad-Core Vector-
fine-grain loops w/ Vertical SIMD
Multithreading (4 core w/ 1 lane)
SP Control
SP Datapath SP Regfile
RAM Interface
VCO
Controller
AHIP
RAM RAM
Subbank Subbank
(2KB) (2KB)
RAM RAM
Subbank Subbank
(2KB) (2KB)
ECE 5745
Figure Course
2: STC1 chip plot and die photo. On the Overview
chip plot, the Exec Unit is colored blue, and the 44 / 58
Fetch Unit is colored green and purple.
Course Goal, Structure, Motivation Activity • ASIC Design Case Studies •
CP Cache Control
Cache X Vector Lane 0
Data B
RAM A Vector Lane 1
Banks R V V
M I
U Vector Lane 2 U
Vector Lane 3
divided
clk out
clk out
clk out
debug
LVDS
LVDS
reset
diff clk (+) LVDS clk
diff clk (−) Recv div
single reset
ended clk clk tree tree
Ctrl
host2chip Host Reg RISC Sort
chip2host Interface Core Accel
L1 Instruction $
(32KB)
Host Interface
LLFU Arbiter
Figure 68. Measured and simulated energy per instruction breakdown at 66 MHz clock and 3.3
V core voltage. Both measurements include static power.
As can be seen in the figure, the simulated energy per instruction is quite close to the
post-silicon measured energy use. This gives us confidence in the tools ability to provide
accurate energy and power estimations based on the post place-and-route design. This figure also
confirms the theory that load and store instructions use significantly less energy in the ALU than
the add and addi instructions.
6.6: Performance
ECE 5745 Evaluating
Course the performance of the chip involves exploring the tradeoffs between57
Overview energy
/ 58
and latency for programs run on the processor. The shmoo plot in Figure 69 illustrates the range
of frequencies at which the chip can operate for different core voltage levels.
Course Goal, Structure, Motivation Activity ASIC Design Case Studies