0% found this document useful (0 votes)
21 views33 pages

18-643 Lecture 2: Basic FPGA Fabric: James C. Hoe Department of ECE Carnegie Mellon University

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 33

18-643 Lecture 2:

Basic FPGA Fabric

James C. Hoe
Department of ECE
Carnegie Mellon University

18-643-F21-L02-S1, James C. Hoe, CMU/ECE/CALCM, ©2021


Housekeeping
• Your goal today: know enough to build a basic
FPGA (even if not a very good one)
• Notices
– Complete survey on Canvas, due noon, 9/8
– Handout #2: lab 0, due noon, 9/13
– Make friends, make teams, due noon, 9/13
• Readings (see lecture schedule online)
– Altera 2006 white paper (see course website)
– skim databooks referenced for more details

18-643-F21-L02-S2, James C. Hoe, CMU/ECE/CALCM, ©2021


What it means:

“Field Programmable” “Gate Array”

18-643-F21-L02-S3, James C. Hoe, CMU/ECE/CALCM, ©2021


SSI MSI LSI VLSI

From Quora, “How did people design integrated circuits in early years?”
18-643-F21-L02-S4, James C. Hoe, CMU/ECE/CALCM, ©2021
How to democratize 100K gates
(AB)’ (X+Y)’

VCC

GND

18-643-F21-L02-S5, James C. Hoe, CMU/ECE/CALCM, ©2021 A B X Y


Idea behind Gate Arrays
• Mass produce identical gate array wafers
• Finish into any design by custom metal layers (2)
– so called Mask-Programmable GA (MPGA)
– reduced design effort (more automation, no layout)
– reduced mask and fab cost
– faster fab turn-around
• Proliferation of ASIC design starts
– don’t need volume for economy of scale
– small design team could keep up with Moore’s law
Of course, not as efficient as full-custom
or standard-cell designs
18-643-F21-L02-S6, James C. Hoe, CMU/ECE/CALCM, ©2021
How about no mask, no fab?
i.e., “field programmable”
• Again, mass produce identical devices but this
time fully-finalized
• Then what can be changed?
– SRAM EPROM (anti)fuse
{1,0} {1,0}
bits

{1,0}
connections

– pass gate mux {1,0} diode


{1,0} A C B
B A
A B
18-643-F21-L02-S7, James C. Hoe, CMU/ECE/CALCM, ©2021
programmable vs reprogrammable
Configurable “Logic Gates”

18-643-F21-L02-S8, James C. Hoe, CMU/ECE/CALCM, ©2021


Reconfigurable Logic
• Arbitrary logic (combinational and sequential) can
be formed by wiring up enough NANDs or muxes
X

f(…,0,…) 0
f(…,X,…)
f(…,1,…) 1
Shannon expansion
• Lookup table as universal logic primitive
– arbitrary n-input function ABC

from 2n-entry table


f(0,0,0)
– this is 8-by-1 bit “memory” f(0,0,1)
∙∙∙∙ f(A,B,C)
f(1,1,0)
f(1,1,1)
18-643-F21-L02-S9, James C. Hoe, CMU/ECE/CALCM, ©2021
Size of Lookup Tables (aka LUTs)
• n-input function from 2n-entry LUT
• Count only the 6T SRAM cells, an n-LUT has 6∙2n T
• Some points of reference n-LUT T-count
– 2-input NAND = 4T 2 24
– 3-input NAND = 6T 3 48
– 3-input full-adder (a, b, cin) 4 96
5 192
• s = a  b  cin = 8T
6 384
• cout = bcin+acin+ab =18T
7 768
– 10-input 5-bit adder = 130T 8 1536
– basic flip-flop=16T 9 3072
(compare to 2 LUTs per latch) 10 6144
18-643-F21-L02-S10, James C. Hoe, CMU/ECE/CALCM, ©2021
Choosing LUT Granularity
• Small LUTs
+ shorter propagation delay (per LUT)
– a given fxn consumes many LUTs (comes with
wiring cost and delay) this kind
– high “interpretation overhead” if too small
• Big LUTs
– longer propagation delay (per LUT)
+ a given fxn consumes fewer (but bigger) LUTs
– high “interpretation overhead” if too large (and
fxn has exploitable structure, e.g., 5-bit ripple add)
this kind
– wastage if not all input are used in a LUT
Where is the sweetspot?
18-643-F21-L02-S11, James C. Hoe, CMU/ECE/CALCM, ©2021
A Quantitative Look at LUT Sizing
e.g., 2006 Altera White Paper on Stratix-II ALMs

Large-enough functions have shorter 3-LUTs 50+% fully utilized


total delay using bigger LUTs 6-LUTs less than 40% fully utilized
But, bigger LUTs cost more and prone
to “internal fragmentation” No one LUT size optimal
18-643-F21-L02-S12, James C. Hoe, CMU/ECE/CALCM, ©2021
 “adaptive” LUT approach
LUT-based Configurable Logic Block
(simplified sketch)
D
X
A {2,1,0}
g(A,B,C)
B 3-LUT
C Y
h(A,B,C,D)
{2,1,0}
FF
3-LUT
f(A,B,C) {1,0} (also latch mode)

• 2 fxns (f & g) of 3 inputs OR 1 fxn (h) of 4 inputs


• hardwired FFs (too expensive/slow to fake)
• Just 10s of these in the earliest FPGAs
18-643-F21-L02-S13, James C. Hoe, CMU/ECE/CALCM, ©2021
Xilinx XC2000 CLB (1980s)

18-643-F21-L02-S14, James C. Hoe, CMU/ECE/CALCM, ©2021


[XC2064, XC2018 Logic Cell Array]
Contemporary Xilinx CLB Architecture

• each 6LUT is
two 5LUTs 2 slices per CLB
• LUTs can also
be used as Largest devices
small SRAMs (many $K each)
• special paths have several
for addition 100K slices
and
multiplexer Largest extreme
in 2021 has
over 1M slices
[Figure 2-3: 7 Series FPGAs CLB User Guide]
18-643-F21-L02-S15, James C. Hoe, CMU/ECE/CALCM, ©2021
Still Coarser Logic Blocks?
• So called Coarse-Grain Reconfigurable Arrays
(CGRAs) based on complete adders or ALUs
– native arithmetic units have low interpretation
overhead if you are doing arithmetic
– poor fit if you are working with narrow data or bit-
level manipulations
• Even coarser is to use many tiny processors
– still a spatial computing paradigm
– not programmed with RTLs
– converging with software multicores

18-643-F21-L02-S16, James C. Hoe, CMU/ECE/CALCM, ©2021


More on this later on
Brief Aside: Mapping Logic To LUTs
• Start from primary output and input to registers,
cover logic graph with cuts of less than K input edges
• K-cuts corresponding to K-LUT realizable functions

[Figure 13.1: “Reconfigurable Computing: The Theory


and Practice
18-643-F21-L02-S17, ofC. Hoe,
James FPGA-Based Computation”]
CMU/ECE/CALCM, ©2021
Placement

18-643-F21-L02-S18, James C. Hoe, CMU/ECE/CALCM, ©2021


[Vivado Implementation Screenshot]
… and Route

18-643-F21-L02-S19, James C. Hoe, CMU/ECE/CALCM, ©2021


[Vivado Implementation Screenshot]
Configurable “Wires”

18-643-F21-L02-S20, James C. Hoe, CMU/ECE/CALCM, ©2021


PLA-style Configurable Routing
AND OR

? ? ?
?

I0 I1 In-1 O0 O1 Om-1
18-643-F21-L02-S21, James C. Hoe, CMU/ECE/CALCM, ©2021
Island Style Routing Architecture
• CLB islands in sea of interconnects
• Flexible routing to support ASIC style netlists
• Note regularity in structure
C C C C C

C C C C C

C C C C C

C C C C C

C C C C C

18-643-F21-L02-S22, James C. Hoe, CMU/ECE/CALCM, ©2021


Configurable Routing
(1980s Xilinx simplified)
Switch Block

A
B
X
CLB CLB
C
Y
D
Connection Block

18-643-F21-L02-S23, James C. Hoe, CMU/ECE/CALCM, ©2021


Reconfigurable Routing is Expensive!
• Routing resource area is on par with logic
• Each configurable connection is
– area of configuration bit
– area of configurable connection
and don’t forget propagation delay
• Too much: cost for everyone who doesn’t need it
• Too little: congestion leaves unreachable CLBs
unused
– worse for larger arrays/designs (why?)
– buy a $10K FPGA and only get to use 70%?

18-643-F21-L02-S24, James C. Hoe, CMU/ECE/CALCM, ©2021


Rent’s Rule
• Tgp
– T = number of inputs and outputs
– g = number of internal components
– p typically between 0.5 (regular) and 0.8 (random)
• In a square, perimeter=4area0.5
– unless regular, I/O signals grow faster than
available routes exiting a design area
• Need hierarchy of progressively longer additional
routing resources
long routes also reduce delay when going far

18-643-F21-L02-S25, James C. Hoe, CMU/ECE/CALCM, ©2021


Virtex-II Routing Architecture

[Figure 48: Virtex-II Platform FPGAs: Complete Data Sheet]


18-643-F21-L02-S26, James C. Hoe, CMU/ECE/CALCM, ©2021
Virtex-II Routing Architecture

Later architectures extended


in reach and in diagonals
Separate, dedicated clock trees
[Figure 49: Virtex-II Platform FPGAs: Complete Data Sheet]
18-643-F21-L02-S27, James C. Hoe, CMU/ECE/CALCM, ©2021
Between-Die Routing in 2.5D IC
Virtex7 Stacked Silicon Interconnect (SSI), 2011
• Longest routes go across dies carried on interposer
• No change to design tool and abstraction

[Figure 1, Stacked & Loaded: Xilinx SSI, 28-


Gbps I/O Yield Amazing FPGAs, Xcell, Q1 2011]
18-643-F21-L02-S28, James C. Hoe, CMU/ECE/CALCM, ©2021
Intel Stratix-X HyperFlex
• Long routes need buffered repeaters; very long
routes need pipelining
• Add (bypassable) pipeline registers throughout
• RTL designs have to be
pipelined explicitly to
benefit; high-level
synthesized designs
leverage directly
• a high-freq strategy 
e.g., 0.5xlogic at 2xfreq
for perf. parity
[Figure 2: Understanding How the New HyperFlex Architecture
Enables Next-Generation
18-643-F21-L02-S29, High-Performance
James C. Hoe, CMU/ECE/CALCM, ©2021 Systems]
Don’t Forget Configurable I/O
In/Out {1,0}
{1,0}
{1,0}
Dout {fast,slow}

FF
PAD
{1,0}

Din
{1,0} I/O Block
- real devices more complicated
- modern devices support special signaling and protocols
18-643-F21-L02-S30, James C. Hoe, CMU/ECE/CALCM, ©2021
Putting it all together:
an Universal ASIC
programmable routing

programmable lookup tables


I (LUT) and flip-flops (FF)
aka “soft logic” or “fabric”

Interconnect
LUT FF

I/O pins

18-643-F21-L02-S31, James C. Hoe, CMU/ECE/CALCM, ©2021


Bitstream defines the chip
• After power up, SRAM FPGA loads bitstream from
somewhere before becoming the “chip”
a bonus “feature” for sensitive
devices that need to forget what it does
• Many built-in loading options
• Non-trivial amount of time; must control reset
timing and sequence with the rest of the system
• Reverse-engineering concerns ameliorated by
– encryption
– proprietary knowledge
Return to this later in term . . . .
18-643-F21-L02-S32, James C. Hoe, CMU/ECE/CALCM, ©2021
Parting Thoughts
• Birth of FPGAs rooted entirely in digital logic and
ASIC concerns; today, you can use an FPGA
without knowing any of this stuff
• You can find a lot of specific details on-line
(databooks and research papers)
• So far still just the basic fabric . . . .
. . . more next time
- saving “configuration” for later in term
- won’t say anything about low-level EDA

18-643-F21-L02-S33, James C. Hoe, CMU/ECE/CALCM, ©2021

You might also like