ADSD Fall2011 05 Architect Ing Speed 2011nov03
ADSD Fall2011 05 Architect Ing Speed 2011nov03
http://lms.nust.edu.pk/
Acknowledgement: Material from the following sources has been consulted/used in these slides: 1. [CIL] Advanced Digital Design with the Verilog HDL, M D. Ciletti 2. [SHO] Digital Design of Signal Processing System by Dr Shoab A Khan 3. [STV] Advanced FPGA Design, Steve Kilts 4. Some slides from : [ECEN 248 Dr Shi]
Material/Slides from these slides CAN be used with following citing reference: Dr. Rehan Hafiz: Advanced Digital System Design 2010
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Tuesday @ 5:30-6:20 pm, Friday @ 6:30-7:20 pm By appointment/Email VISpro Lab above SEECS Library
This Lecture
3
Reading Assignment
Chapter
-1: Advanced FPGA Design, by Steve Kilts Xilinx Application Note Uploaded on MOODLE + Practice in Xilinx ISE
Setup/Hold time violation
Speed
4
Throughput
Amount
Metric: bits/sec
Latency
Time
Metric: No. of
Timing
Logic
Metric :
A high-throughput design
More concerned with the steady-state data rate Less concerned about the time any specic piece of data requires to propagate through the design (latency) Pipelining
Techniques
Throughput
top-level entity
8 bits
D clk Q Combinational Logic clk D Q Combinational Logic D Q
8 bits
input
output
100MHz
clk
input(0)
input(1) (unknown)
1 cycle betweeen Throughput = (bits per output sample) / (time between consecutive output samples) output samples Bits per output sample:
In this example, 8 bits per output sample Can be measured in clock cycles, then translated to time In this example, time between consecutive output samples = 1 clock cycle = 10 ns
Time between consecutive output samples: clock cycles between output(n) to output(n+1)
Throughput = (8 bits per output sample) / (10 ns) = 0.8 bits / ns = 800 Mbits/s
An Example...
[KIL]
Same register and computational resources are reused No new computations can begin until the previous computation has completed
Latency
Timing
3 clk cycles
1 Multiplier Delay
module power3( output [7:0] XPower, output finished, input [7:0] X, input clk, start); reg reg [7:0] ncount; [7:0] XPower;
Loop Unrolling
9
x[n]
x[n-1]2
x[n-2]3
x[n-1]
Both the nal calculation of X3 (XPower3 resources) and the rst calculation of the next value of X (XPower2 resources) occur simultaneously
Coding
module power3( 10 output reg [7:0] XPower, input clk, input [7:0] X ); reg [7:0] XPower1, XPower2; reg [7:0] X1, X2; always @(posedge clk) begin // Pipeline stage 1 X1 <= X; XPower1 <= X; // Pipeline stage 2 X2 <= X1; XPower2 <= XPower1 * X1; // Pipeline stage 3 XPower <= XPower2 * X2;
end endmodule
XPower1
XPower2
X1
X2
ft
Latency
Timing
3 clk cycles
1 Multiplier Delay
12
In general, if an algorithm requiring n iterative loops is unrolled, the pipelined implementation will exhibit a throughput performance increase of a factor of n. The penalty for unrolling an iterative loop is a proportional increase in area.
A low-latency design is one that passes the data from the input to the output as quickly as possible by minimizing the intermediate processing delays. Technique
Removal of pipelining, and logical short cuts that may reduce the throughput or the max clock speed in a design Parallelisms
Decreasing Latency
13
Latency
top-level entity
8 bits
D clk Q Combinational Logic clk D Q Combinational Logic D Q
8 bits
input
output
clk
input(0)
input(1) (unknown)
i.e. time it takes from first input to first output, second input to second output, etc. Also called input-to-output latency In this example, 2 rising edges latency is 2 cycles In this example, say clock period is 10 ns, then latency is 20 ns
Removal of pipelining
Throughput 8/1 = 8 bits/cyc. Latency Timing Less than a cycle 2 Multiplier Delays
Penalty
16
Penalty in timing Previous implementations could theoretically run the system clock period close to the delay of a single multiplier For Low-latency implementation, the clock period must be at least two multiplier delays
module power3( output [7:0] XPower, input [7:0] X ); reg [7:0] XPower1, XPower2; reg [7:0] X1, X2; assign XPower = XPower2 * X2; always @* begin X1 = X; XPower1 = X; end always @* begin X2 = X1; XPower2 = XPower1*X1; end endmodule
17
Understanding Timing
Timings
18
Combinational
Logic
& Routing
Flip Flops
Setup
Classification
tLOGIC :propagation
delay through logic components (e.g. LUTs) trouting :propagation delay through routing (wires)
tLOGIC The output remains unchanged for a time period equal to the contamination delay, tcd
The new output value is guaranteed to be valid after a time period equal to the propagation delay, tLOGIC
clk
D tS Q tCLK2Q tH
Setup time tS minimum time the input has to be stable before the rising edge of the clock Hold time tH minimum time the input has to be stable after the rising edge of the clock
Propagation delay tCLK2Q time to propagate input to output after the rising edge of the clock
A path is defined as a path from the output of one flip-flop to the input of another flip-flop
Combinational Logic
D clk
tLOGIC
tRout ts
tCLK2Q
+ tLOGIC+ tROUTING < (T - tS ) to avoid setup time violation Rewriting the equation: tCLK2Q + tLOGIC + trouting + tS < T
tpath
CLOCK PERIOD T
Path delay tpath = tCLK2Q + tLOGIC + tROUTE + tS The largest of all the path delays in a circuit is called the critical path delay (tcritical_path)
The
associated path is called the critical path There can be millions of paths in a circuit; timing analysis CAD tools help to locate the critical path
Critical Path
D Q 1.1 ns PATH 1 D Q D Q 0.5 ns PATH 2 tS=0.2 ns D Q
tCLK2Q=0.4 ns
tCLK2Q=0.4 ns D Q
tCLK2Q=0.4 ns
tS=0.2 ns
Path delays: tpath1 = 2.2 ns, tpath2 = 1.1 ns, tpath3 = 3.0 ns, tpath4 = 1.4 ns The critical path is path 3; the critical path delay is tcritical_path = tpath3 = 3.0 ns
D Q
tCLK2Q=0.4 ns
clk
CLOCK PERIOD T
Critical path delay = tcritical_path = 5.2 ns The minimum period for this circuit to work is Tmin = 5.2 ns
If the clock period is smaller than Tmin, you will get a timing violation and circuit will not operate correctly!!
This kind of timing violation is called a "setup time" violation (also known as critical path violation)
25
Throughput
Amount of data that is processed per clock cycle OR The aggregate/average data processing rate
Ideally average data rate IN to your system should be able to the average data rate OUT of your system OR you will miss data ! Technique : Pipelining & Loop Unrolling !
Metric: bits/sec
Latency
Time between data input and processed data output Parallelising the system --Response Time --- Important for Time Critical Signals, e.g. some interrupt triggered operation processing an external signal of an avionics system !
Normally a compromise !
Timing
27
Timing
[tCLK2Q + tLOGIC + trouting + tS ]< T Rising Edge of the Clock Does Not Arrive at Clock Inputs of All Flip-flops at The Same Time
Clock Skew
Clock Skew
in
Q clk'
out
clk clk'
tskew
clk
delay
delay
clk
29
Positive slack
When the data arrives at the capture flip-flop before the capture clock less the setup time. If the data arrive after the capture clock less the setup time -ve slack is an issue
Negative Slack
Lead clock skew is bad because it may cause setup time violations
D clk
Q clk
tCLK2Q clk
tLOGIC+tROUTE ts clk
tCLK2Q tLOGIC+tROUTE
tskew
WITHOUT SKEW: tCLK2Q + tLOGIC + tROUTE + ts < T to avoid setup time violation
WITH SKEW: tCLK2Q + tLOGIC + tROUTE + ts < (T tskew) to avoid setup time violation less time to perform logic than you normally would Soln: Optimize/Pipeline/Speedgrade !
Lag clock skew is bad because it may cause hold time violations
D
clk
tCLK2Q tLOGIC+Route
clk
tskew
tH
clk'
tCLK2Q + tLOGIC + tROUTE > (tskew + tH ) to avoid hold time violation If this is violated, get data feedthrough (data gets fed into the next register one cycle too early) There is no clock period (T) in the equation; changing clock period cannot help this problem! Solution : Add dummy logic, e.g. Buffer ! For FPGAs hold time violation predict clock skew
Tskew is propagation delay of clock between the launch ip-op and the capture ip-op -ve,+ve depends on lead or lag
Reading Assignment
33
34
Some Examples
FFA
CLK
FFB
FFA
FFB
CLK
Shall we get Hold Time Violation in this example ? Make sure Y remains stable for hold time (Th) after rising clock edge Remember: contamination delay ensures signal doesnt change
TCLK2Q(FFA) + Tcd(G) >= Th 1ns + 2ns > 2ns
Example-3
Togic+Route= 4ns Comb. Logic F
D
CLK
Comb. Logic H
FFA
FFB
Tlogic+Route = 5ns
TClk-Q = 4 ns Ts = 2 ns
TClk-Q = 5ns
What is the minimum clock period (Tmin) of this circuit? What if FFB has a clock skew Lead of 1 ns
Solution
D
CLK
FFA
FFB
TClk-Q = 5ns
TClk-Q = 4 ns Ts = 2 ns
TClk-Q(FFA) + Tpd(H) + Ts(FFB) = 5ns + 5ns + 2ns = 12ns TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns
D
CLK
FFA
FFB
TClk-Q = 5ns
TClk-Q = 4 ns Ts = 2 ns
TClk-Q(FFA) + Tpd(H) + Ts(FFB) + Tskew= 5ns + 5ns + 2ns + 1ns= 13ns TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns
Tlogic+Route = 1ns
All paths must satisfy requirements Comb. Logic F
D
CLK
FFA
FFB
Tclk2Q = 1ns
Tclk2Q = 1 ns Th = 2 ns
TClk2q (FFB) + TCD(F) + Tlogic+Route (H) > Th(FFB) = 1ns + 1ns + 2ns > 2ns
41
Optimizing Timing
Few Simple Design Considerations
If L=5 y[0]= h0x0 + h1x-1 + h2x-2 + h3x-3 +h4x-4 y[1]= h0x1 + h1x0 + h2x-1 + h3x-2 +h4x-3 y[2]= h0x2 + h1x1 + h2x0 + h3x-1 +h4x-2 y[3]= h0x3 + h1x2 + h2x1 + h3x0 +h4x-1 y[4]= h0x4 + h1x3 + h2x2 + h3x1 +h4x0 y[5]= h0x5 + h1x4 + h2x3 + h3x2 +h4x1
44
Critical Path ??
module fir( output [7:0] Y, input [7:0] A, B, C, X, input clk, input validsample); reg [7:0] X1, X2, Y; always @(posedge clk) if(validsample) begin X1 <= X; X2 <= X1; Y <= A* X+B* X1+C* X2; end endmodule
Code
46
reg [7:0] X1, X2, Y; reg [7:0] prod1, prod2, prod3; always @ (posedge clk) begin if(validsample) begin X1 <= X; X2 <= X1; prod1 <= A * X; prod2 <= B * X1; prod3 <= C * X2; end Y <= prod1 + prod2 + prod3; end endmodule
. Optimize the critical path such that logic structures could be implemented in parallel Example:
For
the x-cube code break the multipliers into independent operations and then recombine them.
Taking a square
48
8-bit
binary multiplier
8 bit Multiplication
1 1
1 1
1 1
1 1
1 1
0 0
1 1
0 0
0
1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0
0
1 0 1 0 1 0
0
1 0 0 1 0
0
1 0 1 0
0
0 0 0
0
1 0
0
0
50
a3 a3 a3b0 a3b1 a3b2 a3b3 a2b3 a2b2 a1b3 a2b1 a1b2 a0b3
b1 b1 b1b0 b0b1
b0 b0 b0b0
a0a3
a1a3 a2a3 a3a3 A3a2 a2a2 a3a1 a1a2 a2a1 a3a0
a0a2
a1a1 a2a0 a3b3
a0a1
a1a0 a2b3 a3b2
a0a0
a1b3 a2b2 a3b1
a0b3
a1b2 a2b1 a3b0
a0b2
a1b1 a2b0
a0b1
A1b0
a0b0
1 1 0 1 0 1 1 0 1 1 0 1
1 1 0 1 0 1
1 1 0 1 0 0
1 1 0 1 0 1
1 1 0 0 0 0
0 0 0 1 0
1 1 0 0
0 0 0
1
1 1 1 1 1 1 1 1 1
1
1 1 1
1
1 1 0
1
1 0 1
1
0 1 0 0
0
1 0
1
0
1 1
1 0
0 0
1 1 1 1 0
0 0
0 0
1 0
0 1
A B A B A B
54
4 Shifts + 4 4-bit Serial Add. 4 Shifts + 4 4-bit Serial Add. 4 Shifts + 4 4-bit Serial Add. 4-bit additions
Keep a balance in the critical path Redistribute logic evenly between registers to minimize the worst-case delay between any two registers
56
Break up logic structures that are coded in a serial fashion Avoiding Priority Structures if not required
control signals coming from an address decode that are used to write four 1-bit registers
58
module regwrite( output reg [3:0] rout, input clk, in, input [3:0] ctrl); always @(posedge clk) if(ctrl[0]) rout[0] <= in; else if(ctrl[1]) rout[1] <= in; else if(ctrl[2]) rout[2] <= in; else if(ctrl[3]) rout[3] <= in; endmodule
59
If the control lines are strobes from an address decoder in another module
Each
strobe is mutually exclusive to the others as they all represent a unique address.
60
module regwrite( output reg [3:0] rout, input clk, in, input [3:0] ctrl); always @(posedge clk) begin if(ctrl[0]) rout[0] <= in; if(ctrl[1]) rout[1] <= in; if(ctrl[2]) rout[2] <= in; if(ctrl[3]) rout[3] <= in; end endmodule
Tip
61
Mostly done by synthesizer !!! Reorder the paths in the dataow to minimize the critical path When to use:
Where multiple paths combine with the critical path The combined path can be reordered such that the critical path can be moved closer to the destination register
module randomlogic( output reg [7:0] Out, input [7:0] A, B, C, input clk, input Cond1, Cond2); always @(posedge clk) if(Cond1) Out <= A; else if(Cond2 && (C < 8)) Out <= B; else Out <= C; endmodule
64
module randomlogic( output reg [7:0] Out, input [7:0] A, B, C, input clk, input Cond1, Cond2); wire CondB = (Cond2 & !Cond1); always @(posedge clk) if(CondB && (C < 8)) Out <= B; else if(Cond1) Out <= A; else Out <= C; endmodule
High Throughput
Pipelining Parallelism Pipeline Removal Parallelism Pipelining Flattening Logic Structure Register Balancing Path Reordering
Low Latency
Timing
Recap
A high-throughput architecture is one that maximizes the number of bits per second that can be processed by a design. Unrolling an iterative loop increases throughput.
66
By removing priority encodings where they are not needed, the logic structure is attened, and the path delay is reduced.
Register balancing improves timing by moving combinatorial logic from the critical path to an adjacent path Timing can be improved by reordering paths that are combined with the critical path in such a way that some of the critical path logic is placed closer to the destination register
Reading
Reading
h0
h1
h2
hM-1 y(n)
M 1
TA = delay through adder TM = delay through multiplier Critical path delay: 1 TM +(M-1) TA M-1 registers M multipliers M-1 adders
y ( n)
Area:
Collection of Nodes & Directed Edges A directed edge (j,k) denotes a node originating at node j & terminating at node k Edge (j,k) denotes a linear transformation from signal at node j to signal at node k Can specify Gain Nodes represent computations or tasks e.g: Addition Source Node : No input edges; Sink Node : No originating edges
Reversing the direction of an SFG and interchanging the input and output ports preserves the functionality of the system.
hM-1
Z-1
hM-2
Z-1
hM-3
Z-1
h0 y(n)
Critical path:
Delay: 1 TM + 1 TA M-1 registers + M multipliers +M-1 adders Larger register sizes depending on quantization scheme used; since registers are now placed after multiplication ! Fanout of x(n) can become prohibitive
Area:
Disadvantages
Nodes represent Computations/tasks: e.g: Addition, Multiplication Computational time for a node can be specified with the node Edges have a non-negative no. of delays associated with it A node shall only compute once all the input data is ready
Some Terms
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining Example (1/4)
76
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining Example (2/4)
77
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining Example (3/4)
78
x(n)
Z-1
Z-1
Z-1
Z-1
h0
h1
Z-1
h2
hM-1
x(n)
Z-1
Z-1
Z-1
h0
h1
h2
hM-1
Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining Example (4/4)
79
hM-2
Z-1
hM-3
Z-1
h0 y(n)
Z-1
hM-1
hM-2
Z-1 Z-1
hM-3
Z-1
h0 y(n)
A convenient way to implement pipelining is to add the desired number of registers to all input edges and then, by repeated application of the node transfer theorem, systematically move the registers to break the delay of the critical path. Functionality is not changed if a register is transferred from all incoming edges of node (e.g. FA0) to all outgoing edges & vice versa !
Important
82
For multiple input multiple output system; Add a Source Node (generating all inputs) & add a Destination Node receiving/sinking all outputs.
This will help confusions in your design to particular order of placement of various nodes
& of course you cant CUT NODES Source Node
Destination Node
This scheme can also be applied for Register Balancing (as discussed earlier)
84
What if we cant optimize our system anymore using pipelining ? Convert a SISO system to a MIMO system using parallel logic ! The effective sampling speed is increased by the level of parallelism: L Multiple outputs are computed in parallel in a clock period
Parallel processing system is also called block processing, and the number of inputs processed in a clock cycle is referred to as the block size : L
y(n)
=ax(n)
+bx(n-1) +cx(n-2)
Convert the SISO system into an MIMO (multiple-input multiple-output) system in order to obtain a parallel processing structure.
To get a parallel system with L = 2 inputs per clock cycle; we re-write the equations as :
i.e. At the Kth cycle two outputs are processed : y(2k) & y (2k+1)
91
By combining parallel processing (block size: L) and pipelining (pipelining stage: M), the sample period can be reduced to:
Quiz ...
94
Time 8 Minutes !
95
Please assume that computational time required for each node = T Also Assume that all nodes are atomic !! Q-1) What is the maximum sampling rate of this system without any optimization? Q-2) Optimize this design such that the sampling rate of the optimized system is 1/T. (You must show the DFG for the optimized design) ---
Solution !
96
Please assume that computational time required for each node = T Also Assume that all nodes are atomic !! Q-1) What is the maximum sampling rate of this system without any optimization? Sampling Period = 4T, Sampling Rate = 1/4T Q-2) Optimize this design such that the sampling rate of the optimized system is 1/T. (You must show the DFG for the optimized design) --Please see figure above (A total of 9 registers were added !!)