0% found this document useful (0 votes)

109 views96 pages

ADSD Fall2011 05 Architect Ing Speed 2011nov03

The document discusses speed, throughput, latency, and timing in digital system design. It defines throughput as the amount of data processed per clock cycle, latency as the time between input and output, and timing as the logic delays between sequential elements and the clock period. It discusses techniques for optimizing these factors like pipelining, loop unrolling, and removing pipelining.

Uploaded by

Rehan Hafiz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views96 pages

ADSD Fall2011 05 Architect Ing Speed 2011nov03

Uploaded by

Rehan Hafiz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Lecture # 05

Dr. Rehan Hafiz

<[email protected]>

Course Website for ADSD Fall 2011

http://lms.nust.edu.pk/
Acknowledgement: Material from the following sources has been consulted/used in these slides: 1. [CIL] Advanced Digital Design with the Verilog HDL, M D. Ciletti 2. [SHO] Digital Design of Signal Processing System by Dr Shoab A Khan 3. [STV] Advanced FPGA Design, Steve Kilts 4. Some slides from : [ECEN 248 Dr Shi]

Material/Slides from these slides CAN be used with following citing reference: Dr. Rehan Hafiz: Advanced Digital System Design 2010
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Lectures: Contact: Office:

Tuesday @ 5:30-6:20 pm, Friday @ 6:30-7:20 pm By appointment/Email VISpro Lab above SEECS Library

This Lecture
3

Understanding & Optimizing

Speed
Throughput Timings

Reading Assignment
Chapter

-1: Advanced FPGA Design, by Steve Kilts Xilinx Application Note Uploaded on MOODLE + Practice in Xilinx ISE
Setup/Hold time violation

Speed
4

Throughput
Amount

of data that is processed per clock cycle

Metric: bits/sec

Latency
Time

between data input and processed data output

cycles or time

Metric: No. of

Timing
Logic

delays between sequential elements

Clock period or Frequency.

Metric :

A high-throughput design

More concerned with the steady-state data rate Less concerned about the time any specic piece of data requires to propagate through the design (latency) Pipelining

Techniques

High Throughput Design

Throughput
top-level entity
8 bits
D clk Q Combinational Logic clk D Q Combinational Logic D Q

8 bits

input

output

100MHz

clk

clk input output

input(0)

input(1) (unknown)

input(2) output(0) output(1)

1 cycle betweeen Throughput = (bits per output sample) / (time between consecutive output samples) output samples Bits per output sample:

In this example, 8 bits per output sample Can be measured in clock cycles, then translated to time In this example, time between consecutive output samples = 1 clock cycle = 10 ns

Time between consecutive output samples: clock cycles between output(n) to output(n+1)

Throughput = (8 bits per output sample) / (10 ns) = 0.8 bits / ns = 800 Mbits/s

An Example...

[KIL]

Software Code Digital Implementation

XPower = 1; for (i=0;i < 3; i++) XPower = X * XPower;

Same register and computational resources are reused No new computations can begin until the previous computation has completed

Throughput 8/3 = 2.7 bits/cyc.

Latency
Timing

3 clk cycles
1 Multiplier Delay

Coding an iterative algorithm <with dependency>

module power3( output [7:0] XPower, output finished, input [7:0] X, input clk, start); reg reg [7:0] ncount; [7:0] XPower;

assign finished = (ncount == 0);

XPower = 1; for (i=0;i < 3; i++) XPower = X * XPower; always@(posedge clk) if(start) begin XPower <= X; ncount <= 2; End else if(!finished) begin ncount <= ncount - 1; XPower <= XPower * X; End endmodule

Loop Unrolling
9

XPower = 1; for (i=0;i < 3; i++) XPower = X * XPower;

x[n]

x[n-1]2

x[n-2]3

x[n-1]

Both the nal calculation of X3 (XPower3 resources) and the rst calculation of the next value of X (XPower2 resources) occur simultaneously

Coding
module power3( 10 output reg [7:0] XPower, input clk, input [7:0] X ); reg [7:0] XPower1, XPower2; reg [7:0] X1, X2; always @(posedge clk) begin // Pipeline stage 1 X1 <= X; XPower1 <= X; // Pipeline stage 2 X2 <= X1; XPower2 <= XPower1 * X1; // Pipeline stage 3 XPower <= XPower2 * X2;

end endmodule

XPower1

XPower2

Throughput 8/3 = 2.7 bits/cyc.

Latency Timing 11

3 clk cycles 1 Multiplier Delay

Throughput 8/1 = 8 bits/cyc.

Latency
Timing

3 clk cycles
1 Multiplier Delay

In general, if an algorithm requiring n iterative loops is unrolled, the pipelined implementation will exhibit a throughput performance increase of a factor of n. The penalty for unrolling an iterative loop is a proportional increase in area.

A low-latency design is one that passes the data from the input to the output as quickly as possible by minimizing the intermediate processing delays. Technique

Removal of pipelining, and logical short cuts that may reduce the throughput or the max clock speed in a design Parallelisms

Decreasing Latency

Latency
top-level entity
8 bits
D clk Q Combinational Logic clk D Q Combinational Logic D Q

8 bits

input

output

clk

100 MHz clk input output

input(0)

input(1) (unknown)

input(2) output(0) output(1)

Latency is the time between input(n) and output(n)

i.e. time it takes from first input to first output, second input to second output, etc. Also called input-to-output latency In this example, 2 rising edges latency is 2 cycles In this example, say clock period is 10 ns, then latency is 20 ns

Count the number of rising edges after input

Latency is measured in clock cycles (then translated to seconds)

Removal of pipelining

Throughput 8/1 = 8 bits/cyc. Latency Timing Less than a cycle 2 Multiplier Delays

Penalty
16

Penalty in timing Previous implementations could theoretically run the system clock period close to the delay of a single multiplier For Low-latency implementation, the clock period must be at least two multiplier delays

module power3( output [7:0] XPower, input [7:0] X ); reg [7:0] XPower1, XPower2; reg [7:0] X1, X2; assign XPower = XPower2 * X2; always @* begin X1 = X; XPower1 = X; end always @* begin X2 = X1; XPower2 = XPower1*X1; end endmodule

Understanding Timing

Timings
18

Combinational
Logic

& Routing

Flip Flops
Setup

time Hold time Propagation delay tCLK2Q

Timing: Combinational Logic tLOGIC + trouting

Classification
tLOGIC :propagation

delay through logic components (e.g. LUTs) trouting :propagation delay through routing (wires)

tLOGIC The output remains unchanged for a time period equal to the contamination delay, tcd

The new output value is guaranteed to be valid after a time period equal to the propagation delay, tLOGIC

Timing: Flip Flops (Sequential Logic)

Input D must remain stable during this interval
D clk Q

Input D can freely change during this interval

clk

D tS Q tCLK2Q tH

Setup time tS minimum time the input has to be stable before the rising edge of the clock Hold time tH minimum time the input has to be stable after the rising edge of the clock
Propagation delay tCLK2Q time to propagate input to output after the rising edge of the clock

Timing: Path timing

D clk tCLK2Q clk Q

A path is defined as a path from the output of one flip-flop to the input of another flip-flop
Combinational Logic

D clk

tLOGIC

tRout ts

tCLK2Q

+ tLOGIC+ tROUTING < (T - tS ) to avoid setup time violation Rewriting the equation: tCLK2Q + tLOGIC + trouting + tS < T
tpath

CLOCK PERIOD T

Critical Path Delay

Path delay tpath = tCLK2Q + tLOGIC + tROUTE + tS The largest of all the path delays in a circuit is called the critical path delay (tcritical_path)
The

associated path is called the critical path There can be millions of paths in a circuit; timing analysis CAD tools help to locate the critical path

Critical Path
D Q 1.1 ns PATH 1 D Q D Q 0.5 ns PATH 2 tS=0.2 ns D Q

tCLK2Q=0.4 ns

tCLK2Q=0.4 ns D Q

PATH 3 0.8 ns PATH 4

tCLK2Q=0.4 ns

tS=0.2 ns

Path delays: tpath1 = 2.2 ns, tpath2 = 1.1 ns, tpath3 = 3.0 ns, tpath4 = 1.4 ns The critical path is path 3; the critical path delay is tcritical_path = tpath3 = 3.0 ns

Setup Time Violation (a.k.a Critical Path Violation)

twire1=0.4 ns D Q tgateA=2.0 ns twire2=0.2 nstgateB=1.2 ns twire3=0.8 ns
Combinational Gate A Combinational Gate B

D Q

tCLK2Q=0.4 ns

tS=0.2 ns tCLK2Q twire1 tgateA twire2 tgateB twire3 ts

clk

CLOCK PERIOD T

Critical path delay = tcritical_path = 5.2 ns The minimum period for this circuit to work is Tmin = 5.2 ns

If the clock period is smaller than Tmin, you will get a timing violation and circuit will not operate correctly!!

Maximum clock frequency = 1/Tmin = 192 MHz

This kind of timing violation is called a "setup time" violation (also known as critical path violation)

Review From Last Lecture

Throughput

Amount of data that is processed per clock cycle OR The aggregate/average data processing rate
Ideally average data rate IN to your system should be able to the average data rate OUT of your system OR you will miss data ! Technique : Pipelining & Loop Unrolling !

Streaming Applications More concerned with throughput !

Metric: bits/sec

Latency

Time between data input and processed data output Parallelising the system --Response Time --- Important for Time Critical Signals, e.g. some interrupt triggered operation processing an external signal of an avionics system !

Metric: No. of cycles or time

Normally a compromise !

Timing
27

Timing

Logic delays between sequential elements

Metric : Clock period or Frequency.

[tCLK2Q + tLOGIC + trouting + tS ]< T Rising Edge of the Clock Does Not Arrive at Clock Inputs of All Flip-flops at The Same Time

Clock Skew

Delay often caused by wire routing delay

Lag clock skew

Q clk'

out

clk clk'
tskew

clk

delay

Lead clock skew in D Q D Q out clk clk'

tskew

delay

clk

Positive slack

When the data arrives at the capture flip-flop before the capture clock less the setup time. If the data arrive after the capture clock less the setup time -ve slack is an issue

Negative Slack

Lead clock skew is bad because it may cause setup time violations

D clk

Combinational Logic clk

Q clk

Combinational Logic clk'

tCLK2Q clk

tLOGIC+tROUTE ts clk

tCLK2Q tLOGIC+tROUTE

ts CLOCK PERIOD T clk' CLOCK PERIOD T

tskew

WITHOUT SKEW: tCLK2Q + tLOGIC + tROUTE + ts < T to avoid setup time violation

WITH SKEW: tCLK2Q + tLOGIC + tROUTE + ts < (T tskew) to avoid setup time violation less time to perform logic than you normally would Soln: Optimize/Pipeline/Speedgrade !

Lag clock skew is bad because it may cause hold time violations
D
clk

Combinational Logic clk'

tCLK2Q tLOGIC+Route

clk

tskew

clk'

tCLK2Q + tLOGIC + tROUTE > (tskew + tH ) to avoid hold time violation If this is violated, get data feedthrough (data gets fed into the next register one cycle too early) There is no clock period (T) in the equation; changing clock period cannot help this problem! Solution : Add dummy logic, e.g. Buffer ! For FPGAs hold time violation predict clock skew

Maximum Achievable Frequency

Maximum-frequency equation (ignoring clockto-clock jitter):

Tskew is propagation delay of clock between the launch ip-op and the capture ip-op -ve,+ve depends on lead or lag

Reading Assignment
33

Some Examples

Example 1: Analyzing Sequential Circuits

TClk-Q = 5ns D X Tlogic+Route = 5ns Comb. Logic G Y TClk-Q = 5 ns Ts = 2 ns

FFA
CLK

FFB

What is the minimum time between rising clock edges?

Tmin = TCLK-Q (FFA) + TLogic (G) + TRoute (G) + Ts (FFB)

Example: 2 Hold Time Violation

Tclk2Q = 1ns D X Tcd = 2ns Comb. Logic G Y Th = 2 ns

FFA

FFB

CLK

Shall we get Hold Time Violation in this example ? Make sure Y remains stable for hold time (Th) after rising clock edge Remember: contamination delay ensures signal doesnt change
TCLK2Q(FFA) + Tcd(G) >= Th 1ns + 2ns > 2ns

Example-3
Togic+Route= 4ns Comb. Logic F

D
CLK

Comb. Logic H

FFA

FFB

Tlogic+Route = 5ns
TClk-Q = 4 ns Ts = 2 ns

TClk-Q = 5ns

What is the minimum clock period (Tmin) of this circuit? What if FFB has a clock skew Lead of 1 ns

Solution

Tlogic+Route = 4ns Comb. Logic F

D
CLK

Comb. Logic H Tlogic+Route = 5ns

FFA

FFB

TClk-Q = 5ns

Path FFA to FFB

TClk-Q = 4 ns Ts = 2 ns

TClk-Q(FFA) + Tpd(H) + Ts(FFB) = 5ns + 5ns + 2ns = 12ns TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns

Path FFB to FFB

Solution(With Lead of 1 ns for FFB)

Tlogic+Route = 4ns Comb. Logic F

D
CLK

Comb. Logic H Tlogic+Route = 5ns

FFA

FFB

TClk-Q = 5ns

Path FFA to FFB

TClk-Q = 4 ns Ts = 2 ns

TClk-Q(FFA) + Tpd(H) + Ts(FFB) + Tskew= 5ns + 5ns + 2ns + 1ns= 13ns TCLK-Q(FFB) + Tpd(F) + Tpd(H) + Ts(FFB) = 4ns + 4ns + 5ns + 2ns = 15ns

Path FFB to FFB

Example Analyzing Sequential Circuits: Hold Time Violations

Tlogic+Route = 1ns
All paths must satisfy requirements Comb. Logic F

D
CLK

Comb. Logic H Tlogic+Route = 2ns

FFA

FFB

Path FFA to FFB

Tclk2Q = 1ns

TClk2q(FFA) + Tlogic+Route (H) > Th(FFB) = 1 ns + 2ns > 2ns

Tclk2Q = 1 ns Th = 2 ns

Path FFB to FFB

TClk2q (FFB) + TCD(F) + Tlogic+Route (H) > Th(FFB) = 1ns + 1ns + 2ns > 2ns

Optimizing Timing
Few Simple Design Considerations

Consider an FIR Filter

The equation for the computation of an L-taps FIR filter is:

If L=5 y[0]= h0x0 + h1x-1 + h2x-2 + h3x-3 +h4x-4 y[1]= h0x1 + h1x0 + h2x-1 + h3x-2 +h4x-3 y[2]= h0x2 + h1x1 + h2x0 + h3x-1 +h4x-2 y[3]= h0x3 + h1x2 + h2x1 + h3x0 +h4x-1 y[4]= h0x4 + h1x3 + h2x2 + h3x1 +h4x0 y[5]= h0x5 + h1x4 + h2x3 + h3x2 +h4x1

Parallel FIR Implementation

Critical Path ??

module fir( output [7:0] Y, input [7:0] A, B, C, X, input clk, input validsample); reg [7:0] X1, X2, Y; always @(posedge clk) if(validsample) begin X1 <= X; X2 <= X1; Y <= A* X+B* X1+C* X2; end endmodule

Technique-1- Pipelining <Reducing TLOGIC+PROPAGATON>

Code
46

reg [7:0] X1, X2, Y; reg [7:0] prod1, prod2, prod3; always @ (posedge clk) begin if(validsample) begin X1 <= X; X2 <= X1; prod1 <= A * X; prod2 <= B * X1; prod3 <= C * X2; end Y <= prod1 + prod2 + prod3; end endmodule

Technique-2- Increasing Parallelism <Speeding-up the logic-process>

. Optimize the critical path such that logic structures could be implemented in parallel Example:
For

the x-cube code break the multipliers into independent operations and then recombine them.

Taking a square
48

8-bit

binary multiplier

8 Muxe shifts + 8 8-bit Additions

8 bit Multiplication

1 1

0 0

1 1

0 0

0
1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0

0
1 0 1 0 1 0

0
1 0 0 1 0

0
1 0 1 0

0
0 0 0

0
1 0

0
0

Optimizing Logic by adding Parallelism

Assume we are squaring an 8-bit number

can

be represented by nibbles A and B:

a3 a3 a3b0 a3b1 a3b2 a3b3 a2b3 a2b2 a1b3 a2b1 a1b2 a0b3

a2 a2 a2b0 a1b1 a0b2 b3b0

a1 a1 a1b0 a0b1 b3b2 b2b0

a0 a0 a0b0 b3b1 b2b2 b1b0

b3 b3 b3b0 b2b1 b1b2 b0b0

b2 b2 b2b0 b1b1 b0b2

b1 b1 b1b0 b0b1

b0 b0 b0b0

a0a3
a1a3 a2a3 a3a3 A3a2 a2a2 a3a1 a1a2 a2a1 a3a0

a0a2
a1a1 a2a0 a3b3

a0a1
a1a0 a2b3 a3b2

a0a0
a1b3 a2b2 a3b1

a0b3
a1b2 a2b1 a3b0

a0b2
a1b1 a2b0

a0b1
A1b0

a0b0

BB 2AB AA

1 1 0 1 0 1 1 0 1 1 0 1

1 1 0 1 0 1

1 1 0 1 0 0

1 1 0 1 0 1

1 1 0 0 0 0

0 0 0 1 0

1 1 0 0

0 0 0

1
1 1 1 1 1 1 1 1 1

1
1 1 1

1
1 1 0

1
1 0 1

1
0 1 0 0

0
1 0

1
0

1 1

1 0

0 0

1 1 1 1 0

0 0

1 0

0 1

{A,B} {A,B} 8 Mux 8 8-bit Serial Additions

A B A B A B
54

4 Shifts + 4 4-bit Serial Add. 4 Shifts + 4 4-bit Serial Add. 4 Shifts + 4 4-bit Serial Add. 4-bit additions

Technique-3- Register Balancing

<Distribute long logic paths evenly across register layers>
55

Keep a balance in the critical path Redistribute logic evenly between registers to minimize the worst-case delay between any two registers

Technique-4- Flatten Logic Structures <Removing redundant logic>

Break up logic structures that are coded in a serial fashion Avoiding Priority Structures if not required

control signals coming from an address decode that are used to write four 1-bit registers
58

module regwrite( output reg [3:0] rout, input clk, in, input [3:0] ctrl); always @(posedge clk) if(ctrl[0]) rout[0] <= in; else if(ctrl[1]) rout[1] <= in; else if(ctrl[2]) rout[2] <= in; else if(ctrl[3]) rout[3] <= in; endmodule

If the control lines are strobes from an address decoder in another module
Each

strobe is mutually exclusive to the others as they all represent a unique address.

Is there any need for priority structure ?

module regwrite( output reg [3:0] rout, input clk, in, input [3:0] ctrl); always @(posedge clk) begin if(ctrl[0]) rout[0] <= in; if(ctrl[1]) rout[1] <= in; if(ctrl[2]) rout[2] <= in; if(ctrl[3]) rout[3] <= in; end endmodule

Tip
61

Technique-5- Reordering Paths <Shortening Critical Paths>

Mostly done by synthesizer !!! Reorder the paths in the dataow to minimize the critical path When to use:
Where multiple paths combine with the critical path The combined path can be reordered such that the critical path can be moved closer to the destination register

Technique-5- Reordering Paths

Events not mutually exclusive

module randomlogic( output reg [7:0] Out, input [7:0] A, B, C, input clk, input Cond1, Cond2); always @(posedge clk) if(Cond1) Out <= A; else if(Cond2 && (C < 8)) Out <= B; else Out <= C; endmodule

module randomlogic( output reg [7:0] Out, input [7:0] A, B, C, input clk, input Cond1, Cond2); wire CondB = (Cond2 & !Cond1); always @(posedge clk) if(CondB && (C < 8)) Out <= B; else if(Cond1) Out <= A; else Out <= C; endmodule

Summary- Architecting Speed

High Throughput

Pipelining Parallelism Pipeline Removal Parallelism Pipelining Flattening Logic Structure Register Balancing Path Reordering

Low Latency

Timing

In your digital design

Make your specification as your goal and apply the techniques

Recap

A high-throughput architecture is one that maximizes the number of bits per second that can be processed by a design. Unrolling an iterative loop increases throughput.

The penalty for unrolling an iterative loop is a proportional increase in area.

A low-latency architecture is one that minimizes the delay from the input of a module to the output. Latency can be reduced by removing pipeline registers The penalty for removing pipeline registers is an increase in combinatorial delay between registers. Timing refers to the clock speed of a design. A design meets timing when the maximum delay between any two sequential elements is smaller than the minimum clock period Adding register layers improves timing by dividing the critical path into two paths of smaller delay. Separating a logic function into a number of smaller functions that can be evaluated in parallel reduces the path delay to the longest of the substructures.

By removing priority encodings where they are not needed, the logic structure is attened, and the path delay is reduced.
Register balancing improves timing by moving combinatorial logic from the critical path to an adjacent path Timing can be improved by reordering paths that are combined with the critical path in such a way that some of the critical path logic is placed closer to the destination register

Reading

Chapter 3 of Parhi, VLSI Digital Signal Processing Systems

Dr. Rehan Hafiz

<[email protected]>

Reading

Parhi, VLSI Digital Signal Processing Systems

Chapter

Direct Form FIR Filters

x(n)
Z-1 Z-1 Z-1

hM-1 y(n)

M-tap FIR filter in direct form Critical path:

M 1

TA = delay through adder TM = delay through multiplier Critical path delay: 1 TM +(M-1) TA M-1 registers M multipliers M-1 adders

y ( n)

h(i) x(n i) h(n) x(n)

i 0

Area:

Arithmetic complexity of M-tap filter modeled as:

M multiplications/sample + M-1 adds/sample

Representations of DSP algorithms and architectures Block Diagram

Block diagram of a 3-tap FIR filter

Representations of DSP algorithms and architectures Signal Flow Graph Representation !

Collection of Nodes & Directed Edges A directed edge (j,k) denotes a node originating at node j & terminating at node k Edge (j,k) denotes a linear transformation from signal at node j to signal at node k Can specify Gain Nodes represent computations or tasks e.g: Addition Source Node : No input edges; Sink Node : No originating edges

Signal Flow Graph of a 3-tap FIR filter

Technique Signal Flow Graph From Direct Form to Transpose Form

Reversing the direction of an SFG and interchanging the input and output ports preserves the functionality of the system.

Also called data broadcast structure

x(n)

hM-1
Z-1

hM-2
Z-1

hM-3
Z-1

h0 y(n)

Critical path:

Delay: 1 TM + 1 TA M-1 registers + M multipliers +M-1 adders Larger register sizes depending on quantization scheme used; since registers are now placed after multiplication ! Fanout of x(n) can become prohibitive

Area:

Disadvantages

Representations of DSP algorithms and architectures Data Flow Graph Representation !

Nodes represent Computations/tasks: e.g: Addition, Multiplication Computational time for a node can be specified with the node Edges have a non-negative no. of delays associated with it A node shall only compute once all the input data is ready

Non Recursive DFG Systems have no loops in a DFG

Data Flow Graph of a 3-tap FIR filter

Consider this example !

Technique : DFG based Pipelining ! Data Flow Graphs (DFGs)

Some Terms

Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining Example (1/4)
76

Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining Example (2/4)
77

Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining Example (3/4)
78

x(n)

Z-1

h1
Z-1

hM-1

x(n)

Z-1

Put delay on all cuts Z-1

hM-1

Technique : DFG based Pipelining ! Data Flow Graphs (DFGs) DFG based Pipelining Example (4/4)
79

Let Tm = 10 units, Ta = 2 units, Desied clock = 6 units ! Initial Design be:

x(n) hM-1
Z-1

hM-2
Z-1

hM-3
Z-1

h0 y(n)

Fine Grained Pipelining

x(n)

Z-1

hM-1

hM-2
Z-1 Z-1

hM-3
Z-1

h0 y(n)

insert registers here

Pipelining using the Delay Transfer Theorem

Feedforward only (Example-1)
80

A convenient way to implement pipelining is to add the desired number of registers to all input edges and then, by repeated application of the node transfer theorem, systematically move the registers to break the delay of the critical path. Functionality is not changed if a register is transferred from all incoming edges of node (e.g. FA0) to all outgoing edges & vice versa !

Article : 7.2.7 [SHO]

Pipelining using the Delay Transfer Theorem

Feedforward only (Example-2)
81

Important
82

For multiple input multiple output system; Add a Source Node (generating all inputs) & add a Destination Node receiving/sinking all outputs.

This will help confusions in your design to particular order of placement of various nodes
& of course you cant CUT NODES Source Node

Destination Node

Pipelining using the Delay Transfer Theorem

This scheme can also be applied for Register Balancing (as discussed earlier)

Technique : DFG based Parallel Processing Data Flow Graphs (DFGs)

What if we cant optimize our system anymore using pipelining ? Convert a SISO system to a MIMO system using parallel logic ! The effective sampling speed is increased by the level of parallelism: L Multiple outputs are computed in parallel in a clock period

Parallel processing system is also called block processing, and the number of inputs processed in a clock cycle is referred to as the block size : L

Technique : DFG based Parallel Processing An Example

Suppose we have multiple inputs available on every processing clock !

SISO to MIMO Conversion !

Technique : DFG based Parallel Processing Example : FIR Filtering !

Consider a single-input single-output (SISO) FIR filter:

y(n)

=ax(n)

+bx(n-1) +cx(n-2)

Convert the SISO system into an MIMO (multiple-input multiple-output) system in order to obtain a parallel processing structure.
To get a parallel system with L = 2 inputs per clock cycle; we re-write the equations as :

i.e. At the Kth cycle two outputs are processed : y(2k) & y (2k+1)

Important: Delays in a MIMO system

2 Parallel 3-Tap Filter !

Combining Parallelism & Pipelining

By combining parallel processing (block size: L) and pipelining (pipelining stage: M), the sample period can be reduced to:

Technique : Parallel Processing + Pipelining Example : FIR Filtering !

Quiz ...
94

Time 8 Minutes !
95

Please assume that computational time required for each node = T Also Assume that all nodes are atomic !! Q-1) What is the maximum sampling rate of this system without any optimization? Q-2) Optimize this design such that the sampling rate of the optimized system is 1/T. (You must show the DFG for the optimized design) ---

Solution !
96

Please assume that computational time required for each node = T Also Assume that all nodes are atomic !! Q-1) What is the maximum sampling rate of this system without any optimization? Sampling Period = 4T, Sampling Rate = 1/4T Q-2) Optimize this design such that the sampling rate of the optimized system is 1/T. (You must show the DFG for the optimized design) --Please see figure above (A total of 9 registers were added !!)

Timing Analysis: Path Group
100% (2)
Timing Analysis: Path Group
18 pages
Advanced FPGA Design1
No ratings yet
Advanced FPGA Design1
52 pages
CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation
No ratings yet
CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation
41 pages
Timing Issues in Digital Circuits
No ratings yet
Timing Issues in Digital Circuits
105 pages
Part 1
No ratings yet
Part 1
40 pages
Lec04 Advanced Sequential Circuit Design
No ratings yet
Lec04 Advanced Sequential Circuit Design
54 pages
Timing Requirements
No ratings yet
Timing Requirements
37 pages
(Lec 17) Timing Analysis at The Logic Level
No ratings yet
(Lec 17) Timing Analysis at The Logic Level
29 pages
W13L18 - Real Time System Design - 1
No ratings yet
W13L18 - Real Time System Design - 1
29 pages
Lecture-5 - Update
No ratings yet
Lecture-5 - Update
51 pages
Digital VLSI Design Timing Analysis: Semester B, 2021-22 Lecturer: Zvika Webb 21 March 2022
100% (1)
Digital VLSI Design Timing Analysis: Semester B, 2021-22 Lecturer: Zvika Webb 21 March 2022
86 pages
Lecture 5 Update
No ratings yet
Lecture 5 Update
27 pages
Timing in Digital Circuits
No ratings yet
Timing in Digital Circuits
18 pages
Lecture 3 STA
No ratings yet
Lecture 3 STA
55 pages
Sta Lab3
No ratings yet
Sta Lab3
5 pages
Vdocuments - MX Overview of Digital Ic Design Flow University of Overview of Digital Ic Design
No ratings yet
Vdocuments - MX Overview of Digital Ic Design Flow University of Overview of Digital Ic Design
12 pages
DICD Fall 2024 Lecture 08 Sequential Logic
No ratings yet
DICD Fall 2024 Lecture 08 Sequential Logic
87 pages
Constraints LogicSyn LEC
No ratings yet
Constraints LogicSyn LEC
78 pages
File 3
No ratings yet
File 3
71 pages
Static Timing Analysis Basics by Selva Kumar
67% (3)
Static Timing Analysis Basics by Selva Kumar
59 pages
DigitalLogic ComputerOrganization L11 12 Timing Handout
No ratings yet
DigitalLogic ComputerOrganization L11 12 Timing Handout
39 pages
Prep Asic
No ratings yet
Prep Asic
36 pages
Selected Design Topics: Logic and Computer Design Fundamentals
No ratings yet
Selected Design Topics: Logic and Computer Design Fundamentals
20 pages
LSI Logic Design Chapter 4
No ratings yet
LSI Logic Design Chapter 4
30 pages
DDCA - Ch3 - Class8
No ratings yet
DDCA - Ch3 - Class8
32 pages
11 Timing Analysis Logic
No ratings yet
11 Timing Analysis Logic
55 pages
Clock Power
No ratings yet
Clock Power
8 pages
Atpn07 I
No ratings yet
Atpn07 I
55 pages
Synthesis - 07 - 23
No ratings yet
Synthesis - 07 - 23
102 pages
Timing Issues in Digital ASIC Design
No ratings yet
Timing Issues in Digital ASIC Design
101 pages
04 Synthesis
No ratings yet
04 Synthesis
57 pages
PD Lec03
No ratings yet
PD Lec03
10 pages
Sequential Cicuit Timing
No ratings yet
Sequential Cicuit Timing
32 pages
WINSEM2021-22 ECE5014 ETH VL2021220505486 Reference Material III 27-04-2022 STA Fundamentals Session2
No ratings yet
WINSEM2021-22 ECE5014 ETH VL2021220505486 Reference Material III 27-04-2022 STA Fundamentals Session2
31 pages
Handout Synthesis
No ratings yet
Handout Synthesis
14 pages
Timing Issues in Circuits
No ratings yet
Timing Issues in Circuits
15 pages
ENGN1630 12 Elmore Delay
No ratings yet
ENGN1630 12 Elmore Delay
7 pages
Digital VLSI Design Timing Analysis: Semester A, 2018-19 Lecturer: Dr. Adam Teman
No ratings yet
Digital VLSI Design Timing Analysis: Semester A, 2018-19 Lecturer: Dr. Adam Teman
72 pages
Introduction To Asynchronous Circuit Design: Specification and Synthesis
No ratings yet
Introduction To Asynchronous Circuit Design: Specification and Synthesis
38 pages
12 Vlsicad Timing
No ratings yet
12 Vlsicad Timing
88 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
Chapter 1 - Overview On Digital IC Design
100% (1)
Chapter 1 - Overview On Digital IC Design
45 pages
Cell B
No ratings yet
Cell B
73 pages
What Is Timing Analysis PDF
No ratings yet
What Is Timing Analysis PDF
62 pages
LSI Logic Design Chapter 4
No ratings yet
LSI Logic Design Chapter 4
24 pages
VLSI Timing
No ratings yet
VLSI Timing
23 pages
CMOS Sequential Circuit Design Lec.-1
No ratings yet
CMOS Sequential Circuit Design Lec.-1
22 pages
Timing Issues in FPGA
No ratings yet
Timing Issues in FPGA
33 pages
Kuehl ICCAD 1992 Timing
No ratings yet
Kuehl ICCAD 1992 Timing
7 pages
Timing Issues in FPGA Synchronous Circuit Design: ECE 428 Programmable ASIC Design
No ratings yet
Timing Issues in FPGA Synchronous Circuit Design: ECE 428 Programmable ASIC Design
33 pages
Logic Synthesis: Timing Analysis
No ratings yet
Logic Synthesis: Timing Analysis
33 pages
1 RTL2GDS Sta
No ratings yet
1 RTL2GDS Sta
75 pages
Vlsi Signal Processing
No ratings yet
Vlsi Signal Processing
455 pages
Chap2 PDF
No ratings yet
Chap2 PDF
25 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Build and Study RS, D, JK, and T Flip Flops Using TTL Logic Gates
From Everand
Build and Study RS, D, JK, and T Flip Flops Using TTL Logic Gates
GURUPRASAD N H
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
From Everand
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
Anand Vemula
No ratings yet
Ejercicios Recuperacion Presente Simple 4c2ba
No ratings yet
Ejercicios Recuperacion Presente Simple 4c2ba
4 pages
CAWRT Drill Flyer
No ratings yet
CAWRT Drill Flyer
1 page
Leg en D: Construction Project Schedule
No ratings yet
Leg en D: Construction Project Schedule
6 pages
Leeb Hardness Tester
No ratings yet
Leeb Hardness Tester
4 pages
M.V. Anil Kumar Kumar: Summary: Work History
No ratings yet
M.V. Anil Kumar Kumar: Summary: Work History
3 pages
A History of Graphic Design - Chapter 3 - A Symbiotic Relationship - Codices and Manuscript Books
No ratings yet
A History of Graphic Design - Chapter 3 - A Symbiotic Relationship - Codices and Manuscript Books
34 pages
STD.7 Comparing Quantities and Algebraic Expressions Practice Worksheet
No ratings yet
STD.7 Comparing Quantities and Algebraic Expressions Practice Worksheet
5 pages
American Scientist, Vol. 111.1 (January-February 2023)
No ratings yet
American Scientist, Vol. 111.1 (January-February 2023)
68 pages
Approach, Method, and Technique
100% (1)
Approach, Method, and Technique
23 pages
BMS Procedure
100% (3)
BMS Procedure
138 pages
Adding and Subtracting Integers Lesson Plan
No ratings yet
Adding and Subtracting Integers Lesson Plan
3 pages
H2S Drill Procedure - WJO & NDSC - English Version
No ratings yet
H2S Drill Procedure - WJO & NDSC - English Version
1 page
Application Form
No ratings yet
Application Form
2 pages
Daa 2
No ratings yet
Daa 2
4 pages
Epic Failures in DevSecOps V1
No ratings yet
Epic Failures in DevSecOps V1
156 pages
Royal Park Property Development Limited
No ratings yet
Royal Park Property Development Limited
7 pages
The Ergonomic Posture Assessment by Comparing REBA With RULA & OWAS: A Case Study in A Gas Springs Factory
No ratings yet
The Ergonomic Posture Assessment by Comparing REBA With RULA & OWAS: A Case Study in A Gas Springs Factory
23 pages
Carcassonne V3 Supplement
No ratings yet
Carcassonne V3 Supplement
2 pages
Catatonia and ECT Across The Lifespan - 2024 - Schizophrenia Research
No ratings yet
Catatonia and ECT Across The Lifespan - 2024 - Schizophrenia Research
6 pages
Chemistry Homework 8-1
No ratings yet
Chemistry Homework 8-1
7 pages
Which Advertisement. Next To Each Statement Write A Letter (A-H) - Some Advertisements Correspond To More Than One Statement. One Example Is Given
No ratings yet
Which Advertisement. Next To Each Statement Write A Letter (A-H) - Some Advertisements Correspond To More Than One Statement. One Example Is Given
9 pages
CBR Proposal
No ratings yet
CBR Proposal
14 pages
Hays Report V4 02122013 Online
No ratings yet
Hays Report V4 02122013 Online
13 pages
Lactic Acid
No ratings yet
Lactic Acid
5 pages
Possessive Pronouns
No ratings yet
Possessive Pronouns
17 pages
Using The Universal PE Unpacker
No ratings yet
Using The Universal PE Unpacker
11 pages
MOS-II Lecture 01 - Stress Analysis
No ratings yet
MOS-II Lecture 01 - Stress Analysis
25 pages
Levels of Organization Story
No ratings yet
Levels of Organization Story
1 page
Hindustan Aeronautics Limited: Asia'S Premier Aerospace Complex
No ratings yet
Hindustan Aeronautics Limited: Asia'S Premier Aerospace Complex
20 pages
Machine Life Cycle Analysis
No ratings yet
Machine Life Cycle Analysis
1 page