Block Interconnection: Today's Topics Divide Into Two
Block Interconnection: Today's Topics Divide Into Two
Block Interconnection
Todays topics divide into two:
Problems of logical interfacing
Standardization of interfaces
AMBA
OCP
Interface operation
Problems of electrical interfacing
Skew
clock skew source/dest.
bit skew in buses
Round trip delays
Drive capacity
The latter set of topics is more in the area of the later lectures.
COMP32212 Implementing System-on-Chip Designs
Section 6
Slide 1
University of Manchester
Traditional bus
Example
CS
Address
address
address
Read
Read data
data_in
Write
data_out
Write data
Asynchronous bus (timed by strobes from master)
Section 6
Slide 2
Asynchronous bus
The example in the slide is not the only protocol for bus timing. Another common approach uses an enable (CE) and a direction signal to specify the operation.
CS
Address
R/W
Wiring
Lets elaborate on that point.
A simple model of load on a gate estimates a lumped capacitance. 200 m of
wire will have twice the capacitance of 100 m.
Assuming the same driver, the edge speed on the longer wire will be correspondingly greater; thus the signal delay will be greater.
1
En
Read data
voltage
threshold
Write data
Although the data is shown here as unidirectional, off chip buses typically use
bidirectional data signals so must be either reading or writing when active. This
is due to pin restriction on the package (and wiring on the PCB).
On chip buses are limited by distance but not (particularly) restricted by width
because there is a considerable wiring resource on a chip. However on-chip signals are now universally unidirectional so that electrical buffers can be inserted
alon the wire to keep switching edge speed reasonably rapid.
time
Possible solutions:
Increase the drive strength
Decrease the load
However the wires also have resistance which slows down the edge more at
greater distances from the driver. The first solution is therefore not as effective
as might be though at driving longer wires.
The load can be decreased by cutting the wire and inserting buffers (amplifiers) at intervals. These also insert delay but keep the edges fast.
Buffers have an input and an output so the wires, necessarily, are unidirectional.
Buffers
The term buffer as applied here refers purely to an electrical amplifier.
Buffer is also used to refer to, for example, latches which hold data and
are thus part of the logic. Beware of potential confusion!
University of Manchester
Address
address 1
data_out 1
Write data
Read data
data_in 0
Simple
Single master
Used for low speed peripherals
COMP32212 Implementing System-on-Chip Designs
Section 6
APB
APB is a simple bus model where commands and addresses and possibly write
data are output at the beginning of the bus cycle and any read data is read at
the end of the cycle. Thus there needs to be adequate time for a round trip
within the bus cycle.
Bus master
Peripheral
Peripheral
Peripheral
The first APB spec. performed every transfer unconditionally in two clock
cycles. This was subsequently extended so that slow peripherals can insert extra
wait states to extend the cycle time if they cannot respond quickly enough.
Wait states may be acceptable when communicating with peripheral devices
because such accesses are infrequent so the penalty is small.
Another extension was an error signal, so the failure (abort) of a bus cycle can
be signalled.
Slide 3
University of Manchester
read 0
read 1
write 2
address 0
address 1
address 2
idle
data_out 2
Write data
Read data
data_in 0
data_in 1
Moderately complex
Multi-master via centralised arbitration
Bus cycles can be extended or aborted
Used for processor buses on medium performance devices (e.g. ARM9)
Section 6
Slide 4
AHB
AHB increases performance by pipelining. For example, in a read operation it
outputs an address and status asking for the read on a rising clock edge. This is
decoded and selects the appropriate slave device.
On the next active clock edge the slave is expected to latch the address and start
the read. At this point the bus master can start the next cycle.
AHB operation is piplined, so that as one set of dat is transferred the subsequent
address can be sent.
Bus master
Device
Device
Device
Device
Device
Device
Bus master
addr_0
addr_1
data_0
data_1
addr_0
addr_1
now quiescent so abort can proceed
University of Manchester
Write data
may be pipelined
Read command/address
Read data/response
Section 6
Slide 5
AXI
AXI is more like a network than a bus. Transactions can be initiated from various units and will arrive at various destinations. In between they may be arbitrated and multiplexed as desired. The packet IDs allow steering so that the
correct response is returned to the correct initiator.
Data can be pipelined to reduce the distance travelled per clock cycle and, consequently, allow faster clocking and higher throughput.
Protocol
valid
valid
data
data
ready
ready
Ready
Valid
Data
A
A
B
C
D
E
University of Manchester
AXI-like pipeline
Consider a synchronous AXI pipeline stage.
valid_in
valid_out
Data moves across an interface if both valid and ready are active.
data_in
data_out
If you indicate (upstream) you are willing to accept data (ready) that is
a commitment
ready_in
ready_out
Section 6
Slide 6
With extra buffering it is possible to achieve full throughput and still stall the
pipe locally.
Stop!
Stop!
Stop! Crunch!
Stop!
Stop! Crunch!
Stop!
With sparser occupancy data can stop safely; however throughput is reduced.
Go
Stop!
Stop!
Stop!
Go
Stop!
Stop!
Stop!
Stop!
Stop!
Go
Go
Go
Note that in some pipelines there will be buffering implicit in the architecture to
even out such flow irregularities. Examples could include network routers storing and forwarding packets.
University of Manchester
Bus hierarchy
Simple example:
TCM
APB
ARM
RAM
$
AHB
bridge
bus
I/F
off
chip
AHB
USB
host
ROM
LCD
ctrl
Atmel AT91SAM9261
Section 6
Example SoC
The example in the slide is the Atmel AT91SAM9261 ARM-based microcontroller; this is the chip used in the laboratory equipment. The view shows the
interconnection structure around the processor.
The processor masters two buses (instruction and data) which are fed into a bus
switch matrix. Other devices can also be bus masters as the USB host interface,
the LCD controller and the APB bridge all have DMA capability.
Dependent on the matrix are:
APB
ROM
USB host and LCD controller (for programming)
External bus interface
RAM
The crossbar switch allows parallel operations so different masters can have
access to different slave devices simultaneously. Clashes have to resolved by
inserting wait states.
Bus occupancy can be reduced because the processor has:
separate instruction and data caches
direct access to the on-chip RAM as Tightly Coupled Memory (TCM)
Tightly coupled memory
Tightly Coupled Memory (TCM) maps fast SRAM to specific addresses. (This
device has ten individually switched 16 KB blocks.) This can allow parallel
instruction and data access and still leave the I/O buses free for DMA.
TCM is sometimes preferred over cache in microcontroller applications because
its timing behaviour is easy to predict. Cache accesses may be faster on average
(as the hit rate may be better optimised) but predictability means that a worst
case response can be guaranteed important in some real-time applications.
APB
The APB hosts numerous lower performance peripherals. It may be run at a
lower clock speed than the AHBs as a power saving measure.
Slide 7
Bridge
A bus bridge is simple a means of converting from one protocol to another. Usually a bridge is a slave on one bus and a master of another, although bidirectionality is possible.
Split transactions
When a bus structure becomes sufficiently complicated it can be an advantage to
allow transactions to complete out-of-order.
master
bridge
I/O
master
RAM
bridge
master
RAM
I/O
bridge
RAM
I/O
This gives decreased latency for some (urgent) operations at the expense of
greater complexity, especially at the master where dependencies between reordered transactions may have to be resolved.
L2 cache
University of Manchester
Section 6
GALS
As clock speeds increase and wiring delays become more significant it is difficult to maintain a synchronous clock model across a whole chip. This problem
was discussed in the section on timing (q.v.).
However one solution to this problem is to allow different IP blocks to be
clocked independently with an arbitrary phase and, possibly, at different frequencies. It is then the job of the interconnection to cross the clock domains.
This form of interconnection is known as GALS (Globally Asynchronous,
Locally Synchronous). GALS frees the SoC designers from a number of timing
constraints which makes timing closure much easier. Each block is developed as
a synchronous circuit but there is no need for chip-wide skew-free clock distribution.
Another advantage is the ability to run each block at its own best frequency
with the possibility of consequent power reduction.
There can also be a reduction in power supply noise. In a synchronous circuit
logic begins to switch just after each active clock edge. Typically the number of
gates switching over time diminishes during the clock period because not all
logic paths are the same length. When gates switch they pull charge from the
power supply or dump it onto the ground. The demand for charge (a.k.a. current) therefore varies periodically setting up a regular AC signal in the (extensive) power wiring. This both acts as a transmitting aerial (especially the wiring
into the chip) and may affect other gates switching. If a whole chip is synchronous then this problem is at its worst; if there are several clocks with different
phases (or frequencies) the demand tends to even out, reducing noise problems.
There are also disadvantages to GALS unsynchronised communication. The
biggest is the need for synchronisation of signals when they arrive at their destination. This inherently adds some latency to the signal; more if the reliability is
increased by adding longer waits for the resolution of any metastability. Communication is therefore slowed down in some way.
Slide 8
Handshaking
The simplest communication mechanism is synchronous on a one-item-per
clock basis; this relies on assumptions that data will always be avaialble and
accepted on every cycle.
If data is not available on every cycle a validity (or request) signal can be
used to indicate when data is available.
If the receiver may not always accept data then some sort of flow control must be
included. Across a synchronous interface such as AXI, discussed earlier this
can be another status bit.
With an asynchronous interface various assumptions cannot be made and some
form of handshake protocol is needed. This must be subject to synchronisation
to the local clock, with a concommitant latency penalty.
Request
Acknowledge
Data
Block transfers
A simple method of communication between asynchronous blocks is to synchronise each data request and, subsequently, latch the data from the bus. This
results in a moderate latency but quite a low bandwidth because every transmission requires two synchronisations, one for the forward request and another for
the reverse acknowledge.
Higher bandwidth can be achieved by buffering several data elements for a single synchronisation. The transmitter owns a RAM into which it writes a message. When this is complete it passes the RAM to the receiver. After
synchronising with the receivers clock the data can be read out at full speed.
The overall latency is greater but the average bandwidth is also higher. This type
of mechanism may be further enhanced (at additional hardware cost) by double
buffering so that one RAM is filled whilst the previous one is emptied.
At its most extreme the interconnection may be asynchronous logic which can
implement an elastic FIFO between transmitter and receiver. This could be a
dual-port RAM which is written and read at different rates synchronisation is
only necessary when the FIFO is almost empty or almost full or truly clockfree circuits.
University of Manchester
Serial buses
This slide is something of an aside, in that it is chiefly concerned with systems off chip.
For wider system interconnection it is common to use serial interconnection:
Inherently slower
Far fewer chip-pins required
Cheaper interconnection medium (wires, connectors, )
Suitable for wireless applications
Examples include:
Ethernet
USB
I2C
On SoC
Pin restrictions do not apply to intra-chip connections.
Nevertheless the reduction in wiring is becoming attractive for some SoC applications.
Section 6
Slide 9
Serial buses
Differential signalling
In a serial bus transactions must occur as packets, so that the various signals are
time-domain multiplexed onto the medium. Thus it may be that a transmitter
sends a packet which contains C bits of a command (such as read or write), A
bits of address (which may be a subsystem and/or a memory address) and D
bits of data.
Ethernet
Differential signalling is used for noise immunity. If two wires are physically
close to each other any induced noise is likely to affect them in a similar way. A
single wire compared to an unmoving ground signal may have its state altered
but the difference should be (largely) preserved. This is known as commonmode rejection.
USB
You are probably more familiar with USB (Universal Serial Bus) as a user than
aware of its operation. It is a hierarchical structure where devices (slaves) are
polled by the host (master) to allow them to transfer data. Data is communicated
across a simplex (one direction at once) differential pair (see opposite) serial
line.
Communication is asynchronous so each device has to have a precise clock reference matching the specification.
I2C
I2C (Inter-Integrated Circuit) is a Philips invention; to avoid legal complications
it is typically reffered to as Two Wire Interface (TWI) by other manufacturers.
I2C is a fairly slow interconnection, suitable for driving entirely in software with
two PIO bits if required. It is typically used as a PCB level interconnection, for
example for adding memory to small microcontrollers. However it is a multimaster bus where arbitration for mastery takes place via the same two wires.
Communication is synchronous as one wire is used as data, the other as a clock.
However the clock really more a strobe need not be regular as it may be
software driven or paused by the receiving device if it is not ready.
The state of the signal is interpreted by looking at the difference between the
wires, which will either be positive or negative a binary choice.