0% found this document useful (0 votes)
16 views8 pages

SoC-Designs Which Can Be Optimized-270924-051829

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

SoC-Designs Which Can Be Optimized-270924-051829

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Designs which can be optimized

Big mux in at_apb_spi_master.v

In the the following line, rdata is 135 bits and dummy_cycle is 6 bits in width and this would incur a big mux. The removal of the
right shift is simple by skipping the shift in of dummy bits in loopback mode. There are 5 at_apb_spi so the saving is 5x.

localparam WSZ = 7'd72;


localparam RSZ = 8'd135; // longer than WSZ because of optional dummy_cycles
reg [RSZ-1:0] rdata;
assign rdata_no_dummy = loopback ? rdata >> dummy_cycles : rdata;

After removing the right shift, the the rdata width can be reduced to 72 bits.

Reuse wdata and rdata in at_apb_spi_master.v

There are wdata (72 bits) and rdata (135 bits) registers, which hold the shift-out write data and shift-in read data respectively.

In fact, the rdata register is not necessary. The read data can be shifted in to the LSB of the wdata while the wdata is shifted out
from the MSB of wdata.

2x register count for sticky bits

We have rram_sticky_write_protection*, rram_sticky_read_protection* in at_apb_wrpr_pins. Also other stiky bits in that module. But
these registers are not the real sticky registers. They are used as the write pulse to the real sticky bits. The real sticky registers are in
u_cpu/u_at_sticky_wrpr. So the register count is 2X.

By making these register _EXTERNAL_, the registers in at_apb_wrpr_pins are gone. We can still use the write enable signals to set the
sticky registers.

This is software transparent.

Moving protection registers from u_cpu to romc

There are non-sticky registers in u_cpu/u_apb0 and sticky registers in u_cpu/u_at_sticky_wrpr. These registers control the
accessibility of RRAM/ROM/FLASH, which are in romc. Numerous routings are required from u_cpu to romc. Moving them to romc
saves the the RFFs, iso cells and routing spaces .

Pinmux related

Outgoing signals:

at_pinmux_kso_cairo -\
cairo_dbg → at_pinmux_cairo_bundle → at_pinmux_o_cairo → at_pads_logic_cairo → at_pads_iso_cairo → at_pads_cairo

Incoming signals:

at_pads_cairo → at_pinmux_i_cairo

Note that at_pinmux_o_cairo is in PD_noret while all other modules are in PD_top.

Some observations:

1. In ret and hib mode, control signals to pads (IE, OEN, DSN, REN, etc.) do not change except for the I/O direction of the following
pads:
a. KSM
b. SWDIO
2. Signals from PD_noret to PD_top are isolated so they don’t propagate to pads.
3. Input signals from pads to PD_noret are of no use because PD_noret is off. Only signals to PD_top to wake up the system matter
here. They include
a. KSM and QDEC
b. SPI and I2C
c. GPIOs
d. SWD

Possible improvements:

1. cairo_dbg:
Most input signals to cairo_dbg are in PD_noret. So we can leave cairo_dbg in PD_noret (so many iso cells are gone) and have
another debug mux module in PD_top for debug signals from PD_top (such as clocks and pseq). Or the new debug mux module
can be merged with at_pinmux_kso_cairo
2. at_pads_logic_cairo and at_pinmux_cairo_bundle in PD_noret
This module is in PD_top because it contains incoming signals ('I'). After splitting the incoming paths from at_pads_logic_cairo, it
can be in PD_noret.
Similarly, most inputs to this module are from PD_noret so leaving it in PD_top is not optimized. It should be in PD_noret or many
unnecessary isolation cells are inserted.
3. Latches in at_pinmux_cairo_bundle
pin_sel are latched in this module because at_pinmux_kso_cairo/at_pinmux_o_cairo/at_pinmux_i_cairo use them. Modules in
PD_top (which include KSM, SWD, and GPIO) need this information to control the IOs and to route the outgoing and control
signals to pads or the incoming signals from pads to modules in PD_top. However, for IOs used by modules in PD_noret, their
control are latched and held still. pinsel is of no use for these I/Os. Therefore, two or three bits of encoding instead of the
complete 8-bit pin_sel is enough.
4. at_pinmux_i_cairo
Logics related to incoming signals to PD_noret can be in PD_noret. But at_pinmux_i_cairo is a small module, it is fine to have
everything in PD_top.

Make QSPI read 1 cycles less in ST_CS_TAIL state when clkdiv=1

Won’t do it. It makes sel_clk1x_early and qspi_clk_1x_180deg change at the same time and results in a glitch in SCLK.

// from at_ahb_qspi_csn.v
assign qspi_clk = sel_clk1x_early ? qspi_clk_1x_180deg : qspi_clk_x;

When clkdiv=1, ST_CS_TAIL is two cycles. If you look into CSN and SCLK closely, you will find that

1. The SCLK pulse in ST_CS_TAIL state is redundant.


2. The timing check between SCLK and CSN here is tCHSH. It is from the rising edge of SCLK to the rising edge of CSN. The flashes
that we support requires tCHSH to be more than 5ns or 10ns. That is to say, it is possible to reduce one cycle latency here as long
as the SCLK freq is less than 100MHz.
When I tried to make ST_CS_TAIL one cycle by modifying load_csn_cnt_cmp in at_ahb_qspi_fsm_sdq.v, the simulation passed but
with tCH (the SCLK high time) violation in gigadevice model.

It was a false alarm, but it led me to find that sel_clk1x_early and qspi_clk_1x_180deg flipped at the same time. This could produce a
pulse in SCLK and it also triggered the false tCH violation.

Gate clock of APB slave register access (CAIRO-1147)

There are two clocks to a APB slave. One is PCLK for the functional logics (state machines and other). The other is PCLKG for
accessing registers.

PCLK (or apb_clk[i] in wrpr) is generated in wrpr. Its on/off is controlled by software and the psuedo code is like this:

ICG u_cg ( .clk_out(apb_clk[i]), .clk_in (clk_sel[i]? PCLK_ALT ? PCLK), .en(clk_enable[i]) );

u_apb[012]/PCLKG is gated by APBACTIVE. PCLKG is running when APB registers is accessed (APBACTIVE=1). In some slaves (e.g.
u_apb0/u_apb_uart_0 and u_apb_uart_1), a local CG gates the PCLKG furthur. This makes its PCLKG run only when its register is
accessed. In other slaves (e.g. u_apb0/u_wrpr) PCLKG is used for register access directly. That makes it PCLKG keeps running even if
the access is not targeted to it.
For example, in the following diagram, the CPU reads UART1’s register. PCLKG of UART0 is not running but wrpr’s PCLKG keeps
running.

Things that we can do:

1. All APB slave should have a clock bridge to generate the PCLK_en_periph signal and gate u_apb[012]/PCLKG. Moreover, having a
clock bridge there removes the uncessary signal toggling, say u_wrpr/PENBLE in the above diagram. at_apb_clk_bridge is used
for peripheral requires fixed frequency and at_apb_clk_bridge_simple is used for others.
2. Each APB slave's PCLKG should be gated by APBACTIVE & PCLK_en_periph (or PCLK_en_periph is enough because it is more
accurate than APBACTIVE?). Don’t use two ICG to do this. Using a single ICG instead to minimize the latency.
3. Merge at_clk_gate into clock bridge. at_apb_clk_bridge outputs the gated version of PCLKG
4. Registers of RWW type must be handled with care. These registers can be made _EXTERNAL_ and be clocked by PCLK or
requesting clock when it is to be updated.
5. See also APB clock bridge - Google Docs

Here is the proper way to gate the PCLKG:

at_apb_clk_bridge u_apb_uart_1_clk_bridge (
// shared
.PRESETn(apb_sreset_n[5]),
// bus side
.PCLK_bus (PCLKG),
.PSEL_bus (psels[5]),
.PENABLE_bus(i_penable),
.PREADY_bus (preadys[5]),
.PRDATA_bus (prdatas[5]),
// peripheral side
.PCLK_periph (PCLKG apb_clk[5]),
.PSEL_periph (uart1_psel_int),
.PENABLE_periph(i_penable_uart1),
.PREADY_periph (uart1_pready_int),
.PRDATA_periph (uart1_prdata_int),
.PCLK_en_periph(uart1_pclk_en)
);

at_clk_gate #(.SPN(SPN)) u_uart1_cg (


.ck_out (PCLKG_uart1),
.ck_in (PCLKG apb_clk[5]), // use PCKG here is more consistent and there is no need to switch on apb_clk[5] before
programming its registers
.enable (uart1_pclk_en),
`ifdef FPGA
.CGBYPASS (1)
`else
.CGBYPASS (CGBYPASS)
`endif
);

at_apb_uart u_apb_uart_1 (
.PCLK (apb_clk[5]), // Peripheral clock
.PCLKG (PCLKG_uart1), // Gated PCLK for bus
.PRESETn (apb_sreset_n[5]), // Reset

Speed up APB access (CAIRO-1151)

The first diagram is to access pwm regiters. A at_apb_clk_bridge is instaniated in pwm. There is three cycle latency from APB0’s AHB
side to pwm’s APB (The blue lines). Besides, there 5 cycles from the completion of pwm’s APB to APB0’s AHB (the read lines). See
Shorten the latency in apb-apb clock bridge (at_apb_clk_bridge.v) - Google Docs for more information.

The PRDATA is registed in the clock bridge (the highlighted two signals). But if you looks into it closely, PRDATA’s value doesn’t
change after the completion of APB transaction because PADDR was latched and not changed. Therefore, the registers can be
removed.

The second diagram is to access wrpr registers. There is no at_apb_clk_bridge instaniated in wrpr. There is one cycle latency from
APB0’s AHB side to wrpr’s APB (The blue lines) and one cycle latency from the completion of wrpr’s APB to APB0’s AHB (the read
lines).
The 1-cycle trailing latency comes from AHB-to-APB bridge (cmsdk_ahb_to_apb.v or
u_cpu_/u_ahb_to_apb_hnonsec/u_at_ahb_to_apb). When the parameter REGISTER_RDATA is set to 1, an extra pipeline stage is
added. The pipelined stage may be removed because the path from APB to AHB-to-APB is short.

Gating clocks of DMA, GPIO and other AHB peripherals

SW uses DMA extensively. In memcpy(), DMA is used to do the job. This leaves DMA clock always on. There are 4 cores. Ideally, their
clocks can be gated based on the DMA FSM states. (CAIRO-1148)

GPIO’s clock is from u_cpu/sysugclk_out and it is always running even if GPIO function is not used. (CAIRO-1149)

Similarly, the mpc’s clock is free run and can be shut off if there is no flash/sram access. (CAIRO-1150)

Speed up access to ATLC (CAIRO-1152)

ATLC runs at a fixed frequency 16MHz. Therefore a down sync bridge is used to bridge it and the bp_clk domain. However, this
bridge makes register accesses (and also TCM accesses) slow. The faster the bp_clk runs, the CPU spends more cycles waiting for
the access to complete. If we can make the access faster, this helps active power and eases the CPU critical flow (Rx to Tx).

Write

The down sync bridge has a write buffer and it can hold one write request. Once the write request is buffered, the transactions
comletes on the CPU side. This effectively hide the write latency in the down sync bridge to the CPU.

To make use the write buffer, we have to do the following things:

1. Setting the WB parameter of the down sync briget (u_cmsdk_ahb_to_ahb_down_sync) to 1.


2. The write request to the ATLC must be ‘bufferable’. That is to say, the HPROT[2] must be 1. Two possible ways to do this. The first
one is more flexible.
a. Use a register to control HPROT[2] of the slave side of u_cmsdk_ahb_to_ahb_down_sync. When the register is set, the WB is
enabled.
b. Program the memory protection unit and make the ATLC space bufferable

Note:

1. The write buffer just hold the write request, it doesn’t re-order accesses. For a back-to-back requests, the following request won’t
take place until the write request completes on the ATLC side. (See the following waveform)
2. If the back-to-back requests (write followed by a read) are to the same address, the data is read from ATLC side after the write
completes on the ATLC side. It doesn’t return the write data held in the write buffer.
3. The error response (HRESP) is ignored. That is fine because ATLC doesn’t generate error response.

Compared to WB=0, 4 bp_clk (32MHz) cycles are saved.

Read

There is a big mux in at_lc_regs_core.v and the cpu_if_address_dcore selects the registers to be read out. The
cpu_if_address_dcore is the output of down sync bridge and cpu_if_read_data_dcore passes through the down sync bridge
again to system bus. This passing of down sync bridge makes the read very slow.

case(cpu_if_address_dcore)
10'b0000000000:
begin
cpu_if_read_data_dcore = dbg_ctrl0_full_read_data;
end
10'b0000000001:
begin
cpu_if_read_data_dcore = { 22'h000000, dbg_ctrl1_full_read_data };
end
endcase

But for read, we can bypass the down sync bridge completely. The HADDR can drive cpu_if_address_dcore directly and
cpu_if_read_data_dcore can drive HRDATA directly in the next cycle and completes the read. It takes only two bp_clk cycles.
Note:

The mux is served as the decoding circuit and shared by read and write. To avoid read and write from happening in the same cycle,
we can take either approach:

1. a read is only allowed when the WB is empty.


2. Allow the read to take place whenever it is not a write cycle

If we go the second approach, we have to take care of the read/write completeion order because a later read may complete first
when the earlier write is still in the WB. This can be not desirable in some cases. So I suggest that we go the first approach.

You might also like