LPDDR4 Multi-Channel Architectures WP
LPDDR4 Multi-Channel Architectures WP
Author LPDDR4, the latest double data rate synchronous DRAM for mobile applications, is a DRAM specification
found in today’s high-end portable products such as Samsung Galaxy S6 smart phones, the Apple
Marc Greenberg
iPhone 6S [1], and several other recently announced devices. In addition to mobile use, we predict that
Director of Product
LPDDR4 will follow its predecessor LPDDR3 into tablets and thin and light laptops in a “memory down”
Marketing for DDR IP,
configuration — i.e. when the DRAM is physically soldered onto the board.
Synopsys
LPDDR4 offers huge bandwidth in a physically small PCB area and volume; up to 25.6 GByte/s of
bandwidth at a 3,200 Mbps data rate from a single 15mmx15mm LPDDR4 package when two dies are
packaged together. LPDDR4 builds on the success of LPDDR2 and LPDDR3 by adding new features and
introducing a major architectural change.
This white paper explains how LPDDR4 is different from all previous JEDEC DRAM specifications.
It discusses:
Why designers are selecting LPDDR4
Highlights of the LPDDR4 architecture
How to best configure LPDDR4 channels
How to handle 2-die and 4-die packages with multi-channel connections
The advantages of sharing channels through system-on-chip (SoC) partitioning
How to optimize channels for the lowest power consumption
Why LPDDR4?
LPDDR4 includes a number of features that enable SoC design teams to reduce power consumption of
discrete DRAM. Desktop devices like PCs and servers commonly utilize DDR devices mounted on dual
inline memory modules (DIMM) hosted on 64-bit wide buses. This board-level solution allows field-
upgradeable DRAM capacity expansion, but requires long and more heavily loaded interconnects which
consume more power than short traces. Systems using LPDDR2, LPDDR3 and LPDDR4 tend to have
fewer memory devices on each bus and shorter interconnects, and thus consume less power than DDR2,
DDR3 and DDR4 devices.
Design teams can call on power-saving options within the LPDDR4 DRAM. These features include
reduced voltage and I/O capacitance; a reduced width, multiplexed command and address bus;
eliminating the on-DRAM DLL; providing lower power standby modes with faster entry and exit; and
enabling faster, less complex frequency changes.
For part of the time, the memory drops to the LPDDR3 speed grade. This level of performance is sufficient
to support texts, calls, web browsing, photography, simple gaming: all features that don’t place too many
demands on the CPU or GPU.
For the majority of the time, when the mobile device is not in use and in a pocket or at a bedside, the DRAM is
switched off or in low speed mode. It will have one channel of the memory active just to perform ‘always-on,
always-connected’ tasks. In this mode, the device is performing background activities such as maintaining
cell contact, receiving messages, receiving / displaying push notifications, synchronizing mail, and displaying
the time.
However, it is the performance of the device during the highest use time that drives many mobile users to
upgrade their devices, which is why it is so important to provide an outstanding user experience in this use
mode (Figure 1).
Best performance
200 - 1600 Mbps (LPDDR3 range) under low-speed
Text, phone, browse, read, power limits
photograph, puzzles and simple games
Figure 1. Highest use times drive the upgrade cycle for mobile users
DDR2, DDR3, and DDR4 devices offer one command address bus input and one data bus per package,
and most commonly one die per package. LPDDR2 and LPDDR3 may offer one to four dies per package. In
the case of two-die and four-die packages for LPDDR4, LPDDR3 and LPDDR2, generally two independent
command address input and data busses (channels) are provided. In other words, multi-channel has partial
enablement in LPDDR2 and LPDDR3 as they offer two independent channels per package. LPDDR4 forces the
issue into the forefront as there are two independent channels per die and four channels in most packages.
2
Connecting Multiple Channels
The LPDDR4 architecture is natively two-channel (Figure 2), in that each die has two command address inputs
and two data buses per die. Four independent channels are available on a LPDDR4 2-die package. To deploy
LPDDR4 effectively, designers must understand how this architectural change affects the system architecture.
2 independent channels
x8 x8
DQ DQ
2KB page
Channel B
Channel A
x8 x8
DQ DQ
A single DRAM device with one channel (for example, a single-die package of LPDDR3) can only be connected
one way — with the command/address bus on the SoC to the command/address bus on the DRAM and the
SoC data bus to the DRAM data bus (Figure 3). A chip select enables the DRAM when it is required.
Data
DRAM
Command/address
C
SoC
(example:
Chip select LPDDR3)
Having two DRAM devices, or one DRAM device with two independent interfaces like LPDDR4, supports four
possible configurations:
Parallel (lockstep)
Series (multi-rank)
Multi-channel
Shared command/address
3
simultaneously, so both of the DRAM devices are always in the same state. They always have the same page
of memory open and access the same column, although the data stored in each DRAM is different.
Data
DRAM*
Command/address
SoC Chip select
Data
DRAM*
DRAM* DRAM*
Chip selects
SoC
Data
Command/address
Multi-channel connection
The multi-channel connection (Figure 6) provides each channel of DRAM or each DRAM device with an
independent connection to the SoC, where each device or channel has its own command/address bus, data
bus and chip select. This flexible configuration enables each DRAM device (or group of devices) to operate
completely independently of the other. They may be in different states, receiving different commands and
different addresses, and one may be reading while the other is writing.
A multi-channel connection also allows for the DRAMs to operate in different power states. For example, one
memory might be in a standby self-refresh mode, while the other is fully active.
Chip select
Data
DRAM*
Command/address
SoC
Command/address
Data
DRAM*
Chip select
4
Shared command/address (CA) connection
The final configuration option, which is used more commonly in non-low-power DDR installations, is multi-
channel with shared command/address (CA) or shared AC (Figure 7). In this configuration, both of the
DRAM devices receive the same command and address, but like the serial implementation, the chip selects
determine which DRAM device is listening on any particular clock cycle, so each device may be in a different
state. The DRAM commands are arbitrated between the two channels at the SoC, but each DRAM can
transmit data independently.
Chip select
Data DRAM
SoC Command/address
Data
DRAM
Chip select
CA bus
DQ (data) bus
Parallel Series Multi-channel Shared CA CS (chip select)
CA pins: 6 CA pins: 6 CA pins: 12 CA pins: 6
DQ pins: 32 DQ pins: 16 DQ pins: 32 DQ pins: 32
CS pins: 1 CS pins: 1 CS pins: 2 CS pins: 2
Banks: 8 Banks: 8 Banks: 16 Banks: 16
Fetch: 64 Fetch: 32 Fetch: 32 Fetch: 32/64
The series connection is also less suited for PoP implementation. It does save some DQ pins, but because the
DRAM devices share a data bus it offers half the bandwidth of the other solutions, which makes this approach
less attractive.
While a shared CA implementation is better suited to DDR systems, a multi-channel connection can help
design teams to get the best out of LPDDR4.
5
Design teams that want to get the most bandwidth out of their LPDDR4 device, especially if using small
data transfers, may consider a true four-channel implementation (Figure 9). Compared to the other
implementations, it has the highest number of banks and the smallest fetch size. It requires 24 CA pins on the
SoC and may be implemented with four separate memory controllers and PHYs on the SoC.
DRAM
4 Channel 1 channel
DRAM
CA pins: 24 o
of LPDDR4
DQ pins: 64 DRAM SoC DRAM CA bus
CS pins: 4
DQ (data) bus
Banks: 32
Fetch: 32 CS (chip select)
DRAM
The two-channel and parallel implementation offers a good compromise between a fully parallel and a four-
channel implementation. It is especially useful for LPDDR3-LPDDR4 combinations (Figure 10). Most early
examples of commercial SoCs using LPDDR4 have used this configuration.
DRAM DRAM
2 Channel
and parallel DRAM 1 channel
of LPDDR4
o
CA pins: 12 SoC
DQ pins: 64 CA bus
CS pins: 2 DQ (data) bus
Banks: 16 CS (chip select)
Fetch: 64
DRAM DRAM
The fully parallel implementation uses only six CA pins and has the maximum number of DQs (64). However,
there are only eight banks available in this system. The fetch size is a minimum fetch of 128 bytes, which can
limit its usefulness for some applications. It may also be necessary to duplicate the pins of the CA bus for bus
loading or chip-level timing closure reasons.
Figure 11 shows an example of a 2-die 4-channel LPDDR4 multi-channel implementation (left) and a 4-die
implementation (right). The LPDDR4 package contains four connected dies. Each physical channel has two
ranks of memory connected to it. This configuration requires the design team to extend the connection in a
serial direction on each of the four channels on the package. Unfortunately, a 4-die package doesn’t provide
8-channel connectivity; there are only four channels of package balls on the 4-die package.
6
2-die LPDDR4 multichannel 4-die LPDDR4 multichannel
implementation and serial implementation
DRAM
DRAM DRAM
CA bus
DQ (data) bus
DRAM CS (chip select)
Figure 11. Two-die and four-die implementations. Four-die LPDDR4 multichannel and serial implementation adds
DRAM capacity. This solution is compatible with two-die packages.
Accessing each channel independently of the others means that every bank on every channel can have
a different row activated. For small transfers like video and network packets that are spread randomly
throughout the memory, having more banks available will avoid some of the inherent memory timing
parameters that could limit performance. Spreading transactions across as many banks as possible will
improve the performance because it reduces the probability of hitting some of the memory timing parameters.
Having more banks in the system, and extending the length of time it takes for commands to complete on
each bank, can improve performance by reducing the probability of delays due to tRRD, tFAW, and tRC
memory timing parameters:
tRC — the row cycle time of the memory. This is the minimum time between activate times to different
rows in the same bank.
tRRD — row-to-row delay. This is the minimum time between activate commands to different rows in
different banks.
tFAW — four activate window. This timing parameter says that no more than four activate commands
can be issued within a rolling tFAW window. It is set to being four times tRRD in the LPDDR4
standard, so for LPDDR4, these are effectively the same timing constraints, although other memories
may use a different relationship between tRRD and tFAW.
The tRC timing can cause problems particularly in faster devices. At the highest speed of LPDDR4, the tRC
time is over 100 clock cycles. When operating at the highest speeds of LPDDR4, after a row in a bank has
been activated, tRC prevents access to any other row in that bank for at least 100 clock cycles, which is a long
time to lock out that bank from being used again. Having more banks available will lower the probability of
having to access a new row in a bank that is currently locked out because of the tRC time.
7
The tRRD and tFAW limit the ability to change banks frequently, something that the design team may want to
do to avoid the tRC timing parameter.
Figure 12 shows an example device with a four activate window tFAW of four times the row-to-row delay tRRD.
The tRRD time may be up to 16 clock cycles at LPDDR4-3200.
Figure 13 shows a continuous sequence of transactions executing on the parallel implementation. The
annotation AC/BA0 is shorthand for an activate command to bank 0. The command next to it, RD/BA4, shows
a read command to bank 4 (assume that bank 4 was activated some time earlier in time). Each command
bubble represents four clock cycles, because of the four-phase addressing of the LPDDR4 device. In practice,
the sequence would be extended as activate followed by a read, activate, read, activate, read, activate, read.
Data comes back completely occupying the DQ bus, which is full. The parallel access pattern utilizes 100%
memory bandwidth — but only when accessing the device at 800MHz (DDR1600).
Parallel
DRAM CA AC RD AC RD AC RD AC RD
BA0 BA4 BA1 BA5 BA2 BA6 BA3 BA7
SoC
One bubble represents multiple clock cycles. AC = Activate Command. RD = Read Command
Figure 13. Parallel implementation using continuous 64-byte reads to rotating addresses at
BL 16 and 800MHz/DDR1600
Figure 14 shows the two-channel implementation executing the same sequence using each of the command
address channels independently. Each command address bus has a slightly different pattern on it: activate,
read, no-op, read, activate, read, no-op, read. The space in the command channel could be used for
something else like a commanded pre-charge or a per-bank refresh, or simply left as an idle clock cycle. The
data bus is fully occupied.
8
2 Channel
CA_a AC RD RD AC RD RD AC
BA0 BA4 BA4 BA2 BA6 BA6 BA4
DRAM
CA_b AC RD RD AC RD RD AC
BA1 BA5 BA5 BA3 BA7 BA7 BA5
SoC
One bubble represents multiple clock cycles. AC = Activate Command. RD = Read Command
Figure 14. Two-channel implementation using command address channels independently using continuous
64-byte reads to rotating addresses at BL 16 and 800MHz/DDR1600
When the frequency is doubled to 1600 MHz (DDR 3200 operation) (Figure 15), the tRRD time will limit
the SoC’s ability to send activate commands to the LPDDR4 device in the upper example of a parallel
implementation. The sequence is: activate, read, no-op, no-op, activate, read, no-op, no-op. The no-op
cycles could be used for pre-charges or refreshes, but the memory cannot be activated fast enough to issue
sequential 64-bank transactions to a new bank with each transaction.
Parallel
Data gaps created, tRRD limits activates
tRRD(min)
DRAM CA AC RD AC RD AC RD AC
BA0 BA4 BA1 BA5 BA1 BA5 BA2
SoC
2 Channel
Still works
CA_a AC RD RD AC RD RD AC
BA0 BA4 BA4 BA2 BA6 BA6 BA4
DRAM
CA_b AC RD RD AC RD RD AC
BA1 BA5 BA5 BA3 BA7 BA7 BA5
SoC
One bubble represents multiple clock cycles. AC = Activate Command. RD = Read Command
Without having another 64-byte transaction to the same page of memory, the SoC must wait until tRRD
has elapsed and it can activate a new page in memory again. This mode of operation limits the maximum
performance of the device to 50% bandwidth if the transactions are not long enough to allow two reads to
each bank before moving to a new bank.
9
By contrast, the two-channel implementation at the bottom of Figure 15 allows each channel to satisfy tRRD
because of the “activate, read, no-op, read” pattern, even with shorter accesses. The bus bandwidth can run
at full capacity, even at the DDR 3200 data rate.
The best approach is to match the fetch size to the SoC, both in terms of the size of transfers to be transmitted
over the bus and the total bandwidth targeted from the device.
A preferred size for the cache lines of many SoCs and CPUs is 32 bytes. Occasionally, some larger 64-bit
CPUs use 64-byte cache lines. Video and networking traffic often requires short transactions of 32 bytes or
less. Ideally, the multichannel architecture should match the system fetch size, so the system can be optimized
to the size of the fetch that it can use.
The parallel implementation shown in Figure 16, with a minimum burst length of 16 for LPDDR4, and 64 DQ
pins in parallel, produces a 128-byte fetch, which is really only suitable for long data transfers to contiguous
addresses. The parallel implementation can work for accesses in units of 128 bytes at a time, but if the
accesses are smaller than 128 bytes and to random addresses, the parallel implementation will be inefficient.
DRAM
Parallel 1 channel
DRAM DRAM
CA pins: 6 o
of LPDDR4
DQ pins: 64 SoC
CA bus
CS pins: 1
Banks: 8 DRAM DQ (data) bus
Fetch: 128 CS (chip select)
DRAM
Another issue in creating a 64-bit parallel implementation is the physical connection between the SoC and
the DRAM dies. The ball-out of the LPDDR4 PoP package is arranged as a channel in each corner, so there
are four channels on the package to accommodate two or four dies. Each channel is in a corner of the device.
Ideally, the SoC memory controller and PHY placement should match that LPDDR4 ballout. This arrangement
will allow channel A to map to channel A, channel B to B, C to C, and D to D, keeping the routes within
the LPDDR4 PoP package as short as possible without crossovers. This package layout also makes for a
challenging physical implementation of a parallel 4-channel LPDDR4 interface.
The user should also take care that if the transactions are to different pages in memory, that tRRD may limit
the effective bandwidth at higher frequencies as explained in the previous section.
For these reasons, the multichannel implementations of LPDDR4 are often preferred over the four-channel
parallel implementation.
Command/address bus
LPDDR4 has a very narrow command/address bus (only six bits wide per channel compared to 20 or
more bits for DDR4) so the overhead of using multiple command/address channels is less than with other
technologies. Using all four of the command/address buses independently on the LPDDR4 package offers the
most flexibility and potentially the highest performance for the overall system.
10
access to its own independent channel. This architecture has some advantages: the CPUs don’t block each
other and the SoC buses are shorter. Channels that are not being used can be powered down.
CPU CPU
LPDDR4 package
ballout
CPU CPU
However, this architecture is also inflexible. If channel A needs to use some of the data that is in channel C,
it cannot use the memory as a mailbox. It must transfer the data somehow through the SoC. It also makes it
harder for the CPUs to work on shared tasks for load balancing.
Another approach is to have every CPU sharing every memory (Figure 18). This allows for more flexible
partitioning. It tends to work better for heterogeneous processing and the CPUs can work on shared data, but
there is a lot more wiring and longer wires on the chip, and it may require a sophisticated on-chip interconnect
system. This represents more accurately how real chips work, especially with a heterogeneous processing
architecture with different sizes of CPUs, GPUs, and other processing elements.
Channel A Channel C
CPU CPU
CPU CPU
Channel B Channel D
Figure 18. Share the channels — all CPUs share all memory
11
Logical
Physical location
address
Y MByte
Application
data,
video buffer,
etc
X MByte
Operating
system and
“always
on always
connected”
functions
0
Channel A Channel B
Figure 19. Logical to physical address mapping using separate memory map
For example, Channel A might hold the operating system and always-on, always-connected functions.
Channel B may contain application data, a video buffer, and similar data. These two different address spaces
are independent and separate. This helps power control because, for example, channel B can be powered
down when not in use.
Another approach is to interleave the memory map by having small consecutive logic address regions
accessing different channels of the memory (Figure 20). For example, bytes 0 to 63 in channel A, bytes 64 to
127 in channel B, and so on back and forth up through the memory. The logical space is interleaved across
the whole memory. This approach helps load balancing across the two different channels, and can enable
good performance. However because both channels are always required, it’s not possible to shut down either
channel to save power.
Logical
Physical location
address
Y MByte
0
Channel A Channel B
12
A further implementation option is to use a hybrid memory map (Figure 21) where different regions in each
channel can provide either non-interleaved access or interleaved access. This hybrid approach could include
a region of memory that is always on and always connected, a region of memory that is interleaved between
two channels to get the highest performance, and an upper area of memory for programs that are associated
with the high bandwidth.
Logical
Physical location
address
Memory for
programs that
are associated
with high
bandwidth
Y MByte
X MByte
Operating
system and
“always
on always
connected”
functions
0
Channel A Channel B
The Synopsys DDR memory controllers, including the uMCTL2 memory controller, offer a multiport or single
port connection into the SoC. The buses available include AXI3, AXI4, or AHB from 1-16 ports. A single-port
protocol controller, uPCTL2, is available for systems that schedule memory traffic outside the controller.
uMCTL2 has low latency and high bandwidth and strong QoS, including QoS-driven arbitration and
high-performance scheduling algorithm within it. The low-power functions within the memory control are
automated, allowing the design team to focus on the system design. It has multiple memory support for DDR2,
DDR3, DDR4, as well as LPDDR2, LPDDR3, and LPDDR4. For automotive and other high-reliability systems,
the IP offers a range of Reliability, Availability, Serviceability (RAS) features.
The uMCTL2 memory controller for LPDDR4 offers a CAM-based scheduling architecture, especially
optimized for 2667-4266 data rate, and multiple address maps to allow flexibility in systems supporting
different use modes and multiple memory types. It has automatic power-down and self-refresh with fast
frequency switching, and supports automatic temperature monitoring and refresh rate adjustment.
Conclusion
The LPDDR4 multichannel specification provides new opportunities for novel system designs, especially
within multichannel architectures that can improve system performance. Design teams need to consider
performance, power, and complexity when considering deployment of the LPDDR4 architecture.
Synopsys, Inc. • 690 East Middlefi eld Road • Mountain View, CA 94043 • www.synopsys.com
©2016 Synopsys, Inc. All rights reserved. Synopsys is a trademark of Synopsys, Inc. in the United States and other countries. A list of Synopsys trademarks is
available at http://www.synopsys.com/copyright.html . All other names mentioned herein are trademarks or registered trademarks of their respective owners.
01/27/16.CS6789_Optimizing LPDDR4_WP_kw.