0% found this document useful (0 votes)

140 views

DMC Performance Optimization For Mobile Memory Subsystem

arm prosescor presantashun

Uploaded by

mana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views

DMC Performance Optimization For Mobile Memory Subsystem

arm prosescor presantashun

Uploaded by

mana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Optimizing Performance for an ARM

Mobile Memory Subsystem

Ashwin Matta
Senior Product Marketing Manager, Systems and Software Group, ARM

Introduction
Contemporary mobile platform SoCs impose intense traffic management demands on the memory subsystem.
An intelligent memory controller design comprehends the fundamental memory streaming requirements of a
mobile SoC and provides the necessary capabilities for optimal Quality of Service (QoS) while ensuring best
use of available memory bandwidth. This paper describes some of the performance challenges for memory
subsystems in an ARM®-based mobile SoC*. Memory controller features necessary for optimizing performance
of mobile traffic are described along with their effects, using benchmarking data. Moreover, the combined effect
of optimizing memory subsystem performance by closely integrating both the memory controller and the
interconnect fabric is demonstrated.

(*In the context of this paper, the word ‘mobile’ somewhat loosely refers to all types of systems that deploy
LPDDR4/3 or DDR4/3n memories, ranging from smartphone, tablets, laptops/clamshells to systems in
consumer and automotive devices.)

ARM Mobile Subsystem Example

Figure 1 shows a contemporary example of an ARM-based mobile subsystem. Typically, there are one or two
clusters of Cortex®-A processors in big.LITTLETM configuration – with the ‘big’ CPUs handling the raw
computational needs whereas the ‘LITTLE’ ones running the lighter threads for power efficiency. The CPUs
seamlessly communicate data with each other over a Cache Coherent Interconnect, CoreLinkTM CCI-550, that
provides a snoop filter for storing a directory of cached data, thereby reducing number of snoops required across
CPU clusters. In addition to the CPUs, graphics, video and display computations are performed by the fully-
coherent MaliTM Mimir GPU and non-coherent V550 Video and DP650 Display processors in the system.

Copyright © 2016 ARM Limited. All rights reserved.

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 1 of 10

Figure 1: Example ARM-based mobile system

As a side note, having a fully-coherent GPU enables Shared Virtual Memory between heterogeneous processing
units – the CPU and the GPU – allowing memory pointers to be passed between these units for true
heterogeneous processing. This in itself improves overall performance by enabling these powerful compute
blocks to operate on the same block of memory down to the byte level. For further information on this topic,
refer to whitepaper from Tirias Research: http://www.tiriasresearch.com/downloads/arm-enables-
heterogeneous-computing-the-corelink-cci-550-and-dmc-500/

GPUs and other coherent IO agents such as co-processors, camera streams, etc. share data with CPUs over the
CoreLink CCI-550. In order to have the same view of physical memory as the CPUs, accesses by these agents
are translated by a Memory Management Unit (MMU), which also optionally provides a Stage2 translation for
supporting virtualization. The Video and Display agents communicate data over a non-coherent hybrid
interconnect, CoreLink NIC-450, but are also subject to the same MMU translations seen by the IO coherent
agents. MMU translation tables are stored in memory and cached locally to minimize the overhead of fetching
the translations from memory for each agent communicating with memory.

A complex system such as the one showed in Figure 1 would typically have a memory subsystem consisting of
four to six channels of memory controllers, CoreLink DMC-500, connected to LPDDR4/3 memories via an
external DFI-compatible PHY. At the maximum supported data transfer rate of LPDDR4-4266 Mbps, the
maximum bandwidth to memory would thus be 51.2 GB/s using 6 channels of x16 memory. This memory data
bandwidth is capable of supporting data-hungry devices such as advanced tablets and clamshells while
continuing to support the extremely low power budgets that ARM is known for.

Copyright © 2016 ARM Limited. All rights reserved.

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 2 of 10

Traffic Optimization by Mobile Memory Subsystems

Although the presence of a CCI in mobile SoCs reduces the number of accesses required to memory, the sheer
number of active agents in the overall system results in high-bandwidth demand for memory data. For the
purpose of this analysis, we make a common-scenario assumption that the agents can be collectively classified
into 3 categories:

1. Low Latency (LL) – the dominant characteristics of memory traffic coming from the CPUs are
random, small size accesses (typically cache line fills) that are sporadic in nature. Key requirement for
CPU accesses is low latency so as to provide maximum thread execution performance.
2. Real Time (RT) – typically these agents are Video and Display units which need to access data from
memory at a guaranteed rate so as to ensure the real time performance of the mobile device. Any
interruptions in data access from memory could result in noticeable flickering in the display thereby
rendering the device defective.
3. High Bandwidth (HB) – the GPU and external co-processors are the best example of agents that require
high bandwidth access to memory. In the absence of LL or RT traffic, HB traffic can easily consume
the entire available bandwidth to memory.

Note that it is not necessary for CPU traffic to always be low latency. Multi-modal latency requirements for
CPU traffic are plausible but don’t necessarily change the discussions presented in this paper.

By classifying key memory activity in a mobile SoC into the LL, RT and HB categories, the primary mobile
memory subsystem optimization problem can be summed up as follows:

“The memory controller needs to provide the lowest latency for LL agents while ensuring that RT agents get
guaranteed rate of access to memory to meet real-time performance requirements – and do all this while
ensuring that all remaining bandwidth after servicing LL and RT agents is available to the HB agents for 100%
total memory bandwidth utilization.”

ARM’s Memory Performance Optimization Techniques

ARM has been delivering memory controller IP to its customers for several years now. Each new memory
generation has resulted in a new controller micro-architecture to match increasing memory transfer speeds and
supporting advanced features. Over the years, the various micro-architectural techniques have built on top of the
base architecture to provide the best performance achievable with the ARM IP ecosystem.

Prior to a deep dive into various techniques for achieving fine grain control over memory access requirements
for different agents, it is necessary to benchmark the bandwidth utilization of the memory controller across a
varying number of banks and sizes of accessed data. This is to ensure that the scheduler is robust enough for any
system traffic scenario.

Figure 2 shows a plot of the same. The red grid is the theoretical maximum utilization achievable for any
memory controller assuming ideal operation with memory access timing restrictions imposed by standard
JEDEC timing parameters. The green heat map is the measured bandwidth utilization of CoreLink DMC-500.
The measured utilization tracks the theoretical maximum very closely (>90% match). There are no local ‘hot
spots’ indicating that the scheduler for this memory controller is flexible enough for handling the wide range of
operating conditions with minimal bandwidth loss.

Copyright © 2016 ARM Limited. All rights reserved.

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 3 of 10

Figure 2: Measured Bandwidth Utilization VS Theoretical Maximum

Building up further on this analysis is a description below of the various memory performance optimization
techniques:

1. QoS Arbitration – One of the simplest methods to provide Quality of Service for various agents trying
to access memory is through priority arbitration whereby simultaneous incoming memory transactions
get serviced based on assigned priorities for the respective agents. Figure 3 shows the effects of QoS
arbitration for 2 types of systems. In non-congested systems, latency response to transactions at various
priority levels is nearly identical through the range of transactions. Conversely, in congested systems,
latency response is stretched out (longer latencies) for transactions from RT and HB agents whereas
those from3:LL
Figure agentsofget
Effect QoSa faster response. on Congested and Non-Congested Systems
Arbitration

Non-Congested System Congested System

Figure 3: Effect of QoS Arbitration on Congested and Non-Congested System

2. Priority Escalation – A system that remains congested for long periods of time runs the risk of starving
execution of low priority transactions in favor of high priority ones. A well-designed QoS scheme must
therefore provide a mechanism for priority escalation based on aging counters or programmable
overrides. Figure 4 shows the effect of priority escalation in a congested system. Latency response for

Copyright © 2016 ARM Limited. All rights reserved.

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 4 of 10

QoS level qv0, qv1, etc. which are lower priority transactions improves with escalation. Thus a system
Figure
can be designed 4: Effect
to provide of Priority
an appropriate Escalation
latency in across
response a Congested System
all transfers at various QoS levels.

Congested System Congested System with Escalation

Figure 4: Effect of Priority Escalation in a Congested System

3. Latency Deadline Scheduling – Whereas both QoS arbitration and priority escalation are good levers
for managing latency responses for LL, RT and HB agents, they lack the fundamental requirement of
timeout or deadline scheduling. This requirement is particularly necessary for RT agents where
guaranteed response within a maximum number of cycles is necessary for proper operation of the
mobile device, regardless of transaction activity occurring in the memory controller. Figure 5 shows
the effect of deadline scheduling on latency response of RT agents. In this scenario, a latency deadline
of ‘D’ cycles has been programmed for the RT traffic. For latencies well below the deadline threshold,
the LL transactions get higher priority and have a steeper latency response for nearly 95% of the LL
transactions. Closer to the latency deadline, the latency response of RT agents picks up with 100% of
Figure 5: Effect
the RT transactions completing of Latency
before Deadline
the deadline Scheduling
– ahead of the for
LLRT Traffic
transactions.
RT transactions overtake LL
and complete before Deadline

Latency of LL Master Remains

Lower Prior to RT Deadline

Programmed Latency
Deadline for RT Master

Figure 5: Effect of Latency Deadline Scheduling

Copyright © 2016 ARM Limited. All rights reserved.

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 5 of 10

4. Deadline Arbitration with Increasing Traffic – The techniques described above are now applied to an
illustrative scenario to test the extreme limits. The traffic from LL and RT agents is continuously
increased from 0 to 100% injection rates, while the HB agents are also sending transactions through to
the memory controller. Figure 6 shows the combined effect of all these controlling mechanisms
operating simultaneously. At very low injection rates, LL transactions get their guaranteed bandwidth
(~10% of total bandwidth) due to their higher priority. HB traffic utilizes the bulk of the remaining
bandwidth as RT traffic is still small. With increasing injection rates, LL transactions continue to get as
much bandwidth as requested but HB transactions start stalling in favor of RT transactions which need
a guaranteed response time due to deadline scheduling. This trend continues with increasing rate of
injection of RT traffic. At around 90% injection rate, the LL transactions can no longer continue
getting their guaranteed bandwidth despite having the highest priority because the rate of RT injection
is so high that deadline scheduling arbitrates over priority. Bandwidth utilization by HB transactions
drops to 0% as they have no mechanism to override LL and RT transactions. At the tail end, near 100%
injection rates, essentially all the bandwidth is allocated to RT transactions. A minuscule amount is left
for the LL transactions on account of the memory controller scheduler finding opportunity holes
between RT transfers to sneak in an LL transaction. The total delivered system bandwidth is
consistently high throughout. Thus agent QoS demands are being met without compromising on overall
system bandwidth.
Figure 6: Deadline Arbitration with increasing LL and RT Traffic Rates

Figure 6: Deadline Arbitration with increasing LL and RT Traffic Rates

Under normal operating conditions, LL/RT injection rates in a mobile system do not exceed 60%.
Under these conditions, LL traffic is guaranteed its ~10% bandwidth whereas the rest is distributed
between RT and HB traffic.

5. Bus Turnaround Time – An important control knob associated with the memory controller scheduler is
the number of cycles after which the DRAM bus is released upon completing a series of transactions
from a single agent. Once released, other agents get access to the bus and can start piping through their
transactions. Context switching of the bus from one agent to another is not necessarily efficient for
bandwidth utilization. Figure 7 shows the two ends of the spectrum from fast bus release (low bus
turnaround time) to slow bus release (high bus turnaround time). Fast release ensures lower latency for
other agent transfers but potentially wastes DRAM bandwidth. On the other hand, slow release of the
bus delivers high bandwidth – assuming the agents are always requesting – but increases the latency for

Copyright © 2016 ARM Limited. All rights reserved.

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 6 of 10

other agents. The CoreLink DMC-500 scheduler has been designed to manage this bus turnaround
effect efficiently. It provides users with programmable parameters and guidelines on when to do this in
the most optimized manner, keeping into account the types of agents requesting, the transaction type,
coherency,Figure 7: Effect
rank, bank and of Bus Turnaround Time on Bandwidth Utilization
priority.

Useable
Bandwidth

Slow Turnaround
Other Agents Starved
Rapid Turnaround (increased latency)
(bandwidth lost)

Low Medium High

Bus Turnaround Time (Cycles)

Figure 7: Effect of Bus Turnaround Time on Bandwidth Utilization

6. Queue Fill Threshold – The memory controller has a large internal queue for storing incoming
transactions. During operation, it may so happen that all the queue entries are filled with low priority
transactions creating back-pressure on subsequent incoming high priority transactions. In order to
ensure that these high priority transactions are always serviced, it is necessary to reserve some queue
entries for high priority accesses only. The degree of ‘fullness’ of the queue as measured in groups of
1/16th size of the queue is referred to as the “Fill Threshold”. So a Fill Threshold of 16 implies there are
no reserved queue entries for subsequent high priority transactions whereas a threshold of 6 implies
62.5% of the queue entries are reserved for high priority transactions. Reserving queue entries reduces
the number of opportunities that the scheduler has to optimize the bandwidth utilization and hence
results in slightly dropped utilization. Figure 8 shows simulation results achieved for a Fill Threshold
of 6. This threshold value was chosen as it only resulted in a 4% loss of bandwidth utilization as seen in
the left graph in Figure 8. Lower values of Fill Threshold (implying higher number of reserved queue
entries) resulted in a substantial bandwidth loss and would hence not be advisable. The right graph in
Figure 8 shows the effect of selected Fill Threshold on the average and maximum latencies for LL and
RT transactions. With Fill Threshold of 6 there is a significant reduction in these latencies at only 4%
additional loss of bandwidth utilization. This is the type of programming trade-off users of CoreLink
DMC-500 can make to optimize the latency response for their LL and RT traffic.

Copyright © 2016 ARM Limited. All rights reserved.

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 7 of 10

Figure 8: Effect of Queue Fill Threshold on Overall Bandwidth, Average and Max Latencies

BW Utilization Against Queue Fill Threshold Average and Max Latency Against Queue Fill Threshold
~4% Bandwidth
Cost for
100 Fill Threshold = 6

80 Data point for Significant reduction

latency comparison in LL and RT Max
(Fill Threshold = 6) Latency with
Fill Threshold = 6
60

Latency

Significant reduction
40 in LL and RT Average
Latency with
Fill Threshold = 6
HB
20

LL
0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Fill Threshold Fill Threshold

Figure 8: Effect of Queue Fill Threshold on Bandwidth, Average and Max Latencies

7. QoSACCEPT Signaling – All the QoS mechanisms described above operate at the micro-level inside
the DMC-500. Super-imposed on these mechanisms are more coarse or macro-level QoS mechanisms
that operate in the Corelink CCI-550 interconnect. An in-depth description of these mechanisms is
outside the scope of this paper. Our performance analyses of CoreLink CCI-550 and DMC-500
subsystems showed that if the memory controller provides visibility to the interconnect of its queue
fullness, the interconnect was able to make more informed decisions at the macro-level, thereby
improving system performance significantly. This visibility is provided to the interconnect using
QoSACCEPT signaling which indicates the QoS value above which the memory controller is willing
to accept transactions based on its own internal Queue Fill Threshold and transactions that are in flight.
The interconnect then attempts to route transactions with QoS values to the memory controller module
that’s willing to accept them.

Figure 9 shows effect of enabling QoSACCEPT in a system with CPU (LL) and GPU (HB) traffic. The
memory controller is programmed with a QoS Threshold value of 12, implying that it will back-
pressure transactions with QoS value less than 12 if its internal queue crosses a certain Fill Threshold.
The graph measures the latency of the transaction within the interconnect before it gets accepted by the
memory controller. Consider 4 different cases in this performance experiment:

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 8 of 10

Figure 9: Effect of QoSACCEPT on LL accesses Latency within CCI-550

CCI Internal Latency

GPU CPU
Case
QoS
ACCEPT
GPU
Traffic
QoS QoS for LL accesses
Value Value

1 OFF OFF - GPU11

- off

2 OFF ON 11=11, CPU=11

GPU-on, GPU 11

GPU-on, GPU =11, CPU=15

3 OFF ON 11 15

All with QoSAccept

4 ON ON 11 15

Clock cycles (lower is better)

Figure 9: Effect of QoSACCEPT on LL access Latency within CCI-550

Case 1 – QoSACCEPT mechanism and GPU traffic are turned off. Since the only traffic flowing
through the interconnect is the CPU traffic, there’s no back pressure from the memory controller.
Consequently the latency of transactions within the interconnect is small.

Case 2 – Next GPU traffic is turned ON at the same QoS value as CPU traffic. Both agents are
competing for memory bandwidth and there is significant back-pressure from the memory controller
resulting in very long CPU latency within the interconnect.

Case 3 – The CPU QoS value is then increased from 11 to 15 giving it a higher priority over GPU
transations. This reduces the latency of CPU traffic within the interconnect.

Case 4 – Finally, keeping the same QoS values for GPU and CPU traffic, QoSACCEPT mechanism is
turned ON. When its internal queues start filling up, the memory controller exerts back-pressure on the
GPU traffic whose QoS value is lower than the programmed QoSACCEPT threshold. However in this
scenario the interconnect exploits the QoSACCEPT signaling to manage traffic flow rather than
regulate traffic more generally, and consequently CPU transaction latency drops to almost the same as
Case 1.

The QoSACCEPT mechanism is fundamentally the escalation of Queue Fill Threshold information to the
memory controller interface. This escalation enables better macro-regulation for the interconnect and increases
overall efficacy of the combined solution. This showcases the importance of close integration between the
interconnect and memory controller, achieved by CoreLink CCI-550 and DMC-500.

Summary
In this paper, we have explored various memory controller performance optimization techniques for SoCs
characterized by mobile style architecture. ARM builds memory controllers for advanced memories targeting
mobile SoCs with the goal of providing the best, optimized performance from CPU to memory for LL, HB and
RT traffic agents. Although the mobile traffic characteristics described in this paper have been simplified for
ease of analysis and explanation, extensive system simulations, hardware emulation using complex, real-world
traffic traces and years of partner success confirm that the mobile memory subsystem performance remains true

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 9 of 10

to its design intent. Specifically, integration of interconnect and memory controller QoS provides further
improvements in results not achievable by individual blocks alone.

Future work includes extension of this integration to include ARM processors as well, providing a low-latency
fast path to memory, performing speculative fetches and optimizing the use of CPU and system caches for
further enhancements in memory access performance.

The ARM logo is a registered trademark of ARM Ltd.
All other trademarks are the property of their respective owners and are acknowledged

Optimizing Performance for an ARM Mobile Memory Subsystem Page 10 of 10

John Deere 6068 Engine Service Manual
41% (27)
John Deere 6068 Engine Service Manual
3 pages
Test 2.1 - 2.6 Exponential Functions and Models
100% (3)
Test 2.1 - 2.6 Exponential Functions and Models
13 pages
Ascendant Lord and Ascendant Nakshatra
100% (1)
Ascendant Lord and Ascendant Nakshatra
1 page
DDR Interface Design Implementation
No ratings yet
DDR Interface Design Implementation
25 pages
Hyper Transport Technology
100% (1)
Hyper Transport Technology
27 pages
Amigurumi Free
80% (5)
Amigurumi Free
71 pages
Scion - Tjukurpa
100% (2)
Scion - Tjukurpa
32 pages
Soc Fpga Main Memory Performance: Architecture Brief
No ratings yet
Soc Fpga Main Memory Performance: Architecture Brief
4 pages
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
From Everand
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
Sam Steed
No ratings yet
Panda 2001
No ratings yet
Panda 2001
58 pages
The Dynamic Random Access Memory Challenge in Embedded Computing Systems
No ratings yet
The Dynamic Random Access Memory Challenge in Embedded Computing Systems
18 pages
UT BE 1
No ratings yet
UT BE 1
70 pages
Memory Chapter2
No ratings yet
Memory Chapter2
3 pages
Smart Memory
No ratings yet
Smart Memory
19 pages
Multi-Core Microcontroller Design With Cortex-M Processors and Cor
No ratings yet
Multi-Core Microcontroller Design With Cortex-M Processors and Cor
18 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
Lecture-1-02.01.2025
No ratings yet
Lecture-1-02.01.2025
18 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Cache Memory
No ratings yet
Cache Memory
2 pages
03-Memory
No ratings yet
03-Memory
48 pages
Solving High-Speed Memory Interface Challenges With Low-Cost Fpgas
No ratings yet
Solving High-Speed Memory Interface Challenges With Low-Cost Fpgas
10 pages
UNIT V (Memory System)
No ratings yet
UNIT V (Memory System)
51 pages
Memory Basics Explained
From Everand
Memory Basics Explained
Alisa Turing
No ratings yet
Benchmarking: Computer Architecture Task Two Mehul. M. Mitha 10ELB019 B015938
No ratings yet
Benchmarking: Computer Architecture Task Two Mehul. M. Mitha 10ELB019 B015938
8 pages
Extending The Transaction Level Modeling Approach For Fast Communication Architecture Exploration Extending The Transaction Level Modeling Approach For Fast Communication Architecture Exploration
No ratings yet
Extending The Transaction Level Modeling Approach For Fast Communication Architecture Exploration Extending The Transaction Level Modeling Approach For Fast Communication Architecture Exploration
44 pages
Seminar Report
100% (3)
Seminar Report
28 pages
Definitions_importantes
No ratings yet
Definitions_importantes
4 pages
T2 CPU Performance and Buses
No ratings yet
T2 CPU Performance and Buses
24 pages
Definition of UMA: Basis For Comparison UMA Numa
No ratings yet
Definition of UMA: Basis For Comparison UMA Numa
10 pages
Arm Big Little Chrome Browser
No ratings yet
Arm Big Little Chrome Browser
9 pages
Design and Implementation of A Soc Reconfigurable Computing Architecture For Multimedia Applications
No ratings yet
Design and Implementation of A Soc Reconfigurable Computing Architecture For Multimedia Applications
7 pages
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Lecture#11-Memory & Storage
No ratings yet
Lecture#11-Memory & Storage
46 pages
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
No ratings yet
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
115 pages
CSE243: Introduction To Computer Architecture and Hardware/Software Interface
No ratings yet
CSE243: Introduction To Computer Architecture and Hardware/Software Interface
25 pages
Abstract
No ratings yet
Abstract
23 pages
CSCE 5610 Project Report - Gozick
No ratings yet
CSCE 5610 Project Report - Gozick
26 pages
09 Communication models of Parallel platforms
No ratings yet
09 Communication models of Parallel platforms
25 pages
Dynamic, Multi-Core Cache Coherence Architecture For Power-Sensitive Mobile Processors
No ratings yet
Dynamic, Multi-Core Cache Coherence Architecture For Power-Sensitive Mobile Processors
9 pages
Lecture#11 Jan2023
100% (1)
Lecture#11 Jan2023
46 pages
11 Memory
No ratings yet
11 Memory
41 pages
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
A Practical Approach For Bus Architecture Optimization at Transaction Level
No ratings yet
A Practical Approach For Bus Architecture Optimization at Transaction Level
6 pages
Static RAM Operation
100% (3)
Static RAM Operation
12 pages
Maximize The Memory Performance
No ratings yet
Maximize The Memory Performance
7 pages
Axi Protocolusing DDR3 Ram Controller
No ratings yet
Axi Protocolusing DDR3 Ram Controller
72 pages
Power and Performance Aware Electronic System Level Design
No ratings yet
Power and Performance Aware Electronic System Level Design
4 pages
ARM Memory Organisation (1)
No ratings yet
ARM Memory Organisation (1)
7 pages
Hard Drive: What If You Were To Design A Solid State Hard Disk Out of Normal Memory Modules?
No ratings yet
Hard Drive: What If You Were To Design A Solid State Hard Disk Out of Normal Memory Modules?
32 pages
Operative Systems Midterm Project
No ratings yet
Operative Systems Midterm Project
6 pages
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
From Everand
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
Robert Johnson
No ratings yet
cortex-a9-processor
No ratings yet
cortex-a9-processor
20 pages
Memory Models
No ratings yet
Memory Models
18 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Lec 4
No ratings yet
Lec 4
18 pages
NCSA02_Fundamental_CUDA_Optimization
No ratings yet
NCSA02_Fundamental_CUDA_Optimization
50 pages
Comparison of Smartphones Hardware Trends: Abstract
No ratings yet
Comparison of Smartphones Hardware Trends: Abstract
9 pages
A Framework For Using Processor Cache As RAM (CAR) : Eswaramoorthi Nallusamy University of New Mexico October 10, 2005
No ratings yet
A Framework For Using Processor Cache As RAM (CAR) : Eswaramoorthi Nallusamy University of New Mexico October 10, 2005
25 pages
Lecture 1: Introduction To ARM Based Embedded Systems
No ratings yet
Lecture 1: Introduction To ARM Based Embedded Systems
24 pages
Computer Architecture
No ratings yet
Computer Architecture
20 pages
System Bus Functions and Features
No ratings yet
System Bus Functions and Features
21 pages
Drilon Lajqi FlashVM
No ratings yet
Drilon Lajqi FlashVM
8 pages
1083_Wang
No ratings yet
1083_Wang
56 pages
GOOD DRAM Interface Tutorial
No ratings yet
GOOD DRAM Interface Tutorial
91 pages
Embedded Systems
No ratings yet
Embedded Systems
111 pages
BEE Inverter AC
No ratings yet
BEE Inverter AC
12 pages
BEE Ceiling Fans
No ratings yet
BEE Ceiling Fans
5 pages
Digital Motor Control Methodology For C2000™ Real-Time Control Microcontrollers PDF
No ratings yet
Digital Motor Control Methodology For C2000™ Real-Time Control Microcontrollers PDF
11 pages
TI C2000™ MCU Digital Power Selection Guide
No ratings yet
TI C2000™ MCU Digital Power Selection Guide
3 pages
C2000™ Piccolo™ F2805x MCU Analog Front End For Motor Control
No ratings yet
C2000™ Piccolo™ F2805x MCU Analog Front End For Motor Control
25 pages
Agri-foods Downfall in Glan Sarangani
No ratings yet
Agri-foods Downfall in Glan Sarangani
6 pages
Advances in Parasitology, Volume 115 Russell Stothard All Chapter Instant Download
No ratings yet
Advances in Parasitology, Volume 115 Russell Stothard All Chapter Instant Download
57 pages
Heinz CEO On EM Success
No ratings yet
Heinz CEO On EM Success
5 pages
Class 9
No ratings yet
Class 9
5 pages
Cleaning The Furniture
No ratings yet
Cleaning The Furniture
31 pages
SoundsFun3 Sample
No ratings yet
SoundsFun3 Sample
13 pages
C09 Periodicity of Elements
No ratings yet
C09 Periodicity of Elements
28 pages
Letter Grade Numerical Grade Percentage
No ratings yet
Letter Grade Numerical Grade Percentage
14 pages
Plant Adaptations To Dry Environments PDF
No ratings yet
Plant Adaptations To Dry Environments PDF
2 pages
Article by Tomiyama and Sakai
No ratings yet
Article by Tomiyama and Sakai
6 pages
Eletrical Services - Notes
No ratings yet
Eletrical Services - Notes
151 pages
A Perspective About Coronavirus Disease 2019 (COVID-19) .: March 2020
No ratings yet
A Perspective About Coronavirus Disease 2019 (COVID-19) .: March 2020
22 pages
Chap8 Tagore Vivekananda MEd PDF
No ratings yet
Chap8 Tagore Vivekananda MEd PDF
18 pages
Endocrine MCQ
100% (1)
Endocrine MCQ
10 pages
Diagram
No ratings yet
Diagram
11 pages
Please Choose A
No ratings yet
Please Choose A
4 pages
Instant download California Water Crisis 1st Edition Paula C. Serrano pdf all chapter
100% (6)
Instant download California Water Crisis 1st Edition Paula C. Serrano pdf all chapter
81 pages
Result of Safety Patrol K3
No ratings yet
Result of Safety Patrol K3
2 pages
Empirical and Analytical Modelling of System Dynamics
No ratings yet
Empirical and Analytical Modelling of System Dynamics
12 pages
Guia Vivienda HB ENG 202212 PDF
No ratings yet
Guia Vivienda HB ENG 202212 PDF
37 pages
Hemolytic Anemia - Evaluation and Differential Diagnosis
100% (1)
Hemolytic Anemia - Evaluation and Differential Diagnosis
8 pages
Apprvd DRG 120 KV - 30 KV LA
No ratings yet
Apprvd DRG 120 KV - 30 KV LA
41 pages
Handbook of Mathematics by Arihant @
100% (4)
Handbook of Mathematics by Arihant @
464 pages
Mitsubishi: Camshaft 4D31 Valve Rocker Arm Rocker Shaft Valve Guide
No ratings yet
Mitsubishi: Camshaft 4D31 Valve Rocker Arm Rocker Shaft Valve Guide
21 pages
Picospritzer Manual
No ratings yet
Picospritzer Manual
17 pages

DMC Performance Optimization For Mobile Memory Subsystem

Uploaded by

DMC Performance Optimization For Mobile Memory Subsystem

Uploaded by

Optimizing Performance for an ARM

ARM Mobile Subsystem Example

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 1 of 10

Figure 1: Example ARM-based mobile system

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 2 of 10

Traffic Optimization by Mobile Memory Subsystems

ARM’s Memory Performance Optimization Techniques

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 3 of 10

Figure 2: Measured Bandwidth Utilization VS Theoretical Maximum

Non-Congested System Congested System

Figure 3: Effect of QoS Arbitration on Congested and Non-Congested System

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 4 of 10

Congested System Congested System with Escalation

Figure 4: Effect of Priority Escalation in a Congested System

Latency of LL Master Remains

Figure 5: Effect of Latency Deadline Scheduling

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 5 of 10

Figure 6: Deadline Arbitration with increasing LL and RT Traffic Rates

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 6 of 10

Low Medium High

Bus Turnaround Time (Cycles)

Figure 7: Effect of Bus Turnaround Time on Bandwidth Utilization

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 7 of 10

80 Data point for Significant reduction

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 8 of 10

Figure 9: Effect of QoSACCEPT on LL accesses Latency within CCI-550

CCI Internal Latency

1 OFF OFF - GPU11

2 OFF ON 11=11, CPU=11

GPU-on, GPU =11, CPU=15

All with QoSAccept

Clock cycles (lower is better)

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 9 of 10

Copyright © 2016 ARM Limited. All rights reserved.

Optimizing Performance for an ARM Mobile Memory Subsystem Page 10 of 10

You might also like