Spru 610 C
Spru 610 C
Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections, modifications,
enhancements, improvements, and other changes to its products and services at any time and to discontinue
any product or service without notice. Customers should obtain the latest relevant information before placing
orders and should verify that such information is current and complete. All products are sold subject to TI’s terms
and conditions of sale supplied at the time of order acknowledgment.
TI warrants performance of its hardware products to the specifications applicable at the time of sale in
accordance with TI’s standard warranty. Testing and other quality control techniques are used to the extent TI
deems necessary to support this warranty. Except where mandated by government requirements, testing of all
parameters of each product is not necessarily performed.
TI assumes no liability for applications assistance or customer product design. Customers are responsible for
their products and applications using TI components. To minimize the risks associated with customer products
and applications, customers should provide adequate design and operating safeguards.
TI does not warrant or represent that any license, either express or implied, is granted under any TI patent right,
copyright, mask work right, or other TI intellectual property right relating to any combination, machine, or process
in which TI products or services are used. Information published by TI regarding third-party products or services
does not constitute a license from TI to use such products or services or a warranty or endorsement thereof.
Use of such information may require a license from a third party under the patents or other intellectual property
of the third party, or a license from TI under the patents or other intellectual property of TI.
Reproduction of information in TI data books or data sheets is permissible only if reproduction is without
alteration and is accompanied by all associated warranties, conditions, limitations, and notices. Reproduction
of this information with alteration is an unfair and deceptive business practice. TI is not responsible or liable for
such altered documentation.
Resale of TI products or services with statements different from or beyond the parameters stated by TI for that
product or service voids all express and any implied warranties for the associated TI product or service and
is an unfair and deceptive business practice. TI is not responsible or liable for any such statements.
Following are URLs where you can obtain information on other Texas Instruments products and application
solutions:
Products Applications
Amplifiers amplifier.ti.com Audio www.ti.com/audio
Data Converters dataconverter.ti.com Automotive www.ti.com/automotive
DSP dsp.ti.com Broadband www.ti.com/broadband
Interface interface.ti.com Digital Control www.ti.com/digitalcontrol
Logic logic.ti.com Military www.ti.com/military
Power Mgmt power.ti.com Optical Networking www.ti.com/opticalnetwork
Microcontrollers microcontroller.ti.com Security www.ti.com/security
Telephony www.ti.com/telephony
Video & Imaging www.ti.com/video
Wireless www.ti.com/wireless
Notational Conventions
This document uses the following conventions.
- Hexadecimal numbers are shown with the suffix h. For example, the
following number is 40 hexadecimal (decimal 64): 40h.
- Registers in this document are shown in figures and described in tables.
J Each register figure shows a rectangle divided into fields that represent
the fields of the register. Each field is labeled with its bit name, its
beginning and ending bit numbers above, and its read/write properties
below. A legend explains the notation used for the properties.
J Reserved bits in a register figure designate a bit that is used for future
device expansion.
Trademarks
Contents
6 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1 Cache Configuration Register (CCFG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 L2 EDMA Access Control Register (EDMAWEIGHT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 L2 Allocation Registers (L2ALLOC0−L2ALLOC03) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 L2 Writeback Base Address Register (L2WBAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.5 L2 Writeback Word Count Register (L2WWC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.6 L2 Writeback−Invalidate Base Address Register (L2WIBAR) . . . . . . . . . . . . . . . . . . . . . . 44
6.7 L2 Writeback−Invalidate Word Count Register (L2WIWC) . . . . . . . . . . . . . . . . . . . . . . . . . 44
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figures
1 TMS320C64x DSP Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 TMS320C64x Two-Level Internal Memory Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 L1D Address Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Address to Bank Number Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Potentially Conflicting Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 L1P Address Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7 L2 Address Allocation, 256K Cache (L2MODE = 111b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 L2 Address Allocation, 128K Cache (L2MODE = 011b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
9 L2 Address Allocation, 64K Cache (L2MODE = 010b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
10 L2 Address Allocation, 32K Cache (L2MODE = 001b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
11 Cache Configuration Register (CCFG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
12 L2 EDMA Access Control Register (EDMAWEIGHT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
13 L2 Allocation Registers (L2ALLOC0−L2ALLOC3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
14 L2 Writeback Base Address Register (L2WBAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
15 L2 Writeback Word Count Register (L2WWC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
16 L2 Writeback−Invalidate Base Address Register (L2WIBAR) . . . . . . . . . . . . . . . . . . . . . . . . . 44
17 L2 Writeback−Invalidate Word Count Register (L2WIWC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
18 L2 Invalidate Base Address Register (L2IBAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
19 L2 Invalidate Word Count Register (L2IWC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
20 L1P Invalidate Base Address Register (L1PIBAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
21 L1P Invalidate Word Count Register (L1PIWC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
22 L1D Writeback−Invalidate Base Address Register (L1DWIBAR) . . . . . . . . . . . . . . . . . . . . . . 47
23 L1D Writeback−Invalidate Word Count Register (L1DWIWC) . . . . . . . . . . . . . . . . . . . . . . . . 47
24 L1D Invalidate Base Address Register (L1DIBAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
25 L1D Invalidate Word Count Register (L1DIWC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
26 L2 Writeback All Register (L2WB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
27 L2 Writeback-Invalidate All Register (L2WBINV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
28 L2 Memory Attribute Register (MAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
29 CPU Control and Status Register (CSR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
30 Block Cache Operation Base Address Register (BAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
31 Block Cache Operation Word Count Register (WC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
32 Streaming Data Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
33 Double Buffering Pseudo-Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
34 Double-Buffering Time Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
35 Double Buffering as a Pipelined Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Tables
1 TMS320C621x/C671x/C64x Internal Memory Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Cycles Per Miss for Different Numbers of L1D Misses That Hit L2 Cache . . . . . . . . . . . . . . 26
4 Cycles Per Miss for Different Numbers of L1D Misses that Hit L2 SRAM . . . . . . . . . . . . . . . 26
5 Average Miss Penalties for Large Numbers of Sequential Execute Packets . . . . . . . . . . . . 29
6 Internal Memory Control Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Cache Configuration Register (CCFG) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8 L2 EDMA Access Control Register (EDMAWEIGHT) Field Descriptions . . . . . . . . . . . . . . . 41
9 L2 Allocation Registers (L2ALLOC0−L2ALLOC3) Field Descriptions . . . . . . . . . . . . . . . . . . 42
10 L2 Writeback Base Address Register (L2WBAR) Field Descriptions . . . . . . . . . . . . . . . . . . . 43
11 L2 Writeback Word Count Register (L2WWC) Field Descriptions . . . . . . . . . . . . . . . . . . . . . 43
12 L2 Writeback−Invalidate Base Address Register (L2WIBAR) Field Descriptions . . . . . . . . 44
13 L2 Writeback−Invalidate Word Count Register (L2WIWC) Field Descriptions . . . . . . . . . . . 44
14 L2 Invalidate Base Address Register (L2IBAR) Field Descriptions . . . . . . . . . . . . . . . . . . . . 45
15 L2 Invalidate Word Count Register (L2IWC) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . 45
16 L1P Invalidate Base Address Register (L1PIBAR) Field Descriptions . . . . . . . . . . . . . . . . . 46
17 L1P Invalidate Word Count Register (L1PIWC) Field Descriptions . . . . . . . . . . . . . . . . . . . . 46
18 L1D Writeback−Invalidate Base Address Register (L1DWIBAR) Field Descriptions . . . . . 47
19 L1D Writeback−Invalidate Word Count Register (L1DWIWC) Field Descriptions . . . . . . . . 47
20 L1D Invalidate Base Address Register (L1DIBAR) Field Descriptions . . . . . . . . . . . . . . . . . 48
21 L1D Invalidate Word Count Register (L1DIWC) Field Descriptions . . . . . . . . . . . . . . . . . . . . 48
22 L2 Writeback All Register (L2WB) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
23 L2 Writeback−Invalidate All Register (L2WBINV) Field Descriptions . . . . . . . . . . . . . . . . . . . 50
24 Memory Attribute Register (MAR) Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
25 L1D Mode Setting Using DCC Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
26 L1P Mode Setting Using PCC Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
27 L2 Mode Switch Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
28 Memory Attribute Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
29 Summary of Program-Initiated Cache Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
30 L2ALLOC Default Queue Allocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
31 Coherence Assurances in the Two-Level Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
32 Program Order for Memory Operations Issued From a Single Execute Packet . . . . . . . . . 76
33 Document Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
EMIFA
L1P cache
EMIFB
Note: EMIFB is available only on certain C64x devices. Refer to the device-specific data sheet for the available peripheral set.
L1P → L2 single request stall 5 cycles for L2 hit 8 cycles for L2 hit
L1D write hit action Data updated in L1D; line marked dirty
L2 total size Varies by part number. Refer to the datasheet for the specific device.
L2 SRAM size Varies by part number. Refer to the datasheet for the specific device.
L2 replacement strategy 1/2/3/4-way Least Recently Used 4-way Least Recently Used
L2 read miss action Data is read via EDMA into newly allocated line in L2; requested
data is passed to the requesting L1
L2 write miss action Data is read via EDMA into newly allocated line in L2; write data is
then written to the newly allocated line.
† Some C64x devices may not support the 256K cache mode. Refer to the device-specific datasheet.
Program Program
address data
C6000 CPU
RAM
Program fetch L2 cache
controller
DA2 address
Address
Data
Cache L1 data cache
RAM controller Data
Snoop address
Table 2 lists the terms used throughout this document that relate to the
operation of the C64x two-level memory hierarchy.
Capacity miss A cache miss that occurs because the cache does not have sufficient room to hold the
entire working set for a program. Compare with compulsory miss and conflict miss.
Clean A cache line that is valid and that has not been written to by upper levels of memory
or the CPU. The opposite state for a valid cache line is dirty.
Coherence Informally, a memory system is coherent if any read of a data item returns the most
recently written value of that data item. This includes accesses by the CPU and the
EDMA. Cache coherence is covered in more detail in section 8.1.
Compulsory miss Sometimes referred to as a first-reference miss. A compulsory miss is a cache miss
that must occur because the data has had no prior opportunity to be allocated in the
cache. Typically, compulsory misses for particular pieces of data occur on the first
access of that data. However, some cases can be considered compulsory even if
they are not the first reference to the data. Such cases include repeated write misses
on the same location in a cache that does not write allocate, and cache misses to
noncacheable locations. Compare with capacity miss and conflict miss.
Conflict miss A cache miss that occurs due to the limited associativity of a cache, rather than due
to capacity constraints. A fully-associative cache is able to allocate a newly cached
line of data anywhere in the cache. Most caches have much more limited
associativity (see set-associative cache), and so are restricted in where they may
place data. This results in additional cache misses that a more flexible cache would
not experience.
Direct-mapped cache A direct-mapped cache maps each address in the lower-level memory to a single
location in the cache. Multiple locations may map to the same location in the cache.
This is in contrast to a multi-way set-associative cache, which selects a place for the
data from a set of locations in the cache. A direct-mapped cache can be considered
a single-way set-associative cache.
Dirty In a writeback cache, writes that reach a given level in the memory hierarchy may
update that level, but not the levels below it. Thus, when a cache line is valid and
contains updates that have not been sent to the next lower level, that line is said to
be dirty. The opposite state for a valid cache line is clean.
Fetch packet A block of 8 instructions that are fetched in a single cycle. One fetch packet may
contain multiple execute packets, and thus may be consumed over multiple cycles.
First-reference miss A cache miss that occurs on the first reference to a piece of data. First-reference
misses are a form of compulsory miss.
Fully-associative A cache that allows any memory address to be stored at any location within the
cache cache. Such caches are very flexible, but usually not practical to build in hardware.
They contrast sharply with direct-mapped caches and set-associative caches, both of
which have much more restrictive allocation policies. Conceptually, fully-associative
caches are useful for distinguishing between conflict misses and capacity misses
when analyzing the performance of a direct-mapped or set-associative cache. In
terms of set-associative caches, a fully-associative cache is equivalent to a
set-associative cache that has as many ways as it does line frames, and that has
only one set.
Higher-level memory In a hierarchical memory system, higher-level memories are memories that are
closer to the CPU. The highest level in the memory hierarchy is usually the Level 1
caches. The memories at this level exist directly next to the CPU. Higher-level
memories typically act as caches for data from lower-level memory.
Hit A cache hit occurs when the data for a requested memory location is present in the
cache. The opposite of a hit is a miss. A cache hit minimizes stalling, since the data
can be fetched from the cache much faster than from the source memory. The
determination of hit versus miss is made on each level of the memory hierarchy
separately—a miss in one level may hit in a lower level.
Lower-level memory In a hierarchical memory system, lower-level memories are memories that are further
from the CPU. In a C64x system, the lowest level in the hierarchy includes the
system memory below L2 and any memory-mapped peripherals.
LRU Least Recently Used. See least recently used allocation for a description of the LRU
replacement policy. When used alone, LRU usually refers to the status information
that the cache maintains for identifying the least-recently used line in a set. For
example, consider the phrase “accessing a cache line updates the LRU for that line.”
Set A collection of line frames in a cache that a single address can potentially reside. A
direct-mapped cache contains one line frame per set, and an N-way set-associative
cache contains N line frames per set. A fully-associative cache has only one set that
contains all of the line frames in the cache.
Set-associative A set-associative cache contains multiple line frames that each lower-level memory
cache location can be held in. When allocating room for a new line of data, the selection is
made based on the allocation policy for the cache. The C64x devices employ a least
recently used allocation policy for its set-associative caches.
Thrash An algorithm is said to thrash the cache when its access pattern causes the
performance of the cache to suffer dramatically. Thrashing can occur for multiple
reasons. One possible situation is that the algorithm is accessing too much data or
program code in a short time frame with little or no reuse. That is, its working set is
too large, and thus the algorithm is causing a significant number of capacity misses.
Another situation is that the algorithm is repeatedly accessing a small group of
different addresses that all map to the same set in the cache, thus causing an
artificially high number of conflict misses.
Touch A memory operation on a given address is said to touch that address. Touch can also
refer to reading array elements or other ranges of memory addresses for the sole
purpose of allocating them in a particular level of the cache. A CPU-centric loop used
for touching a range of memory in order to allocate it into the cache is often referred
to as a touch loop. Touching an array is a form of software-controlled prefetch for data.
Valid When a cache line holds data that has been fetched from the next level memory, that
line frame is valid. The invalid state occurs when the line frame holds no data, either
because nothing has been cached yet, or because previously cached data has been
invalidated for whatever reason (coherence protocol, program request, etc.). The
valid state makes no implications as to whether the data has been modified since it
was fetched from the lower-level memory; rather, this is indicated by the dirty or
clean state of the line.
Victim When space is allocated in a set for a new line, and all of the line frames in the set
that the address maps to contain valid data, the cache controller must select one of
the valid lines to evict in order to make room for the new data. Typically, the
least-recently used (LRU) line is selected. The line that is evicted is known as the
victim line. If the victim line is dirty, its contents are written to the next lower level of
memory using a victim writeback.
Victim Buffer A special buffer that holds victims until they are written back. Victim lines are moved
to the victim buffer to make room in the cache for incoming data.
Victim Writeback When a dirty line is evicted (that is, a line with updated data is evicted), the updated
data is written to the lower levels of memory. This process is referred to as a victim
writeback.
Way In a set-associative cache, each set in the cache contains multiple line frames. The
number of line frames in each set is referred to as the number of ways in the cache.
The collection of corresponding line frames across all sets in the cache is called a
way in the cache. For instance, a 4-way set-associative cache has 4 ways, and each
set in the cache has 4 line frames associated with it, one associated with each of the
4 ways. As a result, any given cacheable address in the memory map has 4 possible
locations it can map to in a 4-way set-associative cache.
Write allocate A write-allocate cache allocates space in the cache when a write miss occurs. Space
is allocated according to the cache’s allocation policy (LRU, for example), and the
data for the line is read into the cache from the next lower level of memory. Once the
data is present in the cache, the write is processed. For a writeback cache, only the
current level of memory is updated—the write data is not immediately passed to the
next level of memory.
Writeback The process of writing updated data from a valid but dirty cache line to a lower-level
memory. After the writeback occurs, the cache line is considered clean. Unless
paired with an invalidate (as in writeback-invalidate), the line remains valid after a
writeback.
Writeback cache A writeback cache will only modify its own data on a write hit. It will not immediately
send the update to the next lower-level of memory. The data will be written back at
some future point, such as when the cache line is evicted, or when the lower-level
memory snoops the address from the higher-level memory. It is also possible to
directly initiate a writeback for a range of addresses using cache control registers. A
write hit to a writeback cache causes the corresponding line to be marked as
dirty—that is, the line contains updates that have yet to be sent to the lower levels of
memory.
Writeback-invalidate A writeback operation followed by an invalidation. See writeback and invalidate. On
the C64x devices, a writeback-invalidate on a group of cache lines only writes out
data for dirty cache lines, but invalidates the contents of all of the affected cache lines.
Write merging Write merging combines multiple independent writes into a single, larger write. This
improves the performance of the memory system by reducing the number of
individual memory accesses it needs to process. For instance, on the C64x device,
the L1D write buffer can merge multiple writes under some circumstances if they are
to the same double-word address. In this example, the result is a larger effective
write-buffer capacity and a lower bandwidth impact on L2.
Write-through cache A write-through cache passes all writes to the lower-level memory. It never contains
updated data that it has not passed on to the lower-level memory. As a result, cache
lines can never be dirty in a write-through cache. The C64x devices do not utilize
write-through caches.
Because L1D is a two-way cache, each set contains two cache lines, one for
each way. On each access, the L1D compares the tag portion of the address
for the access to the tag information for both lines in the appropriate set. If the
tag matches one of the lines and that line is marked valid, the access is a hit.
If these conditions are not met, the access is a miss. Miss penalties are
discussed in detail under section 3.2.
The L1D is a read-allocate-only cache. This means that new lines are allocated
in L1D for read misses, but not for write misses. For this reason, a 4-entry write
buffer exists between the L1D and L2 caches that captures data from write
misses. The write buffer is enhanced in comparison to the write buffer on the
C621x/C671x devices. The write buffer is described in section 3.2.3.
The L1D implements a least-recently used (LRU) line allocation policy. This
means that on an L1D read miss, the L1D evicts the least-recently read or
written line within a set in order to make room for the incoming data. Note that
invalid lines are always considered least-recently used.
If the selected line is dirty, that is, its contents are updated, then the victim line’s
data is prepared for writeback to L2 as a victim writeback. The actual victim
writeback occurs after the new data is fetched, and then only if the newly
fetched data is considered cacheable. If the newly fetched data is
noncacheable, the victim writeback is cancelled and the victim line remains in
the L1D cache.
The C64x DSP has a least-significant bit (LSB) based memory banking
structure that is similar to the structure employed by the C620x/C670x
families. The L1D on C64x devices divides memory into eight 32-bit-wide
banks. These banks are single-ported, allowing only one access per cycle.
This is in contrast to the C621x/C671x devices, which use a single bank of
dual-ported memory rather than multiple banks of single-ported memory. In
Figure 4, bits 4−2 of the address select the bank and bits 1−0 select the byte
within the bank.
31 5 4 2 1 0
Bank
Upper Address Bits Offset
Number
- The memory accesses are both writes to nonoverlapping bytes within the
same word. That is, bits 31−2 of the address are the same.
- The memory accesses are both reads that access all or part of the same
word. That is, bits 31−2 of the address are the same. In this case, the two
accesses may overlap.
- One or both of the memory accesses is a write that misses L1D and is
serviced by the write buffer instead. (See section 3.2.3 for information on
the write buffer.)
Notice that a read access and a write access in parallel to the same bank will
always cause a stall. Two reads or two writes to the same bank may not stall
as long as the above conditions are met.
Bits
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0
4−0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
01011
01100
01101
Byte
01110
01111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
00000
00010
00100
00110
01000
01010
01100
Halfword
01110
10000
10010
10100
10110
11000
11010
11100
11110
00000
00100
01000
Word
01100
10000
10100
11000
11100
00000
01000
DW
10000
11000
- Banks are 16-bits wide on the C620x/C670x devices, and 32-bits wide on
the C64x device.
Reads that miss L1D stall the CPU while the requested data is fetched. The
L1D is a read-allocate cache, and so it will allocate a new line for the requested
data, as described in section 3.1. An isolated L1D read miss that hits L2 SRAM
stalls the CPU for 6 cycles, and an isolated L1D read miss that hits L2 cache
stalls the CPU for 8 cycles. This assumes there is no other memory traffic in
L2 that delays the processing of requests from L1D. Section 5.4 discusses
interactions between the various requestors that access L2.
An L1D read miss that also misses L2 stalls the CPU while the L2 retrieves the
data from external memory. Once the data is retrieved, it is stored in L2 and
transferred to the L1D. The external miss penalty varies depending on the type
and width of external memory used to hold external data, as well as other
aspects of system loading. Section 5.2 describes how L2 handles cache
misses on behalf of L1D.
If there are two read misses to the same line in the same cycle, only one miss
penalty is incurred. Similarly, if there are two accesses in succession to the
same line and the first one is a miss, the second access will not incur any
additional miss penalty.
The process of allocating a line in L1D can result in a victim writeback. Victim
writebacks move updated data out of L1D to the lower levels of memory. When
updated data is evicted from L1D, the cache moves the data to the victim
buffer. Once the data is moved to the victim buffer, the L1D resumes
processing of the current read miss. Further processing of the victim writeback
occurs in the background. Subsequent read and write misses, however, must
wait for the victim writeback to be processed. As a result, victim writebacks can
noticeably lengthen the time for servicing cache misses.
The L1D pipelines read misses. Consecutive read misses to different lines
may be overlapped, reducing the overall stall penalty. The incremental stall
penalty can be as small as 2 cycles per miss. Section 3.2.4 discusses miss
pipelining.
Write misses do not stall the CPU directly. Rather, write misses are queued in
the write buffer that exists between L1D and L2. Although the CPU does not
always stall for write misses, the write buffer can stall the CPU under various
circumstances. Section 3.2.3 describes the effects of the write buffer.
The L1D does not write allocate. Rather, write misses are passed directly to
L2 without allocating a line in L1D. A write buffer exists between the L1D cache
and the L2 memory to capture these write misses. The write buffer provides
a 64-bit path for writes from L1D to L2 with room for four outstanding write
requests.
Writes that miss L1D do not stall the CPU unless the write buffer is full. If the
write buffer is full, a write miss will stall the CPU until there is room in the buffer
for the write. The write buffer can also indirectly stall the CPU by extending the
time for a read miss. Reads that miss L1D will not be processed as long as the
write buffer is not empty. Once the write buffer has emptied, the read miss will
be processed. This is necessary as a read miss may overlap an address for
which a write is pending in the write buffer.
The L2 can process a new request from the write buffer every cycle, provided
that the requested L2 bank is not busy. Section 5.3 describes the L2 banking
structure and its impact on performance.
The C64x write buffer allows merging of write requests. It merges two write
misses into a single transaction providing all of the following rules are obeyed:
- The double-word addresses (that is, the upper 29 bits) for the two
accesses are the same.
- The two writes are to locations in L2 SRAM (not locations that may be held
in L2 cache).
- The first write has just been placed in the write buffer queue.
- The first write has not yet been presented to the L2 controller.
For L1D miss pipelining to be effective, there must be multiple outstanding L1D
read misses. Load instructions on the C64x DSP have a 5-cycle-deep pipeline,
and the C64x DSP may issue up to two accesses per cycle. In this pipeline,
the L1D performs tag comparisons in one pipeline stage (E2), and services
cache hits and misses on the following stage (E3). Cache read misses result
in a CPU stall.
L1D processes single read misses only when there are no outstanding victim
writebacks and when the write buffer is empty. When two cache misses occur
in parallel, the L1D processes the misses in program order. (The program
order is described in section 8.3.1.) In the case of two write misses, the misses
are inserted in the write buffer and the CPU does not stall unless the write
buffer is full. (Section 3.2.3 describes the write buffer.) In the case of two read
misses or a read and a write miss, the misses are overlapped as long as they
are to different sets, that is, their addresses differ in bits 13−6.
Cache misses are processed in the E3 pipeline stage. Once L1D has issued
commands to L2 for all of the cache misses in E3, the L1D may decide to
advance its state internally by one pipeline stage to consider cache misses due
to accesses that were in the E2 pipeline stage. This allows L1D to aggressively
overlap requests for cache misses that occur in parallel and cache misses that
occur on consecutive cycles. L1D considers the accesses in E2 only if the write
buffer and victim writeback buffer are empty. Although the L1D internal state
advances, the CPU stall is not released until the data returns for accesses that
were in the E3 stage.
Once the CPU stall is released, memory accesses that were in the E2 stage
advance to the E3 pipeline stage. This may bring one or two new accesses into
the E2 pipeline stage. It also potentially brings one or two unprocessed cache
misses from E2 into E3. The L1D first issues commands for any cache misses
that are now in E3 but that have not yet been processed. Once the accesses
in E3 are processed, the L1D may consider accesses in E2 as previously
described. In any case, the L1D stalls the CPU when there are accesses in E3
that have not yet completed.
The net result is that the L1D can generate a continuous stream of requests
to L2. Code that issues pairs of memory reads to different cache lines every
cycle will maximize this effect. As noted above, this pipelining can result in
improved performance, especially in the presence of sustained read misses.
The incremental miss penalty can be as small as 2 cycles per miss when the
L1D is able to overlap the processing for a new cache miss with that of prior
misses. Therefore, the average miss penalty for a sustained sequence of
back-to-back misses approaches 2 cycles per miss in the ideal case. Table 3
and Table 4 illustrate the performance for various numbers of consecutive L1D
read misses that hit in L2 cache and L2 SRAM, assuming all misses are able
to overlap. These further assume that there is no other memory traffic in L2 that
may lengthen the time required for an L1D cache miss, and that all misses are
within the same half of the affected L1D cache lines.
Table 3. Cycles Per Miss for Different Numbers of L1D Misses That Hit L2 Cache
2 10 5
3 12 4
4 14 3.5
> 4, even 6 + (2 * M) 2 + (6 / M)
Table 4. Cycles Per Miss for Different Numbers of L1D Misses that Hit L2 SRAM
2 8 4
3 10 3.33
4 12 3
> 4, even 4 + (2 * M) 2 + (4 / M)
Physical addresses map onto the cache in a fixed manner. The physical
address divides into three fields as shown in Figure 6. Bits 4−0 of the address
specify an instruction within a set. Bits 13−5 of the address select one of the
512 sets within the cache. Bits 31−14 of the address serve as the tag for the
line.
An L1P miss that misses in L2 cache stalls the CPU until the L2 retrieves the
data from external memory and transfers the data to the L1P, which then
returns the data to the CPU. This delay depends upon the type of external
memory used to hold the program, as well as other aspects of system loading.
The C64x DSP allows an execute packet to span two fetch packets. This
spanning does not change the penalty for a single miss. However, if both fetch
packets are not present in L1P, two cache misses occur.
The L1P cache pipelines cache misses. A single L1P cache miss requires
8 cycles to retrieve data from L2. Miss pipelining can hide much of this
overhead by overlapping the processing for several cache misses.
Additionally, some amount of the cache miss overhead can be overlapped with
dispatch stalls that occur in the fetch pipeline.
The fetch and decode pipeline is divided into 6 stages leading up to but not
including the first execution stage, E1. The stages are:
- PG − Program Generate
- PS − Program Send
- PW − Program Wait
- PR − Program Read
- DP − Dispatch
- DC − Decode
C6000 DSP instructions are grouped into two groupings: fetch packets and
execute packets. The CPU fetches instructions from memory in fixed bundles
of 8 instructions, known as fetch packets. The instructions are decoded and
separated into bundles of parallel-issue instructions known as execute
packets. A single execute packet may contain between 1 and 8 instructions.
Thus, a single fetch packet may contain multiple execute packets. On the
C64x DSP, an execute packet may also span two fetch packets. The Program
Read (PR) stage of the pipeline is responsible for identifying a sequence of
execute packets within a sequence of fetch packets. The Dispatch (DP) stage
is responsible for extracting and dispatching them to functional units.
As a result of the disparity between fetch packets and execute packets, the
entire fetch pipeline need not advance every cycle. Rather, the PR pipeline
stage only allows the Program Wait (PW) stage to advance its contents into
the PR stage when the DP stage has consumed the complete fetch packet
held in PR. The stages before PR advance as needed to fill in gaps. Thus,
when there are no cache misses, the early stages of the fetch pipeline are
stalled while the DP stage pulls the individual execute packets from the current
fetch packet. These stalls are referred to as dispatch stalls.
The C64x DSP takes advantage of these dispatch stalls by allowing the earlier
stages of the pipeline to advance toward DP while cache misses for those
stages are still pending. Cache misses may be pending for the PR, PW, and
PS pipeline stages. Because the DP stage stalls the PR stage with a dispatch
stall while it consumes the fetch packets in the PR stage of the pipeline, it is
not necessary to expose these cache stalls to the CPU. When a fetch packet
is consumed completely, however, the contents of the PW stage must advance
into the PR stage. At this point, the CPU is stalled if DP requests an execute
packet from PR for which there is still an outstanding cache miss.
When a branch is taken, the fetch packet containing the branch target
advances through the fetch pipeline every cycle until the branch target
reaches the E1 pipeline stage. Branch targets override the dispatch stall
described above. As a result, they do not gain as much benefit from miss
pipelining as other instructions. The fetch packets that immediately follow a
branch target do benefit, however. Although the code in the fetch packets that
follows the branch target may not execute immediately, the branch triggers
several consecutive fetches for this code, and thus pipelines any misses for
that code. In addition, no stalls are registered for fetch packets that were
requested prior to the branch being taken, but that never made it to the DP
pipeline stage.
The miss penalty for a single L1P miss is 8 cycles. The second miss in a pair
of back-to-back misses will see an incremental stall penalty of up to 2 cycles.
Sustained back-to-back misses in straight-line (nonbranching) code incurs an
average miss penalty based on the average parallelism of the code. The
average miss penalty for a long sequence of sustained misses in straight-line
code is summarized in Table 5.
Table 5. Average Miss Penalties for Large Numbers of Sequential Execute Packets
Instructions Per Average Stalls Per
Execute Packet Execute Packet
1 0.125
2 0.125
3 0.688
4 1.500
5 1.813
6 2.375
7 2.938
8 4.000
The total size of the L2 depends on the specific C64x device. The C6414,
C6415, and C6416 devices provide 1024K bytes of L2 memory. The C6411
and C6412 devices provide 256K bytes of L2 memory. For other C64x devices,
consult the data sheet to determine the L2 size for the device.
The L2 cache is a 4-way set associative cache whose capacity varies between
32K bytes and 256K bytes depending on its mode. The L2 cache is enabled
by the L2MODE field in the cache configuration register (CCFG). Enabling L2
cache reduces the amount of available L2 SRAM. Section 7.1.3 discusses how
the L2 memory map varies according to cache mode and the specific device
being used.
5.2 L2 Operation
When L2 cache is enabled, it services requests from L1P and L1D to external
addresses. The operation of the L2 cache is similar to the operation of both
L1P and L1D. When a request is made, the L2 first determines if the address
requested is present in the cache. If the address is present, the access is
considered a cache hit and the L2 services the request directly within the
cache. If the address is not present, the access results in a cache miss. On a
miss, the L2 processes the request according to the cacheability of the
affected address.
On a cache hit, the L2 updates the LRU status for the corresponding set in L2
cache. If the access is a read, the L2 returns the requested data. If the access
is a write, the L2 updates the contents of the cache line and marks the line as
dirty. L2 is a writeback cache, and so write hits in L2 are not immediately
forwarded to external memory. The external memory will be updated when this
line is later evicted, or is written back using the block-writeback control
registers described in section 7.3.2.
The L2 allocates a new line within the L2 cache on a cache miss to a cacheable
external memory location. Note that unlike L1D, the L2 allocates a line for both
read and write misses. The L2 cache implements a least-recently used policy
to select a line within the set to allocate on a cache miss. If this line being
replaced is valid, it is evicted as described below. Once space allocated for the
new data, an entire L2 line’s worth of data is fetched via the EDMA into the
allocated line.
Evicting a line from L2 requires several steps, regardless of whether the victim
is clean or dirty. For each line in L2, L2 tracks whether the given line is also
cached in L1D. If it detects that the victim line is present in L1D, it sends L1D
snoop-invalidate requests to remove the affected L1D lines. L1D responds by
invalidating the corresponding line. If the line in L1D was dirty, the updated
data is passed to L2 and merged with the L2 line that is being evicted. The
combined result is written to external memory. If the victim line was not dirty
in either L1D or L2, its contents are discarded. These actions ensure that the
most recent writes to the affected cache line are written to external memory,
but that clean lines are not needlessly written to external memory.
Note that L1P is not consulted when a line is evicted from L2. This allows
program code to remain in L1P despite having been evicted from L2. This
presumes that program code is never written to. In those rare situations where
this is not the case, programs may use the cache controls described in
section 7.3.2 to remove cached program code from L1P.
If the cache miss was a read, the data is stored in L2 cache when it arrives.
It is also forwarded directly to the requestor, thereby reducing the overall stall
time. The portion of the L2 cache line requested by L1 is sent directly to that
L1, and is allocated within that cache. The entire L2 line is also stored within
the L2 cache as it arrives.
If the cache miss was a write, the incoming data is merged with the write data
from L1D, and the merged result is stored in L2. In the case of a write, the line
is not immediately allocated in L1D, as L1D does not write allocate.
A long distance write causes the L2 to store a temporary copy of the written
data. It then issues a write transfer for the write miss to the EDMA controller.
Long distance writes can only originate from L1D using the L1D write buffer.
Because the written data is stored in a special holding buffer, it is not necessary
to stall the CPU while the long-distance write is being processed. Also, further
writes to the L2 SRAM address space or on-chip peripherals may be
processed while the long-distance access is being executed.
Only one requestor may access the L2 in a given cycle. In the case of an L2
conflict, the L2 prioritizes requests in the above order. That is, L1P read hits
have highest priority, followed by L1D, and so on.
5.4 L2 Interfaces
The L2 services requests from the L1D cache, L1P cache, and the EDMA. The
L2 provides access to its own memory, the peripheral configuration bus
(CFGBUS), and the various cache control registers described in section 6,
Registers. The following sections describe the interaction between L2 and the
various requestors that it interfaces.
Note:
The L2 only supports 32-bit accesses to on-chip peripheral addresses,
including its own control registers. Byte, half-word, and double-word
accesses may not work as expected.
The end result of this system of snoops and invalidates is that coherence is
retained between EDMA and CPU accesses to L2 SRAM. The example in
section 8.1 illustrates this protocol in action.
L2 requests are queued along other with other EDMA requests, and are
serviced according to the policies set by the EDMA. The L2 requests may be
placed on any of the EDMA priority queues. The priority level for requests and
the number of outstanding requests permitted can be controlled as described
in section 7.4. The EDMA is described in TMS320C6000 DSP Enhanced
Direct Memory Access (EDMA) Peripheral Reference Guide (SPRU234).
Sections 5.4.2 through 5.4.4 describe the interaction between the EDMA and
the two-level memory system. Section 8, Memory System Policies, describes
the coherence and ordering policies for the memory system.
6 Registers
The two-level memory hierarchy is controlled by several memory-mapped
control registers listed in Table 6. It is also controlled by the data cache control
(DCC) and program cache control (PCC) fields in the CPU control and status
register (CSR), see section 7.1. See the device-specific datasheet for the
memory address of these registers.
MAR0−MAR95 Reserved −
MAR112−MAR127 Reserved −
MAR192−MAR255 Reserved −
† MAR96−MAR111 only available on the C6414/C6415/C6416 devices; on all other devices, these registers are reserved.
31 29 28 16
P Reserved
R/W-0 R-0
15 10 9 8 7 3 2 0
Reserved IP ID Reserved L2MODE
R-0 W-0 W-0 R-0 R/W-0
Legend: R = Read only; W = Write only; R/W = Read/write; -n = value after reset
− 4h−7h Reserved
28−10 Reserved − 0 Reserved. The reserved bit location is always read as 0. Bits in
this field should always be written with 0.
7−3 Reserved − 0 Reserved. The reserved bit location is always read as 0. Bits in
this field should always be written with 0.
2−0 L2MODE‡ OF(value) 0−7h L2 operation mode bits.See the device-specific datasheet to
determine the available cache options for a device and how the
various L2 cache modes affect the memory map for different
sizes of L2.
− 4h-6h Reserved
1−0 EDMAWEIGHT 0−3h EDMA weight limits the amount of time L1D blocks EDMA access to L2.
2−0 QnCNT OF(value) 0−7h The the total number of outstanding L2 and QDMA requests
permitted on the corresponding EDMA priority level. Further
requests on that priority level are stalled until the number of
outstanding requests falls below the QnCNT setting.
DEFAULT 0
† For CSL implementation, use the notation CACHE_L2WBAR_L2WBAR_symval
DEFAULT 0
† For CSL implementation, use the notation CACHE_L2WWC_L2WWC_symval
DEFAULT 0
† For CSL implementation, use the notation CACHE_L2WIBAR_L2WIBAR_symval
DEFAULT 0
† For CSL implementation, use the notation CACHE_L2WIWC_L2WIWC_symval
DEFAULT 0
† For CSL implementation, use the notation CACHE_L2IBAR_L2IBAR_symval
DEFAULT 0
† For CSL implementation, use the notation CACHE_L2IWC_L2IWC_symval
Table 16. L1P Invalidate Base Address Register (L1PIBAR) Field Descriptions
Bit Field symval† Value Description
31−0 L1PIBAR OF(value) 0−FFFF FFFFh L1P invalidate base address.
DEFAULT 0
† For CSL implementation, use the notation CACHE_L1PIBAR_L1PIBAR_symval
Table 17. L1P Invalidate Word Count Register (L1PIWC) Field Descriptions
Bit Field symval† Value Description
31−16 Reserved − 0 Reserved. The reserved bit location is always read as 0.
Bits in this field should always be written with 0.
DEFAULT 0
† For CSL implementation, use the notation CACHE_L1PIWC_L1PIWC_symval
Table 18. L1D Writeback−Invalidate Base Address Register (L1DWIBAR) Field Descriptions
Bit Field symval† Value Description
31−0 L1DWIBAR OF(value) 0−FFFF FFFFh L1D writeback−invalidate base address.
DEFAULT 0
† For CSL implementation, use the notation CACHE_L1DWIBAR_L1DWIBAR_symval
Table 19. L1D Writeback−Invalidate Word Count Register (L1DWIWC) Field Descriptions
Bit Field symval† Value Description
31−16 Reserved − 0 Reserved. The reserved bit location is always read as 0.
Bits in this field should always be written with 0.
DEFAULT 0
† For CSL implementation, use the notation CACHE_L1DWIWC_L1DWIWC_symval
Table 20. L1D Invalidate Base Address Register (L1DIBAR) Field Descriptions
Bit Field symval† Value Description
31−0 L1DIBAR OF(value) 0−FFFF FFFFh L1D invalidate base address.
DEFAULT 0
† For CSL implementation, use the notation CACHE_L1DIBAR_L1DIBAR_symval
Table 21. L1D Invalidate Word Count Register (L1DIWC) Field Descriptions
Bit Field symval† Value Description
31−16 Reserved − 0 Reserved. The reserved bit location is always read as 0.
Bits in this field should always be written with 0.
DEFAULT 0
† For CSL implementation, use the notation CACHE_L1DIWC_L1DIWC_symval
0 CE OF(value) Cache enable bit determines whether the L1D, L1P, and L2 are
allowed to cache the corresponding address range.
15 10 9 8 7 5 4 2 1 0
PWRD SAT EN PCC DCC PGIE GIE
001 Reserved
011−111 Reserved
001 Reserved
011−111 Reserved
Note:
Reads or writes to L2 address ranges that are configured as L2 cache may
result in undesired operation of the cache hierarchy. Programs must confine
L2 accesses to L2 addresses that are mapped as L2 SRAM to ensure correct
program operation.
The L2 controller processes reads to the address range
0000 0000h−0010 0FFFh without undesired cache operation, even if some
of these addresses are configured as L2 cache. This address range
represents the entire allotted L2 address range, plus some additional space
to allow for certain program optimizations. Therefore, the restriction above
does not apply to reads; however, programs should not interpret values
returned by reads nor should programs perform writes to L2 addresses that
are not configured as L2 SRAM.
Mode with mixed Mode with less L2 1) Use EDMA to transfer any data needed out of the L2 SRAM
L2 SRAM and mapped SRAM space to be converted into cache.
L2 cache
2) Perform a block writeback-invalidate in L1D of L2 SRAM
addresses that are about to become L2 cache.
3) Wait for block writeback-invalidate to complete.
4) Perform global writeback-invalidate of L2 (L2WBINV).
5) Wait for L2WBINV to complete.
6) Write to CCFG to change mode.
7) Force CPU to wait for CCFG modification by reading CCFG.
8) Execute 8 cycles of NOP.
The cache enable (CE) bit in each MAR determines whether the L1D, L1P, and
L2 are allowed to cache the corresponding address range. After reset, the CE
bit in each MAR is cleared to 0, thereby disabling caching of external memory
by default. This is in contrast to L2 SRAM, which is always considered cacheable.
To disable caching for a given address range, programs should follow the
following sequence to ensure that all future accesses to the particular address
range are not cached.
1) Ensure that all addresses within the affected range are removed from the
L1 and L2 caches. This is accomplished in one of the following ways. Any
one of the following operations should be sufficient.
a) If L2 cache is enabled, invoke a global writeback-invalidate using
L2WBINV. Wait for the C bit in L2WBINV to read as 0. Alternately,
invoke a block writeback-invalidate of the affected range using
L2WIBAR/L2WIWC. Wait for L2WIWC to read as 0.
b) If L2 is in all SRAM mode, invoke a block writeback-invalidate of the
affected range using L1DWIBAR/L1DWIWC. Wait for L1DWIWC to
read as 0.
Note that the block-oriented cache controls can only operate on a
256K-byte address range at a time, so multiple block writeback-invalidate
operations may be necessary to remove the entire affected address range
from the cache. These cache controls are discussed in section 7.3.
The memory system can only perform one program-initiated cache operation
at a time. This includes global operations, block operations, and mode
changes. For this reason, the memory system may stall accesses to cache
control registers while a cache control operation is in progress. Table 29 gives
a summary of the available operations and their impact on the memory
system.
L2 Block L2WIBAR, All lines in block Updated lines in block Updated lines in
Writeback L2WIWC are invalidated.† are written to L2. All block are written to
with lines in block are external memory. All
Invalidate invalidated in L1D.† lines in block are
invalidated in L2.
L2 Block L2IBAR, All lines in block All lines in block are All lines in block are
Invalidate L2IWC are invalidated.† invalidated in L1D. invalidated in L2.
Updated data in block is Updated data in
discarded.† block is discarded.
Global operations in L1D and L1P are initiated by the ID and IP bits in CCFG
(Figure 11). The L1D and L1P only offer global invalidation. By writing a 1 to
the ID or IP bit in CCFG, a program can invalidate the entire contents of the
corresponding L1 cache. Upon initiation of the global invalidate, the entire
contents of the corresponding cache is discarded — no updated data is written
back. Reading CCFG after the write will stall the program until the invalidate
operation is complete.
The block cache operations execute on a range of addresses that may be held
in the cache. Block operations execute in the background, allowing other
program accesses to interleave with the block cache operation.
Programs initiate block cache operations with two writes. The program first
writes the starting address to one of the base address registers (BAR), shown
in Figure 30. Next, the program writes the total number of words to operate on
to the corresponding word count register (WC), shown in Figure 31. The cache
operation begins as soon as the word count register is written with a non-zero
value. The cache provides a set of BAR/WC pseudo-register pairs, one for
each block operation the cache supports. The complete list of supported
operations is shown in Table 29.
Notice that the word count field in WC is only 16-bits wide. This limits the block
size to 65535 words (approximately 256K bytes). Larger ranges require
multiple block commands to be issued to cover the entire range.
Although block operations specify the block in terms of a word address and
word count, the block operations always operate on whole cache lines. Whole
lines are always be written back and/or invalidated in each affected cache. For
this reason, programs should be careful to align arrays on cache-line
boundaries, and to pad arrays to be a multiple of the cache line size. This is
especially true when invoking the invalidate-only commands with respect to
these arrays.
31 0
Base Address
R/W-0
31 16 15 0
Reserved Word Count
R-0 R/W-0
Programs should not assume that the value of BAR is retained between block
cache operations. Programs should always write an address to BAR and a
word count to WC for each cache operation. Also, programs should not
assume that the various BAR and WC map to the same physical register. For
each cache operation, programs should write to BAR and WC for that
operation.
The L1P block invalidate can be used in conjunction with the L1D block
writeback-invalidate to provide software controlled coherence between L1D
and L1P. (Section 8.1 discusses the memory system coherence policies.) To
execute code that was previously written to the CPU, the program should use
L1PIBAR/L1PIWC to invalidate the block in L1P, and L1DWIBAR/L1DWIWC
to writeback-invalidate the block in L1D. These operations can be performed
in either order. The specific timing of these operations relative to program
fetches is not well defined. Therefore, programs should wait for L1DWIWC and
L1PIWC to read as zero prior to branching to an address range that has been
invalidated in this manner. (Note that the behavior of L1DIBAR/L1DIWC differs
on C621x/C671x devices. See TMS320C621x/C671x DSP Two-Level
Internal Memory Reference Guide, SPRU609, for details.)
Block cache operations in L2 can indirectly affect L1D and L1P, as noted in
Table 29. Section 7.3.3 discusses these interactions in detail.
Note:
Reads or writes to the addresses within the block being operated on while
a block cache operation is in progress may cause those addresses to not be
written back or invalidated as requested. To avoid this, programs should not
access addresses within the range of cache lines affected by a block cache
operation while the operation is in progress. Programs may consult the
appropriate WC to determine when the block operation is complete.
Note:
The behaviors described in this section may change on future C6000
devices that implement the two-level memory subsystem. Future devices
may remove the inclusivity of L1D within L2, and may cause the L2 cache
operations to act on L1, regardless of the current L2 cache mode. Forward-
compatible programs should not rely on these specific memory system
behaviors.
Under normal circumstances, the L1D cache is inclusive in L2, and L1P is not.
Inclusive implies that the entire contents of L1D are also held either in
L2 SRAM or L2 cache. The L2 cache operations are designed with these
properties in mind.
Because L1P is not inclusive in L2, the L2 cache operations that invalidate
lines in L2 send explicit invalidation commands to L1P. A global
writeback-invalidate of L2 (L2WBINV) triggers a complete invalidation of L1P.
Block invalidate and writeback-invalidate operations in L2 blindly send
invalidate commands to L1P for the corresponding L1P cache lines. This
ensures that L1P always fetches the most recent contents of memory after the
cache operation is complete.
One result of this is that L2 SRAM addresses cached in L1D are not affected
by program-initiated cache operations in L2, as L2 cache never holds copies
of L2 SRAM. To remove L2 SRAM addresses from L1D, programs must use
the L1D block cache operations directly. Ordinarily, direct removal of L2 SRAM
addresses from L1D is required only when changing L2 cache modes. The
coherence policy described in section 8.1 makes unnecessary most of the
need for programs to manually write back portions of L1D to L2 SRAM.
Another result is nonintuitive behavior when L1D is not inclusive in L2. L1D is
inclusive in L2 under normal circumstances, and so most programs do not
need to be concerned about this situation. Indeed, the recommended L2
cache mode-change procedure in section 7.1.3 ensures that the memory
system is never in this state. When not following the procedure precisely, it is
possible for L1D to hold copies of external memory that are not held in L2. This
noninclusive state is achieved in the following rare sequence:
- The program enabled caching for an external address range while L2 was
in all SRAM mode.
- The program then enabled L2 cache without first removing the external
address range from L1D with a block cache operation.
Programs should take care when changing the priority level for L2 requests in
order to ensure proper operation of the cache. The following sequence should
be followed:
1) Poll the EDMA priority queue status register (PQSR) and wait for the PQ
bit that corresponds to the current priority level to read as 1. PQSR is
described in TMS320C6000 DSP Enhanced Direct Memory Access
(EDMA) Peripheral Reference Guide (SPRU234).
This step may require that other transfers using this same priority queue,
such as externally-triggered EDMA transfers, be disabled. Otherwise, in a
heavily loaded system, the PQ bit in PQSR may not read as 1 for an
arbitrarily long period of time.
Table 30 lists the default queue allocation. The L2ALLOC settings must also
take into account the current settings for PQAR so that no transfers are lost.
The correct procedure for modifying the L2ALLOC and PQAR settings is
described in TMS320C6000 DSP Enhanced Direct Memory Access (EDMA)
Peripheral Reference Guide (SPRU234).
Default L2/QDMA
Priority Level Allocation Register Allocation
Urgent L2ALLOC0 6
High L2ALLOC1 2
Medium L2ALLOC2 2
Low L2ALLOC3 2
If L1D blocks L2 for n consecutive cycles, then EDMA is given priority for a
single cycle.
This section discusses the various policies of the memory system, such as
coherence between CPU and EDMA or host accesses and the order in which
memory updates are made.
Cache memories work by retaining copies of data from lower levels of the
memory hierarchy in the hierarchy’s higher levels. This provides a
performance benefit as higher levels of the memory hierarchy may be
accessed more quickly than the lower levels, and so the lower levels need not
be consulted on every access. Because many accesses to memory are
captured at higher levels of the hierarchy, the opportunity exists for the CPU
and other devices to see a different picture of what is in memory.
A memory system is coherent if all requestors into that memory see updates
to individual memory locations occur in the same order. A requestor is a device
such as a CPU or EDMA. A coherent memory system ensures that all writes
to a given memory location are visible to future reads of that location from any
requestor, so long as no intervening write to that location overwrites the value.
If the same requestor is writing and reading, the results of a write are
immediately visible. If one requestor writes and different requestor reads, the
reader may not see the updated value immediately, but it will be able to see
updates after a sufficient period of time. Coherence also implies that all writes
to a given memory location appear to be serialized. That is, that all requestors
see the same order of writes to a memory location, even when multiple
requestors are writing to one location.
Notice that the hardware ensures that accesses by the CPU and EDMA to
internal SRAM addresses are coherent, but external addresses are not.
Software must ensure external addresses accessed by the EDMA are not held
in cache when the EDMA accesses them. Failure to do so can result in data
corruption on the affected range of addresses. See section 7.3 for the steps
required to ensure particular ranges of addresses are not held in cache.
Also notice that CPU data and program accesses are not coherent. The
reason that these are not considered coherent is that the L1P does not query
the L1D when it makes a program fetch, and thus CPU writes captured in L1D
may not be visible to L1P for an arbitrarily long period of time. Therefore,
programs that write to locations that are subsequently executed from must
ensure that the updated data is written from L1D to at least L2 before
execution. The L1DWIBAR/L1DWIWC and L1PIBAR/L1PIWC described in
section 7.3 can be used for this purpose.
Although the memory system is coherent, it does not ensure that all requestors
see updates to cacheable memory occurring in the same order. This is of
primary importance when a given buffer of memory is accessed by the CPU
and an EDMA or host port access. Unless care is taken, it is possible for a
EDMA or host port access to see a mixture of old and new data if the access
occurs while the CPU updates are being processed by the memory system.
Section 8.3 discusses the order in which CPU accesses are made visible to
the memory system.
Graphically, the time sequence for the internal input and output buffers would
look as shown in Figure 34. EDMA reads are one step ahead of the processing
and EDMA writes are one step behind. In Figure 34, step 2 operates in parallel
with steps 3 and 4; steps 5 and 6 overlap with steps 7 and 8.
OutBuff 0
(PING)
OutBuff 1
(PONG)
First while loop iteration: second half, PING and PONG switched
InBuff 0
(PONG)
InBuff 1
(PING)
OutBuff 0
(PONG)
Iter 1 RD PROC WR
Iter 2 RD PROC WR
Iter 3 RD PROC WR
Iter 4 RD PROC WR
Iter 5 RD PROC WR
Iter 6 RD PROC WR
3) The CPU reads InBuff 0. These lines were snoop-invalidated from L1D in
step 1. Therefore, these accesses miss L1D, forcing L1D to read the new
data from L2 SRAM.
5) The EDMA reads OutBuff 0. The EDMA reads in L2 trigger a snoop for
each cache line held in L1D. Any dirty data for the line is written to L2
before the EDMA’s read is processed. Thus, the EDMA sees the most
up-to-date data. The line is marked clean and is left valid. (On
C621x/C671x devices, the line is subsequently invalidated in L1D.)
6) The EDMA writes to InBuff 0. This step proceeds identically to step 1. That
is, InBuff 0 is snoop-invalidated from L1D as needed. The EDMA writes
are processed after any dirty data is written back to L2 SRAM.
The C6000 DSP cores may initiate up to two parallel memory operations per
cycle. The program order of memory accesses defines the outcome of
memory accesses in terms of a hypothetical serial implementation of the
architecture. That is, it describes the order that the parallel memory operations
are processed such that time-sequence terms such as earlier and later are
used precisely with respect to a particular sequence of operations.
Memory accesses initiated from different execute packets have the same
temporal ordering as the execute packets themselves. That is, in the defined
program order, memory operations issued on cycle i are always earlier than
memory accesses issued on cycle i + 1, and are always later than those issued
on cycle i - 1.
For accesses issued in parallel, the type of operations (reads or writes), and
the data address ports that execute the operations determine the ordering.
Table 32 describes the ordering rules.
Table 32. Program Order for Memory Operations Issued From a Single Execute Packet
The data address port for a load or store instruction is determined by the
datapath that provides the data (as opposed to the address) for the memory
operation. Load and store instructions that operate on data in the A datapath
use DA1. Load and store instructions that operate on data in the B datapath
use DA2. Note that the datapath that provides the data to be operated on
determines whether DA1 or DA2 is used. The datapath that provides the
address for the access is irrelevant.
The C64x DSP supports nonaligned memory accesses to memory using the
LDNW, STNW, LDNDW, and STNDW instructions. The memory system does
not assure that these memory accesses will be atomic. Rather, it may divide
the accesses for these instructions into multiple operations. The program
order of memory accesses does not define the order of the individual memory
operations that comprise a single nonaligned access. The program order only
defines how the entire nonaligned access is ordered relative to earlier and later
accesses. So, although the complete nonaligned access does follow the
program order defined above with respect to the CPU itself, other requestors
may see the nonaligned memory access occur in pieces.
The previous definition describes the memory system semantics. The memory
system assures that the semantics of the program order of memory accesses
will be retained for CPU accesses relative to themselves. The memory system
may, however, relax the ordering of operations as they are executed within the
memory hierarchy so long as the correct semantics are retained. It may also
allow other requestors to the memory system to see the accesses occur in an
order other than the original program order. Section 8.3.2 describes this in
detail.
Memory system coherence implies that writes to a single memory location are
serialized for all requestors, and that all requestors see the same sequence
of writes to that location. Coherence does not make any implications about the
ordering of accesses to different locations, or the ordering of reads with
respect to other reads of the same location. Rather, the memory system
ordering rules (strong or relaxed) describe the ordering assurances applied to
accesses to different locations.
For locations in L2 SRAM that are not within the same cache line, strong
ordering is provided only on writes and only as long as addresses involved are
not present in L1D. This can be ensured by using the L1DWIBAR and
L1DWIWC control registers described in section 7.3.2. In all other cases, a
relaxed ordering is provided for CPU accesses to L2 SRAM.
Table 33 lists the changes made since the previous version of this document.
Index
B G
block cache operation base address register global cache operations 62
(BAR) 63
block cache operation word count register (WC) 63 H
block cache operations 63
HPI and PCI access to memory subsystem 37
block diagram
C64x DSP 9
two-level internal memory 12 I
ID bit 39
IP bit 39
C
C bit L
in L2WB 49
in L2WBINV 50 L1D invalidate base address register (L1DIBAR) 48
L1D invalidate word count register (L1DIWC) 48
cache configuration register (CCFG) 39
L1D memory banking 20
cacheability controls 56
L1D miss penalty 22
CCFG 39
L1D miss pipelining 24
CE bit 51 L1D parameters 19
L1D performance 20
L1D write buffer 23
D L1D writeback−invalidate base address register
(L1DWIBAR) 47
DCC field in CSR 52
L1D writeback−invalidate word count register
(L1DWIWC) 47
L1D/L1P-to-L2 request servicing 35
E L1DIBAR 48
L1DIWC 48
EDMA access into L2 control 68
L1DWIBAR 47
EDMA access to cache controls 37
L1DWIWC 47
EDMA coherence in L2 SRAM example 71 L1P invalidate base address register (L1PIBAR) 46
EDMA-to-L2 request servicing 36 L1P invalidate word count register (L1PIWC) 46
EDMAWEIGHT 41 L1P miss penalty 27
effect of L2 commands on L1 caches 65 L1P miss pipelining 28