0% found this document useful (0 votes)
98 views4 pages

Dcos: Cache Embedded Switch Architecture For Distributed Shared Memory Multiprocessor Socs

The document summarizes a proposed cache embedded switch architecture called DCOS for distributed shared memory multiprocessor systems-on-chip (MPSoCs). DCOS aims to reduce the number of home node cache accesses by embedding directory caches in the switches, which can reduce inter-cache transfer time and overall execution time. Simulation results show the DCOS approach improves performance over a design without embedded directory caches in the switches.

Uploaded by

Kelly Fang
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views4 pages

Dcos: Cache Embedded Switch Architecture For Distributed Shared Memory Multiprocessor Socs

The document summarizes a proposed cache embedded switch architecture called DCOS for distributed shared memory multiprocessor systems-on-chip (MPSoCs). DCOS aims to reduce the number of home node cache accesses by embedding directory caches in the switches, which can reduce inter-cache transfer time and overall execution time. Simulation results show the DCOS approach improves performance over a design without embedded directory caches in the switches.

Uploaded by

Kelly Fang
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs

Daewook Kim, Manho Kim and Gerald E. Sobelman Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN 55455 USA Email: {daewook,mhkim,sobelman}@ece.umn.edu
Abstract Shared memory is a common inter-processor communication paradigm for on-chip multiprocessor SoC (MPSoC) platforms. The latency overhead of switch-based interconnection networks plays a critical role in shared memory MPSoC designs. In this paper, we propose a directory-cache embedded switch architecture with distributed shared cache and distributed shared memory. It is able to reduce the number of home node cache accesses, which results in a reduction in the inter-cache transfer time and the total execution time. Simulation results verify that the proposed methodology can improve performance substantially over a design in which directory caches are not embedded in the switches.

I. I NTRODUCTION Rapid advances of silicon and parallel processing technologies have made it possible to build multiprocessor systems-onchip (MPSoCs). In particular, packet-switched MPSoCs, [1], which are called networks-on-chip (NoC) [2], are becoming increasingly attractive platforms due to their better scalability, higher data throughput, exible IP reuse and by solutions to clock skew problems associated with bus-based on-chip interconnection schemes. Distributed shared memory (DSM) [3] or distributed shared cache (DSC) [4] is an architectural approach which allows multiprocessors to support a single shared address space that is implemented with physically distributed memory. A DSM or DSC multiprocessor platform is also called nonuniform memory access (NUMA) [5] or non-uniform cache architecture (NUCA) [6], since the access time depends on the physical location of a data word in memory or cache. Coherence protocols allow such architectures to use caching in order to take advantage of temporal and spatial locality without changing the programmers model of memory or cache. While interconnection networks provide basic mechanisms for communicating, in the case of shared address space processors, additional hardware is required to keep multiple copies of data consistent with each other. Specically, if there exist multiple copies of the data in different caches or memories, we must ensure that different processors are using the freshest data. Snoopy cache coherence is typically associated with MPSoC systems based on broadcast interconnection networks such as a bus or a ring. In bus-based interconnection systems, all linked processors snoop on the bus for transactions and can maintain cache coherence. However, this snooping protocol [5] is not a scalable solution and it causes serious bus trafc for cache coherence. An obvious solution to this problem is

to propagate coherence operations only to those processors that must participate in the operations. This solution requires us to keep track of which processors have copies of various data values and also the relevant state information for them. This state information is stored in a place called the directory, and the cache coherence scheme based on such information is called directory cache coherence. In a distributed shared memory MPSoC that connects all the processors through switches, the directory cache coherence scheme can be applied. In the conventional directory cache protocol, each directory resides in a distributed shared memory bank or distributed L2 cache bank and it contains entries for each memory or cache block. An entry points to the exact locations of every cached copy of a memory block and maintains its status for future reference. The classical full-map directory scheme proposed by Censier [7] uses an invalidation approach and allows for the existence of multiple unmodied cached copies of the same block in the system. However, in such a system, each directory with distributed shared memory or cache is distributed among all nodes in the system to provide a closer local memory or local cache and several remote memories. While local memory access latencies can be tolerated, the remote memory accesses generated during the execution can reduce the performance of applications. In this paper, we present a method to mitigate the impact of remote memory access latency. We propose a switch architecture for low-latency cache coherency of a distributed shared memory MPSoC platform which we denote as DCOS, Directory Cache On a Switch. The proposed architecture was applied to our proposed MPSoC platform that features packet switched cache coherent NUMA and NUCA in an on-chip system as shown in Figure 1. We have tested and evaluated this architecture using the RSIM [8] distributed shared memory MPSoC simulator. Some core parts of the simulator were modied and new directory cache modules with a crossbar switch were added in order to model our system. Simulations from the SPLASH-2 benchmark suite [9] were performed. The results show a substantial reduction of average read latency and execution time compared to a platform in which directory caches are not embedded into the switches. II. S WITCH - BASED MPS O C WITH DCOS Figure 1(a) shows a packet switched MPSoC platform with on-chip shared memory. Figure 1 (b) shows the platform with

MIPS R10000
L1-I$
dir

MIPS R10000
L1-I$
dir

MIPS R10000
L1-I$
dir

MIPS R10000
L1-I$
dir

L1-D$
Shared L2$ Bank

L1-D$
Shared L2$ Bank

L1-D$
Shared L2$ Bank

L1-D$
Shared L2$ Bank

MIPS R10000
L1-I$ L1-D$

MIPS R10000
L1-I$ L1-D$

MIPS R10000
L1-I$ L1-D$

MIPS R10000
L1-I$ L1-D$

Shared Memory Bank

dir dir

Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Shared L2$ Bank

Shared L2$ Bank

Shared L2$ Bank

Shared L2$ Bank

dir

dir

dir

dir

Shared Memory Bank

dir

dir

dir

dir

off-chip shared memory. The proposed DCOS architecture was implemented within each switch to reduce the cache-to-cache data transfer time and the switches were connected through a 42 2D mesh topology. Wormhole routing was adopted as the packet switching methodology. For our simulations, we used the MIPS R10000 core model which is supported by RSIM as shown in Figure 2. The main features of the MIPS R10000 include superscalar execution, out-of-order scheduling, register renaming, static and dynamic branch prediction, and nonblocking memory load and store operations. The rst-level cache can either be a write-through cache with a no-allocate policy on writes, or a write-back cache with a write-allocate policy. The second-level shared cache is a write back cache with write-allocate. The directory cache coherence protocol we adopted in our switch is the modied-shared-invalidate (MSI) protocol [10]. The protocol was also implemented within each distributed shared memory bank and distributed shared L2 cache to compare with the DCOS-based platform to verify performance. The detailed DCOS MSI protocol and switch architecture are shown in Figure 3 and Figure 4.
Floating Point Queue Busy-bit Table Floating Point Adder Floating Point Multiplier Data Cache (L1-D$) Integer Register File Addr. Gene. ALU TLB

Instruction Predecode

Instruction Cache (L1-I$)

III. D IRECTORY C ACHE ON A S WITCH FOR S HARED L2$ AND S HARED M EMORY A. DCOS Cache Coherence and Caching Flow Systems for directory-based cache coherence combine distributed shared memory architectures with scalable cache

dir
dir

s s
Shared L2$ Bank Shared Memory Bank

s s
dir

s
On-Chip Networks

s s
Shared L2$ Bank Shared Memory Bank

s s
Shared L2$ Bank
L1-D$

s s
Shared L2$ Bank
L1-D$

s
On-Chip Networks

s s
Shared L2$ Bank
L1-D$

Shared Memory Bank

dir

s
Shared L2$ Bank Shared Memory Bank

s
Shared L2$ Bank
L1-D$

Shared Memory Bank

dir

Shared L2$ Bank Shared Memory Bank

dir

dir

Shared Memory Bank

dir

dir

dir

dir

dir

dir

dir

dir

dir

L1-I$

L1-I$

L1-I$

L1-I$

L1-I$

L1-D$

L1-I$

L1-D$

L1-I$

L1-D$

L1-I$

L1-D$

Shared Memory Bank

dir

MIPS R10000

MIPS R10000

MIPS R10000

MIPS R10000

MIPS R10000

MIPS R10000

MIPS R10000

MIPS R10000

Shared Memory Bank

dir

(a)
Fig. 1.

(b)

DCOS architecture based MPSoC platform with (a) on-chip shared memory and (b) off-chip shared memory.

Active List

Free Register List

Floating Point Register File

Instruction Decode Branch

Register Map Table

Address Queue

Integer Queue

coherence mechanisms. We exploit a full-map directory cache coherence protocol as our DCOS architecture. In this scheme, the directory resides in main memory and contains entries for each memory block. An entry points to the exact locations of every cached copy of a memory block and maintains its status. With this information, the directory preserves the coherence of data in each distributed shared cache bank by sending directed messages to known locations, avoiding expensive broadcasts. Figure 3 describes a simple example of the data sharing ow for DCOS cache coherence. For example, a presence bit of 1 denotes that cache #1 holds a copy of the memory block. The state entry assigned to a memory block holds the current state of the block: empty, shared, or modied/invalid. Figure 3(a) presents the initial status of the DCOS directory cache, which shows no data items are copied into caches or memories. The directory entries are empty and therefore the state tag shows empty, marked as E. Figure 3(b) presents the case where a data is shared with other caches and memories. The bold arrow pointing from the shared memory bank to the L1 cache of core 2 indicates the case where shared data is not available in the shared L2 cache due to a write-miss. Figure 3(c) shows that a write request of one processor leads to the invalidation of all other copies of the processor caches. The new data value is stored in the caches or memories and the entry state is changed to the modied state, marked as M. The bold arrow pointing from the shared memory bank to the L1 cache of core 2 in Figure 3 also shows the direct data invalidation case when the shared L2 cache doesnt have the data due to a write-miss. B. Directory Cache Embedded Switch Architecture Figure 4 shows the overall DCOS architecture block diagram including our proposed directory caches. All the directory caches for both the shared memory bank and the shared L2 cache bank are embedded within the crossbar switch. The cache dir update and memory dir update inputs to the directory cache module indicate the signals required for the DCOS to update whenever the data states of the attached node

Fig. 2.

MIPS R10000 core block diagram.

Core 1
L1-I$
dir dir dir dir dir dir

Core 2
L1-I$
dir

Core 3
L1-I$
dir

Core 4
L1-I$
dir

L1-D$

L1-D$

L1-D$

L1-D$

TABLE I S IMULATION PARAMETERS


Processor
Architecture model Speed Max. fetch/retire rate Instruction Window Functional units Memory queue size MIPS R10000 1 GHz 4 64 2 integer arithmetic, 2 floating point 32 entries

Shared L2$ bank


Shared Memory Bank

Shared L2$ bank


Shared Memory Bank Presence Bit 1 Presence Bit 1 Presence Bit 1

Shared L2$ bank


Shared Memory Bank Presence Bit 3 Presence Bit 3 Presence Bit 3 Presence Bit 4 Presence Bit 4 Presence Bit 4

Shared L2$ bank


Shared Memory Bank ... ... ... Presence Bit P Presence Bit P Presence Bit P

dir

dir

dir

L2$ 1 Mem 1

State L1 (E) State L1 (E) State L2 (E)

Shared Data Address 1 Shared Data Address 1 Shared Data Address 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

(a)
Core 1
L1-I$ L1-D$ Shared L2$ bank
Shared Memory Bank
dir

Core 2
L1-I$

(S) L1-D$

Core 3
L1-I$
dir

(S) L1-D$

Core 4
L1-I$
dir

L1 Cache
Line size Write through Request ports Hit time 16 bytes Directed mapped, 32 KB 2 2 cycles

L1-D$

(S)

Shared L2$ bank


Shared Memory Bank Presence Bit 1 Presence Bit 1 Presence Bit 1

Shared (S) L2$ bank


Shared Memory Bank Presence Bit 3 Presence Bit 3 Presence Bit 3 Presence Bit 4 Presence Bit 4 Presence Bit 4

Shared (S) L2$ bank


Shared Memory Bank ... ... ... Presence Bit P Presence Bit P Presence Bit P

dir

dir

dir

L2$ 1 Mem 1

State L1 (S) State L1 (E) (S) State L2 (S)

Shared L2 Cache Bank


Shared Data Address 1 Shared Data Address 1 Shared Data Address 1 Presence Bit 2 Presence Bit 2 Presence Bit 2

Line size Write through Hit time

64 bytes Directed mapped, 128 KB 15 cycles, pipelined

(b)
Core 2
L1-I$

Shared Memory
Access time Interleaving 70 cycles (70ns) 4- way

(I) L1-D$

Core 1
L1-I$ L1-D$
dir

Core 2
L1-I$

Core 3
L1-I$
dir

Core 4
L1-I$
dir

Directory Cache in Switch


Shared L2$ directory bank size Shared memory directory size 32 entries 512 / 1024 / 2048 entries

(M) L1-D$

(I)L1-D$ (I)

L1-D$

Shared L2$ bank


Shared Memory Bank

(I)

Shared L2$ bank


Shared Memory Bank Presence Bit 1 Presence Bit 1 Presence Bit 1

Shared L2$ bank


Shared Memory Bank Presence Bit 3 Presence Bit 3 Presence Bit 3

Shared L2$ bank (M) Shared Memory Bank ... ... ... Presence Bit P Presence Bit P Presence Bit P

On -Chip Switched Networks


Topology Flit size Switch speed Channel speed Channel width 2D Mesh (4x2) 8 bytes 250 MHz 500 MHz 32 bits

dir

dir

dir

L2$ 1 Mem 1

State L1 (M) State L1 (M) State L2 (M)

Shared Data Address 1 Shared Data Address 1 Shared Data Address 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

Presence Bit 4 Presence Bit 4 Presence Bit 4

(c)

TABLE II Fig. 3. Full-map directory cache coherent MSI protocol for DCOS. (a) Empty, (b) shared and (c) modied/invalid. B ENCHMARK A PPLICATIONS AND I NPUT S IZES

cache and node memory are changed. This information is sent to the arbiter through the directory controller. This leads to efcient routing up to the destination node. It reduces cacheto-cache transfer time and results in an overall performance improvement in terms of packet latency and execution time. IV. S IMULATION E NVIRONMENT We have used the RSIM simulator for distributed shared memory multiprocessor systems. Some core parts of the simulator written in C++ were modied for a shared L2 cache environment and the proposed directory cache module was added to the default switch block. Table I summarizes the parameters of the simulated platform. The application programs used in our evaluations are FFT, Radix, Ocean, and Barnes from the SPLASH-2 benchmark suite. The input data sizes are shown in Table II. V. S IMULATION R ESULTS AND A NALYSIS In this section, we present and analyze the performance results obtained through extensive simulations and evaluate the impact of the DCOS architecture. The main objective of switch directory caches is to reduce the number of cache-to-cache transfers and the total execution time. Figure 5 shows that both cache-to-cache transfer time and execution time of DCOS based schemes were substantially reduced over the non-DCOS scheme in terms of total consumed clock cycles. The parameter we varied was the size of the shared memory directory cache on a switch, which was varied from 512 to 2048 entries, while

Programs FFT Radix Ocean Barnes-Hut

Input Sizes 32 k 1M Keys, 1024 Radix 100 100 Grid 2048 Bodies

xing the cache size of the shared L2 directory at 32 entries. As shown in Figure 5(a), the total execution time for each benchmark application was reduced proportionally as the size of the on-switch directory cache is increased. When comparing the execution time of the non-DCOS to the DCOS for 2048 entries, the overheads for FFT, Radix, Ocean and Barnes were reduced by 43.1%, 28.7%, 21.4%, and 27.9%, respectively. In addition, as shown in Figure 5(b), the total cache-to-cache transfer time overhead from home node to local node was analyzed to determine the impact of the DCOS scheme. Cacheto-cache transfer time also proportionally decreased as we increase the size of the shared memory directory cache on a switch. When compared, the cache-to-cache transfer time overhead of non-DCOS to DCOS for 2048 entries on FFT, Radix, Ocean and Barnes were reduced by 35.8%, 63.2%, 30.8%, and 43.2%, respectively. Based on these two performance metrics on the four benchmark programs, a substantial performance improvement is obtained with the proposed DCOS scheme. VI. C ONCLUSIONS We have presented a novel directory-cache embedded switch architecture with distributed shared cache and distributed shared memory. This scheme is able to reduce the number of

VOQ Input Buffers Switch B Switch C Switch D Processor Cache Dir Update Memory Dir Update
State L1 State L2 State L1 State L2

[ Switch A ]
Switch B 4 x4 Crossbar Switch Fabric Switch C Switch D Processor

Arbiter

Mem

Shared Memory Bank Directory Cache

Shared Data Address 1 Shared Data Address 1 Shared Data Address N Shared Data Address N

Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2

... ...

Presence Bit P Presence Bit P

Mem

L2$

...

...

Shared L2 Cache Bank Directory Cache

L2$ L2$

State L1 State L1 State L1

... .. .

...

Shared Data Address 1 Shared Data Address 2 Shared Data Address 3

...

...

...

Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2

... ...

Presence Bit P Presence Bit P

Directory Controller

Presence Bit 1 Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

... ... ...

Presence Bit P Presence Bit P Presence Bit P

L2$

.. .

.. .

. ..

. ..

. ..

. ..

State L1

Shared Data Address N

Presence Bit 1

Presence Bit 2

...

Presence Bit P

Fig. 4.

DCOS architecture block diagram.

Non-DCOS
6000 Exe Time [x10000 cycles] 5000 4000 3000 2000 1000 0
1327 1286 1026

512 entries

1024 entries
5122 5065 4605 4027

2048 entries

R EFERENCES
[1] M. D. Nava, P. Blouet, P. Teninge, M. Coppola, T. Ben-Ismail, S. Picchiottino, and R. Wilson, An open platform for developing multiprocessor socs, Computer, vol. 38, no. 7, pp. 6067, 2005. [2] L. Benine and G. D. Micheli, Networks on chip: a new paradigm for systems on chip design, in Proc. of Design, Automation and Test in Europe Conf., jan 2002, pp. 418419. [3] J. Hennessy and D. Patterson, Computer architecture: A quantitative approach, chapter six, Aug 2003. [4] J. Ahn, K. Lee, and H. Kim, Architectural issues in adopting distributed shared memory for distributed object management systems, in Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems, Aug 1995, pp. 294300. [5] J. Hennessy, M. Heinrich, and A. Gupta, Cache-coherent distributed shared memory: perspectives on its development and future challenges, in Proceedings of the IEEE, vol. 87, Mar 1999, pp. 418429. [6] C. Kim, D. Burger, and S. W. Keckler, An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches, in ASPLOS-X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems. ACM Press, 2002, pp. 211222. [7] L. Censier and P. Feautrier, A new solution to coherence problem in multicache systems, Computer, vol. 27, no. 12, pp. 11121118, 1978. [8] C. Hughes, V. Pai, P. Ranganathan, and S. Adve, Rsim: simulating shared-memory multiprocessors with ilp processors, Computer, vol. 35, pp. 4049, 2002. [9] S. Woo, M. Ohara, J. Torrie, E.; Singh, and A. Gupta, The splash-2 programs: characterization and methodological considerations, in Proceedings. 22nd Annual International Symposium on Computer Architecture, 1995, pp. 2436. [10] D. Culler and J. Singh, Parallel computer architecture: A hardware/software approach, chapter ve.

3325 3301 2646 2869 2488 2286 1686 2397

756

FFT

Radix

Ocean

Barnes

Benchmark Applications

(a)
Cache-to-Cache transfer time [cycles]
600
483 480 536 467 390 528 552 564 547 534 525 510 482

500 400 300

310

298 197

200 100 0

FFT

Radix

Ocean

Barnes

Benchmark Applications

(b)
Fig. 5. Simulation results. (a) Execution time [104 cycles]. (b) Cache-tocache transfer time [cycles].

home node cache accesses, which results in the reduction in inter-cache transfer time and total execution time. Simulation results verify that the proposed methodology can improve performance substantially over a platform in which directory caches are not embedded in the switches. ACKNOWLEDGMENTS We thank Sangwoo Rhim, Bumhak Lee and Euiseok Kim of the SAMSUNG Advanced Institute of Technology (SAIT) for their help with this manuscript. This research work is supported by a grant from SAIT.

You might also like