0% found this document useful (0 votes)

100 views4 pages

Dcos: Cache Embedded Switch Architecture For Distributed Shared Memory Multiprocessor Socs

The document summarizes a proposed cache embedded switch architecture called DCOS for distributed shared memory multiprocessor systems-on-chip (MPSoCs). DCOS aims to reduce the number of home node cache accesses by embedding directory caches in the switches, which can reduce inter-cache transfer time and overall execution time. Simulation results show the DCOS approach improves performance over a design without embedded directory caches in the switches.

Uploaded by

Kelly Fang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views4 pages

Dcos: Cache Embedded Switch Architecture For Distributed Shared Memory Multiprocessor Socs

Uploaded by

Kelly Fang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs

Daewook Kim, Manho Kim and Gerald E. Sobelman Department of Electrical and Computer Engineering University of Minnesota, Minneapolis, MN 55455 USA Email: {daewook,mhkim,sobelman}@ece.umn.edu
Abstract Shared memory is a common inter-processor communication paradigm for on-chip multiprocessor SoC (MPSoC) platforms. The latency overhead of switch-based interconnection networks plays a critical role in shared memory MPSoC designs. In this paper, we propose a directory-cache embedded switch architecture with distributed shared cache and distributed shared memory. It is able to reduce the number of home node cache accesses, which results in a reduction in the inter-cache transfer time and the total execution time. Simulation results verify that the proposed methodology can improve performance substantially over a design in which directory caches are not embedded in the switches.

I. I NTRODUCTION Rapid advances of silicon and parallel processing technologies have made it possible to build multiprocessor systems-onchip (MPSoCs). In particular, packet-switched MPSoCs, [1], which are called networks-on-chip (NoC) [2], are becoming increasingly attractive platforms due to their better scalability, higher data throughput, exible IP reuse and by solutions to clock skew problems associated with bus-based on-chip interconnection schemes. Distributed shared memory (DSM) [3] or distributed shared cache (DSC) [4] is an architectural approach which allows multiprocessors to support a single shared address space that is implemented with physically distributed memory. A DSM or DSC multiprocessor platform is also called nonuniform memory access (NUMA) [5] or non-uniform cache architecture (NUCA) [6], since the access time depends on the physical location of a data word in memory or cache. Coherence protocols allow such architectures to use caching in order to take advantage of temporal and spatial locality without changing the programmers model of memory or cache. While interconnection networks provide basic mechanisms for communicating, in the case of shared address space processors, additional hardware is required to keep multiple copies of data consistent with each other. Specically, if there exist multiple copies of the data in different caches or memories, we must ensure that different processors are using the freshest data. Snoopy cache coherence is typically associated with MPSoC systems based on broadcast interconnection networks such as a bus or a ring. In bus-based interconnection systems, all linked processors snoop on the bus for transactions and can maintain cache coherence. However, this snooping protocol [5] is not a scalable solution and it causes serious bus trafc for cache coherence. An obvious solution to this problem is

to propagate coherence operations only to those processors that must participate in the operations. This solution requires us to keep track of which processors have copies of various data values and also the relevant state information for them. This state information is stored in a place called the directory, and the cache coherence scheme based on such information is called directory cache coherence. In a distributed shared memory MPSoC that connects all the processors through switches, the directory cache coherence scheme can be applied. In the conventional directory cache protocol, each directory resides in a distributed shared memory bank or distributed L2 cache bank and it contains entries for each memory or cache block. An entry points to the exact locations of every cached copy of a memory block and maintains its status for future reference. The classical full-map directory scheme proposed by Censier [7] uses an invalidation approach and allows for the existence of multiple unmodied cached copies of the same block in the system. However, in such a system, each directory with distributed shared memory or cache is distributed among all nodes in the system to provide a closer local memory or local cache and several remote memories. While local memory access latencies can be tolerated, the remote memory accesses generated during the execution can reduce the performance of applications. In this paper, we present a method to mitigate the impact of remote memory access latency. We propose a switch architecture for low-latency cache coherency of a distributed shared memory MPSoC platform which we denote as DCOS, Directory Cache On a Switch. The proposed architecture was applied to our proposed MPSoC platform that features packet switched cache coherent NUMA and NUCA in an on-chip system as shown in Figure 1. We have tested and evaluated this architecture using the RSIM [8] distributed shared memory MPSoC simulator. Some core parts of the simulator were modied and new directory cache modules with a crossbar switch were added in order to model our system. Simulations from the SPLASH-2 benchmark suite [9] were performed. The results show a substantial reduction of average read latency and execution time compared to a platform in which directory caches are not embedded into the switches. II. S WITCH - BASED MPS O C WITH DCOS Figure 1(a) shows a packet switched MPSoC platform with on-chip shared memory. Figure 1 (b) shows the platform with

MIPS R10000
L1-I$
dir

L1-D$
Shared L2$ Bank

MIPS R10000
L1-I$ L1-D$

Shared Memory Bank

dir dir

Shared Memory Bank

Shared L2$ Bank

dir

Shared Memory Bank

dir

off-chip shared memory. The proposed DCOS architecture was implemented within each switch to reduce the cache-to-cache data transfer time and the switches were connected through a 42 2D mesh topology. Wormhole routing was adopted as the packet switching methodology. For our simulations, we used the MIPS R10000 core model which is supported by RSIM as shown in Figure 2. The main features of the MIPS R10000 include superscalar execution, out-of-order scheduling, register renaming, static and dynamic branch prediction, and nonblocking memory load and store operations. The rst-level cache can either be a write-through cache with a no-allocate policy on writes, or a write-back cache with a write-allocate policy. The second-level shared cache is a write back cache with write-allocate. The directory cache coherence protocol we adopted in our switch is the modied-shared-invalidate (MSI) protocol [10]. The protocol was also implemented within each distributed shared memory bank and distributed shared L2 cache to compare with the DCOS-based platform to verify performance. The detailed DCOS MSI protocol and switch architecture are shown in Figure 3 and Figure 4.
Floating Point Queue Busy-bit Table Floating Point Adder Floating Point Multiplier Data Cache (L1-D$) Integer Register File Addr. Gene. ALU TLB

Instruction Predecode

Instruction Cache (L1-I$)

III. D IRECTORY C ACHE ON A S WITCH FOR S HARED L2$ AND S HARED M EMORY A. DCOS Cache Coherence and Caching Flow Systems for directory-based cache coherence combine distributed shared memory architectures with scalable cache

dir
dir

s s
Shared L2$ Bank Shared Memory Bank

s s
dir

s
On-Chip Networks

s s
Shared L2$ Bank Shared Memory Bank

s s
Shared L2$ Bank
L1-D$

s
On-Chip Networks

s s
Shared L2$ Bank
L1-D$

Shared Memory Bank

dir

s
Shared L2$ Bank Shared Memory Bank

s
Shared L2$ Bank
L1-D$

Shared Memory Bank

dir

Shared L2$ Bank Shared Memory Bank

dir

Shared Memory Bank

dir

L1-I$

L1-D$

L1-I$

L1-D$

L1-I$

L1-D$

L1-I$

L1-D$

Shared Memory Bank

dir

MIPS R10000

Shared Memory Bank

dir

(a)
Fig. 1.

(b)

DCOS architecture based MPSoC platform with (a) on-chip shared memory and (b) off-chip shared memory.

Active List

Free Register List

Floating Point Register File

Instruction Decode Branch

Register Map Table

Address Queue

Integer Queue

coherence mechanisms. We exploit a full-map directory cache coherence protocol as our DCOS architecture. In this scheme, the directory resides in main memory and contains entries for each memory block. An entry points to the exact locations of every cached copy of a memory block and maintains its status. With this information, the directory preserves the coherence of data in each distributed shared cache bank by sending directed messages to known locations, avoiding expensive broadcasts. Figure 3 describes a simple example of the data sharing ow for DCOS cache coherence. For example, a presence bit of 1 denotes that cache #1 holds a copy of the memory block. The state entry assigned to a memory block holds the current state of the block: empty, shared, or modied/invalid. Figure 3(a) presents the initial status of the DCOS directory cache, which shows no data items are copied into caches or memories. The directory entries are empty and therefore the state tag shows empty, marked as E. Figure 3(b) presents the case where a data is shared with other caches and memories. The bold arrow pointing from the shared memory bank to the L1 cache of core 2 indicates the case where shared data is not available in the shared L2 cache due to a write-miss. Figure 3(c) shows that a write request of one processor leads to the invalidation of all other copies of the processor caches. The new data value is stored in the caches or memories and the entry state is changed to the modied state, marked as M. The bold arrow pointing from the shared memory bank to the L1 cache of core 2 in Figure 3 also shows the direct data invalidation case when the shared L2 cache doesnt have the data due to a write-miss. B. Directory Cache Embedded Switch Architecture Figure 4 shows the overall DCOS architecture block diagram including our proposed directory caches. All the directory caches for both the shared memory bank and the shared L2 cache bank are embedded within the crossbar switch. The cache dir update and memory dir update inputs to the directory cache module indicate the signals required for the DCOS to update whenever the data states of the attached node

Fig. 2.

MIPS R10000 core block diagram.

Core 1
L1-I$
dir dir dir dir dir dir

Core 2
L1-I$
dir

Core 3
L1-I$
dir

Core 4
L1-I$
dir

L1-D$

TABLE I S IMULATION PARAMETERS

Processor
Architecture model Speed Max. fetch/retire rate Instruction Window Functional units Memory queue size MIPS R10000 1 GHz 4 64 2 integer arithmetic, 2 floating point 32 entries

Shared L2$ bank

Shared Memory Bank

Shared L2$ bank

Shared Memory Bank Presence Bit 1 Presence Bit 1 Presence Bit 1

Shared L2$ bank

Shared Memory Bank Presence Bit 3 Presence Bit 3 Presence Bit 3 Presence Bit 4 Presence Bit 4 Presence Bit 4

Shared L2$ bank

Shared Memory Bank ... ... ... Presence Bit P Presence Bit P Presence Bit P

dir

L2$ 1 Mem 1

State L1 (E) State L1 (E) State L2 (E)

Shared Data Address 1 Shared Data Address 1 Shared Data Address 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

(a)
Core 1
L1-I$ L1-D$ Shared L2$ bank
Shared Memory Bank
dir

Core 2
L1-I$

(S) L1-D$

Core 3
L1-I$
dir

(S) L1-D$

Core 4
L1-I$
dir

L1 Cache
Line size Write through Request ports Hit time 16 bytes Directed mapped, 32 KB 2 2 cycles

L1-D$

(S)

Shared L2$ bank

Shared Memory Bank Presence Bit 1 Presence Bit 1 Presence Bit 1

Shared (S) L2$ bank

Shared Memory Bank Presence Bit 3 Presence Bit 3 Presence Bit 3 Presence Bit 4 Presence Bit 4 Presence Bit 4

Shared (S) L2$ bank

Shared Memory Bank ... ... ... Presence Bit P Presence Bit P Presence Bit P

dir

L2$ 1 Mem 1

State L1 (S) State L1 (E) (S) State L2 (S)

Shared L2 Cache Bank

Shared Data Address 1 Shared Data Address 1 Shared Data Address 1 Presence Bit 2 Presence Bit 2 Presence Bit 2

Line size Write through Hit time

64 bytes Directed mapped, 128 KB 15 cycles, pipelined

(b)
Core 2
L1-I$

Shared Memory
Access time Interleaving 70 cycles (70ns) 4- way

(I) L1-D$

Core 1
L1-I$ L1-D$
dir

Core 2
L1-I$

Core 3
L1-I$
dir

Core 4
L1-I$
dir

Directory Cache in Switch

Shared L2$ directory bank size Shared memory directory size 32 entries 512 / 1024 / 2048 entries

(M) L1-D$

(I)L1-D$ (I)

L1-D$

Shared L2$ bank

Shared Memory Bank

(I)

Shared L2$ bank

Shared Memory Bank Presence Bit 1 Presence Bit 1 Presence Bit 1

Shared L2$ bank

Shared Memory Bank Presence Bit 3 Presence Bit 3 Presence Bit 3

Shared L2$ bank (M) Shared Memory Bank ... ... ... Presence Bit P Presence Bit P Presence Bit P

On -Chip Switched Networks

Topology Flit size Switch speed Channel speed Channel width 2D Mesh (4x2) 8 bytes 250 MHz 500 MHz 32 bits

dir

L2$ 1 Mem 1

State L1 (M) State L1 (M) State L2 (M)

Shared Data Address 1 Shared Data Address 1 Shared Data Address 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

Presence Bit 4 Presence Bit 4 Presence Bit 4

(c)

TABLE II Fig. 3. Full-map directory cache coherent MSI protocol for DCOS. (a) Empty, (b) shared and (c) modied/invalid. B ENCHMARK A PPLICATIONS AND I NPUT S IZES

cache and node memory are changed. This information is sent to the arbiter through the directory controller. This leads to efcient routing up to the destination node. It reduces cacheto-cache transfer time and results in an overall performance improvement in terms of packet latency and execution time. IV. S IMULATION E NVIRONMENT We have used the RSIM simulator for distributed shared memory multiprocessor systems. Some core parts of the simulator written in C++ were modied for a shared L2 cache environment and the proposed directory cache module was added to the default switch block. Table I summarizes the parameters of the simulated platform. The application programs used in our evaluations are FFT, Radix, Ocean, and Barnes from the SPLASH-2 benchmark suite. The input data sizes are shown in Table II. V. S IMULATION R ESULTS AND A NALYSIS In this section, we present and analyze the performance results obtained through extensive simulations and evaluate the impact of the DCOS architecture. The main objective of switch directory caches is to reduce the number of cache-to-cache transfers and the total execution time. Figure 5 shows that both cache-to-cache transfer time and execution time of DCOS based schemes were substantially reduced over the non-DCOS scheme in terms of total consumed clock cycles. The parameter we varied was the size of the shared memory directory cache on a switch, which was varied from 512 to 2048 entries, while

Programs FFT Radix Ocean Barnes-Hut

Input Sizes 32 k 1M Keys, 1024 Radix 100 100 Grid 2048 Bodies

xing the cache size of the shared L2 directory at 32 entries. As shown in Figure 5(a), the total execution time for each benchmark application was reduced proportionally as the size of the on-switch directory cache is increased. When comparing the execution time of the non-DCOS to the DCOS for 2048 entries, the overheads for FFT, Radix, Ocean and Barnes were reduced by 43.1%, 28.7%, 21.4%, and 27.9%, respectively. In addition, as shown in Figure 5(b), the total cache-to-cache transfer time overhead from home node to local node was analyzed to determine the impact of the DCOS scheme. Cacheto-cache transfer time also proportionally decreased as we increase the size of the shared memory directory cache on a switch. When compared, the cache-to-cache transfer time overhead of non-DCOS to DCOS for 2048 entries on FFT, Radix, Ocean and Barnes were reduced by 35.8%, 63.2%, 30.8%, and 43.2%, respectively. Based on these two performance metrics on the four benchmark programs, a substantial performance improvement is obtained with the proposed DCOS scheme. VI. C ONCLUSIONS We have presented a novel directory-cache embedded switch architecture with distributed shared cache and distributed shared memory. This scheme is able to reduce the number of

VOQ Input Buffers Switch B Switch C Switch D Processor Cache Dir Update Memory Dir Update
State L1 State L2 State L1 State L2

[ Switch A ]
Switch B 4 x4 Crossbar Switch Fabric Switch C Switch D Processor

Arbiter

Mem

Shared Memory Bank Directory Cache

Shared Data Address 1 Shared Data Address 1 Shared Data Address N Shared Data Address N

Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2

... ...

Presence Bit P Presence Bit P

Mem

L2$

...

Shared L2 Cache Bank Directory Cache

L2$ L2$

State L1 State L1 State L1

... .. .

...

Shared Data Address 1 Shared Data Address 2 Shared Data Address 3

...

Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2

... ...

Presence Bit P Presence Bit P

Directory Controller

Presence Bit 1 Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

... ... ...

Presence Bit P Presence Bit P Presence Bit P

L2$

.. .

. ..

State L1

Shared Data Address N

Presence Bit 1

Presence Bit 2

...

Presence Bit P

Fig. 4.

DCOS architecture block diagram.

Non-DCOS
6000 Exe Time [x10000 cycles] 5000 4000 3000 2000 1000 0
1327 1286 1026

512 entries

1024 entries
5122 5065 4605 4027

2048 entries

R EFERENCES
[1] M. D. Nava, P. Blouet, P. Teninge, M. Coppola, T. Ben-Ismail, S. Picchiottino, and R. Wilson, An open platform for developing multiprocessor socs, Computer, vol. 38, no. 7, pp. 6067, 2005. [2] L. Benine and G. D. Micheli, Networks on chip: a new paradigm for systems on chip design, in Proc. of Design, Automation and Test in Europe Conf., jan 2002, pp. 418419. [3] J. Hennessy and D. Patterson, Computer architecture: A quantitative approach, chapter six, Aug 2003. [4] J. Ahn, K. Lee, and H. Kim, Architectural issues in adopting distributed shared memory for distributed object management systems, in Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems, Aug 1995, pp. 294300. [5] J. Hennessy, M. Heinrich, and A. Gupta, Cache-coherent distributed shared memory: perspectives on its development and future challenges, in Proceedings of the IEEE, vol. 87, Mar 1999, pp. 418429. [6] C. Kim, D. Burger, and S. W. Keckler, An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches, in ASPLOS-X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems. ACM Press, 2002, pp. 211222. [7] L. Censier and P. Feautrier, A new solution to coherence problem in multicache systems, Computer, vol. 27, no. 12, pp. 11121118, 1978. [8] C. Hughes, V. Pai, P. Ranganathan, and S. Adve, Rsim: simulating shared-memory multiprocessors with ilp processors, Computer, vol. 35, pp. 4049, 2002. [9] S. Woo, M. Ohara, J. Torrie, E.; Singh, and A. Gupta, The splash-2 programs: characterization and methodological considerations, in Proceedings. 22nd Annual International Symposium on Computer Architecture, 1995, pp. 2436. [10] D. Culler and J. Singh, Parallel computer architecture: A hardware/software approach, chapter ve.

3325 3301 2646 2869 2488 2286 1686 2397

756

FFT

Radix

Ocean

Barnes

Benchmark Applications

(a)
Cache-to-Cache transfer time [cycles]
600
483 480 536 467 390 528 552 564 547 534 525 510 482

500 400 300

310

298 197

200 100 0

FFT

Radix

Ocean

Barnes

Benchmark Applications

(b)
Fig. 5. Simulation results. (a) Execution time [104 cycles]. (b) Cache-tocache transfer time [cycles].

home node cache accesses, which results in the reduction in inter-cache transfer time and total execution time. Simulation results verify that the proposed methodology can improve performance substantially over a platform in which directory caches are not embedded in the switches. ACKNOWLEDGMENTS We thank Sangwoo Rhim, Bumhak Lee and Euiseok Kim of the SAMSUNG Advanced Institute of Technology (SAIT) for their help with this manuscript. This research work is supported by a grant from SAIT.

Yan Solihin - Fundamentals of Parallel Computer Architecture
100% (2)
Yan Solihin - Fundamentals of Parallel Computer Architecture
547 pages
Learjet 60295
100% (1)
Learjet 60295
759 pages
L3 Communications CVR (FA2100) 23-70-04 - Rev 07
67% (3)
L3 Communications CVR (FA2100) 23-70-04 - Rev 07
268 pages
Cache Coherence Protocols: Evaluation Using A Multiprocessor Simulation Model
No ratings yet
Cache Coherence Protocols: Evaluation Using A Multiprocessor Simulation Model
26 pages
Parallel 2
No ratings yet
Parallel 2
14 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
0014 SharedMemoryArchitecture
No ratings yet
0014 SharedMemoryArchitecture
31 pages
Lec 6 SharedArch PDF
No ratings yet
Lec 6 SharedArch PDF
33 pages
ACA Lecture 29 Cache-Coherence 2
No ratings yet
ACA Lecture 29 Cache-Coherence 2
42 pages
Module 4
No ratings yet
Module 4
40 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
Cache Coherency
No ratings yet
Cache Coherency
33 pages
Cache AN3544
No ratings yet
Cache AN3544
12 pages
Shared Memory Architecture
No ratings yet
Shared Memory Architecture
39 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
Lecture 5
No ratings yet
Lecture 5
15 pages
IJARCCE-46 Cachemesiwithverilog
No ratings yet
IJARCCE-46 Cachemesiwithverilog
5 pages
Cache Coherency
No ratings yet
Cache Coherency
19 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
CSCI 8150 Advanced Computer Architecture
100% (2)
CSCI 8150 Advanced Computer Architecture
46 pages
Optimized Caching Techniques: Application for Scalable Distributed Architectures
From Everand
Optimized Caching Techniques: Application for Scalable Distributed Architectures
Peter Jones
No ratings yet
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
Cache Coherence - 20250120 - 142158 - 0000
No ratings yet
Cache Coherence - 20250120 - 142158 - 0000
34 pages
Cache Coherence
No ratings yet
Cache Coherence
39 pages
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
No ratings yet
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
11 pages
Content Beyond Syllabus PDF
No ratings yet
Content Beyond Syllabus PDF
7 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
23 pages
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
No ratings yet
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
21 pages
Snooping Cache and Directory Based Multiprocessors
No ratings yet
Snooping Cache and Directory Based Multiprocessors
59 pages
Distributed Shared Memory: Introduction & Thisis
No ratings yet
Distributed Shared Memory: Introduction & Thisis
22 pages
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
No ratings yet
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
42 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Computer Organisation and Architecture PYQ
No ratings yet
Computer Organisation and Architecture PYQ
14 pages
Unit 3 DSM
No ratings yet
Unit 3 DSM
12 pages
MN Cache Coherence
No ratings yet
MN Cache Coherence
11 pages
MC&CC
No ratings yet
MC&CC
21 pages
Multiprocessing: Flynn's Classification (1966)
No ratings yet
Multiprocessing: Flynn's Classification (1966)
8 pages
Coherence
No ratings yet
Coherence
16 pages
Cache Coherence: From Wikipedia, The Free Encyclopedia
No ratings yet
Cache Coherence: From Wikipedia, The Free Encyclopedia
8 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
18bce2429 Da 2 Cao
No ratings yet
18bce2429 Da 2 Cao
13 pages
Cache Coherence - MESI MOESI
No ratings yet
Cache Coherence - MESI MOESI
57 pages
A Survey of Cache Coherence Mechanisms in Shared M
No ratings yet
A Survey of Cache Coherence Mechanisms in Shared M
27 pages
Muge - Snoop Based Multiprocessor Design
No ratings yet
Muge - Snoop Based Multiprocessor Design
32 pages
DSM
No ratings yet
DSM
36 pages
Lec 6
No ratings yet
Lec 6
8 pages
Proficiency - PPT (103 - Sagar Magaraiya) (DE2)
No ratings yet
Proficiency - PPT (103 - Sagar Magaraiya) (DE2)
9 pages
Lect5 - Distributed Shared Memory
No ratings yet
Lect5 - Distributed Shared Memory
120 pages
Cache Coherence
No ratings yet
Cache Coherence
14 pages
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
14 pages
Memory Hierarchy: Haresh Dagale Dept of ESE
No ratings yet
Memory Hierarchy: Haresh Dagale Dept of ESE
32 pages
MODULE 4 HPC
No ratings yet
MODULE 4 HPC
41 pages
Unit 5 DOS SCR
No ratings yet
Unit 5 DOS SCR
46 pages
Chip Multicore Processors - Tutorial 8: Task 8.1: Performance of Snooping-Based Cache Coherency
No ratings yet
Chip Multicore Processors - Tutorial 8: Task 8.1: Performance of Snooping-Based Cache Coherency
3 pages
Hierarchical Cache / Bus Architecture For Shared Memory Multiprocessors
No ratings yet
Hierarchical Cache / Bus Architecture For Shared Memory Multiprocessors
9 pages
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cache Coherence: CSE 661 - Parallel and Vector Architectures
No ratings yet
Cache Coherence: CSE 661 - Parallel and Vector Architectures
37 pages
Cache Memory
No ratings yet
Cache Memory
28 pages
Directory-Based Cache Coherence Protocols: Interconnection Networks For Multiprocessors
No ratings yet
Directory-Based Cache Coherence Protocols: Interconnection Networks For Multiprocessors
9 pages
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
No ratings yet
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
24 pages
SOCA Mid-II Objective Paper
No ratings yet
SOCA Mid-II Objective Paper
2 pages
Cable Fault Location Measuring Methods - HV Technologies
No ratings yet
Cable Fault Location Measuring Methods - HV Technologies
21 pages
Com0025-Com6067 Relocation
No ratings yet
Com0025-Com6067 Relocation
2 pages
Flexible Ac Transmission Systems: Abstract - T
No ratings yet
Flexible Ac Transmission Systems: Abstract - T
4 pages
Pos Touch Screen-AerPOSLite-ταμειακό σύστημα
No ratings yet
Pos Touch Screen-AerPOSLite-ταμειακό σύστημα
2 pages
Cav 35NM (En)
No ratings yet
Cav 35NM (En)
13 pages
ZMP 4059995 Element 120 220-240 24 G2
No ratings yet
ZMP 4059995 Element 120 220-240 24 G2
5 pages
CompTech 122 Topic 4. Fundamentals of Computer Hardware System
No ratings yet
CompTech 122 Topic 4. Fundamentals of Computer Hardware System
6 pages
Advancements in Picosecond Resolution Time Interval Measurement Techniques
No ratings yet
Advancements in Picosecond Resolution Time Interval Measurement Techniques
2 pages
Introduction to coaxial cables: Coaxial Cables, For Rf - Μwave - Various pag J1
No ratings yet
Introduction to coaxial cables: Coaxial Cables, For Rf - Μwave - Various pag J1
27 pages
2020 Hertz Price Sheet
No ratings yet
2020 Hertz Price Sheet
28 pages
Group Delay As I Understand It by JOHN ORAM
No ratings yet
Group Delay As I Understand It by JOHN ORAM
2 pages
Picoblaze Sample Design A Quick-Start Guide Using The Digilent NEXYS 2 Spartan 3E Kit
No ratings yet
Picoblaze Sample Design A Quick-Start Guide Using The Digilent NEXYS 2 Spartan 3E Kit
5 pages
Equalization and Clock and Data Recovery Techniques For 10-Gb S CMOS Serial-Link Receivers
No ratings yet
Equalization and Clock and Data Recovery Techniques For 10-Gb S CMOS Serial-Link Receivers
13 pages
Loq 15irx9 83dv00h1in
No ratings yet
Loq 15irx9 83dv00h1in
2 pages
Mivoice Business 7.0 Gig
No ratings yet
Mivoice Business 7.0 Gig
161 pages
Fluke 8846A - 6.5 Digit Precision Mulitmeter
No ratings yet
Fluke 8846A - 6.5 Digit Precision Mulitmeter
8 pages
Ch5-Radar Target and Clutter
No ratings yet
Ch5-Radar Target and Clutter
44 pages
Programmable Logic Devices (PLDS) : Lesson Objectives
No ratings yet
Programmable Logic Devices (PLDS) : Lesson Objectives
10 pages
Disassembly Procedure
No ratings yet
Disassembly Procedure
30 pages
How To Use The TD-W8960N Wireless Bridge (WDS) Function With Another TD-W8960N - Welcome To TP-LINK
No ratings yet
How To Use The TD-W8960N Wireless Bridge (WDS) Function With Another TD-W8960N - Welcome To TP-LINK
5 pages
MX420 Installation Manual
100% (2)
MX420 Installation Manual
72 pages
Max10 User Manual
No ratings yet
Max10 User Manual
14 pages
Datasheet of DS 7608NI Q1 NVRD - V4.71.200 - 20221031
No ratings yet
Datasheet of DS 7608NI Q1 NVRD - V4.71.200 - 20221031
5 pages
A 20 KW X Band High Power Amplifier For ESA Deep Space Ground Stations
No ratings yet
A 20 KW X Band High Power Amplifier For ESA Deep Space Ground Stations
3 pages
Halo Networks: B, A R C (1225112105)
No ratings yet
Halo Networks: B, A R C (1225112105)
17 pages
Fritz Wlan Repeater N/G
No ratings yet
Fritz Wlan Repeater N/G
51 pages
Sample Seminar Report
No ratings yet
Sample Seminar Report
7 pages
LAN (Local Area Network) : Ethernet (IEEE 802.3)
No ratings yet
LAN (Local Area Network) : Ethernet (IEEE 802.3)
6 pages

Dcos: Cache Embedded Switch Architecture For Distributed Shared Memory Multiprocessor Socs

Uploaded by

Dcos: Cache Embedded Switch Architecture For Distributed Shared Memory Multiprocessor Socs

Uploaded by

DCOS: Cache Embedded Switch Architecture for Distributed Shared Memory Multiprocessor SoCs

Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Shared L2$ Bank

Shared L2$ Bank

Shared L2$ Bank

Shared L2$ Bank

Shared Memory Bank

Instruction Cache (L1-I$)

Shared Memory Bank

Shared Memory Bank

Shared L2$ Bank Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Shared Memory Bank

Free Register List

Floating Point Register File

Instruction Decode Branch

Register Map Table

MIPS R10000 core block diagram.

TABLE I S IMULATION PARAMETERS

Shared L2$ bank

Shared L2$ bank

Shared L2$ bank

Shared L2$ bank

State L1 (E) State L1 (E) State L2 (E)

Shared Data Address 1 Shared Data Address 1 Shared Data Address 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

Shared L2$ bank

Shared (S) L2$ bank

Shared (S) L2$ bank

State L1 (S) State L1 (E) (S) State L2 (S)

Shared L2 Cache Bank

Line size Write through Hit time

64 bytes Directed mapped, 128 KB 15 cycles, pipelined

Directory Cache in Switch

Shared L2$ bank

Shared L2$ bank

Shared L2$ bank

On -Chip Switched Networks

State L1 (M) State L1 (M) State L2 (M)

Shared Data Address 1 Shared Data Address 1 Shared Data Address 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

Presence Bit 4 Presence Bit 4 Presence Bit 4

Programs FFT Radix Ocean Barnes-Hut

Shared Memory Bank Directory Cache

Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2

Presence Bit P Presence Bit P

Shared L2 Cache Bank Directory Cache

State L1 State L1 State L1

Shared Data Address 1 Shared Data Address 2 Shared Data Address 3

Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2

Presence Bit P Presence Bit P

Presence Bit 1 Presence Bit 1 Presence Bit 1

Presence Bit 2 Presence Bit 2 Presence Bit 2

... ... ...

Presence Bit P Presence Bit P Presence Bit P

Shared Data Address N

DCOS architecture block diagram.

3325 3301 2646 2869 2488 2286 1686 2397

500 400 300

You might also like