0% found this document useful (0 votes)
18 views5 pages

Exploring Cache Coherency Design

Exploring Cache Coherency Design

Uploaded by

corganhuang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

Exploring Cache Coherency Design

Exploring Cache Coherency Design

Uploaded by

corganhuang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Engineering Research & Technology (IJERT)

ISSN: 2278-0181
Vol. 4 Issue 08, August-2015

Exploring Cache Coherency Design for Chip


Multiprocessor using Multi2Sim

Vinh Ngo Quang and Hao Do Trang Hoang and Thanh Vu D.


IC Design Research and Education Center, University of Technology,
VNUHCM VNUHCM
Ho Chi Minh city, Vietnamese Ho Chi Minh city, Vietnamese

Abstract—Memory hierarchy design plays an important role the address range of the memory banks and we show that, by
in improving the performance of chip multiprocessor (CMP). simulation, the performance could improve up to 13,5% in
The reason is that the performance of a CMP is strongly comparison with a baseline model that arranges the memory
affected by the latency of fetching data from the memory address linearly. The experiment is carried out with
system. Several organizations of the memory hierarchy have
Multi2Sim [5], an open source simulator for heterogeneous
been explored to optimize this latency. In the memory hierarchy,
data traversing is based on a cache coherence protocol which is multiprocessor design. Splash-2 benchmark [17] is used as
the skeleton of the CMP's memory system. In this paper, we the workload in the experiment. In the experiment result, we
concentrate on exploring MOESI, a well-defined and popular focus on analyzing the cache miss latency because this
cache coherence protocol in CMP. Our experiment is based on parameter not only can determine the efficiency of the
Splash-2 benchmark which is widely used in every publication MOESI but also strongly affects the CMP performance. Our
regarding CMP design. The experiment results show that by contribution in this work is (1) the statistical result of the
rearrange the address range of the memory banks, L2 hit ratio CMP performance using MOESI protocol, (2) demonstration
could be improved up to 13,5 %. of the performance improvement by rearranging the address
Keywords—Chip Multiprocessor; Memory Hierarchy;
range of the memory banks.
Coherence Protocol; MOESI; Memory Bank The paper is structured as follows. Section II surveys some
of the latest work on CMP’s cache organizations and
coherence protocols. Section III presents the experiment
I. INTRODUCTION
method. Section IV gives and explains the result. And section
Nowadays, chip multiprocessor (CMP) is the main trend in V finalizes the paper.
designing the CPU for high performance devices. This
originates from the fact that the single core chip reaches the II. RELATED WORK
limitation of execution speed because of the heat and power Several works have been carried out to improve the CMP’s
dissipation issues. Moreover, modern technologies support performance by optimizing the on chip memory hierarchy.
millions of transistors to be integrated in one chip which There are different aspects to look at in the memory hierarchy
eases such as: the shared or private last level cache model, the
the design of multicore on chip in terms of area. In fact, cache coherence protocol, on-chip interconnection and so
several CMPS have been commercialized in the market [1], forth. While L1 cache is always private to the processor core,
[2], [3], [4]. the L2 cache can be designed to be private or shared. Many
In CMP chip, memory hierarchy design is a concern that research papers exploit the possibilities in designing the L2
takes a lot of effort of researchers. The memory organization, cache [7]. Shared last level cache has an advantage in
but not the CPU core, is the bottleneck in CMP design. Most comparison with private cache is that it dynamically allocates
of the memory systems have multiple levels of cache the overall cache space for all cores on chip. Thus, the last
hierarchy. For instance, Intel’s commercial Ivy Bridge has 3 level cache space is better utilized and its miss ratio is
levels of cache. These cache levels and the main memory therefore reduced. Shared last level cache (LLC), can be
need a coherence protocol in order to keep the memory physically centralized or distributed with respect to the
consistent. For example, a L2 cache must ensure data processor cores. In the first designs of CMP, researchers
consistency among L1 caches. Moreover, a good coherence proposed the shared LLC organization to have uniform cache
protocol also helps the CMP to improve the performance in access time (UCA). Even though the UCA is simple for
terms of memory access latency. To the best of our designing, it was soon replaced by non-uniform cache access
knowledge, MOESI [6] is the most widely used cache (NUCA) techniques. With NUCA, nearer cache banks will
coherence protocol in CMP. In this paper, we first evaluate provide lower access latencies than further banks with respect
the CMP performance by using MOESI protocol. Then, by to the requesting core. NUCA was first proposed in [9]. The
observing that the instruction and data cache have different complexity of the interconnection network is the price to pay
ways of access model, we concentrate on optimizing the last as using NUCA. The latency for a bank access depends on
level cache memory to adapt the difference and get better its size and the network hops distance between the requesting
performance in terms of L2 hit ratio. Our idea is to interleave core and the bank. Kim et al. [9] investigated a model with a

IJERTV4IS080547 www.ijert.org 775


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 4 Issue 08, August-2015

single core with a large L2 cache divided into multiple banks. evel cache/memory. Besides, when a block in state E is
They argued that a highly-banked cache structure with evicted, it doesn’t need to write back the data because the
distributed cache controller is desirable in reducing the cache data is clean. Based on research papers, it seems that MOESI
access latency. Two main techniques for NUCA were is better in terms of performance comparing with MESI,
proposed in their paper are static UCA (S-NUCA) and MOSI. Anyway, there is no evidence that prove it.
dynamic UCA (D-NUCA). In S-NUCA, data is statically Depends on the interconnection, there are two major cache
mapped into banks with the least significant bits determining coherence protocols: bus-based and directory-based.
the bank. S-NUCA is proven to have more advantages than Directory-based is desirable for big number of cores on chip
UCA in [9] because of two reasons. Firstly, the banks have using scalable interconnection such as mesh. The directory-
non-uniform access times and thus accesses to the nearer based protocol breaks the broadcast coherence message as in
bank to the requesting core incur lower latency. Secondly, bus-based into point-to-point messages that only involve
different banks can be accessed appropriate nodes in the interconnection network. In more
simultaneously which helps to reduce the contention. The D- detail, the sharing information is logically centralized into a
NUCA further improves the performance of S-NUCA by directory. The directory is usually co-located with the data
dynamically mapping data into different LLC banks. In other block in the memory and each of its entry corresponds to one
words, frequently accessed data are placed in closer banks memory address. The entry keeps essential information to
while less used data are cached in farther banks. The D- track the memory block’s current sharers and their read/write
NUCA leads to the data management policies issues. The key privileges. The directory-based mechanism dramatically
issues in data management for D-NUCA are (1) how the data reduces the coherence traffic in comparison with bus-based.
are mapped to banks and which banks a specific data can be Moreover, it allows coherence messages to traverse over the
reside, (2) how to search a cache line as quick as possible in a dedicated and fast channels on the interconnection network
large cache with multiple banks and (3) how and when to rather than a single shared bus as in the bus-based
migrate the data between banks of the cache. D-NUCA is mechanism. In this paper, we choose the MOESI and
widely exploited in several researches [10], [11], [12]. To the directory-based coherence protocol for the sake of the
best of our knowledge, there is no research paper that refers performance and the scalability [8].
the way in which the entire memory address range is divided The simulator used in this research is Multi2Sim. It models
into different memory banks. In this paper we show that the an event-driven memory hierarchy which uses MOESI as the
L2 hit ratio can be improved up to 13,5% by interleaving the coherence protocol between caches from different processor
memory address range between different memory banks. To cores. It also supports multi-level cache organization as well
ensure a consistent view of memory between all processor as directories for caches and main memory. The simulator is
cores, a cache coherence protocol needs to be implemented in written purely in C language which is more simple to read
CMP. Two main parts of a coherence mechanism are: (1) a and modify the code in comparison with GEM5 [16]. Other
storage that holds the data sharing information and (2) a set well known simulators for this research are CACTI [14] and
of protocols to keep the consistency of the data using the Simics [15]. But CACTI solely builds the cache model which
information in (1). One essential information of the data costs users more effort to run the simulator with benchmarks
sharing is their status. The status of the cached copies of any while Simics is mainly for commercial usage.
data block is usually kept by attaching the state to each cache
data block. The minimum states that a coherence protocol III. METHODOLOGY
must have are: (1) the invalid (I) state which indicates that the We use Multi2sim to run and simulate the operation of CMP,
cache block is not holding the valid data; (2) the shared (S) focusing on memory hierarchy. Multi2sim is a simulation
means that the cache block is shared by one or more other framework written in C for heterogeneous computing. It
processor caches in the system, it also means that this block provides an easy way to design, configure, launch the CMP
can only be read from (not written to) and it is holding the for research purpose on heterogeneous system. This
same data value with the memory; (3) the modified (M) to framework allows creating the CMP by using INI files. We
signify that the block is uniquely held. If a cache block is in can intercede to the main memory, cache, internal network
M state, it must be written back to the memory before being from these files. In the first experiment, we use INI files such
evicted. This simple coherence protocol is therefore named as as memory configuration and network configuration to
three-state MSI protocol. generate a CMP given in Fig. 1.
More sophisticated protocols employed more cache block
states to reduce the coherence traffic and the latency of
fetching a data block. Some popular protocols are MESI,
MOSI, MOESI [13]. MOESI is considered to be the most
complex protocol which encompasses all the possible states
commonly used in other protocols. The O is added to
describe a dirty and shared block. This state helps to reduce
the coherence traffic because a block in M state doesn’t need
to write back the data when it receives a read request and just
changes to O state instead. On the other hand, state E is
implemented to signify a clean and exclusive block. A block
Fig. 1. The CMP architecture used for experiment.
can change to M state without the need of notifying the lower

IJERTV4IS080547 www.ijert.org 776


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 4 Issue 08, August-2015

The CMP composes 4 processor cores, 4 L1 cache results because L1 cache contains the most likely to used
modules, 2 L2 cache modules and a main memory module. data and instructions. L2 hit ratio is less than L1 but also
Each L2 cache is shared for 2 L1 cache modules, but is acceptable because L2 is one level lower and processor core
private from the main memory viewpoint. We also create 4 can proceed with other processing tasks that do not need to
threads for our four-core CMP. The detail configuration of wait for the data from L2. Besides, in this configuration, L2
this system is shown in TABLE I. effective size is limited by its private characteristic. With this
result, we can agree that Multi2sim simulator and the MOESI
protocol work correctly. More importantly, we divide the
main memory into 4 equal banks. Each bank preserves a
continuously range of memory address. This method of
separating bank is considered as the baseline for the next
experiment which shows the improvement of L2 hit ratio by
interleaving the bank’s address range.
In the second experiment, the main memory in this model
is separated into 4 banks by interleaving the memory address
range. There are 32 bits to index memory bytes, but we do
not use all of them. When using interleaving to divide the
main memory, a memory bank is a set of equal smaller range,
if the interleaving ranges are too small, it’s not efficient for
memory access pattern because of its locality of reference.
Thus, we choose the smallest range is 1 KB, this means that
we have 22 bits to index a range. On the other hand, we have
4 memory banks, and also, we need to employ a pair of bits
to locate the memory bank which contains the block. In
TABLE I. Configuration of CMP in detail theory, we can use any pair for this task, but we use 2
consecutive bits within the 22 index bits for our experiment.
We use a ring network to connect all modules of main This approach, which is called interleaved address memory
memory. This helps L2 private caches can access directly to bank, helps us to spread the data and instructions into every
every memory banks in the ring. The MOESI protocol is used memory bank. Fig. 3 illustrates the cases of dividing memory
in this CMP design to keep data consistent between L1 address into banks. There are 21 pairs of consecutive bits in
caches of processor cores. 22 index bits. So that we can divide the main memory by 21
The experiment is run with Splash-2 benchmark. This difference ways based on these pairs of bits. Our purpose,
benchmark contains 11 applications that solve 11 computing finally, is identifying which pair is the best choice, and how
problems such as: N-Body, Cholesky factorization, FFT, etc. much it improves the hit ratio in L2. Firstly, We run the
These are the most classical problems in parallel computing model in Fig. 1 for 21 cases using a benchmark from Splash-
theory. As we know, the main application field of CMP is 2 and get the L2 hit ratio of each case. We randomly choose
solving the complexity problem by parallel computing. That the Sparse Cholesky Factorization problem. After getting the
is the reason we choose this benchmark to evaluate MOESI result, we determine the best pair for interleaving. Next, we
protocol in CMP performance. We run 11 applications and run all remaining benchmarks in Splash-2 with that pair and
record the L1 and L2 hit ratio. The result is reported in Fig. 2. compare the result
As we can see, the average hit ratio of L1 is very high, over
98% for L1-Data and 99% for L1-Instruction. When it comes
to L2, this number is lower, about 77%. These are relevant

Fig. 2. The average of hit ratio in L1 and L2 cache.

Fig. 3. Memory partitions using difference pair of bits

IJERTV4IS080547 www.ijert.org 777


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 4 Issue 08, August-2015

Fig. 4. The hit ratios in L2 based on using difference pairs of bits to partition
main memory when launching Cholesky factorization

Fig. 5. The hit ratios in L2 when using interleaving 256 KB with base
with the based line. After that, we compute the improvement line in comparison
when using interleaved banks versus linear range banks
model. This design is an inclusive cache design so that the In Fig 5, the above line present the hit ratio when using
directories in 4 main memory banks limit the number of interleave 256 KB, the other is base line. The average
cache blocks in all cache modules level 1 and level 2. If the improvement is 7% and up to 13.5% in the best case. So, we
directories are full, cache blocks in the directories are evicted can make the conclusion for these experiments that using
and therefore the corresponding blocks in L1 and L2 caches interleaving method is the good choice.
are also evicted. Thus, we predict that dividing the main
memory banks into 4 linear address range banks should give V. CONCLUSION
the lower performance than that when interleaving the The paper explores the MOESI protocol used in CMP. Based
address range into banks. on the different characteristic of instruction and data cache
IV. EXPERIMENT RESULTS block, we reorganize the memory bank by interleaving the
address range between the banks. And the experiment result
Fig. 4 shows 2 important information after running Cholesky shows that by interleaving 256 KB between 4 memory banks,
factorization. We see that the hit ratio interestingly reaches the L2 hit ration of the CMP improves up to 13,5% compared
the highest value when we use the ninth pair of bits for with the baseline model in which the memory address is
partitioning. Besides, when the interleaving size is too large, divided into 4 continuous address ranges. This result should
hit ratio is a constant. be applicable for other different workloads. Depends on the
First, we present why the hit ratio in L2 is a constant when size of the workload, the interleaving coefficient might
the size of interleave is large. Because the size of the change
application is limited, when the pair of bits is high, or the accordingly to achieve the best L2 hit ratio.
range is large enough, the application code fits within one
bank. Actually, the hit ratio is not a constant due to some ACKNOWLEDGMENT
objective reason such as hardware, other programs ..., but it This research was funded by Vietnam National University -
fluctuates around a number with an amplitude. This Hochiminh city under grant number C2015-40-01.
amplitude
is very small, so we can not recognize its appearance. REFERENCES
Secondly, the hit ratio reaches the maximum value when [1] Chen, Thomas, et al., “Cell broadband engine architecture and
we use the ninth pair of bits corresponding with the range of its first implementation - a performance view,” IBM Journal of
each interleaved bank is 256 KB. The reason for this it that
Research and Development 51.5 (2007): 559-572.
L2 is a unified cache for both instruction and data. Besides,
[2] George, Varghese, T. Piazza, and H. Jiang., “Technology
CMP is running 4 threads in parallel. By interleaving the
Insight: Intel Next Generation Microarchitecture Codename Ivy
memory address range, the instruction and data of
Bridge,” (2011).
applications are distributed equally into 4 directories of main
memory banks. This mechanism of memory banking should [3] Conway, Pat, et al., “Cache hierarchy and memory subsystem
help the directories efficiently store the cache block with of the AMD Opteron processor,” IEEE micro 30.2 (2010): 16-
regarding its limited capacity. Moreover, when a cache block 29.
in the directory is evicted, it is high potential that the block [4] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-
will not be accessed again in upper level caches. way multi- threaded spark processor,” IEEE Micro, vol. 25, pp.
21-29, March 2005.
[5] Ubal, Rafael, et al. ”Multi2Sim: a simulation framework for
CPU-GPU computing.” Proceedings of the 21st international
conference on Parallel architectures and compilation
techniques. ACM, 2012.

IJERTV4IS080547 www.ijert.org 778


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 4 Issue 08, August-2015

[6] Milo M. K. Martin. Token Coherence, Ph.D. Dissertation. Dec.


2003.
[7] Balasubramonian, Rajeev, Norman P. Jouppi, and Naveen
Muralimanohar. “Multi-core cache hierarchies,” Synthesis
Lectures on Computer Architecture 6.3 (2011): 1-153.
[8] Martin, Milo MK, Mark D. Hill, and Daniel J. Sorin. “Why on-
chip cache coherence is here to stay.” Communications of the
ACM 55.7 (2012): 78-89.
[9] Kim, Changkyu, Doug Burger, and Stephen W. Keckler,“An
adaptive, non-uniform cache structure for wire-delay
dominated on-chip caches,”Acm Sigplan Notices. Vol. 37. No.
10. ACM, 2002.
[10] Chang, Jichuan, and Gurindar S. Sohi, “Cooperative caching
for chip multiprocessors”, Vol. 34. No. 2. IEEE Computer
Society, 2006.
[11] Zhang, Michael, and Krste Asanovic, “Victim replication:
Maximizingcapacity while hiding wire delay in tiled chip
multiprocessors,” ACMSIGARCH Computer Architecture
News. Vol. 33. No. 2. IEEE Computer Society, 2005.
[12] Beckmann, Bradford M., and David A. Wood, “Managing wire
delay inlarge chip-multiprocessor caches,” Microarchitecture,
2004. MICRO-37 2004. 37th International Symposium on.
IEEE, 2004.
[13] Sorin, Daniel J., Mark D. Hill, and David A. Wood, “A primer
on memory consistency and cache coherence,” Synthesis
Lectures on Computer Architecture 6.3 (2011): 1-212.
[14] Wilton, Steven JE, and Norman P. Jouppi, “CACTI: An
enhanced cache access and cycle time model,” Solid-State
Circuits, IEEE Journal of 31.5 (1996): 677-688.
[15] Magnusson, Peter S., et al., “Simics: A full system simulation
platform”,Computer 35.2 (2002): 50-58.
[16] Binkert, Nathan, et al., “The gem5 simulator,” ACM
SIGARCH Computer Architecture News 39.2 (2011): 1-7.
[17] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta,
“The SPLASH-2 Programs: Characterization and
Methodological Considerations,” in Proceedings of the 22nd
International Symposium on Computer Architecture, pages 24-
36, June 1995.

IJERTV4IS080547 www.ijert.org 779


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like