0% found this document useful (0 votes)
61 views13 pages

GPS Micro21

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 13

GPS: A Global Publish-Subscribe Model for Multi-GPU Memory

Management
Harini Muthukrishnan Daniel Lustig
harinim@umich.edu dlustig@nvidia.com
University of Michigan NVIDIA
Ann Arbor, Michigan, USA Westford, Massachusetts, USA

David Nellans Thomas Wenisch


dnellans@nvidia.com twenisch@umich.edu
NVIDIA University of Michigan
Austin, Texas, USA Ann Arbor, Michigan, USA

ABSTRACT ACM Reference Format:


Suboptimal management of memory and bandwidth is one of the Harini Muthukrishnan, Daniel Lustig, David Nellans, and Thomas Wenisch.
2021. GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Man-
primary causes of low performance on systems comprising mul-
agement. In MICRO-54: 54th Annual IEEE/ACM International Symposium on
tiple GPUs. Existing memory management solutions like Unified Microarchitecture (MICRO ’21), October 18–22, 2021, Virtual Event, Greece.
Memory (UM) offer simplified programming but come at the cost of ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3466752.3480088
performance: applications can even exhibit slowdown with increas-
ing GPU count due to their inability to leverage system resources
effectively. To solve this challenge, we propose GPS, a HW/SW 1 INTRODUCTION
multi-GPU memory management technique that efficiently orches- Graphics Processing Units (GPUs) are central to high-performance
trates inter-GPU communication using proactive data transfers. computing because of their high memory bandwidth and microar-
GPS offers the programmability advantage of multi-GPU shared chitecture tailored for parallel execution. Because workload de-
memory with the performance of GPU-local memory. To enable mands continue to grow beyond what single-GPU performance
this, GPS automatically tracks the data accesses performed by each can provide, GPU manufacturers now offer systems comprising
GPU, maintains duplicate physical replicas of shared regions in each multiple GPUs to continue scaling application throughput [5, 40].
GPU’s local memory, and pushes updates to the replicas in all con- These aggregated multi-GPU systems provide teraflops of compu-
sumer GPUs. GPS is compatible within the existing NVIDIA GPU tational power and many terabytes per second of memory band-
memory consistency model but takes full advantage of its relaxed width [20, 42]. However, effectively managing these resources to
nature to deliver high performance. We evaluate GPS in the context extract performance from a multi-GPU system remains a challenge
of a 4-GPU system with varying interconnects and show that GPS for many GPU developers.
achieves an average speedup of 3.0× relative to the performance One of the primary challenges to achieving scalable performance
of a single GPU, outperforming the next best available multi-GPU lies in managing the order of magnitude gap between local and
memory management technique by 2.3× on average. In a 16-GPU remote memory bandwidths. If applications are naively partitioned
system, using a future PCIe 6.0 interconnect, we demonstrate a across GPUs, most memory accesses will traverse (relatively) slow
7.9× average strong scaling speedup over single-GPU performance, remote links, resulting in the inter-GPU interconnect becoming a
capturing 80% of the available opportunity. performance bottleneck. Figure 1 demonstrates that for a variety of
hard-to-scale HPC benchmarks on a 4-GPU system (see Section 6
CCS CONCEPTS for specifics), interconnect bandwidth limits scalability. A system
• Computer systems organization → Multicore architectures; with infinite interconnect bandwidth and a system with projected
Peer-to-peer architectures. PCIe 6.0 performance attain 3× and 2× speedups over a single
GPU, respectively, while using a current PCIe 3.0 interconnect can
KEYWORDS result in application performance 30% slower than its single-GPU
GPGPU, multi-GPU, strong scaling, GPU memory management, counterpart.
communication, heterogeneous systems The multi-GPU partitioning problem is difficult because the cur-
rent techniques available for developers to effectively manage data
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed in multi-GPU systems fall short. Unified Memory (UM) [21] pro-
for profit or commercial advantage and that copies bear this notice and the full citation vides a single memory address space accessible from any processor
on the first page. Copyrights for components of this work owned by others than ACM in the system by employing fault-based and/or hint-based page
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a migration to move data pages to the local physical memory of the
fee. Request permissions from permissions@acm.org. accessing processor. Although this enables UM to automatically mi-
MICRO ’21, October 18–22, 2021, Virtual Event, Greece grate pages for locality, the page fault handling overheads are often
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8557-2/21/10. . . $15.00 performance prohibitive. Some programmers therefore resort to
https://doi.org/10.1145/3466752.3480088 manual hints, but using hints effectively requires substantial tuning

1
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch

PCIe3.0 PCIe6 (projected) Infinite Bandwidth Interconnect GPU0 GPU1 GPU0 GPU1
4 memory memory memory memory
Performance relative to

3
non-GPS GPS
2
1 GPU

VA space
1 GPU0 GPU1 GPU0 GPU1
0
Conventional store GPS store
SP

P
CT

Ge IT

n
nk
bi

n
AL

ea
sio
GPS load

H
co

Conventional load
SS
ra

EQ

om
ffu
Ja

ge
Pa

Di
Figure 1: Many HPC programs strong-scale poorly due to in- Figure 2: Load/store paths for conventional and GPS pages.
sufficient inter-GPU bandwidth, as shown on a system with Because GPS transfers data to consumers’ memory proac-
4 NVIDIA GV100 GPUs. tively, all GPS loads can be performed to high bandwidth
effort on the part of the programmer. Peer-to-peer transfers [51], local memory.
in which GPUs perform loads/stores directly to the physical mem-
ories of other GPUs, can potentially achieve good strong scaling
while conforming to the pre-existing NVIDIA GPU memory
when properly coordinated with computation. However, peer-to-
model.
peer loads perform remote accesses on demand, leading to compute
• To minimize the utilization of scarce inter-GPU bandwidth,
stalls. Peer-to-peer stores avoid stalls, but can result in wastage of
we propose a novel memory subscription management mech-
an already scarce interconnect bandwidth if sent to GPUs that do
anism that tracks GPUs’ access patterns and unsubscribes
not require the pushed data. Frameworks such as Gunrock [60] and
GPUs from pages they do not access.
Groute [11] can improve strong scaling for some workloads but are
• Evaluated in simulation with several interconnects against
typically domain-specific and limited in scope.
a 4 GPU system, GPS provides an average strong scaling
To overcome the limitations of these techniques, we propose GPS
performance of 3.0× over a single GPU capturing 93.7% of
(GPU Publish-Subscribe), a multi-GPU memory management tech-
the available opportunity. In a 16 GPU system using a future
nique that transparently improves the performance of multi-GPU
PCIe 6.0 interconnect, GPS achieves an average 7.9× speedup
applications following a fully Unified Memory compatible program-
over a single GPU and captures over 80% of the hypothetical
ming model. GPS provides a set of architectural enhancements that
performance of an infinite bandwidth interconnect.
automatically track which GPUs subscribe to shared memory pages,
and through driver support, it replicates those pages locally on each
2 BACKGROUND AND MOTIVATION
subscribing GPU. GPS hardware then broadcasts stores proactively
to all subscribers, enabling the subscribers to read that data from Proper orchestration of inter-GPU communication is essential for
local memory at high bandwidth. Figure 2 shows the behavioral dif- strong scaling in multi-GPU systems. With each new GPU and
ference between GPS and conventional accesses. GPS loads always interconnect generation, the compute throughput, local memory
return from local memory, while GPS stores are broadcast to all bandwidth, and inter-GPU bandwidth increase. However, the band-
subscribers. On the other hand, conventional load/stores result in width available to local GPU memory remains much higher than
local or remote accesses depending on physical memory location. the bandwidth available to remote GPU memories. For example,
GPS takes advantage of the fact that remote stores do not stall Figure 3 shows that even though interconnect bandwidth has im-
execution. Performing remote accesses on the write path instead proved 38× while evolving from PCIe 3.0 [3] to NVIDIA’s most
of the read path hides latency and enables further optimizations to recent NVLink and NVSwitch-based topology [38], it remains 3×
schedule and combine writes to use the interconnect bandwidth slower than the local GPU memory bandwidth.
efficiently.
GPS successfully improves strong scaling performance because it 2.1 Inter-GPU communication mechanisms
can: (1) transfer data in a proactive, fine-grained fashion and overlap Because local vs. remote bandwidth is a first-order performance
compute with communication, (2) issue all loads to local DRAM concern, multi-GPU workloads typically use one of the following
rather than to remote GPUs’ memory over slower interconnects, mechanisms (summarized in Figure 4) to manage data placement
and (3) perform aggressive coalescing optimizations that reduce among the GPUs’ physical memories.
inter-GPU bandwidth requirements without violating the GPU’s • Host-initiated DMA using cudaMemcpy: cudaMemcpy()
memory model. programs a GPU’s DMA engine to copy data directly between
This work makes the following contributions: GPUs and/or CPUs. Though CUDA provides the ability to
• We propose and describe GPS, a novel HW/SW co-designed pipeline compute and cudaMemcpy()-based transfers [55],
multi-GPU memory management system extension that lever- implementing pipeline parallelism requires significant pro-
ages a publish-subscribe paradigm to improve multi-GPU grammer effort and detailed knowledge of the applications’
system performance. behavior in order to work effectively.
• We propose new simple and intuitive programming model • Fault-based migration via Unified Memory (UM): Uni-
extensions that can naturally integrate GPS into applications fied Memory [21] provides a single unified virtual memory

2
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece

Local Remote Producer Consumer Remote data Remote data transfer


1800 GPU0 ...
Bandwidth (GB/s)

1500
GPU1
1200
900 GPU2
600 (a) Peer-to-peer loads or fault based UM
300 GPU0 ...
0
Discrete/ DGX-1/ DGX-1V/ DGX-2/ DGX-A100/ GPU1
Kepler/ Pascal/ Volta/ Volta/ Ampere/ GPU2
PCIe NVLink 1 NVLink 2 NVLink2+ NVLink3+
(b) memcpy between kernels or UM with hints
NVSwitch NVSwitch
GPU0 ...
Figure 3: Local and remote bandwidths on varying GPU plat- GPU1
forms. Despite significant increases in both metrics, a 3×
GPU2
bandwidth gap persists between local and remote memories.
(c) Publish-Subscribe (including GPS)
address space accessible to any processor in a system. UM
uses a page fault and migrate mechanism to perform data Figure 4: Data transfer patterns in different paradigms. In
transfers among GPUs. Although this enables data move- demand-based loads and UM, transfers happen on-demand;
ment among GPUs in an implicit and programmer-agnostic in memcpy, they happen bulk-synchronously at the end of
fashion, the performance overhead of these page faults typi- producer kernel; in GPS, proactive fine-grained transfers are
cally is a first-order performance concern. performed to all subscribers.
• Hint-based migration via Unified Memory: UM also of-
fers the ability for programmers to provide hints to im-
prove performance. Read-mostly, prefetching, placement, communication effectively by leveraging the pipelining fea-
and “accessed-by” hints enable duplication of read-only pages tures of MPI.
across GPUs as well as page placement on specific GPUs One way to avoid the remote access bottleneck is to transfer data
and thus can reduce page faults if used correctly. However, from the producing GPUs to the consuming GPUs in advance, as
pipelining prefetching and compute using hints to achieve soon as the data is generated. The consumers can then read the data
fine-grained data sharing, our goal in GPS, is challenging directly from their local memory when needed. These proactive
even for expert programmers. Furthermore, crucially, UM transfers help strong scaling for two reasons: (1) they provide more
does not support the replication of pages with at least one opportunities to overlap data transfers with computation, and (2)
writer and multiple readers. Writes to read-duplicated pages they improve locality and ensure that critical path loads enjoy
“collapse” the page to a single GPU (usually the writer) and higher local memory bandwidth. As such, GPS relies on proactive
trigger an expensive TLB shootdown, thus degrading per- peer-to-peer stores to perform data transfers as the basis of its
formance substantially [59]. GPS aims to provide a better performance-scalable implementation.
solution for read-write pages accessed by multiple GPUs.
• Peer-to-peer transfers: GPU threads can also perform peer- 2.2 Publish-subscribe frameworks
to-peer loads and stores that directly access the physical Although proactive fine-grained data movement can improve local-
memory of other GPUs without requiring page migration. In ity, performing broad all-to-all transfers wastes inter-GPU band-
principle, peer-to-peer accesses can be performed at a fine width in cases where only a subset of GPUs will consume the data. In
granularity, overlap compute and communication, and incur these cases, tracking which GPUs read from a page and then trans-
low initiation overhead. However, peer-to-peer loads suffer mitting data only to these consumers can save precious interconnect
from the high latency of traversing interconnects such as bandwidth. To track a page’s consumers and propagate updates only
PCIe or NVLink, often stalling thread execution beyond the to them, GPS adopts a conceptual publish-subscribe framework,
GPU’s ability to mitigate those stalls via multi-threading. On which is often used in software distributed systems [7, 15, 18, 33].
the other hand, peer-to-peer stores typically do not stall GPU Figure 5 shows a simple example of a publish-subscribe frame-
thread execution and can be used to proactively push data work. It consists of publishers who generate data and subscribers
to future consumers without slowing down the computation who have requested specific updates. The publish-subscribe pro-
phases of the application. cessing unit tracks subscription information at page granularity,
• Message Passing Interface (MPI): MPI is a standardized receives data updates from publishers and forwards them to sub-
and portable API for communicating via messages between scribers. This mechanism provides the advantage that publishers
distributed processes. With CUDA-aware MPI and optimiza- and subscribers can be decoupled, and subscription management is
tions such as GPUDirect Remote Direct Memory Access handled entirely by the publish-subscribe processing unit.
(RDMA) [45], the MPI library can send and receive GPU The major challenge faced by a publish-subscribe model relying
buffers directly, without having to first stage them in host on proactive remote stores is deciding which GPUs should receive
memory. However, porting an application to MPI increases the data and when the stores should be transmitted. We address
programmer burden, and it is hard to overlap compute and this in Section 3.

3
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch

Page 1 subscribers: Region subscribed by GPU 0,1 Region subscribed by GPU 0,1,2
GPU0 GPU0  GPUm GPU0
Page 2 subscribers:
GPU3
GPU1 GPU1 GPU0 PA space GPU1 PA space GPU2 PA space
...
... Page n subscribers: ...
GPU0 GPUm VA space
GPUm Publish Subscribe GPUm
Processing Unit GPS address space allocations

Store issued by publisher Stores forwarded to all subscribers Figure 6: GPS address space: Allocations made are replicated
in the physical memory of all subscribers.
Figure 5: A simple publish-subscribe framework.

2.3 GPU memory consistency over existing multi-GPU programming frameworks: a programmer
The NVIDIA GPU memory model [43] prescribes rules regarding can integrate GPS into their workloads with only minor changes
the apparent ordering of GPU memory operations and the values for allocation and subscription management and need not modify
that may be returned by each read operation. Of most relevance to GPU kernels written for UM, except to apply performance tuning.
GPS are the notions of weak vs. strong accesses, and the notion of
scope associated with strong accesses. In short, sys-scoped mem- 3.2 Subscription management
ory accesses or fences are used to explicitly indicate inter-GPU
synchronization. All other access types need not be made visible The key innovation behind GPS is a set of architectural enhance-
to or ordered with respect to memory accesses from other GPUs ments designed to coordinate proactive inter-GPU communication
unless sys-scoped operations are used for synchronization. GPS for high performance. These architectural enhancements maintain
makes use of these relaxed memory ordering guarantees to perform subscription information to enable data transfer only to GPUs that
several hardware optimizations, as described later in Section 3.3, require it, enabling local accesses during consumption.
without violating the memory model. GPS provides both manual and automatic mechanisms to manage
page subscriptions. We describe the general mechanisms below, the
APIs in Section 4, and the implementation in Section 5.
3 GPS ARCHITECTURAL PRINCIPLES
Manual subscription tracking: The manual mechanism al-
In this section, we describe the different architectural principles lows the user to explicitly specify subscription information through
upon which GPS is based. Several possible hardware implementa- subscription/unsubscription APIs that can be called while refer-
tions can support the GPS semantics with differing performance encing each page (or a range of pages) in memory. Though this
characteristics. These high-level design principles are intended to requires extra programmer effort, it can lead to significant band-
decouple the GPS concept from our specific proposed implemen- width savings compared to an all-to-all subscription, even if the
tation. We highlight one particular implementation approach in subscription list is not minimized.
Section 5. Automatic subscription tracking: For applications with an it-
erative/phase behavior wherein the access patterns in each program
3.1 The GPS address space segment match those of prior segments, the subscriber lists can
The GPS address space, which is an extension of the conventional be determined automatically. The repetition in iterative workloads
multi-GPU shared virtual address space, enables programmers to enables GPS to discover the access patterns from one segment in
incorporate GPS features into their applications through simple, an initial profiling phase and determine the set of subscriptions for
intuitive APIs (described later in Section 4). As shown in Figure 6, all subsequent execution.
allocations made in the GPS address space have local replicas in all Automatic subscriber tracking can be performed in one of two
subscribing GPUs’ physical memories. Subscription management ways: subscribed-by-default, i.e., indiscriminate all-to-all subscrip-
is described in Section 3.2. tion followed by an unsubscription phase, or unsubscribed-by-
Loads and stores are issued to GPS pages in the same way as they default, i.e., a GPU subscribes to a page only when it issues the first
are to normal pages, with the same syntax, although the underlying read request to that page. Either way, once captured, the sharer
behavior is different. GPS’ architectural enhancements intercept information then feeds into the subscription tracking mechanism
each store and forward a copy to each subscriber’s local replica. used to orchestrate the inter-GPU communication.
Loads to GPS pages are also intercepted but are not duplicated. It is important to note that the subscriptions are hints to GPS for
Instead, they are issued to the replica in the issuing GPU’s local both mechanisms and are not functional requirements for correct
memory and can therefore be performed at full local bandwidth application execution. In other words, if a GPU issues a load to a
and without consuming interconnect bandwidth. In the corner case page to which it is not a subscriber, it does not fault; the hardware
where the issuing GPU is not a subscriber to that page (e.g., because simply issues the load remotely to one of the subscribers. Perform-
of an incorrect subscription hint), the load is issued remotely to one ing a peer-to-peer load to a remote GPU imposes a performance
of the subscribers. This design offers GPS a significant advantage penalty but does not break functional correctness.

4
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece

3.3 Functionally correct coalescing __global__ void mvmul ( float * invec , float * outvec , ...) {
...
While publish-subscribe models have been proposed in the past, a // Stores to outvec in the GPS address space are
unique aspect of GPS is the way it exploits the weakness of the GPU forwarded to the replicas at each subscriber GPU
for ( int i =0; i < mat_dim ; i ++)
memory consistency model described in Section 2.3. In particular, outvec [ tid ] += mat [ tid * mat_dim + i ] * invec [ i ];
GPS performs aggressive coalescing of stores to GPS pages before }
those stores are forwarded to other GPUs, as described below. int main (...) {
// enable GPS for mat , vec1 , and vec2
Non-sys-scoped accesses: As described in Section 2.3, aside cudaMallocGPS(& mat , mat_dim * mat_dim_size ) ;
from sys-scoped operations, the GPU memory model only requires cudaMallocGPS(& vec1 , mat_dim_size ) ;
writes to be made visible to other GPUs upon execution of sys- cudaMallocGPS(& vec2 , mat_dim_size ) ;
cudaMemset ( vec2 , 0, mat_dim_size );
scoped synchronization (e.g., memory fences). GPS uses this delayed for ( int iter =0; iter < MAX_ITER ; iter ++) {
visibility requirement to aggressively coalesce weak stores if their // Automatic profiling : all GPUs are tentatively
subscribed to all GPS pages at the start
addresses fall within the same cache line. Stores need not be consec- if (iter==0) cuGPSTrackingStart();
utive to be coalesced, as the GPU memory model allows store-store for ( int device =0; device < num_devices ; device ++) {
reordering as long as there is no synchronization or same-address cudaSetDevice ( device ) ;
mvmul << < num_blocks , num_threads , stream [ device ] > > >(
relationship between the stores. This decreases the data transferred mat , vec1 , vec2 , ...) ;
across the interconnect, saving valuable inter-GPU bandwidth. The mvmul << < num_blocks , num_threads , stream [ device ] > > >(
coalescer must be flushed only upon sys-scoped synchronization, }
mat , vec2 , vec1 , ...) ;

including the implicit release operation at the end of every grid. // GPUs are unsubscribed from pages they did not touch
The GPU memory model also does not require cache coherence if (iter==0) cuGPSTrackingStop();
}
between GPUs, i.e., a consistent store visibility order, unless stores
are from the same GPU to the same address, have sys scope, or are Listing 1: A sample GPS application. GPS requires code
synchronized by sys-scoped operations. While it is possible for GPS changes only for GPS allocation and tracking as highlighted
stores broadcast from different GPUs to the same address to cross in yellow.
each other in flight and therefore arrive at different consumer GPUs
in different orders, this is not a violation of the memory model. Weak
stores performed by different GPUs to the same address simultane-
ously without synchronization are racy, and therefore, no ordering are implemented in the CUDA library and GPU driver, just as the
guarantees need to be maintained. As long as proper synchroniza- existing CUDA APIs are. Listing 1 shows sample code.
tion is being used, weak writes will only be sent to a particular Memory allocation and release: GPS provides an API call,
address from one GPU at a time, and point-to-point ordering will cudaMallocGPS(), as a drop-in replacement for cudaMalloc() (for
ensure that all consumer GPUs see those stores arriving in the GPU-pinned memory) or cudaMallocManaged() (for Unified Mem-
same order. In this way, GPS maintains all of the required memory ory) APIs. This call allocates memory in the GPS virtual address
ordering behavior. space and backs it with physical memory in at least one GPU. At
sys-scoped accesses: These writes are intended for synchro- allocation, the programmer can pass an optional manual parameter
nization and must be kept coherent across all GPUs. Therefore, GPS to indicate that subscriptions will be managed explicitly for the
neither coalesces nor broadcasts such writes but instead simply region. Otherwise, GPS performs automatic subscription manage-
handles them as traditional GPUs do. Specifically, all sys-scoped ment. GPS re-purposes the existing cudaFree() function to release
accesses are sent to a single point of coherence and performed a GPS memory region.
there. Typically, the number of sys-scoped operations in programs Manual subscription: To allow expert programmers to explic-
is low, as they are only used when grids launched concurrently itly manage subscriptions, GPS overloads the existing cuMemAdvise()
on multiple GPUs need to explicitly synchronize through memory. API used for providing hints to UM with two additional hints
Hence, the cost of not coalescing system-scoped strong stores is to perform manual subscription and unsubscription. Specifically,
minimal. Further discussion of the handling of sys-scoped writes GPS uses new flag enums CU_MEM_ADVISE_GPS_SUBSCRIBE and
in our GPS implementation is described in Section 5.3. CU_MEM_ADVISE_GPS_UNSUBSCRIBE for subscription and unsubscrip-
The design choices described above ensure that GPS can deliver tion, respectively. Upon subscription, GPS backs the region with
consistent performance gains without breaking backward compati- physical memory on the specified GPU. When a programmer un-
bility with the GPU programming model or memory consistency subscribes a GPU from a GPS region, GPS removes the GPU from
model. This compatibility enables developers to easily integrate GPS the set of subscribers for that region and frees the corresponding
into their applications with minimal code or conceptual change. physical memory. GPS ensures that there is at least one subscriber
to a GPS region and will return an error on attempts to unsubscribe
the last subscriber, leaving the allocation in place.
Automatic subscription and profiling phase: As described
4 GPS PROGRAMMING INTERFACE in Section 3.2, the automatic subscription mechanism comprises a
We next describe the programming interface that an application hardware profiling phase during which the mechanism observes ap-
developer uses to leverage GPS features. We seek to develop a plication access patterns and determines a set of subscriptions. This
minimal, simple programming interface to ease the integration profiling phase requires the user to demarcate the start and end of
of GPS into existing multi-GPU applications. GPS API functions the profiling period using two new APIs, cuGPSTrackingStart()

5
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch

and cuGPSTrackingStop(), which are similar to the existing CUDA Conventional GPU components GPS components
calls cuProfilerStart() and cuProfilerStop(). GPS automati-
cally updates subscriptions at the end of the profiling phase; a GPU SM SM ... SM
remains a subscriber if and only if it accessed the page during pro- R1
GPS remote
W1 write queue
filing. Thus, upon receiving cuGPSTrackingStop(), GPS invokes
L1 L1 L1
the API cuMemAdvise(..., CU_MEM_ADVISE_GPS_UNSUBSCRIBE) W4 W5
to unsubscribe GPUs from any page they did not access during R2 W2
GPS
profiling. (Recall that a GPU may still access a page to which it is TLB TLB TLB address
not subscribed, but such accesses will be performed remotely at re- T1
translation W6
GPS bit R3 W3
duced performance; hence, profiling need not be exact to maintain
correctness). GPS access 
L2
VA PA1 PA2 PA3
track unit

5 ARCHITECTURAL SUPPORT FOR GPS GPS store path GPS load path TLB miss tracking
We now describe one possible GPS hardware implementation that
extends a generic GPU design, such as NVIDIA’s recent Ampere Figure 7: Modifications to GPU hardware needed for GPS
GPUs. Our hardware proposal comprises two major components. provisioning.
First, it requires one bit in the GPU page table entry (PTE), the GPS
bit, to indicate whether a virtual memory page is a GPS page (i.e.,
potentially replicated). Second, it requires a new hardware unit to 5.2 GPS hardware units and extensions
propagate writes to GPUs that have subscribed to particular pages.
Page table support: Our GPS implementation modifies the base-
line GPU page table entries (PTEs) to re-purpose a single currently
5.1 GPS memory operations unused bit (of which there are many [37]) to indicate whether a
GPS must support the following basic memory operations: given page is a GPS page. When this bit is set, the virtual address
Conventional loads, stores, and atomics: Memory accesses is GPS-enabled, and the stores to the page will be propagated to
to non-GPS pages (virtual addresses for which the GPS bit is not the GPS units described below. The baseline GPU virtual memory
set in the PTE) proceed as they do on conventional GPUs, through system remains unchanged aside from this extension.
the existing GPU TLB and cache hierarchy to either local or remote Our GPS support also introduces a new secondary page table, the
physical addresses. GPS page table, for tracking multiple physical mappings that can
GPS loads: Figure 7 shows the paths taken by writes and reads coexist for a given virtual address when multiple GPUs subscribe
to GPS pages. Note that this figure is a simplified view intended to a page. This limited page table for the GPS address space is a
to highlight only the modifications required by GPS. For loads variant of a traditional 5-level hierarchical page table with very
issued to the GPS address space by a GPU that is a subscriber to wide leaf PTEs. Notably, the GPS page table lies off the critical
the page, the conventional GPU page table is configured at the path for memory operations, as it is used only for remote writes
time of subscription to translate the virtual address to the physical triggered by writes to GPS pages. These remote writes are already
address of the local replica. GPS loads thus follow the same path aggressively coalesced, so the additional latency of the GPS address
as conventional loads to local memory, as shown by R1, R2, R3 in translation unit can overlap with the coalescing period and adds
Figure 7. In the uncommon case, if a GPU is not subscribed to this only a small additional latency. Furthermore, by design, these writes
particular page, either the load forwards a value from the remote are not required to become visible until the next synchronization
write queue (Section 5.2) if there is a hit or it issues remotely to one boundary, so they are not latency-sensitive. Therefore, the latencies
of the subscribers. imposed by the address translation unit are not a critical factor,
GPS stores and atomics: Stores to GPS pages initially proceed even under TLB misses.
as normal stores, as shown by W1 and W2 in Figure 7. When a Each GPS-PTE contains the physical page addresses of all the
thread issues a store to an address whose GPS bit is set in the remote subscribers to that page, as shown in Figure 7. The GPS-PTE
conventional TLB, and for which there is a local replica, the write is sized at GPU initialization based on the number of GPUs in the
operation is forwarded to the local replica with both virtual and system. For example, with 64KB pages, for a Virtual Page Number
physical address (W3), ensuring that subsequent local reads from (VPN) size of 33 bits and Physical Page Number (PPN) size of 31
the same GPU thread will observe the new write, a requirement bits [37], for a 4 GPU system, the minimum GPS-PTE entry size is
of the existing GPU memory model. This pattern also ensures that 126 bits.
the L2 cache holding the local replica will serve as a common intra- We choose to allocate memory in the GPS address space using
GPU ordering point for stores to that address prior to their being 64KB pages for two reasons. First, the negative impact of false shar-
forwarded outside of the GPU, as the memory model requires. In ing due to large pages is multiplied due to GPS’ replication of remote
the uncommon case, there is no local replica, and a dummy physical stores. Second, the GPU’s conventional TLB is already sized to pro-
address is used. In addition, whether or not there is a local replica, vide full coverage of the entire VA range [47]. As discussed later in
the write is also forwarded with its virtual address to the GPS unit Section 7.4, the GPS address translation unit requires a surprisingly
(described next) for replication to remote subscribers (W4, W5, W6). small translation capacity even with 64KB pages. Therefore, neither
Atomics follow the same behavior as stores. TLB lies on the execution’s critical path.

6
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece

Coalescing remote writes: If a store’s translation information at the conventional TLB (Section 5.1) is an unnecessary waste of
indicates it is to a GPS-enabled address, the GPS write proceeds to resources when there is only one subscriber. For GPS pages with
the remote write queue (W4). This queue is fully associative and multiple subscribers, the GPS bit is set in the conventional page
virtually addressed at cache block granularity. The queue coalesces table entry for that page, and the GPS page table entry for the
all writes to the same cache block (unless they are sys-scoped) and page is updated to record the physical addresses of all subscribers’
buffers them until they are drained to remote GPU memory. This replicas for that page.
simple optimization results in substantial inter-GPU bandwidth
savings for many applications yet maintains weak memory model
compatibility. 5.3 Discussion
Whenever the occupancy of the write combining buffer reaches
Coalescing in the L2 cache: An alternative implementation strat-
a high watermark, it seeks to drain the least recently added entry to
egy for GPS would be to perform coalescing directly within the L2
the remote destinations. In our evaluation, we set this high water-
cache rather than provisioning a dedicated structure. Given that the
mark to one less than the buffer’s capacity to maximize coalescing
GPS remote write queue amounts to only a few kilobytes of state
opportunity. On draining, the flushed entry moves to the GPS ad-
and the L2 size is in megabytes (6MB in NVIDIA V100 GPUs and
dress translation unit. Additionally, the remote write queue unit
40MB in NVIDIA A100 GPUs [39]), cache capacity impact would
must fully drain at synchronization points, e.g., when a sys-scoped
be negligible. We chose to implement a dedicated write queue to
memory fence is issued. In our final proposal with 512 entries, the
isolate its impact from any other cache interference effects and so
GPS-write buffer requires 68 KB of SRAM storage, an insignificant
that the only change required to the L2 cache will be a shim at the
chip area for a single resource shared among all SMs in the GPU.
ingress point, which forwards GPS writes to the GPS remote write
GPS address translation: When a GPS store reaches the GPS
queue.
address translation unit (W5), it looks up translation information
GPS remote write queue addressing: Our GPS implementa-
cached from the GPS page table in its internal, wide GPS-TLB.
tion assumes the remote write queue is virtually addressed, irre-
Much like conventional TLB misses, GPS-TLB misses trigger a
spective of whether it is maintained as a dedicated SRAM structure
hardware page walk that fetches a GPS-PTE entry containing the
or as specially marked and reserved cache lines in L2. If GPS were
physical addresses for all subscribers. Once translated, the GPS
to perform translation prior to the GPS write queue, a remote store
address translation unit sends a duplicate copy of the store for each
would require one entry per subscribing GPU (one per remote phys-
subscriber to the inter-GPU interconnect (W6). The GPS address
ical address); performing translation as the stores are drained to
translation unit also drains at synchronization points.
the interconnect conserves space in the write queue.
Access tracking unit: Our GPS implementation uses subscribed-
Handling sys-scoped writes: Strong sys-scoped writes must
by-default profiling: it employs a dynamic mechanism that typically
be kept coherent across all GPUs. Our GPS implementation handles
begins by subscribing all GPUs to the entirety of GPS allocations at
sys-scoped writes in the same way that UM handles writes to pages
the beginning of execution. As discussed in Section 4, GPS tracks
annotated with read-mostly cudaMemAdvise hints. Upon detection
the access patterns during a profiling phase, after which it un-
of a sys-scoped store to a GPS page, the access faults, all in-flight
subscribes each GPU from pages they did not access. Although
accesses to the page are flushed, and the page is collapsed to a
the early over-subscription initially transfers more data than is re-
single copy and demoted to a conventional page (i.e., its GPS bit
quired, unsubscribed-by-default profiling incurs stalls due to faults
is cleared). Accesses to the page are all issued to the GPU hosting
or remote accesses on the first touch and hence is more expensive.
the single physical copy from this point on. This approach ensures
The GPS access tracking unit provides the hardware support
coherence and same-address ordering for this access and all future
for runtime subscription profiling. It maintains a bitmap in DRAM
accesses to the page.
with one bit per page in the GPS address space. Misses at the last
As mentioned in Section 3.3, sys-scoped accesses are rare, and
level conventional GPU TLBs to pages in the GPS virtual address
hence the impact on typical programs is minimal. The user is ex-
space are forwarded to the access tracking unit, which sets the
pected to explicitly opt pages holding synchronization variables out
bit corresponding to the page (as shown by T1 in Figure 7). TLB
of GPS (i.e., use cudaMalloc instead of cudaMallocGPS). If the user
misses are infrequent yet cover all pages accessed by the GPU,
provides incorrect hints, then just as with UM, there will be a per-
so the bandwidth required to maintain this bitmap is low (typi-
formance penalty. Nevertheless, the execution remains functionally
cally only 1.4 TLB misses per thousand cycles [47]). We maintain
correct.
the bitmap in DRAM since the bitmap is used only during the ini-
Handling memory oversubscription: If the GPU driver swaps
tial profiling and unsubscription. Tracking a 32GB virtual address
out a page from a subscriber due to oversubscription, that GPU will
range, the bitmap requires only 64KB of DRAM, and updates can be
be unsubscribed and will access that page remotely.
aggressively cached or write-combined to minimize DRAM band-
width impact. Thus, the total area and energy consumed by these
hardware extensions are negligible relative to the GPU SoC. 6 EXPERIMENTAL METHODOLOGY
The bitmap managed by the access tracking unit is read by the To evaluate the benefits of GPS, we extend the NVIDIA Architec-
GPU driver during the cuGPUTrackingStop() API call and used to tural Simulator (NVAS) [57] to model multi-GPU systems com-
configure the conventional and GPS page tables appropriately. GPS prising NVIDIA GV100 GPUs on PCIe, with parameters shown in
pages with only a single subscriber are downgraded to conventional Table 1. The simulator is driven by application traces collected at
pages within the page tables, as duplication of writes to such pages the SASS (GPU assembly) level using the binary instrumentation

7
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch

GPU Parameters read mostly, accessed by, prefetch, and preferred location hints.
Cache block size 128 bytes Based on the compute partitioning across GPUs, we set the GPU
Global memory 16 GB that issues writes to a given memory region as its preferred location.
Streaming multiprocessors (SM) 80 The most proactive approach we can configure with UM hints is
CUDA cores/SM 64 to pick one consumer to be the preferred location, and since each
L2 Cache size 6 MB producer of a page is always also a consumer of the page in our
Warp size 32
applications, that was a convenient and close-to-optimal choice.
Maximum threads per SM 2048
We also set GPUs that read from remote pages as accessed by those
Maximum threads per CTA 1024
GPUs. Although read-mostly hints are generally effective, we did
GPS Structures
not use them because our applications had no read-only pages
Remote write queue 512 entries
accessed by multiple GPUs. Before each kernel launch, we enable
Remote write queue entry size 135 bytes
GPUs to prefetch remote regions they may access through prefetch
TLB 8-way set associative
hints.
TLB size 32 entries
Virtual address 49 bits Remote Demand Loads (RDL): While GPS performs all loads
Physical address 47 bits locally by issuing the stores to all subscribers, RDL performs the
converse: it issues stores to local memory and loads to the most
Table 1: Simulation settings, based on NVIDIA V100.
recent GPU to issue a store to a given page. We believe that this
paradigm is representative of an expert programmer who manually
tracks writers to each page. We simulate this expertise by explicitly
tracking the latest write to each page in the simulator and using
tool NVBit [58] on real hardware. These traces contain CUDA API this information during address translation to issue the read to the
events, GPU kernel instructions, and memory addresses accessed, appropriate GPU.
but no pre-recorded timing events. The simulator models the tim- Memcpy: This paradigm duplicates data structures among all
ing aspects of the trace replay in accordance with the GPU and GPUs and broadcasts updates via cudaMemcpy() calls at the syn-
interconnect architectural models and respects all functional de- chronization barriers. This duplication ensures that all data struc-
pendencies such as work scheduling, barrier synchronization, and tures are resident in local GPU memory when accessed by kernels
load dependencies. We have specifically calibrated the link and in the subsequent synchronization phase; there are no remote ac-
switch parameters in our interconnect models to match several cesses during kernel execution. However, there is also no overlap
(sometimes speculative) PCIe generations. This simulator has been between data transfers and compute.
correlated across a wide range of benchmarks and GPU models but GPS with automatic subscription: We implement GPS with
remains fast enough to model complex multi-GPU systems and the automatic subscription management by modifying applications as
hard-to-scale applications suitable for evaluating GPS. described in Section 4 and marking all memory allocations as GPS
We evaluate a suite of multi-GPU applications shown in Table 2. allocations.
These include all applications used to evaluate PROACT [34]. We Infinite bandwidth: Finally, we provide an infinite bandwidth
also study those applications from the Tartan benchmark suite [29], comparison, which establishes an upper bound on achievable multi-
whose strong scaling performance was bound by inter-GPU com- GPU performance if all data were always accessible locally at each
munication when measured on real systems. These applications GPU (i.e., it ignores all transfer costs). We obtain this comparison
also possess varying communication patterns, giving us a broader by eliding the data transfer time from the memcpy variant.
opportunity to evaluate GPS. For the Tartan applications not bound
by inter-GPU communication, we found that GPS obtains the same 7 EXPERIMENTAL RESULTS
performance as the native version and have not included them in
GPS relies on fine-grained, proactive data transfers to remote GPUs
the interest of space. We modify the applications only to imple-
during kernel execution to optimize GPU locality. The subscrip-
ment the different multi-GPU programming paradigms, and the
tion management mechanism ensures that only the required data
partitioning of applications across multi-GPUs remains the same
is transferred, resulting in interconnect bandwidth savings. GPS
as the original code for all paradigms. All our application variants
performance benefits arise for three reasons: (1) GPS proactively
are written in CUDA and compiled using CUDA 10.
publishes updates to subscribers, enabling them to fetch hot-path
To demonstrate the ability of GPS to improve multi-GPU scal-
data from high bandwidth local memory. (2) By automatically iden-
ability, we compare it with several contemporary multi-GPU pro-
tifying subscribers for a given page, GPS publishes updates only to
gramming paradigms as discussed in Section 2.1:
the GPUs that require them, resulting in significant interconnect
Unified Memory without Hints: We simulate baseline Unified
bandwidth savings. (3) Coalescing in the GPS write queue results in
Memory without user-provided hints. Application code allocates
substantial bandwidth reduction, especially for applications where
shared memory regions using cudaMallocManaged() API. By de-
subscriptions alone are not sufficient to achieve peak performance.
fault, the simulator allocates pages on the first GPU that touches
the page. Subsequent accesses by peer GPUs to the same page will
result in fault-based page migration as described in Section 2.1. 7.1 End-to-end performance
Unified Memory with Hints: For this paradigm, we hand-tune Figure 8 shows the 4-GPU application speedup over a single GPU
each application using a combination of four manual hints, namely for the different programming paradigms described in Section 6.

8
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece

Application Description Predominant Communication Pattern


Jacobi Iterative algorithm that solves a diagonally dominant system of linear equations Peer-to-peer
Pagerank Algorithm used by Google Search to rank web pages in their search engine results Peer-to-Peer
SSSP Shortest path computation between every pair of vertices in a graph Many-to-many
ALS Matrix factorization algorithm All-to-all
CT Model Based Iterative Reconstruction algorithm used in CT imaging All-to-all
B2rEqwp 3D earthquake wave-propogation model simulation using 4-order finite difference method Peer-to-peer
Diffusion A multi-GPU implementation of 3D Heat Equation and inviscid Burgers’ Equation Peer-to-peer
Hit Simulating Homogeneous Isotropic Turbulence by solving Navier-Stokes equations in 3D Peer-to-peer
Table 2: Applications under study.

UM UM + hints RDL Memcpy GPS Infinite BW


Performance relative to 1 GPU

5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
bi

nk

n
SP

n
P
CT

T
AL

HI
W

sio

ea
co

ra

SS

EQ

om
ffu
Ja

ge
Pa

Di

Ge

Figure 8: 4-GPU speedup of different paradigms. Figure 9: Subscriber distribution for shared application
pages. GPS subscriptions result in interconnect bandwidth
savings for all pages with less than 4 subscribers.

First, we observe that the unified memory paradigm is ineffective


for these applications, despite having attractive programmability Finally, remote data loading performs well for applications where
properties, due to the cost of page faults necessary to migrate multi-threading is sufficient to hide remote load latencies; however,
data between GPUs. While work is ongoing [9, 64], it is unclear if for others, these loads lie in the critical path and can have a severe
executing page faults on the GPU’s critical load path in multi-GPU adverse effect on performance.
systems will ever scale well. The GPS approach ensures both a good overlap of communica-
Second, though applying UM hints results in performance bene- tion and computation phases while at the same time decreasing
fits over baseline UM, we found that extracting maximum benefits the unwanted stores to GPUs that will not access them (locally or
out of hints by achieving fine-grained data sharing requires de- remotely). As a result, GPS offers an overall average speedup of 3×
tailed knowledge about the data access patterns of the applications (out of 3.2× possible) over the single-GPU programs. Furthermore,
along with significant programmer effort. The performance penalty we observe that EQWP achieves greater than 4× speedup due to
for incorrectly applied hints is still substantial, ranging from slow an improvement in L2 hit rate from 55% to 68% when scaling to 4
remote access in the best case to thrashing page migrations and ex- GPUs due to the increase in aggregate cache capacity.
pensive faults and TLB shootdowns in the worst case. Also, writes
to pages replicated across subscribers result in the replicated pages 7.2 Benefits of subscription tracking
collapsing back to the writer, further degrading performance. This Figure 9 shows the distribution of GPS pages with more than one
degradation can be somewhat mitigated through profiling and fine- subscriber at the beginning of the GPS execution phase. Whereas
tuning of hints, but generally this approach is only effective after some applications do perform all-to-all transfers for nearly all pages
substantial expert effort and nevertheless faces fundamental limita- in the GPS space (ALS), other applications (like Jacobi) require only
tions like the inability to replicate read-write pages, giving GPS an one remote subscriber for most pages because of how the algo-
advantage. rithm performs boundary exchange of halos. The variation in these
Third, memcpy at kernel boundaries performs well for CT, but on applications’ subscription sets supports the idea that programmer
average does not achieve any improvement over a well-optimized efficiency is maximized by allowing promiscuous subscription to
single GPU implementation, due to inefficient interconnect utiliza- the GPS space, with automatic hardware unsubscription when the
tion during compute phases. This finding is significant because programmer cannot easily determine the ideal subscription set for
it demonstrates that though using programmer-directed memcpy each GPU.
to optimize locality is the most common programming technique Figure 10 compares the total data moved over the interconnect for
today, it is unlikely to result in good strong scaling performance all the paradigms. We normalize the numbers for all the paradigms
for applications bound by inter-GPU communication. to the memcpy variant since it copies all shared data exactly once

9
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch

UM UM + hints RDL GPS UM UM + hints RDL Memcpy GPS Infinite BW

Performance relative to 1 GPU


Total data moved over interconnect

4.4 16
3
14
12
normalized to memcpy

2.5 10
lower
8
2 is
6
better
4
1.5 2
0
1

bi

nk

n
SP

n
P
CT

T
AL

HI
W

sio

ea
co
0.5

ra

SS

EQ

om
ffu
Ja

ge
Pa

Di

Ge
0
bi

nk

n
SP

P
CT

T
Figure 12: 16-GPU performance achieved by different
AL

HI
W

sio
co

ra

SS

EQ

ffu
Ja

ge

paradigms.
Pa

Di

Figure 10: Total data moved over interconnect normalized to transfers. When degradation occurs, it is typically due to the GPS
memcpy (bulk-synchronous transfers). write combining buffer failing to coalesce multiple writes to the
same block effectively. However, if the excess writes do not saturate
the interconnect, they will not typically stall GPU execution.
GPS without subscription GPS with subscription
Performance relative to 1 GPU

5 The end-to-end performance benefits of subscription tracking


4 are shown in Figure 11, which compares the performance of GPS
with and without subscription. The bandwidth savings due to sub-
3
scription tracking is the primary factor in GPS’s scalable perfor-
2 mance results for most applications. The exceptions are ALS and
1 CT, where a majority of the shared pages are subscribed by all the
GPUs, resulting in significant all-to-all transfers.
0
bi

nk

n
SP

n
P
CT

7.3 Scalability beyond 4-GPUs


AL

HI
W

sio

ea
co

ra

SS

EQ

om
ffu
Ja

ge

To understand GPS’s scalability as GPU counts in a system in-


Pa

Di

Ge

crease, we simulate the performance of the different programming


Figure 11: Performance sensitivity to subscription. paradigms on a system comprising 16 NVIDIA GV100 GPUs us-
ing a projected PCIe 6.0 interconnect (operating at 128GB/s). Fig-
ure 12 shows the performance relative to one GPU for each of these
across all the GPUs, thus enabling us to quantify the additional paradigms. The application performance trend across the five tech-
write traffic introduced by each paradigm. The figure shows that the niques is similar to the 4-GPU trends discussed in Section 7.1. While
effectiveness of each technique is highly dependent on the access current paradigms do not scale well on average, GPS achieves scal-
pattern of the applications. able performance with a mean speedup of 7.9×, capturing 80% of
We find that Unified Memory often causes significant increases the performance opportunity available when modeling an infinite
in interconnect traffic compared to manual memcpy because the bandwidth interconnect.
presence of multiple subscribers to the same page results in pages
thrashing back and forth between the multiple GPUs that access it. 7.4 Sensitivity Studies
The exceptions are Jacobi and CT, where the interconnect traffic Interconnect bandwidth: Figure 13 provides the geometric mean
due to memcpy needlessly copying data to GPUs that do not access performance of each programming paradigm shown in Figure 8
them outweighs the traffic due to page migrations. Adding hints to while increasing the throughput of the inter-GPU interconnect.
UM decreases the total data moved when compared to UM in all Our results show that despite expected dramatic improvements in
cases except diffusion, where more fine-grained prefetching hints PCIe and other inter-GPU interconnects [4, 41], strong scaling re-
are required to avoid over-fetching pages needlessly. RDL moves mains difficult to achieve using traditional GPU multi-programming
less data than memcpy in all cases except for ALS, wherein a lack paradigms. Conversely, as GPUs move to higher performance inter-
of temporal locality results in the same cacheline being fetched connects, GPS will approach the limits of performance scalability
multiple times over the interconnect. by making efficient use of inter-GPU bandwidth.
Compared to other paradigms, GPS’s unsubscription mechanism Write queue size: Sizing the GPS remote write queue properly
drastically reduces the total data transferred over the interconnect is critical to overall GPS system performance. An ideal queue con-
in most cases. However, compared to the memcpy paradigm, there tains enough entries to exploit applications’ temporal locality but is
may be improvement or degradation in the amount of data trans- not so large that the associative lookup operations become overly
ferred. We observe improvements when GPS allows small gran- expensive. Figure 14 studies the sensitivity of buffer size versus
ularity updates to occur within a page, in contrast to bulk DMA achieved hit rate. With 512 buffer entries all applications achieve

10
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece

UM UM + hints RDL Memcpy GPS Infinite BW CT EQWP Diffusion HIT


60

GPS remote write queue


Speedup over 1 GPU

3.5
3 40

hit rate (%)


2.5
2
1.5
20
1
0.5
0
0
PCIe 3.0 PCIe 4.0 PCIe 5.0 PCIe 6.0
(projected) 0 200 400 600 800 1000
GPS remote write queue size

Figure 13: Sensitivity to interconnect bandwidth.


Figure 14: Sensitivity to GPS write queue size.

near peak performance. Further performance is difficult to capture


due to random writes that have neither temporal nor spatial locality. bytes within a cache block are updated for many HPC applications,
We note that Jacobi exhibits a 0% hit rate since all spatial locality is resulting in the transfer of some unneeded bytes. Third, although
fully captured in the coalescer internal to the SM, while Pagerank, the write combining buffer coalesces stores to the same cache block,
ALS, and SSSP exhibit a 0% hit rate since they predominantly issue if the stores are temporally distant, the prior store might have al-
atomic operations whose coalescing is not supported by the GPS ready been flushed from the write combining buffer before the
write queue. subsequent store to the same cache block arrives. Despite these
GPS-TLB Size: In virtual memory systems, the translation in- inefficiencies, GPS represents a significant performance improve-
formation for a virtual address will typically be fetched only once ment over the state-of-the-art in multi-GPU performance scalability
(upon the first request to the page), and subsequent accesses to the while also providing a simple and universal programming interface.
page will result in TLB hits. The GPS-TLB is no exception and it is
important to size the GPS-TLB appropriately to capture locality in 8 RELATED WORK
the GPS space. GPUs now have thousands of entries in their last Prior work [6, 9, 11, 25, 32, 62] has explored the use of various hard-
level TLB [35] and we initially expected the GPS-TLB to require a ware and software mechanisms to improve multi-GPU performance.
similar number of entries. However, we found that the GPS-TLB hit Our work explores employing the publish-subscribe paradigm for
rate approaches 100% at just 32 entries. We find that because the strong scaling in multi-GPU systems. Prior work [49, 50, 56] has
GPS-TLB only services a fraction of the entire GPU memory space also proposed different mechanisms for inter-GPU coherence. GPS
(GPS-allocated heap pages) and it does not service read requests to does not implement an expensive coherence protocol because it is
the GPS address space (they are typically serviced from the normal not required for GPU memory model compatibility.
local memory system), the GPS-TLB is under substantially less pres- Several works propose multi-GPU memory management solu-
sure than the general-purpose GPU TLBs, and can thus be much tions. Although Griffin [9] optimizes page migration, reads to pages
smaller. accessed by more than one GPU still result in on-demand remote
Page size: To measure the impact of the virtual memory page reads that can fall on the execution critical path. GPS avoids these
size, we study the performance of GPS under three page sizes, demand reads. CARVE [62] derives its benefits from caching re-
namely, 4kB, 64kB, and 2MB. While a smaller page size can decrease mote data in local DRAM. However, it does not proactively update
the false sharing of GPS pages, it significantly increases the pressure locally cached data, resulting in demand reads to data updated by
on all the TLBs in the GPU, resulting in the 4kB variant being 42% peer GPUs. Only subsequent reads, after the first demand miss, are
slower than 64kB. On the contrary, though a larger page size enables serviced from the DRAM cache, thus benefiting only workloads
improved TLB hit rates, the interconnect traffic increases due to with substantial data re-use.
larger number of redundant remote transfers, resulting in the 2MB Several teams have studied prefetching in the context of single
variant being 15% slower than 64kB. Therefore, 64kB is the sweet GPUs [26, 27, 52], but multi-GPU systems pose new challenges due
spot we focus on in our evaluation. to concurrent accesses and the high cost of GPU TLB shootdowns to
migrate pages among the GPUs. NVIDIA’s Unified Memory allows
7.5 Limitations of the GPS approach programmers to explicitly provide data placement hints [36] to
The GPS paradigm does not achieve maximum theoretical perfor- improve the locality through programmer-controlled prefetching
mance due to three primary issues. First, even though a GPU may and decrease the rate of page migration in multi-GPU systems. It
subscribe to a page, it may only require updates to a portion of the also provides a mechanism to allow read-only page duplication
(64kB) page, yet the interconnect will still transmit all updates to among GPUs, but upon any write to the page, it must collapse back
this page due to false sharing. Second, GPUs typically issue inter- to a single GPU. In future work, it could be possible to improve UM
connect transfers at cache line granularity. We observe that not all performance by layering it on top of a GPS-like system.

11
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch

The publish-subscribe communication paradigm for distributed [7] Guruduth Banavar, Tushar Chandra, Bodhi Mukherjee, Jay Nagarajarao, Robert E
interaction has been explored by prior work [2, 7, 8, 17]. Hill et Strom, and Daniel C Sturman. 1999. An Efficient Multicast Protocol for Content-
Based Publish-Subscribe Systems. In International Conference on Distributed Com-
al. [23] propose a Check-In/Check-out model for shared-memory puting Systems (ICDCS).
machines. The more traditional alternative to publish-subscribe [8] Guruduth Banavar, Tushar Chandra, Robert Strom, and Daniel Sturman. 1999. A
Case for Message Oriented Middleware. In International Symposium on Distributed
support is NUMA memory management. Dashti et al. [14] develop Computing (DISC).
and implement a memory placement algorithm in Linux to address [9] Trinayan Baruah, Yifan Sun, Ali Dinçer, Md Saiful Arefin Mojumder, José L. Abel-
traffic congestion in modern NUMA systems. Many other works [1, lán, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020.
Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU
16, 28, 48, 63] perform NUMA-aware optimizations to improve Systems. In International Symposium on High Performance Computer Architecture
performance, and hardware-based peer caching has been explored (HPCA).
but is yet to be adopted by GPU vendors [10, 13, 30, 46, 54]. Recently [10] Arkaprava Basu, Sooraj Puthoor, Shuai Che, and Bradford M Beckmann. 2016.
Software Assisted Hardware Cache Coherence for Heterogeneous Processors. In
DRAM-caches for multi-node systems [12] have been proposed to International Symposium on Memory Systems (ISMM).
achieve large capacity advantages. [11] Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2020. Groute:
Asynchronous Multi-GPU Programming Model with Applications to Large-scale
Prior work has also explored scoped synchronization for memory Graph Processing. Transactions on Parallel Computing (TOPC) 7, 3 (2020), 1–27.
models [19, 22, 24, 31, 44, 61]. Non-scoped GPU memory models [12] Chiachen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2016. CANDY: Enabling
are simpler [53], but do not permit the same type of coalescing as Coherent DRAM Caches for Multi-node Systems. In International Symposium on
Microarchitecture (MICRO).
GPS, which makes explicit use of scopes. [13] Mohammad Dashti and Alexandra Fedorova. 2017. Analyzing Memory Manage-
ment Methods on Integrated CPU-GPU Systems. In International Symposium on
Memory Management (ISMM).
9 CONCLUSION [14] Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud
Strong scaling in multi-GPU systems is a challenging task. In this Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Man-
agement: A Holistic Approach to Memory Placement on NUMA systems. In
work, we proposed and evaluated GPS, a HW/SW multi-GPU mem- International Conference on Architectural Support for Programming Languages and
ory management technique to improve strong scaling in multi-GPU Operating Systems (ASPLOS).
systems. GPS automatically tracks the subscribers to each page of [15] Alan Demers, Johannes Gehrke, Mingsheng Hong, Mirek Riedewald, and Walker
White. 2006. Towards Expressive Publish/Subscribe Systems. In International
memory and proactively broadcasts fine-grained stores to these sub- Conference on Extending Database Technology (EDBT).
scribers. This enables each subscriber to read data from their local [16] Andi Drebes, Karine Heydemann, Nathalie Drach, Antoniu Pop, and Albert
memory at high bandwidth. GPS provides significant performance Cohen. 2014. Topology-Aware and Dependence-Aware Scheduling and Memory
Allocation for Task-parallel Languages. Transactions on Architecture and Code
improvement while retaining compatibility with conventional GPU Optimization (TACO) 11, 3 (2014), 1–25.
programming and memory models. Evaluated on a model of 4 [17] Patrick Th Eugster, Pascal A Felber, Rachid Guerraoui, and Anne-Marie Kermar-
rec. 2003. The Many Faces of Publish/Subscribe. Computing Surveys (CSUR) 35, 2
NVIDIA V100 GPUs and several interconnect architectures, GPS (2003), 114–131.
offers an average speedup of 3.0× over 1 GPU and performs 2.3× bet- [18] Françoise Fabret, H Arno Jacobsen, François Llirbat, Joăo Pereira, Kenneth A
ter than the next best available multi-GPU programming paradigm. Ross, and Dennis Shasha. 2001. Filtering Algorithms and Implementation for
Very Fast Publish/Subscribe Systems. In International Conference on Management
On a similar 16 GPU system, GPS captures 80% of the available of Data (SIGMOD).
performance opportunity, a significant lead over today’s current [19] Benedict R Gaster. 2013. HSA Memory Model. In A Symposium on High Perfor-
multi-GPU programming models. mance Chips (Hot Chips).
[20] Tom’s Hardware. 2019. AMD Big Navi and RDNA 2 GPUs. tomshardware.com/
news/amd-big_navi-rdna2-all-we-know, last accessed on 08/17/2020.
ACKNOWLEDGMENTS [21] Mark Harris. 2017. Unified Memory for CUDA Beginners. developer.nvidia.com/
blog/unified-memory-cuda-beginners/, last accessed on 08/17/2020.
The authors thank Zi Yan and Oreste Villa from NVIDIA Research [22] Blake A Hechtman, Shuai Che, Derek R Hower, Yingying Tian, Bradford M
for their support with NVAS and the anonymous reviewers for their Beckmann, Mark D Hill, Steven K Reinhardt, and David A Wood. 2014. Quick-
Release: A Throughput-oriented Approach to Release Consistency on GPUs. In
valuable feedback. This work was supported by the Center for Ap- International Symposium on High Performance Computer Architecture (HPCA).
plications Driving Architectures (ADA), one of six centers of JUMP, [23] Mark D Hill, James R Larus, Steven K Reinhardt, and David A Wood. 1992. Coop-
erative Shared Memory: Software and Hardware for Scalable Multiprocessors.
a Semiconductor Research Corporation program co-sponsored by In International Conference on Architectural Support for Programming Languages
DARPA. and Operating Systems (ASPLOS).
[24] Derek R Hower, Blake A Hechtman, Bradford M Beckmann, Benedict R Gaster,
Mark D Hill, Steven K Reinhardt, and David A Wood. 2014. Heterogeneous-
REFERENCES race-free Memory Models. In International conference on Architectural Support
[1] Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W for Programming Languages and Operating Systems (ASPLOS).
Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Mem- [25] Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim.
ory Systems. In International Conference on Architectural Support for Programming 2020. Batch-Aware Unified Memory Management in GPUs for Irregular Work-
Languages and Operating Systems (ASPLOS). loads. In International Conference on Architectural Support for Programming Lan-
[2] Marcos K Aguilera, Robert E Strom, Daniel C Sturman, Mark Astley, and Tushar D guages and Operating Systems (ASPLOS).
Chandra. 1999. Matching Events in a Content-Based Subscription System. In [26] Nagesh B Lakshminarayana and Hyesoon Kim. 2014. Spare Register Aware
Symposium on Principles of Distributed Computing (PODC). Prefetching for Graph Algorithms on GPUs. In International Symposium on High
[3] Jasmin Ajanovic. 2009. PCI Express 3.0 Overview. In A Symposium on High Performance Computer Architecture (HPCA).
Performance Chips (Hot Chips). [27] Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc.
[4] AMD. 2019. AMD Infinity Architecture: The Foundation of the Modern 2010. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In
Datacenter. Product Brief. amd.com/system/files/documents/LE-70001-SB- International Symposium on Microarchitecture (MICRO).
InfinityArchitecture.pdf, last accessed on 08/17/2020. [28] Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and
[5] AMD. 2020. AMD Crossfire™ Technology. www.amd.com/en/technologies/ Memory Placement on NUMA Systems: Asymmetry Matters. In USENIX Annual
crossfire, last accessed on 04/14/2021. Technical Conference (USENIX ATC).
[6] Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, [29] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin
Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU
Multi-chip-module GPUs for continued performance scalability. In International Benchmark Suite. In International Symposium on Workload Characterization
Symposium of Computer Architecture (ISCA). (IISWC).

12
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece

[30] Daniel Lustig and Margaret Martonosi. 2013. Reducing GPU Offload Latency via [56] Abdulaziz Tabbakh, Xuehai Qian, and Murali Annavaram. 2018. G-TSC: Times-
Fine-Grained CPU-GPU Synchronization. In International Symposium on High tamp Based Coherence for GPUs. In International Symposium on High Performance
Performance Computer Architecture (HPCA). Computer Architecture (HPCA).
[31] Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Anal- [57] Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish
ysis of the NVIDIA PTX Memory Consistency Model. In International Conference Chatterjee, Nan Jiang, and David Nellans. 2021. Need for Speed: Experiences
on Architectural Support for Programming Languages and Operating Systems (AS- Building a Trustworthy System level GPU Simulator. In International Symposium
PLOS). on High Performance Computer Architecture (HPCA).
[32] Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, [58] Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019.
Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA- NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In
aware GPUs. In International Symposium on Microarchitecture (MICRO). International Symposium on Microarchitecture (MICRO).
[33] Gero Mühl. 2002. Large-Scale Content-Based Publish-Subscribe Systems. Ph.D. [59] Peng Wang. 2017. UNIFIED MEMORY ON P100. olcf.ornl.gov/wp-content/
Dissertation. Technische Universität Darmstadt. uploads/2018/02/SummitDev_Unified-Memory.pdf, last accessed on 02/14/2021.
[34] Harini Muthukrishnan, David Nellans, Daniel Lustig, Jeffrey Fessler, and Thomas [60] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and
Wenisch. 2021. Efficient Multi-GPU Shared Memory via Automatic Optimization John D Owens. 2016. Gunrock: A High-performance Graph Processing Library
of Fine-grained Transfers. In International Symposium on Computer Architecture on the GPU. In Principles and Practice of Parallel Programming (PPoPP).
(ISCA). [61] John Wickerson, Mark Batty, Bradford M Beckmann, and Alastair F Donaldson.
[35] Prashant J Nair, David A Roberts, and Moinuddin K Qureshi. 2016. Citadel: 2015. Remote-scope Promotion: Clarified, Rectified, and Verified. In International
Efficiently Protecting Stacked Memory from TSV and Large Granularity Failures. Conference on Object-Oriented Programming, Systems, Languages, and Applications
Transactions on Architecture and Code Optimization (TACO) 12, 4 (2016), 1–24. (OOPSLA).
[36] NVIDIA. 2013. CUDA Toolkit Documentation. docs.nvidia.com/cuda/, last [62] Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans,
accessed on 08/17/2020. and Oreste Villa. 2018. Combining HW/SW Mechanisms to Improve NUMA Per-
[37] NVIDIA. 2019. GP100 MMU Format. nvidia.github.io/open-gpu-doc/pascal/ formance of Multi-GPU Systems. In International Symposium on Microarchitecture
gp100-mmu-format.pdf, last accessed on 08/17/2020. (MICRO).
[38] NVIDIA. 2019. NVLink AND NVSwitch The Building Blocks of Advanced Multi- [63] Kaiyuan Zhang, Rong Chen, and Haibo Chen. 2015. NUMA-aware Graph-
GPU Communication. nvidia.com/en-us/data-center/nvlink/, last accessed on structured Analytics. In Symposium on Principles and Practice of Parallel Pro-
08/17/2020. gramming (PPoPP).
[39] NVIDIA. 2020. NVIDIA Ampere Architecture. www.nvidia.com/en-us/data- [64] Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W
center/ampere-architecture/, last accessed on 04/14/2021. Keckler. 2016. Towards High Performance Paged Memory for GPUs. In Interna-
[40] NVIDIA. 2020. NVIDIA DGX Systems. www.nvidia.com/en-us/data-center/dgx- tional Symposium on High Performance Computer Architecture (HPCA).
systems/ last accessed on 04/14/2021.
[41] NVIDIA. 2020. NVIDIA NVLink High-Speed GPU Interconnect. nvidia.com/en-
us/design-visualization/nvlink-bridges/, last accessed on 08/17/2020.
[42] NVIDIA. 2020. NVIDIA TITAN V, NVIDIA’s Supercomputing GPU Architecture,
Now for Your PC. www.nvidia.com/en-us/titan/titan-v/, last accessed on
08/17/2020.
[43] NVIDIA. 2020. PTX: Parallel Thread Execution ISA Version 7.0. docs.nvidia.
com/cuda/pdf/ptx_isa_7.0.pdf, last accessed on 08/17/2020.
[44] Marc S Orr, Shuai Che, Ayse Yilmazer, Bradford M Beckmann, Mark D Hill,
and David A Wood. 2015. Synchronization Using Remote-Scope Promotion.
International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS).
[45] Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy,
and Dhabaleswar K Panda. 2013. Efficient Inter-node MPI Communication Using
GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In International
Conference on Parallel Processing (ICPP).
[46] Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M Beckmann,
Mark D Hill, Steven K Reinhardt, and David A Wood. 2013. Heterogeneous
System Coherence for Integrated CPU-GPU Systems. In International Symposium
on Microarchitecture (MICRO).
[47] Jason Power, Mark D Hill, and David A Wood. 2014. Supporting x86-64 Ad-
dress Translation for 100s of GPU Lanes. In International Symposium on High
Performance Computer Architecture (HPCA).
[48] Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, and Anas-
tasia Ailamaki. 2015. Scaling Up Concurrent Main-memory Column-store Scans:
Towards Adaptive NUMA-aware Data and Task Placement. In Proceedings of the
VLDB Endowment (PVLDB).
[49] Xiaowei Ren and Mieszko Lis. 2017. Efficient Sequential Consistency in GPUs via
Relativistic Cache Coherence. In International Symposium on High Performance
Computer Architecture (HPCA).
[50] Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa, and David
Nellans. 2020. HMG: Extending Cache Coherence Protocols Across Modern Hi-
erarchical Multi-GPU Systems. In International Symposium on High Performance
Computer Architecture (HPCA).
[51] Tim C Schroeder. 2011. Peer-to-peer & Unified Virtual Addressing. In GPU
Technology Conference (GTC).
[52] Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE:
Adaptive Prefetching on GPUs for Energy Efficiency. In International Conference
on Parallel Architectures and Compilation Techniques (PACT).
[53] Matthew D Sinclair, Johnathan Alsop, and Sarita V Adve. 2015. Efficient GPU
synchronization without scopes: Saying No to Complex Consistency Models. In
International Symposium on Microarchitecture (MICRO).
[54] Inderpreet Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O’Connor, and
Tor M Aamodt. 2013. Cache Coherence for GPU Architectures. In International
Symposium on High Performance Computer Architecture (HPCA).
[55] Mohammed Sourouri, Tor Gillberg, Scott B Baden, and Xing Cai. 2014. Effective
Multi-GPU Communication Using Multiple CUDA Streams and Threads. In
International Conference on Parallel and Distributed Systems (ICPADS).

13

You might also like