GPS Micro21
GPS Micro21
GPS Micro21
Management
Harini Muthukrishnan Daniel Lustig
harinim@umich.edu dlustig@nvidia.com
University of Michigan NVIDIA
Ann Arbor, Michigan, USA Westford, Massachusetts, USA
1
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch
PCIe3.0 PCIe6 (projected) Infinite Bandwidth Interconnect GPU0 GPU1 GPU0 GPU1
4 memory memory memory memory
Performance relative to
3
non-GPS GPS
2
1 GPU
VA space
1 GPU0 GPU1 GPU0 GPU1
0
Conventional store GPS store
SP
P
CT
Ge IT
n
nk
bi
n
AL
ea
sio
GPS load
H
co
Conventional load
SS
ra
EQ
om
ffu
Ja
ge
Pa
Di
Figure 1: Many HPC programs strong-scale poorly due to in- Figure 2: Load/store paths for conventional and GPS pages.
sufficient inter-GPU bandwidth, as shown on a system with Because GPS transfers data to consumers’ memory proac-
4 NVIDIA GV100 GPUs. tively, all GPS loads can be performed to high bandwidth
effort on the part of the programmer. Peer-to-peer transfers [51], local memory.
in which GPUs perform loads/stores directly to the physical mem-
ories of other GPUs, can potentially achieve good strong scaling
while conforming to the pre-existing NVIDIA GPU memory
when properly coordinated with computation. However, peer-to-
model.
peer loads perform remote accesses on demand, leading to compute
• To minimize the utilization of scarce inter-GPU bandwidth,
stalls. Peer-to-peer stores avoid stalls, but can result in wastage of
we propose a novel memory subscription management mech-
an already scarce interconnect bandwidth if sent to GPUs that do
anism that tracks GPUs’ access patterns and unsubscribes
not require the pushed data. Frameworks such as Gunrock [60] and
GPUs from pages they do not access.
Groute [11] can improve strong scaling for some workloads but are
• Evaluated in simulation with several interconnects against
typically domain-specific and limited in scope.
a 4 GPU system, GPS provides an average strong scaling
To overcome the limitations of these techniques, we propose GPS
performance of 3.0× over a single GPU capturing 93.7% of
(GPU Publish-Subscribe), a multi-GPU memory management tech-
the available opportunity. In a 16 GPU system using a future
nique that transparently improves the performance of multi-GPU
PCIe 6.0 interconnect, GPS achieves an average 7.9× speedup
applications following a fully Unified Memory compatible program-
over a single GPU and captures over 80% of the hypothetical
ming model. GPS provides a set of architectural enhancements that
performance of an infinite bandwidth interconnect.
automatically track which GPUs subscribe to shared memory pages,
and through driver support, it replicates those pages locally on each
2 BACKGROUND AND MOTIVATION
subscribing GPU. GPS hardware then broadcasts stores proactively
to all subscribers, enabling the subscribers to read that data from Proper orchestration of inter-GPU communication is essential for
local memory at high bandwidth. Figure 2 shows the behavioral dif- strong scaling in multi-GPU systems. With each new GPU and
ference between GPS and conventional accesses. GPS loads always interconnect generation, the compute throughput, local memory
return from local memory, while GPS stores are broadcast to all bandwidth, and inter-GPU bandwidth increase. However, the band-
subscribers. On the other hand, conventional load/stores result in width available to local GPU memory remains much higher than
local or remote accesses depending on physical memory location. the bandwidth available to remote GPU memories. For example,
GPS takes advantage of the fact that remote stores do not stall Figure 3 shows that even though interconnect bandwidth has im-
execution. Performing remote accesses on the write path instead proved 38× while evolving from PCIe 3.0 [3] to NVIDIA’s most
of the read path hides latency and enables further optimizations to recent NVLink and NVSwitch-based topology [38], it remains 3×
schedule and combine writes to use the interconnect bandwidth slower than the local GPU memory bandwidth.
efficiently.
GPS successfully improves strong scaling performance because it 2.1 Inter-GPU communication mechanisms
can: (1) transfer data in a proactive, fine-grained fashion and overlap Because local vs. remote bandwidth is a first-order performance
compute with communication, (2) issue all loads to local DRAM concern, multi-GPU workloads typically use one of the following
rather than to remote GPUs’ memory over slower interconnects, mechanisms (summarized in Figure 4) to manage data placement
and (3) perform aggressive coalescing optimizations that reduce among the GPUs’ physical memories.
inter-GPU bandwidth requirements without violating the GPU’s • Host-initiated DMA using cudaMemcpy: cudaMemcpy()
memory model. programs a GPU’s DMA engine to copy data directly between
This work makes the following contributions: GPUs and/or CPUs. Though CUDA provides the ability to
• We propose and describe GPS, a novel HW/SW co-designed pipeline compute and cudaMemcpy()-based transfers [55],
multi-GPU memory management system extension that lever- implementing pipeline parallelism requires significant pro-
ages a publish-subscribe paradigm to improve multi-GPU grammer effort and detailed knowledge of the applications’
system performance. behavior in order to work effectively.
• We propose new simple and intuitive programming model • Fault-based migration via Unified Memory (UM): Uni-
extensions that can naturally integrate GPS into applications fied Memory [21] provides a single unified virtual memory
2
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece
1500
GPU1
1200
900 GPU2
600 (a) Peer-to-peer loads or fault based UM
300 GPU0 ...
0
Discrete/ DGX-1/ DGX-1V/ DGX-2/ DGX-A100/ GPU1
Kepler/ Pascal/ Volta/ Volta/ Ampere/ GPU2
PCIe NVLink 1 NVLink 2 NVLink2+ NVLink3+
(b) memcpy between kernels or UM with hints
NVSwitch NVSwitch
GPU0 ...
Figure 3: Local and remote bandwidths on varying GPU plat- GPU1
forms. Despite significant increases in both metrics, a 3×
GPU2
bandwidth gap persists between local and remote memories.
(c) Publish-Subscribe (including GPS)
address space accessible to any processor in a system. UM
uses a page fault and migrate mechanism to perform data Figure 4: Data transfer patterns in different paradigms. In
transfers among GPUs. Although this enables data move- demand-based loads and UM, transfers happen on-demand;
ment among GPUs in an implicit and programmer-agnostic in memcpy, they happen bulk-synchronously at the end of
fashion, the performance overhead of these page faults typi- producer kernel; in GPS, proactive fine-grained transfers are
cally is a first-order performance concern. performed to all subscribers.
• Hint-based migration via Unified Memory: UM also of-
fers the ability for programmers to provide hints to im-
prove performance. Read-mostly, prefetching, placement, communication effectively by leveraging the pipelining fea-
and “accessed-by” hints enable duplication of read-only pages tures of MPI.
across GPUs as well as page placement on specific GPUs One way to avoid the remote access bottleneck is to transfer data
and thus can reduce page faults if used correctly. However, from the producing GPUs to the consuming GPUs in advance, as
pipelining prefetching and compute using hints to achieve soon as the data is generated. The consumers can then read the data
fine-grained data sharing, our goal in GPS, is challenging directly from their local memory when needed. These proactive
even for expert programmers. Furthermore, crucially, UM transfers help strong scaling for two reasons: (1) they provide more
does not support the replication of pages with at least one opportunities to overlap data transfers with computation, and (2)
writer and multiple readers. Writes to read-duplicated pages they improve locality and ensure that critical path loads enjoy
“collapse” the page to a single GPU (usually the writer) and higher local memory bandwidth. As such, GPS relies on proactive
trigger an expensive TLB shootdown, thus degrading per- peer-to-peer stores to perform data transfers as the basis of its
formance substantially [59]. GPS aims to provide a better performance-scalable implementation.
solution for read-write pages accessed by multiple GPUs.
• Peer-to-peer transfers: GPU threads can also perform peer- 2.2 Publish-subscribe frameworks
to-peer loads and stores that directly access the physical Although proactive fine-grained data movement can improve local-
memory of other GPUs without requiring page migration. In ity, performing broad all-to-all transfers wastes inter-GPU band-
principle, peer-to-peer accesses can be performed at a fine width in cases where only a subset of GPUs will consume the data. In
granularity, overlap compute and communication, and incur these cases, tracking which GPUs read from a page and then trans-
low initiation overhead. However, peer-to-peer loads suffer mitting data only to these consumers can save precious interconnect
from the high latency of traversing interconnects such as bandwidth. To track a page’s consumers and propagate updates only
PCIe or NVLink, often stalling thread execution beyond the to them, GPS adopts a conceptual publish-subscribe framework,
GPU’s ability to mitigate those stalls via multi-threading. On which is often used in software distributed systems [7, 15, 18, 33].
the other hand, peer-to-peer stores typically do not stall GPU Figure 5 shows a simple example of a publish-subscribe frame-
thread execution and can be used to proactively push data work. It consists of publishers who generate data and subscribers
to future consumers without slowing down the computation who have requested specific updates. The publish-subscribe pro-
phases of the application. cessing unit tracks subscription information at page granularity,
• Message Passing Interface (MPI): MPI is a standardized receives data updates from publishers and forwards them to sub-
and portable API for communicating via messages between scribers. This mechanism provides the advantage that publishers
distributed processes. With CUDA-aware MPI and optimiza- and subscribers can be decoupled, and subscription management is
tions such as GPUDirect Remote Direct Memory Access handled entirely by the publish-subscribe processing unit.
(RDMA) [45], the MPI library can send and receive GPU The major challenge faced by a publish-subscribe model relying
buffers directly, without having to first stage them in host on proactive remote stores is deciding which GPUs should receive
memory. However, porting an application to MPI increases the data and when the stores should be transmitted. We address
programmer burden, and it is hard to overlap compute and this in Section 3.
3
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch
Page 1 subscribers: Region subscribed by GPU 0,1 Region subscribed by GPU 0,1,2
GPU0 GPU0 GPUm GPU0
Page 2 subscribers:
GPU3
GPU1 GPU1 GPU0 PA space GPU1 PA space GPU2 PA space
...
... Page n subscribers: ...
GPU0 GPUm VA space
GPUm Publish Subscribe GPUm
Processing Unit GPS address space allocations
Store issued by publisher Stores forwarded to all subscribers Figure 6: GPS address space: Allocations made are replicated
in the physical memory of all subscribers.
Figure 5: A simple publish-subscribe framework.
2.3 GPU memory consistency over existing multi-GPU programming frameworks: a programmer
The NVIDIA GPU memory model [43] prescribes rules regarding can integrate GPS into their workloads with only minor changes
the apparent ordering of GPU memory operations and the values for allocation and subscription management and need not modify
that may be returned by each read operation. Of most relevance to GPU kernels written for UM, except to apply performance tuning.
GPS are the notions of weak vs. strong accesses, and the notion of
scope associated with strong accesses. In short, sys-scoped mem- 3.2 Subscription management
ory accesses or fences are used to explicitly indicate inter-GPU
synchronization. All other access types need not be made visible The key innovation behind GPS is a set of architectural enhance-
to or ordered with respect to memory accesses from other GPUs ments designed to coordinate proactive inter-GPU communication
unless sys-scoped operations are used for synchronization. GPS for high performance. These architectural enhancements maintain
makes use of these relaxed memory ordering guarantees to perform subscription information to enable data transfer only to GPUs that
several hardware optimizations, as described later in Section 3.3, require it, enabling local accesses during consumption.
without violating the memory model. GPS provides both manual and automatic mechanisms to manage
page subscriptions. We describe the general mechanisms below, the
APIs in Section 4, and the implementation in Section 5.
3 GPS ARCHITECTURAL PRINCIPLES
Manual subscription tracking: The manual mechanism al-
In this section, we describe the different architectural principles lows the user to explicitly specify subscription information through
upon which GPS is based. Several possible hardware implementa- subscription/unsubscription APIs that can be called while refer-
tions can support the GPS semantics with differing performance encing each page (or a range of pages) in memory. Though this
characteristics. These high-level design principles are intended to requires extra programmer effort, it can lead to significant band-
decouple the GPS concept from our specific proposed implemen- width savings compared to an all-to-all subscription, even if the
tation. We highlight one particular implementation approach in subscription list is not minimized.
Section 5. Automatic subscription tracking: For applications with an it-
erative/phase behavior wherein the access patterns in each program
3.1 The GPS address space segment match those of prior segments, the subscriber lists can
The GPS address space, which is an extension of the conventional be determined automatically. The repetition in iterative workloads
multi-GPU shared virtual address space, enables programmers to enables GPS to discover the access patterns from one segment in
incorporate GPS features into their applications through simple, an initial profiling phase and determine the set of subscriptions for
intuitive APIs (described later in Section 4). As shown in Figure 6, all subsequent execution.
allocations made in the GPS address space have local replicas in all Automatic subscriber tracking can be performed in one of two
subscribing GPUs’ physical memories. Subscription management ways: subscribed-by-default, i.e., indiscriminate all-to-all subscrip-
is described in Section 3.2. tion followed by an unsubscription phase, or unsubscribed-by-
Loads and stores are issued to GPS pages in the same way as they default, i.e., a GPU subscribes to a page only when it issues the first
are to normal pages, with the same syntax, although the underlying read request to that page. Either way, once captured, the sharer
behavior is different. GPS’ architectural enhancements intercept information then feeds into the subscription tracking mechanism
each store and forward a copy to each subscriber’s local replica. used to orchestrate the inter-GPU communication.
Loads to GPS pages are also intercepted but are not duplicated. It is important to note that the subscriptions are hints to GPS for
Instead, they are issued to the replica in the issuing GPU’s local both mechanisms and are not functional requirements for correct
memory and can therefore be performed at full local bandwidth application execution. In other words, if a GPU issues a load to a
and without consuming interconnect bandwidth. In the corner case page to which it is not a subscriber, it does not fault; the hardware
where the issuing GPU is not a subscriber to that page (e.g., because simply issues the load remotely to one of the subscribers. Perform-
of an incorrect subscription hint), the load is issued remotely to one ing a peer-to-peer load to a remote GPU imposes a performance
of the subscribers. This design offers GPS a significant advantage penalty but does not break functional correctness.
4
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece
3.3 Functionally correct coalescing __global__ void mvmul ( float * invec , float * outvec , ...) {
...
While publish-subscribe models have been proposed in the past, a // Stores to outvec in the GPS address space are
unique aspect of GPS is the way it exploits the weakness of the GPU forwarded to the replicas at each subscriber GPU
for ( int i =0; i < mat_dim ; i ++)
memory consistency model described in Section 2.3. In particular, outvec [ tid ] += mat [ tid * mat_dim + i ] * invec [ i ];
GPS performs aggressive coalescing of stores to GPS pages before }
those stores are forwarded to other GPUs, as described below. int main (...) {
// enable GPS for mat , vec1 , and vec2
Non-sys-scoped accesses: As described in Section 2.3, aside cudaMallocGPS(& mat , mat_dim * mat_dim_size ) ;
from sys-scoped operations, the GPU memory model only requires cudaMallocGPS(& vec1 , mat_dim_size ) ;
writes to be made visible to other GPUs upon execution of sys- cudaMallocGPS(& vec2 , mat_dim_size ) ;
cudaMemset ( vec2 , 0, mat_dim_size );
scoped synchronization (e.g., memory fences). GPS uses this delayed for ( int iter =0; iter < MAX_ITER ; iter ++) {
visibility requirement to aggressively coalesce weak stores if their // Automatic profiling : all GPUs are tentatively
subscribed to all GPS pages at the start
addresses fall within the same cache line. Stores need not be consec- if (iter==0) cuGPSTrackingStart();
utive to be coalesced, as the GPU memory model allows store-store for ( int device =0; device < num_devices ; device ++) {
reordering as long as there is no synchronization or same-address cudaSetDevice ( device ) ;
mvmul << < num_blocks , num_threads , stream [ device ] > > >(
relationship between the stores. This decreases the data transferred mat , vec1 , vec2 , ...) ;
across the interconnect, saving valuable inter-GPU bandwidth. The mvmul << < num_blocks , num_threads , stream [ device ] > > >(
coalescer must be flushed only upon sys-scoped synchronization, }
mat , vec2 , vec1 , ...) ;
including the implicit release operation at the end of every grid. // GPUs are unsubscribed from pages they did not touch
The GPU memory model also does not require cache coherence if (iter==0) cuGPSTrackingStop();
}
between GPUs, i.e., a consistent store visibility order, unless stores
are from the same GPU to the same address, have sys scope, or are Listing 1: A sample GPS application. GPS requires code
synchronized by sys-scoped operations. While it is possible for GPS changes only for GPS allocation and tracking as highlighted
stores broadcast from different GPUs to the same address to cross in yellow.
each other in flight and therefore arrive at different consumer GPUs
in different orders, this is not a violation of the memory model. Weak
stores performed by different GPUs to the same address simultane-
ously without synchronization are racy, and therefore, no ordering are implemented in the CUDA library and GPU driver, just as the
guarantees need to be maintained. As long as proper synchroniza- existing CUDA APIs are. Listing 1 shows sample code.
tion is being used, weak writes will only be sent to a particular Memory allocation and release: GPS provides an API call,
address from one GPU at a time, and point-to-point ordering will cudaMallocGPS(), as a drop-in replacement for cudaMalloc() (for
ensure that all consumer GPUs see those stores arriving in the GPU-pinned memory) or cudaMallocManaged() (for Unified Mem-
same order. In this way, GPS maintains all of the required memory ory) APIs. This call allocates memory in the GPS virtual address
ordering behavior. space and backs it with physical memory in at least one GPU. At
sys-scoped accesses: These writes are intended for synchro- allocation, the programmer can pass an optional manual parameter
nization and must be kept coherent across all GPUs. Therefore, GPS to indicate that subscriptions will be managed explicitly for the
neither coalesces nor broadcasts such writes but instead simply region. Otherwise, GPS performs automatic subscription manage-
handles them as traditional GPUs do. Specifically, all sys-scoped ment. GPS re-purposes the existing cudaFree() function to release
accesses are sent to a single point of coherence and performed a GPS memory region.
there. Typically, the number of sys-scoped operations in programs Manual subscription: To allow expert programmers to explic-
is low, as they are only used when grids launched concurrently itly manage subscriptions, GPS overloads the existing cuMemAdvise()
on multiple GPUs need to explicitly synchronize through memory. API used for providing hints to UM with two additional hints
Hence, the cost of not coalescing system-scoped strong stores is to perform manual subscription and unsubscription. Specifically,
minimal. Further discussion of the handling of sys-scoped writes GPS uses new flag enums CU_MEM_ADVISE_GPS_SUBSCRIBE and
in our GPS implementation is described in Section 5.3. CU_MEM_ADVISE_GPS_UNSUBSCRIBE for subscription and unsubscrip-
The design choices described above ensure that GPS can deliver tion, respectively. Upon subscription, GPS backs the region with
consistent performance gains without breaking backward compati- physical memory on the specified GPU. When a programmer un-
bility with the GPU programming model or memory consistency subscribes a GPU from a GPS region, GPS removes the GPU from
model. This compatibility enables developers to easily integrate GPS the set of subscribers for that region and frees the corresponding
into their applications with minimal code or conceptual change. physical memory. GPS ensures that there is at least one subscriber
to a GPS region and will return an error on attempts to unsubscribe
the last subscriber, leaving the allocation in place.
Automatic subscription and profiling phase: As described
4 GPS PROGRAMMING INTERFACE in Section 3.2, the automatic subscription mechanism comprises a
We next describe the programming interface that an application hardware profiling phase during which the mechanism observes ap-
developer uses to leverage GPS features. We seek to develop a plication access patterns and determines a set of subscriptions. This
minimal, simple programming interface to ease the integration profiling phase requires the user to demarcate the start and end of
of GPS into existing multi-GPU applications. GPS API functions the profiling period using two new APIs, cuGPSTrackingStart()
5
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch
and cuGPSTrackingStop(), which are similar to the existing CUDA Conventional GPU components GPS components
calls cuProfilerStart() and cuProfilerStop(). GPS automati-
cally updates subscriptions at the end of the profiling phase; a GPU SM SM ... SM
remains a subscriber if and only if it accessed the page during pro- R1
GPS remote
W1 write queue
filing. Thus, upon receiving cuGPSTrackingStop(), GPS invokes
L1 L1 L1
the API cuMemAdvise(..., CU_MEM_ADVISE_GPS_UNSUBSCRIBE) W4 W5
to unsubscribe GPUs from any page they did not access during R2 W2
GPS
profiling. (Recall that a GPU may still access a page to which it is TLB TLB TLB address
not subscribed, but such accesses will be performed remotely at re- T1
translation W6
GPS bit R3 W3
duced performance; hence, profiling need not be exact to maintain
correctness). GPS access
L2
VA PA1 PA2 PA3
track unit
5 ARCHITECTURAL SUPPORT FOR GPS GPS store path GPS load path TLB miss tracking
We now describe one possible GPS hardware implementation that
extends a generic GPU design, such as NVIDIA’s recent Ampere Figure 7: Modifications to GPU hardware needed for GPS
GPUs. Our hardware proposal comprises two major components. provisioning.
First, it requires one bit in the GPU page table entry (PTE), the GPS
bit, to indicate whether a virtual memory page is a GPS page (i.e.,
potentially replicated). Second, it requires a new hardware unit to 5.2 GPS hardware units and extensions
propagate writes to GPUs that have subscribed to particular pages.
Page table support: Our GPS implementation modifies the base-
line GPU page table entries (PTEs) to re-purpose a single currently
5.1 GPS memory operations unused bit (of which there are many [37]) to indicate whether a
GPS must support the following basic memory operations: given page is a GPS page. When this bit is set, the virtual address
Conventional loads, stores, and atomics: Memory accesses is GPS-enabled, and the stores to the page will be propagated to
to non-GPS pages (virtual addresses for which the GPS bit is not the GPS units described below. The baseline GPU virtual memory
set in the PTE) proceed as they do on conventional GPUs, through system remains unchanged aside from this extension.
the existing GPU TLB and cache hierarchy to either local or remote Our GPS support also introduces a new secondary page table, the
physical addresses. GPS page table, for tracking multiple physical mappings that can
GPS loads: Figure 7 shows the paths taken by writes and reads coexist for a given virtual address when multiple GPUs subscribe
to GPS pages. Note that this figure is a simplified view intended to a page. This limited page table for the GPS address space is a
to highlight only the modifications required by GPS. For loads variant of a traditional 5-level hierarchical page table with very
issued to the GPS address space by a GPU that is a subscriber to wide leaf PTEs. Notably, the GPS page table lies off the critical
the page, the conventional GPU page table is configured at the path for memory operations, as it is used only for remote writes
time of subscription to translate the virtual address to the physical triggered by writes to GPS pages. These remote writes are already
address of the local replica. GPS loads thus follow the same path aggressively coalesced, so the additional latency of the GPS address
as conventional loads to local memory, as shown by R1, R2, R3 in translation unit can overlap with the coalescing period and adds
Figure 7. In the uncommon case, if a GPU is not subscribed to this only a small additional latency. Furthermore, by design, these writes
particular page, either the load forwards a value from the remote are not required to become visible until the next synchronization
write queue (Section 5.2) if there is a hit or it issues remotely to one boundary, so they are not latency-sensitive. Therefore, the latencies
of the subscribers. imposed by the address translation unit are not a critical factor,
GPS stores and atomics: Stores to GPS pages initially proceed even under TLB misses.
as normal stores, as shown by W1 and W2 in Figure 7. When a Each GPS-PTE contains the physical page addresses of all the
thread issues a store to an address whose GPS bit is set in the remote subscribers to that page, as shown in Figure 7. The GPS-PTE
conventional TLB, and for which there is a local replica, the write is sized at GPU initialization based on the number of GPUs in the
operation is forwarded to the local replica with both virtual and system. For example, with 64KB pages, for a Virtual Page Number
physical address (W3), ensuring that subsequent local reads from (VPN) size of 33 bits and Physical Page Number (PPN) size of 31
the same GPU thread will observe the new write, a requirement bits [37], for a 4 GPU system, the minimum GPS-PTE entry size is
of the existing GPU memory model. This pattern also ensures that 126 bits.
the L2 cache holding the local replica will serve as a common intra- We choose to allocate memory in the GPS address space using
GPU ordering point for stores to that address prior to their being 64KB pages for two reasons. First, the negative impact of false shar-
forwarded outside of the GPU, as the memory model requires. In ing due to large pages is multiplied due to GPS’ replication of remote
the uncommon case, there is no local replica, and a dummy physical stores. Second, the GPU’s conventional TLB is already sized to pro-
address is used. In addition, whether or not there is a local replica, vide full coverage of the entire VA range [47]. As discussed later in
the write is also forwarded with its virtual address to the GPS unit Section 7.4, the GPS address translation unit requires a surprisingly
(described next) for replication to remote subscribers (W4, W5, W6). small translation capacity even with 64KB pages. Therefore, neither
Atomics follow the same behavior as stores. TLB lies on the execution’s critical path.
6
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece
Coalescing remote writes: If a store’s translation information at the conventional TLB (Section 5.1) is an unnecessary waste of
indicates it is to a GPS-enabled address, the GPS write proceeds to resources when there is only one subscriber. For GPS pages with
the remote write queue (W4). This queue is fully associative and multiple subscribers, the GPS bit is set in the conventional page
virtually addressed at cache block granularity. The queue coalesces table entry for that page, and the GPS page table entry for the
all writes to the same cache block (unless they are sys-scoped) and page is updated to record the physical addresses of all subscribers’
buffers them until they are drained to remote GPU memory. This replicas for that page.
simple optimization results in substantial inter-GPU bandwidth
savings for many applications yet maintains weak memory model
compatibility. 5.3 Discussion
Whenever the occupancy of the write combining buffer reaches
Coalescing in the L2 cache: An alternative implementation strat-
a high watermark, it seeks to drain the least recently added entry to
egy for GPS would be to perform coalescing directly within the L2
the remote destinations. In our evaluation, we set this high water-
cache rather than provisioning a dedicated structure. Given that the
mark to one less than the buffer’s capacity to maximize coalescing
GPS remote write queue amounts to only a few kilobytes of state
opportunity. On draining, the flushed entry moves to the GPS ad-
and the L2 size is in megabytes (6MB in NVIDIA V100 GPUs and
dress translation unit. Additionally, the remote write queue unit
40MB in NVIDIA A100 GPUs [39]), cache capacity impact would
must fully drain at synchronization points, e.g., when a sys-scoped
be negligible. We chose to implement a dedicated write queue to
memory fence is issued. In our final proposal with 512 entries, the
isolate its impact from any other cache interference effects and so
GPS-write buffer requires 68 KB of SRAM storage, an insignificant
that the only change required to the L2 cache will be a shim at the
chip area for a single resource shared among all SMs in the GPU.
ingress point, which forwards GPS writes to the GPS remote write
GPS address translation: When a GPS store reaches the GPS
queue.
address translation unit (W5), it looks up translation information
GPS remote write queue addressing: Our GPS implementa-
cached from the GPS page table in its internal, wide GPS-TLB.
tion assumes the remote write queue is virtually addressed, irre-
Much like conventional TLB misses, GPS-TLB misses trigger a
spective of whether it is maintained as a dedicated SRAM structure
hardware page walk that fetches a GPS-PTE entry containing the
or as specially marked and reserved cache lines in L2. If GPS were
physical addresses for all subscribers. Once translated, the GPS
to perform translation prior to the GPS write queue, a remote store
address translation unit sends a duplicate copy of the store for each
would require one entry per subscribing GPU (one per remote phys-
subscriber to the inter-GPU interconnect (W6). The GPS address
ical address); performing translation as the stores are drained to
translation unit also drains at synchronization points.
the interconnect conserves space in the write queue.
Access tracking unit: Our GPS implementation uses subscribed-
Handling sys-scoped writes: Strong sys-scoped writes must
by-default profiling: it employs a dynamic mechanism that typically
be kept coherent across all GPUs. Our GPS implementation handles
begins by subscribing all GPUs to the entirety of GPS allocations at
sys-scoped writes in the same way that UM handles writes to pages
the beginning of execution. As discussed in Section 4, GPS tracks
annotated with read-mostly cudaMemAdvise hints. Upon detection
the access patterns during a profiling phase, after which it un-
of a sys-scoped store to a GPS page, the access faults, all in-flight
subscribes each GPU from pages they did not access. Although
accesses to the page are flushed, and the page is collapsed to a
the early over-subscription initially transfers more data than is re-
single copy and demoted to a conventional page (i.e., its GPS bit
quired, unsubscribed-by-default profiling incurs stalls due to faults
is cleared). Accesses to the page are all issued to the GPU hosting
or remote accesses on the first touch and hence is more expensive.
the single physical copy from this point on. This approach ensures
The GPS access tracking unit provides the hardware support
coherence and same-address ordering for this access and all future
for runtime subscription profiling. It maintains a bitmap in DRAM
accesses to the page.
with one bit per page in the GPS address space. Misses at the last
As mentioned in Section 3.3, sys-scoped accesses are rare, and
level conventional GPU TLBs to pages in the GPS virtual address
hence the impact on typical programs is minimal. The user is ex-
space are forwarded to the access tracking unit, which sets the
pected to explicitly opt pages holding synchronization variables out
bit corresponding to the page (as shown by T1 in Figure 7). TLB
of GPS (i.e., use cudaMalloc instead of cudaMallocGPS). If the user
misses are infrequent yet cover all pages accessed by the GPU,
provides incorrect hints, then just as with UM, there will be a per-
so the bandwidth required to maintain this bitmap is low (typi-
formance penalty. Nevertheless, the execution remains functionally
cally only 1.4 TLB misses per thousand cycles [47]). We maintain
correct.
the bitmap in DRAM since the bitmap is used only during the ini-
Handling memory oversubscription: If the GPU driver swaps
tial profiling and unsubscription. Tracking a 32GB virtual address
out a page from a subscriber due to oversubscription, that GPU will
range, the bitmap requires only 64KB of DRAM, and updates can be
be unsubscribed and will access that page remotely.
aggressively cached or write-combined to minimize DRAM band-
width impact. Thus, the total area and energy consumed by these
hardware extensions are negligible relative to the GPU SoC. 6 EXPERIMENTAL METHODOLOGY
The bitmap managed by the access tracking unit is read by the To evaluate the benefits of GPS, we extend the NVIDIA Architec-
GPU driver during the cuGPUTrackingStop() API call and used to tural Simulator (NVAS) [57] to model multi-GPU systems com-
configure the conventional and GPS page tables appropriately. GPS prising NVIDIA GV100 GPUs on PCIe, with parameters shown in
pages with only a single subscriber are downgraded to conventional Table 1. The simulator is driven by application traces collected at
pages within the page tables, as duplication of writes to such pages the SASS (GPU assembly) level using the binary instrumentation
7
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch
GPU Parameters read mostly, accessed by, prefetch, and preferred location hints.
Cache block size 128 bytes Based on the compute partitioning across GPUs, we set the GPU
Global memory 16 GB that issues writes to a given memory region as its preferred location.
Streaming multiprocessors (SM) 80 The most proactive approach we can configure with UM hints is
CUDA cores/SM 64 to pick one consumer to be the preferred location, and since each
L2 Cache size 6 MB producer of a page is always also a consumer of the page in our
Warp size 32
applications, that was a convenient and close-to-optimal choice.
Maximum threads per SM 2048
We also set GPUs that read from remote pages as accessed by those
Maximum threads per CTA 1024
GPUs. Although read-mostly hints are generally effective, we did
GPS Structures
not use them because our applications had no read-only pages
Remote write queue 512 entries
accessed by multiple GPUs. Before each kernel launch, we enable
Remote write queue entry size 135 bytes
GPUs to prefetch remote regions they may access through prefetch
TLB 8-way set associative
hints.
TLB size 32 entries
Virtual address 49 bits Remote Demand Loads (RDL): While GPS performs all loads
Physical address 47 bits locally by issuing the stores to all subscribers, RDL performs the
converse: it issues stores to local memory and loads to the most
Table 1: Simulation settings, based on NVIDIA V100.
recent GPU to issue a store to a given page. We believe that this
paradigm is representative of an expert programmer who manually
tracks writers to each page. We simulate this expertise by explicitly
tracking the latest write to each page in the simulator and using
tool NVBit [58] on real hardware. These traces contain CUDA API this information during address translation to issue the read to the
events, GPU kernel instructions, and memory addresses accessed, appropriate GPU.
but no pre-recorded timing events. The simulator models the tim- Memcpy: This paradigm duplicates data structures among all
ing aspects of the trace replay in accordance with the GPU and GPUs and broadcasts updates via cudaMemcpy() calls at the syn-
interconnect architectural models and respects all functional de- chronization barriers. This duplication ensures that all data struc-
pendencies such as work scheduling, barrier synchronization, and tures are resident in local GPU memory when accessed by kernels
load dependencies. We have specifically calibrated the link and in the subsequent synchronization phase; there are no remote ac-
switch parameters in our interconnect models to match several cesses during kernel execution. However, there is also no overlap
(sometimes speculative) PCIe generations. This simulator has been between data transfers and compute.
correlated across a wide range of benchmarks and GPU models but GPS with automatic subscription: We implement GPS with
remains fast enough to model complex multi-GPU systems and the automatic subscription management by modifying applications as
hard-to-scale applications suitable for evaluating GPS. described in Section 4 and marking all memory allocations as GPS
We evaluate a suite of multi-GPU applications shown in Table 2. allocations.
These include all applications used to evaluate PROACT [34]. We Infinite bandwidth: Finally, we provide an infinite bandwidth
also study those applications from the Tartan benchmark suite [29], comparison, which establishes an upper bound on achievable multi-
whose strong scaling performance was bound by inter-GPU com- GPU performance if all data were always accessible locally at each
munication when measured on real systems. These applications GPU (i.e., it ignores all transfer costs). We obtain this comparison
also possess varying communication patterns, giving us a broader by eliding the data transfer time from the memcpy variant.
opportunity to evaluate GPS. For the Tartan applications not bound
by inter-GPU communication, we found that GPS obtains the same 7 EXPERIMENTAL RESULTS
performance as the native version and have not included them in
GPS relies on fine-grained, proactive data transfers to remote GPUs
the interest of space. We modify the applications only to imple-
during kernel execution to optimize GPU locality. The subscrip-
ment the different multi-GPU programming paradigms, and the
tion management mechanism ensures that only the required data
partitioning of applications across multi-GPUs remains the same
is transferred, resulting in interconnect bandwidth savings. GPS
as the original code for all paradigms. All our application variants
performance benefits arise for three reasons: (1) GPS proactively
are written in CUDA and compiled using CUDA 10.
publishes updates to subscribers, enabling them to fetch hot-path
To demonstrate the ability of GPS to improve multi-GPU scal-
data from high bandwidth local memory. (2) By automatically iden-
ability, we compare it with several contemporary multi-GPU pro-
tifying subscribers for a given page, GPS publishes updates only to
gramming paradigms as discussed in Section 2.1:
the GPUs that require them, resulting in significant interconnect
Unified Memory without Hints: We simulate baseline Unified
bandwidth savings. (3) Coalescing in the GPS write queue results in
Memory without user-provided hints. Application code allocates
substantial bandwidth reduction, especially for applications where
shared memory regions using cudaMallocManaged() API. By de-
subscriptions alone are not sufficient to achieve peak performance.
fault, the simulator allocates pages on the first GPU that touches
the page. Subsequent accesses by peer GPUs to the same page will
result in fault-based page migration as described in Section 2.1. 7.1 End-to-end performance
Unified Memory with Hints: For this paradigm, we hand-tune Figure 8 shows the 4-GPU application speedup over a single GPU
each application using a combination of four manual hints, namely for the different programming paradigms described in Section 6.
8
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
bi
nk
n
SP
n
P
CT
T
AL
HI
W
sio
ea
co
ra
SS
EQ
om
ffu
Ja
ge
Pa
Di
Ge
Figure 8: 4-GPU speedup of different paradigms. Figure 9: Subscriber distribution for shared application
pages. GPS subscriptions result in interconnect bandwidth
savings for all pages with less than 4 subscribers.
9
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch
4.4 16
3
14
12
normalized to memcpy
2.5 10
lower
8
2 is
6
better
4
1.5 2
0
1
bi
nk
n
SP
n
P
CT
T
AL
HI
W
sio
ea
co
0.5
ra
SS
EQ
om
ffu
Ja
ge
Pa
Di
Ge
0
bi
nk
n
SP
P
CT
T
Figure 12: 16-GPU performance achieved by different
AL
HI
W
sio
co
ra
SS
EQ
ffu
Ja
ge
paradigms.
Pa
Di
Figure 10: Total data moved over interconnect normalized to transfers. When degradation occurs, it is typically due to the GPS
memcpy (bulk-synchronous transfers). write combining buffer failing to coalesce multiple writes to the
same block effectively. However, if the excess writes do not saturate
the interconnect, they will not typically stall GPU execution.
GPS without subscription GPS with subscription
Performance relative to 1 GPU
nk
n
SP
n
P
CT
HI
W
sio
ea
co
ra
SS
EQ
om
ffu
Ja
ge
Di
Ge
10
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece
3.5
3 40
11
MICRO ’21, October 18–22, 2021, Virtual Event, Greece Harini Muthukrishnan, Daniel Lustig, David Nellans, Thomas Wenisch
The publish-subscribe communication paradigm for distributed [7] Guruduth Banavar, Tushar Chandra, Bodhi Mukherjee, Jay Nagarajarao, Robert E
interaction has been explored by prior work [2, 7, 8, 17]. Hill et Strom, and Daniel C Sturman. 1999. An Efficient Multicast Protocol for Content-
Based Publish-Subscribe Systems. In International Conference on Distributed Com-
al. [23] propose a Check-In/Check-out model for shared-memory puting Systems (ICDCS).
machines. The more traditional alternative to publish-subscribe [8] Guruduth Banavar, Tushar Chandra, Robert Strom, and Daniel Sturman. 1999. A
Case for Message Oriented Middleware. In International Symposium on Distributed
support is NUMA memory management. Dashti et al. [14] develop Computing (DISC).
and implement a memory placement algorithm in Linux to address [9] Trinayan Baruah, Yifan Sun, Ali Dinçer, Md Saiful Arefin Mojumder, José L. Abel-
traffic congestion in modern NUMA systems. Many other works [1, lán, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020.
Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU
16, 28, 48, 63] perform NUMA-aware optimizations to improve Systems. In International Symposium on High Performance Computer Architecture
performance, and hardware-based peer caching has been explored (HPCA).
but is yet to be adopted by GPU vendors [10, 13, 30, 46, 54]. Recently [10] Arkaprava Basu, Sooraj Puthoor, Shuai Che, and Bradford M Beckmann. 2016.
Software Assisted Hardware Cache Coherence for Heterogeneous Processors. In
DRAM-caches for multi-node systems [12] have been proposed to International Symposium on Memory Systems (ISMM).
achieve large capacity advantages. [11] Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2020. Groute:
Asynchronous Multi-GPU Programming Model with Applications to Large-scale
Prior work has also explored scoped synchronization for memory Graph Processing. Transactions on Parallel Computing (TOPC) 7, 3 (2020), 1–27.
models [19, 22, 24, 31, 44, 61]. Non-scoped GPU memory models [12] Chiachen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2016. CANDY: Enabling
are simpler [53], but do not permit the same type of coalescing as Coherent DRAM Caches for Multi-node Systems. In International Symposium on
Microarchitecture (MICRO).
GPS, which makes explicit use of scopes. [13] Mohammad Dashti and Alexandra Fedorova. 2017. Analyzing Memory Manage-
ment Methods on Integrated CPU-GPU Systems. In International Symposium on
Memory Management (ISMM).
9 CONCLUSION [14] Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud
Strong scaling in multi-GPU systems is a challenging task. In this Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Man-
agement: A Holistic Approach to Memory Placement on NUMA systems. In
work, we proposed and evaluated GPS, a HW/SW multi-GPU mem- International Conference on Architectural Support for Programming Languages and
ory management technique to improve strong scaling in multi-GPU Operating Systems (ASPLOS).
systems. GPS automatically tracks the subscribers to each page of [15] Alan Demers, Johannes Gehrke, Mingsheng Hong, Mirek Riedewald, and Walker
White. 2006. Towards Expressive Publish/Subscribe Systems. In International
memory and proactively broadcasts fine-grained stores to these sub- Conference on Extending Database Technology (EDBT).
scribers. This enables each subscriber to read data from their local [16] Andi Drebes, Karine Heydemann, Nathalie Drach, Antoniu Pop, and Albert
memory at high bandwidth. GPS provides significant performance Cohen. 2014. Topology-Aware and Dependence-Aware Scheduling and Memory
Allocation for Task-parallel Languages. Transactions on Architecture and Code
improvement while retaining compatibility with conventional GPU Optimization (TACO) 11, 3 (2014), 1–25.
programming and memory models. Evaluated on a model of 4 [17] Patrick Th Eugster, Pascal A Felber, Rachid Guerraoui, and Anne-Marie Kermar-
rec. 2003. The Many Faces of Publish/Subscribe. Computing Surveys (CSUR) 35, 2
NVIDIA V100 GPUs and several interconnect architectures, GPS (2003), 114–131.
offers an average speedup of 3.0× over 1 GPU and performs 2.3× bet- [18] Françoise Fabret, H Arno Jacobsen, François Llirbat, Joăo Pereira, Kenneth A
ter than the next best available multi-GPU programming paradigm. Ross, and Dennis Shasha. 2001. Filtering Algorithms and Implementation for
Very Fast Publish/Subscribe Systems. In International Conference on Management
On a similar 16 GPU system, GPS captures 80% of the available of Data (SIGMOD).
performance opportunity, a significant lead over today’s current [19] Benedict R Gaster. 2013. HSA Memory Model. In A Symposium on High Perfor-
multi-GPU programming models. mance Chips (Hot Chips).
[20] Tom’s Hardware. 2019. AMD Big Navi and RDNA 2 GPUs. tomshardware.com/
news/amd-big_navi-rdna2-all-we-know, last accessed on 08/17/2020.
ACKNOWLEDGMENTS [21] Mark Harris. 2017. Unified Memory for CUDA Beginners. developer.nvidia.com/
blog/unified-memory-cuda-beginners/, last accessed on 08/17/2020.
The authors thank Zi Yan and Oreste Villa from NVIDIA Research [22] Blake A Hechtman, Shuai Che, Derek R Hower, Yingying Tian, Bradford M
for their support with NVAS and the anonymous reviewers for their Beckmann, Mark D Hill, Steven K Reinhardt, and David A Wood. 2014. Quick-
Release: A Throughput-oriented Approach to Release Consistency on GPUs. In
valuable feedback. This work was supported by the Center for Ap- International Symposium on High Performance Computer Architecture (HPCA).
plications Driving Architectures (ADA), one of six centers of JUMP, [23] Mark D Hill, James R Larus, Steven K Reinhardt, and David A Wood. 1992. Coop-
erative Shared Memory: Software and Hardware for Scalable Multiprocessors.
a Semiconductor Research Corporation program co-sponsored by In International Conference on Architectural Support for Programming Languages
DARPA. and Operating Systems (ASPLOS).
[24] Derek R Hower, Blake A Hechtman, Bradford M Beckmann, Benedict R Gaster,
Mark D Hill, Steven K Reinhardt, and David A Wood. 2014. Heterogeneous-
REFERENCES race-free Memory Models. In International conference on Architectural Support
[1] Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W for Programming Languages and Operating Systems (ASPLOS).
Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Mem- [25] Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim.
ory Systems. In International Conference on Architectural Support for Programming 2020. Batch-Aware Unified Memory Management in GPUs for Irregular Work-
Languages and Operating Systems (ASPLOS). loads. In International Conference on Architectural Support for Programming Lan-
[2] Marcos K Aguilera, Robert E Strom, Daniel C Sturman, Mark Astley, and Tushar D guages and Operating Systems (ASPLOS).
Chandra. 1999. Matching Events in a Content-Based Subscription System. In [26] Nagesh B Lakshminarayana and Hyesoon Kim. 2014. Spare Register Aware
Symposium on Principles of Distributed Computing (PODC). Prefetching for Graph Algorithms on GPUs. In International Symposium on High
[3] Jasmin Ajanovic. 2009. PCI Express 3.0 Overview. In A Symposium on High Performance Computer Architecture (HPCA).
Performance Chips (Hot Chips). [27] Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc.
[4] AMD. 2019. AMD Infinity Architecture: The Foundation of the Modern 2010. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In
Datacenter. Product Brief. amd.com/system/files/documents/LE-70001-SB- International Symposium on Microarchitecture (MICRO).
InfinityArchitecture.pdf, last accessed on 08/17/2020. [28] Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and
[5] AMD. 2020. AMD Crossfire™ Technology. www.amd.com/en/technologies/ Memory Placement on NUMA Systems: Asymmetry Matters. In USENIX Annual
crossfire, last accessed on 04/14/2021. Technical Conference (USENIX ATC).
[6] Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, [29] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin
Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU
Multi-chip-module GPUs for continued performance scalability. In International Benchmark Suite. In International Symposium on Workload Characterization
Symposium of Computer Architecture (ISCA). (IISWC).
12
GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management MICRO ’21, October 18–22, 2021, Virtual Event, Greece
[30] Daniel Lustig and Margaret Martonosi. 2013. Reducing GPU Offload Latency via [56] Abdulaziz Tabbakh, Xuehai Qian, and Murali Annavaram. 2018. G-TSC: Times-
Fine-Grained CPU-GPU Synchronization. In International Symposium on High tamp Based Coherence for GPUs. In International Symposium on High Performance
Performance Computer Architecture (HPCA). Computer Architecture (HPCA).
[31] Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Anal- [57] Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish
ysis of the NVIDIA PTX Memory Consistency Model. In International Conference Chatterjee, Nan Jiang, and David Nellans. 2021. Need for Speed: Experiences
on Architectural Support for Programming Languages and Operating Systems (AS- Building a Trustworthy System level GPU Simulator. In International Symposium
PLOS). on High Performance Computer Architecture (HPCA).
[32] Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, [58] Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019.
Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA- NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In
aware GPUs. In International Symposium on Microarchitecture (MICRO). International Symposium on Microarchitecture (MICRO).
[33] Gero Mühl. 2002. Large-Scale Content-Based Publish-Subscribe Systems. Ph.D. [59] Peng Wang. 2017. UNIFIED MEMORY ON P100. olcf.ornl.gov/wp-content/
Dissertation. Technische Universität Darmstadt. uploads/2018/02/SummitDev_Unified-Memory.pdf, last accessed on 02/14/2021.
[34] Harini Muthukrishnan, David Nellans, Daniel Lustig, Jeffrey Fessler, and Thomas [60] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and
Wenisch. 2021. Efficient Multi-GPU Shared Memory via Automatic Optimization John D Owens. 2016. Gunrock: A High-performance Graph Processing Library
of Fine-grained Transfers. In International Symposium on Computer Architecture on the GPU. In Principles and Practice of Parallel Programming (PPoPP).
(ISCA). [61] John Wickerson, Mark Batty, Bradford M Beckmann, and Alastair F Donaldson.
[35] Prashant J Nair, David A Roberts, and Moinuddin K Qureshi. 2016. Citadel: 2015. Remote-scope Promotion: Clarified, Rectified, and Verified. In International
Efficiently Protecting Stacked Memory from TSV and Large Granularity Failures. Conference on Object-Oriented Programming, Systems, Languages, and Applications
Transactions on Architecture and Code Optimization (TACO) 12, 4 (2016), 1–24. (OOPSLA).
[36] NVIDIA. 2013. CUDA Toolkit Documentation. docs.nvidia.com/cuda/, last [62] Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans,
accessed on 08/17/2020. and Oreste Villa. 2018. Combining HW/SW Mechanisms to Improve NUMA Per-
[37] NVIDIA. 2019. GP100 MMU Format. nvidia.github.io/open-gpu-doc/pascal/ formance of Multi-GPU Systems. In International Symposium on Microarchitecture
gp100-mmu-format.pdf, last accessed on 08/17/2020. (MICRO).
[38] NVIDIA. 2019. NVLink AND NVSwitch The Building Blocks of Advanced Multi- [63] Kaiyuan Zhang, Rong Chen, and Haibo Chen. 2015. NUMA-aware Graph-
GPU Communication. nvidia.com/en-us/data-center/nvlink/, last accessed on structured Analytics. In Symposium on Principles and Practice of Parallel Pro-
08/17/2020. gramming (PPoPP).
[39] NVIDIA. 2020. NVIDIA Ampere Architecture. www.nvidia.com/en-us/data- [64] Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W
center/ampere-architecture/, last accessed on 04/14/2021. Keckler. 2016. Towards High Performance Paged Memory for GPUs. In Interna-
[40] NVIDIA. 2020. NVIDIA DGX Systems. www.nvidia.com/en-us/data-center/dgx- tional Symposium on High Performance Computer Architecture (HPCA).
systems/ last accessed on 04/14/2021.
[41] NVIDIA. 2020. NVIDIA NVLink High-Speed GPU Interconnect. nvidia.com/en-
us/design-visualization/nvlink-bridges/, last accessed on 08/17/2020.
[42] NVIDIA. 2020. NVIDIA TITAN V, NVIDIA’s Supercomputing GPU Architecture,
Now for Your PC. www.nvidia.com/en-us/titan/titan-v/, last accessed on
08/17/2020.
[43] NVIDIA. 2020. PTX: Parallel Thread Execution ISA Version 7.0. docs.nvidia.
com/cuda/pdf/ptx_isa_7.0.pdf, last accessed on 08/17/2020.
[44] Marc S Orr, Shuai Che, Ayse Yilmazer, Bradford M Beckmann, Mark D Hill,
and David A Wood. 2015. Synchronization Using Remote-Scope Promotion.
International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS).
[45] Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy,
and Dhabaleswar K Panda. 2013. Efficient Inter-node MPI Communication Using
GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In International
Conference on Parallel Processing (ICPP).
[46] Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M Beckmann,
Mark D Hill, Steven K Reinhardt, and David A Wood. 2013. Heterogeneous
System Coherence for Integrated CPU-GPU Systems. In International Symposium
on Microarchitecture (MICRO).
[47] Jason Power, Mark D Hill, and David A Wood. 2014. Supporting x86-64 Ad-
dress Translation for 100s of GPU Lanes. In International Symposium on High
Performance Computer Architecture (HPCA).
[48] Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, and Anas-
tasia Ailamaki. 2015. Scaling Up Concurrent Main-memory Column-store Scans:
Towards Adaptive NUMA-aware Data and Task Placement. In Proceedings of the
VLDB Endowment (PVLDB).
[49] Xiaowei Ren and Mieszko Lis. 2017. Efficient Sequential Consistency in GPUs via
Relativistic Cache Coherence. In International Symposium on High Performance
Computer Architecture (HPCA).
[50] Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa, and David
Nellans. 2020. HMG: Extending Cache Coherence Protocols Across Modern Hi-
erarchical Multi-GPU Systems. In International Symposium on High Performance
Computer Architecture (HPCA).
[51] Tim C Schroeder. 2011. Peer-to-peer & Unified Virtual Addressing. In GPU
Technology Conference (GTC).
[52] Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE:
Adaptive Prefetching on GPUs for Energy Efficiency. In International Conference
on Parallel Architectures and Compilation Techniques (PACT).
[53] Matthew D Sinclair, Johnathan Alsop, and Sarita V Adve. 2015. Efficient GPU
synchronization without scopes: Saying No to Complex Consistency Models. In
International Symposium on Microarchitecture (MICRO).
[54] Inderpreet Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O’Connor, and
Tor M Aamodt. 2013. Cache Coherence for GPU Architectures. In International
Symposium on High Performance Computer Architecture (HPCA).
[55] Mohammed Sourouri, Tor Gillberg, Scott B Baden, and Xing Cai. 2014. Effective
Multi-GPU Communication Using Multiple CUDA Streams and Threads. In
International Conference on Parallel and Distributed Systems (ICPADS).
13