Ditto
Ditto
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
Citation: Liang, Mingyu, Gan, Yu, Li, Yueying, Torres, Carlos, Dhanotia, Abhishek et al. 2023.
"Ditto: End-to-End Application Cloning for Networked Cloud Services."
As Published: https://doi.org/10.1145/3575693.3575751
Version: Final published version: final published article, as it appeared in a journal, conference
proceedings, or other formally published context
Terms of Use: Article is made available in accordance with the publisher's policy and may be
subject to US copyright law. Please refer to the publisher's site for terms of use.
Ditto: End-to-End Application Cloning for Networked Cloud
Services
Mingyu Liang∗ Yu Gan∗ Yueying Li
[email protected] [email protected] [email protected]
Cornell University Cornell University Cornell University
Ithaca, New York, USA Ithaca, New York, USA Ithaca, New York, USA
Christina Delimitrou
[email protected]
MIT
Cambridge, Massachusetts, USA
222
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou
to adjust to different studies. Unfortunately, most prior work on significantly from the services running in production clouds [100,
synthetic benchmark cloning is limited to CPU-centric, single-tier, 103, 111].
and user-level applications [72, 87, 90]. One of the earliest efforts towards modern cloud application
Only capturing CPU-centric microarchitectural events is not benchmarking was the Cloudstone benchmark [97], which pro-
enough to reproduce the performance and resource characteristics posed a new interaction-heavy Web 2.0 workload. CloudSuite [48]
of cloud applications, which spend a large fraction of their execution further composes a collection of workloads for the evaluation of
in the networking stack and OS. Moreover, prior work on synthetic scaling-out cloud services. The YCSB suite [37] collects workloads
application cloning mostly considers generating assembly code to for database systems, while SPEC Cloud [9] utilizes a subset of
mimic metrics like IPC, cache miss rate, and dependency distance, workloads representing real-world use cases found on IaaS clouds.
but overlooks critical higher-level performance metrics, such as More recently, uSuite [98] and DeathStarBench [53] focus on bench-
average and tail latency. marking cloud microservices, given the increased popularity of this
We present Ditto, an automated application cloning framework programming model.
for end-to-end cloud services, designed for both monolithic ap- Instead of condensing cloud services to a pre-set group of bench-
plications and microservices. Ditto is the first system to clone an marks, Ditto enables generating arbitrary applications that resemble
application’s behavior across the system stack, including the hard- in features a target service.
ware, I/O, networking layers, and OS. This is critical for cloud
applications which spend a large fraction of their time at kernel 2.2 Simulation and Trace Replay
level and the I/O stack [74]. It additionally also targets multi-tier Simulation and trace replay provide another way to estimate ser-
microservices which span distributed deployments and are gaining vice performance when hardware or software is inaccessible. Many
in popularity. microarchitectural simulators, including gem5 [31], Sniper [34] and
Ditto relies on the following key techniques. First, it captures the ZSim [94], can accurately simulate the CPU performance of a given
dependency graph across distributed services using distributed trac- binary. BigHouse [82] and 𝜇qsim [109] are queueing-based simu-
ing [4, 7, 11, 96]. Then, it recreates the high-level control and data lators which quickly estimate high-level performance metrics of
flow inside each service, and, finally, it generates system calls and monolithic applications and microservices. While useful when hard-
user-space assembly to capture the on-CPU and off-CPU behavior. ware is not available, these simulators still make approximations
Ditto operates transparently to the user, with the cloning process
about the application behavior, and do not capture all complexi-
working in an automated fashion, from obtaining a microservice
ties of a real system. On the other hand, RecPlay [92], iDNA [29],
deployment’s dependency graph to populating each tier with ap- PinPLay [89], Jalangi [95] log the execution and memory traces,
propriate assembly code and I/O operations. It generalizes across and reproduce an application’s behavior for debugging and per-
platforms, deployments, and application configurations, such as formance analysis. Unfortunately, a lot of prior work has showed
load and thread pool size, without retraining, and the synthetic that traces can leak confidential information about production ser-
applications react to changes similarly to the original ones. vices [33, 38, 66], restricting an application owner’s incentive to
Ditto is beneficial to hardware vendors, cloud providers, and publicly share the collected traces. West [22], STM [20], HALO [86]
researchers. Hardware vendors can obtain synthetic versions of and Dangwal’s paper [39], for example, analyze the memory access
production applications to test new platforms, cloud providers can patterns of an original application, and generate synthetic memory
specify performance and/or resource specs to hardware vendors
traces. Although they can constrain the information leakage, they
using the synthetic workloads, and researchers can use represen-
only target the cache and memory subsystems. Compared to trace-
tative end-to-end cloud services without the need for production based techniques, Ditto generates synthetic services that clone the
code access. performance characteristics across the system stack, and can run
We evaluate Ditto across a set of both monolithic applications both on real systems and microarchitectural simulators.
and multi-tier microservices and show that it consistently captures
the low- and high-level performance metrics and resource char-
acteristics of the original service. We also validate that synthetic
2.3 Performance Cloning and Synthetic
applications generated with Ditto react the same way as the original Benchmarks
workloads to changes in the input load, platform, resource alloca- Workload cloning is a way to generate synthetic code that mim-
tion, and deployment configuration, including interference from ics real-world applications. Previous studies profile architecture-
external workloads and power management. Ditto is open-source independent characteristics of real applications, and generate cor-
software. 1 responding proxy benchmarks that capture their CPU performance
[26, 56, 73, 87]. PerfProx, for example, generates miniature proxies
2 RELATED WORK which resemble the low-level CPU metrics of real databases [87].
2.1 CPU and Cloud Benchmarking MicroGrad [90] introduces a gradient-based mechanism to gener-
ate workload clones and stress tests. NanoBench [18] generates mi-
The architecture and system community rely heavily on software crobenchmarks with certain instructions to evaluate undocumented
benchmarking to learn the performance characteristics of target features of x86 CPUs. In [104], the authors hide the functional se-
applications. Prior studies have found that traditional CPU bench- mantics of the proprietary applications through code mutation.
mark suites, such as SPEC [32, 55, 99] and MiBench [61], differ However, these systems are not sufficient for performance
1 https://github.com/Mingyu-Liang/Ditto. cloning in cloud services. First, they only consider performance
223
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
metrics in user space. Cloud applications spend a large fraction 3.1 Application Inputs
of their execution at kernel level [43, 48, 53, 74, 77, 99]. Synthetic The behavior and performance of cloud applications is significantly
benchmarks generated with these tools focus on matching low-level impacted by the service configuration and input load, with the
performance metrics, e.g., instructions per cycle (IPC) or misses latter going through well-documented fluctuations [19, 24, 42, 43,
per kilo instructions (MPKI), which do not always translate to the 81, 91, 99]. The application’s configuration, although changing less
high-level metrics cloud applications care about, like tail latency frequently than load, can substantially alter the execution flow of
and throughput [28, 40]. Second, cloud services are often bottle- an application and impact performance. For instance, configuring a
necked by off-CPU events, such as context switching or network smaller in-memory cache for a database can cause more disk I/O
or disk I/O, which are not captured in previous work. Finally, cloud accesses, significantly increasing latency.
services do not operate like independent processes, having instead
client-server interfaces, which need to be captured by the cloning
framework. This is even more the case for multi-tier microservices, 3.2 Application Codebase and Binary
which can have hundreds of dependent tiers, and are becoming the The application and its linked libraries are intrinsic to its perfor-
norm in many clouds. mance, regardless of the platform it is deployed on. Modifications
in the application code can alter the control and data flow of a
3 CLONING ACROSS THE SYSTEM STACK service, its memory access patterns, and its resource bottlenecks.
Application cloning for cloud services is challenging due to the This is especially true for new cloud programming frameworks,
complexity and heterogeneity of their design, and the various plat- like microservices and serverless, where services are updated on a
forms they can be deployed on. Different services can have entirely daily basis.
different bottlenecks; for example, key-value stores (KVS) require
high CPU performance, high memory and network bandwidth to
3.3 Deployment Environment
retrieve a large amount of data under a strict latency SLO, while
databases are usually bottlenecked by disk I/O bandwidth [36]. 3.3.1 Containers and Virtual Machines (VMs). Cloud services are
Therefore, it is important to consider the performance breakdown often deployed with containers and/or VMs. These add different
across the system stack to accurately clone the performance of levels of performance overheads, primarily due to the extra I/O
end-to-end cloud services. and network layers [47]. Unlike prior work, Ditto faithfully clones
the I/O behaviors of the cloud services, and thus, the synthetic
applications generated by Ditto can be affected by virtualization
Workload Configurations the same way as the original services.
Application Inputs
Application User-level Libraries 3.3.2 OS Kernel. Cloud applications are especially dependent on
Application Binary OS performance, given that they spent a large fraction of their exe-
System Calls
Deployment Environment cution at kernel level for interrupt handling, I/O requests, memory
Container Engine management, task scheduling, etc. [25, 53, 74, 76]. Prior work on
application cloning has mostly focused on user-level application
(Guest) Kernel logic; for cloud services overlooking kernel operations leads to
Virtual Task File Network … very different performance characteristics compared to the original
Memory Scheduler Systems Stack
application.
Device Drivers
224
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou
misprediction
Long-latency
IX uops decode
tune the generator. The eventual synthetic service can serve as a
Useful uops
Mem bound
L1d bound
iTLB miss
utilization
L2 bound
L3 bound
resteers
L1i miss
Branch
Branch
performance and resource proxy for the original service.
uops
Port
Ditto profiles applications in isolation to capture their character-
istics alone; in Section 6.5 we show that in the presence of interfer-
ence, synthetic applications behave the same way as their original
IM
IM
BB, IM, IX
BB, IM, IX
IX
DM, DD, IX
DM, DD, IX, IM
DM, DD, IX, IM
DM, DD, IX, IM
IX
BB, DM, DD,
counterparts.
IX, IM
Ditto adheres to the following design principles:
• End-to-end system stack modeling: Cloud services often con-
tain a large fraction of kernel-space operations for network and
disk I/O. Ditto captures the inputs, RPC dependency graph, ap-
Figure 2: Top-down analysis of the CPU-memory subsystem plication binary, OS kernel, CPU, memory, disk, networks, and
performance [107]. Letters at the bottom show the corre- resource interference.
sponding analysis in Ditto. IX: Instruction Mix. BB: Branch • Portability: Ditto uses platform-independent features to ensure
Behavior. IM: Instruction Memory Access Pattern. DM: Data that generated services are portable across platforms without
Memory Access Pattern. DD: Data Dependency. reprofiling. Synthetic applications also faithfully adjust to load
and configuration changes, such as queries per second (QPS),
and scaling, because of the fine-grained network and thread
3.4 Multi-Tenancy modeling.
Multi-tenancy improves datacenter utilization by deploying mul- • Abstraction: Ditto does not disclose the implementation of
tiple services on the same node. Applications share resources, in- the original application, only exposing the skeleton and post-
cluding CPU cores, LLC, and memory, disk I/O, and network band- processed performance characteristics to the synthetic bench-
width [36, 79]. Resource contention can degrade performance, and mark user. It replaces the skeleton of an application with a
should be accounted for in the application cloning process. template, refills the body with artificial instructions and their
operands, and abstracts the memory access patterns away to
4 END-TO-END CLONING FOR CLOUD avoid side-channel attacks. Application-specific characteristics,
SERVICES including user-space function calls, memory accesses, and appli-
cation inputs, are also concealed. Thus, the synthetic workload
4.1 Overview can be publicly shared, without a user reverse engineering the
Ditto is an application cloning framework for cloud services; it implementation of the original service.
applies to both single-tier applications and multi-tier microservices. • Automation: Ditto automates the profiling and generation pro-
It generates services that faithfully reproduce the performance, cess. It entirely relies on static and dynamic profiling of the
resource profile, and thread-level control/data flow of the original original application to generate a benchmark. Users are not re-
workload, decoupling representative system studies from access to quired to have expertise in the implementation of a service to
the source code or the binary of production cloud services. use the framework.
Ditto profiles an application at runtime and extracts key per-
formance and resource metrics using dynamic instrumentation 4.2 Microservice Topology
and runtime emulators (SystemTap [45], Valgrind [85], eBPF [44], A topology of microservices is a directed acyclic graph (DAG),
Perf [12], VTune [67], and Intel SDE [68]). Then, it generates a where the nodes are microservices and the edges indicate the
synthetic service which preserves the performance of the original, dataflow between dependent tiers [52–54, 80]. Ditto leverages the
using an entirely distinct code sequence, to avoid revealing the distributed trace frameworks present in most production deploy-
implementation of the original service. ments to collect traces of end-to-end requests. The performance
Figure 3 shows an overview of Ditto’s profiling and generation overhead is negligible if the traces are sampled properly [4, 11, 96].
process. If the target service consists of a set of microservices, Ditto It then automatically extracts the dependency graph between mi-
first learns their Remote Procedure Call (RPC) dependency graph, croservices and uses it as input to the skeleton generator.
using distributed tracing [4, 7, 11, 96]. This graph is then used to
generate the API interfaces between the different synthetic mi- 4.3 Application Skeleton
croservices. Next, Ditto analyzes the thread and networking model,
We define the application skeleton as the network and thread mod-
e.g., single- or multi-threaded, and synchronous or asynchronous
els of an application, which determine how it handles remote service
respectively using kernel-level profiling, and builds the skeleton
communication, and how tasks are assigned to different threads,
of each service. The application skeleton contains empty handlers
respectively. The application skeleton is a critical design choice for
which are filled with appropriate functionality in the next step. The
cloud services facing tight latency constraints [46, 88, 101, 106], as
handlers can either be triggered upon receiving requests for worker
it directly impacts their performance and scalability.
threads, or by a timer for background threads.
To generate the synthetic application body, Ditto instruments 4.3.1 Network Model. The network model describes how an appli-
the application binary using kernel- and user-space profilers for cation communicates with other services, acting as a client, server,
225
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
Microservices
Microservices
Microservices
A
1.0 1.0
B C
0.5 0.5 0.7 0.3 Microservice Topology 1 // Main function of worker A
2 void worker_a_main(req_id) {
D E F Single-tier 3 // Syscalls
4 int fd = open(file, O_RDONLY);
Application
1 // Main thread 5 int size = read(fd, buffer, BUFFER_SIZE);
2 void main_loop() { 6 close(fd);
3 while (!stop) { 7
Application Skeleton 8 // Assembly blocks
4 epoll_wait(listen_fd, events, MAX_EVENTS, -1);
5 int socket_fd = accept(listen_fd, addr, len); Thread Model Network Model 9 __asm__ __volatile__ (
6 init_worker_thread_a(socket_fd); 10 ...
7 } 11 );
12
8 } Application Body 13 // Block for data_size = i & inst_size = j
9
10 // Worker thread type A Instruction Mix 14 __asm__ __volatile__ (
11 void worker_a_loop() { 15 "xor r9, r9\n"
12 while (!conn_closed) { Branch Behavior 16 ".BLOCK_I_J:\n" // Inner loop
13 epoll_wait(socket_fd, events, MAX_EVENTS, -1); 17 "add <X_REG>, <R_REG>\n"
14 read(socket_fd, buffer, BUFFER_SIZE); Data Memory 18 "sub <R_REG>, DWORD PTR [r10 + <OFFSET>]\n"
15 // Handler to be generated in next step Access Pattern 19 "mul QWORD PTR[r10 + <OFFSET>]\n"
System Call
16 worker_a_main(req_id); 20 "mov r11, QWORD PTR [r11]\n" // Ptr chasing
Instruction Memory
17 dispatch_to_worker_b(req_id); 21 "test r8d, <BIT_MASK>\n"
Access Pattern
18 wait_worker_b(); 22 "jz .COND_BR_FOO\n"
19 sendmsg(socket_fd, buffer, BUFFER_SIZE); Data Dependency 23 ...
20 } 24 "cmp r9, <LOOP_COUNT>\n"
C/C++ level Assembly level 25 "jl .BLOCK_I_J\n"
21 }
22 26 );
23 // Worker thread type B 27
24 void worker_b_loop() { Fine Tuning 28 __asm__ __volatile__ (
25 ... 29 ...
26 } 30 );
Synthetic Synthetic 31 }
Synthetic
Microservices
Microservices Application
or both. When acting as a client, a service can use synchronous by various threads. Ditto uses SystemTap to profile the functional-
or asynchronous communication. In synchronous models, threads ity, lifecycle, and trigger points of threads by experimenting with
block on network I/O (e.g., send(), write()) to await responses. different connections, QPS, and execution times. First, it combines
Asynchronous models are typically event-based with responses network and user-space call stack analysis to cluster threads with
handled by specific threads via callback functions. They are more similar functionalities. We build a call graph for each thread, use
complicated, as they involve additional synchronization and state tree-edit distance [30] to measure the similarity between threads,
machine transitions. In return, they avoid long queueing delays by and cluster threads with similar call graphs using agglomerative
allowing threads to process new requests and offer better perfor- clustering [83], since the number of clusters is unknown in ad-
mance [101]. vance. Second, we categorize each thread cluster into short- and
On the server side, there are three common options for the net- long-lived threads by probing clone() and context switches. Short-
work model: blocking, non-blocking, and I/O multiplexing [102]. In lived threads are usually spawned and terminated frequently, while
all three models, threads await requests through system calls (e.g., long-lived threads are spawned at initialization, waiting for tasks
recv(), read(), epoll()). In contrast to the other two models, the to arrive. Finally, thread functions can be triggered by both kernel-
non-blocking model needs to periodically call the I/O interfaces and user-space events, including reads and writes to sockets, timers,
to look for new requests, which can waste CPU time at low loads. signals, user-space locks, and condition variables. We monitor event
In both blocking and I/O multiplexing models, threads block on notification functions in kernel space and common user-level li-
system calls, although I/O multiplexing allows monitoring mul- braries, such as libpthread and libc++, and analyze the relation-
tiple sockets via a single system call (e.g., select() or epoll()). ship between them and thread spawning or wakeup to identify
I/O multiplexing is the most commonly-used in services like Mem- trigger points.
cached, Redis, and NGINX, since they support many concurrent
connections, and I/O multiplexing reduces the required threads.
4.4 Application Body
Ditto uses SystemTap [45] to profile the network model by prob-
ing kernel-space functions and data structures. It acquires key at- The application body corresponds to the workload-specific work,
tributes of sockets, and monitors network-related system calls, consisting of kernel-space functions, via system calls and user-
gathering the distribution of their types, arguments, and call fre- level functions. While assembly-level profiling for kernel-space
quency. Ditto then chooses one out of several network models that functions is unnecessary, since they can be cloned by imitating the
combine the different design choices described above, with socket system calls themselves, it is critical to clone user-space functions
options and network message parameters set based on profiling. at assembly level to capture the low-level usage of CPU resources.
Application performance is also significantly impacted by fac-
4.3.2 Thread Model. Cloud services rely on multithreading for tors like instruction mix, and memory (data and instruction) ac-
asynchronous networking, disk I/O, and parallel processing [111]. cess patterns, branch behavior, and data dependencies. Ditto uses
The thread model describes how tasks are scheduled to and handled these platform-independent features to ensure that the generated
226
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou
synthetic applications can be ported to other platforms without Branches with extremely high taken or not-taken ratios, even if
reprofiling. their patterns are completely random, have fewer mispredictions,
since the majority of executions are in one direction. Similarly,
4.4.1 System Calls. Applications use system calls to perform priv- branches with low transition rates are easier to predict. We also
ileged operations in the OS kernel. Besides network handling and find that instruction locality and the number of static branch in-
spawning new threads, cloud applications can make system calls structions significantly contribute to the branch prediction accuracy,
to access file descriptors, allocate memory space, or synchronize especially for applications with large binaries.
on shared memory. Capturing the system call characteristics is Based on these observations, Ditto profiles the distribution of
critical to clone the kernel-level CPU and un-core metrics. Prior branch taken/not-taken rates and transition rates across all con-
performance cloning studies either do not profile system call char- ditional branch instructions, and together with the instruction
acteristics [72, 90], or only profile the total number of kernel-level memory access pattern analysis it accurately clones the branch
instructions [87]. To accurately capture kernel-level characteris- misprediction behavior of the target application. We quantize the
tics, Ditto profiles the distribution of system calls, including their taken/not-taken rates and transition rates in log scale, from 2 −1
counts and arguments with SystemTap. For example, MongoDB to 2 −10 . During the generation phase, Ditto samples a taken/not-
calls pread() to read a database file from disk. During system call taken rate and transition rate from the profiled distribution for each
profiling, Ditto captures the flags of fd and the distribution of conditional branch instruction.
count and offset, to accurately clone key metrics, such as disk Lines 21-22 in the right code snippet in Figure 3 show how Ditto
latency, utilization, and page cache miss rates. generates conditional branch instructions with profiled taken/not-
4.4.2 Instruction Mix. The instruction mix in Ditto captures the taken rates and transition rates. <BIT_MASK> is a binary mask pre-
distribution of x86 assembly instructions at runtime in the original computed during the generation phase, which contains 𝑀 ones in
service, and reproduces it faithfully in the synthetic benchmark. the highest bits and 𝑁 zeros in the lowest bits. 2 −𝑀 is the taken/not-
Previous studies categorize x86 assembly instructions into integer taken rate, and 2 −𝑁 is the transition rate. The ZF flag, which deter-
arithmetic, integer multiplication, integer division, floating-point mines the branch direction of jz or jnz, will change periodically
operations, SIMD operations, loads, stores, and control instruc- according to the bitmask in the test instruction.
tions [72, 87, 90]. They then generate the synthetic benchmark
using a representative instruction from each category. 4.4.4 Data Memory Access Pattern. The memory access pattern is a
However, this categorization is too coarse-grained and does not dominant characteristic of an application, as it impacts the backend
capture the characteristics of modern CPU microarchitectures. The of the CPU and memory subsystem. Since operands in arithmetic
x86 ISA, for instance, contains assembly instructions with different instructions in synthetic benchmarks are randomly generated, they
uops, port usages, and execution cycles. For example, the CRC32 cannot calculate meaningful memory addresses at runtime. Thus,
(r64, r64) instruction, which implements the checksum function, memory addresses or offsets need to be pre-calculated in the gener-
takes three cycles and can only be executed via port 1 on Skylake ation phase and hard-coded in the synthetic application binaries.
CPUs, while other integer arithmetic instructions usually take one Previous studies [20, 22, 86] capture memory access patterns us-
cycle on any of the ports 0, 1, 5, and 6 [17, 60]. Instructions with ing the stack distance, reuse distance, and stride pattern profiles.
REP/REPZ/REPNZ (repeat string operations) or LOCK prefixes can However, they need 10 to 20 million memory traces to accurately
take tens of cycles or more, depending on the repeat count, or the represent target memory access patterns because of the sparsity of
cache/RAM configuration [51]. the memory address space, and the multimodality in memory ac-
Ditto uses Intel SDE [68] to collect the dynamic count of each cesses [63]. Preserving the original access patterns requires millions
x86 instruction using Intel x86 Encoder Decoder (XED) Iforms [35]. of hard-coded memory instructions, which significantly interferes
It then clusters x86 assembly instructions by functionality (data with other performance characteristics. Moreover, directly replicat-
movement, arithmetic/logic, control-flow, lock-prefixed, and re- ing the target memory access pattern introduces security concerns,
peat string operations), operands (general-purpose registers, x87 since previous studies showed that memory access patterns reveal
floating-point registers, XMM registers, and memory), and ALU confidential information about the service [57, 66, 71].
usage [17] using hierarchical clustering, so that each cluster has Instead, Ditto uses profiling of the memory working set to syn-
similar hardware resource requirements. Ditto also profiles the av- thesize appropriate data memory access patterns without incurring
erage number of dynamic instructions per request, and the repeat high instruction misses or leaking application context. We construct
counts of each REP-prefixed instruction. During the generation a sequence of memory accesses for working sets with different sizes,
phase, Ditto randomly samples the next instruction from the in- from 64 bytes (one cache line) to the maximum memory size allo-
struction mix distribution. Registers and memory addresses are cated to the target application, increasing by a factor of two. Each
assigned after data memory access profiling (Section 4.4.4). memory access only reads or writes the first data in a cache line to
ensure that a new cache line is loaded, assuming the most common
4.4.3 Branch Behavior. Branch prediction accuracy, which is de- write-allocate policy. We use Valgrind [85] to compute the distribu-
termined by both the branch behavior of the application and the tion of memory accesses with different working set sizes, which can
branch predictors, is critical in modern out-of-order CPUs [60, 78]. be efficiently simulated as “cache hits” for different “cache sizes”.
Prior studies observe that branch taken ratios and transition rates Each “cache size” only needs to be simulated once during profiling.
(frequency a branch switches between taken and not-taken) impact We calculate the number of memory accesses in a working set of 2𝑖
the branch prediction accuracy and misprediction penalty [50, 64]. bytes as follows:
227
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
( memory access patterns from the runtime memory trace and use
𝑖 𝐻𝑑 (2𝑖 ) if 2𝑖 = 64 bytes this ratio to control the number of regular memory access sequences
𝐴𝑑 (2 ) = , (1)
𝐻𝑑 (2𝑖 ) − 𝐻𝑑 (2𝑖 −1 ) otherwise in the synthetic applications.
4.4.5 Instruction Memory Access Pattern. Instruction memory ac-
where 𝐴𝑑 (2𝑖 ) is the number of memory accesses for a working cess patterns significantly impact CPU frontend and backend per-
set of 2𝑖 bytes in generated code, and 𝐻𝑑 (2𝑖 ) is the number of cache formance, as they determine the L1i, L2, L3 cache misses and branch
hits in a 2𝑖 -byte cache in the original application. The synthetic mispredictions. Replicating the original application’s instruction
working set-based memory access pattern is illustrated in Figure 4, memory access pattern is not possible with a synthetic benchmark
with the number of memory accesses for each working set equal because the execution flow is usually controlled by the computa-
that of the profiled distribution. Since the memory accesses are tion’s output at runtime.
limited to the working set size, it is guaranteed that 𝐴𝑑 (2𝑖 ) accesses Therefore, Ditto synthesizes instruction memory access patterns
will contribute to 𝐴𝑑 (2𝑖 ) hits when cache size ≥ 2𝑖 bytes. Assuming with a similar approach to that of Section 4.4.4. We profile the i-
a least-recently-used (LRU) cache replacement policy or its pseudo- cache hits of the original application with different i-cache sizes
LRU variant, commonly used in recent Intel processors [18, 105], using Valgrind. Then, we calculate the distribution of dynamic
since we iterate through cache lines in a working set sequentially, executions in an instruction memory working set of 2 𝑗 bytes as
there must be previous memory accesses which evict this cache line follows, assuming the cache line size is 64 bytes, and the average
when cache size < 2𝑖 bytes. Therefore, every memory access of a instruction size is 4 bytes:
2𝑖 -byte working set ends up with a miss when cache size < 2𝑖 bytes. (
16 ∗ 𝐻𝑖 (2 𝑗 ) − 𝐻𝑖 (2 𝑗 −1 ) if 2 𝑗 > 64 bytes
The statement is true for any memory hierarchy and cache inclusion 𝑗
𝐸𝑖 (2 ) = , (2)
𝐻𝑖 (2𝑁 ) − 2𝑗=9 𝐸𝑖 (2 𝑗 )
policy because of the sequential access pattern within each working Í𝑁
if 2 𝑗 = 64 bytes
set. Therefore, even if applications are profiled with a single-level
where 𝐸𝑖 (2 𝑗 ) is the number of instruction executions with a
cache, the results can be applied to any number of cache levels and
working set of 2 𝑗 bytes in the synthetic code, 2𝑁 is the max in-
inclusion policies. Applications are profiled with an 8-way cache for
struction working set size, 𝐻𝑖 (2 𝑗 ) is the number of i-cache hits
working sets < 1MB and a 16-way cache for working sets ≥ 1MB,
on a 2 𝑗 -byte i-cache in the original application, and the number
which are close to the typical values of modern CPUs. There is an
of instructions in a cache line is 16 (64B cacheline / 4B inst size).
average 1.9% error in the cache miss rate when cache associativity
After profiling the distribution of i-cache accesses with different in-
changes across all examined applications. We allocate an array
struction working sets, Ditto generates static assembly instruction
for memory accesses in the heap when the synthetic application is
blocks, shown in lines 14-26 in the right code snippet of Fig. 3.
initialized, and store the base address in a register (for example, r10).
The number of instructions per block matches the instruction
Ditto generates the address offsets for each memory instruction,
working set size, and the loop iteration number is determined by
which can access [r10 + <OFFSET>] at runtime.
the distribution.
0x0 0x40 0x80 0x100 0x200 0x400 4.4.6 Data Dependencies. Data dependencies are another inher-
… ent characteristic of an application that impact performance. Data
64B A(64) accesses dependencies can flow through registers or memory locations, limit-
128B A(128) accesses ing the number of simultaneous instructions issued to an execution
256B A(256) accesses
unit (instruction-level parallelism, or ILP), and the number of out-
standing memory requests (memory-level parallelism, or MLP) [65].
512B A(512) accesses
Ditto uses the distribution of data dependency distances to quan-
1024B
tify data flows through registers. We measure the read after write
… … A(1024) accesses
(RAW), write after read (WAR), and write after write (WAW) data
dependency distance from the dynamic control flow graph (DCFG)
Figure 4: Working-set-based data memory access generation. generated using Intel SDE [69]. The dependency distance is quan-
Except for the 64-byte working set, the memory accesses of tized into 11 bins, increasing exponentially from 1 to 1024, since
2𝑖 -byte working set start at address 2𝑖 −1 and loop iteratively a larger dependency distance does not impact the ILP, due to the
within the working set. limited size of the reorder buffer. When generating the synthetic
Coherence misses also contribute to cache miss rates in multi- code, we reserve several registers for recording the loop counters
threaded applications. Coherence misses happen when cache lines and data memory addresses, and use the rest of general-purpose
containing shared data are invalidated by another core. To accu- and SIMD registers to clone the data dependency characteristics. To
rately clone cache behavior with multi-threading, we use Intel SDE assign registers for each instruction, Ditto samples a (RAW, WAR,
to profile the ratio between private data accesses and shared data WAW) distance tuple from the profiled distributions, and chooses
accesses, and generate memory accesses accordingly. an available register with the closest distance values. Data depen-
Modern CPUs implement hardware prefetching mechanisms dencies through registers can also impact MLP if the register values
to improve cache performance. Hardware prefetchers detect load determine memory locations. Such behavior cannot be captured
instructions with regular strides, sequences of consecutive cache since the synthetic application never writes to a reserved register
line accesses and adjacent cache line accesses, to load data into with the memory base address. To address this, we replace a frac-
caches before they are needed [84]. To clone the performance impact tion of memory reads with pointer chasing reads (mov r11, QWORD
of cache prefetching, we calculate the ratio of regular to irregular PTR [r11]); determined by the MLP measured with Perf.
228
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou
Data dependencies through memory locations are much more Ditto is implemented primarily in Python and C in about 16,000
difficult to profile with DCFG since memory addresses are often lines of code. It supports C/C++ applications, the Apache Thrift [1]
calculated at runtime. However, they are partially determined by and gRPC [3] RPC frameworks, and x86 ISAs, which are commonly
data access patterns that Ditto already profiles (Section 4.4.4). A used in cloud environments. It can be extended to more languages,
program with a shorter memory dependency distance can be mod- frameworks, and ISAs, by leveraging compatible profiling tools.
eled with a smaller working set because the probability that the Ditto can generate applications that run on a single machine or con-
cache line is evicted by other instructions in between is lower. tainerized microservices that run on multiple nodes, using Docker
Swarm or Kubernetes. The runtime profilers and emulators, includ-
4.5 Fine Tuning ing SystemTap, Intel SDE, and Valgrind, can introduce overheads to
Finally, Ditto implements fine tuning to calibrate the output of pre- the original application during profiling. This overhead only occurs
vious steps, due to inaccuracies introduced by the instrumentation once, and does not affect the accuracy of the platform-independent
tools. For example, application body profiling does not consider features collected during profiling.
the interaction between user-space and kernel-space functions and To generate a clone, cloud providers only need to specify a rep-
the correlation between the application skeleton and body; thus resentative input for their service. Ditto automatically instruments
the actual d-cache and i-cache miss rates are often higher than the the application at runtime, collecting profiling statistics and feeding
profiled results. Ditto iteratively runs the synthetic application on a them to the code generator, followed by the fine-tuning process.
specific platform, computes the errors between target and synthetic Ditto does not require reprofiling if the input change does not affect
service, adjusts the inputs to the generator accordingly, and regen- the application body, such as changes in QPS or number of connec-
erates the synthetic application. Although there are many knobs tions. Inevitably, if a new input exercises an entirely new code path
to tune, most of them are orthogonal with each other. We have or memory access pattern, this will need to be profiled to create a
characterized the correlation across knobs, and derived the small new clone. The synthesized binaries can run directly on hardware,
groups of parameters that need to be jointly tuned (e.g., branch execution-driven simulators like gem5 [31] and ZSim [94], or their
taken/transition rate and i-cache pattern because they all influence traces can be fed to trace-driven simulators like Ramulator [75].
branch prediction). Since relationships between knobs and perfor-
mance are mostly linear, we use a feedback-based heuristic to tune 6 EVALUATION
knobs within a group. Fine tuning uses performance counters for 6.1 Methodology
calibration. It usually takes within ten iterations to reach over 95%
accuracy, incurring low overhead since each iteration only takes a 6.1.1 Platforms. We validate Ditto on a heterogeneous cluster,
couple tens of seconds. Since Ditto captures performance charac- with three types of servers, whose specs are in Table 1. All servers
teristics well with platform-independent data, this fine tuning does run x86 ISA, but differ in the CPU and memory architectures, and
not compromise the generality of the synthetic service, as shown their storage and network.
in Section 6.2.2. Table 1: Server platform specifications.
229
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
0
0.4
0.8
1.2
1.6
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC
L1i L1i L1i L1i L1i L1i
Branch Branch L1d Branch Branch Branch Branch
Medium Load
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC
L1i L1i L1i L1i L1i L1i
Branch Branch L1d Branch Branch Branch Branch
High Load
0
0.5
1
1.5
2
2.5
0
0.5
1
1.5
2
0
0.3
0.6
0.9
1.2
0
0.4
0.8
1.2
1.6
0
0.4
0.8
1.2
1.6
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC
0 0 0 0 0 0
0 500 1000 0 20k 40k
0
10
20
30
0
10
20
30
40
50
10
15
50
10
15
0k
0k
0k
k
k
k
k
00
00
00
00
QPS QPS QPS QPS QPS QPS
Actual Synthetic Actual avg Synthetic avg Actual p95 Synthetic p95 Actual p99 Synthetic p99
Figure 5: CPU performance metrics (IPC, branch mispredictions, L1i, L1d, L2 and LLC miss rates), network bandwidth, disk
bandwidth (MongoDB only) and service latency under varying load across six services. CPU metrics are normalized to each
original application’s metrics under medium load. Network and disk bandwidth are, by exception, normalized to each original
application’s bandwidth under current load, because their magnitudes change significantly, and would obscure the figure’s
shape.
230
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
0
0.3
0.6
0.9
1.2
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC
0
0.3
0.6
0.9
1.2
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC
15 0.6
0.2 0.4 0.1 4
10 0.4
0.1 0.2 0.05 2
5 0.2
0 0 0 0 0 0
A B C A B C A B C A B C A B C A B C
Figure 7: CPU metrics (IPC, branch misprediction, L1i, L1d, L2 and LLC misses), network BW, disk BW (MongoDB only) and
latencies across platforms. CPU metrics are normalized to each original service on Platform A.
are representative of the other tiers of the service. TextService man- misses, while SocialGraphService has high IPC due to fewer LLC
ages the text users add to composed posts, and SocialGraphService misses. At high load, Memcached and Redis have similar metrics to
manages follow relationships between users. We do not show each medium load, however the other four applications exhibit different
tier due to space constraints, but have validated that the results degrees of L2, LLC misses variation. The results illustrate that
are similar for them. All applications are generated using profiling applications can have very different characteristics under different
data under medium load; Ditto has not profiled any other load. We loads, which are accurately captured by Ditto in their synthetic
increase the load until the single-tier application or bottleneck tier counterparts. The network and disk bandwidth also conform to the
in the microservice topology saturates in one or more resources original by faithfully reproducing the system calls. We only show
(e.g., disk I/O for MongoDB and CPU for the other applications). disk bandwidth for MongoDB since other services do not involve
Since we use a close-loop workload generator for MongoDB and disk I/O. The bottom line plot shows the average, 95th, and 99th
Redis, which only allows one outstanding request per connection, percentile latencies, which also match the originals, with the p99
the latency does not increase significantly at high load. While the diverging at high load, due to the queueing behavior in the network
end-to-end latency of Social Network increases at high load, the la- stack at saturation.
tency of TextService and SocialGraphService only increases slightly, Fig. 6 shows the end-to-end latency of original and synthetic
since they are not bottleneck tiers. Social Network when every individual microservice is replaced
The upper three rows show IPC, branch misprediction, L1i, L1d, with a synthetic one.
L2, LLC miss rates, and network and disk I/O bandwidth under low,
medium, and high load, with average errors across all applications 6.2.2 Validation on Varying Platforms. We also validate the CPU,
being 4.1%, 9.9%, 7.1%, 5.1%, 6.9%, 12.1%, 0.1%, 0.1%, respectively. network and disk metrics and service latency as we vary the hosted
This indicates that Ditto accurately clones the overall hardware platforms. Each application is profiled only on Platform A, and
performance metrics. Memcached and NGINX have low IPC under validated on Platforms A, B and C. Figure 7 shows that the syn-
low load because of high branch misprediction, and L1i and L2 thetic benchmarks react to platform changes in a similar way to
231
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
1.2
Retiring Front-end Bad speculation Back-end
A S 0.8
IPC
2.0 A
S 0.4
1.5 A S target IPC
A S 0.0
CPI
A S A S 2.5 1e9
1.0
Instructions
2.0
0.5
1.5
1.0
0.0
0.5 target instructions
Mem Nginx MongoDB Redis Text SocialGraph 0.0
cached Service Service 5.5 1e9
4.5
Cycles
3.5
Figure 8: Cycles breakdown. (A: actual, S: synthetic) 2.5
1.5 target cycles
the original applications. More specifically, all six applications have 0.5
different degrees of L2 cache miss increases on Platforms B and C, 1.4
p99 (ms)
due to their smaller L2 cache sizes. Applications running on Plat- 1.2
form B, which is an older CPU generation, have consistently lower 1.0
IPC. When running all microservices of the Social Network on the 0.8 target latency
small-scale Platform C server, the high degree of interference re- 0.6
A:
B:
C:
E:
F:
I:T
sults in high LLC miss rates for TextService and SocialGraphService,
:D
:In
:D
I-
Br
S
Sy
un
m
ke
at
in
-m
st
an
sc
e
em
both original and synthetic. Network and disk I/O bandwidths are
st
le
a
.m
em
ch
al
to
de
l
ix
n
identical across platforms, since the amount of data transferred is
p.
independent of the platform. The line plots at the bottom show the Figure 9: Evolution of IPC, instructions, cycles, and p99 la-
latency on the three platforms, where the synthetic always matches tency for MongoDB as we add sophistication to Ditto.
the original. All applications experience the highest latency on
Platform B because it has the lowest IPC. The latency of MongoDB
is significantly lower on Platform A because it benefits from the decrease from 1.11 to 1.02 due to memory instructions incurring
low random access latency of SSDs. In general, the fact that the syn- additional cycles in the backend. From D to E, we clone the branch
thetic applications react to platform changes the same way as the behaviors following the profiled branch taken and transition rates.
original, without reprofiling, shows that Ditto accurately captures The branch misprediction rate drops from 1.95% to 1.47% but has a
critical, platform-independent features that impact performance. negligible impact on IPC. In step F, we synthesize the instruction
memory accesses, which causes more i-cache misses (from 1.3% to
6.3 CPU Top-down Analysis 7.3%) and branch mispredictions (from 1.47% to 4.56%, as discussed
Figure 8 shows the cycles per instruction (CPI) top-down analysis in Sec. 4.4.3), and significantly lowers the IPC. From F to G, we
of the original and synthetic applications. Ditto accurately captures synthesize the data memory access pattern by accessing different
the cycle breakdown of the original applications. Many prior studies sizes of private and shared working sets. The IPC further decreases
have showed that cloud services diverge from traditional scientific as the L1d miss rate rises from 17% to 24%. In H, we mimic data
CPU benchmarks like SPEC CPU by having significant fractions of dependencies by reassigning registers for each instruction, which
front-end stalls, due to large code footprints and frequent context clones the ILP and MLP characteristics and slightly lowers the
switches between user and kernel mode [53, 101, 111]. Our synthetic IPC. From H to I, we perform the fine tuning, which calibrates
benchmarks show similar bottlenecks to the original applications, instruction and data access patterns, lowers the IPC from 0.6 to
and can be used as proxies for microarchitectural optimizations. 0.51, and further improves accuracy. This shows that, even if not
every aspect in Ditto is equally important, they are all required to
accurately clone complex cloud services.
6.4 Decomposition of Ditto’s Accuracy
In Fig. 9, we use MongoDB as an example to show Ditto’s accuracy,
as the framework incorporates more information. We start with a 6.5 Case Study: Interference Analysis
version of Ditto that only generates the thread model and network Figure 10 shows that synthetic applications react to resource inter-
interfaces skeleton, but an empty request handling body. From ference in a similar way to their original counterparts, even though
A to B, we inject the system calls with arguments drawn from we only profile the original application in isolation. We show the
the distribution of the original application, which increases the analysis on NGINX, but the results are similar for other services. We
kernel-level instructions and disk I/Os. In C, we add user-level use a set of stress benchmarks to generate interference in different
instructions (add rax, rax) to match the total instruction count, resources. We use stress-ng [10] to generate hyperthreading (HT),
but not their specific mix. From C to D, user-level instructions are L1d, and L2 interference by co-locating the applications and mi-
generated based on the profiled mix. We assume the highest branch crobenchmarks on different logical cores of the same physical core.
taken/transition rate, strongest data dependencies, and all memory The synthetic application captures the IPC and latency degradation
operations accessing the smallest working sets. We observe an IPC caused by memory contention. When generating L2 interference,
232
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou
Frequency (GHz)
1.0 1ms
1.3 X X X X XXX X X X X XXX
IPC
0.5 0.8ms
1.5 X X X X XXX X X X X XXX
0.0 1.7 X X X X X X X X X X 0.6ms
5.0
1.9 X X X X X X X 0.4ms
p99 (ms)
4.0
3.0
2.0
2.1 X X 0.2ms
1.0 4 6 8 10 12 14 16 4 6 8 10 12 14 16
0.0
10
#Cores #Cores
LLC Miss (%) L2 Miss (%) Lid Miss (%) L1i Miss (%)
8
6
4 Figure 11: 99th percentile latency of actual and synthetic
2
0 Memcached under varying CPU frequency core count.
12
9
6
3
networking, threading, interference, and power management. It can
0 also be used for certain microarchitecture studies, as we showed
10 when changing memory hierarchies across platforms.
8
6 However, to enable fast, automated, and obfuscated cloning, our
4 method abstracts away the original individual instructions and
2
0 memory accesses. Thus, it is less accurate for microarchitecture
80
studies that rely on exact application implementation rather than
60
40
their statistical performance patterns. For instance, studying the
20 memory access patterns to improve hardware prefetchers would
0 not be a good fit for Ditto. There is a fundamental trade-off between
Orig. HT L1d L2 LLC Net
the granularity at which information is captured in one subsystem,
Figure 10: Interference impact on NGINX. and the overall performance accuracy.
besides the L2 miss rate increase, the synthetic workload also cap-
7.2 Confidentiality
tures the LLC miss rate change in the original service due to an
increase in the LLC accesses with constant misses. Although Ditto cannot guarantee zero information leakage, as it
We also use iBench [41] to generate LLC interference on the may expose the RPC graph, and statistics of some hardware coun-
shared socket, and the result shows the synthetic application cap- ters, many software companies do not consider these sensitive
tures the IPC drop in the original service. Finally, we use iperf3 [16] data. For example, Alibaba, has open-sourced their production RPC
to compete with the service for network bandwidth, and the latency traces [80], Facebook shared the kernel-level cycles breakdown
of synthetic application successfully matches the original service. of its production workloads [99], and Google open-sourced their
workload traces via DynamoRIO [21]. Additionally, the application
6.6 Case Study: CPU Core and Frequency skeleton, while may reflect the original workflow to some extent,
Scaling is chosen and adapted from the network and threading models that
have been extensively studied and used [46, 101, 108]. As the actual
Fig. 11 shows using Ditto to evaluate power management in Mem- logic, functionality and per-access memory patterns are concealed,
cached with CPU core and frequency scaling. Each cell represents inferring useful proprietary information would be very difficult.
the p99 latency under a given number of cores and frequency. We For collaboration with hardware vendors, sharing the propri-
set the QoS as 1ms and cells with marks mean that QoS cannot be etary code under NDA is of course a more straightforward way, but
satisfied for that configuration. Memcached cannot meet the QoS most cloud providers would not share the binaries related to their
at low frequency even with the maximum number of cores, which core business regardless of NDA agreements. Other solutions like
prohibits aggressive power management. Synthetic Memcached internal evaluation of prototypes can be time-consuming and inac-
accurately captures the latency variation of Memcached under dif- curate since the prototype is usually immature, and cloud providers
ferent settings. This similarity indicates that cloud providers can need to adapt their workloads for each new prototype.
use synthetic applications to determine whether power manage-
ment is beneficial for a service, without needing access its source 7.3 Application Phases
code.
Previous studies observe execution phases in SPEC CPU benchmark
7 DISCUSSION applications [62, 70]. To verify whether program phases exists in
the evaluated cloud services, we collect time series of CPU metrics
7.1 Suitable and Unsuitable Use Cases spanning 600 seconds, with sampling frequencies ranging from one
Ditto’s main contribution is cloning an end-to-end application second to ten seconds. We do not observe regular program phases
across the system stack. This makes it more suitable for architecture- with such sampling granularity. There may be program phases
, OS-, application-, and cluster-level studies, including scalability, in the execution of individual requests, which range from tens of
233
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
234
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou
[37] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell [60] Part Guide. Intel® 64 and ia-32 architectures software developer’s manual.
Sears. Benchmarking Cloud Serving Systems with YCSB. 2010. Volume 3B: System programming Guide, Part, 2(11), 2011.
[38] Weidong Cui, Marcus Peinado, Karl Chen, Helen J. Wang, and Luis Irun-Briz. [61] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown.
Tupni: Automatic reverse engineering of input formats. In Proceedings of the Mibench: A free, commercially representative embedded benchmark suite. In
15th ACM Conference on Computer and Communications Security, CCS ’08, page Proceedings of the Fourth Annual IEEE International Workshop on Workload
391–402, New York, NY, USA, 2008. Association for Computing Machinery. Characterization. WWC-4 (Cat. No.01EX538), pages 3–14, 2001.
[39] Deeksha Dangwal, Weilong Cui, Joseph McMahan, and Timothy Sherwood. [62] Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. Simpoint 3.0:
Safer program behavior sharing through trace wringing. In Proceedings of the Faster and more flexible program phase analysis. Journal of Instruction Level
Twenty-Fourth International Conference on Architectural Support for Programming Parallelism, 7(4):1–28, 2005.
Languages and Operating Systems, pages 1059–1072, 2019. [63] Milad Hashemi, Kevin Swersky, Jamie Smith, Grant Ayers, Heiner Litz, Jichuan
[40] Jeffrey Dean and Luiz Andre Barroso. The tail at scale. In CACM, Vol. 56 No. 2. Chang, Christos Kozyrakis, and Parthasarathy Ranganathan. Learning memory
[41] Christina Delimitrou and Christos Kozyrakis. iBench: Quantifying Interfer- access patterns. In Jennifer Dy and Andreas Krause, editors, Proceedings of the
ence for Datacenter Workloads. In Proceedings of the 2013 IEEE International 35th International Conference on Machine Learning, volume 80 of Proceedings of
Symposium on Workload Characterization (IISWC). Portland, OR, September Machine Learning Research, pages 1919–1928. PMLR, 10–15 Jul 2018.
2013. [64] M. Haungs, P. Sallee, and M. Farrens. Branch transition rate: a new metric for
[42] Christina Delimitrou and Christos Kozyrakis. Quality-of-Service-Aware Sched- improved branch classification analysis. In Proceedings Sixth International Sym-
uling in Heterogeneous Datacenters with Paragon. In IEEE Micro Special Issue posium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550),
on Top Picks from the Computer Architecture Conferences. May/June 2014. pages 241–250, 2000.
[43] Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-Efficient and [65] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth Edition:
QoS-Aware Cluster Management. In Proceedings of the Nineteenth International A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA,
Conference on Architectural Support for Programming Languages and Operating USA, 2006.
Systems (ASPLOS). Salt Lake City, UT, USA, 2014. [66] Weizhe Hua, Zhiru Zhang, and G. Edward Suh. Reverse engineering convolu-
[44] eBPF Foundation. ebpf, 2021. tional neural networks through side-channel information leaks. In 2018 55th
[45] Frank C. Eigler, Vara Prasad, Will Cohen, Hien Nguyen, Martin Hunt, Jim ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6, 2018.
Keniston, and Brad Chen. Architecture of systemtap: a linux trace/probe tool, [67] Intel. Intel vtune amplifier. https://www.intel.com/content/www/us/en/
2005. developer/tools/oneapi/vtune-profiler.html, year = 2021,.
[46] Qi Fan and Qingyang Wang. Performance comparison of web servers with [68] Intel. Intel software development emulator, 2012.
different architectures: A case study using high concurrency workload. In 2015 [69] Intel. Dynamic control-flow graph generation with pinplay, 2015.
Third IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb), [70] Canturk Isci, Alper Buyuktosunoglu, and Margaret Martonosi. Long-term
pages 37–42. IEEE, 2015. workload phases: Duration predictions and applications to dvfs. Ieee Micro,
[47] Wes Felter, Alexandre Ferreira, Ram Rajamony, and Juan Rubio. An updated 25(5):39–51, 2005.
performance comparison of virtual machines and linux containers. In 2015 [71] Tara Merin John, Syed Kamran Haider, Hamza Omar, and Marten van Dijk.
IEEE International Symposium on Performance Analysis of Systems and Software Connecting the dots: Privacy leakage via write-access patterns to the main
(ISPASS), pages 171–172, 2015. memory. IEEE Transactions on Dependable and Secure Computing, 17(2):436–442,
[48] Michael Ferdman, Almutaz Adileh, and et al. Clearing the clouds: A study of 2020.
emerging scale-out workloads on modern hardware. In Proc. of ASPLOS. London, [72] Ajay Joshi, Lieven Eeckhout, Robert H. Bell, and Lizy John. Performance cloning:
England, UK, 2012. A technique for disseminating proprietary applications as benchmarks. In 2006
[49] Brad Fitzpatrick. Distributed caching with memcached. In Linux Journal, Volume IEEE International Symposium on Workload Characterization, pages 105–115,
2004, Issue 124, 2004. 2006.
[50] Agner Fog. The microarchitecture of intel, amd and via cpus an optimization [73] Ajay Joshi, Lieven Eeckhout, Robert H. Bell, and Lizy K. John. Distilling the
guide for assembly programmers and compiler makers. essence of proprietary workloads into miniature benchmarks. ACM Trans.
[51] Agner Fog et al. Instruction tables: Lists of instruction latencies, throughputs Archit. Code Optim., 5(2), September 2008.
and micro-operation breakdowns for intel, amd and via cpus. Copenhagen [74] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan,
University College of Engineering, 93:110, 2011. Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a warehouse-scale
[52] Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. Sage: computer. SIGARCH Comput. Archit. News, 43(3S):158–169, June 2015.
Practical and scalable ml-driven performance debugging in microservices. In [75] Yoongu Kim, Weikun Yang, and Onur Mutlu. Ramulator: A fast and extensible
Proceedings of the 26th ACM International Conference on Architectural Support for dram simulator. IEEE Computer architecture letters, 15(1):45–49, 2015.
Programming Languages and Operating Systems, ASPLOS 2021, page 135–151, [76] Nikita Lazarev, Shaojie Xiang, Neil Adit, Zhiru Zhang, and Christina Delim-
New York, NY, USA, 2021. Association for Computing Machinery. itrou. Dagger: Efficient and fast rpcs in cloud microservices with near-memory
[53] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayantara reconfigurable nics. In Proceedings of the 26th ACM International Conference
Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, on Architectural Support for Programming Languages and Operating Systems,
Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine ASPLOS 2021, page 36–51, New York, NY, USA, 2021. Association for Computing
Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Machinery.
Liu, Jake Padilla, and Christina Delimitrou. An Open-Source Benchmark Suite [77] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and
for Microservices and Their Hardware-Software Implications for Cloud and sub-millisecond quality-of-service. In Proc. of EuroSys. 2014.
Edge Systems. In Proceedings of the Twenty Fourth International Conference [78] Chit-Kwan Lin and Stephen J. Tarsa. Branch prediction is not a solved problem:
on Architectural Support for Programming Languages and Operating Systems Measurements, opportunities, and future directions. In 2019 IEEE International
(ASPLOS), April 2019. Symposium on Workload Characterization (IISWC), pages 228–238, 2019.
[54] Yu Gan, Yanqi Zhang, Kelvin Hu, Yuan He, Meghna Pancholi, Dailun Cheng, [79] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and
and Christina Delimitrou. Seer: Leveraging Big Data to Navigate the Complex- Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Proc.
ity of Performance Debugging in Cloud Microservices. In Proceedings of the of the 42Nd Annual International Symposium on Computer Architecture (ISCA).
Twenty Fourth International Conference on Architectural Support for Programming Portland, OR, 2015.
Languages and Operating Systems (ASPLOS), April 2019. [80] Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang,
[55] Karthik Ganesan, Jungho Jo, and Lizy K. John. Synthesizing memory-level paral- Yu Ding, Jian He, and Chengzhong Xu. Characterizing microservice dependency
lelism aware miniature clones for SPEC CPU2006 and implantBench workloads. and performance: Alibaba trace analysis. In Proceedings of the ACM Sympo-
ISPASS 2010 - IEEE International Symposium on Performance Analysis of Systems sium on Cloud Computing, SoCC ’21, page 412–426, New York, NY, USA, 2021.
and Software, pages 33–44, 2010. Association for Computing Machinery.
[56] Karthik Ganesan and Lizy Kurian John. Automatic generation of miniatur- [81] David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber,
ized synthetic proxies for target applications to efficiently design multicore and Thomas F. Wenisch. Power management of online data-intensive services. In
processors. IEEE Transactions on Computers, 63(4):833–846, 2014. Proceedings of the 38th annual international symposium on Computer architecture,
[57] Oded Goldreich and Rafail Ostrovsky. Software protection and simulation on pages 319–330, 2011.
oblivious rams. Journal of the ACM (JACM), 43(3):431–473, 1996. [82] David Meisner, Junjie Wu, and Thomas F. Wenisch. Bighouse: A simulation
[58] Tarun Goyal, Ajit Singh, and Aakanksha Agrawal. Cloudsim: simulator for cloud infrastructure for data center systems. In 2012 IEEE International Symposium on
computing infrastructure and modeling. Procedia Engineering, 38:3566–3572, Performance Analysis of Systems Software, pages 35–45, 2012.
2012. [83] Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an
[59] Brendan Gregg. Systems performance: enterprise and the cloud. Pearson Educa- overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,
tion, 2014. 2(1):86–97, 2012.
[84] Intel® 64 and ia-32 architectures optimization reference manual. February 2022.
235
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
[85] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight [99] Akshitha Sriraman and Abhishek Dhanotia. Accelerometer: Understanding
dynamic binary instrumentation. ACM Sigplan notices, 42(6):89–100, 2007. acceleration opportunities for data center overheads at hyperscale. In Proceed-
[86] Reena Panda and Lizy K. John. Halo: A hierarchical memory access locality ings of the Twenty-Fifth International Conference on Architectural Support for
modeling technique for memory system explorations. In Proceedings of the 2018 Programming Languages and Operating Systems, ASPLOS ’20, page 733–750,
International Conference on Supercomputing, ICS ’18, page 118–128, New York, New York, NY, USA, 2020. Association for Computing Machinery.
NY, USA, 2018. Association for Computing Machinery. [100] Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. Softsku:
[87] Reena Panda and Lizy Kurian John. Proxy benchmarks for emerging big-data Optimizing server architectures for microservice diversity @scale. In Proceedings
workloads. In 2017 26th International Conference on Parallel Architectures and of the 46th International Symposium on Computer Architecture, ISCA ’19, page
Compilation Techniques (PACT), pages 105–116, 2017. 513–526, New York, NY, USA, 2019. Association for Computing Machinery.
[88] David Pariag, Tim Brecht, Ashif Harji, Peter Buhr, Amol Shukla, and David R [101] Akshitha Sriraman and Thomas F. Wenisch. µtune: Auto-tuned threading for
Cheriton. Comparing the performance of web server architectures. ACM SIGOPS OLDI microservices. In 13th USENIX Symposium on Operating Systems Design
Operating Systems Review, 41(3):231–243, 2007. and Implementation (OSDI 18), pages 177–194, Carlsbad, CA, October 2018.
[89] Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. USENIX Association.
Pinplay: A framework for deterministic replay and reproducible analysis of [102] W Richard Stevens and Thomas Narten. Unix network programming. ACM
parallel programs. In Proceedings of the 8th Annual IEEE/ACM International SIGCOMM Computer Communication Review, 20(2):8–9, 1990.
Symposium on Code Generation and Optimization, CGO ’10, page 2–11, New [103] Takanori Ueda, Takuya Nakaike, and Moriyoshi Ohara. Workload characteriza-
York, NY, USA, 2010. Association for Computing Machinery. tion for microservices. In Proc. of IISWC. 2016.
[90] G. Ravi, R. Bertran, P. Bose, and M. Lipasti. Micrograd: A centralized framework [104] Luk Van Ertvelde and Lieven Eeckhout. Dispersing proprietary applications
for workload cloning and stress testing. In 2021 IEEE International Symposium as benchmarks through code mutation. In Proceedings of the 13th international
on Performance Analysis of Systems and Software (ISPASS), pages 70–72, Los conference on Architectural support for programming languages and operating
Alamitos, CA, USA, mar 2021. IEEE Computer Society. systems, pages 201–210, 2008.
[91] Charles Reiss, Alexey Tumanov, Gregory Ganger, Randy Katz, and Michael [105] Pepe Vila, Pierre Ganty, Marco Guarnieri, and Boris Köpf. Cachequery: Learning
Kozych. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. replacement policies from hardware caches. In Proceedings of the 41st ACM
In Proceedings of SOCC. 2012. SIGPLAN Conference on Programming Language Design and Implementation,
[92] Michiel Ronsse and Koen De Bosschere. Recplay: A fully integrated practical PLDI 2020, page 519–532, New York, NY, USA, 2020. Association for Computing
record/replay system. ACM Trans. Comput. Syst., 17(2):133–152, may 1999. Machinery.
[93] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with [106] Qingyang Wang, Chien-An Lai, Yasuhiko Kanemasa, Shungeng Zhang, and
interactive graph analytics and visualization. In AAAI, 2015. Calton Pu. A study of long-tail latency in n-tier systems: Rpc vs. asynchronous
[94] Daniel Sanchez and Christos Kozyrakis. Zsim: Fast and accurate microarchitec- invocations. In 2017 IEEE 37th International Conference on Distributed Computing
tural simulation of thousand-core systems. In Proceedings of the 40th Annual Systems (ICDCS), pages 207–217. IEEE, 2017.
International Symposium on Computer Architecture, ISCA ’13, page 475–486, [107] Ahmad Yasin. A top-down method for performance analysis and counters
New York, NY, USA, 2013. Association for Computing Machinery. architecture. In 2014 IEEE International Symposium on Performance Analysis of
[95] Koushik Sen, Swaroop Kalasapur, Tasneem Brutch, and Simon Gibbs. Jalangi: Systems and Software (ISPASS), pages 35–44, 2014.
A selective record-replay and dynamic analysis framework for javascript. In [108] Shungeng Zhang, Qingyang Wang, Yasuhiko Kanemasa, Huasong Shan, and Lit-
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineer- ing Hu. The impact of event processing flow on asynchronous server efficiency.
ing, ESEC/FSE 2013, page 488–498, New York, NY, USA, 2013. Association for IEEE Transactions on Parallel and Distributed Systems, 31(3):565–579, 2019.
Computing Machinery. [109] Yanqi Zhang, Yu Gan, and Christina Delimitrou. µqsim: Enabling accurate and
[96] Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, scalable simulation for interactive microservices. In 2019 IEEE International
Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a Symposium on Performance Analysis of Systems and Software (ISPASS), pages
large-scale distributed systems tracing infrastructure. Technical report, Google, 212–222, 2019.
Inc., 2010. [110] Yanqi Zhang, Iñigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh
[97] Will Sobel, Shanti Subramanyam, Akara Sucharitakul, Jimmy Nguyen, Hubert Elnikety, Christina Delimitrou, and Ricardo Bianchini. Faster and cheaper
Wong, Arthur Klepchukov, Sheetal Patil, Armando Fox, and David Patterson. serverless computing on harvested resources. In Proceedings of the 28th ACM
Cloudstone: Multi-platform, multi-language benchmark and measurement tools Symposium on Operating Systems Principles (SOSP), October 2021.
for web 2.0. In Proc. of CCA, volume 8, page 228, 2008. [111] Yuhao Zhu, Daniel Richins, Matthew Halpern, and Vijay Janapa Reddi. Microar-
[98] A. Sriraman and T. F. Wenisch. 𝜇 suite: A benchmark suite for microservices. chitectural implications of event-driven server-side web applications. In Proc.
In 2018 IEEE International Symposium on Workload Characterization (IISWC), of MICRO, 2015.
pages 1–12, 2018.
Received 2022-07-07; accepted 2022-09-22
236