0% found this document useful (0 votes)

14 views16 pages

Ditto

Uploaded by

Saulo Santos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views16 pages

Ditto

Uploaded by

Saulo Santos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

MIT Open Access Articles

Ditto: End-to-End Application Cloning

for Networked Cloud Services

The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.

Citation: Liang, Mingyu, Gan, Yu, Li, Yueying, Torres, Carlos, Dhanotia, Abhishek et al. 2023.
"Ditto: End-to-End Application Cloning for Networked Cloud Services."

As Published: https://doi.org/10.1145/3575693.3575751

Publisher: ACM|Proceedings of the 28th ACM International Conference on Architectural Support

for Programming Languages and Operating Systems, Volume 2

Persistent URL: https://hdl.handle.net/1721.1/147828

Version: Final published version: final published article, as it appeared in a journal, conference
proceedings, or other formally published context

Terms of Use: Article is made available in accordance with the publisher's policy and may be
subject to US copyright law. Please refer to the publisher's site for terms of use.
Ditto: End-to-End Application Cloning for Networked Cloud
Services
Mingyu Liang∗ Yu Gan∗ Yueying Li
[email protected] [email protected] [email protected]
Cornell University Cornell University Cornell University
Ithaca, New York, USA Ithaca, New York, USA Ithaca, New York, USA

Carlos Torres Abhishek Dhanotia Mahesh Ketkar

[email protected] [email protected] [email protected]
Meta Meta Intel
Menlo Park, California, USA Menlo Park, California, USA Folsom, California, USA

Christina Delimitrou
[email protected]
MIT
Cambridge, Massachusetts, USA

ABSTRACT methodologies; • Software and its engineering → Software

The lack of representative, publicly-available cloud services has reverse engineering; Software performance.
been a recurring problem in the architecture and systems communi-
ties. While open-source benchmarks exist, they do not capture the KEYWORDS
full complexity of cloud services. Application cloning is a promising cloud computing, architecture, benchmarking and emulation, mi-
way to address this, however, prior work is limited to CPU-/cache- croservices, software reverse engineering
centric, single-node services, operating at user level.
ACM Reference Format:
We present Ditto, an automated framework for cloning end-to-
Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia,
end cloud applications, both monolithic and microservices, which Mahesh Ketkar, and Christina Delimitrou. 2023. Ditto: End-to-End Ap-
captures I/O and network activity, as well as kernel operations, in plication Cloning for Networked Cloud Services. In Proceedings of the
addition to application logic. Ditto takes a hierarchical approach to 28th ACM International Conference on Architectural Support for Program-
application cloning, starting with capturing the dependency graph ming Languages and Operating Systems, Volume 2 (ASPLOS ’23), March
across distributed services, to recreating each tier’s control/data 25–29, 2023, Vancouver, BC, Canada. ACM, New York, NY, USA, 15 pages.
flow, and finally generating system calls and assembly that mim- https://doi.org/10.1145/3575693.3575751
ics the individual applications. Ditto does not reveal the logic of
the original application, facilitating publicly sharing clones of pro- 1 INTRODUCTION
duction services with hardware vendors, cloud providers, and the
Cloud computing now hosts a large fraction of the world’s compu-
research community.
tation, ranging from machine learning workloads to latency-critical
We show that across a diverse set of single- and multi-tier appli-
interactive services [23, 24]. Studying these applications is imper-
cations, Ditto accurately captures their CPU and memory charac-
ative to correctly design the systems that populate future cloud
teristics as well as their high-level performance metrics, is portable
infrastructures.
across platforms, and facilitates a wide range of system studies.
There are three approaches to performing studies that re-
quire cloud applications; using real services (production or open-
CCS CONCEPTS source) [43, 48, 53, 91, 98, 110], using simulation or trace replay [20,
• Computer systems organization → Cloud computing; n- 22, 58, 82, 86, 109], and generating synthetic services that resemble
tier architectures; • Computing methodologies → Modeling the original in behavior and characteristics [18, 27, 56, 72, 87, 90].
All three approaches are subject to pitfalls.
∗ Equal contribution. Using real production services is, naturally, the most representa-
tive approach. Unfortunately production services are rarely publicly
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed available, and open-source applications, although useful, often lack
for profit or commercial advantage and that copies bear this notice and the full citation the complexity and update cadence of a real cloud deployment. Stud-
on the first page. Copyrights for components of this work owned by others than the ies that rely on simulation or replaying traces from a production
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission system offer some representativeness, but are tied to the system
and/or a fee. Request permissions from [email protected]. configuration the trace was collected on, and cannot easily general-
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada ize to arbitrary studies. Finally, generating synthetic benchmarks
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9916-6/23/03. . . $15.00 offers a middle ground, with the synthetic application capturing
https://doi.org/10.1145/3575693.3575751 critical features of the original service, but being malleable enough

222
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou

to adjust to different studies. Unfortunately, most prior work on significantly from the services running in production clouds [100,
synthetic benchmark cloning is limited to CPU-centric, single-tier, 103, 111].
and user-level applications [72, 87, 90]. One of the earliest efforts towards modern cloud application
Only capturing CPU-centric microarchitectural events is not benchmarking was the Cloudstone benchmark [97], which pro-
enough to reproduce the performance and resource characteristics posed a new interaction-heavy Web 2.0 workload. CloudSuite [48]
of cloud applications, which spend a large fraction of their execution further composes a collection of workloads for the evaluation of
in the networking stack and OS. Moreover, prior work on synthetic scaling-out cloud services. The YCSB suite [37] collects workloads
application cloning mostly considers generating assembly code to for database systems, while SPEC Cloud [9] utilizes a subset of
mimic metrics like IPC, cache miss rate, and dependency distance, workloads representing real-world use cases found on IaaS clouds.
but overlooks critical higher-level performance metrics, such as More recently, uSuite [98] and DeathStarBench [53] focus on bench-
average and tail latency. marking cloud microservices, given the increased popularity of this
We present Ditto, an automated application cloning framework programming model.
for end-to-end cloud services, designed for both monolithic ap- Instead of condensing cloud services to a pre-set group of bench-
plications and microservices. Ditto is the first system to clone an marks, Ditto enables generating arbitrary applications that resemble
application’s behavior across the system stack, including the hard- in features a target service.
ware, I/O, networking layers, and OS. This is critical for cloud
applications which spend a large fraction of their time at kernel 2.2 Simulation and Trace Replay
level and the I/O stack [74]. It additionally also targets multi-tier Simulation and trace replay provide another way to estimate ser-
microservices which span distributed deployments and are gaining vice performance when hardware or software is inaccessible. Many
in popularity. microarchitectural simulators, including gem5 [31], Sniper [34] and
Ditto relies on the following key techniques. First, it captures the ZSim [94], can accurately simulate the CPU performance of a given
dependency graph across distributed services using distributed trac- binary. BigHouse [82] and 𝜇qsim [109] are queueing-based simu-
ing [4, 7, 11, 96]. Then, it recreates the high-level control and data lators which quickly estimate high-level performance metrics of
flow inside each service, and, finally, it generates system calls and monolithic applications and microservices. While useful when hard-
user-space assembly to capture the on-CPU and off-CPU behavior. ware is not available, these simulators still make approximations
Ditto operates transparently to the user, with the cloning process
about the application behavior, and do not capture all complexi-
working in an automated fashion, from obtaining a microservice
ties of a real system. On the other hand, RecPlay [92], iDNA [29],
deployment’s dependency graph to populating each tier with ap- PinPLay [89], Jalangi [95] log the execution and memory traces,
propriate assembly code and I/O operations. It generalizes across and reproduce an application’s behavior for debugging and per-
platforms, deployments, and application configurations, such as formance analysis. Unfortunately, a lot of prior work has showed
load and thread pool size, without retraining, and the synthetic that traces can leak confidential information about production ser-
applications react to changes similarly to the original ones. vices [33, 38, 66], restricting an application owner’s incentive to
Ditto is beneficial to hardware vendors, cloud providers, and publicly share the collected traces. West [22], STM [20], HALO [86]
researchers. Hardware vendors can obtain synthetic versions of and Dangwal’s paper [39], for example, analyze the memory access
production applications to test new platforms, cloud providers can patterns of an original application, and generate synthetic memory
specify performance and/or resource specs to hardware vendors
traces. Although they can constrain the information leakage, they
using the synthetic workloads, and researchers can use represen-
only target the cache and memory subsystems. Compared to trace-
tative end-to-end cloud services without the need for production based techniques, Ditto generates synthetic services that clone the
code access. performance characteristics across the system stack, and can run
We evaluate Ditto across a set of both monolithic applications both on real systems and microarchitectural simulators.
and multi-tier microservices and show that it consistently captures
the low- and high-level performance metrics and resource char-
acteristics of the original service. We also validate that synthetic
2.3 Performance Cloning and Synthetic
applications generated with Ditto react the same way as the original Benchmarks
workloads to changes in the input load, platform, resource alloca- Workload cloning is a way to generate synthetic code that mim-
tion, and deployment configuration, including interference from ics real-world applications. Previous studies profile architecture-
external workloads and power management. Ditto is open-source independent characteristics of real applications, and generate cor-
software. 1 responding proxy benchmarks that capture their CPU performance
[26, 56, 73, 87]. PerfProx, for example, generates miniature proxies
2 RELATED WORK which resemble the low-level CPU metrics of real databases [87].
2.1 CPU and Cloud Benchmarking MicroGrad [90] introduces a gradient-based mechanism to gener-
ate workload clones and stress tests. NanoBench [18] generates mi-
The architecture and system community rely heavily on software crobenchmarks with certain instructions to evaluate undocumented
benchmarking to learn the performance characteristics of target features of x86 CPUs. In [104], the authors hide the functional se-
applications. Prior studies have found that traditional CPU bench- mantics of the proprietary applications through code mutation.
mark suites, such as SPEC [32, 55, 99] and MiBench [61], differ However, these systems are not sufficient for performance
1 https://github.com/Mingyu-Liang/Ditto. cloning in cloud services. First, they only consider performance

223
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

metrics in user space. Cloud applications spend a large fraction 3.1 Application Inputs
of their execution at kernel level [43, 48, 53, 74, 77, 99]. Synthetic The behavior and performance of cloud applications is significantly
benchmarks generated with these tools focus on matching low-level impacted by the service configuration and input load, with the
performance metrics, e.g., instructions per cycle (IPC) or misses latter going through well-documented fluctuations [19, 24, 42, 43,
per kilo instructions (MPKI), which do not always translate to the 81, 91, 99]. The application’s configuration, although changing less
high-level metrics cloud applications care about, like tail latency frequently than load, can substantially alter the execution flow of
and throughput [28, 40]. Second, cloud services are often bottle- an application and impact performance. For instance, configuring a
necked by off-CPU events, such as context switching or network smaller in-memory cache for a database can cause more disk I/O
or disk I/O, which are not captured in previous work. Finally, cloud accesses, significantly increasing latency.
services do not operate like independent processes, having instead
client-server interfaces, which need to be captured by the cloning
framework. This is even more the case for multi-tier microservices, 3.2 Application Codebase and Binary
which can have hundreds of dependent tiers, and are becoming the The application and its linked libraries are intrinsic to its perfor-
norm in many clouds. mance, regardless of the platform it is deployed on. Modifications
in the application code can alter the control and data flow of a
3 CLONING ACROSS THE SYSTEM STACK service, its memory access patterns, and its resource bottlenecks.
Application cloning for cloud services is challenging due to the This is especially true for new cloud programming frameworks,
complexity and heterogeneity of their design, and the various plat- like microservices and serverless, where services are updated on a
forms they can be deployed on. Different services can have entirely daily basis.
different bottlenecks; for example, key-value stores (KVS) require
high CPU performance, high memory and network bandwidth to
3.3 Deployment Environment
retrieve a large amount of data under a strict latency SLO, while
databases are usually bottlenecked by disk I/O bandwidth [36]. 3.3.1 Containers and Virtual Machines (VMs). Cloud services are
Therefore, it is important to consider the performance breakdown often deployed with containers and/or VMs. These add different
across the system stack to accurately clone the performance of levels of performance overheads, primarily due to the extra I/O
end-to-end cloud services. and network layers [47]. Unlike prior work, Ditto faithfully clones
the I/O behaviors of the cloud services, and thus, the synthetic
applications generated by Ditto can be affected by virtualization
Workload Configurations the same way as the original services.
Application Inputs

Application User-level Libraries 3.3.2 OS Kernel. Cloud applications are especially dependent on
Application Binary OS performance, given that they spent a large fraction of their exe-
System Calls
Deployment Environment cution at kernel level for interrupt handling, I/O requests, memory
Container Engine management, task scheduling, etc. [25, 53, 74, 76]. Prior work on
application cloning has mostly focused on user-level application
(Guest) Kernel logic; for cloud services overlooking kernel operations leads to
Virtual Task File Network … very different performance characteristics compared to the original
Memory Scheduler Systems Stack
application.
Device Drivers

3.3.3 CPU-Memory Subsystem. The CPU-memory subsystem is a

Hypervisor & Host Kernel
dominant factor in cloud application performance, even for services
that spend significant time processing network requests [53, 74, 76].
CPU Memory GPU Disk Network We follow the top-down analysis methodology in [107] to identify
the key CPU performance metrics that impact the overall IPC and
Colocated Applications reproduce them in the synthetic applications, as shown in Figure 2.
Section 4.4 discusses how Ditto accounts for each of these factors
Figure 1: General system stack for cloud applications [59]. during application generation.
Dashed boxes are optional layers for virtualization.
3.3.4 Hardware Devices. Services interact with hardware devices,
including disks, and NICs through system calls. In cloud services
Figure 1 demonstrates an abstract view of a generic system stack specifically, peripherals can dominate performance, especially when
for a single cloud server [59]. The performance of an application they experience long queueing delays. We mainly consider the
is determined by factors that range from the application code and impact of storage and network devices in our study, as many cloud
inputs, to the environment it is running on, including containeriza- services involve I/O and network operations. Ditto can be extended
tion technology, the hypervisor, server platforms, and any colocated to clone the behavior of other devices, such as GPUs and hardware
applications. We briefly describe why these factors matter below. accelerators, which we defer to future work.

224
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou

Bad different subsystems. Finally, Ditto uses the deviation in perfor-

Frontend Bound Retiring Backend Bound
Speculation
mance metrics between original and synthetic application to fine

misprediction

Long-latency
IX uops decode
tune the generator. The eventual synthetic service can serve as a

Useful uops

Mem bound
L1d bound
iTLB miss

utilization
L2 bound
L3 bound
resteers
L1i miss

Branch

Branch
performance and resource proxy for the original service.

uops
Port
Ditto profiles applications in isolation to capture their character-
istics alone; in Section 6.5 we show that in the presence of interfer-
ence, synthetic applications behave the same way as their original
IM
IM

BB, IM, IX

DM, DD, IX
DM, DD, IX, IM
DM, DD, IX, IM
DM, DD, IX, IM

IX
BB, DM, DD,
counterparts.

IX, IM
Ditto adheres to the following design principles:
• End-to-end system stack modeling: Cloud services often con-
tain a large fraction of kernel-space operations for network and
disk I/O. Ditto captures the inputs, RPC dependency graph, ap-
Figure 2: Top-down analysis of the CPU-memory subsystem plication binary, OS kernel, CPU, memory, disk, networks, and
performance [107]. Letters at the bottom show the corre- resource interference.
sponding analysis in Ditto. IX: Instruction Mix. BB: Branch • Portability: Ditto uses platform-independent features to ensure
Behavior. IM: Instruction Memory Access Pattern. DM: Data that generated services are portable across platforms without
Memory Access Pattern. DD: Data Dependency. reprofiling. Synthetic applications also faithfully adjust to load
and configuration changes, such as queries per second (QPS),
and scaling, because of the fine-grained network and thread
3.4 Multi-Tenancy modeling.
Multi-tenancy improves datacenter utilization by deploying mul- • Abstraction: Ditto does not disclose the implementation of
tiple services on the same node. Applications share resources, in- the original application, only exposing the skeleton and post-
cluding CPU cores, LLC, and memory, disk I/O, and network band- processed performance characteristics to the synthetic bench-
width [36, 79]. Resource contention can degrade performance, and mark user. It replaces the skeleton of an application with a
should be accounted for in the application cloning process. template, refills the body with artificial instructions and their
operands, and abstracts the memory access patterns away to
4 END-TO-END CLONING FOR CLOUD avoid side-channel attacks. Application-specific characteristics,
SERVICES including user-space function calls, memory accesses, and appli-
cation inputs, are also concealed. Thus, the synthetic workload
4.1 Overview can be publicly shared, without a user reverse engineering the
Ditto is an application cloning framework for cloud services; it implementation of the original service.
applies to both single-tier applications and multi-tier microservices. • Automation: Ditto automates the profiling and generation pro-
It generates services that faithfully reproduce the performance, cess. It entirely relies on static and dynamic profiling of the
resource profile, and thread-level control/data flow of the original original application to generate a benchmark. Users are not re-
workload, decoupling representative system studies from access to quired to have expertise in the implementation of a service to
the source code or the binary of production cloud services. use the framework.
Ditto profiles an application at runtime and extracts key per-
formance and resource metrics using dynamic instrumentation 4.2 Microservice Topology
and runtime emulators (SystemTap [45], Valgrind [85], eBPF [44], A topology of microservices is a directed acyclic graph (DAG),
Perf [12], VTune [67], and Intel SDE [68]). Then, it generates a where the nodes are microservices and the edges indicate the
synthetic service which preserves the performance of the original, dataflow between dependent tiers [52–54, 80]. Ditto leverages the
using an entirely distinct code sequence, to avoid revealing the distributed trace frameworks present in most production deploy-
implementation of the original service. ments to collect traces of end-to-end requests. The performance
Figure 3 shows an overview of Ditto’s profiling and generation overhead is negligible if the traces are sampled properly [4, 11, 96].
process. If the target service consists of a set of microservices, Ditto It then automatically extracts the dependency graph between mi-
first learns their Remote Procedure Call (RPC) dependency graph, croservices and uses it as input to the skeleton generator.
using distributed tracing [4, 7, 11, 96]. This graph is then used to
generate the API interfaces between the different synthetic mi- 4.3 Application Skeleton
croservices. Next, Ditto analyzes the thread and networking model,
We define the application skeleton as the network and thread mod-
e.g., single- or multi-threaded, and synchronous or asynchronous
els of an application, which determine how it handles remote service
respectively using kernel-level profiling, and builds the skeleton
communication, and how tasks are assigned to different threads,
of each service. The application skeleton contains empty handlers
respectively. The application skeleton is a critical design choice for
which are filled with appropriate functionality in the next step. The
cloud services facing tight latency constraints [46, 88, 101, 106], as
handlers can either be triggered upon receiving requests for worker
it directly impacts their performance and scalability.
threads, or by a timer for background threads.
To generate the synthetic application body, Ditto instruments 4.3.1 Network Model. The network model describes how an appli-
the application binary using kernel- and user-space profilers for cation communicates with other services, acting as a client, server,

225
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

Microservices
Microservices
Microservices
A
1.0 1.0

B C
0.5 0.5 0.7 0.3 Microservice Topology 1 // Main function of worker A
2 void worker_a_main(req_id) {
D E F Single-tier 3 // Syscalls
4 int fd = open(file, O_RDONLY);
Application
1 // Main thread 5 int size = read(fd, buffer, BUFFER_SIZE);
2 void main_loop() { 6 close(fd);
3 while (!stop) { 7
Application Skeleton 8 // Assembly blocks
4 epoll_wait(listen_fd, events, MAX_EVENTS, -1);
5 int socket_fd = accept(listen_fd, addr, len); Thread Model Network Model 9 __asm__ __volatile__ (
6 init_worker_thread_a(socket_fd); 10 ...
7 } 11 );
12
8 } Application Body 13 // Block for data_size = i & inst_size = j
9
10 // Worker thread type A Instruction Mix 14 __asm__ __volatile__ (
11 void worker_a_loop() { 15 "xor r9, r9\n"
12 while (!conn_closed) { Branch Behavior 16 ".BLOCK_I_J:\n" // Inner loop
13 epoll_wait(socket_fd, events, MAX_EVENTS, -1); 17 "add <X_REG>, <R_REG>\n"
14 read(socket_fd, buffer, BUFFER_SIZE); Data Memory 18 "sub <R_REG>, DWORD PTR [r10 + <OFFSET>]\n"
15 // Handler to be generated in next step Access Pattern 19 "mul QWORD PTR[r10 + <OFFSET>]\n"
System Call
16 worker_a_main(req_id); 20 "mov r11, QWORD PTR [r11]\n" // Ptr chasing
Instruction Memory
17 dispatch_to_worker_b(req_id); 21 "test r8d, <BIT_MASK>\n"
Access Pattern
18 wait_worker_b(); 22 "jz .COND_BR_FOO\n"
19 sendmsg(socket_fd, buffer, BUFFER_SIZE); Data Dependency 23 ...
20 } 24 "cmp r9, <LOOP_COUNT>\n"
C/C++ level Assembly level 25 "jl .BLOCK_I_J\n"
21 }
22 26 );
23 // Worker thread type B 27
24 void worker_b_loop() { Fine Tuning 28 __asm__ __volatile__ (
25 ... 29 ...
26 } 30 );
Synthetic Synthetic 31 }
Synthetic
Microservices
Microservices Application

Figure 3: Overview of Ditto’s synthetic benchmark generation process.

or both. When acting as a client, a service can use synchronous by various threads. Ditto uses SystemTap to profile the functional-
or asynchronous communication. In synchronous models, threads ity, lifecycle, and trigger points of threads by experimenting with
block on network I/O (e.g., send(), write()) to await responses. different connections, QPS, and execution times. First, it combines
Asynchronous models are typically event-based with responses network and user-space call stack analysis to cluster threads with
handled by specific threads via callback functions. They are more similar functionalities. We build a call graph for each thread, use
complicated, as they involve additional synchronization and state tree-edit distance [30] to measure the similarity between threads,
machine transitions. In return, they avoid long queueing delays by and cluster threads with similar call graphs using agglomerative
allowing threads to process new requests and offer better perfor- clustering [83], since the number of clusters is unknown in ad-
mance [101]. vance. Second, we categorize each thread cluster into short- and
On the server side, there are three common options for the net- long-lived threads by probing clone() and context switches. Short-
work model: blocking, non-blocking, and I/O multiplexing [102]. In lived threads are usually spawned and terminated frequently, while
all three models, threads await requests through system calls (e.g., long-lived threads are spawned at initialization, waiting for tasks
recv(), read(), epoll()). In contrast to the other two models, the to arrive. Finally, thread functions can be triggered by both kernel-
non-blocking model needs to periodically call the I/O interfaces and user-space events, including reads and writes to sockets, timers,
to look for new requests, which can waste CPU time at low loads. signals, user-space locks, and condition variables. We monitor event
In both blocking and I/O multiplexing models, threads block on notification functions in kernel space and common user-level li-
system calls, although I/O multiplexing allows monitoring mul- braries, such as libpthread and libc++, and analyze the relation-
tiple sockets via a single system call (e.g., select() or epoll()). ship between them and thread spawning or wakeup to identify
I/O multiplexing is the most commonly-used in services like Mem- trigger points.
cached, Redis, and NGINX, since they support many concurrent
connections, and I/O multiplexing reduces the required threads.
4.4 Application Body
Ditto uses SystemTap [45] to profile the network model by prob-
ing kernel-space functions and data structures. It acquires key at- The application body corresponds to the workload-specific work,
tributes of sockets, and monitors network-related system calls, consisting of kernel-space functions, via system calls and user-
gathering the distribution of their types, arguments, and call fre- level functions. While assembly-level profiling for kernel-space
quency. Ditto then chooses one out of several network models that functions is unnecessary, since they can be cloned by imitating the
combine the different design choices described above, with socket system calls themselves, it is critical to clone user-space functions
options and network message parameters set based on profiling. at assembly level to capture the low-level usage of CPU resources.
Application performance is also significantly impacted by fac-
4.3.2 Thread Model. Cloud services rely on multithreading for tors like instruction mix, and memory (data and instruction) ac-
asynchronous networking, disk I/O, and parallel processing [111]. cess patterns, branch behavior, and data dependencies. Ditto uses
The thread model describes how tasks are scheduled to and handled these platform-independent features to ensure that the generated

226
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou

synthetic applications can be ported to other platforms without Branches with extremely high taken or not-taken ratios, even if
reprofiling. their patterns are completely random, have fewer mispredictions,
since the majority of executions are in one direction. Similarly,
4.4.1 System Calls. Applications use system calls to perform priv- branches with low transition rates are easier to predict. We also
ileged operations in the OS kernel. Besides network handling and find that instruction locality and the number of static branch in-
spawning new threads, cloud applications can make system calls structions significantly contribute to the branch prediction accuracy,
to access file descriptors, allocate memory space, or synchronize especially for applications with large binaries.
on shared memory. Capturing the system call characteristics is Based on these observations, Ditto profiles the distribution of
critical to clone the kernel-level CPU and un-core metrics. Prior branch taken/not-taken rates and transition rates across all con-
performance cloning studies either do not profile system call char- ditional branch instructions, and together with the instruction
acteristics [72, 90], or only profile the total number of kernel-level memory access pattern analysis it accurately clones the branch
instructions [87]. To accurately capture kernel-level characteris- misprediction behavior of the target application. We quantize the
tics, Ditto profiles the distribution of system calls, including their taken/not-taken rates and transition rates in log scale, from 2 −1
counts and arguments with SystemTap. For example, MongoDB to 2 −10 . During the generation phase, Ditto samples a taken/not-
calls pread() to read a database file from disk. During system call taken rate and transition rate from the profiled distribution for each
profiling, Ditto captures the flags of fd and the distribution of conditional branch instruction.
count and offset, to accurately clone key metrics, such as disk Lines 21-22 in the right code snippet in Figure 3 show how Ditto
latency, utilization, and page cache miss rates. generates conditional branch instructions with profiled taken/not-
4.4.2 Instruction Mix. The instruction mix in Ditto captures the taken rates and transition rates. <BIT_MASK> is a binary mask pre-
distribution of x86 assembly instructions at runtime in the original computed during the generation phase, which contains 𝑀 ones in
service, and reproduces it faithfully in the synthetic benchmark. the highest bits and 𝑁 zeros in the lowest bits. 2 −𝑀 is the taken/not-
Previous studies categorize x86 assembly instructions into integer taken rate, and 2 −𝑁 is the transition rate. The ZF flag, which deter-
arithmetic, integer multiplication, integer division, floating-point mines the branch direction of jz or jnz, will change periodically
operations, SIMD operations, loads, stores, and control instruc- according to the bitmask in the test instruction.
tions [72, 87, 90]. They then generate the synthetic benchmark
using a representative instruction from each category. 4.4.4 Data Memory Access Pattern. The memory access pattern is a
However, this categorization is too coarse-grained and does not dominant characteristic of an application, as it impacts the backend
capture the characteristics of modern CPU microarchitectures. The of the CPU and memory subsystem. Since operands in arithmetic
x86 ISA, for instance, contains assembly instructions with different instructions in synthetic benchmarks are randomly generated, they
uops, port usages, and execution cycles. For example, the CRC32 cannot calculate meaningful memory addresses at runtime. Thus,
(r64, r64) instruction, which implements the checksum function, memory addresses or offsets need to be pre-calculated in the gener-
takes three cycles and can only be executed via port 1 on Skylake ation phase and hard-coded in the synthetic application binaries.
CPUs, while other integer arithmetic instructions usually take one Previous studies [20, 22, 86] capture memory access patterns us-
cycle on any of the ports 0, 1, 5, and 6 [17, 60]. Instructions with ing the stack distance, reuse distance, and stride pattern profiles.
REP/REPZ/REPNZ (repeat string operations) or LOCK prefixes can However, they need 10 to 20 million memory traces to accurately
take tens of cycles or more, depending on the repeat count, or the represent target memory access patterns because of the sparsity of
cache/RAM configuration [51]. the memory address space, and the multimodality in memory ac-
Ditto uses Intel SDE [68] to collect the dynamic count of each cesses [63]. Preserving the original access patterns requires millions
x86 instruction using Intel x86 Encoder Decoder (XED) Iforms [35]. of hard-coded memory instructions, which significantly interferes
It then clusters x86 assembly instructions by functionality (data with other performance characteristics. Moreover, directly replicat-
movement, arithmetic/logic, control-flow, lock-prefixed, and re- ing the target memory access pattern introduces security concerns,
peat string operations), operands (general-purpose registers, x87 since previous studies showed that memory access patterns reveal
floating-point registers, XMM registers, and memory), and ALU confidential information about the service [57, 66, 71].
usage [17] using hierarchical clustering, so that each cluster has Instead, Ditto uses profiling of the memory working set to syn-
similar hardware resource requirements. Ditto also profiles the av- thesize appropriate data memory access patterns without incurring
erage number of dynamic instructions per request, and the repeat high instruction misses or leaking application context. We construct
counts of each REP-prefixed instruction. During the generation a sequence of memory accesses for working sets with different sizes,
phase, Ditto randomly samples the next instruction from the in- from 64 bytes (one cache line) to the maximum memory size allo-
struction mix distribution. Registers and memory addresses are cated to the target application, increasing by a factor of two. Each
assigned after data memory access profiling (Section 4.4.4). memory access only reads or writes the first data in a cache line to
ensure that a new cache line is loaded, assuming the most common
4.4.3 Branch Behavior. Branch prediction accuracy, which is de- write-allocate policy. We use Valgrind [85] to compute the distribu-
termined by both the branch behavior of the application and the tion of memory accesses with different working set sizes, which can
branch predictors, is critical in modern out-of-order CPUs [60, 78]. be efficiently simulated as “cache hits” for different “cache sizes”.
Prior studies observe that branch taken ratios and transition rates Each “cache size” only needs to be simulated once during profiling.
(frequency a branch switches between taken and not-taken) impact We calculate the number of memory accesses in a working set of 2𝑖
the branch prediction accuracy and misprediction penalty [50, 64]. bytes as follows:

227
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

( memory access patterns from the runtime memory trace and use
𝑖 𝐻𝑑 (2𝑖 ) if 2𝑖 = 64 bytes this ratio to control the number of regular memory access sequences
𝐴𝑑 (2 ) = , (1)
𝐻𝑑 (2𝑖 ) − 𝐻𝑑 (2𝑖 −1 ) otherwise in the synthetic applications.
4.4.5 Instruction Memory Access Pattern. Instruction memory ac-
where 𝐴𝑑 (2𝑖 ) is the number of memory accesses for a working cess patterns significantly impact CPU frontend and backend per-
set of 2𝑖 bytes in generated code, and 𝐻𝑑 (2𝑖 ) is the number of cache formance, as they determine the L1i, L2, L3 cache misses and branch
hits in a 2𝑖 -byte cache in the original application. The synthetic mispredictions. Replicating the original application’s instruction
working set-based memory access pattern is illustrated in Figure 4, memory access pattern is not possible with a synthetic benchmark
with the number of memory accesses for each working set equal because the execution flow is usually controlled by the computa-
that of the profiled distribution. Since the memory accesses are tion’s output at runtime.
limited to the working set size, it is guaranteed that 𝐴𝑑 (2𝑖 ) accesses Therefore, Ditto synthesizes instruction memory access patterns
will contribute to 𝐴𝑑 (2𝑖 ) hits when cache size ≥ 2𝑖 bytes. Assuming with a similar approach to that of Section 4.4.4. We profile the i-
a least-recently-used (LRU) cache replacement policy or its pseudo- cache hits of the original application with different i-cache sizes
LRU variant, commonly used in recent Intel processors [18, 105], using Valgrind. Then, we calculate the distribution of dynamic
since we iterate through cache lines in a working set sequentially, executions in an instruction memory working set of 2 𝑗 bytes as
there must be previous memory accesses which evict this cache line follows, assuming the cache line size is 64 bytes, and the average
when cache size < 2𝑖 bytes. Therefore, every memory access of a instruction size is 4 bytes:
2𝑖 -byte working set ends up with a miss when cache size < 2𝑖 bytes. (
16 ∗ 𝐻𝑖 (2 𝑗 ) − 𝐻𝑖 (2 𝑗 −1 ) if 2 𝑗 > 64 bytes

The statement is true for any memory hierarchy and cache inclusion 𝑗
𝐸𝑖 (2 ) = , (2)
𝐻𝑖 (2𝑁 ) − 2𝑗=9 𝐸𝑖 (2 𝑗 )
policy because of the sequential access pattern within each working Í𝑁
if 2 𝑗 = 64 bytes
set. Therefore, even if applications are profiled with a single-level
where 𝐸𝑖 (2 𝑗 ) is the number of instruction executions with a
cache, the results can be applied to any number of cache levels and
working set of 2 𝑗 bytes in the synthetic code, 2𝑁 is the max in-
inclusion policies. Applications are profiled with an 8-way cache for
struction working set size, 𝐻𝑖 (2 𝑗 ) is the number of i-cache hits
working sets < 1MB and a 16-way cache for working sets ≥ 1MB,
on a 2 𝑗 -byte i-cache in the original application, and the number
which are close to the typical values of modern CPUs. There is an
of instructions in a cache line is 16 (64B cacheline / 4B inst size).
average 1.9% error in the cache miss rate when cache associativity
After profiling the distribution of i-cache accesses with different in-
changes across all examined applications. We allocate an array
struction working sets, Ditto generates static assembly instruction
for memory accesses in the heap when the synthetic application is
blocks, shown in lines 14-26 in the right code snippet of Fig. 3.
initialized, and store the base address in a register (for example, r10).
The number of instructions per block matches the instruction
Ditto generates the address offsets for each memory instruction,
working set size, and the loop iteration number is determined by
which can access [r10 + <OFFSET>] at runtime.
the distribution.
0x0 0x40 0x80 0x100 0x200 0x400 4.4.6 Data Dependencies. Data dependencies are another inher-
… ent characteristic of an application that impact performance. Data
64B A(64) accesses dependencies can flow through registers or memory locations, limit-
128B A(128) accesses ing the number of simultaneous instructions issued to an execution
256B A(256) accesses
unit (instruction-level parallelism, or ILP), and the number of out-
standing memory requests (memory-level parallelism, or MLP) [65].
512B A(512) accesses
Ditto uses the distribution of data dependency distances to quan-
1024B
tify data flows through registers. We measure the read after write
… … A(1024) accesses
(RAW), write after read (WAR), and write after write (WAW) data
dependency distance from the dynamic control flow graph (DCFG)
Figure 4: Working-set-based data memory access generation. generated using Intel SDE [69]. The dependency distance is quan-
Except for the 64-byte working set, the memory accesses of tized into 11 bins, increasing exponentially from 1 to 1024, since
2𝑖 -byte working set start at address 2𝑖 −1 and loop iteratively a larger dependency distance does not impact the ILP, due to the
within the working set. limited size of the reorder buffer. When generating the synthetic
Coherence misses also contribute to cache miss rates in multi- code, we reserve several registers for recording the loop counters
threaded applications. Coherence misses happen when cache lines and data memory addresses, and use the rest of general-purpose
containing shared data are invalidated by another core. To accu- and SIMD registers to clone the data dependency characteristics. To
rately clone cache behavior with multi-threading, we use Intel SDE assign registers for each instruction, Ditto samples a (RAW, WAR,
to profile the ratio between private data accesses and shared data WAW) distance tuple from the profiled distributions, and chooses
accesses, and generate memory accesses accordingly. an available register with the closest distance values. Data depen-
Modern CPUs implement hardware prefetching mechanisms dencies through registers can also impact MLP if the register values
to improve cache performance. Hardware prefetchers detect load determine memory locations. Such behavior cannot be captured
instructions with regular strides, sequences of consecutive cache since the synthetic application never writes to a reserved register
line accesses and adjacent cache line accesses, to load data into with the memory base address. To address this, we replace a frac-
caches before they are needed [84]. To clone the performance impact tion of memory reads with pointer chasing reads (mov r11, QWORD
of cache prefetching, we calculate the ratio of regular to irregular PTR [r11]); determined by the MLP measured with Perf.

228
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou

Data dependencies through memory locations are much more Ditto is implemented primarily in Python and C in about 16,000
difficult to profile with DCFG since memory addresses are often lines of code. It supports C/C++ applications, the Apache Thrift [1]
calculated at runtime. However, they are partially determined by and gRPC [3] RPC frameworks, and x86 ISAs, which are commonly
data access patterns that Ditto already profiles (Section 4.4.4). A used in cloud environments. It can be extended to more languages,
program with a shorter memory dependency distance can be mod- frameworks, and ISAs, by leveraging compatible profiling tools.
eled with a smaller working set because the probability that the Ditto can generate applications that run on a single machine or con-
cache line is evicted by other instructions in between is lower. tainerized microservices that run on multiple nodes, using Docker
Swarm or Kubernetes. The runtime profilers and emulators, includ-
4.5 Fine Tuning ing SystemTap, Intel SDE, and Valgrind, can introduce overheads to
Finally, Ditto implements fine tuning to calibrate the output of pre- the original application during profiling. This overhead only occurs
vious steps, due to inaccuracies introduced by the instrumentation once, and does not affect the accuracy of the platform-independent
tools. For example, application body profiling does not consider features collected during profiling.
the interaction between user-space and kernel-space functions and To generate a clone, cloud providers only need to specify a rep-
the correlation between the application skeleton and body; thus resentative input for their service. Ditto automatically instruments
the actual d-cache and i-cache miss rates are often higher than the the application at runtime, collecting profiling statistics and feeding
profiled results. Ditto iteratively runs the synthetic application on a them to the code generator, followed by the fine-tuning process.
specific platform, computes the errors between target and synthetic Ditto does not require reprofiling if the input change does not affect
service, adjusts the inputs to the generator accordingly, and regen- the application body, such as changes in QPS or number of connec-
erates the synthetic application. Although there are many knobs tions. Inevitably, if a new input exercises an entirely new code path
to tune, most of them are orthogonal with each other. We have or memory access pattern, this will need to be profiled to create a
characterized the correlation across knobs, and derived the small new clone. The synthesized binaries can run directly on hardware,
groups of parameters that need to be jointly tuned (e.g., branch execution-driven simulators like gem5 [31] and ZSim [94], or their
taken/transition rate and i-cache pattern because they all influence traces can be fed to trace-driven simulators like Ramulator [75].
branch prediction). Since relationships between knobs and perfor-
mance are mostly linear, we use a feedback-based heuristic to tune 6 EVALUATION
knobs within a group. Fine tuning uses performance counters for 6.1 Methodology
calibration. It usually takes within ten iterations to reach over 95%
accuracy, incurring low overhead since each iteration only takes a 6.1.1 Platforms. We validate Ditto on a heterogeneous cluster,
couple tens of seconds. Since Ditto captures performance charac- with three types of servers, whose specs are in Table 1. All servers
teristics well with platform-independent data, this fine tuning does run x86 ISA, but differ in the CPU and memory architectures, and
not compromise the generality of the synthetic service, as shown their storage and network.
in Section 6.2.2. Table 1: Server platform specifications.

5 IMPLEMENTATION Platform A Platform B Platform C

Ditto implements several analyzers and code generators to capture CPU model Gold 6152 E5-2660 v3 E3-1240 v5
Base Frequency 2.10GHz 2.60GHz 3.50GHz
the microservice topology, application skeleton, and application
CPU cores 22 10 4
body. If the target service is a graph of microservices, the microser-
CPU family Skylake Haswell Skylake
vice topology analyzer leverages distributed tracing systems, like
Sockets 2 2 1
Jaeger [4], to obtain RPC call graphs and call statistics. For both L1i/L1d 32KB/32KB 32KB/32KB 32KB/32KB
microservices and single-tier applications, the application skeleton L2 1MB 256KB 256KB
analyzer then deploys SystemTap to profile network- and thread- LLC 30.25MB 25MB 8MB
related functions and data structures in kernel space, and identify RAM 192GB@2666 128GB@2400 32GB@2133
the network and thread models used. Disk 1TB SSD 2TB HDD 1TB HDD
The skeleton generator creates a synthetic application skeleton Network 10Gbe 1Gbe 1Gbe
using either a TCP- or RPC-based network interface, leaving the
body of each thread’s handler to the application body generator. 6.1.2 Applications and Workload Generators.
The latter runs SystemTap to profile system calls, and uses Intel • Memcached: Memcached [49] is a distributed low-latency, key-
SDE and Valgrind to capture the platform-independent features value store for in-memory caching. We build Memcached 1.6.9
of binaries, such as instruction mix and working set size distribu- from source, deployed with four worker threads, and load it with
tion, etc. The generator creates handlers according to these features 10K items, each with a 30B key and a 4KB value. It is driven by
using POSIX APIs in libc and inline assembly in C code. The assem- an open-loop version of the mutated workload generator [13].
bly code contains tens to hundreds of instruction blocks looping • NGINX: NGINX [6] is a high-performance web server and is the
iteratively with different instruction and working set sizes. Finally most commonly-deployed technology in Docker [15]. We build
the fine tuner runs the synthetic application, collects performance NGINX 1.20.0 from source and configure it with one worker pro-
data from Perf, eBPF and VTune on the deviation between origi- cess. For NGINX, we use tcpkali [14] to generate HTTP requests.
nal and synthetic workloads, and calibrates the input data for the • MongoDB: MongoDB [5] is an open-sourced cross-platform
application body generator accordingly. NoSQL database. We use MongoDB 4.4.4 and set up a dataset

229
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

Memcached NGINX MongoDB Redis TextService SocialGraphService

L1i L1i L1i L1i L1i L1i
Branch Branch L1d Branch Branch Branch Branch
Low Load

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

01234 012345

0
0.3
0.6
0.9
1.2

0
0.4
0.8
1.2
1.6
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC
L1i L1i L1i L1i L1i L1i
Branch Branch L1d Branch Branch Branch Branch
Medium Load

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

0
0.3
0.6
0.9
1.2

0
0.3
0.6
0.9
1.2
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC
L1i L1i L1i L1i L1i L1i
Branch Branch L1d Branch Branch Branch Branch
High Load

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

0
0.3
0.6
0.9
1.2

0
0.5
1
1.5
2
2.5

0
0.5
1
1.5
2

0
0.3
0.6
0.9
1.2

0
0.4
0.8
1.2
1.6

0
0.4
0.8
1.2
1.6
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC

1.5 2 2 0.4 15 1.5

Latency (ms)

1.5 1.5 0.3

1 10 1
1 1 0.2
0.5 5 0.5
0.5 0.5 0.1

0 0 0 0 0 0
0 500 1000 0 20k 40k
0

0
10
20
30
40

15
0k

k
k
k
k

00
QPS QPS QPS QPS QPS QPS
Actual Synthetic Actual avg Synthetic avg Actual p95 Synthetic p95 Actual p99 Synthetic p99

Figure 5: CPU performance metrics (IPC, branch mispredictions, L1i, L1d, L2 and LLC miss rates), network bandwidth, disk
bandwidth (MongoDB only) and service latency under varying load across six services. CPU metrics are normalized to each
original application’s metrics under medium load. Network and disk bandwidth are, by exception, normalized to each original
application’s bandwidth under current load, because their magnitudes change significantly, and would obscure the figure’s
shape.

of 40GB with one million records. To load MongoDB, we use 40

Actual p50
YCSB [37] with all read operations, following a uniform distribu- 35
Synth. p50
30
Latency (ms)

tion. Actual p95

• Redis: Redis [8] is a fast, single-threaded, in-memory data store 25 Synth. p95
used as a database, cache, and message broker. We build Redis Actual p99
20
Synth. p99
6.2.6 from source, disable its persistent storage, and load a dataset 15
with 100K records. We use YCSB as the load generator. 10
• Social Network: Social Network is a microservice topology 5
from DeathStarBench [53], consisting of 20+ individual services.
0
We compose its social graph with the socfb-Reed98 Facebook 200 500 1000 1500 2000
dataset [93], which contains 962 users and 18.8K follow rela- QPS
tionships. We also modify the wrk2 [2] workload generator to Figure 6: End-to-end latency for the Social Network.
open-loop and use it as the client. The Social Network is deployed
with one replica per microservice, both locally and on a cluster
using Docker containers. 6.2 Validation
For all synthetic applications, we use the same load generator as 6.2.1 Validation on Varying Loads. Figure 5 shows CPU, network
the original application, sending dummy requests with the same and disk performance metrics, and latency for six applications un-
traffic distribution. The number of threads of MongoDB and Social der different QPS in platform A. In addition to the four single-tier
Network microservices changes dynamically with the number of applications, we also show resource characteristics for TextService
concurrent connections, up to a few tens under our load settings. and SocialGraphService, two of the Social Network’s tiers, which

230
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou

Memcached NGINX MongoDB Redis TextService SocialGraphService

L1i L1i L1i L1i L1i L1i
Branch Branch L1d Branch Branch Branch Branch
Platform A

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

0
0.3
0.6
0.9
1.2

0
0.3
0.6
0.9
1.2
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC

L1i L1i L1i L1i L1i L1i

Branch Branch L1d Branch Branch Branch Branch
Platform B

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

0 1 2 3 0 3 6 912 0 2 4 6 0 2 4 6 0 2 4 6

0
0.3
0.6
0.9
1.2
L2 L2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC

L1i L1i L1i L1i L1i L1i

Branch Branch L1d Branch Branch Branch Branch
Platform C

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

0 2 4 6 0 3 6 912 02468 01234 01234
0
0.3
0.6
0.9
L2 L2 1.2 L2 L2 L2
Net BW Net BW LLC Disk BW Net BW Net BW Net BW
LLC LLC Net BW LLC LLC LLC

0.3 0.6 20 0.15 6 0.8

Latency (ms)

15 0.6
0.2 0.4 0.1 4
10 0.4
0.1 0.2 0.05 2
5 0.2

0 0 0 0 0 0
A B C A B C A B C A B C A B C A B C

QPS QPS QPS QPS QPS QPS

Actual Synthetic Actual avg Synthetic avg Actual p95 Synthetic p95 Actual p99 Synthetic p99

Figure 7: CPU metrics (IPC, branch misprediction, L1i, L1d, L2 and LLC misses), network BW, disk BW (MongoDB only) and
latencies across platforms. CPU metrics are normalized to each original service on Platform A.

are representative of the other tiers of the service. TextService man- misses, while SocialGraphService has high IPC due to fewer LLC
ages the text users add to composed posts, and SocialGraphService misses. At high load, Memcached and Redis have similar metrics to
manages follow relationships between users. We do not show each medium load, however the other four applications exhibit different
tier due to space constraints, but have validated that the results degrees of L2, LLC misses variation. The results illustrate that
are similar for them. All applications are generated using profiling applications can have very different characteristics under different
data under medium load; Ditto has not profiled any other load. We loads, which are accurately captured by Ditto in their synthetic
increase the load until the single-tier application or bottleneck tier counterparts. The network and disk bandwidth also conform to the
in the microservice topology saturates in one or more resources original by faithfully reproducing the system calls. We only show
(e.g., disk I/O for MongoDB and CPU for the other applications). disk bandwidth for MongoDB since other services do not involve
Since we use a close-loop workload generator for MongoDB and disk I/O. The bottom line plot shows the average, 95th, and 99th
Redis, which only allows one outstanding request per connection, percentile latencies, which also match the originals, with the p99
the latency does not increase significantly at high load. While the diverging at high load, due to the queueing behavior in the network
end-to-end latency of Social Network increases at high load, the la- stack at saturation.
tency of TextService and SocialGraphService only increases slightly, Fig. 6 shows the end-to-end latency of original and synthetic
since they are not bottleneck tiers. Social Network when every individual microservice is replaced
The upper three rows show IPC, branch misprediction, L1i, L1d, with a synthetic one.
L2, LLC miss rates, and network and disk I/O bandwidth under low,
medium, and high load, with average errors across all applications 6.2.2 Validation on Varying Platforms. We also validate the CPU,
being 4.1%, 9.9%, 7.1%, 5.1%, 6.9%, 12.1%, 0.1%, 0.1%, respectively. network and disk metrics and service latency as we vary the hosted
This indicates that Ditto accurately clones the overall hardware platforms. Each application is profiled only on Platform A, and
performance metrics. Memcached and NGINX have low IPC under validated on Platforms A, B and C. Figure 7 shows that the syn-
low load because of high branch misprediction, and L1i and L2 thetic benchmarks react to platform changes in a similar way to

231
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

1.2
Retiring Front-end Bad speculation Back-end
A S 0.8

IPC
2.0 A
S 0.4
1.5 A S target IPC
A S 0.0
CPI

A S A S 2.5 1e9
1.0

Instructions
2.0
0.5
1.5
1.0
0.0
0.5 target instructions
Mem Nginx MongoDB Redis Text SocialGraph 0.0
cached Service Service 5.5 1e9
4.5

Cycles
3.5
Figure 8: Cycles breakdown. (A: actual, S: synthetic) 2.5
1.5 target cycles
the original applications. More specifically, all six applications have 0.5
different degrees of L2 cache miss increases on Platforms B and C, 1.4

p99 (ms)
due to their smaller L2 cache sizes. Applications running on Plat- 1.2
form B, which is an older CPU generation, have consistently lower 1.0
IPC. When running all microservices of the Social Network on the 0.8 target latency
small-scale Platform C server, the high degree of interference re- 0.6

I:T
sults in high LLC miss rates for TextService and SocialGraphService,

:D
:In

:D
I-
Br
S

un
m
ke

at
in

-m
st

an
sc

e
em
both original and synthetic. Network and disk I/O bandwidths are

st
le

a
.m

em
ch
al
to

de
l

ix
n
identical across platforms, since the amount of data transferred is

p.
independent of the platform. The line plots at the bottom show the Figure 9: Evolution of IPC, instructions, cycles, and p99 la-
latency on the three platforms, where the synthetic always matches tency for MongoDB as we add sophistication to Ditto.
the original. All applications experience the highest latency on
Platform B because it has the lowest IPC. The latency of MongoDB
is significantly lower on Platform A because it benefits from the decrease from 1.11 to 1.02 due to memory instructions incurring
low random access latency of SSDs. In general, the fact that the syn- additional cycles in the backend. From D to E, we clone the branch
thetic applications react to platform changes the same way as the behaviors following the profiled branch taken and transition rates.
original, without reprofiling, shows that Ditto accurately captures The branch misprediction rate drops from 1.95% to 1.47% but has a
critical, platform-independent features that impact performance. negligible impact on IPC. In step F, we synthesize the instruction
memory accesses, which causes more i-cache misses (from 1.3% to
6.3 CPU Top-down Analysis 7.3%) and branch mispredictions (from 1.47% to 4.56%, as discussed
Figure 8 shows the cycles per instruction (CPI) top-down analysis in Sec. 4.4.3), and significantly lowers the IPC. From F to G, we
of the original and synthetic applications. Ditto accurately captures synthesize the data memory access pattern by accessing different
the cycle breakdown of the original applications. Many prior studies sizes of private and shared working sets. The IPC further decreases
have showed that cloud services diverge from traditional scientific as the L1d miss rate rises from 17% to 24%. In H, we mimic data
CPU benchmarks like SPEC CPU by having significant fractions of dependencies by reassigning registers for each instruction, which
front-end stalls, due to large code footprints and frequent context clones the ILP and MLP characteristics and slightly lowers the
switches between user and kernel mode [53, 101, 111]. Our synthetic IPC. From H to I, we perform the fine tuning, which calibrates
benchmarks show similar bottlenecks to the original applications, instruction and data access patterns, lowers the IPC from 0.6 to
and can be used as proxies for microarchitectural optimizations. 0.51, and further improves accuracy. This shows that, even if not
every aspect in Ditto is equally important, they are all required to
accurately clone complex cloud services.
6.4 Decomposition of Ditto’s Accuracy
In Fig. 9, we use MongoDB as an example to show Ditto’s accuracy,
as the framework incorporates more information. We start with a 6.5 Case Study: Interference Analysis
version of Ditto that only generates the thread model and network Figure 10 shows that synthetic applications react to resource inter-
interfaces skeleton, but an empty request handling body. From ference in a similar way to their original counterparts, even though
A to B, we inject the system calls with arguments drawn from we only profile the original application in isolation. We show the
the distribution of the original application, which increases the analysis on NGINX, but the results are similar for other services. We
kernel-level instructions and disk I/Os. In C, we add user-level use a set of stress benchmarks to generate interference in different
instructions (add rax, rax) to match the total instruction count, resources. We use stress-ng [10] to generate hyperthreading (HT),
but not their specific mix. From C to D, user-level instructions are L1d, and L2 interference by co-locating the applications and mi-
generated based on the profiled mix. We assume the highest branch crobenchmarks on different logical cores of the same physical core.
taken/transition rate, strongest data dependencies, and all memory The synthetic application captures the IPC and latency degradation
operations accessing the smallest working sets. We observe an IPC caused by memory contention. When generating L2 interference,

232
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou

Actual Synthetic Actual Synthetic

>1ms
1.1 X X X X XXX X X X X XXX

Frequency (GHz)
1.0 1ms
1.3 X X X X XXX X X X X XXX
IPC

0.5 0.8ms
1.5 X X X X XXX X X X X XXX
0.0 1.7 X X X X X X X X X X 0.6ms
5.0
1.9 X X X X X X X 0.4ms
p99 (ms)

4.0
3.0
2.0
2.1 X X 0.2ms
1.0 4 6 8 10 12 14 16 4 6 8 10 12 14 16
0.0
10
#Cores #Cores
LLC Miss (%) L2 Miss (%) Lid Miss (%) L1i Miss (%)

8
6
4 Figure 11: 99th percentile latency of actual and synthetic
2
0 Memcached under varying CPU frequency core count.
12
9
6
3
networking, threading, interference, and power management. It can
0 also be used for certain microarchitecture studies, as we showed
10 when changing memory hierarchies across platforms.
8
6 However, to enable fast, automated, and obfuscated cloning, our
4 method abstracts away the original individual instructions and
2
0 memory accesses. Thus, it is less accurate for microarchitecture
80
studies that rely on exact application implementation rather than
60
40
their statistical performance patterns. For instance, studying the
20 memory access patterns to improve hardware prefetchers would
0 not be a good fit for Ditto. There is a fundamental trade-off between
Orig. HT L1d L2 LLC Net
the granularity at which information is captured in one subsystem,
Figure 10: Interference impact on NGINX. and the overall performance accuracy.
besides the L2 miss rate increase, the synthetic workload also cap-
7.2 Confidentiality
tures the LLC miss rate change in the original service due to an
increase in the LLC accesses with constant misses. Although Ditto cannot guarantee zero information leakage, as it
We also use iBench [41] to generate LLC interference on the may expose the RPC graph, and statistics of some hardware coun-
shared socket, and the result shows the synthetic application cap- ters, many software companies do not consider these sensitive
tures the IPC drop in the original service. Finally, we use iperf3 [16] data. For example, Alibaba, has open-sourced their production RPC
to compete with the service for network bandwidth, and the latency traces [80], Facebook shared the kernel-level cycles breakdown
of synthetic application successfully matches the original service. of its production workloads [99], and Google open-sourced their
workload traces via DynamoRIO [21]. Additionally, the application
6.6 Case Study: CPU Core and Frequency skeleton, while may reflect the original workflow to some extent,
Scaling is chosen and adapted from the network and threading models that
have been extensively studied and used [46, 101, 108]. As the actual
Fig. 11 shows using Ditto to evaluate power management in Mem- logic, functionality and per-access memory patterns are concealed,
cached with CPU core and frequency scaling. Each cell represents inferring useful proprietary information would be very difficult.
the p99 latency under a given number of cores and frequency. We For collaboration with hardware vendors, sharing the propri-
set the QoS as 1ms and cells with marks mean that QoS cannot be etary code under NDA is of course a more straightforward way, but
satisfied for that configuration. Memcached cannot meet the QoS most cloud providers would not share the binaries related to their
at low frequency even with the maximum number of cores, which core business regardless of NDA agreements. Other solutions like
prohibits aggressive power management. Synthetic Memcached internal evaluation of prototypes can be time-consuming and inac-
accurately captures the latency variation of Memcached under dif- curate since the prototype is usually immature, and cloud providers
ferent settings. This similarity indicates that cloud providers can need to adapt their workloads for each new prototype.
use synthetic applications to determine whether power manage-
ment is beneficial for a service, without needing access its source 7.3 Application Phases
code.
Previous studies observe execution phases in SPEC CPU benchmark
7 DISCUSSION applications [62, 70]. To verify whether program phases exists in
the evaluated cloud services, we collect time series of CPU metrics
7.1 Suitable and Unsuitable Use Cases spanning 600 seconds, with sampling frequencies ranging from one
Ditto’s main contribution is cloning an end-to-end application second to ten seconds. We do not observe regular program phases
across the system stack. This makes it more suitable for architecture- with such sampling granularity. There may be program phases
, OS-, application-, and cluster-level studies, including scalability, in the execution of individual requests, which range from tens of

233
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

microseconds to tens of milliseconds. Nevertheless, they rarely [10] stress-ng. https://wiki.ubuntu.com/Kernel/Reference/stress-ng.

manifest across requests, unless caused by load fluctuations, which [11] Zipkin. http://zipkin.io.
[12] perf: Linux profiling with performance counters, 2015.
Ditto captures. As the service processes hundreds of requests per [13] Mutated: A high-performance and very accurate load-generator for stressing
second, the overall program phases are averaged across concurrent servers and measuring their latency behaviour under load, 2016.
[14] tcpkali: A high performance tcp and websocket load generator and sink, 2017.
requests. [15] 8 surprising facts about real docker adoption, 2018.
[16] iperf3: A tcp, udp, and sctp network bandwidth measurement tool, 2021.
7.4 Multi-tenancy and Virtualization [17] Andreas Abel and Jan Reineke. Uops.info: Characterizing latency, throughput,
and port usage of instructions on intel microarchitectures. In Proceedings of the
We demonstrate Ditto’s accuracy in local, virtualized, and multi- Twenty-Fourth International Conference on Architectural Support for Programming
tenant deployments in Sections 6.2 and 6.5. Although we do not Languages and Operating Systems, ASPLOS ’19, page 673–686, New York, NY,
USA, 2019. Association for Computing Machinery.
directly model contention and hypervisor events, the behavior of [18] Andreas Abel and Jan Reineke. nanobench: A low-overhead tool for running
synthetic and target applications under multi-tenancy and virtu- microbenchmarks on x86 systems. In 2020 IEEE International Symposium on
alization is similar. In Ditto, the synthetic application clones the Performance Analysis of Systems and Software (ISPASS), August 2020.
[19] Dan Ardelean, Amer Diwan, and Chandra Erdman. Performance analysis of
interference sensitivity by matching the target application’s re- cloud applications. In 15th USENIX Symposium on Networked Systems Design
source usage patterns. The performance overhead of virtualization and Implementation (NSDI 18), pages 405–417, Renton, WA, April 2018. USENIX
Association.
comes primarily from network and disk I/O, nested paging, VM [20] Amro Awad and Yan Solihin. Stm: Cloning the spatial and temporal memory
scheduling, and OS interactions [47]. Cloning the network and access behavior. In 2014 IEEE 20th International Symposium on High Performance
thread models, system calls and memory accesses ensures that Computer Architecture (HPCA), pages 237–247, 2014.
[21] Grant Ayers, Nayana Prasad Nagendra, David I. August, Hyoun Kyu Cho, Svilen
synthetic and original applications experience similar overheads. Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Mose-
ley, and Parthasarathy Ranganathan. Asmdb: Understanding and mitigating
7.5 Limitations and Future Work front-end stalls in warehouse-scale computers. In International Symposium on
Computer Architecture (ISCA), 2019.
Ditto currently only supports C/C++ applications compiled for x86 [22] Ganesh Balakrishnan and Yan Solihin. West: Cloning data cache behavior using
ISA, as it relies on instrumentation tools designed for C/C++ and stochastic traces. In IEEE International Symposium on High-Performance Comp
Architecture, pages 1–12, 2012.
x86. However, Ditto does not depend on any language- or ISA- [23] Luiz Barroso. Warehouse-scale computing: Entering the teenage decade. In
specific features, and can be easily adapted to other languages and Proceedings of the 38th Intl. symposium on Computer architecture, San Jose, CA,
2011.
ISAs. [24] Luiz Barroso and Urs Hoelzle. The Datacenter as a Computer: An Introduction to
Cloning for specialized hardware, such as GPUs, FPGAs, and the Design of Warehouse-Scale Machines. MC Publishers, 2009.
smartNICs is also gaining in importance. Due to their radically [25] Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan.
Attack of the killer microseconds. Commun. ACM, 60(4):48–54, March 2017.
different ISAs and computing paradigms, we leave extending Ditto [26] Robert H Bell Jr and Lizy K John. Improved automatic testcase synthesis for
for them to future work. performance model validation. In Proceedings of the 19th annual international
conference on Supercomputing, pages 111–120, 2005.
[27] Ramon Bertran, Alper Buyuktosunoglu, Meeta S. Gupta, Marc Gonzalez, and
8 CONCLUSION Pradip Bose. Systematic energy characterization of cmp/smt processor systems
We presented Ditto, an accurate cloning framework for end-to-end via automated micro-benchmarks. In 2012 45th Annual IEEE/ACM International
Symposium on Microarchitecture, pages 199–211, 2012.
monolithic services and microservices. Ditto captures the activity [28] Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site relia-
of an application across the system stack, including kernel and bility engineering: How Google runs production systems. " O’Reilly Media, Inc.",
network events, and accurately reproduces its characteristics, de- 2016.
[29] Sanjay Bhansali, Wen-Ke Chen, Stuart de Jong, Andrew Edwards, Ron Murray,
coupling representative cloud studies from access to production Milenko Drinić, Darek Mihočka, and Joe Chau. Framework for instruction-level
code. tracing and analysis of program executions. In Proceedings of the 2nd Interna-
tional Conference on Virtual Execution Environments, VEE ’06, page 154–163,
New York, NY, USA, 2006. Association for Computing Machinery.
ACKNOWLEDGMENTS [30] Philip Bille. A survey on tree edit distance and related problems. Theoretical
computer science, 337(1-3):217–239, 2005.
We sincerely thank Ramesh Illikkal, Yanqi Zhang, Nikita Lazarev, [31] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali
Zhuangzhuang Zhou, Daniel Sanchez, and the anonymous review- Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh
ers for their feedback on earlier versions of this manuscript. This Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D.
Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News,
work was in part supported by an NSF CAREER Award CCF- 39(2):1–7, aug 2011.
1846046, NSF grant NeTS CSR-1704742, an Intel Research Award, [32] James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. Spec cpu2017:
an Intel Faculty Rising Star Award, a Sloan Research Fellowship, a Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC
International Conference on Performance Engineering, ICPE ’18, page 41–42, New
Microsoft Research Fellowship, a Facebook Research Faculty Award, York, NY, USA, 2018. Association for Computing Machinery.
and a John and Norma Balen Sesquisentennial Faculty Fellowship. [33] Gerardo Canfora and Massimiliano Di Penta. New frontiers of reverse engi-
neering. In Future of Software Engineering (FOSE’07), pages 326–341. IEEE,
2007.
REFERENCES [34] Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. Sniper: Exploring the
[1] Apache thrift. https://thrift.apache.org. level of abstraction for scalable and accurate parallel multi-core simulation. In
[2] giltene/wrk2. https://github.com/giltene/wrk2. Proceedings of 2011 International Conference for High Performance Computing,
[3] grpc: A high performance open-source universal rpc framework. https://grpc.io/. Networking, Storage and Analysis, SC ’11, New York, NY, USA, 2011. Association
[4] Jaeger: open source, end-to-end distributed tracing. https://www.jaegertracing. for Computing Machinery.
io/. [35] Mark Charney. X86 encoder decoder user guide, 2019.
[5] Mongodb. https://www.mongodb.com. [36] Shuang Chen, Christina Delimitrou, and Jose F. Martinez. PARTIES: QoS-Aware
[6] Nginx. https://nginx.org/en. Resource Partitioning for Multiple Interactive Services. In Proceedings of the
[7] Opentracing. https://opentracing.io/. Twenty Fourth International Conference on Architectural Support for Programming
[8] Redis. https://redis.io. Languages and Operating Systems (ASPLOS), April 2019.
[9] SPEC Cloud® IaaS 2018.

234
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Mingyu Liang, Yu Gan, Yueying Li, Carlos Torres, Abhishek Dhanotia, Mahesh Ketkar, and Christina Delimitrou

[37] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell [60] Part Guide. Intel® 64 and ia-32 architectures software developer’s manual.
Sears. Benchmarking Cloud Serving Systems with YCSB. 2010. Volume 3B: System programming Guide, Part, 2(11), 2011.
[38] Weidong Cui, Marcus Peinado, Karl Chen, Helen J. Wang, and Luis Irun-Briz. [61] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown.
Tupni: Automatic reverse engineering of input formats. In Proceedings of the Mibench: A free, commercially representative embedded benchmark suite. In
15th ACM Conference on Computer and Communications Security, CCS ’08, page Proceedings of the Fourth Annual IEEE International Workshop on Workload
391–402, New York, NY, USA, 2008. Association for Computing Machinery. Characterization. WWC-4 (Cat. No.01EX538), pages 3–14, 2001.
[39] Deeksha Dangwal, Weilong Cui, Joseph McMahan, and Timothy Sherwood. [62] Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. Simpoint 3.0:
Safer program behavior sharing through trace wringing. In Proceedings of the Faster and more flexible program phase analysis. Journal of Instruction Level
Twenty-Fourth International Conference on Architectural Support for Programming Parallelism, 7(4):1–28, 2005.
Languages and Operating Systems, pages 1059–1072, 2019. [63] Milad Hashemi, Kevin Swersky, Jamie Smith, Grant Ayers, Heiner Litz, Jichuan
[40] Jeffrey Dean and Luiz Andre Barroso. The tail at scale. In CACM, Vol. 56 No. 2. Chang, Christos Kozyrakis, and Parthasarathy Ranganathan. Learning memory
[41] Christina Delimitrou and Christos Kozyrakis. iBench: Quantifying Interfer- access patterns. In Jennifer Dy and Andreas Krause, editors, Proceedings of the
ence for Datacenter Workloads. In Proceedings of the 2013 IEEE International 35th International Conference on Machine Learning, volume 80 of Proceedings of
Symposium on Workload Characterization (IISWC). Portland, OR, September Machine Learning Research, pages 1919–1928. PMLR, 10–15 Jul 2018.
2013. [64] M. Haungs, P. Sallee, and M. Farrens. Branch transition rate: a new metric for
[42] Christina Delimitrou and Christos Kozyrakis. Quality-of-Service-Aware Sched- improved branch classification analysis. In Proceedings Sixth International Sym-
uling in Heterogeneous Datacenters with Paragon. In IEEE Micro Special Issue posium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550),
on Top Picks from the Computer Architecture Conferences. May/June 2014. pages 241–250, 2000.
[43] Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-Efficient and [65] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth Edition:
QoS-Aware Cluster Management. In Proceedings of the Nineteenth International A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA,
Conference on Architectural Support for Programming Languages and Operating USA, 2006.
Systems (ASPLOS). Salt Lake City, UT, USA, 2014. [66] Weizhe Hua, Zhiru Zhang, and G. Edward Suh. Reverse engineering convolu-
[44] eBPF Foundation. ebpf, 2021. tional neural networks through side-channel information leaks. In 2018 55th
[45] Frank C. Eigler, Vara Prasad, Will Cohen, Hien Nguyen, Martin Hunt, Jim ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6, 2018.
Keniston, and Brad Chen. Architecture of systemtap: a linux trace/probe tool, [67] Intel. Intel vtune amplifier. https://www.intel.com/content/www/us/en/
2005. developer/tools/oneapi/vtune-profiler.html, year = 2021,.
[46] Qi Fan and Qingyang Wang. Performance comparison of web servers with [68] Intel. Intel software development emulator, 2012.
different architectures: A case study using high concurrency workload. In 2015 [69] Intel. Dynamic control-flow graph generation with pinplay, 2015.
Third IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb), [70] Canturk Isci, Alper Buyuktosunoglu, and Margaret Martonosi. Long-term
pages 37–42. IEEE, 2015. workload phases: Duration predictions and applications to dvfs. Ieee Micro,
[47] Wes Felter, Alexandre Ferreira, Ram Rajamony, and Juan Rubio. An updated 25(5):39–51, 2005.
performance comparison of virtual machines and linux containers. In 2015 [71] Tara Merin John, Syed Kamran Haider, Hamza Omar, and Marten van Dijk.
IEEE International Symposium on Performance Analysis of Systems and Software Connecting the dots: Privacy leakage via write-access patterns to the main
(ISPASS), pages 171–172, 2015. memory. IEEE Transactions on Dependable and Secure Computing, 17(2):436–442,
[48] Michael Ferdman, Almutaz Adileh, and et al. Clearing the clouds: A study of 2020.
emerging scale-out workloads on modern hardware. In Proc. of ASPLOS. London, [72] Ajay Joshi, Lieven Eeckhout, Robert H. Bell, and Lizy John. Performance cloning:
England, UK, 2012. A technique for disseminating proprietary applications as benchmarks. In 2006
[49] Brad Fitzpatrick. Distributed caching with memcached. In Linux Journal, Volume IEEE International Symposium on Workload Characterization, pages 105–115,
2004, Issue 124, 2004. 2006.
[50] Agner Fog. The microarchitecture of intel, amd and via cpus an optimization [73] Ajay Joshi, Lieven Eeckhout, Robert H. Bell, and Lizy K. John. Distilling the
guide for assembly programmers and compiler makers. essence of proprietary workloads into miniature benchmarks. ACM Trans.
[51] Agner Fog et al. Instruction tables: Lists of instruction latencies, throughputs Archit. Code Optim., 5(2), September 2008.
and micro-operation breakdowns for intel, amd and via cpus. Copenhagen [74] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan,
University College of Engineering, 93:110, 2011. Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a warehouse-scale
[52] Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. Sage: computer. SIGARCH Comput. Archit. News, 43(3S):158–169, June 2015.
Practical and scalable ml-driven performance debugging in microservices. In [75] Yoongu Kim, Weikun Yang, and Onur Mutlu. Ramulator: A fast and extensible
Proceedings of the 26th ACM International Conference on Architectural Support for dram simulator. IEEE Computer architecture letters, 15(1):45–49, 2015.
Programming Languages and Operating Systems, ASPLOS 2021, page 135–151, [76] Nikita Lazarev, Shaojie Xiang, Neil Adit, Zhiru Zhang, and Christina Delim-
New York, NY, USA, 2021. Association for Computing Machinery. itrou. Dagger: Efficient and fast rpcs in cloud microservices with near-memory
[53] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayantara reconfigurable nics. In Proceedings of the 26th ACM International Conference
Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, on Architectural Support for Programming Languages and Operating Systems,
Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine ASPLOS 2021, page 36–51, New York, NY, USA, 2021. Association for Computing
Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Machinery.
Liu, Jake Padilla, and Christina Delimitrou. An Open-Source Benchmark Suite [77] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and
for Microservices and Their Hardware-Software Implications for Cloud and sub-millisecond quality-of-service. In Proc. of EuroSys. 2014.
Edge Systems. In Proceedings of the Twenty Fourth International Conference [78] Chit-Kwan Lin and Stephen J. Tarsa. Branch prediction is not a solved problem:
on Architectural Support for Programming Languages and Operating Systems Measurements, opportunities, and future directions. In 2019 IEEE International
(ASPLOS), April 2019. Symposium on Workload Characterization (IISWC), pages 228–238, 2019.
[54] Yu Gan, Yanqi Zhang, Kelvin Hu, Yuan He, Meghna Pancholi, Dailun Cheng, [79] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and
and Christina Delimitrou. Seer: Leveraging Big Data to Navigate the Complex- Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Proc.
ity of Performance Debugging in Cloud Microservices. In Proceedings of the of the 42Nd Annual International Symposium on Computer Architecture (ISCA).
Twenty Fourth International Conference on Architectural Support for Programming Portland, OR, 2015.
Languages and Operating Systems (ASPLOS), April 2019. [80] Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang,
[55] Karthik Ganesan, Jungho Jo, and Lizy K. John. Synthesizing memory-level paral- Yu Ding, Jian He, and Chengzhong Xu. Characterizing microservice dependency
lelism aware miniature clones for SPEC CPU2006 and implantBench workloads. and performance: Alibaba trace analysis. In Proceedings of the ACM Sympo-
ISPASS 2010 - IEEE International Symposium on Performance Analysis of Systems sium on Cloud Computing, SoCC ’21, page 412–426, New York, NY, USA, 2021.
and Software, pages 33–44, 2010. Association for Computing Machinery.
[56] Karthik Ganesan and Lizy Kurian John. Automatic generation of miniatur- [81] David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber,
ized synthetic proxies for target applications to efficiently design multicore and Thomas F. Wenisch. Power management of online data-intensive services. In
processors. IEEE Transactions on Computers, 63(4):833–846, 2014. Proceedings of the 38th annual international symposium on Computer architecture,
[57] Oded Goldreich and Rafail Ostrovsky. Software protection and simulation on pages 319–330, 2011.
oblivious rams. Journal of the ACM (JACM), 43(3):431–473, 1996. [82] David Meisner, Junjie Wu, and Thomas F. Wenisch. Bighouse: A simulation
[58] Tarun Goyal, Ajit Singh, and Aakanksha Agrawal. Cloudsim: simulator for cloud infrastructure for data center systems. In 2012 IEEE International Symposium on
computing infrastructure and modeling. Procedia Engineering, 38:3566–3572, Performance Analysis of Systems Software, pages 35–45, 2012.
2012. [83] Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an
[59] Brendan Gregg. Systems performance: enterprise and the cloud. Pearson Educa- overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,
tion, 2014. 2(1):86–97, 2012.
[84] Intel® 64 and ia-32 architectures optimization reference manual. February 2022.

235
Ditto: End-to-End Application Cloning for Networked Cloud Services ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

[85] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight [99] Akshitha Sriraman and Abhishek Dhanotia. Accelerometer: Understanding
dynamic binary instrumentation. ACM Sigplan notices, 42(6):89–100, 2007. acceleration opportunities for data center overheads at hyperscale. In Proceed-
[86] Reena Panda and Lizy K. John. Halo: A hierarchical memory access locality ings of the Twenty-Fifth International Conference on Architectural Support for
modeling technique for memory system explorations. In Proceedings of the 2018 Programming Languages and Operating Systems, ASPLOS ’20, page 733–750,
International Conference on Supercomputing, ICS ’18, page 118–128, New York, New York, NY, USA, 2020. Association for Computing Machinery.
NY, USA, 2018. Association for Computing Machinery. [100] Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. Softsku:
[87] Reena Panda and Lizy Kurian John. Proxy benchmarks for emerging big-data Optimizing server architectures for microservice diversity @scale. In Proceedings
workloads. In 2017 26th International Conference on Parallel Architectures and of the 46th International Symposium on Computer Architecture, ISCA ’19, page
Compilation Techniques (PACT), pages 105–116, 2017. 513–526, New York, NY, USA, 2019. Association for Computing Machinery.
[88] David Pariag, Tim Brecht, Ashif Harji, Peter Buhr, Amol Shukla, and David R [101] Akshitha Sriraman and Thomas F. Wenisch. µtune: Auto-tuned threading for
Cheriton. Comparing the performance of web server architectures. ACM SIGOPS OLDI microservices. In 13th USENIX Symposium on Operating Systems Design
Operating Systems Review, 41(3):231–243, 2007. and Implementation (OSDI 18), pages 177–194, Carlsbad, CA, October 2018.
[89] Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. USENIX Association.
Pinplay: A framework for deterministic replay and reproducible analysis of [102] W Richard Stevens and Thomas Narten. Unix network programming. ACM
parallel programs. In Proceedings of the 8th Annual IEEE/ACM International SIGCOMM Computer Communication Review, 20(2):8–9, 1990.
Symposium on Code Generation and Optimization, CGO ’10, page 2–11, New [103] Takanori Ueda, Takuya Nakaike, and Moriyoshi Ohara. Workload characteriza-
York, NY, USA, 2010. Association for Computing Machinery. tion for microservices. In Proc. of IISWC. 2016.
[90] G. Ravi, R. Bertran, P. Bose, and M. Lipasti. Micrograd: A centralized framework [104] Luk Van Ertvelde and Lieven Eeckhout. Dispersing proprietary applications
for workload cloning and stress testing. In 2021 IEEE International Symposium as benchmarks through code mutation. In Proceedings of the 13th international
on Performance Analysis of Systems and Software (ISPASS), pages 70–72, Los conference on Architectural support for programming languages and operating
Alamitos, CA, USA, mar 2021. IEEE Computer Society. systems, pages 201–210, 2008.
[91] Charles Reiss, Alexey Tumanov, Gregory Ganger, Randy Katz, and Michael [105] Pepe Vila, Pierre Ganty, Marco Guarnieri, and Boris Köpf. Cachequery: Learning
Kozych. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. replacement policies from hardware caches. In Proceedings of the 41st ACM
In Proceedings of SOCC. 2012. SIGPLAN Conference on Programming Language Design and Implementation,
[92] Michiel Ronsse and Koen De Bosschere. Recplay: A fully integrated practical PLDI 2020, page 519–532, New York, NY, USA, 2020. Association for Computing
record/replay system. ACM Trans. Comput. Syst., 17(2):133–152, may 1999. Machinery.
[93] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with [106] Qingyang Wang, Chien-An Lai, Yasuhiko Kanemasa, Shungeng Zhang, and
interactive graph analytics and visualization. In AAAI, 2015. Calton Pu. A study of long-tail latency in n-tier systems: Rpc vs. asynchronous
[94] Daniel Sanchez and Christos Kozyrakis. Zsim: Fast and accurate microarchitec- invocations. In 2017 IEEE 37th International Conference on Distributed Computing
tural simulation of thousand-core systems. In Proceedings of the 40th Annual Systems (ICDCS), pages 207–217. IEEE, 2017.
International Symposium on Computer Architecture, ISCA ’13, page 475–486, [107] Ahmad Yasin. A top-down method for performance analysis and counters
New York, NY, USA, 2013. Association for Computing Machinery. architecture. In 2014 IEEE International Symposium on Performance Analysis of
[95] Koushik Sen, Swaroop Kalasapur, Tasneem Brutch, and Simon Gibbs. Jalangi: Systems and Software (ISPASS), pages 35–44, 2014.
A selective record-replay and dynamic analysis framework for javascript. In [108] Shungeng Zhang, Qingyang Wang, Yasuhiko Kanemasa, Huasong Shan, and Lit-
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineer- ing Hu. The impact of event processing flow on asynchronous server efficiency.
ing, ESEC/FSE 2013, page 488–498, New York, NY, USA, 2013. Association for IEEE Transactions on Parallel and Distributed Systems, 31(3):565–579, 2019.
Computing Machinery. [109] Yanqi Zhang, Yu Gan, and Christina Delimitrou. µqsim: Enabling accurate and
[96] Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, scalable simulation for interactive microservices. In 2019 IEEE International
Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a Symposium on Performance Analysis of Systems and Software (ISPASS), pages
large-scale distributed systems tracing infrastructure. Technical report, Google, 212–222, 2019.
Inc., 2010. [110] Yanqi Zhang, Iñigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh
[97] Will Sobel, Shanti Subramanyam, Akara Sucharitakul, Jimmy Nguyen, Hubert Elnikety, Christina Delimitrou, and Ricardo Bianchini. Faster and cheaper
Wong, Arthur Klepchukov, Sheetal Patil, Armando Fox, and David Patterson. serverless computing on harvested resources. In Proceedings of the 28th ACM
Cloudstone: Multi-platform, multi-language benchmark and measurement tools Symposium on Operating Systems Principles (SOSP), October 2021.
for web 2.0. In Proc. of CCA, volume 8, page 228, 2008. [111] Yuhao Zhu, Daniel Richins, Matthew Halpern, and Vijay Janapa Reddi. Microar-
[98] A. Sriraman and T. F. Wenisch. 𝜇 suite: A benchmark suite for microservices. chitectural implications of event-driven server-side web applications. In Proc.
In 2018 IEEE International Symposium on Workload Characterization (IISWC), of MICRO, 2015.
pages 1–12, 2018.
Received 2022-07-07; accepted 2022-09-22

236

Operating Systems Notes
100% (8)
Operating Systems Notes
325 pages
Vulcan Gantt Scheduler Tutorial
100% (2)
Vulcan Gantt Scheduler Tutorial
44 pages
Cloud Computing, Seminar Report
78% (9)
Cloud Computing, Seminar Report
17 pages
Unit-3 - Advanced Operating Systems-23pcsce24-3
No ratings yet
Unit-3 - Advanced Operating Systems-23pcsce24-3
15 pages
Cs8791-Cloud Computing-931237381-Cs8791cc
No ratings yet
Cs8791-Cloud Computing-931237381-Cs8791cc
241 pages
Unit I
No ratings yet
Unit I
12 pages
OS in 6 Hours
No ratings yet
OS in 6 Hours
73 pages
Cloud Computing Syllabus
No ratings yet
Cloud Computing Syllabus
2 pages
Cloud Computing
No ratings yet
Cloud Computing
71 pages
CC Unit-2
No ratings yet
CC Unit-2
33 pages
IEEE Xplore Citation Plain Text Download 2024.8.11.8.40.33
No ratings yet
IEEE Xplore Citation Plain Text Download 2024.8.11.8.40.33
32 pages
Cloud Computing Question Bank Unit IV and Unit V Updated
No ratings yet
Cloud Computing Question Bank Unit IV and Unit V Updated
25 pages
A Multi-Dimensional Extensible Cloud-Native Service Stack For Enterprises
No ratings yet
A Multi-Dimensional Extensible Cloud-Native Service Stack For Enterprises
18 pages
A Brief History of Cloud Application Architectures
No ratings yet
A Brief History of Cloud Application Architectures
26 pages
Applsci 12 11115
No ratings yet
Applsci 12 11115
22 pages
16) Maximo 7.5.1 Scheduler With Integrated Dispatch and Mobile
No ratings yet
16) Maximo 7.5.1 Scheduler With Integrated Dispatch and Mobile
47 pages
Welcome To International Journal of Engineering Research and Development (IJERD)
No ratings yet
Welcome To International Journal of Engineering Research and Development (IJERD)
5 pages
Paper 58-Metaheuristic Optimization For Dynamic Task Scheduling
No ratings yet
Paper 58-Metaheuristic Optimization For Dynamic Task Scheduling
8 pages
Survey On Middleware Systems in Cloud Computing Integration: Jonathan A.P. Marpaung, Mangal Sain, Hoon-Jae Lee
No ratings yet
Survey On Middleware Systems in Cloud Computing Integration: Jonathan A.P. Marpaung, Mangal Sain, Hoon-Jae Lee
4 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
A Referencemodelfordevelopingcloud
No ratings yet
A Referencemodelfordevelopingcloud
6 pages
Admas CCC II Final Exam
No ratings yet
Admas CCC II Final Exam
18 pages
Cse 334 Introduction To Cloud Computing
No ratings yet
Cse 334 Introduction To Cloud Computing
2 pages
16.1 Purpose of An Operating System (MT-L) PDF
No ratings yet
16.1 Purpose of An Operating System (MT-L) PDF
10 pages
Ho To Do CIF Customizing
100% (1)
Ho To Do CIF Customizing
59 pages
Process Management
100% (1)
Process Management
43 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
22CS304 - Operating Systems (Lab Integrated)
No ratings yet
22CS304 - Operating Systems (Lab Integrated)
3 pages
OS 2022 Solution
No ratings yet
OS 2022 Solution
37 pages
A Review of Metaheuristic Algorithms For Job Shop Scheduling
No ratings yet
A Review of Metaheuristic Algorithms For Job Shop Scheduling
27 pages
International Mining FMS Report May 2020
No ratings yet
International Mining FMS Report May 2020
10 pages
OS Summary
No ratings yet
OS Summary
22 pages
5th Semester B.Tech AI&DS Syllabus
No ratings yet
5th Semester B.Tech AI&DS Syllabus
30 pages
LSSEU2019 - Exploiting Race Conditions On Linux
No ratings yet
LSSEU2019 - Exploiting Race Conditions On Linux
35 pages
Blueprint OS
No ratings yet
Blueprint OS
4 pages
OPNET LTE Simulator Project PDF
0% (2)
OPNET LTE Simulator Project PDF
2 pages
Process Synchronization & Disk Scheduling - criticalRegionOfCodeAndHardDiskProblems - Tut-7 With Solution
No ratings yet
Process Synchronization & Disk Scheduling - criticalRegionOfCodeAndHardDiskProblems - Tut-7 With Solution
9 pages
TMS320F2812 Dsp/bios
No ratings yet
TMS320F2812 Dsp/bios
26 pages
DSP Applications Using C and The TMS320C6x DSK - G
No ratings yet
DSP Applications Using C and The TMS320C6x DSK - G
3 pages
Cimple: Instruction and Memory Level Parallelism: A DSL For Uncovering ILP and MLP
No ratings yet
Cimple: Instruction and Memory Level Parallelism: A DSL For Uncovering ILP and MLP
16 pages
Unit III OSY Handout Revised 24.07.2023
No ratings yet
Unit III OSY Handout Revised 24.07.2023
15 pages
Unit 9 Scheduling
No ratings yet
Unit 9 Scheduling
14 pages
Operating System
No ratings yet
Operating System
15 pages
Nptel Lecture Notes Cpu Sharing
No ratings yet
Nptel Lecture Notes Cpu Sharing
17 pages
CS3461 Set 3
No ratings yet
CS3461 Set 3
3 pages
Understanding The Linux Kernel: From I/O Ports To Process Management - Daniel P. Bovet
No ratings yet
Understanding The Linux Kernel: From I/O Ports To Process Management - Daniel P. Bovet
5 pages
WhatsNew PlatformLSF8
No ratings yet
WhatsNew PlatformLSF8
2 pages
TypeScript in Microservices Architecture: Effective Patterns and Techniques
From Everand
TypeScript in Microservices Architecture: Effective Patterns and Techniques
Baldurs L.
No ratings yet
Cybersecurity in Cloud Computing
From Everand
Cybersecurity in Cloud Computing
Akula Achari
No ratings yet
Cloud Computing
From Everand
Cloud Computing
Dr. Nirvikar Katiyar
No ratings yet
Edge Computing Applications in Supply Chain Management
From Everand
Edge Computing Applications in Supply Chain Management
Bo Li
No ratings yet
Dancing on a Cloud: A Framework for Increasing Business Agility
From Everand
Dancing on a Cloud: A Framework for Increasing Business Agility
David Sterling
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
Handbook of Cloud Computing: Basic to Advance research on the concepts and design of Cloud Computing
From Everand
Handbook of Cloud Computing: Basic to Advance research on the concepts and design of Cloud Computing
Dr. Anand Nayyar
No ratings yet
Implementing Linkerd Service Mesh
From Everand
Implementing Linkerd Service Mesh
Kimiko Lee
No ratings yet
Mastering Cloud Computing With Best Practices
From Everand
Mastering Cloud Computing With Best Practices
Manish Soni
No ratings yet
The Ultimate Guide to Landing a Network Engineering Job
From Everand
The Ultimate Guide to Landing a Network Engineering Job
J.L Parham
No ratings yet
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
From Everand
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
M. Sreedevi
No ratings yet
Quantum Ethics Navigating the Moral Landscape of Quantum Computing and Cybersecurity
From Everand
Quantum Ethics Navigating the Moral Landscape of Quantum Computing and Cybersecurity
Nikiforos Kontopoulos
No ratings yet
Cloud Computing Unveiled: A Short Journey Through Time
From Everand
Cloud Computing Unveiled: A Short Journey Through Time
Maula Issa
No ratings yet
Shedding Light on Cloud Computing
From Everand
Shedding Light on Cloud Computing
Gregor Petri
5/5 (1)
Litmus Chaos Engineering for Kubernetes: The Complete Guide for Developers and Engineers
From Everand
Litmus Chaos Engineering for Kubernetes: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
From Everand
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
Manish Soni
No ratings yet
Edge Computing Architecture and Applications: Definitive Reference for Developers and Engineers
From Everand
Edge Computing Architecture and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
From Everand
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
Editor IJSMI
No ratings yet
EDGE Computing Architecture and Protocols: Definitive Reference for Developers and Engineers
From Everand
EDGE Computing Architecture and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
IGNOU MCA Cloud Computing and IoT Previous year Unsolved Papers MCS 227
From Everand
IGNOU MCA Cloud Computing and IoT Previous year Unsolved Papers MCS 227
Manish Soni
No ratings yet
The Quantum Computer
From Everand
The Quantum Computer
Roberto Miguel Rodriguez
No ratings yet
Moleculer for Scalable Microservices: Definitive Reference for Developers and Engineers
From Everand
Moleculer for Scalable Microservices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
From Everand
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Designing Resilient Distributed Systems with CAP: Definitive Reference for Developers and Engineers
From Everand
Designing Resilient Distributed Systems with CAP: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Architecting Distributed Applications with Macrometa: The Complete Guide for Developers and Engineers
From Everand
Architecting Distributed Applications with Macrometa: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Building Desktop Applications with Electron: Definitive Reference for Developers and Engineers
From Everand
Building Desktop Applications with Electron: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
From Everand
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Networking Programming with C++: Build Efficient Communication Systems
From Everand
Networking Programming with C++: Build Efficient Communication Systems
Robert Johnson
No ratings yet
Istio in Depth: Definitive Reference for Developers and Engineers
From Everand
Istio in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Containerization Technology Essentials: Definitive Reference for Developers and Engineers
From Everand
Containerization Technology Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kuma Service Mesh in Depth: Definitive Reference for Developers and Engineers
From Everand
Kuma Service Mesh in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
From Everand
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Mesh Networks and Mesh Generation: Definitive Reference for Developers and Engineers
From Everand
Principles of Mesh Networks and Mesh Generation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenTracing in Distributed Systems: Definitive Reference for Developers and Engineers
From Everand
OpenTracing in Distributed Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
From Everand
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
Poonam Devi
No ratings yet
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
From Everand
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Mastering OpenTelemetry: Building Scalable Observability Systems for Cloud-Native Applications
From Everand
Mastering OpenTelemetry: Building Scalable Observability Systems for Cloud-Native Applications
Robert Johnson
No ratings yet
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
From Everand
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
Christopher Ford
No ratings yet
Sussman Anomaly: Fundamentals and Applications
From Everand
Sussman Anomaly: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet

Ditto

Uploaded by

Ditto

Uploaded by

MIT Open Access Articles

Ditto: End-to-End Application Cloning

Publisher: ACM|Proceedings of the 28th ACM International Conference on Architectural Support

Persistent URL: https://hdl.handle.net/1721.1/147828

Carlos Torres Abhishek Dhanotia Mahesh Ketkar

ABSTRACT methodologies; • Software and its engineering → Software

3.3.3 CPU-Memory Subsystem. The CPU-memory subsystem is a

Bad different subsystems. Finally, Ditto uses the deviation in perfor-

Figure 3: Overview of Ditto’s synthetic benchmark generation process.

5 IMPLEMENTATION Platform A Platform B Platform C

Memcached NGINX MongoDB Redis TextService SocialGraphService

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

1.5 2 2 0.4 15 1.5

1.5 1.5 0.3

of 40GB with one million records. To load MongoDB, we use 40

tion. Actual p95

Memcached NGINX MongoDB Redis TextService SocialGraphService

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

L1i L1i L1i L1i L1i L1i

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

L1i L1i L1i L1i L1i L1i

L1d L1d L1d L1d L1d

IPC IPC L2 IPC IPC IPC IPC

0.3 0.6 20 0.15 6 0.8

QPS QPS QPS QPS QPS QPS

Actual Synthetic Actual Synthetic

microseconds to tens of milliseconds. Nevertheless, they rarely [10] stress-ng. https://wiki.ubuntu.com/Kernel/Reference/stress-ng.

You might also like