0% found this document useful (0 votes)
21 views

SW Development Method - Google

Uploaded by

Siddalinga Swamy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

SW Development Method - Google

Uploaded by

Siddalinga Swamy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Towards Modern Development of Cloud Applications

Sanjay Ghemawat, Robert Grandl, Srdjan Petrovic, Michael Whittaker


Parveen Patel, Ivan Posva, Amin Vahdat
Google
Abstract like [5, 6, 23, 30]. The prevailing wisdom when using these
When writing a distributed application, conventional wis- technologies is to manually split your application into sepa-
dom says to split your application into separate services rate microservices that can be rolled out independently.
that can be rolled out independently. This approach is well- Via an internal survey of various infrastructure teams,
intentioned, but a microservices-based architecture like this we have found that most developers split their applications
often backfires, introducing challenges that counteract the into multiple binaries for one of the following reasons: (1)
benefits the architecture tries to achieve. Fundamentally, this It improves performance. Separate binaries can be scaled
is because microservices conflate logical boundaries (how independently, leading to better resource utilization. (2) It
code is written) with physical boundaries (how code is de- improves fault tolerance. A crash in one microservice doesn’t
ployed). In this paper, we propose a different programming bring down other microservices, limiting the blast radius of
methodology that decouples the two in order to solve these bugs. (3) It improves abstraction boundaries. Microservices
challenges. With our approach, developers write their appli- require clear and explicit APIs, and the chance of code en-
cations as logical monoliths, offload the decisions of how to tanglement is severely minimized. (4) It allows for flexible
distribute and run applications to an automated runtime, and rollouts. Different binaries can be released at different rates,
deploy applications atomically. Our prototype implementa- leading to more agile code upgrades.
tion reduces application latency by up to 15× and reduces However, splitting applications into independently de-
cost by up to 9× compared to the status quo. ployable microservices is not without its challenges, some
of which directly contradict the benefits.
CCS Concepts • C1: It hurts performance. The overhead of serializing
• Computer systems organization → Cloud computing; data and sending it across the network is increasingly
Client-server architectures. becoming a bottleneck [72]. When developers over-split
their applications, these overheads compound [55].
ACM Reference Format: • C2: It hurts correctness. It is extremely challenging to
Sanjay Ghemawat, Robert Grandl, Srdjan Petrovic, Michael Whit-
reason about the interactions between every deployed
taker, Parveen Patel, Ivan Posva, Amin Vahdat. 2023. Towards Mod-
version of every microservice. In a case study of over
ern Development of Cloud Applications. In Workshop on Hot Topics
in Operating Systems (HOTOS ’23), June 22–24, 2023, Providence, RI, 100 catastrophic failures of eight widely used systems,
USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/ two-thirds of failures were caused by the interactions
3593856.3595909 between multiple versions of a system [78].
• C3: It is hard to manage. Rather than having a single bi-
1 Introduction nary to build, test, and deploy, developers have to manage
𝑛 different binaries, each on their own release schedule.
Cloud computing has seen unprecedented growth in recent Running end-to-end tests with a local instance of the
years. Writing and deploying distributed applications that application becomes an engineering feat.
can scale up to millions of users has never been easier, in • C4: It freezes APIs. Once a microservice establishes an
large part due to frameworks like Kubernetes [25], messag- API, it becomes hard to change without breaking the
ing solutions like [7, 18, 31, 33, 40, 60], and data formats other services that consume the API. Legacy APIs linger
around, and new APIs are patched on top.
• C5: It slows down application development. When mak-
ing changes that affect multiple microservices, develop-
This work is licensed under a Creative Commons Attribution International 4.0 License.
ers cannot implement and deploy the changes atomically.
HOTOS ’23, June 22–24, 2023, Providence, RI, USA They have to carefully plan how to introduce the change
© 2023 Copyright held by the owner/author(s).
across 𝑛 microservices with their own release schedules.
ACM ISBN 979-8-4007-0195-5/23/06.
https://doi.org/10.1145/3593856.3595909
In our experience, we have found that an overwhelming
number of developers view the above challenges as a neces-
sary part of doing business. Many cloud-native companies

110
HOTOS ’23, June 22–24, 2023, Providence, RI, USA Sanjay Ghemawat, Robert Grandl, Srdjan Petrovic, Michael Whittaker et al.

are in fact developing internal frameworks and processes 2 Proposed Solution


that aim to ease some of the above challenges, but not funda-
The two main parts of our proposal are (1) a programming
mentally change or eliminate them altogether. For example,
model with abstractions that allow developers to write single-
continuous deployment frameworks [12, 22, 37] simplify how
binary modular applications focused solely on business logic,
individual binaries are built and pushed into production, but
and (2) a runtime for building, deploying, and optimizing
they do nothing to solve the versioning issue; if anything,
these applications.
they make it worse, as code is pushed into production at a
The programming model enables a developer to write a
faster rate. Various programming libraries [13, 27] make it
distributed application as a single program, where the code is
easier to create and discover network endpoints, but do noth-
split into modular units called components (Section 3). This
ing to help ease application management. Network protocols
is similar to splitting an application into microservices, ex-
like gRPC [18] and data formats like Protocol Buffers [30]
cept that microservices conflate logical and physical bound-
are continually improved, but still take up a major fraction
aries. Our solution instead decouples the two: components
of an application’s execution cost.
are centered around logical boundaries based on application
There are two reasons why these microservice-based solu-
business logic, and the runtime is centered around physi-
tions fall short of solving challenges C1-C5. The first reason
cal boundaries based on application performance (e.g., two
is that they all assume that the developer manually splits their
components should be co-located to improve performance).
application into multiple binaries. This means that the net-
This decoupling—along with the fact that boundaries can be
work layout of the application is predetermined by the appli-
changed atomically—addresses C4.
cation developer. Moreover, once made, the network layout
By delegating all execution responsibilities to the runtime,
becomes hardened by the addition of networking code into
our solution is able to provide the same benefits as microser-
the application (e.g., network endpoints, client/server stubs,
vices but with much higher performance and reduced costs
network-optimized data structures like [30]). This means
(addresses C1). For example, the runtime makes decisions on
that it becomes harder to undo or modify the splits, even
how to run, place, replicate, and scale components (Section 4).
when it makes sense to do so. This implicitly contributes to
Because applications are deployed atomically, the runtime
the challenges C1, C2 and C4 mentioned above.
has a bird’s eye view into the application’s execution, en-
The second reason is the assumption that application bina-
abling further optimizations. For example, the runtime can
ries are individually (and in some cases continually) released
use custom serialization and transport protocols that lever-
into production. This makes it more difficult to make changes
age the fact that all participants execute at the same version.
to the cross-binary protocol. Additionally, it introduces ver-
Writing an application as a single binary and deploying it
sioning issues and forces the use of more inefficient data
atomically also makes it easier to reason about its correctness
formats like [23, 30]. This in turn contributes to the chal-
(addresses C2) and makes the application easier to manage
lenges C1-C5 listed above.
(addresses C3). Our proposal provides developers with a pro-
In this paper, we propose a different way of writing and
gramming model that lets them focus on application business
deploying distributed applications, one that solves C1-C5.
logic, delegating deployment complexities to a runtime (ad-
Our programming methodology consists of three core tenets:
dresses C5). Finally, our proposal enables future innovations
like automated testing of distributed applications (Section 5).
(1) Write monolithic applications that are modularized 3 Programming Model
into logically distinct components.
3.1 Components
(2) Leverage a runtime to dynamically and automatically
assign logical components to physical processes The key abstraction of our proposal is the component. A
based on execution characteristics. component is a long-lived, replicated computational agent,
(3) Deploy applications atomically, preventing differ- similar to an actor [2]. Each component implements an in-
ent versions of an application from interacting. terface, and the only way to interact with a component is by
calling methods on its interface. Components may be hosted
by different OS processes (perhaps across many machines).
Other solutions (e.g., actor based systems) have also tried Component method invocations turn into remote procedure
to raise the abstraction. However, they fall short of solving calls where necessary, but remain local procedure calls if the
one or more of these challenges (Section 7). Though these caller and callee component are in the same process.
challenges and our proposal are discussed in the context of Components are illustrated in Figure 1. The example ap-
serving applications, we believe that our observations and plication has three components: 𝐴, 𝐵, and 𝐶. When the appli-
solutions are broadly useful. cation is deployed, the runtime determines how to co-locate

111
Towards Modern Development of Cloud Applications HOTOS ’23, June 22–24, 2023, Providence, RI, USA

Machine 1 // Component interface.


type Hello interface {
Application A loca Greet(name string) string
l
B }
component A RPC RPC // Component implementation.
component B type hello struct {
component C Implements[Hello]
C C
}
func (h *hello) Greet(name string) string {
Machine 2 Machine 3 return fmt.Sprintf("Hello, %s!", name)
}
Figure 1: An illustration of how components are written and
deployed. An application is written as a set of components // Component invocation.
(left) and deployed across machines (right). Note that com- func main() {
ponents can be replicated and co-located. app := Init()
hello := Get[Hello](app)
fmt.Println(hello.Greet("World"))
and replicate components. In this example, components 𝐴
}
and 𝐵 are co-located in the same OS process, and method
Figure 2: A “Hello, World!" application.
calls between them are executed as regular method calls.
Component 𝐶 is not co-located with any other component
and is replicated across two machines. Method calls on 𝐶 are
details like launching components onto physical resources
executed as RPCs over the network.
and restarting components when they fail. Finally, the run-
Components are generally long-lived, but the runtime may
time is responsible for performing atomic rollouts, ensuring
scale up or scale down the number of replicas of a component
that components in one version of an application never com-
over time based on load. Similarly, component replicas may
municate with components in a different version.
fail and get restarted. The runtime may also move component
There are many ways to implement a runtime. The goal of
replicas around, e.g., to co-locate two chatty components in
this paper is not to prescribe any particular implementation.
the same OS process so that communication between the
Still, it is important to recognize that the runtime is not
components is done locally rather than over the network.
magical. In the rest of this section, we outline the key pieces
3.2 API of the runtime and demystify its inner workings.
For the sake of concreteness, we present a component API 4.2 Code Generation
in Go, though our ideas are language-agnostic. A “Hello,
The first responsibility of the runtime is code generation. By
World!" application is given in Figure 2. Component inter-
inspecting the Implements[T] embeddings in a program’s
faces are represented as Go interfaces, and component im-
source code, the code generator computes the set of all com-
plementations are represented as Go structs that implement
ponent interfaces and implementations. It then generates
these interfaces. In Figure 2, the hello struct embeds the
code to marshal and unmarshal arguments to component
Implements[Hello] struct to signal that it is the implemen-
methods. It also generates code to execute these methods
tation of the Hello component.
as remote procedure calls. The generated code is compiled
Init initializes the application. Get[Hello] returns a
along with the developer’s code into a single binary.
client to the component with interface Hello, creating it
if necessary. The call to hello.Greet looks like a regular 4.3 Application-Runtime Interaction
method call. Any serialization and remote procedure calls With our proposal, applications do not include any code
are abstracted away from the developer. specific to the environment in which they are deployed, yet
4 Runtime they must ultimately be run and integrated into a specific
environment (e.g., across machines in an on-premises cluster
4.1 Overview or across regions in a public cloud). To support this integra-
Underneath the programming model lies a runtime that is tion, we introduce an API (partially outlined in Table 1) that
responsible for distributing and executing components. The isolates application logic from the details of the environment.
runtime makes all high-level decisions on how to run compo- The caller of the API is a proclet. Every application binary
nents. For example, it decides which components to co-locate runs a small, environment-agnostic daemon called a proclet
and replicate. The runtime is also responsible for low-level that is linked into the binary during compilation. A proclet

112
HOTOS ’23, June 22–24, 2023, Providence, RI, USA Sanjay Ghemawat, Robert Grandl, Srdjan Petrovic, Michael Whittaker et al.

API Description the components; and to handle requests to start new compo-
RegisterReplica Register a proclet as alive and ready. nents. The manager also issues environment-specific APIs
StartComponent Start a component, potentially in another process. (e.g., Google Cloud [16], AWS [4]) to update traffic assign-
ComponentsToHost Get components a proclet should host. ments and to scale up and down components based on load,
Table 1: Example API between the application and runtime. health, and performance constraints. Note that the runtime
implements the control plane but not the data plane. Proclets
manages the components in a running binary. It runs them, communicate directly with one another.
starts them, stops them, restarts them on failure, etc. 4.4 Atomic Rollouts
The implementer of the API is the runtime, which is re-
sponsible for all control plane operations. The runtime de- Developers inevitably have to release new versions of their
cides how and where proclets should run. For example, a application. A widely used approach is to perform rolling
multiprocess runtime may run every proclet in a subpro- updates, where the machines in a deployment are updated
cess; an SSH runtime may run proclets via SSH; and a cloud from the old version to the new version one by one. During
runtime may run proclets as Kubernetes pods [25, 28]. a rolling update, machines running different versions of the
Concretely, proclets interact with the runtime over a Unix code have to communicate with each other, which can lead
pipe. For example, when a proclet is constructed, it sends to failures. [78] shows that the majority of update failures
a RegisterReplica message over the pipe to mark itself are caused by these cross-version interactions.
as alive and ready. It periodically issues ComponentsToHost To address these complexities, we propose a different ap-
requests to learn which components it should run. If a com- proach. The runtime ensures that application versions are
ponent calls a method on a different component, the proclet rolled out atomically, meaning that all component commu-
issues a StartComponent request to ensure it is started. nication occurs within a single version of the application.
The runtime implements these APIs in a way that makes The runtime gradually shifts traffic from the old version to
sense for the deployment environment. We expect most run- the new version, but once a user request is forwarded to a
time implementations to contain the following two pieces: specific version, it is processed entirely within that version.
(1) a set of envelope processes that communicate directly One popular implementation of atomic rollouts is the use of
with proclets via UNIX pipes, and (2) a global manager that blue/green deployments [9].
orchestrates the execution of the proclets (see Figure 3). 5 Enabled Innovations

Global Manager
5.1 Transport, Placement, and Scaling
AWS
Deployment/Management

Web UI
Generates distributed code
Integration with Cloud APIs
The runtime has a bird’s-eye view into application execution,
Profiling Tools GKE
Debugging Tools
Rollouts
Scaling which enables new avenues to optimize performance. For
Colocation Azure
E2E Testing Tools
Placement example, our framework can construct a fine-grained call
Metrics, traces, logs
Cloudlab
graph between components and use it to identify the critical
path, the bottleneck components, the chatty components, etc.
Envelope Envelope Using this information, the runtime can make smarter scal-
ing, placement, and co-location decisions. Moreover, because
Application

A
proclet proclet serialization and transport are abstracted from the developer,
C
B the runtime is free to optimize them. For network bottle-
OS Processes
necked applications, for example, the runtime may decide
Containers to compress messages on the wire. For certain deployments,
Execution

Servers VMs Pods the transport may leverage technologies like RDMA [32].
5.2 Routing
Figure 3: Proposed Deployer Architecture. The performance of some components improves greatly
when requests are routed with affinity. For example, consider
An envelope runs as the parent process to a proclet and an in-memory cache component backed by an underlying
relays API calls to the manager. The manager launches en- disk-based storage system. The cache hit rate and overall per-
velopes and (indirectly) proclets across the set of available formance increase when requests for the same key are routed
resources (e.g., servers, VMs). Throughout the lifetime of to the same cache replica. Slicer [44] showed that many ap-
the application, the manager interacts with the envelopes plications can benefit from this type of affinity based routing
to collect health and load information of the running com- and that the routing is most efficient when embedded in
ponents; to aggregate metrics, logs, and traces exported by the application itself [43]. Our programming framework can

113
Towards Modern Development of Cloud Applications HOTOS ’23, June 22–24, 2023, Providence, RI, USA

be naturally extended to include a routing API. The run- researchers to focus on developing new solutions for tuning
time could also learn which methods benefit the most from applications and reducing deployment costs.
routing and route them automatically.
6 Prototype Implementation
5.3 Automated Testing Our prototype implementation is written in Go [38] and in-
cludes the component APIs described in Figure 2, the code
One of the touted benefits of microservice architectures is
generator described in Section 4.2, and the proclet architec-
fault-tolerance. The idea is that if one service in an applica-
ture described in Section 4.3. The implementation uses a
tion fails, the functionality of the application degrades but
custom serialization format and a custom transport protocol
the app as a whole remains available. This is great in theory,
built directly on top of TCP. The prototype also comes with
but in practice it relies on the developer to ensure that their
a Google Kubernetes Engine (GKE) deployer, which imple-
application is resilient to failures and, more importantly, to
ments multi-region deployments with gradual blue/green
test that their failure-handling logic is correct. Testing is
rollouts. It uses Horizontal Pod Autoscalers [20] to dynami-
particularly challenging due to the overhead in building and
cally adjust the number of container replicas based on load
running 𝑛 different microservices, systematically failing and
and follows an architecture similar to that in Figure 3. Our
restoring them, and checking for correct behavior. As a re-
implementation is available at github.com/ServiceWeaver.
sult, only a fraction of microservice-based systems are tested
for this type of fault tolerance. With our proposal, it is trivial 6.1 Evaluation
to run end-to-end tests. Because applications are written as To evaluate our prototype, we used a popular web applica-
single binaries in a single programming language, end-to- tion [41] representative of the kinds of microservice applica-
end tests become simple unit tests. This opens the door to tions developers write. The application has eleven microser-
automated fault tolerance testing, akin to chaos testing [47], vices and uses gRPC [18] and Kubernetes [25] to deploy on
Jepsen testing [14], and model checking [62]. the cloud. The application is written in various programming
languages, so for a fair comparison, we ported the application
5.4 Stateful Rollouts to be written fully in Go. We then ported the application to
Our proposal ensures that components in one version of an our prototype, with each microservice rewritten as a compo-
application never communicate with components in a dif- nent. We used Locust [26], a workload generator, to load-test
ferent version. This makes it easier for developers to reason the application with and without our prototype.
about correctness. However, if an application updates state The workload generator sends a steady rate of HTTP
in a persistent storage system, like a database, then different requests to the applications. Both application versions were
versions of an application will indirectly influence each other configured to auto-scale the number of container replicas
via the data they read and write. These cross-version interac- in response to load. We measured the number of CPU cores
tions are unavoidable—persistent state, by definition, persists used by the application versions in a steady state, as well as
across versions—but an open question remains about how their end-to-end latencies. Table 2 shows our results.
to test these interactions and identify bugs early to avoid
catastrophic failures during rollout. Metric Our Prototype Baseline
QPS 10000 10000
5.5 Discussion Average Number of Cores 28 78
Note that innovation in the areas discussed in this section is Median Latency (ms) 2.66 5.47
not fundamentally unique to our proposal. There has been ex- Table 2: Performance Results.
tensive research on transport protocols [63, 64], routing [44,
65], testing [45, 75], resource management [57, 67, 71], trou- Most of the performance benefits of our prototype come
bleshooting [54, 56], etc. However, the unique features of from its use of a custom serialization format designed for non-
our programming model enable new innovations and make versioned data exchange, as well as its use of a streamlined
existing innovations much easier to implement. transport protocol built directly on top of TCP. For example,
For instance, by leveraging the atomic rollouts in our pro- the serialization format used does not require any encoding
posal, we can design highly-efficient serialization protocols of field numbers or type information. This is because all
that can safely assume that all participants are using the same encoders and decoders run at the exact same version and
schema. Additionally, our programming model makes it easy agree on the set of fields and the order in which they should
to embed routing logic directly into a user’s application, be encoded and decoded in advance.
providing a range of benefits [43]. Similarly, our proposal’s For an apples-to-apples comparison to the baseline, we
ability to provide a bird’s eye view of the application allows did not co-locate any components. When we co-locate all

114
HOTOS ’23, June 22–24, 2023, Providence, RI, USA Sanjay Ghemawat, Robert Grandl, Srdjan Petrovic, Michael Whittaker et al.

eleven components into a single OS process, the number of While this paper doesn’t address the cases where the use
cores drops to 9 and the median latency drops to 0.38 ms, of multiple binaries is required, we believe that our proposal
both an order of magnitude lower than the baseline. This allows developers to write fewer binaries (i.e. by grouping
mirrors industry experience [34, 39]. multiple services into single binaries whenever possible),
achieve better performance, and postpone hard decisions
7 Related Work related to how to partition the application. We are explor-
Actor Systems. The closest solutions to our proposal are ing how to accommodate applications written in multiple
Orleans [74] and Akka [3]. These frameworks also use ab- languages and compiled into separate binaries.
stractions to decouple the application and runtime. Ray [70] 8.2 Integration with External Services
is another actor based framework but is focused on ML ap-
Applications often need to interact with external services
plications. None of these systems support atomic rollouts,
(e.g., a Postgres database [29]). Our programming model
which is a necessary component to fully address challenges
allows applications to interact with these services as any
C2-C5. Other popular actor based frameworks such as Er-
application would. Not anything and everything has to be a
lang [61], E [52], Thorn [48] and C++ Actor Framework [10]
component. However, when an external service is extensively
put the burden on the developer to deal with system and
used within and across applications, defining a correspond-
low level details regarding deployment and execution, hence
ing component might provide better code reuse.
they fail to decouple the concerns between the application
and the runtime and therefore don’t fully address C1-C5. 8.3 Distributed Systems Challenges
Distributed object frameworks like CORBA, DCOM, and Java While our programming model allows developers to focus
RMI use a programming model similar to ours but suffered on their business logic and defer a lot of the complexity of
from a number of technical and organizational issues [58] deploying their applications to a runtime, our proposal does
and don’t fully address C1-C5 either. not solve fundamental challenges of distributed systems [53,
Microservice Based Systems. Kubernetes [25] is widely 68, 76]. Application developers still need to be aware that
used for deploying container based applications in the cloud. components may fail or experience high latency.
However, its focus is orthogonal to our proposal and doesn’t
address any of C1-C5. Docker Compose [15], Acorn [1], 8.4 Programming Guidance
Helm [19], Skaffold [35], and Istio [21] abstract away some There is no official guidance on how to write distributed
microservice challenges (e.g., configuration generation). How- applications, hence it’s been a long and heated debate on
ever, challenges related to splitting an application into mi- whether writing applications as monoliths or microservices
croservices, versioned rollouts, and testing are still left to is a better choice. However, each approach comes with its
the user. Hence, they don’t satisfy C1-C5. own pros and cons. We argue that developers should write
Other Systems. There are many other solutions that make their application as a single binary using our proposal and de-
it easier for developers to write distributed applications, in- cide later whether they really need to move to a microservices-
cluding dataflow systems [51, 59, 77], ML inference serving based architecture. By postponing the decision of how ex-
systems [8, 17, 42, 50, 73], serverless solutions [11, 24, 36], actly to split into different microservices, it allows them to
databases [29, 49], and web applications [66]. More recently, write fewer and better microservices.
service meshes [46, 69] have raised networking abstractions
9 Conclusion
to factor out common communication functionality. Our pro-
posal embodies these same ideas but in a new domain of The status quo when writing distributed applications in-
general serving systems and distributed applications. In this volves splitting applications into independently deployable
context, new challenges arise (e.g., atomic rollouts). services. This architecture has a number of benefits but also
many shortcomings. In this paper, we propose a different
8 Discussion programming paradigm that sidesteps these shortcomings.
Our proposal encourages developers to (1) write monolithic
8.1 Multiple Application Binaries
applications divided into logical components, (2) defer to a
We argue that applications should be written and built as runtime the challenge of physically distributing and execut-
single binaries, but we acknowledge that this may not al- ing the modularized monoliths, and (3) deploy applications
ways be feasible. For example, the size of an application atomically. These three guiding principles unlock a number
may exceed the capabilities of a single team, or different of benefits and open the door to a bevy of future innovation.
application services may require distinct release cycles for Our prototype implementation reduced application latency
organizational reasons. In all such cases, it may be necessary by up to 15× and reduced cost by up to 9× compared to the
for the application to consist of multiple binaries. status quo.

115
Towards Modern Development of Cloud Applications HOTOS ’23, June 22–24, 2023, Providence, RI, USA

References [46] S. Ashok, P. B. Godfrey, and R. Mittal. Leveraging service meshes as a


new network layer. In HotNets, 2021.
[1] Acorn. https://www.acorn.io/. [47] A. Basiri, N. Behnam, R. De Rooij, L. Hochstein, L. Kosewski,
[2] Actor model. https://en.wikipedia.org/wiki/Actor_model. J. Reynolds, and C. Rosenthal. Chaos engineering. In IEEE Software,
[3] Akka. https://akka.io. 2016.
[4] Amazon Web Services. https://aws.amazon.com/. [48] B. Bloom, J. Field, N. Nystrom, J. Östlund, G. Richards, R. Strniša,
[5] Apache avro. https://avro.apache.org/docs/1.2.0/. J. Vitek, and T. Wrigstad. Thorn: Robust, concurrent, extensible script-
[6] Apache thrift. https://thrift.apache.org/. ing on the jvm. In OOPSLA, 2009.
[7] AWS Cloud Map. https://aws.amazon.com/cloud-map/. [49] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghe-
[8] Azure Machine Learning. https://docs.microsoft.com/en-us/azure/ mawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak,
machine-learning. E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quin-
[9] Blue/green deployments. https://tinyurl.com/3bk64ch2. lan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and
[10] The c++ actor framework. https://www.actor-framework.org/. D. Woodford. Spanner: Google’s globally-distributed database. In
[11] Cloudflare Workers. https://workers.cloudflare.com/. OSDI, 2012.
[12] Continuous integration and delivery - circleci. https://circleci.com/. [50] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and
[13] Dapr - distributed application runtime. https://dapr.io/. I. Stoica. Clipper: A low-latency online prediction serving system. In
[14] Distributed systems safety research. https://jespen.io. NSDI, 2017.
[15] Docker compose. https://docs.docker.com/compose/. [51] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on
[16] Google Cloud. https://cloud.google.com/. large clusters. In OSDI, 2004.
[17] Google Cloud AI Platform. https://cloud.google.com/ai-platform. [52] J. Eker, J. Janneck, E. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuendorffer,
[18] grpc. https://grpc.io/. S. Sachs, and Y. Xiong. Taming heterogeneity - the ptolemy approach.
[19] Helm. http://helm.sh. In Proceedings of the IEEE, 2003.
[20] Horizontal Pod Autoscaling. https://kubernetes.io/docs/tasks/run- [53] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of dis-
application/horizontal-pod-autoscale/. tributed consensus with one faulty process. In ACM Journal, 1985.
[21] Istio. https://istio.io/. [54] Y. Gan, M. Liang, S. Dev, D. Lo, and C. Delimitrou. Sage: Practical
[22] Jenkins. https://www.jenkins.io/. and Scalable ML-Driven Performance Debugging in Microservices. In
[23] Json. https://www.json.org/json-en.html. ASPLOS, 2021.
[24] Kalix. https://www.kalix.io/. [55] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno,
[25] Kubernetes. https://kubernetes.io/. J. Hu, B. Ritchken, B. Jackson, et al. An open-source benchmark suite
[26] Locust. https://locust.io/. for microservices and their hardware-software implications for cloud
[27] Micro | powering the future of cloud. https://micro.dev/. & edge systems. In ASPLOS, 2019.
[28] Pods. https://kubernetes.io/docs/concepts/workloads/pods/. [56] Y. Gan, Y. Zhang, K. Hu, Y. He, M. Pancholi, D. Cheng, and C. De-
[29] Postgresql. https://www.postgresql.org/. limitrou. Seer: Leveraging Big Data to Navigate the Complexity of
[30] Protocol buffers. https://developers.google.com/protocol-buffers. Performance Debugging in Cloud Microservices. In ASPLOS, 2019.
[31] RabbitMQ. https://www.rabbitmq.com/. [57] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella.
[32] Remote direct memory access. https://en.wikipedia.org/wiki/Remote_ Multi-resource packing for cluster schedulers. In SIGCOMM, 2014.
direct_memory_access. [58] M. Henning. The rise and fall of corba: There’s a lot we can learn from
[33] REST API. https://restfulapi.net/. corba’s mistakes. In Queue, 2006.
[34] Scaling up the Prime Video audio/video monitoring service and reduc- [59] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed
ing costs by 90%. https://tinyurl.com/yt6nxt63. data-parallel programs from sequential building blocks. In Eurosys,
[35] Skaffold. https://skaffold.dev/. 2007.
[36] Temporal. https://temporal.io/. [60] K. Jay, N. Neha, and R. Jun. Kafka : a distributed messaging system for
[37] Terraform. https://www.terraform.io/. log processing. In NetDB, 2011.
[38] The Go programming language. https://go.dev/. [61] A. Joe. Erlang. In Communications of the ACM, 2010.
[39] To Microservices and Back Again - Why Segment Went Back to a [62] L. Lamport. The temporal logic of actions. In ACM TOPLS, 1994.
Monolith. https://tinyurl.com/5932ce5n. [63] A. Langley, A. Riddoch, A. Wilk, A. Vicente, C. Krasic, D. Zhang,
[40] WebSocket. https://en.wikipedia.org/wiki/WebSocket. F. Yang, F. Kouranov, I. Swett, J. Iyengar, J. Bailey, J. Dorfman, J. Roskind,
[41] Online boutique. https://github.com/GoogleCloudPlatform/ J. Kulik, P. Westin, R. Tenneti, R. Shade, R. Hamilton, V. Vasiliev, W.-T.
microservices-demo, 2023. Chang, and Z. Shi. The quic transport protocol: Design and internet-
[42] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, scale deployment. In SIGCOMM, 2017.
S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, [64] N. Lazarev, N. Adit, S. Xiang, Z. Zhang, and C. Delimitrou. Dagger:
S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, Towards Efficient RPCs in Cloud Microservices with Near-Memory
M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale Reconfigurable NICs. In ASPLOS, 2021.
machine learning. In OSDI, 2016. [65] S. Lee, Z. Guo, O. Sunercan, J. Ying, T. Kooburat, S. Biswal, J. Chen,
[43] A. Adya, R. Grandl, D. Myers, and H. Qin. Fast key-value stores: An K. Huang, Y. Cheung, Y. Zhou, K. Veeraraghavan, B. Damani, P. M. Ruiz,
idea whose time has come and gone. In HotOS, 2019. V. Mehta, and C. Tang. Shard manager: A generic shard management
[44] A. Adya, D. Myers, J. Howell, J. Elson, C. Meek, V. Khemani, S. Fulger, framework for geo-distributed applications. In SOSP, 2021.
P. Gu, L. Bhuvanagiri, J. Hunter, R. Peon, L. Kai, A. Shraer, A. Merchant, [66] B. Livshits and E. Kiciman. Doloto: Code splitting for network-bound
and K. Lev-Ari. Slicer: Auto-sharding for datacenter applications. In web 2.0 applications. In FSE, 2008.
OSDI, 2016. [67] S. Luo, H. Xu, C. Lu, K. Ye, G. Xu, L. Zhang, Y. Ding, J. He, and C. Xu.
[45] D. Ardelean, A. Diwan, and C. Erdman. Performance analysis of cloud Characterizing microservice dependency and performance: Alibaba
applications. In NSDI, 2018. trace analysis. In SOCC, 2021.

116
HOTOS ’23, June 22–24, 2023, Providence, RI, USA Sanjay Ghemawat, Robert Grandl, Srdjan Petrovic, Michael Whittaker et al.

[68] N. A. Lynch. Distributed algorithms. In Morgan Kaufmann Publishers


Inc., 1996.
[69] S. McClure, S. Ratnasamy, D. Bansal, and J. Padhye. Rethinking net-
working abstractions for cloud tenants. In HotOS, 2021.
[70] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Eli-
bol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica. Ray: A distributed
framework for emerging ai applications. In OSDI, 2018.
[71] H. Qiu, S. S. Banerjee, S. Jha, Z. T. Kalbarczyk, and R. K. Iyer. FIRM:
An intelligent fine-grained resource management framework for SLO-
Oriented microservices. In OSDI, 2020.
[72] D. Raghavan, P. Levis, M. Zaharia, and I. Zhang. Breakfast of champi-
ons: towards zero-copy serialization with nic scatter-gather. In HotOS,
2021.
[73] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis. Infaas: Automated
model-less inference serving. In ATC, 2021.
[74] B. Sergey, G. Allan, K. Gabriel, L. James, P. Ravi, and T. Jorgen. Orleans:
Cloud computing for everyong. In SOCC, 2011.
[75] M. Waseem, P. Liang, G. Márquez, and A. D. Salle. Testing microser-
vices architecture-based applications: A systematic mapping study. In
APSEC, 2020.
[76] Wikipedia contributors. Fallacies of distributed computing.
[77] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.
Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster computing. In NSDI,
2012.
[78] Y. Zhang, J. Yang, Z. Jin, U. Sethi, K. Rodrigues, S. Lu, and D. Yuan.
Understanding and detecting software upgrade failures in distributed
systems. In SOSP, 2021.

117

You might also like