Testing Microservices
Testing Microservices
Testing Resilience of
Envoy Service Proxy
with Microservices
Nikhil Dattatreya Nadig
Keywords
i
Abstract
Nyckelord
ii
Acknowledgements
I would like to thank my industrial supervisor Florian Biesinger for his unwavering
interest, support and patience to mentor me throughout the thesis, Henry Bogaeus
for his constant willingness to help, assistance in keeping my progress on schedule
and to discuss new ideas with, and all of my colleagues with whom I had the
pleasure of working with.
I want to thank my KTH examiner, Prof. Seif Haridi and my supervisor Lars Kroll
for providing me with valuable inputs and guidance throughout the thesis. I am
grateful to all my friends for their support and encouragement all along.
Finally, my deep and sincere gratitude to my family for their unconditional
love, help and support. I am forever indebted to my parents for giving me the
opportunities and enabling me to pursue my interests. This would not have been
possible if not for them.
iii
Authors
Stockholm, Sweden
Spotify AB
Examiner
Seif Haridi
Kista, Sweden
KTH Royal Institute of Technology
Supervisor
Florian Biesinger
Berlin, Germany
Spotify GmbH
Lars Kroll
Kista, Sweden
KTH Royal Institute of Technology
Contents
1 Introduction 1
1.1 Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement and Research Question . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Evolution of Cloud Computing . . . . . . . . . . . . . . . . . . . . . 5
2.2 Resiliency Patterns in Microservices . . . . . . . . . . . . . . . . . . 15
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Implementation 29
3.1 Services Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Apollo Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Load Testing Client . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Deployed Environment . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Conclusions 41
5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
References 43
v
1 Introduction
1
1.2 Problem Statement and Research Question
1.3 Purpose
The purpose of this thesis is to determine if the use of a service proxy deployed
as a sidecar in a microservices system induces substantial latency and reduces
the performance of the overall system to the extent that a system without sidecar
proxies performs a better than a system deployed with sidecar proxies. Testing the
resilience capabilities such as automatic retries, rate limiting, and circuit breaking
provided by the service proxy which help the services handle failures better and
compare the results with the test run without using Envoy to understand the
advantages of Envoy.
2
1.4 Goal
1.5 Methodology
1.6 Delimitations
Service proxies are a very young field. There is not much academic research about
this topic, and most of the literature study for these have been in the form of the
3
official documentation, books, and blog posts.
Service proxies and Envoy in particular is used as sidecar proxies along
with service meshes such as Istio[18], Consul[7], Linkerd2[22] which use
Kubernetes[26] as a container-orchestration system. Using all of these
components to perform testing can become complex as the possible configuration
space to explore may become too large to be tractable. So to limit the scope of this
research, service meshes and kubernetes was not used in the experiment setup.
Additionally, features of Envoy proxy such as distributed tracing, advanced load
balancing was not evaluated.
1.7 Outline
4
2 Background
Most internet services followed a three-tier architecture until recently. The three
layers included the data layer, business layer, and application layer. These
tiers did not necessarily correspond to the physical locations of the various
computers on a network, but rather to logical layers of the application.[4] The
data layer primarily interacted with persistent storage, usually a database. The
business layer accessed this layer to retrieve data to pass onto the presentation
layer. The business layer was responsible for holding the logic that performed
the computation of the application and essentially moving data back and forth
between the other two layers. The application layer was responsible for all the
user-facing aspects of the application. Information retrieved from the data layer
was presented to the user in the form of a web page. Web-based applications
contained most of the data manipulation features that traditional applications
used.
5
2.1.2 Service Oriented Architecture
6
context. Any change requires the entire application to be built and re-deployed.
This does not scale well and is problematic to maintain. Microservices are a
software development technique—a variant of the service-oriented architecture
architectural style that structures an application as a collection of loosely coupled
services. In a microservices architecture, services are fine-grained and the
protocols are lightweight. The benefit of dividing an application into different
smaller services is that it improves modularity. This makes the application
easier to understand, develop, test, and deploy[25]. Figure 2.2 depicts the
difference between a microservices architecture and a monolithic or a three-tiered
architecture.
System Virtual Machines They are also known as full virtualisation virtual
machines which provide a substitute for a real machine, i.e. provide the
functionality needed to execute an entire operating system. A hypervisor is a
7
software that creates and runs virtual machines. It uses native execution to share
and manage hardware which allows multiple environments to coexist in isolation
while sharing the same physical machine[30].
However, virtual machines are less efficient than real machines since they access
the hardware indirectly, i.e. running software on a host OS which needs to request
access to the hardware from the host. When there are several VMs running on a
single host, performance of each of the VMs is hindered due to the lack of resources
available to each of the VMs.
8
2.1.6 Comparison of Docker Containers and Virtual Machines
There are use cases when either containers or virtual machines is the better choice
to make. Virtual machines are a better choice when applications running need
access to the complete operating system’s functionalities, resources or need to run
several applications in one server. On the other hand, containers are better suited
when the goal is optimal resource utilization, i.e, to run the maximum number of
applications on a minimum number of servers.
9
system, request tracing, tracking, fault tolerance, and debugging interactions
across the entire systems are some of the issues faced while maintaining a
microservices architecture based system.
10
2.1.8 Service Proxy
11
HTTP L7 filter architecture: HTTP is a crucial component of modern application
architectures that Envoy supports an additional HTTP L7 filter layer. HTTP filters
are plugged into the HTTP connection management subsystem that performs
different tasks such as buffering, rate limiting, routing/forwarding.
First class HTTP/2 support: When operating in HTTP mode, Envoy supports
both HTTP/1.1 and HTTP/2. Envoy operates as a transparent HTTP/1.1 to
HTTP/2 proxy in both directions. Any combination of HTTP/1.1 and HTTP/2
clients and target servers can be bridged.
gRPC support: gRPC is an open-source RPC framework from Google that uses
HTTP/2 as the underlying multiplexed transport. Envoy supports all of the
HTTP/2 features required as the routing and load balancing substrate for gRPC
requests and responses. The two systems are complementary.
12
instead of a library, it can implement advanced load balancing techniques in
a single place and be accessible to any application. Currently, Envoy includes
support for automatic retries, circuit breaking, global rate limiting via an external
rate limiting service, request shadowing, and outlier detection. Future support is
planned for request racing.
NGINX Nginx is a web server which can also be used as a reverse proxy, load
balancer, mail proxy and HTTP cache.[24] It supports serving static content,
HTTP L7 reverse proxy load balancing, HTTP/2, and many other features. Envoy
provides the following main advantages over nginx as an edge proxy:
• Ability to run the same software at the edge as well as on each service node.
Many infrastructures run a mix of nginx and HAProxy. A single proxy
solution at every hop is substantially simpler from an operations perspective.
• HTTP/2 support.
• Pluggable architecture.
13
• Integration with a remote service discovery service.
AWS Elastic Load Balancer AWS Elastic Load Balancer was created
by Amazon.[8] Elastic Load Balancing automatically distributes incoming
application traffic across multiple targets, such as Amazon EC2 instances,
containers, IP addresses, and Lambda functions. It can handle the varying loads
of application traffic in a single Availability Zone or across multiple Availability
Zones. Elastic Load Balancing offers three types of load balancers that all feature
the high availability, automatic scaling, and robust security necessary to make
applications fault tolerant[9]. Envoy provides the following main advantages of
ELB as a load balancer and service discovery system:
Finagle Finagle is an extensible RPC system for the JVM, used to construct
high-concurrency servers.[12] Finagle implements uniform client and server APIs
for several protocols, and is designed for high performance and concurrency.
14
Twitter’s infrastructure team maintains finagle. It has many of the same features
as Envoy, such as service discovery, load balancing, and filters. Envoy provides the
following main advantages over Finagle as a load balancer and service discovery
package:
• Out of process and application agnostic architecture. Envoy works with any
application stack.
2.2.1 Introduction
15
other services, this can cause cascading failures
• Microservices are optimised for being deployed in the cloud and they can be
relocated to different machines during runtime
• Cancel: If there is a fault where the fault is not transient or not likely to
succeed on retry, the application should cancel that operation and report an
exception. An example of this is, authentication failure caused by invalid
credentials.
• Retry: If the fault that occurs is rare or unusual, might have been caused due
to a corrupt network packet being transmitted. In this case, the likelihood
of the packet being corrupt is low and retrying the request would probably
yield a correct response.
16
or the backlog of work is cleared. The application should wait for a suitable
time before retrying the request.
For services that suffer from transient failures often, the duration between retries
needs to be chosen to spread requests from multiple instances of the application as
evenly as possible. This reduces the chances of a busy service being continuously
overloaded. If the services are not allowed to recover due to retries, the service
would take longer to recover.
However, there can be situations where failures occur due to unusual events that
could take longer to fix. These failures can vary from a simple loss in connection to
a complete failure of service. In such scenarios, there is no point for an application
17
to retry an operation which is not likely to succeed. Instead, the application should
accept the failure of the operation and handle it accordingly.
Solution Allowing the service to continue without waiting for the fault to be
fixed while it determines that the fault is long lasting. The Circuit Breaker pattern
also enables an application to detect whether the fault has been resolved. If
the problem appears to have been fixed, the application can try to invoke the
operation.
A circuit breaker acts as a proxy for operations that might fail. The proxy
should monitor the number of recent failures that have occurred, and decide if
the operation should be allowed to proceed or return an exception. The proxy
can be implemented as a state machine with the following states that mimic the
functionality of a circuit breaker:
• Closed: The request from the service is routed to the operation and the
proxy maintains the number of recent failures and if there is a failure, the
proxy increments the count. If the number of failures goes beyond a certain
threshold in a given period of time, the proxy is set to an OPEN state. When
the proxy is in this state, the proxy times out and when the timeout expires,
the proxy is placed in a HALF OPEN state
• Open: The request from the service fails immediately and an exception is
returned to the service.
18
There may be a number of clients, each implementing various types of retry or
rate-limiting policies. Such clients can easily starve resources from other clients
by saturating a service and may continue to do so until it completely brings down
a service.
Solution To avoid such abuse of the service, rate limiting is enabled. Envoy
allows for reliable global rate limiting at the HTTP layer as compared to IP
based rate limiting or an application level rate limiting like may web frameworks
provide.
19
Figure 2.5: OSI Model[6].
• Routing: The network layer protocols determines the route suitable from
source to destination. This function of the network layer is known as routing.
20
uniquely, network layer defines an addressing scheme. The sender and
receiver’s IP address are placed in the header by network layer. Such an
address distinguishes each device uniquely and universally.
• Segmentation and Reassembly: This layer receives the message from the
session layer, breaks the message into smaller blocks. Each of the segment
produced has a header associated with it. The transport layer at the
destination reassembles the message.
21
• Session establishment, maintenance and termination: The layer allows the
two processes to establish, manage and terminate a connection.
When there are several containers deployed in a server, there is a lot of operational
work that needs to be carried out such as, if the containers need to be shut down
22
and respawned when there are issues or deploying additional instances when
there is more traffic in the system. System resource management is an essential
aspect of managing servers and to use all the resources to reduce operational costs
optimally. To perform all this manually is not only arduous but unnecessary.
Container orchestrators such as Kubernetes can be used instead for this purpose.
Kubernetes is an open-source container orchestration system for automating
application deployment, scaling, and management. Google originally designed
it as Borg[32] and later open-sourced it. It provides a platform for automating
deployment, scaling, and operations of application containers across clusters
of hosts. Containers deployed onto hosts are usually in replicated groups.
When deploying a new container into a cluster, the container orchestration
tool schedules the deployment and finds the most appropriate host to place
the container based on predefined constraints (for example, CPU or memory
availability). Once the container is running on the host, the orchestration tool
manages its lifecycle as per the specifications laid out in the container’s definition
file, for example, a dockerfile.
As seen in figure 2.6, there are several components that make up a kubernetes
orchestrator. Following are the components with their functions:
– kube-scheduler:
It is a component on the master that monitors newly created pods
23
Figure 2.6: Kubernetes Architecture[11].
that have no node assigned, and selects a node for them to run on.
Scheduling decisions are based on individual and collective resource
requirements, hardware/software/policy constraints, affinity and anti-
affinity specifications, data locality, inter-workload interference and
deadlines.
24
and healthy. The kubelet is not responsible for containters that were
not created by Kubernetes.
2.3.3 gRPC
gRPC (gRPC Remote Procedure Calls) is an open source remote procedure call
(RPC) system initially developed at Google[15]. It uses HTTP/2 for transport,
Protocol Buffers as the interface description language, and provides features such
as authentication, bidirectional streaming and flow control, blocking or non-
blocking bindings, and cancellation and timeouts. It generates cross-platform
client and server bindings for many languages. Most common usage scenarios
include connecting services in microservices style architecture and connect mobile
devices, browser clients to backend services.
The main usage scenarios:
• Generate server and client code using the protocol buffer compiler.
• Use the gRPC API of the language used to write a simple client and server.
25
Figure 2.7: gRPC Architecture Diagram.
26
2.3.4 Service Mesh
Istio Istio is built by Google and IBM in collaboration with Lyft[18]. Istio is an
open source framework for connecting, monitoring, and securing microservices.
It allows the creation of a network or ”mesh” of deployed services with load
balancing, service-to-service authentication, and monitoring, without requiring
any changes in service code. Istio provides support to services by deploying
an Envoy sidecar proxy to each of the application’s pods. The Envoy proxy
intercepts all network communication between microservices, and is configured
and managed using Istio’s control plane functionality.
27
Kubernetes. There are continuous developments and there are certain features
around reliability, security and traffic shifting that are under development.
Consul Consul is a control plane built by HashiCorp that works with multi-
datacenter topologies and specialises in service discovery. Consul works with
many data planes and can be used with or without other control planes such as
Istio.
28
3 Implementation
This chapter describes the implementation of the setup that was used to test the
resiliency of Envoy proxies deployed as sidecars.
A collection of three microservices each with their sidecar proxies are deployed
in two different configurations. Each of the configurations is load tested with
different parameters. The service communicates with each other through gRPC.
The intention was to find the most reoccurring configurations of microservices in
a large scale system and perform experiments on them to understand the effects
of Envoy. Following were the two common configurations found:
3.1.1 Configuration 1
In this configuration as seen in figure 3.1, the client makes a request call to service
1 which calls two upstream services 2 and 3 simultaneously. Service 1 returns a
response only after receiving responses from the two upstream services. Each of
the services are deployed with a sidecar proxy which handles the communication
between the services.
3.1.2 Configuration 2
In this configuration as seen in figure 3.2, the client makes a request call to service
1 which calls service 2 and service 2 in turn makes a service call to service 3. Service
1 sends a response after it receives a response from service 2, which only sends
a response once it has received a response from service 3. Each of the services
are deployed with a sidecar proxy which handles the communication between the
services.
29
Figure 3.1: Configuration 1.
• apollo-api: The apollo-api library defines the interfaces for request routing
and request/reply handlers.
30
Figure 3.3: Apollo Setup.
starting, and stopping) of the service and defines a module system for adding
functionality to an Apollo assembly.
As seen in figure 3.3, when the user’s code starts a service with callback, the
apollo-http-service instantiates different modules such as okhttp-client, jetty-
http-server, apollo-environment. It then connects the different modules and
starts the service by calling the apollo-core component.
31
3.3 Load Testing Client
3.3.1 ghz
A tool called ghz is used in the experiment to make the service requests.[2] It
is a command line utility and Go package for load testing and benchmarking
gRPC services. It is can be used for testing and debugging services locally, and
in automated continuous integration environments for performance regression
testing.
Additionally the core of the command line tool is implemented as a Go library
package that can be used to programatically implement performance tests as
well.
3.3.2 ghz-web
ghz-web is a web server and a web application for storing, viewing and comparing
data that has been generated by the ghz CLI. [14] Figure 3.4 shows the trends
32
of average time taken by each request in the experiments along with the fastest
request, slowest, requests per second, time taken by 95% of requests and the time
taken by 99% of requests.
The experiments were run on a shared Virtual Machine instance in Google Cloud
Platform with 32 core Intel Xeon E5 v3 (Haswell) CPUs of and 120 GB of
RAM.
Using the setup of services as shown in figures 3.1 and 3.2, we perform load
testing using ghz. Each microservice can be configured to have a certain error
probability and latency. Using Java CompletableFuture [5], the requests are
handled asynchronously. The latency set (in milliseconds) as a parameter in the
request is handled by CompletableFuture asynchronously and will have the new
thread sleep in the background. For the purposes of the experiment we set the
latency of each service at 10ms and have different error probability. The data from
each experiment without using Envoy as a sidecar proxy is compared with the data
from experiments where the microservices use a sidecar proxy. The different error
probabilities chosen used to run experiments are, 0%, 0.0035%, 0.02%, 0.2%, 2%,
and 20%.
The error probability of 0% was chosen to test the setups in an ideal scenario.
0.0035% is an acceptable error reply percentage for a large scale production grade
system. 10 fold increments from 0.02% to 20% of error probability were chosen to
understand how the systems performed with and without Envoy and its resiliency
patterns.
33
4 Results and Evaluation
4.1 Results
This chapter contains the results of the experiments conducted. The chapter is
further divided into sections, the first section contains the results of experiments
run without Envoy service proxy, and the second section contains the results from
the experiments run with Envoy service proxy. This chapter mainly contains
the results and preliminary analysis. A detailed analysis is provided in the next
chapter.
For the purposes of the experiments, the latency for each microservice was set
at 10ms and the error probability was increased to observe the effects on the
system. Experiments were performed using the load testing tool ghz by sending
10000 requests on each run with an error probability of 0% to test the system in
an ideal scenario without any errors to see the average time taken per requests
and throughput. These experiments were repeated ten times. Increasing the
error probability to 0.0035% and then 0.02% which is the error probability of
a production grade large scale system. At these error probabilities, the resilience
of Envoy proxies are tested but are not pushed to its limit. By increasing the error
probabilities to 0.2%, 2%, and finally to 20% we can test the Apollo services, gRPC
and observe how they are able to handle failures. Additionally, the resilience of
Envoy is pushed to its limit. We were able to observe the way Envoy uses automatic
retries and circuit breaking to limit the number of errors.
The experiments are first run without using Envoy in both configurations
as shown in figures 3.1 and 3.2. The values of time in the table are in
milliseconds. P95 and P99 refer to the time taken by 95% and 99% of the requests
respectively.
As seen from table 4.1 there is no significant difference in the time taken by the
different configurations with the same parameters.
34
4.1.2 Experiments with Envoy
Table 4.1 contains comparison data of experiments run with and without Envoy
in configuration 1 and table 4.2 contains the comparison of time taken by
experiments with and without Envoy in configuration 2.
As seen in table 4.2, the time taken by each request in the experiments without
Envoy is faster than the experiments that use Envoy. When the error probability
35
Error
No Envoy Envoy
Probability
0.2 fastest 21.7 22.83
slowest 47.65 1390
p95 24.55 124.15
p99 27 257.27
0.02 fastest 21.97 22.7
slowest 33.87 154.36
p95 24.25 41.69
p99 25.55 59.57
0.002 fastest 21.62 22.7
slowest 36.12 154.36
p95 24.92 41.69
p99 27.41 59.57
0.0002 fastest 21.68 22.33
slowest 37.73 75.11
p95 25.42 29.67
p99 29 35.44
0.000035 fastest 21.8 21.94
slowest 37.61 64.63
p95 26.57 25.54
p99 30.62 30.23
0 fastest 21.67 22.15
slowest 36.42 35.69
p95 26.02 26.66
p99 30.74 31.66
reduces, the values of P95 and P99 for experiments with and without Envoy
converge. In an ideal scenario of 0% error probability, the time taken by
experiments using Envoy on average is 3.42% slower than the experiments
without using Envoy.
4.2 Evaluation
36
4.2.1 Envoy in Production
The results in table 4.1 show that average time taken by services using Envoy is
slightly higher than those which do not use Envoy as a sidecar proxy. Each of
the services were set at certain error probability while performing load testing,
when each service has a certain error probability, the error probability of the entire
system is compounded. For example, if the error probability each service is 0.2,
the compounded error probability of the entire system with three services is
This means that about 48.8% of all requests will be errors. Figure 4.1 depicts
the distribution of errors for an experiment with each service having a error
probability of 0.2.
When the error probability of the service reduces, as shown in table 4.3, the error
probability of the entire system reduces too. When there are high number of
errors, the throughput is hindered severely. When there are many requests failing
there is a need to have circuit breakers in place to make sure that these errors do
not affect the entire system’s performance. This is done by throttling the number
of incoming requests and allowing the system to recover from the errors.
When testing services with Envoy the throughput i.e., requests per second
(RPS) was higher for services without Envoy for error probabilities such as
37
Error Probability of each Error Probability of the
service system
0.2 0.488
0.02 0.058808
0.002 0.005988008
0.0002 0.000599880008
0.000035 0.000104996325
0 0
Table 4.3: Compounded Error Probability of the entire system for the
corresponding error probability of each service
0%, 0.0035%, and 0.02%. However, the RPS declined rapidly for higher error
probabilities. Figure 4.2 shows the trend in which RPS reduced for higher error
probabilities.
38
4.3 Envoy’s Resilience Capabilities
When Envoy was deployed as a sidecar proxy, the proxy employs features such
as automatic retries, rate limiting and circuit breakers to minimise the number
of errors thrown by the system. As seen from figure 4.3, the number of errors
thrown by the services with Envoy are lower than the number of error thrown by
the services which do not use Envoy. This reduces the percentage of overall errors
at 20% error probability to 13.8% as compared to 48.8% when not using Envoy.
Having higher error percentages decreases the RPS as shown in figure 4.2.
When running experiments with the Envoy service proxies deployed as sidecars,
we were able to capture the number of failed requests, the number of requests
39
that were retried, number of requests that were not retried by each sidecar from
the proxies’ logs. These provide insights on the resilience offered by the service
proxies. The logs from from an experiment which had a error probability of 2%
retrieved from each of the Envoy sidecar proxies is present in Appendix A.
The upstream_rq_200 log is the total number of requests the Envoy proxy retried
and received a HTTP 200 response. upstream_rq_completed provides the total
number of upstream requests that are completed. retry_or_shadow_abandoned
is the total number of times shadowing or retry buffering was cancelled due
to buffer limits. This happens when there are too many requests are failing
and the proxy cannot perform retries on all of them due to limits on resources.
upstream_rq_retry_overflow is the total requests not retried due to circuit
breaking. This is when the circuit breaker is set to OPEN state to try and provide
the service time to self-correct. upstream_rq_retry_success is the total request
retry successes.
As seen in figure 4.2, due to the automatic retry capability of Envoy the requests
per second in the experiments run with Envoy are much higher than the one run
without Envoy. This is because Envoy retries failed requests, which increases the
throughput and the total time taken to complete the experiment. The total time
taken by the experiment to send all requests determines the requests per second
for the experiment.
40
5 Conclusions
With the adoption of microservices, it has been clear that even though they reduce
complexity in the services itself, they add a lot of operational complexity. To
improve availability, resilience and scalability - tooling such as Kubernetes, Envoy
proxy, Istio and others have been developed. Envoy service proxy deployed as a
sidecar to microservices provide the ability to have higher resilience, advanced
load balancing and distributed tracing. In this thesis, we determined in the
context of our testing setup if Envoy adds latency to the system and if the
disadvantages outweighed its advantages. After conducting the experiments and
analysing the results, we can conclude that the Envoy service adds very little
latency to the request time. Although the presence of Envoy adds latency into
the system, with the data obtained from the experiments, we can say that the
latency was not statistically significant. We performed Students T-distribution
[17] between the time taken by both configurations with and without Envoy.
Additionally, the resilience offered by Envoy service proxy improves the overall
performance of the system making it more resilient to failures. The advantages
of having an Envoy sidecar proxy based on the experiment setup, the data
gathered and the analysis made outweigh the disadvantage it brings in the form
of additional latency.
5.1 Discussion
The goals initially set when testing Envoy was to see if it added a lot of latency
to the system and observe the resilience features of it. When performing load
testing on very low error probabilities such as 0.002%, 0.035% and 0.2% of each
service, the sidecar proxy performed as expected by retrying all the failed requests
and reducing the system’s error percentages to zero. However, on very high error
probabilities such as 20% on each service, there were just too many requests that
were failing, and even with the Envoy proxies, all the failures could not be handled.
There is a lesser number of failures compared to the system without Envoy; this
is due to Envoy’s automatic retries, rate limiting, and circuit breaking features.
While trying to disable these features to see if it made any difference in request
41
times, we found that Envoy’s functions cannot be truly disabled, and by design,
it is meant to be present. The possible workaround was to increase the value
for maximum connections and maximum requests to a very high number. The
configurations used for Envoy is present in Appendix B. Similarly, by reducing
the value for the number of retries to zero, the automatic retries feature could be
disabled.
Service Proxies and Service Meshes are very young and fast developing. There
are additional features added to make the systems more resilient, scalable and
available. Envoy service proxies are used as the data plane in Istio service mesh.
The control plane provides additional features such as telemetry, traffic routing
and policy enforcement. These are features that help a system in a microservices
architecture not only to grow in scale but assist in maintaining and monitoring
of the system. Introducing Kubernetes and service meshes into the experiments
would give insights into the effects these systems might have on each other and the
performance of the overall system. Our experiments consisted of three services
having Envoy proxies as sidecars, however, in a large scale system there are dozens
if not hundreds of services and performing testing on a setup with hundreds of
services would give us more accurate data about the advantages of Envoy.
42
References
[2] bojand/ghz: Simple gRPC benchmarking and load testing tool. https://
github.com/bojand/ghz. (Accessed on 06/10/2019).
[8] Elastic Load Balancing - Amazon Web Services. https : / / aws . amazon .
com/elasticloadbalancing/. (Accessed on 06/10/2019).
[9] Elastic Load Balancing - Amazon Web Services. https : / / aws . amazon .
com/elasticloadbalancing/. (Accessed on 06/11/2019).
43
[13] Get Started, Part 1: Orientation and setup | Docker Documentation.
https://docs.docker.com/get-started/. (Accessed on 06/10/2019).
[19] Krafzig, Dirk, Banke, Karl, and Slama, Dirk. Enterprise SOA: Service-
Oriented Architecture Best Practices (The Coad Series). Upper Saddle
River, NJ, USA: Prentice Hall PTR, 2004. ISBN: 0131465759.
[24] NGINX | High Performance Load Balancer, Web Server, & Reverse Proxy.
https://www.nginx.com/. (Accessed on 06/10/2019).
44
[25] (PDF) Service-Oriented Architecture and Software Architectural Pattern
– A Literature Review | Kholed Langsari - Academia.edu. https : / /
www . academia . edu / 24716210 / Service - Oriented _ Architecture _ and _
Software_Architectural_Pattern_A_Literature_Review. (Accessed on
06/11/2019).
[30] Smith, J.E. and Nair, Ravi. “The architecture of virtual machines”. In:
Computer 38.5 (May 2005), pp. 32–38. DOI: 10.1109/mc.2005.173. URL:
https://doi.org/10.1109/mc.2005.173.
[33] What’s a service mesh? And why do I need one? | Buoyant. https : / /
buoyant . io / 2017 / 04 / 25 / whats - a - service - mesh - and - why - do - i -
need-one/. (Accessed on 06/10/2019).
45
[34] White Paper: The Definitive Guide to Container Platforms | Docker.
https://www.docker.com/resources/white- paper/the- definitive-
guide-to-container-platforms. (Accessed on 06/10/2019).
46
Appendices
47
Appendix - Contents
A Envoy Logs 49
B Envoy Configuration 50
48
A Envoy Logs
ENVOY 1
cluster.envoytest.retry.upstream_rq_200: 417
cluster.envoytest.retry.upstream_rq_2xx: 417
cluster.envoytest.retry.upstream_rq_completed: 417
cluster.envoytest.retry_or_shadow_abandoned: 0
cluster.envoytest.upstream_rq_retry: 417
cluster.envoytest.upstream_rq_retry_overflow: 38
cluster.envoytest.upstream_rq_retry_success: 408
ENVOY 2
cluster.envoytest.retry.upstream_rq_200: 396
cluster.envoytest.retry.upstream_rq_2xx: 396
cluster.envoytest.retry.upstream_rq_completed: 396
cluster.envoytest.retry_or_shadow_abandoned: 0
cluster.envoytest.upstream_rq_retry: 396
cluster.envoytest.upstream_rq_retry_overflow: 15
cluster.envoytest.upstream_rq_retry_success: 386
ENVOY 3
cluster.envoytest.retry.upstream_rq_200: 413
cluster.envoytest.retry.upstream_rq_2xx: 413
cluster.envoytest.retry.upstream_rq_completed: 413
cluster.envoytest.retry_or_shadow_abandoned: 0
cluster.envoytest.upstream_rq_retry: 413
cluster.envoytest.upstream_rq_retry_overflow: 13
cluster.envoytest.upstream_rq_retry_success: 405
49
B Envoy Configuration
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address:
protocol: TCP
address: 127.0.0.1
port_value: 9901
static_resources:
listeners:
- name: listener_0
address:
socket_address:
protocol: TCP
address: 0.0.0.0
port_value: 5991
filter_chains:
- filters:
- name: envoy.http_connection_manager
config:
codec_type: auto
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match:
prefix: "/"
grpc: {}
route:
50
host_rewrite: 10.48.32.124
cluster: envoytest
retry_policy:
retry_on: cancelled,deadline-exceeded,internal,
resource-exhausted,unavailable
num_retries: 10
http_filters:
- name: envoy.router
clusters:
- name: envoytest
connect_timeout: 0.25s
type: LOGICAL_DNS
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
circuit_breakers:
thresholds:
priority: HIGH
max_connections: 10000
max_pending_requests: 10000
max_requests: 10000
max_retries: 10000
hosts:
- socket_address:
address: 10.48.32.124
port_value: 5990
51
TRITA-EECS-EX-2019:226
www.kth.se