Khan Fouzia
Khan Fouzia
Many companies are using a distributed approach where systems can be developed
in smaller chunks of functionalities called microservices. Due to their smaller size,
microservices offer benefits such as smaller development teams, independent choice
of developmental technologies and a lesser time to market to name a few. However,
as the systems grow bigger, the number of microservices can increase up to hundreds
and thousands which makes all this more challenging for the development teams and
the project managers to analyse the services performance, prioritize resources and
the ability to see the overall picture from business and management perspective.
This thesis aims to look for a possibility to develop a tool by combining existing open
source tools which could help overcome this challenge. The thesis aims to research
and explore various existing open source microservices monitoring and tracing tools
to figure out if they could be utilized to develop a microservices visualization tool
comprised of a dynamic call graph displaying performance and business metrics
for individual microservices. These tools are individually explored and run with
test projects to verify their potential and the output is carefully analysed before
deciding if the tool should be included in the implementation. The results show
that such a tool can be developed by using some existing open source tools such as
Jaeger, Prometheus and Grafana. The implemented tool is capable of constructing
a microservices dynamic dependency network graph which includes various metrics
e.g. number of calls being made from one microservice to the other, average response
time per service, average load per minute per service, ratio of open and closed issues,
ratio of open and closed bugs, ratio of cost and revenue and the amount of effort spent
for each microservice in the system. This tool can make it easier for the developers
and managers to visualize the number of calls being made and performance related
challenges in microservices architecture-based systems only by looking at the graph
and it can also help the business managers to make some strategic decisions on the
basis of an overall picture of the system.
1 Introduction
Over the period, technology has evolved at a very fast pace. Many improvements in
methods to code and software architectures have been brought up in the Software
development industry to improve the quality of developed products. For example,
continuous delivery assists the software companies to deliver products faster to their
customers. Similarly, automating the recurrent development and operational pro-
cesses have reduced the overhead for developers. However, as the products grow and
get complicated the size of the codebase increases and it becomes very difficult for the
developers to debug issues or identify areas to modify code for additional features.
Despite following the modular approach to keep various functionalities separate it
turns very difficult to maintain the code quality and developer’s productivity with
a traditional monolithic approach. Therefore, the need to use microservices based
systems arises.
Microservices based systems comprise several smaller and autonomous services
communicating with each other. These services are called microservices. The mi-
croservices architectural system is divided into smaller functionalities and each mi-
croservice is developed to cater to one functionality. These services are independent
and different development teams can work on them independently using the desired
tools and technologies, hence, the code base remains smaller. Owing to the benefits
of microservices based architecture, many big companies like Uber, Netflix, Amazon,
eBay have adopted microservices based architecture and many other companies are
planning to move towards it.
Despite having many advantages, one of the major challenges of using microser-
vices is their complex communication and their dependencies upon each other. Cur-
rently, there are systems with hundreds and thousands of microservices communicat-
ing with each other and it has become very difficult for companies to make strategic
and business decisions for such large and complex systems. It can often cause prob-
lems for the developers as well to find out problematic connections between the
services and the performance issues. Hence, a microservices visualization tool is
needed which could draw an overall services dependency graph and display some
important metrics to make it easier to conclude business decisions e.g. on which
service to invest time and money based on its usage, priority and identify services
performance bottlenecks. Currently, there are some open source and commercial
distributed tracing tools able to perform this task. However, the open source tools
draw a very basic dependency graph of microservices with no additional metrics e.g.
services latency time, load per minute, cost vs revenue and open vs closed issues
and bugs.
2
2 Background
This chapter covers the concepts related to microservices, their monitoring and
how microservices communicate with each other. The chapter also includes the
concepts of logging, monitoring, tracing, how they are different from each other and
some opensource logging, monitoring and tracing tools. Each section is divided into
subsections to describe that particular aspect in detail.
2.1 Microservices
Microservices based systems offer the following characterics over monlithic archite-
tures [19]:
Heterogeneous: For a system of microservices, it is possible to choose the most
suitable programming language for one particular microservice. This aspect is very
useful in a case when some functionality sets of the system needs to excel in different
attributes to perform well as compared to the other functionality. For example, a
microservice which need better performance can be developed in a language which
helps achieve this attribute. Besides this, heterogeneity can help try different or new
programming languages and frameworks. In case of monolithic systems, it is quite
hard to try a new technology as it imposes a huge risk onto the whole system. In case
of microservices, new technologies can be used for the least important microservice
to evaluate it’s benefits before choosing for a larger chunk.
Resilient: As monolithic systems are not resilient, a failure in one component
can break down the whole system. To solve this problem one can run the whole
system in more than one machines so if the failure occurs in one, the other service
keeps running. Whereas in microservices, resiliency avoids the breakage of the whole
system by degrading systems functionality. However, to ensure the resilience, it is
important to understand the possible failures first which can break the system.
Scaling: In monolithic systems if a smaller part of the systems needs scaling
up, the whole system is to be scaled which can cause extra cost. As microservices
comprise of smaller independent code of chunks, only the relevant services can be
scaled up or down as per demand.
Easy Deployments: A major issue with monolithic systems is the frequent re-
deployments and more down times. Once a change (smaller or larger) is made it
requires the whole system to be re-deployed. Sometimes, the re-deployment package
might contain plenty of new work or an entire new release which has a greater
potential to induce risk to the working system. Once this new release is redeployed,
it can be quite challenging to identify the problems, the areas where they occur and
the re-deployment can be quite time consuming as well. However, the microservices
based systems ease this deployment process and allow only deploying code to the
microservice where it belongs to while keeping the other services unaffected. This
practice also makes it easier to debug the issues in case they occur or even rollback
the whole deployment in case required.
Organizational Alignment: The larger teams working on same code-bases can
create problems, whereas the smaller teams working on a small code-bases can be
more productive. Hence, microservices ease this kind of organizational structure
where one team can be assigned only one microservice to improve the code under-
5
used by the services to register or de-register themselves and query API is used to
make requests to a known service. Client side and server-side discovery are the two
patterns of service discovery. In client-side discovery services use service registry
to query an instance whereas in server-side discovery services use a router to send
a service request to the service registry which forwards the request to the actual
service instance. In the end, microservices based architectures cannot overcome the
challenges if the requirements are incorrect as it will lead to the overall architectural
decay. On the other hand, excellent DevOps skills are needed to monitor and deploy
such systems. [10]
Another challenge to deal with microservices is their partitioned database ar-
chitecture which is needed for transactions that need to update multiple databases
which could belong to various microservices in a system. Testing microservices sys-
tems is also very challenging as compared to monoliths. For testing a whole business
flow the entire system needs to be in place whereas an alternative is to create the
stubs for unavailable services for the test use. [11]
Figure 2.1 shows how service registry acts in microservices based architecture
(MSBA).
Some propriety tools are IBM, WebSphere MQ and Microsoft message queuing.
Others are opensource tools such as RabbitMQ, ActiveMQ, JBoss. [13]
In the subsequent subsection three common asynchronous communication tech-
nologies are discussed.
2.2.2 RabbitMQ
2.2.3 ActiveMQ
ActiveMQ (written in Java and implements JSM standard) is also a message bro-
ker and follows the same message delivery mechanism as other message brokers
do. Many ActiveMQ brokers can be combined to create a broker network which
increases the performance capabilities. On contrary with RabbitMQ, without a
middle-ware ActiveMQ boker cannot send data from a C# based application using
MSMQ to a Ruby based application using STOM due to the limitation that JSM
based applications can only directly transfer data between Java platforms only. [14]
Old data can be very useful at times to predict the future events. In the past,
amazon and Delta Airlines had to go through immense loss due to the unavailability
of system’s internal data as they couldn’t predict the failure in their data-center
which caused the systems shut down. Similarly, in case of microservices we need to
keep track of some information such as CPU usage, memory usage, request initiator,
request latency time to keep a check on microservice’s health and working ability.
Keeping a track of internal states of microservices (or systems in general) is called
observability and generally it can be achieved by logging, metrics and tracing. Here,
logging means tracking the sequence of events.
In logs, everything about an event such as timestamp, latency, event’s trigger
source and status of the event/call can be written. The logs are stored as later
useful information can be deduced from them. Metrics is the measure of event’s
attributes. E.g. number of events occurring per minute is a metric. Similarly the
latency time or the average of latency time of all the events per minute can help
determine the response or execution time. These metrics can be very helpful in
monitoring the system. For example, you can monitor the system by configuring
an alerts which triggers if the latency time is below a certain threshold. Logs can
also help perform tracing. Tracing is about tracking a sequence of events within
a system/microservice or within multiple systems/microservices. Within the same
system it is possible to trace the request to identify where exactly the delays have
been occurred. Similarly, events can be traced from one microservice to another to
identify the failures or delays. [20]
2.3.1 Monitoring
centralized monitoring tool it is quite tedious to take care of all these individual
aspects by logging into the microservices separately and scanning the right infor-
mation. Hence, it is important to have a bigger picture by combing the smaller
chunks together. To achieve this, one needs to start with the simplest picture which
is starting to monitor by a single node or server. Three different scenarios of mon-
itoring have been described below with monitoring level increasing from simpler to
complex. [21]
Single Node and Single Server Monitoring: A single service running on a single
server requires the monitoring of the host, the server logs and the performance of
the microservice itself. Monitoring the host might involve monitoring the CPU
and the memory. Server logs can be monitored using some monitoring tool. The
most important task will be monitoring response/latency time or the error rate of
the microservice. To obtain the information on latency time/error rate, one can
monitor the logs of the server hosting the microservice or directly the logs coming
out of the microservice.
Single Node and Multi Server Monitoring: In this scenario multiple instances of
the same service might be running on different hosts. A load-balancer can be used to
manage these instances. To monitor this kind of setup, one needs to monitor the host
to get metrics such as CPU and memory and the logs of the microservice instances.
A monitoring system Nagios can be used to achieve this collective monitoring of the
logs of microservice instances. Response times could also be monitored by tracking
the traces from load-balancer. In case an issue is detected, it is advisable to debug
both the load-balancer logs as well as the microservice logs.
Multiple Services Running on Multiple Instances: In this scenario several services
are running on multiple hosts to provide the desired functionality to the users and
each service can have more than one instances. Detecting problems and debugging
the issues in this case becomes the most tedious as developers need to ssh into several
instances to scan the logs and find the problems. The most efficient way to cater to
this is by aggregating the logs and metrics in one place. The below diagram depicts
the kind of a system running multiple services on multiple hosts.
Figure 2.2 shows a small system of microservices where each service is hosted on
a different host/instance.
Microservices reside independently from each other and written in multi pro-
gramming languages and can be deployed on different platforms, hence, their con-
stant monitoring is required to make sure they do not break before and after a par-
ticular release and whether their overall performance meets the expectations. [16]
Microservices based systems are designed to be fault tolerant, hence, they should
be able to handle the failure due to unavailability of the required infrastructure.
Negligence in this area can cause a bad end-customer experience. The performance
12
metrics are usually the latency time, CPU utilization, services health status, mem-
ory, disk usage and errors.
Additionally, architectural elements such as number of calls per minute and
business metrics should also be continuously monitored with some monitoring tool.
At minimal, a monitoring tool displaying status of metrics such as latency time,
throughout in a dashboard is needed to meet the expectations of development teams
to overcome the challenges related to services communication. [10] [16]
In monolithic systems, the system metrics such as CPU, disk IO, memory and
network utilization are tested using black box techniques. In white box metrics
monitoring (which is a low-level monitoring), system and database logs are generated
and sent to the metrics monitoring tool through SNMP. However, as the deployment
patterns of microservices might vary from service to service and services might use
additional message exchange brokers, using a time series database is better as it
supports data labelling and real time querying. [11]
Apart from black box and white box monitoring, there are various commer-
cial and opensource microservices monitoring tools available. Some tools perform
component level monitoring and some perform domain specific monitoring. And
the monitored metrics can also be visualized in graphs using a metrics monitoring
visualization tool.[15]
In the subsequent subsections a few microservices monitoring tools have been
described.
13
Prometheus
AppDynamics
New Relic
New Relic is an application performance monitoring tool which has been registered
with around 14000 companies. It collects real time application performance data to
help create insights on application performance, problems and solutions. The tool
is capable of providing various features such as APM, Infrastructure, browser and
14
insights etc. The APM feature helps users to see the charts with deep real data
which depict performance bottlenecks of application, server or a database. New
Relic infrastructure feature provides system properties such as memory, CPU usage
and can be configured to send alerts in critical situation. Browser feature can help
identify and improve page load time and resolve front-end errors. New Relic insights
can help visualize the data collected through metrics explorers into different charts.
Looking at these charts viewers can analyze the system’s performance. [18]
Grafana
Grafana is a visualization tool which can be used to monitor bulk of data in the
form of graphs, charts or tables in order to extract some useful information out of
it. These graphs can display multiple queries along the series, use different units
for data and colors for the graphs. Grafana allows graphical visualization of bulk
of data tracked over time. [19] For example, IoT sensor devices produce a lot of
data which gets stored in the cloud environments. However, without a visualization
tool this data is meaningless. Grafana is one of the most commonly used open
source data monitoring tools which is available as a web application and displays
different graphs, charts or tables in the form of panels on a dashboard. Grafana
allows connectivity to various open source applications and time series databases
e.g. InfluxDB and Elasticsearch. IoT and real time data can be effectively stored
in these time series databases. Although external applications can be used with
Grafana as plugins, nevertheless, Grafana has a very tight relationship with time
series database. All the visual results on dashboard e.g. graphs and charts are
a result of query to the time series database. Grafana functionality to fetch and
display data in form of graphs depends on the functions it’s time series database
can provide which poses limitations for some users. Limitations are identified as [8]
• Inability to perform customized query. • Certain applications cannot be
plugged in with Grafana due to their inability to conform with time series database
functionality.
2.3.2 Logging
Logs are very crucial in a way that you can detect the errors, stacktraces, other
information about services calls and data flows by looking at them. In case of a
monolith application where a single instance is running, it is easy to ssh into the
machine and look at the logs from the log file to find out if any problematic thing
is going on. However, the same task is very complicated in case your system is
comprised of many microservices. The solution to this problem is to gather all the
logs from different microservices to one place. Another important feature should be
15
the ability to apply some query to analyze logs from a specific microservice, host or
a process ID to make it easier to find what the person is looking for. [19]
2.3.3 Tracing
Dapper is a monitoring tool which was developed for Google’s production distribut-
ing system. Dapper started as a tracing tool but evolved as a monitoring platform. It
was initially developed to trace Google’s internal system of microservices. Dapper’s
tracing logic is like the previous tracing tools e.g. Magpie and X-Trace, however,
it was designed to have better sampling and instrumentation logic. Dapper system
outperformed in Google’s environment when tested against similar tracing systems
for over two years. It causes lower overhead on the performance of microservices
system, tracing instrumentation is kept minimal, system can scale with respect to
the data size and processes information efficiently to display results quickly in the
monitoring tool. When a system needs to send tracing information it requires to
be instrumented to generate a trace. In Dapper, a trace can comprise several spans
where a span is a log of time stamped record which includes the span’s start and
end time. A span can have a span name, parent id and a span id except the par-
ent level span which does not include parent id. All spans under a specific trace
must share the same traceID. Once a request is completed, microservices call path
is determined based on spans sharing a common trace ID. Dapper’s trace logging
and collection comprises three phases. First of all, the tracing data is written to
logs then data is fetched from all hosts via Dapper demons and stored in Dapper’s
Bigtable repository from where it can be fetched to display for monitoring purposes.
Dapper’s API can be used to fetch this tracing data to build the analysis tools.
[4] Open Source microservices monitoring tools like Zipkin [5] and Jaeger [6] are
based on Google’s dapper technology and are widely being used by companies using
microservices architecture to monitor calls and services dependency networks.
Spring Cloud Sleuth performs distributed tracing for spring microservices. It bor-
rows some concepts and solutions from Google’s Dapper System. Sleuth creates
spans which include the work between two points in a request. For example, if a
request is made from order service to the product_purchase, Sleuth generates three
spans e.g. one for the order internal work, second for order to product_purchase
17
and third span for the product_purchase internal work. Each of these spans con-
tains a unique ID. Spans contain a parent ID except the parent span so that all the
related spans could be grouped. A trace is a collection of related spans. These spans
are stored in the logs, hence, requests can be tracked or filtered by applying search
operations on the aggregated logs. To work with the Sleuth, a dependency needs
to be added to the code. Sleuth will take care of tracing the requests and adding
this information to the logs. For an incoming request, the tracing information is
extracted by Sleuth and stored in the logs. Similarly, sleuth attaches the tracing
information to the outgoing requests to make it available for other microservices.
Spring cloud Sleuth can also be configured to send other information such as service
name along with span and trace IDs to the logs. This particularly helps in tracing
the calls in a system of multiple microservices. It is also possible to configure Sleuth
to allow sharing logs and traces with the Zipkin server. [20]
Zipkin
Zipkin is an opensource distributed tracing tool. It helps in tracing the call se-
quences from one microservice to the others. Zipkin also gathers the timing data
for calls which helps in monitoring system events. Zipkin has four main compo-
nents which are collector, storage, search and web user interface. The instrumented
applications generate and send the tracing information to the collector component
of Zipkin which then validates, stores and index data in the storage component.
Zipkin supports Cassandra, Elasticsearch and MySQL for data storage. The search
component is comprised of a JSON API which fetches the indexed data from the
storage and displays it on the web user interface. The Web component of Zipkin
has two main features. The first features allows filtering the calls/requests data
(with end points and latency time) using filters such as ’Service Name’, ’Request
Duration’ etc. The other feature displays a call dependency diagram which is very
helpful in monitoring the system which has several microservices communicating
with each other. However, to send data to the Zipkin server, the code needs to be
instrumented with the relevant libraries. On the other hand, setting up the Zipkin
server is quite easy. It can be run inside a docker container by executing a simple
docker command or can be installed onto the server of choice. Zipkin’s user interface
default port is 9411. Zipkin can also receive metrics from the Sleuth for spring boot
projects to obtain logs and other tracing information. Once tracing data sharing is
enabled within Sleuth, it shares the metrics and traces in Zipkin readable format.[20]
18
Jaeger
Jaeger is another opensource distributed tracing system. Jaeger borrows the same
inspiration and concepts as Zipkin and Google’s Dapper system. Jaeger was devel-
oped by Uber to monitor their distributed system. Jaeger uses opentracing library
which is available for various programming languages. To monitor with Jaeger, the
source application needs to be instrumented with the relevant opentracing program-
ming language in order to send tracing information to the Jaeger. Once Jaeger is
setup, the call requests with latency time can be viewed in Jaeger user interface
(UI) which is by default available on 16686. Jaeger also displays a dyanic call de-
pendency graph of the system of microservices. Jaeger can be easily setup inside a
docker container by running a docker command.[20]
19
Various microservices monitoring and tracing tools were taken into account to get
information about their features, installation, compatibility with other tools and
their usage in industry. To get the list of tool names below search query was exe-
cuted in some popular search engines such as google scholars, IEEE and university
databases:
(“open tracing” OR opentracing) AND tool*
Results which did not mention anything significant about the tool either in title
or abstract were excluded. The results were mostly links to the tools official websites,
tutorials, documentation and blogs. During the final step, tools which didn’t have
any official website or had lack of documentation were excluded from the final list
of tools to be consider for comparison. Below is the list of tools which was selected
for comparison:
• AppDynamics
• Datadog
• Elastic APM
• Jaeger
• InspectIT
• Instana
• LightStep
• SkyWalking
• Stagemonitor
• Wavefront VMware
• Zipkin
20
Data regarding tool’s features, installation and usage documentation was collected
from the official websites of tools. Each tool was individually explored for its char-
acteristics and features. The result/information was noted down the paper. Once
the initial data collection process finished, a table was created which only comprised
the features which were common among most of the tools. The final features on
which the comparison was made are:
• APIs Availability
• License
• Programming Language
• Installation
• Supported Back-end/DB
• Setup Requirement
• Kubernetes Monitoring
• Container Monitoring
• Context Monitoring
• Data Retention
• Dashboards
• Charts/Graphs
• Application logs
• Query/Tag filter
• Alerts
• Metrics
21
3.3 Comparison
In this section three comparison tables have been presented which compare around
eleven monitoring and tracing tools. Eleven tools have been divided into Table 3.2
and Table 3.3 which compare the data relating to tool’s installation, operations
and additional functionalities (such as integration to an external back-end, metrics
monitoring or tracing tool/data storage). Whereas, the Table 3.1 compares the
performance metrics such as timing data/latency time, call count, error rate, page
load time, CPU usage, memory usage and dependency diagram’s availability.
Metric Zipkin Jaeger Light Instana Sky Inspect Stage Datadog Wave- App Elastic
Step Walk- IT Mon- front Dy- APM
ing itor VMware nam-
ics
Timing yes yes yes yes yes yes yes yes yes yes yes
data
for re-
quests
Call yes yes yes
Count
Error yes yes yes yes yes yes yes
Rate
Page yes yes yes yes yes
Load
time
CPU yes yes yes yes yes yes yes yes yes
Usage (by
in-
stru-
ment-
ing)
Memory yes yes yes yes yes yes yes yes
Usage (by
in-
stru-
ment-
ing)
Depen- yes yes yes yes yes yes yes yes yes yes
dency
Dia-
gram
22
Table 3.2 Tool Comparison - Features about the tool itself (independent of usage)
Table 3.3 Tool Comparison - Features about the tool itself (indipendent of usage)
As mentioned earlier in the background section, there are various opensource and
commercial microservices monitoring and tracing tools available which can provide
information about overall health and status for a system comprised of several mi-
croservices. Also, comparison of some opentracing tools have been drawn in the
previous section from which it is evident that there is no single tool which can
provide all the necessary insights related to performance and business metrics in-
side one tool. Some of the tools which provide more features than others are the
commercial tools and might cost a lot to the company on monthly or yearly basis
in order to provide performance and business monitoring onto their microservices
based systems. As per the literature done during this thesis, there are certain open-
source monitoring and tracing tools which are widely being used by many software
companies such as Prometheus (as a monitoring tool), Zipkin and Jaeger (as mon-
itoring and tracing tools). Besides these tools, Grafana is used to visualize system
metrics in form of graphs and charts to make quick and subtle judgements onto the
system’s overall health and condition. However, to perform a thorough monitoring,
all these tools need to be separately setup and constantly monitored. Furthermore,
these tools lack the insights about business metrics which can be very important to
the companies to make strategic decisions for the future of their system’s individual
components/services. Hence, a tool which could provide some important metrics
and tracing information in one place along with business insights can overcome this
gap.
MsViz (a microservices metrics visualization tool) provides a solution which com-
bines some of the important metrics and tracing data from Prometheus and Jaeger
server combined together with business metrics stored in a mySQL database and
creates a visualization diagram which provides all these metrics in one place. It
draws a basic call dependency graph of a system of micro-services and displays cer-
tain performance and business metrics for each microservice inside the call graph.
The tool is comprised of four Grafana plugins. Two of which (a MS-Visualization-
Panel plugin and jaeger-backend-datasource datasource plugin) were developed as a
result of this thesis.
25
The MsViz tool is comprised of three datasources and one front-end panel plugin.
The tool utilizes the backend/datasource plugins to gather all the required metrics
data from the relevant monitoring and tracing tools servers then draws a diagram
with the available information inside front-end plugin. In order to use these plug-
ins, it is assumed that the microservices of the system to be monitored have been
instrumented to send required metrics to the Jaeger and Prometheus server. And
a MySQL database contains the business metrics to be displayed inside the final
output of tool.
The tool is comprised of build-in and custom plugins which include Grafana’s
built-in Prometheus datasource plugin (to obtain performance metrics from Prometheus
server) and MySQL plugin (to obtain business metrics stored in an external MySQL
database). Third datasource plugin named ”Jaeger-backend-datasource” was cre-
ated to acquire microservices call dependency and call count data from Jaeger server.
To use these plugins, URLs to the respective servers need to be entered to the plugins
configuration user interfaces inside Grafana web server. Jaeger-backend-datasource
fetches data from Jaeger server and converts this data into a format which is readable
by Grafana internal backend. Front-end plugin named ”MS-Visualization-Panel”
automatically fetches data from datasource plugins and plots a dynamic call depen-
dency graph with performance and business metrics displayed inside. Figure 4.1
describes how the various plugins and tools have been combined together to form
the MsViz tool.
Apart from requests/calls tracing data to create parent-child relationship, the MsViz
tool displays metrics which can be categorized as performance and business metrics.
Performance metrics include two different metrics e.g. latency time or service re-
sponse time and load time. Response time is the average time of all the responses
in last minute which a particular microservice took. For example, if a microservice
processed three requests in last minute, where each response took 0.7ms, 0.8ms and
0.9ms respectively then average response time will be (0.7+0.8+0.9)/3 = 0.8ms.
Whereas load time indicates the number of requests a micro service processed
during the last minute.
26
• Issues: Include the number of open and closed issues against a particular
microservice. A comparison is drawn inside the graph between these two
counts.
• Bugs: Include the number of open and closed bugs against a particular mi-
croservice. A comparison is drawn inside the graph between these two counts.
5 MsViz Implementation
This chapter describes the tool from implementation perspective. In first two sec-
tions, the tools which have been researched and selected to gather data are described,
in subsequent sections the tool selected as a baseline of MsViz is described later im-
plementation of self developed Grafana plugins is explained. Then built-in Grafana
datasources are described in regards with their usage in MsViz tool. Lastly, the
results of the MsViz tool and the testing process is described.
During the literature review on relevant material, it turned out that there was not
much academic research available on microservices metrics visualization tools and
techniques other than Google’s Dapper technology which is mainly being used by
Jaeger and Zipkin.
Initially the research also considered investigating the possible ways to create
microservices dependency graph without manually instrumenting the code (adding
code to generate the tracing data in order to send to the monitoring/tracing tool’s
backend/server). To research this possibility, the logic and working of open source
microservices monitoring and tracing tools e.g. jaeger and zipkin was studied and
turned out that developers need to instrument their code with opentracing libraries
in order to send traces to the Zipin or Jaeger server to be processed to display the
results. Hence, other possible ways were investigated to determine the possibility to
generate the microservices call dependency graph without manually instrumenting
the code.
According to the literature review of microservices message based communication
protocol, it came into knowledge that Kafka is one of the most popular message bro-
kers which is used for asynchronous communication between microservices. Kafka
messages format have been studied to see if the information about the request sender
and receiver microservices could be extracted. The research findings showed that
Kafka is pretty agnostic to the messages. Producer groups produce messages which
do not contain the information about the calling or called microservices. These mes-
sages are directed towards Kafka topics. The messages are basically comprised of
Key, Value. On the other hand, consumers constantly check Kafka topics assigned
to them to look for the messages with relevant key. Multiple producer and consumer
groups might be sending and consuming messages from the same topic. Also, mul-
tiple microservices might be using similar producer or consumer groups to produce
and consume messages. Considering this information, it was not possible to get the
28
calling and called services names by monitoring the Kafka message bus in a net-
work of microservices. However, it is possible to add custom headers (with services
names/IPs) to the messages before they are directed to a Kafka topic by a producer
group. The information about calling and called services can then be extracted
when a relevant message is found by Kafka consumer and is ready to be consumed.
Nevertheless, this approach is not very suitable as it requires instrumenting the
Kafka messages with proper headers and it can cover only for microservices commu-
nicating through Kafka message bus. Also, no previous research has included any
study around microservices visualization techniques without having to instrument
the code using Kafka or other message broker technologies.
Additionally, some logs monitoring tools e.g. Elastic stack and event monitoring
tools such as Prometheus were studied as part of this research to assess if a depen-
dency graph could be created by monitoring their events/requests data. Prometheus
is an event monitoring and alerting tool and uses time series database. Prometheus
metrics were analyzed and there was no information about the services call pattern.
The information in Prometheus is generally about resources being consumed e.g.
UP services, counts of errors, CPU/memory utilization, requests count, tasks etc.
On the other hand, Elastic stack is a centralized logging tool. Elastic stack indexes
the logs in Elasticsearch via Logstash and Kibana board displays the logs from all
the services configured to send their logs to Elastic stack through logstash. Elastic
stack can provide a way to figure out called and calling services if the logging level
defined in application send enough information e.g. microservices names in a mes-
sage before a call to external service is made which is something developers need to
manually do inside the code while generating the logs. Hence it was not a possible
option either for microservices monitoring with no instrumentation goal.
Jaeger and Zipkin are two of the most popular opensource tools to monitor
network of microservices and are being used by many microservices based companies.
However, they do provide a very basic dependency diagram which only depicts
the relation from calling to called microservies and the total calls count from one
microservice to the other.
After performing research on some opensource monitoring and tracing tools, the
following tools were selected to construct the MsViz tool.
• D3JS: To improve the existing call dependency graph from Jaeger tool, D3JS,
which is a graph plotting library have been utilized to draw the desired depen-
dency graph whereas the data has been fetched from Jaeger server by using
Jaeger API to fetch the microservices dependency data.
Different icons are used to represent microservices, database and message buses. The
graph also includes the information on the number of calls made from one microser-
vice to another. The final graph output is embedded to the Grafana monitoring
tool.
• Panel Plugin: Panel plugins are needed to visualize data in a desirable form
such as chart, graph or table. These panels can utilize data from a Grafana
datasource or static data can be used inside a custom plugin.
In order to develop and later test the plugin, Grafana environment was setup up
and configured on Ubuntu 18.04 as a service. Grafana can be installed and run by
cloning the Grafana repository from Grfana official gitHub repository, by running
it as a service or by running Grafana Docker image. In order to run Grafana from
source, Grafana repository needs to be cloned from GitHub then following commands
in src/github.com/grafana/grafana directory need to be run to build Grafana
frontend and backend respectively:
yarn start
make run
Grafana needs to know the plugins directory in order to include them and display
in Grafana web server. The plugin directory URL needs to be written in Grafana
configuration file. The configuration file directory in Linux is by default located in:
/usr/local/etc/grafana/grafana.ini
Once, the grafana.ini configuration file is update, the Grafana server is restarted
so the changes are reflected.
• Plugin.json
Once Grafana is looking for plugins inside plugin directory mentioned in con-
figuration file, it takes each individual directory containing a plugin.json file as
a plugin. The plugin.json file must contain some important information about
the plugin such as plugin type, plugin name and plugin ID. Where plugin type
can be datasource, panel or app, plugin name describes the functionality or
use of datasource in a few words and plugin ID should be a unique identi-
fier so the plugin can be identified uniquely inside Grafana. The contents of
plugin.json have been shown in Figure 5.1.
31
• Module.ts The plugin’s Module.ts contains the entry point of the plugin file
describing the plugin logic. The file should extend GrafanaPlugin to any of
the objects as PanelPlugin, DataSourcePlugin, AppPlugin. In case Jaeger-
backend-datasource plugin it extends to the object DataSource Plugin and in
case of panel plugin it extends to the PanelPlugin object.
external data server (URL of external server is fetched from the datasource configu-
ration interface) when ”Save and Test” button is clicked. The response ”Datasource
is working” is returned if external datasource is running and ”Datasource is not
working” in case query response is not received as success. The testDataSource
function implementation logic is shown in Figure 5.3.
The Jaeger datasource is comprised of two sides e.g. the frontend and the server
side. The frontend is written in typescript whereas server side is written in golang
programming language. When a query from the panel is sent to a Datasource plu-
gin, the ’query’ function inside /src/datasource.ts is called, which calls the Grafana
server query function inside /pkg/datasource.go by passing the query parameters.
The ’query’ function in datasource.go shown in Figure 5.4 is responsible for making
the actual API call to the external Jaeger server. The Jaeger API response is con-
verted into the required a specific format and result is sent back to the datasource
query function inside datasource.ts which then returns the response to the panel
where data can be visualized as per the panel’s implementation logic. The Figure
5.5 describes how communication between frontend and backend plugin takes place.
url-to-the-jaeger-server/api/dependencies/endTs=1605034853803&lookback=
604800000
Where endTs and lookback are the optional parameters. Jaeger API returns the
response in a format shown in Figure 5.6.
Grafana queries needs responses either ’timeseries’ or ’table’ formats. Hence, the
above Jaeger format is converted into the Grafana specific table format. The for-
matted response is shown in Figure 5.7.
Once, the datasource plugin logic is in place, the plugin’s frontend and backend need
to be built. The build creates a directory called ’dist’ and keeps the compiled code
in it. Later, when Grafana is restarted to reflect the changes, it looks for the plugin
implementation logic inside ’dist’ folder.
To build the Jaeger-backend-datasource plugin, the commands shown in Figure 5.8
need to be run inside datasource plugin repository:
Once, the plugins have been build, grafana server needs to be restarted to reflect
the changes.
35
The existing Grafana Prometheus datasource plugin is used to call Prometheus APIs
to gather performance and load metrics.
The following Prometheus API is called to fetch the performance metrics:
GET/api/v1/query
As Prometheus supports Pomsql query language, the queries to fetch response time
and load time for each registered service is sent via Prometheus API. Then response
is validated and converted to a suitable format.
Grafana built-in MySQL plugin is used to fetch data from external MySQL database.
The data in external MySQL database is comprised of open/close issues, open/close
bugs, cost, revenue and effort for individual microservices. This data can be ex-
tracted from relevant project management and tracking tools and then stored in a
table in MySQL database.
To access Grafana’s build-in datasources inside this panel plugin, Grafana internal
datasource APIs are utilized. The following API has been used inside the panel
plugin logic:
GET/api/datasources/name/:name
The request to the Grafana internal API has the format shown in Figure 5.10.
And the expected response contains the datasource Id as shown in the Figure 5.11.
Once the datasources IDs are known, MsViz frontend panel plugin requests data
from Prometheus and MySQL datasources.
Front-end plugin contains two important files to implement the panel logic. The files
are named ”App.tsx” and ”FormNetwrok.tsx”. App.tsx is responsible for fetching
data from all the datasources based on the input of a configuration file provided
by the user. This configuration file contains the properties shown in the figure 5.12
and has to be placed in the default Grafana configuration directory (/etc/grafana
in Linux based operating systems).
The configuration file is comprised of eighteen attributes, their values can be up-
dated as required. The attribute ’api_key_admin’ is an authentication token which
is generated from Grafana API configuration user interface. The token needs to
have admin rights for the MS-Visualization plugin to use grafana internal APIs.
Once data from datasources is collected inside App.tsx, ’mergeMetricsData’ function
calls ’getNetwrokData’ function inside FormNetwrok.tsx file. getNetwork function
37
formats and aggregates all the data into one typescript aarry which is read by the
D3JS graph plotting library. Once the data is merged and formatted it is returned
back to the ’mergeMetricsData’ function. The formatted data is stored in a type-
script array in a format shown in Figure 5.13.
The function which plots the D3JS graph is called ’drawGraph’ and is located in-
side ’App.tsx’. The function is called by ’mergeMetricsData’ function with typescript
data array as an input parameter. Finally, drawGraph plots the dynamic call de-
pendency graph of microservices using D3JS graph plotting library and displays a
legend alongside to describe the metrics data being presented inside the graph.
To build MS-Visualization plugin, the following command was run inside plugin di-
rectory:
yarn watch
The command compiles and builds the code inside dist directory. Grafana server
needs to be restarted to incorporate the plugin changes.
The tool has been installed and tested in Ubuntu 18.04. In order to test the tool in
Ubuntu 18.04 server, following steps were executed step by step.
39
5.8.1 Prerequisites
Jaeger ”Hot R.O.D. - Rides on Demand” application has been used to test the MsViz
tool. This is a demo application which has been developed by Jaeger to demonstrate
the use of Opentracing API. The application is by default instrumented with open-
tracing library and sends data to the Jaeger server for monitoring. The application
can be run by source or by running its docker image as per Jaeger Hot R.O.D in-
structions available on Jager Github repository.
To obtain the microservices performance and load, the application was instrumented
with Prometheus to send the performance_per_ms and load_per_min metrics.
The following business metrics were stored in a MySQL database named ’busi-
ness_metrics’.
• Open Issues
• Closed Issues
• Open Bugs
• Closed Bugs
• Cost
• Revenue
• Effort
The business metrics do not reflect the actual data. In production environment,
business metrics can be fetched from GitHub, project management tool or excel
files (in case data is stored manually). For Hot R.O.D application, dummy business
metrics are used to test the tool.
40
Lastly, the msvis.json file needs to be updated with the right properties and placed
in Grafana’s default configuration directory. (/etc/grafana). The Grafana con-
figuration file was provided the directory URL of the pre-built plugins repository.
Then Grafana server was restarted.
In order to run and test, hot R.O.D application, all the datasources were configured
in Grafana web server interface. The respective URLs of Jaeger and Prometheus
were provided on Jaeger and Prometheus datasources. The MySQL datasource was
configured by providing the MySQL host address aslocalhost:3306, database name
as ’business_metrics’ and a username and password for database authentication.
After adding the datasources, the output graph was viewed by accessing the ’MS-
Visualization-Panel’ from Grafana panels interface. Tracing data and values of all
the metrics were successfully crossed verified for their validity. However, only the the
positive use cases (with complete input data and with no input data) were tested.
Negative test cases could not be conducted due to shortage of time.
The final output of the tool can be viewed in Figure 5.14.
41
6 Discussion
This section discusses the uncertainty factors and the limitations that can be pre-
dicted about this thesis work.
As the MsViz tool have been developed as part of an existing Grafana tool and it’s
functionalities, it is highly dependent on some external factors. For example, the tool
might not work without proper maintenance with a higher or lower Grafana version
than v6.6.0. Moreover, if the format of Grafana query requests to datasources or the
Grafana internal datasource APIs change, this would need the MsViz corresponding
code to be updated to ensure the tool remains working.
The tool will not work if the corresponding Jaeger or Prometheus APIs request or
response formats change in a way which is not compatible with the existing Jaeger-
backend-datasource plugin’s logic.
The tool has not been tested with a significantly larger test project, hence the
performance and accuracy cannot be guaranteed for bigger input projects.
Finally, the tool has only been tested with positive test cases. Negative testing has
been left out due to shortage of time.
6.2 Limitations
7 Conclusion
As the IT industry has been moving from monolith to Microservices based archi-
tecture, the developers, managers and other stakeholders had to face challenges to
make sure the systems meet the expected performance and the users expectations.
The major challenge is the observability of such big systems. Despite the availability
of opensource monitoring tools, none of them fulfil the need of both the developers
and the managers working in financial departments. Different monitoring tools need
to be setup and individually monitored as per their outputs and stakeholder’s needs.
In this research work, various monitoring tools have been studied to understand
the way they work and the kind of output they produce. After performing research
on some opensource monitoring tools, three tools Prometheus, MySQL and Jaeger
were selected to obtain microservices call dependency and metrics data. Then a tool
named MsViz was developed inside Grafana as a combination of different plugins.
MsViz uses above three datasources to fetch performance, tracing and business data
and metrics and graphically displays them inside a Grafana panel in form of a graph.
The metrics are showed in a graphical way inside the microserivces call dependency
graph which can help the developers and operation engineers to briefly assess the
overall system’s health and performance. The Graph can help the managers to
briefly predict how the microservices are doing from the business perspective and a
deep analysis of the output graph can help the managers make strategic decisions.
The tool eases the monitoring process for the stakeholders by aggregating some
important data in one place.
Currently, there are various opensource and commercial microservices monitor-
ing and tracing tools capable of providing far deeper and complex information and
insights about microservices based systems. Whereas, MsViz tool only incorpo-
rates a few important metrics from different opensource tools and combines them
to produce a better and visually more powerful graphical presentation. However,
the current work shows the future possibility of how this tool can be improved and
enhanced to incorporate more metric data and different visualizations can be added
for the whole system or individual microservices.
43
8 References
[1] J. Thöness. Microservices. IEEE Software, 32(1):116–116, 2015.
[2] X. Zhou et al., ”Fault Analysis and Debugging of Microservice Systems: Indus-
trial Survey, Benchmark System, and Empirical Study,” in IEEE Transactions on
Software Engineering. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?
tp=&arnumber=8580420&isnumber=4359463
[7] J. Shaheen, ”Apache Kafka: Real Time Implementation with Kafka Architecture
Review.” International Journal Of Advanced Science And Technology 109 (2017):
35-42.
[9] Dragoni N. et al. (2017) Microservices: Yesterday, Today, and Tomorrow. In:
Mazzara M., Meyer B. (eds) Present and Ulterior Software Engineering. Springer,
Cham. URL: https://link.springer.com/chapter/10.1007/978-3-319-67425-4_
12
IEEE Aerospace Conference, Big Sky, MT, 2017, pp. 1-8, doi:10.1109/AERO.
2017.7943959.