Fault Analysis and Debugging 2018
Fault Analysis and Debugging 2018
Abstract—The complexity and dynamism of microservice systems pose unique challenges to a variety of software engineering tasks
such as fault analysis and debugging. In spite of the prevalence and importance of microservices in industry, there is limited research
on the fault analysis and debugging of microservice systems. To fill this gap, we conduct an industrial survey to learn typical faults of
microservice systems, current practice of debugging, and the challenges faced by developers in practice. We then develop a
medium-size benchmark microservice system (being the largest and most complex open source microservice system within our
knowledge) and replicate 22 industrial fault cases on it. Based on the benchmark system and the replicated fault cases, we conduct an
empirical study to investigate the effectiveness of existing industrial debugging practices and whether they can be further improved by
introducing the state-of-the-art tracing and visualization techniques for distributed systems. The results show that the current industrial
practices of microservice debugging can be improved by employing proper tracing and visualization techniques and strategies. Our
findings also suggest that there is a strong need for more intelligent trace analysis and visualization, e.g., by combining trace
visualization and improved fault localization, and employing data-driven and learning-based recommendation for guided visual
exploration and comparison of traces.
1 I NTRODUCTION
Microservice architecture [1] is an architectural style and environments. The execution of a microservice system may
approach to developing a single application as a suite of small involve a huge number of microservice interactions. Most
services, each running in its own process and communicating of these interactions are asynchronous and involve complex
with lightweight mechanisms, often an HTTP resource API. invocation chains. For example, Netflix’s online service sys-
Microservice architecture allows each microservice to be tem involves 5 billion service invocations per day and 99.7%
independently developed, deployed, upgraded, and scaled. of them are internal (most are microservice invocations);
Thus, it is particularly suitable for systems running on cloud Amazon.com makes 100-150 microservice invocations to
infrastructures and require frequent updating and scaling of build a page [8]. The situation is further complicated by
their components. the dynamic nature of microservices. A microservice can
Nowadays, more and more companies have chosen to have several to thousands of physical instances running
migrate from the so-called monolithic architecture to mi- on different containers and managed by a microservice
croservice architecture [2], [3]. Their core business systems discovery service (e.g., the service discovery component of
are increasingly built based on microservice architecture. Docker swarm). The instances can be dynamically created or
Typically a large-scale microservice system can include hun- destroyed according to the scaling requirements at runtime,
dreds to thousands of microservices. For example, Netflix’s and the invocations of the same microservice in a trace may
online service system [4] uses about 500+ microservices be accomplished by different instances. Therefore, there is a
and handles about two billion API requests every day [5]; strong need to address architectural challenges such as deal-
Tencent’s WeChat system [6] accommodates more than 3,000 ing with asynchronous communication, cascading failures,
services running on over 20,000 machines [7]. data consistency problems, discovery, and authentication of
A microservice system is complicated due to the ex- microservices [9].
tremely small grained and complex interactions of its mi- The complexity and dynamism of microservice systems
croservices and the complex configurations of the runtime pose great and unique challenges to debugging, as the
developers are required to reason about the concurrent
behaviors of different microservices and understand the
• X. Peng is the corresponding author.
• X. Zhou, X. Peng, C. Ji, W. Li, and D. Ding are with the School of
interaction topology of the whole system. A basic and
Computer Science and the Shanghai Key Laboratory of Data Science, effective way for understanding and debugging distributed
Fudan University, Shanghai, China, and Shanghai Institute of Intelligent systems is tracing and visualizing system executions [10].
Electronics & Systems, China. However, microservice systems are much more complex and
• T. Xie is with the University of Illinois at Urbana-Champaign, USA.
• J. Sun is with the Singapore University of Technology and Design, dynamic than traditional distributed systems. For example,
Singapore. there lacks a natural correspondence between microservices
and system nodes in distributed systems, as microservice
instances can be dynamically created and destroyed. There-
'LJLWDO2EMHFW,GHQWL¿HU76(
3HUVRQDOXVHLVSHUPLWWHGEXWUHSXEOLFDWLRQUHGLVWULEXWLRQUHTXLUHV,(((SHUPLVVLRQ
6HH KWWSZZZLHHHRUJSXEOLFDWLRQV VWDQGDUGVSXEOLFDWLRQVULJKWVLQGH[KWPO IRU PRUH LQIRUPDWLRQ
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 2
fore, it is not clear whether or how well the state-of-the-art vice fault analysis and debugging, and potentially
debugging visualization tools for distributed systems can be other practice-relevant research on microservices.
used for microservice systems. • We experimentally evaluate the effectiveness of ex-
In spite of the prevalence and importance of microser- ecution tracing and visualization for microservice
vices in industry, there exists limited research on the subject, debugging and propose a number of visualization
with only a few papers on microservices in the software analysis strategies for microservice debugging.
engineering research community, and even fewer in ma- In this work we also extend the benchmark system pre-
jor conferences. The existing research focuses on a wide sented in our earlier 2-page poster paper [26] by introducing
range of topics about microservices, including design [11], more microservices (from 24 to 41) and characteristics (e.g.,
testing [3], [12], [13], [14], deployment [15], [16], [17], ver- more languages, interaction modes). The replicated fault
ification [18], composition [19], architecture recovery [20], cases have been released as an open-source project [27],
legacy migration [21], and runtime adaptation [22]. There which can be easily integrated into the benchmark sys-
exists little research on the fault analysis and debugging tem [25]. The details of our industrial survey and empirical
of microservice systems. Moreover, the existing research on study (along with the source code of our open source
microservices is usually based on small systems with few benchmark system and replicated faults) can be found in
microservices (e.g., 5 microservices or fewer [2]). Such lack our replication package [28].
of non-trivial open source benchmark microservice systems The rest of the article is structured as follows. Section 2
results in a gap between what the research community presents background knowledge of microservice architec-
can produce and what the industrial practices really need. ture. Section 3 describes the industrial survey, including the
There have been appeals that practitioners and researchers process and the results. Section 4 introduces the benchmark
develop and share a common microservice infrastructure system and the 22 replicated fault cases. Section 5 presents
that can emulate the production environments of typical the effectiveness evaluation of execution tracing and visual-
microservice applications for more repeatable and industry- ization based on the replicated fault cases and discusses our
focused empirical studies [23], [24]. observations and suggestions. Section 6 discusses threats to
To fill this gap and pursue practice-relevant research validity. Section 7 reviews related work. Section 8 concludes
on microservices, we conduct an industrial survey on fault the paper and outlines future work.
analysis of typical faults of microservice systems, current
practice of debugging, and the challenges faced by the
developers. Our survey shows that the current techniques
2 BACKGROUND
used in practice are limited and the developers face great Microservice architecture arises from the broader area of
challenges in microservice debugging. We then conduct an Service Oriented Architecture (SOA) with a focus on compo-
empirical study to further investigate the effectiveness of nentization of small lightweight microservices, application
existing industrial debugging practices and whether the of agile and DevOps practices, decentralized data man-
practices can be facilitated by state-of-the-art debugging agement and governance among microservices [2]. With
visualization tools. the migration from monolithic architecture to microservice
To enable our study and also provide a valuable practice- architecture, architectural complexity moves from the code
reflecting benchmark for the broad research community, based to the interactions of microservices. The interactions
we develop a medium-size benchmark microservice system among different microservices must be implemented using
named TrainTicket [25]. Within our knowledge, our system network communication. Microservice invocations can be
is the largest and most complex open source microservice synchronous or asynchronous. Synchronous invocations are
system. Upon the system, we replicate the 22 representa- considered harmful due to the multiplicative effect of down-
tive fault cases collected in the industrial survey. Based time [9]. Asynchronous invocations can be implemented by
on the benchmark system and replicated fault cases, we asynchronous REST invocations or using message queues.
empirically evaluate the effectiveness of execution tracing The former provides better performance whereas the latter
and visualization for microservice debugging by extending provides better reliability. As a user request usually involves
a state-of-the-art debugging visualization tool [10] for dis- a large number of microservice invocations and each mi-
tributed systems. Based on the study results, we summarize croservice may fail, the microservices need to be designed
our findings and suggest directions for future research. accordingly, i.e., taking possible failures of microservice
In this work, we make the following main contributions: invocations into account.
Microservice architecture is supported by a series of
• We conduct a survey on industrial microservice infrastructure systems and techniques. Microservice de-
systems and report the fault-analysis results about velopment frameworks such as Spring Boot [29] and
typical faults, current practice of debugging, and the Dubbo [30] facilitate the development of microservice sys-
challenges faced by the developers. tems by providing common functionalities such as REST
• We develop a medium-size benchmark microservice client, database integration, externalized configuration, and
system (being the largest and most complex open caching. Microservice systems widely employ container
source microservice system within our knowledge) (e.g., Docker [31]) based deployment for portability, flexi-
and replicate 22 representative fault cases upon it. bility, efficiency, and speed [9]. Microservice containers can
The system and the replicated fault cases can be used be organized and managed by clusters with configuration
as a benchmark for the research community to fur- management, service discovery, service registry, load bal-
ther conduct practice-relevant research on microser- ancing by using runtime infrastructure frameworks such as
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 3
Spring Cloud [32], Mesos [33], Kubernetes [34], and Docker we select and invite 46 engineers for interview based on
Swarm [35]. the following criteria. The candidate must have more than
The unique characteristics of microservices pose chal- 6 years’ experience of industrial software development and
lenges to existing debugging techniques. Existing debug- more than 3 years’ experience of microservice development.
ging techniques are designed based on setting up break- Among the invited engineers, 16 of them accept the
points, manual inspection of intermediate program states, invitation. The 16 participants are from 12 companies and
and profiling. However, these techniques are ineffective their feedback is based on 13 microservice systems that
for microservices. For instance, due to the high degree of they are working on or have worked on. These participants
concurrency, the same breakpoint might be reached through have a broad representation of different types of compa-
very different executions resulting in different intermediate nies (i.e., traditional IT companies, Internet companies, and
program states. Furthermore, a microservice system con- non-IT companies), different types of microservice systems
tains many asynchronous processes, which requires tracing (Internet services, enterprise systems) of different scales (50
multiple breakpoints across different processes; such tracing to more than 1000 microservices), and different roles in
is considerably more challenging than debugging mono- development (technical roles and managerial roles). The
lithic systems. Besides inspecting intermediate program information of the participants and subject systems are
states, it is equally, if not more, important to comprehend listed in Table 1, including the company (C.), the participant
how microservices interact with each other for debugging. (P.), the subject system, the number of microservices (#S.),
Profiling similarly becomes more complicated due to the and the position of the participant.
high dynamism of microservices. Among the 12 companies, C1, C6, C7, C8, C11, and C12
In addition, existing fault localization techniques [36] are are leading traditional IT companies, of which C1 and C8
ineffective for microservices. Program-slicing-based fault are Fortune 500 companies; C3, C4, C5, and C10 are leading
localization [37] works by identifying program statements Internet companies; C2 and C9 are non-IT companies. The
that are irrelevant to the faulty statement and allowing the 13 subject systems can be categorized into two types. One
developers to investigate the fault based on the remaining type is Internet microservices that serve consumers via the
statements. Program slicing for microservices is compli- Internet, including A3, A4, A5, A6, A10, and A11. The other
cated since we must slice through many processes con- type is enterprise systems that serve company employees,
sidering different interleavings of the processes. Spectrum- including A1, A2, A7, A8, A9, A12, and A13. The number
based fault localization [38] computes the suspiciousness of microservices in these systems ranges from 50 to more
of every statement using information such as how many than 1000, with a majority of them involving about 100-200
times it is executed in passing test executions or failed test microservices. The 16 participants take different positions in
executions, and ranks the statements accordingly so that their respective companies. Among these positions, Junior
the developers can focus on the highly suspicious ones. Software Engineer, Staff Software Engineer, Senior Software
There is no evidence that such techniques work for highly Engineer, and Architect are technical positions; and Man-
concurrent and dynamic systems such as microservices. ager is a managerial position that manages the development
Similarly related fault localization techniques are designed process and project schedule.
mainly for sequential programs, such as statistic-based fault We conduct a face-to-face interview with each of the
localization [39], and machine-learning-based ones [40], [41]. participants. The participant is first asked to recall a mi-
In recent years, fault localization has been extended to croservice system that he/she is the most familiar with and
concurrent programs [42], [43], [44] and distributed sys- provide his/her subsequent feedback based on the system.
tems [45], [46], [47]. Both groups of work start with logging The participant introduces the subject system and the role
thread (and node) level execution information and then that he/she takes in the system-development project. Then
locate faults using existing techniques. Applying such tech- we interview and discuss with the participant around the
niques to microservices is highly non-trivial since the con- following questions:
tainer instances in microservices are constantly changing, • Why does your company choose to apply the mi-
causing difficulty in log checking and overly fragmented croservice architecture in this system? Is the system
logs. migrated from an existing monolithic system or de-
veloped as a new system?
• How does your team design the system? For exam-
3 I NDUSTRIAL S URVEY
ple, how does your team determine the partitioning
In order to precisely understand the industry needs, we of microservices?
start with an industrial survey and then proceed with the • What kinds of techniques and what programming
collection of typical fault cases and the understanding of languages are used to develop the system?
the current practice on microservice debugging. • What challenges does your team face during the
maintenance of the system?
3.1 Participants and Process Afterwards, the participant is asked to recall those fault
We identify an initial set of candidates for the survey cases that he/she has handled. For each fault case, the
from the local technical community who have talked about participant is asked to describe the fault and answer the
microservice in industrial conferences or online posts (e.g., following questions:
articles, blogs). These candidates further recommend more • What is the symptom of the fault and how can it be
candidates (e.g., their colleagues). Among these candidates, reproduced?
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 4
TABLE 1
Survey Participants and Subject Systems root cause, and the time (in days) used to locate the root
C. P. Subject System #S. Position
P1 A1: online meeting system 50+ Staff Software Engineer
cause. Detailed descriptions of these fault cases (along with
C1
C2
P2
P3
A2: collaborative translation system
A3: personal financial system
100+
100+
Senior Software Engineer
Manager
the source code of our open source benchmark system and
C3 P4 A4: message notification system 80+ Staff Software Engineer replicated faults) can be found in our replication pack-
C4 P5 A5: mobile payment system 200+ Architect
C5 P6 A6: travel assistance system 1000+ Senior Software Engineer age [28]. Note that developers take several days to locate
C6 P7 A7: OA (Office Automation) system 100+ Manager
P8 A8: product data management system 200+ Architect the root causes in most cases. These faults can be grouped
C7
P9 A8: product data management system 200+ Senior Software Engineer
P10 A9: price management system 100+ Manager
into 6 common categories as shown in Table 3 based on their
C8
C9
P11
P12
A9: price management system
A10: electronic banking system
100+
200+
Senior Software Engineer
Senior Software Engineer
symptoms (functional or non-functional) and root causes
C10 P13 A11: online retail system 100+ Senior Software Engineer (internal, interaction, or environment).
C11 P14 A12: BPM system 60+ Staff Software Engineer
C12
P15 A13: enterprise wiki system 200+ Senior Software Engineer Functional faults result in malfunctioning of system ser-
P16 A13: enterprise wiki system 200+ Senior Software Engineer
vices by raising errors or producing incorrect results. Non-
• What is the root cause of the fault and how many Functional faults influence the quality of services such as
microservices are involved? performance and reliability. From Table 3, it can be seen that
• What is the process of debugging? How much time is most of the faults are functional, causing incorrect results
spent on debugging and what techniques are used? (F1, F2, F8, F9, F10, F11, F12, F13, F14, F18, F19, F21, F22),
After the interview, whenever necessary, we conduct runtime failures (F7, F15, F16), or no response (F20); only 4
follow-up communication with the participants via emails of them are non-functional, causing unreliable services (e.g.,
or phone calls to clarify some details. F3, F5) or long response time (F4, F17).
The root causes of Internal faults lie in the internal im-
plementation of individual microservices. For example, F14
3.2 General Practice is an internal fault caused by a mistake in the calculation of
Our survey shows that most of these companies, not only Consumer Price Index (CPI) implemented in a microservice.
the Internet companies but also the traditional IT com- The root causes of Interaction faults lie in the interactions
panies, have adopted microservice architecture to a cer- among multiple microservices. These faults are often caused
tain degree. Independent development and deployment as by missing or incorrect coordination of microservice inter-
well as diversity in development techniques are the main actions. For example, F1 is caused by the lack of sequence
reasons for adopting microservice architecture. Among the control in the asynchronous invocations of multiple message
13 surveyed systems, 6 adopt microservice architecture by delivery microservices; F12 is caused by the incorrect behav-
migrating from existing monolithic systems, while the re- iors of a microservice resulted from an unexpected state of
maining 7 are new projects using microservice architecture another microservice. The root causes of Environment faults
for a comparatively independent business. The migration of lie in the configuration of runtime infrastructure, which
some systems is still incomplete: the systems include both may influence the instances of a single microservice or the
microservices and old monolithic modules. The decisions instances of a cluster of microservices. For example, F3 and
on the migration highly depend on its business value and F20 are caused by improper configuration of Docker (cluster
effort. level) and JBoss (service level), respectively. These faults
Feedback in response to the second question mainly may influence the availability, stability, performance, and
comes from the participants who hold the positions of man- even functionality of related microservices.
ager or architect. 4 of 5 choose to take a product perspective To learn the characteristics of the faults in microservice
instead of a project perspective on the architectural design systems, we discuss with each participant to determine
and consider the microservice partitioning based on the whether the reported fault cases are particular to microser-
product business model. They express that this strategy vice architecture. The criterion is whether similar fault cases
ensures stable boundary and responsibility of different mi- may occur in systems of monolithic architecture. Based on
croservices. the discussion, we find that internal faults and service-
Among the 13 surveyed systems, 10 use more than one level environment configuration faults are common in both
language, e.g., Java, C++, C#, Ruby, Python, and Node.js. microservice systems and monolithic systems, while inter-
One of the systems (A6) uses more than 5 languages. 9 action faults and cluster-level environment configuration
of the participants state that runtime verification and de- faults are particular to microservice systems.
bugging are the main challenges, and they heavily depend
on runtime monitoring and tracing of microservice systems.
The managers and architects are interested in using runtime 3.4 Debugging Practice
monitoring and tracing for verifying the conformance of Based on the survey, we summarize the existing debugging
their systems to microservice best practices and patterns, process of microservice systems and identify different matu-
while the developers are interested in using them for debug- rity levels of the practices and techniques on debugging. We
ging. Debugging remains as a major challenge for almost all also analyze the effectiveness of the debugging processes of
of the participants. They often spend days or even weeks the reported fault cases.
analyzing and debugging of a fault.
3.4.1 Debugging Process
3.3 Fault Cases Our survey shows that all the participants depend on log
In total, the 16 participants report 22 fault cases as shown in analysis for fault analysis and debugging. Their debugging
Table 2. For each case, the table lists its reporter, symptom, processes are usually triggered by failure reports describing
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 5
TABLE 2
Microservice Fault Cases Reported by the Participants
Fault Reporter Symptom Root Cause Time
F1 P1 (A1) Messages are displayed in wrong order Asynchronous message delivery lacks sequence control 7D
F2 P2 (A2) Some information displayed in a report is wrong Different data requests for the same report are returned in an unexpected order 3D
F3 P2 (A2) The system periodically returns server 500 error JVM configurations are inconsistent with Docker configurations 10D
F4 P3 (A3) The response time for some requests is very long SSL offloading happens in a fine granularity (happening in almost each Docker instance) 7D
F5 P4 (A4) A service sometimes returns timeout exceptions for user requests The high load of a type of requests causes the timeout failure of another type of requests 6D
F6 P5 (A5) A service is slowing down and returns error finally Endless recursive requests of a microservice are caused by SQL errors of another dependent microservice 3D
F7 P6 (A6) The payment service of the system fails The overload of requests to a third-party service leads to denial of service 2D
F8 P7 (A7) A default selection on the web page is changed unexpectedly The key in the request of one microservice is not passed to its dependent microservice 5D
F9 P7 (A7) There is a Right To Left (RTL) display error for UI words There is a CSS display style error in bi-directional 0.5D
F10 P8 (A8) The number of parts of a specific type in a bill of material (BOM) is wrong An API used in a special case of BOM updating returns unexpected output 4D
F11 P9 (A8) The bill of material (BOM) tree of a product is erroneous after updates The BOM data is updated in an unexpected sequence 4D
F12 P10 (A9) The price status shown in the optimized result table is wrong Price status querying does not consider an unexpected output of a microservice in its call chain 6D
F13 P11 (A9) The result of price optimization is wrong Price optimization steps are executed in an unexpected order 6D
F14 P11 (A9) The result of the Consumer Price Index (CPI) is wrong There is a mistake in including the locked product in CPI calculation 2D
F15 P11 (A9) The data-synchronization job quits unexpectedly The spark actor is used for the configuration of actorSystem (part of Apache Spark) instead of the system actor 3D
F16 P11 (A9) The file-uploading process fails The “max-content-length” configuration of spray is only 2 Mb, not allowing to upload a bigger file 2D
F17 P12 (A10) The grid-loading process takes too much time Too many nested “select” and “from” clauses are in the constructed SQL statement 1D
F18 P13 (A11) Loading the product-analysis chart is erroneous One key of the returned JSON data for the UI chart includes the null value 0.5D
F19 P13 (A11) The price is displayed in an unexpected format The product price is not formatted correctly in the French format 1D
F20 P14 (A12) Nothing is returned upon workflow data request The JBoss startup classpath parameter does not include the right DB2 jar package 3D
F21 P15 (A13) JAWS (a screen reader) misses reading some elements The “aria-labeled-by” element for accessibility cannot be located by the JAWS 0.5D
F22 P16 (A13) The error of SQL column missing is returned upon some data request The constructed SQL statement includes a wrong column name in the “select” part according to its “from” part 1.5D
TABLE 3 TABLE 4
Fault Categories Maturity Levels of Debugging
Influence
Functional Non-Functional Maturity Level Systems Percentage
Root Cause
Internal F9, F14, F18, F19, F21, F22 F17 Basic Log Analysis A1, A7, A11 23%
Interaction F1, F2, F6, F7, F8, F10, F11, F12, F13 F5 Visual Log Analysis A2, A3, A8, A9, A10, A12 46%
Environment F15, F16, F20 F3, F4 Visual Trace Analysis A4, A5, A6, A13 31%
the symptoms and possibly reproduction steps of the fail- tions, or interactions of a group of microservices.
ures, and ended when the faults are fixed. The debugging • Fault Localization (FL). The developers localize the
processes typically include the following 7 steps. root causes of the failure based on the identified
suspicious locations. For each suspicious location,
• Initial Understanding (IU). The developers get an the developers confirm whether it involves real faults
initial understanding of the reported failure based that cause the failure and identify the precise location
on the failure report. They may also examine the of the faults.
logs from the production or test environment to • Fault Fixing (FF). The developers fix the identified
understand the failure. Based on the understanding, faults and verify the fixing by rerunning related test
they may have a preliminary judgement of the root cases.
causes or decide to further reproduce the failure for
debugging. Note that these steps are not always sequentially exe-
• Environment Setup (ES). The developers set up a cuted. Some steps may be repeated if the subsequent steps
runtime environment to reproduce the failure based can not be successfully done. For example, the developers
on their initial understanding of the failure. The en- may go back to set up the environment again if they
vironment setup includes the preparation of virtual find they can not reproduce the failure. Some steps may
machines, a deployment of related microservices, be skipped if they are not required. For example, some
and configurations of related microservice instances. experienced developers may skip environment setup and
To ease the debugging processes, the developers failure reproduction if they can locate the faults based on
usually set up a simplified environment, which for the logs from the production or test environment and verify
example includes as less virtual machines and mi- the fault fixing by their special partial execution strategies.
croservices as possible. In some cases the developers
can directly use the production or test environment 3.4.2 Maturity Levels of Debugging Practices
that produces the failure for debugging and thus this We find that the practices and techniques on debugging for
step can be skipped. the 13 systems can be categorized into 3 maturity levels as
• Failure Reproduction (FR). Based on the prepared run- shown in Table 4.
time environment, the developers execute the failure The first level is basic log analysis. At this level, the
scenario to reproduce the failure. The developers developers analyze original execution logs produced by
usually try different data sets to reproduce the failure the system to locate faults. The logs record the execution
to get a preliminary feeling of the failure patterns, information of the system at specific points, including the
which are important for the subsequent steps. time, executed methods, values of parameters and variables,
• Failure Identification (FI). The developers identify fail- intermediate results, and extra context information such as
ure symptoms from the failure reproduction exe- execution threads. Basic log analysis follows the debugging
cutions. The symptoms can be error messages of practices of monolithic systems and requires only common
microservice instances found in logs or abnormal logging tools such as Log4j [48] for capturing and collecting
behaviours of the microservice system (e.g., no re- execution logs. To locate a fault, the developers manually
sponse for a long time) observed by the developers. examine a large number of logs. Successful debugging at
• Fault Scoping (FS). The developers identify suspi- this level depends heavily on the developers’ experience
cious locations of the microservice system where the on the system (e.g., overall architecture and error-prone mi-
root causes may reside, for example implementations croservices) and similar fault cases, as well as the technology
of individual microservices, environment configura- stack being used.
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 6
TABLE 5
Time Analysis of Debugging Practices from Industrial Survey
The second level is visual log analysis. At this level, chains into a tree structure. For example, a visualization
execution logs are structured and visualized for fault lo- tool can vertically show a nested structure of microservice
calization. The developers can flexibly retrieve specific ex- invocations and horizontally show the duration time of
ecution logs that they are interested in using conditions and each microservice invocation with colored bars. Analysis
regular expressions, and sort the candidate results according at this level highly depends on advanced microservice
to specific strategies of debugging. The selected execution execution tracing and visualization tools. Commonly used
logs can be aggregated and visualized by different kinds of toolset include Dynatrace [52] and Zipkin [53]. Our survey
statistical charts. Log retrieval and visualization are usually shows that most companies choose to implement their own
combined to allow the developers to interactively drill up tracing and visualization tools, as they are specific to the
and down through the data (execution logs). For example, implementation techniques of microservice architecture.
to locate a fault resulting in abnormal execution results for Visual log analysis provides better support for most
a microservice, the developers can first use a histogram to types of faults than basic log analysis. Flexible log retrieval
learn the range and distribution of different results and provides a quick filtering of execution logs. Visualized
then choose a specific abnormal result to examine related statistics of microservice executions (e.g., variable values or
execution logs. To support visual log analysis, the devel- method execution counts) reveal patterns of microservice
opers need to use a centralized logging system to collect executions. These patterns can help locate suspicious mi-
the execution logs produced in different nodes and include croservice executions. For example, for F22, the developers
information about the microservice and its instances in can easily exclude those methods that are executed fewer
execution logs. Log analysis at this level highly depends times than that of failure occurrences based on the statistics.
on tools for log collection, retrieval, and visualization. A However, locating interaction-related faults often requires
commonly used toolset is the ELK stack, i.e., Logstash [49] the developers to understand microservice executions in
for log collection, ElasticSearch [50] for log indexing and the context of microservice-invocation chains. Visual trace
retrieval, and Kibana [51] for visualization. analysis further improves visual log analysis by embedding
The third level is visual trace analysis. At this level, log analysis in the context of traces. For example, for F1,
the developers further analyze collected traces of system the developers can compare the traces of success scenarios
executions with the support of trace visualization tools. A with the traces of failure scenarios, and identify the root
trace is resulted from the execution of a scenario (e.g., a test cause based on the different orders of specific microservice
case), and is composed of user-request trace segments. A executions.
user-request trace segment consists of logs that share the Tables 2 and 4 show that there is often a mismatch be-
same user request ID (created for each user request). In tween the log analysis level and the faults. For example, the
particular, when a user request comes in the front door of the system A7 is still at the basic log analysis level, which cannot
system, the adopted tracing framework [52] creates a unique help locate the fault F8 reported from this system. In such
user request ID, which is passed along with the request cases, the developers often need to manually investigate a
to each directly or indirectly invoked microservice. Thus, lot of execution logs and code. They usually start with the
logs collected for each such invoked microservice record the failure-triggering location in the logs and then examine the
user request ID. The developers can use visualization tools logs backwards to find suspicious microservices and check
to analyze user requests’ microservice-invocation chains the code of the microservices.
(extracted from the traces) and identify suspicious ranges of
microservice invocations and executions. As a microservice 3.4.3 Effectiveness Analysis
can invoke multiple microservices in parallel, the visual- To analyze the effectiveness of different debugging practices
ization tools usually organize the microservice-invocation we collect the maturity levels of debugging practices and
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 7
the time consumed by each step of the 22 fault cases. skills, and preferences.
Table 5 shows the results, including the fault type, the
number of microservices involved in the fault case (#MS),
the supported maturity level and the actually adopted ma-
4 B ENCHMARK S YSTEM AND FAULT-C ASE R EPLI -
turity level of debugging, and the time consumed for the CATION
whole debugging process and individual steps. The last line Our survey clearly reveals that the existing practices for
shows the average of the #MS and the average of the time fault analysis and debugging of microservice systems can
consumed for the entire debugging process and individual be much improved. To conduct research in this area, one of
steps. the difficulties faced by researchers is that there is a lack of
Note that for some fault cases the maturity levels of the benchmark systems, which is due to the great complexity
debugging practices adopted by the developers are lower in setting up realistic microservice systems. We thus set
than the levels supported in their teams. For example, for up a benchmark system: TrainTicket [25]. Our empirical
F5 the developers choose to use basic log analysis while study is based on TrainTicket and the 22 fault cases that
they are equipped with visual trace analysis. Moreover, the are reported in the survey and replicated in the system.
developers may also combine practices of different levels. The system and the replicated fault cases can be used as
For example, when they adopt visual trace analysis or visual a valuable benchmark for the broad research community to
log analysis they may also use basic log analysis to examine further conduct practice-relevant research on microservice
details. fault analysis and debugging, and even other broad types
The time consumed for the whole debugging process of practice relevant research on microservices.
and individual steps is obtained from the descriptions of TrainTicket provides typical train ticket booking func-
the participants during the interviews. The participants tionalities such as ticket enquiry, reservation, payment,
are asked to estimate the time by hours. To validate their change, and user notification. It is designed using mi-
estimation, the participants are asked to confirm the esti- croservice design principles and covers different interac-
mation with their colleagues and examine the records (e.g., tion modes such as synchronous invocations, asynchronous
the debugging time indicated by the period between bug invocations, and message queues. The system contains 41
assignment and resolution) in their issue tracking systems. microservices related to business logic (without counting
On average the time used to locate and fix a fault all database and infrastructure microservices). It uses four
increases with the number of microservices involved in programming languages: Java, Python, Node.js, and Go.
the fault: 9.5 hours for one microservice, 20 hours for two Detailed description of the system (along with the source
microservices, 40 hours for three microservices, 48 hours for code of our open source benchmark system and replicated
more than three microservices. For some fault cases (e.g., faults) can be found in our replication package [28].
F6) the overall time is less than the sum of the time spent We replicate all the 22 fault cases collected from the
on each step. This is usually caused by the simultaneous industrial survey. In general, these fault cases are replicated
execution of multiple steps. For example, when confirming by transferring the fault mechanisms from the original
a suspicious location of the fault in fault localization, the systems to the benchmark system. In the following, we de-
developers can simultaneously conduct fault scoping to scribe the replication implementation of some representative
identify more suspicious locations. fault cases. Descriptions of the replication implementation
We find that the advantages of visual log analysis and of the other fault cases can be found in our replication
visual trace analysis are more obvious for interaction faults. package [28].
On average, the developers spend 20, 35, 45 hours in these F1 is a fault associated with asynchronous tasks, i.e.,
fault cases adopting visual trace analysis, visual log analy- when we send messages asynchronously without message
sis, basic log analysis, respectively. sequence control. We replicate this fault in the order cancel-
In general, initial understanding, fault scoping, fault lo- lation process of TrainTicket. In the process, there are two
calization are more time consuming than the other steps, as asynchronous tasks being sent, which have no additional
these steps require in-depth understanding and analysis of sequence control. The first task should always be completed
logs. Also in these steps the advantages of visual log/trace before the second one. However, if the first task is delayed
analysis are more obvious. For example, the average time and completed only after the second one, the order reaches
for initial understanding is 3, 7, 21 hours when using visual an abnormal status, leading to a failure.
trace analysis, visual log analysis, basic log analysis, respec- F3 is a reliability problem caused by improper configura-
tively. In some cases (e.g., F9, and F17-19) the developers tions of JVM and Docker. JVM’s max memory configuration
choose to skip environment setup and failure reproduction, conflicts with Docker cluster’s memory limitation configu-
as they can easily identify the failure symptoms from user ration. As a result, Docker sometimes kills the JVM process.
interfaces or exceptions. In other cases (e.g., F17, F19, F21, We replicate this fault in the ticket searching process. We
F22) the developers choose to skip failure identification and select some microservices that are involved in this process
fault scoping, as they can identify potential locations of the and revise them to be more resource consuming. These
faults based on failure symptoms and past experience. revised microservices are deployed in a Docker cluster with
According to the feedback of the participants, 11 out of conflicting configurations, thus making these microservices
13 of them who have experiences of visual log/trace anal- sometimes unavailable.
ysis believe that the visual analysis tools and practices are F4 is a performance problem caused by improper con-
very useful. But how much the tools and practices can help figuration of Secure Sockets Layer (SSL) applied for many
depends on the faulty types and developers’ experiences, microservices. The result is frequent SSL offloading at a fine
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 8
granularity, which slows down the executions of related For each of these preceding faults, we create a devel-
microservices. We replicate this fault by applying the faulty opment branch for its replication in the fault case reposi-
SSL configuration to every microservice of TrainTicket. Then tory [27]. Researchers using the repository can easily mix
when a user requests a service (e.g., ticket reservation), and match different faults to produce a faulty version of
he/she will feel that the response time is very long. TrainTicket including multiple faults.
F5 is a reliability problem caused by improper usage of a
thread pool. The microservice uses a thread pool to process
multiple different types of service requests. When the thread 5 E MPIRICAL S TUDY
pool is exhausted due to the high load of a type of service Our empirical study with the TrainTicket system and the
requests, another type of service requests will fail due to replicated fault cases includes two parts. In the first part,
timeout. We replicate this fault in the ticket reservation we investigate the effectiveness of existing industrial de-
service, which serves both the ticket searching process and bugging practices for the fault cases. In the second part,
the ticket booking process. When the load of ticket searching we develop a microservice execution tracing tool and two
is high, the thread pool of the service will be exhausted and trace visualization strategies for fault localization based
the ticket booking requests to the service will fail due to on a state-of-the-art debugging visualization tool [10] for
timeout. distributed systems, and investigate whether it can improve
F8 is caused by missing or incorrect parameter passing the effectiveness of debugging interaction faults. A group of
along an invocation chain. We replicate this fault in the order 6 graduate students who are familiar with TrainTicket and
cancellation process. When a VIP user tries to cancel a ticket have comparable experiences of microservice development
order, the login token saved in Redis [54] (an in-memory serve as the developers for conducting debugging indepen-
data store) is not passed to some involved microservices dently. For each fault case, the developers locate and fix the
that require the token. This fault causes that the user gets faults based on a given failure report, and the developers
unexpectedly lower ticket refund rate. who debug with different practices are different. To provide
F10 is caused by an unexpected output of a microservice, a fair comparison, we randomly select a developer for each
which is used in a special case of business processing. We fault case and each practice to allow a developer to use
replicate this fault in the ticket booking process. In the ticket different practices for different fault cases. The developers
ordering service, we implement two APIs, which respec- follow the general process presented in Section 3.4.1 for
tively serve for general ticket ordering and ticket ordering debugging. For any step, if the developers cannot complete
of some special stations. The API for special ticket order- in two hours they can choose to give up and the step and
ing sometimes returns an unexpected output that is not the whole process fail.
correctly handled, thus making the ticket booking process
fail.
F11 is a fault that occurs in asynchronous updating of 5.1 Debugging with Industrial Debugging Practices
data, caused by the missing of sequence control. When the In this part of the study, we investigate the effectiveness
bill of material (BOM) tree is updated in an unexpected of the debugging practices of the three maturity levels by
order, the resulting tree is incorrect. But when the user qualitative analysis and quantitative analysis respectively.
turns on the “strict mode” on product BOM services, the For each fault case three developers are selected to debug
resulting tree is rebuilt when the BOM tree includes some with the practices of different maturity levels. The tools
negative numbers, leading to a correct tree. We replicate provided for different maturity levels are as follows.
this fault in the order cancellation process, which includes
two microservices (payment service and cancel service) that • Basic Log Analysis. The developers use command
asynchronously set the same value in the database. Due to line tools to grab and analyze logs.
the missing of sequence control, the two microservices may • Visual Log Analysis. The developers use the ELK
set the value in a wrong sequence, thus causing an incorrect stack, i.e., Logstash [49] for log collection, Elas-
value. But if the user turns on the “strict order” mode ticSearch [50] for log indexing and retrieval, and
on the order service, the incorrect value will be corrected Kibana [51] for visualization.
eventually. • Visual Trace Analysis. The developers use both the
F12 is caused by an unexpected output of a microservice ELK stack and Zipkin [53] for debugging.
when it is in a special state. We replicate this fault in
the ticket booking process. We introduce state admDepSta-
tion/admDesStation for the ticket reservation service instance 5.1.1 Qualitative Analysis
to indicate the departure/destination station of which the We qualitatively compare different levels of practices based
administrator is examining the tickets. if no administra- on the debugging processes of F8 as shown in Figure 1.
tor examining things happened, the corresponding ticket Figure 1(a) presents a snapshot of basic log analysis,
reservation service instance will be without state. If the which shows the logs captured from a container running
departure/destination station of a ticket reservation request a microservice of food service. The developers identify a
is admDepStation/admDesStation, and the ticket reservation suspicious log fragment in the red box and find that the
service is accessed by the same request thread twice or more food refund rate is 68%, which is lower than the predefined
times including both with and without state instance, the VIP refund rate for food ordering. Thus they can regard the
request will be denied with an unexpected output and the calculation of food refund rate as a potential fault location.
ticket booking process returns an error. A shortage of basic log analysis is the lack of context of
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 9
microservice invocation, which makes it hard for the devel- Figure 1(c) presents a snapshot of visual trace analysis.
opers to analyze and understand the logs in the context of It shows the entire trace of the order cancellation process,
user requests and invocation chains. including the nested invocations of microservices, and the
Figure 1(b) presents a snapshot of visual log analysis. It consumed time of each invocation. The developers find that
shows the histograms of average refund rate of the instances the ticket cancellation process invokes not only the food
of two related microservices (food service and consign ser- service and the consign service, but also the config service
vice) in different virtual machines, as well as corresponding and route service. Then they further analyze the logs of
logs. The refunds of these two services are both included the config service and find that a suspicious general refund
in the ticket refund. As the failure symptom is low ticket rate (which can be 36% in the lowest case) is used by the
refund rate, the developers choose the lowest bar which ticket cancellation process to calculate the final refund rate.
shows the average food refund rate in VM3 (see the red They thus regard the calculation of general refund rate in
box in Figure 1(b)) to check the logs. From the logs the the config service as a potential fault location. This fault
developers find that the lowest food refund rate is 65%, localization is more precise than the localization supported
and thus regard the calculation of food refund rate as a by the basic log analysis and the visual log analysis.
potential fault location. Compared with basic log analysis, Compared with visual log analysis, visual trace analysis
visual log analysis provides aggregated statistics of vari- supports the understanding of microservice executions in
ables and quality attributes (e.g., response time), and thus the context of user requests and invocation chains.
can help developers to identify suspicious microservices
and instances. However, it lacks the context of user requests 5.1.2 Quantitative Analysis
and invocation chains, and thus can not support the analysis The results of the study are shown in Table 6, including the
of microservice interactions. time used for the whole debugging process and that of each
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 10
TABLE 6
Time Analysis of Debugging with Industrial Debugging Practices
F10
basic log
visual log
4
3.6
1.2
1
0.3
0.3
0.3
0.3
0.8
0.6
0.6
0.6
0.6
0.6
0.3
0.3
environment setup and failure reproduction varies with the
visual trace 3 0.8 0.3 0.3 0.4 0.4 0.6 0.2 employed debugging practices. According to the feedback
basic log 6 1.6 0.8 0.8 1 0.6 1 0.3
F11 visual log 5 1.1 0.6 0.4 1 0.6 0.8 0.2 from the developers, they often try to make a simplest
visual trace 2.9 0.4 0.4 0.4 0.6 0.4 0.6 0.2
basic log 10 4 0.8 0.8 1 1.4 1.2 0.8 failure reproduction based on the initial understanding, so
F12 visual log 6 2 0.6 0.6 0.6 1 0.8 0.4
visual trace 3.3 1 0.6 0.4 0.4 0.4 0.6 0.2 the accuracy of the initial understanding influences the time
basic log 6.3 2 0.7 0.7 0.8 1 1 0.3
F13 visual log 5.4 1.6 0.6 0.6 0.6 0.6 0.8 0.4 used for environment setup and failure reproduction
visual trace 4.3 1 0.6 0.3 0.4 0.6 0.9 0.3
basic log 1 0.2 0.1 0.1 0.2 0.2 0.1 0.1
F14 visual log 1.1 0.3 0.1 0.1 0.2 0.2 0.1 0.1
visual trace 0.8 0.2 0.1 0.1 0.1 0.1 0.1 0.1 5.2 Debugging with Improved Trace Visualization
basic log 1 0.2 0.2 0.2 0.1 0.1 0.1 0.1
F15 visual log
visual trace
0.4
0.4
0.1
0.1
-
-
-
-
-
-
-
-
0.2
0.2
0.1
0.1
From the above, we observe that tracing and visualizing can
F16
basic log
visual log
2.1
1.8
0.6
0.4
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
potentially help fault analysis and debugging of microser-
visual trace
basic log
2
2.6
0.4
1
0.3
0.3
0.3
0.3
0.4
0.2
0.4
0.2
0.2
0.4
0.2
0.2
vice systems. Thus, in this part of the study, in order to
F17 visual log
visual trace
1.6
1.7
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.1
0.2
0.3
0.1
0.2
0.1
0.1
better support fault analysis and debugging of microservice
basic log 1.3 0.4 0.2 0.2 0.2 0.2 0.1 0.1 systems, we investigate the effectiveness of the state-of-the-
F18 visual log 1 0.4 - - 0.2 0.2 0.2 0.1
visual trace 1.1 0.4 - - 0.2 0.2 0.2 0.1 art distributed system debugging techniques for microser-
basic log 0.7 0.3 - - 0.1 0.1 0.1 0.1
F19 visual log - - - - - - - - vice system debugging.
visual trace - - - - - - - -
basic log 2.2 0.8 0.3 0.3 0.4 0.2 0.2 0.1
F20 visual log
visual trace
0.8
0.8
0.3
0.3
-
-
-
-
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.1
5.2.1 Tracing and Visualization Approach
basic log 1.6 0.4 0.2 0.2 0.3 0.2 0.2 0.1
F21 visual log - - - - - - - - ShiViz [10] is a state-of-the-art debugging visualization tool
visual trace - - - - - - - -
basic log 0.4 0.2 - - - - 0.1 0.1 for distributed systems. It visualizes distributed system
F22 visual log - - - - - - - -
visual trace - - - - - - - - executions as interactive time-space diagrams that explicitly
capture distributed ordering of events in the system. ShiViz
supports pairwise comparison of two traces by highlighting
step. A mark “-” means that the developer skips the step. If their differences. It compares the distribute-system nodes
all the steps are skipped, it means that the fault can be easily and events from two traces by names and descriptions, and
identified and fixed with lower level practices (e.g., basic log highlights the nodes or events (in one trace) that do not
analysis) thus there is no need for higher level practices. A appear in the other. ShiViz supports the selection of a part
mark “failed” means that the developer fails to complete the of a trace for comparison. For example, we can select a user-
step. If a step of a debugging process fails, the whole process request trace segment, i.e., the events for a specific user
fails also. request based on the request ID.
The developers fail in F3 and F4 with all the three levels Figure 2 presents an example of trace visualization by
of industrial practices. Both of them are non-functional En- ShiViz, which shows the nodes (colored boxes at the top),
vironment faults. For F9, F19, F21, F22, the developers easily the node timelines (vertical lines), events (circles on time-
locate and fix the faults with basic log analysis. All of them lines), and partial orders between events (edges connecting
are Internal faults. For the other faults, there is a general events). The rhombuses (events) on the left side highlight
trend of reduced debugging time with the employment of the differences between the two traces, and we can also click
higher level debugging practices (from basic log, visual log, to see the detail of the rhombuses.
to visual trace analysis). In these fault cases Interaction This pairwise comparison can be used to locate suspi-
faults are the ones that benefit the most from higher levels cious nodes and events in microservice system debugging
of debugging practices. when execution information (e.g., service name, user request
Similar to the industrial survey, initial understanding, ID, and invoked method) is added to the names and descrip-
fault scoping, fault localization are more time consuming tions of nodes and events, by treating a microservice unit as
than the other steps, and environment setup and failure a distribute-system node. We can leverage ShiViz to visual-
reproduction are sometimes skipped. The time used for ize the traces of microservices by transforming the trace logs
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 11
(a) Service Level Analysis (b) State Level Analysis (Success Vs. Failure) (c) State Level Analysis (Failure Vs. Failure)
When necessary, the developers can introduce more state involves a much smaller number of events across multiple
variables/expressions for comparison, paying the price of user requests compared to F2, and the events of different
having more nodes in the visualized trace. The developers requests on ShiViz can be easily distinguished, F13 can be
can gradually adjust the strategies and attempt different effectively analyzed, and less time consumed.
combinations of state variables/expressions. Heuristics can
be applied to identify such combinations. For instance, 5.2.4 Quantitative Analysis
desirable state variables/expressions are likely built from The time-analysis results of debugging with improved trace
static variables, singleton-member variable, or key-values in visualization are shown in Table 7. Among the 12 fault cases,
temp storage, e.g., in Redis [54]. the developers fail in 2 cases (F3 and F4) in which they
We find that all the successful analyses are based on the also fail with visual trace analysis. For F16, the develop-
following four tactics on comparing a failure trace and a ers succeed but use more time than visual trace analysis.
success trace. These three cases are all Environment faults (F3 and F4 are
T1 (Single-Event Difference). The fault-revealing range non-functional, F16 is functional); such result suggests that
is a single event, and the difference lies only in the descrip- debugging such faults cannot benefit from trace analysis.
tions (e.g., invoked method) of the event. In all the other 9 fault cases, the developers achieve
T2 (Single-Range Difference). The fault-revealing range improved debugging effectiveness with decreased average
involves different interaction sequences among nodes. debugging time from 3.23 hours to 2.14 hours. Note that
T3 (Multiple-Range Difference). The execution orders these 9 fault cases are all Interaction faults. For these faults,
of multiple fault-revealing ranges are different. fault localization, initial understanding, failure reproduction
T4 (Multiple-Request Difference). The execution orders are the three steps that benefit the most from the analysis.
of multiple user requests are different. The time used for these steps is reduced by 49%, 28%, and
Among these tactics, T1 does not involve differences in 24%, respectively compared with visual trace analysis.
node interaction sequences, while the other three tactics Table 7 also shows the detailed analysis processes of
do. The tactics used for the analysis of each fault case are the developers on each of the 12 fault cases, including the
also shown in Table 7. It can be seen that tactics may be used visualization strategy, number of nodes (#N.), number
combined for debugging to locate a fault, because a fault of events (#E.), number of user requests (#UR.), number
may involve multiple fault-revealing ranges at different of fault-revealing ranges identified in each analysis (#FR.),
levels. The difficulty of debugging increases from T1 to T4 number of events in fault-revealing ranges in each analysis
with the analysis of trace differences in a larger scope. (#FE.), and hit (i.e., the analysis that succeeds in identi-
T4 is relatively hard to use, as it involves complex fying at least one true fault-revealing range, ’Y’ indicates
interactions among different user requests. F2 is an example successfully identified, ’N’ indicates failed) in each analysis.
for which T4 must be used. Due to the extensive usage A mark “-” indicates that the developers fail in identifying
of asynchronous interactions in microservice systems, the the ranges or events. The results show that each debugging
processing orders of user requests do not always correspond with a combination of success and failure traces involves
to their receiving order. If there are interactions among about 7-22 nodes (representing services or service states)
different user requests, it is likely that a fault will be in- and hundreds to thousands of events. These events belong
troduced due to erroneous coordination of processing user to 2-7 user requests, and the traces of each user request are
requests. F2 is an example of this case. As the trace analysis compared separately.
involves a large number of events across multiple user Each successful analysis identifies several fault-revealing
requests, and the events of different requests are interleaved, ranges with dozens of events. For some fault cases (F1, F2,
F2 cannot be effectively analyzed based on existing visual- F7, F8, F10, and F13), service-level analysis can effectively
ization techniques, unless the differences between success identify the fault-revealing ranges (Hit is ’Y’). Some other
and failure traces are reflected in the trace comparison of a fault cases (F5, F11, and F12) require service state-level
single request. For F2, the developer spends a lot of time analysis to identify the fault-revealing ranges. For F12, both
in seeking the root cause. But for F13, as the trace analysis developers successfully identify one of the issue states.
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 14
TABLE 7
Empirical Study Results of Debugging with Improved Trace Visualization
There are also unsuccessful cases (F3, F4), indicating that for guided visual exploration and comparison of traces.
the developers fail in locating at least one fault-revealing The supporting tools can take full advantage of the large
range. The main reason is that the two cases are environ- amount of data produced by runtime monitoring and his-
mental faults, and they are not sufficiently supported by torical analysis and make critical suggestion and guidance
our debugging methodology. during the visual exploration and comparison of traces. For
example, the tools can suggest suspicious scopes in traces
5.3 Findings and sensitive state variables that may differentiate success
and failure traces based on probabilistic data analysis, or
Our study shows that most fault cases except those caused
recommend historical fault cases that share similar patterns
by environmental settings can benefit from trace visualiza-
of traces. Based on these suggestions and guidance, the
tion, especially those related to microservice interactions. By
developers can dig into possible segments of traces or add
treating microservices or microservice states as the nodes,
relevant state variables to trace comparison and visualiza-
we can further improve the effectiveness of microservice
tion. These actions are in turn collected and used by the tools
debugging using state-of-the-art debugging visualization
as feedback to improve further suggestion and guidance.
tools for distributed systems. A difficulty for state-level
tracing and visualization mainly lies in the definition of mi-
croservice states. As a microservice system may have a large
6 T HREATS TO VALIDITY
number of microservices and state variables/expressions,
it highly depends on the experience of the developers to One common threat to the external validity of our studies
achieve effective and efficient fault analysis by identifying a lies in the limited participants and fault cases. The industrial
few key states that can help reveal the faults. experiences learned from these participants may not rep-
A challenge for the trace visualization lies in the huge resent other companies or microservice systems that have
number of nodes and events. Large-scale industrial mi- different characteristics. The fault cases collected from the
croservice systems have hundreds to thousands of microser- industrial participants may not cover more complex faults
vices and tens of thousands to millions of events in a trace. or other different fault types. One major threat to the inter-
Such a number of nodes and events can make the visualiza- nal validity of the industrial survey lies in the accuracy of
tion analysis infeasible. This problem can be alleviated from the information (e.g., time of each debugging step) collected
two aspects. First, better trace visualization techniques such from the participants. As such information is not completely
as zoom in/out and node/event clustering are required to based on precise historical records, some of the information
allow the developers to focus on suspicious scopes. For may not be accurate.
example, node/event clustering can adaptively group co- The threats to the external validity of the empirical study
hesive nodes and events together, and thus reduce the num- mainly lie in the representativeness of the benchmark sys-
ber of nodes and events to be examined by progressively tem. The system currently is smaller and less complex (e.g.,
disclosing information. Second, fault localization techniques less heterogeneous) than most of the surveyed industrial
such as spectrum based fault localization [38], [55] and delta systems, despite being the largest and most complex open
debugging [56] can be combined with visualization analysis source microservice system within our knowledge. Thus
for microservice debugging. On the one hand, the combi- some experiences of debugging obtained from the study
nation can suggest suspicious scopes in traces by applying may not be valid for large industrial systems.
statistical fault localization on microservice invocations, and There are three major threats to the internal validity of
on the other hand, the combination can provide results the empirical study. The first one lies in the implementa-
of code-level fault localization (e.g., code blocks) within tion of the fault cases based on our understanding. The
specific microservices. understanding may be inaccurate, and the replication of
In view of the great complexity caused by the scale of some faults in a different system may not fully capture
microservice interactions and the dynamics of infrastruc- the essential characteristics of the fault cases. The second
ture, we believe that debugging of microservices needs to one lies in the uncertainty of runtime environments such
be supported in a data-driven way. For instance, one way as access load and network traffic. Some faults may behave
is to combine human expertise and machine intelligence differently with different environment settings, thus need
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 15
different debugging strategies. The third one lies in the microservice system within our knowledge) and replicated
differences of the experiences and skills of the developers 22 representative fault cases from industrial ones based on
who participate in the study. These differences may also the system. These replicated faults have then been used as
contribute to the differences of debugging time and results the basis of our empirical study on microservice debug-
with different practices. ging. The results of the study show that, by using proper
tracing and visualization techniques or strategies, tracing
and visualization analysis can help debugging for locating
7 R ELATED W ORK various kinds of faults involving microservice interactions.
Some researchers review the development and status of Our findings from the study also indicate that there is a need
microservice research using systematic mapping study and for more intelligent trace analysis and visualization, e.g., by
literature review. Francesco et al. [2] present a systematic combining techniques of trace visualization and improved
mapping study on the current state of the art on architecting fault localization, and employing data-driven and learning-
microservices from three perspectives: publication trends, based recommendation for guided visual exploration and
focus of research, and potential for industrial adoption. comparison of traces.
One of their conclusions is that research on architecting Industrial microservice systems are often large and com-
microservices is still in its initial phases and the balanced in- plex. For example, industrial systems are highly heteroge-
volvement of industrial and academic authors is promising. neous in microservice interactions and may use not only
Alshuqayran et al. [57] present a systematic mapping study REST invocations and message queues but also remote
on microservice architecture, focusing on the architectural procedure calls and socket communication. Moreover, in-
challenges of microservice systems, the architectural dia- dustrial systems are running on highly complex infrastruc-
grams used for representing them, and the involved quality tures such as auto-scaling microservice cluster and service
requirements. Dragoni et al. [58] review the development mesh (a dedicated infrastructure layer for service-to-service
history from objects, services, to microservices, present the communication [65]). Such complexity and heterogeneity
current state of the art, and raise some open problems and pose additional challenges on execution tracing and vi-
future challenges. Aderaldo et al. [23] present an initial set sualization. Our future work plans to further extend our
of requirements for a candidate microservice benchmark benchmark system to reflect more characteristics of indus-
system to be used in research on software architecture. They trial microservice systems, and explore effective trace visual-
evaluate five open source microservice systems based on ization techniques and the combination of fault localization
these requirements and the results indicate that none of techniques (e.g., spectrum-based fault localization [38] and
them is mature enough to be used as a community-wide delta debugging [56]) for microservice debugging. More-
research benchmark. Our open source benchmark system over, we plan to explore a more technology-independent
offers a promising candidate to fill such vacancy. Our in- way to inject tracing information at every service invocation
dustrial survey well supplements these previous systematic via service mesh tools such as Linkerd [66] and Istio [67].
mapping studies and literature reviews.
There has been some research on debugging concurrent
programs [38], [59], [60] and distributed systems [10], [61], ACKNOWLEDGMENTS
[62], [63]. Asadollah et al. [64] present a systematic mapping
This work was supported by the National Key Research
study on debugging concurrent and multicore software in
and Development Program of China under Grant No.
the decade between 2005 and 2014. Bailis et al. [61] present
2018YFB1004803. Tao Xie’s work was supported in part by
a survey on recent techniques for debugging distributed sys-
National Science Foundation under grants no. CNS-1513939,
tems with a conclusion that the state-of-the-art of debugging
CNS-1564274, and CCF-1816615.
distributed systems is still in its infancy. Giraldeau et al. [62]
propose a technique to visualize the execution of distributed
systems using scheduling, network, and interrupt events.
R EFERENCES
Aguerre et al. [63] present a simulation and visualization
platform that incorporates a distributed debugger. Beschast- [1] J. Lewis and M. Fowler, “Microservices a definition of
nikh et al. [10] discuss the key features and debugging this new architectural term,” 2014. [Online]. Available:
http://martinfowler.com/articles/microservices.html
challenges of distributed systems and present a debugging
[2] P. D. Francesco, I. Malavolta, and P. Lago, “Research on archi-
visualization tool named ShiViz, which our empirical study tecting microservices: Trends, focus, and potential for industrial
investigates and extends. In contrast to such previous re- adoption,” in 2017 IEEE International Conference on Software Archi-
search, our work is the first to focus on debugging support tecture, ICSA 2017, Gothenburg, Sweden, April 3-7, 2017, 2017, pp.
21–30.
for microservice systems. [3] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M. K. Reiter, and
V. Sekar, “Gremlin: Systematic resilience testing of microservices,”
in 36th IEEE International Conference on Distributed Computing Sys-
8 C ONCLUSION tems, ICDCS 2016, Nara, Japan, June 27-30, 2016, 2016, pp. 57–66.
[4] Netflix.Com, “Netflix,” 2018. [Online]. Available:
In this work, we have presented an industrial survey to con- https://www.netflix.com/
duct fault analysis on typical faults of microservice systems, [5] SmartBear, “Why you can’t talk about microservices
current industrial practice of debugging, and the challenges without mentioning netflix,” 2015. [Online]. Avail-
able: https://smartbear.com/blog/develop/why-you-cant-talk-
faced by the developers. Based on the survey results, we
about-microservices-without-ment/
have developed a medium-size benchmark microservice [6] Wechat.Com, “Wechat,” 2018. [Online]. Available:
system (being the largest and most complex open source https://www.wechat.com/
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 16
[7] H. Zhou, M. Chen, Q. Lin, Y. Wang, X. She, S. Liu, R. Gu, B. C. Ooi, [24] P. Jamshidi, C. Pahl, N. C. Mendonça, J. Lewis, and S. Tilkov,
and J. Yang, “Overload control for scaling wechat microservices,” “Microservices: The journey so far and challenges ahead,” IEEE
in Proceedings of the ACM Symposium on Cloud Computing, SoCC Software, vol. 35, no. 3, pp. 24–35, 2018.
2018, Carlsbad, CA, USA, October 11-13, 2018, 2018, pp. 149–161. [25] Microservice.System.Benchmark, “Trainticket,” 2018. [Online].
[8] A. Deb, “Application delivery service challenges in Available: https://github.com/FudanSELab/train-ticket/
microservices-based applications,” 2016. [Online]. Avail- [26] X. Zhou, X. Peng, T. Xie, J. Sun, C. Xu, C. Ji, and W. Zhao,
able: http://www.thefabricnet.com/application-delivery-service- “Benchmarking microservice systems for software engineering
challenges-in-microservices-based-applications/ research,” in Proceedings of the 40th International Conference on Soft-
[9] Amazon.Com, “Amazon,” 2017. [Online]. Avail- ware Engineering: Companion Proceeedings, ICSE 2018, Gothenburg,
able: https://d0.awsstatic.com/whitepapers/microservices-on- Sweden, May 27 - June 03, 2018, 2018, pp. 323–324.
aws.pdf [27] Fault.Replication, “Fault replication,” 2017. [Online]. Available:
[10] I. Beschastnikh, P. Wang, Y. Brun, and M. D. Ernst, “Debugging https://github.com/FudanSELab/train-ticket-fault-replicate
distributed systems,” Commun. ACM, vol. 59, no. 8, pp. 32–37, [28] Replication.Package, “Fault analysis and debugging
2016. of microservice systems,” 2018. [Online]. Available:
[11] S. Hassan and R. Bahsoon, “Microservices and their design trade- https://fudanselab.github.io/research/MSFaultEmpiricalStudy/
offs: A self-adaptive roadmap,” in IEEE International Conference on [29] SpringBoot.Com, “Spring boot,” 2018. [Online]. Available:
Services Computing, SCC 2016, San Francisco, CA, USA, June 27 - http://projects.spring.io/spring-boot/
July 2, 2016, 2016, pp. 813–818. [30] Dubbo.Com, “Dubbo,” 2017. [Online]. Available:
[12] G. Schermann, D. Schöni, P. Leitner, and H. C. Gall, “Bifrost: http://dubbo.io/
Supporting continuous deployment with automated enactment [31] Docker.Com, “Docker,” 2018. [Online]. Available:
of multi-phase live testing strategies,” in Proceedings of the 17th https://docker.com/
International Middleware Conference, Trento, Italy, December 12 - 16, [32] SpringCloud.Com, “Spring cloud,” 2018. [Online]. Available:
2016, 2016, p. 12. http://projects.spring.io/spring-cloud/
[13] A. de Camargo, I. L. Salvadori, R. dos Santos Mello, and [33] Mesos.Com, “Mesos,” 2018. [Online]. Available:
F. Siqueira, “An architecture to automate performance tests on http://mesos.apache.org/
microservices,” in Proceedings of the 18th International Conference [34] Kubernetes.Com, “Kubernetes,” 2018. [Online]. Available:
on Information Integration and Web-based Applications and Services, https://kubernetes.io/
iiWAS 2016, Singapore, November 28-30, 2016, 2016, pp. 422–429. [35] DockerSwarm.Com, “Docker swarm,” 2018. [Online]. Available:
[14] R. Heinrich, A. van Hoorn, H. Knoche, F. Li, L. E. Lwakatare, https://docs.docker.com/swarm/
C. Pahl, S. Schulte, and J. Wettinger, “Performance engineering for [36] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey
microservices: Research challenges and directions,” in Companion on software fault localization,” IEEE Trans. Software Eng., vol. 42,
Proceedings of the 8th ACM/SPEC on International Conference on no. 8, pp. 707–740, 2016.
Performance Engineering, ICPE 2017, L’Aquila, Italy, April 22-26, [37] R. A. Santelices, Y. Zhang, S. Jiang, H. Cai, and Y. Zhang, “Quan-
2017, 2017, pp. 223–226. titative program slicing: separating statements by relevance,” in
[15] P. Leitner, J. Cito, and E. Stöckli, “Modelling and managing 35th International Conference on Software Engineering, ICSE ’13, San
deployment costs of microservice-based cloud applications,” in Francisco, CA, USA, May 18-26, 2013, 2013, pp. 1269–1272.
Proceedings of the 9th International Conference on Utility and Cloud [38] R. Abreu, P. Zoeteweij, and A. J. C. van Gemund, “Spectrum-based
Computing, UCC 2016, Shanghai, China, December 6-9, 2016, 2016, multiple fault localization,” in ASE 2009, 24th IEEE/ACM Interna-
pp. 165–174. tional Conference on Automated Software Engineering, Auckland, New
[16] W. Hasselbring, “Microservices for scalability: Keynote talk ab- Zealand, November 16-20, 2009, 2009, pp. 88–99.
stract,” in Proceedings of the 7th ACM/SPEC International Conference [39] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan,
on Performance Engineering, ICPE 2016, Delft, The Netherlands, March “Scalable statistical bug isolation,” in Proceedings of the ACM
12-16, 2016, 2016, pp. 133–134. SIGPLAN 2005 Conference on Programming Language Design and
[17] S. Klock, J. M. E. M. van der Werf, J. P. Guelen, and S. Jansen, Implementation, Chicago, IL, USA, June 12-15, 2005, 2005, pp. 15–26.
“Workload-based clustering of coherent feature sets in microser- [40] W. E. Wong and Y. Qi, “Bp neural network-based effective fault lo-
vice architectures,” in 2017 IEEE International Conference on Software calization,” International Journal of Software Engineering and Knowl-
Architecture, ICSA 2017, Gothenburg, Sweden, April 3-7, 2017, 2017, edge Engineering, vol. 19, no. 4, pp. 573–597, 2009.
pp. 11–20. [41] J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin, “Using
[18] A. Panda, M. Sagiv, and S. Shenker, “Verification in the age of mutation analysis for assessing and comparing testing coverage
microservices,” in Proceedings of the 16th Workshop on Hot Topics criteria,” IEEE Trans. Software Eng., vol. 32, no. 8, pp. 608–624, 2006.
in Operating Systems, HotOS 2017, Whistler, BC, Canada, May 8-10, [42] E. H. da S. Alves, L. C. Cordeiro, and E. B. de Lima Filho, “Fault
2017, 2017, pp. 30–36. localization in multi-threaded C programs using bounded model
[19] I. L. Salvadori, A. Huf, R. dos Santos Mello, and F. Siqueira, “Pub- checking,” in 2015 Brazilian Symposium on Computing Systems Engi-
lishing linked data through semantic microservices composition,” neering, SBESC 2015, Foz do Iguacu, Brazil, November 3-6, 2015, 2015,
in Proceedings of the 18th International Conference on Information pp. 96–101.
Integration and Web-based Applications and Services, iiWAS 2016, [43] F. Koca, H. Sözer, and R. Abreu, “Spectrum-based fault localization
Singapore, November 28-30, 2016, 2016, pp. 443–452. for diagnosing concurrency faults,” in Testing Software and Systems
[20] G. Granchelli, M. Cardarelli, P. D. Francesco, I. Malavolta, - 25th IFIP WG 6.1 International Conference, ICTSS 2013, Istanbul,
L. Iovino, and A. D. Salle, “Microart: A software architecture re- Turkey, November 13-15, 2013, Proceedings, 2013, pp. 239–254.
covery tool for maintaining microservice-based systems,” in 2017 [44] I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin,
IEEE International Conference on Software Architecture Workshops, “Probabilistic diagnosis of performance faults in large-scale paral-
ICSA Workshops 2017, Gothenburg, Sweden, April 5-7, 2017, 2017, lel applications,” in International Conference on Parallel Architectures
pp. 298–302. and Compilation Techniques, PACT ’12, Minneapolis, MN, USA -
[21] J. Lin, L. C. Lin, and S. Huang, “Migrating web applications to September 19 - 23, 2012, 2012, pp. 213–222.
clouds with microservice architectures,” in Applied System Innova- [45] G. Qi, L. Yao, and A. V. Uzunov, “Fault detection and localiza-
tion (ICASI), 2016 International Conference on. IEEE, 2016, pp. 1–4. tion in distributed systems using recurrent convolutional neural
[22] S. Hassan and R. Bahsoon, “Microservices and their design trade- networks,” in Advanced Data Mining and Applications - 13th In-
offs: A self-adaptive roadmap,” in IEEE International Conference on ternational Conference, ADMA 2017, Singapore, November 5-6, 2017,
Services Computing, SCC 2016, San Francisco, CA, USA, June 27 - Proceedings, 2017, pp. 33–48.
July 2, 2016, 2016, pp. 813–818. [46] A. B. Sharma, H. Chen, M. Ding, K. Yoshihira, and G. Jiang, “Fault
[23] C. M. Aderaldo, N. C. Mendonca, C. Pahl, and P. Jamshidi, “Bench- detection and localization in distributed systems using invariant
mark requirements for microservices architecture research,” in 1st relationships,” in 2013 43rd Annual IEEE/IFIP International Confer-
IEEE/ACM International Workshop on Establishing the Community- ence on Dependable Systems and Networks (DSN), Budapest, Hungary,
Wide Infrecaseructure for Architecture-Based Software Engineering, June 24-27, 2013, 2013, pp. 1–8.
ECASE@ICSE 2017, Buenos Aires, Argentina, May 22, 2017, 2017, [47] C. Pham, L. Wang, B. Tak, S. Baset, C. Tang, Z. T. Kalbarczyk,
pp. 8–13. and R. K. Iyer, “Failure diagnosis for distributed systems using
IEEE TRANSACTION ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2018 17
targeted fault injection,” IEEE Trans. Parallel Distrib. Syst., vol. 28, Xin Peng is a professor of the School of Com-
no. 2, pp. 503–516, 2017. puter Science at Fudan University, China. He
[48] Log4j.Com, “Log4j,” 2017. [Online]. Available: received Bachelor and PhD degrees in com-
https://logging.apache.org/log4j/2.x/ puter science from Fudan University in 2001 and
[49] Logstash.Com, “Logstash,” 2018. [Online]. Available: 2006. His research interests include data-driven
https://www.elastic.co/products/logstash intelligent software development, software main-
[50] Elasticsearch.Com, “Elasticsearch,” 2018. [Online]. Available: tenance and evolution, mobile and cloud com-
https://www.elastic.co/products/elasticsearch puting. His work won the Best Paper Award
[51] Kibana.Com, “Kibana,” 2018. [Online]. Available: at the 27th International Conference on Soft-
https://www.elastic.co/products/kibana ware Maintenance (ICSM 2011), the ACM SIG-
[52] Dynatrace.Com, “Dynatrace,” 2013. [Online]. Available: SOFT Distinguished Paper Award at the 33rd
https://www.dynatrace.com/ IEEE/ACM International Conference on Automated Software Engineer-
[53] Zipkin.Com, “Zipkin,” 2016. [Online]. Available: ing (ASE 2018), the IEEE TCSE Distinguished Paper Award at the 34th
https://zipkin.io/ IEEE International Conference on Software Maintenance and Evolution
[54] Redis.Io, “redis.io,” 2016. [Online]. Available: https://redis.io/ (ICSME 2018).
[55] J. Campos, A. Riboira, A. Perez, and R. Abreu, “Gzoltar:
an eclipse plug-in for testing and debugging,” in IEEE/ACM
International Conference on Automated Software Engineering, ASE’12,
Essen, Germany, September 3-7, 2012, 2012, pp. 378–381. [Online].
Available: https://doi.org/10.1145/2351676.2351752
[56] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure- Tao Xie is a professor and Willett Faculty
inducing input,” IEEE Trans. Software Eng., vol. 28, no. 2, pp. 183– Scholar in the Department of Computer Science
200, 2002. at the University of Illinois at UrbanaChampaign,
[57] N. Alshuqayran, N. Ali, and R. Evans, “A systematic mapping USA. His research interests are software test-
study in microservice architecture,” in 9th IEEE International Con- ing, program analysis, software analytics, soft-
ference on Service-Oriented Computing and Applications, SOCA 2016, ware security, intelligent software engineeirng,
Macau, China, November 4-6, 2016, 2016, pp. 44–51. and educational software engineering. He is a
[58] N. Dragoni, S. Giallorenzo, A. Lluch-Lafuente, M. Mazzara, Fellow of the IEEE.
F. Montesi, R. Mustafin, and L. Safina, “Microservices: yesterday,
today, and tomorrow,” CoRR, vol. abs/1606.04036, 2016.
[59] S. Park, R. W. Vuduc, and M. J. Harrold, “UNICORN: a unified
approach for localizing non-deadlock concurrency bugs,” Softw.
Test., Verif. Reliab., vol. 25, no. 3, pp. 167–190, 2015.
[60] ——, “Falcon: fault localization in concurrent programs,” in Pro-
ceedings of the 32nd ACM/IEEE International Conference on Software
Engineering - Volume 1, ICSE 2010, Cape Town, South Africa, 1-8 May Jun Sun is currently an associate professor at
2010, 2010, pp. 245–254. Singapore University of Technology and Design
[61] P. Bailis, P. Alvaro, and S. Gulwani, “Research for practice: tracing (SUTD). He received Bachelor and PhD degrees
and debugging distributed systems; programming by examples,” in computing science from National University of
Commun. ACM, vol. 60, no. 7, pp. 46–49, 2017. Singapore (NUS) in 2002 and 2006. In 2007,
[62] F. Giraldeau and M. Dagenais, “Wait analysis of distributed he received the prestigious LEE KUAN YEW
systems using kernel tracing,” IEEE Trans. Parallel Distrib. Syst., postdoctoral fellowship. He has been a faculty
vol. 27, no. 8, pp. 2450–2461, 2016. member of SUTD since 2010. He was a visiting
[63] C. Aguerre, T. Morsellino, and M. Mosbah, “Fully-distributed scholar at MIT from 2011-2012. Jun’s research
debugging and visualization of distributed systems in anonymous interests include software engineering, formal
networks,” in GRAPP & IVAPP 2012: Proceedings of the International methods, program analysis and cyber-security.
Conference on Computer Graphics Theory and Applications and Interna-
tional Conference on Information Visualization Theory and Applications,
Rome, Italy, 24-26 February, 2012, 2012, pp. 764–767.
[64] S. A. Asadollah, D. Sundmark, S. Eldh, H. Hansson, and W. Afzal,
“10 years of research on debugging concurrent and multicore
software: a systematic mapping study,” Software Quality Journal, Chao Ji is a Master student of the School of
vol. 25, no. 1, pp. 49–82, 2017. Computer Science at Fudan University, China.
[65] W. Morgan, “What’s a service mesh? and why do i need one?” He received his Bachelor degree from Fudan
2017. [Online]. Available: https://buoyant.io/2017/04/25/whats- University in 2017. His work mainly concerns on
a-service-mesh-and-why-do-i-need-one/ the development and operation of microservice
[66] Linkerd, “Linkerd,” 2018. [Online]. Available: https://linkerd.io/ systems.
[67] Istio, “Istio,” 2018. [Online]. Available: https://istio.io/