Aiops Done Right Banking
Aiops Done Right Banking
Aiops Done Right Banking
in Banking
Automating the Next Generation
of Enterprise Software
What’s inside
Introduction
The promise of AI
Chapter 1
Chapter 2
Auto-remediation
Chapter 6
Chapter 7
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 3
A proven record in AIOps
“Dynatrace is the first I’ve seen where the AI really shines. Incredible.”
Ariel Molina, Sr. Dir., Software Engineering & Enterprise Architecture at Carnival Cruise Line
Dynatrace helps some of the world’s top financial institutions to simplify cloud complexity
and accelerate digital transformation. Davis—our deterministic, causation-based AI
engine—was built into the fabric of the Dynatrace software intelligence platform four “Dynatrace, within two minutes came back and said 'you have a problem in your
years ago, at a time when cloud computing became mainstream and conventional cloud instance', and we spun up extra resources. So we avoided having to close
monitoring tools hit a wall. Since then, many leading states and cities have relied on down supermarkets and disappoint customers waiting in line.”
Dynatrace to accurately and reliably identify the root cause of performance problems Jeppe Hedesgaard Lindberg, Application Performance Manager at Coop Denmark
while automating Ops, DevOps, and business processes.
“We fire up Dynatrace, and immediately the AI goes to work and identifies problems.
How to avoid closing down multiple branches or on-line banking There's no digging—it’s bubbling to the top. It’s right there in your face.
on a busy Saturday: It just does it for you; it’s amazing.”
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 4
Chapter 1
The concept of automating operations revolves around better troubleshooting, with the ultimate
goal to reduce the Mean Time To Recovery (MTTR). This is accomplished through automatic
anomaly detection and alerting, i.e., speedy Mean Time To Discovery (MTTD). BEFORE:
However, further reduction of MTTR require automatic root cause analysis. without Dynatrace
Number
of Calls
Traditional monitoring tools focus on application performance metrics and baselining methods
20 ENGAGE 20 TRIAGE 45 FIND & ASSEMBLE 30 RESOLVE 35 RESTORE
to distinguish normal from faulty behavior. Defining the anomaly thresholds turns out to be a Before
tricky task that requires advanced statistics like machine learning. However, even the best
baselining methods prove to be inadequate when it comes to the cloud. 90%
120
Faster
With modern microservice architectures, a single fault impacts a multitude of connected services 40% AFTER:
Faster with Dynatrace & incidence
which subsequently also fail. Therefore, a single problem typically triggers many alerts, which are Number
of Calls
all justified. This is called an alert storm or noisy alerts. response service
-90% SLA
Conventional monitoring solutions fall short of resolving this issue. It remains up to human Calls
operators to make sense of the alerts. Problem triage becomes a time consuming and After
5 5 5 10 RESOLVE 35 RESTORE
(×) (×)
often frustrating exercise involving war rooms and graveyard shifts.
120
The only way out is a reliable method for determining the underlying root cause automatically. "Incidence response services: xMatters, PagerDuty, VictorOps, Opsgenie"
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 5
Fine-tuning individual baselines helps, but it does not fix alert storms. For a real cure, we need to step outside the box and try to find the underlying
root cause directly.
There are two very distinct AI-based approaches to reduce alert noise:
The aggregate
potential cost
savings for banks
from AI applications
is estimated at
$447 billion
by 2023
Deterministic AI performs a step-by-step fault tree analysis Machine learning AI is a statistical approach that correlates
— Autonomous Next Research
as is common in safety engineering. metrics, events, and alerts to build a multi-dimensional model
of the analyzed system.
Results: Precise identification of the problem root cause Results: A set of correlated alerts; it is still up to human
• Works in near real-time operators to determine the root cause
• Explainable results — problem evolution over time can be • Building machine learning models takes time
visualized step-by-step • Tend to lag behind in dynamic environments
• Includes technical and foundational root causes as well as • Some systems suggest likely root causes by accessing historic
impact analysis records created by humans
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 6
Chapter 2
In an environment of disparate monitoring tools, operations personnel are left to make sense of multiple diverse inputs coming from various sources. This increases
the likelihood of error in situational awareness and diagnosis.² Currently, only 5% of applications are monitored. The aim is to get full end-to-end visibility.
“Dynatrace is one of
Challenge
our strategic platforms
Full system visibility is a necessary precondition for automating operations that enables us to
including solid self-remediation. We need full insight not only into the
application—including containers and functions-as-a-service—but also into
make huge strides
all layers of the cloud infrastructure, networks, the CI/CD pipeline, and the in our banking
real user experience. In many cases, data collection itself comes for free, as all
transformation.”
major public cloud providers offer monitoring APIs, and open-source tools are
abundantly available. However, the following considerations are critical: —CTO
Multi-National Bank
Use AIOps for a Data-Driven Approach to Improve Insights From IT Operations Monitoring Tools (Gartner Research Note)
2
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 7
Rich data in context
In order to accomplish true root cause analysis, the collected data need
to be high-fidelity (minimal or no sampling) and context-rich in order
to create real-time topology and service flow maps.
Topology map
A topology map captures and visualizes the entire application
environment. This includes the vertical stack (infrastructure, services
and processes) and the horizontal dependencies, i.e. all incoming
and outgoing call relationships. Leading monitoring solutions
provide auto-discovery of new environment components
and near real-time updates.
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 8
Chapter 3
Banks without AI attempt the impossible and will eventually die. Gartner predicts 30% of IT organizations that fail to adopt AI will no longer be operationally viable
Dynatrace is helping
by 2022.³ As financial businesses embrace a hybrid, multi-cloud environment, the sheer volume of data and massive environmental complexity will make it impossible
for humans to monitor, comprehend, and take action. KeyBank get on the
path to autonomous
Challenge cloud operations,
allowing DevOps
We are quickly entering a time when humans will no longer be the main actors to fix IT
problems or push code into production. Cloud and AI solutions revolve around automation, teams to create an
so DevOps won’t require nearly as much human intervention in the future. For AIOps (truly unbreakable software
autonomous cloud operations) to work perfectly, we need a system that can not only identify
that something is wrong, but pinpoint the true root cause. delivery pipeline
that enables faster
Modern, highly dynamic microservice architectures run in hybrid and multi-cloud environments.
innovation and
Infrastructure and services are spun up and killed within the blink of an eye as loads demand.
Determining the root cause of an anomaly requires exponentially more effort than humans can improved product
take on. quality that enhances
customer experience
³AI (in a box) for IT Ops—The AIOps 101 you’ve been looking for.
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 9
Root cause analysis with deterministic AI
Vertical Stack
Davis—the Dynatrace AI engine—uses the application topology and service flow The deployment stack
maps together with high-fidelity metrics to perform a fault tree analysis. A fault an application or service
tree shows all the vertical and horizontal topological dependencies for a given alert. depends on. Always
autocratically analyzed
Consider the following example visualized in the chart to the right. for abnormal behavior
1. A web app exhibits an anomaly, like a reduced response time (see top left in
the graphic). Horizontal stack
The real-time dynamic
2. Davis first “takes a look” at the vertical stack below and finds that everything dependency across host
Application Service 1 Service 3
performs as expected—no problems there. boundaries measured by
all incoming transactions Service 2
(PurePath)
3. From here, Davis follows all the transactions and detects a dependency on
runs on runs on runs on
Service 1 that also shows an anomaly. In addition, all further dependencies
(Services 2 and 3) exhibit anomalies as well. runs on
4. The automatic root-cause detection includes all the relevant vertical stacks as
shown in the example and ranks the contributors to determine the one with the
most negative impact.
Webserver Cluster Microservice cluster Microservice cluster
5. In this case, the root cause is a CPU saturation in one of the Linux hosts.
Microservice cluster
Host
Components showing
abnormal behavior Host
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 10
Understanding problem evolution
Deterministic fault tree analysis yields precise, explainable results. This can be used
to replay the evolution and resolution of a problem step by step and visualize the
affected components in a topology map. This is an extremely powerful feature
because it allows the DevOps team to gain a deep understanding of the problem
right from the get-go, cutting triage and research time to a minimum.
The problem evolution data is key for auto-remediation. Given that it can be
accessed through APIs, remediation sequences can be triggered to resolve a problem
with surgical precision and at a speed not achievable by human operators.
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 11
Chapter 4
Infrastructure and services get spun up and killed as needed at a mind-boggling speed in a modern Not every disappearing container or host is a problem, and a slow service that nobody uses does
dynamic microservice application. That’s the nature of a healthy system. not require immediate attention. Therefore, an advanced software intelligence system assesses the
severity of a problem:
A disappearing container can be a desired event to optimize resources, or it can be a sign of an
unintended disruption that requires immediate mitigation. The AI needs to be able to tell an
anomaly from a desired change.
Customer Impact
Challenge How many customers at multiple branches have been impacted
by a detected problem since it occurred? Ideally, the number should be
A precise and reliable determination of the technical root cause is absolutely essential for auto- based on actual real users rather than a statistical extrapolation
remediation, but it is not sufficient. We also need a measure of an anomaly’s severity and some of historic data.
indication of what led to the technical root cause in the first place.
Service calls impacted
Some parts of the system are not built for human interaction. In
this case, the number of impacted service calls is a good estimate
of the severity.
Business Impact
As software intelligence solutions increasingly cover banking systems
end-to-end, from user actions all the way to the infrastructure, it is
possible to map system performance to business KPIs. A retailer, for
example, can measure the dollar value of purchases during a system
slowdown and compare it with a reference timeframe in the past.
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 12
Foundational root causes 2 Applications: User action duration degradation
Problem 753 detected at Nov 28 06:58–Nov 28 07:54 (was open for 56 minutes). This problem affects real users.
Typical foundational root causes are: Business Impact Analysis Root Cause
An analysis of all affected service calls and impacted Based on our dependency analysis all incidents have the same
real users during the first 10 minutes of the problem root cause
• Deployments shows the following potential impact.
Collecting metrics and events from the CI/CD tool chain makes it
possible to link a problem to a specific deployment (and roll it back
1.17k 384k Check Destination
Custom service
Impacted users Affected service calls
if needed).
Show more
Response time degradation
• Third-party configuration changes The current response time (19.6 s) exceeds the
These can relate to changes in the underlying cloud infrastructure auto-detected baseline (120 ms) by 16,309%
or a third-party service.
Business Metric Analysis Affected requests Service method
Additional analysis performed on key business metrics such 551/min All methods affected
• Infrastructure availability
as conversion goals or revenue numbers. Comparisons are
In many cases, the shutdown or restart of hosts or individual done for the Problem timeframe yesterday and a week ago.
processes causes the problem.
Basket
BB1-apache-tomcatjms-iis
17.16% 10.42%
To determine the foundational root causes the AI engine needs to have
51,782 vs. yesterday vs. last week
Host
access to metrics and events from the CI/CD pipeline, ITSM solutions, CPU saturation
Checkout
and other connected tools. Dynatrace provides an API and plug-ins to 8.12% 30.17% 100% CPU usage
ingest third-party data into Davis. 379 vs. yesterday vs. last week Analyze logs
Order Details
44.64% 25.68%
16 vs. yesterday vs. last week
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 13
Chapter 5
Auto-remediation
Insight
Infrastructure as code and powerful cloud orchestration layers provide the necessary ingredients to automate operations and enable self-remediation. This will
not only reduce operational cost and deliver better service, but also avoid human error. The key to truly autonomous cloud operations is reliable system health
information including deep anomaly root cause and impact analyses.
Challenge
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 14
Path to NoOps: Auto-Remediation, Self-Healing...
325 1.82
Impacted users Affected service calls
Auto Mitigate!
Problem evolution
100
2
08:00 08:15 08:30 08:45
High garbage collection? Adjust/revert memory settings!
Complex auto-remediation sequences
4 3 Issue with BLUE only? Switch back to GREEN!
This example shows how a precise analysis of the technical root
cause, foundational root causes and user/business impact can be
used to automate problem resolution through integration with a 4 Hung threads? Restart service
2
Mark Bad Commits
5 Still ongoing? Initiate Rollback!
1
Escalate
? Still ongoing?
5
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 15
Chapter 6
Automation doesn’t stop at software operations and auto-remediation in an enterprise grade financial services application environment. Accurate and
explainable software intelligence has the capacity to move towards automating the entire digital value chain and to enable novel business processes.
PIPELINE
Check in Auto Trigger AI powered quality gate
⁴https://www.dynatrace.com/news/blog/shift-left-in-jenkins-how-to-implement-performance-signature-with-dynatrace/
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 16
Automating customer service Ma
ry
Hi D
ir
Any good software intelligence solution needs to include real user data, and per k! We
for wa
an impact analysis (as described in chapter 4) can be used to ensure customer m
und anc nt to
er eo ap
satisfaction even if something goes wrong. you high f our ologi
r pa pres we ze f
tien sure bsite or t
In case of a breakdown or slowdown, the system can engage autonomously with
ce. to f tod he po
ix t ay. o
his We r
impacted users. One way is to open a chat window operated by a chatbot behind the . Th ’r
ank e wo
scenes and inform the customer about the specific performance issue, then offer to you rking
for
make it up to them by providing discounts, etc.
Dir
k
Tha
nk
you
!
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 17
Chapter 7
Virtual assistants have emerged as one of the fastest-growing areas of AI and exist in many forms, including voice bots, text bots, and SMS bots.⁵ These technologies
have become mainstream with smartphones and home automation systems. When applied to software intelligence platforms, which usually require expert knowledge
to navigate, natural language interfaces enable broad user adoption, organizational learning and innovation.
Most people are not well-trained performance engineers, but everybody knows how to ask a A software intelligence solution that holds actionable insights about recent and current
question. To improve service quality in the eye of your customers, up-to-the-minute information problems, their causes, and their impacts on customers is well suited for a new and
on system health and business KPIs need to be available to customer service reps, analysts, convenient human user interface.
software developers and architects, and agency leaders across the board.
With Dynatrace Davis Assistant users can simply ask a question verbally or use a text
Information and actionable knowledge can be democratized to allow for cross-departmental based chat tool to interact with the software intelligence platform, and they’ll get a
organizational learning and culture change. plain-language answer back within the second. Alternatively, for a deep dive, the bot
can open a web browser and display the relevant charts. The ultimate level of system
intelligence would be smart assistants that identify an error pattern, proactively
suggest remediation actions, and only ask for approval to execute them.
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 18
"Give me my morning report." "What was the customer experience over the weekend?"
Good morning. After analyzing 13 applications, here are some Over the weekend, Vmware.easytravel.com was the most
interesting findings from the last 24 hours. heavily used application. The customer experience was excellent,
with an average Apdex rating of 0.97. Is there anything else I
• Easytravel.com reported an apdex customer experience rating can help you with?
of 0.85, a 49 percent increase from what was expected; and
7,200 sessions per minute, a 24 percent increase from what was
expected.
• Vmware.dynatrace.com had an apdex customer experience rating "What was the total dollar value of bookings?"
of 0.97 which is near the expected value of 0.9.
"Yes."
(Bot opens a web browser with the pertinent information.)
AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 19
Software intelligence holds the key to true AIOps, auto-remediation
and automation along the entire digital value chain.
We hope this AIOps Done Right ebook has inspired you to take
the next step in your digital journey.
Dynatrace is committed to providing financial institutions the data and intelligence they need to be successful with their
enterprise cloud and digital transformation initiatives, no matter how complex.
Learn more If you are ready to learn more, please visit dynatrace.com/platform for assets, resources, and a free 15-day trial.
About Dynatrace
Dynatrace provides software intelligence to simplify cloud complexity and accelerate digital transformation. With automatic and intelligent observability at scale, our all-in-one platform delivers
precise answers about the performance of applications, the underlying infrastructure and the experience of all users to enable organizations to innovate faster, collaborate more efficiently, and deliver
more value with dramatically less effort. That’s why many of the world’s largest enterprises trust Dynatrace® to modernize and automate cloud operations, release better software faster, and deliver
unrivaled digital experiences.