Aiops Done Right Banking

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

AIOps Done Right

in Banking
Automating the Next Generation
of Enterprise Software
What’s inside
Introduction

The promise of AI

Chapter 1

Introduction Anomaly detection and alerting

Chapter 2

Getting the best monitoring data


The adoption of AI by banking and financial institutions is
on the rise. OpenText reports that 80% of banks are “highly Chapter 3
aware” of the benefits that AI delivers. However, AI has many AI operations and root cause analysis
applications and it is not as easy as “just adopting AI”. Banks
that want to see tangible benefit from AI know that it is
Chapter 4
a journey and one that will demand continued innovation.
AIOps – when done right — unleashes a whole new level Impact analysis and foundational root causes
of automation, software quality, agility, efficiency and time
to market.
Chapter 5

Auto-remediation

Chapter 6

Automation and system integrations

Chapter 7

Natural language interfaces

AIOps Done Right In Banking ©2020 Dynatrace 2


Beyond error detection, The Promise of AI
towards self-healing
Enable autonomous operations, boost innovation, and offer new modes

Consider this all-too-familiar challenge: An anomaly in a of customer engagement by automating everything.


large microservice application triggers a storm of alerts
as services around your IT banking infrastructure are
impacted. As your application contains literally millions of
dependencies, how do you find the original error and avoid AIOps Intelligent DevOps
a complete lack of service to your customers? Conventional Replace a storm of noisy anomaly Increase the speed of innovation
monitoring tools are not much help. They collect metrics alerts with accurate and reliable root and software quality through intelligent
and raise alerts, but they provide few answers as to what cause analysis. performance and regression testing.
went wrong in the first place.

In contrast, envision an intelligent system that accurately


provides the answers — in this case, the technical root
cause of the anomaly and how to fix it. Such intelligence,
Auto-remediation Smart customer engagement
if accurate and reliable, can be trusted to trigger
Automate anomaly remediation and Use business intelligence data to improve
autoremediation procedures before most customers even
performance optimization based on customer experience, including automatic
notice a glitch.
system health and real user demands. remediation of breakdowns and complaints.

AI and automation are poised to radically change the game


in operations. And even more, it's about collecting and
applying intelligence along the entire digital value chain,
from software development through service delivery
all the way to end-user interactions. Smart integration
and automation will drive the next innovation cycle in
enterprise software.

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 3
A proven record in AIOps
“Dynatrace is the first I’ve seen where the AI really shines. Incredible.”

Ariel Molina, Sr. Dir., Software Engineering & Enterprise Architecture at Carnival Cruise Line
Dynatrace helps some of the world’s top financial institutions to simplify cloud complexity
and accelerate digital transformation. Davis—our deterministic, causation-based AI
engine—was built into the fabric of the Dynatrace software intelligence platform four “Dynatrace, within two minutes came back and said 'you have a problem in your
years ago, at a time when cloud computing became mainstream and conventional cloud instance', and we spun up extra resources. So we avoided having to close
monitoring tools hit a wall. Since then, many leading states and cities have relied on down supermarkets and disappoint customers waiting in line.”
Dynatrace to accurately and reliably identify the root cause of performance problems Jeppe Hedesgaard Lindberg, Application Performance Manager at Coop Denmark
while automating Ops, DevOps, and business processes.

“We fire up Dynatrace, and immediately the AI goes to work and identifies problems.
How to avoid closing down multiple branches or on-line banking There's no digging—it’s bubbling to the top. It’s right there in your face.
on a busy Saturday: It just does it for you; it’s amazing.”

Steve Strout, Director, Platform Engineering Assurant


• A 200-year-old bank and one of the largest in North America that serves 7-million
customers extends monitoring to all 900 of their branches in North America.
“The AI paves the way for autonomous operations, enabling us to create
• A multi-national bank serving over 32 million digital customers improves time auto-remediation workflows that remove the need for human intervention
to resolution and costs by automating root cause analysis. in the resolution of recurring problems.”

David Shepherd, Service Delivery Manager, Global IT Service Excellence at Experian


• An international bank with $1B in growth improved customer experience by being able
to see the digital technology health across the organization and monitor their critical
apps. Before Dynatrace they would be alerted when a customer would tell them.

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 4
Chapter 1

Anomaly detection and alerting


Insight

The concept of automating operations revolves around better troubleshooting, with the ultimate
goal to reduce the Mean Time To Recovery (MTTR). This is accomplished through automatic
anomaly detection and alerting, i.e., speedy Mean Time To Discovery (MTTD). BEFORE:
However, further reduction of MTTR require automatic root cause analysis. without Dynatrace
Number
of Calls

Challenge Manual Manual


Communication Communication SLA

Traditional monitoring tools focus on application performance metrics and baselining methods
20 ENGAGE 20 TRIAGE 45 FIND & ASSEMBLE 30 RESOLVE 35 RESTORE
to distinguish normal from faulty behavior. Defining the anomaly thresholds turns out to be a Before

tricky task that requires advanced statistics like machine learning. However, even the best
baselining methods prove to be inadequate when it comes to the cloud. 90%
120

Faster
With modern microservice architectures, a single fault impacts a multitude of connected services 40% AFTER:
Faster with Dynatrace & incidence
which subsequently also fail. Therefore, a single problem typically triggers many alerts, which are Number
of Calls
all justified. This is called an alert storm or noisy alerts. response service
-90% SLA

Conventional monitoring solutions fall short of resolving this issue. It remains up to human Calls
operators to make sense of the alerts. Problem triage becomes a time consuming and After
5 5 5 10 RESOLVE 35 RESTORE
(×) (×)
often frustrating exercise involving war rooms and graveyard shifts.
120

The only way out is a reliable method for determining the underlying root cause automatically. "Incidence response services: xMatters, PagerDuty, VictorOps, Opsgenie"

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 5
Fine-tuning individual baselines helps, but it does not fix alert storms. For a real cure, we need to step outside the box and try to find the underlying
root cause directly.

There are two very distinct AI-based approaches to reduce alert noise:

The aggregate
potential cost
savings for banks
from AI applications
is estimated at

$447 billion
by 2023
Deterministic AI performs a step-by-step fault tree analysis Machine learning AI is a statistical approach that correlates
— Autonomous Next Research
as is common in safety engineering. metrics, events, and alerts to build a multi-dimensional model
of the analyzed system.

Results: Precise identification of the problem root cause Results: A set of correlated alerts; it is still up to human
• Works in near real-time operators to determine the root cause
• Explainable results — problem evolution over time can be • Building machine learning models takes time
visualized step-by-step • Tend to lag behind in dynamic environments
• Includes technical and foundational root causes as well as • Some systems suggest likely root causes by accessing historic
impact analysis records created by humans

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 6
Chapter 2

Getting the best monitoring data


Insight

In an environment of disparate monitoring tools, operations personnel are left to make sense of multiple diverse inputs coming from various sources. This increases
the likelihood of error in situational awareness and diagnosis.² Currently, only 5% of applications are monitored. The aim is to get full end-to-end visibility.
“Dynatrace is one of
Challenge
our strategic platforms
Full system visibility is a necessary precondition for automating operations that enables us to
including solid self-remediation. We need full insight not only into the
application—including containers and functions-as-a-service—but also into
make huge strides
all layers of the cloud infrastructure, networks, the CI/CD pipeline, and the in our banking
real user experience. In many cases, data collection itself comes for free, as all
transformation.”
major public cloud providers offer monitoring APIs, and open-source tools are
abundantly available. However, the following considerations are critical: —CTO
Multi-National Bank

• How much manual effort is required for instrumentation


and deployment of updates?
• Can the monitoring agents inject themselves into ephemeral components
like functions or containers, and do configuration changes require
additional manual instrumentation?
• Are the metrics coarsely sampled or high-fidelity?
• Is there enough meta-information and context to build a
unifying data model?

Use AIOps for a Data-Driven Approach to Improve Insights From IT Operations Monitoring Tools (Gartner Research Note)
2

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 7
Rich data in context
In order to accomplish true root cause analysis, the collected data need
to be high-fidelity (minimal or no sampling) and context-rich in order
to create real-time topology and service flow maps.

Topology map
A topology map captures and visualizes the entire application
environment. This includes the vertical stack (infrastructure, services
and processes) and the horizontal dependencies, i.e. all incoming
and outgoing call relationships. Leading monitoring solutions
provide auto-discovery of new environment components
and near real-time updates.

Service flow map


A service flow map offers a transactional view that illustrates the
sequence of service calls from the perspective of a single service or
request. The difference to topologies is that service flows display
a step-by-step sequence of a whole transaction while topologies
are higher abstractions and only show general dependences.
Service flows require high fidelity data with minimal or no sampling.

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 8
Chapter 3

AI Operations and Root Cause Analysis


Insight

Banks without AI attempt the impossible and will eventually die. Gartner predicts 30% of IT organizations that fail to adopt AI will no longer be operationally viable
Dynatrace is helping
by 2022.³ As financial businesses embrace a hybrid, multi-cloud environment, the sheer volume of data and massive environmental complexity will make it impossible
for humans to monitor, comprehend, and take action. KeyBank get on the
path to autonomous
Challenge cloud operations,
allowing DevOps
We are quickly entering a time when humans will no longer be the main actors to fix IT
problems or push code into production. Cloud and AI solutions revolve around automation, teams to create an
so DevOps won’t require nearly as much human intervention in the future. For AIOps (truly unbreakable software
autonomous cloud operations) to work perfectly, we need a system that can not only identify
that something is wrong, but pinpoint the true root cause. delivery pipeline
that enables faster
Modern, highly dynamic microservice architectures run in hybrid and multi-cloud environments.
innovation and
Infrastructure and services are spun up and killed within the blink of an eye as loads demand.
Determining the root cause of an anomaly requires exponentially more effort than humans can improved product
take on. quality that enhances
customer experience

³AI (in a box) for IT Ops—The AIOps 101 you’ve been looking for.

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 9
Root cause analysis with deterministic AI
Vertical Stack
Davis—the Dynatrace AI engine—uses the application topology and service flow The deployment stack
maps together with high-fidelity metrics to perform a fault tree analysis. A fault an application or service
tree shows all the vertical and horizontal topological dependencies for a given alert. depends on. Always
autocratically analyzed
Consider the following example visualized in the chart to the right. for abnormal behavior

1. A web app exhibits an anomaly, like a reduced response time (see top left in
the graphic). Horizontal stack
The real-time dynamic
2. Davis first “takes a look” at the vertical stack below and finds that everything dependency across host
Application Service 1 Service 3
performs as expected—no problems there. boundaries measured by
all incoming transactions Service 2
(PurePath)
3. From here, Davis follows all the transactions and detects a dependency on
runs on runs on runs on
Service 1 that also shows an anomaly. In addition, all further dependencies
(Services 2 and 3) exhibit anomalies as well. runs on

4. The automatic root-cause detection includes all the relevant vertical stacks as
shown in the example and ranks the contributors to determine the one with the
most negative impact.
Webserver Cluster Microservice cluster Microservice cluster
5. In this case, the root cause is a CPU saturation in one of the Linux hosts.
Microservice cluster

Deterministic AI automatically and accurately determines the technical anomaly


root cause. This is a necessary precondition for true AIOps. We’ll go deeper into the
Docker image Host
requirements auto-remediation in the next sections. Hosts

Host

Components showing
abnormal behavior Host

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 10
Understanding problem evolution

Deterministic fault tree analysis yields precise, explainable results. This can be used
to replay the evolution and resolution of a problem step by step and visualize the
affected components in a topology map. This is an extremely powerful feature
because it allows the DevOps team to gain a deep understanding of the problem
right from the get-go, cutting triage and research time to a minimum.

The problem evolution data is key for auto-remediation. Given that it can be
accessed through APIs, remediation sequences can be triggered to resolve a problem
with surgical precision and at a speed not achievable by human operators.

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 11
Chapter 4

Impact analysis and foundational root causes


Insight Impact severity

Infrastructure and services get spun up and killed as needed at a mind-boggling speed in a modern Not every disappearing container or host is a problem, and a slow service that nobody uses does
dynamic microservice application. That’s the nature of a healthy system. not require immediate attention. Therefore, an advanced software intelligence system assesses the
severity of a problem:
A disappearing container can be a desired event to optimize resources, or it can be a sign of an
unintended disruption that requires immediate mitigation. The AI needs to be able to tell an
anomaly from a desired change.
Customer Impact
Challenge How many customers at multiple branches have been impacted
by a detected problem since it occurred? Ideally, the number should be
A precise and reliable determination of the technical root cause is absolutely essential for auto- based on actual real users rather than a statistical extrapolation
remediation, but it is not sufficient. We also need a measure of an anomaly’s severity and some of historic data.
indication of what led to the technical root cause in the first place.
Service calls impacted
Some parts of the system are not built for human interaction. In
this case, the number of impacted service calls is a good estimate
of the severity.

Business Impact
As software intelligence solutions increasingly cover banking systems
end-to-end, from user actions all the way to the infrastructure, it is
possible to map system performance to business KPIs. A retailer, for
example, can measure the dollar value of purchases during a system
slowdown and compare it with a reference timeframe in the past.

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 12
Foundational root causes 2 Applications: User action duration degradation
Problem 753 detected at Nov 28 06:58–Nov 28 07:54 (was open for 56 minutes). This problem affects real users.

Affected applications Affected services Affected infrastructure


2 15 3 654,998,400
The technical root cause determines what is broken. Discrepancies analyzed
The foundational root cause specifies why it is broken.

Typical foundational root causes are: Business Impact Analysis Root Cause
An analysis of all affected service calls and impacted Based on our dependency analysis all incidents have the same
real users during the first 10 minutes of the problem root cause
• Deployments shows the following potential impact.
Collecting metrics and events from the CI/CD tool chain makes it
possible to link a problem to a specific deployment (and roll it back
1.17k 384k Check Destination
Custom service
Impacted users Affected service calls
if needed).
Show more
Response time degradation
• Third-party configuration changes The current response time (19.6 s) exceeds the
These can relate to changes in the underlying cloud infrastructure auto-detected baseline (120 ms) by 16,309%

or a third-party service.
Business Metric Analysis Affected requests Service method
Additional analysis performed on key business metrics such 551/min All methods affected
• Infrastructure availability
as conversion goals or revenue numbers. Comparisons are
In many cases, the shutdown or restart of hosts or individual done for the Problem timeframe yesterday and a week ago.
processes causes the problem.
Basket
BB1-apache-tomcatjms-iis
17.16% 10.42%
To determine the foundational root causes the AI engine needs to have
51,782 vs. yesterday vs. last week
Host

access to metrics and events from the CI/CD pipeline, ITSM solutions, CPU saturation
Checkout
and other connected tools. Dynatrace provides an API and plug-ins to 8.12% 30.17% 100% CPU usage
ingest third-party data into Davis. 379 vs. yesterday vs. last week Analyze logs

Order Details
44.64% 25.68%
16 vs. yesterday vs. last week

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 13
Chapter 5

Auto-remediation
Insight

Infrastructure as code and powerful cloud orchestration layers provide the necessary ingredients to automate operations and enable self-remediation. This will
not only reduce operational cost and deliver better service, but also avoid human error. The key to truly autonomous cloud operations is reliable system health
information including deep anomaly root cause and impact analyses.

Challenge

Many cloud platforms offer mechanisms to dynamically adjust resources


based on load demand or restart unhealthy hosts and services. Some of these
solutions are very advanced—however, they only work within their designed
scope. Software intelligence solutions cover the entire banking system end-
Software Intelligence Platform CI/CD Automation
to-end, including hybrid environments where mainframes exist along multiple
cloud platforms.

Enabling auto-remediation Full-stack Anomalies Root cause Problem


environment are detected analysis is notification Event is Job is Playbook Problem
is monitored automatically performed is sent received triggered is executed is remediated
There are many ways of implementing auto-remediation in practice. Typically,
the software intelligence platform integrates with CI/CD solutions or with
cloud platform configuration layers to execute remediation actions. In any
case, the software intelligence solution needs to provide full stack monitoring,
automatic anomaly detection, precise root cause analysis and problem
notification through APIs.

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 14
Path to NoOps: Auto-Remediation, Self-Healing...

www.easytravel.com Escalate at 2am?

325 1.82
Impacted users Affected service calls

Auto Mitigate!
Problem evolution
100

50 ? 1 CPU exhausted? Add a new service instance!

2
08:00 08:15 08:30 08:45
High garbage collection? Adjust/revert memory settings!
Complex auto-remediation sequences
4 3 Issue with BLUE only? Switch back to GREEN!
This example shows how a precise analysis of the technical root
cause, foundational root causes and user/business impact can be
used to automate problem resolution through integration with a 4 Hung threads? Restart service

variety of CI/CD, ITOM, workflow and cloud technologies.


3
Update Dev Tickets
? Impact mitigated?

2
Mark Bad Commits
5 Still ongoing? Initiate Rollback!
1

Escalate
? Still ongoing?
5

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 15
Chapter 6

Automation and system integrations


Insight

Automation doesn’t stop at software operations and auto-remediation in an enterprise grade financial services application environment. Accurate and
explainable software intelligence has the capacity to move towards automating the entire digital value chain and to enable novel business processes.

The unbreakable DevOps pipeline


3x
faster build and
Over the last years, many DevOps teams have come a long way in This follows the concept of “shift left” — to use more production data earlier in the test cycles,
implementing a CI/CD pipeline that codifies and automates parts development lifecycle to answer the question: 50% reduction
of the build, testing, and deployment steps. The goal is to speed up
"Is this a good or bad change that we try to push towards production?" in issues.
time to market and ensure excellent software quality—to get faster
and better. AI-powered software intelligence helps to close existing
automation gaps like manual approval steps at decision gates or build -Verizon Enterprise

validation. It also provides valuable performance signatures to test


new builds against production scenarios.⁴ Continuous Integration (CI) Continuous Delivery (CD)

PIPELINE
Check in Auto Trigger AI powered quality gate

1 BUILD 2 DEV 3 BETA 4 PROD

⁴https://www.dynatrace.com/news/blog/shift-left-in-jenkins-how-to-implement-performance-signature-with-dynatrace/

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 16
Automating customer service Ma
ry
Hi D
ir
Any good software intelligence solution needs to include real user data, and per k! We
for wa
an impact analysis (as described in chapter 4) can be used to ensure customer m
und anc nt to
er eo ap
satisfaction even if something goes wrong. you high f our ologi
r pa pres we ze f
tien sure bsite or t
In case of a breakdown or slowdown, the system can engage autonomously with
ce. to f tod he po
ix t ay. o
his We r
impacted users. One way is to open a chat window operated by a chatbot behind the . Th ’r
ank e wo
scenes and inform the customer about the specific performance issue, then offer to you rking
for
make it up to them by providing discounts, etc.

Dir
k
Tha
nk
you
!

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 17
Chapter 7

Natural language interfaces


Insight

Virtual assistants have emerged as one of the fastest-growing areas of AI and exist in many forms, including voice bots, text bots, and SMS bots.⁵ These technologies
have become mainstream with smartphones and home automation systems. When applied to software intelligence platforms, which usually require expert knowledge
to navigate, natural language interfaces enable broad user adoption, organizational learning and innovation.

Challenge Smart assistants meet software intelligence

Most people are not well-trained performance engineers, but everybody knows how to ask a A software intelligence solution that holds actionable insights about recent and current
question. To improve service quality in the eye of your customers, up-to-the-minute information problems, their causes, and their impacts on customers is well suited for a new and
on system health and business KPIs need to be available to customer service reps, analysts, convenient human user interface.
software developers and architects, and agency leaders across the board.
With Dynatrace Davis Assistant users can simply ask a question verbally or use a text
Information and actionable knowledge can be democratized to allow for cross-departmental based chat tool to interact with the software intelligence platform, and they’ll get a
organizational learning and culture change. plain-language answer back within the second. Alternatively, for a deep dive, the bot
can open a web browser and display the relevant charts. The ultimate level of system
intelligence would be smart assistants that identify an error pattern, proactively
suggest remediation actions, and only ask for approval to execute them.

⁵AI Technologies—William Blair Industry Report, June 28, 2018

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 18
"Give me my morning report." "What was the customer experience over the weekend?"

Good morning. After analyzing 13 applications, here are some Over the weekend, Vmware.easytravel.com was the most
interesting findings from the last 24 hours. heavily used application. The customer experience was excellent,
with an average Apdex rating of 0.97. Is there anything else I
• Easytravel.com reported an apdex customer experience rating can help you with?
of 0.85, a 49 percent increase from what was expected; and
7,200 sessions per minute, a 24 percent increase from what was
expected.

• Vmware.dynatrace.com had an apdex customer experience rating "What was the total dollar value of bookings?"
of 0.97 which is near the expected value of 0.9.

$3.78 million. That’s up from the previous weekend by 2 percent.


"Are there any problems right now?""

There are currently 11 open problems. The most affected application


is easytravel dynatrace-dev, which is being affected by two issues.

Would you like to see a listing of these issues?

"Yes."
(Bot opens a web browser with the pertinent information.)

AIOps Done Right In Banking: Automating the Next Generation of Enterprise Software ©2020 Dynatrace 19
Software intelligence holds the key to true AIOps, auto-remediation
and automation along the entire digital value chain.

We hope this AIOps Done Right ebook has inspired you to take
the next step in your digital journey.
Dynatrace is committed to providing financial institutions the data and intelligence they need to be successful with their
enterprise cloud and digital transformation initiatives, no matter how complex.

Learn more If you are ready to learn more, please visit dynatrace.com/platform for assets, resources, and a free 15-day trial.

About Dynatrace
Dynatrace provides software intelligence to simplify cloud complexity and accelerate digital transformation. With automatic and intelligent observability at scale, our all-in-one platform delivers
precise answers about the performance of applications, the underlying infrastructure and the experience of all users to enable organizations to innovate faster, collaborate more efficiently, and deliver
more value with dramatically less effort. That’s why many of the world’s largest enterprises trust Dynatrace® to modernize and automate cloud operations, release better software faster, and deliver
unrivaled digital experiences.

09.24.20 10337_EBK_Agency/cs ©2020 Dynatrace

You might also like