0% found this document useful (0 votes)
123 views92 pages

A - JVM - Threading - Model - For - The - Containerized - Times 2

The document discusses building an adaptive threading model to handle spikes in load for a payment processing system. It describes detecting degradation through fine-grained CPU metrics and multiple checks, including CPU usage, throttled percentage, and memory usage. An adaptive concurrency approach is proposed to dynamically adjust the thread pool size to avoid CPU throttling, along with reactive backpressure to reject work above system capacity.

Uploaded by

Ícaro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views92 pages

A - JVM - Threading - Model - For - The - Containerized - Times 2

The document discusses building an adaptive threading model to handle spikes in load for a payment processing system. It describes detecting degradation through fine-grained CPU metrics and multiple checks, including CPU usage, throttled percentage, and memory usage. An adaptive concurrency approach is proposed to dynamically adjust the thread pool size to avoid CPU throttling, along with reactive backpressure to reject work above system capacity.

Uploaded by

Ícaro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

{ A JVM threading

model for the


containerized
times
Flavio Brasil, Principal Engineer
Luiz Hespanha, Principal Engineer

...
Systems Performance @ Nubank

}
Nubank
Hespanha

Nubank
01 { ..
The perfect storm

} ..
PIX

Running since the end of 2020, Pix is an instant payment platform created and
managed by the monetary authority of Brazil, the Central Bank of Brazil (BCB), which
enables the quick execution(max 10 seconds) of payments and transfers 24/7.
PIX

Monthly transfers (thousands) - 4 Billion in July!


PIX

Essential for people's day-to-day


Payday madness
Number of
failures (RED)
increasing

Nubank's PIX
down today?
Service
experiencing
instability
02 { ..
Understanding the
problem

} ..
The crash resolution paradox
The crash resolution paradox
The system normally operates
at a low CPU usage
Flavio

The crash resolution paradox


The system normally operates
at a low CPU usage

But when there's a load spike, the CPU


becomes a bottleneck
The crash resolution paradox
The system normally operates
at a low CPU usage

But when there's a load spike, the CPU


becomes a bottleneck

Latencies skyrocket, system sometimes


become unresponsive
The crash resolution paradox
The system normally operates
at a low CPU usage

But when there's a load spike, the CPU


becomes a bottleneck

Latencies skyrocket, system sometimes


become unresponsive

Resolution: more CPU capacity!?


Cluster-wide instability
Cluster-wide instability

Some crashes escalated to k8s nodes


becoming saturated
Cluster-wide instability

Some crashes escalated to k8s nodes


becoming saturated

A few systems consumed all CPU resources


and became noisy neighbors
Cluster-wide instability

Some crashes escalated to k8s nodes


becoming saturated

A few systems consumed all CPU resources


and became noisy neighbors

Several nodes become saturated and


instability spread to collocated services
Flavio

Symptoms of a bottleneck
Symptoms of a bottleneck
As described by the Universal Scalability
Law (USL), efficiency can drop
significantly when a bottleneck is
reached
Symptoms of a bottleneck
As described by the Universal Scalability
Law (USL), efficiency can drop
significantly when a bottleneck is
reached

The median CPU usage (blue line) was


dropping over time and the number of
cores (bars) growing at a higher pace
than the load
Our goals

}
Our goals

{ Systems remain
functional even
if overloaded
}
Our goals

{ Systems remain
functional even Problematic
if overloaded systems can't
affect others
}
03 { ..
Seeing through
the noise

} ..
CPU isolation

Linux Prevents
Nature Mechanism
config node saturation

CPU
requests

CPU
limits
CPU isolation

Linux Prevents
Nature Mechanism
config node saturation

CPU
cpu.shares
requests

CPU
cpu.quota
limits
Hespanha

CPU isolation

Linux Prevents
Nature Mechanism
config node saturation

CPU
cpu.shares Soft limit
requests

CPU
cpu.quota Hard limit
limits
CPU isolation

Linux Prevents
Nature Mechanism
config node saturation

Prioritization
CPU
cpu.shares Soft limit when all CPUs
requests
are busy

Enforced even if
CPU
cpu.quota Hard limit node has
limits
available CPU
CPU isolation

Linux Prevents
Nature Mechanism
config node saturation

Prioritization
CPU
cpu.shares Soft limit when all CPUs No
requests
are busy

Enforced even if
CPU
cpu.quota Hard limit node has Yes**
limits
available CPU

** Partially since it's a quota and not a concurrency limit, but it's generally enough
CPU throttling oddness
CPU throttling oddness
CPU throttling intuition

CPU quota:
4 cores

Each square:
20ms of CPU
CPU throttling intuition

CPU quota:
4 cores

Each square:
20ms of CPU
CPU throttling intuition

CPU quota:
4 cores

Each square:
20ms of CPU
CPU throttling intuition

CPU quota:
4 cores

Each square:
20ms of CPU
CPU throttling intuition

CPU quota:
4 cores

Each square:
20ms of CPU
Current schools of thought

}
Current schools of thought
Just disable CPU pinning is
{ limits! the one true way!

}
Current schools of thought
Just disable CPU pinning is
{ limits! the one true way!
● How can we prevent node
saturation?
● Environment-dependent
performance
}
Current schools of thought
Just disable CPU pinning is
{ limits! the one true way!
● How can we prevent node ● What about small systems
saturation? with fractional quotas?
● Environment-dependent ● No possibility of bursts
performance ● k8s scheduling pressure
}
Current schools of thought
Just disable CPU pinning is
{ limits! the one true way!
● How can we prevent node ● What about small systems
saturation? with fractional quotas?
● Environment-dependent ● No possibility of bursts
performance ● k8s scheduling pressure
}
What is it hiding? 🤔
The danger of averages
The danger of averages
The danger of averages
The danger of averages
Flavio

The danger of averages


Flavio

The danger of averages


The danger of averages
The danger of averages
The danger of averages

What is it hiding? 🤔
The
CPU avg
is a lie!
😱
Fine-grained CPU metrics
Fine-grained CPU metrics

Based on https://github.com/sqshq/sampler
Fine-grained CPU metrics

Does not require a shorter Prometheus scrapping interval


How can we
make systems
behave within
the CPU quota?
04 { ..
Building an adaptive
threading model

} ..
Nauvoo
01

02

03
Nauvoo
01 Fine-grained perf metrics
You can't improve what you don't measure!

02

03
Nauvoo
01 Fine-grained perf metrics
You can't improve what you don't measure!

02 Adaptive concurrency
No more manual thread pool tuning, avoids CPU
throttling on the fly

03
Nauvoo
01 Fine-grained perf metrics
You can't improve what you don't measure!

02 Adaptive concurrency
No more manual thread pool tuning, avoids CPU
throttling on the fly

03 Reactive backpressure
Rejects work above the system's capacity,
avoids unbounded queuing and GC death spirals
Detecting degradation
Hespanha

Detecting degradation
v0: check all
{ the things

}
Detecting degradation
v0: check all
{ the things
Multiple checks:
● CPU usage
● Throttled %
● Memory
Tries to avoid degradation }
Detecting degradation
v0: check all v1: heartbeat
{ the things mode
Multiple checks:
● CPU usage
● Throttled %
● Memory
Tries to avoid degradation }
Detecting degradation
v0: check all v1: heartbeat
{ the things mode
Multiple checks: ● Inspired by jHiccup
● CPU usage ● while(true) { measure
● Throttled % Thread.sleep(1) }
● Memory
Tries to avoid degradation


Allows a configurable
level of degradation
Also detects GC pauses,
}
safepoints, allocation
stalls
Controlling degradation

Linux
Scheduler
Controlling degradation

Executor

Linux
Scheduler Executor

Executor
Controlling degradation

Executor

Linux
Nauvoo Executor
Scheduler

Executor
Controlling degradation
If there's
degradation,
reduce concurrency
Flavio

Controlling degradation
If there's If threads are
degradation, reliably scheduled,
reduce concurrency allow more concurrency
Controlling degradation
If there's If threads are
degradation, reliably scheduled,
reduce concurrency allow more concurrency

Main challenges
- Reaction time
Start with small changes and
escalate via exponential steps
- Control loop stability
Introduce metastable state
thresholds to stabilize changes
Controlling degradation
If there's If threads are
degradation, reliably scheduled,
reduce concurrency allow more concurrency

Main challenges
- Reaction time
Start with small changes and
escalate via exponential steps
- Control loop stability
Introduce metastable state
thresholds to stabilize changes
Controlling degradation
If there's If threads are
degradation, reliably scheduled,
reduce concurrency allow more concurrency

Main challenges
- Reaction time
Start with small changes and
escalate via exponential steps
- Control loop stability
Introduce metastable state
thresholds to stabilize changes
Demo

Load
increasing

CPU
Throttling
under control

Number of
Threads
increasing
ing to handle
the load

Adapting # of threads to a load with low CPU usage and thread blocking
Demo
Tasks being
rejected

Number of
Threads Service
decreasing suffering with
CPU Throttling

Adapting # of threads to a CPU intensive load + rejections


Results
Results

Social - Stability improvement


Results

Magnitude - Latency reduction


Results

Stormshield - Cost reduction


05 { ..
The path ahead

} ..
Further optimization
Hespanha

Further optimization
Further optimization
Nauvoo v2
01

02

03
Nauvoo v2
01 Lower overhead

02

03
Nauvoo v2
01 Lower overhead

02 Loom integration

03
Nauvoo v2
01 Lower overhead

02 Loom integration

03 Prepare for open source


Flavio

Optimizations by
several teams!
major rewrites, database
migration, tunings, ...

CREDITS: This presentation template was created by


Slidesgo, and includes icons by Flaticon, and
infographics & images by Freepik
Optimizations by
several teams!
major rewrites, database
migration, tunings, ...

Nauvoo

CREDITS: This presentation template was created by


Slidesgo, and includes icons by Flaticon, and
infographics & images by Freepik
Payday sanity
Thanks!
@fbrasisil
@luiz_hespanha

CREDITS: This presentation template was created by


Slidesgo, and includes icons by Flaticon, and
infographics & images by Freepik

You might also like