A - JVM - Threading - Model - For - The - Containerized - Times 2
A - JVM - Threading - Model - For - The - Containerized - Times 2
...
Systems Performance @ Nubank
}
Nubank
Hespanha
Nubank
01 { ..
The perfect storm
} ..
PIX
Running since the end of 2020, Pix is an instant payment platform created and
managed by the monetary authority of Brazil, the Central Bank of Brazil (BCB), which
enables the quick execution(max 10 seconds) of payments and transfers 24/7.
PIX
Nubank's PIX
down today?
Service
experiencing
instability
02 { ..
Understanding the
problem
} ..
The crash resolution paradox
The crash resolution paradox
The system normally operates
at a low CPU usage
Flavio
Symptoms of a bottleneck
Symptoms of a bottleneck
As described by the Universal Scalability
Law (USL), efficiency can drop
significantly when a bottleneck is
reached
Symptoms of a bottleneck
As described by the Universal Scalability
Law (USL), efficiency can drop
significantly when a bottleneck is
reached
}
Our goals
{ Systems remain
functional even
if overloaded
}
Our goals
{ Systems remain
functional even Problematic
if overloaded systems can't
affect others
}
03 { ..
Seeing through
the noise
} ..
CPU isolation
Linux Prevents
Nature Mechanism
config node saturation
CPU
requests
CPU
limits
CPU isolation
Linux Prevents
Nature Mechanism
config node saturation
CPU
cpu.shares
requests
CPU
cpu.quota
limits
Hespanha
CPU isolation
Linux Prevents
Nature Mechanism
config node saturation
CPU
cpu.shares Soft limit
requests
CPU
cpu.quota Hard limit
limits
CPU isolation
Linux Prevents
Nature Mechanism
config node saturation
Prioritization
CPU
cpu.shares Soft limit when all CPUs
requests
are busy
Enforced even if
CPU
cpu.quota Hard limit node has
limits
available CPU
CPU isolation
Linux Prevents
Nature Mechanism
config node saturation
Prioritization
CPU
cpu.shares Soft limit when all CPUs No
requests
are busy
Enforced even if
CPU
cpu.quota Hard limit node has Yes**
limits
available CPU
** Partially since it's a quota and not a concurrency limit, but it's generally enough
CPU throttling oddness
CPU throttling oddness
CPU throttling intuition
CPU quota:
4 cores
Each square:
20ms of CPU
CPU throttling intuition
CPU quota:
4 cores
Each square:
20ms of CPU
CPU throttling intuition
CPU quota:
4 cores
Each square:
20ms of CPU
CPU throttling intuition
CPU quota:
4 cores
Each square:
20ms of CPU
CPU throttling intuition
CPU quota:
4 cores
Each square:
20ms of CPU
Current schools of thought
}
Current schools of thought
Just disable CPU pinning is
{ limits! the one true way!
}
Current schools of thought
Just disable CPU pinning is
{ limits! the one true way!
● How can we prevent node
saturation?
● Environment-dependent
performance
}
Current schools of thought
Just disable CPU pinning is
{ limits! the one true way!
● How can we prevent node ● What about small systems
saturation? with fractional quotas?
● Environment-dependent ● No possibility of bursts
performance ● k8s scheduling pressure
}
Current schools of thought
Just disable CPU pinning is
{ limits! the one true way!
● How can we prevent node ● What about small systems
saturation? with fractional quotas?
● Environment-dependent ● No possibility of bursts
performance ● k8s scheduling pressure
}
What is it hiding? 🤔
The danger of averages
The danger of averages
The danger of averages
The danger of averages
Flavio
What is it hiding? 🤔
The
CPU avg
is a lie!
😱
Fine-grained CPU metrics
Fine-grained CPU metrics
Based on https://github.com/sqshq/sampler
Fine-grained CPU metrics
} ..
Nauvoo
01
02
03
Nauvoo
01 Fine-grained perf metrics
You can't improve what you don't measure!
02
03
Nauvoo
01 Fine-grained perf metrics
You can't improve what you don't measure!
02 Adaptive concurrency
No more manual thread pool tuning, avoids CPU
throttling on the fly
03
Nauvoo
01 Fine-grained perf metrics
You can't improve what you don't measure!
02 Adaptive concurrency
No more manual thread pool tuning, avoids CPU
throttling on the fly
03 Reactive backpressure
Rejects work above the system's capacity,
avoids unbounded queuing and GC death spirals
Detecting degradation
Hespanha
Detecting degradation
v0: check all
{ the things
}
Detecting degradation
v0: check all
{ the things
Multiple checks:
● CPU usage
● Throttled %
● Memory
Tries to avoid degradation }
Detecting degradation
v0: check all v1: heartbeat
{ the things mode
Multiple checks:
● CPU usage
● Throttled %
● Memory
Tries to avoid degradation }
Detecting degradation
v0: check all v1: heartbeat
{ the things mode
Multiple checks: ● Inspired by jHiccup
● CPU usage ● while(true) { measure
● Throttled % Thread.sleep(1) }
● Memory
Tries to avoid degradation
●
●
Allows a configurable
level of degradation
Also detects GC pauses,
}
safepoints, allocation
stalls
Controlling degradation
Linux
Scheduler
Controlling degradation
Executor
Linux
Scheduler Executor
Executor
Controlling degradation
Executor
Linux
Nauvoo Executor
Scheduler
Executor
Controlling degradation
If there's
degradation,
reduce concurrency
Flavio
Controlling degradation
If there's If threads are
degradation, reliably scheduled,
reduce concurrency allow more concurrency
Controlling degradation
If there's If threads are
degradation, reliably scheduled,
reduce concurrency allow more concurrency
Main challenges
- Reaction time
Start with small changes and
escalate via exponential steps
- Control loop stability
Introduce metastable state
thresholds to stabilize changes
Controlling degradation
If there's If threads are
degradation, reliably scheduled,
reduce concurrency allow more concurrency
Main challenges
- Reaction time
Start with small changes and
escalate via exponential steps
- Control loop stability
Introduce metastable state
thresholds to stabilize changes
Controlling degradation
If there's If threads are
degradation, reliably scheduled,
reduce concurrency allow more concurrency
Main challenges
- Reaction time
Start with small changes and
escalate via exponential steps
- Control loop stability
Introduce metastable state
thresholds to stabilize changes
Demo
Load
increasing
CPU
Throttling
under control
Number of
Threads
increasing
ing to handle
the load
Adapting # of threads to a load with low CPU usage and thread blocking
Demo
Tasks being
rejected
Number of
Threads Service
decreasing suffering with
CPU Throttling
} ..
Further optimization
Hespanha
Further optimization
Further optimization
Nauvoo v2
01
02
03
Nauvoo v2
01 Lower overhead
02
03
Nauvoo v2
01 Lower overhead
02 Loom integration
03
Nauvoo v2
01 Lower overhead
02 Loom integration
Optimizations by
several teams!
major rewrites, database
migration, tunings, ...
Nauvoo