Tracking Spark Memory Usage - Phase 1
Tracking Spark Memory Usage - Phase 1
1. Goal
Expose all the memory usage information for spark executor, including all major components, and
record them as history for post tracing and debugging and tuning.
The memory used in each executor includes spark itself and other third-party libraries. Memory used by
spark itself includes RDD cache and shuffle maps and etc. Other memory is used by third-party library
buffers such as Netty for network transform and Kryo for serialization. For Spark itself, there are task
metrics that can collect some of memory usage (currently task metrics is not aiming to collect memory
used info), and future work can based on task metrics to expose the memory usage.
In this design doc, which is phase 1 for SPARK-9103, we mainly focus on how to tracking network layer
buffer that Netty used in spark. And memory used by other third-party library can be exposed based on
this design.
3. Proposal
Spark has two methods to make network transform, one is NIO based and the other is Netty based.
Since NIO is going to be removed in Spark 1.6, so we focus on the network buffer that Netty used. Netty
is the third-party library for spark, and the underlying of Netty is not controlled by Spark. So we can only
get the memory Netty used by invoking the metrics API that Netty exposed. That means the architecture
must be sample based, spark can only make a sample for each time interval.
3.1 Metrics
We introduce
to present the metrics for executors. The ExecutorMetrics is the metrics for whole executor, it is in
executor level, not the task level. TransportMetrics is the metrics for network transport (Netty). Also we
can add other metrics into ExecutorMetrics, for example the metrics that aggregated by all tasks that
running in this executor.
3.3 WebUI
We add an extra WebUI tab (MemotyTab) to show the memory usage for each executor. And added
MemoryListener for handle the updated ExecutorMetrics and other events. The MemoryListener class is
as follows:
Events like ExecutorsAdded, StageSubmitted are handled in MemoryListener because we want to show
the ExecturorMetrics in each stage, for example, the maximum size of Netty buffers of each executors
during one stage. And we also want to show all of the removed executors and show the last
ExecutorsMetrics right before the executors are removed. Note that removing executors happens
frequently in dynamic allocation.
As we will show the ExecutorMetrics for each stage, so we also add an additional web page named
“StageMemoryPage” to attach to MemoryTab.
We both support live web showing and history replaying for ExecutorMetrics. The data processing
between these two parts are similar but actually not the same.
We take one executor as example, Figure-1 shows the stages and the Heartbeats of one executor.
HB1 HB2 HB3 HB4 HB5 HB6
If there is no Heartbeat received in the stage, such as Stage2 showed in Figure-1, then we choose the
latest heartbeat that has received, which is HB3 for Stage2.
SparkListenerStageCompleted
SparkListenerExecutorRemoved
And after each events logs, all maintained SparkListenerExecutorMetricsUpdate events will be discarded.
That means the status will be refreshed.
Case in Figure-1 is a simple case, there is only one stage running at a time. It would be a little more
complicated when there are several stages running at the same time, as shown in Figure-3.
Stage1 Stage4
Stage2 Stage5
Stage3
If the stages are interleaved, we cannot simply combine several events into one for a stage. And we still
log the events with the way that discussed previously. Which is to log the events right before
SparkListenerStageCompleted event. So we will separate the time into 4 segments for case in Figure-3.
We need to make it equal to Figure-4.
Stage1 Stage4
Stage2 Stage5
Stage3
T1 T2 T3 T4
Then, we can process it the same as the case in Figure-1. In this way, we can make the result of history
replaying the same with that for live presenting. In this case, we combine (HB1, HB2) into CHB1,
combine (HB3, HB4) into CHB2, combine HB5 into CHB3, combine (HB6, HB7, HB8, HB9) into CHB4. And
the actual events logging will be like Figure-5.
CHB2 HB4
CHB1 HB2 CHB3 HB5 CHB4 HB9
T1 T2 T3 T4
4. Limitations
There are several limitations of the design. First he sample based metrics system is not that accurate,
and in future, it cannot not be totally integrated with the current taskMetrics collection. Because the
current taskMetrics is both sample based and event based. Second, since we only log part of the
SparkListenerExecutorMetricsUpdate events, that means, the logic of the combining events might
coupling with the logic of how it processed for living presenting. Third, the number of
SparkListenerExecutorMetricsUpdate events logged might still very huge. In general, the logged events
can be regarded as 2*S*E. In which “S” means number of stages, and “E” means number executors. In
worst cases, the number of logged SparkListenerExecutorMetricsUpdate events might even bigger than
the actual number of SparkListenerExecutorMetricsUpdate events.