0% found this document useful (0 votes)
18 views5 pages

Tracking Spark Memory Usage - Phase 1

Uploaded by

Eduardo Vinicius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

Tracking Spark Memory Usage - Phase 1

Uploaded by

Eduardo Vinicius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Design Doc for Tracking Spark Memory Usage – phase 1

1. Goal
Expose all the memory usage information for spark executor, including all major components, and
record them as history for post tracing and debugging and tuning.

2. Background and Motivation


Currently spark only provides little memory usage information for the executors. The exposed memory
metrics only including Cached memory. For other parts, take shuffle as example, even though user can
infer that the shuffle memory usage can be up-to-bound with spark.shuffle.memoryFraction when
there are shuffle spills, but this assumption may not be correct. So in most situations, user have no idea
what is the memory consumption when they are running spark applications with a lot of memory used
in spark executors, and one more severe situation is that when they meet the OOM, it’s really hard to
know what is the cause of the problem and what is the last memory state before the OOM. So it would
be helpful to give out the detail memory consumption information for each part of spark, so that user
can clearly have a picture of where the memory is exactly used.

The memory used in each executor includes spark itself and other third-party libraries. Memory used by
spark itself includes RDD cache and shuffle maps and etc. Other memory is used by third-party library
buffers such as Netty for network transform and Kryo for serialization. For Spark itself, there are task
metrics that can collect some of memory usage (currently task metrics is not aiming to collect memory
used info), and future work can based on task metrics to expose the memory usage.

In this design doc, which is phase 1 for SPARK-9103, we mainly focus on how to tracking network layer
buffer that Netty used in spark. And memory used by other third-party library can be exposed based on
this design.

3. Proposal
Spark has two methods to make network transform, one is NIO based and the other is Netty based.
Since NIO is going to be removed in Spark 1.6, so we focus on the network buffer that Netty used. Netty
is the third-party library for spark, and the underlying of Netty is not controlled by Spark. So we can only
get the memory Netty used by invoking the metrics API that Netty exposed. That means the architecture
must be sample based, spark can only make a sample for each time interval.

3.1 Metrics
We introduce

class ExecutorMetrics extends Serializable {


var _transportMetrics: Option[TransportMetrics]

}

to present the metrics for executors. The ExecutorMetrics is the metrics for whole executor, it is in
executor level, not the task level. TransportMetrics is the metrics for network transport (Netty). Also we
can add other metrics into ExecutorMetrics, for example the metrics that aggregated by all tasks that
running in this executor.

3.2 Updated by Heartbeat


As discussed previously, the Netty metrics is sample based, so we can used Heartbeat to sample the
memory metrics periodically. And send the ExecutorMetrics within the Heartbeat. The Hearbeat is
changed to as follows.

private[spark] case class Heartbeat(


executorId: String,
executorMetrics: ExecutorMetrics,
taskMetrics: Array[(Long, TaskMetrics)],
blockManagerId: BlockManagerId)

And also for spark listener, we change SparkListenerExecutorMetricsUpdate to get ExecutorMetrics.

case class SparkListenerExecutorMetricsUpdate(


execId: String,
executorMetrics: ExecutorMetrics,
taskMetrics: Seq[(Long, Int, Int, TaskMetrics)])
extends SparkListenerEvent

3.3 WebUI
We add an extra WebUI tab (MemotyTab) to show the memory usage for each executor. And added
MemoryListener for handle the updated ExecutorMetrics and other events. The MemoryListener class is
as follows:

class MemoryListener extends SparkListener {


override def onExecutorMetricsUpdate(event: SparkListenerExecutorMetricsUpdate)
override def onExecutorAdded(event: SparkListenerExecutorAdded)
override def onExecutorRemoved(event: SparkListenerExecutorRemoved)
override def onBlockManagerRemoved(event: SparkListenerBlockManagerRemoved)
override def onStageSubmitted(event: SparkListenerStageSubmitted)
override def onStageCompleted(event: SparkListenerStageCompleted)
}

Events like ExecutorsAdded, StageSubmitted are handled in MemoryListener because we want to show
the ExecturorMetrics in each stage, for example, the maximum size of Netty buffers of each executors
during one stage. And we also want to show all of the removed executors and show the last
ExecutorsMetrics right before the executors are removed. Note that removing executors happens
frequently in dynamic allocation.

As we will show the ExecutorMetrics for each stage, so we also add an additional web page named
“StageMemoryPage” to attach to MemoryTab.
We both support live web showing and history replaying for ExecutorMetrics. The data processing
between these two parts are similar but actually not the same.

3.3.1 Live presenting


Since the metrics are sample based, we can only give out the precise value at a specific time. The
accuracy is depending on the sampling frequency.

We take one executor as example, Figure-1 shows the stages and the Heartbeats of one executor.
HB1 HB2 HB3 HB4 HB5 HB6

Stage1 Stage2 Stage3

Figure-1. Stages and heartbeats

Whenever we receive a Heartbeat (SparkListenerExecutorMetricsUpdate event), we will update both the


current ExecutorMetrics value the maxmum ExecutorMetics since the application start in MemoryTab.
And once one stage complete, we will compute all the Heartbeats received during this stage and
calculate the maximum ExecutorMetrics during the stage. For example, for Stage1 in Figure, we will
compare all the ExecutoMetrics in 3 heartbeats, and choose the maximum one.

If there is no Heartbeat received in the stage, such as Stage2 showed in Figure-1, then we choose the
latest heartbeat that has received, which is HB3 for Stage2.

3.3.2 Hisotry representing


Spark history server is eventlog based, so we need to log information of ExecutorMetrics into log files.
Only heartbeat contains ExecutorMetrics, so we will log SparkListenerExecutorMetricsUpdate event, but
we cannot log all SparkListenerExecutorMetricsUpdate events because there are too many
SparkListenerExecutorMetricsUpdate events and it will make the log file too large and slow the log
processing causing performance issue. So we only log part of SparkListenerExecutorMetricsUpdate
events. We log them right before logging the following events:

SparkListenerStageCompleted
SparkListenerExecutorRemoved

And after each events logs, all maintained SparkListenerExecutorMetricsUpdate events will be discarded.
That means the status will be refreshed.

Like other listeners, EventLoggingListener can receive all SparkListenerExecutorMetricsUpdate events.


But the logged SparkListenerExecutorMetricsUpdate events is not simply the received one. Because we
only log a few SparkListenerExecutorMetricsUpdate events, so we need to combine several events into
one event. It would be simple if there is always one stage at a time, just make the simple combine. For
example, in Figure-1, we just need combine HB1, HB2, HB3 into one event CHB1, and HB4, HB5, HB6 into
one event CHB2. Note that in case there is no Heartbeat in one stage, so we shall also log the latest
SparkListenerExecutorMetricsUpdate event right after each combined one. So for case in Figure-1, we
will log the events like that in Figure-2. In total 4 evets will be logged, which are CHB1, HB3, CHB2, HB6.
CHB1 HB3 CHB2 HB6

Stage1 Stage2 Stage3

Figure-2. Events logged for case in Figure-1

Case in Figure-1 is a simple case, there is only one stage running at a time. It would be a little more
complicated when there are several stages running at the same time, as shown in Figure-3.

HB1 HB2 HB3 HB4 HB5 HB6 HB7 HB8 HB9

Stage1 Stage4
Stage2 Stage5
Stage3

Figure-3. Interleaved stages

If the stages are interleaved, we cannot simply combine several events into one for a stage. And we still
log the events with the way that discussed previously. Which is to log the events right before
SparkListenerStageCompleted event. So we will separate the time into 4 segments for case in Figure-3.
We need to make it equal to Figure-4.

HB1 HB2 HB3 HB4 HB5 HB6 HB7 HB8 HB9

Stage1 Stage4
Stage2 Stage5
Stage3

HB1 HB2 HB3 HB4 HB5 HB6 HB7 HB8 HB9

T1 T2 T3 T4

Figure-4. Equivalent segments

Then, we can process it the same as the case in Figure-1. In this way, we can make the result of history
replaying the same with that for live presenting. In this case, we combine (HB1, HB2) into CHB1,
combine (HB3, HB4) into CHB2, combine HB5 into CHB3, combine (HB6, HB7, HB8, HB9) into CHB4. And
the actual events logging will be like Figure-5.
CHB2 HB4
CHB1 HB2 CHB3 HB5 CHB4 HB9

T1 T2 T3 T4

Figure-5. Events logged for case in Figure-3

4. Limitations
There are several limitations of the design. First he sample based metrics system is not that accurate,
and in future, it cannot not be totally integrated with the current taskMetrics collection. Because the
current taskMetrics is both sample based and event based. Second, since we only log part of the
SparkListenerExecutorMetricsUpdate events, that means, the logic of the combining events might
coupling with the logic of how it processed for living presenting. Third, the number of
SparkListenerExecutorMetricsUpdate events logged might still very huge. In general, the logged events
can be regarded as 2*S*E. In which “S” means number of stages, and “E” means number executors. In
worst cases, the number of logged SparkListenerExecutorMetricsUpdate events might even bigger than
the actual number of SparkListenerExecutorMetricsUpdate events.

You might also like