Spark notes
Spark notes
Tasks are the basically the threads that run within the Executor JVM of a
Worker node to do the needed computation. It is the smallest unit of
execution that operates on a partition in our dataset. Given that Spark is
an in-memory processing engine where all of the computation that a task
does happens in-memory, its important to understand Task Memory
Management...
To understand this topic better, one'll section Task Memory Management into 3
parts:
1. Execution Memory:
2. Storage Memory:
below:
Likewise, if the Storage memory gets filled, its evicted via LRU
Spilled execution data is always going to be read back from disk where as
cached data may or may not be read back. (User might tend to
aggressively cache data at times with/without its need.. )
One can't just blow away cached data like that in this case. So, for this
usecase, spark allows user to specify minimal unevictable amount of
storage a.k.a cache data. Notice this is not a reservation meaning, one
don’t pre-allocate a chunk of storage for cache data such that execution
cannot borrow from it. Rather, only when there’s cached data this value
comes into effect..
As soon as another task comes in, task1 will have to spill to disk and free
space for task2 for fairness. So, number of slots are determined
dynamically depending on active running tasks.
Key Advantage: One notable behaviour here is - What happens to a
straggler which is a last remaining task. These straggler tasks are
potentially expensive because everybody is already done but then this is
the last remaining task. This model allocates all the memory to the
straggler because number of actively running tasks is one. This has been
there since spark 1.0 and its been working fine since then. So, Spark
haven't found a reason to change it.
One have a stream of objects that one would like to filter based on some reference data.
One would typically think of broadcasting the reference data to give every executor its own local
cached-copy. But then, how to handle periodical updates to this? This is where perhaps the thought
of having an updatable broadcast or rebroadcasting gets instilled in user's mind.
Dealing with such streaming applications which need a way to oneave (filter, map etc) the streaming
data using a changing reference data (from DB, files etc) has become a relatively common use-case.
I believe that this is more than just being a relatively-common use-case in the world of Machine
Learning applications or Active Learning systems. Let me illustrate the situations which will help us
understand this necessity:
Spark 2.x - 2nd generation Tungsten Engine
Spark 2.x had an aggressive goal to get orders of magnitude faster
performance. For such an aggressive goal, traditional techniques like
using a profiler to identify hotspots and shaving those hotspots is not
gonna help much. Hence came forth 2nd generation Tungsten Engine with
following two goals (focusing on changes in spark’s execution engine):
Its a very straight forward query. Basically, scan the entire sales table and
outputs the items where item_id =512. The right hand side shows spark’s
query plan for the same. Each of the stages shown in the query plan is an
operator which performs a specific operation on the data like Filter, Count,
Scan etc
There’s a standard row format that exists between all the operators and
this means writes to main memory. Potentially, you read a row in and send
a new row to your parent.This suffers from problems in writing
intermediate rows to main memory.
Left hand side: shows how the query plans look like in volcano iterator
model. There are 9 virtual function calls with 8 intermediate results.
Right hand side: shows how whole-stage code generation happens for
this case. It has only 2 stages.
o First stage is reading, filtering and projecting input2
o Second stage starts with reading and filtering input1, joining it with
input2 and generates the final aggregated result.
o Here, we reduced Number of function calls to 2 and Number of
intermediate results to 1.
o Each of these 2 stages(or boxes) is going to be converted into a
single java function.
o There are different rules as to how we split up those pipelines
depending on the usecase. We can't possibly fuse everything into
one single
function.
Observation:
Whole-stage Code Generation works particularly well when the operations
we want to do are simple. But there are cases where it is infeasible to
generate code to fuse the entire query into a single function like the one’s
listed below:
Complicated I/O:
o Complicated parsing like CSV or parquet.
o We cant have the pipeline extend over the physical machines
(Network I/O).
External Integrations:
o With third party components like python, tensor-flow etc, we cant
integrate their code into our code.
o Reading cached-data
What is Vectorization?
As main memory grew, query performance is more and more determined
by raw CPU costs of query processing. That’s where vector operations
evolved to allow in-core parallelism for operations on arrays (vectors) of
data via specialised instructions, vector registers and more FPU’s per
core .
tuples:
5. So, we saw how SIMD instructions perform Vector operations. Are
there any other ways to optimize processing? Yes..There're other
kinds of processing techniques like loop unrolling, pipeline scheduling
etc ..(Further details on these are discussed in the APPENDIX section at
the end of this blog)
Pipeline with CPU stall: Consider the same instruction set used in the
above example. What if the second instruction fetch incurs a cache miss
requiring this data to be fetched from memory? This will result in quiet
some cpu stalling/idling as shown in the figure
below.
Performance bechmarking:
WholeStageCodeGeneration(WSCG) Benchmarking:
Summary:
We’ve explored following in this article:
Example1: Consider a task of training k-means model given a set of data-points. After each iteration,
one would want to have:
Cache cluster-centroids
Example2: Similarly, consider another example of phrase mining which aims at extracting quality
phrases from a text corpus. A streaming application that is trying to do phrase-mining would want to
have:
The reference data, be it the cluster centroids or the phrases mined, in both the tasks would need
to:
Most importantly, the reference-data that one are learning/mining is very small.
For the cases discussed above, one would think that one want a way to broadcast our periodically
changing reference data. But, given that such cases have very small sized reference data, is it really
needed to have a local copy per executor? Let’s see alternative perspectives in which one can think
to handle such cases.
Before going further into alternative perspectives, please do note that the Broadcast object is not
Serializable and needs to be final. So, stop thinking about or searching for a solution to update it.
Now, hopefully, one are also in the same page as me to stop thinking of modifying a broadcast
variable. Let's explore the right approach to handle such cases. Enough of talk. Let's see the code to
demo the same.
Demo:
Consider phrase mining streaming application. One want to cache mined-phrases and keep refining
it periodically. How do one do it at large scale?
Common mistake:
.format("text")
.load("/tmp/gensim-input").as[String]
The phrasesBc broadcasted value is not a shared variable across the executors running in different
nodes of the cluster.
One has to be mindful of the fact that phrasesBc is indeed a local copy one in each of the cluster’s
worker nodes where executors are running.
Therefore, changes done to phrasesBc by one executor will be local to it and are not visible to other
executors.
Our streaming input data is split into multiple partitions and passed over to multiple executor nodes.
Mine <phrase, phrase-count> info as RDD's or Dataframes locally for each partitioned block (i.e.,
locally within that executor).
Combine the mined rdd phrases per partition using aggregateByKey or combineByKey to sum up the
phrase-count's.
Code
.format("text")
.load(“/tmp/gensim-input”).as[String]
while (sentencesInPartitionIter.hasNext) {
})
partitionCorpusDf.groupBy($”key”)
.agg(sum($”value”))
.collect()
.foreach(x => {
globalCorpus.put(x.word, globalCount+localCount)
})
Driver machine is the only place where one maintain a cumulative corpus of phrases learnt
At the same time, this approach doesnt overload driver with updates of one per record.
Essentially, every time one receive a new batch of input data points, the reference data i.e., phrases
gets updated only at one place i.e., driver node. Also, at the same time, the job of mining phrases is
computed in a distributed way.
FAQ
But note that, in these usecases, the reference data being collected at driver is a small cache.
Also, every time one are collecting only a small set of new data to be added to this cache.
Just make sure that the driver machine has enough memory to hold the reference data. That's it!!
Why have globalCorpus in driver. Cant one maintain this using an in-memory alternatives like Redis?
Yes! One use an in-memory cache for this purpose.
But, be mindful that with spark distributed computing, every partition run by every executor will try
to update its local count simultaneously.
So, if one want to take this route, then make sure to take up the additional responsibility of lock-
based or synchronous updates to cache.
The onus is on programmers to add support for new types by subclassing AccumulatorV2 as shown
here.
Can one write the reference data directly into a file instead of collecting it in a globalCorpus as
shown in code below?
partitionCorpusDf.groupBy($”key”)
.agg(sum($”value”))
.foreach(x => {
// write x to file
})
Above code is writing the aggregated counts to file. But, note that this is writing aggregated counts
per batch. So, if one receive the word "apache spark" once in batch1 and later once more in batch2,
then this approach will write <"apache spark", 1> entry in the file twice. This approach is basically
agnostic of the counts aggregated in earlier batches.
So, there's a need to merge the batch-wise counts in the global corpus.
Save the globalCorpus into a file or some database using a shutdown hook right before our
streaming application shuts down.
Insights into the troubles of using filesystem (S3/HDFS)
as data source in spark...
I was experimenting to compare the usage of filesystem (like s3, hdfs) VS
a queue (like kafka or kinesis) as data source of spark i.e.,
// 2. kinesis as datasource
spark.read.format(kinesis)
As I started analysing s3 as source, at first it looked all dandy - ease of
use, reliable uptime, easy maintenance etc. On top of that, its
checkpointing system also seemed fine. Checkpointing essentially keeps
track of the files it finished processing or partly processed. Spark does this
by maintaining a list of files it processed and also the offsets of the files
that it partially processed. So, the next time our spark application kicks-
off, it’ll not reprocess all the files present in s3 bucket. Instead, spark will
pick up the last partially processed file according to the saved offsets in
checkpoint and continue from there.
Compute all files in s3 bucket - Spark calls the list API to list all the files
in s3
Compute processed files in previous runs - It then pulls the list of files
it processed from checkpoint
Compute files to process next = Subtract(AllFiles -
ProcessedFiles)
Listing api of S3 to get all files in the bucket is very expensive. Some
numbers that I observed, when this application was deployed in amazon
EMR cluster of 3 nodes, show how slow it is:
1. High Latency: Need to list large buckets on S3 per every batch, which is
slow and resource intensive.
2. Higher costs: LIST API requests made to S3 are costly involving network
I/O.
3. Eventual Consistency: There is always a lag observed in listing the files
written to S3 because of its eventual consistency policy.
Solutions:
Before we jump into solutions, I want you to make note of the fact that S3
is not a real file system i.e., the semantics of the S3 file system are not
that of a POSIX file system. So, S3 file system may not behave entirely as
expected. This is why it has limitations such as eventual consistency in
listing the files written to it.
Idea:
The core idea of the solutions mentioned below is to avoid List API in
S3 to compute the next batch of records to process.
For this, one should establish alternative secondary source to track files
written to S3 bucket.
Netflix's solution:
Netflix has addressed this issue with a tool named S3mper.
It is an open-source library that provides an additional layer of consistency
checking on top of S3 through the use of a consistent, secondary index.
Essentially, any files written to S3 are tracked in the secondary index and
hence makes listing files written to S3 consistent and fast.
Default implementation of secondary index that they provide is
DynamoDB.
DataBrick's solution:
DataBricks has also implemented a solution to address the same but its
unfortunately not open-sourced. Instead of maintaining a secondary
index, they used SQS queue as secondary source to track the files in S3
bucket.
So, every time a new file is written to S3 bucket, its full path is added to
SQS queue.
This way, in our spark application, to find out the next files to process, we
just need to poll SQS.
Processed files will be removed from the queue because polling removes
those entries from SQS.
Therefore, we need not call list api in S3 anymore to find out the next
batch records.