0% found this document useful (0 votes)
16 views

Spark notes

Notes

Uploaded by

anuk93620
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Spark notes

Notes

Uploaded by

anuk93620
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Task Memory Management

Tasks are the basically the threads that run within the Executor JVM of a
Worker node to do the needed computation. It is the smallest unit of
execution that operates on a partition in our dataset. Given that Spark is
an in-memory processing engine where all of the computation that a task
does happens in-memory, its important to understand Task Memory
Management...

To understand this topic better, one'll section Task Memory Management into 3
parts:

1. What are the memory needs of a task?


2. Memory Management within a Task - How does spark arbitrate
memory within a task?
3. Memory Management across the Tasks - How is memory
shared among different tasks running on the same worker node?

1. What are the memory needs of a task?


Every task needs 2 kinds of memory:

1. Execution Memory:

 Execution Memory is the memory used to buffer Intermediate


results.
 As soon as one are done with the operation, one can go ahead and
release it. Its short lived.
 For example, a task performing Sort operation, would need some
sort of collection to store the Intermediate sorted values.

2. Storage Memory:

 Storage memory is more about reusing the data for future


computation.
 This is where one store cached data and its long-lived.
 Until the allotted storage gets filled, Storage memory stays in place.
 LRU eviction is used to spill the storage data when it gets filled.

Following picture illustrates it with an example task of "Sorting a


collection of Int’s"
Now that one've seen the memory needs of a task, Let's understand how
Spark manages it..

2. Memory Management within a Task


How does Spark arbitrate betoneen ExecutionMemory(EM) and
StorageMemory(SM) within a Task?

Simplest Solution – Static Assignment

 Static Assignment - This approach basically splits the total available


on-heap memory (size of oner JVM) into 2 parts, one for
ExecutionMemory and the other for StorageMemory.
 As the name says, this memory split is static and doesn't change
dynamically.
 This has been the solution since spark 1.0.
 While running our task, if the execution memory gets filled, it’ll get
spilled to disk as shown

below:
 Likewise, if the Storage memory gets filled, its evicted via LRU

(Least recently Used)

Disadvantage: Because of the hard split of memory betoneen Execution


and Storage, even if the task doesn't need any StorageMemory,
ExecutionMemory will still be using only its chunk of the total available
free memory..

How to fix this?

UNIFIED MEMORY MANAGEMENT - This is how Unified Memory Management


works:

 Express execution and storage memory as one single unified region.


 So, there's no splitting of memory in this approach.
 Execution and Storage share it combinedly with this
agreement: Keep acquiring execution memory and evict storage as u
need more execution memory.

Following picture depicts Unified memory management..


But, why to evict storage than execution memory?

Spilled execution data is always going to be read back from disk where as
cached data may or may not be read back. (User might tend to
aggressively cache data at times with/without its need.. )

What if application relies on caching like a Machine Learning


application?

One can't just blow away cached data like that in this case. So, for this
usecase, spark allows user to specify minimal unevictable amount of
storage a.k.a cache data. Notice this is not a reservation meaning, one
don’t pre-allocate a chunk of storage for cache data such that execution
cannot borrow from it. Rather, only when there’s cached data this value
comes into effect..

3.Memory Management across the Tasks


How is memory shared among different tasks running on the
same worker node?

Ans: Static Assignment (again!!) - No matter how many tasks are


currently running, if the worker machine has 4 cores, one’ll have 4 fixed
slots.
Drawback: Even if there’s only 1 task running, its going to get only one-
quarter of the total memory.

Better Solution – Dynamic Assignment (Again)!!


More efficient alternative is Dynamic allocation where how much memory
a task gets is dependent on total number of tasks running. If there is only
one task running, it can feel free to acquire all the available memory.

As soon as another task comes in, task1 will have to spill to disk and free
space for task2 for fairness. So, number of slots are determined
dynamically depending on active running tasks.
Key Advantage: One notable behaviour here is - What happens to a
straggler which is a last remaining task. These straggler tasks are
potentially expensive because everybody is already done but then this is
the last remaining task. This model allocates all the memory to the
straggler because number of actively running tasks is one. This has been
there since spark 1.0 and its been working fine since then. So, Spark
haven't found a reason to change it.

Did one ever thought of updating or re-broadcasting a broadcast variable?

Why would one need this?

One have a stream of objects that one would like to filter based on some reference data.

This reference data will keep changing periodically.

One would typically think of broadcasting the reference data to give every executor its own local
cached-copy. But then, how to handle periodical updates to this? This is where perhaps the thought
of having an updatable broadcast or rebroadcasting gets instilled in user's mind.

Dealing with such streaming applications which need a way to oneave (filter, map etc) the streaming
data using a changing reference data (from DB, files etc) has become a relatively common use-case.

Is this requirement only a relatively-common use-case?

I believe that this is more than just being a relatively-common use-case in the world of Machine
Learning applications or Active Learning systems. Let me illustrate the situations which will help us
understand this necessity:
Spark 2.x - 2nd generation Tungsten Engine
Spark 2.x had an aggressive goal to get orders of magnitude faster
performance. For such an aggressive goal, traditional techniques like
using a profiler to identify hotspots and shaving those hotspots is not
gonna help much. Hence came forth 2nd generation Tungsten Engine with
following two goals (focusing on changes in spark’s execution engine):

1. Optimise query plan - solved via Whole-Stage Code-Generation


2. Speed up query execution - solved via Supporting Vectorized in-
memory columnar data

Goal 1 - Whole Stage Code Generation -


Optimise query plan:
To understand what optimising query plan means, let’s take a user query
and understand how spark generates query plan for
it:

Its a very straight forward query. Basically, scan the entire sales table and
outputs the items where item_id =512. The right hand side shows spark’s
query plan for the same. Each of the stages shown in the query plan is an
operator which performs a specific operation on the data like Filter, Count,
Scan etc

How does Spark 1.x evaluate this query plan?


Ans: Volcano Iterator Model
Spark SQL uses the traditional database technique which is called Volcano
Iterator Model. This is a standard technique adapted in majority of the
database systems for over 30years. As the name suggests [IteratorModel],
all the operators like filter, project, scan etc implement a common iterator
interface and they all generate output in a common standard output
format. Query plan shown on the right side of the figure shown above is
basically nothing but a list of operators chained together which are
processed like this:

 Read the output generated by the parent operator


 Does some processing
 Produce the return value in a standard output row format.
 Hands over the return value to the next child operator
 Who is the child/parent of what is known only at runtime.
 Every handshake between 2 operators causes one virtual function call +
reading parent output from memory + write the final output in memory
 In the example query plan shown on the right side of the above figure,
Scan is the parent of the chain. Scan reads input data one-by-one, writes
the output in main memory and hands it over to the next child which is
Filter function and so on..

Downsides of Volcano Iterator Model:


Too many virtual function calls

 We don’t know where the child is coming from.


 Its all dynamic dispatching between parent and child operators at runtime.
 Its agnostic to the operator below it.

Extensive memory access

 There’s a standard row format that exists between all the operators and
this means writes to main memory. Potentially, you read a row in and send
a new row to your parent.This suffers from problems in writing
intermediate rows to main memory.

Unable to leverage lot of modern techniques like pipelining,


prefetching, branch prediction, SIMD, loop unrolling etc..

Conclusion: With VolcanoIterator Model, its difficult to get order’s of


magnitude performance speed ups using the traditional profiling
techniques.

Instead, let’s look bottom up..

What does look bottom-up mean?


A college freshman would implement the same query using a for-loop and
if-condition like the one shown
below:

Volcano model vs College freshman Code:


There’s ~10x speed difference between these 2
models

Why is the difference so huge?


College freshman hand-written code is very simple. It does exactly the
work it needs to do. No virtual function calls. Data is in cpu registers and
we are able to maximise the benefits of the compiler and hardware. Key
thing is: Hand written code is taking advantage of all the information that
is known. Its designed specifically to run that query and nothing else VS
volcano model is a more generic
one

Key IDEA is to come up with an execution engine


which:
Has the functionality of a general-purpose execution engine like volcano
model and Perform just like a hand built system that does exactly what
user wants to do.

Okay! How do we get that?


Answer: Whole-Stage Code Generation

 This is a new technique now popular in DB literature.


 Basically, Fuse the operators in the query plan together so the generated
code looks like hand optimised code as shown in the below
picture:

What does this mean?


 Identify chain of operators a.k.a stages
 Instead of having each operator as an individual function, combine and
compile all of those stages into single function.
 At runtime generate the byte code that needs to be run

Let’s take another example..


Join with some filters and aggregation.

 Left hand side: shows how the query plans look like in volcano iterator
model. There are 9 virtual function calls with 8 intermediate results.
 Right hand side: shows how whole-stage code generation happens for
this case. It has only 2 stages.
o First stage is reading, filtering and projecting input2
o Second stage starts with reading and filtering input1, joining it with
input2 and generates the final aggregated result.
o Here, we reduced Number of function calls to 2 and Number of
intermediate results to 1.
o Each of these 2 stages(or boxes) is going to be converted into a
single java function.
o There are different rules as to how we split up those pipelines
depending on the usecase. We can't possibly fuse everything into
one single
function.

Observation:
Whole-stage Code Generation works particularly well when the operations
we want to do are simple. But there are cases where it is infeasible to
generate code to fuse the entire query into a single function like the one’s
listed below:

 Complicated I/O:
o Complicated parsing like CSV or parquet.
o We cant have the pipeline extend over the physical machines
(Network I/O).
 External Integrations:
o With third party components like python, tensor-flow etc, we cant
integrate their code into our code.
o Reading cached-data

Is there anything that we can do to above mentioned


stuff which can't be fused together in whole-stage
code-generation?
Indeed Yes!!
Goal 2 - Speed up query execution via Supporting
Vectorized in-memory columnar data:
Let's start with output of Goal1 (WholeStageCodeGeneration..)

Goal1 output - What did WholeStageCodeGeneration


(WSCG) give us?
WSCG is generating an optimized query plan for
user:

What extension can we add to this further?


This is where Goal2 comes into picture - Speed up query execution

How can we speed up?


Vectorization

What is Vectorization?
As main memory grew, query performance is more and more determined
by raw CPU costs of query processing. That’s where vector operations
evolved to allow in-core parallelism for operations on arrays (vectors) of
data via specialised instructions, vector registers and more FPU’s per
core .

To better avail in-core parallelism, Spark has done two changes:


 Vectorization: Idea is to take advantage of Data Level Parallelism (DLP)
within an algorithm via vector processing i.e., processing batches of rows
together.
 Shift from row-based to column-based storage format: We'll discuss
the details on what triggered this shift below.

Vectorization: Goal of Vectorization


Parallelise computations over vector arrays a.k.a. Adapt vector
processing

Vectorization: What is Vector Processing?


Vector Processing is basically single instruction operating on one-
dimensional arrays of data called vectors, compared to scalar processors,
whose instructions operate on single data items as shown in the below
picture:

Vectorization: How did Spark adapt to Vector


Processing:
1. Spark 1.x VolcanoIteratorModel performs scalar
processing: We’ve seen earlier that in Spark 1.x, using
VolcanoIteratorModel, all the operators like filter, project, scan etc
were implemented via a common iterator interface where we fetch
one tuple per iteration and process it. Its essentially doing Scalar
Processing here.

2. Spark 2.x moved to vector processing: This traditional Volcano


IteratorModel implementation of operators has been tweaked to
operate in vectors i.e., instead of one-at-time, Spark changed these
operator implementations to fetch a vector array (a batch-of-tuples)
per iteration and make use of vector registers to process all of
them in one go.

3. Ok, Wait..Vector register? What is it!? Typically, each vector


registers can hold upto 4 words of data a.k.a 4 floats OR four 32-bit
integers OR eight 16-bit integers OR sixteen 8-bit integers.

4. How are these Vector registers used?

 SIMD (Single Instruction Multiple Data) Instructions operate on vector


registers.
 One single SIMD Instruction can process eight 16-bit integers at a time,
there by achieving DLP (Data Level Parallelism). Following picture
illustrates computing Min() operation on 8-tuples in ONE go compared to
EIGHT scalar instructions iterating over 8-

tuples:
5. So, we saw how SIMD instructions perform Vector operations. Are
there any other ways to optimize processing? Yes..There're other
kinds of processing techniques like loop unrolling, pipeline scheduling
etc ..(Further details on these are discussed in the APPENDIX section at
the end of this blog)

Now, that we've seen what is Vectorization, its important to


understand how to make the most out of it. This is where we
reason out why spark shifted from row-based to columnar format

Row-based to Column-based storage format:


[Note: Feel free to skip this section if you want to jump to performace
section and find out performance benchmark results]

What is critical to achieve best efficiency while


adapting to vector operations?
Data Availability - All the data needed to execute an instruction should
be available readily in cache. Else, it'll lead to CPU stalling (or CPU idling).

How is data availability critical for execution speed?


To illustrate this better let’s look at two pipelines:

1. One without any CPU Stall's and


2. The other with CPU Stall (CPU Idling)

Following four stages of an instruction cycle are displayed in the pipelining


examples shown below:

 F Fetch: read the instruction from the memory.


 D Decode: decode the instruction and fetch the source operand(s).
 E Execute: perform the operation specified by the instruction.
 W Write: store the result in the destination location.

Pipeline without any CPU stall: Following picture depicts an ideal


pipeline of 4 instructions where everything is beautifully falling in-place
and CPU is not
idled:

Pipeline with CPU stall: Consider the same instruction set used in the
above example. What if the second instruction fetch incurs a cache miss
requiring this data to be fetched from memory? This will result in quiet
some cpu stalling/idling as shown in the figure
below.

Above example clearly illustrates how data availability is very critical to


performance of instruction execution.

Goal2 Action Plan: Support Vectorized in-memory


columnar data
 So, we've seen that any cache operation works best when the data that
you are about to process is laid out next to the data you are processing
now.
 This works against row-based storage because it keeps all the data of the
row together immaterial of whether current processing stage needs only
small subset of that row. So, CPU is forced to keep un-needed data of the
row also in the cache just so it gets the needed part of the row.
 Columnar data on the other hand plays nicely because, in general, each
stage of processing only needs few columns at a time and hence
columnar-storage is more cache friendly. One could possibly get order-of-
magnitude speed-up by adapting to columnar storage while performing
vector operations.
 For this and many more advantages listed in this blog, Spark moved from
row-based storage format to support columnar in-memory data.

Performance bechmarking:
 WholeStageCodeGeneration(WSCG) Benchmarking:

o Join and Aggregations on 1Billion records on a single 2013 Macbook


Pro finished in less than 1 sec.
o Please refer to this databricks notebook where they carried out this
experiment to join 1Billion records to evaluate performance of
WSCG.

 Vectorised in-memory columnar support:

o Let's benchmark Spark 1.x Columnar data (Vs) Spark 2.x


Vectorized Columnar data.
o For this, Parquet which is the most popular columnar-format for
hadoop stack was considered.
o Parquet scan performance in spark 1.6 ran at the rate of
11million/sec.
o Parquet vectorized in spark 2.x ran at about 90 million rows/sec
roughly 9x faster.
o Parquet vectored is basically directly scanning the data and
materialising it in the vectorized way.
o This is promising and clearly shows that this is right thing to
do!!

Summary:
We’ve explored following in this article:

 How VolcanoIteratorModel in spark 1.x interprets and executes a given


query
 Downsides of VolcanoIteratorModel like number of virtual function calls,
excessive memory reads and writes happening for intermediate results etc
 Compared it with hand-written code and noticed easy 10x speedup!! Ola!!
 Hence came WholeStageCodeGeneration!
 But, WholeStageCodeGeneration cannot be done for complex operations.
To handle these cases faster, Spark came up with Vectorization to better
leverage the techniques of Modern CPU’s and hardware.
 Vectorization speeds up processing by batching multiple rows per
instruction together and running them as SIMD instruction using vector
registors.
 Conclusion: Whole stage code generation has done decent job in
combining the functionality of general-purpose execution engine.
Vectorization is a good alternative for the cases that are not handled by
Whole-stage code-generation.
 Downside of Vectorization: Like we discussed in VolcanoIteratorModel,
all the Intermediate results will be written to main memory. Because of
this extensive memory access, where ever possible, Spark
does WholeStageCodeGeneration first.

Example1: Consider a task of training k-means model given a set of data-points. After each iteration,
one would want to have:

Cache cluster-centroids

Be able to update this cached centroids after each iteration

Example2: Similarly, consider another example of phrase mining which aims at extracting quality
phrases from a text corpus. A streaming application that is trying to do phrase-mining would want to
have:

Cache of the <mined-phrases, their-term-frequency> across the worker nodes.

Be able to update this cache as more phrases are mined.

What is common in both these cases?

The reference data, be it the cluster centroids or the phrases mined, in both the tasks would need
to:

Broadcast it to have a local cached copy per executor and

Iteratively keep refining this broadcasted cache.

Most importantly, the reference-data that one are learning/mining is very small.

For the cases discussed above, one would think that one want a way to broadcast our periodically
changing reference data. But, given that such cases have very small sized reference data, is it really
needed to have a local copy per executor? Let’s see alternative perspectives in which one can think
to handle such cases.

Why should one not think of workarounds to update broadcast variable?

Before going further into alternative perspectives, please do note that the Broadcast object is not
Serializable and needs to be final. So, stop thinking about or searching for a solution to update it.

Demo time: Right perspective/approach to handle it:

Now, hopefully, one are also in the same page as me to stop thinking of modifying a broadcast
variable. Let's explore the right approach to handle such cases. Enough of talk. Let's see the code to
demo the same.

Demo:

Consider phrase mining streaming application. One want to cache mined-phrases and keep refining
it periodically. How do one do it at large scale?

Common mistake:

# init phrases corpus as broadcast variable

val phrasesBc = spark.sparkContext.broadcast(phrases)

# load input data as dataframe

val sentencesDf = spark.read

.format("text")

.load("/tmp/gensim-input").as[String]

# foreach sentence, mine phrases and update phrases vocabulary.

sentencesDf.foreach(sentence => phrasesBc.value.updateVocab(sentence))

Above code will run fine in local, but not in cluster..

Notice phrasesBc.value.updateVocab() code written above. This is trying to update broadcasted


variable which will run fine in local run. But, in cluster, this doesn’t work because:

The phrasesBc broadcasted value is not a shared variable across the executors running in different
nodes of the cluster.

One has to be mindful of the fact that phrasesBc is indeed a local copy one in each of the cluster’s
worker nodes where executors are running.
Therefore, changes done to phrasesBc by one executor will be local to it and are not visible to other
executors.

How to solve this without broadcast?:

Our streaming input data is split into multiple partitions and passed over to multiple executor nodes.

Mine <phrase, phrase-count> info as RDD's or Dataframes locally for each partitioned block (i.e.,
locally within that executor).

Combine the mined rdd phrases per partition using aggregateByKey or combineByKey to sum up the
phrase-count's.

Collect the aggregated rdd phrases at driver !!

Code

val sentencesDf = spark.read

.format("text")

.load(“/tmp/gensim-input”).as[String]

// word and its term frequencies

val globalCorpus = new HashMap[String, Int]()

// learn local corpus per partition

val partitionCorpusDf = sentencesDf.mapPartitions(sentencesInthisPartitionIterator => {

// 1. local partition corpus

val partitionCorpus = new HashMap[String, Int]()

// 2. iterate over each sentence in this partition

while (sentencesInPartitionIter.hasNext) {

val sentence = sentencesInPartitionIter.next()

// 3. mine phrases in this sentence

val sentenceCorpus: HashMap[String, Int] = Phrases.learnVocab(sentence)

// 4. merge sentence corpus with partition corpus


val partitionCount = partitionCorpus.getOrElse(x._1, 0)

sentenceCorpus.foreach(x => partitionCorpus.put(x._1, partitionCount + x._2))

})

// 5. aggregate partition-wise corpus into one and collect it at the driver.

// 6. finally, update global corpus with the collected info.

partitionCorpusDf.groupBy($”key”)

.agg(sum($”value”))

.collect()

.foreach(x => {

// merge x with global corpus

val globalCount = globalCorpus.getOrElse(x.word, 0)

val localCount = x.count

globalCorpus.put(x.word, globalCount+localCount)

})

What did one achieve by this

Driver machine is the only place where one maintain a cumulative corpus of phrases learnt

At the same time, this approach doesnt overload driver with updates of one per record.

Driver copy will get updated only once per batch.

Phrase mining workload is shared beautifully across all the executors.

Essentially, every time one receive a new batch of input data points, the reference data i.e., phrases
gets updated only at one place i.e., driver node. Also, at the same time, the job of mining phrases is
computed in a distributed way.

FAQ

Why are one collecting reference-data at the driver?

One might have apprehensions on collecting the reference-data at the driver.

But note that, in these usecases, the reference data being collected at driver is a small cache.

Also, every time one are collecting only a small set of new data to be added to this cache.

Just make sure that the driver machine has enough memory to hold the reference data. That's it!!

Why have globalCorpus in driver. Cant one maintain this using an in-memory alternatives like Redis?
Yes! One use an in-memory cache for this purpose.

But, be mindful that with spark distributed computing, every partition run by every executor will try
to update its local count simultaneously.

So, if one want to take this route, then make sure to take up the additional responsibility of lock-
based or synchronous updates to cache.

Why not use Accumulators for globalCorpus?

Accumulators work fine for mutable distributed cache.

But, Spark natively supports accumulators of numeric types.

The onus is on programmers to add support for new types by subclassing AccumulatorV2 as shown
here.

Can one write the reference data directly into a file instead of collecting it in a globalCorpus as
shown in code below?

// write aggregated counts per batch in a file

partitionCorpusDf.groupBy($”key”)

.agg(sum($”value”))

.foreach(x => {

// write x to file

})

Above code is writing the aggregated counts to file. But, note that this is writing aggregated counts
per batch. So, if one receive the word "apache spark" once in batch1 and later once more in batch2,
then this approach will write <"apache spark", 1> entry in the file twice. This approach is basically
agnostic of the counts aggregated in earlier batches.

So, there's a need to merge the batch-wise counts in the global corpus.

Save the globalCorpus into a file or some database using a shutdown hook right before our
streaming application shuts down.
Insights into the troubles of using filesystem (S3/HDFS)
as data source in spark...
I was experimenting to compare the usage of filesystem (like s3, hdfs) VS
a queue (like kafka or kinesis) as data source of spark i.e.,

//1. filesystem as datasource


spark.read.json(“s3://…<S3_PATH>..”) (VS)

// 2. kinesis as datasource
spark.read.format(kinesis)
As I started analysing s3 as source, at first it looked all dandy - ease of
use, reliable uptime, easy maintenance etc. On top of that, its
checkpointing system also seemed fine. Checkpointing essentially keeps
track of the files it finished processing or partly processed. Spark does this
by maintaining a list of files it processed and also the offsets of the files
that it partially processed. So, the next time our spark application kicks-
off, it’ll not reprocess all the files present in s3 bucket. Instead, spark will
pick up the last partially processed file according to the saved offsets in
checkpoint and continue from there.

Trouble started once I deployed.


Culprit: What files to pick for next batch?

How does spark decide next batch of files


to process?
Per every batch, it repeatedly lists all of the files in s3 bucket in order to
decide the next batch of files to process as shown below:

 Compute all files in s3 bucket - Spark calls the list API to list all the files
in s3
 Compute processed files in previous runs - It then pulls the list of files
it processed from checkpoint
 Compute files to process next = Subtract(AllFiles -
ProcessedFiles)

Listing api of S3 to get all files in the bucket is very expensive. Some
numbers that I observed, when this application was deployed in amazon
EMR cluster of 3 nodes, show how slow it is:

 80-100 files in s3 bucket takes ~2-3secs time to list


 >500-1000 files in s3 bucket takes ~10secs time
 >1000-10000 files in s3 bucket takes ~15-20secs time
Note that the numbers didn’t change much when S3 datasource was replaced
with HDFS file system for Spark

What does this mean?


If we use a file system as spark-source, supposing that our batch interval
is say 30secs, ~50% of the batch-processing time is taken just to decide
the next batch to process. This is bad. On top of that, this listing
happens repeated per every batch - absolute waste of resources
and time!!!

Hence, S3 as source for Spark is bad for following reasons:

1. High Latency: Need to list large buckets on S3 per every batch, which is
slow and resource intensive.
2. Higher costs: LIST API requests made to S3 are costly involving network
I/O.
3. Eventual Consistency: There is always a lag observed in listing the files
written to S3 because of its eventual consistency policy.

So, should we just not use filesystem as


Datasource with Spark at all?
Hmm..Its low cost, reliable uptime and ease of maintanence features are
too good to avoid it. There are alternative solutions to above discussed
problems. Let's have a look at them

Solutions:
Before we jump into solutions, I want you to make note of the fact that S3
is not a real file system i.e., the semantics of the S3 file system are not
that of a POSIX file system. So, S3 file system may not behave entirely as
expected. This is why it has limitations such as eventual consistency in
listing the files written to it.

Idea:
 The core idea of the solutions mentioned below is to avoid List API in
S3 to compute the next batch of records to process.
 For this, one should establish alternative secondary source to track files
written to S3 bucket.

Netflix's solution:
 Netflix has addressed this issue with a tool named S3mper.
 It is an open-source library that provides an additional layer of consistency
checking on top of S3 through the use of a consistent, secondary index.
 Essentially, any files written to S3 are tracked in the secondary index and
hence makes listing files written to S3 consistent and fast.
 Default implementation of secondary index that they provide is
DynamoDB.

After looking at Netflix’s solution, hopefully, it is now clear that repeated


listing of files in S3 bucket can be avoided through an alternative
secondary source to track its files.

DataBrick's solution:
DataBricks has also implemented a solution to address the same but its
unfortunately not open-sourced. Instead of maintaining a secondary
index, they used SQS queue as secondary source to track the files in S3
bucket.

 So, every time a new file is written to S3 bucket, its full path is added to
SQS queue.
 This way, in our spark application, to find out the next files to process, we
just need to poll SQS.
 Processed files will be removed from the queue because polling removes
those entries from SQS.
 Therefore, we need not call list api in S3 anymore to find out the next
batch records.

Conclusion and key takeouts:


 For a streaming application, Queues should be the first choice of
streaming-sources for spark applications because here we just poll the
queues to get the next batch of records. More importantly neither latency
nor consistency is a problem here.
 Any filesystem, be it S3 or HDFS, comes with the drawbacks like eventual
consistency or/and high latency with listing files api. Such filesystems as
sources are fine to experiment a POC quickly as it wouldn't need any setup
process like Kafka Queue.
 But for production purposes, if we intend to use filesystem, then we
definitely need a solution like Netflix’s S3Mper or Databrick’s S3-
SQS where we can get rid of S3 List API call.

You might also like