Databricks Question
Databricks Question
P,=VCX1
-----------------------------------------------------------------------------------
----------------------------
5.Why did you use Databricks in your project?
-----------------------------------------------------------------------------------
----------------------------
1.Not only does Databricks sit on top of either an Azure or AWS flexible,
distributed cloud computing environment, it also masks the complexities of
distributed processing from your data scientists and engineers,
allowing them to develop straight in Spark's native R, Scala, Python or SQL
interface.
2. Why is Databricks so popular?
It unifies both batch and streaming data, incorporates many different
processing models and supports SQL.
These characteristics make it much easier to use, highly accessible and
extremely expressive.
3.Databricks is a higher-level platform that also includes multi-user support, an
interactive UI, security,
data management, cluster sharing and job scheduling. These qualities make the
Databricks different.
-----------------------------------------------------------------------------------
-----------------------------
8. Optimization Techniques in Spark.
-----------------------------------------------------------------------------------
-----------------------------
1.Spark Performance Tuning – Best Guidelines & Practices
2.Use DataFrame/Dataset over RDD.
3.Use coalesce() over repartition()
4.Use mapPartitions() over map()
5.Use Serialized data format's.
6.Avoid UDF's (User Defined Functions)
7.Caching data in memory.
8.Reduce expensive Shuffle operations.
9.Disable DEBUG & INFO Logging.
-----------------------------------------------------------------------------------
-----------------------------
8 Performance Optimization Techniques Using Spark:
1. Serialization
Serialization plays an important role in the performance for any distributed
application. By default, Spark uses Java serializer.
Spark can also use another serializer called ‘Kryo’ serializer for better
performance.
Kryo serializer is in compact binary format and offers processing 10x faster than
Java serializer.
To set the serializer properties:
conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)
Serialization plays an important role in the performance of any distributed
application
and we know that by default Spark uses the Java serializer on the JVM platform.
Instead of Java serializer, Spark can also use another serializer called Kryo.
The Kryo serializer gives better performance as compared to the Java serializer.
Kryo serializer is in a compact binary format and offers approximately 10 times
faster
speed as compared to the Java Serializer. To set the Kryo serializer as part of a
Spark job,
we need to set a configuration property, which is
org.apache.spark.serializer.KryoSerializer.
We all know that during the development of any program, taking care of the
performance is equally important. A Spark job can be optimized by many techniques
so let’s dig deeper into those techniques one by one. Apache Spark optimization
helps with in-memory data computations. The bottleneck for these spark optimization
computations can be CPU, memory or any resource in the cluster.
1. Serialization
Serialization plays an important role in the performance for any distributed
application. By default, Spark uses Java serializer.
Spark can also use another serializer called ‘Kryo’ serializer for better
performance.
Kryo serializer is in compact binary format and offers processing 10x faster than
Java serializer.
To set the serializer properties:
conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)
Code:
val conf = new SparkConf().setMaster(…).setAppName(…)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
2. API selection
Spark introduced three types of API to work upon – RDD, DataFrame, DataSet
RDD is used for low level operation with less optimization
DataFrame is best choice in most cases due to its catalyst optimizer and low
garbage collection (GC) overhead.
Dataset is highly type safe and use encoders. It uses Tungsten for serialization
in binary format.
We know that Spark comes with 3 types of API to work upon -RDD, DataFrame and
DataSet.
RDD is used for low-level operations and has less optimization techniques.
DataFrame is the best choice in most cases because DataFrame uses the catalyst
optimizer
which creates a query plan resulting in better performance. DataFrame also
generates
low labor garbage collection overhead.
DataSets are highly type safe and use the encoder as part of their serialization.
It also uses Tungsten for the serializer in binary format.
3. Advance Variable
Broadcasting plays an important role while tuning Spark jobs.
Broadcast variable will make small datasets available on nodes locally.
When you have one dataset which is smaller than other dataset, Broadcast join is
highly recommended.
To use the Broadcast join: (df1. join(broadcast(df2)))
Spark comes with 2 types of advanced variables – Broadcast and Accumulator.
Broadcasting plays an important role while tuning your spark job.
Broadcast variable will make your small data set available on each node,
and that node and data will be treated locally for the process.
5. ByKey Operation
Shuffles are heavy operation which consume a lot of memory.
While coding in Spark, the user should always try to avoid shuffle operation.
High shuffling may give rise to an OutOfMemory Error; To avoid such an error, the
user can increase the level of parallelism.
Use reduceByKey instead of groupByKey.
Partition the data correctly.
As we know during our transformation of Spark we have many ByKey operations.
ByKey operations generate lot of shuffle. Shuffles are heavy operation
because they consume a lot of memory. While coding in Spark, a user should always
try to
avoid any shuffle operation because the shuffle operation will degrade the
performance.
If there is high shuffling then a user can get the error out of memory.
Inthis case, to avoid that error, a user should increase the level of parallelism.
Instead of groupBy, a user should go for the reduceByKey because groupByKey
creates a lot of shuffling
which hampers the performance, while reduceByKey does not shuffle the data as
much.
Therefore, reduceByKey is faster as compared to groupByKey. Whenever any ByKey
operation is used,
the user should partition the data correctly.
8. Level of Parallelism
Parallelism plays a very important role while tuning spark jobs.
Every partition ~ task requires a single core for processing.
There are two ways to maintain the parallelism:
Repartition: Gives equal number of partitions with high shuffling
Coalesce: Generally reduces the number of partitions with less shuffling.
In any distributed environment parallelism plays very important role while tuning
your Spark job.
Whenever a Spark job is submitted, it creates the desk that will contain stages,
and the tasks depend upon partition so every partition or task requires a single
core of the system
for processing. There are two ways to maintain the parallelism – Repartition and
Coalesce.
Whenever you apply the Repartition method it gives you equal number of partitions
but it will shuffle a lot so it is not advisable to go for Repartition when you
want to lash all the data.
Coalesce will generally reduce the number of partitions and creates less shuffling
of data.
source: https://www.syntelli.com/eight-performance-optimization-techniques-using-
spark
-----------------------------------------------------------------------------------
-----------------------------
11. How to decide number of partitions?
-----------------------------------------------------------------------------------
-----------------------------
The general recommendation for Spark is to have 4x of partitions to the number of
cores in cluster available
for application, and for upper bound — the task should take 100ms+ time to execute.
12. How to decide number partitions in repartitions?
-----------------------------------------------------------------------------------
-----------------------------
13. Which Scheduler did you use in your project?
-----------------------------------------------------------------------------------
-----------------------------
I. Default FIFO scheduler
FIFO scheduler is running as a default algorithm on Hadoop [5]. The way that this
algorithm
work is depend on the priority of the job, that mean all jobs will be executed has
a priority to
run on the available resources on cluster. The jobs arranged on a queue with their
priorities.
II. Fair scheduler
The priorities for each job are used by this scheduler[6] relying on the weights to
transact with
the portions of the total resources. The job will be split to number of tasks and
the available
slots can ready for processing. The scheduler examines the time deficit against the
ideal fair
allocation of this job. if the tasks have finished and the slot is ready for next
scheduling, then
high priority tasks are assigned to the free slot.
III.Delay scheduler
In this scheduler, when the data is not ready, a task tracker stays for a specific
time [7]. If there
is task assigning requested the node, The size of the job will be reviewed by delay
schedulers;
when the job is very short, it will be cancelled and if any later jobs ready to
run. The important
problem that resolved by these schedulers is the locality problem.
IV. Capacity scheduler
Capacity scheduler is utilized when different companies need to share the huge
cluster with
less capacity and sharing overflowing capacity between users [8]. MapReduce slots
is
configure any existing queue. The queue has priority with FIFO. resources can be
accessed by
high priority jobs, matched to the jobs with minimum priority. Scheduled tasks are
knew by
memory exhaustion of each task. Also, the scheduler has capable of monitor memory
with
available resources.
V. Matchmaking scheduler
Matchmaking scheduling can improve Data localities of map tasks [9]. Scheduler
guarantees
that assign job first by slave nodes before assigning non local tasks. Scheduler
hold on trying
to detect matches with a slave node. The node will be sign by locality marker and
ensures that
each node pulls the tasks.
VI. LATE (Longest Approximate Time to End) scheduler
LATE [10] scheduling algorithm tries to improve Hadoop by seeking to find real slow
tasks
by calculating remaining time of all the tasks. It is based on prioritizing tasks
to speculate then
selecting fast nodes to run on, and capping speculative tasks to block thrashing.
This method
only takes action on appropriate slow tasks and does not break the synchronization
phase
between the map and reduce phase.
VII. Deadline constraint scheduler
Deadline constraints [11] can be appropriated when the jobs scheduling guarantees
that jobs
met deadline. Deadline constraint scheduler enhances system utilization dealing
with data
processing. It is achieved by cost model for job execution and Hadoop scheduler
with
constraint. Cost model with job execution considers some parameters such as the
size of input
data, task with runtime distribution of job .
VIII. Resource aware scheduler
Resource utilization is minimized by this scheduler in Hadoop . These schemes focus
on how
effectively resource utilization has been done with different types of utilization
like IO, disk,
memory, network and CPU utilization[12]. dynamic free slot was being used in these
scheduler.
IX. Energy aware scheduler
The enormous clusters within data centers can run big data applications, where
their energy
costs make executing energy efficiency. This scheduler[13] used for enhancing
energy
efficiency of applications in MapReduce leads to a reduction of the cost in data
centers. Table
1 provides a comparison among different Hadoop job scheduling techniques.
-----------------------------------------------------------------------------------
-----------------------------