5 Key Factors To Keep in Mind While Optimizing Apache Spark in AWS
5 Key Factors To Keep in Mind While Optimizing Apache Spark in AWS
Long Lineage
Lazy evaluation in spark means, actual execution does not
happen until an action is triggered. The types of commands
available in spark can be divided into 2 types.
Conclusions
In Bigdata systems it is advisable to optimize data first before
we think about optimizing quries.
https://medium.com/@brajendragouda/5-key-factors-to-keep-in-
mind-while-optimising-apache-spark-in-aws-part-2-
c0197276623c
Join Operations
During joining if you have a big table and a relatively small table
( lookup or dimension table) it is advisable to broadcast the
small table. In broadcasting a copy of the broadcasted table is
sent to each node of the cluster. So, while joining, part of the
bigger table there in a node joins with the broadcasted table
therefore does not move data across nodes and reduces I/O
operations hence improves performance.
Optimisation Trick:
if you are joining a Big table with a small one then it is good idea
to broadcast the smaller table.But keep in mind that, the smaller
table should be small enough to fit inside memory of an
executor. If both the table you are trying to join are big and
similar in size, then ensure both the tables are not skewed or
distributed across more number of partitions. If not, repartition
to increase number of partitions of the skewed table.Out of the
two tables if you find that one of tables is not similar in size to
the other table and not small enough to be broadcasted, then
you could cache( or persist) the smaller table and ensure bigger
table is partitioned properly before performing Joins.
Maximising Parallelism
One way to increase parallelism of spark processing is to
increase the number of executors on the cluster. Below are 2
important properties that controls number of executors.
A Node can have multiple executors but not the other way
around.
An Executor can have multiple cores.
property spark.executor.core should only be given integer
values.
property spark.executor.memory can have integer or decimal
values upto 1 decimal place.
Not advisable to have more than 5 cores for each executor.
This is based on a study where any application with more
than 5 concurrent threads would start hampering the
performance.
Optimisation Trick:
Few things to note from the above two ways of writing UDF in
spark.
Optimisation Trick
While writing UDF assume the function accepts one row returns
one row. we can input multiple columns but one row only. If you
have more than two arguments(columns) to your UDF , I would
advice to create one array using all your arguments and pass it
to the UDF.
Optimisation Trick
df.explain()
df.explain(True)
Optimisation Trick
Look into all the plans above and identify opportunities for
optimisation. Avoid full table scans if possible, apply filters as
early as you can in the processing steps and ensure lineage isn’t
long before performing joins.
Conclusion
Although there are lot of inbuilt optimisations already available
in spark, it is necessary to smartly use all of them to get best out
of it.
Here is first part of the story, please give a read and let me
know your thoughts.