Apache Spark Things To Know
Apache Spark Things To Know
Of course, you can only rely on its usage for aspects of your
pipeline that are completely deterministic and provide no risk of a
race condition. Overzealous usage of .par can quickly result in
mysteriously disappearing or overwritten data.
If there is a chance your join columns have null values, you are in
danger of massive skew. A great solution to this problem is to
“salt” your nulls. This essentially means pre-filling arbitrary values
(like uuids) into empty cells prior to running a join.
Conclusion
And there you have it, a loose assemblage of suggestions, cobbled
together from a year of using Spark. Here’s hoping my future self
has already found that wormhole and is sending me the year two
edition as you’re reading this.