100% found this document useful (1 vote)

628 views22 pages

Databricks Spark Knowledge Base

This document provides an overview of best practices, troubleshooting tips, and optimization strategies for Apache Spark. It covers common issues like avoiding GroupByKey transformations, not copying large RDDs to the driver, and gracefully handling bad input data. General troubleshooting sections address errors related to task serialization, missing dependencies, and network connectivity. Performance optimization topics include RDD partitioning and data locality. Spark Streaming concepts like OneForOneStrategy errors are also mentioned.

Uploaded by

Lokesh Dikshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

628 views22 pages

Databricks Spark Knowledge Base

Uploaded by

Lokesh Dikshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

1. Knowledgebase
2. Best Practices
i. Avoid GroupByKey
ii. Don't copy all elements of a large RDD to the driver
iii. Gracefully Dealing with Bad Input Data
3. General Troubleshooting
i. Job aborted due to stage failure: Task not serializable:
ii. Missing Dependencies in Jar Files
iii. Error running start-all.sh - Connection refused
iv. Network connectivity issues between Spark components
4. Performance & Optimization
i. How Many Partitions Does An RDD Have?
ii. Data Locality
5. Spark Streaming
i. ERROR OneForOneStrategy

Databricks Spark Knowledge Base

The contents contained here is also published in Gitbook format.
Best Practices
Avoid GroupByKey
Don't copy all elements of a large RDD to the driver
Gracefully Dealing with Bad Input Data
General Troubleshooting
Job aborted due to stage failure: Task not serializable:
Missing Dependencies in Jar Files
Error running start-all.sh - Connection refused
Network connectivity issues between Spark components
Performance & Optimization
How Many Partitions Does An RDD Have?
Data Locality
Spark Streaming
ERROR OneForOneStrategy
This content is covered by the license specified here.

Best Practices
Avoid GroupByKey
Don't copy all elements of a large RDD to the driver
Gracefully Dealing with Bad Input Data

Avoid GroupByKey
Let's look at two different ways to compute word counts, one using

reduceByKey

and the other using

groupByKey :

val words = Array("one", "two", "two", "three", "three", "three")

val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
val wordCountsWithGroup = wordPairsRDD
.groupByKey()
.map(t => (t._1, t._2.sum))
.collect()

While both of these functions will produce the correct answer, the

reduceByKey

example works much better on a large

dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the
data.
Look at the diagram below to understand what happens with

reduceByKey . Notice

the same key are combined (by using the lamdba function passed into

how pairs on the same machine with

reduceByKey )

before the data is shuffled. Then the

lamdba function is called again to reduce all the values from each partition to produce one final result.

On the other hand, when calling

groupByKey

- all the key-value pairs are shuffled around. This is a lot of unnessary data

to being transferred over the network.

To determine which machine to shuffle a pair to, Spark calls a partitioning function on the key of the pair. Spark spills data
to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out
the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory
exception occurs. This will be more gracefully handled in a later release of Spark so the job can still proceed, but should
still be avoided - when Spark needs to spill to disk, performance is severely impacted.

You can imagine that for a much larger dataset size, the difference in the amount of data you are shuffling becomes more
exaggerated and different between

reduceByKey

Here are more functions to prefer over

combineByKey
foldByKey

and

groupByKey .

groupByKey :

can be used when you are combining elements but your return type differs from your input value type.

merges the values for each key using an associative function and a neutral "zero value".

Don't copy all elements of a large RDD to the driver.

If your RDD is so large that all of it's elements won't fit in memory on the drive machine, don't do this:

val values = myVeryLargeRDD.collect()

Collect will attempt to copy every single element in the RDD onto the single driver program, and then run out of memory
and crash.
Instead, you can make sure the number of elements you return is capped by calling

take

takeSample

, or perhaps

filtering or sampling your RDD.

Similarly, be cautious of these other actions as well unless you are sure your dataset size is small enough to fit in
memory:
countByKey
countByValue
collectAsMap

If you really do need every one of these values of the RDD and the data is too big to fit into memory, you can write out the
RDD to files or export the RDD to a database that is large enough to hold all the data.

Gracefully Dealing with Bad Input Data

When dealing with vast amounts of data, a common problem is that a small amount of the data is malformed or corrupt.
Using a

filter

transformation, you can easily discard bad inputs, or use a

input. Or perhaps the best option is to use a

flatMap

map

transformation if it's possible to fix the bad

function where you can try fixing the input but fall back to discarding

the input if you can't.

Let's consider the json strings below as input:

input_rdd = sc.parallelize(["{\"value\": 1}", # Good

"bad_json", # Bad
"{\"value\": 2}", # Good
"{\"value\": 3" # Missing an ending brace.
])

If we tried to input this set of json strings to a sqlContext, it would clearly fail due to the malformed input's.

sqlContext.jsonRDD(input_rdd).registerTempTable("valueTable")
# The above command will throw an error.

Instead, let's try fixing the input with this python function:

def try_correct_json(json_string):
try:
# First check if the json is okay.
json.loads(json_string)
return [json_string]
except ValueError:
try:
# If not, try correcting it by adding a ending brace.
try_to_correct_json = json_string + "}"
json.loads(try_to_correct_json)
return [try_to_correct_json]
except ValueError:
# The malformed json input can't be recovered, drop this input.
return []

Now, we can apply that function to fix our input and try again. This time we will succeed to read in three inputs:

corrected_input_rdd = input_rdd.flatMap(try_correct_json)
sqlContext.jsonRDD(corrected_input_rdd).registerTempTable("valueTable")
sqlContext.sql("select * from valueTable").collect()
# Returns [Row(value=1), Row(value=2), Row(value=3)]

General Troubleshooting
Job aborted due to stage failure: Task not serializable:
Missing Dependencies in Jar Files
Error running start-all.sh - Connection refused
Network connectivity issues between Spark components

Job aborted due to stage failure: Task not serializable:

If you see this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: ...

The above error can be triggered when you intialize a variable on the driver (master), but then try to use it on one of the
workers. In that case, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is
not serializable. Consider the following code snippet:

NotSerializable notSerializable = new NotSerializable();

JavaRDD<String> rdd = sc.textFile("/tmp/myfile");
rdd.map(s -> notSerializable.doSomething(s)).collect();

This will trigger that error. Here are some ideas to fix this error:
Serializable the class
Declare the instance only within the lambda function passed in map.
Make the NotSerializable object as a static and create it once per machine.
Call rdd.forEachPartition and create the NotSerializable object in there like this:

rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iter
});

Missing Dependencies in Jar Files

By default, maven does not include dependency jars when it builds a target. When running a Spark job, if the Spark
worker machines don't contain the dependency jars - there will be an error that a class cannot be found.
The easiest way to work around this is to create a shaded or uber jar to package the dependencies in the jar as well.
It is possible to opt out certain dependencies from being included in the uber jar by marking them as
<scope>provided</scope>

. Spark dependencies should be marked as provided since they are already on the Spark

cluster. You may also exclude other jars that you have installed on your worker machines.
Here is an example Maven pom.xml file that creates an uber jar with all the code in that project and includes the
common-cli dependency, but not any of the Spark libraries.:

<project>
<groupId>com.databricks.apps.logs</groupId>
<artifactId>log-analyzer</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Databricks Spark Logs Analyzer</name>
<packaging>jar</packaging>
<version>1.0</version>
<repositories>
<repository>
<id>Akka repository</id>
<url>http://repo.akka.io/releases</url>
</repository>
</repositories>
<dependencies>
<dependency> 
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.0</version>
<scope>provided</scope>
</dependency>
<dependency> 
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.1.0</version>
<scope>provided</scope>
</dependency>
<dependency> 
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.1.0</version>
<scope>provided</scope>
</dependency>
<dependency> 
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.2</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<finalName>uber-${project.artifactId}-${project.version}</finalName>
</configuration>
</plugin>
</plugins>
</build>
</project>

Error running start-all.sh Connection refused

If you are on a Mac and run into the following error when running start-all.sh:

% sh start-all.sh
starting org.apache.spark.deploy.master.Master, logging to ...
localhost: ssh: connect to host localhost port 22: Connection refused

You need to enable "Remote Login" for your machine. From System Preferences, select Sharing, and then turn on
Remote Login.

Network connectivity issues between Spark components

Network connectivity issues between Spark components can lead to a variety of warnings / errors:
SparkContext <-> Spark Standalone Master:
If the SparkContext cannot connect to a Spark standalone master, then the driver may display errors like

ERROR AppClient$ClientActor: All masters are unresponsive! Giving up.

ERROR SparkDeploySchedulerBackend: Spark cluster looks dead, giving up.
ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Spark cluster looks down

If the driver is able to connect to the master but the master is unable to communicate back to the driver, then the
Master's logs may record multiple attempts to connect even though the driver will report that it could not connect:

INFO Master: Registering app SparkPi

INFO Master: Registered app SparkPi with ID app-XXX-0000
INFO: Master: Removing app app-app-XXX-0000
[...]
INFO Master: Registering app SparkPi
INFO Master: Registered app SparkPi with ID app-YYY-0000
INFO: Master: Removing app app-YYY-0000
[...]

In this case, the master reports that it has successfully registered an application, but if the acknowledgment of this
registration fails to be received by the driver, then the driver will automatically make several attempts to re-connect
before eventually giving up and failing. As a result, the master web UI may report multiple failed applications even
though only a single SparkContext was created.

Recomendations
If you are experiencing any of the errors described above:
Check that the workers and drivers are configured to connect to the Spark master on the exact address listed in the
Spark master web UI / logs.
Set

SPARK_LOCAL_IP

to a cluster-addressable hostname for the driver, master, and worker processes.

Configurations that determine hostname/port binding:

This section describes configurations that determine which network interfaces and ports Spark components will bind to.
In each section, the configurations are listed in decreasing order of precedence, with the final entry being the default
configuration if none of the previous configurations were supplied.

SparkContext actor system:

Hostname:
The
If the

spark.driver.host
SPARK_LOCAL_IP

configuration property.
environment variable is set to a hostname, then this hostname will be used. If

set to an IP address, it will be resolved to a hostname.

The IP address of the interface returned from Java's
Port:
The

spark.driver.port

configuration property.

InetAddress.getLocalHost

method.

SPARK_LOCAL_IP

An ephemeral port chosen by the OS.

Spark Standalone Master / Worker actor systems:

Hostname:
The

--host

The

SPARK_MASTER_HOST

If the

, or

-h

options (or the deprecated

SPARK_LOCAL_IP

--ip

options) when launching the

-i

environment variable (only applies to

Master

Worker

process.

environment variable is set to a hostname, then this hostname will be used. If

SPARK_LOCAL_IP

set to an IP address, it will be resolved to a hostname.

The IP address of the interface returned from Java's

InetAddress.getLocalHost

method.

Port:
The

--port

The

SPARK_MASTER_PORT

, or

-p

options when launching the

Master

SPARK_WORKER_PORT

respectively).
An ephemeral port chosen by the OS.

Worker

process.

environment variables (only apply to

Master

and

Worker

Performance & Optimization

How Many Partitions Does An RDD Have?
Data Locality

How Many Partitions Does An RDD Have?

For tuning and troubleshooting, it's often necessary to know how many paritions an RDD represents. There are a few
ways to find this information:

View Task Execution Against Partitions Using the UI

When a stage executes, you can see the number of partitions for a given stage in the Spark UI. For example, the following
simple job creates an RDD of 100 elements across 4 partitions, then distributes a dummy map task before collecting the
elements back to the driver program:

scala> val someRDD = sc.parallelize(1 to 100, 4)

someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33

In Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of
partitions:

View Partition Caching Using the UI

When persisting (a.k.a. caching) RDDs, it's useful to understand how many partitions have been stored. The example
below is identical to the one prior, except that we'll now cache the RDD prior to processing it. After this completes, we can
use the UI to understand what has been stored from this operation.

scala> someRDD.setName("toy").cache
res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33

Note from the screenshot that there are four partitions cached.

Inspect RDD Partitions Programatically

In the Scala API, an RDD holds a reference to it's Array of partitions, which you can use to find out how many partitions
there are:

scala> val someRDD = sc.parallelize(1 to 100, 30)

someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.partitions.size
res0: Int = 30

In the python API, there is a method for explicitly listing the number of partitions:

In [1]: someRDD = sc.parallelize(range(101),30)

In [2]: someRDD.getNumPartitions()
Out[2]: 30

Note in the examples above, the number of partitions was intentionally set to 30 upon initialization.

Data Locality
Spark is a data parallel processing framework, which means it will execute tasks as close to where the data lives as
possible (i.e. minimize data transfer).

Checking Locality
The best means of checking whether a task ran locally is to inspect a given stage in the Spark UI. Notice from the
screenshot below that the "Locality Level" column displays which locality a given task ran with.

Adjusting Locality Confugrations

You can adjust how long Spark will wait before it times out on each of the phases of data locality (data local --> process

local --> node local --> rack local --> Any). For more information on these parameters, see the
the Scheduling section of the Application Configration docs.

spark.locality.*

configs in

Spark Streaming
ERROR OneForOneStrategy

ERROR OneForOneStrategy
If you enable checkpointing in Spark Streaming, then objects used in a function called in forEachRDD should be
Serializable. Otherwise, there will be an "ERROR OneForOneStrategy: ... java.io.NotSerializableException:

JavaStreamingContext jssc = new JavaStreamingContext(sc, INTERVAL);

// This enables checkpointing.
jssc.checkpoint("/tmp/checkpoint_test");
JavaDStream<String> dStream = jssc.socketTextStream("localhost", 9999);
NotSerializable notSerializable = new NotSerializable();
dStream.foreachRDD(rdd -> {
if (rdd.count() == 0) {
return null;
}
String first = rdd.first();
notSerializable.doSomething(first);
return null;
}
);
// This does not work!!!!

This code will run if you make one of these changes to it:
Turn off checkpointing by removing the

jssc.checkpoint

line.

Make the object being used Serializable.

Declare NotSerializable inside the forEachRDD function, so the following code sample would be fine:

JavaStreamingContext jssc = new JavaStreamingContext(sc, INTERVAL);

jssc.checkpoint("/tmp/checkpoint_test");
JavaDStream<String> dStream = jssc.socketTextStream("localhost", 9999);
dStream.foreachRDD(rdd -> {
if (rdd.count() == 0) {
return null;
}
String first = rdd.first();
NotSerializable notSerializable = new NotSerializable();
notSerializable.doSomething(first);
return null;
}
);
// This code snippet is fine since the NotSerializable object
// is declared and only used within the forEachRDD function.

WinCUPL Intro Handouts2
100% (1)
WinCUPL Intro Handouts2
6 pages
Boolean Calculator
No ratings yet
Boolean Calculator
5 pages
Seminar Presentation On Voicexml
No ratings yet
Seminar Presentation On Voicexml
30 pages
Montage Library Editor
100% (2)
Montage Library Editor
17 pages
BGP Conditional Route Injection
No ratings yet
BGP Conditional Route Injection
4 pages
Basics of Bus Interconnection
No ratings yet
Basics of Bus Interconnection
41 pages
Software Companies PDF
100% (2)
Software Companies PDF
18 pages
ZXSDR B8200 GU360 (V4.00.100) Indoor GSM&UMTS Dual Mode Baseband Unit Ground Parameter Reference
No ratings yet
ZXSDR B8200 GU360 (V4.00.100) Indoor GSM&UMTS Dual Mode Baseband Unit Ground Parameter Reference
51 pages
Troubleshooting Vsans, Domains, and FSPF: Cisco Mds 9000 Family Troubleshooting Guide, Release 3.X Ol-9285-04
No ratings yet
Troubleshooting Vsans, Domains, and FSPF: Cisco Mds 9000 Family Troubleshooting Guide, Release 3.X Ol-9285-04
36 pages
Data Analysis of Internet Usage of University Students
No ratings yet
Data Analysis of Internet Usage of University Students
6 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
Ps6x Basics User Guide
No ratings yet
Ps6x Basics User Guide
26 pages
Data Entry Instructions Terms and Conditions
No ratings yet
Data Entry Instructions Terms and Conditions
2 pages
Release Notes RW 5.15.08
No ratings yet
Release Notes RW 5.15.08
102 pages
Catv Notes
No ratings yet
Catv Notes
7 pages
Harlequin Rip: Proofready Plugin For Canon W2200 Printers
No ratings yet
Harlequin Rip: Proofready Plugin For Canon W2200 Printers
58 pages
Top 100+ Data Engineer Interview Questions and Answers For 2022
No ratings yet
Top 100+ Data Engineer Interview Questions and Answers For 2022
4 pages
SIB Messages
0% (1)
SIB Messages
9 pages
Pvss Scada
No ratings yet
Pvss Scada
22 pages
Snowflake and Its Benefits
No ratings yet
Snowflake and Its Benefits
93 pages
Resume
No ratings yet
Resume
3 pages
What Are The 10 Major Differences Between Vmware Esx Server and Esxi Server?
No ratings yet
What Are The 10 Major Differences Between Vmware Esx Server and Esxi Server?
7 pages
Snowflake Interview 2024 03
100% (1)
Snowflake Interview 2024 03
167 pages
Geomax Zgp800 - RTK Radio and Base Setup
No ratings yet
Geomax Zgp800 - RTK Radio and Base Setup
5 pages
Snowflake Architecture - Concepts
No ratings yet
Snowflake Architecture - Concepts
38 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
PySpark VS SQL Interview Questions
No ratings yet
PySpark VS SQL Interview Questions
16 pages
Amba
No ratings yet
Amba
13 pages
Snowflake Mini Project
No ratings yet
Snowflake Mini Project
7 pages
Hafiz Rodze - The Latest Open Source Software and The Latest Development in ICT
No ratings yet
Hafiz Rodze - The Latest Open Source Software and The Latest Development in ICT
10 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Zoho Factsheet
No ratings yet
Zoho Factsheet
6 pages
Commodore Magazine Vol-09-N05 1988 May
No ratings yet
Commodore Magazine Vol-09-N05 1988 May
132 pages
Azure DataEngineering End To End Videos
No ratings yet
Azure DataEngineering End To End Videos
21 pages
Reference Guide 1998
No ratings yet
Reference Guide 1998
149 pages
Exercise: Basic Elements of Computer System: Objective Type Questions
No ratings yet
Exercise: Basic Elements of Computer System: Objective Type Questions
2 pages
Databricks Question
No ratings yet
Databricks Question
89 pages
7 Snowflake Reference Architectures For Application Builders
No ratings yet
7 Snowflake Reference Architectures For Application Builders
13 pages
3 Snowflake+Architecture
No ratings yet
3 Snowflake+Architecture
20 pages
Snowflake Free Lab Guide
50% (4)
Snowflake Free Lab Guide
58 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Snowflake Faq
No ratings yet
Snowflake Faq
185 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Azure Data Factory Interview Questions
100% (1)
Azure Data Factory Interview Questions
33 pages
Azure Data Factory Data Flows: Luke Newport Technical Specialist - Data & AI
100% (1)
Azure Data Factory Data Flows: Luke Newport Technical Specialist - Data & AI
30 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Snowflake Flatten PDF
100% (2)
Snowflake Flatten PDF
17 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Data Modeling Interview Questions
No ratings yet
Data Modeling Interview Questions
2 pages
Pega PDN Question
0% (1)
Pega PDN Question
56 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Snowflake Vs Data Bricks
No ratings yet
Snowflake Vs Data Bricks
10 pages
Lakehouse: A Unified Data Architecture
No ratings yet
Lakehouse: A Unified Data Architecture
9 pages
Views in Snowflake
No ratings yet
Views in Snowflake
13 pages
Databricks Lab 1
100% (3)
Databricks Lab 1
7 pages
Cisco Next-Generation Security Solutions All-In-One Cisco ASA Firepower Services, NGIPS, and AMP
100% (1)
Cisco Next-Generation Security Solutions All-In-One Cisco ASA Firepower Services, NGIPS, and AMP
547 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Notes of Azure Data Bricks
No ratings yet
Notes of Azure Data Bricks
16 pages
Azure Data Factory Interview Questions
0% (1)
Azure Data Factory Interview Questions
14 pages
Azure Databricks Best Practices 1664384402
No ratings yet
Azure Databricks Best Practices 1664384402
30 pages
AZURE DATA FACTORY Content
No ratings yet
AZURE DATA FACTORY Content
5 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Snowflake:: Data Warehouse For Cloud
No ratings yet
Snowflake:: Data Warehouse For Cloud
2 pages
Create Temporary, Permanent & Transient Table
No ratings yet
Create Temporary, Permanent & Transient Table
2 pages
Star and Snowflake Schemas
No ratings yet
Star and Snowflake Schemas
4 pages
Snowflake Interview Questions: Click Here
No ratings yet
Snowflake Interview Questions: Click Here
29 pages
GPSTrackerManualTK 104
No ratings yet
GPSTrackerManualTK 104
15 pages
PySpark Notes
No ratings yet
PySpark Notes
29 pages
Snowflake Questions1
No ratings yet
Snowflake Questions1
4 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
What Is The Snowflake Data Warehouse
No ratings yet
What Is The Snowflake Data Warehouse
7 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
What Is Snowflake
No ratings yet
What Is Snowflake
34 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Carrier Balancing Between Multiple Carriers/BI Sectors & DRD Issues Identification
No ratings yet
Carrier Balancing Between Multiple Carriers/BI Sectors & DRD Issues Identification
19 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Sybex's Study Guide for Snowflake SnowPro Core Certification: COF-C02 Exam
From Everand
Sybex's Study Guide for Snowflake SnowPro Core Certification: COF-C02 Exam
Hamid Mahmood Qureshi
No ratings yet
Ultimate Azure Data Engineering
From Everand
Ultimate Azure Data Engineering
Ashish Agarwal
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
From Everand
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
Exam OG
5/5 (1)

Databricks Spark Knowledge Base

Uploaded by

Databricks Spark Knowledge Base

Uploaded by

Table of Contents

Databricks Spark Knowledge Base

and the other using

val words = Array("one", "two", "two", "three", "three", "three")

example works much better on a large

how pairs on the same machine with

before the data is shuffled. Then the

On the other hand, when calling

to being transferred over the network.

Here are more functions to prefer over

Don't copy all elements of a large RDD to the driver.

val values = myVeryLargeRDD.collect()

filtering or sampling your RDD.

Gracefully Dealing with Bad Input Data

transformation, you can easily discard bad inputs, or use a

input. Or perhaps the best option is to use a

transformation if it's possible to fix the bad

the input if you can't.

input_rdd = sc.parallelize(["{\"value\": 1}", # Good

Job aborted due to stage failure: Task not serializable:

NotSerializable notSerializable = new NotSerializable();

Missing Dependencies in Jar Files

Error running start-all.sh Connection refused

Network connectivity issues between Spark components

ERROR AppClient$ClientActor: All masters are unresponsive! Giving up.

INFO Master: Registering app SparkPi

to a cluster-addressable hostname for the driver, master, and worker processes.

Configurations that determine hostname/port binding:

SparkContext actor system:

set to an IP address, it will be resolved to a hostname.

An ephemeral port chosen by the OS.

Spark Standalone Master / Worker actor systems:

options (or the deprecated

options) when launching the

environment variable (only applies to

environment variable is set to a hostname, then this hostname will be used. If

set to an IP address, it will be resolved to a hostname.

options when launching the

environment variables (only apply to

Performance & Optimization

How Many Partitions Does An RDD Have?

View Task Execution Against Partitions Using the UI

scala> val someRDD = sc.parallelize(1 to 100, 4)

View Partition Caching Using the UI

Inspect RDD Partitions Programatically

scala> val someRDD = sc.parallelize(1 to 100, 30)

In [1]: someRDD = sc.parallelize(range(101),30)

Adjusting Locality Confugrations

JavaStreamingContext jssc = new JavaStreamingContext(sc, INTERVAL);

Make the object being used Serializable.

JavaStreamingContext jssc = new JavaStreamingContext(sc, INTERVAL);

You might also like