Spark Interview Questions
Spark Interview Questions
GUIDE TO
INTERVIEWS FOR
SPARK FOR BIG DATA
ZEP ANALYTICS
Introduction
We've curated this series of interview which guides to
accelerate your learning and your mastery of data
science skills and tools.
ZEP ANALYTICS
COMPREHENSIVE 03
GUIDE TO
INTERVIEWS
FOR DATA
SCIENCE
Explore
ZEP ANALYTICS
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
04
1.Explain spark architecture?
Apache Spark follows a master/slave architecture with two
main daemons and a cluster manager -
i. Master Daemon - (Master/Driver Process) ii. Worker
Daemon -(Slave Process)
A spark cluster has a single Master and any number of
Slaves/Workers. The driver and the executors run their
individual Java processes and users can run them on the
same horizontal spark cluster or on separate machines i.e. in
a vertical spark cluster or in mixed machine configuration.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
05
3. Difference Between RDD, Dataframe, Dataset?
Resilient Distributed Dataset (RDD)
RDD was the primary user-facing API in Spark since its
inception. At the core, an RDD is an immutable distributed
collection of elements of your data, partitioned across nodes
in your cluster that can be operated in parallel with a low-
level API that offers transformations and actions.
DataFrames (DF)
Like an RDD, a DataFrame is an immutable distributed
collection of data. Unlike an RDD, data is organized into
named columns, like a table in a relational database.
Designed to make large data sets processing even easier,
DataFrame allows developers to impose a structure onto a
distributed collection of data, allowing higher-level
abstraction; it provides a domain specific language API to
manipulate your distributed data.
Datasets (DS)
Starting in Spark 2.0, Dataset takes on two distinct APIs
characteristics: a strongly-typed API and an untyped API, as
shown in the table below. Conceptually, consider DataFrame
as an alias for a collection of generic objects Dataset[Row],
where a Row is a generic untyped JVM object.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
06
4. When to use RDDs?
Consider these scenarios or common use cases for using
RDDs when:
1. you want low-level transformation and actions and
control on your dataset; your data is unstructured, such as
media streams or streams of text;
2. you want to manipulate your data with functional
programming constructs than domain specific expressions;
3. you don’t care about imposing a schema, such as
columnar format, while processing or accessing data
attributes by name or column; and
4. you can forgo some optimization and performance
benefits available with DataFrames and Datasets for
structured and semi-structured data.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
07
7. What is a RDD and How it works internally?
RDD (Resilient Distributed Dataset) is a representation of
data located on a networkwhich is Immutable - You can
operate on the rdd to produce another rdd but you can’t
alter it.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
08
partitions by passing second argument.
You can change the number of partitions later using
repartition().
If you want certain operations to consume the
whole partitions at a time, you can use:
mappartition().
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
09
1.Broadcast Variable- Broadcast variable enhances the
efficiency of joins between small and large RDDs.
2.Accumulators - Accumulators help update the values
of variables in parallel while executing.
3.The most common way is to avoid operations ByKey,
repartition or any other operations which trigger shuffles.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
10
14. What is Sliding Window operation?
Sliding Window controls transmission of data packets
between various computer networks. Spark Streaming
library provides windowed computations where the
transformations on RDDs are applied over a sliding
window of data. Whenever the window slides, the RDDs
that fall within the particular window are combined and
operated upon to produce new RDDs of the windowed
DStream.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
11
17. What is the difference between persist() and
cache()?
persist () allowsthe user tospecify the
storagelevel whereas cache () usesthe
default storage level(MEMORY_ONLY).
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
12
20. What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it
operates on data. When you tell Spark to operate on
a given dataset, it heeds the instructions and makes
a note of it, so that it does not forget - but it does
nothing, unless asked for the final result. When a
transformation like map () is called on a RDD-the
operation is not performed immediately.
Transformations in Spark are not evaluated till you
perform an action. This helps optimize the overall
data processing workflow.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
13
23. What is "Lineage Graph" in Spark?
Whenever a series of transformations are performed
on an RDD, they are not evaluated immediately, but
lazily(Lazy Evaluation). When a new RDD has been
created from an existing RDD, that new RDD
contains a pointer to the parent RDD. Similarly, all
the dependencies between the RDDs will be logged
in a graph, rather than the actual data. This graph
is called the lineage graph.
Spark does not support data replication in the
memory. In the event of any data loss, it is rebuilt
using the "RDD Lineage". It is a process that
reconstructs lost data partitions.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
14
25. What is an “Accumulator”?
“Accumulators” are Spark’s offline debuggers.
Similar to “Hadoop Counters”, “Accumulators”
provide the number of “events” in a program.
Accumulators are the variables that can be added
through associative operations. Spark natively
supports accumulators ofnumeric value typesand
standard mutablecollections.
“AggregrateByKey()”and “combineByKey()” uses
accumulators.
zepanalytics.com
SPARK COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
15
27. What is SparkSession ?
SPARK 2.0.0 onwards, SparkSession provides a single
point of entry to interact with underlying Spark
functionality and it allows Spark programming with
DataFrame and Dataset APIs. All the functionality
available with sparkContext are also available in
sparkSession.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
16
29. What is Partitioner?
A partitioner is an object that defines how the
elements in a key-value pair RDD are partitioned by
key, maps each key to a partition ID from 0 to
numPartitions - 1. It captures the data distribution
at the output. With the help of partitioner, the
scheduler can optimize the future operations. The
contract of partitioner ensures that records for a
given key have to reside on a single partition.
We should choose a partitioner to use for a
cogroup-like operations. If any of the RDDs already
has a partitioner, we should choose that one.
Otherwise, we use a default HashPartitioner.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
17
30. What are the benefits of DataFrames ?
1.DataFrame is distributed collection of data. In
DataFrames, data is organized in named column.
2. They are conceptually similar to a table in a
relational database. Also, have richer optimizations.
3. Data Frames empower SQL queries and the
DataFrame API.
4. we can process both structured and unstructured
data formats through it. Such as: Avro, CSV, elastic
search, and Cassandra. Also, it deals with storage
systems HDFS, HIVE tables, MySQL, etc.
5. In Data Frames, Catalyst supports
optimization(catalyst Optimizer). There are general
libraries available to represent trees. In four phases,
DataFrame uses Catalyst tree transformation:
- Analyze logical plan to solve references
- Logical plan optimization
- Physical planning
- Code generation to compile part of a query to
Java bytecode.
6. The Data Frame API’s are available in various
programming languages. For example Java, Scala,
Python, and R.
7. It provides Hive compatibility. We can run
unmodified Hive queries on existing Hive warehouse.
8. It can scale from kilobytes of data on the single
laptop to petabytes of data on a large cluster.
9. DataFrame provides easy integration with Big
data tools and framework via Spark core.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
18
31. What is Dataset ?
A Dataset is an immutable collection of objects,
those are mapped to a relational schema. They are
strongly-typed in nature.
There is an encoder, at the core of the Dataset API.
That Encoder is responsible for converting between
JVM objects and tabular representation. By using
Spark’s internal binary format, the tabular
representation is stored that allows to carry out
operations on serialized data and improves memory
utilization. It also supports automatically generating
encoders for a wide variety of types, including
primitive types (e.g. String, Integer, Long) and Scala
case classes. It offers many functional
transformations (e.g. map, flatMap, filter).
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
19
4. For processing demands like high-level
expressions, filters, maps, aggregation, averages,
sum,SQL queries, columnar access and also for use
of lambda functions on semi-structured data,
DataSets are best.
5. Datasets provides rich semantics, high-level
abstractions, and domain-specific APIs
20
35. What is the Difference between DSM and RDD?
On the basis of several features, the difference
between RDD and DSM is:
i. Read
RDD - The read operation in RDD is either coarse-
grained or fine-grained. Coarse-grained meaning
we can transform the whole dataset but not an
individual element on the dataset. While fine-
grained means we can transform individual element
on the dataset.
DSM - The read operation in Distributed shared
memory is fine-grained.
ii. Write
RDD - The write operation in RDD is coarse-grained.
DSM - The Write operation is fine grained in
distributed shared system.
iii. Consistency
RDD - The consistency of RDD is trivial meaning it is
immutable in nature. We can not realtor the content
of RDD i.e. any changes on RDD is permanent.
Hence, The level of consistency is very high.
DSM - The system guarantees that if the
programmer follows the rules, the memory will be
consistent. Also, the results of memory operations
will be predictable.
21
Therefore, for each transformation, new RDD is
formed. As RDDs are immutable in nature, hence, it
is easy to recover.
DSM - Fault tolerance is achieved by a
checkpointing technique which allows applications
to roll back to a recent checkpoint rather than
restarting.
v. Straggler Mitigation
Stragglers, in general, are those that take more time
to complete than their peers. This could happen due
to many reasons such as load imbalance, I/O
blocks, garbage collections, etc.
An issue with the stragglers is that when the parallel
computation is followed by synchronizations such
as reductions that causes all the parallel tasks to
wait for others.
RDD - It is possible to mitigate stragglers by using
backup task, in RDDs. DSM - To achieve straggler
mitigation, is quite difficult.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
22
36. What is Speculative Execution in Spark and how
to enable it?
One more point is, Speculative execution will not
stop the slow running task but it launch the new
task in parallel.
Tabular Form :
Spark Property >> Default Value >> Description
spark.speculation >> false >> enables ( true ) or
disables ( false ) speculative execution of tasks.
spark.speculation.interval >> 100ms >> The time
interval to use before checking for speculative
tasks.
spark.speculation.multiplier >> 1.5 >> How many
times slower a task is than the median to be for
speculation.
spark.speculation.quantile >> 0.75 >> The
percentage of tasks that has not finished yet at
which to start speculation.
23
in their memory. Using the lineage graph, those
tasks will be accomplished in any other worker
nodes. The data is also replicated to other worker
nodes to achieve fault tolerance. There are two
cases:
24
Spark 1.2. With WAL enabled, the intention of the
operation is first noted down in a log file, such that
if the driver fails and is restarted, the noted
operations in that log file can be applied to the
data. For sources that read streaming data, like
Kafka or Flume, receivers will be receiving the data,
and those will be stored in the executor's memory.
With WAL enabled, these received data will also be
stored in the log files.
WAL can be enabled by performing the below:
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
25
2.reduceByKey:
Data is combined at each partition , only one output
for one key at each partition to send over network.
reduceByKey required combining all your values into
another value with the exact same type.
Example:-
sc.textFile("hdfs://").flatMap(line => line.split(" ")
).map(word => (word,1)) .reduceByKey((x,y)=>
(x+y))
3.aggregateByKey:
same as reduceByKey, which takes an initial value.
Example:-
val inp
=Seq("dinesh=70","kumar=60","raja=40","ram=60","di
nesh=50","dinesh=80","kumar=40"
,"raja=40")
val rdd=sc.parallelize(inp,3)
val pairRdd=rdd.map(_.split("=")).map(x=>
(x(0),x(1)))
val initial_val=0
val addOp=(intVal:Int,StrVal: String)=>
intVal+StrVal.toInt val mergeOp=
(p1:Int,p2:Int)=>p1+p2
val out=pairRdd.aggregateByKey(initial_val)
(addOp,mergeOp) out.collect.foreach(println)
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
26
4.combineByKey:
combineByKey values are merged into one value at
each partition then each partition value is merged
into a single value. It’s worth noting that the type of
the combined value does not have to match the
type of the original value and often times it won’t
be.
3 parameters as input
1. create combiner
2. mergeValue
3. mergeCombiners
Example:
val inp =Array(("Dinesh", 98.0), ("Kumar",
86.0), ("Kumar", 81.0), ("Dinesh",
92.0), ("Dinesh", 83.0),
("Kumar", 88.0))
val rdd = sc.parallelize(inp,2)
/
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
27
/Function to merge across the partitions
val mergeCombiners = (PartOutput1:(Int, Double) ,
PartOutput2:(Int, Double))=>{
(PartOutput1._1+PartOutput2._1 ,
PartOutput1._2+PartOutput2._2)
}
//Function to calculate the average.Personinps is a
custom type val CalculateAvg = (personinp:(String,
(Int, Double)))=>{
val (name,(numofinps,inp)) = personinp
(name,inp/numofinps)
}
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
28
mapPartitions() can be called for each partitions
while map() and foreach() is called for each
elements
in an RDD.
Hence one can do the initialization on per-partition
basis rather than each element basis
MappartionwithIndex():
It is similar to MapPartition but with one difference
that it takes two parameters, the first parameter is
the index and second is an iterator through all items
within this partition (Int, Iterator<T>).
mapPartitionsWithIndex is similar to mapPartitions()
but it provides second parameter index which keeps
the track of partition.
29
This behaves somewhat differently from fold
operations implemented for non-distributed
collections in functional languages like Scala. This
fold operation may be applied to partitions
individually, and then fold those results into the final
result, rather than apply the fold to each element
sequentially in some defined ordering. For
functions that are not commutative, the result may
differ from that of a fold applied to a non-
distributed
collection.
zeroValue: The initial value for the accumulated
result of each partition for the op operator, and also
the initial value for the combine results from
different partitions for the op operator - this will
typically be the neutral element (e.g. Nil for list
concatenation or 0 for summation)
Op: an operator used to both accumulate results
within a partition and combine results from different
partitions
Example :
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1.fold(5)(_+_)
Output : Int = 35
30
41. Difference between textFile Vs wholeTextFile ?
Both are the method of
org.apache.spark.SparkContext.
textFile() :
def textFile(path: String, minPartitions: Int =
defaultMinPartitions): RDD[String]
Read a text file from HDFS, a local file system
(available on all nodes), or any Hadoop-supported
file system URI, and return it as an RDD of Strings
For example sc.textFile("/home/hdadmin/wc-
data.txt") so it will create RDD in which each
individual line an element.
Everyone knows the use of textFile.
wholeTextFiles() :
def wholeTextFiles(path: String, minPartitions: Int =
defaultMinPartitions): RDD[(String, String)]
Read a directory of text files from HDFS, a local file
system (available on all nodes), or any Hadoop-
supported file system URI.Rather than create basic
RDD, the wholeTextFile() returns pairRDD.
For example, you have few files in a directory so by
using wholeTextFile() method,
it creates pair RDD with filename with path as
key,and value being the whole file as string.
Example:-
val myfilerdd =
sc.wholeTextFiles("/home/hdadmin/MyFiles") val
keyrdd = myfilerdd.keys
keyrdd.collect
val filerdd = myfilerdd.values
filerdd.collect
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
31
42. What is cogroup() operation.?
It's a transformation. It's in package
org.apache.spark.rdd.PairRDDFunctions
def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
other2: RDD[(K, W2)], other3: RDD[(K, W3)]):
RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2],
Iterable[W3]))|
For each key k in this or other1 or other2 or other3,
return a resulting RDD that contains a tuple with the
list of values for that key in this, other1, other2 and
other3.
Example:
val myrdd1 = sc.parallelize(List((1,"spark"),
(2,"HDFS"),(3,"Hive"),(4,"Flink"),(6,"HBase")))
val myrdd2= sc.parallelize(List((4,"RealTime"),
(5,"Kafka"),(6,"NOSQL"),(1,"stream"),(1,"MLlib"))) val
result = myrdd1.cogroup(myrdd2)
result.collect
Output :
Array[(Int, (Iterable[String], Iterable[String]))] =
Array((4,
(CompactBuffer(Flink),CompactBuffer(RealTime))),
(1,(CompactBuffer(spark),CompactBuffer(stream,
MLlib))),
(6,
(CompactBuffer(HBase),CompactBuffer(NOSQL))),
(3,(CompactBuffer(Hive),CompactBuffer())),
(5,(CompactBuffer(),CompactBuffer(Kafka))),
(2,(CompactBuffer(HDFS),CompactBuffer())))
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
32
43. Explain pipe() operation ?
Return an RDD created by piping elements to a
forked external process.
def pipe(command: String): RDD[String]
In general, Spark is using Scala, Java, and Python to
write the program. However, if that is not enough,
and one want to pipe (inject) the data which
written in other languages like 'R', Spark provides
general mechanism in the form of pipe() method
Spark provides the pipe() method on RDDs.
With Spark's pipe() method, one can write a
transformation of an RDD that can read each
element in the RDD from standard input as String.
It can write the results as String to the standard
output.
Example:
test.py
#!/usr/bin/python
import sys
for line in sys.stdin:
print "hello " + line
spark-shell Scala:
val data = List("john","paul","george","ringo") val
dataRDD = sc.makeRDD(data)
val scriptPath = "./test.py"
val pipeRDD = dataRDD.pipe(scriptPath)
pipeRDD.foreach(println)
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
32
44. Explain coalesce() operation.?
It's in a package org.apache.spark.rdd.ShuffledRDD
def coalesce(numPartitions: Int, shuffle: Boolean
= false, partitionCoalescer:
Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[(K, C)] = null): RDD[(K, C)]
Return a new RDD that is reduced into numPartitions
partitions.
Example:
val myrdd1 = sc.parallelize(1 to 1000, 15)
myrdd1.partitions.length
val myrdd2 = myrdd1.coalesce(5,false)
myrdd2.partitions.length
Int = 5
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
33
If you are decreasing the number of partitions in
this RDD, consider using coalesce, which can avoid
performing a shuffle.
Example :
val rdd1 = sc.parallelize(1 to 100, 3)
rdd1.getNumPartitions
val rdd2 = rdd1.repartition(6)
rdd2.getNumPartitions
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
34
Example :
val myrdd1 = sc.parallelize(List(5,7,9,13,51,89))
myrdd1.top(3) //Array[Int] = Array(89, 51, 13)
myrdd1.takeOrdered(3) //Array[Int] = Array(5, 7, 9)
myrdd1.top(3) //Array[Int] = Array(89, 51, 13)
Example:
val rdd1 = sc.parallelize(Seq(("myspark",78),
("Hive",95),("spark",15),("HBase",25),("spark",39),
("BigData",78),("spark",49)))
rdd1.lookup("spark")
rdd1.lookup("Hive")
rdd1.lookup("BigData")
Output:
Seq[Int] = WrappedArray(15, 39, 49)
Seq[Int] = WrappedArray(95)
Seq[Int] = WrappedArray(78)
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
35
49. How to stop INFO messages displaying on spark
console?
Edit spark conf/log4j.properties file and change the
following line:
log4j.rootCategory=INFO, console
to
log4j.rootCategory=ERROR, console
36
| 5| Robert| 41| M|
| 6| Sandra| 45| F|
+---+--------+---+------+
scala> custDF.show
+---+-----+---+------+
|cId| name|age|gender|
+---+-----+---+------+
| 1|James| 21| M|
| 2| Liz| 25|F|
+---+-----+---+------+
37
communicate. If the shared secret is not identical
they will not be allowed to communicate. The
shared secret is created as follows:
For Spark on YARN deployments, configuring
spark.authenticate to true will automatically handle
generating and distributing the shared secret. Each
application will use a unique shared secret.
For other types of Spark deployments, the Spark
parameter spark.authenticate.secret should be
configured on each of the nodes. This secret will be
used by all the Master/Workers and applications.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
38
Variables are like values, except you can re-assign
them. You can define a variable with the var
keyword.
Example:
var x = 1 + 1
x = 3 // This can compile.
Example:
def add(x: Int, y: Int): Int = x + y println(add(1, 2))
// 3
Functions:-
Functions are expressions that take parameters.
Bigdata Hadoop: Spark Interview Questions with
Answers
You can define an anonymous function (i.e. no
name) that returns a given integer plus one: (x: Int)
=> x + 1
You can also name functions. like
val addOne = (x: Int) => x + 1
println(addOne(1)) // 2
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
39
56. What is case classes in Scala?
Scala has a special type of class called a “case”
class. By default, case classes are immutable and
compared by value. You can define case classes
with the case class keywords.
Example:
case class Point(x: Int, y: Int)
val point = Point(1, 2)
val anotherPoint = Point(1, 2)
val yetAnotherPoint = Point(2, 2)
40
59. What is Companion objects in scala?
An object with the same name as a class is called a
companion object. Conversely, the class is the
object’s companion class. A companion class or
object can access the private members of its
companion. Use a companion object for methods
and values which are not specific to instances of
the companion class.
Example:
import scala.math._
case class Circle(radius: Double) { import Circle._
def area: Double = calculateArea(radius)
}
object Circle {
private def calculateArea(radius: Double): Double
= Pi * pow(radius, 2.0)
}
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
41
Sample Example:
val list: List[Any] = List(
"a string",
732, // an integer
'c', // a character
true, // a boolean value
() => "an anonymous function returning a string"
)
list.foreach(element => println(element))
AnyVal:
AnyVal represents value types. There are nine
predefined value types and they are non-nullable:
Double, Float, Long, Int, Short, Byte, Char, Unit, and
Boolean. Unit is a value type which carries no
meaningful information. There is exactly one
instance of Unit which can be declared literally like
so: (). All functions must return something so
sometimes Unit is a useful return type.
AnyRef:
AnyRef represents reference types. All non-value
types are defined as reference types. Every user-
defined type in Scala is a subtype of AnyRef. If
Scala is used in the context of a Java runtime
environment, AnyRef corresponds to
java.lang.Object.
Nothing:
Nothing is a subtype of all types, also called the
bottom type. There is no value that has type
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
42
Nothing. A common use is to signal non-termination
such as a thrown exception, program exit, or an
infinite loop (i.e., it is the type of an expression
which does not evaluate to a value, or a method
that does not return normally).
Null:
Null is a subtype of all reference types (i.e. any
subtype of AnyRef). It has a single value identified
by the keyword literal null. Null is provided mostly
for interoperability with other JVM languages and
should almost never be used in Scala code. We’ll
cover alternatives to null later in the tour.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
43
63. What is Pattern matching in Scala?
Pattern matching is a mechanism for checking a
value against a pattern. A successful match can
also deconstruct a value into its constituent parts. It
is a more powerful version of the switch statement
in Java and it can likewise be used in place of a
series of if/else statements.
Syntax:
import scala.util.Random
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
44
64. What are the basic properties avail in Spark?
It may be useful to provide some simple definitions
for the Spark nomenclature:
Worker Node: A server that is part of the cluster and
are available to run Spark jobs Master Node: The
server that coordinates the Worker nodes.
Executor: A sort of virtual machine inside a node.
One Node can have multiple Executors.
Driver Node: The Node that initiates the Spark
session. Typically, this will be the server where
context is located.
Driver (Executor): The Driver Node will also show up
in the Executor list.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
45
started, which in turn impact the amount of storage
available for the session. For more information,
please see the Dynamic Resource Allocation page in
the official Spark website.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
46
68. When not to rely on default type inference?
The type inferred for obj was Null. Since the only
value of that type is null, So it is impossible to
assign a different value by default.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
47
3. server: whether this process should be the server
when talking to the debugger (or conversely, the
client) -- you always need one server and one
client. In this case, we're going to be the server and
wait for a connection from the debugger
4. suspend: whether to pause execution until a
debugger has successfully connected. We turn this
on so the driver won't start until the debugger
connects
5. address: here, this is the port to listen on (for
incoming debugger connection requests). You can
set it to any available port (you just have to make
sure the debugger is configured to connect to this
same port)
48
pool of memory. This makes it attractive in
environments with large heaps or multiple
concurrent applications.
2.Iterations:
By exploiting its streaming architecture, Flink allows
you to natively iterate over data, something Spark
also supports only as batches
3.Memory Management:
Spark jobs have to be optimized and adapted to
specific datasets because you need to manually
control partitioning and caching if you want to get it
right
4.Maturity:
Flink is still in its infancy and has but a few
production deployments
5.Data Flow:
In contrast to the procedural programming
paradigm Flink follows a distributed data flow
approach. For data set operations where
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
49
intermediate results are required in addition to the
regular input of an operation, broadcast variables
are used to distribute the pre calculated results to
all worker nodes.
50
considered).
Apache Storm is focused on stream processing or
what some call complex event processing. Storm
implements a fault tolerant method for performing
a computation or pipelining multiple computations
on an event as it flows into a system. One might use
Storm to transform unstructured data as it flows
into a system into a desired format.
Storm and Spark are focused on fairly different use
cases. The more "apples-to-apples" comparison
would be between Storm Trident and Spark
Streaming. Since Spark's RDDs are inherently
immutable, Spark Streaming implements a method
for "batching" incoming updates in user-defined
time intervals that get transformed into their own
RDDs. Spark's parallel operators can then perform
computations on these RDDs. This is different from
Storm which deals with each event individually.
One key difference between these two technologies
is that Spark performs Data-Parallel computations
while Storm performs Task-Parallel computations.
Either design makes trade offs that are worth
knowing.
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
51
76. How to read multiple text files into a single RDD?
You can specify whole directories, use wildcards
and even CSV of directories and wildcards like
below.
Eg.:
val rdd = sc.textFile("file:///D:/Dinesh.txt,
file:///D:/Dineshnew.txt")
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
52
Example: The instance stack can only take Int
values. val stack = new Stack[Int]
stack.push(1)
stack.push(2)
println(stack.pop)// prints 2
println(stack.pop)// prints 1
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
53
Set by the yarn.scheduler.minimum-allocation-mb
every container always allocates at least this
amount of memory.This means ifparameter --
executor-memoryis set toe.g. only 1g but
yarn.scheduler.minimum-allocation-mb is e.g. 6g,
the container is much bigger than needed by the
Spark application.
The other way round, if the parameter --executor-
memory is set to somthing higher than the
yarn.scheduler.minimum-allocation-mb value, e.g.
12g, the Container will allocate more memory
dynamically,but only ifthe requested amountof
memory issmaller or equalto
yarn.scheduler.maximum-allocation-mb value.
The value ofyarn.nodemanager.resource.memory-
mbdetermines, how muchmemory can be allocated
in sum by all containers of one host!
So setting yarn.scheduler.minimum-allocation-mb
allows you to run smaller containers e.g. for smaller
executors (else it would be waste of memory).
Settingyarn.scheduler.maximum-allocation-mbto
the maximumvalue (e.g. equalto
yarn.nodemanager.resource.memory-mb) allows
you to define bigger executors (more memory is
allocated if needed, e.g. by --executor-memory
parameter).
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
54
82. How to allocate the memory sizes for the spark
jobs in cluster?
Before we are answering this question we have to
concentrate the 3 main features.
which are –
1. NoOfExecutors,
2. Executor-memory and
3. Number of executor-cores
Lets go with example now, let’s imagine we have a
cluster with six nodes running NodeManagers, each
with 16 coresand 64GB RAM. TheNodeManager
sizes, yarn.nodemanager.resource.memory-mb
and yarn.nodemanager.resource.cpu-vcores,
should be set to 63 * 1024 = 64512 (megabytes)
and 15 respectively. We never provide 100%
allocation of each resources to YARN containers
because the node needs some resources to run the
OS processes and Hadoop. In this case, we leave a
gigabyte and a core for these system processes.
Cloudera Manager helps by accounting for these
and configuring these YARN properties
automatically. So the allocation likely matched as -
-num-executors 6 --executor-cores 15 --executor-
memory 63G.
However, this is the wrong approach because: 63GB
more on the executor memory overhead won’t fit
within the 63GB RAM of the NodeManagers. The
application master will cover up a core on one of
the nodes, meaning that there won’t be room for a
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
55
15-core executor on that node. 15 cores per
executor can lead to bad HDFS I/O throughput. So
the best option would be to use --num-executors 17
--executor-cores 5 --executor-memory 19G.
This configuration results in three executors on all
nodes except for the one with the Application
Master, which
will have two executors. --executor-memory was
derived as (63/3 executors per node) = 21. 21 * 0.07
= 1.47. 21 -
1.47 ~ 19.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
56
Example:
val df = spark.range(100000)
val df1= df.filter('id < 1000)
val df2= df.filter('id >= 1000)
print(df1.count() + df2.count()) //100000
zepanalytics.com
SPARK | COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
57
Hadoop distributions.
One advantage of Mesos over both YARN and
standalone mode is its fine-grained sharing option,
which lets interactive applications such as the
Spark shell scale down their CPU allocation between
commands. This makes it attractive in environments
where multiple users are running interactive shells.
In all cases, it is best to run Spark on the same
nodes as HDFS for fast access to storage. You can
install Mesos or the standalone cluster manager on
the same nodes manually, or most Hadoop
distributions already install YARN and HDFS together.
57
88. What is Stateful Transformation ?
The uses data or intermediate results from previous
batches and computes the result of the current
batch called Stateful Transformation. Stateful
transformations are operations on DStreams that
track data across time. Thus it makes use of some
data from previous batches to generate the results
for a new batch.
In streaming if we have a use case to track data
across batches then we need state-ful DStreams.
For example we may track a user’s interaction in a
website during the user session or we may track a
particular twitter hashtag across time and see
which users across the globe is talking about it.
Types of state-ful transformation.
zepanalytics.com
SPARK| COMPREHENSIVE GUIDE TO INTERVIEWS FOR DATA SCIENCE
44
This brings our list of 80+ SPARK for Big Data
interview questions to an end.
zepanalytics.com
ZEP ANALYTICS
Explore
zepanalytics.com