Spark Interview QUestions
Spark Interview QUestions
3. How can you use Machine Learning library SciKit library which is written in Python, with
Spark engine?
Ans: Machine learning tool written in Python, e.g. SciKit library, can be used as a Pipeline API in
Spark MLlib or calling pipe().
4. Why Spark is good at low-latency iterative workloads e.g. Graphs and Machine Learning?
Ans: Machine Learning algorithms for instance logistic regression require many iterations before
creating optimal resulting model. And similarly in graph algorithms which traverse all the nodes
and edges. Any algorithm which needs many iteration before creating results can increase their
performance when the intermediate partial results are stored in memory or at very fast solid
state drives.
Spark can cache/store intermediate data in memory for faster model building and training.
Also, when graph algorithms are processed then it traverses graphs one connection per iteration
with the partial result in memory. Less disk access and network traffic can make a huge
difference when you need to process lots of data.
5. Which all kind of data processing supported by Spark?
Ans: Spark offers three kinds of data processing using batch, interactive (Spark Shell), and
stream processing with the unified API and data structures.
A Spark context can be used to create RDDs, accumulators and broadcast variables, access Spark
services and run jobs.
A “park o te t is esse tiall a lie t of “park’s e e utio e iro e t a d it acts as the master
of your Spark.
.setMaster("local[2]")
.setAppName("CountingSheep")
8. Which all are the, ways to configure Spark Properties and order them least important to the
most important.
Ans: There are the following ways to set up properties for Spark and user programs (in the order
of importance from the least important to the most important):
conf/spark-defaults.conf - the default
--conf - the command line option used by spark-shell and spark-submit
SparkConf
9. What is the Default level of parallelism in Spark?
Ans: Default level of parallelism is the number of partitions when not specified explicitly by a
user.
13. Give few examples , how RDD can be created using SparkContext
Ans: SparkContext allows you to create many different RDDs from input sources like:
“cala’s collections: i.e. sc.parallelize(0 to 100)
Local or remote filesystems : sc.textFile("README.md")
Any Hadoop InputSource : using sc.newAPIHadoopFile
14. How would you brodcast, collection of values over the Sperk executors?
Ans: sc.broadcast("hello")
18. How can you stop SparkContext and what is the impact if stopped?
Ans: You can stop a Spark context using SparkContext.stop() method. Stopping a Spark context
stops the Spark Runtime Environment and effectively shuts down the entire Spark application.
20. How would you the amount of memory to allocate to each executor?
Ans: SPARK_EXECUTOR_MEMORY sets the amount of memory to allocate to each executor.
You can think of actions as a valve and until no action is fired, the data to be processed is not
even in the pipes, i.e. transformations. Only actions can materialize the entire processing
pipeline with real data.
31. Please tell me , how execution starts and end on RDD or Spark Job
Ans: Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or
reference cached data) and ends with the RDD that produces the result of the action that has
been called to execute.
All of the tuples with the same key must end up in the same partition, processed by the same
task. To satisfy these operations, Spark must execute RDD shuffle, which transfers data across
cluster and results in a new stage with a new set of partitions. (54)
36. Data is spread in all the nodes of cluster, how spark tries to process this data?
Ans: By default, Spark tries to read data into an RDD from the nodes that are close to it. Since
Spark usually accesses distributed partitioned data, to optimize transformation operations it
creates partitions to hold the data chunks
37. How would you hint, minimum number of partitions while transformation ?
Ans: You can request for the minimum number of partitions, using the second input parameter
to many transformations.
scala> sc.parallelize(1 to 100, 2).count
Preferred way to set up the number of partitions for an RDD is to directly pass it as the second
i put para eter i the all like rdd = s .te tFile "hdfs://… /file.t t", 400 , here400 is the
number of partitions. In this case, the partitioning makes for 400 splits that would be done by
the Hadoop’s Te tI putFor at , ot “park a d it ould ork u h faster. It’salso that the ode
spawns 400 concurrent tasks to try to load file.txt directly into 400 partitions.
38. How many concurrent task Spark can run for an RDD partition?
Ans: Spark can only run 1 concurrent task for every partition of an RDD, up to the number of
cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have
50 partitions (and probably 2-3x times that).
As far as choosing a "good" number of partitions, you generally want at least as many as the
number of executors for parallelism. You can get this computed value by calling
sc.defaultParallelism .
40. When Spark works with file.txt.gz, how many partitions can be created?
Ans: When using textFile with compressed files ( file.txt.gz not file.txt or similar), Spark disables
splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be
parallelized). In this case, to change the number of partitions you should do repartitioning.
Please note that Spark disables splitting for compressed files and creates RDDs with only 1
partitio . I su h ases, it’s helpful to use s .te tFile 'de o.gz' a d do repartitio i g usi g
rdd.repartition(100) as follows:
rdd = sc.textFile('demo.gz')
rdd = rdd.repartition(100)
With the lines, you end up with rdd to be exactly 100 partitions of roughly equal in size.
42. What is the difference between cache() and persist() method of RDD
Ans: RDDs a e a hed usi g RDD’s a he operatio or persisted usi g RDD’s
persist(newLevel: StorageLevel) operation). The cache() operation is a synonym of persist() that
uses the default storage level MEMORY_ONLY .
43. You have RDD storage level defined as MEMORY_ONLY_2 , what does _2 means ?
Ans: number _2 in the name denotes 2 replicas
47. When you call join operation on two pair RDDs e.g. (K, V) and (K, W), what is the result?
Ans: When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with
all pairs of elements for each key [68]
You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to a file
inside the checkpoint directory and all references to its parent RDDs will be removed. This
function has to be called before any job has been executed on this RDD.
49. What do you mean by Dependencies in RDD lineage graph?
Ans: Dependency is a connection between RDDs after applying a transformation.
50. Which script will you use Spark Application, using spark-shell ?
Ans: You use spark-submit script to launch a Spark application, i.e. submit the application to a
Spark deployment environment.
DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent
events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes
sequentially.
A BlockManager manages the storage for most of the data in Spark, i.e. block that represent a
cached RDD partition, intermediate shuffle data, and broadcast data.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing
the various blocks of a file or directory as well as their locations (represented as InputSplits ),
a d the s hedules the ork to the “parkWorkers. “park’s o pute odes / orkers should e
running on storage nodes.
68. What all are the data sources Spark can process?
Ans:
Hadoop File System (HDFS)
Cassandra (NoSQL databases)
HBase (NoSQL database)
S3 (Amazon WebService Storage : AWS Cloud)
TRAINING'S
Ans:
Spark runs almost 100 times faster than Hadoop Map Hadoop MapReduce is slower when it comes to large sc
Reduce processing
Spark stores data in the RAM i.e. in-memory. So, it is Hadoop MapReduce data is stored in HDFS and henc
easier to retrieve it longer time to retrieve the data
Spark provides caching and in-memory data storage Hadoop is highly disk-dependent
Ans:
Apache Spark has 3 main categories that comprise its ecosystem. Those are:
Language support: Spark can integrate with different languages to applications and perform
analytics. These languages are Java, Python, Scala, and R.
Core Components: Spark supports 5 main core components. There are Spark Core, Spark SQL,
Spark Streaming, Spark MLlib, and GraphX.
Cluster Management: Spark can be run in 3 environments. Those are the Standalone cluster,
Apache Mesos, and YARN.
3. What are the different cluster managers available in Apache Spark?
Ans:
Standalone Mode: By default, applications submitted to the standalone mode cluster will run in FIFO
order, and each application will try to use all available nodes. You can launch a standalone cluster
either manually, by starting a master and workers by hand or use our provided launch scripts. It is
also possible to run these daemons on a single machine for testing.
Apache Mesos: Apache Mesos is an open-source project to manage computer clusters, and can
also run Hadoop applications. The advantages of deploying Spark with Mesos include dynamic
partitioning between Spark and other frameworks as well as scalable partitioning between multiple
instances of Spark.
Hadoop YARN: Apache YARN is the cluster resource manager of Hadoop 2. Spark can be run on
YARN as well.
Kubernetes: Kubernetes is an open-source system for automating deployment, scaling, and
management of containerized applications.
Ans:
When Spark operates on any dataset, it remembers the instructions.When a transformation such as
a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are
not evaluated until you perform an action, which aids in optimizing the overall data processing
workflow, known as lazy evaluation.
5. What makes Spark good at low latency workloads like graph processing and Machine
Learning?
Ans:
Apache Spark stores data in-memory for faster processing and building machine learning models.
Machine Learning algorithms require multiple iterations and different conceptual steps to create an
optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph.
These low latency workloads that need multiple iterations can lead to increased performance.
6. How can you connect Spark to Apache Mesos?
Ans:
There are a total of 4 steps that can help you connect Spark to Apache Mesos.
Ans:
Parquet is a columnar format that is supported by several data processing systems. With the
Parquet file, Spark can perform both read and write operations.
Ans:
Shuffling is the process of redistributing data across partitions that may lead to data movement
across the executors. The shuffle operation is implemented differently in Spark compared to
Hadoop.
spark.shuffle.compress – checks whether the engine would compress shuffle outputs or not
spark.shuffle.spill.compress – decides whether to compress intermediate shuffle spill files or not
It occurs while joining two tables or while performing byKey operations such as GroupByKey or
ReduceByKey
Ans:
Spark Core is the engine for parallel and distributed processing of large data sets. The various
functionalities supported by Spark Core include:
Ans:
Transformations: Transformations are operations that are performed on an RDD to create a new
RDD containing the results (Example: map, filter, join, union)
Actions: Actions are operations that return a value after running a computation on an RDD
(Example: reduce, first, count)
11. How to programmatically specify a schema for DataFrame?
Ans:
Ans:
A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means
that all the dependencies between the RDD will be recorded in a graph, rather than the original
data.
The need for an RDD lineage graph happens when we want to compute new RDD or if we want to
recover the lost data from the lost persisted RDD.Spark does not support data replication in memory.
So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or
RDD dependency graph.
13. Which transformation returns a new DStream by selecting only those records of the
source DStream for which the function returns true?
1. map(func)
2. transform(func)
3. filter(func)
4. count()
Ans:
3) filter(func).
14. Does Apache Spark provide checkpoints?
Ans:
Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the
process of making streaming applications resilient to failures. It allows you to save the data and
metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start
from wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to
fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and
incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of
the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous
batches.
Ans:
Ans:
A map function returns a new DStream by passing each It is similar to map function and applies to each e
element of the source DStream through a function function RDD and it returns the result as new RDD
Spark Map function takes one element as input process it FlatMap allows returning 0, 1 or more elements from
according to custom code (specified by the developer) and function.In the FlatMap operation
returns one element at a time
17. How would you compute the total count of unique words in Spark?
Ans:
sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
def toWords(line):
return line.split();
3. Run the toWords function on each element of RDD in Spark as flatMap transformation:
words = line.flatMap(toWords);
def toTuple(word):
return (word, 1);
wordTuple = words.map(toTuple);
6. Print:
counts.collect()
Ans:
Ans:
A Sparse vector is a type of local vector which is represented by an index array and a value array.
Ans:
Spark SQL is Apache Spark‟s module for working with structured data.
Spark SQL loads the data from a variety of structured data sources.
It queries data using SQL statements, both inside a Spark program and from external tools that
connect to Spark SQL through standard database connectors (JDBC/ODBC).
It provides a rich integration between SQL and regular Python/Java/Scala code, including the ability
to join RDDs and SQL tables and expose custom functions in SQL.
21. What are the different types of operators provided by the Apache GraphX library?
Ans:
Property Operator: Property operators modify the vertex or edge properties using a user-defined
map function and produce a new graph.
Structural Operator: Structure operators operate on the structure of an input graph and produce a
new graph.
Join Operator: Join operators add data to graphs and generate new graphs.
22. What are the analytic algorithms provided in Apache Spark GraphX?
Ans:
GraphX is Apache Spark‟s API for graphs and graph-parallel computation. GraphX includes a set of
graph algorithms to simplify analytics tasks. The algorithms are contained in the
org.apache.spark.graphx.lib package and can be accessed directly as methods on Graph via
GraphOps.
PageRank: PageRank is a graph parallel computation that measures the importance of each vertex
in a graph. Example: You can run PageRank to evaluate what the most important pages in
Wikipedia are.
Connected Components: The connected components algorithm labels each connected component
of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected
components can approximate clusters.
Triangle Counting: A vertex is part of a triangle when it has two adjacent vertices with an edge
between them. GraphX implements a triangle counting algorithm in the TriangleCount object that
determines the number of triangles passing through each vertex, providing a measure of clustering.
Ans:
Custom profilers are PySpark supported in PySpark to allow for different Profilers to be used and
for outputting to different formats than what is offered in the BasicProfiler.
Ans:
Ans:
As Spark provides a Machine Learning API, MLlib. Similarly, in Python as well, PySpark has this
machine learning API.
Ans:
mllib.classification
mllib.clustering
mllib.fpm
mllib.linalg
mllib.recommendation
spark.mllib
Mllib.regression
27. Name parameter of SparkContext?
Ans:
Ans:
29. What Makes Apache Spark Good At Low-latency Workloads Like Graph Processing And
Machine Learning?
Ans:
Apache Spark stores data in-memory for faster model building and training. Machine learning
algorithms require multiple iterations to generate a resulting optimal model and similarly graph
algorithms traverse all the nodes and edges.These low latency workloads that need multiple
iterations can lead to increased performance. Less disk access and controlled network traffic make
a huge difference when there is lots of data to be processed.
30. Is It Necessary To Start Hadoop To Run Any Apache Spark Application ?
Ans:
Starting hadoop is not mandatory to run any spark application. As there is no separate storage in
Apache Spark, it uses Hadoop HDFS but it is not mandatory. The data can be stored in the local file
system, can be loaded from the local file system and processed.
Ans:
If the user does not explicitly specify then the number of partitions are considered as default level of
parallelism in Apache Spark.
Ans:
The foremost step in a Spark program involves creating input RDD‟s from external data.
Use various RDD transformations like filter() to create new transformed RDD‟s based on the
business logic.
persist() any intermediate RDD‟s which might have to be reused in future.
Launch various RDD actions() like first(), count() to begin parallel computation , which will then be
optimized and executed by Spark.
33. Name A Few Commonly Used Spark Ecosystems.
Ans:
Ans:
At whatever point there is information streaming constantly and you need to process the
information as right on time as could reasonably be expected, all things considered you can exploit
Spark Streaming.
Ans:
Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of
that.
Ans:
SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured
data and perform structured data processing. Through this module, Spark executes relational SQL
queries on the data. The core of the component supports an altogether different RDD called
SchemaRDD, composed of rows objects and schema objects defining data type of each column in
the row. It is similar to a table in a relational database.
Ans:
Parquet is a columnar format file supported by many other data processing systems. Spark SQL
performs both read and write operations with Parquet files and considers it to be one of the best big
data analytics formats so far.
Ans:
Ans:
Spark is a parallel data processing framework. It allows developers to develop fast, unified big data
applications that combine batch, streaming and interactive analytics.
Ans:
Hive is a component of Hortonworks‟ Data Platform (HDP). Hive provides an SQL-like interface to
data stored in the HDP. Spark users will automatically get the complete set of Hive‟s rich features,
including any new features that Hive might introduce in the future.
The main task around implementing the Spark execution engine for Hive lies in query planning,
where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark
can execute. It also includes query execution, where the generated Spark plan gets actually
executed in the Spark cluster.
“Parquet” is a columnar format file supported by many data processing systems. Spark SQL
performs both read and write operations with the “Parquet” file.
Ans:
Due to the availability of in-memory processing, Spark implements the processing around 10-100x
faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data
processing tasks.
Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like
batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only
supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of performing computations multiple times on the same dataset. This is called
iterative computation while there is no iterative computing implemented by Hadoop.
Ans:
SparkSQL is a special component on the spark Core engine that supports SQL and Hive Query
Language without changing any syntax. It‟s possible to join SQL table and HQL table.
SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer
arrays) with schema information about the type of data in each column.
SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of
code debugging and unit testing on SparkSQL core module. The idea can boil down to describing
the data structures inside RDD using a formal description similar to the relational database schema.
On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some
straightforward relational query interface functions that are realized through SparkSQL.
Now, it is officially renamed to DataFrame API on Spark‟s latest trunk.
Ans:
Spark SQL is a special component on the Spark Core engine that supports SQL and Hive Query
Language without changing any syntax. It is possible to join SQL table and HQL table to Spark SQL.
Ans:
DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume.
DStreams have two operations:
There are many DStream transformations possible in Spark Streaming. Let us look at filter(func).
filter(func) returns a new DStream by selecting only the records of the source DStream on
which func returns true.
47. When running Spark applications, is it necessary to install Spark on all the nodes of the
YARN cluster?
Ans:
Spark need not be installed when running a job under YARN or Mesos because Spark can execute
on top of YARN or Mesos clusters without affecting any change to the cluster.
48. What are the various data sources available in Spark SQL?
Ans:
Parquet file, JSON datasets and Hive tables are the data sources available in Spark SQL.
49. What are the various levels of persistence in Apache Spark?
Ans:
Apache Spark automatically persists the intermediary data from various shuffle operations, however,
it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark
has various persistence levels to store the RDDs on disk or in memory or as a combination of both
with different replication levels.
MEMORY_ONLY: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in
memory, some partitions will not be cached and will be recomputed on the fly each time they‟re
needed. This is the default level.
MEMORY_AND_DISK: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit
in memory, store the partitions that don‟t fit on disk, and read them from there when they‟re needed.
MEMORY_ONLY_SER: Store RDD as serialized Java objects (one byte array per partition).
MEMORY_AND_DISK_SER: Similar to MEMORY_ONLY_SER, but spill partitions that don‟t fit in
memory to disk instead of recomputing them on the fly each time they‟re needed.
DISK_ONLY: Store the RDD partitions only on disk.
OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory.
Ans:
Spark uses Akka basically for scheduling. All the workers request for a task to master after
registering. The master just assigns the task. Here Spark uses Akka for messaging between the
workers and masters.
51. What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on
a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it
does nothing, unless asked for the final result. When a transformation like map() is called on an
RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you
perform an action. This helps optimize the overall data processing workflow.
Enroll in
Ans:
MLlib is an adaptable AI library given by Spark. It goes for making AI simple and adaptable with
normal learning calculations and use cases like bunching, relapse separating, dimensional
decrease, and alike.
Ans:
At the point when SparkContext associates with a group chief, it obtains an Executor on hubs in the
bunch. Representatives are Spark forms that run controls and store the information on the laborer
hub. The last assignments by SparkContext are moved to agents for their execution.
54. Name kinds of Cluster Managers in Spark.
Ans:
55. Show some utilization situations where Spark beats Hadoop in preparing.
Ans:
Sensor Data Processing: Apache Spark‟s “In-memory” figuring works best here, as information is
recovered and joined from various sources.
Real Time Processing: Spark is favored over Hadoop for constant questioning of information. for
example Securities exchange Analysis, Banking, Healthcare, Telecommunications, and so on.
Stream Processing: For preparing logs and identifying cheats in live streams for cautions, Apache
Spark is the best arrangement.
Big Data Processing: Spark runs upto multiple times quicker than Hadoop with regards to preparing
medium and enormous estimated datasets.
Ans:
57. How is Spark SQL not the same as HQL and SQL?
Ans:
Flash SQL is a unique segment on the Spark Core motor that supports SQL and Hive Query
Language without changing any sentence structure. It is conceivable to join SQL table and HQL
table to Spark SQL.
58. What is ancestry in Spark? How adaptation to internal failure is accomplished in Spark
utilizing Lineage Graph?
Ans:
At whatever point a progression of changes are performed on a RDD, they are not assessed
promptly, however languidly.
At the point when another RDD has been made from a current RDD every one of the conditions
between the RDDs will be signed in a diagram.
This chart is known as the ancestry diagram.
Consider the underneath situation
First RDD
Second RDD (applying map)
Third RDD (applying channel)
Fourth RDD (applying check)
This heredity diagram will be helpful on the off chance that if any of the segments of information is
lost.
Ans:
RDD :
DataFrame :
Gives the construction see ( lines and segments ). It tends to be thought of as a table in a database.
Like RDD even the dataframe is sluggishly assessed.
It offers colossal execution due to a.) Custom Memory Management – Data is put away in off load
memory in twofold arrangement .No refuse accumulation because of this.
Optimized Execution Plan – Query plans are made utilizing Catalyst analyzer.
DataFrame Limitations : Compile Time wellbeing , i.e no control of information is conceivable when
the structure isn‟t known.
DataSet : Expansion of DataFrame
DataSet Features – Provides best encoding component and not at all like information edges
supports arrange time security.
Ans:
Ans:
Errand
An errand is a unit of work that is sent to the agent. Each stage has some assignment, one
undertaking for every segment. The Same assignment is done over various segments of RDD.
Occupation
The activity is a parallel calculation consisting of numerous undertakings that get produced in light of
activities in Apache Spark.
Stage
Each activity gets isolated into littler arrangements of assignments considered stages that rely upon
one another. Stages are named computational limits. All calculation is impossible in a single stage. It
is accomplished over numerous stages.
Ans:
Flash Driver: The Spark driver is the procedure running the sparkle setting . This driver is in charge
of changing over the application to a guided diagram of individual strides to execute on the bunch.
There is one driver for each application.
63. How might you limit information moves when working with Spark?
Ans:
The different manners by which information moves can be limited when working with Apache Spark
are:
Communicate and Accumulator factors
64. When running Spark applications, is it important to introduce Spark on every one of the
hubs of YARN group?
Ans:
Flash need not be introduced when running a vocation under YARN or Mesos in light of the fact
that Spark can execute over YARN or Mesos bunches without influencing any change to the group.
65. Which one will you decide for an undertaking – Hadoop MapReduce or Apache Spark?
Ans:
The response to this inquiry relies upon the given undertaking situation – as it is realized that Spark
utilizes memory rather than system and plate I/O. In any case, Spark utilizes enormous measure of
RAM and requires devoted machines to create viable outcomes. So the choice to utilize Hadoop or
Spark changes powerfully with the necessities of the venture and spending plan of the association.
Ans:
endure () enables the client to determine the capacity level while reserve () utilizes the default
stockpiling level.
Ans:
Apache Spark naturally endures the mediator information from different mix tasks, anyway it is
regularly proposed that clients call persevere () technique on the RDD on the off chance that they
intend to reuse it. Sparkle has different tirelessness levels to store the RDDs on circle or in memory
or as a mix of both with various replication levels.
68. What are the disservices of utilizing Apache Spark over Hadoop MapReduce?
Ans:
Apache Spark‟s in-memory ability now and again comes a noteworthy barrier for cost effective
preparing of huge information. Likewise, Spark has its own record the board framework and
consequently should be incorporated with other cloud based information stages or apache hadoop.
Ans:
Applying Transformations tasks on RDD or “stacking information into RDD” isn‟t executed quickly
until it sees an activity. Changes on RDDs and putting away information in RDD are languidly
assessed. Assets will be used in a superior manner if Spark utilizes sluggish assessment.
Lazy assessment advances the plate and memory utilization in Spark.
The activities are activated just when the information is required. It diminishes overhead.
Ans:
Because of the accessibility of in-memory handling, Spark executes the preparing around 10 to
multiple times quicker than Hadoop MapReduce while MapReduce utilizes diligence stockpiling for
any of the information handling errands.
Dissimilar to Hadoop, Spark gives inbuilt libraries to play out numerous errands from a similar center
like cluster preparing, Steaming, Machine learning, Interactive SQL inquiries. Be that as it may,
Hadoop just backings cluster handling.
Hadoop is very plate subordinate while Spark advances reserving and in-memory information
stockpiling.
Ans:
At the point when an Action is approached Spark RDD at an abnormal state, Spark presents the
heredity chart to the DAG Scheduler.
Activities are separated into phases of the errand in the DAG Scheduler. A phase contains errands
dependent on the parcel of the info information. The DAG scheduler pipelines administrators
together. It dispatches tasks through the group chief. The conditions of stages are obscure to the
errand scheduler.The Workers execute the undertaking on the slave.
Ans:
Sliding Window controls transmission of information bundles between different PC systems. Sparkle
Streaming library gives windowed calculations where the changes on RDDs are connected over a
sliding window of information. At whatever point the window slides, the RDDs that fall inside the
specific window are consolidated and worked upon to create new RDDs of the windowed DStream.
73. What are communicated and Accumulators?
Ans:
Communicate variable:
On the off chance that we have an enormous dataset, rather than moving a duplicate of
informational collection for each assignment, we can utilize a communication variable which can be
replicated to every hub at one time and share similar information for each errand in that hub.
Communicate variable assistance to give a huge informational collection to every hub.
Collector:
Flash capacities utilized factors characterized in the driver program and nearby replicated factors will
be produced. Aggregators are shared factors which help to refresh factors in parallel during
execution and offer the outcomes from specialists to the driver.
Ans:
An activity helps in bringing back the information from RDD to the nearby machine. An activity‟s
execution is the aftereffect of all recently made changes. lessen() is an activity that executes the
capacity passed over and over until one esteem assuming left. take() move makes every one of the
qualities from RDD to nearby hub.
Like Hadoop, YARN is one of the key highlights in Spark, giving a focal and asset the executives
stage to convey adaptable activities over the bunch. YARN is a conveyed holder chief, as Mesos for
instance, while Spark is an information preparing instrument. Sparkle can keep running on YARN, a
similar way Hadoop Map Reduce can keep running on YARN. Running Spark on YARN requires a
double dispersion of Spark as based on YARN support.
Ans:
Polyglot
Speed
Multiple Format Support
Lazy Evaluation
Hadoop Integration
Machine Learning
Ans:
Sparkle Streaming is utilized for handling constant gushing information. Along these lines it is a
helpful expansion into the Spark API. It empowers high-throughput and shortcoming tolerant stream
handling of live information streams. The crucial stream unit is DStream which is fundamentally a
progression of RDDs (Resilient Distributed Datasets) to process the constant information. The
information from various sources like Flume, HDFS is spilled lastly to document frameworks, live
dashboards and databases. It is like a bunch preparing as the information is partitioned into streams
like clusters.
78. What are the enhancements that engineers can make while working with flash?
Ans:
79. List some use cases where Spark outperforms Hadoop in processing.
Ans:
Sensor Data Processing: Apache Spark‟s “In-memory” computing works best here, as data is
retrieved and combined from different sources.
Real Time Processing: Spark is preferred over Hadoop for real-time querying of data. e.g. Stock
Market Analysis, Banking, Healthcare, Telecommunications, etc.
Stream Processing: For processing logs and detecting frauds in live streams for alerts, Apache
Spark is the best solution.
Big Data Processing: Spark runs upto 100 times faster than Hadoop when it comes to processing
medium and large-sized datasets.
Ans:
An information casing resembles a table, it got some named sections which are composed into
segments. You can make an information outline from a document or from tables in hive, outside
databases SQL or NoSQL or existing RDD‟s. It is practically equivalent to a table.
Ans:
The principal significant thing is that you need to place the hive-site.xml record in the conf index of
Spark.
At that point with the assistance of Spark session object we can develop an information outline as,
Ans:
Ordinarily you need to process the information as charts, since you need to do some examination
on it. It endeavors to perform Graph calculation in Spark in which information is available in
documents or in RDD‟s.
GraphX is based on the highest point of Spark center, so it has got every one of the abilities of
Apache Spark like adaptation to internal failure, scaling and there are numerous inbuilt chart
calculations too. GraphX binds together ETL, exploratory investigation and iterative diagram
calculation inside a solitary framework.
You can see indistinguishable information from the two charts and accumulations, change and unite
diagrams with RDD effectively and compose custom iterative calculations utilizing the pregel API.
GraphX contends on execution with the quickest diagram frameworks while holding Spark‟s
adaptability, adaptation to internal failure and convenience.
1) What are the advantages of using Apache Spark over Hadoop MapReduce for big data
processing?
Simplicity, Flexibility and Performance are the major advantages of using Spark over
Hadoop.
Spark is 100 times faster than Hadoop for big data processing as it stores the data
in-memory, by placing it in Resilient Distributed Databases (RDD).
It provides complete recovery using lineage graph whenever something goes wrong.
2) What is Shark?
Most of the data users know only SQL and are not good at programming. Shark is a tool,
developed for people who are from a database background - to access Scala MLib
capabilities through Hive like SQL interface. Shark tool helps data users run Hive on
Spark - offering compatibility with Hive metastore, queries and data.
ii. Spark is preferred over Hadoop for real time querying of data
iii. Stream Processing – For processing logs and detecting frauds in live streams for
alerts, Apache Spark is the best solution.
5) What is RDD?
RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that
represent the data coming into the system in object format. RDDs are used for in-
memory computations on large clusters, in a fault tolerant manner. RDDs are read-
only portioned, collection of records, that are –
Resilient – If a node holding the partition fails the other node takes the data.
7) What are the languages supported by Apache Spark for developing big data
applications?
Scala, Java, Python, R and Clojure
8) Can you use Spark to access and analyse data stored in Cassandra databases?
Yes, it is possible if you use Spark Cassandra Connector.
9) Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
YARN
Apache Mesos -Has rich resource scheduling capabilities and is well suited to run
Spark along with other applications. It is advantageous when several users run
interactive shells because it scales down the CPU allocation between commands.
Standalone deployments – Well suited for new deployments which only run and are easy
to set up.
Configure the spark driver program to connect to Mesos. Spark binary package should be
in a location accessible by Mesos. (or)
Install Apache Spark in the same location as that of Apache Mesos and configure the
property ‘spark.mesos.executor.home’ to point to the location where it is installed.
12) How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run
in a fast and reliable manner. The various ways in which data transfers can be
minimized when working with Apache Spark are:
1. Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between
small and large RDDs.
3. The most common way is to avoid operations ByKey, repartition or any other operations
which trigger shuffles.
13) Why is there a need for broadcast variables when working with Apache Spark?
These are read only variables, present in-memory cache on every machine. When working
with Spark, usage of broadcast variables eliminates the necessity to ship copies of a
variable for every task, so data can be processed faster. Broadcast variables help in
storing a lookup table inside the memory which enhances the retrieval efficiency when
compared to an RDD lookup ().
17) Explain about the major libraries that constitute the Spark Ecosystem
Spark MLib- Machine learning library in Spark for commonly used learning algorithms
like clustering, regression, classification, etc.
Spark Streaming – This library is used to process real time streaming data.
Spark GraphX – Spark API for graph parallel computations with basic operators like
joinVertices, subgraph, aggregateMessages, etc.
Spark SQL – Helps execute SQL like queries on Spark data using standard visualization
or BI tools.
18) What are the benefits of using Spark with Apache Mesos?
It renders scalable partitioning among various Spark instances and dynamic
partitioning between Spark and other big data frameworks.
21) When running Spark applications, is it necessary to install Spark on all the nodes
of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can
execute on top of YARN or Mesos clusters without affecting any change to the cluster.
24) Which spark library allows reliable file sharing at memory speed across different
cluster frameworks?
Tachyon
26) How can you compare Hadoop and Spark in terms of ease of use?
Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive
make it considerably easier. Learning Pig and Hive syntax takes time. Spark has
interactive APIs for different languages like Java, Python or Scala and also includes
Shark i.e. Spark SQL for SQL lovers - making it comparatively easier to use than
Hadoop.
27) What are the common mistakes developers make when running Spark applications?
Developers often make the mistake of-
Developers need to be careful with this, as Spark makes use of memory for processing.
JSON Datasets
Hive tables
For the complete list of big data companies and their salaries- CLICK HERE
31) What are the key features of Apache Spark that you like?
Spark provides advanced analytic options like graph algorithms, machine learning,
streaming data, etc
It has built-in APIs in multiple languages like Java, Scala, Python and R
It has good performance gains, as it helps run an application in the Hadoop cluster
ten times faster on disk and 100 times faster in memory.
33) Which one will you choose for a project –Hadoop MapReduce or Apache Spark?
The answer to this question depends on the given project scenario - as it is known
that Spark makes use of memory instead of network and disk I/O. However, Spark uses
large amount of RAM and requires dedicated machine to produce effective results. So
the decision to use Hadoop or Spark varies dynamically with the requirements of the
project and budget of the organization.
34) Explain about the different types of transformations on DStreams?
Stateless Transformations- Processing of the batch does not depend on the output of
the previous batch. Examples – map (), reduceByKey (), filter ().
Stream processing
38) How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP
43) How can you launch Spark jobs inside Hadoop MapReduce?
Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without
requiring any admin rights.
46) Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache
Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance
through lineage. RDD always has the information on how to build from other datasets.
If any partition of a RDD is lost due to failure, lineage helps build only that
particular lost partition.
Executor –The worker processes that run the individual tasks of a Spark job.
Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The
cluster manager allows Spark to run on top of other external managers like Apache
Mesos or YARN.
3. What is “RDD”?
RDD stands for Resilient Distribution Datasets: a collection of fault-tolerant
operational elements that run in parallel. The partitioned data in RDD is immutable
and is distributed in nature.
6. Define “Partitions”.
A “Partition” is a smaller and logical division of data, that is similar to the
“split” in Map Reduce. Partitioning is the process that helps derive logical units of
data in order to speed up data processing.
Here’s an example: val someRDD = sc.parallelize( 1 to 100, 4)
Here an RDD of 100 elements is created in four partitions, which then distributes a
dummy map task before collecting the elements back to the driver program.
Transformations
Actions
Standalone
Apache Mesos
YARN
It is a cluster computing platform designed for general and fast purposes.Spark is essentially a fast and flexible data processing
framework. It has a capable of getting data from hdfs,hbase,cassandra and others.It has an advanced execution engine supporting
cyclic data flow with in-memory computing functionalities
In-Memory Computation
RDD (Resilient Distributed Dataset)
Supports many languages
Integration with Hadoop
fast processing
Real time stream processing
3) What is RDD ?
RDD (Resilient Distrubution Datasets) : Collection of objects that runs in parallel.Partitions data in RDD is immutable and is
distributed in nature.
Transformations
Actions
1. Narrow Transformation
2. Wide transformation
“Transformations” are functions applied on RDD, gives a new RDD. Transformations does not execute until an action occurs.
map() and filer() are examples of “transformations”.The filter() creates a new RDD by selecting elements from the current RDD.
“Action” take back the data from the RDD to the local machine. Execution of “action” is the result of all transformations created
previously. fold() is an action that implements the function passed again and again until only one value is left.
Apache Spark interview questions and answers for Experienced and Freshers
7) What are the commonly used Ecosystems in Apache Spark ?
Spark Streaming
Spark Sql
Spark Mllib
Spark graphx
SparkCore performs memory management, monitoring jobs, fault tolerance, job scheduling and interaction with storage
systems.
RDD in Spark Core makes it fault tolerance. RDD is a collection of items distributed across many nodes that can be manipulated
in parallel.
Spark SQL is a Spark module for structured data processing.Spark SQL is almost similar to SQL and it supports also Hive
Query Language.There are several ways to interact with Spark SQL including SQL, the DataFrames API and the Datasets API.
10) What is Spark Streaming ?
Spark Streaming allows stream processing of live data streams.Data can be ingested from many sources like Kafka, Flume,
Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions
like map,reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.
Spark GraphX is a component in Spark which is used for graph processing (Social Media Friends Recommendation).
Spark MLlib is supporting for Machine Learning Algorithms before Spark MLlib hadoop using Apache Mahout for Machine
Leaning Algorithms.It consists of common learning algorithms and utilities, including classification, regression, clustering,
collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline
APIs.Machine Learning Algorithms are mainly used for predictions,Recommendations and other purposes.
Spark can create distributed datasets from any storage source supported byHadoop, including your local file system, HDFS,
Cassandra, HBase, Amazon S3, etc.
Standalone
Apache Mesos
YARN
Yarn means yet another resource negotiator.Yarn is a cluster Management technology.It introduced from Hadoop 2.X
version.Yarn mainly used for reduce the burden on Mapreduce.
Apache Mesos is one of the cluster Management technology like Yarn.It “provides efficient resource isolation and sharing across
distributed applications, or frameworks”.
Spark Cluster can also be run with out support of Yarn and Apache Mesos or any other cluster Manager.Spark can run itself
called Standalone Mode.
Spark Work Node is a slave node.“Worker node” refers to any node that can run the application code in a cluster.
i) parallelize
ii) textFile
val a= Array(4,6,7,8)
val b= sc.parallelize(a)
Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.
A “Partition” is a smaller and logical division of data, that is similar to the “split” in Map
Reduce. Partitioning is the process that helps derive logical units of data in order to speed up data
processing.
What is Spark?
Spark is a parallel data processing framework. It allows to develop fast, unified big data
application combine batch, streaming and interactive analytics.
Why Spark?
Spark is third generation distributed data processing platform. It’s unified bigdata solution for all
bigdata processing problems such as batch , interacting, streaming processing.So it can ease
many bigdata problems.
What is RDD?
Spark’s primary core abstraction is called Resilient Distributed Datasets. RDD is a collection of
partitioned data that satisfies these properties. Immutable, distributed, lazily evaluated, catchable
are common RDD properties.
What is Immutable?
Once created and assign a value, it’s not possible to change, this property is called Immutability.
Spark is by default immutable, it’s not allows updates and modifications. Please note data
collection is not immutable, but data value is immutable.
What is Distributed?
RDD can automatically the data is distributed across different parallel computing nodes.
What is Catchable?
keep all the data in-memory for computation, rather than going to the disk. So Spark can catch
the data 100 times faster than Hadoop.
Spark use map-reduce API to do the partition the data. In Input format we can create number of
partitions. By default HDFS block size is partition size (for best performance), but its’ possible
to change partition size like Split.
What is SparkContext?
When a programmer creates a RDDs, SparkContext connect to the Spark cluster to create a new
SparkContext object. SparkContext tell spark how to access the cluster. SparkConf is key factor
to create programmer application.
Mahout is a machine learning library for Hadoop, similarly MLlib is a Spark library. MetLib
provides different algorithms, that algorithms scale out on the cluster for data processing. Most
of the data scientists use this MLlib library.
What is GraphX?
GraphX is a Spark API for manipulating Graphs and collections. It unifies ETL, other analysis,
and iterative graph computation. It’s fastest graph system, provides fault tolerance and ease of
use without special skills.
Spark provides two special operations on RDDs called transformations and Actions.
Transformation follow lazy operation and temporary hold the data until unless called the Action.
Each transformation generate/return new RDD. Example of transformations: Map, flatMap,
groupByKey, reduceByKey, filter, co-group, join, sortByKey, Union, distinct, sample are
common spark transformations.
What is Action in Spark?
Actions is RDD’s operation, that value return back to the spar driver programs, which kick off a
job to execute on a cluster. Transformation’s output is input of Actions. reduce, collect,
takeSample, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, foreach are common
actions in Apache spark.
Map is a specific line or row to process that data. In FlatMap each input item can be mapped to
multiple output items (so function should return a Seq rather than a single item). So most
frequently used to return Array elements.
Apache Spark is easy to use and flexible data processing framework. Spark can round on
Hadoop, standalone, or in the cloud. It is capable of assessing diverse data source, which
includes HDFS, Cassandra, and others.
Dstream is a sequence of resilient distributed database which represent a stream of data. You
can create Dstream from various source like HDFS, Apache Flume, Apache Kafka, etc.
JSON Datasets
Hive tables
Parquet file
Sparse vector is a vector which has two parallel arrays, one for indices, one for values, use for
1/8
storing non-zero entities to save space.
6) Name the language supported by Apache Spark for developing big data applications
Java
Python
R
Clojure
Scala
In Apache Spark, a Data frame can be created using Tables in Hive and Structured data files.
8) Explain SchemaRDD
An RDD which consists of row object with schema information about the type of data in each
column is called SchemaRDD.
Accumulators are the write-only variables. They are initialized once and sent to the workers.
These workers will update based on the logic written, which will send back to the driver.
2/8
Spark Core: It is a base engine for large-scale parallel and distributed data processing
Spark Streaming: This component used for real-time data streaming.
Spark SQL: Integrates relational processing by using Spark’s functional programming
API
GraphX: Allows graphs and graph-parallel computation
MLlib: Allows you to perform machine learning in Apache Spark
If the user isn't able to specify, then the number of partitions are considered as default level of
parallelism in Apache Spark.
Uber
Netflix
Pinterest
Spark SQL is a module for structured data processing where we take advantage of SQL queries
running on that database.
Paraquet is a columnar format file support by many other data processing systems. Spark SQL
allows you to performs both read and write operations with Parquet file.
Spark Driver is the program which runs on the master node of the machine and declares
transformations and actions on data RDDs.
3/8
Spark is a processing engine which doesn't have any storage engine. It can retrieve data from
another storage engine like HDFS, S3.
File system API allows you to read data from various storage devices like HDFS, S3 or local
Fileyste.
Spark Engine is helpful for scheduling, distributing and monitoring the data application across
the cluster.
Real-time data processing is not possible directly. However, it is possible by registering existing
RDD as a SQL table and trigger the SQL queries on priority.
23) What are the important differences between Apache and Hadoop
Yes, you can run Apache Spark on the hardware clusters managed by Mesos.
Partition is a smaller and logical division of data. It is the method for deriving logical units of data
to speed up the processing process.
4/8
26) Define the term ‘Lazy Evolution’ with reference to Apache Spark
Apache Spark delays its evaluation until it is needed. For the transformations, Spark adds them
to a DAG of computation and only when derive request some data.
Spark uses Akka use for scheduling. It also uses Akka for messaging between the workers
and masters.
Map transformation on an RDD produces another RDD by translating each element. It helps you
to translates every element by executing the function provided by the user.
5/8
Interactive machine learning
Stream processing
Data analytics and processing
Sensor data processing
Persist() function allows the user to specify the storage level whereas cache() use the default
storage level.
35) Name the Spark Library which allows reliable file sharing at memory speed across
different cluster frameworks.
Tachyon is a spark library which allows reliable file sharing at memory speed across various
cluster frameworks.
36) Apache Spark is a good fit for which type of machine learning techniques?
Apache Spark is ideal for simple machine learning algorithms like clustering, regression, and
classification.
37) How you can remove the element with a critical present in any other Rdd is Apache
spark?
In order to remove the elements with a key present in any other rdd, you need to use
substractkey() function.
Checkpoints allow the program to run all around the clock. Moreover, it helps to make it resilient
towards failure irrespective to application logic.
Lineage graph information computer each RDD on demand. Therefore, whenever a part of
persistent RDD is lost. In that situation, you can recover this data using lineage graph
information.
Spark supports file format json, tsv, snappy, orc, rc, etc.
Action helps you to bring back the data from RDD to the local machine. Its execution is the
result of all previously created transformations.
6/8
42) What is Yarn?
Yarn is one of the most important features of Apache Spark. Running spark on Yarn makes
binary distribution of spark as it is built on Yarn support.
An executor is a Spark process which runs computations and stores the data on the worker
node. The final tasks by SparkContent are transferred to the executor for their execution.
44) is it necessary to install Spark on all nodes while running Spark application on Yarn?
No, you don’t necessarily need to install spark on all nodes as spark runs on top of Yarn.
A worker node is any node which can run the application code in a cluster.
46) How can you launch Spark jobs inside Hadoop MapReduce?
Spark in MapReduce allows users to run all kind of spark job inside MapReduce without need to
obtain admin rights of that application.
47) Explain the process to trigger automatic clean-up in Spark to manage accumulated
metadata.
You can trigger automatic clean-ups by seeing the parameter ‘spark.cleaner.ttf or by separating
the long-running jobs into various batches and writing the intermediate results to the disk.
BlinkDB is a query engine tool which allows you to execute SQL queries on huge volumes of
data and renders query results in the meaningful error bars.
49) Does Hoe Spark handle monitoring and logging in Standalone mode?
Yes, a spark can handle monitoring and logging in standalone mode as it has a web-based user
interface.
50) How can you identify whether a given operation is Transformation or Action?
You can identify the operation based on the return type. If the return type is not RDD, then the
operation is an action. However, if the return type is the same as the RDD, then the operation is
transformation.
51) Can You Use Apache Spark To Analyze and Access Data Stored In Cassandra
7/8
Databases?
Yes, you can use Spark Cassandra Connector which allows you to access and analyze data
stored in Cassandra Database.
SparkSQL is an essential component on the spark Core engine. It supports SQL and Hive
Query Language without altering its syntax.
8/8
Powered by TCPDF (www.tcpdf.org)
Q. What is PySpark?
This is almost always the first PySpark interview question you will
face.
the facility to read data from multiple sources which have different
data formats. Along with these features, we can also interface with
simple way.
MapReduce fashion.
not efficient.
1. spark.mllib
2. mllib.clustering
3. mllib.classification
4. mllib.regression
5. mllib.recommendation
6. mllib.linalg
7. mllib.fpm
sc.addFile to load the files on the Apache Spark. SparkFIles can also
files that were added from sc.addFile. The class methods present in
cluster.
We run the following code whenever we want to run SparkConf:
class pyspark.Sparkconf(
localdefaults = True,
_jvm = None,
_jconf = None
decisions on where the RDD will be stored (on memory or over the
deserialized, replication = 1)
SparkJobs that are in execution. The code for using the SparkJobInfo
is as follows:
”)):
information about the SparkStages that are present at that time. The
DataSet?
RDD-
efficiently reserved.
● It's useful when you need to do low-level transformations, operations, and control on
a dataset.
● It's more commonly used to alter data with functional programming structures than
DataFrame-
● It allows the structure, i.e., lines and segments, to be seen. You can think of it as a
database table.
● Optimized Execution Plan- The catalyst analyzer is used to create query plans.
● One of the limitations of dataframes is Compile Time Wellbeing, i.e., when the
● Also, if you're working on Python, start with DataFrames and then switch to RDDs if
● It has the best encoding component and, unlike information edges, it enables time
● If you want a greater level of type safety at compile-time, or if you want typed JVM
take advantage of Catalyst optimization or even when you are trying to benefit from
The toDF() function of PySpark RDD is used to construct a DataFrame from an existing
RDD. The DataFrame is constructed with the default column names "_1" and "_2" to
dfFromRDD1 = rdd.toDF()
dfFromRDD1.printSchema()
Here, the printSchema() method gives you a database schema without column names-
root
Use the toDF() function with column names as parameters to pass column names to the
columns = ["language","users_count"]
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()
The above code snippet gives you the database schema with the column names-
root
The StructType and StructField classes in PySpark are used to define the schema to the
DataFrame and create complex columns such as nested struct, array, and map columns.
StructType is a collection of StructField objects that determines column name, column data
columns as "struct."
● To define the columns, PySpark offers the pyspark.sql.types import StructField class,
which has the column name (String), column type (DataType), nullable column
import pyspark
spark = SparkSession.builder.master("local[1]") \
.appName('ProjectPro') \
.getOrCreate()
data = [("James","","William","36636","M",3000),
("Michael","Smith","","40288","M",4000),
("Robert","","Dawson","42114","M",4000),
("Maria","Jones","39192","F",4000)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
PySpark DataFrame?
There are two ways to handle row duplication in PySpark dataframes. The distinct()
function in PySpark is used to drop/remove duplicate rows (all columns) from a DataFrame,
Here’s an example showing how to utilize the distinct() and dropDuplicates() methods-
import pyspark
spark = SparkSession.builder.appName('ProjectPro).getOrCreate()
df.printSchema()
df.show(truncate=False)
Output-
The record with the employer name Robert contains duplicate rows in the table above. As
we can see, there are two rows with duplicate values in all fields and four rows with
import pyspark
spark = SparkSession.builder.appName('ProjectPro').getOrCreate()
]
column= ["employee_name", "department", "salary"]
df.printSchema()
df.show(truncate=False)
#Distinct
distinctDF = df.distinct()
distinctDF.show(truncate=False)
#Drop duplicates
df2 = df.dropDuplicates()
df2.show(truncate=False)
dropDisDF = df.dropDuplicates(["department","salary"])
dropDisDF.show(truncate=False)
}
Q. Explain PySpark UDF with the help of an example.
The most important aspect of Spark SQL & DataFrame is PySpark UDF (i.e., User Defined
Function), which is used to expand PySpark's built-in capabilities. UDFs in PySpark work
PySpark SQL udf() or register it as udf and use it on DataFrame and SQL, respectively, in the
case of PySpark.
spark = SparkSession.builder.appName('ProjectPro').getOrCreate()
column = ["Seqno","Name"]
df = spark.createDataFrame(data=data,schema=column)
df.show(truncate=False)
Output-
2. The next step is creating a Python function. The code below generates the
convertCase() method, which accepts a string parameter and turns every word's
def convertCase(str):
resStr=""
for x in arr:
return resStr
By passing the function to PySpark SQL udf(), we can convert the convertCase() function to
PySpark map or the map() function is an RDD transformation that generates a new RDD by
RDD map() transformations are used to perform complex operations such as adding a
column, changing a column, converting data, and so on. Map transformations always
records = ["Project","Gutenberg’s","Alice’s","Adventures",
"in","Wonderland","Project","Gutenberg’s","Adventures",
"in","Wonderland","Project","Gutenberg’s"]
rdd=spark.sparkContext.parallelize(records)
map(f, preservesPartitioning=False)
● We are adding a new element having value 1 for each element in this PySpark map()
example, and the output of the RDD is PairRDDFunctions, which has key-value pairs,
where we have a word (String type) as Key and 1 (Int type) as Value.
rdd2=rdd.map(lambda x: (x,1))
print(element)
Output-
Joins in PySpark are used to join two DataFrames together, and by linking them together,
one may join several DataFrames. INNER Join, LEFT OUTER Join, RIGHT OUTER Join, LEFT
ANTI Join, LEFT SEMI Join, CROSS Join, and SELF Join are among the SQL join types it
supports.
‘how’: default inner (Options are inner, cross, outer, full, full outer, left, left outer, right, right
PySpark ArrayType is a collection data type that extends PySpark's DataType class, which is
the superclass for all kinds. The types of items in all ArrayType elements should be the
accepts two arguments: valueType and one optional argument valueContainsNull, which
specifies whether a value can accept null and is set to True by default. valueType should
arrayCol = ArrayType(StringType(),False)
Using one or more partition keys, PySpark partitions a large dataset into smaller parts.
When we build a DataFrame from a file or table, PySpark creates the DataFrame in memory
partitioned data run quicker since each partition's transformations are executed in parallel.
Partitioning in memory (DataFrame) and partitioning on disc (File system) are both
supported by PySpark.
PySpark MapType accepts two mandatory parameters- keyType and valueType, and one
Here’s how to create a MapType with PySpark StructType and StructField. The StructType()
accepts a list of StructFields, each of which takes a fieldname and a value type.
schema = StructType([
])
dataDictionary = [
('James',{'hair':'black','eye':'brown'}),
('Michael',{'hair':'brown','eye':None}),
('Robert',{'hair':'red','eye':'black'}),
('Washington',{'hair':'grey','eye':'grey'}),
('Jefferson',{'hair':'brown','eye':''})
df.printSchema()
df.show(truncate=False)
Output-
Q. How can PySpark DataFrame be converted to Pandas
DataFrame?
First, you need to learn the difference between the PySpark and Pandas. The key difference
between Pandas and PySpark is that PySpark's operations are quicker than Pandas'
because of its distributed nature and parallel execution over several cores and computers.
In other words, pandas use a single node to do operations, whereas PySpark uses several
computers.
You'll need to transfer the data back to Pandas DataFrame after processing it in PySpark so
that you can use it in Machine Learning apps or other Python programs.
Below are the steps to convert PySpark DataFrame into Pandas DataFrame-
2. The next step is to convert this PySpark dataframe into Pandas dataframe.
function. toPandas() gathers all records in a PySpark DataFrame and delivers them to the
driver software; it should only be used on a short percentage of the data. When using a
ArrayType.
PySpark ArrayType is a data type for collections that extends PySpark's DataType class.
two arguments: valueType and one optional argument valueContainsNull, which specifies
whether a value can accept null and is set to True by default. valueType should extend the
arrayCol = ArrayType(StringType(),False)
The above example generates a string array that does not allow null values.
Dataframe columns and back using the unpivot() function (). Pivot() is an aggregation in
which the values of one of the grouping columns are transposed into separate columns
import pyspark
("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"), \
("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"), \
("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico")]
columns= ["Product","Amount","Country"]
df.printSchema()
df.show(truncate=False)
Output-
To determine the entire amount of each product's exports to each nation, we'll group by
pivotDF = df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.printSchema()
pivotDF.show(truncate=False)
This will convert the nations from DataFrame rows to columns, resulting in the output seen
an example.
Broadcast variables in PySpark are read-only shared variables that are stored and
accessible on all nodes in a cluster so that processes may access or use them. Instead of
sending this information with each job, PySpark uses efficient broadcast algorithms to
broadcastVariable.value
broadcastStates = spark.sparkContext.broadcast(states)
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
rdd = spark.sparkContext.parallelize(data)
def state_convert(code):
return broadcastState.value[code]
print(res)
broadcastStates = spark.sparkContext.broadcast(states)
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","William","USA","CA"),
("Maria","Jones","USA","FL")
columns = ["firstname","lastname","country","state"]
df.printSchema()
df.show(truncate=False)
def state_convert(code):
return broadcastState.value[code]
res.show(truncate=False)
CPU cores. The following code works, but it may crash on huge
data sets, or at the very least, it may not take advantage of the
cluster's full processing capabilities. Which aspect is the most
The repartition command creates ten partitions regardless of how many of them were
loaded. On large datasets, they might get fairly huge, and they'll almost certainly outgrow
In addition, each executor can only have one partition. This means that just ten of the 240
executors are engaged (10 nodes with 24 cores, each running one executor).
If the number is set exceptionally high, the scheduler's cost in handling the partition grows,
lowering performance. It may even exceed the execution time in some circumstances,
The optimal number of partitions is between two and three times the number of executors.
RDD[UserActivity] = { sparkSession.sparkContext.parallelize(
The primary function, calculate, reads two pieces of data. (They are given in this case from
parallelize.) Each of them is transformed into a tuple by the map, which consists of a userId
and the item itself. To combine the two datasets, the userId is utilised.
All users' login actions are filtered out of the combined dataset. The uName and the event
This is eventually reduced down to merely the initial login record per user, which is then sent
to the console.
eventType. Join the two dataframes using code and count the
.repartition(col(UIdColName)) // ??????????????? .
.repartition(col(UIdColName))
.join(userActivityRdd, UIdColName)
.select(col(UNameColName))
.groupBy(UNameColName)
.count()
.withColumnRenamed("count", CountColName)
result.show()
the master and which parts will run on each worker node.
DateTimeFormatter.ofPattern("yyyy/MM") def
getEventCountOnWeekdaysPerMonth(data:
DayOfWeek.SATURDAY.getValue) . map(mapDateTime2Date)
The driver application is responsible for calling this function. The DAG is defined by the
assignment to the result value, as well as its execution, which is initiated by the collect()
operation. The worker nodes handle all of this (including the logic of the method
mapDateTime2Date). Because the result value that is gathered on the master is an array,
Q. What are the elements used by the GraphX library, and how
RDD[???[PageReference]] =
(1,1.4537951595091907) (2,0.7731024202454048)
(3,0.7731024202454048)
Vertex, and Edge objects are supplied to the Graph object as RDDs of type RDD[VertexId,
VT] and RDD[Edge[ET]] respectively (where VT and ET are any user-defined types
associated with a given Vertex or Edge). For Edge type, the constructor is Edge[ET](srcId:
VertexId, dstId: VertexId, attr: ET). VertexId is just an alias for Long.
Q. Under what scenarios are Client and Cluster modes used for
deployment?
● Cluster mode should be utilized for deployment if the client computers are not near
the cluster. This is done to prevent the network delay that would occur in Client
mode while communicating between executors. In case of Client mode, if the
machine goes offline, the entire operation is lost.
● Client mode can be utilized for deployment if the client computer is located within
the cluster. There will be no network latency concerns because the computer is part
of the cluster, and the cluster's maintenance is already taken care of, so there is no
need to be concerned in the event of a failure.
Only batch-wise data processing is done Apache Spark can handle data in both real-
using MapReduce. time and batch mode.
The data is stored in HDFS (Hadoop Spark saves data in memory (RAM), making
Distributed File System), which takes a data retrieval quicker and faster when
long time to retrieve. needed.
def keywordExists(line):
return 1
return 0
lines = sparkContext.textFile(“sample_file.txt”);
isExist = lines.map(keywordExists);
sum=isExist.reduce(sum);
Spark executors have the same fixed core count and heap size as the applications created
in Spark. The heap size relates to the memory used by the Spark executor, which is
controlled by the -executor-memory flag's property spark.executor.memory. On each worker
node where Spark operates, one executor is assigned to it. The executor memory is a
measurement of the memory utilized by the application's worker node.
The core engine for large-scale distributed and parallel data processing is SparkCore. The
distributed execution engine in the Spark core provides APIs in Java, Python, and Scala for
constructing distributed ETL applications.
Memory management, task monitoring, fault tolerance, storage system interactions, work
scheduling, and support for all fundamental I/O activities are all performed by Spark Core.
Additional libraries on top of Spark Core enable a variety of SQL, streaming, and machine
learning applications.
applications?
Despite the fact that Spark is a strong data processing engine, there are certain drawbacks
to utilizing it in applications.
PySpark?
The process of shuffling corresponds to data transfers. Spark applications run quicker and
more reliably when these transfers are minimized. There are quite a number of approaches
that may be used to reduce them. They are as follows:
● Using broadcast variables improves the efficiency of joining big and small RDDs.
● Accumulators are used to update variable values in a parallel manner during
execution.
● Another popular method is to prevent operations that cause these reshuffles.
dense vectors?
Sparse vectors are made up of two parallel arrays, one for indexing and the other for storing
values. These vectors are used to save space by storing non-zero values. E.g.- val
sparseVec: Vector = Vectors.sparse(5, Array(0, 4), Array(1.0, 2.0))
The vector in the above example is of size 5, but the non-zero values are only found at
indices 0 and 4.
When there are just a few non-zero values, sparse vectors come in handy. If there are just a
few zero values, dense vectors should be used instead of sparse vectors, as sparse vectors
would create indexing overhead, which might affect performance.
The usage of sparse or dense vectors has no effect on the outcomes of calculations, but
when they are used incorrectly, they have an influence on the amount of memory needed
and the calculation time.
The partition of a data stream's contents into batches of X seconds, known as DStreams, is
the basis of Spark Streaming. These DStreams allow developers to cache data in memory,
which may be particularly handy if the data from a DStream is utilized several times. The
cache() function or the persist() method with proper persistence settings can be used to
cache data. For input streams receiving data through networks such as Kafka, Flume, and
others, the default persistence level setting is configured to achieve data replication on two
nodes to achieve fault tolerance.
● Cache method-
● Persist method-
Spark RDD is extended with a robust API called GraphX, which supports graphs and graph-
based calculations. The Resilient Distributed Property Graph is an enhanced property of
Spark RDD that is a directed multi-graph with many parallel edges. User-defined
characteristics are associated with each edge and vertex. Multiple connections between the
same set of vertices are shown by the existence of parallel edges. GraphX offers a
collection of operators that can allow graph computing, such as subgraph,
mapReduceTriplets, joinVertices, and so on. It also offers a wide number of graph builders
and algorithms for making graph analytics chores easier.
According to the UNIX Standard Streams, Apache Spark supports the pipe() function on
RDDs, which allows you to assemble distinct portions of jobs that can use any language.
The RDD transformation may be created using the pipe() function, and it can be used to
read each element of the RDD as a String. These may be altered as needed, and the results
can be presented as Strings.
PySpark?
Spark automatically saves intermediate data from various shuffle processes. However, it is
advised to use the RDD's persist() function. There are many levels of persistence for storing
RDDs on memory, disc, or both, with varying levels of replication. The following are the
persistence levels available in Spark:
● MEMORY ONLY: This is the default persistence level, and it's used to save RDDs on
the JVM as deserialized Java objects. In the event that the RDDs are too large to fit
in memory, the partitions are not cached and must be recomputed as needed.
● MEMORY AND DISK: On the JVM, the RDDs are saved as deserialized Java objects.
In the event that memory is inadequate, partitions that do not fit in memory will be
kept on disc, and data will be retrieved from the drive as needed.
● MEMORY ONLY SER: The RDD is stored as One Byte per partition serialized Java
Objects.
● DISK ONLY: RDD partitions are only saved on disc.
● OFF HEAP: This level is similar to MEMORY ONLY SER, except that the data is saved
in off-heap memory.
The persist() function has the following syntax for employing persistence levels:
df.persist(StorageLevel.)
No. of nodes = 10
No. of cores = How many concurrent tasks the executor can handle.
= 15/5
=3
= 10 * 3
words?
sc.textFile(“hdfs://Hadoop/user/sample_file.txt”);
def toWords(line):
return line.split();
3. As a flatMap transformation, run the toWords function on each item of the RDD in Spark:
words = line.flatMap(toWords);
def toTuple(word):
wordTuple = words.map(toTuple);
return x+y:
counts = wordsTuple.reduceByKey(sum)
6. Print:
counts.collect()
languages?
Resilient Distribution Datasets (RDD) are a collection of fault-tolerant functional units that
may run simultaneously. RDDs are data fragments that are maintained in memory and
spread across several nodes. In an RDD, all partitioned data is distributed and consistent.
1. Hadoop datasets- Those datasets that apply a function to each file record in the
Hadoop Distributed File System (HDFS) or another file storage system.
2. Parallelized Collections- Existing RDDs that operate in parallel with each other.
of PySpark.
PySpark allows you to create custom profiles that may be used to build predictive models.
In general, profilers are calculated using the minimum and maximum values of each
column. It is utilized as a valuable data review tool to ensure that the data is accurate and
appropriate for future usage.
● Avoid dictionaries: If you use Python data types like dictionaries, your code might not
be able to run in a distributed manner. Consider adding another column to a
dataframe that may be used as a filter instead of utilizing keys to index entries in a
dictionary. This proposal also applies to Python types that aren't distributable in
PySpark, such as lists.
● Limit the use of Pandas: using toPandas causes all data to be loaded into memory
on the driver node, preventing operations from being run in a distributed manner.
When data has previously been aggregated, and you wish to utilize conventional
Python plotting tools, this method is appropriate, but it should not be used for larger
dataframes.
● Minimize eager operations: It's best to avoid eager operations that draw whole
dataframes into memory if you want your pipeline to be as scalable as possible.
Reading in CSVs, for example, is an eager activity, thus I stage the dataframe to S3
as Parquet before utilizing it in further pipeline steps.
PySpark provides the reliability needed to upload our files to Apache Spark. This is
accomplished by using sc.addFile, where 'sc' stands for SparkContext. We use
SparkFiles.net to acquire the directory path.
We use the following methods in SparkFiles to resolve the path to the files added using
SparkContext.addFile():
● get(filename),
● getrootdirectory()
SparkConf.
SparkConf aids in the setup and settings needed to execute a spark application locally or in
a cluster. To put it another way, it offers settings for running a Spark application. The
following are some of SparkConf's most important features:
The primary difference between lists and tuples is that lists are mutable, but tuples are
immutable.
When a Python object may be edited, it is considered to be a mutable data type. Immutable
data types, on the other hand, cannot be changed.
list_num[3] = 7
print(list_num)
tup_num[3] = 7
Output:
[1,2,5,7]
We assigned 7 to list_num at index 3 in this code, and 7 is found at index 3 in the output.
However, we set 7 to tup_num at index 3, but the result returned a type error. Because of
their immutable nature, we can't change tuples.
Python?
There are two types of errors in Python: syntax errors and exceptions.
Syntax errors are frequently referred to as parsing errors. Errors are flaws in a program that
might cause it to crash or terminate unexpectedly. When a parser detects an error, it
repeats the offending line and then shows an arrow pointing to the line's beginning.
Exceptions arise in a program when the usual flow of the program is disrupted by an
external event. Even if the program's syntax is accurate, there is a potential that an error will
be detected during execution; nevertheless, this error is an exception. ZeroDivisionError,
TypeError, and NameError are some instances of exceptions.
PySpark is a Python API created and distributed by the Apache Spark organization to make
working with Spark easier for Python programmers. Scala is the programming language
used by Apache Spark. It can communicate with other languages like Java, R, and Python.
Also, because Scala is a compile-time, type-safe language, Apache Spark has several
capabilities that PySpark does not, one of which includes Datasets. Datasets are a highly
typed collection of domain-specific objects that may be used to execute concurrent
calculations.
SparkSession in PySpark
Spark 2.0 includes a new class called SparkSession (pyspark.sql import SparkSession).
Prior to the 2.0 release, SparkSession was a unified class for all of the many contexts we
had (SQLContext and HiveContext, etc). Since version 2.0, SparkSession may replace
SQLContext, HiveContext, and other contexts specified before version 2.0. It's a way to get
into the core PySpark technology and construct PySpark RDDs and DataFrames
programmatically. Spark is the default object in pyspark-shell, and it may be generated
programmatically with SparkSession.
In PySpark, we must use the builder pattern function builder() to construct SparkSession
programmatically (in a.py file), as detailed below. The getOrCreate() function retrieves an
already existing SparkSession or creates a new SparkSession if none exists.
spark=SparkSession.builder.master("local[1]") \
.appName('ProjectPro') \
.getOrCreate()
Py4J is a Java library integrated into PySpark that allows Python to actively communicate
with JVM instances. Py4J is a necessary module for the PySpark application to execute,
and it may be found in the $SPARK_HOME/python/lib/py4j-*-src.zip directory.
To execute the PySpark application after installing Spark, set the Py4j module to the
PYTHONPATH environment variable. We’ll get an ImportError: No module named
py4j.java_gateway error if we don't set this module to env.
export SPARK_HOME=/Users/abc/apps/spark-3.0.0-bin-hadoop2.7
export
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$SPARK_HOME/python
/lib/py4j-0.10.9-src.zip:$PYTHONPATH
The py4j module version changes depending on the PySpark version we’re using; to
configure this version correctly, follow the steps below:
Use the pip show command to see the PySpark location's path- pip show pyspark
Use the environment variables listed below to fix the problem on Windows-
set SPARK_HOME=C:\apps\opt\spark-3.0.0-bin-hadoop2.7
set HADOOP_HOME=%SPARK_HOME%
set PYTHONPATH=%SPARK_HOME%/python;%SPARK_HOME%/python/lib/py4j-0.10.9-
src.zip;%PYTHONPATH%
Spark shell, PySpark shell, and Databricks all have the SparkSession object 'spark' by
default. However, if we are creating a Spark/PySpark application in a.py file, we must
manually create a SparkSession object by using builder to resolve NameError: Name 'Spark'
is not Defined.
# Import PySpark
import pyspark
#Create SparkSession
spark = SparkSession.builder
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
If you get the error message 'No module named pyspark', try using findspark instead-
#Install findspark
# Import findspark
import findspark
findspark.init()
#import pyspark
import pyspark
● Standalone- a simple cluster manager that comes with Spark and makes setting up a
cluster easier.
● Apache Mesos- Mesos is a cluster manager that can also run Hadoop MapReduce
and PySpark applications.
● Hadoop YARN- It is the Hadoop 2 resource management.
● Kubernetes- an open-source framework for automating containerized application
deployment, scaling, and administration.
● local – not exactly a cluster manager, but it's worth mentioning because we use
"local" for master() to run Spark on our laptop/computer.
● Reliable receiver: When data is received and copied properly in Apache Spark
Storage, this receiver validates data sources.
● Unreliable receiver: When receiving or replicating data in Apache Spark Storage,
these receivers do not recognize data sources.
Spark runs almost 100 times faster than Hadoop Hadoop MapReduce is
MapReduce slower when it comes to
large scale data processing
Spark stores data in the RAM i.e. in-memory. So, it is Hadoop MapReduce data is
easier to retrieve it stored in HDFS and hence
takes a long time to retrieve
the data
Spark provides caching and in-memory data storage Hadoop is highly disk-
dependent
Apache Spark has 3 main categories that comprise its ecosystem. Those are:
Q. Explain how Spark runs applications with the help of its architecture.
This is one of the most frequently asked spark interview questions, and the interviewer
will expect you to give a thorough answer to it.
Spark applications run as independent processes that are coordinated by the
SparkSession object in the driver program. The resource manager or cluster manager
assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply
operations repeatedly to the data so they can benefit from caching datasets across
iterations. A task applies its unit of work to the dataset in its partition and outputs a new
partition dataset. Finally, the results are sent back to the driver application or can be
saved to the disk.
So far, if you have any doubts regarding the apache spark interview questions and
answers, please comment below.
Q. What makes Spark good at low latency workloads like graph processing and Machine
Learning?
Apache Spark stores data in-memory for faster processing and building machine
learning models. Machine Learning algorithms require multiple iterations and different
conceptual steps to create an optimal model. Graph algorithms traverse through all the
nodes and edges to generate a graph. These low latency workloads that need multiple
iterations can lead to increased performance.
How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
There are a total of 4 steps that can help you connect Spark to Apache Mesos.
Learn open-source framework and scala programming languages with the Apache
Spark and Scala Certification training course.
Q. What is shuffling in Spark? When does it occur?
Shuffling is the process of redistributing data across partitions that may lead to data
movement across the executors. The shuffle operation is implemented differently in
Spark compared to Hadoop.
It occurs while joining two tables or while performing byKey operations such as
GroupByKey or ReduceByKey
Suppose you want to read data from a CSV file into an RDD having four partitions.
This is how a filter operation is performed to remove all the multiple of 10 from the
data.
The RDD has some empty partitions. It makes sense to reduce the number of partitions,
which can be achieved by using coalesce.
This is how the resultant RDD would look like after applying to coalesce.
Spark Core is the engine for parallel and distributed processing of large data sets. The
various functionalities supported by Spark Core include:
● Scheduling and monitoring jobs
● Memory management
● Fault recovery
● Task dispatching
import com.mapr.db.spark.sql._
val df = sc.loadFromMapRDB(<table-name>)
.select(“_id”, “first_name”).toDF()
● Using SparkSession.createDataFrame
Actions: Actions are operations that return a value after running a computation on an
RDD (Example: reduce, first, count)
Q.What is a Lineage Graph?
The need for an RDD lineage graph happens when we want to compute a new RDD or if
we want to recover the lost data from the lost persisted RDD. Spark does not support
data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It
is also called an RDD operator graph or RDD dependency graph.
It represents a continuous stream of data that is either in the form of an input source or
processed data stream generated by transforming the input stream.
dstream.
The default persistence level is set to replicate the data to two nodes for fault-tolerance,
and for input streams that receive data over the network.
kafka
Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used to give every
node a copy of a large input dataset in an efficient manner. Spark distributes broadcast
variables using efficient broadcast algorithms to reduce communication costs.
scala
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
So far, if you have any doubts regarding the spark interview questions for beginners,
please ask in the comment section below.
1. map(func)
2. transform(func)
3. filter(func)
4. count()
This is one of the most frequently asked spark interview questions where the
interviewer expects a detailed answer (and not just a yes or no!). Give as detailed an
answer as possible here.
Yes, Apache Spark provides an API for adding and managing checkpoints.
Checkpointing is the process of making streaming applications resilient to failures. It
allows you to save the data and metadata into a checkpointing directory. In case of a
failure, the spark can recover this data and start from wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving the
metadata to fault-tolerant storage like HDFS. Metadata includes configurations,
DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises
in some of the stateful transformations. In this case, the upcoming RDD depends on the
RDDs of previous batches.
MEMORY_ONLY_SER - Stores the RDD as serialized Java objects with a one-byte array
per partition
MEMORY_ONLY - Stores the RDD as deserialized Java objects in the JVM. If the RDD is
not able to fit in the memory available, some partitions won’t be cached
OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory
MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the
RDD is not able to fit in the memory, additional partitions are stored on the disk
Q. What is the difference between map and flatMap transformation in Spark Streaming?
map() flatMap()
Spark Map function takes one element as an input FlatMap allows returning 0,
process it according to custom code (specified by 1, or more elements from
the developer) and returns one element at a time the map function. In the
FlatMap operation
Q. How would you compute the total count of unique words in Spark?
sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
def toWords(line):
return line.split();
def toTuple(word):
wordTuple = words.map(toTuple);
return x+y:
counts = wordsTuple.reduceByKey(sum)
6. Print:
counts.collect()
Suppose you have a huge text file. How will you check if a particular keyword exists
using Spark?
lines = sc.textFile(“hdfs://Hadoop/user/test_file.txt”);
def isFound(line):
if line.find(“my_keyword”) > -1
return 1
return 0
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print “Found”
else:
Accumulators are variables used for aggregating information across the executors. This
information can be about the data or API diagnosis like how many records are corrupted
or how many times a library API was called.
Hope it is clear so far. Let us know what were the apache spark interview questions
ask’d by/to you during the interview process.
Spark MLlib supports local vectors and matrices stored on a single machine, as well as
distributed matrices.
Local Vector: MLlib supports two types of local vectors - dense and sparse
Labeled point: A labeled point is a local vector, either dense or sparse that is associated
with a label/response.
Local Matrix: A local matrix has integer type row and column indices, and double type
values that are stored in a single machine.
Distributed Matrix: A distributed matrix has long-type row and column indices and
double-type values, and is stored in a distributed manner in one or more RDDs.
● RowMatrix
● IndexedRowMatrix
● CoordinatedMatrix
A Sparse vector is a type of local vector which is represented by an index array and a
value array.
extends Object
implements Vector
where:
Do you have a better example for this spark interview question? If yes, let us know.
Q. Describe how model creation works with MLlib and how the model is applied.
Spark MLlib lets you combine multiple transformations into a pipeline to apply complex
data transformations.
Spark SQL loads the data from a variety of structured data sources.
It queries data using SQL statements, both inside a Spark program and from external
tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).
To connect Hive to Spark SQL, place the hive-site.xml file in the conf directory of Spark.
val df = spark.read.json("examples/src/main/resources/people.json")
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
// Select only the "name" column
df.select("name").show()
// +-------+
// | name|
// +-------+
// |Michael|
// | Andy|
// | Justin|
// +-------+
// +-------+---------+
// | name|(age + 1)|
// +-------+---------+
// |Michael| null|
// | Andy| 31|
// | Justin| 20|
// +-------+---------+
// Select people older than 21
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+
df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// | 19| 1|
// |null| 1|
// | 30| 1|
// +----+-----+
Q. What are the different types of operators provided by the Apache GraphX library?
In such spark interview questions, try giving an explanation too (not just the name of the
operators).
Property Operator: Property operators modify the vertex or edge properties using a user-
defined map function and produce a new graph.
Structural Operator: Structure operators operate on the structure of an input graph and
produce a new graph.
Join Operator: Join operators add data to graphs and generate new graphs.
GraphX is Apache Spark's API for graphs and graph-parallel computation. GraphX
includes a set of graph algorithms to simplify analytics tasks. The algorithms are
contained in the org.apache.spark.graphx.lib package and can be accessed directly as
methods on Graph via GraphOps.
Triangle Counting: A vertex is part of a triangle when it has two adjacent vertices with
an edge between them. GraphX implements a triangle counting algorithm in the
TriangleCount object that determines the number of triangles passing through each
vertex, providing a measure of clustering.
It is a plus point if you are able to explain this spark interview question thoroughly, along
with an example! PageRank measures the importance of each vertex in a graph,
assuming an edge from u to v represents an endorsement of v’s importance by u.
If a Twitter user is followed by many other users, that handle will be ranked high.
PageRank algorithm was originally developed by Larry Page and Sergey Brin to rank
websites for Google. It can be applied to measure the influence of vertices in any
network graph. PageRank works by counting the number and quality of links to a page
to determine a rough estimate of how important the website is. The assumption is that
more important websites are likely to receive more links from other websites.
A typical example of using Scala's functional programming with Apache Spark RDDs to
iteratively compute Page Ranks is shown below:
Q. Explain how an object is implemented in python?
Syntax:
= ()
Example:
class Student:
id = 25;
estb = 10
stud = Student()
stud.display()
Output:
ID: 25
Estb: 10
Ans: In Python, a method is a function that is associated with an object. Any object type
can have methods.
Example:
class Student:
roll = 17;
name = "gopal"
age = 25
print(self.roll,self.name,self.age)
In the above example, a class named Student is created which contains three fields as
Student’s roll, name, age and a function “display()” which is used to display the
information of the Student.
Below is the example of encapsulation whereby the max price of the product cannot be
modified as it is set to 75 .
Example:
class Product:
def __init__(self):
self.__maxprice = 75
def sell(self):
print("Selling Price: {}".format(self.__maxprice))
self.__maxprice = price
p = Product()
p.sell()
p.__maxprice = 100
p.sell()
Output:
Selling Price: 75
Selling Price: 75
Ans: Inheritance refers to a concept where one class inherits the properties of another.
It helps to reuse the code and establish a relationship between different classes.
Parent class (Super or Base class): A class whose properties are inherited.
Child class (Subclass or Derived class): A class which inherits the properties.
In python, a derived class can inherit base class by just mentioning the base in the
bracket after the derived class name.
The syntax to inherit a base class into the derived class is shown below:
Syntax:
The syntax to inherit multiple classes is shown below by specifying all of them inside
the bracket.
Syntax:
Ans: A for loop in Python requires at least two variables to work. The first is the iterable
object such as a list, tuple or a string and second is the variable to store the successive
values from the sequence in the loop.
Syntax:
statements(iter)
The “iter” represents the iteration variable. It gets assigned with the successive values
from the input sequence.
The “sequence” may refer to any of the following Python objects such as a list, a tuple
or a string.
Example :
x=[]
for i in x:
print "in for loop"
else:
Output:
in else block
Ans: In Python, there are two types of errors - syntax error and exceptions.
Syntax Error: It is also known as parsing errors. Errors are issues in a program which
may cause it to exit abnormally. When an error is detected, the parser repeats the
offending line and then displays an arrow which points at the earliest point in the line.
Exceptions: Exceptions take place in a program when the normal flow of the program is
interrupted due to the occurrence of an external event. Even if the syntax of the program
is correct, there are chances of detecting an error during execution, this error is nothing
but an exception. Some of the examples of exceptions are - ZeroDivisionError, TypeError
and NameError.
Ans:
The key difference between lists and tuples is the fact that lists have mutable nature
and tuples have immutable nature.
It is said to be a mutable data type when a python object can be modified. On the other
hand, immutable data types cannot be modified. Let us see an example to modify an
item list vs tuple.
Example:
list_num[3] = 7
print(list_num)
tup_num[3] = 7
Output:
[1,2,5,7]
In this code, we had assigned 7 to list_num at index 3 and in the output, we can see 7 is
found in index 3 . However, we had assigned 7 to tup_num at index 3 but we got type
error on the output. This is because we cannot modify tuples due to its immutable
nature.
Ans: The method provided by Python, is a standard built-in function which converts a
string into an integer value.
It can be called with a string containing a number as the argument, and it will return the
number converted to an actual integer.
Example:
print int("1") + 2
Ans: Data cleaning is the process of preparing data for analysis by removing or
modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly
formatted.
Master Your Craft Lifetime LMS & Faculty Access 24/7 online expert support Real-
world & Project Based Learning
Q. What is Pyspark and explain its characteristics?
Ans: To support Python with Spark, the Spark community has released a tool called
PySpark. It is primarily used to process structured and semi-structured datasets and
also supports an optimized API to read data from the multiple data sources containing
different file formats. Using PySpark, you can also work with RDDs in the Python
programming language using its library name Py4j.
Based on MapReduce.
Q. Explain RDD and also state how you can create RDDs in Apache Spark.
Ans: RDD stands for Resilient Distribution Datasets, a fault-tolerant set of operational
elements that are capable of running in parallel. These RDDs, in general, are the portions
of data, which are stored in the memory and distributed over many nodes.
Hadoop datasets: Those who perform a function on each file record in Hadoop
Distributed File System (HDFS) or any other storage system.
Parallelized collections: Those existing RDDs which run in parallel with one another.
By loading an external dataset from external storage like HDFS, HBase, shared file
system.
Spark SQL: Integrates relational processing with Spark’s functional programming API.
Ans: When an Action is approached at a certain point, Spark RDD at an abnormal state,
Spark presents the heredity chart to the DAG Scheduler.
Activities are separated into phases of the errand in the DAG Scheduler. A phase
contains errands dependent on the parcel of the info information. The DAG scheduler
pipelines administrators together. It dispatches tasks through the group chief. The
conditions of stages are obscure to the errand scheduler. The Workers execute the
undertaking on the slave.
Ans: Stream processing is an extension to the Spark API that lets stream processing of
live data streams. Data from multiple sources such as Flume, Kafka, Kinesis, etc., is
processed and then pushed to live dashboards, file systems, and databases. Compared
to the terms of input data, it is just similar to batch processing, and data is segregated
into streams like batches in processing.
Ans:
Memory management.
Fault-tolerance.
Monitoring jobs.
Job scheduling.
Moreover, additional libraries, built atop the core, let diverse workloads for streaming,
machine learning, and SQL. This is useful for:
Memory management.
fault recovery.
Q. What is the module used to implement SQL in Spark? How does it work?
Ans: The module used is Spark SQL, which integrates relational processing with Spark’s
functional programming API. It helps to query data either through Hive Query Language
or SQL. These are the four libraries of Spark SQL.
DataFrame API.
SQL Service.
Querying data using SQL statements, both inside a Spark program and from external
tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).
For instance, using business intelligence tools like Tableau.
Providing rich integration between SQL and regular Python/Java/Scala code, including
the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.
Spark.mllib.
mllib.clustering.
mllib.classification.
mllib.regression.
mllib.recommendation.
mllib.linalg.
mllib.fpm.
Q. Explain the purpose of serializations in PySpark?
Ans: For improving performance, PySpark supports custom serializers to transfer data.
They are:
PickleSerializer: It is by default used for serializing objects. Supports any Python object
but at a slow speed.
Ans: PySpark Storage Level controls storage of an RDD. It also manages how to store
RDD in the memory or over the disk, or sometimes both. Moreover, it even controls the
replicate or serializes RDD partitions. The code for StorageLevel is as follows
Ans: PySpark SparkContext is treated as an initial point for entering and using any Spark
functionality. The SparkContext uses py4j library to launch the JVM, and then create the
JavaSparkContext. By default, the SparkContext is available as ‘sc’.
Ans: PySpark SparkFiles is used to load our files on the Apache Spark application. It is
one of the functions under SparkContext and can be called using sc.addFile to load the
files on the Apache Spark. SparkFIles can also be used to get the path using
SparkFile.get or resolve the paths to files that were added from sc.addFile. The class
methods present in the SparkFiles directory are getrootdirectory() and get(filename).
Ans: Apache Spark is a graph execution engine that enables users to analyze massive
data sets with high performance. For this, Spark first needs to be held in memory to
improve performance drastically, if data needs to be manipulated with multiple stages
of processing.
Ans: SparkConf helps in setting a few configurations and parameters to run a Spark
application on the local/cluster. In simple terms, it provides configurations to run a
Spark application.
Q. What is PySpark?
PySpark is an Apache Spark interface in Python. It is used for collaborating with Spark
using APIs written in Python. It also supports Spark’s features like Spark DataFrame,
Spark SQL, Spark Streaming, Spark MLlib and Spark Core. It provides an interactive
PySpark shell to analyze structured and semi-structured data in a distributed
environment. PySpark supports reading data from multiple sources and different
formats. It also facilitates the use of RDDs (Resilient Distributed Datasets). PySpark
features are implemented in the py4j library in python.
Download PDF
# --serializing.py----
from pyspark.context import SparkContext
from pyspark.serializers import MarshalSerializer
sc = SparkContext("local", "Marshal Serialization", serializer =
MarshalSerializer()) #Initialize spark context and serializer
print(sc.parallelize(list(range(1000))).map(lambda x: 3 *
x).take(5))
sc.stop()
When we run the file using the command:
$SPARK_HOME/bin/spark-submit serializing.py
The output of the code would be the list of size 5 of numbers multiplied by 3:
[0, 3, 6, 9, 12]
[
"interview",
"interviewbit"
]
● Action: These operations instruct Spark to perform some computations on the
RDD and return the result to the driver. It sends data from the Executer to the
driver. count(), collect(), take() are some of the examples.
Let us consider an example to demonstrate action operation by making use of
the count() function.
from pyspark import SparkContext
sc = SparkContext("local", "Action Demo")
words = sc.parallelize (
["pyspark",
"interview",
"questions",
"at",
"interviewbit"]
)
counts = words.count()
print("Count of elements in RDD -> ", counts)
In this class, we count the number of elements in the spark RDDs. The output of this
code is
depending on similarity.
algebra.
dependencies.
The above figure shows the position of cluster manager in the Spark ecosystem.
Consider a master node and multiple worker nodes present in the cluster. The master
nodes provide the worker nodes with the resources like memory, processor allocation
etc depending on the nodes requirements with the help of the cluster manager.
● In-Memory Processing: PySpark’s RDD helps in loading data from the disk to the
memory. The RDDs can even be persisted in the memory for reusing the
computations.
● Immutability: The RDDs are immutable which means that once created, they
cannot be modified. While applying any transformation operations on the RDDs, a
new RDD would be created.
● Fault Tolerance: The RDDs are fault-tolerant. This means that whenever an
operation fails, the data gets automatically reloaded from other available
partitions. This results in seamless execution of the PySpark applications.
● Lazy Evolution: The PySpark transformation operations are not performed as
soon as they are encountered. The operations would be stored in the DAG and
are evaluated once it finds the first RDD action.
● Partitioning: Whenever RDD is created from any data, the elements in the RDD are
partitioned to the cores available by default.
Q. What are the types of PySpark’s shared variables and why are they
useful?
Whenever PySpark performs the transformation operation using filter(), map() or
reduce(), they are run on a remote node that uses the variables shipped with tasks.
These variables are not reusable and cannot be shared across different tasks because
they are not returned to the Driver. To solve the issue of reusability and sharing, we have
shared variables in PySpark. There are two types of shared variables, they are:
Broadcast variables: These are also known as read-only shared variables and are used
in cases of data lookup requirements. These variables are cached and are made
available on all the cluster nodes so that the tasks can make use of them. The variables
are not sent with every task. They are rather distributed to the nodes using efficient
algorithms for reducing the cost of communication. When we run an RDD job operation
that makes use of Broadcast variables, the following things are done by PySpark:
● The job is broken into different stages having distributed shuffling. The actions
are executed in those stages.
● The stages are then broken into tasks.
● The broadcast variables are broadcasted to the tasks if the tasks need to use it.
Broadcast variables are created in PySpark by making use of the broadcast(variable)
method from the SparkContext class. The syntax for this goes as follows:
Accumulator variables: These variables are called updatable shared variables. They are
added through associative and commutative operations and are used for performing
counter or sum operations. PySpark supports the creation of numeric type
accumulators by default. It also has the ability to add custom accumulator types. The
custom types can be of two types:
Here, we will see the Accumulable section that has the sum of the Accumulator values
of the variables modified by the tasks listed in the Accumulator column present in the
Tasks table.
● Unnamed Accumulators: These accumulators are not shown on the PySpark Web
UI page. It is always recommended to make use of named accumulators.
Accumulator variables can be created by using
SparkContext.longAccumulator(variable) as shown in the example below:
ac = sc.longAccumulator("sumaccumulator")
sc.parallelize([2, 23, 1]).foreach(lambda x: ac.add(x))
Depending on the type of accumulator variable data - double, long and collection,
PySpark provide DoubleAccumulator, LongAccumulator and CollectionAccumulator
respectively.
class pyspark.Sparkconf(
localdefaults = True,
_jvm = None,
_jconf = None
)
where:
df.select(col("ID_COLUMN"), convertUDF(col("NAME_COLUMN"))
.alias("NAME_COLUMN") )
.show(truncate=False)
The output of the above code would be:
+----------+-----------------+
|ID_COLUMN |NAME_COLUMN |
+----------+-----------------+
|1 |Harry Potter |
|2 |Ronald Weasley |
|3 |Hermoine Granger |
+----------+-----------------+
UDFs have to be designed in a way that the algorithms are efficient and take less time
and space complexity. If care is not taken, the performance of the DataFrame
operations would be impacted.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]")
.appName('InterviewBitSparkSession')
.getOrCreate()
Here,
● master() – This is used for setting up the mode in which the application has to
run - cluster mode (use the master name) or standalone mode. For Standalone
mode, we use the local[x] value to the function, where x represents partition
count to be created in RDD, DataFrame and DataSet. The value of x is ideally the
number of CPU cores available.
● appName() - Used for setting the application name
● getOrCreate() – For returning SparkSession object. This creates a new object if it
does not exist. If an object is there, it simply returns that.
If we want to create a new SparkSession object every time, we can use the newSession
method as shown below:
import pyspark
from pyspark.sql import SparkSession
spark_session = SparkSession.newSession
+-----------+----------+
| Name | Age |
+-----------+----------+
| Harry | 20 |
| Ron | 20 |
| Hermoine | 20 |
+-----------+----------+
We can get the schema of the dataframe by using df.printSchema()
>> df.printSchema()
root
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
df = spark.read.csv("/path/to/file.csv")
PySpark supports csv, text, avro, parquet, tsv and many other file extensions.
● startsWith() – returns boolean Boolean value. It is true when the value of the
column starts with the specified string and False when the match is not satisfied
in that column value.
● endsWith() – returns boolean Boolean value. It is true when the value of the
column ends with the specified string and False when the match is not satisfied
in that column value.
Both the methods are case-sensitive.
import org.apache.spark.sql.functions.col
df.filter(col("Name").startsWith("H")).show()
The output of the code would be:
+-----------+----------+
| Name | Age |
+-----------+----------+
| Harry | 20 |
| Hermoine | 20 |
+-----------+----------+
Notice how the record with the Name “Ron” is filtered out because it does not start with
“H”.
For example, consider we have the following DataFrame assigned to a variable df:
+-----------+----------+----------+
| Name | Age | Gender |
+-----------+----------+----------+
| Harry | 20 | M |
| Ron | 20 | M |
| Hermoine | 20 | F |
+-----------+----------+----------+
In the below piece of code, we will be creating a temporary table of the DataFrame that
gets accessible in the SparkSession using the sql() method. The SQL queries can be run
within the method.
df.createOrReplaceTempView("STUDENTS")
df_new = spark.sql("SELECT * from STUDENTS")
df_new.printSchema()
The schema will be displayed as shown below:
>> df.printSchema()
root
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Gender: string (nullable = true)
For the above example, let’s try running group by on the Gender column:
+------+------------+
|Gender|Gender_Count|
+------+------------+
| F| 1 |
| M| 2 |
+------+------------+
how - type of join, by default it is inner. The values can be inner, left, right, cross, full,
outer, left_outer, right_outer, left_anti, left_semi.
The join expression can be appended with where() and filter() methods for filtering
rows. We can have multiple join too by means of the chaining join() method.
-- Employee DataFrame --
+------+--------+-----------+
|emp_id|emp_name|empdept_id |
+------+--------+-----------+
| 1| Harry| 5|
| 2| Ron | 5|
| 3| Neville| 10|
| 4| Malfoy| 20|
+------+--------+-----------+
-- Department DataFrame --
+-------+--------------------------+
|dept_id| dept_name |
+-------+--------------------------+
| 5 | Information Technology |
| 10| Engineering |
| 20| Marketting |
+-------+--------------------------+
We can inner join the Employee DataFrame with Department DataFrame to get the
department information along with employee information as:
emp_dept_df = empDF.join(deptDF,empDF.empdept_id ==
deptDF.dept_id,"inner").show(truncate=False)
The result of this becomes:
+------+--------+-----------+-------+--------------------------+
|emp_id|emp_name|empdept_id |dept_id| dept_name |
+------+--------+-----------+-------+--------------------------+
| 1| Harry| 5| 5 | Information Technology |
| 2| Ron | 5| 5 | Information Technology |
| 3| Neville| 10| 10 | Engineering |
| 4| Malfoy| 20| 20 | Marketting |
+------+--------+-----------+-------+--------------------------+
We can also perform joins by chaining join() method by following the syntax:
dfQ.join(df2,["column_name"]).join(df3,df1["column_name"] ==
df3["column_name"]).show()
Consider we have a third dataframe called Address DataFrame having columns emp_id,
city and state where emp_id acts as the foreign key equivalent of SQL to the Employee
DataFrame as shown below:
-- Address DataFrame --
+------+--------------+------+
|emp_id| city |state |
+------+--------------+------+
|1 | Bangalore | KA |
|2 | Pune | MH |
|3 | Mumbai | MH |
|4 | Chennai | TN |
+------+--------------+------+
If we want to get address details of the address along with the Employee and the
Department Dataframe, then we can run,
resultDf = empDF.join(addressDF,["emp_id"])
.join(deptDF,empDF["empdept_id"] ==
deptDF["dept_id"])
.show()
The resultDf would be:
+------+--------+-----------+--------------+------+-------+-----
---------------------+
|emp_id|emp_name|empdept_id | city |state |dept_id|
dept_name |
+------+--------+-----------+--------------+------+-------+-----
---------------------+
| 1| Harry| 5| Bangalore | KA | 5 |
Information Technology |
| 2| Ron | 5| Pune | MH | 5 |
Information Technology |
| 3| Neville| 10| Mumbai | MH | 10 |
Engineering |
| 4| Malfoy| 20| Chennai | TN | 20 |
Marketting |
+------+--------+-----------+--------------+------+-------+-----
---------------------+
root
|-- value: string (nullable = true)
Post data processing, the DataFrame can be streamed to the console or any other
destinations based on the requirements like Kafka, dashboards, database etc.
Q. What would happen if we lose RDD partitions due to the failure of the
worker node?
If any RDD partition is lost, then that partition can be recomputed using operations
lineage from the original fault-tolerant dataset.