0% found this document useful (0 votes)

151 views9 pages

Spark Vs Hadoop Features Spark

Spark is a parallel data processing framework that allows developing fast, unified applications for batch, streaming, and interactive analytics. It provides resilient distributed datasets (RDDs) that can be operated on in parallel. RDDs are immutable, partitioned collections that can be operated on using transformations and actions. Transformations are lazy operations that create new RDDs, while actions return values to the driver program and trigger job execution.

Uploaded by

consania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

151 views9 pages

Spark Vs Hadoop Features Spark

Uploaded by

consania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

Q) Apache Spark Vs Hadoop

Spark Vs Hadoop

Features Spark

Data processing Part of hadoop, hence batch processing Batch Proc

Streaming Engine Apache spark straming - micro batches Map-Redu

Data Flow Direct Acyclic Graph-DAG Map-Redu

Computation Model Collect and process Map-Redu

Performance Slow due to batch processing Slow due t

Memory Management Automatic memory management in latest Dynamic a

release

Fault Tolerance Recovery available without extra code Highly fault

Scalability Highly scalable - sSpark Cluster(8000 Highly scal

Nodes) of nodes

Q1) What is Spark?

Spark is a parallel data processing framework. It allows to develop fast, uniﬁed big data application combine
batch, streaming and interactive analytics.

Q2) Why Spark?

Spark is the third generation distributed data processing platform. It’s uniﬁed bigdata solution for all bigdata
processing problems such as batch , interacting, streaming processing.So it can ease many bigdata problems.

Q3) What is RDD?

Spark’s primary core abstraction is called Resilient Distributed Datasets. RDD is a collection of partitioned data
that satisﬁes these properties. Immutable, distributed, lazily evaluated, catchable are common RDD properties.

Q4) What is Immutable?

Once created and assign a value, it’s not possible to change, this property is called Immutability. Spark is by
default immutable, it does not allow updates and modiﬁcations. Please note data collection is not immutable, but
data value is immutable.

Q5) What is Distributed?

RDD can automatically the data is distributed across different parallel computing nodes.

Q6) What is Lazy evaluated?

https://mindmajix.com/apache-spark-interview-questions 1/9
01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

If you execute a bunch of programs, it’s not mandatory to evaluate immediately. Especially in Transformations,
this Laziness is a trigger.

Q6) What is Catchable?

Keep all the data in-memory for computation, rather than going to the disk. So Spark can catch the data 100
times faster than Hadoop.

Q7) What is Spark engine responsibility?

Spark responsible for scheduling, distributing, and monitoring the application across the cluster.

Q8) What are common Spark Ecosystems?

Spark SQL(Shark) for SQL developers,

Spark Streaming for streaming data,

MLLib for machine learning algorithms,

GraphX for Graph computation,

SparkR to run R on Spark engine,

BlinkDB enabling interactive queries over massive data are common Spark ecosystems. GraphX,
SparkR, and BlinkDB are in the incubation stage.

Q9) What is Partitions?

Partition is a logical division of the data, this idea derived from Map-reduce (split). Logical data speciﬁcally
derived to process the data. Small chunks of data also it can support scalability and speed up the process. Input
data, intermediate data, and output data everything is Partitioned RDD.

Learn Spark vs Hadoop What's Better to Learn First.

Q10) How spark partition the data?

Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By
default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.

Q11) How Spark store the data?

Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS,
S3 and other data resources.

Q12) Is it mandatory to start Hadoop to run spark application?

No not mandatory, but there is no separate storage in Spark, so it use local ﬁle system to store the data. You can
load data from local system and process it, Hadoop or HDFS is not mandatory to run spark application.

Q13) What is SparkContext?

https://mindmajix.com/apache-spark-interview-questions 2/9
01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

When a programmer creates a RDDs, SparkContext connect to the Spark cluster to create a new SparkContext
object. SparkContext tell spark how to access the cluster. SparkConf is key factor to create programmer
application.

Q14) What is SparkCore functionalities?

SparkCore is a base engine of apache spark framework. Memory management, fault tolarance, scheduling and
monitoring jobs, interacting with store systems are primary functionalities of Spark.

Q15) How SparkSQL is different from HQL and SQL?

SparkSQL is a special component on the sparkCore engine that support SQL and HiveQueryLanguage without
changing any syntax. It’s possible to join SQL table and HQL table.

Q16) When did we use Spark Streaming?

Spark Streaming is a real time processing of streaming data API. Spark streaming gather streaming data from
different resources like web server log ﬁles, social media data, stock market data or Hadoop ecosystems like
Flume, and Kafka.

Q17) How Spark Streaming API works?

Programmer set a specific time in the configuration, with in this time how much data gets into the Spark, that data
separates as a batch. The input stream (DStream) goes into spark streaming. Framework breaks up into small
chunks called batches, then feeds into the spark engine for processing. Spark Streaming API passes that
batches to the core engine. Core engine can generate the final results in the form of streaming batches. The
output also in the form of batches. It can allows streaming data and batch data for processing.

Q18) What is Spark MLlib?

Mahout is a machine learning library for Hadoop, similarly MLlib is a Spark library. MetLib provides different
algorithms, that algorithms scale out on the cluster for data processing. Most of the data scientists use this MLlib
library.

Q19) What is GraphX?

GraphX is a Spark API for manipulating Graphs and collections. It uniﬁes ETL, other analysis, and iterative graph
computation. It’s fastest graph system, provides fault tolerance and ease of use without special skills.

Q20) What is File System API?

FS API can read data from different storage devices like HDFS, S3 or local FileSystem. Spark uses FS API to
read data from different storage engines.

Q21) Why Partitions are immutable?

Every transformation generates new partition. Partitions use HDFS API so that partition is immutable, distributed
and fault tolerance. Partition also aware of data locality.

Q22) What is Transformation in spark?

https://mindmajix.com/apache-spark-interview-questions 3/9
01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

Spark provides two special operations on RDDs called transformations and Actions. Transformation follows lazy
operation and temporary hold the data until unless called the Action. Each transformation generates/return new
RDD. Example of transformations: Map, ﬂatMap, groupByKey, reduceByKey, ﬁlter, co-group, join, sortByKey,
Union, distinct, sample are common spark transformations.

Q23) What is Action in Spark?

Actions are RDD’s operation, that value returns back to the spar driver programs, which kick off a job to execute
on a cluster. Transformation’s output is an input of Actions. reduce, collect, takeSample, take, ﬁrst,
saveAsTextﬁle, saveAsSequenceFile, countByKey, foreach are common actions in Apache spark.

Q24) What is RDD Lineage?

Lineage is an RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd
use linege to rebuild lost data.Each RDD remembers how the RDD build from other datasets.

Q25) What is Map and ﬂatMap in Spark?

The map is a speciﬁc line or row to process that data. In FlatMap each input item can be mapped to multiple
output items (so the function should return a Seq rather than a single item). So most frequently used to return
Array elements.

Q26) What are broadcast variables?

Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a
copy of it with tasks. Spark supports 2 types of shared variables called broadcast variables (like Hadoop
distributed cache) and accumulators (like Hadoop counters). Broadcast variables stored as Array Buffers, which
sends read-only values to work nodes.

Q27) What are Accumulators in Spark?

Spark of-line debuggers called accumulators. Spark accumulators are similar to Hadoop counters, to count the
number of events and what’s happening during job you can use accumulators. Only the driver program can read
an accumulator value, not the tasks.

Q28) How RDD persist the data?

There are two methods to persist the data, such as persist() to persist permanently and cache() to persist
temporarily in the memory. Different storage level options there such as MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache() uses different options depends
on the task.

Q29) When do you use apache spark? OR What are the beneﬁts of Spark over Mapreduce?

Spark is really fast. As per their claims, it runs programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk. It aptly utilizes RAM to produce the faster results.

In map reduce paradigm, you write many Map-reduce tasks and then tie these tasks together using
Oozie/shell script. This mechanism is very time consuming and the map-reduce task has heavy
latency.

https://mindmajix.com/apache-spark-interview-questions 4/9
01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

And quite often, translating the output out of one MR job into the input of another MR job might
require writing another code because Oozie may not sufﬁce.

In Spark, you can basically do everything using single application/console (pyspark or scala console)
and get the results immediately. Switching between ‘Running something on cluster’ and ‘doing
something locally’ is fairly easy and straightforward. This also leads to less context switch of the
developer and more productivity.

Spark kind of equals to MapReduce and Oozie put together.

Q30) Is there is a point of learning MapReduce, then?

Yes. For the following reason:

Mapreduce is a paradigm used by many big data tools including Spark. So, understanding the
MapReduce paradigm and how to convert a problem into series of MR tasks is very important.

When the data grows beyond what can ﬁt into the memory on your cluster, the Hadoop Map-Reduce
paradigm is still very relevant.

Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you
understand the Mapreduce then you will be able to optimize your queries better.

Q31) When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster?

Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster’s nodes.
So, you just have to install Spark on one node.

Check Out Apache Spark Tutorials

Q32) What are the downsides of Spark?

Spark utilizes the memory. The developer has to be careful. A casual developer might make following mistakes:

She may end up running everything on the local node instead of distributing work over to the cluster.

She might hit some webservice too many times by the way of using multiple clusters.

The ﬁrst problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your code is
churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single
node.
The second mistake is possible in Map-Reduce too. While writing Map-Reduce, user may hit a service from
inside of map() or reduce() too many times. This overloading of service is also possible while using Spark.

Q33) What is an RDD?

The full form of RDD is resilience distributed dataset. It is a representation of data located on a network which is

Immutable – You can operate on the rdd to produce another rdd but you can’t alter it.

Partitioned / Parallel – The data located on RDD is operated in parallel. Any operation on RDD is
done using multiple nodes.

https://mindmajix.com/apache-spark-interview-questions 5/9
01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

Resilience – If one of the node hosting the partition fails, another nodes takes its data.

RDD provides two kinds of operations: Transformations and Actions.

Q34) What is Transformations?

The transformations are the functions that are applied on an RDD (resilient distributed data set). The
transformation results in another RDD. A transformation is not executed until an action follows.
The example of transformations are:

1. map() – applies the function passed to it on each element of RDD resulting in a new RDD.

2. ﬁlter() – creates a new RDD by picking the elements from the current RDD which pass the function
argument.

Q35) What are Actions?

An action brings back the data from the RDD to the local machine. Execution of an action results in all the
previously created transformation. The example of actions are:

reduce() – executes the function passed again and again until only one value is left. The function
should take two argument and return one value.

take() – take all the values back to the local node form RDD.

Q36) Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to compute
average:

def myAvg(x, y):

return (x+y)/2.0;
avg = myrdd.reduce(myAvg);

Q37) What is wrong with it? And How would you correct it?

The average function is not commutative and associative;

I would simply sum it and then divide by count.
def sum(x, y):
return x+y;
total = myrdd.reduce(sum);
avg = total / myrdd.count();
The only problem with the above code is that the total might become very big thus over ﬂow. So, I would rather
divide each number by count and then sum in the following way.
cnt = myrdd.count();
def devideByCnd(x):
return x/cnt;
myrdd1 = myrdd.map(devideByCnd);
avg = myrdd.reduce(sum);

https://mindmajix.com/apache-spark-interview-questions 6/9
01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

Q38) Say I have a huge list of numbers in a ﬁle in HDFS. Each line has one number.And I want to compute
the square root of sum of squares of these numbers. How would you do it?

# We would ﬁrst load the ﬁle as RDD from HDFS on spark

numsAsText = sc.textFile(“hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt”);
# Define the function to compute the squares
def toSqInt(str):
v = int(str);
return v*v;
#Run the function on spark rdd as transformation
nums = numsAsText.map(toSqInt);
#Run the summation as reduce action
total = nums.reduce(sum)
#finally compute the square root. For which we need to import math.
import math;
print math.sqrt(total);

Q39) Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?

numsAsText =sc.textFile(“hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersﬁle.txt”);
def toInt(str):
return int(str);
nums = numsAsText.map(toInt);
def sqrtOfSumOfSq(x, y):
return math.sqrt(x*x+y*y);
total = nums.reduce(sum)
import math;
print math.sqrt(total);
A: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.

Q40) Could you compare the pros and cons of the your approach (in Question 2 above) and my approach
(in Question 3 above)?

You are doing the square and square root as part of reduce action while I am squaring in map() and summing in
reduce in my approach.
My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt() and reducer
code is generally executed approximately n-1 times the spark RDD.
The only downside of my approach is that there is a huge chance of integer overﬂow because I am computing the
sum of squares as part of map.

Q41) If you have to compute the total counts of each of the unique words on spark, how would you go
about it?

#This will load the bigtextfile.txt as RDD in the spark lines = sc.textFile(“hdfs://hadoop1.knowbigdata.com/user/st
udent/sgiri/bigtextfile.txt”);
#define a function that can break each line into words
https://mindmajix.com/apache-spark-interview-questions 7/9
01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

def toWords(line):
return line.split();
# Run the toWords function on each element of RDD on spark as flatMap transformation.
# We are going to flatMap instead of map because our function is returning multiple values.
words = lines.flatMap(toWords);
# Convert each word into (key, value) pair. Her key will be the word itself and value will be 1.
def toTuple(word):
return (word, 1);
wordsTuple = words.map(toTuple);
# Now we can easily do the reduceByKey() action.
def sum(x, y):
return x+y;
counts = wordsTuple.reduceByKey(sum)
# Now, print
counts.collect()

Q41) In a very huge text ﬁle, you want to just check if a particular keyword exists. How would you do this
using Spark?

lines = sc.textFile(“hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextﬁle.txt”);
def isFound(line):
if line.ﬁnd(“mykeyword”) > -1:
return 1;
return 0;
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print “FOUND”;
else:
print “NOT FOUND”;

Q42) Can you improve the performance of this code in previous answer?

Yes. The search is not stopping even after the word we are looking for has been found. Our map code would
keep executing on all the nodes which is very inefﬁcient.
We could utilize accumulators to report whether the word has been found or not and then stop the job. Something
on these line:
import thread, threading
from time import sleep
result = “Not Set”
lock = threading.Lock()
accum = sc.accumulator(0)
def map_func(line):
#introduce delay to emulate the slowness

https://mindmajix.com/apache-spark-interview-questions 8/9
01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

sleep(1);
if line.ﬁnd(“Adventures”) > -1:
accum.add(1);
return 1;
return 0;
def start_job():
global result
try:
sc.setJobGroup(“job_to_cancel”, “some description”)
lines = sc.textFile(“hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt”);
result = lines.map(map_func);
result.take(1);
except Exception as e:
result = “Cancelled”
lock.release()
def stop_job():
while accum.value < 3 :
sleep(1);
sc.cancelJobGroup(“job_to_cancel”)
supress = lock.acquire()
supress = thread.start_new_thread(start_job, tuple())
supress = thread.start_new_thread(stop_job, tuple())
supress = lock.acquire()
[/tab]
Facing technical problem in your current IT job, let us help you. MindMajix has highly technical people who can
assist you in solving technical problems in your project.
We have come across many developers in USA, Australia and other countries who have recently got the job but
they are struggling to survive in the job because of less technical knowledge, exposure and the kind of work
given to them.
We are here to help you.

Let us know your proﬁle and kind of help you are looking for and we shall do our best to help you out. The job
support is provided by Mindmajix Technical experts who have more than 10 years of work experience on IT
technologies landscape.
How does the job support work?
* We see your project and technologies used, if we are 100% conﬁdent then we agree to support you.
* We work on the monthly basis
* No of hours of Support: Based on customer need and the pricing also varies
* We support you to solve your technical problem and guide you in the right direction.

https://mindmajix.com/apache-spark-interview-questions 9/9

Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Snowflake Prctice1
No ratings yet
Snowflake Prctice1
51 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Mourya K Data Engineer
No ratings yet
Mourya K Data Engineer
7 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
External Tables
No ratings yet
External Tables
105 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Interview Series ADF Part-1
No ratings yet
Interview Series ADF Part-1
17 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Core Data Services (CDS)
No ratings yet
Core Data Services (CDS)
29 pages
Azure DataBricks Interview Questions
No ratings yet
Azure DataBricks Interview Questions
17 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Data Migration With Sap Frank Densborn
No ratings yet
Data Migration With Sap Frank Densborn
32 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Siva
No ratings yet
Siva
4 pages
SAST Checkmarx
100% (1)
SAST Checkmarx
6 pages
150 Data Engineering Interview Questions PDF
No ratings yet
150 Data Engineering Interview Questions PDF
8 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Oracle HRMS Question e Book
100% (1)
Oracle HRMS Question e Book
37 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Bhaskar ADE - Altimetrik
No ratings yet
Bhaskar ADE - Altimetrik
3 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Interview
No ratings yet
Interview
86 pages
Pushpender Snowflake 24thjune Questions
No ratings yet
Pushpender Snowflake 24thjune Questions
3 pages
Iswarya - SR - Bigdata Hadoop Developer
No ratings yet
Iswarya - SR - Bigdata Hadoop Developer
8 pages
Databricks
No ratings yet
Databricks
11 pages
Supplier Creation Throug FBDI
No ratings yet
Supplier Creation Throug FBDI
9 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Database. in COBOL or SQR Programs Reside Completely Within The File Server. If
100% (2)
Database. in COBOL or SQR Programs Reside Completely Within The File Server. If
21 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
CIT 3203 Database Administration Notes
100% (1)
CIT 3203 Database Administration Notes
66 pages
Seminar Report ON Data Warehouse
No ratings yet
Seminar Report ON Data Warehouse
21 pages
SQL Nested Queries PDF
0% (1)
SQL Nested Queries PDF
12 pages
PHP - Google Maps - How To Get GPS Coordinates For Address
No ratings yet
PHP - Google Maps - How To Get GPS Coordinates For Address
4 pages
View45 Integration Guide
No ratings yet
View45 Integration Guide
70 pages
trắc nghiệm phân tích dữ liệu trong kế toán
No ratings yet
trắc nghiệm phân tích dữ liệu trong kế toán
24 pages
Customer Life Cycle: Predictive Customer Analytics - Part I
No ratings yet
Customer Life Cycle: Predictive Customer Analytics - Part I
5 pages
LA2 Computer System Powerpoint
No ratings yet
LA2 Computer System Powerpoint
40 pages
Install
No ratings yet
Install
9 pages
Compare Hadoop and Spark.: Table
No ratings yet
Compare Hadoop and Spark.: Table
10 pages
EXDvRQz7REOw70UM-4RDMg Course1PostgreSQLpgAdminTutorial
No ratings yet
EXDvRQz7REOw70UM-4RDMg Course1PostgreSQLpgAdminTutorial
32 pages
An Automated System For Patient Record Management
No ratings yet
An Automated System For Patient Record Management
7 pages
BM 5500
No ratings yet
BM 5500
2 pages
7 - Relational Databases and SQL
No ratings yet
7 - Relational Databases and SQL
35 pages
2022-05-13
No ratings yet
2022-05-13
11 pages
Midterm Report 1
No ratings yet
Midterm Report 1
21 pages
Machine Learning101
No ratings yet
Machine Learning101
20 pages
OS Storage Management
No ratings yet
OS Storage Management
8 pages
7 - Big and Small Data
No ratings yet
7 - Big and Small Data
2 pages
Sem Iii - It - Rev 19 C - Dec-23
No ratings yet
Sem Iii - It - Rev 19 C - Dec-23
7 pages
Understanding Master Boot Record (MBR) - Know IT Like Pro
No ratings yet
Understanding Master Boot Record (MBR) - Know IT Like Pro
15 pages
A Beginner's Guide To Disks and Disk Partitions in Linux
No ratings yet
A Beginner's Guide To Disks and Disk Partitions in Linux
7 pages
Data Governance and Security Dashboard - (Tableau - Business Analyst)
No ratings yet
Data Governance and Security Dashboard - (Tableau - Business Analyst)
11 pages
DB 2 Cheat Sheet For Dev 20210323
No ratings yet
DB 2 Cheat Sheet For Dev 20210323
2 pages
YEAR 12 ICT Term 2 Scheme of Work
No ratings yet
YEAR 12 ICT Term 2 Scheme of Work
2 pages
Arik Dholiya JavaDeveloper
No ratings yet
Arik Dholiya JavaDeveloper
1 page
Trinayreddy
No ratings yet
Trinayreddy
1 page
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet

Spark Vs Hadoop Features Spark

Uploaded by

Spark Vs Hadoop Features Spark

Uploaded by

01/05/2019 The Best Apache Spark Interview Questions [UPDATED] 2019

Q) Apache Spark Vs Hadoop

Data processing Part of hadoop, hence batch processing Batch Proc

Streaming Engine Apache spark straming - micro batches Map-Redu

Data Flow Direct Acyclic Graph-DAG Map-Redu

Computation Model Collect and process Map-Redu

Performance Slow due to batch processing Slow due t

Memory Management Automatic memory management in latest Dynamic a

Fault Tolerance Recovery available without extra code Highly fault

Scalability Highly scalable - sSpark Cluster(8000 Highly scal

Q1) What is Spark?

Q2) Why Spark?

Q3) What is RDD?

Q4) What is Immutable?

Q5) What is Distributed?

Q6) What is Lazy evaluated?

Q6) What is Catchable?

Q7) What is Spark engine responsibility?

Q8) What are common Spark Ecosystems?

Spark SQL(Shark) for SQL developers,

Spark Streaming for streaming data,

MLLib for machine learning algorithms,

GraphX for Graph computation,

SparkR to run R on Spark engine,

Q9) What is Partitions?

Learn Spark vs Hadoop What's Better to Learn First.

Q11) How Spark store the data?

Q12) Is it mandatory to start Hadoop to run spark application?

Q13) What is SparkContext?

Q14) What is SparkCore functionalities?

Q15) How SparkSQL is different from HQL and SQL?

Q16) When did we use Spark Streaming?

Q17) How Spark Streaming API works?

Q18) What is Spark MLlib?

Q19) What is GraphX?

Q20) What is File System API?

Q21) Why Partitions are immutable?

Q22) What is Transformation in spark?

Q23) What is Action in Spark?

Q24) What is RDD Lineage?

Q25) What is Map and ﬂatMap in Spark?

Q26) What are broadcast variables?

Q27) What are Accumulators in Spark?

Q28) How RDD persist the data?

Spark kind of equals to MapReduce and Oozie put together.

Q30) Is there is a point of learning MapReduce, then?

Yes. For the following reason:

Check Out Apache Spark Tutorials

Q32) What are the downsides of Spark?

Q33) What is an RDD?

RDD provides two kinds of operations: Transformations and Actions.

Q34) What is Transformations?

Q35) What are Actions?

def myAvg(x, y):

The average function is not commutative and associative;

# We would ﬁrst load the ﬁle as RDD from HDFS on spark

Q39) Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?

You might also like