0% found this document useful (0 votes)

115 views7 pages

Databricks Question

Uploaded by

Chaitali Dange

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views7 pages

Databricks Question

Uploaded by

Chaitali Dange

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 7

1. Types of S3 bucket?

: Simple Storage Service

-----------------------------------------------------------------------------------
-----------------------------
S3 Storage Classes can be configured at the object level,
and a single bucket can contain objects stored across S3 Standard,
S3 Intelligent-Tiering, S3 Standard-IA, and S3 One Zone-IA.

S3 Standard : for frequently accessed data(miliseconds)

S3 Intelligent-Tiering: for automatic cost savings for data with unknown or
changing access patterns
S3 Standard-IA: for less frequently accessed data
S3 One Zone-IA: for less frequently accessed data
S3 Glacier Instant Retrieval: for archive data that needs immediate access
S3 Glacier Flexible Retrieval (formerly S3 Glacier): for rarely accessed long-term
data that does not
require immediate access
Amazon S3 Glacier Deep Archive (S3 Glacier Deep Archive) : for long-term archive
and digital preservation
with retrieval in hours at the lowest cost storage in the cloud.
-----------------------------------------------------------------------------------
-----------------------------
3. Write all the transformations you did?
-----------------------------------------------------------------------------------
-----------------------------
I performed many transformations in day to day tasks naming a few:
1.withColumn
2.withColumnRenamed
3.read
4.write
5.select
6.where/filter
7.orderBy
8.selectExpr
9.drop
10.distinct/dropbyDuplicates
11.limit
12.sort
13.groupBy
14.union/unionByName
15.fill/fill.na
16.head()
17.take()

P,=VCX1

-----------------------------------------------------------------------------------
----------------------------
5.Why did you use Databricks in your project?

-----------------------------------------------------------------------------------
----------------------------
1.Not only does Databricks sit on top of either an Azure or AWS flexible,
distributed cloud computing environment, it also masks the complexities of
distributed processing from your data scientists and engineers,
allowing them to develop straight in Spark's native R, Scala, Python or SQL
interface.
2. Why is Databricks so popular?
It unifies both batch and streaming data, incorporates many different
processing models and supports SQL.
These characteristics make it much easier to use, highly accessible and
extremely expressive.
3.Databricks is a higher-level platform that also includes multi-user support, an
interactive UI, security,
data management, cluster sharing and job scheduling. These qualities make the
Databricks different.
-----------------------------------------------------------------------------------
-----------------------------
8. Optimization Techniques in Spark.
-----------------------------------------------------------------------------------
-----------------------------
1.Spark Performance Tuning – Best Guidelines & Practices
2.Use DataFrame/Dataset over RDD.
3.Use coalesce() over repartition()
4.Use mapPartitions() over map()
5.Use Serialized data format's.
6.Avoid UDF's (User Defined Functions)
7.Caching data in memory.
8.Reduce expensive Shuffle operations.
9.Disable DEBUG & INFO Logging.
-----------------------------------------------------------------------------------
-----------------------------
8 Performance Optimization Techniques Using Spark:
1. Serialization
Serialization plays an important role in the performance for any distributed
application. By default, Spark uses Java serializer.
Spark can also use another serializer called ‘Kryo’ serializer for better
performance.
Kryo serializer is in compact binary format and offers processing 10x faster than
Java serializer.
To set the serializer properties:
conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)
Serialization plays an important role in the performance of any distributed
application
and we know that by default Spark uses the Java serializer on the JVM platform.
Instead of Java serializer, Spark can also use another serializer called Kryo.
The Kryo serializer gives better performance as compared to the Java serializer.
Kryo serializer is in a compact binary format and offers approximately 10 times
faster
speed as compared to the Java Serializer. To set the Kryo serializer as part of a
Spark job,
we need to set a configuration property, which is
org.apache.spark.serializer.KryoSerializer.

8 Performance Optimization Techniques Using Spark

Due to its fast, easy-to-use capabilities, Apache Spark helps to Enterprises

process data faster, solving complex data problems quickly.

We all know that during the development of any program, taking care of the
performance is equally important. A Spark job can be optimized by many techniques
so let’s dig deeper into those techniques one by one. Apache Spark optimization
helps with in-memory data computations. The bottleneck for these spark optimization
computations can be CPU, memory or any resource in the cluster.

1. Serialization
Serialization plays an important role in the performance for any distributed
application. By default, Spark uses Java serializer.
Spark can also use another serializer called ‘Kryo’ serializer for better
performance.
Kryo serializer is in compact binary format and offers processing 10x faster than
Java serializer.
To set the serializer properties:
conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)

Code:
val conf = new SparkConf().setMaster(…).setAppName(…)

conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

val sc = new SparkContext(conf)

Serialization plays an important role in the performance of any distributed

application and we know that by default Spark uses the Java serializer on the JVM
platform. Instead of Java serializer, Spark can also use another serializer called
Kryo. The Kryo serializer gives better performance as compared to the Java
serializer.

Kryo serializer is in a compact binary format and offers approximately 10 times

faster speed as compared to the Java Serializer. To set the Kryo serializer as part
of a Spark job, we need to set a configuration property, which is
org.apache.spark.serializer.KryoSerializer.

2. API selection
Spark introduced three types of API to work upon – RDD, DataFrame, DataSet
RDD is used for low level operation with less optimization
DataFrame is best choice in most cases due to its catalyst optimizer and low
garbage collection (GC) overhead.
Dataset is highly type safe and use encoders. It uses Tungsten for serialization
in binary format.
We know that Spark comes with 3 types of API to work upon -RDD, DataFrame and
DataSet.

RDD is used for low-level operations and has less optimization techniques.
DataFrame is the best choice in most cases because DataFrame uses the catalyst
optimizer
which creates a query plan resulting in better performance. DataFrame also
generates
low labor garbage collection overhead.
DataSets are highly type safe and use the encoder as part of their serialization.
It also uses Tungsten for the serializer in binary format.

3. Advance Variable
Broadcasting plays an important role while tuning Spark jobs.
Broadcast variable will make small datasets available on nodes locally.
When you have one dataset which is smaller than other dataset, Broadcast join is
highly recommended.
To use the Broadcast join: (df1. join(broadcast(df2)))
Spark comes with 2 types of advanced variables – Broadcast and Accumulator.
Broadcasting plays an important role while tuning your spark job.
Broadcast variable will make your small data set available on each node,
and that node and data will be treated locally for the process.

4. Cache and Persist

Spark provides its own caching mechanisms like persist() and cache().
cache() and persist() will store the dataset in memory.
When you have a small dataset which needs be used multiple times in your program,
we cache that dataset.
Cache() – Always in Memory
Persist() – Memory and disks
Spark provides its own caching mechanism like Persist and Caching.
Persist and Cache mechanisms will store the data set into the memory whenever there
is requirement,
where you have a small data set and that data set is being used multiple times in
your program.
If we apply RDD.Cache() it will always store the data in memory, and if we apply
RDD.
Persist() then some part of data can be stored into the memory some can be stored
on the disk.

5. ByKey Operation
Shuffles are heavy operation which consume a lot of memory.
While coding in Spark, the user should always try to avoid shuffle operation.
High shuffling may give rise to an OutOfMemory Error; To avoid such an error, the
user can increase the level of parallelism.
Use reduceByKey instead of groupByKey.
Partition the data correctly.
As we know during our transformation of Spark we have many ByKey operations.
ByKey operations generate lot of shuffle. Shuffles are heavy operation
because they consume a lot of memory. While coding in Spark, a user should always
try to
avoid any shuffle operation because the shuffle operation will degrade the
performance.
If there is high shuffling then a user can get the error out of memory.
Inthis case, to avoid that error, a user should increase the level of parallelism.
Instead of groupBy, a user should go for the reduceByKey because groupByKey
creates a lot of shuffling
which hampers the performance, while reduceByKey does not shuffle the data as
much.
Therefore, reduceByKey is faster as compared to groupByKey. Whenever any ByKey
operation is used,
the user should partition the data correctly.

6. File Format selection

Spark supports many formats, such as CSV, JSON, XML, PARQUET, ORC, AVRO, etc.
Spark jobs can be optimized by choosing the parquet file with snappy compression
which
gives the high performance and best analysis.
Parquet file is native to Spark which carries the metadata along with its footer.
Spark comes with many file formats like CSV, JSON, XML, PARQUET, ORC, AVRO and
more.
A Spark job can be optimized by choosing the parquet file with snappy compression.
Parquet file is native to Spark which carry the metadata along with its footer
as we know parquet file is native to spark which is into the binary format and
along with the data it also carry the footer it’s also carries the metadata and
its footer so whenever you create any parquet file, you will see .metadata file
on the same directory along with the data file.
7. Garbage Collection Tuning
JVM garbage collection can be a problem when you have large collection of unused
objects.
The first step in GC tuning is to collect statistics by choosing – verbose while
submitting spark jobs.
In an ideal situation we try to keep GC overheads < 10% of heap memory.
As we know underneath our Spark job is running on the JVM platform so JVM garbage
collection
can be a problematic when you have a large collection of an unused object so the
first step in
tuning of garbage collection is to collect statics by choosing the option in your
Spark submit verbose.
Generally, in an ideal situation we should keep our garbage collection memory less
than 10% of heap memory.

8. Level of Parallelism
Parallelism plays a very important role while tuning spark jobs.
Every partition ~ task requires a single core for processing.
There are two ways to maintain the parallelism:
Repartition: Gives equal number of partitions with high shuffling
Coalesce: Generally reduces the number of partitions with less shuffling.
In any distributed environment parallelism plays very important role while tuning
your Spark job.
Whenever a Spark job is submitted, it creates the desk that will contain stages,
and the tasks depend upon partition so every partition or task requires a single
core of the system
for processing. There are two ways to maintain the parallelism – Repartition and
Coalesce.
Whenever you apply the Repartition method it gives you equal number of partitions
but it will shuffle a lot so it is not advisable to go for Repartition when you
want to lash all the data.
Coalesce will generally reduce the number of partitions and creates less shuffling
of data.

source: https://www.syntelli.com/eight-performance-optimization-techniques-using-
spark
-----------------------------------------------------------------------------------
-----------------------------
11. How to decide number of partitions?
-----------------------------------------------------------------------------------
-----------------------------
The general recommendation for Spark is to have 4x of partitions to the number of
cores in cluster available
for application, and for upper bound — the task should take 100ms+ time to execute.
12. How to decide number partitions in repartitions?
-----------------------------------------------------------------------------------
-----------------------------
13. Which Scheduler did you use in your project?
-----------------------------------------------------------------------------------
-----------------------------
I. Default FIFO scheduler
FIFO scheduler is running as a default algorithm on Hadoop [5]. The way that this
algorithm
work is depend on the priority of the job, that mean all jobs will be executed has
a priority to
run on the available resources on cluster. The jobs arranged on a queue with their
priorities.
II. Fair scheduler
The priorities for each job are used by this scheduler[6] relying on the weights to
transact with
the portions of the total resources. The job will be split to number of tasks and
the available
slots can ready for processing. The scheduler examines the time deficit against the
ideal fair
allocation of this job. if the tasks have finished and the slot is ready for next
scheduling, then
high priority tasks are assigned to the free slot.
III.Delay scheduler
In this scheduler, when the data is not ready, a task tracker stays for a specific
time [7]. If there
is task assigning requested the node, The size of the job will be reviewed by delay
schedulers;
when the job is very short, it will be cancelled and if any later jobs ready to
run. The important
problem that resolved by these schedulers is the locality problem.
IV. Capacity scheduler
Capacity scheduler is utilized when different companies need to share the huge
cluster with
less capacity and sharing overflowing capacity between users [8]. MapReduce slots
is
configure any existing queue. The queue has priority with FIFO. resources can be
accessed by
high priority jobs, matched to the jobs with minimum priority. Scheduled tasks are
knew by
memory exhaustion of each task. Also, the scheduler has capable of monitor memory
with
available resources.
V. Matchmaking scheduler
Matchmaking scheduling can improve Data localities of map tasks [9]. Scheduler
guarantees
that assign job first by slave nodes before assigning non local tasks. Scheduler
hold on trying
to detect matches with a slave node. The node will be sign by locality marker and
ensures that
each node pulls the tasks.
VI. LATE (Longest Approximate Time to End) scheduler
LATE [10] scheduling algorithm tries to improve Hadoop by seeking to find real slow
tasks
by calculating remaining time of all the tasks. It is based on prioritizing tasks
to speculate then
selecting fast nodes to run on, and capping speculative tasks to block thrashing.
This method
only takes action on appropriate slow tasks and does not break the synchronization
phase
between the map and reduce phase.
VII. Deadline constraint scheduler
Deadline constraints [11] can be appropriated when the jobs scheduling guarantees
that jobs
met deadline. Deadline constraint scheduler enhances system utilization dealing
with data
processing. It is achieved by cost model for job execution and Hadoop scheduler
with
constraint. Cost model with job execution considers some parameters such as the
size of input
data, task with runtime distribution of job .
VIII. Resource aware scheduler
Resource utilization is minimized by this scheduler in Hadoop . These schemes focus
on how
effectively resource utilization has been done with different types of utilization
like IO, disk,
memory, network and CPU utilization[12]. dynamic free slot was being used in these
scheduler.
IX. Energy aware scheduler
The enormous clusters within data centers can run big data applications, where
their energy
costs make executing energy efficiency. This scheduler[13] used for enhancing
energy
efficiency of applications in MapReduce leads to a reduction of the cost in data
centers. Table
1 provides a comparison among different Hadoop job scheduling techniques.
-----------------------------------------------------------------------------------
-----------------------------

Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
An Introduction To Python Programming For Scientists and Engineers
100% (2)
An Introduction To Python Programming For Scientists and Engineers
766 pages
ICTTEN615 Project Portfolio
No ratings yet
ICTTEN615 Project Portfolio
21 pages
Pythons Basics
No ratings yet
Pythons Basics
104 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Intune Starter Guide: Nick Ross - Microsoft Certified Expert Administrator
100% (1)
Intune Starter Guide: Nick Ross - Microsoft Certified Expert Administrator
65 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Azure Data Engineering Interview Q & A - Topicwise
No ratings yet
Azure Data Engineering Interview Q & A - Topicwise
57 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Pyspark
No ratings yet
Pyspark
31 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Spark
No ratings yet
Spark
96 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Interview
No ratings yet
Interview
86 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
ENG202 - Introduction To Python
No ratings yet
ENG202 - Introduction To Python
34 pages
Py Spark
No ratings yet
Py Spark
427 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Databricks
No ratings yet
Databricks
11 pages
Unit 5
100% (1)
Unit 5
109 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Databricks Widgets
No ratings yet
Databricks Widgets
13 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Python Syllbus by Lokesh
No ratings yet
Python Syllbus by Lokesh
5 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Pyspark Questions
No ratings yet
Pyspark Questions
63 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Job Portal Synopsis
No ratings yet
Job Portal Synopsis
8 pages
Slide 2 - Relational Database
No ratings yet
Slide 2 - Relational Database
14 pages
CN Lab Manual 01
No ratings yet
CN Lab Manual 01
6 pages
AI Algorithms For Quantum Computing: Enhancing Cybersecurit
No ratings yet
AI Algorithms For Quantum Computing: Enhancing Cybersecurit
17 pages
Advanced Embedded Customization
No ratings yet
Advanced Embedded Customization
76 pages
Feasibility Report
No ratings yet
Feasibility Report
4 pages
Sobhan Mohammadi Resume
No ratings yet
Sobhan Mohammadi Resume
2 pages
Courses For Splunk Architects
No ratings yet
Courses For Splunk Architects
1 page
Lab # 06 Object: in This Lab, You Will Learn About VLAN Trunking Protocol (VTP) and Its Implementation
No ratings yet
Lab # 06 Object: in This Lab, You Will Learn About VLAN Trunking Protocol (VTP) and Its Implementation
5 pages
DCN Question Bank-1
No ratings yet
DCN Question Bank-1
2 pages
BVMS - System Design Guide - January 2025
No ratings yet
BVMS - System Design Guide - January 2025
50 pages
Which Company Developed Android?: A. Apple B. Google C. Android Inc. D. Nokia
No ratings yet
Which Company Developed Android?: A. Apple B. Google C. Android Inc. D. Nokia
6 pages
MCQ Qbank - 2023 24
No ratings yet
MCQ Qbank - 2023 24
8 pages
Seminar Touchless Touch Screen
No ratings yet
Seminar Touchless Touch Screen
18 pages
It Sba
No ratings yet
It Sba
8 pages
Dberr
No ratings yet
Dberr
45 pages
35552dep-Notice 08232023
No ratings yet
35552dep-Notice 08232023
2 pages
LMD Mcode V2.01
100% (1)
LMD Mcode V2.01
266 pages
D.K.T.E. Society's Yashwantrao Chavan Polytechnic, Ichalkaranji
No ratings yet
D.K.T.E. Society's Yashwantrao Chavan Polytechnic, Ichalkaranji
13 pages
Nios V Monitor Program Introduction
No ratings yet
Nios V Monitor Program Introduction
10 pages
Lecture9 Dropout Optimization Cnns
No ratings yet
Lecture9 Dropout Optimization Cnns
79 pages
Bloom FIlter and Hash Function Numericals
No ratings yet
Bloom FIlter and Hash Function Numericals
6 pages
DONE 02-Connect Your New Email Accounts To Smartlead and Start The Warm Up Process
No ratings yet
DONE 02-Connect Your New Email Accounts To Smartlead and Start The Warm Up Process
7 pages
MikroTik MTCNA
No ratings yet
MikroTik MTCNA
7 pages
gc ٢٠٢٤ ١١ ١٣
No ratings yet
gc ٢٠٢٤ ١١ ١٣
15 pages
Chapter 4.1 Security and Privacy 1
No ratings yet
Chapter 4.1 Security and Privacy 1
6 pages
MP 8086
No ratings yet
MP 8086
2 pages

Databricks Question

Uploaded by

Databricks Question

Uploaded by

1. Types of S3 bucket?

: Simple Storage Service

S3 Standard : for frequently accessed data(miliseconds)

8 Performance Optimization Techniques Using Spark

Due to its fast, easy-to-use capabilities, Apache Spark helps to Enterprises

val sc = new SparkContext(conf)

Serialization plays an important role in the performance of any distributed

Kryo serializer is in a compact binary format and offers approximately 10 times

4. Cache and Persist

6. File Format selection

You might also like