0% found this document useful (0 votes)
41 views18 pages

Before Spark Interview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views18 pages

Before Spark Interview

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

How to answer frequently asked

real-time project questions?


AGENDA:
Project Questions:
Cluster configuration:

Task:
1. You were given a task to develop a new reporting pipeline which
consumes data from various sources, dimensions and fact tables.

2. This pipeline has to be scheduled after your SLA hours due to which it
has to be scheduled separately.

3. After writing the logic, its time to test the pipeline end-to-end and
validate whether data was flowing correctly.

4. You have to define the cluster configuration for this new pipeline. How
do you go about it?
Cluster configuration:

Steps to answer:

Initial questions to ask yourself:


Cluster configuration:
R5.2xlarge EMR (64GB 8cores) R5.4xlarge EMR (128GB 16cores)

Executors per node: 7/5 ~ 1 Executors per node: 15/5 ~ 3

Usable memory: 64 - 10% = 57.6 GB Usable memory: 128- 10% = 115.2 GB

Total node: 2250/57.6 ~ 39 Total node: 2250/115.2 ~ 20

Total executors: 39*1 = 39 Total executors: 20*3 = 60

Executor memory: 52 GB Executor memory: 34 GB (115/3) = 38.3


Memory overhead: 5GB (min 10% of Memory overhead: 4 GB (10% of usable
usable memory) mem)

Driver memory: 52 GB Driver Memory: 104GB (115 - driver


Driver overhead: 5GB overhead)
Driver Overhead: 11GB (10% of 115)
Total tasks: 39*7 = 273
Total tasks: 60*5 = 300
Total memory of executor : (52+5) GB
Total memory of executor : (34+4) GB
Memory allocation:

Set the executor and driver memory size appropriately to ensure sufficient memory
for data processing and shuffle operations.
Allocate an optimal number of executor cores considering the CPU resources
required for parallel processing.

Set the memory overhead per executor to accommodate JVM overheads, off-heap
storage, and other system-related memory requirements.
(calculate these parameters below)

spark.executor.memory:
spark.executor.cores:
spark.executor.memoryOverhead:
Spark.driver.memory:
spark.driver.memoryOverhead:
Broadcast Join Threshold:
Tune the broadcast join threshold to determine the size threshold for broadcasting
smaller tables during join operations.
Use broadcast joins for smaller tables to reduce shuffle data and minimize network
traffic.

spark.sql.autoBroadcastJoinThreshold: 10MB
What are the checks done after data ingestion:

How business requirements are stated ?

1. Complete new product/application onboarding

2. New feature enhancement

3. New source onboarding

4. New report to be generated


How business requirements are stated ?

1. EPIC (Initiative)

2. Description, examples(tables, sample data), columns to include, documentation

3. SLA to be met

4. Business value

5. Timeline for UAT validation, production deployment

You might also like