Pyspark
Pyspark
Interview
Questions &
Solutions
Yedla Akshay Kumar
What are some strategies for optimizing Spark jobs,
and why are they important?
4. Predicate Pushdown: Filter data early in the pipeline to reduce the amount
of data processed. Use .filter() or .where() close to the data source.
Repartition :-
repartition(numPartitions)
Increases or decreases the number of partitions.
Performs a full shuffle of the data across all partitions.
When you need to increase the number of partitions or when
distributing data evenly is important, like before a large join or when
improving parallelism.
What are the steps to run a Python file containing
PySpark code on AWS EC2?
2. Stage :
A stage is a set of tasks that can be executed in parallel. It represents
a part of the job that has no shuffling of data. Stages are created
based on shuffle boundaries.
Each stage executes independently, and their number depends on
the job's transformations and shuffles.
What is the difference between a Job, Stage, and Task
in Apache Spark, and how do they work together?
3. Task :
A task is the smallest unit of work in Spark, corresponding to a single
data partition. It’s a single operation that runs on a single partition of
data within a stage.
Tasks run the computations defined in a stage across partitions. Each
stage runs multiple tasks (one per data partition).
Together, jobs, stages, and tasks define the logical execution plan for
processing large datasets in Spark, ensuring distributed and efficient
computation.
What are the advantages of DataFrames over RDDs in
PySpark?
1. Memory-Only (‘MEMORY_ONLY’)
2. Memory and Disk (‘MEMORY_AND_DISK’)
3. Disk-Only (‘DISK_ONLY’)
4. Serialized (‘MEMORY_ONLY_SER’)
5. Off-Heap (‘MEMORY_AND_DISK_SER’)
How do Spark Context and Streaming Context differ,
and when do you use each?
1. Spark Context:
The main entry point for Spark applications; it initializes Spark and
connects to the cluster.
Used for batch processing of data with RDDs and DataFrames.
For static data processing like reading files, running SQL queries, or
performing data transformations on large datasets.
2. Streaming Context:
A specialized context for handling streaming data, built on top of
Spark Context.
Manages real-time data streams, processing data in small batches.
For real-time analytics, such as processing data from Kafka, socket
streams, or other real-time data sources.
What role does the DAG Scheduler play in executing
jobs in Spark?
Key Functions:
Converts high-level actions (like `collect()` or `save()`) into a DAG of
stages.
Divides the DAG into stages based on shuffle operations.
Creates tasks for each stage and schedules them for execution on
cluster nodes.
Handles task failures by retrying failed tasks.
What is a Catalyst Optimizer in Spark, and why is it
essential for query optimization?
Catalyst Optimizer is Spark's built-in query optimization engine that
enhances the performance of DataFrame and SQL operations.
Why is it Essential :
Analyzes and optimizes the logical plan of a query by reordering and
simplifying operations (e.g., pushing filters closer to data sources).
Converts the optimized logical plan into one or more physical plans
and selects the most efficient one for execution.
Uses statistics (like data size) to choose the best execution plan,
further improving performance.
Applies predefined rules to simplify and optimize query plans, such
as constant folding and predicate pushdown.
What distinguishes wide transformations from narrow
transformations in Spark?
Narrow Transformations :
Each partition of the parent RDD/DataFrame is used by at most one
partition of the child RDD/DataFrame.
‘map()’, ‘filter()’, ‘flatMap()’.
No shuffling of data; data flows within the same partition.
Faster, as there's no data movement across nodes.
Wide Transformations :
Multiple child partitions depend on data from multiple parent
partitions.
‘groupByKey()’, ‘reduceByKey()’, ‘join()’.
Requires shuffling of data across the cluster.
Slower due to the cost of shuffling data between nodes.
How do you execute an anti-join operation in Spark?
Anti-Join: Returns rows from the left DataFrame that do not have
matching rows in the right DataFrame.
Use "left_anti" in the join type to perform an anti-join, filtering out rows
with matches in the right DataFrame.
What are semi joins in Spark, and when should you use
them?
Semi Join: Returns rows from the left DataFrame that have matching
rows in the right DataFrame, but without including columns from the
right DataFrame.
When you need to filter rows in the left DataFrame based on the
existence of matches in the right DataFrame, without needing additional
data from the right DataFrame.
What are the differences between cache and
checkpoint in Spark, and when should you use each?
1. Cache :
Stores the data in memory (or memory and disk) for quick access
during iterative operations.
Use ‘.cache()’ to keep frequently accessed DataFrames/RDDs in
memory.
Data remains only for the duration of the Spark job.
2. Checkpoint :
Saves the data to a reliable storage (like HDFS) to provide fault
tolerance by truncating the lineage graph.
Use ‘.checkpoint()’ to truncate RDD lineage and save progress on disk.
Data is stored permanently until explicitly removed.
What is speculative execution in Spark, and how does
it help in job execution?
Speculative Execution is a feature in Spark that helps to handle slow or
"straggler" tasks in a job.
Spark monitors the progress of tasks. If a task is running much slower
than others (a straggler), Spark speculatively launches a duplicate of
that task on another node.
Whichever duplicate task finishes first, the result is used, and the
other task is killed.
By mitigating the impact of slow tasks, it reduces overall job
completion time.
Helps recover from node performance variability, network issues, or
other unforeseen delays affecting task speed.
What happens if you forget to uncache data in Spark?
How does it impact performance?
Impact of Forgetting to Uncache Data in Spark :
Cached data remains in memory, occupying space even when it’s no
longer needed.
This can lead to memory pressure, reducing the space available for
other data and computations.
Excessive caching without uncaching can cause Spark to run out of
memory, leading to crashes or slower performance due to increased
garbage collection.
With memory filled by unused cached data, Spark might spill data to
disk more frequently, slowing down job execution due to disk I/O.
Uncaching: Always uncache data (`unpersist()`) when it’s no longer
needed to free up memory.
What are the main components of Spark architecture,
and how do they interact?
Main Components of Spark Architecture :
1. Driver :
The central coordinator that manages the Spark application. It
translates the user code into tasks and schedules them on the
cluster.
Sends tasks to the executors and handles the overall application flow.
2. Cluster Manager :
Allocates resources (CPU, memory) to Spark applications. Examples
include YARN, Mesos, or Spark’s standalone cluster manager.
Provides resources to the driver and executors, managing the
execution environment.
What are the main components of Spark architecture,
and how do they interact?
Main Components of Spark Architecture :
3. Executors:
Worker nodes that run tasks and execute the computations defined
by the driver. Each executor manages a subset of the data and stores
data (like cached RDDs) in memory or disk.
Executes tasks assigned by the driver, and sends back results or
progress updates.
4. Tasks :
The smallest unit of work in Spark, representing a computation on a
data partition.
Executed by the executors as instructed by the driver.
How do you optimize for memory and cost in ETL jobs
in Spark?
Optimizing Memory and Cost in ETL Jobs in Spark :
2. Optimize Partitioning:
Use appropriate partition sizes (‘repartition()’ or ‘coalesce()’) to avoid
small file issues or excessive shuffling.
Benefit: Balances memory use and reduces shuffle operations,
improving job performance.
How do you optimize for memory and cost in ETL jobs
in Spark?
Optimizing Memory and Cost in ETL Jobs in Spark :
6. Avoid Skewness:
Handle skewed data with techniques like salting or using ‘skew join’ hints.
Benefit: Balances workload across executors, preventing memory
bottlenecks.
How do you deploy a PySpark application on an Apache
Spark cluster or AWS services like EMR or Glue?
3. On AWS Glue:
Create a Glue Job and select “Spark” as the job type.
Upload your script to S3 or use the Glue script editor.
Configure the job with necessary IAM roles, number of DPUs (Data
Processing Units), and additional job parameters.
Run the job directly from the AWS Glue console or programmatically with
AWS SDKs.
After deploying a PySpark job in production, how do
you check if the changes have been successfully made?
Checking PySpark Job Changes in Production :
1. Review Logs:
Location: Check the logs generated by Spark. In Spark UI or cluster
management tools (like YARN, EMR console, or AWS Glue).
Look For: Confirm that the job started with the correct configurations
and inputs. Check for expected logs indicating successful execution.
2. Spark UI Monitoring:
Access: Use Spark UI to monitor stages, tasks, and job progress.
Verify: Check that all stages completed successfully without errors, and
confirm task execution matches the expected changes.
After deploying a PySpark job in production, how do
you check if the changes have been successfully made?
Checking PySpark Job Changes in Production :
3. Validate Outputs:
Location: Check the output data location (e.g., S3, HDFS, database) where
the job writes results.
Verify: Ensure the data is written correctly and matches expected outputs
(e.g., file count, format, schema).
4. Application Metrics:
Use: Tools like CloudWatch (for AWS), Ganglia, or custom monitoring
dashboards.
Verify: Confirm that the job performance metrics align with expected
changes (e.g., runtime, memory usage).
How can you write unit test cases for PySpark code
locally?
Writing Unit Test Cases for PySpark Code Locally :
GlueContext:
It is a specialized context in AWS Glue that extends the standard Spark
functionalities with Glue-specific features.
Supports working with DynamicFrames, which are similar to Spark
DataFrames but offer additional flexibility for semi-structured data and
schema inference.
How does distribution work in Spark, and how can you
optimize partitioning?
Distribution in Spark refers to how data is split and processed across multiple
nodes in a cluster. Spark divides data into partitions, which are distributed
across executors for parallel processing.
Key Concepts :
1. Partitions:
- Data is divided into chunks called partitions.
- Each partition is processed independently by an executor.
2. Executors:
- Executors are the worker nodes that run tasks on each partition.
3. Shuffling:
- When operations require data to move between partitions (like joins or
groupBy), Spark performs a shuffle, which redistributes data across
partitions.
How does distribution work in Spark, and how can you
optimize partitioning?
Optimizing Partitioning in Spark :
A general guideline is 128 MB per partition for efficient processing.
repartition(numPartitions): Increases or evenly redistributes partitions;
use for increasing parallelism or when dealing with skewed data.
coalesce(numPartitions): Decreases the number of partitions without a
full shuffle; ideal for reducing partition count after filtering operations.
Ensure data is evenly distributed to avoid some partitions having
significantly more data than others.
Use salting by adding random values to keys to distribute data more
evenly.
Adjust the number of shuffle partitions (‘spark.sql.shuffle.partitions’) to
balance between task overhead and execution parallelism.
What are DynamicFrames in AWS Glue, and how do they
differ from DataFrames?
DynamicFrames are a data structure in AWS Glue, similar to Spark
DataFrames, but designed to handle semi-structured data and integrate with
AWS Glue's ecosystem.
1. Schema Handling:
DynamicFrames: Automatically manage schema changes and can handle
nested and complex structures.
DataFrames: Require a fixed schema and are stricter in terms of data
types and structure.
What are DynamicFrames in AWS Glue, and how do they
differ from DataFrames?
2. Transformation APIs:
DynamicFrames: Provide additional Glue-specific transformations
suitable for semi-structured data.
DataFrames: Use Spark’s standard transformation functions (‘select’,
‘filter’, ‘join’, etc.).
3. Performance:
DynamicFrames: Slightly less performant due to their schema flexibility,
but highly useful for handling messy data.
DataFrames: More optimized for performance but require cleaner, well-
structured data.
How can you address and resolve Out of Memory (OOM)
errors in Spark applications?
1. Increase Executor Memory:
Adjust the executor memory setting (‘--executor-memory’) to allocate
more memory per executor.
Example: ‘--executor-memory 4G’
Please
React
Comment
Share
Yedla Akshay Kumar