100% found this document useful (1 vote)
169 views48 pages

Pyspark

Uploaded by

Ajay Chavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
169 views48 pages

Pyspark

Uploaded by

Ajay Chavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

PySpark

Interview
Questions &
Solutions
Yedla Akshay Kumar
What are some strategies for optimizing Spark jobs,
and why are they important?

1. Partitioning: Ensure data is evenly distributed across partitions to


avoid data skew. Use `repartition()` or `coalesce()` to manage partition
sizes.

2. Caching: Cache frequently accessed DataFrames using .cache() or


.persist() to avoid recomputation.

3. Broadcast Variables: Use broadcast() for small lookup tables to reduce


data shuffling during joins.
What are some strategies for optimizing Spark jobs,
and why are they important?

4. Predicate Pushdown: Filter data early in the pipeline to reduce the amount
of data processed. Use .filter() or .where() close to the data source.

5. Avoid Shuffles: Minimize shuffle operations (like groupBy(), join()) as they


are costly. Optimize join conditions and use map-side joins if possible.

Optimizing Spark jobs improves performance, reduces execution time, and


lowers resource costs, making your applications faster and more efficient.
How do repartition and coalesce functions differ in
Spark, and when should each be used?

Repartition :-
repartition(numPartitions)
Increases or decreases the number of partitions.
Performs a full shuffle of the data across all partitions.
When you need to increase the number of partitions or when
distributing data evenly is important, like before a large join or when
improving parallelism.
What are the steps to run a Python file containing
PySpark code on AWS EC2?

1. Launch an EC2 Instance


2. Connect to Your Instance
3. Install Java and Python
4. Install Spark
5. Set Environment Variables
6. Upload Your PySpark Script
7. Run Your PySpark Script
What is the difference between a Job, Stage, and Task
in Apache Spark, and how do they work together?
1. Job :
A job is a complete Spark action like collect(), save(), or count(). It
triggers a computation to return a result or save data.
Spark breaks the job into multiple stages based on transformations.

2. Stage :
A stage is a set of tasks that can be executed in parallel. It represents
a part of the job that has no shuffling of data. Stages are created
based on shuffle boundaries.
Each stage executes independently, and their number depends on
the job's transformations and shuffles.
What is the difference between a Job, Stage, and Task
in Apache Spark, and how do they work together?
3. Task :
A task is the smallest unit of work in Spark, corresponding to a single
data partition. It’s a single operation that runs on a single partition of
data within a stage.
Tasks run the computations defined in a stage across partitions. Each
stage runs multiple tasks (one per data partition).

Together, jobs, stages, and tasks define the logical execution plan for
processing large datasets in Spark, ensuring distributed and efficient
computation.
What are the advantages of DataFrames over RDDs in
PySpark?

1. DataFrames are optimized with Catalyst and Tungsten, making them


faster than RDDs.
2. They have SQL-like syntax, making data operations simpler and more
intuitive.
3. DataFrames use optimized memory management, reducing memory
overhead.
4. Provide many built-in functions for data manipulation, making complex
tasks easier.
5. You can run SQL queries directly on DataFrames.
What is a Spark session, and what are its key
functions?

A Spark Session is the entry point to using Spark functionality, replacing


the older ‘SparkContext’ and ‘SQLContext’. It allows you to interact with
Spark's DataFrame and SQL APIs in a unified way.
What is the difference between Bucketing and
Partitioning in Spark, and how does each impact
performance?
Partitioning :
Divides data into separate directories based on a column, distributing
data across multiple files.
Useful for large datasets; each partition is stored separately, allowing
parallel processing.
Speeds up data retrieval for queries filtering on the partition column
but can cause small file issues if partitions are too many.
Bucketing :
Splits data into fixed-sized buckets based on a hash of a column.
Buckets are stored within each partition.
Reduces shuffle during joins, improving performance, especially
when bucketed on the same column in both tables.
Why is data persistence important in Spark, and what
are the common persistence mechanisms?
Data Persistence is important in Spark because it helps avoid
recomputation of intermediate results, speeding up operations that
access the same data multiple times (like iterations in machine learning
algorithms).

Common Persistence Mechanisms :

1. Memory-Only (‘MEMORY_ONLY’)
2. Memory and Disk (‘MEMORY_AND_DISK’)
3. Disk-Only (‘DISK_ONLY’)
4. Serialized (‘MEMORY_ONLY_SER’)
5. Off-Heap (‘MEMORY_AND_DISK_SER’)
How do Spark Context and Streaming Context differ,
and when do you use each?
1. Spark Context:
The main entry point for Spark applications; it initializes Spark and
connects to the cluster.
Used for batch processing of data with RDDs and DataFrames.
For static data processing like reading files, running SQL queries, or
performing data transformations on large datasets.

2. Streaming Context:
A specialized context for handling streaming data, built on top of
Spark Context.
Manages real-time data streams, processing data in small batches.
For real-time analytics, such as processing data from Kafka, socket
streams, or other real-time data sources.
What role does the DAG Scheduler play in executing
jobs in Spark?

DAG Scheduler (Directed Acyclic Graph Scheduler) is responsible for


breaking down jobs into stages and tasks, managing their execution flow
in Spark.

Key Functions:
Converts high-level actions (like `collect()` or `save()`) into a DAG of
stages.
Divides the DAG into stages based on shuffle operations.
Creates tasks for each stage and schedules them for execution on
cluster nodes.
Handles task failures by retrying failed tasks.
What is a Catalyst Optimizer in Spark, and why is it
essential for query optimization?
Catalyst Optimizer is Spark's built-in query optimization engine that
enhances the performance of DataFrame and SQL operations.

Why is it Essential :
Analyzes and optimizes the logical plan of a query by reordering and
simplifying operations (e.g., pushing filters closer to data sources).
Converts the optimized logical plan into one or more physical plans
and selects the most efficient one for execution.
Uses statistics (like data size) to choose the best execution plan,
further improving performance.
Applies predefined rules to simplify and optimize query plans, such
as constant folding and predicate pushdown.
What distinguishes wide transformations from narrow
transformations in Spark?
Narrow Transformations :
Each partition of the parent RDD/DataFrame is used by at most one
partition of the child RDD/DataFrame.
‘map()’, ‘filter()’, ‘flatMap()’.
No shuffling of data; data flows within the same partition.
Faster, as there's no data movement across nodes.

Wide Transformations :
Multiple child partitions depend on data from multiple parent
partitions.
‘groupByKey()’, ‘reduceByKey()’, ‘join()’.
Requires shuffling of data across the cluster.
Slower due to the cost of shuffling data between nodes.
How do you execute an anti-join operation in Spark?

Anti-Join: Returns rows from the left DataFrame that do not have
matching rows in the right DataFrame.

How to Execute an Anti-Join :


Using DataFrame API:
Example :
result = df1.join(df2, df1["key"] == df2["key"], "left_anti")

Use "left_anti" in the join type to perform an anti-join, filtering out rows
with matches in the right DataFrame.
What are semi joins in Spark, and when should you use
them?
Semi Join: Returns rows from the left DataFrame that have matching
rows in the right DataFrame, but without including columns from the
right DataFrame.

How to Execute a Semi Join :


Using DataFrame API:
Example:
result = df1.join(df2, df1["key"] == df2["key"], "left_semi")

When you need to filter rows in the left DataFrame based on the
existence of matches in the right DataFrame, without needing additional
data from the right DataFrame.
What are the differences between cache and
checkpoint in Spark, and when should you use each?
1. Cache :
Stores the data in memory (or memory and disk) for quick access
during iterative operations.
Use ‘.cache()’ to keep frequently accessed DataFrames/RDDs in
memory.
Data remains only for the duration of the Spark job.

2. Checkpoint :
Saves the data to a reliable storage (like HDFS) to provide fault
tolerance by truncating the lineage graph.
Use ‘.checkpoint()’ to truncate RDD lineage and save progress on disk.
Data is stored permanently until explicitly removed.
What is speculative execution in Spark, and how does
it help in job execution?
Speculative Execution is a feature in Spark that helps to handle slow or
"straggler" tasks in a job.
Spark monitors the progress of tasks. If a task is running much slower
than others (a straggler), Spark speculatively launches a duplicate of
that task on another node.
Whichever duplicate task finishes first, the result is used, and the
other task is killed.
By mitigating the impact of slow tasks, it reduces overall job
completion time.
Helps recover from node performance variability, network issues, or
other unforeseen delays affecting task speed.
What happens if you forget to uncache data in Spark?
How does it impact performance?
Impact of Forgetting to Uncache Data in Spark :
Cached data remains in memory, occupying space even when it’s no
longer needed.
This can lead to memory pressure, reducing the space available for
other data and computations.
Excessive caching without uncaching can cause Spark to run out of
memory, leading to crashes or slower performance due to increased
garbage collection.
With memory filled by unused cached data, Spark might spill data to
disk more frequently, slowing down job execution due to disk I/O.
Uncaching: Always uncache data (`unpersist()`) when it’s no longer
needed to free up memory.
What are the main components of Spark architecture,
and how do they interact?
Main Components of Spark Architecture :

1. Driver :
The central coordinator that manages the Spark application. It
translates the user code into tasks and schedules them on the
cluster.
Sends tasks to the executors and handles the overall application flow.

2. Cluster Manager :
Allocates resources (CPU, memory) to Spark applications. Examples
include YARN, Mesos, or Spark’s standalone cluster manager.
Provides resources to the driver and executors, managing the
execution environment.
What are the main components of Spark architecture,
and how do they interact?
Main Components of Spark Architecture :

3. Executors:
Worker nodes that run tasks and execute the computations defined
by the driver. Each executor manages a subset of the data and stores
data (like cached RDDs) in memory or disk.
Executes tasks assigned by the driver, and sends back results or
progress updates.

4. Tasks :
The smallest unit of work in Spark, representing a computation on a
data partition.
Executed by the executors as instructed by the driver.
How do you optimize for memory and cost in ETL jobs
in Spark?
Optimizing Memory and Cost in ETL Jobs in Spark :

1. Use Efficient Data Formats:


Use columnar formats like Parquet or ORC for better compression
and faster I/O compared to CSV or JSON.
Benefit: Reduces storage costs and speeds up data processing.

2. Optimize Partitioning:
Use appropriate partition sizes (‘repartition()’ or ‘coalesce()’) to avoid
small file issues or excessive shuffling.
Benefit: Balances memory use and reduces shuffle operations,
improving job performance.
How do you optimize for memory and cost in ETL jobs
in Spark?
Optimizing Memory and Cost in ETL Jobs in Spark :

3. Leverage Caching Wisely:


Cache only frequently used DataFrames/RDDs with ‘.cache()’ or
‘.persist()’ and uncache them after use.
Benefit: Speeds up repeated operations while managing memory
usage effectively.

4. Broadcast Small Tables:


Use broadcast() for small lookup tables in joins to reduce shuffle.
Benefit: Minimizes data movement and memory usage, speeding up
joins.
How do you optimize for memory and cost in ETL jobs
in Spark?
Optimizing Memory and Cost in ETL Jobs in Spark :

5. Adjust Spark Configurations:


Tune configurations like ‘spark.executor.memory’, ‘spark.executor.cores’,
and ‘spark.sql.shuffle.partitions’ based on job needs.
Benefit: Optimizes resource usage and avoids memory overhead or
underutilization.

6. Avoid Skewness:
Handle skewed data with techniques like salting or using ‘skew join’ hints.
Benefit: Balances workload across executors, preventing memory
bottlenecks.
How do you deploy a PySpark application on an Apache
Spark cluster or AWS services like EMR or Glue?

Deploying a PySpark Application on Apache Spark Cluster, AWS EMR, or AWS


Glue :

1. On Apache Spark Cluster :


Package your PySpark application into a ‘.py’ file or a ‘.zip’ with
dependencies.
Use ‘spark-submit’ to deploy the application:
- spark-submit --master <cluster-master-url> your_app.py
Configure with flags for resources (‘--num-executors’, ‘--executor-
memory’) as needed.
How do you deploy a PySpark application on an Apache
Spark cluster or AWS services like EMR or Glue?
Deploying a PySpark Application on Apache Spark Cluster, AWS EMR, or AWS
Glue :

2. On AWS EMR (Elastic MapReduce) :


Launch an EMR cluster with Spark installed (choose Spark as an
application).
Upload your PySpark script to S3.
Submit your job using the AWS Management Console, AWS CLI, or
directly through ‘spark-submit’ on the EMR cluster:
- spark-submit s3://path-to-your-script/your_app.py
How do you deploy a PySpark application on an Apache
Spark cluster or AWS services like EMR or Glue?
Deploying a PySpark Application on Apache Spark Cluster, AWS EMR, or AWS
Glue :

3. On AWS Glue:
Create a Glue Job and select “Spark” as the job type.
Upload your script to S3 or use the Glue script editor.
Configure the job with necessary IAM roles, number of DPUs (Data
Processing Units), and additional job parameters.
Run the job directly from the AWS Glue console or programmatically with
AWS SDKs.
After deploying a PySpark job in production, how do
you check if the changes have been successfully made?
Checking PySpark Job Changes in Production :

1. Review Logs:
Location: Check the logs generated by Spark. In Spark UI or cluster
management tools (like YARN, EMR console, or AWS Glue).
Look For: Confirm that the job started with the correct configurations
and inputs. Check for expected logs indicating successful execution.

2. Spark UI Monitoring:
Access: Use Spark UI to monitor stages, tasks, and job progress.
Verify: Check that all stages completed successfully without errors, and
confirm task execution matches the expected changes.
After deploying a PySpark job in production, how do
you check if the changes have been successfully made?
Checking PySpark Job Changes in Production :

3. Validate Outputs:
Location: Check the output data location (e.g., S3, HDFS, database) where
the job writes results.
Verify: Ensure the data is written correctly and matches expected outputs
(e.g., file count, format, schema).

4. Application Metrics:
Use: Tools like CloudWatch (for AWS), Ganglia, or custom monitoring
dashboards.
Verify: Confirm that the job performance metrics align with expected
changes (e.g., runtime, memory usage).
How can you write unit test cases for PySpark code
locally?
Writing Unit Test Cases for PySpark Code Locally :

1. Set Up a Testing Environment:


Install PySpark and testing libraries like unittest or pytest
2. Create a PySpark Session for Testing:
Set up a local Spark session in your test script
3. Write Test Cases:
Use ‘unittest’ or ‘pytest’ to write test functions.
4. Run Tests:
Run your tests using ‘unittest’ or ‘pytest’
How does SparkSession work in AWS Glue, and what is
the use of GlueContext?
SparkSession :
In AWS Glue serves as the primary entry point for interacting with Spark’s
DataFrame and SQL functionalities.
It’s automatically created in Glue jobs, allowing you to read, process, and
write data using Spark capabilities.

GlueContext:
It is a specialized context in AWS Glue that extends the standard Spark
functionalities with Glue-specific features.
Supports working with DynamicFrames, which are similar to Spark
DataFrames but offer additional flexibility for semi-structured data and
schema inference.
How does distribution work in Spark, and how can you
optimize partitioning?
Distribution in Spark refers to how data is split and processed across multiple
nodes in a cluster. Spark divides data into partitions, which are distributed
across executors for parallel processing.
Key Concepts :
1. Partitions:
- Data is divided into chunks called partitions.
- Each partition is processed independently by an executor.
2. Executors:
- Executors are the worker nodes that run tasks on each partition.
3. Shuffling:
- When operations require data to move between partitions (like joins or
groupBy), Spark performs a shuffle, which redistributes data across
partitions.
How does distribution work in Spark, and how can you
optimize partitioning?
Optimizing Partitioning in Spark :
A general guideline is 128 MB per partition for efficient processing.
repartition(numPartitions): Increases or evenly redistributes partitions;
use for increasing parallelism or when dealing with skewed data.
coalesce(numPartitions): Decreases the number of partitions without a
full shuffle; ideal for reducing partition count after filtering operations.
Ensure data is evenly distributed to avoid some partitions having
significantly more data than others.
Use salting by adding random values to keys to distribute data more
evenly.
Adjust the number of shuffle partitions (‘spark.sql.shuffle.partitions’) to
balance between task overhead and execution parallelism.
What are DynamicFrames in AWS Glue, and how do they
differ from DataFrames?
DynamicFrames are a data structure in AWS Glue, similar to Spark
DataFrames, but designed to handle semi-structured data and integrate with
AWS Glue's ecosystem.

Differences Between DynamicFrames and DataFrames:

1. Schema Handling:
DynamicFrames: Automatically manage schema changes and can handle
nested and complex structures.
DataFrames: Require a fixed schema and are stricter in terms of data
types and structure.
What are DynamicFrames in AWS Glue, and how do they
differ from DataFrames?
2. Transformation APIs:
DynamicFrames: Provide additional Glue-specific transformations
suitable for semi-structured data.
DataFrames: Use Spark’s standard transformation functions (‘select’,
‘filter’, ‘join’, etc.).

3. Performance:
DynamicFrames: Slightly less performant due to their schema flexibility,
but highly useful for handling messy data.
DataFrames: More optimized for performance but require cleaner, well-
structured data.
How can you address and resolve Out of Memory (OOM)
errors in Spark applications?
1. Increase Executor Memory:
Adjust the executor memory setting (‘--executor-memory’) to allocate
more memory per executor.
Example: ‘--executor-memory 4G’

2. Optimize Partition Sizes:


Reduce the size of each partition by increasing the number of partitions
(‘repartition()’) to distribute the data more evenly.
Example: ‘df.repartition(200)’

3. Use Efficient Data Formats:


Use columnar formats like Parquet or ORC, which are more memory-
efficient than row-based formats like CSV.
How can you address and resolve Out of Memory (OOM)
errors in Spark applications?
4. Persist Data Wisely:
Avoid unnecessary caching and unpersist data that’s no longer needed
using ‘.unpersist()’.

5. Tune Garbage Collection:


Adjust JVM garbage collection settings (‘-XX:+UseG1GC’) to improve
memory management.

6. Broadcast Joins for Small Tables:


Use ‘broadcast()’ for small lookup tables to reduce memory usage during
joins.
Write a PySpark code to read a position-based text file
and store it in Delta format. The file contains lines like
"Sam 12 Mumbai", "Rahul 19 Nashik", "John 21 Pune".
Write a PySpark code to write a DataFrame to AWS S3 in
Parquet format.
Write a PySpark code to add a new column
price_category to a DataFrame based on the conditions
for price.
Write a PySpark code to read a DataFrame with a
predefined schema without inferring it.
Write a PySpark code to add a new column "comments"
where if the age is between 13-18 years, the value is
"teenager" and for other cases, the value is "adult".
Write a PySpark code to read the sales log, compute the total
sales amount for each sale_type and date, aggregate the sales
data by sale_type and date, and write the aggregated sales data
to a separate table.
Write a PySpark code to detect columns with JSON log entries,
infer the schema, convert JSON columns to fields, and persist the
updated DataFrame.
Write a PySpark code to check:
for duplicates in primary keys (columns 1-10)
filter out rows with null or empty primary keys
replace nulls in non-primary keys with 0
reject rows where all values in columns 11 to 20 are null (non-
primary keys).
If You have found it useful

Please
React
Comment
Share
Yedla Akshay Kumar

You might also like