100% found this document useful (1 vote)

169 views48 pages

Pyspark

Uploaded by

Ajay Chavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

169 views48 pages

Pyspark

Uploaded by

Ajay Chavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

PySpark

Interview
Questions &
Solutions
Yedla Akshay Kumar
What are some strategies for optimizing Spark jobs,
and why are they important?

1. Partitioning: Ensure data is evenly distributed across partitions to

avoid data skew. Use `repartition()` or `coalesce()` to manage partition
sizes.

2. Caching: Cache frequently accessed DataFrames using .cache() or

.persist() to avoid recomputation.

3. Broadcast Variables: Use broadcast() for small lookup tables to reduce

data shuffling during joins.
What are some strategies for optimizing Spark jobs,
and why are they important?

4. Predicate Pushdown: Filter data early in the pipeline to reduce the amount
of data processed. Use .filter() or .where() close to the data source.

5. Avoid Shuffles: Minimize shuffle operations (like groupBy(), join()) as they

are costly. Optimize join conditions and use map-side joins if possible.

Optimizing Spark jobs improves performance, reduces execution time, and

lowers resource costs, making your applications faster and more efficient.
How do repartition and coalesce functions differ in
Spark, and when should each be used?

Repartition :-
repartition(numPartitions)
Increases or decreases the number of partitions.
Performs a full shuffle of the data across all partitions.
When you need to increase the number of partitions or when
distributing data evenly is important, like before a large join or when
improving parallelism.
What are the steps to run a Python file containing
PySpark code on AWS EC2?

1. Launch an EC2 Instance

2. Connect to Your Instance
3. Install Java and Python
4. Install Spark
5. Set Environment Variables
6. Upload Your PySpark Script
7. Run Your PySpark Script
What is the difference between a Job, Stage, and Task
in Apache Spark, and how do they work together?
1. Job :
A job is a complete Spark action like collect(), save(), or count(). It
triggers a computation to return a result or save data.
Spark breaks the job into multiple stages based on transformations.

2. Stage :
A stage is a set of tasks that can be executed in parallel. It represents
a part of the job that has no shuffling of data. Stages are created
based on shuffle boundaries.
Each stage executes independently, and their number depends on
the job's transformations and shuffles.
What is the difference between a Job, Stage, and Task
in Apache Spark, and how do they work together?
3. Task :
A task is the smallest unit of work in Spark, corresponding to a single
data partition. It’s a single operation that runs on a single partition of
data within a stage.
Tasks run the computations defined in a stage across partitions. Each
stage runs multiple tasks (one per data partition).

Together, jobs, stages, and tasks define the logical execution plan for
processing large datasets in Spark, ensuring distributed and efficient
computation.
What are the advantages of DataFrames over RDDs in
PySpark?

1. DataFrames are optimized with Catalyst and Tungsten, making them

faster than RDDs.
2. They have SQL-like syntax, making data operations simpler and more
intuitive.
3. DataFrames use optimized memory management, reducing memory
overhead.
4. Provide many built-in functions for data manipulation, making complex
tasks easier.
5. You can run SQL queries directly on DataFrames.
What is a Spark session, and what are its key
functions?

A Spark Session is the entry point to using Spark functionality, replacing

the older ‘SparkContext’ and ‘SQLContext’. It allows you to interact with
Spark's DataFrame and SQL APIs in a unified way.
What is the difference between Bucketing and
Partitioning in Spark, and how does each impact
performance?
Partitioning :
Divides data into separate directories based on a column, distributing
data across multiple files.
Useful for large datasets; each partition is stored separately, allowing
parallel processing.
Speeds up data retrieval for queries filtering on the partition column
but can cause small file issues if partitions are too many.
Bucketing :
Splits data into fixed-sized buckets based on a hash of a column.
Buckets are stored within each partition.
Reduces shuffle during joins, improving performance, especially
when bucketed on the same column in both tables.
Why is data persistence important in Spark, and what
are the common persistence mechanisms?
Data Persistence is important in Spark because it helps avoid
recomputation of intermediate results, speeding up operations that
access the same data multiple times (like iterations in machine learning
algorithms).

Common Persistence Mechanisms :

1. Memory-Only (‘MEMORY_ONLY’)
2. Memory and Disk (‘MEMORY_AND_DISK’)
3. Disk-Only (‘DISK_ONLY’)
4. Serialized (‘MEMORY_ONLY_SER’)
5. Off-Heap (‘MEMORY_AND_DISK_SER’)
How do Spark Context and Streaming Context differ,
and when do you use each?
1. Spark Context:
The main entry point for Spark applications; it initializes Spark and
connects to the cluster.
Used for batch processing of data with RDDs and DataFrames.
For static data processing like reading files, running SQL queries, or
performing data transformations on large datasets.

2. Streaming Context:
A specialized context for handling streaming data, built on top of
Spark Context.
Manages real-time data streams, processing data in small batches.
For real-time analytics, such as processing data from Kafka, socket
streams, or other real-time data sources.
What role does the DAG Scheduler play in executing
jobs in Spark?

DAG Scheduler (Directed Acyclic Graph Scheduler) is responsible for

breaking down jobs into stages and tasks, managing their execution flow
in Spark.

Key Functions:
Converts high-level actions (like `collect()` or `save()`) into a DAG of
stages.
Divides the DAG into stages based on shuffle operations.
Creates tasks for each stage and schedules them for execution on
cluster nodes.
Handles task failures by retrying failed tasks.
What is a Catalyst Optimizer in Spark, and why is it
essential for query optimization?
Catalyst Optimizer is Spark's built-in query optimization engine that
enhances the performance of DataFrame and SQL operations.

Why is it Essential :
Analyzes and optimizes the logical plan of a query by reordering and
simplifying operations (e.g., pushing filters closer to data sources).
Converts the optimized logical plan into one or more physical plans
and selects the most efficient one for execution.
Uses statistics (like data size) to choose the best execution plan,
further improving performance.
Applies predefined rules to simplify and optimize query plans, such
as constant folding and predicate pushdown.
What distinguishes wide transformations from narrow
transformations in Spark?
Narrow Transformations :
Each partition of the parent RDD/DataFrame is used by at most one
partition of the child RDD/DataFrame.
‘map()’, ‘filter()’, ‘flatMap()’.
No shuffling of data; data flows within the same partition.
Faster, as there's no data movement across nodes.

Wide Transformations :
Multiple child partitions depend on data from multiple parent
partitions.
‘groupByKey()’, ‘reduceByKey()’, ‘join()’.
Requires shuffling of data across the cluster.
Slower due to the cost of shuffling data between nodes.
How do you execute an anti-join operation in Spark?

Anti-Join: Returns rows from the left DataFrame that do not have
matching rows in the right DataFrame.

How to Execute an Anti-Join :

Using DataFrame API:
Example :
result = df1.join(df2, df1["key"] == df2["key"], "left_anti")

Use "left_anti" in the join type to perform an anti-join, filtering out rows
with matches in the right DataFrame.
What are semi joins in Spark, and when should you use
them?
Semi Join: Returns rows from the left DataFrame that have matching
rows in the right DataFrame, but without including columns from the
right DataFrame.

How to Execute a Semi Join :

Using DataFrame API:
Example:
result = df1.join(df2, df1["key"] == df2["key"], "left_semi")

When you need to filter rows in the left DataFrame based on the
existence of matches in the right DataFrame, without needing additional
data from the right DataFrame.
What are the differences between cache and
checkpoint in Spark, and when should you use each?
1. Cache :
Stores the data in memory (or memory and disk) for quick access
during iterative operations.
Use ‘.cache()’ to keep frequently accessed DataFrames/RDDs in
memory.
Data remains only for the duration of the Spark job.

2. Checkpoint :
Saves the data to a reliable storage (like HDFS) to provide fault
tolerance by truncating the lineage graph.
Use ‘.checkpoint()’ to truncate RDD lineage and save progress on disk.
Data is stored permanently until explicitly removed.
What is speculative execution in Spark, and how does
it help in job execution?
Speculative Execution is a feature in Spark that helps to handle slow or
"straggler" tasks in a job.
Spark monitors the progress of tasks. If a task is running much slower
than others (a straggler), Spark speculatively launches a duplicate of
that task on another node.
Whichever duplicate task finishes first, the result is used, and the
other task is killed.
By mitigating the impact of slow tasks, it reduces overall job
completion time.
Helps recover from node performance variability, network issues, or
other unforeseen delays affecting task speed.
What happens if you forget to uncache data in Spark?
How does it impact performance?
Impact of Forgetting to Uncache Data in Spark :
Cached data remains in memory, occupying space even when it’s no
longer needed.
This can lead to memory pressure, reducing the space available for
other data and computations.
Excessive caching without uncaching can cause Spark to run out of
memory, leading to crashes or slower performance due to increased
garbage collection.
With memory filled by unused cached data, Spark might spill data to
disk more frequently, slowing down job execution due to disk I/O.
Uncaching: Always uncache data (`unpersist()`) when it’s no longer
needed to free up memory.
What are the main components of Spark architecture,
and how do they interact?
Main Components of Spark Architecture :

1. Driver :
The central coordinator that manages the Spark application. It
translates the user code into tasks and schedules them on the
cluster.
Sends tasks to the executors and handles the overall application flow.

2. Cluster Manager :
Allocates resources (CPU, memory) to Spark applications. Examples
include YARN, Mesos, or Spark’s standalone cluster manager.
Provides resources to the driver and executors, managing the
execution environment.
What are the main components of Spark architecture,
and how do they interact?
Main Components of Spark Architecture :

3. Executors:
Worker nodes that run tasks and execute the computations defined
by the driver. Each executor manages a subset of the data and stores
data (like cached RDDs) in memory or disk.
Executes tasks assigned by the driver, and sends back results or
progress updates.

4. Tasks :
The smallest unit of work in Spark, representing a computation on a
data partition.
Executed by the executors as instructed by the driver.
How do you optimize for memory and cost in ETL jobs
in Spark?
Optimizing Memory and Cost in ETL Jobs in Spark :

1. Use Efficient Data Formats:

Use columnar formats like Parquet or ORC for better compression
and faster I/O compared to CSV or JSON.
Benefit: Reduces storage costs and speeds up data processing.

2. Optimize Partitioning:
Use appropriate partition sizes (‘repartition()’ or ‘coalesce()’) to avoid
small file issues or excessive shuffling.
Benefit: Balances memory use and reduces shuffle operations,
improving job performance.
How do you optimize for memory and cost in ETL jobs
in Spark?
Optimizing Memory and Cost in ETL Jobs in Spark :

3. Leverage Caching Wisely:

Cache only frequently used DataFrames/RDDs with ‘.cache()’ or
‘.persist()’ and uncache them after use.
Benefit: Speeds up repeated operations while managing memory
usage effectively.

4. Broadcast Small Tables:

Use broadcast() for small lookup tables in joins to reduce shuffle.
Benefit: Minimizes data movement and memory usage, speeding up
joins.
How do you optimize for memory and cost in ETL jobs
in Spark?
Optimizing Memory and Cost in ETL Jobs in Spark :

5. Adjust Spark Configurations:

Tune configurations like ‘spark.executor.memory’, ‘spark.executor.cores’,
and ‘spark.sql.shuffle.partitions’ based on job needs.
Benefit: Optimizes resource usage and avoids memory overhead or
underutilization.

6. Avoid Skewness:
Handle skewed data with techniques like salting or using ‘skew join’ hints.
Benefit: Balances workload across executors, preventing memory
bottlenecks.
How do you deploy a PySpark application on an Apache
Spark cluster or AWS services like EMR or Glue?

Deploying a PySpark Application on Apache Spark Cluster, AWS EMR, or AWS

Glue :

1. On Apache Spark Cluster :

Package your PySpark application into a ‘.py’ file or a ‘.zip’ with
dependencies.
Use ‘spark-submit’ to deploy the application:
- spark-submit --master <cluster-master-url> your_app.py
Configure with flags for resources (‘--num-executors’, ‘--executor-
memory’) as needed.
How do you deploy a PySpark application on an Apache
Spark cluster or AWS services like EMR or Glue?
Deploying a PySpark Application on Apache Spark Cluster, AWS EMR, or AWS
Glue :

2. On AWS EMR (Elastic MapReduce) :

Launch an EMR cluster with Spark installed (choose Spark as an
application).
Upload your PySpark script to S3.
Submit your job using the AWS Management Console, AWS CLI, or
directly through ‘spark-submit’ on the EMR cluster:
- spark-submit s3://path-to-your-script/your_app.py
How do you deploy a PySpark application on an Apache
Spark cluster or AWS services like EMR or Glue?
Deploying a PySpark Application on Apache Spark Cluster, AWS EMR, or AWS
Glue :

3. On AWS Glue:
Create a Glue Job and select “Spark” as the job type.
Upload your script to S3 or use the Glue script editor.
Configure the job with necessary IAM roles, number of DPUs (Data
Processing Units), and additional job parameters.
Run the job directly from the AWS Glue console or programmatically with
AWS SDKs.
After deploying a PySpark job in production, how do
you check if the changes have been successfully made?
Checking PySpark Job Changes in Production :

1. Review Logs:
Location: Check the logs generated by Spark. In Spark UI or cluster
management tools (like YARN, EMR console, or AWS Glue).
Look For: Confirm that the job started with the correct configurations
and inputs. Check for expected logs indicating successful execution.

2. Spark UI Monitoring:
Access: Use Spark UI to monitor stages, tasks, and job progress.
Verify: Check that all stages completed successfully without errors, and
confirm task execution matches the expected changes.
After deploying a PySpark job in production, how do
you check if the changes have been successfully made?
Checking PySpark Job Changes in Production :

3. Validate Outputs:
Location: Check the output data location (e.g., S3, HDFS, database) where
the job writes results.
Verify: Ensure the data is written correctly and matches expected outputs
(e.g., file count, format, schema).

4. Application Metrics:
Use: Tools like CloudWatch (for AWS), Ganglia, or custom monitoring
dashboards.
Verify: Confirm that the job performance metrics align with expected
changes (e.g., runtime, memory usage).
How can you write unit test cases for PySpark code
locally?
Writing Unit Test Cases for PySpark Code Locally :

1. Set Up a Testing Environment:

Install PySpark and testing libraries like unittest or pytest
2. Create a PySpark Session for Testing:
Set up a local Spark session in your test script
3. Write Test Cases:
Use ‘unittest’ or ‘pytest’ to write test functions.
4. Run Tests:
Run your tests using ‘unittest’ or ‘pytest’
How does SparkSession work in AWS Glue, and what is
the use of GlueContext?
SparkSession :
In AWS Glue serves as the primary entry point for interacting with Spark’s
DataFrame and SQL functionalities.
It’s automatically created in Glue jobs, allowing you to read, process, and
write data using Spark capabilities.

GlueContext:
It is a specialized context in AWS Glue that extends the standard Spark
functionalities with Glue-specific features.
Supports working with DynamicFrames, which are similar to Spark
DataFrames but offer additional flexibility for semi-structured data and
schema inference.
How does distribution work in Spark, and how can you
optimize partitioning?
Distribution in Spark refers to how data is split and processed across multiple
nodes in a cluster. Spark divides data into partitions, which are distributed
across executors for parallel processing.
Key Concepts :
1. Partitions:
- Data is divided into chunks called partitions.
- Each partition is processed independently by an executor.
2. Executors:
- Executors are the worker nodes that run tasks on each partition.
3. Shuffling:
- When operations require data to move between partitions (like joins or
groupBy), Spark performs a shuffle, which redistributes data across
partitions.
How does distribution work in Spark, and how can you
optimize partitioning?
Optimizing Partitioning in Spark :
A general guideline is 128 MB per partition for efficient processing.
repartition(numPartitions): Increases or evenly redistributes partitions;
use for increasing parallelism or when dealing with skewed data.
coalesce(numPartitions): Decreases the number of partitions without a
full shuffle; ideal for reducing partition count after filtering operations.
Ensure data is evenly distributed to avoid some partitions having
significantly more data than others.
Use salting by adding random values to keys to distribute data more
evenly.
Adjust the number of shuffle partitions (‘spark.sql.shuffle.partitions’) to
balance between task overhead and execution parallelism.
What are DynamicFrames in AWS Glue, and how do they
differ from DataFrames?
DynamicFrames are a data structure in AWS Glue, similar to Spark
DataFrames, but designed to handle semi-structured data and integrate with
AWS Glue's ecosystem.

Differences Between DynamicFrames and DataFrames:

1. Schema Handling:
DynamicFrames: Automatically manage schema changes and can handle
nested and complex structures.
DataFrames: Require a fixed schema and are stricter in terms of data
types and structure.
What are DynamicFrames in AWS Glue, and how do they
differ from DataFrames?
2. Transformation APIs:
DynamicFrames: Provide additional Glue-specific transformations
suitable for semi-structured data.
DataFrames: Use Spark’s standard transformation functions (‘select’,
‘filter’, ‘join’, etc.).

3. Performance:
DynamicFrames: Slightly less performant due to their schema flexibility,
but highly useful for handling messy data.
DataFrames: More optimized for performance but require cleaner, well-
structured data.
How can you address and resolve Out of Memory (OOM)
errors in Spark applications?
1. Increase Executor Memory:
Adjust the executor memory setting (‘--executor-memory’) to allocate
more memory per executor.
Example: ‘--executor-memory 4G’

2. Optimize Partition Sizes:

Reduce the size of each partition by increasing the number of partitions
(‘repartition()’) to distribute the data more evenly.
Example: ‘df.repartition(200)’

3. Use Efficient Data Formats:

Use columnar formats like Parquet or ORC, which are more memory-
efficient than row-based formats like CSV.
How can you address and resolve Out of Memory (OOM)
errors in Spark applications?
4. Persist Data Wisely:
Avoid unnecessary caching and unpersist data that’s no longer needed
using ‘.unpersist()’.

5. Tune Garbage Collection:

Adjust JVM garbage collection settings (‘-XX:+UseG1GC’) to improve
memory management.

6. Broadcast Joins for Small Tables:

Use ‘broadcast()’ for small lookup tables to reduce memory usage during
joins.
Write a PySpark code to read a position-based text file
and store it in Delta format. The file contains lines like
"Sam 12 Mumbai", "Rahul 19 Nashik", "John 21 Pune".
Write a PySpark code to write a DataFrame to AWS S3 in
Parquet format.
Write a PySpark code to add a new column
price_category to a DataFrame based on the conditions
for price.
Write a PySpark code to read a DataFrame with a
predefined schema without inferring it.
Write a PySpark code to add a new column "comments"
where if the age is between 13-18 years, the value is
"teenager" and for other cases, the value is "adult".
Write a PySpark code to read the sales log, compute the total
sales amount for each sale_type and date, aggregate the sales
data by sale_type and date, and write the aggregated sales data
to a separate table.
Write a PySpark code to detect columns with JSON log entries,
infer the schema, convert JSON columns to fields, and persist the
updated DataFrame.
Write a PySpark code to check:
for duplicates in primary keys (columns 1-10)
filter out rows with null or empty primary keys
replace nulls in non-primary keys with 0
reject rows where all values in columns 11 to 20 are null (non-
primary keys).
If You have found it useful

Please
React
Comment
Share
Yedla Akshay Kumar

Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Big Data Computing Spark Built-In Libraries
No ratings yet
Big Data Computing Spark Built-In Libraries
11 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Pyspark
No ratings yet
Pyspark
31 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Spark QA
No ratings yet
Spark QA
34 pages
30 Pyspark Coding Questions
No ratings yet
30 Pyspark Coding Questions
9 pages
Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
PySpark 30 Days Practice Guide?
100% (1)
PySpark 30 Days Practice Guide?
35 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Pyspark Questions
No ratings yet
Pyspark Questions
63 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
BSMA 2022 Curriculum
100% (1)
BSMA 2022 Curriculum
2 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Pyspark 30 Days
No ratings yet
Pyspark 30 Days
32 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Databricks Interview Question & Answers
No ratings yet
Databricks Interview Question & Answers
10 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Big Query Interview Q&A
No ratings yet
Big Query Interview Q&A
8 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
XS Series E Appen 7 Installation PDF
No ratings yet
XS Series E Appen 7 Installation PDF
101 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Iso 3511 Instrument - Symbols - Part - 4 PDF
0% (1)
Iso 3511 Instrument - Symbols - Part - 4 PDF
10 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
MSSQL Server 2008 Developer
No ratings yet
MSSQL Server 2008 Developer
240 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Day65 - Day70 Power BI Interview
No ratings yet
Day65 - Day70 Power BI Interview
31 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Psychology and Other Disciplines
No ratings yet
Psychology and Other Disciplines
5 pages
Talend Open Studio For Data Integration: User Guide
No ratings yet
Talend Open Studio For Data Integration: User Guide
452 pages
Deterioration of Concrete
No ratings yet
Deterioration of Concrete
34 pages
Aa BPG 375001
No ratings yet
Aa BPG 375001
36 pages
Grid Audit Report Format
100% (1)
Grid Audit Report Format
7 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Control Lab1
0% (1)
Control Lab1
59 pages
XXXXX: Important Instructions To Examiners
No ratings yet
XXXXX: Important Instructions To Examiners
16 pages
Cement Mill Certificate
100% (2)
Cement Mill Certificate
1 page
Study Guide Chapter 8. The Teaching of Araling Panlipunan
No ratings yet
Study Guide Chapter 8. The Teaching of Araling Panlipunan
5 pages
Motivational Cognitive Behavioural Therapy Distinctive Features 1st Edition Optimized EPUB Download
100% (19)
Motivational Cognitive Behavioural Therapy Distinctive Features 1st Edition Optimized EPUB Download
16 pages
Skill Development Under RKVY-2016-17
No ratings yet
Skill Development Under RKVY-2016-17
10 pages
Dbms Lab 1,2,3,4
No ratings yet
Dbms Lab 1,2,3,4
40 pages
Letter Writing: Lead in
No ratings yet
Letter Writing: Lead in
23 pages
Project Name: Wilmont's Pharmacy Drone Case: Qualitative Risk Analysis
100% (1)
Project Name: Wilmont's Pharmacy Drone Case: Qualitative Risk Analysis
3 pages
Dijkstra's Algorithm: 1 N Ij I J 1
No ratings yet
Dijkstra's Algorithm: 1 N Ij I J 1
5 pages
Activity Based Costing
No ratings yet
Activity Based Costing
34 pages
Lesson Plan
No ratings yet
Lesson Plan
9 pages
Yamaha Fzr400swc 89 Parts Catalogue
100% (42)
Yamaha Fzr400swc 89 Parts Catalogue
6 pages
912 Brochure
No ratings yet
912 Brochure
3 pages
Sap Abap On Hana-3
No ratings yet
Sap Abap On Hana-3
51 pages
Project Documentation 2023 - 24 TK
No ratings yet
Project Documentation 2023 - 24 TK
18 pages
DrWeb Crash
No ratings yet
DrWeb Crash
12 pages
MAED Math 4
No ratings yet
MAED Math 4
2 pages
Scipy - Stats.norm - SciPy v1.11.2 Manual
No ratings yet
Scipy - Stats.norm - SciPy v1.11.2 Manual
3 pages
? Gallery Walk Scoring Rubric
No ratings yet
? Gallery Walk Scoring Rubric
2 pages
Tire Dimensions
No ratings yet
Tire Dimensions
1 page
07 Rawlbolts Plugs Anchors
No ratings yet
07 Rawlbolts Plugs Anchors
1 page
Mehdi Belouahchia Resume F
No ratings yet
Mehdi Belouahchia Resume F
2 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet

Pyspark

Uploaded by

Pyspark

Uploaded by

PySpark

1. Partitioning: Ensure data is evenly distributed across partitions to

2. Caching: Cache frequently accessed DataFrames using .cache() or

3. Broadcast Variables: Use broadcast() for small lookup tables to reduce

5. Avoid Shuffles: Minimize shuffle operations (like groupBy(), join()) as they

Optimizing Spark jobs improves performance, reduces execution time, and

1. Launch an EC2 Instance

1. DataFrames are optimized with Catalyst and Tungsten, making them

A Spark Session is the entry point to using Spark functionality, replacing

Common Persistence Mechanisms :

DAG Scheduler (Directed Acyclic Graph Scheduler) is responsible for

How to Execute an Anti-Join :

How to Execute a Semi Join :

1. Use Efficient Data Formats:

3. Leverage Caching Wisely:

4. Broadcast Small Tables:

5. Adjust Spark Configurations:

Deploying a PySpark Application on Apache Spark Cluster, AWS EMR, or AWS

1. On Apache Spark Cluster :

2. On AWS EMR (Elastic MapReduce) :

1. Set Up a Testing Environment:

Differences Between DynamicFrames and DataFrames:

2. Optimize Partition Sizes:

3. Use Efficient Data Formats:

5. Tune Garbage Collection:

6. Broadcast Joins for Small Tables:

You might also like