0% found this document useful (0 votes)
3 views4 pages

Pyspark For Etl

The document outlines a comprehensive PySpark syllabus tailored for ETL use cases, covering topics such as Spark architecture, RDDs, Spark SQL, and Spark Streaming, along with hands-on exercises. It also includes an introduction to Apache Airflow, its architecture, and how to schedule Spark applications using Airflow, with practical exercises for building and testing DAGs. The curriculum emphasizes real-world applications and best practices in data engineering and processing.

Uploaded by

ralayi8577
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Pyspark For Etl

The document outlines a comprehensive PySpark syllabus tailored for ETL use cases, covering topics such as Spark architecture, RDDs, Spark SQL, and Spark Streaming, along with hands-on exercises. It also includes an introduction to Apache Airflow, its architecture, and how to schedule Spark applications using Airflow, with practical exercises for building and testing DAGs. The curriculum emphasizes real-world applications and best practices in data engineering and processing.

Uploaded by

ralayi8577
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Section 1 : PySpark Syllabus Tailored For ETL Use cases

Spark | PySpark
Spark Foundations: 4HRs
• What is Big Data
• What is Data Engineering
• Need for Big Data Technologies
• Batch Analytics
• Real Time Analytics
• Introduction to PySpark
• Need for PySpark
• Why Spark is faster than Hadoop MR?
• What is in memory processing?
• Understanding the use cases of pyspark | where it belongs in developing Data pipelines
• PySpark Architectural Components (How Pyspark works with Hadoop YARN and
Standalone)

Deep Dive in Spark Architecture 4HRs


• Driver
• Executor
• Cluster Manager
• RDD
• RDD Transformations (Includes Hands on)
• Narrow Vs Wide
• RDD Actions
• Spark Deploy Modes
• Spark Application
• Spark job
• Stages
• Tasks
• In depth Spark Architecture

Hands On RDDs and Executor memory Architecture 4HRs


• 50 RDD Hands on Exercises with Solutions
• Executor Memory Architecture
• Executor Data Processing View

Distributed Shared variables | Spark SQL Architectural Overview 4HRs


• Broadcast Variable
• Accumulator
• Need for Spark SQL
• Performance difference between RDD Operations and Spark SQL
• Catalyst Optimizer
• DF vs DS vs RDD

Spark SQL Intense Hands on 12HRs


• Creating Data frames from various data sources
• Spark SQL Transformations (PySpark 200 Basic exercises)
• Spark SQL Transformations (PySpark 500 basic to advanced)
• Handling JSON Data (50 Exercises[Basic to Advanced])
• Writing Data to Destinations (Redshift, Hive)
• Industry Coding Practices for End to End PySpark Applications
o Configs
o Resources
o src
o Tests
o Deploy
o Entry Points

PySpark in Industry 8HRs


 Adaptive Query Execution
 Dealing with Data Skewness
 Application Deployment Strategies
 Coalesce vs Repartition
 Cache vs Persist
 Join Optimization
 Garbage Collection Tuning
 Resource calculation for Spark Application
 Dynamic Resource Allocation
 Understanding Place of PySpark in Industry Projects

Spark Streaming 4HRs


 Introduction to Spark Streaming
 Spark Streaming Basics
 Input Data Sources
 Transformations in Spark Streaming
 Fault Tolerance and Checkpointing
 Output Modes and Sinks
 Structured Streaming
 Tuning and Optimization
 Integration with Batch Processing
 Real-world Example

Section 2 : Airflow Introduction and Scheduling Spark Applications

Part 1: Introduction to Apache Airflow 4Hrs

1. Overview of Airflow
- What is Apache Airflow?
- Key Features of Airflow:
- Workflow Automation
- Dynamic Pipelines
- Extensibility and Scalability
- Real-world Use Cases of Airflow

2. Airflow Architecture
- Core Components:
- Scheduler
- Web Server
- Metadata Database
- Executor and Workers
- DAG (Directed Acyclic Graph) Concept
- Task Instances and Operators

3. Core Concepts and Terminology


- DAGs, Operators, Tasks
- Connections and Variables
- Hooks and Executors

4. Airflow Installation and Setup


- Installing Airflow on Local or Cloud Environment
- Configuring Connections for External Systems
- Navigating the Web UI:
- DAGs Dashboard
- Task Logs and Graph View
- Hands-On Exercise:
- Create and Test a Simple DAG with BashOperator

Part 2: Scheduling Spark Applications with Airflow 8Hrs

1. Integrating Spark with Airflow


- Why Schedule Spark Applications with Airflow?
- Airflow Operators for Spark:
- SparkSubmitOperator
- BashOperator for Spark Scripts
- Setting Up SparkSubmitOperator:
- Required Parameters (e.g., application, master, deploy-mode)
- Configuring Dependencies

2. Building a DAG to Schedule Spark Applications


- Step-by-Step Guide to DAG Creation:
- Define the DAG
- Create SparkSubmit Tasks
- Manage Dependencies Between Tasks
- Hands-On Exercise:
- Write a DAG to Submit a Spark Job to a Local/Cluster Environment

3. Monitoring and Troubleshooting Spark Jobs


- Viewing Task Logs in Airflow
- Handling Failed Spark Jobs:
- Retry Policies
- On-Failure Callbacks
- Debugging Common Issues in SparkSubmitOperator

4. Real-World Use Case: End-to-End Pipeline


- Scenario:
- Ingest Data (from S3 or HDFS)
- Process Data Using Spark
- Write Output to a Data Warehouse or Data Lake
- DAG Implementation:
- Task for Data Ingestion
- Task for Spark Processing
- Task for Data Export
- Hands-On Exercise:
- Build and Test the Pipeline in Airflow

You might also like