Pyspark For Etl
Pyspark For Etl
Spark | PySpark
Spark Foundations: 4HRs
• What is Big Data
• What is Data Engineering
• Need for Big Data Technologies
• Batch Analytics
• Real Time Analytics
• Introduction to PySpark
• Need for PySpark
• Why Spark is faster than Hadoop MR?
• What is in memory processing?
• Understanding the use cases of pyspark | where it belongs in developing Data pipelines
• PySpark Architectural Components (How Pyspark works with Hadoop YARN and
Standalone)
1. Overview of Airflow
- What is Apache Airflow?
- Key Features of Airflow:
- Workflow Automation
- Dynamic Pipelines
- Extensibility and Scalability
- Real-world Use Cases of Airflow
2. Airflow Architecture
- Core Components:
- Scheduler
- Web Server
- Metadata Database
- Executor and Workers
- DAG (Directed Acyclic Graph) Concept
- Task Instances and Operators