Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Week 4
Orchestration, Monitoring, and
Automating your Data Pipelines
Week 4 Overview
The Three Pillars of DataOps
DataOps
Observability Incident
Automation & Response
Monitoring
Infrastructure Pipeline
• Automate individual tasks in your data pipeline as Code as Code
The Undercurrents
3. Details of Airflow
Orchestration, Monitoring, and
Automating Data Pipelines
Before Orchestration
Cron
No restriction
1
* 8
* 4
* 2
* 0
* command to be executed
Month (1 - 12)
Day (1 - 31)
Hour (0 - 23)
Minute ( 0 - 59)
Cron Job
0th minute
Scheduling Data Pipeline with Cron
Ingest from a
REST API
Scheduling Data Pipeline with Cron
Ingest from
a database
Ingest from
a database
Ingest from
a database
Ingest from
a database
?
Every night at midnight
When To Use Cron?
late 2000s
Dataswarm
Orchestration Tools
Dataswarm
late 2000s
2010s
Dataswarm
late 2000s
2010s introduced
2014
Orchestration Tools
Dataswarm
late 2000s
2010s
2014
Orchestration Tools
Dataswarm
late 2000s
2010s
Maxime Beauchemin
2014
Dataswarm
late 2000s
2010s
2014
2019
Advantages and Challenges of Airflow
Advantages Challenges
Orchestration Basics
Orchestration
PROS CONS
Fails or takes
Ingest from more than 1 hour
a database
Ingest from
a database
Task is complete
Orchestration in Airflow
Tasks
Dependencies
Orchestration in Airflow
You can set up the conditions on which the DAG should run:
Metadata
Database
User
Web Server Workers Scheduler
Interface
interact
interact
DAG Directory
Airflow Components
Metadata
Database
User
Web Server Workers Scheduler
Interface
Visualize, monitor,
Visualize trigger,
your DAGs
troubleshoot your DAGs
User
Web Server Workers Scheduler
Interface
DAG Directory
Airflow Components
Metadata
Database
DAG Directory
Airflow Components
Task status
• scheduled
• queued Metadata
• running Extracts the
Database
Store the status of
• success status of tasks the tasks
• failed
User
Web Server Workers Scheduler
Interface
DAG Directory
Airflow Components
Metadata
Database
User
Web Server Workers Scheduler
Interface
DAG Directory
Amazon Managed Workflows for Apache Airflow (MWAA)
Workers
Web Server
DAG Directory
Airflow Components
• Operators are Python classes used to encapsulate the logic of the tasks.
• You can import Operators into your code.
• Each task is an instance of an Airflow Operator.
Airflow Operators
Types of Operators:
PythonOperator
Orchestration, Monitoring, and
Automating your Data Pipelines
get_authors
get_random_book
get_cover
Amazon S3
XCom
XCom Variable
• Value
xcom_push • key
Task 1 • Timestamp
• Dag id
• Task id
xcom_pull
Metadata Task 2
Database
User-Created Variable
User-Created Variable
dag_id
Task Definition
Traditional Approach
Keep track:
•the task_id,
•the name of the Python
function,
•the name of the task variable.
DAG Definition
task_id
• Keep track of fewer names
• Use the @task decorator
DAG Dependencies
TaskFlow API
Traditional Approach
XCom
Traditional Approach
XCom
TaskFlow API
XCom
TaskFlow API
print_data(extract_from_api())
XCom
TaskFlow API
Orchestration, Monitoring, and
Automating your Data Pipelines
Orchestration on AWS
More control More convenience
• Run the open-source version of • MWAA runs Airflow for you and handles tasks like the
Airflow on an Amazon EC2 instance provisioning and scaling of the underlying
or in a container. infrastructure.
• You have full control over the
configuration and scaling of your
Airflow environment.
• You need to manage all of the
underlying infrastructure and
integrations yourself.
More control More convenience
• Run the open-source version of • MWAA runs Airflow for you and handles tasks like the
Airflow on an Amazon EC2 instance provisioning and scaling of the underlying
or in a container. infrastructure.
• You have full control over the
configuration and scaling of your • It integrates with other AWS services.
Airflow environment.
• You need to manage all of the
underlying infrastructure and
integrations yourself.
AWS Glue Workflows: Allows you to create, run, and monitor complex ETL workflow
Trigger types:
Crawlers
End • A schedule
Trigger
Trigger • On-demand
Job Job
• Event from Amazon
Crawlers
EventBridge
Allow you to orchestrate multiple AWS services and states into workflows called
state machines
• States can be tasks which do work Lambda function Glue job ECS Task
• States can make decisions based on their input, perform actions from those
inputs, and pass output to other states
• Offers a lot of flexibility
• Good for complex workflows.
Course Summary
Week 1
Event
Databases Files
Systems
Week 1
Relational databases
NoSQL databases
Document
Key Value
{
k1 value 1 “firstName”: “Joe”,
“lastName” : “Reis”,
“age”: 10,
k2 value 2 “address”: {
“city”: “Los Angeles”,
Object Storage k3 value 3 “postalCode”: 90024,
“country”: “USA”
}
}
Week 2
Amazon Kinesis
Data Streams
Week 3
DataOps
Observability Incident
Automation & Response
Monitoring
Week 4
DataOps
Observability Incident
Automation & Response
Monitoring