0% found this document useful (0 votes)
8 views93 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

akshayadev2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views93 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

akshayadev2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Orchestration, Monitoring, and
Automating your Data Pipelines

Week 4
Orchestration, Monitoring, and
Automating your Data Pipelines

Week 4 Overview
The Three Pillars of DataOps

DataOps

Observability Incident
Automation & Response
Monitoring

Infrastructure Pipeline
• Automate individual tasks in your data pipeline as Code as Code

• Build in data quality tests and monitoring


Orchestration

The Undercurrents

Data Data Software


Security DataOps Orchestration
Management Architecture Engineering
Week 4 Plan
1. Evolution of Orchestration Tools

Cron Oozie Luigi Airflow Prefect Dagster Mage


Week 4 Plan
1. Evolution of Orchestration Tools

Cron Oozie Luigi Airflow Prefect Dagster Mage

2. Basic Details of Orchestration


Week 4 Plan
1. Evolution of Orchestration Tools

Cron Oozie Luigi Airflow Prefect Dagster Mage

2. Basic Details of Orchestration

3. Details of Airflow
Orchestration, Monitoring, and
Automating Data Pipelines

Before Orchestration
Cron

• A command line utility introduced in the 1970s


• Used to execute a particular command at a specified
date and time.
Cron Job

No restriction

1
* 8
* 4
* 2
* 0
* command to be executed

Week (0 = Sun … 6 = Saturday)

Month (1 - 12)

Day (1 - 31)

Hour (0 - 23)

Minute ( 0 - 59)
Cron Job

0 0 1 1 * echo “Happy New Year”


Any day of the week Every year at midnight on January 1st

1st Month 01/01/2006 0:00:00 Happy New Year

01/01/2007 0:00:00 Happy New Year


1st Day
01/01/2008 0:00:00 Happy New Year
0th hour
01/01/2009 0:00:00 Happy New Year

0th minute
Scheduling Data Pipeline with Cron

Every night at midnight


00*** python ingest_from_rest_api.py

Ingest from a
REST API
Scheduling Data Pipeline with Cron

Every night at midnight


00*** python ingest_from_rest_api.py

Ingest from a Transform


REST API data
1 AM every night
01*** python transform_api_data.py
Scheduling Data Pipeline with Cron

Every night at midnight


00*** python ingest_from_rest_api.py

Ingest from a Transform


REST API data
1 AM every night
01*** python transform_api_data.py

Ingest from
a database

Every night at midnight


00*** python ingest_from_database.py
Scheduling Data Pipeline with Cron

Every night at midnight 2 AM every night


00*** python ingest_from_rest_api.py 02*** python combine_api_and_database.py

Ingest from a Transform Combine


REST API data data
1 AM every night
01*** python transform_api_data.py

Ingest from
a database

Every night at midnight


00*** python ingest_from_database.py
Scheduling Data Pipeline with Cron Pure Scheduling Approach

Every night at midnight 2 AM every night


00*** python ingest_from_rest_api.py 02*** python combine_api_and_database.py

Ingest from a Transform Combine Load into data


REST API data data warehouse
1 AM every night
01*** python transform_api_data.py

Ingest from
a database

Every night at midnight


00*** python ingest_from_database.py
Scheduling Data Pipeline with Cron Pure Scheduling Approach

Every night at midnight 1 AM every night 2 AM every night

Ingest from a Transform Combine Load into data


REST API data data warehouse

Ingest from
a database
?
Every night at midnight
When To Use Cron?

• To schedule simple and repetitive tasks:


Example: Regular data downloads

• In the prototyping phase


Example: Testing aspects of your data pipeline
Orchestration, Monitoring, and
Automating Data Pipelines

Evolution of Orchestration Tools


Orchestration Tools

late 2000s

Dataswarm
Orchestration Tools

Dataswarm
late 2000s

2010s

• Designed to work within a Hadoop cluster


• Difficult to use in a more heterogeneous environment
Orchestration Tools

Dataswarm
late 2000s

2010s introduced

2014
Orchestration Tools

Dataswarm
late 2000s

2010s

2014
Orchestration Tools

Dataswarm
late 2000s

2010s

Maxime Beauchemin
2014

Open source project


Orchestration Tools

Dataswarm
late 2000s

2010s

2014

2019
Advantages and Challenges of Airflow
Advantages Challenges

• Airflow is written in Python • Scalability challenge


• The Airflow open source project is very active • Ensuring data integrity
• Airflow is available as a managed service • No support for streaming pipelines
Other Open-Source Orchestration Tools
Other Open-Source Orchestration Tools

More scalable orchestration solutions than Airflow

Built-in capabilities for data quality testing


Orchestration, Monitoring, and
Automating your Data Pipelines

Orchestration Basics
Orchestration

PROS CONS

• Set up dependencies • More operational overhead than


simple Cron scheduling
• Monitor tasks
• Get alerts
• Create fallback plans
Orchestration

Directed Acyclic Graph (DAG)

Ingest from Transform Combine Load to data


a REST API data data warehouse

Ingest from Edge


a database
• Data flows only in one direction
Node • No circles or cycles
Scheduling Tasks with Cron
Kicks off before the
previous is finished
Every night at midnight 2 AM every night
00*** python ingest_from_rest_api.py 02*** python combine_api_and_database.py

Ingest from a Transform Combine Load into data


REST API data data warehouse
1 AM every night Downstream
tasks break
01*** python transform_api_data.py

Fails or takes
Ingest from more than 1 hour
a database

Every night at midnight


00*** python ingest_from_database.py
Orchestration

You could build in dependencies between tasks.

Ingest from a Transform Combine Load into data


REST API data data warehouse

Task is complete Task starts

Ingest from
a database

Task is complete
Orchestration in Airflow

Tasks

Dependencies
Orchestration in Airflow

You can set up the conditions on which the DAG should run:

time-based conditions event-based conditions


Orchestration in Airflow

Trigger the DAG based on a schedule


Orchestration in Airflow

Trigger the DAG based on an event


Example: when a dataset has been updated by another DAG
Orchestration in Airflow

Trigger a portion of the DAG based on an event


Example: presence of a file in an S3 bucket
Orchestration in Airflow

Set up monitoring, logging and alerts


Orchestration in Airflow

Set up data quality checks

Ingest from a Transform Combine


REST API data data

Ingest from Check Data


a database Quality
• Checking count of null values
• Checking range of some set of values
• Verifying if the schema is as expected
Orchestration, Monitoring, and
Automating your Data Pipelines

Airflow - Core Components


Airflow Components

Metadata
Database

User
Web Server Workers Scheduler
Interface

interact

interact
DAG Directory
Airflow Components

Metadata
Database

User
Web Server Workers Scheduler
Interface

Visualize, monitor,
Visualize trigger,
your DAGs
troubleshoot your DAGs

store Python scripts


defining your DAGs
DAG Directory
Airflow Components

trigger DAG based on a


Metadata schedule or with an event
Database

User
Web Server Workers Scheduler
Interface

DAG Directory
Airflow Components

Metadata
Database

Run the tasks Executor Pushes


extracts tasks tasks
User
Web Server Workers Scheduler
Interface Executor

Monitors your DAGs

Checks which tasks are


ready to be triggered

DAG Directory
Airflow Components
Task status
• scheduled
• queued Metadata
• running Extracts the
Database
Store the status of
• success status of tasks the tasks
• failed

User
Web Server Workers Scheduler
Interface

DAG Directory
Airflow Components

Metadata
Database

User
Web Server Workers Scheduler
Interface

DAG Directory
Amazon Managed Workflows for Apache Airflow (MWAA)

Scheduler Aurora PostgreSQL


Database

Workers

Web Server
DAG Directory
Airflow Components

In the labs of this week,

1. set up the Airflow environment


2. use Cloud9 to define your DAGs as Python scripts
3. upload the DAG scripts to the S3 bucket
4. open the Airflow UI
Orchestration, Monitoring, and
Automating your Data Pipelines

Airflow - The Airflow UI


Orchestration, Monitoring, and
Automating your Data Pipelines

Airflow - Creating a DAG


DAG Example
Airflow Operators

• Operators are Python classes used to encapsulate the logic of the tasks.
• You can import Operators into your code.
• Each task is an instance of an Airflow Operator.
Airflow Operators

Types of Operators:

• PythonOperator: execute a python script


• BashOperator: execute bash commands
• EmptyOperator: organize the DAGs
• EmailOperator: send you notification via an email
• Sensor: special type of Operator used to make your DAGs event-driven
Airflow Operators

PythonOperator
Orchestration, Monitoring, and
Automating your Data Pipelines

Airflow - XCom and Variables


Lab 1

Pass data from one task to another using an intermediate storage

get_authors
get_random_book
get_cover

Amazon S3
XCom

• XCom is short for cross-communication.


• Designed to pass small amounts of data: metadata, dates, single value
metrics, simple computations.
XCom

Pass value from task 1 to task 2


Task 1 Task 2
XCom

XCom Variable

• Value
xcom_push • key
Task 1 • Timestamp
• Dag id
• Task id

xcom_pull
Metadata Task 2
Database
User-Created Variable
User-Created Variable

Instead of including hard-coded values:

• Create variables in the Airflow UI.


• Create environmental variables.
User-Created Variables
User-Created Variable
User-Created Variable
User-Created Variable
User-Created Variable
Orchestration, Monitoring, and
Automating your Data Pipelines

Airflow - TaskFlow API


TaskFlow API

Traditional Paradigm TaskFlow API

• Instantiate a DAG object to define • DAGs are easier to write and


the DAG more concise

• Use PythonOperator to define • Use decorators to create the DAG


tasks and its tasks
TaskFlow API
DAG Definition

Traditional Approach TaskFlow API


DAG Definition

Traditional Approach TaskFlow API

Implicitly call the


DAG constructor

dag_id
Task Definition

Traditional Approach

Keep track:
•the task_id,
•the name of the Python
function,
•the name of the task variable.
DAG Definition

TaskFlow API Implicitly call


PythonOperator

task_id
• Keep track of fewer names
• Use the @task decorator
DAG Dependencies

TaskFlow API

Use the bit-shift operator: >>


TaskFlow API TaskFlow API

Traditional Approach
XCom

Traditional Approach
XCom

TaskFlow API
XCom

TaskFlow API

print_data(extract_from_api())
XCom

TaskFlow API
Orchestration, Monitoring, and
Automating your Data Pipelines

Orchestration on AWS
More control More convenience
• Run the open-source version of • MWAA runs Airflow for you and handles tasks like the
Airflow on an Amazon EC2 instance provisioning and scaling of the underlying
or in a container. infrastructure.
• You have full control over the
configuration and scaling of your
Airflow environment.
• You need to manage all of the
underlying infrastructure and
integrations yourself.
More control More convenience
• Run the open-source version of • MWAA runs Airflow for you and handles tasks like the
Airflow on an Amazon EC2 instance provisioning and scaling of the underlying
or in a container. infrastructure.
• You have full control over the
configuration and scaling of your • It integrates with other AWS services.
Airflow environment.
• You need to manage all of the
underlying infrastructure and
integrations yourself.
AWS Glue Workflows: Allows you to create, run, and monitor complex ETL workflow

Trigger types:
Crawlers
End • A schedule
Trigger
Trigger • On-demand
Job Job
• Event from Amazon
Crawlers
EventBridge

Allow you to orchestrate multiple AWS services and states into workflows called
state machines

• States can be tasks which do work Lambda function Glue job ECS Task

• States can make decisions based on their input, perform actions from those
inputs, and pass output to other states
• Offers a lot of flexibility
• Good for complex workflows.

• Provide a serverless option with


extensive native AWS service • What are your requirements?
integration
• What are you trying to
• Ideal for AWS-centric workflows optimize for?

• Specifically designed for ETL


processes
• Allows you to orchestrate Glue jobs,
Glue Workflows
crawlers, and triggers in a
serverless environment
Source Systems, Data Ingestion,
and Pipelines

Course Summary
Week 1

Event
Databases Files
Systems
Week 1

Relational databases
NoSQL databases

Document
Key Value
{
k1 value 1 “firstName”: “Joe”,
“lastName” : “Reis”,
“age”: 10,
k2 value 2 “address”: {
“city”: “Los Angeles”,
Object Storage k3 value 3 “postalCode”: 90024,
“country”: “USA”
}
}
Week 2

Batch Micro-batch Streaming

Semi-Frequent Frequent Very frequent

Amazon Kinesis
Data Streams
Week 3

DataOps

Observability Incident
Automation & Response
Monitoring
Week 4

DataOps

Observability Incident
Automation & Response
Monitoring

You might also like