0% found this document useful (0 votes)
56 views4 pages

AIRFLOW

Apache Airflow is an open-source platform for orchestrating complex workflows and data processing pipelines using Python scripts. It enables scheduling, monitoring, and managing workflows through DAGs (Directed Acyclic Graphs) which define dependencies between tasks. Airflow provides features like operators, schedulers, a web interface, and plugins.

Uploaded by

ARYAN PATEL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views4 pages

AIRFLOW

Apache Airflow is an open-source platform for orchestrating complex workflows and data processing pipelines using Python scripts. It enables scheduling, monitoring, and managing workflows through DAGs (Directed Acyclic Graphs) which define dependencies between tasks. Airflow provides features like operators, schedulers, a web interface, and plugins.

Uploaded by

ARYAN PATEL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

AIRFLOW

➢ Apache Airflow is an open-source platform used for orchestrating complex


workflows and data processing pipelines. It enables users to schedule,
monitor, and manage workflows programmatically through Python scripts.
Airflow provides a rich set of features for workflow management, including
task dependencies, dynamic scheduling, extensibility, and monitoring
capabilities.

Key Concepts:

​ DAGs (Directed Acyclic Graphs):


● A DAG is a collection of tasks arranged in a specific order with
dependencies between them.
● Tasks represent individual units of work, such as running a Python script,
executing an SQL query, or transferring files.
● DAGs define the workflow logic and dependencies between tasks.
​ Operators:
● Operators define the type of task to be executed within a DAG.
● Airflow provides various built-in operators for common tasks such as
BashOperator, PythonOperator, SQLOperator, etc.
● Custom operators can be developed to extend Airflow's functionality.
​ Schedulers:
● Airflow uses schedulers to manage task execution and scheduling.
● The default scheduler in Airflow is the "CeleryExecutor," which
distributes task execution across multiple worker nodes.
● Other options include the "LocalExecutor" for single-machine execution
and the "KubernetesExecutor" for running tasks on Kubernetes clusters.
​ Web Interface:
● Airflow comes with a web-based user interface for visualizing and
monitoring workflows.
● Users can view DAGs, task statuses, execution logs, and schedule tasks
through the web interface.
​ Plugins:
● Airflow supports a plugin architecture to extend its capabilities.
● Plugins can be used to add custom operators, hooks, sensors, and other
components to Airflow.

DAGs (Directed Acyclic Graphs) :

Introduction:

In Apache Airflow, Directed Acyclic Graphs (DAGs) represent a collection of tasks with
dependencies and a defined schedule. DAGs are defined using Python scripts and are
the backbone of workflows managed by Airflow. Each DAG describes how tasks are
structured and how they relate to each other.

Example​:
➢ default_args = {
➢ 'owner': 'Gaurav',

➢ 'start_date': datetime(2022, 7, 7),

➢ 'sla': timedelta(minutes=5),

➢ "params": {

➢ "endpoint": healthcheck_uuid

➢ }

➢ }
➢ mysql_params = {
➢ "mysql_conf_file": "/opt/airflow/mysql.cnf",

➢ "conn_suffix": "db-zenu"

➢ }

➢ with DAG('age_of_listings_count_daily', default_args=default_args,
schedule_interval='1 0 * * *',
➢ template_searchpath=['/opt/airflow/templates'],
catchup=False, concurrency=1) as dag:

➢ t0 = notification.dag_start_healthcheck_notify("start")


➢ t1 = BashOperator(task_id='get_daily',


bash_command='druid_dump_age_of_listings_count.bash',
➢ params=mysql_params,

➢ retries=2, retry_delay=timedelta(seconds=15),


on_failure_callback=notification.task_fail_healthcheck_notify
➢ )


➢ t2 = S3KeySensor(task_id='check_daily',

➢ bucket_key="{{
var.json.s3_age_of_listings_count_data_path.s3_path }}/{{
execution_date.format('YYYY-MM-DD') }}*.csv.gz",
➢ bucket_name="{{
var.json.s3_age_of_listings_count_data_path.bucket_name }}",
➢ wildcard_match=True,

➢ retries=2, retry_delay=timedelta(seconds=5),


on_failure_callback=notification.task_fail_healthcheck_notify
➢ )


➢ t3 = DruidOperator(task_id='ingest_daily',


json_index_file='druid_ingest_age_of_listings_count.json',
➢ druid_ingest_conn_id='druid_ingest_conn',


on_failure_callback=notification.task_fail_healthcheck_notify
➢ )


➢ tn = notification.dag_success_healthcheck_notify("end")


➢ t0 >> t1 >> t2 >> t3 >> t4

You might also like