0% found this document useful (0 votes)

3 views10 pages

Setting Up Airflow With Docker From Installation To Data Processing

This document is a comprehensive guide on setting up Apache Airflow using Docker, covering installation, configuration, and the creation of data processing workflows with PostgreSQL connections. It details the components of Airflow, how to create Directed Acyclic Graphs (DAGs), manage database connections, and implement data cleansing scripts. Best practices for DAG design and troubleshooting are also discussed to enhance reliability and performance.

Uploaded by

pratiwi118

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views10 pages

Setting Up Airflow With Docker From Installation To Data Processing

Uploaded by

pratiwi118

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Setting Up Airflow with

Docker: From Installation

to Data Processing
Welcome to this comprehensive guide on implementing Apache Airflow using
Docker. We'll walk through everything from initial installation to creating
sophisticated data processing workflows that leverage PostgreSQL connections.

Apache Airflow has become the industry standard for orchestrating complex
computational workflows and data processing pipelines. By the end of this
presentation, you'll understand how to set up, configure, and use Airflow to
transform your data processing capabilities.

Let's begin our journey into the world of automated workflow management
with Airflow.

by Rakhmadina Noviyanti
Introduction to Apache Airflow
Workflow Orchestration
Programmatically author, schedule, and monitor workflows as directed acyclic graphs (DAGs) of tasks

Advanced Scheduling
Define complex scheduling patterns with cron-like syntax for precise execution timing

Rich UI Dashboard
Monitor workflow execution, inspect logs, and troubleshoot issues through an intuitive web interface

Apache Airflow revolutionizes how data teams manage complex computational workflows. As an open-source platform, it excels at orchestrating sophisticated data
pipelines through its DAG-based architecture, where tasks and their dependencies are clearly defined.

The platform offers exceptional flexibility for ETL processes, enabling precise scheduling, retries, and dependency management. Its rich user interface provides
comprehensive visibility into workflow execution, making monitoring and debugging straightforward.
Installing Docker and Initial Setup

Configure Docker Compose

Create Directory Structure
Download the docker-compose.yaml
Install Docker Desktop
Open terminal and run: mkdir -p file from the Airflow repository and
Download and install Docker Desktop ./dags ./logs ./plugins ./config to set permissions with: echo -e
appropriate for your operating system create the necessary folder structure. "AIRFLOW_UID=$(id -u)" > .env
(Windows, macOS, or Linux) from the
official Docker website.

Setting up Airflow with Docker begins with installing Docker Desktop, which provides the containerization platform
needed to run Airflow services in isolated environments. This approach significantly simplifies deployment and ensures
consistency across different systems.

After installation, creating the proper directory structure is crucial for organizing Airflow components. The specific
folders will store your workflow definitions, logs, plugins, and configuration files that Airflow needs to function properly.
Initializing Airflow Environment
Initialize Database
Run docker compose up airflow-init to create initial database schema

Start Airflow Services

Execute docker compose up to launch all Airflow containers

Access Airflow UI
Navigate to http://localhost:8080 in your web browser

Once your Docker environment is configured, the next step is initializing the Airflow database. This one-time setup creates the necessary tables and default settings for Airflow to
function properly.

With initialization complete, starting the Airflow services brings up multiple containers working together: the scheduler, webserver, worker, and database. The web interface provides
your primary method of interaction with the system, where you'll monitor workflows and configure connections.
Understanding Airflow Components
Scheduler
Workers
Heart of Airflow that orchestrates

Webserver workflow execution Execute the actual tasks in the workflow

• Determines when tasks should run • Process assigned tasks Database

Provides the user interface for
monitoring, triggering, and managing • Submits tasks to the executor • Report execution status Stores metadata about DAGs and
workflows • Monitors task states • Handle task retries executions

• Visualizes DAG structure • Tracks task states

• Shows execution history • Stores variable values
• Provides access to logs • Maintains connection information

Airflow's architecture consists of several key components working in harmony. The webserver provides a comprehensive user interface for monitoring and managing your
workflows, while the scheduler determines when tasks should run based on their dependencies and schedules.

Workers handle the actual execution of tasks, which may include running Python functions, executing Bash commands, or interacting with external systems. The metadata
database ties everything together by storing information about DAG structures, task states, and execution history.
Creating Your First DAG
DAG Object
Define core DAG parameters and properties

Operators
Create individual tasks using appropriate operators

Dependencies
Set execution order with task dependencies

Creating your first DAG involves defining a Python file that specifies your workflow structure. The DAG object itself requires parameters like a unique ID, schedule interval, and start date. These parameters determine when and how
often your workflow will run.

Tasks are represented by operators - pre-built components for common actions like running Python functions (PythonOperator), executing Bash commands (BashOperator), or transferring data (PostgresOperator). Dependencies
between tasks establish the execution order, typically set using the bitshift operators (>> and <<) for readability.

from airflow import DAG

from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG('my_first_dag', start_date=datetime(2023, 1, 1),

schedule_interval='@daily') as dag:
task1 = BashOperator(
task_id='hello_world',
bash_command='echo "Hello, Airflow!"'
)

task2 = BashOperator(
task_id='print_date',
bash_command='date'
)

task1 >> task2 # Set dependency: task1 runs before task2

Setting Up PostgreSQL Connection in Airflow
Navigate to Connections
Access Admin > Connections in the Airflow UI

Create New Connection

Click "Add a new record" button

Configure Connection
3 Set connection type to "PostgreSQL" and enter details

Securely managing database connections is a key Airflow feature. Rather than hardcoding credentials in your DAGs, Airflow provides a centralized
connection management system that enhances security and simplifies maintenance.

When creating a PostgreSQL connection, you'll need to provide a unique Connection ID (which will be referenced in your DAGs), along with the
host, port, database name, username, and password. For Docker installations, remember that database services often have their own container
names that should be used as the host (e.g., "postgres" instead of "localhost").

This connection can then be used in your DAGs by referencing the Connection ID, allowing your workflows to securely interact with the database
without exposing sensitive credentials in your code.
Creating a Python Script for Data Cleansing
Script Structure Common Cleansing Operations Data Loading
• Import psycopg2 and other required • Handle missing values • Prepare cleaned data for insertion
libraries • Remove duplicates • Create target tables if not exist
• Define database connection • •
Standardize formats (dates, Implement batch insertion for
function using environment numbers, text) efficiency
variables
• Apply data validation rules • Handle potential errors and
• Implement data extraction query exceptions
• Apply cleansing operations and
transformations

Your Python data cleansing script serves as the core of your ETL process. The script should be modular and reusable, handling database
connections, data extraction, transformation, and loading back to PostgreSQL. Using environment variables or Airflow's connection
system keeps credentials secure.

The cleansing phase might include fixing inconsistent formatting, handling missing values, normalizing data, and removing duplicates.
For larger datasets, consider implementing batch processing to avoid memory issues. Finally, the script should include proper error
handling and logging to facilitate troubleshooting when running as part of an automated workflow.
Implementing BashOperator to Run Python Scripts
Define BashOperator Task Pass Connection Details
Create a BashOperator task in your DAG file that specifies the command to run your Python script Use Airflow's connection mechanism to securely retrieve PostgreSQL credentials

Execute Script Handle Results

The BashOperator runs the Python script as a shell command during DAG execution Implement proper exit codes in your script and configure the operator to respond appropriately

The BashOperator provides a straightforward way to execute Python scripts within your Airflow workflow. By using this operator, you can leverage existing Python scripts without needing to rewrite them as native Airflow tasks, making it ideal for
integrating established data processing code.

When configuring the BashOperator, you can pass environment variables containing connection details retrieved from Airflow's connection system. This approach keeps your credentials secure while making them available to your Python script.
Additionally, you can pass command-line arguments to your script for dynamic configuration based on the specific DAG run.

from airflow import DAG

from airflow.operators.bash import BashOperator
from airflow.hooks.postgres_hook import PostgresHook
from datetime import datetime

with DAG('data_cleansing_workflow', start_date=datetime(2023, 1, 1),

schedule_interval='@daily') as dag:

# Get connection details

pg_hook = PostgresHook(postgres_conn_id='postgres_conn')
conn = pg_hook.get_connection(conn_id='postgres_conn')

# Run Python script with connection details as environment variables

run_cleansing = BashOperator(
task_id='run_data_cleansing',
bash_command='python /opt/airflow/dags/scripts/cleanse_data.py',
env={
'POSTGRES_HOST': conn.host,
'POSTGRES_PORT': conn.port,
'POSTGRES_DB': conn.schema,
'POSTGRES_USER': conn.login,
'POSTGRES_PASSWORD': conn.password
}
)
Best Practices and Troubleshooting

1 4
DAG Design Monitoring
Keep tasks atomic and idempotent for better reliability Key areas to monitor: task duration, failure rates, resource usage

24/7 95%
Availability Test Coverage
Configure proper retries and alerting for mission-critical workflows Aim for comprehensive testing of all DAGs and custom operators

Successful Airflow implementations follow several best practices. First, design DAGs to be idempotent, meaning they can be run multiple times without side effects. Use descriptive naming for DAGs and tasks to make monitoring
easier, and implement comprehensive logging within your scripts and operators.

When troubleshooting, check the Airflow logs first, which can be accessed through the web UI. Common issues include connection problems, environment variable misconfigurations, and timing issues with task dependencies. For
production deployments, consider implementing monitoring, alerting, and scaling strategies to ensure reliable operation of your data pipelines.

Apache Airflow 1741977651
No ratings yet
Apache Airflow 1741977651
83 pages
The Ultimate Guide To Apache Airflow DAGs
No ratings yet
The Ultimate Guide To Apache Airflow DAGs
135 pages
Apache Airflow
No ratings yet
Apache Airflow
24 pages
Airflow
No ratings yet
Airflow
97 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
2.airflow 2
No ratings yet
2.airflow 2
17 pages
Apache Airflow Documentation
No ratings yet
Apache Airflow Documentation
101 pages
Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
Airflow Notes
No ratings yet
Airflow Notes
10 pages
Apache Airflow 50
100% (1)
Apache Airflow 50
50 pages
Apache Airflow For Data Science
No ratings yet
Apache Airflow For Data Science
23 pages
Intro To Apache Airflow
No ratings yet
Intro To Apache Airflow
14 pages
Airflow Notes
No ratings yet
Airflow Notes
5 pages
Best Practices Apache Airflow
100% (1)
Best Practices Apache Airflow
28 pages
Master Airflow With This Amazing Document!
No ratings yet
Master Airflow With This Amazing Document!
63 pages
Study Guide For Apache Airflow Fundamentals Certification
No ratings yet
Study Guide For Apache Airflow Fundamentals Certification
6 pages
Apache Airflow
50% (2)
Apache Airflow
8 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
FortiAnalyzer 5.4.5 Administration Guide
No ratings yet
FortiAnalyzer 5.4.5 Administration Guide
196 pages
GuideToApacheAirflow PDF
100% (1)
GuideToApacheAirflow PDF
6 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Tutorial - Create An Animal Matching Quiz in Flash As3
50% (2)
Tutorial - Create An Animal Matching Quiz in Flash As3
13 pages
01 - Welcome To Airflow
No ratings yet
01 - Welcome To Airflow
11 pages
Developing Elegant Workflows in Python Code With Apache Airflow
100% (1)
Developing Elegant Workflows in Python Code With Apache Airflow
35 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
De Lab Manual
No ratings yet
De Lab Manual
39 pages
CCD CT 2 Model Answer 2022-23
No ratings yet
CCD CT 2 Model Answer 2022-23
5 pages
Sid Anand Qcon Ai 2018 v2 PDF
No ratings yet
Sid Anand Qcon Ai 2018 v2 PDF
35 pages
Aws Ques
No ratings yet
Aws Ques
62 pages
Week 6. Airflow Overview
No ratings yet
Week 6. Airflow Overview
71 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
93 pages
SAP BW Info Provider
No ratings yet
SAP BW Info Provider
28 pages
Getting Started With Airflow Using Docker - Towards Data Science
No ratings yet
Getting Started With Airflow Using Docker - Towards Data Science
8 pages
Apache Airflow - A Python Hands-On Guide
No ratings yet
Apache Airflow - A Python Hands-On Guide
9 pages
Airflowintroduction 190217155729
No ratings yet
Airflowintroduction 190217155729
21 pages
Airflow Best Practices
No ratings yet
Airflow Best Practices
34 pages
Airflow Techtonic Template
No ratings yet
Airflow Techtonic Template
18 pages
Airflow Chapter 1
No ratings yet
Airflow Chapter 1
34 pages
Apache Airflow Installation On Ubuntu
No ratings yet
Apache Airflow Installation On Ubuntu
6 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Airflow Dag Bash
No ratings yet
Airflow Dag Bash
6 pages
Airflow
No ratings yet
Airflow
7 pages
Airflow Web UI and CLI
No ratings yet
Airflow Web UI and CLI
51 pages
Etalab Talk Apache Airflow Embulk
No ratings yet
Etalab Talk Apache Airflow Embulk
29 pages
Airflow
No ratings yet
Airflow
3 pages
O'Reilly, Posix 4, Programming For The Real World
No ratings yet
O'Reilly, Posix 4, Programming For The Real World
552 pages
Apache Airflow For Data Engineering - The Ultimate Guide - by Vijay Gadhave - Mar, 2025 - Medium
No ratings yet
Apache Airflow For Data Engineering - The Ultimate Guide - by Vijay Gadhave - Mar, 2025 - Medium
18 pages
Building Data Pipelines - 4
No ratings yet
Building Data Pipelines - 4
38 pages
Airflow
No ratings yet
Airflow
7 pages
Appache Airflow
No ratings yet
Appache Airflow
5 pages
ETL Pipeline, Class Notes
No ratings yet
ETL Pipeline, Class Notes
2 pages
Lecture Notes - Automating Machine Learning Workflows
No ratings yet
Lecture Notes - Automating Machine Learning Workflows
12 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Apache Airflow Workflow
No ratings yet
Apache Airflow Workflow
4 pages
AIRFLOW
No ratings yet
AIRFLOW
4 pages
What Is Spring Framework
No ratings yet
What Is Spring Framework
44 pages
Airflow - Kubernetes Executor Design PDF
No ratings yet
Airflow - Kubernetes Executor Design PDF
4 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Advanced Mathematical and Computational Tools in M
No ratings yet
Advanced Mathematical and Computational Tools in M
384 pages
Apple
No ratings yet
Apple
17 pages
Slick mp305 2 Users Manual 534518
No ratings yet
Slick mp305 2 Users Manual 534518
8 pages
Xcode Tools 2.4.1 For Mac OS X Version 10.4.x (Tiger) : Compatibility
No ratings yet
Xcode Tools 2.4.1 For Mac OS X Version 10.4.x (Tiger) : Compatibility
4 pages
The Universim
No ratings yet
The Universim
2 pages
CV: Nagendra KS Via
No ratings yet
CV: Nagendra KS Via
5 pages
Nortel Contivity Client
100% (1)
Nortel Contivity Client
110 pages
Astras Faq - Frequently Asked Questions: Allocation Network GMBH
No ratings yet
Astras Faq - Frequently Asked Questions: Allocation Network GMBH
12 pages
Business Warehouse - Data Modeling PDF
No ratings yet
Business Warehouse - Data Modeling PDF
110 pages
Visual Basic, Controls, and Events
No ratings yet
Visual Basic, Controls, and Events
57 pages
Common Unix System V Commands Pocket Guide 07
No ratings yet
Common Unix System V Commands Pocket Guide 07
2 pages
Question 1: Write Answers For Following: (10 1) : I. What Is HTML?
No ratings yet
Question 1: Write Answers For Following: (10 1) : I. What Is HTML?
52 pages
Latex Tutorial
No ratings yet
Latex Tutorial
44 pages
CS 5150: Software Engineering
No ratings yet
CS 5150: Software Engineering
36 pages
Fruit Sorting Robot Based On Size and Color: Abstract
No ratings yet
Fruit Sorting Robot Based On Size and Color: Abstract
4 pages
App Cache 133521572779644043
No ratings yet
App Cache 133521572779644043
114 pages
Designing Effective Powerpoint Presentations: Atul Aggarwal
No ratings yet
Designing Effective Powerpoint Presentations: Atul Aggarwal
51 pages
Advertisement For PISP Hiring With Job Description
No ratings yet
Advertisement For PISP Hiring With Job Description
28 pages
Maxl Script: Spool On To Loaddata - Log Login Admin Password On Localhost Set Timestamp On
No ratings yet
Maxl Script: Spool On To Loaddata - Log Login Admin Password On Localhost Set Timestamp On
3 pages
7 PHP MVC Frameworks REST Controllers and Routing
No ratings yet
7 PHP MVC Frameworks REST Controllers and Routing
31 pages
Data Mining Lab
No ratings yet
Data Mining Lab
13 pages
What Do Reliability, Scalability and Maintainability Mean
No ratings yet
What Do Reliability, Scalability and Maintainability Mean
3 pages
Parallel Scan in C CUda
No ratings yet
Parallel Scan in C CUda
3 pages
Minishell Explanation
No ratings yet
Minishell Explanation
4 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Mastering Go Network Automation
From Everand
Mastering Go Network Automation
Ian Taylor
No ratings yet
Koha 3 Library Management System
From Everand
Koha 3 Library Management System
Savitra Sirohi
2.5/5 (2)
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
From Everand
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
Jordan Lioy
No ratings yet
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
From Everand
Mastering Go Network Automation: Automating Networks, Container Orchestration, Kubernetes with Puppet, Vegeta and Apache JMeter
Ian Taylor
No ratings yet
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet

Setting Up Airflow With Docker From Installation To Data Processing

Uploaded by

Setting Up Airflow With Docker From Installation To Data Processing

Uploaded by

Setting Up Airflow with

Docker: From Installation

Configure Docker Compose

Start Airflow Services

Webserver workflow execution Execute the actual tasks in the workflow

• Determines when tasks should run • Process assigned tasks Database

• Visualizes DAG structure • Tracks task states

from airflow import DAG

with DAG('my_first_dag', start_date=datetime(2023, 1, 1),

task1 >> task2 # Set dependency: task1 runs before task2

Create New Connection

Execute Script Handle Results

from airflow import DAG

with DAG('data_cleansing_workflow', start_date=datetime(2023, 1, 1),

# Get connection details

# Run Python script with connection details as environment variables

You might also like