0% found this document useful (0 votes)
3 views10 pages

Setting Up Airflow With Docker From Installation To Data Processing

This document is a comprehensive guide on setting up Apache Airflow using Docker, covering installation, configuration, and the creation of data processing workflows with PostgreSQL connections. It details the components of Airflow, how to create Directed Acyclic Graphs (DAGs), manage database connections, and implement data cleansing scripts. Best practices for DAG design and troubleshooting are also discussed to enhance reliability and performance.

Uploaded by

pratiwi118
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Setting Up Airflow With Docker From Installation To Data Processing

This document is a comprehensive guide on setting up Apache Airflow using Docker, covering installation, configuration, and the creation of data processing workflows with PostgreSQL connections. It details the components of Airflow, how to create Directed Acyclic Graphs (DAGs), manage database connections, and implement data cleansing scripts. Best practices for DAG design and troubleshooting are also discussed to enhance reliability and performance.

Uploaded by

pratiwi118
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Setting Up Airflow with

Docker: From Installation


to Data Processing
Welcome to this comprehensive guide on implementing Apache Airflow using
Docker. We'll walk through everything from initial installation to creating
sophisticated data processing workflows that leverage PostgreSQL connections.

Apache Airflow has become the industry standard for orchestrating complex
computational workflows and data processing pipelines. By the end of this
presentation, you'll understand how to set up, configure, and use Airflow to
transform your data processing capabilities.

Let's begin our journey into the world of automated workflow management
with Airflow.

by Rakhmadina Noviyanti
Introduction to Apache Airflow
Workflow Orchestration
Programmatically author, schedule, and monitor workflows as directed acyclic graphs (DAGs) of tasks

Advanced Scheduling
Define complex scheduling patterns with cron-like syntax for precise execution timing

Rich UI Dashboard
Monitor workflow execution, inspect logs, and troubleshoot issues through an intuitive web interface

Apache Airflow revolutionizes how data teams manage complex computational workflows. As an open-source platform, it excels at orchestrating sophisticated data
pipelines through its DAG-based architecture, where tasks and their dependencies are clearly defined.

The platform offers exceptional flexibility for ETL processes, enabling precise scheduling, retries, and dependency management. Its rich user interface provides
comprehensive visibility into workflow execution, making monitoring and debugging straightforward.
Installing Docker and Initial Setup

Configure Docker Compose


Create Directory Structure
Download the docker-compose.yaml
Install Docker Desktop
Open terminal and run: mkdir -p file from the Airflow repository and
Download and install Docker Desktop ./dags ./logs ./plugins ./config to set permissions with: echo -e
appropriate for your operating system create the necessary folder structure. "AIRFLOW_UID=$(id -u)" > .env
(Windows, macOS, or Linux) from the
official Docker website.

Setting up Airflow with Docker begins with installing Docker Desktop, which provides the containerization platform
needed to run Airflow services in isolated environments. This approach significantly simplifies deployment and ensures
consistency across different systems.

After installation, creating the proper directory structure is crucial for organizing Airflow components. The specific
folders will store your workflow definitions, logs, plugins, and configuration files that Airflow needs to function properly.
Initializing Airflow Environment
Initialize Database
Run docker compose up airflow-init to create initial database schema

Start Airflow Services


Execute docker compose up to launch all Airflow containers

Access Airflow UI
Navigate to http://localhost:8080 in your web browser

Login
Use default credentials: username "airflow", password "airflow"

Once your Docker environment is configured, the next step is initializing the Airflow database. This one-time setup creates the necessary tables and default settings for Airflow to
function properly.

With initialization complete, starting the Airflow services brings up multiple containers working together: the scheduler, webserver, worker, and database. The web interface provides
your primary method of interaction with the system, where you'll monitor workflows and configure connections.
Understanding Airflow Components
Scheduler
Workers
Heart of Airflow that orchestrates

Webserver workflow execution Execute the actual tasks in the workflow

• Determines when tasks should run • Process assigned tasks Database


Provides the user interface for
monitoring, triggering, and managing • Submits tasks to the executor • Report execution status Stores metadata about DAGs and
workflows • Monitors task states • Handle task retries executions

• Visualizes DAG structure • Tracks task states


• Shows execution history • Stores variable values
• Provides access to logs • Maintains connection information

Airflow's architecture consists of several key components working in harmony. The webserver provides a comprehensive user interface for monitoring and managing your
workflows, while the scheduler determines when tasks should run based on their dependencies and schedules.

Workers handle the actual execution of tasks, which may include running Python functions, executing Bash commands, or interacting with external systems. The metadata
database ties everything together by storing information about DAG structures, task states, and execution history.
Creating Your First DAG
DAG Object
Define core DAG parameters and properties

Operators
Create individual tasks using appropriate operators

Dependencies
Set execution order with task dependencies

Creating your first DAG involves defining a Python file that specifies your workflow structure. The DAG object itself requires parameters like a unique ID, schedule interval, and start date. These parameters determine when and how
often your workflow will run.

Tasks are represented by operators - pre-built components for common actions like running Python functions (PythonOperator), executing Bash commands (BashOperator), or transferring data (PostgresOperator). Dependencies
between tasks establish the execution order, typically set using the bitshift operators (>> and <<) for readability.

from airflow import DAG


from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG('my_first_dag', start_date=datetime(2023, 1, 1),


schedule_interval='@daily') as dag:
task1 = BashOperator(
task_id='hello_world',
bash_command='echo "Hello, Airflow!"'
)

task2 = BashOperator(
task_id='print_date',
bash_command='date'
)

task1 >> task2 # Set dependency: task1 runs before task2


Setting Up PostgreSQL Connection in Airflow
Navigate to Connections
Access Admin > Connections in the Airflow UI

Create New Connection


Click "Add a new record" button

Configure Connection
3 Set connection type to "PostgreSQL" and enter details

Securely managing database connections is a key Airflow feature. Rather than hardcoding credentials in your DAGs, Airflow provides a centralized
connection management system that enhances security and simplifies maintenance.

When creating a PostgreSQL connection, you'll need to provide a unique Connection ID (which will be referenced in your DAGs), along with the
host, port, database name, username, and password. For Docker installations, remember that database services often have their own container
names that should be used as the host (e.g., "postgres" instead of "localhost").

This connection can then be used in your DAGs by referencing the Connection ID, allowing your workflows to securely interact with the database
without exposing sensitive credentials in your code.
Creating a Python Script for Data Cleansing
Script Structure Common Cleansing Operations Data Loading
• Import psycopg2 and other required • Handle missing values • Prepare cleaned data for insertion
libraries • Remove duplicates • Create target tables if not exist
• Define database connection • •
Standardize formats (dates, Implement batch insertion for
function using environment numbers, text) efficiency
variables
• Apply data validation rules • Handle potential errors and
• Implement data extraction query exceptions
• Apply cleansing operations and
transformations

Your Python data cleansing script serves as the core of your ETL process. The script should be modular and reusable, handling database
connections, data extraction, transformation, and loading back to PostgreSQL. Using environment variables or Airflow's connection
system keeps credentials secure.

The cleansing phase might include fixing inconsistent formatting, handling missing values, normalizing data, and removing duplicates.
For larger datasets, consider implementing batch processing to avoid memory issues. Finally, the script should include proper error
handling and logging to facilitate troubleshooting when running as part of an automated workflow.
Implementing BashOperator to Run Python Scripts
Define BashOperator Task Pass Connection Details
Create a BashOperator task in your DAG file that specifies the command to run your Python script Use Airflow's connection mechanism to securely retrieve PostgreSQL credentials

Execute Script Handle Results


The BashOperator runs the Python script as a shell command during DAG execution Implement proper exit codes in your script and configure the operator to respond appropriately

The BashOperator provides a straightforward way to execute Python scripts within your Airflow workflow. By using this operator, you can leverage existing Python scripts without needing to rewrite them as native Airflow tasks, making it ideal for
integrating established data processing code.

When configuring the BashOperator, you can pass environment variables containing connection details retrieved from Airflow's connection system. This approach keeps your credentials secure while making them available to your Python script.
Additionally, you can pass command-line arguments to your script for dynamic configuration based on the specific DAG run.

from airflow import DAG


from airflow.operators.bash import BashOperator
from airflow.hooks.postgres_hook import PostgresHook
from datetime import datetime

with DAG('data_cleansing_workflow', start_date=datetime(2023, 1, 1),


schedule_interval='@daily') as dag:

# Get connection details


pg_hook = PostgresHook(postgres_conn_id='postgres_conn')
conn = pg_hook.get_connection(conn_id='postgres_conn')

# Run Python script with connection details as environment variables


run_cleansing = BashOperator(
task_id='run_data_cleansing',
bash_command='python /opt/airflow/dags/scripts/cleanse_data.py',
env={
'POSTGRES_HOST': conn.host,
'POSTGRES_PORT': conn.port,
'POSTGRES_DB': conn.schema,
'POSTGRES_USER': conn.login,
'POSTGRES_PASSWORD': conn.password
}
)
Best Practices and Troubleshooting

1 4
DAG Design Monitoring
Keep tasks atomic and idempotent for better reliability Key areas to monitor: task duration, failure rates, resource usage

24/7 95%
Availability Test Coverage
Configure proper retries and alerting for mission-critical workflows Aim for comprehensive testing of all DAGs and custom operators

Successful Airflow implementations follow several best practices. First, design DAGs to be idempotent, meaning they can be run multiple times without side effects. Use descriptive naming for DAGs and tasks to make monitoring
easier, and implement comprehensive logging within your scripts and operators.

When troubleshooting, check the Airflow logs first, which can be accessed through the web UI. Common issues include connection problems, environment variable misconfigurations, and timing issues with task dependencies. For
production deployments, consider implementing monitoring, alerting, and scaling strategies to ensure reliable operation of your data pipelines.

You might also like