Setting Up Airflow With Docker From Installation To Data Processing
Setting Up Airflow With Docker From Installation To Data Processing
Apache Airflow has become the industry standard for orchestrating complex
computational workflows and data processing pipelines. By the end of this
presentation, you'll understand how to set up, configure, and use Airflow to
transform your data processing capabilities.
Let's begin our journey into the world of automated workflow management
with Airflow.
by Rakhmadina Noviyanti
Introduction to Apache Airflow
Workflow Orchestration
Programmatically author, schedule, and monitor workflows as directed acyclic graphs (DAGs) of tasks
Advanced Scheduling
Define complex scheduling patterns with cron-like syntax for precise execution timing
Rich UI Dashboard
Monitor workflow execution, inspect logs, and troubleshoot issues through an intuitive web interface
Apache Airflow revolutionizes how data teams manage complex computational workflows. As an open-source platform, it excels at orchestrating sophisticated data
pipelines through its DAG-based architecture, where tasks and their dependencies are clearly defined.
The platform offers exceptional flexibility for ETL processes, enabling precise scheduling, retries, and dependency management. Its rich user interface provides
comprehensive visibility into workflow execution, making monitoring and debugging straightforward.
Installing Docker and Initial Setup
Setting up Airflow with Docker begins with installing Docker Desktop, which provides the containerization platform
needed to run Airflow services in isolated environments. This approach significantly simplifies deployment and ensures
consistency across different systems.
After installation, creating the proper directory structure is crucial for organizing Airflow components. The specific
folders will store your workflow definitions, logs, plugins, and configuration files that Airflow needs to function properly.
Initializing Airflow Environment
Initialize Database
Run docker compose up airflow-init to create initial database schema
Access Airflow UI
Navigate to http://localhost:8080 in your web browser
Login
Use default credentials: username "airflow", password "airflow"
Once your Docker environment is configured, the next step is initializing the Airflow database. This one-time setup creates the necessary tables and default settings for Airflow to
function properly.
With initialization complete, starting the Airflow services brings up multiple containers working together: the scheduler, webserver, worker, and database. The web interface provides
your primary method of interaction with the system, where you'll monitor workflows and configure connections.
Understanding Airflow Components
Scheduler
Workers
Heart of Airflow that orchestrates
Airflow's architecture consists of several key components working in harmony. The webserver provides a comprehensive user interface for monitoring and managing your
workflows, while the scheduler determines when tasks should run based on their dependencies and schedules.
Workers handle the actual execution of tasks, which may include running Python functions, executing Bash commands, or interacting with external systems. The metadata
database ties everything together by storing information about DAG structures, task states, and execution history.
Creating Your First DAG
DAG Object
Define core DAG parameters and properties
Operators
Create individual tasks using appropriate operators
Dependencies
Set execution order with task dependencies
Creating your first DAG involves defining a Python file that specifies your workflow structure. The DAG object itself requires parameters like a unique ID, schedule interval, and start date. These parameters determine when and how
often your workflow will run.
Tasks are represented by operators - pre-built components for common actions like running Python functions (PythonOperator), executing Bash commands (BashOperator), or transferring data (PostgresOperator). Dependencies
between tasks establish the execution order, typically set using the bitshift operators (>> and <<) for readability.
task2 = BashOperator(
task_id='print_date',
bash_command='date'
)
Configure Connection
3 Set connection type to "PostgreSQL" and enter details
Securely managing database connections is a key Airflow feature. Rather than hardcoding credentials in your DAGs, Airflow provides a centralized
connection management system that enhances security and simplifies maintenance.
When creating a PostgreSQL connection, you'll need to provide a unique Connection ID (which will be referenced in your DAGs), along with the
host, port, database name, username, and password. For Docker installations, remember that database services often have their own container
names that should be used as the host (e.g., "postgres" instead of "localhost").
This connection can then be used in your DAGs by referencing the Connection ID, allowing your workflows to securely interact with the database
without exposing sensitive credentials in your code.
Creating a Python Script for Data Cleansing
Script Structure Common Cleansing Operations Data Loading
• Import psycopg2 and other required • Handle missing values • Prepare cleaned data for insertion
libraries • Remove duplicates • Create target tables if not exist
• Define database connection • •
Standardize formats (dates, Implement batch insertion for
function using environment numbers, text) efficiency
variables
• Apply data validation rules • Handle potential errors and
• Implement data extraction query exceptions
• Apply cleansing operations and
transformations
Your Python data cleansing script serves as the core of your ETL process. The script should be modular and reusable, handling database
connections, data extraction, transformation, and loading back to PostgreSQL. Using environment variables or Airflow's connection
system keeps credentials secure.
The cleansing phase might include fixing inconsistent formatting, handling missing values, normalizing data, and removing duplicates.
For larger datasets, consider implementing batch processing to avoid memory issues. Finally, the script should include proper error
handling and logging to facilitate troubleshooting when running as part of an automated workflow.
Implementing BashOperator to Run Python Scripts
Define BashOperator Task Pass Connection Details
Create a BashOperator task in your DAG file that specifies the command to run your Python script Use Airflow's connection mechanism to securely retrieve PostgreSQL credentials
The BashOperator provides a straightforward way to execute Python scripts within your Airflow workflow. By using this operator, you can leverage existing Python scripts without needing to rewrite them as native Airflow tasks, making it ideal for
integrating established data processing code.
When configuring the BashOperator, you can pass environment variables containing connection details retrieved from Airflow's connection system. This approach keeps your credentials secure while making them available to your Python script.
Additionally, you can pass command-line arguments to your script for dynamic configuration based on the specific DAG run.
1 4
DAG Design Monitoring
Keep tasks atomic and idempotent for better reliability Key areas to monitor: task duration, failure rates, resource usage
24/7 95%
Availability Test Coverage
Configure proper retries and alerting for mission-critical workflows Aim for comprehensive testing of all DAGs and custom operators
Successful Airflow implementations follow several best practices. First, design DAGs to be idempotent, meaning they can be run multiple times without side effects. Use descriptive naming for DAGs and tasks to make monitoring
easier, and implement comprehensive logging within your scripts and operators.
When troubleshooting, check the Airflow logs first, which can be accessed through the web UI. Common issues include connection problems, environment variable misconfigurations, and timing issues with task dependencies. For
production deployments, consider implementing monitoring, alerting, and scaling strategies to ensure reliable operation of your data pipelines.