Etalab Talk Apache Airflow Embulk
Etalab Talk Apache Airflow Embulk
Etalab talk
Antoine Augusti
2018-10-12
Who am I?
Hello, I'm Antoine Augusti.
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow
scheduler executes your tasks on an array of workers while following the specified
dependencies. The rich user interface makes it easy to visualize pipelines running in
production, monitor progress, and troubleshoot issues when needed.
When workflows are defined as Python code, they become more maintainable, versionable,
testable, and collaborative.
What Apache Airflow is not
Airflow is not:
● A data streaming solution: data does not move from one task to another
● A message queue: it does not replace RabbitMQ, Redis and is not suited for a large
number of very short running tasks
● Tied to a language. Workflows are defined in Python but Airflow is language and
technology agnostic
● Designed to have a very low number of long and complex tasks. You should embrace
the power of DAGs and of small and reproducible tasks
Apache Airflow terminology
Main concepts:
● DAG: Directed Acyclic Graph. A sequence of tasks and how often they should run
● Task: individual work units of a DAG. Tasks have dependencies between them.
● Operators: operators define how tasks should be done. Examples: Bash command,
SQL script, send an email, poll an API
● DAG run: an executed instance of a DAG. When a DAG is triggered, Airflow
orchestrates the execution of operators while respecting dependencies and allocated
resources
● Task instance: an executed operator inside a DAG run.
Airflow web UI
DAGs overview
Sample DAG
Tree view
Displays multiple executions over time
Task duration
Time spent on individual tasks over DAG runs
Defining DAGs
Sample DAG
Taken from Airflow's tutorial.
https://gist.github.com/AntoineAugusti/a48bd5ea414258d407a10c874ec9b70f
DAG files
● DAGs are just configuration files and define the DAG structure as code
● DAGs don't do any data processing as such: only the actual execution will do
● Tasks defined here will run in different contexts, different workers, different points in
time
● Tasks don't communicate between each other
● DAG definition files should execute very quickly (hundreds of milliseconds) because
they will be evaluated often by the Airflow scheduler
● DAGs are defined in Python and you should take advantage of it: custom operators,
tests, modules, factories etc.
Airflow processes
Airflow components
It can automatically guess file formats, distribute execution to deal with big datasets, offers
transactions, can resume stuck tasks and is modular thanks to plugins.
Embulk is written in JRuby and the configuration is specified in YAML. You then execute
Embulk configuration files through the command line.
Architecture
Components
● Input: specify where the data is coming from (MySQL, AWS S3, Jira, Mixpanel etc.)
● Output: specify the destination of the data (BigQuery, Vertica, Redshift, CSV etc.)
● File parser: to parse specific input files (JSON, Excel, Avro, XML etc.)
● File decoder: to deal with compressed files
● File formatter: to format specific output files (similar to parsers)
● Filter: to keep only some rows from the input
● File encoder: to compress output file (similar to decoders)
● Executor: where do Embulk task are executed (locally or Hadoop)
Embulk example: from MySQL to
Redshift
Example: from MySQL to Redshift
Example: from MySQL to Redshift
● Incremental loading: load records inserted (or updated) after the latest execution
● Merging: load or update records according to the value of the latest updated_at
and id columns
● Templating: configurations for MySQL and Redshift are defined elsewhere
Example: from MySQL to Redshift
Embulk example: from CSV file to
Redshift
Example: from CSV file to Redshift