Etalab Talk Apache Airflow Embulk

The document presents an overview of Apache Airflow and Embulk, focusing on their functionalities and use cases in workflow management and data transfer. Apache Airflow is a platform for authoring, scheduling, and monitoring workflows as directed acyclic graphs (DAGs), while Embulk is an open-source data loader that facilitates data transfer between various sources and destinations. Both tools enhance data processing efficiency and offer features such as monitoring, versioning, and modular configurations.

Uploaded by

Larissa Poliane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views29 pages

Etalab Talk Apache Airflow Embulk

Uploaded by

Larissa Poliane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Apache Airflow & Embulk

Etalab talk
Antoine Augusti
2018-10-12
Who am I?
Hello, I'm Antoine Augusti.

● INSA software engineer + master research in statistics / machine learning.

● Started at Drivy (car rental / sharing platform). 2 years. Started the data team with 8
people. Reporting, fraud, pricing, tracking, AB tests, monitoring, alterings, ops,
security.
● Currently software engineer @ Maritime Affairs under the EIG program: search and
rescue operations. Data analytics, open data, communication, accidentology.

Spy on me: antoine-augusti.fr / @AntoineAugusti

How to run workflows
Problems solved by Apache Airflow
● Defining workflows: a list of tasks with dependencies (do B if A has been done, C after
and D)
● Executing workflows under a defined set of resources and across multiple machines
● Monitoring workflows, alert on task executions anomalies, respect SLAs
● Versioning workflows
● Testing workflows
● Rich web UI interface to manage workflows
Is CRON enough?
I don't need your stuff, I write Bash scripts and I know crontab -e.

● Do you know when your CRON jobs fail?

● How do you handle long running tasks and dependencies?
● Can you spot when your tasks become 3x slower?
● Can you distribute tasks across your new containers?
● Do you version control your CRON tasks?
● Can you visualize what's currently running? What's queued?
● Do you have reusable components you can use across workflows?
Introducing Apache Airflow
Airflow was started in late 2014 @ Airbnb. Joined the Apache Foundation in March 2016.
Airflow is a platform to programmatically author, schedule and monitor workflows.

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow
scheduler executes your tasks on an array of workers while following the specified
dependencies. The rich user interface makes it easy to visualize pipelines running in
production, monitor progress, and troubleshoot issues when needed.

When workflows are defined as Python code, they become more maintainable, versionable,
testable, and collaborative.
What Apache Airflow is not
Airflow is not:

● A data streaming solution: data does not move from one task to another
● A message queue: it does not replace RabbitMQ, Redis and is not suited for a large
number of very short running tasks
● Tied to a language. Workflows are defined in Python but Airflow is language and
technology agnostic
● Designed to have a very low number of long and complex tasks. You should embrace
the power of DAGs and of small and reproducible tasks
Apache Airflow terminology
Main concepts:

● DAG: Directed Acyclic Graph. A sequence of tasks and how often they should run
● Task: individual work units of a DAG. Tasks have dependencies between them.
● Operators: operators define how tasks should be done. Examples: Bash command,
SQL script, send an email, poll an API
● DAG run: an executed instance of a DAG. When a DAG is triggered, Airflow
orchestrates the execution of operators while respecting dependencies and allocated
resources
● Task instance: an executed operator inside a DAG run.
Airflow web UI
DAGs overview
Sample DAG
Tree view
Displays multiple executions over time
Task duration
Time spent on individual tasks over DAG runs
Defining DAGs
Sample DAG
Taken from Airflow's tutorial.

https://gist.github.com/AntoineAugusti/a48bd5ea414258d407a10c874ec9b70f
DAG files
● DAGs are just configuration files and define the DAG structure as code
● DAGs don't do any data processing as such: only the actual execution will do
● Tasks defined here will run in different contexts, different workers, different points in
time
● Tasks don't communicate between each other
● DAG definition files should execute very quickly (hundreds of milliseconds) because
they will be evaluated often by the Airflow scheduler
● DAGs are defined in Python and you should take advantage of it: custom operators,
tests, modules, factories etc.
Airflow processes
Airflow components

From Dustin Stansbury

Airflow demo
Embulk: moving data across data
sources
What's Embulk?
Embulk is an open-source data loader that helps data transfer between various databases,
storages, file formats, and cloud services.

It can automatically guess file formats, distribute execution to deal with big datasets, offers
transactions, can resume stuck tasks and is modular thanks to plugins.

Embulk is written in JRuby and the configuration is specified in YAML. You then execute
Embulk configuration files through the command line.
Architecture
Components
● Input: specify where the data is coming from (MySQL, AWS S3, Jira, Mixpanel etc.)
● Output: specify the destination of the data (BigQuery, Vertica, Redshift, CSV etc.)
● File parser: to parse specific input files (JSON, Excel, Avro, XML etc.)
● File decoder: to deal with compressed files
● File formatter: to format specific output files (similar to parsers)
● Filter: to keep only some rows from the input
● File encoder: to compress output file (similar to decoders)
● Executor: where do Embulk task are executed (locally or Hadoop)
Embulk example: from MySQL to
Redshift
Example: from MySQL to Redshift
Example: from MySQL to Redshift
● Incremental loading: load records inserted (or updated) after the latest execution
● Merging: load or update records according to the value of the latest updated_at
and id columns
● Templating: configurations for MySQL and Redshift are defined elsewhere
Example: from MySQL to Redshift
Embulk example: from CSV file to
Redshift
Example: from CSV file to Redshift

Apache Airflow 1741977651
No ratings yet
Apache Airflow 1741977651
83 pages
DevOps с Laravel 1. Fundamentals
100% (1)
DevOps с Laravel 1. Fundamentals
235 pages
Apache Airflow Cookbook 2
No ratings yet
Apache Airflow Cookbook 2
55 pages
The Ultimate Guide To Apache Airflow DAGs
No ratings yet
The Ultimate Guide To Apache Airflow DAGs
135 pages
Apache Airflow
No ratings yet
Apache Airflow
24 pages
Airflow
No ratings yet
Airflow
97 pages
Apache Airflow 50
100% (1)
Apache Airflow 50
50 pages
Fanuc Focas Ethernet Manual
No ratings yet
Fanuc Focas Ethernet Manual
70 pages
Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
Airflow Notes
No ratings yet
Airflow Notes
10 pages
GenAI in Customer Service Proposal v2 - Client
No ratings yet
GenAI in Customer Service Proposal v2 - Client
53 pages
Intro To Apache Airflow
No ratings yet
Intro To Apache Airflow
14 pages
Airflow 101
No ratings yet
Airflow 101
25 pages
Best Practices Apache Airflow
100% (1)
Best Practices Apache Airflow
28 pages
2.airflow 2
No ratings yet
2.airflow 2
17 pages
Airflow 101 Mobile
No ratings yet
Airflow 101 Mobile
48 pages
GuideToApacheAirflow PDF
100% (1)
GuideToApacheAirflow PDF
6 pages
GreatDay HR Professional Updated 7.1.22
No ratings yet
GreatDay HR Professional Updated 7.1.22
17 pages
Aws Ques
No ratings yet
Aws Ques
62 pages
Apache Airflow - A Python Hands-On Guide
No ratings yet
Apache Airflow - A Python Hands-On Guide
9 pages
AcademyCloudFoundations Module 04
No ratings yet
AcademyCloudFoundations Module 04
62 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
93 pages
Week 6. Airflow Overview
No ratings yet
Week 6. Airflow Overview
71 pages
Study Guide For Apache Airflow Fundamentals Certification
No ratings yet
Study Guide For Apache Airflow Fundamentals Certification
6 pages
Developing Elegant Workflows in Python Code With Apache Airflow
100% (1)
Developing Elegant Workflows in Python Code With Apache Airflow
35 pages
What Is Apache Airflow
No ratings yet
What Is Apache Airflow
22 pages
Incident Response
100% (1)
Incident Response
6 pages
1 FRS Document
No ratings yet
1 FRS Document
11 pages
Apache Airflow Certification - Study Guide For DAG Authoring
No ratings yet
Apache Airflow Certification - Study Guide For DAG Authoring
17 pages
Apache Airflow
50% (2)
Apache Airflow
8 pages
Airflow Best Practices
No ratings yet
Airflow Best Practices
34 pages
Airflow Notes
No ratings yet
Airflow Notes
5 pages
Chapter 1
No ratings yet
Chapter 1
34 pages
Apache Airflow For Data Engineering - The Ultimate Guide - by Vijay Gadhave - Mar, 2025 - Medium
No ratings yet
Apache Airflow For Data Engineering - The Ultimate Guide - by Vijay Gadhave - Mar, 2025 - Medium
18 pages
Project Management: Managing and Using Information Systems: A Strategic Approach by Keri Pearlson & Carol Saunders
No ratings yet
Project Management: Managing and Using Information Systems: A Strategic Approach by Keri Pearlson & Carol Saunders
54 pages
Sourcefire 3D System Migration Guide v2
No ratings yet
Sourcefire 3D System Migration Guide v2
128 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Olms
No ratings yet
Olms
62 pages
Airflow Web UI and CLI
No ratings yet
Airflow Web UI and CLI
51 pages
M365 Business Advisor
No ratings yet
M365 Business Advisor
18 pages
CLAP Roles and Responsibilities
No ratings yet
CLAP Roles and Responsibilities
2 pages
Block Chain Hash Functions
No ratings yet
Block Chain Hash Functions
35 pages
Astronomer GettingStarted Web Final
No ratings yet
Astronomer GettingStarted Web Final
23 pages
Lecture Notes - Automating Machine Learning Workflows
No ratings yet
Lecture Notes - Automating Machine Learning Workflows
12 pages
Setting Up Airflow With Docker From Installation To Data Processing
No ratings yet
Setting Up Airflow With Docker From Installation To Data Processing
10 pages
Airflowintroduction 190217155729
No ratings yet
Airflowintroduction 190217155729
21 pages
E-Commerce Training Course Outline
No ratings yet
E-Commerce Training Course Outline
3 pages
Sid Anand Qcon Ai 2018 v2 PDF
No ratings yet
Sid Anand Qcon Ai 2018 v2 PDF
35 pages
Airflow
No ratings yet
Airflow
7 pages
Access Control Policy
No ratings yet
Access Control Policy
6 pages
Apacheairflow 160827123852
No ratings yet
Apacheairflow 160827123852
25 pages
Erp Sad
No ratings yet
Erp Sad
14 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Internet of Things (Iot) : A Survey On Empowering Technologies, Research Opportunities and Applications
No ratings yet
Internet of Things (Iot) : A Survey On Empowering Technologies, Research Opportunities and Applications
26 pages
Bitmap and Bitmap Join Index
No ratings yet
Bitmap and Bitmap Join Index
18 pages
Fortigate 3000f Series
No ratings yet
Fortigate 3000f Series
11 pages
Appache Airflow
No ratings yet
Appache Airflow
5 pages
DiaSysNG ReleaseNotes 2023.06
No ratings yet
DiaSysNG ReleaseNotes 2023.06
6 pages
Data Modeling
No ratings yet
Data Modeling
7 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Cloud - Road Map
No ratings yet
Cloud - Road Map
2 pages
5 - Statement of Work - Forcepoint Web Security System
No ratings yet
5 - Statement of Work - Forcepoint Web Security System
11 pages
Overview - DAg Structure and Operators-1
No ratings yet
Overview - DAg Structure and Operators-1
6 pages
Apache Airflow Workflow
No ratings yet
Apache Airflow Workflow
4 pages
S.E Software Requirements Final
No ratings yet
S.E Software Requirements Final
17 pages
My Journey As A Data Engineer Spans Over
No ratings yet
My Journey As A Data Engineer Spans Over
6 pages
AIRFLOW
No ratings yet
AIRFLOW
4 pages
Notes Airflow MQTT
No ratings yet
Notes Airflow MQTT
6 pages
4.6 Creating and Operating POSIX Threads
No ratings yet
4.6 Creating and Operating POSIX Threads
12 pages
Kiza Ismail CV
No ratings yet
Kiza Ismail CV
4 pages
Test Bank
No ratings yet
Test Bank
2 pages
Debugging XML From ECC To SRM - SAP Blogs
No ratings yet
Debugging XML From ECC To SRM - SAP Blogs
6 pages
Airflow
No ratings yet
Airflow
3 pages
ETL Pipeline, Class Notes
No ratings yet
ETL Pipeline, Class Notes
2 pages
Basic DNS
No ratings yet
Basic DNS
2 pages
Geosoft - Exploring With Data - Creating Geological Surfaces From A Lithology Voxel
No ratings yet
Geosoft - Exploring With Data - Creating Geological Surfaces From A Lithology Voxel
3 pages
Butterfly Primorac Carbon Reviews
No ratings yet
Butterfly Primorac Carbon Reviews
1 page
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Core Java Programming
From Everand
Core Java Programming
Jitendra Patel
4/5 (11)
IGNOU BCA Object-Oriented Technologies and Java Programming Previous Year Solved Papers MCS 024
From Everand
IGNOU BCA Object-Oriented Technologies and Java Programming Previous Year Solved Papers MCS 024
Manish Soni
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Ian Talks JavaScript Libraries and Frameworks A-Z: WebDevAtoZ, #4
From Everand
Ian Talks JavaScript Libraries and Frameworks A-Z: WebDevAtoZ, #4
Ian Eress
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Professional Heroku Programming
From Everand
Professional Heroku Programming
Chris Kemp
4/5 (2)
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
From Everand
ASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast!
Tim Warren
No ratings yet

Etalab Talk Apache Airflow Embulk

Uploaded by

Etalab Talk Apache Airflow Embulk

Uploaded by

Apache Airflow & Embulk

● INSA software engineer + master research in statistics / machine learning.

Spy on me: antoine-augusti.fr / @AntoineAugusti

● Do you know when your CRON jobs fail?

From Dustin Stansbury

You might also like