The document discusses Embulk, an open-source parallel bulk data loader that uses plugins. Embulk loads records from various sources ("A") to various targets ("B") using plugins for different source and target types. This makes the painful process of data integration more relaxed. Embulk executes in parallel, validates data, handles errors, behaves deterministically, and allows for idempotent retries of bulk loads.
1 of 36
Downloaded 214 times
More Related Content
Embulk, an open-source plugin-based parallel bulk data loader
1. Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
EmbulkAn open-source plugin-based parallel bulk data loader
that makes painful data integration work relaxed.
Sharing our knowledge on RubyGems to manage arbitrary files.
2. A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - Efficient object serializer
> Fluentd - An unified data collection tool
> Prestogres - PostgreSQL protocol gateway for Presto
> Embulk - A plugin-based parallel bulk data loader
> ServerEngine - A Ruby framework to build multiprocess servers
> LS4 - A distributed object storage with cross-region replication
> kumofs - A distributed strong-consistent key-value data store
3. Today’s talk
> What’s Embulk?
> How Embulk works?
> The architecture
> Writing Embulk plugins
> Roadmap & Development
> Q&A + Discussion
4. What’s Embulk?
> An open-source parallel bulk data loader
> using plugins
> to make data integration relaxed.
5. What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration relaxed.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,
transactions (idempotency),
performance, …
6. The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 1. First attempt → fails
> 2. Write a script to make the records cleaned
• Convert ”20150127T190500Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many cleanings…
> 3. Second attempt → another error
• Convert “Inf” → “Infinity”
> 4. Fix the script, retry, retry, retry…
> 5. Oh, some data got loaded twice!?
7. The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 6. Ok, the script worked.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
8. The pains of bulk data loading
Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …
9. The problems:
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
> How to optimize the code or parallelize?
10. The problems at Treasure Data
Treasure Data Service?
> “Fast, powerful SQL access to big data from connected
applications and products, with no new infrastructure or
special skills required.”
> Customers want to try Treasure Data, but
> SEs write scripts to bulk load their data. Hard work :(
> Customers want to migrate their big data, but
> Hard work :(
> Fluentd solved streaming data collection, but
> bulk data loading is another problem.
11. A solution:
> Package the efforts as a plugin.
> data cleaning, error handling, retrying
> Share & reuse the plugin.
> don’t repeat the pains!
> Keep improving the plugin code.
> rather than throwing away the efforts every time
> using OSS-style pull-reqs & frequent releases.
12. Embulk
Embulk is an open-source, plugin-based
parallel bulk data loader
that makes data integration works relaxed.
33. Roadmap
> Add missing JRuby Plugin APIs
> ParserPlugin, FormatterPlugin
> DecoderPlugin, EncoderPlugin
> Add Executor plugin SPI
> Add ssh distributed executor
> embulk run —command ssh %host embulk run %task
> Add MapReduce executor
> Add support for nested records (?)
34. Contributing to the Embulk project
> Pull-requests & issues on Github
> Posting blogs
> “I tried Embulk. Here is how it worked”
> “I read Embulk code. Here is how it’s written”
> “Embulk is good because…but bad because…”
> Talking on Twitter with a word “embulk"
> Writing & releasing plugins
> Windows support
> Integration to other software
> ETL tools, Fluentd, Hadoop, Presto, …