Large Scale Etl With Hadoop
Large Scale Etl With Hadoop
1
ETL is like “REST” or “Disaster Recovery”
2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
Hard to generalize without being lossy in some
way
2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
Hard to generalize without being lossy in some
way
Worst, it’s trivial at face value, complicated in
practice
2
So why is ETL hard?
3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
Accessibility
3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
Accessibility
How it all fits together
3
Hadoop is two components
4
Hadoop is two components
HDFS – Massive, redundant data storage
4
Hadoop is two components
HDFS – Massive, redundant data storage
MapReduce – Batch-oriented data processing at
scale
4
The ecosystem brings additional functionality
5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
Hive, Pig, Cascading, ...
5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
6
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Flume, Sqoop, WebHDFS, ...
6
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
7
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Oozie, Azkaban, ...
7
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
8
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
Tika, ?, ...
8
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
...and now low latency query with Impala
9
To truly scale ETL, separate infrastructure from
processes
10
To truly scale ETL, separate infrastructure from
processes, and make it a macro-level service
11
To truly scale ETL, separate infrastructure from
processes, and make it a macro-level service
(composed of other services).
12
The services of ETL
13
The services of ETL
Process Repository
13
The services of ETL
Process Repository
Metadata Repository
13
The services of ETL
Process Repository
Metadata Repository
Scheduling
13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
Integration Adapters or Channels
13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
Integration Adapters or Channels
Service and Process Instrumentation and
Collection
13
What do we have today?
14
What do we have today?
HDFS and MapReduce – The core
14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
Oozie – Process orchestration and basic
scheduling
14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
Oozie – Process orchestration and basic
scheduling
Impala – Fast analysis of data quality
14
MapReduce is the assembly language of data
processing
15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
Java knowledge required
15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
Java knowledge required
Use higher level tools where possible
15
Data organization in HDFS
16
Data organization in HDFS
Standard file system tricks to make operations
atomic
16
Data organization in HDFS
Standard file system tricks to make operations
atomic
Use a well-defined structure that supports tooling
16
Data organization in HDFS – Hierarchy
/intent
/category
/application (optional)
/dataset
/partitions
/files
Examples:
/data/fraud/txs/2012-01-01/20120101-00.avro
/data/fraud/txs/2012-01-01/20120101-01.avro
/group/research/model-17/training-txs/part-00000.avro
/group/research/model-17/training-txs/part-00001.avro
/user/esammer/scratch/foo/
17
A view of data integration
18
Event
headers:({
((app:((1234,
((type:(321
((ts:(((<epoch>
},
body:(((<bytes>
Syslog)
Events Flume)Agent
HDFS
Flume)
Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/
Events
Flume) /data/web/core/2012P01P01/
(Channel)2) /data/web/retail/2012P01P01/
Clickstream)
Events Relational Data
/data/pos/US/NY/17/2012P01P01/
Flume) /data/pos/US/CA/42/2012P01P01/
Point)of)Sale) (Channel)3)
Events
Sqoop Web)App)
(Job)1) Database
/data/wdb/<database>/<table>/
19
Structure data in tiers
20
Structure data in tiers
A clear hierarchy of source/derived relationships
20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems
20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems
Tier 1 – Derived from 0, cleansed, normalized
20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems
Tier 1 – Derived from 0, cleansed, normalized
Tier 2 – Derived from 1, aggregated
20
HDFS%(Tier%0) HDFS%(Tier%1)
/data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/
Sessioniza9on
/data/web/core/2012G01G01/
/data/repor9ng/eventsGday/YYYYGMMGDD/
/data/web/retail/2012G01G01/
/data/wdb/<database>/<table>/
Inventory%Reconcilia9on HDFS%(For%export)
/data/edw/<database>/<table>/ /export/edw/inventory/itemGdiff/<ts>/
21
There’s a lot to do
22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
Metadata, metadata, metadata (metadata)
22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
Metadata, metadata, metadata (metadata)
Process (job) deployment, service location,
22
To the contributors, potential and current
23
To the contributors, potential and current
We have work to do
23
To the contributors, potential and current
We have work to do
Still way too much scaffolding work
23
To the contributors, potential and current
We have work to do
Still way too much scaffolding work
23
I’m out of time (for now)
24
I’m out of time (for now)
Join me for office hours – 1:40 - 2:20 in
Rhinelander
24
I’m out of time (for now)
Join me for office hours – 1:40 - 2:20 in
Rhinelander
I’m signing copies of Hadoop Operations tonight
24
25