0% found this document useful (0 votes)
37 views76 pages

Large Scale Etl With Hadoop

The document discusses large scale ETL processes with Hadoop. It describes how ETL is difficult due to factors like data integration, organization, and process orchestration. Hadoop provides core components like HDFS for storage and MapReduce for processing. The ecosystem includes tools that provide higher-level abstractions, data integration, and process orchestration. The document advocates separating ETL infrastructure from processes and treating it as a macro-level service composed of other services. It also discusses data organization in HDFS and structuring data in tiers with clear source/derived relationships.

Uploaded by

禹范
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views76 pages

Large Scale Etl With Hadoop

The document discusses large scale ETL processes with Hadoop. It describes how ETL is difficult due to factors like data integration, organization, and process orchestration. Hadoop provides core components like HDFS for storage and MapReduce for processing. The ecosystem includes tools that provide higher-level abstractions, data integration, and process orchestration. The document advocates separating ETL infrastructure from processes and treating it as a macro-level service composed of other services. It also discusses data organization in HDFS and structuring data in tiers with clear source/derived relationships.

Uploaded by

禹范
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Large Scale ETL with Hadoop

Headline Goes Here


Eric Sammer | Principal Solution Architect
Speaker Name or Subhead Goes Here
@esammer
Strata + Hadoop World 2012

1
ETL is like “REST” or “Disaster Recovery”

2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)

2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing

2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
Hard to generalize without being lossy in some
way

2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
Hard to generalize without being lossy in some
way
Worst, it’s trivial at face value, complicated in
practice

2
So why is ETL hard?

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
Accessibility

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
Accessibility
How it all fits together

3
Hadoop is two components

4
Hadoop is two components
HDFS – Massive, redundant data storage

4
Hadoop is two components
HDFS – Massive, redundant data storage
MapReduce – Batch-oriented data processing at
scale

4
The ecosystem brings additional functionality

5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce

5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
Hive, Pig, Cascading, ...

5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration

6
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Flume, Sqoop, WebHDFS, ...

6
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling

7
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Oozie, Azkaban, ...

7
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction

8
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
Tika, ?, ...

8
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
...and now low latency query with Impala

9
To truly scale ETL, separate infrastructure from
processes

10
To truly scale ETL, separate infrastructure from
processes, and make it a macro-level service

11
To truly scale ETL, separate infrastructure from
processes, and make it a macro-level service
(composed of other services).

12
The services of ETL

13
The services of ETL
Process Repository

13
The services of ETL
Process Repository
Metadata Repository

13
The services of ETL
Process Repository
Metadata Repository
Scheduling

13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration

13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
Integration Adapters or Channels

13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
Integration Adapters or Channels
Service and Process Instrumentation and
Collection

13
What do we have today?

14
What do we have today?
HDFS and MapReduce – The core

14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration

14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables

14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
Oozie – Process orchestration and basic
scheduling

14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
Oozie – Process orchestration and basic
scheduling
Impala – Fast analysis of data quality

14
MapReduce is the assembly language of data
processing

15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”

15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level

15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
Java knowledge required

15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
Java knowledge required
Use higher level tools where possible

15
Data organization in HDFS

16
Data organization in HDFS
Standard file system tricks to make operations
atomic

16
Data organization in HDFS
Standard file system tricks to make operations
atomic
Use a well-defined structure that supports tooling

16
Data organization in HDFS – Hierarchy
/intent
/category
/application (optional)
/dataset
/partitions
/files

Examples:
/data/fraud/txs/2012-01-01/20120101-00.avro
/data/fraud/txs/2012-01-01/20120101-01.avro
/group/research/model-17/training-txs/part-00000.avro
/group/research/model-17/training-txs/part-00001.avro
/user/esammer/scratch/foo/

17
A view of data integration

18
Event
headers:({
((app:((1234,
((type:(321
((ts:(((<epoch>
},
body:(((<bytes>

Syslog)
Events Flume)Agent

HDFS
Flume)
Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/
Events

Flume) /data/web/core/2012P01P01/
(Channel)2) /data/web/retail/2012P01P01/
Clickstream)
Events Relational Data
/data/pos/US/NY/17/2012P01P01/
Flume) /data/pos/US/CA/42/2012P01P01/
Point)of)Sale) (Channel)3)
Events
Sqoop Web)App)
(Job)1) Database
/data/wdb/<database>/<table>/

Streaming Data /data/edw/<database>/<table>/ Sqoop


EDW
(Job)2)

19
Structure data in tiers

20
Structure data in tiers
A clear hierarchy of source/derived relationships

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems
Tier 1 – Derived from 0, cleansed, normalized

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems
Tier 1 – Derived from 0, cleansed, normalized
Tier 2 – Derived from 1, aggregated

20
HDFS%(Tier%0) HDFS%(Tier%1)

/data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/

Sessioniza9on

/data/web/core/2012G01G01/
/data/repor9ng/eventsGday/YYYYGMMGDD/
/data/web/retail/2012G01G01/

/data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/


/data/pos/US/CA/42/2012G01G01/

/data/wdb/<database>/<table>/

Inventory%Reconcilia9on HDFS%(For%export)

/data/edw/<database>/<table>/ /export/edw/inventory/itemGdiff/<ts>/

21
There’s a lot to do

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
Metadata, metadata, metadata (metadata)

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
Metadata, metadata, metadata (metadata)
Process (job) deployment, service location,

22
To the contributors, potential and current

23
To the contributors, potential and current
We have work to do

23
To the contributors, potential and current
We have work to do
Still way too much scaffolding work

23
To the contributors, potential and current
We have work to do
Still way too much scaffolding work

23
I’m out of time (for now)

24
I’m out of time (for now)
Join me for office hours – 1:40 - 2:20 in
Rhinelander

24
I’m out of time (for now)
Join me for office hours – 1:40 - 2:20 in
Rhinelander
I’m signing copies of Hadoop Operations tonight

24
25

You might also like