0% found this document useful (0 votes)

37 views76 pages

Large Scale Etl With Hadoop

The document discusses large scale ETL processes with Hadoop. It describes how ETL is difficult due to factors like data integration, organization, and process orchestration. Hadoop provides core components like HDFS for storage and MapReduce for processing. The ecosystem includes tools that provide higher-level abstractions, data integration, and process orchestration. The document advocates separating ETL infrastructure from processes and treating it as a macro-level service composed of other services. It also discusses data organization in HDFS and structuring data in tiers with clear source/derived relationships.

Uploaded by

禹范

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views76 pages

Large Scale Etl With Hadoop

Uploaded by

禹范

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Large Scale ETL with Hadoop

Headline Goes Here

Eric Sammer | Principal Solution Architect
Speaker Name or Subhead Goes Here
@esammer
Strata + Hadoop World 2012

1
ETL is like “REST” or “Disaster Recovery”

2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)

2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing

2
ETL is like “REST” or “Disaster Recovery”
Everyone defines it differently (and loves to fight
about it)
It’s more of a problem/solution space than a thing
Hard to generalize without being lossy in some
way

2
So why is ETL hard?

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
Accessibility

3
So why is ETL hard?
It’s not because ƒ(A) → B is hard (anymore)
Data integration
Organization and management
Process orchestration and scheduling
Accessibility
How it all fits together

3
Hadoop is two components

4
Hadoop is two components
HDFS – Massive, redundant data storage

4
Hadoop is two components
HDFS – Massive, redundant data storage
MapReduce – Batch-oriented data processing at
scale

4
The ecosystem brings additional functionality

5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce

5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
Hive, Pig, Cascading, ...

5
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration

6
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Flume, Sqoop, WebHDFS, ...

6
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling

7
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Oozie, Azkaban, ...

8
The ecosystem brings additional functionality
Higher level languages and abstractions on
MapReduce
File, relational, and streaming data integration
Process orchestration and scheduling
Libraries for parsing and text extraction
Tika, ?, ...

9
To truly scale ETL, separate infrastructure from
processes

10
To truly scale ETL, separate infrastructure from
processes, and make it a macro-level service

11
To truly scale ETL, separate infrastructure from
processes, and make it a macro-level service
(composed of other services).

12
The services of ETL

13
The services of ETL
Process Repository

13
The services of ETL
Process Repository
Metadata Repository

13
The services of ETL
Process Repository
Metadata Repository
Scheduling

13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration

13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
Integration Adapters or Channels

13
The services of ETL
Process Repository
Metadata Repository
Scheduling
Process Orchestration
Integration Adapters or Channels
Service and Process Instrumentation and
Collection

13
What do we have today?

14
What do we have today?
HDFS and MapReduce – The core

14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration

14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables

14
What do we have today?
HDFS and MapReduce – The core
Flume – Streaming event data integration
Sqoop – Batch exchange of relational database
tables
Oozie – Process orchestration and basic
scheduling

14
MapReduce is the assembly language of data
processing

15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”

15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level

15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
Java knowledge required

15
MapReduce is the assembly language of data
processing
“Simple things are hard, but hard things are
possible”
Comparatively low level
Java knowledge required
Use higher level tools where possible

15
Data organization in HDFS

16
Data organization in HDFS
Standard file system tricks to make operations
atomic

16
Data organization in HDFS
Standard file system tricks to make operations
atomic
Use a well-defined structure that supports tooling

16
Data organization in HDFS – Hierarchy
/intent
/category
/application (optional)
/dataset
/partitions
/files

Examples:
/data/fraud/txs/2012-01-01/20120101-00.avro
/data/fraud/txs/2012-01-01/20120101-01.avro
/group/research/model-17/training-txs/part-00000.avro
/group/research/model-17/training-txs/part-00001.avro
/user/esammer/scratch/foo/

17
A view of data integration

18
Event
headers:({
((app:((1234,
((type:(321
((ts:(((<epoch>
},
body:(((<bytes>

Syslog)
Events Flume)Agent

HDFS
Flume)
Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/
Events

Flume) /data/web/core/2012P01P01/
(Channel)2) /data/web/retail/2012P01P01/
Clickstream)
Events Relational Data
/data/pos/US/NY/17/2012P01P01/
Flume) /data/pos/US/CA/42/2012P01P01/
Point)of)Sale) (Channel)3)
Events
Sqoop Web)App)
(Job)1) Database
/data/wdb/<database>/<table>/

Streaming Data /data/edw/<database>/<table>/ Sqoop

EDW
(Job)2)

19
Structure data in tiers

20
Structure data in tiers
A clear hierarchy of source/derived relationships

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples

20
Structure data in tiers
A clear hierarchy of source/derived relationships
One step on the road to proper lineage
Simple “fault and rebuild” processes
Examples
Tier 0 – Raw data from source systems

20
HDFS%(Tier%0) HDFS%(Tier%1)

/data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/

Sessioniza9on

/data/web/core/2012G01G01/
/data/repor9ng/eventsGday/YYYYGMMGDD/
/data/web/retail/2012G01G01/

/data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/

/data/pos/US/CA/42/2012G01G01/

/data/wdb/<database>/<table>/

Inventory%Reconcilia9on HDFS%(For%export)

/data/edw/<database>/<table>/ /export/edw/inventory/itemGdiﬀ/<ts>/

21
There’s a lot to do

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality

22
There’s a lot to do
Build libraries or services to reveal higher-level
interfaces
Data management and lifecycle events
Instrument jobs and services for performance/
quality
Metadata, metadata, metadata (metadata)

22
To the contributors, potential and current

23
To the contributors, potential and current
We have work to do

23
To the contributors, potential and current
We have work to do
Still way too much scaffolding work

23
I’m out of time (for now)

24
I’m out of time (for now)
Join me for office hours – 1:40 - 2:20 in
Rhinelander

24
I’m out of time (for now)
Join me for office hours – 1:40 - 2:20 in
Rhinelander
I’m signing copies of Hadoop Operations tonight

24
25

Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Oreilly Technical Guide Understanding Etl
No ratings yet
Oreilly Technical Guide Understanding Etl
107 pages
Service Now Certification Guide
0% (1)
Service Now Certification Guide
10 pages
Road Map 1741960074
No ratings yet
Road Map 1741960074
24 pages
CTS Batch - GenC AIA - Informatica - IICS - DataStage - Curriculum
100% (1)
CTS Batch - GenC AIA - Informatica - IICS - DataStage - Curriculum
69 pages
TD GEStion Des Projets - PPPTX
No ratings yet
TD GEStion Des Projets - PPPTX
23 pages
Understanding Etl Er1
No ratings yet
Understanding Etl Er1
34 pages
Flink
No ratings yet
Flink
31 pages
Learn Well Technocraft: Hadoop/Big Data Syllabus
100% (1)
Learn Well Technocraft: Hadoop/Big Data Syllabus
12 pages
Data Management Concepts Learned
No ratings yet
Data Management Concepts Learned
5 pages
Big Data Architectures and The Data Lake: James Serra
No ratings yet
Big Data Architectures and The Data Lake: James Serra
53 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
Udyog Vihar
100% (1)
Udyog Vihar
12 pages
Data Engineering Cookbook
100% (1)
Data Engineering Cookbook
124 pages
Data Engineering Cookbook
100% (2)
Data Engineering Cookbook
127 pages
N Tier Architecture
100% (1)
N Tier Architecture
31 pages
Object Oriented Analysis and Design - Syllabus
No ratings yet
Object Oriented Analysis and Design - Syllabus
1 page
Ds Itsm
100% (1)
Ds Itsm
4 pages
OpenSAP Bw4h2 Week 3 Transcript en
No ratings yet
OpenSAP Bw4h2 Week 3 Transcript en
25 pages
An Introduction To Hadoop Presentation PDF
100% (1)
An Introduction To Hadoop Presentation PDF
91 pages
Oracle Database Upgrade Methods
No ratings yet
Oracle Database Upgrade Methods
20 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Management Information Systems Unit - 3 Notes-1
No ratings yet
Management Information Systems Unit - 3 Notes-1
13 pages
ETL - PPT v0.2
No ratings yet
ETL - PPT v0.2
20 pages
Bde Imp
No ratings yet
Bde Imp
20 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Hadoop Introduction
No ratings yet
Hadoop Introduction
29 pages
Modern Data Architecture For Financial Services With Apache Hadoop On Windows White Paper
No ratings yet
Modern Data Architecture For Financial Services With Apache Hadoop On Windows White Paper
20 pages
Iia 4
No ratings yet
Iia 4
29 pages
OD M2 Building A Data Lake
No ratings yet
OD M2 Building A Data Lake
59 pages
Unit 2
No ratings yet
Unit 2
14 pages
Data Pipelines Explained
No ratings yet
Data Pipelines Explained
4 pages
MCS-221 2024-25 em
No ratings yet
MCS-221 2024-25 em
34 pages
Lez.d-01-Hadoop (C)
No ratings yet
Lez.d-01-Hadoop (C)
29 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
Unit II
No ratings yet
Unit II
60 pages
Big Data
No ratings yet
Big Data
4 pages
Concept of Big Data
No ratings yet
Concept of Big Data
29 pages
Unit-2 1
No ratings yet
Unit-2 1
93 pages
Ebook The Evolution of The Data Warehouse
No ratings yet
Ebook The Evolution of The Data Warehouse
40 pages
Building Batch Data Pipelines On Google Cloud
No ratings yet
Building Batch Data Pipelines On Google Cloud
18 pages
DSS ch2
No ratings yet
DSS ch2
112 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
ELT Vs ETL
No ratings yet
ELT Vs ETL
13 pages
Etl VS Elt
No ratings yet
Etl VS Elt
8 pages
Mastering Business Intelligence
No ratings yet
Mastering Business Intelligence
27 pages
Data Warehouse
No ratings yet
Data Warehouse
71 pages
Rainfall Analysis Implementing On Data Warehouse
No ratings yet
Rainfall Analysis Implementing On Data Warehouse
12 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Big Data Open Source Implementation & Administration
No ratings yet
Big Data Open Source Implementation & Administration
16 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
Data Engineering Concepts For Mid-to-Senior Professionals
No ratings yet
Data Engineering Concepts For Mid-to-Senior Professionals
27 pages
Evaluating ETL and Data Integration Plataforms 2003ETLReport
No ratings yet
Evaluating ETL and Data Integration Plataforms 2003ETLReport
40 pages
ETL Interview Question Basic
No ratings yet
ETL Interview Question Basic
10 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
DW Vs Data Lake
No ratings yet
DW Vs Data Lake
5 pages
Unit 4
No ratings yet
Unit 4
30 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
Modern Data Stack
No ratings yet
Modern Data Stack
23 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
Function Apollo Amadeus: Sign In/Out
No ratings yet
Function Apollo Amadeus: Sign In/Out
16 pages
ITSM20F
No ratings yet
ITSM20F
28 pages
Hadoop World: Production Deep Dive With High Availability
No ratings yet
Hadoop World: Production Deep Dive With High Availability
26 pages
Data Engineering Foundation
No ratings yet
Data Engineering Foundation
2 pages
First Vietnamese QA Firm: 1 LQA - Toward The Perfection
No ratings yet
First Vietnamese QA Firm: 1 LQA - Toward The Perfection
43 pages
DMR TableColumnDescription
100% (1)
DMR TableColumnDescription
3,479 pages
Author Contributions: Manuscript Title
No ratings yet
Author Contributions: Manuscript Title
6 pages
Jeffrey A. Hoffer, Mary B. Prescott, Fred R. Mcfadden: Modern Database Management 8 Edition
No ratings yet
Jeffrey A. Hoffer, Mary B. Prescott, Fred R. Mcfadden: Modern Database Management 8 Edition
37 pages
Dba Interview Questions & Answers
No ratings yet
Dba Interview Questions & Answers
43 pages
Part One Relational Databases
No ratings yet
Part One Relational Databases
9 pages
Data Warehousing Prelim Summary
No ratings yet
Data Warehousing Prelim Summary
129 pages
Data Engineering Study Plan
No ratings yet
Data Engineering Study Plan
4 pages
Data Warehousing: Data Models and OLAP Operations: by Kishore Jaladi
No ratings yet
Data Warehousing: Data Models and OLAP Operations: by Kishore Jaladi
41 pages
Resume Velocity 2 Years Informatica
No ratings yet
Resume Velocity 2 Years Informatica
2 pages
Master'S Thesis: Potential Deep Learning Approaches For The Physical Layer
No ratings yet
Master'S Thesis: Potential Deep Learning Approaches For The Physical Layer
59 pages
SVD Notes
No ratings yet
SVD Notes
7 pages
Microsoft Azure SQL Database
No ratings yet
Microsoft Azure SQL Database
52 pages
Advanced Database Systems (Lecture-2)
No ratings yet
Advanced Database Systems (Lecture-2)
12 pages
Relational Databases
No ratings yet
Relational Databases
370 pages
12 ModelDeployment
No ratings yet
12 ModelDeployment
29 pages
Topical Past Paper Questions - Database and Data Modeling - AS CS - Page 1 of 55
No ratings yet
Topical Past Paper Questions - Database and Data Modeling - AS CS - Page 1 of 55
55 pages
Galiano
No ratings yet
Galiano
49 pages
A Technical Overview of Vertica Architecture
No ratings yet
A Technical Overview of Vertica Architecture
46 pages
The Future of Apache Hadoop Security
No ratings yet
The Future of Apache Hadoop Security
32 pages
Answer Scheme
No ratings yet
Answer Scheme
4 pages
Normalization: Mrs. CH - Swathi
No ratings yet
Normalization: Mrs. CH - Swathi
24 pages
Pertemuan 2 Tabel Dan Manipulasi Data
No ratings yet
Pertemuan 2 Tabel Dan Manipulasi Data
23 pages
Jawad Ali
No ratings yet
Jawad Ali
11 pages
ABTesting Intuition Busters
No ratings yet
ABTesting Intuition Busters
11 pages
Vishal Java BigData 3years
No ratings yet
Vishal Java BigData 3years
3 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet

Large Scale Etl With Hadoop

Uploaded by

Large Scale Etl With Hadoop

Uploaded by

Large Scale ETL with Hadoop

Headline Goes Here

Streaming Data /data/edw/<database>/<table>/ Sqoop

/data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/

You might also like