DataLake Hadoop
DataLake Hadoop
March 3, 2014
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1
Data Lake Definition
Current Hadoop Landscape
Why to Build Data Lake
Benefits
Data Lake Design
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2
Data Lake - a place to store practically unlimited amounts of data of any format, schema and
type that is relatively inexpensive and massively scalable. Data processing software like
Hadoop can transform the data from its raw state to a finished product.
--Revelytix
If you think of a datamart as a store of bottled water – cleansed and packaged and structured
for easy consumption – the data lake is a large body of water in a more natural state. The
contents of the data lake stream in from a source to fill the lake, and various users of the lake
can come to examine, dive in, or take samples.
--Pentaho
The difference between a data lake and a data warehouse is that in a data warehouse, the
data is pre-categorized at the point of entry, which can dictate how it’s going to be analyzed.
--Forbes
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
Current Hadoop Landscape
Data Sources Hadoop Platform Data Consumption
Database N
Marketing Data Science Program
Unstructured Data
Customer Network
Docs, Cases, Content
Config,
Customer,
Bookings, Product Quality
IoE, Machine Data, Hierarchies
v
Clickstream Hierarchies
Cisco.com
etc
logs
Data Science Program
Excercises
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
• Every project team spends resources in bringing its data
• Difficult to track data elements availability in the platform
• Redundant platform resource utilization for data acquisition & maintenance
• Data quality and reliability issues
• Project teams develop their data acquisition flows manually
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
Data Lake
Data Sources Hadoop Platform Data Consumption
CPAI
Data Lake (EDS) Service Renewal
Databases Opportunities
Database N
Data Science
Unstructured Data
Docs, Cases, Content
Network Logs, Customer Network
Cisco.com logs, Config,
Documents, Product Quality
IoE, Machine Data, etc
Clickstream
CSTG
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
• Data reuse – bring data once and consumed by multiple projects
• Data stored in raw format – can be used by variety of apps and tools
• Automated framework – can be quickly configured to get data from any source
• Better resource utilization – frees resources in source systems and hadoop
platform
• Quick project deliveries
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
High Level Data Lake Architecture
Data Sources Hadoop Platform
CPAI
Tidal Data Lake (EDS)
Databases
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
Data Lake Population and Consumption
ETL Offload • What data model to use?
HADOOP • Data Lake – Source Like
Structure
Any Source • Processed Data - 3NF
Model
Structured
Sources • What are Sources to Data
F1
Data Lake Lake?
CG1 Transformed Layer • Any structured /
(3NF Model) unstructured data source
TD (Source Like
Structure) T S • Do we build a transformed
O F2 layer?
L R • Yes
Unstructured F3
Sources
Docs, Cases,
• SSOT to be computed in one place,
Content
F4 consumed by many platforms
• Functional Areas can consume from
IoE, Machine Data Lake, not allowed to share with
data, F5 F6 other Functional Areas
Clickstream
• EDS Governs Data Lake and
Transformed Layer
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
Thank you.
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12