100% found this document useful (1 vote)
183 views12 pages

DataLake Hadoop

Uploaded by

Kartik Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
183 views12 pages

DataLake Hadoop

Uploaded by

Kartik Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Cisco Data Lake

March 3, 2014

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1
 Data Lake Definition
 Current Hadoop Landscape
 Why to Build Data Lake
 Benefits
 Data Lake Design

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2
Data Lake - a place to store practically unlimited amounts of data of any format, schema and
type that is relatively inexpensive and massively scalable. Data processing software like
Hadoop can transform the data from its raw state to a finished product.

--Revelytix
If you think of a datamart as a store of bottled water – cleansed and packaged and structured
for easy consumption – the data lake is a large body of water in a more natural state. The
contents of the data lake stream in from a source to fill the lake, and various users of the lake
can come to examine, dive in, or take samples.

--Pentaho

The difference between a data lake and a data warehouse is that in a data warehouse, the
data is pre-categorized at the point of entry, which can dictate how it’s going to be analyzed.

--Forbes
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
Current Hadoop Landscape
Data Sources Hadoop Platform Data Consumption

CPAI CSTG Service Renewal


Databases Opportunities

IB, Contracts, IB, Cases,


ERP Hierarchies Hierarchies,
Customer
SFDC
Network Logs
Network Logs Marketing
Collab Campaigns

Database N
Marketing Data Science Program
Unstructured Data
Customer Network
Docs, Cases, Content
Config,
Customer,
Bookings, Product Quality
IoE, Machine Data, Hierarchies
v
Clickstream Hierarchies
Cisco.com
etc
logs
Data Science Program
Excercises

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
• Every project team spends resources in bringing its data
• Difficult to track data elements availability in the platform
• Redundant platform resource utilization for data acquisition & maintenance
• Data quality and reliability issues
• Project teams develop their data acquisition flows manually

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
Data Lake
Data Sources Hadoop Platform Data Consumption

CPAI
Data Lake (EDS) Service Renewal
Databases Opportunities

IB, Contracts, Cases


ERP
Hierarchies, Bookings,
Marketing
SFDC Customers, Supply Chain
Marketing
Etc
Campaigns

Database N

Data Science
Unstructured Data
Docs, Cases, Content
Network Logs, Customer Network
Cisco.com logs, Config,
Documents, Product Quality
IoE, Machine Data, etc
Clickstream
CSTG

Data Science Program


Excercises

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
• Data reuse – bring data once and consumed by multiple projects
• Data stored in raw format – can be used by variety of apps and tools
• Automated framework – can be quickly configured to get data from any source
• Better resource utilization – frees resources in source systems and hadoop
platform
• Quick project deliveries

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
High Level Data Lake Architecture
Data Sources Hadoop Platform

CPAI
Tidal Data Lake (EDS)
Databases

IB, Contracts, Cases


ERP
Hierarchies, Bookings,
Marketing
SFDC Customers, Supply Chain
Etc
Data Lake
Load Process
Database N
Hadoop Edge Node
Data Science
Unstructured Data
Docs, Cases, Content
Network Logs,
Cisco.com logs,
Documents,
IoE, Machine Data, etc
Clickstream
ETL Offload
Data lake
Metadata
(TD)

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
Data Lake Population and Consumption
ETL Offload • What data model to use?
HADOOP • Data Lake – Source Like
Structure
Any Source • Processed Data - 3NF
Model
Structured
Sources • What are Sources to Data
F1
Data Lake Lake?
CG1 Transformed Layer • Any structured /
(3NF Model) unstructured data source
TD (Source Like
Structure) T S • Do we build a transformed
O F2 layer?
L R • Yes

Unstructured F3
Sources

Docs, Cases,
• SSOT to be computed in one place,
Content
F4 consumed by many platforms
• Functional Areas can consume from
IoE, Machine Data Lake, not allowed to share with
data, F5 F6 other Functional Areas
Clickstream
• EDS Governs Data Lake and
Transformed Layer
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
Thank you.
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12

You might also like