0% found this document useful (0 votes)

12 views46 pages

Big Data Ecosystem2

The document provides an introduction to the Hortonworks Data Platform (HDP), detailing its features, components, and various tools for data management and processing. It highlights the IBM value-add components, including Big SQL, Big Replicate, BigQuality, and BigIntegrate, along with their functionalities. Additionally, it covers key tools like Apache Hive, Apache Spark, and Apache Kafka, which facilitate data access, processing, and governance within the Hadoop ecosystem.

Uploaded by

ikram abdelmouleh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views46 pages

Big Data Ecosystem2

Uploaded by

ikram abdelmouleh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to Hortonworks

Data Platform (HDP)

Data Science Foundations

© Copyright IBM Corporation 2018

Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit objectives

• Describe the functions and features of HDP

• List the IBM value-add components
• Explain what IBM Watson Studio is
• Give a brief description of the purpose of each of the value-add components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Hortonworks Data Platform (HDP)

• HDP is platform for data-at-rest

• Secure, enterprise-ready open source Apache Hadoop distribution based on a centralized architecture
(YARN)
• HDP is:
 Open
 Central
 Interoperable
 Enterprise ready

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Hortonworks Data Platform (HDP)

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Appendix B
Data workflow

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Sqoop

• Tool to easily import information from structured databases (Db2, MySQL, Netezza, Oracle, etc.) and related
Hadoop systems (such as Hive and HBase) into your Hadoop cluster
• Can also use to extract data from Hadoop and export it to relational databases and enterprise data
warehouses
• Helps offload some tasks such as ETL from Enterprise Data Warehouse to Hadoop for lower cost and efficient
execution

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Flume

• Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of streaming event data.
• Flume helps you aggregate data from many sources, manipulate the data, and then add the data into
your Hadoop environment.
• Its functionality is now superseded by HDF / Apache Nifi.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Kafka

• Apache Kafka is a fast, scalable, durable, and fault-tolerant publishsubscribe messaging system.
 Used for building real-time data pipelines and streaming apps
• Often used in place of traditional message brokers like JMS and AMQP because of its higher throughput,
reliability and replication.
• Kafka works in combination with variety of Hadoop tools:
 Apache Storm
 Apache HBase
 Apache Spark

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Data access

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Hive

• Apache Hive is a data warehouse system built on top of Hadoop.

• Hive facilitates easy data summarization, ad-hoc queries, and the analysis of very large datasets that are
stored in Hadoop.
• Hive provides SQL on Hadoop
 Provides SQL interface, better known as HiveQL or HQL, which allows for easy querying of data in
Hadoop
• Includes HCatalog
 Global metadata management layer that exposes Hive table metadata to other Hadoop applications.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Pig

• Apache Pig is a platform for analyzing large data sets.

• Pig was designed for scripting a long series of data operations (good for ETL)
 Pig consists of a high-level language called Pig Latin, which was designed to simplify MapReduce
programming.
• Pig's infrastructure layer consists of a compiler that produces sequences of MapReduce programs from this
Pig Latin code that you write.
• The system is able to optimize your code, and "translate" it into MapReduce allowing you to focus on
semantics rather than efficiency.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

HBase

• Apache HBase is a distributed, scalable, big data store.

• Use Apache HBase when you need random, real-time read/write access to your Big Data.
 The goals of the HBase project is to be able to handle very large tables of data running on clusters of
commodity hardware.
• HBase is modeled after Google's BigTable and provides BigTable-like capabilities on top of Hadoop and
HDFS. HBase is a NoSQL datastore.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Accumulo

• Apache Accumulo is a sorted, distributed key/value store that provides robust, scalable data storage and
retrieval.
• Based on Google’s BigTable and runs on YARN
 Think of it as a "highly secure HBase"
• Features:
 Server-side programming
 Designed to scale
 Cell-based access control
 Stable

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Phoenix

• Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications by
combining the best of both worlds:
 The power of standard SQL and JDBC APIs with full ACID transaction capabilities.
 The flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase
as its backing store.
• Essentially this is SQL for NoSQL
• Fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and MapReduce

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Storm

• Apache Storm is an open source distributed real-time computation system.

 Fast
 Scalable
 Fault-tolerant
• Used to process large volumes of high-velocity data
• Useful when milliseconds of latency matter and Spark isn't fast enough
 Has been benchmarked at over a million tuples processed per second per node

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Solr

• Apache Solr is a fast, open source enterprise search platform built on the Apache Lucene Java search library
• Full-text indexing and search
 REST-like HTTP/XML and JSON APIs make it easy to use with variety of programming languages
• Highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced
querying, automated failover and recovery, centralized configuration and more

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Spark

• Apache Spark is a fast and general engine for large-scale data processing.
• Spark has a variety of advantages including:
 Speed
−Run programs faster than MapReduce in memory
 Easy to use
−Write apps quickly with Java, Scala, Python, R
 Generality
−Can combine SQL, streaming, and complex analytics
 Runs on variety of environments and can access diverse data sources
−Hadoop, Mesos, standalone, cloud…
−HDFS, Cassandra, HBase, S3…
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018
Druid

• Apache Druid is a high-performance, column-oriented, distributed data store.

 Interactive sub-second queries
−Unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute groupings, and
extremely fast aggregations
 Real-time streams
−Lock-free ingestion to allow for simultaneous ingestion and querying of high dimensional, high
volume data sets
−Explore events immediately after they occur
 Horizontally scalable
 Deploy anywhere

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Data Lifecycle and Governance

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Falcon

• Framework for managing data life cycle in Hadoop clusters

• Data governance engine
 Defines, schedules, and monitors data management policies
• Hadoop admins can centrally define their data pipelines
 Falcon uses these definitions to auto-generate workflows in Oozie
• Addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage
tracing by deploying a framework for data management and processing

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Atlas

• Apache Atlas is a scalable and extensible set of core foundational governance services
 Enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop
• Exchange metadata with other tools and processes within and outside of the Hadoop
 Allows integration with the whole enterprise data ecosystem
• Atlas Features:
 Data Classification
 Centralized Auditing
 Centralized Lineage
 Security & Policy Engine

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Security

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Ranger

• Centralized security framework to enable, monitor and manage comprehensive data security across the
Hadoop platform
• Manage fine-grained access control over Hadoop data access components like Apache Hive and Apache
HBase
• Using Ranger console can manage policies for access to files, folders, databases, tables, or column with
ease
• Policies can be set for individual users or groups
 Policies enforced within Hadoop

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Knox

• REST API and Application Gateway for the Apache Hadoop Ecosystem
• Provides perimeter security for Hadoop clusters
• Single access point for all REST interactions with Apache Hadoop clusters
• Integrates with prevalent SSO and identity management systems
 Simplifies Hadoop security for users who access cluster data and execute jobs

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Operations

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Ambari

• For provisioning, managing, and monitoring Apache Hadoop clusters.

• Provides intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs

• Ambari REST APIs

 Allows application developers and system integrators to easily integrate Hadoop provisioning,
management, and monitoring capabilities to their own applications

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

The Ambari web interface

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Cloudbreak

• A tool for provisioning and managing Apache Hadoop clusters in the cloud
• Automates launching of elastic Hadoop clusters
• Policy-based autoscaling on the major cloud infrastructure platforms, including:
 Microsoft Azure
 Amazon Web Services
 Google Cloud Platform
 OpenStack
 Platforms that support Docker container

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

ZooKeeper

• Apache ZooKeeper is centralized service for maintaining configuration information, naming, providing
distributed synchronization, and providing group services

 All of these kinds of services are used in some form or another by distributed applications

 Saves time so you don't have to develop your own

• It is fast, reliable, simple and ordered

• Distributed applications can use ZooKeeper to store and mediate updates to important configuration
information

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Oozie

• Oozie is a Java based workflow scheduler system to manage Apache Hadoop jobs

• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions

• Integrated with the Hadoop stack

 YARN is its architectural center

 Supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Tools

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Zeppelin

• Apache Zeppelin is a Web-based notebook that enables data-driven, interactive data analytics and
collaborative documents

• Documents can contain SparkSQL, SQL, Scala, Python, JDBC connection, and much more

• Easy for both end-users and data scientists to work with

• Notebooks combine code samples, source data, descriptive markup, result sets, and rich visualizations in
one place

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Zeppelin GUI

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

Ambari Views

• Ambari web interface includes a built-in set of Views that are predeployed for you to use with your cluster

• These GUI components increase ease-of-use

• Includes views for Hive, Pig, Tez, Capacity Scheduler, File, HDFS

• Ambari Views Framework allow developers to create new user interface components that plug into
Ambari Web UI

IBM value-add components

• Big SQL

• Big Replicate

• BigQuality

• BigIntegrate

• Big Match

Big SQL is SQL on Hadoop

• Big SQL builds on Apache Hive foundation

 Integrates with the Hive metastore
 Instead of MapReduce, uses powerful native C/C++
MPP engine
• View on your data residing in the Hadoop FileSystem
• No proprietary storage format
• Modern SQL:2011 capabilities
• Same SQL can be used on your warehouse data with
little or no modifications

Big Replicate

• Provides active-active data replication for Hadoop across supported environments, distributions, and hybrid
deployments
• Replicates data automatically with guaranteed consistency across Hadoop clusters running on any
distribution, cloud object storage and local and NFS mounted file systems
• Provides SDK to extend Big Replicate replication to virtually any data source
• Patented distributed coordination engine enables:
 Guaranteed data consistency across any
number of sites at any distance
 Minimized RTO/RPO
• Totally non-invasive
 No modification to source code
 Easy to turn on/off

Information Server and Hadoop: BigQuality and BigIntegrate

• IBM InfoSphere Information Server is a market-leading data integration platform which includes a family of
products that enable you to understand, cleanse, monitor, transform, and deliver data, as well as to
collaborate to bridge the gap between business and IT.

• Information Server can now be used with Hadoop

• You can profile, validate, cleanse, transform, and integrate your big data on Hadoop, an open source
framework that can manage large volumes of structured and unstructured data.

• This functionality is available with the following product offerings

 IBM BigIntegrate: Provides data integration features of Information Server.

 IBM BigQuality: Provides data quality features of Information Server.

Information Server - BigIntegrate:

Ingest, transform, process and deliver any data into & within Hadoop

Satisfy the most complex transformation requirements with

the most scalable runtime available in batch or real-time
• Connect
 Connect to wide range of traditional enterprise data
sources as well as Hadoop data sources
 Native connectors with highest level of performance
and scalability for key data sources
• Design & Transform
 Transform and aggregate any data volume
 Benefit from hundreds of built-in transformation
functions
 Leverage metadata-driven productivity and enable
collaboration
• Manage & Monitor
 Use a simple, web-based dashboard to manage
your runtime environment

Information Server - BigQuality:
Analyze, cleanse and monitor your big data
Most comprehensive data quality capabilities that run natively on
Hadoop
• Analyze
 Discovers data of interest to the organization based on
business defined data classes
 Analyzes data structure, content and quality
 Automates your data analysis process
• Cleanse
 Investigate, standardize, match and survive data at
scale and with the full power of common data integration
processes
• Monitor
 Assess and monitor the quality of your data in any
place and across systems
 Align quality indicators to business policies
 Engage data steward team when issues exceed
thresholds of the business

IBM InfoSphere Big Match for Hadoop

Big Match is a Probabilistic Matching Engine (PME) running

natively within Hadoop for Customer Data Matching

Watson Studio (formerly Data Science Experience (DSX))

• Watson Studio is a collaborative platform for data scientists, built on open source components and IBM
added value, available in the cloud or on premises.
• https://datascience.ibm.com/

Checkpoint

1) List the components of HDP which provides data access capabilities?

2) List the components that provides the capability to move data from relational database into Hadoop?

3) Managing Hadoop clusters can be accomplished using which component?

4) True or False? The following components are value-add from IBM:

Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match

5) True or False? Data Science capabilities can be achieved using only HDP.

Checkpoint solution

1) List the components of HDP which provides data access capabilities.

 MapReduce, Pig, Hive, HBase, Phoenix, Spark, and more!
2) List the components that provides the capability to move data from relational database into Hadoop.
 Sqoop, Flume, Kafka
3) Managing Hadoop clusters can be accomplished using which component?
 Ambari
4) True or False? The following components are value-add from IBM:
Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match
 True
5) True or False? Data Science capabilities can be achieved using only HDP.
 False. Data Science capabilities also requires Watson Studio.

Unit summary

• Describe the functions and features of HDP

• List the IBM value-add components

• Explain what IBM Watson Studio is

• Give a brief description of the purpose of each of the value-ad components

dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Nosql Databases Unit-2
0% (1)
Nosql Databases Unit-2
15 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
Cloud Storage Infrastructures
100% (1)
Cloud Storage Infrastructures
40 pages
Apache Spark Component Guide
No ratings yet
Apache Spark Component Guide
84 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
SABDE3G02 Big Data HDP Introduction
No ratings yet
SABDE3G02 Big Data HDP Introduction
57 pages
Big Data HDP Introduction
No ratings yet
Big Data HDP Introduction
34 pages
02 HDP Introduction
No ratings yet
02 HDP Introduction
58 pages
2.2. Components of Hadoop - Analysing
No ratings yet
2.2. Components of Hadoop - Analysing
16 pages
Big Data Overview
No ratings yet
Big Data Overview
19 pages
Ibm Hadoop
No ratings yet
Ibm Hadoop
4 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Hortonworks Data Platform: Apache Hive Performance Tuning
No ratings yet
Hortonworks Data Platform: Apache Hive Performance Tuning
48 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
04-Hadoop Distributed File System
No ratings yet
04-Hadoop Distributed File System
56 pages
Hortonworks Data Platform (HDP) 3.0 - Faster, Smarter, Hybrid Data
No ratings yet
Hortonworks Data Platform (HDP) 3.0 - Faster, Smarter, Hybrid Data
3 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
FB Documentation
No ratings yet
FB Documentation
21 pages
Module 2
No ratings yet
Module 2
20 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
Getting Started With HDP Sandbox
No ratings yet
Getting Started With HDP Sandbox
107 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
Hortonworks Data Platform Installing HDP On Windows
No ratings yet
Hortonworks Data Platform Installing HDP On Windows
84 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Big Data Ecosystem3
No ratings yet
Big Data Ecosystem3
25 pages
Lecture 4 - Hadoop Ecosystem - 1691899782480
No ratings yet
Lecture 4 - Hadoop Ecosystem - 1691899782480
36 pages
Hortonworks Data Platform (HDP)
No ratings yet
Hortonworks Data Platform (HDP)
28 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
HDP Components Detailed
No ratings yet
HDP Components Detailed
4 pages
Unit 5
No ratings yet
Unit 5
4 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Hadoop
No ratings yet
Hadoop
21 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
The Power of One: IBM + Hortonworks: Overcome Enterprise Data Challenges With One Solution
No ratings yet
The Power of One: IBM + Hortonworks: Overcome Enterprise Data Challenges With One Solution
6 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
Hive Installation On Windows 10
No ratings yet
Hive Installation On Windows 10
13 pages
Unit 1 Introduction To Big Data and Hadoop
No ratings yet
Unit 1 Introduction To Big Data and Hadoop
100 pages
Cloud Digital Leader Demo
No ratings yet
Cloud Digital Leader Demo
20 pages
Data Engineering Brochure FXSr63lN9T
No ratings yet
Data Engineering Brochure FXSr63lN9T
14 pages
Google Bigtable
No ratings yet
Google Bigtable
21 pages
HBase Presentation
No ratings yet
HBase Presentation
23 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
Hbase PDF
No ratings yet
Hbase PDF
33 pages
CS441 FT by AC 03222254114
No ratings yet
CS441 FT by AC 03222254114
14 pages
Assignment 3
No ratings yet
Assignment 3
6 pages
Distributed Parallel Architecture For "Big Data"
No ratings yet
Distributed Parallel Architecture For "Big Data"
12 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
HBASE
No ratings yet
HBASE
18 pages
Sqoop To Hbase
No ratings yet
Sqoop To Hbase
4 pages
BDT Unit 2 Textbook
No ratings yet
BDT Unit 2 Textbook
20 pages
Week-5 - Lecture Notes
No ratings yet
Week-5 - Lecture Notes
138 pages
Dynamo and BigTable Review and Comparison
No ratings yet
Dynamo and BigTable Review and Comparison
5 pages
Akash Box Akash Notes3
No ratings yet
Akash Box Akash Notes3
55 pages
Akash Mavle Links To Lot of Scalable Big Data Architectures
No ratings yet
Akash Mavle Links To Lot of Scalable Big Data Architectures
57 pages
Hbase
No ratings yet
Hbase
6 pages