0% found this document useful (0 votes)
12 views46 pages

Big Data Ecosystem2

The document provides an introduction to the Hortonworks Data Platform (HDP), detailing its features, components, and various tools for data management and processing. It highlights the IBM value-add components, including Big SQL, Big Replicate, BigQuality, and BigIntegrate, along with their functionalities. Additionally, it covers key tools like Apache Hive, Apache Spark, and Apache Kafka, which facilitate data access, processing, and governance within the Hadoop ecosystem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views46 pages

Big Data Ecosystem2

The document provides an introduction to the Hortonworks Data Platform (HDP), detailing its features, components, and various tools for data management and processing. It highlights the IBM value-add components, including Big SQL, Big Replicate, BigQuality, and BigIntegrate, along with their functionalities. Additionally, it covers key tools like Apache Hive, Apache Spark, and Apache Kafka, which facilitate data access, processing, and governance within the Hadoop ecosystem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to Hortonworks

Data Platform (HDP)

Data Science Foundations

© Copyright IBM Corporation 2018


Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit objectives

• Describe the functions and features of HDP


• List the IBM value-add components
• Explain what IBM Watson Studio is
• Give a brief description of the purpose of each of the value-add components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Hortonworks Data Platform (HDP)

• HDP is platform for data-at-rest


• Secure, enterprise-ready open source Apache Hadoop distribution based on a centralized architecture
(YARN)
• HDP is:
 Open
 Central
 Interoperable
 Enterprise ready

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Hortonworks Data Platform (HDP)

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Appendix B
Data workflow

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Sqoop

• Tool to easily import information from structured databases (Db2, MySQL, Netezza, Oracle, etc.) and related
Hadoop systems (such as Hive and HBase) into your Hadoop cluster
• Can also use to extract data from Hadoop and export it to relational databases and enterprise data
warehouses
• Helps offload some tasks such as ETL from Enterprise Data Warehouse to Hadoop for lower cost and efficient
execution

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Flume

• Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of streaming event data.
• Flume helps you aggregate data from many sources, manipulate the data, and then add the data into
your Hadoop environment.
• Its functionality is now superseded by HDF / Apache Nifi.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Kafka

• Apache Kafka is a fast, scalable, durable, and fault-tolerant publishsubscribe messaging system.
 Used for building real-time data pipelines and streaming apps
• Often used in place of traditional message brokers like JMS and AMQP because of its higher throughput,
reliability and replication.
• Kafka works in combination with variety of Hadoop tools:
 Apache Storm
 Apache HBase
 Apache Spark

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Data access

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Hive

• Apache Hive is a data warehouse system built on top of Hadoop.


• Hive facilitates easy data summarization, ad-hoc queries, and the analysis of very large datasets that are
stored in Hadoop.
• Hive provides SQL on Hadoop
 Provides SQL interface, better known as HiveQL or HQL, which allows for easy querying of data in
Hadoop
• Includes HCatalog
 Global metadata management layer that exposes Hive table metadata to other Hadoop applications.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Pig

• Apache Pig is a platform for analyzing large data sets.


• Pig was designed for scripting a long series of data operations (good for ETL)
 Pig consists of a high-level language called Pig Latin, which was designed to simplify MapReduce
programming.
• Pig's infrastructure layer consists of a compiler that produces sequences of MapReduce programs from this
Pig Latin code that you write.
• The system is able to optimize your code, and "translate" it into MapReduce allowing you to focus on
semantics rather than efficiency.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


HBase

• Apache HBase is a distributed, scalable, big data store.


• Use Apache HBase when you need random, real-time read/write access to your Big Data.
 The goals of the HBase project is to be able to handle very large tables of data running on clusters of
commodity hardware.
• HBase is modeled after Google's BigTable and provides BigTable-like capabilities on top of Hadoop and
HDFS. HBase is a NoSQL datastore.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Accumulo

• Apache Accumulo is a sorted, distributed key/value store that provides robust, scalable data storage and
retrieval.
• Based on Google’s BigTable and runs on YARN
 Think of it as a "highly secure HBase"
• Features:
 Server-side programming
 Designed to scale
 Cell-based access control
 Stable

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Phoenix

• Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications by
combining the best of both worlds:
 The power of standard SQL and JDBC APIs with full ACID transaction capabilities.
 The flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase
as its backing store.
• Essentially this is SQL for NoSQL
• Fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and MapReduce

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Storm

• Apache Storm is an open source distributed real-time computation system.


 Fast
 Scalable
 Fault-tolerant
• Used to process large volumes of high-velocity data
• Useful when milliseconds of latency matter and Spark isn't fast enough
 Has been benchmarked at over a million tuples processed per second per node

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Solr

• Apache Solr is a fast, open source enterprise search platform built on the Apache Lucene Java search library
• Full-text indexing and search
 REST-like HTTP/XML and JSON APIs make it easy to use with variety of programming languages
• Highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced
querying, automated failover and recovery, centralized configuration and more

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Spark

• Apache Spark is a fast and general engine for large-scale data processing.
• Spark has a variety of advantages including:
 Speed
−Run programs faster than MapReduce in memory
 Easy to use
−Write apps quickly with Java, Scala, Python, R
 Generality
−Can combine SQL, streaming, and complex analytics
 Runs on variety of environments and can access diverse data sources
−Hadoop, Mesos, standalone, cloud…
−HDFS, Cassandra, HBase, S3…
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018
Druid

• Apache Druid is a high-performance, column-oriented, distributed data store.


 Interactive sub-second queries
−Unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute groupings, and
extremely fast aggregations
 Real-time streams
−Lock-free ingestion to allow for simultaneous ingestion and querying of high dimensional, high
volume data sets
−Explore events immediately after they occur
 Horizontally scalable
 Deploy anywhere

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Data Lifecycle and Governance

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Falcon

• Framework for managing data life cycle in Hadoop clusters


• Data governance engine
 Defines, schedules, and monitors data management policies
• Hadoop admins can centrally define their data pipelines
 Falcon uses these definitions to auto-generate workflows in Oozie
• Addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage
tracing by deploying a framework for data management and processing

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Atlas

• Apache Atlas is a scalable and extensible set of core foundational governance services
 Enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop
• Exchange metadata with other tools and processes within and outside of the Hadoop
 Allows integration with the whole enterprise data ecosystem
• Atlas Features:
 Data Classification
 Centralized Auditing
 Centralized Lineage
 Security & Policy Engine

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Security

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Ranger

• Centralized security framework to enable, monitor and manage comprehensive data security across the
Hadoop platform
• Manage fine-grained access control over Hadoop data access components like Apache Hive and Apache
HBase
• Using Ranger console can manage policies for access to files, folders, databases, tables, or column with
ease
• Policies can be set for individual users or groups
 Policies enforced within Hadoop

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Knox

• REST API and Application Gateway for the Apache Hadoop Ecosystem
• Provides perimeter security for Hadoop clusters
• Single access point for all REST interactions with Apache Hadoop clusters
• Integrates with prevalent SSO and identity management systems
 Simplifies Hadoop security for users who access cluster data and execute jobs

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Operations

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Ambari

• For provisioning, managing, and monitoring Apache Hadoop clusters.

• Provides intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs

• Ambari REST APIs

 Allows application developers and system integrators to easily integrate Hadoop provisioning,
management, and monitoring capabilities to their own applications

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


The Ambari web interface

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Cloudbreak

• A tool for provisioning and managing Apache Hadoop clusters in the cloud
• Automates launching of elastic Hadoop clusters
• Policy-based autoscaling on the major cloud infrastructure platforms, including:
 Microsoft Azure
 Amazon Web Services
 Google Cloud Platform
 OpenStack
 Platforms that support Docker container

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


ZooKeeper

• Apache ZooKeeper is centralized service for maintaining configuration information, naming, providing
distributed synchronization, and providing group services

 All of these kinds of services are used in some form or another by distributed applications

 Saves time so you don't have to develop your own

• It is fast, reliable, simple and ordered

• Distributed applications can use ZooKeeper to store and mediate updates to important configuration
information

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Oozie

• Oozie is a Java based workflow scheduler system to manage Apache Hadoop jobs

• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions

• Integrated with the Hadoop stack

 YARN is its architectural center

 Supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Tools

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Zeppelin

• Apache Zeppelin is a Web-based notebook that enables data-driven, interactive data analytics and
collaborative documents

• Documents can contain SparkSQL, SQL, Scala, Python, JDBC connection, and much more

• Easy for both end-users and data scientists to work with

• Notebooks combine code samples, source data, descriptive markup, result sets, and rich visualizations in
one place

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Zeppelin GUI

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Ambari Views

• Ambari web interface includes a built-in set of Views that are predeployed for you to use with your cluster

• These GUI components increase ease-of-use

• Includes views for Hive, Pig, Tez, Capacity Scheduler, File, HDFS

• Ambari Views Framework allow developers to create new user interface components that plug into
Ambari Web UI

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


IBM value-add components

• Big SQL

• Big Replicate

• BigQuality

• BigIntegrate

• Big Match

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Big SQL is SQL on Hadoop

• Big SQL builds on Apache Hive foundation


 Integrates with the Hive metastore
 Instead of MapReduce, uses powerful native C/C++
MPP engine
• View on your data residing in the Hadoop FileSystem
• No proprietary storage format
• Modern SQL:2011 capabilities
• Same SQL can be used on your warehouse data with
little or no modifications

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Big Replicate

• Provides active-active data replication for Hadoop across supported environments, distributions, and hybrid
deployments
• Replicates data automatically with guaranteed consistency across Hadoop clusters running on any
distribution, cloud object storage and local and NFS mounted file systems
• Provides SDK to extend Big Replicate replication to virtually any data source
• Patented distributed coordination engine enables:
 Guaranteed data consistency across any
number of sites at any distance
 Minimized RTO/RPO
• Totally non-invasive
 No modification to source code
 Easy to turn on/off

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Information Server and Hadoop: BigQuality and BigIntegrate

• IBM InfoSphere Information Server is a market-leading data integration platform which includes a family of
products that enable you to understand, cleanse, monitor, transform, and deliver data, as well as to
collaborate to bridge the gap between business and IT.

• Information Server can now be used with Hadoop

• You can profile, validate, cleanse, transform, and integrate your big data on Hadoop, an open source
framework that can manage large volumes of structured and unstructured data.

• This functionality is available with the following product offerings

 IBM BigIntegrate: Provides data integration features of Information Server.

 IBM BigQuality: Provides data quality features of Information Server.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Information Server - BigIntegrate:

Ingest, transform, process and deliver any data into & within Hadoop

Satisfy the most complex transformation requirements with


the most scalable runtime available in batch or real-time
• Connect
 Connect to wide range of traditional enterprise data
sources as well as Hadoop data sources
 Native connectors with highest level of performance
and scalability for key data sources
• Design & Transform
 Transform and aggregate any data volume
 Benefit from hundreds of built-in transformation
functions
 Leverage metadata-driven productivity and enable
collaboration
• Manage & Monitor
 Use a simple, web-based dashboard to manage
your runtime environment

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Information Server - BigQuality:
Analyze, cleanse and monitor your big data
Most comprehensive data quality capabilities that run natively on
Hadoop
• Analyze
 Discovers data of interest to the organization based on
business defined data classes
 Analyzes data structure, content and quality
 Automates your data analysis process
• Cleanse
 Investigate, standardize, match and survive data at
scale and with the full power of common data integration
processes
• Monitor
 Assess and monitor the quality of your data in any
place and across systems
 Align quality indicators to business policies
 Engage data steward team when issues exceed
thresholds of the business

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


IBM InfoSphere Big Match for Hadoop

Big Match is a Probabilistic Matching Engine (PME) running

natively within Hadoop for Customer Data Matching

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Watson Studio (formerly Data Science Experience (DSX))

• Watson Studio is a collaborative platform for data scientists, built on open source components and IBM
added value, available in the cloud or on premises.
• https://datascience.ibm.com/

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Checkpoint

1) List the components of HDP which provides data access capabilities?

2) List the components that provides the capability to move data from relational database into Hadoop?

3) Managing Hadoop clusters can be accomplished using which component?

4) True or False? The following components are value-add from IBM:

Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match

5) True or False? Data Science capabilities can be achieved using only HDP.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Checkpoint solution

1) List the components of HDP which provides data access capabilities.


 MapReduce, Pig, Hive, HBase, Phoenix, Spark, and more!
2) List the components that provides the capability to move data from relational database into Hadoop.
 Sqoop, Flume, Kafka
3) Managing Hadoop clusters can be accomplished using which component?
 Ambari
4) True or False? The following components are value-add from IBM:
Big Replicate, Big SQL, BigIntegrate, BigQuality, Big Match
 True
5) True or False? Data Science capabilities can be achieved using only HDP.
 False. Data Science capabilities also requires Watson Studio.

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018


Unit summary

• Describe the functions and features of HDP

• List the IBM value-add components

• Explain what IBM Watson Studio is

• Give a brief description of the purpose of each of the value-ad components

Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018

You might also like