Big Data Ecosystem2
Big Data Ecosystem2
• Tool to easily import information from structured databases (Db2, MySQL, Netezza, Oracle, etc.) and related
Hadoop systems (such as Hive and HBase) into your Hadoop cluster
• Can also use to extract data from Hadoop and export it to relational databases and enterprise data
warehouses
• Helps offload some tasks such as ETL from Enterprise Data Warehouse to Hadoop for lower cost and efficient
execution
• Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and
moving large amounts of streaming event data.
• Flume helps you aggregate data from many sources, manipulate the data, and then add the data into
your Hadoop environment.
• Its functionality is now superseded by HDF / Apache Nifi.
• Apache Kafka is a fast, scalable, durable, and fault-tolerant publishsubscribe messaging system.
Used for building real-time data pipelines and streaming apps
• Often used in place of traditional message brokers like JMS and AMQP because of its higher throughput,
reliability and replication.
• Kafka works in combination with variety of Hadoop tools:
Apache Storm
Apache HBase
Apache Spark
• Apache Accumulo is a sorted, distributed key/value store that provides robust, scalable data storage and
retrieval.
• Based on Google’s BigTable and runs on YARN
Think of it as a "highly secure HBase"
• Features:
Server-side programming
Designed to scale
Cell-based access control
Stable
• Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications by
combining the best of both worlds:
The power of standard SQL and JDBC APIs with full ACID transaction capabilities.
The flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase
as its backing store.
• Essentially this is SQL for NoSQL
• Fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and MapReduce
• Apache Solr is a fast, open source enterprise search platform built on the Apache Lucene Java search library
• Full-text indexing and search
REST-like HTTP/XML and JSON APIs make it easy to use with variety of programming languages
• Highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced
querying, automated failover and recovery, centralized configuration and more
• Apache Spark is a fast and general engine for large-scale data processing.
• Spark has a variety of advantages including:
Speed
−Run programs faster than MapReduce in memory
Easy to use
−Write apps quickly with Java, Scala, Python, R
Generality
−Can combine SQL, streaming, and complex analytics
Runs on variety of environments and can access diverse data sources
−Hadoop, Mesos, standalone, cloud…
−HDFS, Cassandra, HBase, S3…
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2018
Druid
• Apache Atlas is a scalable and extensible set of core foundational governance services
Enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop
• Exchange metadata with other tools and processes within and outside of the Hadoop
Allows integration with the whole enterprise data ecosystem
• Atlas Features:
Data Classification
Centralized Auditing
Centralized Lineage
Security & Policy Engine
• Centralized security framework to enable, monitor and manage comprehensive data security across the
Hadoop platform
• Manage fine-grained access control over Hadoop data access components like Apache Hive and Apache
HBase
• Using Ranger console can manage policies for access to files, folders, databases, tables, or column with
ease
• Policies can be set for individual users or groups
Policies enforced within Hadoop
• REST API and Application Gateway for the Apache Hadoop Ecosystem
• Provides perimeter security for Hadoop clusters
• Single access point for all REST interactions with Apache Hadoop clusters
• Integrates with prevalent SSO and identity management systems
Simplifies Hadoop security for users who access cluster data and execute jobs
• Provides intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs
Allows application developers and system integrators to easily integrate Hadoop provisioning,
management, and monitoring capabilities to their own applications
• A tool for provisioning and managing Apache Hadoop clusters in the cloud
• Automates launching of elastic Hadoop clusters
• Policy-based autoscaling on the major cloud infrastructure platforms, including:
Microsoft Azure
Amazon Web Services
Google Cloud Platform
OpenStack
Platforms that support Docker container
• Apache ZooKeeper is centralized service for maintaining configuration information, naming, providing
distributed synchronization, and providing group services
All of these kinds of services are used in some form or another by distributed applications
• Distributed applications can use ZooKeeper to store and mediate updates to important configuration
information
• Oozie is a Java based workflow scheduler system to manage Apache Hadoop jobs
• Apache Zeppelin is a Web-based notebook that enables data-driven, interactive data analytics and
collaborative documents
• Documents can contain SparkSQL, SQL, Scala, Python, JDBC connection, and much more
• Notebooks combine code samples, source data, descriptive markup, result sets, and rich visualizations in
one place
• Ambari web interface includes a built-in set of Views that are predeployed for you to use with your cluster
• Includes views for Hive, Pig, Tez, Capacity Scheduler, File, HDFS
• Ambari Views Framework allow developers to create new user interface components that plug into
Ambari Web UI
• Big SQL
• Big Replicate
• BigQuality
• BigIntegrate
• Big Match
• Provides active-active data replication for Hadoop across supported environments, distributions, and hybrid
deployments
• Replicates data automatically with guaranteed consistency across Hadoop clusters running on any
distribution, cloud object storage and local and NFS mounted file systems
• Provides SDK to extend Big Replicate replication to virtually any data source
• Patented distributed coordination engine enables:
Guaranteed data consistency across any
number of sites at any distance
Minimized RTO/RPO
• Totally non-invasive
No modification to source code
Easy to turn on/off
• IBM InfoSphere Information Server is a market-leading data integration platform which includes a family of
products that enable you to understand, cleanse, monitor, transform, and deliver data, as well as to
collaborate to bridge the gap between business and IT.
• You can profile, validate, cleanse, transform, and integrate your big data on Hadoop, an open source
framework that can manage large volumes of structured and unstructured data.
Ingest, transform, process and deliver any data into & within Hadoop
• Watson Studio is a collaborative platform for data scientists, built on open source components and IBM
added value, available in the cloud or on premises.
• https://datascience.ibm.com/
2) List the components that provides the capability to move data from relational database into Hadoop?
5) True or False? Data Science capabilities can be achieved using only HDP.