MODULE 2 Hadoop Ecosystem Tools
MODULE 2 Hadoop Ecosystem Tools
Introduction to Hadoop
Ecosystem tools
Hadoop Ecosystem
2
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications.
• HDFS is used to store different types of large data sets (i.e. structured, unstructured and
semi structured data).
• It helps us in storing our data across various clusters and maintaining the log file about the
stored data (metadata).
3
4
MAPREDUCE
• Mapreduce is the core component of processing in a Hadoop Ecosystem as it provides the logic of
processing.
• MapReduce is a software framework which helps in writing applications that processes large data
sets using distributed and parallel algorithms inside Hadoop environment.
• The Map function performs actions like filtering, grouping and sorting.
• While Reduce function aggregates and summarizes the result produced by map function.
• The result generated by the Map function is a key value pair (K, V) which acts as the input for
Reduce function.
5
6
YARN
• YARN is the brain of your Hadoop Ecosystem.
7
8
HADOOP YARN ARCHITECTURE
9
PIG
• PIG was initially developed by Yahoo.
• PIG tool has two parts: Pig Latin, the language and the pig runtime, for the execution
environment.
• Pig latin language contains SQL type of command structure.
• The compiler internally converts pig latin to MapReduce.
• It produces a sequential set of MapReduce jobs.
• It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
• The process of extracting data from different source systems and bringing it into the data
warehouse is commonly called ETL.
10
11
HADOOP PIG ARCHITECTURE
12
APACHE HIVE
• Facebook created HIVE for people who are fluent with SQL.
• Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets
in a distributed environment using SQL-like interface.
• HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish
connection from data storage(HDFS)
• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing and real time
data processing.
13
14
15
16
Mahout
• Mahout performs the following operations like collaborative filtering,
clustering and classification, frequent item set missing.
17
• Classification: It means classifying and categorizing data into
various sub-departments like articles can be categorized into
blogs, news, essay, research papers.
• Frequent item set missing: Here Mahout checks, which objects
are likely to be appearing together and make suggestions, if
they are missing. For example, cell phone and cover are
brought together in general. So, if you search for a cell phone,
it will also recommend you the cover and cases.
18
19
20
APACHE SPARK
• Apache Spark is a framework is used real time data analytics in a
distributed computing environment.
• The Spark coding's are written in Scala and was originally developed at
the University of California, Berkeley.
• It executes in-memory computations to increase speed of data processing
over Map-Reduce.
• It is 100x faster than Hadoop for large scale data processing by
exploiting in-memory computations and other optimizations.
• Therefore, it requires high processing power than Map-Reduce.
21
Working of Spark Architecture
22
APACHE HBASE
• HBase is an open source, non-relational distributed database.
• It supports all types of data and that is why, it’s capable of handling anything and everything inside a
Hadoop ecosystem.
• It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with
large data sets.
• The HBase was designed to run on top of HDFS and provides BigTable like capabilities.
• The HBase is written in Java, whereas HBase applications can be written in REST, Avro and Thrift APIs.
23
24
APACHE DRILL
• As the name suggests, Apache Drill is used to drill into any kind of data.
• It’s an open source application which works with distributed environment to analyze
large data sets.
• It supports different kinds NoSQL databases and file systems, which is a powerful
feature of Drill.
• For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB,
MapReduce-FS, Amazon S3, Swift, NAS and local files.
25
WORKING OF APACHE
DRILL
26
APACHE ZOOKEEPER
• Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various tools in
a Hadoop Ecosystem.
• Before Zookeeper, it was very difficult and time consuming to coordinate between different tools in
Hadoop Ecosystem.
• The services earlier had many problems with interactions like common configuration while
synchronizing data.
27
• Even if the services are configured, changes in the configurations of the
28
29
APACHE OOZIE
• Apache Oozie as a clock and alarm service inside Hadoop Ecosystem.
• It schedules Hadoop jobs and binds them together as one logical work.
• Oozie workflow: These are sequential set of actions to be executed. You can assume
it as a relay race. Where each athlete waits for the last one to complete his part.
30
• Oozie Coordinator: These are the Oozie jobs which are
triggered when the data is made available to it.
31
WORKING OF APACHE OOZIE
32
33
APACHE SQOOP
• Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive.
34
WORKING OF APACHE SQOOP
35
36
APACHE FLUME
• Feeding data is an important part of our Hadoop Ecosystem.
• The Flume is a tool which helps in feeding unstructured and semi-structured data into
HDFS.
• It helps us to consume online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.
37
WORKING OF APACHE FLUME
38
APACHE SOLR & LUCENE
• Apache Solr and Apache Lucene are the two services which are used for searching and
indexing in Hadoop Ecosystem.
• If Apache Lucene is the engine, Apache Solr is the car built around it.
• It uses the Lucene Java search library as a core for search and full indexing.
39
40
APACHE AMBARI
• Ambari is an Apache Software Foundation Project which aims at making
Hadoop ecosystem more manageable.
• It includes software for provisioning, managing and monitoring Apache
Hadoop clusters.
• Hadoop cluster provisioning:
• It gives us step by step process for installing Hadoop services across a
number of hosts.
• It also handles configuration of Hadoop services over a cluster.
• Hadoop cluster management:
• It provides a central management service for starting, stopping and re-
configuring Hadoop services across the cluster.
41
• Hadoop cluster monitoring: For monitoring health status, Ambari
provides us a dashboard.
42
43
44