0% found this document useful (0 votes)
13 views44 pages

MODULE 2 Hadoop Ecosystem Tools

The document provides an overview of the Hadoop ecosystem, detailing its core components such as HDFS for data storage, MapReduce for data processing, and YARN for resource management. It also introduces various tools like Apache Hive for data warehousing, Apache Pig for data flow, and Apache Spark for real-time analytics, among others. Additionally, it covers the roles of Apache Zookeeper, Oozie, Sqoop, Flume, Solr, and Ambari in managing and optimizing Hadoop operations.

Uploaded by

zohraiqbal2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views44 pages

MODULE 2 Hadoop Ecosystem Tools

The document provides an overview of the Hadoop ecosystem, detailing its core components such as HDFS for data storage, MapReduce for data processing, and YARN for resource management. It also introduces various tools like Apache Hive for data warehousing, Apache Pig for data flow, and Apache Spark for real-time analytics, among others. Additionally, it covers the roles of Apache Zookeeper, Oozie, Sqoop, Flume, Solr, and Ambari in managing and optimizing Hadoop operations.

Uploaded by

zohraiqbal2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 44

Experiment -1

Introduction to Hadoop
Ecosystem tools
Hadoop Ecosystem

2
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications.

• HDFS is used to store different types of large data sets (i.e. structured, unstructured and
semi structured data).

• It helps us in storing our data across various clusters and maintaining the log file about the
stored data (metadata).

3
4
MAPREDUCE
• Mapreduce is the core component of processing in a Hadoop Ecosystem as it provides the logic of
processing.

• MapReduce is a software framework which helps in writing applications that processes large data
sets using distributed and parallel algorithms inside Hadoop environment.

• In a MapReduce program, Map() and Reduce() are two functions.

• The Map function performs actions like filtering, grouping and sorting.

• While Reduce function aggregates and summarizes the result produced by map function.

• The result generated by the Map function is a key value pair (K, V) which acts as the input for
Reduce function.

5
6
YARN
• YARN is the brain of your Hadoop Ecosystem.

• It performs all your processing activities by allocating resources and scheduling


tasks.

• It has two major components, i.e. ResourceManager and NodeManager.


• ResourceManager is the main node in the processing department.
• It receives the processing requests, and then passes the parts of requests to
corresponding NodeManagers accordingly, where the actual processing takes
place.

7
8
HADOOP YARN ARCHITECTURE

9
PIG
• PIG was initially developed by Yahoo.
• PIG tool has two parts: Pig Latin, the language and the pig runtime, for the execution
environment.
• Pig latin language contains SQL type of command structure.
• The compiler internally converts pig latin to MapReduce.
• It produces a sequential set of MapReduce jobs.
• It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
• The process of extracting data from different source systems and bringing it into the data
warehouse is commonly called ETL.

10
11
HADOOP PIG ARCHITECTURE

12
APACHE HIVE
• Facebook created HIVE for people who are fluent with SQL.
• Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets
in a distributed environment using SQL-like interface.
• HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish
connection from data storage(HDFS)

• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing and real time
data processing.

• It supports all primitive data types and predefined functions of SQL.

13
14
15
16
Mahout
• Mahout performs the following operations like collaborative filtering,
clustering and classification, frequent item set missing.

• Collaborative filtering: Mahout mines user behaviors, their patterns and


their characteristics and based on that it predicts and make
recommendations to the users. The typical use case is E-commerce
website.

• Clustering: It organizes a similar group of data together like articles can


contain blogs, news, research papers etc.

17
• Classification: It means classifying and categorizing data into
various sub-departments like articles can be categorized into
blogs, news, essay, research papers.
• Frequent item set missing: Here Mahout checks, which objects
are likely to be appearing together and make suggestions, if
they are missing. For example, cell phone and cover are
brought together in general. So, if you search for a cell phone,
it will also recommend you the cover and cases.

18
19
20
APACHE SPARK
• Apache Spark is a framework is used real time data analytics in a
distributed computing environment.
• The Spark coding's are written in Scala and was originally developed at
the University of California, Berkeley.
• It executes in-memory computations to increase speed of data processing
over Map-Reduce.
• It is 100x faster than Hadoop for large scale data processing by
exploiting in-memory computations and other optimizations.
• Therefore, it requires high processing power than Map-Reduce.

21
Working of Spark Architecture

22
APACHE HBASE
• HBase is an open source, non-relational distributed database.

• In other words, it is a NoSQL database.

• It supports all types of data and that is why, it’s capable of handling anything and everything inside a
Hadoop ecosystem.

• It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with
large data sets.

• The HBase was designed to run on top of HDFS and provides BigTable like capabilities.

• The HBase is written in Java, whereas HBase applications can be written in REST, Avro and Thrift APIs.

23
24
APACHE DRILL
• As the name suggests, Apache Drill is used to drill into any kind of data.

• It’s an open source application which works with distributed environment to analyze
large data sets.

• It supports different kinds NoSQL databases and file systems, which is a powerful
feature of Drill.

• For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB,
MapReduce-FS, Amazon S3, Swift, NAS and local files.

25
WORKING OF APACHE
DRILL

26
APACHE ZOOKEEPER
• Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various tools in

a Hadoop Ecosystem.

• Apache Zookeeper coordinates with various services in a distributed environment.

• Before Zookeeper, it was very difficult and time consuming to coordinate between different tools in

Hadoop Ecosystem.

• The services earlier had many problems with interactions like common configuration while

synchronizing data.

27
• Even if the services are configured, changes in the configurations of the

services make it complex and difficult to handle.

• Grouping and naming of tools was also a time-consuming factor.

• Due to the above problems, Zookeeper was introduced.

• It saves a lot of time by performing synchronization, configuration

maintenance, grouping and naming.

28
29
APACHE OOZIE
• Apache Oozie as a clock and alarm service inside Hadoop Ecosystem.

• For Apache jobs, Oozie has been just like a scheduler.

• It schedules Hadoop jobs and binds them together as one logical work.

• There are two kinds of Oozie jobs:

• Oozie workflow: These are sequential set of actions to be executed. You can assume
it as a relay race. Where each athlete waits for the last one to complete his part.

30
• Oozie Coordinator: These are the Oozie jobs which are
triggered when the data is made available to it.

• Think of this as the response-stimuli system in our body.

• In the same manner as we respond to an external stimulus,


an Oozie coordinator responds to the availability of data and
it rests otherwise.

31
WORKING OF APACHE OOZIE

32
33
APACHE SQOOP
• Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive.

• It also exports data from Hadoop to other external sources.

• Sqoop works with relational databases such as teradata, Netezza, oracle,


MySQL.

34
WORKING OF APACHE SQOOP

35
36
APACHE FLUME
• Feeding data is an important part of our Hadoop Ecosystem.

• The Flume is a tool which helps in feeding unstructured and semi-structured data into
HDFS.

• It gives us a solution which is reliable and distributed.

• It helps us in collecting, aggregating and moving large amount of data sets.

• It helps us to consume online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.

37
WORKING OF APACHE FLUME

38
APACHE SOLR & LUCENE
• Apache Solr and Apache Lucene are the two services which are used for searching and
indexing in Hadoop Ecosystem.

• Apache Lucene is based on Java, which also helps in spell checking.

• If Apache Lucene is the engine, Apache Solr is the car built around it.

• Solr is a complete application built around Lucene.

• It uses the Lucene Java search library as a core for search and full indexing.

• Indexing is a way to optimize performance of a database by minimizing the number of


disk accesses required when a query is processed.

39
40
APACHE AMBARI
• Ambari is an Apache Software Foundation Project which aims at making
Hadoop ecosystem more manageable.
• It includes software for provisioning, managing and monitoring Apache
Hadoop clusters.
• Hadoop cluster provisioning:
• It gives us step by step process for installing Hadoop services across a
number of hosts.
• It also handles configuration of Hadoop services over a cluster.
• Hadoop cluster management:
• It provides a central management service for starting, stopping and re-
configuring Hadoop services across the cluster.

41
• Hadoop cluster monitoring: For monitoring health status, Ambari
provides us a dashboard.

• The Amber Alert framework is an alerting service which notifies the


user, whenever the attention is needed. For example, if a node goes
down or low disk space on a node, etc.

42
43
44

You might also like