0% found this document useful (0 votes)

13 views44 pages

MODULE 2 Hadoop Ecosystem Tools

The document provides an overview of the Hadoop ecosystem, detailing its core components such as HDFS for data storage, MapReduce for data processing, and YARN for resource management. It also introduces various tools like Apache Hive for data warehousing, Apache Pig for data flow, and Apache Spark for real-time analytics, among others. Additionally, it covers the roles of Apache Zookeeper, Oozie, Sqoop, Flume, Solr, and Ambari in managing and optimizing Hadoop operations.

Uploaded by

zohraiqbal2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views44 pages

MODULE 2 Hadoop Ecosystem Tools

Uploaded by

zohraiqbal2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 44

Experiment -1

Introduction to Hadoop
Ecosystem tools
Hadoop Ecosystem

2
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications.

• HDFS is used to store different types of large data sets (i.e. structured, unstructured and
semi structured data).

• It helps us in storing our data across various clusters and maintaining the log file about the
stored data (metadata).

3
4
MAPREDUCE
• Mapreduce is the core component of processing in a Hadoop Ecosystem as it provides the logic of
processing.

• MapReduce is a software framework which helps in writing applications that processes large data
sets using distributed and parallel algorithms inside Hadoop environment.

• In a MapReduce program, Map() and Reduce() are two functions.

• The Map function performs actions like filtering, grouping and sorting.

• While Reduce function aggregates and summarizes the result produced by map function.

• The result generated by the Map function is a key value pair (K, V) which acts as the input for
Reduce function.

5
6
YARN
• YARN is the brain of your Hadoop Ecosystem.

• It performs all your processing activities by allocating resources and scheduling

tasks.

• It has two major components, i.e. ResourceManager and NodeManager.

• ResourceManager is the main node in the processing department.
• It receives the processing requests, and then passes the parts of requests to
corresponding NodeManagers accordingly, where the actual processing takes
place.

7
8
HADOOP YARN ARCHITECTURE

9
PIG
• PIG was initially developed by Yahoo.
• PIG tool has two parts: Pig Latin, the language and the pig runtime, for the execution
environment.
• Pig latin language contains SQL type of command structure.
• The compiler internally converts pig latin to MapReduce.
• It produces a sequential set of MapReduce jobs.
• It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
• The process of extracting data from different source systems and bringing it into the data
warehouse is commonly called ETL.

10
11
HADOOP PIG ARCHITECTURE

12
APACHE HIVE
• Facebook created HIVE for people who are fluent with SQL.
• Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets
in a distributed environment using SQL-like interface.
• HIVE + SQL = HQL
• The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
• The Hive Command line interface is used to execute HQL commands.
• While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish
connection from data storage(HDFS)

• Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing and real time
data processing.

• It supports all primitive data types and predefined functions of SQL.

13
14
15
16
Mahout
• Mahout performs the following operations like collaborative filtering,
clustering and classification, frequent item set missing.

• Collaborative filtering: Mahout mines user behaviors, their patterns and

their characteristics and based on that it predicts and make
recommendations to the users. The typical use case is E-commerce
website.

• Clustering: It organizes a similar group of data together like articles can

contain blogs, news, research papers etc.

17
• Classification: It means classifying and categorizing data into
various sub-departments like articles can be categorized into
blogs, news, essay, research papers.
• Frequent item set missing: Here Mahout checks, which objects
are likely to be appearing together and make suggestions, if
they are missing. For example, cell phone and cover are
brought together in general. So, if you search for a cell phone,
it will also recommend you the cover and cases.

18
19
20
APACHE SPARK
• Apache Spark is a framework is used real time data analytics in a
distributed computing environment.
• The Spark coding's are written in Scala and was originally developed at
the University of California, Berkeley.
• It executes in-memory computations to increase speed of data processing
over Map-Reduce.
• It is 100x faster than Hadoop for large scale data processing by
exploiting in-memory computations and other optimizations.
• Therefore, it requires high processing power than Map-Reduce.

21
Working of Spark Architecture

22
APACHE HBASE
• HBase is an open source, non-relational distributed database.

• In other words, it is a NoSQL database.

• It supports all types of data and that is why, it’s capable of handling anything and everything inside a
Hadoop ecosystem.

• It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with
large data sets.

• The HBase was designed to run on top of HDFS and provides BigTable like capabilities.

• The HBase is written in Java, whereas HBase applications can be written in REST, Avro and Thrift APIs.

23
24
APACHE DRILL
• As the name suggests, Apache Drill is used to drill into any kind of data.

• It’s an open source application which works with distributed environment to analyze
large data sets.

• It supports different kinds NoSQL databases and file systems, which is a powerful
feature of Drill.

• For example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB,
MapReduce-FS, Amazon S3, Swift, NAS and local files.

25
WORKING OF APACHE
DRILL

26
APACHE ZOOKEEPER
• Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various tools in

a Hadoop Ecosystem.

• Apache Zookeeper coordinates with various services in a distributed environment.

• Before Zookeeper, it was very difficult and time consuming to coordinate between different tools in

Hadoop Ecosystem.

• The services earlier had many problems with interactions like common configuration while

synchronizing data.

27
• Even if the services are configured, changes in the configurations of the

services make it complex and difficult to handle.

• Grouping and naming of tools was also a time-consuming factor.

• Due to the above problems, Zookeeper was introduced.

• It saves a lot of time by performing synchronization, configuration

maintenance, grouping and naming.

28
29
APACHE OOZIE
• Apache Oozie as a clock and alarm service inside Hadoop Ecosystem.

• For Apache jobs, Oozie has been just like a scheduler.

• It schedules Hadoop jobs and binds them together as one logical work.

• There are two kinds of Oozie jobs:

• Oozie workflow: These are sequential set of actions to be executed. You can assume
it as a relay race. Where each athlete waits for the last one to complete his part.

30
• Oozie Coordinator: These are the Oozie jobs which are
triggered when the data is made available to it.

• Think of this as the response-stimuli system in our body.

• In the same manner as we respond to an external stimulus,

an Oozie coordinator responds to the availability of data and
it rests otherwise.

31
WORKING OF APACHE OOZIE

32
33
APACHE SQOOP
• Sqoop imports data from external sources into related Hadoop ecosystem
components like HDFS, Hbase or Hive.

• It also exports data from Hadoop to other external sources.

• Sqoop works with relational databases such as teradata, Netezza, oracle,

MySQL.

34
WORKING OF APACHE SQOOP

35
36
APACHE FLUME
• Feeding data is an important part of our Hadoop Ecosystem.

• The Flume is a tool which helps in feeding unstructured and semi-structured data into
HDFS.

• It gives us a solution which is reliable and distributed.

• It helps us in collecting, aggregating and moving large amount of data sets.

• It helps us to consume online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS.

37
WORKING OF APACHE FLUME

38
APACHE SOLR & LUCENE
• Apache Solr and Apache Lucene are the two services which are used for searching and
indexing in Hadoop Ecosystem.

• Apache Lucene is based on Java, which also helps in spell checking.

• If Apache Lucene is the engine, Apache Solr is the car built around it.

• Solr is a complete application built around Lucene.

• It uses the Lucene Java search library as a core for search and full indexing.

• Indexing is a way to optimize performance of a database by minimizing the number of

disk accesses required when a query is processed.

39
40
APACHE AMBARI
• Ambari is an Apache Software Foundation Project which aims at making
Hadoop ecosystem more manageable.
• It includes software for provisioning, managing and monitoring Apache
Hadoop clusters.
• Hadoop cluster provisioning:
• It gives us step by step process for installing Hadoop services across a
number of hosts.
• It also handles configuration of Hadoop services over a cluster.
• Hadoop cluster management:
• It provides a central management service for starting, stopping and re-
configuring Hadoop services across the cluster.

41
• Hadoop cluster monitoring: For monitoring health status, Ambari
provides us a dashboard.

• The Amber Alert framework is an alerting service which notifies the

user, whenever the attention is needed. For example, if a node goes
down or low disk space on a node, etc.

42
43
44

Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
12 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
Unit 4
No ratings yet
Unit 4
85 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Wa0005.
No ratings yet
Wa0005.
84 pages
Unit 4
No ratings yet
Unit 4
21 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Fro CH3
No ratings yet
Fro CH3
21 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem: Overview
No ratings yet
Hadoop Ecosystem: Overview
5 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
BDP Unit 3
No ratings yet
BDP Unit 3
20 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
Lecture 4 - Hadoop Ecosystem - 1691899782480
No ratings yet
Lecture 4 - Hadoop Ecosystem - 1691899782480
36 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
11 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Unit 2
No ratings yet
Unit 2
23 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
8 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
U2 - Hadoop EcoSytem
No ratings yet
U2 - Hadoop EcoSytem
6 pages
Part Big Data Unit-IV
No ratings yet
Part Big Data Unit-IV
12 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Notes - 4 Unit Neha
No ratings yet
Notes - 4 Unit Neha
44 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Khutbahs by Almaghrib Institute Instructors
No ratings yet
Khutbahs by Almaghrib Institute Instructors
11 pages
English Exercises - Word Formation - Prefixes & Suffixes
No ratings yet
English Exercises - Word Formation - Prefixes & Suffixes
3 pages
2002 Amc 10B
No ratings yet
2002 Amc 10B
6 pages
Matlab Code For Embedded Zero Wavelet (EZW) Image Compression - Image Processing Projects - Projects Source Code
No ratings yet
Matlab Code For Embedded Zero Wavelet (EZW) Image Compression - Image Processing Projects - Projects Source Code
22 pages
Keywords in Python
No ratings yet
Keywords in Python
18 pages
GWG PDFX4 Workflow EN
No ratings yet
GWG PDFX4 Workflow EN
36 pages
Chapter 4 Basic Probability
No ratings yet
Chapter 4 Basic Probability
41 pages
(Lecture 1) Andrew Marvell, To His Coy Mistress - Explanation
No ratings yet
(Lecture 1) Andrew Marvell, To His Coy Mistress - Explanation
5 pages
The Study of Select Themes in Cormac Mcarthy'S
No ratings yet
The Study of Select Themes in Cormac Mcarthy'S
26 pages
Python Final Exam - July2023
No ratings yet
Python Final Exam - July2023
5 pages
Ore No Imouto Ga Konna Ni Kawaii Wake Ga Nai - Volume 09 (Baka-Tsuki) (Autogenerated)
No ratings yet
Ore No Imouto Ga Konna Ni Kawaii Wake Ga Nai - Volume 09 (Baka-Tsuki) (Autogenerated)
357 pages
Will The Humanities Survive Artificial Intelligence - The New Yorker
No ratings yet
Will The Humanities Survive Artificial Intelligence - The New Yorker
39 pages
Ode To West Wind
No ratings yet
Ode To West Wind
4 pages
Semester - 6 Text: Elixir (Orient Blackswan) : Poetry
No ratings yet
Semester - 6 Text: Elixir (Orient Blackswan) : Poetry
5 pages
Relative Clauses
100% (1)
Relative Clauses
4 pages
SKPDD Canto 2 CH 1-10 Question Bank
No ratings yet
SKPDD Canto 2 CH 1-10 Question Bank
31 pages
MS 5118
No ratings yet
MS 5118
21 pages
Metasploit Meterpreter The Advanced and Powerful Payload
No ratings yet
Metasploit Meterpreter The Advanced and Powerful Payload
1 page
Megamat - Swiss Instruments LTD
No ratings yet
Megamat - Swiss Instruments LTD
74 pages
Marriland Team Builder For Pokémon Teams
No ratings yet
Marriland Team Builder For Pokémon Teams
1 page
Introduction To LANSA For I
No ratings yet
Introduction To LANSA For I
129 pages
9-Basic GK Questions With Answers PDF Notes
No ratings yet
9-Basic GK Questions With Answers PDF Notes
23 pages
Problem Solving Python Programming
No ratings yet
Problem Solving Python Programming
5 pages
Identity Essay Rough Draft
No ratings yet
Identity Essay Rough Draft
3 pages
L06 - Syntactic and Semantic Errors
No ratings yet
L06 - Syntactic and Semantic Errors
19 pages
Program Outcomes: Doctor of Philosophy in Development Education (Ph.D. Deved)
No ratings yet
Program Outcomes: Doctor of Philosophy in Development Education (Ph.D. Deved)
5 pages
Advanced Excel: Multiple Worksheets
No ratings yet
Advanced Excel: Multiple Worksheets
9 pages
Core Network in GSM
No ratings yet
Core Network in GSM
81 pages
(سلسلة بيسان لتعليم اللغة الانجليزية (الجزأ الأول
No ratings yet
(سلسلة بيسان لتعليم اللغة الانجليزية (الجزأ الأول
104 pages
Cot1 Ap 2019
67% (3)
Cot1 Ap 2019
2 pages

MODULE 2 Hadoop Ecosystem Tools

Uploaded by

MODULE 2 Hadoop Ecosystem Tools

Uploaded by

Experiment -1

• In a MapReduce program, Map() and Reduce() are two functions.

• It performs all your processing activities by allocating resources and scheduling

• It has two major components, i.e. ResourceManager and NodeManager.

• It supports all primitive data types and predefined functions of SQL.

• Collaborative filtering: Mahout mines user behaviors, their patterns and

• Clustering: It organizes a similar group of data together like articles can

• In other words, it is a NoSQL database.

• Apache Zookeeper coordinates with various services in a distributed environment.

services make it complex and difficult to handle.

• Grouping and naming of tools was also a time-consuming factor.

• Due to the above problems, Zookeeper was introduced.

• It saves a lot of time by performing synchronization, configuration

maintenance, grouping and naming.

• For Apache jobs, Oozie has been just like a scheduler.

• There are two kinds of Oozie jobs:

• Think of this as the response-stimuli system in our body.

• In the same manner as we respond to an external stimulus,

• It also exports data from Hadoop to other external sources.

• Sqoop works with relational databases such as teradata, Netezza, oracle,

• It gives us a solution which is reliable and distributed.

• It helps us in collecting, aggregating and moving large amount of data sets.

• Apache Lucene is based on Java, which also helps in spell checking.

• Solr is a complete application built around Lucene.

• Indexing is a way to optimize performance of a database by minimizing the number of

• The Amber Alert framework is an alerting service which notifies the

You might also like