0% found this document useful (0 votes)

184 views55 pages

Hadoop Ecosystem

The document discusses the Hadoop ecosystem. It begins by describing big data challenges and how Hadoop provides a distributed system solution. The core components of Hadoop are HDFS for storage and MapReduce for processing. Additional components in the ecosystem include HBase for random read/write access, Hive for SQL-like queries, Pig for data flows, Sqoop for database integration, Flume for streaming data collection, and Zookeeper for coordination. The document provides overviews and examples of how each component works within Hadoop.

Uploaded by

nehal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

184 views55 pages

Hadoop Ecosystem

Uploaded by

nehal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

The Hadoop Ecosystem

Overview
 Big Data Challenges
 Distributed system and challenges
 Hadoop Introduction
 History
 Who uses Hadoop
 The Hadoop Ecosystem
 Hadoop core components
 HDFS
 Map Reduce
 Other Hadoop ecosystem components
 Hbase
 Hive
 Pig
 Impala
 Sqoop
 Flume
 Hue
 Zookeeper
 Demo
Big Data Challenges
Solution: Distributed system
Distributed System Challenges

 Programming Complexity
 Finite bandwidth
 Partial failure
 The data bottleneck
New Approach to distributed computing

Hadoop:
A scalable fault-tolerant distributed system for data storage and
processing
 Distribute data when the data is stored
 Process data where the data is
 Data is replicated
Hadoop Introduction

 Apache Hadoop is an open-source software framework for storage and large-

scale processing of data-sets on clusters of commodity hardware.
 Some of the characteristics:
 Open source
 Distributed processing
 Distributed storage
 Scalable
 Reliable
 Fault-tolerant
 Economical
 Flexible
History

 Originally built as a Infrastructure for the “Nutch” project.

 Based on Google’s map reduce and google File System.
 Created by Doug Cutting in 2005 at Yahoo
 Named after his son’s toy yellow elephant.
Who uses Hadoop

 http://wiki.apache.org/hadoop/PoweredBy
 http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Suppo
rt
The Hadoop Ecosystem

 http://hadoopecosystemtable.github.io/
Hadoop Core Components

 HDFS – Hadoop Distributed File System (Storage)

 Map Reduce (Processing)
Hadoop Core Components
A multi-node Hadoop cluster
Nodes
 NameNode:
 Master of the system
 Maintains and manages the blocks which are present on the DataNodes

 DataNodes:
 Slaves which are deployed on each machine and provide the actual storage
 Responsible for serving read and write requests for the clients

 Jobtracker:
 takes care of all the job scheduling and assign tasks to Task Trackers.

 TaskTracker:
 a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations -
from a jobtracker
HDFS

 Hadoop Distributed File System (HDFS) is designed to reliably store very large
files across machines in a large cluster. It is inspired by the GoogleFileSystem.
 Distribute large data file into blocks
 Blocks are managed by different nodes in the cluster
 Each block is replicated on multiple nodes
 Name node stored metadata information about files and blocks
Map Reduce
 The Mapper:
 Each block is processed in isolation by a map task called mapper
 Map task runs on the node where the block is stored

 The Reducer:
 Consolidate result from different mappers
 Produce final output
What makes Hadoop unique

 Moving computation to data, instead of moving data to computation.

 Simplified programming model: allows user to quickly write and test
 Automatic distribution of data and work across machines
Other Hadoop components in Ecosystem
HBase Hadoop database for random read/write access

Hive SQL-like queries and tables on large datasets

Pig Data flow language and compiler

Oozie Workflow for interdependent Hadoop jobs

Sqoop Integration of databases and data warehouses with Hadoop

Flume Configurable streaming data collection

ZooKeeper Coordination service for distributed applications

Hbase

 HBase is an open source, non-relational, distributed database modeled after

Google's BigTable.
 It runs on top of Hadoop and HDFS, providing BigTable-like capabilities for
Hadoop.
Features of Hbase

 Type of NoSql database

 Strongly consistent read and write
 Automatic sharding
 Automatic RegionServer failover
 Hadoop/HDFS Integration
 HBase supports massively parallelized processing via MapReduce for using
HBase as both source and sink.
 HBase supports an easy to use Java API for programmatic access.
 HBase also supports Thrift and REST for non-Java front-ends.
Hbase in CAP theorem

 Eric Brewer’s CAP theorem, HBase is a CP type system.

When to use Hbase

 When there is real big data: millions or billions of rows, in other way data can
not store in a single node.
 When random read/write access to big data
 When require to do thousands of operations on big data
 When there is no need of extra features of RDMS like typed columns,
secondary indexes, transactions, advanced query languages, etc.
 When there is enough hardware.
Difference between Hbase and HDFS

HDFS Hbase
Good for storing large file Built on top of HDFS. Good for hosting
very large tables like billions of rows X
millions of column
Write once. Append to files in some of Read/write many
recent versions but not commonly used
No random read/write Random read/write
No individual record lookup rather read Fast records lookup(update)
all data
Hive

 An sql like interface to Hadoop.

 Data warehouse infrastructure built on top of Hadoop
 Provide data summarization, query and analysis
 Query execution via MapReduce
 Hive interpreter convert the query to Map reduce format.
 Open source project.
 Developed by Facebook
 Also used by Netflix, Cnet, Digg, eHarmony etc.
Hive

 HiveQL example:

SELECT customerId, max(total_cost) from hive_purchases GROUP BY

customerId HAVING count(*) > 3;
Pig

 A scripting platform for processing and analyzing large data sets

 Apache Pig allows to write complex MapReduce programs using a simple
scripting language.
 High level language: Pig Latin
 Pig Latin is data flow language.
 Pig translate Pig Latin script into MapReduce to execute within Hadoop.
 Open source project
 Developed by Yahoo
Pig

 Pig Latin example:

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,

gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
Pig and Hive

 Both requires compiler to generate Map reduce jobs

 Hence high latency queries when used for real time responses to ad-hoc
queries
 Both are good for batch processing and ETL jobs
 Fault tolerant
Impala

 Cloudera Impala is a query engine that runs on Apache Hadoop.

 Similar to HiveQL.
 Does not use Map reduce
 Optimized for low latency queries
 Open source apache project
 Developed by Cloudera
 Much faster than Hive or pig
Comparing Pig, Hive and Impala
Description of Pig Hive Impala
Feature
SQL based query No yes yes
language
Schema optional required required

Process data with yes yes no

external scripts

Extensible file format yes yes no

support

Query speed slow slow fast

Accessible via no yes yes

ODBC/JDBC
Sqoop

 Command-line interface for transforming data between relational database

and Hadoop
 Support incremental imports
 Imports use to populate tables in Hadoop
 Exports use to put data from Hadoop into relational database such as SQL
server

Hadoop sqoop RDBMS

How Sqoop works

 The dataset being transferred is broken into small blocks.

 Map only job is launched.
 Individual mapper is responsible for transferring a block of the dataset.
How Sqoop works
Flume

 Apache Flume is a distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of
streaming data into the Hadoop Distributed File System (HDFS).
How flume works
How flume works

 Data flows like:

Agent tier -> Collector tier -> Storage tier
 Agent nodes are typically installed on the machines that generate the logs
and are data’s initial point of contact with Flume. They forward data to the
next tier of collector nodes, which aggregate the separate data flows and
forward them to the final storage tier.
Hue

 Graphical front end to the cluster.

 Open source web interface.
 Makes Hadoop platform (HDFS, Map reduce, oozie, Hive, etc.) easy to use
Hue
Zookeeper

 Because coordinating distributed systems is a Zoo.

 ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services.
DEMO
Hadoop Installation (CDH ) for windows
 Download and install VM player
https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmw
are_player/6_0
Hadoop Installation (CDH ) for windows

Make sure you have enabled

virtualization in bios
Hadoop Installation (CDH ) for windows
 Download “Quick start VM with CDH” : Download for VMWare
http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh
-4-7-x.html
Hadoop Installation (CDH ) for windows
 Unzip “cloudera-quickstart-vm-4.7.0-0-vmware”
 Open CDH using VMPlayer:
 Open VM Player
 Click open a virtual machine
 Select the file “cloudera-quickstart-vm-4.7.0-0-vmware” in the extracted directory
of “cloudera-quickstart-vm-4.7.0-0-vmware”. Virtual machine will be added to
your VM player.
 Select this virtual machine and click play virtual machine.
References

 http://training.cloudera.com/essentials.pdf
 http://en.wikipedia.org/wiki/Apache_Hadoop
 http://practicalanalytics.wordpress.com/2011/11/06/explaining-hadoop-to-
management-whats-the-big-data-deal/
 https://developer.yahoo.com/hadoop/tutorial/module1.html
 http://hadoop.apache.org/
 http://wiki.apache.org/hadoop/FrontPage
Questions?
Thanks

Hadoop Administration Course Content PDF
No ratings yet
Hadoop Administration Course Content PDF
4 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
5 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
IT - R19 Final - 210
No ratings yet
IT - R19 Final - 210
210 pages
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
OOMD Questions
No ratings yet
OOMD Questions
5 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
Formation Angular Lab 2 More Components: Lab 2.1: Data Flowing Downwards
No ratings yet
Formation Angular Lab 2 More Components: Lab 2.1: Data Flowing Downwards
5 pages
Sapthagiri College of Engineering: Department of Information Science and Engineering Big Data Analytics Question Bank
No ratings yet
Sapthagiri College of Engineering: Department of Information Science and Engineering Big Data Analytics Question Bank
3 pages
BDA QB - CMRIT - pdf-0
100% (2)
BDA QB - CMRIT - pdf-0
3 pages
B.E. (2015 Pattern) Question Bank
No ratings yet
B.E. (2015 Pattern) Question Bank
102 pages
It6713 Grid Cloud Computing Lab
No ratings yet
It6713 Grid Cloud Computing Lab
96 pages
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
No ratings yet
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
30 pages
Hadoop Installation
No ratings yet
Hadoop Installation
6 pages
HADOOP
100% (1)
HADOOP
35 pages
Tutorial Hbase
No ratings yet
Tutorial Hbase
100 pages
What Is Apache Hadoop?: Ambari™
No ratings yet
What Is Apache Hadoop?: Ambari™
1 page
Primer On Big Data Testing
No ratings yet
Primer On Big Data Testing
24 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
Nosql PDF
No ratings yet
Nosql PDF
21 pages
Big Data and Spark Developers
No ratings yet
Big Data and Spark Developers
5 pages
Lab1 Angular PDF
No ratings yet
Lab1 Angular PDF
8 pages
Hadoop Admin
No ratings yet
Hadoop Admin
13 pages
Java Programming Part I
No ratings yet
Java Programming Part I
120 pages
Hibernate (An ORM Tool)
No ratings yet
Hibernate (An ORM Tool)
69 pages
Clouds and Big Data Computing
No ratings yet
Clouds and Big Data Computing
13 pages
Part B Questions
No ratings yet
Part B Questions
3 pages
Python Syllbus by Lokesh
No ratings yet
Python Syllbus by Lokesh
5 pages
BDA - Chapter-1-Components of Hadoop Ecosystem - Lecture 3
0% (1)
BDA - Chapter-1-Components of Hadoop Ecosystem - Lecture 3
38 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
Cloud Computing Questions & Answers
No ratings yet
Cloud Computing Questions & Answers
14 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Software Engineering & Program Management For CSE/IT Engineering - Third-Year - Notes, Books, Ebook PDF Download
No ratings yet
Software Engineering & Program Management For CSE/IT Engineering - Third-Year - Notes, Books, Ebook PDF Download
235 pages
MCQ Type Questions
No ratings yet
MCQ Type Questions
24 pages
Hadoop Installation Step by Step
No ratings yet
Hadoop Installation Step by Step
6 pages
Big Data Analytics
100% (1)
Big Data Analytics
227 pages
UID IT 604 Part A
No ratings yet
UID IT 604 Part A
4 pages
Unit 4 (MongoDB)
No ratings yet
Unit 4 (MongoDB)
46 pages
ADBMS Lab Manual Aug-Dec 2017 - ByMe
No ratings yet
ADBMS Lab Manual Aug-Dec 2017 - ByMe
9 pages
Data Mining 101
No ratings yet
Data Mining 101
50 pages
Q1. Explain JDK, JRE and JVM?
No ratings yet
Q1. Explain JDK, JRE and JVM?
21 pages
Presentation 2
No ratings yet
Presentation 2
36 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Hive Join
No ratings yet
Hive Join
6 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
SC QB
No ratings yet
SC QB
24 pages
Hadoop and BigData LAB MANUAL
50% (4)
Hadoop and BigData LAB MANUAL
59 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
10gen-MongoDB Operations Best Practices 2.6
No ratings yet
10gen-MongoDB Operations Best Practices 2.6
26 pages
Hadoop All Installations
No ratings yet
Hadoop All Installations
19 pages
Cloudera Developer Training PDF
No ratings yet
Cloudera Developer Training PDF
593 pages
Hadoop Online Training
No ratings yet
Hadoop Online Training
7 pages
Hadoop Lab
100% (2)
Hadoop Lab
6 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
Cloudera Apache Hadoop 101
100% (1)
Cloudera Apache Hadoop 101
51 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
A New Type of DDoS Attack Can Amplify Attack Strength
No ratings yet
A New Type of DDoS Attack Can Amplify Attack Strength
3 pages
Change Log HameApp
No ratings yet
Change Log HameApp
191 pages
Research On Nosql Database Technology: Li Jun-Shan, Li Jian-Jun
No ratings yet
Research On Nosql Database Technology: Li Jun-Shan, Li Jian-Jun
6 pages
Netact Summary
No ratings yet
Netact Summary
63 pages
02 - Vines, Samoila
No ratings yet
02 - Vines, Samoila
10 pages
MicroServices On Azure
No ratings yet
MicroServices On Azure
34 pages
Software Security Engineering
No ratings yet
Software Security Engineering
6 pages
CIM Programa
No ratings yet
CIM Programa
2 pages
NM SDK C# Programming For Beginners
No ratings yet
NM SDK C# Programming For Beginners
22 pages
CLEVERTAP Vs Pinpoint
No ratings yet
CLEVERTAP Vs Pinpoint
4 pages
Learning Management System
No ratings yet
Learning Management System
15 pages
Informatica Certification Training Online
No ratings yet
Informatica Certification Training Online
11 pages
Pig
No ratings yet
Pig
2 pages
Microsoft Official Course: Automating Active Directory Domain Services Administration
No ratings yet
Microsoft Official Course: Automating Active Directory Domain Services Administration
23 pages
Huawei Cloud Overview
No ratings yet
Huawei Cloud Overview
28 pages
Industrial Information Security Management System: Internal Audit Checklist
No ratings yet
Industrial Information Security Management System: Internal Audit Checklist
13 pages
Nishi Sharma: AWS Cloud Engineer
No ratings yet
Nishi Sharma: AWS Cloud Engineer
7 pages
Slides Powerpoint SCRUM Standard
No ratings yet
Slides Powerpoint SCRUM Standard
20 pages
001817
No ratings yet
001817
3 pages
Oracle Golden Gate (GG) vs. Oracle Stream: Sanjay Naik
No ratings yet
Oracle Golden Gate (GG) vs. Oracle Stream: Sanjay Naik
4 pages
Communications and The Internet: Scope of The Module
No ratings yet
Communications and The Internet: Scope of The Module
27 pages
Internship Report 11
No ratings yet
Internship Report 11
28 pages
Presented By:-: Simrandeep Singh B.A. LL.B (3 Sem)
No ratings yet
Presented By:-: Simrandeep Singh B.A. LL.B (3 Sem)
31 pages
Android Java
No ratings yet
Android Java
5 pages
USB Security System
No ratings yet
USB Security System
88 pages
Unit-3 Part 2
No ratings yet
Unit-3 Part 2
11 pages
Python Programming 19-06-2023
No ratings yet
Python Programming 19-06-2023
124 pages
5.1 Storage Maintenance and Troubleshooting
No ratings yet
5.1 Storage Maintenance and Troubleshooting
37 pages
#7 Software Metrics
No ratings yet
#7 Software Metrics
3 pages

Hadoop Ecosystem

Uploaded by

Hadoop Ecosystem

Uploaded by

The Hadoop Ecosystem

 Apache Hadoop is an open-source software framework for storage and large-

 Originally built as a Infrastructure for the “Nutch” project.

 HDFS – Hadoop Distributed File System (Storage)

 Moving computation to data, instead of moving data to computation.

Hive SQL-like queries and tables on large datasets

Pig Data flow language and compiler

Sqoop Integration of databases and data warehouses with Hadoop

Flume Configurable streaming data collection

ZooKeeper Coordination service for distributed applications

 HBase is an open source, non-relational, distributed database modeled after

 Type of NoSql database

 Eric Brewer’s CAP theorem, HBase is a CP type system.

 An sql like interface to Hadoop.

SELECT customerId, max(total_cost) from hive_purchases GROUP BY

 A scripting platform for processing and analyzing large data sets

 Pig Latin example:

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,

 Both requires compiler to generate Map reduce jobs

 Cloudera Impala is a query engine that runs on Apache Hadoop.

Process data with yes yes no

Extensible file format yes yes no

Query speed slow slow fast

Accessible via no yes yes

 Command-line interface for transforming data between relational database

Hadoop sqoop RDBMS

 The dataset being transferred is broken into small blocks.

 Apache Flume is a distributed, reliable, and available service for

 Data flows like:

 Graphical front end to the cluster.

 Because coordinating distributed systems is a Zoo.

Make sure you have enabled

You might also like