BDA Presentations Unit-4 - Hadoop, Ecosystem
BDA Presentations Unit-4 - Hadoop, Ecosystem
GTU #3170722
Unit-4
Hadoop
History of Hadoop
History of Hadoop
Hadoop is an open-source software framework for storing and
processing large datasets ranging in size
from gigabytes to petabytes.
Hadoop was developed at the Apache Software Foundation.
In 2008, Hadoop defeated the supercomputers and became
the fastest system on the planet for sorting terabytes of data.
There are basically two components in Hadoop:
1. Hadoop distributed File System (HDFS):
It allows you to store data of various formats across a cluster.
2. Yarn:
For resource management in Hadoop. It allows parallel processing over the data,
i.e. stored across HDFS.
Basics of Hadoop
Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware.
It provides massive storage for any kind of data, enormous processing power and the ability
to handle virtually limitless concurrent tasks or jobs.
A data residing in a local file system of a personal computer system, in Hadoop, data resides
in a distributed file system which is called as a Hadoop Distributed File system - HDFS.
The processing model is based on 'Data Locality' concept wherein computational logic is sent
to cluster nodes(server) containing data.
This computational logic is nothing, but a compiled version of a program written in a high-
level language such as Java.
Such a program, processes data stored in Hadoop HDFS.
Basics of Hadoop
Hadoop
Big Data Technology
Distributed processing of large data sets
Open source
Map Reduce- Simple Programming model
Why Hadoop?
Handles any data type
Structured/unstructured
Schema/no Schema
High volume/Low volume
All kinds of analytic applications
Grows with business
Proven with Petabyte scale
Capacity & Performance grows
Leverages commodity hardware to mitigate costs
Hadoop Features
100% Apache open source
No Vendor locking
Rich Eco system & community development
To Derive compute value of all data
More affordable cost-effective platform
Advantage & Dis-Advantage of Hadoop
Advantages Disadvantages
3. It is best suited for OLTP environment. It is best suited for BIG data.
4. Interactive OLAP analytics Scalability of storage/Compute
5. Multistep ACID transactions Complex data processing
6. Stored in the form of tables Distributed file system
7. It is less scalable than Hadoop. It is highly scalable.
8. SQL – Update and Access data MapReduce Programming model
9. 100% SQL compliant Both SQL & NoSQL
Data normalization is required in
10. RDBMS. Data normalization is not required in Hadoop.
It stores transformed and aggregated data.
11. It stores huge volume of data.
12. It has no latency in response. It has some latency in response.
The data schema of RDBMS is static type.
13. The data schema of Hadoop is dynamic type.
14. High data integrity available. Low data integrity available than RDBMS.
15. Cost is applicable for licensed software. Free of cost, as it is an open source software.
Hadoop Infrastructure
The 2 infrastructure models of Hadoop are:
⦿ Data Model
⦿ Computing Model
Traditional Database Model Vs Hadoop Data Model
DISTRIBUTED DATABASE MODEL HADOOP DATA MODEL
Deals with tables and relations Deals with flat files in any format
Data fragmentation and partitioning Files are divided automatically into blocks
Distributed transaction with ACID properties MapReduce computing model where every task
is either a map or reduce service
Why Hadoop Required? ------ Example
1 2
3 5
4
Why Hadoop Required? - Traditional Restaurant Scenario
Why Hadoop Required? - Traditional Scenario
Why Hadoop Required? - Distributed Processing Scenario
Why Hadoop Required? - Distributed Processing Scenario Failure
Why Hadoop Required? - Solution to Restaurant Problem
Why Hadoop Required? - Hadoop in Restaurant Analogy
Hadoop functions in a similar fashion as Bob’s
restaurant.
As the food shelf is distributed in Bob’s restaurant,
similarly, in Hadoop, the data is stored in a
distributed fashion with replications, to provide
fault tolerance.
For parallel processing, first the data is processed
by the slaves where it is stored for some
intermediate results and then those intermediate
results are merged by master node to send the final
result.
Hadoop Ecosystem
Hadoop is a framework that can process large data sets in the form of clusters.
As a framework, Hadoop is composed of multiple modules that are compatible with a large
technology ecosystem.
The Hadoop ecosystem is a platform or suite that provides various services to solve big data
problems.
It includes the Apache project and various commercial tools and solutions.
Hadoop has four main elements, namely HDFS, MapReduce, YARN and Hadoop Common.
Most tools or solutions are used to supplement or support these core elements.
All of these tools work together to provide services such as data absorption, analysis, storage,
and maintenance.
Hadoop Ecosystem
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Hadoop Ecosystem Distribution
HDFS
Hadoop Distributed File System is the core component, or you can say, the backbone of
Hadoop Ecosystem.
HDFS is one, and it is possible to store large data sets (i.e., structured, unstructured and semi
structured data).
HDFS creates levels of abstraction of resources from where you can see all the HDF as a
single unit.
It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
HDFS has two main components: NAMENODE and DATANODE.
Yarn
YARN as the brain of your Hadoop Ecosystem.
It performs all your processing activities by allocating resources and scheduling tasks.
It is a type of resource negotiator, as the name suggests, YARN is a negotiator that helps
manage all resources in the cluster.
In short, you perform scheduling and resource allocation for the Hadoop system.
It consists of three main components, namely,
1. Resource Manager has the right to allocate resources for the applications on the system.
2. Node Manager is responsible for allocating resources such as CPU, memory, bandwidth and
so on for each machine, and then identify the resource manager.
3. Application Manager acts as an interface between the resource manager and the node
manager, and negotiates according to the requirements of the two.
Map Reduce
MapReduce is the core component of processing in a Hadoop Ecosystem as it provides the
logic of processing.
MapReduce is a software framework which helps in writing applications that processes large
data sets using distributed and parallel algorithms inside Hadoop environment.
So, By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
MapReduce uses two functions, namely Map() and Reduce(), and its task is:
The Map() function performs actions like filtering, grouping and sorting.
While Reduce() function aggregates and summarizes the result produced by map function.
Pig
Pig is basically developed by Yahoo, it uses the Pig Latin language, which is a query-based
language similar to SQL.
It is a platform for structuring the data flows, processing and analyzing massive data sets.
Pig is responsible for executing commands and processing all MapReduce activities in the
background. After processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework running in Pig Runtime. Like the
way Java runs on the JVM.
Pig helps simplify programming and optimization and is therefore an important part of the
Hadoop ecosystem.
Pig working like first the load command, loads the data. Then we perform various functions
on it like grouping, filtering, joining, sorting, etc.
At last, either you can dump the data on the screen, or you can store the result back in HDFS.
Hive
With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).