100% found this document useful (1 vote)
144 views25 pages

BDA Presentations Unit-4 - Hadoop, Ecosystem

This document provides an overview of Hadoop, including: - A brief history of Hadoop and its origins at the Apache Software Foundation. - An explanation of the two main components of Hadoop - HDFS for distributed storage and YARN for resource management. - Descriptions of some key features and advantages of Hadoop like its ability to handle large datasets in a scalable, fault-tolerant and cost effective manner. - An overview of other technologies in the Hadoop ecosystem like MapReduce, Pig, Hive, HBase, Spark and Zookeeper that work with the core Hadoop framework.

Uploaded by

Ashish Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
144 views25 pages

BDA Presentations Unit-4 - Hadoop, Ecosystem

This document provides an overview of Hadoop, including: - A brief history of Hadoop and its origins at the Apache Software Foundation. - An explanation of the two main components of Hadoop - HDFS for distributed storage and YARN for resource management. - Descriptions of some key features and advantages of Hadoop like its ability to handle large datasets in a scalable, fault-tolerant and cost effective manner. - An overview of other technologies in the Hadoop ecosystem like MapReduce, Pig, Hive, HBase, Spark and Zookeeper that work with the core Hadoop framework.

Uploaded by

Ashish Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Big Data Analytics(BDA)

GTU #3170722

Unit-4
Hadoop
History of Hadoop
History of Hadoop
 Hadoop is an open-source software framework for storing and
processing large datasets ranging in size
from gigabytes to petabytes.
 Hadoop was developed at the Apache Software Foundation.
 In 2008, Hadoop defeated the supercomputers and became
the fastest system on the planet for sorting terabytes of data.
 There are basically two components in Hadoop:
1. Hadoop distributed File System (HDFS):
 It allows you to store data of various formats across a cluster.
2. Yarn:
 For resource management in Hadoop. It allows parallel processing over the data,
i.e. stored across HDFS.
Basics of Hadoop
 Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware.
 It provides massive storage for any kind of data, enormous processing power and the ability
to handle virtually limitless concurrent tasks or jobs.
 A data residing in a local file system of a personal computer system, in Hadoop, data resides
in a distributed file system which is called as a Hadoop Distributed File system - HDFS.
 The processing model is based on 'Data Locality' concept wherein computational logic is sent
to cluster nodes(server) containing data.
 This computational logic is nothing, but a compiled version of a program written in a high-
level language such as Java.
 Such a program, processes data stored in Hadoop HDFS.
Basics of Hadoop
Hadoop
 Big Data Technology
 Distributed processing of large data sets
 Open source
 Map Reduce- Simple Programming model
Why Hadoop?
 Handles any data type
 Structured/unstructured
 Schema/no Schema
 High volume/Low volume
 All kinds of analytic applications
 Grows with business
 Proven with Petabyte scale
 Capacity & Performance grows
 Leverages commodity hardware to mitigate costs
Hadoop Features
 100% Apache open source
 No Vendor locking
 Rich Eco system & community development
 To Derive compute value of all data
 More affordable cost-effective platform
Advantage & Dis-Advantage of Hadoop
Advantages Disadvantages

 Varied Data Sources  Issue With Small Files


 Cost-effective  Vulnerable By Nature
 Performance  Processing Overhead
 Fault-Tolerant  Supports Only Batch Processing
 Highly Available  Iterative Processing
 Low Network Traffic  Security
 High Throughput
 Open Source
 Scalable
 Ease of use
 Compatibility
 Multiple Languages Supported
Hadoop and Databases
S.NO. RDBMS HADOOP
1. Structured database approach Structured and Unstructured database approach
Traditional row-column based databases, basically used An open-source software used for storing data and running
2. for data storage, manipulation and retrieval. applications or processes concurrently.

3. It is best suited for OLTP environment. It is best suited for BIG data.
4. Interactive OLAP analytics Scalability of storage/Compute
5. Multistep ACID transactions Complex data processing
6. Stored in the form of tables Distributed file system
7. It is less scalable than Hadoop. It is highly scalable.
8. SQL – Update and Access data MapReduce Programming model
9. 100% SQL compliant Both SQL & NoSQL
Data normalization is required in
10. RDBMS. Data normalization is not required in Hadoop.
It stores transformed and aggregated data.
11. It stores huge volume of data.
12. It has no latency in response. It has some latency in response.
The data schema of RDBMS is static type.
13. The data schema of Hadoop is dynamic type.
14. High data integrity available. Low data integrity available than RDBMS.
15. Cost is applicable for licensed software. Free of cost, as it is an open source software.
Hadoop Infrastructure
 The 2 infrastructure models of Hadoop are:
 ⦿ Data Model
 ⦿ Computing Model
 Traditional Database Model Vs Hadoop Data Model
DISTRIBUTED DATABASE MODEL HADOOP DATA MODEL

Deals with tables and relations Deals with flat files in any format

Must have a schema for data No schema

Data fragmentation and partitioning Files are divided automatically into blocks

Notion of a transaction Notion of a job divided into tasks

Distributed transaction with ACID properties MapReduce computing model where every task
is either a map or reduce service
Why Hadoop Required? ------ Example
1 2

3 5

4
Why Hadoop Required? - Traditional Restaurant Scenario
Why Hadoop Required? - Traditional Scenario
Why Hadoop Required? - Distributed Processing Scenario
Why Hadoop Required? - Distributed Processing Scenario Failure
Why Hadoop Required? - Solution to Restaurant Problem
Why Hadoop Required? - Hadoop in Restaurant Analogy
 Hadoop functions in a similar fashion as Bob’s
restaurant.
 As the food shelf is distributed in Bob’s restaurant,
similarly, in Hadoop, the data is stored in a
distributed fashion with replications, to provide
fault tolerance.
 For parallel processing, first the data is processed
by the slaves where it is stored for some
intermediate results and then those intermediate
results are merged by master node to send the final
result.
Hadoop Ecosystem
 Hadoop is a framework that can process large data sets in the form of clusters.
 As a framework, Hadoop is composed of multiple modules that are compatible with a large
technology ecosystem.
 The Hadoop ecosystem is a platform or suite that provides various services to solve big data
problems.
 It includes the Apache project and various commercial tools and solutions.
 Hadoop has four main elements, namely HDFS, MapReduce, YARN and Hadoop Common.
 Most tools or solutions are used to supplement or support these core elements.
 All of these tools work together to provide services such as data absorption, analysis, storage,
and maintenance.
Hadoop Ecosystem
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Hadoop Ecosystem Distribution
HDFS
 Hadoop Distributed File System is the core component, or you can say, the backbone of
Hadoop Ecosystem.
 HDFS is one, and it is possible to store large data sets (i.e., structured, unstructured and semi
structured data).
 HDFS creates levels of abstraction of resources from where you can see all the HDF as a
single unit.
 It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
 HDFS has two main components: NAMENODE and DATANODE.
Yarn
 YARN as the brain of your Hadoop Ecosystem.
 It performs all your processing activities by allocating resources and scheduling tasks.
 It is a type of resource negotiator, as the name suggests, YARN is a negotiator that helps
manage all resources in the cluster.
 In short, you perform scheduling and resource allocation for the Hadoop system.
 It consists of three main components, namely,
1. Resource Manager has the right to allocate resources for the applications on the system.
2. Node Manager is responsible for allocating resources such as CPU, memory, bandwidth and
so on for each machine, and then identify the resource manager.
3. Application Manager acts as an interface between the resource manager and the node
manager, and negotiates according to the requirements of the two.
Map Reduce
 MapReduce is the core component of processing in a Hadoop Ecosystem as it provides the
logic of processing.
 MapReduce is a software framework which helps in writing applications that processes large
data sets using distributed and parallel algorithms inside Hadoop environment.
 So, By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
 MapReduce uses two functions, namely Map() and Reduce(), and its task is:
 The Map() function performs actions like filtering, grouping and sorting.
 While Reduce() function aggregates and summarizes the result produced by map function.
Pig
 Pig is basically developed by Yahoo, it uses the Pig Latin language, which is a query-based
language similar to SQL.
 It is a platform for structuring the data flows, processing and analyzing massive data sets.
 Pig is responsible for executing commands and processing all MapReduce activities in the
background. After processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework running in Pig Runtime. Like the
way Java runs on the JVM.
 Pig helps simplify programming and optimization and is therefore an important part of the
Hadoop ecosystem.
 Pig working like first the load command, loads the data. Then we perform various functions
on it like grouping, filtering, joining, sorting, etc.
 At last, either you can dump the data on the screen, or you can store the result back in HDFS.
Hive
 With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).

HIVE + SQL = HQL


 It is highly scalable because it supports real-time processing and batch processing.
 In addition, Hive supports all SQL data types, making query processing easier.
 Similar to the query processing framework, HIVE also has two components: JDBC driver and
HIVE Command-Line.
 JDBC is used with the ODBC driver to set up connection and data storage permissions
whereas HIVE command line facilitates query processing.

You might also like