0% found this document useful (0 votes)

6 views8 pages

History of Hadoop Apache Hadoop - The Hadoop Distributed File System

Hadoop is an open-source framework developed by the Apache Software Foundation for storing and processing large datasets, with its evolution beginning in 2002 by Doug Cutting and Mike Cafarella. Key components of Hadoop include HDFS (Hadoop Distributed File System), YARN for resource management, and MapReduce for data processing. Over the years, Hadoop has become a leading technology for big data, with significant milestones including its recognition as the fastest system for sorting terabytes of data in 2008 and the introduction of various versions up to Hadoop 3.1 in 2018.

Uploaded by

nikhil.2226cse1108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views8 pages

History of Hadoop Apache Hadoop - The Hadoop Distributed File System

Uploaded by

nikhil.2226cse1108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

History of Hadoop

Apache Hadoop - The Hadoop Distributed File System

History of Hadoop – The complete evolution of Hadoop Ecosytem

Hadoop is an open-source software framework for storing and processing large
datasets ranging in size from gigabytes to petabytes. Hadoop was developed at
the Apache Software Foundation.
In 2008, Hadoop defeated the supercomputers and became the fastest system
on the planet for sorting terabytes of data.
This article describes the evolution of Hadoop over a period.
What is Hadoop

Hadoop is an open source framework from Apache and is used to store process
and analyze data which are very huge in volume. Hadoop is written in Java and
is not OLAP (online analytical processing). It is used for batch/offline
processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and
many more. Moreover it can be scaled up just by adding nodes in the cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS
and based on that HDFS was developed. It states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and
manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes
input data and converts it into a data set which can be computed in Key
value pair. The output of Map task is consumed by reduce task and then
the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and
the HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The
master node includes Job Tracker, Task Tracker, NameNode, and DataNode
whereas the slave node includes DataNode and TaskTracker.
Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for
Hadoop. It contains a master/slave architecture. This architecture consist of a
single NameNode performs the role of master, and multiple DataNodes
performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity

machines. The Java language is used to develop HDFS. So any machine that
supports Java language can easily run the NameNode and DataNode software.

NameNode

o It is a single master server exist in the HDFS cluster.

o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
o It simplifies the architecture of the system.
DataNode

o The HDFS cluster contains multiple DataNodes.

o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the
file system's clients.
o It performs block creation, deletion, and replication upon instruction from
the NameNode.

Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.

o It receives task and code from Job Tracker and applies that code on the
file. This process can also be called as a Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request
to the appropriate Task Trackers. Sometimes, the TaskTracker fails or time out.
In such a case, that part of the job is rescheduled.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which
helps in faster retrieval. Even the tools to process the data are often on
the same servers, thus reducing the processing time. It is able to process
terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the
cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to
store data so it really cost effective as compared to traditional relational
database management system.
o Resilient to failure: HDFS has the property with which it can replicate data
over the network, so if one node is down or some other network failure
happens, then Hadoop takes the other copy of data and use it. Normally,
data are replicated thrice but the replication factor is configurable.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin
was the Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a

project, Apache Nutch. It is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store
that data they have to spend a lot of costs which becomes the
consequence of that project. This problem becomes one of the important
reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file
system). It is a proprietary distributed file system developed to provide
efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique
simplifies the data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system
known as NDFS (Nutch Distributed File System). This file system also
includes Map reduce.
Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006 o Hadoop introduced.

o Hadoop 0.1.0 released.
o Yahoo deploys 300 machines and within this year reaches
600 machines.

2007 o Yahoo runs 2 clusters of 1000 machines.

o Hadoop includes HBase.

2008 o YARN JIRA opened

o Hadoop becomes the fastest system to sort 1 terabyte of
data on a 900 node cluster within 209 seconds.
o Yahoo clusters loaded with 10 terabytes per day.
o Cloudera was founded as a Hadoop distributor.

2009 o Yahoo runs 17 clusters of 24,000 machines.

o Hadoop becomes capable enough to sort a petabyte.
o MapReduce and HDFS become separate subproject.

2010 o Hadoop added the support for Kerberos.

o Hadoop operates 4,000 nodes with 40 petabytes.
o Apache Hive and Pig released.

2011 o Apache Zookeeper released.

o Yahoo has 42,000 Hadoop nodes and hundreds of petabytes
of storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

o In 2006, Doug Cutting quit Google and joined Yahoo. Based on the
o Nutch project, Dough Cutting introduces a new project Hadoop with a file
system known as HDFS (Hadoop Distributed File System). Hadoop first
version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on
a 900 node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

CJI3 User Manual
100% (3)
CJI3 User Manual
9 pages
Apache Hadoop: Big Data (Unit 2)
No ratings yet
Apache Hadoop: Big Data (Unit 2)
40 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Unit 5
No ratings yet
Unit 5
101 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
30 pages
Unit 2
No ratings yet
Unit 2
73 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Unit III
No ratings yet
Unit III
32 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Big Data Notes - 2 Unit
No ratings yet
Big Data Notes - 2 Unit
20 pages
Unit II BDA
No ratings yet
Unit II BDA
32 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Bda Aiml Note Unit 2
No ratings yet
Bda Aiml Note Unit 2
13 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
IBM Hadoop
No ratings yet
IBM Hadoop
11 pages
Module II
No ratings yet
Module II
46 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
Module III Note
No ratings yet
Module III Note
36 pages
Unit 3
No ratings yet
Unit 3
18 pages
BD Sec B
No ratings yet
BD Sec B
19 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 2
No ratings yet
Unit 2
28 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit 2
No ratings yet
Unit 2
21 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Unit 2
No ratings yet
Unit 2
30 pages
CC 2
No ratings yet
CC 2
25 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 2
No ratings yet
Unit 2
10 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
CC W3 AWS Basic Infra
No ratings yet
CC W3 AWS Basic Infra
57 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Unit 1 Notes Final Part C
No ratings yet
Unit 1 Notes Final Part C
38 pages
Unit 2
No ratings yet
Unit 2
9 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop
No ratings yet
Hadoop
7 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Vsan 703 Administration
No ratings yet
Vsan 703 Administration
126 pages
14 - Error 1
No ratings yet
14 - Error 1
31 pages
Hadoop
No ratings yet
Hadoop
5 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Modern Information Retrieval: July 1999
No ratings yet
Modern Information Retrieval: July 1999
39 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Disk Management
No ratings yet
Disk Management
9 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Course Code: Comp 324
No ratings yet
Course Code: Comp 324
20 pages
Dell Unity XT Software Driven Innovation
No ratings yet
Dell Unity XT Software Driven Innovation
2 pages
No SQL
No ratings yet
No SQL
13 pages
SQL Injection Ultimate Tutorial
100% (1)
SQL Injection Ultimate Tutorial
20 pages
Unit 3 Part 2 Scoopflume
No ratings yet
Unit 3 Part 2 Scoopflume
10 pages
Instagram Privacy Policy
No ratings yet
Instagram Privacy Policy
8 pages
Hadoop
No ratings yet
Hadoop
14 pages
16 HDLC
No ratings yet
16 HDLC
17 pages
IM Ch01 Database Approach IntEdition Solutions
No ratings yet
IM Ch01 Database Approach IntEdition Solutions
9 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Tiger Jobs List
No ratings yet
Tiger Jobs List
11 pages
Doubly-Linked Lists
No ratings yet
Doubly-Linked Lists
16 pages
Question Bank DBMS
No ratings yet
Question Bank DBMS
4 pages
Class-XI Database+Concepts
No ratings yet
Class-XI Database+Concepts
32 pages
Predictive - Analytics 2
No ratings yet
Predictive - Analytics 2
18 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Abrar Mohiuddin DBA 06102024
No ratings yet
Abrar Mohiuddin DBA 06102024
5 pages
Hadoop Research Paper
No ratings yet
Hadoop Research Paper
7 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Core Java Project Final
No ratings yet
Core Java Project Final
3 pages
Project 02 Customer Service Requests Analysis Caltech
No ratings yet
Project 02 Customer Service Requests Analysis Caltech
19 pages
Car Rental Application - Updated
No ratings yet
Car Rental Application - Updated
10 pages
15 QueryOptimization
No ratings yet
15 QueryOptimization
24 pages
FND Global and FND Profile PDF
No ratings yet
FND Global and FND Profile PDF
4 pages
Database Security Roadmap
No ratings yet
Database Security Roadmap
1 page
Insurance Domain Project PPT2
No ratings yet
Insurance Domain Project PPT2
14 pages
Roadmap To Define A Backup Strategy For Sap Applications: by Prakash Palani
No ratings yet
Roadmap To Define A Backup Strategy For Sap Applications: by Prakash Palani
11 pages
New Fields For Output Control
No ratings yet
New Fields For Output Control
4 pages
Week4 Linked List Variants Circular Linked List 19102022 121122pm
No ratings yet
Week4 Linked List Variants Circular Linked List 19102022 121122pm
13 pages
Frigga Assignment
No ratings yet
Frigga Assignment
7 pages
Unit 4 Complete Notes
No ratings yet
Unit 4 Complete Notes
1 page
Solaris Unix
No ratings yet
Solaris Unix
1 page
CIS150 1E Plaster
No ratings yet
CIS150 1E Plaster
4 pages

History of Hadoop Apache Hadoop - The Hadoop Distributed File System

Uploaded by

History of Hadoop Apache Hadoop - The Hadoop Distributed File System

Uploaded by

History of Hadoop

Apache Hadoop - The Hadoop Distributed File System

History of Hadoop – The complete evolution of Hadoop Ecosytem

Both NameNode and DataNode are capable enough to run on commodity

o It is a single master server exist in the HDFS cluster.

o The HDFS cluster contains multiple DataNodes.

o It works as a slave node for Job Tracker.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006 o Hadoop introduced.

2007 o Yahoo runs 2 clusters of 1000 machines.

2008 o YARN JIRA opened

2009 o Yahoo runs 17 clusters of 24,000 machines.

2010 o Hadoop added the support for Kerberos.

2011 o Apache Zookeeper released.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

You might also like