Unit 2
Unit 2
Topic
History of Hadoop
Apache Hadoop
Hadoop Distributed File System (HDFS)
Components of Hadoop
Data format
Analyzing data with Hadoop
Scaling out
Hadoop streaming
Hadoop pipes
Hadoop Eco System
Map Reduce framework and basics
How Map Reduce works
Developing a Map Reduce application
Unit tests with MR unit
Test data and local tests
Anatomy of a Map Reduce job run
Failures
Job scheduling
Shuffle and sort
Task execution
Map Reduce types
Input formats
Output formats
Map Reduce features, Real-world Map Reduce
History of Hadoop
Hadoop, an open-source framework, emerged from the need to
handle big data—massive datasets that traditional systems struggled to
manage. Here’s how it all began:
1. Origins: In 2002, Doug Cutting and Mike Cafarella embarked on
the Apache Nutch project, aiming to build a web search engine
capable of indexing a billion pages. However, they faced two major
challenges:
o Storage: Storing such vast amounts of data was expensive
using traditional relational databases.
o Processing: Efficiently processing this data required a novel
approach.
2. Google’s Influence: In 2003, Cutting and Cafarella stumbled upon
Google’s research papers:
o Google File System (GFS): Described a distributed file
system for storing large datasets.
o MapReduce: Introduced a programming model for
processing these datasets.
3. Open-Source Implementation: Inspired by Google’s techniques,
Cutting and Cafarella decided to implement them as open-source
tools within the Apache Nutch project. They focused on:
o HDFS (Hadoop Distributed File System): A storage solution
akin to GFS.
o MapReduce: A way to process data efficiently.
4. Birth of Hadoop: By 2006, Hadoop was born—a framework
combining distributed storage (HDFS) and large-scale data
processing (MapReduce). It aimed to democratize big data
handling.
What is Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval. Even the tools to process the data
are often on the same servers, thus reducing the processing time.
It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in
the cluster.
o Cost Effective: Hadoop is open source and uses commodity
hardware to store data so it really cost effective as compared to
traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can
replicate data over the network, so if one node is down or some
other network failure happens, then Hadoop takes the other copy
of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its
origin was the Google File System paper, published by Google.
Components of Hadoop
Hadoop comprises three core components:
1. Hadoop Distributed File System (HDFS):
o Purpose: Stores data across a cluster of commodity
hardware.
o Key Features:
▪ Distributed: Splits files into blocks and distributes
them across nodes.
▪ Fault-Tolerant: Replicates data for resilience.
▪ Scalable: Handles petabytes of data.
What is HDFS?
Hadoop comes with a distributed file system called HDFS. In HDFS data
is distributed over several machines and replicated to ensure their
durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the
concept of blocks, data nodes and node name.
o Low Latency data access: Applications that require very less time
to access the first data should not use HDFS as it is giving
importance to whole data rather than time to fetch the first record.
o Lots Of Small Files:The name node contains the metadata of files
in memory and if the files are small in size it takes a lot of memory
for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write
multiple times.
HDFS Concepts
Starting HDFS
To Start $ start-dfs.sh
Recursive deleting
Example:
"<localSrc>" and "<localDest>" are paths as above, but on the local file
system
o put <localSrc><dest>
Copies the file or directory from the local file system identified by
localSrc to dest within the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by
localSrc to dest within HDFS, and then deletes the local copy on
success.
Copies the file or directory in HDFS identified by src to the local file
system path identified by localDest.
o cat <filen-ame>
o moveToLocal <src><localDest>
Sets the target replication factor for files identified by path to rep. (The
actual replication factor will move toward the target over time)
o touchz <path>
Prints information about path. Format is a string which accepts file size
in blocks (%b), filename (%n), block size (%o), replication (%r), and
modification date (%y, %Y).
Unlike other distributed file system, HDFS is highly fault-tolerant and can
be deployed on low-cost hardware. It can easily handle the application
that contains large data sets.
Features of HDFS
Goals of HDFS
2. Hadoop MapReduce:
o Purpose: Processes data in parallel across the cluster.
o How It Works:
▪ Map Phase: Breaks down data into key-value pairs.
▪ Shuffle and Sort: Organizes intermediate data.
▪ Reduce Phase: Aggregates results.
What is YARN?
Components Of YARN
Benefits of YARN
Real-World Applications
Hadoop has revolutionized data processing across industries:
Data Formats
MapReduce Types
• Input Formats:
o Define how data is read into MapReduce jobs.
o Examples: TextInputFormat, KeyValueTextInputFormat, Seq
uenceFileInputFormat.
• Output Formats:
o Specify how results are written.
o Examples: TextOutputFormat, SequenceFileOutputFormat, A
vroOutputFormat.
Real-World MapReduce Applications
MapReduce Types
• Input Formats:
o Define how data is read into MapReduce jobs.
o Examples: TextInputFormat, KeyValueTextInputFormat, Seq
uenceFileInputFormat.
• Output Formats:
o Specify how results are written.
o Examples: TextOutputFormat, SequenceFileOutputFormat, A
vroOutputFormat.