Hadoop 10
Hadoop 10
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. It is highly fault-
tolerant and is designed to be deployed on low-cost hardware. It provides high
throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource
management
How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy configurations that handle
large scale processing, but as an alternative, you can tie together many commodity
practically, the clustered machines can read the dataset in parallel and provide a
much higher throughput. Moreover, it is cheaper than one high-end server. So this
is the first motivational factor behind using Hadoop that it runs across clustered
• Data is initially divided into directories and files. Files are divided into
• These files are then distributed across various cluster nodes for further
processing.
• HDFS, being on top of the local file system, supervises the processing.
• Performing the sort that takes place between the map and reduce stages.
• Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work across the
machines and in turn, utilizes the underlying parallelism of the CPU cores.
availability (FTHA), rather Hadoop library itself has been designed to detect
• Servers can be added or removed from the cluster dynamically and Hadoop
• Another big advantage of Hadoop is that apart from being open source, it is