What Is Hadoop Distributed File System (HDFS) PDF
What Is Hadoop Distributed File System (HDFS) PDF
Using big data platforms for data management, access and analytics
DEFINITION
con
u The Hadoop Distributed File System (HDFS) is the primary data storage system used
Contributor(s): Jack Vaughan and Emma Preslar
by Hadoop applications. It employs a NameNode and DataNode architecture to implement a
c distributed file system that provides high-performance access to data across highly scalable
o Hadoop clusters.
HDFS uses master/slave architecture. In its initial incarnation, each Hadoop cluster consisted
of a single NameNode that managed file system operations and supporting DataNodes that
managed data storage on individual compute nodes. The HDFS elements combine to
support applications with large data sets.
This master node "data chunking" architecture takes as its design guides elements from
Google File System (GFS), a proprietary file system outlined in in Google technical papers,
as well as IBM's General Parallel File System (GPFS), a format that boosts I/O by striping
blocks of data over multiple disks, writing blocks in parallel. While HDFS is not Portable
Operating System Interface model-compliant, it echoes POSIX design style in some aspects.
k HDFS architecture centers on commanding NameNodes that hold metadata and DataNodes that store information in
blocks. Working at the heart of Hadoop, HDFS can replicate data at great scale.
But the file system found use beyond that. HDFS was used by The New York Times as part
of large-scale image conversions, Media6Degrees for log processing and machine learning,
LiveBet for log storage and odds analysis, Joost for session analysis and Fox Audience
Network for log analysis and data mining. HDFS is also at the core of many open source data
warehouse alternatives, sometimes called data lakes.
Because HDFS is typically deployed as part of very large-scale implementations, support for
low-cost commodity hardware is a particularly useful feature. Such systems, running web
search and related applications, for example, can range into the hundreds of petabytes and
thousands of nodes. They must be especially resilient, as server failures are common at such
scale.
Prev
Next
The basic HDFS standard has been continuously updated since its inception.
With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was added,
and MapReduce and HDFS were effectively decoupled. Thereafter, diverse data processing
frameworks and file systems were supported by Hadoop. While MapReduce was often
replaced by Apache Spark, HDFS continued to be a prevalent file format for Hadoop.
After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available in
December 2017, with HDFS enhancements supporting additional NameNodes, erasure
coding facilities and greater data compression. At the same time, advances in HDFS tooling,
such as LinkedIn's open source Dr. Elephant and Dynamometer performance testing tools,
have expanded to enable development of ever larger HDFS implementations.
∙∙ Hadoop co-creator talks where Hadoop has been and where it's going
Related Terms
Apache Hive
Apache Hive is an open source data warehouse system for querying and analyzing large data sets that are
principally stored in ... See complete definitionq
Hadoop
Hadoop is an open source distributed processing framework that manages data processing and storage for
big data applications ... See complete definitionq
Add My Comment
Oldest 5
Reply
[-] RajaBaba - 29 Aug 2017 10:48 PM
t
Huge databases are used to identify in instance of reference is the goal of Hadoop developer.
Reply
-ADS BY GOOGLE
Powered by:
SearchBusinessAnalytics
Latest TechTarget
resources
A2organizations 2 2
Data-rich Cloud AutoML, Google's
BUSINESS ANALYTICS
BigQuery ML BigQuery ML
AWS
turn focus to upgrades needs big
ethical data expand Google changes to
CONTENT MANAGEMENT
mining stake in ML compete
ORACLE
As data analytics have increasingly Google introduced a raft of updates to On the eve of Google Cloud Next '19,
SAP become a core component of its cloud-based machine learning and experts assess the impact of Google's
organizations' strategies, concerns AI products, including expanded BigQuery ML since its release and their
SQL SERVER have arisen around how data is... capabilities for its ... hopes for the ...