0% found this document useful (0 votes)
108 views16 pages

The Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system designed for storing large data sets reliably and streaming data sets at high bandwidth to user applications. HDFS stores file system metadata and application data separately, with metadata stored on a dedicated NameNode and application data stored on DataNodes. The NameNode manages the file system namespace and maps file blocks to DataNodes.

Uploaded by

Varun Gupta
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views16 pages

The Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system designed for storing large data sets reliably and streaming data sets at high bandwidth to user applications. HDFS stores file system metadata and application data separately, with metadata stored on a dedicated NameNode and application data stored on DataNodes. The NameNode manages the file system namespace and maps file blocks to DataNodes.

Uploaded by

Varun Gupta
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 16

The Hadoop Distributed File System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler ahoo! Sunnyvale, Cali"ornia #S$ %Shv, Hairong, SRadia, Chansler&' ahoo()nc*com ce
AbstractThe Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. n a large cluster, thousands o! servers both host directly attached storage and e"ecute user application tas#s. $y distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every si%e. &e describe the architecture o! HDFS and report on e"perience using HDFS to manage '( petabytes o! enterprise data at )ahoo*.

H?ase @ig Hive AooKeep er

Column(oriented table service 032, @ig 0E2, AooKeeper

developed at Facebook*

Data"lo5 language and parallel and Chuk5a 5ere e:ecution originated and developed "rame5ork at ahoo! $vro 5as
originated at ahoo! and Data 5arehouse in"rastructure is being co(developed 5ith Cloudera*

Keywords: Hadoop, HDFS, distributed file system

Distributed coordination service HDFS is the "ile Chuk5a System "orsystem collecting management component o" data Hadoop* /hile the $vro Data seriali>ation system inter"ace to HDFS is patterned a"ter the Table 1. #+)F "ile system, Hadoop "aith"ulness to project standards 5as componen sacri"iced in "avor o" ts improved Hadoop is an per"ormance "or the $pache projectB all applications at hand*
components are

9.

)+TR,D#CT),+ available via the $pache $+D R-.$T-D open source license* /,RK ahoo! has developed

Hadoop 01201320142 provides a distributed "ile system and a "rame5ork "or the analysis and trans"ormation o" very large data sets using the 6apReduce 072 paradigm* $n important characteristic o" Hadoop is the partitioning o" data and compu( tation across many 8thousands9 o" hosts, and e:ecuting applica( tion computations in parallel close to their data* $ Hadoop cluster scales computation capacity, storage capacity and ), band5idth by simply adding commodity servers* Hadoop clus(ters at ahoo! span ;< === servers, and store ;< petabytes o" application data, 5ith the largest cluster being 7<== servers* ,ne hundred other organi>ations 5orld5ide report using Hadoop*

and contributed to C=D o" the core o" Hadoop 8HDFS and 6apRe( duce9* H?ase 5as originally developed at @o5erset, no5 a department at 6icroso"t* Hive 01<2 5as originated and devel(

HDFS stores "ile system metadata and application data separately* $s in other distributed "ile systems, like @GFS 0;2 01E2, .ustre 0H2 and IFS 0<20C2, HDFS stores metadata on a dedicated server, called the +ame+ode* $pplication data are stored on other servers called Data+odes* $ll servers are "ully connected and communicate 5ith each other using TC@( based protocols*
#nlike .ustre and @GFS, the Data+odes in HDFS do not use data protection mechanisms such as R$)D to make the data durable* )nstead, like IFS, the "ile content is replicated on mul(tiple Data+odes "or reliability* /hile ensuring data durability, this strategy has the added advantage

HDFS 6apRedu

Distributed "ile system Subject o" this paper! Distributed computation "rame5ork

that data trans"er band(5idth isdirec(tory over multiple multiplied, and there are moremetadata servers 86DS9, opportunities "or locat(ingeach o" 5hich con(tains a computation near the needed data* disjoint portion o" the Several distributed "ile systems namespace* $ "ile is have or are e:ploring truly distributed assigned to a particular using a hash implementations o" the namespace*6DS "unction on the "ile name* Ceph 01H2 has a cluster o" namespace servers 86DS9 and uses a dynamic sub(tree partitioning 35. $R CH) algorithm in order to map the Tnamespace tree to 6DSs evenly* CT IFS is also evolving into a distributed #R name(space implementation 0C2* The ne5 IFS 5ill have hundreds o" namespace servers 8masters9 5ithA. NameNode 1== million "iles per master* .ustre 0H2 The HDFS namespace has an implementation o" clusteredis a hierarchy o" "iles and namespace on its roadmap "or .ustre directo(ries* Files and ;*; release* The intent is to stripe a directories are

represented on the +ame+ode by inodes, 5hich record attributes like permissions, modi"ication and access times, namespace and disk space Juotas* The "ile content is split into large blocks 8typically 1;C megabytes, but user selectable "ile(by( "ile9 and each block o" the "ile is inde(pendently replicated at multiple Data+odes 8typically three, but user selectable "ile(by("ile9* The +ame+ode maintains the namespace tree and the mapping o" "ile blocks to Data+odes

978-1-42447153-

9/10/$26.00 2010 IEEE

8the physical location o" "ile data9* $n HDFS client 5anting to read a "ile "irst contacts the +ame+ode "or the locations o" data blocks comprising the "ile and then reads block contents "rom the Data+ode closest to the client* /hen 5riting data, the cli(ent reJuests the +ame+ode to nominate a suite o" three Data+odes to host the block replicas* The client then 5rites data to the Data+odes in a pipeline "ashion* The current design has a single +ame+ode "or each cluster* The cluster can have thousands o" Data+odes and tens o" thousands o" HDFS clients per cluster, as each Data+ode may e:ecute multiple application tasks concurrently* HDFS keeps the entire namespace in R$6* The inode data and the list o" blocks belonging to each "ile comprise the meta(data o" the name system called the image* The persistent record o" the image stored in the local hostKs native "iles system is called a checkpoint. The +ame+ode also stores the modi"ica(tion log o" the image called the journal in the local hostKs na(tive "ile system* For improved durability, redundant copies o" the checkpoint and journal can be made at other servers* Dur(ing restarts the +ame+ode restores the namespace by reading the namespace and replaying the journal* The locations o" block replicas may change over time and are not part o" the persistent checkpoint*

Data+ode 5hen it registers 5ith the +ame+ode "or the "irst time and never changes a"ter that*
$ Data+ode identi"ies block replicas in its possession to the +ame+ode by sending a block report* $ block report contains the block id, the generation stamp and the length "or each block replica the server hosts* The "irst block report is sent immedi(ately a"ter the Data+ode registration* SubseJuent block reports are sent every hour and provide the +ame+ode 5ith an up(to(date vie5 o" 5here block replicas are located on the cluster*

During normal operation Data+odes send heartbeats to the +ame+ode to con"irm that the Data+ode is operating and the block replicas it hosts are available* The de"ault heartbeat in(terval is three seconds* )" the +ame+ode does not receive a heartbeat "rom a Data+ode in ten minutes the +ame+ode con(siders the Data+ode to be out o" service and the block replicas hosted by that Data+ode to be unavailable* The +ame+ode then schedules creation o" ne5 replicas o" those blocks on other Data+odes*

B. DataNodes
-ach block replica on a Data+ode is represented by t5o "iles in the local hostKs native "ile system* The "irst "ile contains the data itsel" and the second "ile is blockKs metadata including checksums "or the block data and the blockKs generation stamp* The si>e o" the data "ile eJuals the actual length o" the block and does not reJuire e:tra space to round it up to the nominal block si>e as in traditional "ile systems* Thus, i" a block is hal" "ull it needs only hal" o" the space o" the "ull block on the local drive* During startup each Data+ode connects to the +ame+ode and per"orms a handshake* The purpose o" the handshake is to veri"y the namespace ID and the software ersion o" the Data+ode* )" either does not match that o" the +ame+ode the Data+ode automatically shuts do5n*

Heartbeats "rom a Data+ode also carry in"ormation about total storage capacity, "raction o" storage in use, and the num(ber o" data trans"ers currently in progress* These statistics are used "or the +ame+odeKs space allocation and load balancing decisions* The +ame+ode does not directly call Data+odes* )t uses replies to heartbeats to send instructions to the Data+odes* The instructions include commands toL

1 2 3 4

replicate blocks to other nodesB remove local block replicasB re(register or to shut do5n the nodeB send an immediate block report*

These commands are important "or maintaining the overall system integrity and there"ore it is critical to keep heartbeats "reJuent even on big clusters* The +ame+ode can process thousands o" heartbeats per second 5ithout a""ecting other +ame+ode operations* !. HD"# !lient #ser applications access the "ile system using the HDFS client, a code library that e:ports the HDFS "ile system inter("ace*
Similar to most conventional "ile systems, HDFS supports operations to read, 5rite and delete "iles, and operations to cre(ate and delete directories* The user re"erences "iles and directo(ries by paths in the namespace* The user application generally does not need to kno5 that "ile system metadata and storage are on di""erent servers, or that blocks have multiple replicas* /hen an application reads a "ile, the HDFS client "irst asks the +ame+ode "or the list o" Data+odes that host replicas o" the blocks o" the "ile* )t then contacts a Data+ode directly and reJuests the trans"er o" the desired block* /hen a client 5rites, it "irst asks the +ame+ode to choose Data+odes to host repli(cas o" the "irst block o" the "ile* The client organi>es a pipeline "rom node(to(node and sends the data* /hen the "irst block is "illed, the client reJuests ne5 Data+odes to be chosen to host replicas o" the ne:t block* $ ne5 pipeline is organi>ed, and the
;

The namespace )D is assigned to the "ile system instance 5hen it is "ormatted* The namespace )D is persistently stored on all nodes o" the cluster* +odes 5ith a di""erent namespace )D 5ill not be able to join the cluster, thus preserving the integ(rity o" the "ile system*
The consistency o" so"t5are versions is important because incompatible version may cause data corruption or loss, and on large clusters o" thousands o" machines it is easy to overlook nodes that did not shut do5n properly prior to the so"t5are upgrade or 5ere not available during the upgrade*

$ Data+ode that is ne5ly initiali>ed and 5ithout any namespace )D is permitted to join the cluster and receive the clusterKs namespace )D*
$"ter the handshake the Data+ode registers 5ith the +ame+ode* Data+odes persistently store their uniJue storage IDs* The storage )D is an internal identi"ier o" the Data+ode, 5hich makes it recogni>able even i" it is restarted 5ith a di""er(ent )@ address or port* The storage )D is assigned to the

"igure 1. An HD"# client creates a new file b$ gi ing its path to the NameNode. "or each block of the file% the NameNode returns a list of DataNodes to host its replicas. The client then pipelines data to the chosen DataNodes% which e entuall$ confirm the creation of the block replicas to the NameNode.
scribes the organi>ation o" application data as directories client sends the "urther bytes o" the "ile* -ach choice o" and "iles* $ persistent record o" Data+odes is likely to be di""erent* The interactions among the image 5ritten to disk is the client, the +ame+ode and the Data+odes are called a checkpoint* The journal illustrated in Fig* 1* is a 5rite(ahead commit log "or #nlike conventional "ile systems, HDFS provides an $@) changes to the "ile system that that e:poses the locations o" a "ile blocks* This allo5s must be persistent* For each applica(tions like the 6apReduce "rame5ork to schedule a client(initiated transaction, the task to 5here the data are located, thus improving the read change is recorded in the jour( per"orm(ance* )t also allo5s an application to set the nal, and the journal "ile is replication "actor o" a "ile* ?y de"ault a "ileKs replication "actor "lushed and synched be"ore the is three* For criti(cal "iles or "iles 5hich are accessed very change is committed to the o"ten, having a higher replication "actor improves their HDFS client* The checkpoint "ile is never changed by the tolerance against "aults and increase their read band5idth* +ame+odeB it is replaced in its entirety 5hen a ne5 checkpoint D. Image and &ournal is created during restart, 5hen The namespace image is the "ile system metadata that de( re(Juested by the administrator, or by the Checkpoint+ode de( scribed in the ne:t section* During startup the +ame+ode ini(tiali>es the namespace image "rom the checkpoint, and then replays changes "rom the journal until the image is up(to( date 5ith the last state o" the "ile system* $ ne5 checkpoint and empty journal are 5ritten back to the storage directories be"ore the +ame+ode starts serving clients* )" either the checkpoint or the journal is missing, or be( comes corrupt, the namespace

in"ormation 5ill be lost partly or entirely* )n order to preserve this be con"igured to store the critical in"ormation HDFS can checkpoint and journal in multiple storage directories* Recommended practice is to place the di(rectories on di""erent volumes, and "or one storage directory to be on a remote +FS server* The "irst choice prevents loss "rom single volume "ailures, and the second choice protects against "ailure o" the entire node* )" the +ame+ode encounters an error 5riting the journal to one o" the storage directories it automati( cally e:cludes that directory "rom the list o" storage directories* The +ame+ode automatically shuts itsel" do5n i" no storage directory is available* The +ame+ode is a multithreaded system and processes reJuests simultaneously "rom multiple clients* Saving a trans(action to disk becomes a bottleneck since all other threads need to 5ait until the synchronous "lush( and(sync procedure initi(ated by one o" them is complete* )n order to optimi>e this process the +ame+ode batches multiple transactions initiated by di""erent clients* /hen one o" the +ame+odeKs threads ini(

tiates a "lush(and(sync operation, all transactions batched at that time are committed together* Remaining threads only need to check that their transactions have been saved and do not need to initiate a "lush(and(sync operation*

'. !heckpointNode The +ame+ode in HDFS, in addition to its primary role serving client reJuests, can alternatively e:ecute either o" t5o other roles, either a !heckpointNode or a BackupNode* The role is speci"ied at the node startup*
The Checkpoint+ode periodically combines the e:isting checkpoint and journal to create a ne5 checkpoint and an empty journal* The Checkpoint+ode usually runs on a di""erent host "rom the +ame+ode since it has the same memory re(Juirements as the +ame+ode* )t do5nloads the current check(point and journal "iles "rom the +ame+ode, merges them lo( cally, and returns the ne5 checkpoint back to the +ame+ode*
7

Creating periodic checkpoints is one 5ay to protect the "ile system metadata* The system can start "rom the most recent checkpoint i" all other persistent copies o" the namespace im(age or journal are unavailable*
Creating a checkpoint lets the +ame+ode truncate the tail o" the journal 5hen the ne5 checkpoint is uploaded to the +ame+ode* HDFS clusters run "or prolonged periods o" time 5ithout restarts during 5hich the journal constantly gro5s* )" the journal gro5s very large, the probability o" loss or corrup(tion o" the journal "ile increases* $lso, a very large journal e:(tends the time reJuired to restart the +ame+ode* For a large cluster, it takes an hour to process a 5eek(long journal* Iood practice is to create a daily checkpoint*

and journal "iles and merges them in memory* Then it 5rites the ne5 checkpoint and the empty journal to a ne5 location, so that the old checkpoint and journal remain unchanged*
During handshake the +ame+ode instructs Data+odes 5hether to create a local snapshot* The local snapshot on the Data+ode cannot be created by replicating the data "iles direc( tories as this 5ill reJuire doubling the storage capacity o" every Data+ode on the cluster* )nstead each Data+ode creates a copy o" the storage directory and hard links e:isting block "iles into it* /hen the Data+ode removes a block it removes only the hard link, and block modi"ications during appends use the copy(on (5rite techniJue* Thus old block replicas remain un(touched in their old directories* The cluster administrator can choose to roll back HDFS to the snapshot state 5hen restarting the system* The +ame+ode recovers the checkpoint saved 5hen the snapshot 5as created* Data+odes restore the previously renamed directories and initi(ate a background process to delete block replicas created a"ter the snapshot 5as made* Having chosen to roll back, there is no provision to roll "or5ard* The cluster administrator can recover the storage occupied by the snapshot by commanding the sys(tem to abandon the snapshot, thus "inali>ing the so"t5are up(grade* System evolution may lead to a change in the "ormat o" the +ame+odeKs checkpoint and journal "iles, or in the data repre( sentation o" block replica "iles on Data+odes* The la$out er*sion identi"ies the data representation "ormats, and is persis(tently stored in the +ame+odeKs and the Data+odesK storage directories* During startup each node compares the layout ver(sion o" the current so"t5are 5ith the version stored in its stor(age directories and automatically converts data "rom older "or(mats to the ne5er ones* The conversion reJuires the mandatory creation o" a snapshot 5hen the system restarts 5ith the ne5 so"t5are layout version*

". BackupNode
$ recently introduced "eature o" HDFS is the BackupNode* .ike a Checkpoint+ode, the ?ackup+ode is capable o" creating periodic checkpoints, but in addition it maintains an in(memory, up( to(date image o" the "ile system namespace that is al5ays synchroni>ed 5ith the state o" the +ame+ode* The ?ackup+ode accepts the journal stream o" namespace transactions "rom the active +ame+ode, saves them to its o5n storage directories, and applies these transactions to its o5n namespace image in memory* The +ame+ode treats the ?ackup+ode as a journal store the same as it treats journal "iles in its storage directories* )" the +ame+ode "ails, the ?ackup+odeKs image in memory and the checkpoint on disk is a record o" the latest namespace state*

The ?ackup+ode can create a checkpoint 5ithout do5n(loading checkpoint and journal "iles "rom the active +ame+ode, since it already has an up(to(date namespace im(age in its memory* This makes the checkpoint process on the ?ackup+ode more e""icient as it only needs to save the name(space into its local storage directories* The ?ackup+ode can be vie5ed as a read(only +ame+ode* )t contains all "ile system metadata in"ormation e:cept "or block locations* )t can per"orm all operations o" the regular +ame+ode that do not involve modi"ication o" the namespace or kno5ledge o" block locations* #se o" a ?ackup+ode pro(vides the option o" running the +ame+ode 5ithout persistent storage, delegating responsibility "or the namespace state per(sisting to the ?ackup+ode*

(. )pgrades% "ile #$stem #napshots


During so"t5are upgrades the possibility o" corrupting the system due to so"t5are bugs or human mistakes increases* The purpose o" creating snapshots in HDFS is to minimi>e potential damage to the data stored in the system during upgrades*

HDFS does not separate layout versions "or the +ame+ode and Data+odes because snapshot creation must be an all(cluster e""ort rather than a node(selective event* )" an upgraded +ame+ode due to a so"t5are bug purges its image then back(ing up only the namespace state still results in total data loss, as the +ame+ode 5ill not recogni>e the blocks reported by Data+odes, and 5ill order their deletion* Rolling back in this case 5ill recover the metadata, but the data itsel" 5ill be lost* $ coordinated snapshot is reJuired to avoid a cataclysmic de(struction*

)))* F).- )M, ,@-R$T),+S $+D R-@.)C$ 6$+I-6-+T

1. "ile +ead and ,rite


$n application adds data to HDFS by creating a ne5 "ile and 5riting the data to it* $"ter the "ile is closed, the bytes 5rit(ten cannot be altered or removed e:cept that ne5 data can be added to the "ile by reopening the "ile "or append* HDFS im(plements a single(5riter, multiple(reader model*
The HDFS client that opens a "ile "or 5riting is granted a lease "or the "ileB no other client can 5rite to the "ile* The 5rit(ing client periodically rene5s the lease by sending a heartbeat to the +ame+ode* /hen the "ile is closed, the lease is revoked*
E

The snapshot mechanism lets administrators persistently save the current state o" the "ile system, so that i" the upgrade results in data loss or corruption it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they 5ere at the time o" the snapshot*
The snapshot 8only one can e:ist9 is created at the cluster administratorKs option 5henever the system is started* )" a snapshot is reJuested, the +ame+ode "irst reads the checkpoint

The lease duration is bound by a so"t limit and a hard limit* #ntil the so"t limit e:pires, the 5riter is certain o" e:clusive access to the "ile* )" the so"t limit e:pires and the client "ails to close the "ile or rene5 the lease, another client can preempt the lease* )" a"ter the hard limit e:pires 8one hour9 and the client has "ailed to rene5 the lease, HDFS assumes that the client has Juit and 5ill automatically close the "ile on behal" o" the 5riter, and recover the lease* The 5riterNs lease does not prevent other clients "rom reading the "ileB a "ile may have many concurrent readers*
$n HDFS "ile consists o" blocks* /hen there is a need "or a ne5 block, the +ame+ode allocates a block 5ith a uniJue block )D and determines a list o" Data+odes to host replicas o" the block* The Data+odes "orm a pipeline, the order o" 5hich minimi>es the total net5ork distance "rom the client to the last Data+ode* ?ytes are pushed to the pipeline as a seJuence o" packets* The bytes that an application 5rites "irst bu""er at the client side* $"ter a packet bu""er is "illed 8typically 3E K?9, the data are pushed to the pipeline* The ne:t packet can be pushed to the pipeline be"ore receiving the ackno5ledgement "or the previous packets* The number o" outstanding packets is limited by the outstanding packets 5indo5 si>e o" the client*

"igure -. Data pipeline during block construction


)" no error occurs, block construction goes through three stages as sho5n in Fig* ; illustrating a pipeline o" three Data+odes 8D+9 and a block o" "ive packets* )n the picture,

$"ter data are 5ritten to an HDFS "ile, HDFS does not pro(vide any guarantee that data are visible to a ne5 reader until the "ile is closed* )" a user application needs the visibility guaran(tee, it can e:plicitly call the hflush operation* Then the current packet is immediately pushed to the pipeline, and the h"lush operation 5ill 5ait until all Data+odes in the pipeline ac(kno5ledge the success"ul transmission o" the packet* $ll data 5ritten be"ore the h"lush operation are then certain to be visible to readers*

bold lines represent data packets, dashed lines represent ac( kno5ledgment messages, and thin lines represent control mes(sages to setup and close the pipeline* Gertical lines represent activity at the client and the three Data+odes 5here time pro(ceeds "rom top to bottom* From t = to t1 is the pipeline setup stage* The interval t1 to t; is the data streaming stage, 5here t1 is the time 5hen the "irst data packet gets sent and t ; is the time that the ackno5ledgment to the last packet gets received* Here an h"lush operation transmits the second packet* The h"lush indication travels 5ith the packet data and is not a separate operation* The "inal interval t ; to t7 is the pipeline close stage "or this block*
)n a cluster o" thousands o" nodes, "ailures o" a node 8most commonly storage "aults9 are daily occurrences* $ replica stored on a Data+ode may become corrupted because o" "aults in memory, disk, or net5ork* HDFS generates and stores checksums "or each data block o" an HDFS "ile* Checksums are veri"ied by the HDFS client 5hile reading to help detect any corruption caused either by client, Data+odes, or net5ork* /hen a client creates an HDFS "ile, it computes the checksum seJuence "or each block and sends it to a Data+ode along 5ith the data* $ Data+ode stores checksums in a metadata "ile sepa(rate "rom the blockKs data "ile* /hen HDFS reads a "ile, each blockKs data and checksums are shipped to the client* The client computes the checksum "or the received data and veri"ies that the ne5ly computed checksums matches the checksums it re(ceived* )" not, the client noti"ies the +ame+ode o" the corrupt replica and then "etches a di""erent replica o" the block "rom another Data+ode*

/hen a client opens a "ile to read, it "etches the list o" blocks and the locations o" each block replica "rom the +ame+ode* The locations o" each block are ordered by their distance "rom the reader* /hen reading the content o" a block, the client tries the closest replica "irst* )" the read attempt "ails, the client tries the ne:t replica in seJuence* $ read may "ail i" the target Data+ode is unavailable, the node no longer hosts a replica o" the block, or the replica is "ound to be corrupt 5hen checksums are tested* HDFS permits a client to read a "ile that is open "or 5riting* /hen reading a "ile open "or 5riting, the length o" the last block still being 5ritten is unkno5n to the +ame+ode* )n this case, the client asks one o" the replicas "or the latest length be("ore starting to read its content* The design o" HDFS )M, is particularly optimi>ed "or batch processing systems, like 6apReduce, 5hich reJuire high throughput "or seJuential reads and 5rites* Ho5ever, many e""orts have been put to improve its readM5rite response time in order to support applications like Scribe that provide real(time data streaming to HDFS, or H?ase that provides random, real(time access to large tables*

B. Block .lacement
For a large cluster, it may not be practical to connect all nodes in a "lat topology* $ common practice is to spread the nodes across multiple racks* +odes o" a rack share a s5itch, and rack s5itches are connected by one or more core s5itches* Communication bet5een t5o nodes in di""erent racks has to go through multiple s5itches* )n most cases, net5ork band5idth
<

bet5een nodes in the same rack is greater than net5ork band( 5idth bet5een nodes in di""erent racks* Fig* 7 describes a clus(ter 5ith t5o racks, each o" 5hich contains three nodes*

any "ile, t5o(thirds o" its block replicas 5ould be on the same rack*

Rack = D+ =;

Rack 1 D+ 1;

$"ter all target nodes are selected, nodes are organi>ed as a pipeline in the order o" their pro:imity to the "irst replica* Data are pushed to nodes in this order* For reading, the +ame+ode "irst checks i" the clientKs host is located in the cluster* )" yes, block locations are returned to the client in the order o" its closeness to the reader* The block is read "rom Data+odes in this pre"erence order* 8)t is usual "or 6apReduce applications

D+==

D+=1

D+1=

D+11

"igure /. !luster topolog$ e0ample


HDFS estimates the net5ork band5idth bet5een t5o nodes by their distance* The distance "rom a node to its parent node is assumed to be one* $ distance bet5een t5o nodes can be cal( culated by summing up their distances to their closest common ancestor* $ shorter distance bet5een t5o nodes means that the greater band5idth they can utili>e to trans"er data*

HDFS allo5s an administrator to con"igure a script that re(turns a nodeKs rack identi"ication given a nodeKs address* The +ame+ode is the central place that resolves the rack location o" each Data+ode* /hen a Data+ode registers 5ith the +ame+ode, the +ame+ode runs a con"igured script to decide 5hich rack the node belongs to* )" no such a script is con"ig(ured, the +ame+ode assumes that all the nodes belong to a de"ault single rack* The placement o" replicas is critical to HDFS data reliabil(ity and readM5rite per"ormance* $ good replica placement pol(icy should improve data reliability, availability, and net5ork band5idth utili>ation* Currently HDFS provides a con"igurable block placement policy inter"ace so that the users and research(ers can e:periment and test any policy thatKs optimal "or their applications*
The de"ault HDFS block placement policy provides a tradeo"" bet5een minimi>ing the 5rite cost, and ma:imi>ing data reliability, availability and aggregate read band5idth* /hen a ne5 block is created, HDFS places the "irst replica on the node 5here the 5riter is located, the second and the third replicas on t5o di""erent nodes in a di""erent rack, and the rest are placed on random nodes 5ith restrictions that no more than one replica is placed at one node and no more than t5o replicas are placed in the same rack 5hen the number o" replicas is less than t5ice the number o" racks* The choice to place the second and third replicas on a di""erent rack better distributes the block replicas "or a single "ile across the cluster* )" the "irst t5o repli(cas 5ere placed on the same rack, "or

to run on cluster nodes, but as long as a host can connect to the +ame+ode and Data+odes, it can e:ecute the HDFS client*9 This policy reduces the inter(rack and inter(node 5rite tra"("ic and generally improves 5rite per"ormance* ?ecause the chance o" a rack "ailure is "ar less than that o" a node "ailure, this policy does not impact data reliability and availability guarantees* )n the usual case o" three replicas, it can reduce the aggregate net5ork band5idth used 5hen reading data since a block is placed in only t5o uniJue racks rather than three*

The de"ault HDFS replica placement policy can be summa(ri>ed as "ollo5sL

1. +o Datanode contains more than one replica


o" any block*

2. +o rack contains more than t5o replicas o"


the same block, provided there are su""icient racks on the cluster*

3. +eplication management
The +ame+ode endeavors to ensure that each block al5ays has the intended number o" replicas* The +ame+ode detects that a block has become under( or over(replicated 5hen a block report "rom a Data+ode arrives* /hen a block becomes over replicated, the +ame+ode chooses a replica to remove* The +ame+ode 5ill pre"er not to reduce the number o" racks that host replicas, and secondly pre"er to remove a replica "rom the Data+ode 5ith the least amount o" available disk space* The goal is to balance storage utili>ation across Data+odes 5ithout reducing the blockKs availability* /hen a block becomes under(replicated, it is put in the rep( lication priority Jueue* $ block 5ith only one replica has the highest priority, 5hile a block 5ith a number o" replicas that is greater than t5o thirds o" its replication "actor has the lo5est priority* $ background thread periodically scans the head o" the replication Jueue to decide 5here to place ne5 replicas* ?lock replication "ollo5s a similar policy as that o" the ne5 block placement* )" the number o" e:isting replicas is one, HDFS places the ne:t replica on a di""erent rack* )n case that the block has t5o e:isting replicas, i" the t5o e:isting replicas are on the same rack, the third replica is placed on a di""erent rackB other(5ise, the third replica is placed on a di""erent node in the same rack as an e:isting replica* Here the goal is to reduce the cost o" creating ne5 replicas* The +ame+ode also makes sure that not all replicas o" a block are located on one rack* )" the +ame+ode detects that a blockKs replicas end up at one rack, the +ame+ode treats the block as under (replicated and replicates the block to a di""erent rack using the same block placement policy described above* $"ter the +ame+ode receives the noti"ication that the replica is created, the block becomes over(replicated* The +ame+ode then 5ill decides to remove an old replica because the over(replication policy pre"ers not to reduce the number o" racks*

D. Balancer
HDFS block placement strategy does not take into account Data+ode disk space utili>ation* This is to avoid placing ne5O more likely to be re"erencedOdata at a small subset o"
3

the Data+odes* There"ore data might not al5ays be placed uni("ormly across Data+odes* )mbalance also occurs 5hen ne5 nodes are added to the cluster*
The balancer is a tool that balances disk space usage on an HDFS cluster* )t takes a threshold value as an input parameter, 5hich is a "raction in the range o" 8=, 19* $ cluster is balanced i" "or each Data+ode, the utili>ation o" the node 8ratio o" used space at the node to total capacity o" the node9 di""ers "rom the utili>ation o" the 5hole cluster 8ratio o" used space in the clus(ter to total capacity o" the cluster9 by no more than the thresh(old value* The tool is deployed as an application program that can be run by the cluster administrator* )t iteratively moves replicas "rom Data+odes 5ith higher utili>ation to Data+odes 5ith lo5er utili>ation* ,ne key reJuirement "or the balancer is to maintain data availability* /hen choosing a replica to move and deciding its destination, the balancer guarantees that the decision does not reduce either the number o" replicas or the number o" racks*
The balancer optimi>es the balancing process by minimi>(ing the inter(rack data copying* )" the balancer decides that a replica $ needs to be moved to a di""erent rack and the destina(tion rack happens to have a replica ? o" the same block, the data 5ill be copied "rom replica ? instead o" replica $* $ second con"iguration parameter limits the band5idth consumed by rebalancing operations* The higher the allo5ed band5idth, the "aster a cluster can reach the balanced state, but 5ith greater competition 5ith application processes*

to register and the host addresses o" nodes that are not permit(ted to register* The administrator can command the system to re( evaluate these include and e:clude lists* $ present member o" the cluster that becomes e:cluded is marked "or decommis(sioning* ,nce a Data+ode is marked as decommissioning, it 5ill not be selected as the target o" replica placement, but it 5ill continue to serve read reJuests* The +ame+ode starts to schedule replication o" its blocks to other Data+odes* ,nce the +ame+ode detects that all blocks on the decommissioning Data+ode are replicated, the node enters the decommissioned state* Then it can be sa"ely removed "rom the cluster 5ithout jeopardi>ing any data availability*

(. Inter*!luster Data !op$


/hen 5orking 5ith large datasets, copying data into and out o" a HDFS cluster is daunting* HDFS provides a tool called DistCp "or large interMintra(cluster parallel copying* )t is a 6apReduce jobB each o" the map tasks copies a portion o" the source data into the destination "ile system* The 6apReduce "rame5ork automatically handles parallel task scheduling, error detection and recovery*

)G*

@R$CT)C- $T

$H,,!

.arge HDFS clusters at ahoo! include about 7<== nodes* $ typical cluster node hasL 1 2 quad core Xeon processors @ 2.5ghz

2 3 4

'. Block #canner


-ach Data+ode runs a block scanner that periodically scans its block replicas and veri"ies that stored checksums match the block data* )n each scan period, the block scanner adjusts the read band5idth in order to complete the veri"ication in a con("igurable period* )" a client reads a complete block and check(sum veri"ication succeeds, it in"orms the Data+ode* The Data+ode treats it as a veri"ication o" the replica*

Red Hat Enterprise Linux Server Release 5.1 Sun Pava PDK 1*3*=Q17(b=7 4 directly attached SATA drives (one terabyte each) 5 16G RAM 6 1-gigabit Ethernet
Seventy percent o" the disk space is allocated to HDFS* The remainder is reserved "or the operating system 8Red Hat .inu:9, logs, and space to spill the output o" map tasks* 86apReduce intermediate data are not stored in HDFS*9 Forty nodes in a single rack share an )@ s5itch* The rack s5itches are connected to each o" eight core s5itches* The core s5itches provide connectivity bet5een racks and to out( o"(cluster re(sources* For each cluster, the +ame+ode and the ?ackup+ode hosts are specially provisioned 5ith up to 3EI? R$6B applica(tion tasks are never assigned to those hosts* )n total, a cluster o" 7<== nodes has 4*C @? o" storage available as blocks that are replicated three times yielding a net 7*7 @? o" storage "or user applications* $s a convenient appro:imation, one thousand nodes represent one @? o" application storage* ,ver the years that HDFS has been in use 8and into the "uture9, the hosts se(lected as cluster nodes bene"it "rom improved technologies* +e5 cluster nodes al5ays have "aster processors, bigger disks and larger R$6* Slo5er, smaller nodes are retired or relegated to clusters reserved "or development and testing o" Hadoop* The choice o" ho5 to provision a cluster node is largely an is(sue o" economically purchasing computation and storage* HDFS does not compel a particular ratio o" computation to storage, or set a limit on the amount o" storage attached to a cluster node* ,n an e:ample large cluster 87<== nodes9, there are about 3= million "iles* Those "iles have 37 million blocks* $s each
H

The veri"ication time o" each block is stored in a human readable log "ile* $t any time there are up to t5o "iles in top(level Data+ode directory, current and prev logs* +e5 veri"ica(tion times are appended to current "ile* Correspondingly each Data+ode has an in(memory scanning list ordered by the rep(licaKs veri"ication time*
/henever a read client or a block scanner detects a corrupt block, it noti"ies the +ame+ode* The +ame+ode marks the replica as corrupt, but does not schedule deletion o" the replica immediately* )nstead, it starts to replicate a good copy o" the block* ,nly 5hen the good replica count reaches the replication "actor o" the block the corrupt replica is scheduled to be re(moved* This policy aims to preserve data as long as possible* So even i" all replicas o" a block are corrupt, the policy allo5s the user to retrieve its data "rom the corrupt replicas*

". Decommissioing
The cluster administrator speci"ies 5hich nodes can join the cluster by listing the host addresses o" nodes that are permitted

block typically is replicated three times, every data node hosts <E === block replicas* -ach day user applications 5ill create t5o million ne5 "iles on the cluster* The ;< === nodes in Hadoop clusters at ahoo! provide ;< @? o" on(line data stor(age* $t the start o" ;=1=, this is a modestObut gro5ingO "raction o" the data processing in"rastructure at ahoo!* ahoo! began to investigate 6apReduce programming 5ith a distrib(uted "ile system in ;==E* The $pache Hadoop project 5as "ounded in ;==3* ?y the end o" that year, ahoo! had adopted Hadoop "or internal use and had a 7== (node cluster "or devel(opment* Since then HDFS has become integral to the back o"("ice at ahoo!* The "lagship application "or HDFS has been the production o" the /eb 6ap, an inde: o" the /orld /ide /eb that is a critical component o" search 8H< hours elapsed time, <== terabytes o" 6apReduce intermediate data, 7== terabytes total output9* 6ore applications are moving to Hadoop, espe(cially those that analy>e and model user behavior*

)n addition to total "ailures o" nodes, stored data can be corrupted or lost* The block scanner scans all blocks in a large cluster each "ortnight and "inds about ;= bad replicas in the process* B. !aring for the !ommons
$s the use o" HDFS has gro5n, the "ile system itsel" has had to introduce means to share the resource 5ithin a large and diverse user community* The "irst such "eature 5as a permis(sions "rame5ork closely modeled on the #ni: permissions scheme "or "ile and directories* )n this "rame5ork, "iles and directories have separate access permissions "or the o5ner, "or other members o" the user group associated 5ith the "ile or directory, and "or all other users* The principle di""erences be(t5een #ni: 8@,S)F9 and HDFS are that ordinary "iles in HDFS have neither Re:ecuteS permissions nor RstickyS bits* )n the present permissions "rame5ork, user identity is 5eakL you are 5ho your host says you are* /hen accessing HDFS, the application client simply Jueries the local operating system "or user identity and group membership* $ stronger identity model is under development* )n the ne5 "rame5ork, the application client must present to the name system creden(tials obtained "rom a trusted source* Di""erent credential ad(ministrations are possibleB the initial implementation 5ill use Kerberos* The user application can use the same "rame5ork to con"irm that the name system also has a trust5orthy identity* $nd the name system also can demand credentials "rom each o" the data nodes participating in the cluster* The total space available "or data storage is set by the num(ber o" data nodes and the storage provisioned "or each node* -arly e:perience 5ith HDFS demonstrated a need "or some means to en"orce the resource allocation policy across user communities* +ot only must "airness o" sharing be en"orced, but 5hen a user application might involve thousands o" hosts 5riting data, protection against application inadvertently e:(hausting resources is also important* For HDFS, because the system metadata are al5ays in R$6, the si>e o" the namespace 8number o" "iles and directories9 is also a "inite resource* To manage storage and namespace resources, each directory may be assigned a Juota "or the total space occupied by "iles in the sub(tree o" the namespace beginning at that directory* $ sepa(rate Juota may also be set "or the total number o" "iles and di(rectories in the sub(tree* /hile the architecture o" HDFS presumes most applications 5ill stream large data sets as input, the 6apReduce program(ming "rame5ork can have a tendency to generate many small output "iles 8one "rom each reduce task9 "urther stressing the namespace resource* $s a convenience, a directory sub(tree can be collapsed into a single Hadoop $rchive "ile* $ H$R "ile is similar to a "amiliar tar, P$R, or Aip "ile, but "ile system opera(tion can address the individual "iles "or the archive, and a H$R "ile can be used transparently as the input to a 6apReduce job*

?ecoming a key component o" ahoo!Ks technology suite meant tackling technical problems that are the di""erence be(t5een being a research project and being the custodian o" many petabytes o" corporate data* Foremost are issues o" robustness and durability o" data* ?ut also important are economical per("ormance, provisions "or resource sharing among members o" the user community, and ease o" administration by the system operators*

A. Durabilit$ of Data
Replication o" data three times is a robust guard against loss o" data due to uncorrelated node "ailures* )t is unlikely ahoo! has ever lost a block in this 5ayB "or a large cluster, the prob(ability o" losing a block during one year is less than *==<* The key understanding is that about =*C percent o" nodes "ail each month* 8-ven i" the node is eventually recovered, no e""ort is taken to recover data it may have hosted*9 So "or the sample large cluster as described above, a node or t5o is lost each day* That same cluster 5ill re(create the <E === block replicas hosted on a "ailed node in about t5o minutes* 8Re(replication is "ast because it is a parallel problem that scales 5ith the si>e o" the cluster*9 The probability o" several nodes "ailing 5ithin t5o minutes such that all replicas o" some block are lost is indeed small*
!orrelated "ailure o" nodes is a di""erent threat* The most commonly observed "ault in this regard is the "ailure o" a rack or core s5itch* HDFS can tolerate losing a rack s5itch 8each block has a replica on some other rack9* Some "ailures o" a core s5itch can e""ectively disconnect a slice o" the cluster "rom multiple racks, in 5hich case it is probable that some blocks 5ill become unavailable* )n either case, repairing the s5itch restores unavailable replicas to the cluster* $nother kind o" correlated "ailure is the accidental or deliberate loss o" electri(cal po5er to the cluster* )" the loss o" po5er spans racks, it is likely that some blocks 5ill become unavailable* ?ut restoring po5er may not be a remedy because one(hal" to one percent o" the nodes 5ill not survive a "ull po5er(on restart* Statistically, and in practice, a large cluster 5ill lose a hand"ul o" blocks during a po5er(on restart* 8The strategy o" deliberately restart(ing one node at a time over a period o" 5eeks to identi"y nodes that 5ill not survive a restart has not been tested*9

!. Benchmarks $ design goal o" HDFS is to provide very high )M, band(5idth "or large data sets* There are three kinds o" measure(ments that test that goal*
C

/hat is band5idth observed "rom a contrived bench( achieve 5ith the current design and hard5are* The )M, rate in temKs ability to move data "rom the last column is the markT and to the "ile system 8it really combination o" reading the 1 /hat band5idth is observed in a production isnNt about sorting9* The input and 5riting the cluster 5ith a mi: o" user jobsT output "rom and to HDFS* competitive aspect means that )n the second ro5, 5hile the re(sults in Table ; are 2 /hat band5idth can be obtained by the most the rate "or HDFS is care"ully constructed large(scale user applicationT about the best a user reduced, the total )M, per application can node 5ill be about double The statistics reported here 5ere obtained "rom because "or the larger clusters o" at least 7<== nodes* $t this scale, total 8petabyte!9 data set, the band5idth is linear 5ith the number o" nodes, and so the 6apReduce intermediates interesting statistic is the band5idth per node* These must also be 5ritten to and benchmarks are available as part o" the Hadoop read "rom disk* )n the codebase* smaller test, there is no The DFS), benchmark measures average throughput "or need to spill the 6apReduce intermediates read, 5rite and append operations* DFS), is an application to diskB they are bu""ered available as part o" the Hadoop distribution* This 6apReduce the mem(ory o" the tasks* program readsM5ritesMappends random data "romMto large "iles*
-ach map task 5ithin the job e:ecutes the same operation on a distinct "ile, trans"ers the same amount o" data, and reports its trans"er rate to the single reduce task* The reduce task then summari>es the measurements* The test is run 5ithout conten( tion "rom other applications, and the number o" map tasks is chosen to be proportional to the cluster si>e* )t is designed to measure per"ormance only during data trans"er, and e:cludes the overheads o" task scheduling, startup, and the reduce task* .arge clusters reJuire that the HDFS +ame+ode support the number o" client operations e:pected in a large cluster* The ++Throughput benchmark is a single node process 5hich starts the +ame+ode application and runs a series o" client threads on the same node* -ach client thread per"orms the same +ame+ode operation repeatedly by directly calling the +ame(+ode method implementing this operation* The benchmark measures the number o" operations per second per"ormed by the +ame+ode* The benchmark is designed to avoid communi( cation overhead caused by R@C connections and seriali>ation, and there"ore runs clients locally rather than remotely "rom di""erent nodes* This provides the upper bound o" pure +ame+ode per"ormance*

1 DFS), ReadL 33 6? Ms per node 2 DFS), /riteL E= 6? Ms per node


For a production cluster, the number o" bytes read and 5rit( ten is reported to a metrics collection system* These averages are taken over a "e5 5eeks and represent the utili>ation o" the cluster by jobs "rom hundreds o" individual users* ,n average each node 5as occupied by one or t5o application tasks at any moment 8"e5er than the number o" processor cores available9*

1 ?usy Cluster ReadL 1*=; 6?Ms per node 2 ?usy Cluster /riteL 1*=4 6?Ms per node
HDFS +, $ytes+s 0ggreg ate (I?9 C== = C= ===

$yte s (T?9

-ode s

.a ps

/educ es

Time

1er -od e (6? 9 ;;*1 4*7<

1 1===

1E3= 73<C

;H== ;= ===

3; s <C <== s

7; 7E*;

Table -. #ort benchmark for one terab$te and one petab$te of data. 'ach data record is 111 b$tes with a 11*b$te ke$. The test program is a general sorting procedure that is not special*i2ed for the record si2e. In the terab$te sort% the block replica*tion factor was set to one% a modest ad antage for a short test. In the petab$te sort% the replication factor was set to two so that the test would confidentl$ complete in case of a 3not un*e0pected4 node failure. $t the beginning o" ;==4, ahoo! participated in the Iray Sort competition 042* The nature o" this task stresses the sys(

2peration ,pen "ile "or read Create "ile Rename "ile Delete "ile Data+ode Heartbeat ?locks report 8blocksMs9 Table /. NNThroughpu t benchmark

Throug (ops+s 1; < C ; 7= 37

22. F
# T #

primary +ame+ode and R- /,RK This section presents some o" the "uture 5ork that the Hadoop ?ackup+ode* $ "e5 Hadoop ahoo! have team at ahoo is consideringB Hadoop being an open source users outside e:perimented 5ith manual project implies that ne5 "eatures and changes are de(cided by the "ailover* ,ur plan is to use Hadoop development community at large* Aookeeper, ahooKs distributed The Hadoop cluster is e""ectively unavailable 5hen its consensus technology to build +ame+ode is do5n* Iiven that Hadoop is used primarily as a an automated "ailover solution* batch system, restarting the +ame+ode has been a satis"actory Scalability o" the +ame+ode recovery means* Ho5ever, 5e have taken steps to5ards auto( 0172 has been a key struggle* mated "ailover* Currently a ?ackup+ode receives all transac(tions ?ecause the +ame+ode keeps "rom the primary +ame+ode* This 5ill allo5 a "ailover to a 5arm all the namespace and block or even a hot ?ackup+ode i" 5e send block reports to both the

locations in memory, the si>e o" the +ame+ode heap has lim( ited the number o" "iles and also the number o" blocks address(able* The main challenge 5ith the +ame+ode has been that 5hen its memory usage is close to the ma:imum the +ame+ode becomes unresponsive due to Pava garbage collec(tion and sometimes reJuires a restart* /hile 5e have encour(
4

aged our users to create larger "iles, this has not happened since it 5ould reJuire changes in application behavior* /e have added Juotas to manage the usage and have provided an ar(chive tool* Ho5ever these do not "undamentally address the scalability problem*
,ur near(term solution to scalability is to allo5 multiple namespaces 8and +ame+odes9 to share the physical storage 5ithin a cluster* /e are e:tending our block )Ds to be pre"i:ed by block pool identi"iers* ?lock pools are analogous to .#+s in a S$+ storage system and a namespace 5ith its pool o" blocks is analogous as a "ile system volume* This approach is "airly simple and reJuires minimal changes to the system* )t o""ers a number o" advantages besides scalabilityL it isolates namespaces o" di""erent sets o" applica(tions and improves the overall availability o" the cluster* )t also generali>es the block storage abstraction to allo5 other services to use the block storage service 5ith perhaps a di""erent name(space structure* /e plan to e:plore other approaches to scaling such as storing only partial namespace in memory and truly distributed implementation o" the +ame+ode in the "uture* )n particular, our assumption that applications 5ill create a small number o" large "iles 5as "la5ed* $s noted earlier, changing application behavior is hard* Furthermore, 5e are seeing ne5 classes o" applications "or HDFS that need to store a large number o" smaller "iles* The main dra5back o" multiple independent namespaces is the cost o" managing them, especially i" the number o" name(spaces is large* /e are also planning to use application or job centric namespaces rather than cluster centric namespacesO this is analogous to the per(process namespaces that are used to deal 5ith remote e:ecution in distributed systems in the late C=s and early 4=s 01=2011201;2*

R-F-R-+C-S

[1] [2] [3] [4]

$pache Hadoop* httpLMMhadoop*apache*orgM @* H* Carns, /* ?* .igon ))), R* ?* Ross, and R* Thakur* R@GFSL $ parallel "ile system "or .inu: clusters,S in @roc* o" Eth $nnual .inu: Sho5case and Con"erence, ;===, pp* 71HU7;H*
P* Dean, S* Ihema5at, R6apReduceL Simpli"ied Data @rocessing on .arge Clusters,S )n @roc* o" the 3th Symposium on ,perating Systems Design and )mplementation, San Francisco C$, Dec* ;==E* $* Iates, ,* +atkovich, S* Chopra, @* Kamath, S* +arayanam, C* ,lston, ?* Reed, S* Srinivasan, #* Srivastava* R?uilding a High(.evel Data"lo5 System on top o" 6apReduceL The @ig -:perience,S )n @roc* o" Gery .arge Data ?ases, vol ; no* ;, ;==4, pp* 1E1EU1E;<

[5] [6] [7] [8] [9]

S* Ihema5at, H* Iobio"", S* .eung* RThe Ioogle "ile system,S )n @roc* o" $C6 Symposium on ,perating Systems @rinciples, .ake Ieorge, + , ,ct ;==7, pp ;4UE7* F* @* PunJueira, ?* C* Reed* RThe li"e and times o" a >ookeeper,S )n @roc* o" the ;Cth $C6 Symposium on @rinciples o" Distributed Computing, Calgary, $?, Canada, $ugust 1=U1;, ;==4* .ustre File System* httpLMM555*lustre*org 6* K* 6cKusick, S* Vuinlan* RIFSL -volution on Fast("or5ard,S $C6 Vueue, vol* H, no* H, +e5 ork, + * $ugust ;==4*
,* ,N6alley, $* C* 6urthy* Hadoop Sorts a @etabyte in 13*;< Hours and a Terabyte in 3; Seconds* 6ay ;==4* httpLMMdeveloper*yahoo*netMblogsMhadoopM;==4M=<MhadoopQsortsQaQpetab yteQinQ13;*html

[10] [11]

R* @ike, D* @resotto, K* Thompson, H* Trickey, @* /interbottom, R#se o" +ame Spaces in @lan4,S ,perating Systems Review, 27(2), April 1993, pages 72U76. S* Radia, W+aming @olicies in the spring system,W )n @roc* o" 1st )--- /orkshop on Services in Distributed and +et5orked -nvironments, Pune 144E, pp* 13EU1H1* S* Radia, P* @achl, RThe @er(@rocess Gie5 o" +aming and Remote -:ecution,S )--- @arallel and Distributed Technology, vol. 1, no. 3, August 1993, pp. 71U80. K* G* Shvachko, RHDFS ScalabilityL The limits to gro5th,S BloginL* $pril ;=1=, pp* 3U13* /* Tantisiriroj, S* @atil, I* Iibson* RData(intensive "ile systems "or )nternet servicesL $ rose by any other name ***S Technical Report C6#(@D.(=C(11E, @arallel Data .aboratory, Carnegie 6ellon #niversity, @ittsburgh, @$, ,ctober ;==C*
$* Thusoo, P* S* Sarma, +* Pain, A* Shao, @* Chakka, S* $nthony, H* .iu,

Currently our clusters are less than E=== nodes* /e believe 5e can scale to much larger clusters 5ith the solutions outlined above* Ho5ever, 5e believe it is prudent to have multiple clus(ters rather than a single large cluster 8say three 3===(node clus(ters rather than a single 1C ===( node cluster9 as it allo5s much improved availability and isolation* To that end 5e are plan(ning to provide greater cooperation bet5een clusters* For e:(ample caching remotely accessed "iles or reducing the replica(tion "actor o" blocks 5hen "iles sets are replicated across clus(ters*

[12] [13] [14] [15]

G)*

$CK+,/.-DI6-+T

/e 5ould like to thank all members o" the HDFS team at ahoo! present and past "or their hard 5ork building the "ile system* /e 5ould like to thank all Hadoop committers and collaborators "or their valuable contributions* Corinne Chandel dre5 illustrations "or this paper*

@* /ycko"", R* 6urthy, RHive U $ /arehousing Solution ,ver a 6ap(Reduce Frame5ork,S )n @roc* o" Gery .arge Data ?ases, vol* ; no* ;, $ugust ;==4, pp* 13;3(13;4*

[16] [17] [18]

P* Genner, @ro Hadoop* $press, Pune ;;, ;==4*

S* /eil, S* ?randt, -* 6iller, D* .ong, C* 6alt>ahn, RCephL $ Scalable, High(@er"ormance Distributed File System,S )n @roc* o" the Hth Symposium on ,perating Systems Design and )mplementation, Seattle, /$, +ovember ;==3* ?* /elch, 6* #nangst, A* $bbasi, I* Iibson, ?* 6ueller, P* Small, P* Aelenka, ?* Ahou, RScalable @er"ormance o" the @anasas @arallel "ile SystemS, )n @roc* o" the 3 th #S-+)F Con"erence on File and Storage Technologies, San Pose, C$, February ;==C T* /hite, HadoopL The De"initive Iuide* ,NReilly 6edia, ahoo! @ress, Pune <, ;==4*

[19]

1 =

You might also like