0% found this document useful (0 votes)
70 views13 pages

Hdfs and Pig

HDFS follows a master-slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Files are divided into blocks that are replicated across DataNodes for reliability. The NameNode records metadata like file locations and permissions, while DataNodes store and retrieve blocks. A Secondary NameNode helps merge edits from the NameNode to prevent it from running out of memory. HDFS provides commands like put, get, and ls to interact with files.

Uploaded by

DEEPINDER SINGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views13 pages

Hdfs and Pig

HDFS follows a master-slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Files are divided into blocks that are replicated across DataNodes for reliability. The NameNode records metadata like file locations and permissions, while DataNodes store and retrieve blocks. A Secondary NameNode helps merge edits from the NameNode to prevent it from running out of memory. HDFS provides commands like put, get, and ls to interact with files.

Uploaded by

DEEPINDER SINGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

HDFS:

Apache HDFS or Hadoop Distributed File System is a block-structured file system where
each file is divided into blocks of a pre-determined size. These blocks are stored across a
cluster of one or several machines. Apache Hadoop HDFS Architecture follows
a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node)
and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad
spectrum of machines that support Java. Though one can run several DataNodes on a single
machine, but in the practical world, these DataNodes are spread across various machines..

Features of HDFS

 It is suitable for the distributed storage and processing.


 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

HDFS Architecture:

HDFS follows the master-slave architecture and it has the following elements.

NameNode:
NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and
manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly
available server that manages the File System Namespace and controls access to files by
clients. I will be discussing this High Availability feature of Apache Hadoop HDFS in my
next blog. The HDFS architecture is built in such a way that the user data never resides on the
NameNode. The data resides on DataNodes only.
Functions of NameNode:

 It is the master daemon that maintains and manages the DataNodes (slave nodes)
 It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated
with the metadata:
o FsImage: It contains the complete state of the file system namespace since the
start of the NameNode.
o EditLogs: It contains all the recent modifications made to the file system with
respect to the most recent FsImage.
 It records each change that takes place to the file system metadata. For example, if a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
 It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.
 The NameNode is also responsible to take care of the replication factor of all the
blocks which we will discuss in detail later in this HDFS tutorial blog.
 In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the DataNodes.

DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a
commodity hardware, that is, a non-expensive system which is not of high quality or
high-availability. The DataNode is a block server that stores the data in the local file
ext3 or ext4.

Functions of DataNode:

 These are slave daemons or process which runs on each slave machine.
 The actual data is stored on DataNodes.
 The DataNodes perform the low-level read and write requests from the file system’s
clients.
 They send heartbeats to the NameNode periodically to report the overall health of
HDFS, by default, this frequency is set to 3 seconds.

Till now, you must have realized that the NameNode is pretty much important to us. If it fails,
we are doomed. But don’t worry, we will be talking about how Hadoop solved this single
point of failure problem in the next Apache Hadoop HDFS Architecture blog. So, just relax
for now and let’s take one step at a time.

Secondary NameNode:

Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode. The Secondary NameNode works concurrently with the primary NameNode as a
helper daemon. And don’t be confused about the Secondary NameNode being a backup
NameNode because it is not.
Functions of Secondary NameNode:

 The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the file
system.
 It is responsible for combining the EditLogs with FsImage from the NameNode.
 It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage. The new FsImage is copied back to the NameNode, which is used whenever
the NameNode is started the next time.

Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also


called CheckpointNode.

Blocks:
Blocks are the nothing but the smallest continuous location on your hard drive where data is
stored. In general, in any of the File System, you store the data as a collection of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache
Hadoop cluster. The default size of each block is 128 MB which you can configure as per
your requirement.

It is not necessary that in HDFS, each file is stored in exact multiple of the configured block
size (128 MB, 256 MB etc.). Let’s take an example where I have a file “example.txt” of size
514 MB as shown in above figure. Suppose that we are using the default configuration of
block size, which is 128 MB. Then, how many blocks will be created? 5, Right. The first four
blocks will be of 128 MB. But, the last block will be of 2 MB size only.

Replication Management:

HDFS provides a reliable way to store huge data in a distributed environment as data blocks.
The blocks are also replicated to provide fault tolerance. The default replication factor is 3
which is again configurable. So, as you can see in the figure below where each block is
replicated three times and stored on different DataNodes (considering the default replication
factor):

Therefore, if you are storing a file of 128 MB in HDFS using the default configuration, you
will end up occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three
times and each replica will be residing on a different DataNode.

Note: The NameNode collects block report from DataNode periodically to maintain the
replication factor. Therefore, whenever a block is over-replicated or under-replicated the
NameNode deletes or add replicas as needed.
Hadoop Shell Commands:
1.cat:

Copies source paths to stdout.

Usage: hadoop fs -cat URI [URI …]

Example: hdfs dfs -cat sample

2. chgrp:

Change group association of files. With -R, make the change recursively through the
directory structure. The user must be the owner of files, or else a super-user.
Additionalinformation is in the Permissions User Guide.

Usage: hadoop fs -chgrp [-R] GROUP URI [URI …]

Example: hdfs dfs -chgrp [-R] New Group sample

3. chmod:

Change the permissions of files. With -R, make the change recursively through the
directorystructure. The user must be the owner of the file, or else a super-user. Additional
informationis in the Permissions User Guide.

Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI


[URI …]

Example: hdfs dfs -chmod 777 /user/hadoop/dir1/sample

4. chown:

Change the owner of files. With -R, make the change recursively through the directory
structure. The user must be a super-user. Additional information is in the Permissions User
Guide.

Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Example: hdfs dfs -chown -R xdark/opt/hadoop/logs

5. put:

Copy single src, or multiple srcs from local file system to the destination filesystem. Also
reads input from stdin and writes to destination filesystem.

Usage: hadoop fs -put <localsrc> ... <dst>

Example: hdfs dfs -put /home/xdark/Desktop/sample /user/hadoop


6. copyFromLocal:

Similar to put command, except that the source is restricted to a local file reference.

Usage: hadoop fs -copyFromLocal <localsrc> URI

Example: hdfs dfs -copyFromLocal /home/xdark/Desktop/sample /user/hadoop

7. copyToLocal:

Similar to get command, except that the destination is restricted to a local file reference.

Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI


<localdst>

Example: hdfs dfs -copyToLocal /user/hadoop/sample /home/xdark/Desktop

8. cp:

Copy files from source to destination. This command allows multiple sources as well in
which case the destination must be a directory.

Usage: hadoop fs -cp URI [URI …] <dest>

Example: hadoop fs -cp /user/hadoop/dir2/purchases.txt /user/hadoop/dir1

9. du:

Displays aggregate length of files contained in the directory or the length of a file in case its
just a file.

Usage: hadoop fs -du URI [URI …]

Example: hdfs dfs -du /user/hadoop/dir1/sample

10. dus:

Displays a summary of file lengths.

Usage: hadoop fs -dus <args>

Example: hdfs dfs -dus /sample

11. expunge:

Empty the Trash. Refer to HDFS Design for more information on Trash feature.

Usage: hadoop fs –expunge


Example: hdfs dfs -expunge

12. get:

Copy files to the local file system. Files that fail the CRC check may be copied with the
-ignorecrc option. Files and CRCs may be copied using the -crc option.

Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

Example: hdfs dfs -get /user/hadoop/dir2/sample /home/dataflair/Desktop

13. getmerge:

Takes a source directory and a destination file as input and concatenates files in src into the
destination local file. Optionally addnl can be set to enable adding a newline character at
the end of each file.

Usage: hadoop fs -getmerge <src> <localdst> [addnl]

Example: hdfs dfs -getmerge /user/hadoop/dir1/sample.txt /user/hadoop/dir2/sample2.txt


/home/sample1.txt

14. ls:

For a file returns stat on the file with the following format:
filename <number of replicas> filesize modification_date
modification_time permissions userid groupid
For a directory it returns list of its direct children as in unix. A directory is listed as:
dirname <dir> modification_time modification_time permissions
userid groupid

Usage: hadoop fs -ls <args>

Example: hdfs dfs -ls /user/hadoop/dir1

15. lsr:

Recursive version of ls. Similar to Unix ls -R.

Usage: hadoop fs -lsr <args>

Example: hadoop fs -lsr

16. mkdir:

Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p
creating parent directories along the path.

Usage: hadoop fs -mkdir <paths>


Example: hdfs dfs -mkdir /user/hadoop/dir1

17. movefromLocal:

Displays a "not implemented" message.


Usage: dfs -moveFromLocal <src> <dst>

Example: hdfs dfs -moveFromLocal /home/xdark/Desktop/sample /user/hadoop/dir1

18. mv:

Moves files from source to destination. This command allows multiple sources as well in
which case the destination needs to be a directory. Moving files across filesystems is not
permitted.

Usage: hadoop fs -mv URI [URI …] <dest>

Example: hdfs dfs -mv myfile.txt /dir1

19. rm:

Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for
recursive deletes.

Usage: hadoop fs -mv URI [URI …] <dest>

Example: hdfs dfs -rm /user/hadoop/dir2/sample

20. rmr:

Recursive version of delete.

Usage: hadoop fs -rmr URI [URI …]

Example: hdfs dfs -rmr /sample

21. setrep:

Changes the replication factor of a file. -R option is for recursively increasing the replication
factor of files within a directory.

Usage: hadoop fs -setrep [-R] <path>

Example: hdfs dfs -setrep -w 3 /user/hadoop/dir1

22. stat:

Returns the stat information on the path.


Usage: hadoop fs -tail [-f] URI

Example: hdfs dfs -stat /user/hadoop/dir1

23. tail:

Displays last kilobyte of the file to stdout. -f option can be used as in Unix.

Usage: hadoop fs -tail [-f] URI

Example: hdfs dfs -tail -f /user/hadoop/dir2/purchases.txt

24. test:

The Hadoop fs shell command test is used for file test operations.
It gives 1 output if a path exists; it has zero length, or it is a directory or otherwise 0.
Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true
-d check return 1 if the path is directory else return 0.

Usage: hadoop fs -test -[ezd] URI

Example: 1.hdfs dfs -test -e sample

2.hdfs dfs -test -z sample

3.hdfs dfs -test -d sample

25. text:

Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.

Usage: hadoop fs -text <src>

Example: hdfs dfs -text /user/hadoop/dir1/sample

26. touchz:
Create a file of zero length.

Usage: hadoop fs -touchz URI [URI …]

Example: hadoop fs –touchz sample

27. version :

This Hadoop command prints the Hadoop version

Usage: version

Example: hdfs dfs version

28.df:

df displays free space.This was all on Hadoop HDFS Commands.

Usage: hdfs dfs -df [-h] URI [URI ...]

Example: hdfs dfs -df -h

HDFS dfs Command Description


Apache Pig:
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze
larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can
perform all the data manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin.
Thislanguage provides various operators using which programmers can develop their own
functionsfor reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language.
All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a
componentknown as Pig Engine that accepts the Pig Latin scripts as input and converts those
scripts intoMapReduce jobs.

Apache Pig – Architecture:


The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a
highlevel
data processing language which provides a rich set of data types and operators to perform
various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms
(Grunt
Shell, UDFs, Embedded). After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes
the programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components:
As shown in the figure, there are various components in the Apache Pig framework. Let us
take a
look at the major components.

Parser:

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows
arerepresented as edges.

Optimizer:

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.

Compiler:

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.

Pig Latin Data Model:


The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.

Atom:

Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is
stored as string and can be used as string and number. int, long, float, double, chararray, and
bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a
field.
Example − ‘raja’ or ‘30’

Tuple:

A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)

Bag:

A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is


known as a bag. Each tuple can have any number of fields (flexible schema). A bag is
represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that the fields in the same
position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}


A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Map:

A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]

Relation:

A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).

Apache Pig Execution Modes:


You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode:

In this mode, all the files are installed and run from your local host and local file system.
There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.

MapReduce Mode:

MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms:


Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the
Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using
Dump operator).

Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pig extension.

Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and using them
in our script.

You might also like