Hdfs and Pig
Hdfs and Pig
Apache HDFS or Hadoop Distributed File System is a block-structured file system where
each file is divided into blocks of a pre-determined size. These blocks are stored across a
cluster of one or several machines. Apache Hadoop HDFS Architecture follows
a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node)
and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad
spectrum of machines that support Java. Though one can run several DataNodes on a single
machine, but in the practical world, these DataNodes are spread across various machines..
Features of HDFS
HDFS Architecture:
HDFS follows the master-slave architecture and it has the following elements.
NameNode:
NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and
manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly
available server that manages the File System Namespace and controls access to files by
clients. I will be discussing this High Availability feature of Apache Hadoop HDFS in my
next blog. The HDFS architecture is built in such a way that the user data never resides on the
NameNode. The data resides on DataNodes only.
Functions of NameNode:
It is the master daemon that maintains and manages the DataNodes (slave nodes)
It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated
with the metadata:
o FsImage: It contains the complete state of the file system namespace since the
start of the NameNode.
o EditLogs: It contains all the recent modifications made to the file system with
respect to the most recent FsImage.
It records each change that takes place to the file system metadata. For example, if a
file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.
The NameNode is also responsible to take care of the replication factor of all the
blocks which we will discuss in detail later in this HDFS tutorial blog.
In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the DataNodes.
DataNode:
DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a
commodity hardware, that is, a non-expensive system which is not of high quality or
high-availability. The DataNode is a block server that stores the data in the local file
ext3 or ext4.
Functions of DataNode:
These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file system’s
clients.
They send heartbeats to the NameNode periodically to report the overall health of
HDFS, by default, this frequency is set to 3 seconds.
Till now, you must have realized that the NameNode is pretty much important to us. If it fails,
we are doomed. But don’t worry, we will be talking about how Hadoop solved this single
point of failure problem in the next Apache Hadoop HDFS Architecture blog. So, just relax
for now and let’s take one step at a time.
Secondary NameNode:
Apart from these two daemons, there is a third daemon or a process called Secondary
NameNode. The Secondary NameNode works concurrently with the primary NameNode as a
helper daemon. And don’t be confused about the Secondary NameNode being a backup
NameNode because it is not.
Functions of Secondary NameNode:
The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the file
system.
It is responsible for combining the EditLogs with FsImage from the NameNode.
It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage. The new FsImage is copied back to the NameNode, which is used whenever
the NameNode is started the next time.
Blocks:
Blocks are the nothing but the smallest continuous location on your hard drive where data is
stored. In general, in any of the File System, you store the data as a collection of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache
Hadoop cluster. The default size of each block is 128 MB which you can configure as per
your requirement.
It is not necessary that in HDFS, each file is stored in exact multiple of the configured block
size (128 MB, 256 MB etc.). Let’s take an example where I have a file “example.txt” of size
514 MB as shown in above figure. Suppose that we are using the default configuration of
block size, which is 128 MB. Then, how many blocks will be created? 5, Right. The first four
blocks will be of 128 MB. But, the last block will be of 2 MB size only.
Replication Management:
HDFS provides a reliable way to store huge data in a distributed environment as data blocks.
The blocks are also replicated to provide fault tolerance. The default replication factor is 3
which is again configurable. So, as you can see in the figure below where each block is
replicated three times and stored on different DataNodes (considering the default replication
factor):
Therefore, if you are storing a file of 128 MB in HDFS using the default configuration, you
will end up occupying a space of 384 MB (3*128 MB) as the blocks will be replicated three
times and each replica will be residing on a different DataNode.
Note: The NameNode collects block report from DataNode periodically to maintain the
replication factor. Therefore, whenever a block is over-replicated or under-replicated the
NameNode deletes or add replicas as needed.
Hadoop Shell Commands:
1.cat:
2. chgrp:
Change group association of files. With -R, make the change recursively through the
directory structure. The user must be the owner of files, or else a super-user.
Additionalinformation is in the Permissions User Guide.
3. chmod:
Change the permissions of files. With -R, make the change recursively through the
directorystructure. The user must be the owner of the file, or else a super-user. Additional
informationis in the Permissions User Guide.
4. chown:
Change the owner of files. With -R, make the change recursively through the directory
structure. The user must be a super-user. Additional information is in the Permissions User
Guide.
5. put:
Copy single src, or multiple srcs from local file system to the destination filesystem. Also
reads input from stdin and writes to destination filesystem.
Similar to put command, except that the source is restricted to a local file reference.
7. copyToLocal:
Similar to get command, except that the destination is restricted to a local file reference.
8. cp:
Copy files from source to destination. This command allows multiple sources as well in
which case the destination must be a directory.
9. du:
Displays aggregate length of files contained in the directory or the length of a file in case its
just a file.
10. dus:
11. expunge:
Empty the Trash. Refer to HDFS Design for more information on Trash feature.
12. get:
Copy files to the local file system. Files that fail the CRC check may be copied with the
-ignorecrc option. Files and CRCs may be copied using the -crc option.
13. getmerge:
Takes a source directory and a destination file as input and concatenates files in src into the
destination local file. Optionally addnl can be set to enable adding a newline character at
the end of each file.
14. ls:
For a file returns stat on the file with the following format:
filename <number of replicas> filesize modification_date
modification_time permissions userid groupid
For a directory it returns list of its direct children as in unix. A directory is listed as:
dirname <dir> modification_time modification_time permissions
userid groupid
15. lsr:
16. mkdir:
Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p
creating parent directories along the path.
17. movefromLocal:
18. mv:
Moves files from source to destination. This command allows multiple sources as well in
which case the destination needs to be a directory. Moving files across filesystems is not
permitted.
19. rm:
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for
recursive deletes.
20. rmr:
21. setrep:
Changes the replication factor of a file. -R option is for recursively increasing the replication
factor of files within a directory.
22. stat:
23. tail:
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.
24. test:
The Hadoop fs shell command test is used for file test operations.
It gives 1 output if a path exists; it has zero length, or it is a directory or otherwise 0.
Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true
-d check return 1 if the path is directory else return 0.
25. text:
Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
26. touchz:
Create a file of zero length.
27. version :
Usage: version
28.df:
Parser:
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows
arerepresented as edges.
Optimizer:
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler:
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
Atom:
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is
stored as string and can be used as string and number. int, long, float, double, chararray, and
bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a
field.
Example − ‘raja’ or ‘30’
Tuple:
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag:
Map:
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation:
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).
Local Mode:
In this mode, all the files are installed and run from your local host and local file system.
There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode:
MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the
Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using
Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and using them
in our script.