Data Analytics Unit 2
Data Analytics Unit 2
(Autonomous)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
V Semester
20AI501 – Data Analytics
Regulations 2020
Question Bank
UNIT – II (HADOOP DISTRIBUTED FILE SYSTEM)
PART- A
Q.No. Questions Marks CO BL
1 Define HDFS. 2 CO2 R
2 state Limitations of HAR files 2 CO2 R
3 Tell about job execution in Hadoop 2 CO2 R
4 List out three different features of Hadoop? 2 CO2 R
5 Classify various compression formats that are used in Hadoop 2 CO2 R
6 State design concept of HDFS 2 CO2 R
7 which file format is best for Hadoop 2 CO2 R
8 Define Hadoop Archive 2 CO2 R
9 Which architecture is used by HDFS 2 CO2 R
10 Recall Sqoop and Flume 2 CO2 R
11 Define data flow in Hadoop 2 CO2 R
12 Write Down the drawback of Hadoop Serialization 2 CO2 R
13 What is serialization in Hadoop 2 CO2 R
14 Write down Avro schema 2 CO2 R
15 Define Hadoop I/O Compression 2 CO2 R
16 What is meant Avro serialization in Hadoop 2 CO2 R
17 What is data Replication 2 CO2 R
18 State cluster in Hadoop 2 CO2 R
19 Recall Flume used for 2 CO2 R
20 Infer Command line interface in hadoop 2 CO2 R
PART- B
Q.No. Questions Marks CO BL
1. Examine about 16 CO2 U
(i) The Design of HDFS
(ii) HDFS Concepts
2. Describe DataIngest With Flume and Scoop and Hadoop archives 16 CO2 U
Explain the following with
3. Hadoop I/O Compression 16 CO2 U
Hadoop I/O Serialization
4. Discuss in detail about Hadoop File system Interface 16 CO2 U
5. Summarize the Command Line Interface 16 CO2 U
6. Explain the Avro and File Based Data Structures 16 CO2 U
PART- B
PART- B
Q.No Questions Mar CO BL
. ks
1. Examine about 16 CO2 U
(i) The Design of HDFS
HDFS uses a master/slave architecture in which one device (the
master) controls one or more other devices (the slaves). The HDFS cluster
consists of a single Name Node and a master server manages the file
system namespace and regulates access to files.
Flume Architecture
A Flume agent is a JVM process which has 3 components –Flume
Source, Flume Channel and Flume Sink– through which events
propagate after initiated at an external source.
Sqoop :
The data which are stored in a relational database management system
needed to be transferred into the Hadoop structure. So the transfer of this
large amount of data manually is not possible but with the help of Sqoop
Some of the important Features of the Sqoop :
Sqoop also helps us to connect the result from the SQL Queries into
Hadoop distributed file system.
Sqoop helps us to load the processed data directly into the hive or
Hbase.
It performs the security operation of data with the help of Kerberos.
With the help of Sqoop, we can perform compression of processed data.
Sqoop is highly powerful and efficient in nature.
There are two major operations performed in Sqoop :
1. Import
2. Export
Archives (HAR
Hadoop Archives (HAR) offers an effective way to deal with the small files
problem. This post will explain –
1. The problem with small files
2. What is HAR?
3. Limitations of HAR files
The problem with small files
Hadoop works best with big files and small files are handled
inefficiently in HDFS. As we know, Namenode holds the metadata
information in memory for all the files stored in HDFS. Let’s say we have a
file in HDFS which is 1 GB in size and the Namenode will store metadata
information of the file – like file name, creator, created time stamp, blocks,
permissions etc.
Now assume we decide to split this 1 GB file in to 1000 pieces and
store all 100o “small” files in HDFS. Now Namenode has to store metadata
information of 1000 small files in memory. This is not very efficient – first
it takes up a lot of memory and second soon Namenode will become a
bottleneck as it is trying to manage a lot of data.
Explain the following with
Hadoop I/O Compression
Compression. File compression brings two major benefits: it
reduces the space needed to store files, and it speeds up data transfer across
the network, or to or from disk.
Data compression at various stages in Hadoop
Compressing input files- You can compress the input file that will reduce
storage space in HDFS. If you compress the input files then the files will be
decompressed automatically
Hadoop compression formats
There are many different compression formats available in Hadoop
3. framework. You will have to use one that suits your requirement. 16 CO2 U
Parameters that you need to look for are-
Time it takes to compress.
Space saving.
Compression format is splittable or not.
Let’s go through the list of available compression formats and see which
format provides what characteristics.
Deflate– It is the compression algorithm whose implementation is zlib.
Defalte compression algorithm is also used by gzip compression tool.
Filename extension is .deflate.
gzip- gzip compression is based on Deflate compression algorithm. Gzip
compression is not as fast as LZO or snappy but compresses better so space
saving is more.
Gzip is not splittable. Filename extension is .gz.
bzip2- Using bzip2 for compression will provide higher compression ratio
but the compressing and decompressing speed is slow. Bzip2 is splittable,
Bzip2Codec implements SplittableCompressionCodec interface which
provides the capability to compress / de-compress a stream starting at any
arbitrary position.
Filename extension is .bz2.
Snappy– The Snappy compressor from Google provides fast compression
and decompression but compression ratio is less.
Snappy is not splittable. Filename extension is .snappy.
LZO– LZO, just like snappy is optimized for speed so compresses and
decompresses faster but compression ratio is less.LZO is not splittable by
default but you can index the lzo files as a pre-processing step to make
them splittable. Filename extension is .lzo.
.LZ4– Has fast compression and decompression speed but compression
ratio is less. LZ4 is not splittable. Filename extension is .lz4.
Zstandard– Zstandard is a real-time compression algorithm, providing
high compression ratios. It offers a very wide range of compression / speed
trade-off.
Zstandard is not splittable. Filename extension is .zstd.
Codecs in Hadoop
Codec, short form of compressor-decompressor is the implementation of a
compression-decompression algorithm. In Hadoop framework there are
different codec classes for different compression formats
Hadoop I/O Serialization
Serialization is the process of turning structured objects into a byte
stream for transmission over a network or for writing to persistent storage.
Deserialization is the reverse process of turning a byte stream back into a
series of structured objects.Serialization appears in two quite distinct areas
of distributed data processing: for interprocess communication and for
persistent storage.
In Hadoop, interprocess communication between nodes in the
system is implemented using remote procedure calls (RPCs). The RPC
protocol uses serialization to render the message into a binary stream to be
sent to the remote node, RPC serialization format is:
Compact
A compact format makes the best use of network bandwidth, which
is the most scarce resource in a data center.
Fast
Intercrosses communication forms the backbone for a distributed
system, so it is essential that there is as little performance overhead as
possible for the serialization and deserialization process.
Extensible
Protocols change over time to meet new requirements, so it should
be straightforward to evolve the protocol in a controlled manner for clients
and servers. For example, it should be possible to add a new argument to a
method call, and have the new servers accept messages in the old format
(without the new argument) from old clients.
Interoperable
Hadoop uses its own serialization format, Writables, which is
certainly compact and fast, but not so easy to extend or use from languages
other than Java. Since Writables are central to Hadoop (most MapReduce
programs use them for their key and value types), we look at them in some
depth in the next three sections, before looking at serialization frameworks
in general, and then Avro (a serialization system that was designed to
overcome some of the limitations of Writables) in more detail.
Persistent Storage
Persistent Storage is a digital storage facility that does not lose its
data with the loss of power supply. Files, folders, databases are the
examples of persistent storage.
Writable Interface
This is the interface in Hadoop which provides methods for serialization
and deserialization. The following table describes the methods –
copyFromLocal - This command copies all the files inside test folder in
the edge node to test folder in the hdfs Similar to put command, except that
the source is restricted to a local file reference
hadoop fs -copyFromLocal /home/sa081876/test/* /user/haas_queue/test
.ptions:The -f option will overwrite the destination if it already exists.
copyToLocal - This command copies all the files inside test folder in the
hdfs to test folder in the edge node.Similar to get command, except that the
destination is restricted to a local file reference.
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
hadoop fs -copyFromLocal /user/haas_queue/test/* /home/sa081876/test
count- Count the number of directories, files and bytes under the paths that
match the specified file pattern. The output columns with -count are:
DIR_COUNT, FILE_COUNT
hadoop fs -count [-q] [-h] [-v] <paths>, CONTENT_SIZE, PATHNAME
The output columns with -count -q are: QUOTA, REMAINING_QUATA,
SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT,
FILE_COUNT, CONTENT_SIZE, PATHNAME
The -h option shows sizes in human readable format.
The -v option displays a header line.
Example:
hadoop fs -count hdfs://nn1.example.com/file1
hdfs://nn2.example.com/file2
hadoop fs -count -q hdfs://nn1.example.com/file1
hadoop fs -count -q -h hdfs://nn1.example.com/file1
hdfs dfs -count -q -h -v hdfs://nn1.example.com/file1
cp: Copy files from source to destination. This command allows multiple
sources as well in which case the destination must be a directory.
deleteSnapshot
Delete a snapshot of from a snapshottable directory. This operation requires
owner privilege of the snapshottable directory.For more information refer
the link HdfsSnapshots.html
hdfs dfs -deleteSnapshot <path> <snapshotName>
path – The path of the snapshottable directory.
snapshotName – The snapshot name.
du: Displays sizes of files and directories contained in the given directory
or the length of a file in case its just a file.
find: Finds all files that match the specified expression and applies selected
actions to them. If no path is specified then defaults to the current working
directory. If no expression is specified then defaults to -print.
hadoop fs -find <path> … <expression> …
hadoop fs -find / -name test -print
get: Copy files to the local file system. Files that fail the CRC check may
be copied with the -ignorecrc option. Files and CRCs may be copied using
the -crc option
hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
.
Example:hadoop fs-get/user/hadoop/filelocalfile
hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile
Getfacl: Displays the Access Control Lists (ACLs) of files and directories.
If a directory has a default ACL, then getfacl also displays the default ACL.
Sequencefile Compression
internal format of the sequencefile depends on whether compression is
enabled, or, if it is, either a record compression or a block compression.
Three kinds of types:
A. No compression type : If compression is not enabled (the default
setting), then each record consists of its record length (number of bytes),
the length of the key, the key and the value. The Length field is four bytes.
B. Record compression type : The record compression format is
basically the same as the uncompressed format, and the difference is that
the value byte is compressed with the encoder defined in the header. Note
that the key is not compressed.
C. Block compression type : Block compression compresses multiple
records at once , so it is more compact than record compression and
generally preferred . When the number of bytes recorded reaches the
minimum size, it is added to the block. The minimum
valueio.seqfile.compress.blocksizeis defined by the property in. The default
value is 1000000 bytes. The format is record count, key length, key, value
length, value.
Benefits of the Sequencefile file format:
A. Supports data compression based on records (record) or blocks (block).
B. Supports splittable, which can be used as input shards for mapreduce.
C. Simple to modify: The main responsibility is to modify the
corresponding business logic, regardless of the specific storage format.
Disadvantages of the Sequencefile file format:
The downside is the need for a merge file, and the merged file will be
inconvenient to view. because it is a binary file.
read/write Sequencefile
Write Process:
1) Create a configuration 2) Get filesystem 3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration ,2) Get filesystem ,3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream