CIA3 Answer
CIA3 Answer
2 Marks:
1. How does failover occur and what is the role of fencing in this process?
The transition from the active namenode to the standby is managed by a new entity in the system
called the failover controller. Failover controllers are pluggable, but the first implementation uses
ZooKeeper to ensure that only one namenode is active.
Failover may also be initiated manually by an adminstrator, in the case of routine maintenance, for
example. This is known as a graceful failover, since the failover controller arranges an orderly
transition for both namenodes to switch roles.
In the case of an ungraceful failover, The HA implementation goes to great lengths to ensure that the
previously active namenode is prevented from doing any damage and causing corruption—a method
known as fencing.
2. The default block size of HDFS is 64MB/128 MB while the default block size of
UNIX/LINUX is 4KB/8KB. Why? What implication, do you think, this will have in the
design of the NameNode?
The default size of a block in HDFS is 128 MB (Hadoop 2.x) and 64 MB (Hadoop 1.x) which is much
larger as compared to the Linux system where the block size is 4KB. The reason of having this huge
block size is to minimize the cost of seek and reduce the meta data information generated per block
3. How to copy a file into HDFS with a different block size to that of existing block size
configuration?
Yes, one can copy a file into HDFS with a different block size by using ‘-Ddfs.blocksize=block_size’
where the block_size is specified in Bytes.
Let me explain it with an example: Suppose, I want to copy a file called test.txt of size, say of 120
MB, into the HDFS and I want the block size for this file to be 32 MB (33554432 Bytes) instead of
the default (128 MB). So, I would issue the following command:
hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /home/edureka/test.txt /sample_hdfs
Now, I can check the HDFS block size associated with this file by:
hadoop fs -stat %o /sample_hdfs/test.txt
Else, I can also use the NameNode web UI for seeing the HDFS directory.
4. Explain how can you compute node failure?
If the machine running the namenode failed, all the files on the filesystem would be lost since there
would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
5. Justify “Data Locality Optimization”.
To run map task on a node where the input data resides in HDFS. This is call data locality
optimization.
6. What if Writable were not there in Hadoop?
Serialization is important in Hadoop because it enables easy transfer of data. If Writable is not present
in Hadoop, then it uses the serialization of java which increases the data over-head in the network.
7. Interpret the term ‘Codecs’
A codec is an algorithm that is used to perform compression and decompression of a data stream to
transmit or store it.
In Hadoop, these compression and decompression operations run with different codecs and with
different compression formats
8. Discuss the RPC Serialization format in Hadoop I/O?
Compact – A compact format makes the best use of network bandwidth, which is the most scarce
resource in a data center.
Fast – Interprocess communication forms the backbone for a distributed systems, so it is essential that
there is as little performance overhead as possible for the serialization and deserialization process
Extensible – protocols change over time to meet new requirements, so it should be straightforward to
evolve the protocol in a controlled manner for clients and servers.
Interoperable – For some systems, it is desirable to be able to support clients that are written in
different languages to the server, so the format needs to be designed to make this possible.
16 Marks
1. (i) How does Hadoop's Wordcount program process and analyze text data efficiently and
what are its key components and functionalities?
(ii) How do Hadoop streaming and pipes differ and how do they contribute to data processing in the
Hadoop ecosystem
2. Illustrate the concepts and functioning of the Hadoop Distributed File System (HDFS) and provide
practical examples:
i. Name Node and Data Node
ii. Basic Filesystem operations in Hadoop
iii. Query in Filesystem
iv. Coherency Model in Hadoop Filesystem
3. Explain the concepts of data integrity and compression while emphasizing their significance and
real-world applications through a detailed explanation.
4. (i) Explain the concept of serialization in Hadoop with example code
(ii) Write short notes on sequence file and map file in file based data structures with example code.
5. Demonstrate the process of implementing a MapReduce program in Java for analyzing a weather
dataset, including the key steps such as mapping data, reducing it to calculate statistics, and setting up
the configuration?
6. (i) What is Meta data? What information does it provide and explain the role of Name node in a
HDFS clusters with real-time example?
(ii) Show how a client read and write data in HDFS. Give an example code.Briefly explain the
Hadoop Input and Output operations.
7. Illustrate the concepts and functioning of the Hadoop input and output format and provide practical
examples:
• Data integrity
• Compression
• Serialization
• Avro
8. Develop a Java interface in Hadoop for custom data types in a MapReduce job.