0% found this document useful (0 votes)
19 views5 pages

CIA3 Answer

Its an answer for all

Uploaded by

Vijay ragavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

CIA3 Answer

Its an answer for all

Uploaded by

Vijay ragavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 5

CIA 3 Question Bank

2 Marks:

1. How does failover occur and what is the role of fencing in this process?
The transition from the active namenode to the standby is managed by a new entity in the system
called the failover controller. Failover controllers are pluggable, but the first implementation uses
ZooKeeper to ensure that only one namenode is active.
Failover may also be initiated manually by an adminstrator, in the case of routine maintenance, for
example. This is known as a graceful failover, since the failover controller arranges an orderly
transition for both namenodes to switch roles.
In the case of an ungraceful failover, The HA implementation goes to great lengths to ensure that the
previously active namenode is prevented from doing any damage and causing corruption—a method
known as fencing.
2. The default block size of HDFS is 64MB/128 MB while the default block size of
UNIX/LINUX is 4KB/8KB. Why? What implication, do you think, this will have in the
design of the NameNode?
The default size of a block in HDFS is 128 MB (Hadoop 2.x) and 64 MB (Hadoop 1.x) which is much
larger as compared to the Linux system where the block size is 4KB. The reason of having this huge
block size is to minimize the cost of seek and reduce the meta data information generated per block
3. How to copy a file into HDFS with a different block size to that of existing block size
configuration?
Yes, one can copy a file into HDFS with a different block size by using ‘-Ddfs.blocksize=block_size’
where the block_size is specified in Bytes.
Let me explain it with an example: Suppose, I want to copy a file called test.txt of size, say of 120
MB, into the HDFS and I want the block size for this file to be 32 MB (33554432 Bytes) instead of
the default (128 MB). So, I would issue the following command:
hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /home/edureka/test.txt /sample_hdfs
Now, I can check the HDFS block size associated with this file by:
hadoop fs -stat %o /sample_hdfs/test.txt
Else, I can also use the NameNode web UI for seeing the HDFS directory.
4. Explain how can you compute node failure?
If the machine running the namenode failed, all the files on the filesystem would be lost since there
would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
5. Justify “Data Locality Optimization”.
To run map task on a node where the input data resides in HDFS. This is call data locality
optimization.
6. What if Writable were not there in Hadoop?
Serialization is important in Hadoop because it enables easy transfer of data. If Writable is not present
in Hadoop, then it uses the serialization of java which increases the data over-head in the network.
7. Interpret the term ‘Codecs’
A codec is an algorithm that is used to perform compression and decompression of a data stream to
transmit or store it.
In Hadoop, these compression and decompression operations run with different codecs and with
different compression formats
8. Discuss the RPC Serialization format in Hadoop I/O?
Compact – A compact format makes the best use of network bandwidth, which is the most scarce
resource in a data center.
Fast – Interprocess communication forms the backbone for a distributed systems, so it is essential that
there is as little performance overhead as possible for the serialization and deserialization process
Extensible – protocols change over time to meet new requirements, so it should be straightforward to
evolve the protocol in a controlled manner for clients and servers.
Interoperable – For some systems, it is desirable to be able to support clients that are written in
different languages to the server, so the format needs to be designed to make this possible.

9. What happens if a client detects an error when reading a block in Hadoop?


If a client detects an error when reading a block:
• It reports the bad block and datanode it was trying to read from to the namenode before throwing a
ChecksumException.
• The namenode marks the block replica as corrupt, so it does not direct clients to it, or try to copy this
replica to another datanode.
• It then schedules a copy of the block to be replicated on another datanode, so its replication factor is
back at the expected level.
10. Why do map tasks write their output to the local disk, not be HDFS?
Map output is intermediate output: It is processed by reducing tasks to produce the final output and
once the job is complete the map output can be thrown away. So storing it in HDFS, with replication,
would be overkill. If the node running the map task fails before the map output has been consumed by
the reduce task, then Hadoop will automatically rerun the map task on another node to re-create the
map output.
11. Why is scaling out important in the context of Hadoop?
Scaling out, or horizontal scaling, involves adding servers for parallel computing.
• The scale out technique is a long-term solution, as more and more servers may be added when
needed.
• But going from one monolithic system to this type of cluster may be a difficult, although extremely
effective solution.
12. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block
size configuration and default replication factor. Then, how many blocks will be created
in total and what will be the size of each block?
Default block size in Hadoop 2.x is 128 MB. So, a file of size 514 MB will be divided into 5 blocks
(514 MB/128 MB) where the first four blocks will be of 128 MB and the last block will be of 2 MB
only. Since, we are using the default replication factor i.e. 3, each block will be replicated thrice.
Therefore, we will have 15 blocks in total where 12 blocks will be of size 128 MB each and 3 blocks
of size 2 MB each
13. What do you understand by fsck in Hadoop?
In Hadoop, "fsck" stands for "File System Check." It is a command-line utility used to check the
health and integrity of the Hadoop Distributed File System (HDFS). HDFS is the primary distributed
storage system used in Hadoop for storing and managing large volumes of data. The fsck command is
a diagnostic tool that helps administrators and users identify and fix issues within the HDFS.
1. Data Consistency Check
2. Block Replication Check
3. Block Health Check
4. Namespace Check
14. What does Hadoop encompass, and what are the key components that make up its
ecosystem?
Blocks
Namenode
Datanode
Rack awareness
Replication
15. What are the real-time industry applications of Hadoop?
Social Media Analytics
Financial Services
Retail
Telecommunications
Healthcare
IoT
16. Why is a block in HDFS so large?
HDFS blocks are large compared to disk blocks and the reason is to minimize the cost of seeks. By
making a block large enough, the time to transfer the data from the disk can be made to be
significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file
made of multiple blocks operates at the disk transfer rate.
17. Infer ‘Streaming’ in Hadoop.
MapReduce is a programming model and processing framework used in Hadoop for distributed data
processing. By default, Hadoop's native implementation of MapReduce requires developers to write
their Map and Reduce functions in Java. However, Hadoop Streaming enables you to use other
scripting or programming languages, such as Python or Perl, to implement these functions.
Here's how Hadoop Streaming works:
Mapper: You write a Mapper script in your preferred language (e.g., Python) that reads input data,
processes it, and emits key-value pairs to the standard output.
Reducer: Similarly, you write a Reducer script in your chosen language, which receives the key-
value pairs emitted by the Mapper, processes them, and produces the final output.
Command-Line Interface: You use the Hadoop Streaming utility from the command line to specify
the Mapper and Reducer scripts, as well as the input and output paths. Hadoop Streaming takes care
of the data distribution, parallel processing, and aggregation.
Hadoop Streaming allows for more flexibility in Hadoop MapReduce jobs, as developers can leverage
their existing expertise in languages other than Java. This feature is particularly useful when dealing
with legacy code or when you need to quickly prototype and test MapReduce jobs without the need
for Java development.
18. Interpret the term ‘Serialization’.
Serialization is the process of converting object data into byte stream data for transmission over a
network across different nodes in a cluster or for persistent data storage.
19. State the purpose of Hadoop pipes.
Hadoop pipes is the name of the C++ interface to Hadoop MapReduce. Unlike streaming, this uses
standard input and output to communicate with the map and reduce code. Pipes uses sockets as the
channel over which the task tracker communicates with the process running the C++ map or reduce
function.
20. Why do map tasks write their output to the local disk, not to HDFS?
Map output is intermediate output: it is processed by reducing tasks to produce the final output and
once the job is complete the map output can be thrown away. So storing it in HDFS, with replication,
would be overkill. If the node running the map task fails before the map output has been consumed by
the reduce task, then Hadoop will automatically return the map task on another node to re-create the
map output.

16 Marks

1. (i) How does Hadoop's Wordcount program process and analyze text data efficiently and
what are its key components and functionalities?
(ii) How do Hadoop streaming and pipes differ and how do they contribute to data processing in the
Hadoop ecosystem
2. Illustrate the concepts and functioning of the Hadoop Distributed File System (HDFS) and provide
practical examples:
i. Name Node and Data Node
ii. Basic Filesystem operations in Hadoop
iii. Query in Filesystem
iv. Coherency Model in Hadoop Filesystem
3. Explain the concepts of data integrity and compression while emphasizing their significance and
real-world applications through a detailed explanation.
4. (i) Explain the concept of serialization in Hadoop with example code
(ii) Write short notes on sequence file and map file in file based data structures with example code.
5. Demonstrate the process of implementing a MapReduce program in Java for analyzing a weather
dataset, including the key steps such as mapping data, reducing it to calculate statistics, and setting up
the configuration?
6. (i) What is Meta data? What information does it provide and explain the role of Name node in a
HDFS clusters with real-time example?
(ii) Show how a client read and write data in HDFS. Give an example code.Briefly explain the
Hadoop Input and Output operations.
7. Illustrate the concepts and functioning of the Hadoop input and output format and provide practical
examples:
• Data integrity
• Compression
• Serialization
• Avro
8. Develop a Java interface in Hadoop for custom data types in a MapReduce job.

You might also like