0% found this document useful (0 votes)
3 views8 pages

Bigdata Short

The document provides an overview of Hadoop's architecture, focusing on the Hadoop Distributed File System (HDFS) and MapReduce framework. It details key components such as NameNode, DataNode, and Secondary NameNode, along with features like fault tolerance, scalability, and data integrity. Additionally, it covers the phases of MapReduce, job scheduling with YARN, and various data management techniques including compression and serialization.

Uploaded by

22eg105a63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

Bigdata Short

The document provides an overview of Hadoop's architecture, focusing on the Hadoop Distributed File System (HDFS) and MapReduce framework. It details key components such as NameNode, DataNode, and Secondary NameNode, along with features like fault tolerance, scalability, and data integrity. Additionally, it covers the phases of MapReduce, job scheduling with YARN, and various data management techniques including compression and serialization.

Uploaded by

22eg105a63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

UNIT2 BD

1. Hadoop Architecture: HDFS Architecture

 What is HDFS?
o Java-based distributed file system designed for Big Data environments.
o Provides a resilient, clustered approach to manage files using commodity servers.
o Designed to store large files split into blocks, replicated across nodes for fault tolerance.
 Key Features of HDFS:
o Fault-Tolerance: Replicates data to prevent loss during failures.
o Scalability: Easily scales up to 200 PB of storage across thousands of nodes.
o Data Availability: Ensures continuous access by replicating data across multiple nodes.
o Data Reliability: Files are split into blocks and replicated for redundancy.
 Architecture Overview:
o NameNode: Stores file system metadata.
o DataNodes: Stores actual data and communicates with NameNode for block management.
o Replication:
 Default replication factor: 3.
 Placement policy: Local machine, remote rack, and another node in the same rack.

2. NameNode, DataNode, Secondary NameNode

 NameNode:
o Acts as the master node in HDFS.
o Tracks the list of blocks, their locations, and health of DataNodes.
o Communicates with DataNodes using heartbeat messages and block reports.
o Does not store user data but coordinates storage and retrieval.
 DataNode:
o Slave nodes responsible for storing and retrieving data blocks.
o Continuously sends heartbeats and block reports to the NameNode.
o Replicates blocks as instructed by the NameNode.
 Secondary NameNode (SNN):
o Works as an assistant to the NameNode.
o Takes periodic snapshots of file system metadata.
o Does not handle real-time operations but helps during NameNode failure.

3. Scaling Out - Block, Data Flow, Replica

 Block Concept:
o Files are divided into blocks (default: 64 MB).
o Blocks are replicated for fault tolerance and distributed across DataNodes.
 Data Flow:
o During writes, blocks are sent to DataNodes and replicated as per the policy.
o Reads occur directly between the client and the DataNodes.
 Replication Policy:
o Default replication factor: 3.
o Ensures fault tolerance and data availability in case of node or hardware failures.
4. MapReduce - Phases (Mapper, Sort and Shuffle, Reducer)

 Overview:
o A programming model for processing large datasets in parallel across clusters.
 Phases:
o Mapper Phase:
 Splits input data into chunks.
 Processes each chunk and produces intermediate key-value pairs.
o Shuffle and Sort Phase:
 Transfers intermediate data to reducers.
 Sorts the data by keys to group related records.
o Reducer Phase:
 Aggregates and processes the sorted data to produce the final output.
 Intermediate Output:
o Mapper writes intermediate output to local disks.
o Reducer processes this output to generate the final result.

5. Combiner Functions, Streaming, Job Scheduling

 Combiner Functions:
o Acts as a mini-reducer to minimize data transfer between Mapper and Reducer.
o Reduces network congestion and improves efficiency.
 Streaming:
o Allows MapReduce jobs to be written in languages other than Java.
o Example: Python, Shell scripts.
o Inputs and outputs are handled through standard input/output streams.
 Job Scheduling:
o Managed by YARN, which dynamically allocates resources for jobs.
o Types:
 FIFO Scheduler: Executes tasks in the order of arrival.
 Capacity Scheduler: Divides resources into multiple queues.
 Fair Scheduler: Allocates resources dynamically, ensuring fairness among jobs.

6. I/O, Data Integrity, Compression, Serialization

 I/O:
o Input and output formats are managed by InputFormat classes.
o Examples:
 TextInputFormat: Default format; processes files line by line.
 SequenceFileInputFormat: Handles binary key-value pairs.
 Data Integrity:
o Ensured using checksums to detect and fix corrupted data blocks.
o Periodic block verification by DataNodes ensures reliability.
 Compression:
o Reduces storage space and network traffic.
o Common algorithms: gzip, bzip2, LZO.
 Serialization:
o Converts structured data into byte streams for transmission or storage.
o Hadoop uses the Writable interface for compact and efficient serialization.
7. File-Based Data Structures

 SequenceFile:
o Stores binary key-value pairs for efficient data processing.
o Supports compression at the block level.
 MapFile:
o An indexed version of SequenceFile for faster lookups.
o Useful for applications requiring sorted data access.

8. Developing a MapReduce Application

 Steps:
1. Input Splits:
 Data is divided into fixed-size chunks.
2. Mapper:
 Processes each chunk to produce intermediate key-value pairs.
3. Shuffle and Sort:
 Transfers and organizes intermediate data for the reducer.
4. Reducer:
 Processes grouped key-value pairs to generate the final output.
5. Output:
 Final results are stored in HDFS with replication.
 Fault Tolerance:

o Failed tasks are retried up to 4 times.


o Tasks are rescheduled on different nodes if necessary.

If you need further expansion or any specific topic, feel free to ask!

MCQs/Fill in the Blanks

1. Question: HDFS replicates data ______ times by default.


Answer: 3.
2. Question: ______ is responsible for job scheduling and resource management in Hadoop.
Answer: YARN.
3. Question: The smallest unit of data in HDFS is called a ______.
Answer: Block.
4. Question: The default block size in HDFS is ______.
Answer: 64 MB.
5. Question: ______ is the component responsible for storing metadata in HDFS.
Answer: NameNode.
6. Question: ______ nodes are used to store the actual data in HDFS.
Answer: DataNodes.
7. Question: The process of splitting files into smaller blocks is called ______.
Answer: Data partitioning.
8. Question: The framework used for parallel processing of large data is ______.
Answer: MapReduce.
9. Question: Hadoop uses ______ for resource allocation in a cluster.
Answer: YARN.
10. Question: The ______ is an optional phase that reduces the size of intermediate data.
Answer: Combiner.
11. Question: The phase in MapReduce that aggregates data based on keys is ______.
Answer: Reduce phase.
12. Question: Data flow between mappers and reducers is managed by the ______ phase.
Answer: Shuffle.
13. Question: In HDFS, the default replication factor can be changed during ______.
Answer: File creation.
14. Question: ______ ensures data accuracy and consistency in Hadoop.
Answer: Data integrity.
15. Question: The scheduling method where tasks are executed in the order of arrival is ______.
Answer: FIFO.
16. Question: In MapReduce, data is processed in ______ key-value pairs.
Answer: Intermediate.
17. Question: The container that provides binary key-value storage in Hadoop is ______.
Answer: SequenceFile.
18. Question: The ______ is a master node that assigns tasks in Hadoop.
Answer: JobTracker.
19. Question: ______ manages task execution in MapReduce.
Answer: TaskTracker.
20. Question: In HDFS, ______ nodes send heartbeat signals to the NameNode.
Answer: DataNodes.

One-Line Q&A

21. Question: What is the function of Secondary NameNode?


Answer: It takes periodic snapshots of metadata from the NameNode.
22. Question: What is the purpose of block replication in HDFS?
Answer: To ensure fault tolerance and high data availability.
23. Question: Which phase in MapReduce handles the sorting of keys?
Answer: The Shuffle phase.
24. Question: What is the main role of the Combiner in MapReduce?
Answer: To reduce the volume of mapper output.
25. Question: What is the size of a block in HDFS by default?
Answer: 64 MB.
26. Question: What is YARN’s function in Hadoop?
Answer: Resource allocation and job scheduling.
27. Question: What is the file format used by HDFS to ensure data consistency?
Answer: Checksum files.
28. Question: What type of failures does the replication feature in HDFS address?
Answer: Node and network failures.
29. Question: Which component in Hadoop splits input files for processing?
Answer: InputFormat.
30. Question: How many times can a failed task be retried in MapReduce?
Answer: 4 times.

3 Marks Q&A

31. Question: What are the key features of HDFS?


Answer: Fault tolerance, scalability, data reliability, replication, and data availability.
32. Question: What are the phases of MapReduce?
Answer: Mapper, Sort and Shuffle, and Reducer.
33. Question: How does the Combiner function improve performance?
Answer: By summarizing the mapper output locally, it reduces data transfer to the reducer.
34. Question: What are the advantages of using replication in HDFS?
Answer: Ensures data redundancy, fault tolerance, and high availability.
35. Question: Explain the FIFO scheduler in Hadoop.
Answer: Tasks are executed in the order of their arrival, with no priority adjustments.
36. Question: Describe the function of the NameNode.
Answer: It manages metadata, monitors DataNodes, and coordinates file operations.
37. Question: What is the function of YARN's Application Master?
Answer: It manages the execution of a specific job and allocates resources for it.
38. Question: How is data integrity maintained in Hadoop?
Answer: Using checksums and replication to detect and recover from errors.
39. Question: What is the purpose of the Shuffle phase in MapReduce?
Answer: To transfer and sort intermediate data for reducer input.
40. Question: Explain data compression in Hadoop.
Answer: Reduces storage needs and improves data transfer speed using tools like gzip and bzip2.

Scenario-Based Questions

41. Question: What happens if a DataNode fails in HDFS?


Answer: The NameNode directs operations to replicas of the lost blocks.
42. Question: Why are intermediate outputs written to local disks in MapReduce?
Answer: To minimize network congestion during the Shuffle phase.
43. Question: How does the Fair Scheduler handle priority tasks?
Answer: It allocates resources dynamically, ensuring high-priority jobs get preference.
44. Question: What is the role of SequenceFile in Hadoop?
Answer: To store binary key-value pairs for efficient data access.
45. Question: How does a TaskTracker report its progress?
Answer: By sending heartbeats to the JobTracker.
46. Question: What is the role of the Secondary NameNode during NameNode failure?
Answer: It minimizes downtime by providing snapshots of metadata.
47. Question: Why is data split into blocks in HDFS?
Answer: To enable distributed storage and processing.
48. Question: How is data availability ensured in HDFS?
Answer: Through block replication across multiple nodes.
49. Question: What is the purpose of Serialization in Hadoop?
Answer: To convert structured data into a byte stream for efficient transmission.
50. Question: How does Hadoop ensure fault tolerance?
Answer: Using data replication and regular heartbeat checks.

If you need additional variations or deeper explanations for any topic, let me know!

Here are additional 3 Marks Q&A based on the content of the uploaded PPT:

Hadoop Distributed File System (HDFS)


1. What is the purpose of block replication in HDFS?
Answer: Block replication ensures fault tolerance and data availability. If a node fails or data is
corrupted, HDFS retrieves the data from its replicated copies.
2. What is the role of the NameNode in HDFS?
Answer: The NameNode is the master node in HDFS. It stores metadata about file locations,
manages file operations like read/write, and coordinates with DataNodes to replicate data for fault
tolerance.
3. What is the difference between vertical and horizontal scaling in HDFS?
Answer:
o Vertical Scaling: Adds more resources (CPU, memory) to existing nodes but requires
downtime.
o Horizontal Scaling: Adds more nodes to the cluster without downtime, making it real-time
scalable.

MapReduce

4. Explain the Shuffle and Sort phase in MapReduce.


Answer: This phase transfers intermediate data from mappers to reducers. It sorts keys generated by
mappers and groups similar keys for processing by reducers, ensuring efficient data aggregation.
5. What are the responsibilities of the Reducer in MapReduce?
Answer: The Reducer processes grouped intermediate key-value pairs received after the Shuffle and
Sort phase. It performs aggregation or computation and generates the final output, which is stored in
HDFS.
6. What is the purpose of the Combiner in MapReduce?
Answer: The Combiner acts as a mini-reducer, processing intermediate data locally on each node. It
reduces the volume of data transferred to the reducer, optimizing network usage.

Data Flow and Architecture

7. How does data flow occur in HDFS during write operations?


Answer: During write operations, the client sends the file to the NameNode, which determines block
placement. The data is then split into blocks, replicated, and stored across multiple DataNodes.
8. What is the function of heartbeats in HDFS?
Answer: DataNodes send periodic heartbeats to the NameNode to confirm they are functioning
correctly. If a heartbeat is not received, the NameNode assumes the DataNode has failed and
reassigns its blocks.
9. Describe the role of the Secondary NameNode in Hadoop.
Answer: The Secondary NameNode periodically takes snapshots of the NameNode's metadata and
edits logs. These snapshots help in recovering the cluster in case the NameNode fails.

Scheduling and Resource Management

10. What are the three types of schedulers in Hadoop?


Answer:
o FIFO Scheduler: Executes tasks in the order of their arrival.
o Capacity Scheduler: Allocates resources into multiple queues.
o Fair Scheduler: Dynamically assigns resources, ensuring fairness among jobs.
11. What is YARN, and what is its function in Hadoop?
Answer: YARN (Yet Another Resource Negotiator) manages cluster resources and job scheduling.
It divides responsibilities between a ResourceManager (scheduling) and ApplicationMaster (job
execution).
12. What is the role of the ApplicationMaster in YARN?
Answer: The ApplicationMaster coordinates the execution of a specific job, requests resources from
the ResourceManager, and monitors the job's progress.

Data Integrity, Compression, and Serialization

13. How is data integrity maintained in HDFS?


Answer: Data integrity is maintained using checksums. A checksum is calculated when data is
written, and verified during reads to detect corruption. Replicas are used to recover from errors.
14. Why is compression important in Hadoop?
Answer: Compression reduces storage requirements and speeds up data transfer over the network. It
optimizes storage space and enhances performance in data-intensive applications.
15. What is serialization in Hadoop, and why is it important?
Answer: Serialization converts structured data into byte streams for efficient storage and
transmission. It is critical for inter-process communication and data persistence in distributed
systems.

HDFS and Job Execution

16. What is the difference between SequenceFile and MapFile?


Answer:
o SequenceFile: Stores binary key-value pairs for efficient data processing.
o MapFile: An indexed version of SequenceFile, allowing faster lookups by key.
17. What are the key responsibilities of a TaskTracker in Hadoop?
Answer: A TaskTracker manages the execution of individual tasks. It reports progress to the
JobTracker and reschedules tasks if there is a failure.
18. What is the role of the JobTracker in MapReduce?
Answer: The JobTracker schedules tasks, assigns them to TaskTrackers, and monitors their
execution. It also manages task rescheduling in case of failures.

Additional Advanced Topics

19. Explain the concept of rack awareness in HDFS.


Answer: Rack awareness refers to the placement of data replicas across nodes in different racks to
minimize data loss during rack failures. It also optimizes network traffic during read/write
operations.
20. What happens when a DataNode fails in Hadoop?
Answer: The NameNode detects the failure (via missing heartbeats) and reassigns tasks to other
nodes. It retrieves the lost blocks from replicas to ensure data availability.
21. How are failed tasks handled in MapReduce?
Answer: If a task fails, the JobTracker reschedules it on a different node. A task can be retried up to
4 times before being marked as failed.

If you'd like more questions or details on a specific sub-topic, let me know!

You might also like