Bigdata Short
Bigdata Short
What is HDFS?
o Java-based distributed file system designed for Big Data environments.
o Provides a resilient, clustered approach to manage files using commodity servers.
o Designed to store large files split into blocks, replicated across nodes for fault tolerance.
Key Features of HDFS:
o Fault-Tolerance: Replicates data to prevent loss during failures.
o Scalability: Easily scales up to 200 PB of storage across thousands of nodes.
o Data Availability: Ensures continuous access by replicating data across multiple nodes.
o Data Reliability: Files are split into blocks and replicated for redundancy.
Architecture Overview:
o NameNode: Stores file system metadata.
o DataNodes: Stores actual data and communicates with NameNode for block management.
o Replication:
Default replication factor: 3.
Placement policy: Local machine, remote rack, and another node in the same rack.
NameNode:
o Acts as the master node in HDFS.
o Tracks the list of blocks, their locations, and health of DataNodes.
o Communicates with DataNodes using heartbeat messages and block reports.
o Does not store user data but coordinates storage and retrieval.
DataNode:
o Slave nodes responsible for storing and retrieving data blocks.
o Continuously sends heartbeats and block reports to the NameNode.
o Replicates blocks as instructed by the NameNode.
Secondary NameNode (SNN):
o Works as an assistant to the NameNode.
o Takes periodic snapshots of file system metadata.
o Does not handle real-time operations but helps during NameNode failure.
Block Concept:
o Files are divided into blocks (default: 64 MB).
o Blocks are replicated for fault tolerance and distributed across DataNodes.
Data Flow:
o During writes, blocks are sent to DataNodes and replicated as per the policy.
o Reads occur directly between the client and the DataNodes.
Replication Policy:
o Default replication factor: 3.
o Ensures fault tolerance and data availability in case of node or hardware failures.
4. MapReduce - Phases (Mapper, Sort and Shuffle, Reducer)
Overview:
o A programming model for processing large datasets in parallel across clusters.
Phases:
o Mapper Phase:
Splits input data into chunks.
Processes each chunk and produces intermediate key-value pairs.
o Shuffle and Sort Phase:
Transfers intermediate data to reducers.
Sorts the data by keys to group related records.
o Reducer Phase:
Aggregates and processes the sorted data to produce the final output.
Intermediate Output:
o Mapper writes intermediate output to local disks.
o Reducer processes this output to generate the final result.
Combiner Functions:
o Acts as a mini-reducer to minimize data transfer between Mapper and Reducer.
o Reduces network congestion and improves efficiency.
Streaming:
o Allows MapReduce jobs to be written in languages other than Java.
o Example: Python, Shell scripts.
o Inputs and outputs are handled through standard input/output streams.
Job Scheduling:
o Managed by YARN, which dynamically allocates resources for jobs.
o Types:
FIFO Scheduler: Executes tasks in the order of arrival.
Capacity Scheduler: Divides resources into multiple queues.
Fair Scheduler: Allocates resources dynamically, ensuring fairness among jobs.
I/O:
o Input and output formats are managed by InputFormat classes.
o Examples:
TextInputFormat: Default format; processes files line by line.
SequenceFileInputFormat: Handles binary key-value pairs.
Data Integrity:
o Ensured using checksums to detect and fix corrupted data blocks.
o Periodic block verification by DataNodes ensures reliability.
Compression:
o Reduces storage space and network traffic.
o Common algorithms: gzip, bzip2, LZO.
Serialization:
o Converts structured data into byte streams for transmission or storage.
o Hadoop uses the Writable interface for compact and efficient serialization.
7. File-Based Data Structures
SequenceFile:
o Stores binary key-value pairs for efficient data processing.
o Supports compression at the block level.
MapFile:
o An indexed version of SequenceFile for faster lookups.
o Useful for applications requiring sorted data access.
Steps:
1. Input Splits:
Data is divided into fixed-size chunks.
2. Mapper:
Processes each chunk to produce intermediate key-value pairs.
3. Shuffle and Sort:
Transfers and organizes intermediate data for the reducer.
4. Reducer:
Processes grouped key-value pairs to generate the final output.
5. Output:
Final results are stored in HDFS with replication.
Fault Tolerance:
If you need further expansion or any specific topic, feel free to ask!
One-Line Q&A
3 Marks Q&A
Scenario-Based Questions
If you need additional variations or deeper explanations for any topic, let me know!
Here are additional 3 Marks Q&A based on the content of the uploaded PPT:
MapReduce