Big Data Assignment 3
Big Data Assignment 3
Q1.
Block abstraction in HDFS divides large files into fixed-size blocks (default 128MB) for storage.
These blocks are distributed across the cluster to optimize performance and scalability. Block size
matters because larger blocks reduce the overhead of metadata and improve throughput, while
Q2.
HDFS stores files by splitting them into blocks and distributing them across DataNodes. For reading,
the client contacts the NameNode for block locations, then reads directly from DataNodes. Writing
involves the client writing to a pipeline of DataNodes. Data flows through the client to multiple
Q3.
Data replication in HDFS ensures fault tolerance and high availability. Each block is typically
replicated three times across different nodes. If a node fails, the data is still accessible from replicas.
It also helps in load balancing and improves data locality during processing.
Q4.
Compression reduces data size, saving storage and bandwidth. Serialization converts objects into
byte streams for transmission and storage. Both are crucial in Hadoop I/O to improve efficiency and
performance. Efficient serialization and compression accelerate data transfer between nodes and
Q5.
Q6.
Advantages: Scalability, cost-efficiency, and easy resource management. Cloud providers offer
Challenges: Data security, latency, compliance issues, and dependence on internet connectivity.
Performance tuning and configuration management can also be complex in cloud environments.
Q7.
```
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
FileSystem fs = FileSystem.get(conf);
input.readFully(0, buffer);
System.out.println(new String(buffer));
input.close();
fs.close();
}
}
```