0% found this document useful (0 votes)
11 views

Big Data Assignment 3

The document discusses key concepts of HDFS, including block abstraction, data replication, and the importance of compression and serialization for efficiency. It outlines the steps for cluster setup and highlights the advantages and challenges of using cloud environments for big data. Additionally, it provides a sample code for reading a file from HDFS using Java.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Big Data Assignment 3

The document discusses key concepts of HDFS, including block abstraction, data replication, and the importance of compression and serialization for efficiency. It outlines the steps for cluster setup and highlights the advantages and challenges of using cloud environments for big data. Additionally, it provides a sample code for reading a file from HDFS using Java.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Assignment-3: Big Data (BCS-061)

Q1.

Block abstraction in HDFS divides large files into fixed-size blocks (default 128MB) for storage.

These blocks are distributed across the cluster to optimize performance and scalability. Block size

matters because larger blocks reduce the overhead of metadata and improve throughput, while

smaller blocks may cause excessive load on the NameNode.

Q2.

HDFS stores files by splitting them into blocks and distributing them across DataNodes. For reading,

the client contacts the NameNode for block locations, then reads directly from DataNodes. Writing

involves the client writing to a pipeline of DataNodes. Data flows through the client to multiple

DataNodes in a sequence, ensuring replication.

Q3.

Data replication in HDFS ensures fault tolerance and high availability. Each block is typically

replicated three times across different nodes. If a node fails, the data is still accessible from replicas.

It also helps in load balancing and improves data locality during processing.

Q4.

Compression reduces data size, saving storage and bandwidth. Serialization converts objects into

byte streams for transmission and storage. Both are crucial in Hadoop I/O to improve efficiency and

performance. Efficient serialization and compression accelerate data transfer between nodes and

reduce storage overhead.

Q5.

Cluster setup involves:

1. Installing Java and Hadoop.

2. Configuring environment variables.


3. Editing core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml.

4. Formatting the NameNode.

5. Starting HDFS and YARN daemons.

6. Verifying the setup using web interfaces or command-line tools.

Q6.

Advantages: Scalability, cost-efficiency, and easy resource management. Cloud providers offer

flexibility and eliminate hardware maintenance.

Challenges: Data security, latency, compliance issues, and dependence on internet connectivity.

Performance tuning and configuration management can also be complex in cloud environments.

Q7.

```

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.*;

public class ReadFromHDFS {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path path = new Path("/user/hadoop/file.txt");

FSDataInputStream input = fs.open(path);

byte[] buffer = new byte[1024];

input.readFully(0, buffer);

System.out.println(new String(buffer));

input.close();

fs.close();

}
}

```

You might also like