0% found this document useful (0 votes)
19 views4 pages

Hadoop HDFS Notes

HDFS is designed with a master/slave architecture for high throughput and large data sets, utilizing NameNode for metadata management and DataNodes for data storage. It offers benefits like fault tolerance and scalability but faces challenges such as latency and management complexity. Key operations include data storage, reading, and writing, with support for large files and data replication for reliability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views4 pages

Hadoop HDFS Notes

HDFS is designed with a master/slave architecture for high throughput and large data sets, utilizing NameNode for metadata management and DataNodes for data storage. It offers benefits like fault tolerance and scalability but faces challenges such as latency and management complexity. Key operations include data storage, reading, and writing, with support for large files and data replication for reliability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

HDFS (Hadoop Distributed File System)

1. Design of HDFS:

- Master/slave architecture with NameNode and DataNodes.

- Optimized for high throughput and large data sets.

- Designed to run on commodity hardware with fault tolerance.

2. HDFS Concepts:

- NameNode: Manages metadata.

- DataNode: Stores actual data blocks.

- Blocks: Default 128MB size, unit of storage.

3. Benefits and Challenges:

- Benefits: Fault tolerance, scalability, cost-effective.

- Challenges: Latency, small file issue, complex management.

4. File Sizes and Block Sizes:

- Large file support.

- Block abstraction helps in parallelism.

5. Data Replication:

- Default replication factor is 3.

- Ensures reliability and fault tolerance.

6. HDFS Operations (Store, Read, Write):


- Store: Data split into blocks, stored on DataNodes.

- Read: Client requests NameNode, then DataNodes.

- Write: Data goes through pipeline to DataNodes.

7. Java Interfaces to HDFS:

- org.apache.hadoop.fs.FileSystem

- Provides APIs for interacting with HDFS.

8. Command Line Interface:

- hdfs dfs -ls, -put, -get, -rm, etc.

9. Hadoop File System Interfaces:

- FileSystem API, FSDataInputStream, FSDataOutputStream

10. Data Flow:

- From ingest to processing and output storage.

11. Data Ingest with Flume and Sqoop:

- Flume: For ingesting streaming data.

- Sqoop: For transferring data between RDBMS and HDFS.

12. Hadoop Archives (HAR):

- Reduces NameNode memory usage for lots of small files.

13. Hadoop I/O:

- Compression: Gzip, Bzip2, Snappy.


- Serialization: Writable, Avro.

- File-based structures: SequenceFile, Avro, Parquet.

Hadoop Environment

1. Setting up a Hadoop Cluster:

- Define hardware, software requirements.

- Choose distribution (Apache, Cloudera, Hortonworks).

2. Cluster Specification and Installation:

- Master/slave node setup.

- Install Java, Hadoop.

3. Hadoop Configuration:

- core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml

4. Security in Hadoop:

- Kerberos authentication, HDFS file permissions.

5. Administering Hadoop:

- User management, quota setting, HDFS fsck.

6. HDFS Monitoring & Maintenance:

- Tools: JMX, Ganglia, Nagios, Cloudera Manager.

7. Hadoop Benchmarks:
- TestDFSIO, TeraSort for performance testing.

8. Hadoop in the Cloud:

- AWS EMR, Google Cloud Dataproc, Azure HDInsight.

You might also like