Hadoop Notes
Hadoop Notes
Introduction to Hadoop
Hadoop is an open-source framework developed by the Apache Software Foundation for
processing and storing large volumes of data. It is designed to scale from a single server to
thousands of machines, each offering local computation and storage. Hadoop enables
distributed processing of large datasets across clusters of computers using simple
programming models.
Key Features:
• Open-source and cost-effective.
• High fault tolerance.
• Scalable and flexible.
• Designed for Big Data.
ADVANTAGES OF USING HADOOP
1. Scalability
Hadoop can easily scale up from a single server to thousands of machines, each offering
local storage and computation.
2. Cost-Effective
Since it uses commodity hardware (inexpensive machines), Hadoop is much cheaper than
traditional systems.
3. Fault Tolerance
If a machine fails, Hadoop automatically recovers the data using replicated copies stored on
other nodes.
4. High Availability
Data is replicated across multiple nodes, ensuring that it’s always available, even during
failures.
5. Flexibility
Hadoop can store and process various types of data: structured, semi-structured, and
unstructured (like text, images, videos).
6. Speed
With parallel processing and local data computation, Hadoop can process large volumes of
data quickly.
7. Open Source
It’s free to use and has a large community for support, updates, and improvements.
8. Support for Big Data Tools
Hadoop works well with many other big data tools and ecosystems like Hive, Pig, Spark,
HBase, and more.
Hadoop Architecture
Hadoop has a Master-Slave architecture composed of the following components:
• HDFS (Hadoop Distributed File System): For storage.
• MapReduce: For processing data.
• YARN (Yet Another Resource Negotiator): Manages resources and job scheduling.
Main Nodes:
• NameNode: Master node that manages metadata and file directory structure.
• DataNodes: Slave nodes that store actual data blocks.
• ResourceManager: Manages system resources and job allocation.
• NodeManagers: Execute tasks on slave nodes.
Hadoop Distributed File System (HDFS)
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes.
It's often used by companies who need to handle and store big data. HDFS is a key
component of many Hadoop systems, as it provides a means for managing big data, as well
as supporting big data analytics.
1. NameNode (Master)
• Stores metadata: file names, permissions, and locations of file blocks.
• Manages the namespace of the file system.
• Does not store the actual data.
2. DataNodes (Slaves)
• Store actual data blocks of files.
• Periodically send heartbeats to the NameNode to report status.
• Serve read and write requests from clients.
3. Secondary NameNode
• Helps the NameNode by merging the edit logs and fsimage (file system snapshot).
• It is not a backup for the NameNode.
• It reduces the workload on the NameNode.
Read Process:
• Client requests file → NameNode returns block locations.
• Client reads blocks directly from the DataNodes.
Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at multiple datanodes.
• Even if multiple datanodes are down we can still do our work, thus making it highly
reliable.
• High fault tolerance.
MapReduce?
MapReduce is a programming model used in Hadoop for processing large datasets in
parallel across a distributed cluster. It has two main phases:
• Map Phase
• Reduce Phase
Each phase has a specific task in breaking down and combining data.
Architecture of MapReduce
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The
result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
Introduction to Hive
Apache Hive is a data warehouse tool built on top of the Hadoop framework. It allows users
to manage and analyze large datasets stored in HDFS (Hadoop Distributed File System)
using a SQL-like language called HiveQL.
Features of Hive
• SQL-like Language: Hive uses HiveQL, which is similar to SQL, making it easy for
people with database experience to use.
• Flexible Storage: Works with various file formats like Text, ORC, Parquet, etc.
Hive Components