0% found this document useful (0 votes)
13 views8 pages

Hadoop Notes

Hadoop is an open-source framework designed for processing and storing large datasets, offering scalability, fault tolerance, and flexibility. It operates on a Master-Slave architecture with components like HDFS for storage and MapReduce for data processing. Additionally, Apache Hive provides a SQL-like interface for managing and analyzing data within Hadoop, catering to users familiar with database systems.

Uploaded by

Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Hadoop Notes

Hadoop is an open-source framework designed for processing and storing large datasets, offering scalability, fault tolerance, and flexibility. It operates on a Master-Slave architecture with components like HDFS for storage and MapReduce for data processing. Additionally, Apache Hive provides a SQL-like interface for managing and analyzing data within Hadoop, catering to users familiar with database systems.

Uploaded by

Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

HADOOP NOTES

Introduction to Hadoop
Hadoop is an open-source framework developed by the Apache Software Foundation for
processing and storing large volumes of data. It is designed to scale from a single server to
thousands of machines, each offering local computation and storage. Hadoop enables
distributed processing of large datasets across clusters of computers using simple
programming models.
Key Features:
• Open-source and cost-effective.
• High fault tolerance.
• Scalable and flexible.
• Designed for Big Data.
ADVANTAGES OF USING HADOOP
1. Scalability
Hadoop can easily scale up from a single server to thousands of machines, each offering
local storage and computation.
2. Cost-Effective
Since it uses commodity hardware (inexpensive machines), Hadoop is much cheaper than
traditional systems.
3. Fault Tolerance
If a machine fails, Hadoop automatically recovers the data using replicated copies stored on
other nodes.
4. High Availability
Data is replicated across multiple nodes, ensuring that it’s always available, even during
failures.
5. Flexibility
Hadoop can store and process various types of data: structured, semi-structured, and
unstructured (like text, images, videos).
6. Speed
With parallel processing and local data computation, Hadoop can process large volumes of
data quickly.
7. Open Source
It’s free to use and has a large community for support, updates, and improvements.
8. Support for Big Data Tools
Hadoop works well with many other big data tools and ecosystems like Hive, Pig, Spark,
HBase, and more.

Configuring Hadoop in a Clustered Environment


In a clustered environment, Hadoop runs on multiple machines (nodes) working together.
Basic Steps:
• Install Java: Required for Hadoop to run.
• Configure SSH: Enables password-less login between nodes.
• Set Environment Variables: Like JAVA_HOME and HADOOP_HOME.
• Hadoop Configuration Files:
o core-site.xml: Basic settings (e.g., HDFS URI).
o hdfs-site.xml: HDFS settings (replication factor, data directories).
o mapred-site.xml: MapReduce job configuration.
o yarn-site.xml: YARN resource management settings.
• Start Hadoop Daemons: NameNode, DataNode, ResourceManager, and
NodeManager.

Hadoop Architecture
Hadoop has a Master-Slave architecture composed of the following components:
• HDFS (Hadoop Distributed File System): For storage.
• MapReduce: For processing data.
• YARN (Yet Another Resource Negotiator): Manages resources and job scheduling.
Main Nodes:
• NameNode: Master node that manages metadata and file directory structure.
• DataNodes: Slave nodes that store actual data blocks.
• ResourceManager: Manages system resources and job allocation.
• NodeManagers: Execute tasks on slave nodes.
Hadoop Distributed File System (HDFS)
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes.
It's often used by companies who need to handle and store big data. HDFS is a key
component of many Hadoop systems, as it provides a means for managing big data, as well
as supporting big data analytics.

1. NameNode (Master)
• Stores metadata: file names, permissions, and locations of file blocks.
• Manages the namespace of the file system.
• Does not store the actual data.
2. DataNodes (Slaves)
• Store actual data blocks of files.
• Periodically send heartbeats to the NameNode to report status.
• Serve read and write requests from clients.
3. Secondary NameNode
• Helps the NameNode by merging the edit logs and fsimage (file system snapshot).
• It is not a backup for the NameNode.
• It reduces the workload on the NameNode.

HDFS Workflow (Read/Write)


Write Process:
• Client contacts NameNode → gets list of DataNodes.
• Data is split into blocks → written to DataNodes in a pipeline.
• Each block is replicated to other nodes.

Read Process:
• Client requests file → NameNode returns block locations.
• Client reads blocks directly from the DataNodes.

Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at multiple datanodes.
• Even if multiple datanodes are down we can still do our work, thus making it highly
reliable.
• High fault tolerance.
MapReduce?
MapReduce is a programming model used in Hadoop for processing large datasets in
parallel across a distributed cluster. It has two main phases:
• Map Phase
• Reduce Phase
Each phase has a specific task in breaking down and combining data.

Architecture of MapReduce

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The
result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
Introduction to Hive
Apache Hive is a data warehouse tool built on top of the Hadoop framework. It allows users
to manage and analyze large datasets stored in HDFS (Hadoop Distributed File System)
using a SQL-like language called HiveQL.

Features of Hive

• SQL-like Language: Hive uses HiveQL, which is similar to SQL, making it easy for
people with database experience to use.

• Data Warehousing: Helps in summarizing, querying, and analyzing large-scale data.

• Scalable: Can handle petabytes of data using Hadoop's power.

• Flexible Storage: Works with various file formats like Text, ORC, Parquet, etc.

• Batch Processing: Best for batch jobs, not real-time querying.

How Hive Works

1. The user writes a HiveQL query.

2. Hive converts the query into MapReduce, Tez, or Spark jobs.

3. These jobs are then executed over data stored in HDFS.

4. The final result is returned to the user.

Hive Components

• Metastore: Stores metadata (table names, columns, partitions).

• Driver: Manages the lifecycle of a query.

• Compiler: Converts HiveQL into executable jobs.

• Execution Engine: Runs the jobs on the Hadoop cluster.


Hive vs Pig: Key Differences
Feature / Aspect Hive Pig
Language HiveQL (SQL-like) Pig Latin (procedural language)
User Base SQL developers, analysts Programmers, data engineers
Data Flow Type Declarative Procedural
Easier for those with SQL More flexible for complex data
Ease of Use
background transformations
Schema Schema-based (structured Schema optional (semi-
Requirement data) structured/unstructured)
Execution Engine MapReduce, Tez, or Spark MapReduce, Tez, or Spark
Data warehousing, reporting,
Use Case Focus ETL processes, data transformation
analytics
Limited (auto optimization More control with custom
Performance Tuning
mostly) UDFs/scripts
Data Format JSON, XML, CSV, and custom
ORC, Parquet, Avro, etc.
Support formats
Integration with BI Good (like Tableau, Power
Limited
Tools BI)

You might also like