0% found this document useful (0 votes)

13 views8 pages

Hadoop Notes

Hadoop is an open-source framework designed for processing and storing large datasets, offering scalability, fault tolerance, and flexibility. It operates on a Master-Slave architecture with components like HDFS for storage and MapReduce for data processing. Additionally, Apache Hive provides a SQL-like interface for managing and analyzing data within Hadoop, catering to users familiar with database systems.

Uploaded by

Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views8 pages

Hadoop Notes

Uploaded by

Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

HADOOP NOTES

Introduction to Hadoop
Hadoop is an open-source framework developed by the Apache Software Foundation for
processing and storing large volumes of data. It is designed to scale from a single server to
thousands of machines, each offering local computation and storage. Hadoop enables
distributed processing of large datasets across clusters of computers using simple
programming models.
Key Features:
• Open-source and cost-effective.
• High fault tolerance.
• Scalable and flexible.
• Designed for Big Data.
ADVANTAGES OF USING HADOOP
1. Scalability
Hadoop can easily scale up from a single server to thousands of machines, each offering
local storage and computation.
2. Cost-Effective
Since it uses commodity hardware (inexpensive machines), Hadoop is much cheaper than
traditional systems.
3. Fault Tolerance
If a machine fails, Hadoop automatically recovers the data using replicated copies stored on
other nodes.
4. High Availability
Data is replicated across multiple nodes, ensuring that it’s always available, even during
failures.
5. Flexibility
Hadoop can store and process various types of data: structured, semi-structured, and
unstructured (like text, images, videos).
6. Speed
With parallel processing and local data computation, Hadoop can process large volumes of
data quickly.
7. Open Source
It’s free to use and has a large community for support, updates, and improvements.
8. Support for Big Data Tools
Hadoop works well with many other big data tools and ecosystems like Hive, Pig, Spark,
HBase, and more.

Configuring Hadoop in a Clustered Environment

In a clustered environment, Hadoop runs on multiple machines (nodes) working together.
Basic Steps:
• Install Java: Required for Hadoop to run.
• Configure SSH: Enables password-less login between nodes.
• Set Environment Variables: Like JAVA_HOME and HADOOP_HOME.
• Hadoop Configuration Files:
o core-site.xml: Basic settings (e.g., HDFS URI).
o hdfs-site.xml: HDFS settings (replication factor, data directories).
o mapred-site.xml: MapReduce job configuration.
o yarn-site.xml: YARN resource management settings.
• Start Hadoop Daemons: NameNode, DataNode, ResourceManager, and
NodeManager.

Hadoop Architecture
Hadoop has a Master-Slave architecture composed of the following components:
• HDFS (Hadoop Distributed File System): For storage.
• MapReduce: For processing data.
• YARN (Yet Another Resource Negotiator): Manages resources and job scheduling.
Main Nodes:
• NameNode: Master node that manages metadata and file directory structure.
• DataNodes: Slave nodes that store actual data blocks.
• ResourceManager: Manages system resources and job allocation.
• NodeManagers: Execute tasks on slave nodes.
Hadoop Distributed File System (HDFS)
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes.
It's often used by companies who need to handle and store big data. HDFS is a key
component of many Hadoop systems, as it provides a means for managing big data, as well
as supporting big data analytics.

1. NameNode (Master)
• Stores metadata: file names, permissions, and locations of file blocks.
• Manages the namespace of the file system.
• Does not store the actual data.
2. DataNodes (Slaves)
• Store actual data blocks of files.
• Periodically send heartbeats to the NameNode to report status.
• Serve read and write requests from clients.
3. Secondary NameNode
• Helps the NameNode by merging the edit logs and fsimage (file system snapshot).
• It is not a backup for the NameNode.
• It reduces the workload on the NameNode.

HDFS Workflow (Read/Write)

Write Process:
• Client contacts NameNode → gets list of DataNodes.
• Data is split into blocks → written to DataNodes in a pipeline.
• Each block is replicated to other nodes.

Read Process:
• Client requests file → NameNode returns block locations.
• Client reads blocks directly from the DataNodes.

Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at multiple datanodes.
• Even if multiple datanodes are down we can still do our work, thus making it highly
reliable.
• High fault tolerance.
MapReduce?
MapReduce is a programming model used in Hadoop for processing large datasets in
parallel across a distributed cluster. It has two main phases:
• Map Phase
• Reduce Phase
Each phase has a specific task in breaking down and combining data.

Architecture of MapReduce

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The
result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
Introduction to Hive
Apache Hive is a data warehouse tool built on top of the Hadoop framework. It allows users
to manage and analyze large datasets stored in HDFS (Hadoop Distributed File System)
using a SQL-like language called HiveQL.

Features of Hive

• SQL-like Language: Hive uses HiveQL, which is similar to SQL, making it easy for
people with database experience to use.

• Data Warehousing: Helps in summarizing, querying, and analyzing large-scale data.

• Scalable: Can handle petabytes of data using Hadoop's power.

• Flexible Storage: Works with various file formats like Text, ORC, Parquet, etc.

• Batch Processing: Best for batch jobs, not real-time querying.

How Hive Works

1. The user writes a HiveQL query.

2. Hive converts the query into MapReduce, Tez, or Spark jobs.

3. These jobs are then executed over data stored in HDFS.

4. The final result is returned to the user.

Hive Components

• Metastore: Stores metadata (table names, columns, partitions).

• Driver: Manages the lifecycle of a query.

• Compiler: Converts HiveQL into executable jobs.

• Execution Engine: Runs the jobs on the Hadoop cluster.

Hive vs Pig: Key Differences
Feature / Aspect Hive Pig
Language HiveQL (SQL-like) Pig Latin (procedural language)
User Base SQL developers, analysts Programmers, data engineers
Data Flow Type Declarative Procedural
Easier for those with SQL More flexible for complex data
Ease of Use
background transformations
Schema Schema-based (structured Schema optional (semi-
Requirement data) structured/unstructured)
Execution Engine MapReduce, Tez, or Spark MapReduce, Tez, or Spark
Data warehousing, reporting,
Use Case Focus ETL processes, data transformation
analytics
Limited (auto optimization More control with custom
Performance Tuning
mostly) UDFs/scripts
Data Format JSON, XML, CSV, and custom
ORC, Parquet, Avro, etc.
Support formats
Integration with BI Good (like Tableau, Power
Limited
Tools BI)

Updated Unit-IV Reference PPT 08-02-2022
No ratings yet
Updated Unit-IV Reference PPT 08-02-2022
103 pages
Hadoop
No ratings yet
Hadoop
83 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Unit 2
No ratings yet
Unit 2
22 pages
Unit 2
No ratings yet
Unit 2
19 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Unit 5
No ratings yet
Unit 5
32 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Unit III-Hashing
100% (1)
Unit III-Hashing
135 pages
DBMS Mini Project Movies Database GARVIT
No ratings yet
DBMS Mini Project Movies Database GARVIT
12 pages
02 ER Diagram
No ratings yet
02 ER Diagram
72 pages
Shubh Saxena 11-B Ip File
No ratings yet
Shubh Saxena 11-B Ip File
61 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Unit-2 Data Models
No ratings yet
Unit-2 Data Models
92 pages
Working With Two Tables
No ratings yet
Working With Two Tables
22 pages
Chapter 2 - 大数据生态系统
No ratings yet
Chapter 2 - 大数据生态系统
31 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Attachment
No ratings yet
Attachment
11 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
CS3492 DBMS Important 2 Mark With Answer
No ratings yet
CS3492 DBMS Important 2 Mark With Answer
16 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Lecture 17 - JDBC
No ratings yet
Lecture 17 - JDBC
8 pages
Enahnce SQLite Database in A React Native App
No ratings yet
Enahnce SQLite Database in A React Native App
15 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Red Hat Virtualization-4.3-Data Warehouse Guide-en-US PDF
No ratings yet
Red Hat Virtualization-4.3-Data Warehouse Guide-en-US PDF
61 pages
HADOOP
No ratings yet
HADOOP
55 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
HADOOP
No ratings yet
HADOOP
19 pages
Guideline For Using DbPoolConnector Incodelist SI Maps
No ratings yet
Guideline For Using DbPoolConnector Incodelist SI Maps
1 page
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Ict Notes 0417
No ratings yet
Ict Notes 0417
9 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Adv Java Notes
No ratings yet
Adv Java Notes
14 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Introduction To
No ratings yet
Introduction To
7 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Final Exam
No ratings yet
Final Exam
8 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 2
No ratings yet
Unit 2
9 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Big Data
No ratings yet
Big Data
67 pages
DBMS Unit 2
No ratings yet
DBMS Unit 2
71 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Lecture 03
No ratings yet
Lecture 03
22 pages
Oracle Is A Client/Server Relational Database
No ratings yet
Oracle Is A Client/Server Relational Database
21 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
HADOOP
No ratings yet
HADOOP
10 pages
2024 - 25 EVEN CE263 DBMS PracticalList
No ratings yet
2024 - 25 EVEN CE263 DBMS PracticalList
6 pages
QB - Answers (PHP) Unit - V
No ratings yet
QB - Answers (PHP) Unit - V
10 pages
Oracle 1z0 083 Dumps by Boyd 29-01-2024 9qa Dumpssheet
No ratings yet
Oracle 1z0 083 Dumps by Boyd 29-01-2024 9qa Dumpssheet
11 pages
AUTOMOBILE SERVICE STATION Final
No ratings yet
AUTOMOBILE SERVICE STATION Final
92 pages
Unit 04 - Database Design and Development - Reworded - 2021
No ratings yet
Unit 04 - Database Design and Development - Reworded - 2021
31 pages
Reviewer Tutorial
No ratings yet
Reviewer Tutorial
43 pages
Design and Implementation of Web Based Job Portaldc73vcymol
No ratings yet
Design and Implementation of Web Based Job Portaldc73vcymol
11 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Informatica: The Powercenter/Powermart
No ratings yet
Informatica: The Powercenter/Powermart
3 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Computer Science Project
No ratings yet
Computer Science Project
19 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Servicenow Interview Questions
No ratings yet
Servicenow Interview Questions
7 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Hadoop Notes

Uploaded by

Hadoop Notes

Uploaded by

HADOOP NOTES

Configuring Hadoop in a Clustered Environment

HDFS Workflow (Read/Write)

• Data Warehousing: Helps in summarizing, querying, and analyzing large-scale data.

• Scalable: Can handle petabytes of data using Hadoop's power.

• Batch Processing: Best for batch jobs, not real-time querying.

How Hive Works

1. The user writes a HiveQL query.

2. Hive converts the query into MapReduce, Tez, or Spark jobs.

3. These jobs are then executed over data stored in HDFS.

4. The final result is returned to the user.

• Metastore: Stores metadata (table names, columns, partitions).

• Driver: Manages the lifecycle of a query.

• Compiler: Converts HiveQL into executable jobs.

• Execution Engine: Runs the jobs on the Hadoop cluster.

You might also like