0% found this document useful (0 votes)

34 views4 pages

Bigdata Hadoop

Uploaded by

Mutomba Tichaona

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views4 pages

Bigdata Hadoop

Uploaded by

Mutomba Tichaona

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Name: Group 3&4

Course: Big Data

Hadoop is an open-source framework designed for processing and storing large datasets across clusters of
computers using simple programming models. It is a powerful tool for big data analytics. A robust Hadoop
architecture for processing and analysing large datasets involves leveraging various components of the
Hadoop ecosystem. Below, are the phases of the data pipeline and justification for the choice of specific
Hadoop components for each phase.

1. Data Ingestion

Component: Apache Flume or Apache Kafka.

For ingesting large volumes of data from various sources, Apache Flume or Apache Kafka can be used. Flume
is designed for collecting and aggregating large amounts of log data, while Kafka is a distributed streaming
platform that can handle real-time data feeds. Both tools ensure that data is ingested efficiently and can be
processed in real-time or batch modes.

2. Storage

Component: HDFS (Hadoop Distributed File System).

Once the data is ingested, it needs to be stored in a distributed manner. HDFS is the backbone of the Hadoop
ecosystem, providing a reliable and scalable storage solution. It allows for the storage of large files across
multiple machines, ensuring fault tolerance and high availability. HDFS is optimized for high-throughput
access to application data, making it suitable for big data applications. While HDFS is optimized for batch
processing, it is less efficient when dealing with real-time data due to the time required to write and read large
files, hence this is an ideal choice is the dataset is not real-time data.

3. Data Processing

Component: YARN (Yet Another Resource Negotiator) and MapReduce/Spark.

For data processing, YARN acts as the resource management layer of Hadoop – scheduling tasks efficiently
and allocating resources based on the needs of each component, allowing multiple data processing engines to
run and manage resources efficiently. YARN is highly effective for large batch jobs, but real-time data require
tighter integration with faster frameworks like Apache Kafka for streaming.

MapReduce is a programming model that enables the processing of large datasets in parallel across a
distributed cluster. It is particularly effective for batch processing tasks where data is processed in large
chunks, making it suitable if our use case where real time processing is not a requirement. It is slower due to
its disk-based processing.

Apache Spark can also be used for data processing, especially when low-latency processing is required. Spark

provides in-memory processing capabilities, which can significantly speed up data processing tasks compared
to traditional MapReduce. Spark’s in-memory processing offers speed but may require more memory
resources, making it costlier in large scale scenarios.

4. Data Analysis

Component: Hive or Pig.

For analysing the processed data, Apache Hive or Apache Pig can be employed:

 Hive provides an SQL-like interface for querying and managing large datasets stored in HDFS. It is
suitable for users who are familiar with SQL and want to perform data analysis without writing
complex MapReduce code. It is ideal for batch processing and is well-suited for data analysis tasks,
making it a great choice for the need to perform complex queries on large datasets. Hive’s batch-
oriented nature makes it slower for real-time analytics.
 Pig is a high-level platform for creating programs that run on Hadoop. It uses a language called Pig
Latin, which is designed to handle data transformations and analysis in a more procedural way than
Hive. This makes it an excellent choice for ETL (Extract, Transform, Load) processes within the data
pipeline. Pig is optimal for batch data.
5. Data Visualization and Reporting.

Component: Apache Superset or Tableau.

For visualizing the results of the data analysis, Apache Superset or Tableau can be integrated. These tools
allow users to create interactive dashboards and reports, making it easier to derive insights from the data.

6. Workflow Management.

Component: Apache Oozie

To manage the workflow of the entire data pipeline, Apache Oozie can be used. Oozie is a workflow
scheduler system that allows users to define complex data processing workflows, ensuring that tasks are
executed in the correct order and managing dependencies between different components.

In conclusion, a robust Hadoop architecture for processing and analysing large datasets can be build using
HDFS for storage, YARN for resource management, MapReduce for batch processing, Spark for advanced
processing, Hive for querying, and Pig for data transformations. Each component plays a crucial role in
ensuring that data is ingested, stored, processed, analysed, and visualized effectively, catering to the needs of
big data applications.
References.

1. Components of MapReduce Architecture https:// www.geeksforgeeks.org/mapreducearchitecture/

Unit Iii
No ratings yet
Unit Iii
20 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
GX1100S Gx1100e GX1150 SM Usa Exp Ce Em037n90f
100% (2)
GX1100S Gx1100e GX1150 SM Usa Exp Ce Em037n90f
48 pages
Hadoop Kufkaf Apeche
No ratings yet
Hadoop Kufkaf Apeche
14 pages
Group 3&4 Assignment
No ratings yet
Group 3&4 Assignment
6 pages
Group 3&4 Assignment Sample Solution
No ratings yet
Group 3&4 Assignment Sample Solution
5 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
2.2. Components of Hadoop - Analysing
No ratings yet
2.2. Components of Hadoop - Analysing
16 pages
Hadoop
No ratings yet
Hadoop
3 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
Noted Assignment
No ratings yet
Noted Assignment
4 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
dSbDa MiniProject Case Study
No ratings yet
dSbDa MiniProject Case Study
10 pages
Unit Iii Basics - of - Hadoop
No ratings yet
Unit Iii Basics - of - Hadoop
22 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Assignment Group 3
No ratings yet
Assignment Group 3
21 pages
Sub Unit 3
No ratings yet
Sub Unit 3
9 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Asit Kumar Das - M5 SPARK
No ratings yet
Asit Kumar Das - M5 SPARK
24 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
8 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
New Printout
No ratings yet
New Printout
5 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
BD Unit-4
No ratings yet
BD Unit-4
79 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Attachment
No ratings yet
Attachment
11 pages
Big Data Spark Lab Manual 2025-2026
No ratings yet
Big Data Spark Lab Manual 2025-2026
62 pages
Big Data
No ratings yet
Big Data
27 pages
Unit 3
No ratings yet
Unit 3
12 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Big Data Notes With Diagrams
No ratings yet
Big Data Notes With Diagrams
3 pages
Bigdata
No ratings yet
Bigdata
18 pages
Unit4 - 1
No ratings yet
Unit4 - 1
13 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
Unit IV Basics - of - Hadoop
No ratings yet
Unit IV Basics - of - Hadoop
20 pages
Last Min Preparation - Big Data
No ratings yet
Last Min Preparation - Big Data
5 pages
HDP Components Detailed
No ratings yet
HDP Components Detailed
4 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
Unit IV Basics of Hadoop
No ratings yet
Unit IV Basics of Hadoop
21 pages
Big Data
No ratings yet
Big Data
63 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Week 5 Researchpaper
No ratings yet
Week 5 Researchpaper
7 pages
Unit 2
No ratings yet
Unit 2
23 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Big Data
No ratings yet
Big Data
8 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Hadoop Is A Framework That Is Widely Used For Storing and Managing Big Data
No ratings yet
Hadoop Is A Framework That Is Widely Used For Storing and Managing Big Data
2 pages
Big Data Analytics QP
No ratings yet
Big Data Analytics QP
36 pages
Unit 5
No ratings yet
Unit 5
14 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
Unit Iii Basics - of - Hadoop
No ratings yet
Unit Iii Basics - of - Hadoop
46 pages
Big Data
No ratings yet
Big Data
3 pages
SA PGMplus Manual Ver1.5
No ratings yet
SA PGMplus Manual Ver1.5
7 pages
Fundamentals of RS-232 Serial Communications
No ratings yet
Fundamentals of RS-232 Serial Communications
8 pages
Bca Sem 4 Unit 3
No ratings yet
Bca Sem 4 Unit 3
43 pages
Project Management SoW Copy 2
No ratings yet
Project Management SoW Copy 2
4 pages
Blue Mountain Hybrid Solar Inverter App Note BME-20
No ratings yet
Blue Mountain Hybrid Solar Inverter App Note BME-20
6 pages
Studying The Evolution of PHP Web Application
No ratings yet
Studying The Evolution of PHP Web Application
20 pages
Module 1.0
No ratings yet
Module 1.0
48 pages
01 HW Outdoor
No ratings yet
01 HW Outdoor
12 pages
EcoStruxureSubstationOperation User Interface - NRJTDS17751EN - 190218
No ratings yet
EcoStruxureSubstationOperation User Interface - NRJTDS17751EN - 190218
4 pages
Assignment 1
No ratings yet
Assignment 1
5 pages
74HC4052D Datasheet en 20160912
No ratings yet
74HC4052D Datasheet en 20160912
13 pages
Datadomain Os Install
No ratings yet
Datadomain Os Install
12 pages
C Digital Notes All 5 Units
No ratings yet
C Digital Notes All 5 Units
193 pages
1313DX PDS 134710-000
No ratings yet
1313DX PDS 134710-000
2 pages
MPMC Model Exam Question Paper
No ratings yet
MPMC Model Exam Question Paper
5 pages
Module 5 CN
No ratings yet
Module 5 CN
26 pages
Carver M-200t Service Manual
No ratings yet
Carver M-200t Service Manual
30 pages
Anthos Recall Notes
No ratings yet
Anthos Recall Notes
11 pages
Ecalc A Simple and Powerful Electronics Calculator
No ratings yet
Ecalc A Simple and Powerful Electronics Calculator
8 pages
Edc Lab Manuals Third Semester
No ratings yet
Edc Lab Manuals Third Semester
56 pages
Day 01 - Session 03 - AICTE ATAL FDP On Internet of Things
100% (1)
Day 01 - Session 03 - AICTE ATAL FDP On Internet of Things
55 pages
Code 49
No ratings yet
Code 49
4 pages
Cccs
No ratings yet
Cccs
3 pages
Erreur Udpate MMI VCDS
No ratings yet
Erreur Udpate MMI VCDS
6 pages
Oop Report
No ratings yet
Oop Report
15 pages
Career Options in ICT
No ratings yet
Career Options in ICT
7 pages
Unit 4
No ratings yet
Unit 4
157 pages
Electronic Device and Circuit
No ratings yet
Electronic Device and Circuit
5 pages
2019 - ELE125A Main - Exam-1
No ratings yet
2019 - ELE125A Main - Exam-1
10 pages

Bigdata Hadoop

Uploaded by

Bigdata Hadoop

Uploaded by

Name: Group 3&4

Course: Big Data

Component: Apache Flume or Apache Kafka.

Component: HDFS (Hadoop Distributed File System).

Component: YARN (Yet Another Resource Negotiator) and MapReduce/Spark.

Component: Hive or Pig.

Component: Apache Superset or Tableau.

Component: Apache Oozie

1. Components of MapReduce Architecture https:// www.geeksforgeeks.org/mapreducearchitecture/

You might also like