0% found this document useful (0 votes)
12 views

Bigdata Hadoop

Uploaded by

Mutomba Tichaona
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Bigdata Hadoop

Uploaded by

Mutomba Tichaona
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Name: Group 3&4

Course: Big Data


Hadoop is an open-source framework designed for processing and storing large datasets across clusters of
computers using simple programming models. It is a powerful tool for big data analytics. A robust Hadoop
architecture for processing and analysing large datasets involves leveraging various components of the
Hadoop ecosystem. Below, are the phases of the data pipeline and justification for the choice of specific
Hadoop components for each phase.

1. Data Ingestion

Component: Apache Flume or Apache Kafka.

For ingesting large volumes of data from various sources, Apache Flume or Apache Kafka can be used. Flume
is designed for collecting and aggregating large amounts of log data, while Kafka is a distributed streaming
platform that can handle real-time data feeds. Both tools ensure that data is ingested efficiently and can be
processed in real-time or batch modes.

2. Storage

Component: HDFS (Hadoop Distributed File System).

Once the data is ingested, it needs to be stored in a distributed manner. HDFS is the backbone of the Hadoop
ecosystem, providing a reliable and scalable storage solution. It allows for the storage of large files across
multiple machines, ensuring fault tolerance and high availability. HDFS is optimized for high-throughput
access to application data, making it suitable for big data applications. While HDFS is optimized for batch
processing, it is less efficient when dealing with real-time data due to the time required to write and read large
files, hence this is an ideal choice is the dataset is not real-time data.

3. Data Processing

Component: YARN (Yet Another Resource Negotiator) and MapReduce/Spark.

For data processing, YARN acts as the resource management layer of Hadoop – scheduling tasks efficiently
and allocating resources based on the needs of each component, allowing multiple data processing engines to
run and manage resources efficiently. YARN is highly effective for large batch jobs, but real-time data require
tighter integration with faster frameworks like Apache Kafka for streaming.

MapReduce is a programming model that enables the processing of large datasets in parallel across a
distributed cluster. It is particularly effective for batch processing tasks where data is processed in large
chunks, making it suitable if our use case where real time processing is not a requirement. It is slower due to
its disk-based processing.

Apache Spark can also be used for data processing, especially when low-latency processing is required. Spark

provides in-memory processing capabilities, which can significantly speed up data processing tasks compared
to traditional MapReduce. Spark’s in-memory processing offers speed but may require more memory
resources, making it costlier in large scale scenarios.

4. Data Analysis

Component: Hive or Pig.


For analysing the processed data, Apache Hive or Apache Pig can be employed:

 Hive provides an SQL-like interface for querying and managing large datasets stored in HDFS. It is
suitable for users who are familiar with SQL and want to perform data analysis without writing
complex MapReduce code. It is ideal for batch processing and is well-suited for data analysis tasks,
making it a great choice for the need to perform complex queries on large datasets. Hive’s batch-
oriented nature makes it slower for real-time analytics.
 Pig is a high-level platform for creating programs that run on Hadoop. It uses a language called Pig
Latin, which is designed to handle data transformations and analysis in a more procedural way than
Hive. This makes it an excellent choice for ETL (Extract, Transform, Load) processes within the data
pipeline. Pig is optimal for batch data.
5. Data Visualization and Reporting.

Component: Apache Superset or Tableau.

For visualizing the results of the data analysis, Apache Superset or Tableau can be integrated. These tools
allow users to create interactive dashboards and reports, making it easier to derive insights from the data.

6. Workflow Management.

Component: Apache Oozie

To manage the workflow of the entire data pipeline, Apache Oozie can be used. Oozie is a workflow
scheduler system that allows users to define complex data processing workflows, ensuring that tasks are
executed in the correct order and managing dependencies between different components.

In conclusion, a robust Hadoop architecture for processing and analysing large datasets can be build using
HDFS for storage, YARN for resource management, MapReduce for batch processing, Spark for advanced
processing, Hive for querying, and Pig for data transformations. Each component plays a crucial role in
ensuring that data is ingested, stored, processed, analysed, and visualized effectively, catering to the needs of
big data applications.
References.

1. Components of MapReduce Architecture https:// www.geeksforgeeks.org/mapreducearchitecture/

You might also like