0% found this document useful (0 votes)
50 views3 pages

Week 9

Hadoop 1.0 had limitations including a single point of failure and scalability restrictions. Hadoop 2.0 improved on this with the inclusion of YARN, which removed resource management from MapReduce, improved scalability and fault tolerance, and allowed additional data processing frameworks to run alongside MapReduce. Spark can also have limitations, such as complexity when using advanced capabilities, potential memory issues when handling large datasets, and added integration complexity when used alongside Hadoop.

Uploaded by

leminil254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views3 pages

Week 9

Hadoop 1.0 had limitations including a single point of failure and scalability restrictions. Hadoop 2.0 improved on this with the inclusion of YARN, which removed resource management from MapReduce, improved scalability and fault tolerance, and allowed additional data processing frameworks to run alongside MapReduce. Spark can also have limitations, such as complexity when using advanced capabilities, potential memory issues when handling large datasets, and added integration complexity when used alongside Hadoop.

Uploaded by

leminil254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

Hadoop's Limitations:

a. Complexity and Scalability: Hadoop's setup and maintenance may be complicated,


offering a substantial barrier for smaller enterprises. This complexity derives from the need to
setup and manage many components, including the Hadoop Distributed File System (HDFS),
MapReduce, and connected all ecosystem tools. It comprises numerous software layers and
might be resource-intensive (AltexSoft). Additionally, for smaller firms, the expense and
labor necessary to manage Hadoop clusters may not be justified owing to their minimal data
demands (Murthy, 2017).

b. Latency: Hadoop's core processing are framework, MapReduce and that is built for batch
processing. While it's fantastic for managing massive datasets, it's not ideal for real-time or
near-real-time data processing. The batch processing characteristic of MapReduce might
result in greater latency, making Hadoop less appropriate for applications that demand
immediate data insights (AltexSoft). This constraint may effect applications where quick
decision-making is critical, such as fraud detection or recommendation systems (All You
Need to Know About Big Data Analytics. (n.d.)).

c. Storage Overheads: Hadoop's HDFS is designed for huge files and is less efficient when
managing multiple little files. HDFS replicates data across numerous nodes for fault
tolerance, contributing to storage cost when dealing with tiny files (AltexSoft). This may be a
downside for the enterprises with various data storage demands, as it may lead to wasteful
disk utilization (All You Need to Know About Big Data Analytics. (n.d.)).

d. High Disk I/O: Hadoop depends extensively on disk I/O for storing and retrieving the
data. While this method gives increased durability and fault tolerance, it may become a
bottleneck for specific workloads. Disk I/O may restrict the pace of data processing and
impair the overall performance of Hadoop (AltexSoft). In circumstances when low-latency
data access is necessary, this constraint becomes more obvious (All You Need to Know
About Big Data Analytics. (n.d.)).

2. Spark's Drawbacks:

a. Complexity: While Spark is a robust and adaptable data processing framework, its
diversity may contribute to complexity. Advanced capabilities like Spark Streaming for real-
time data and machine learning libraries may be complicated to install and operate, making
Spark less accessible to novice users (AltexSoft). Organizations may need to invest in
training or employ professionals to successfully leverage these talents (Murthy, 2017).

b. Memory Consumption: Spark's in-memory processing is one of its primary benefits,


dramatically enhancing data processing speed. However, this strategy might lead to excessive
memory use, particularly when processing huge datasets or sophisticated calculations. In
circumstances when memory resources are restricted, this might result in performance
concerns, since the data might not fit fully in memory, requiring additional data shuffling
between RAM and disk (AltexSoft). This might be an issue for businesses with finite
memory resources (Ahmed et al., 2020).

c. Limited Hadoop Ecosystem Integration: Spark can function with Hadoop, but it doesn't
totally replace Hadoop in certain instances. This implies enterprises running mixed systems
with both Hadoop and Spark need to manage and integrate the two technologies. This
connection may add complexity, as it needs coordination between the systems, data
transmission between HDFS and Spark, and occasionally, operating and monitoring several
clusters (AltexSoft). This complexity may raise operational overhead, making it less smooth
for enterprises (All You Need to Know About Big Data Analytics. (n.d.)).

3. Difference Between Hadoop 1.0 and 2.0:

• Hadoop 1.0: The first version of Hadoop, Hadoop 1.0, principally contained the Hadoop
Distributed File System (HDFS) for storage and the MapReduce processing engine. While it
was innovative in the managing enormous datasets, it had drawbacks, such as a single point
of failure in that JobTracker and restrictions on the scalability. JobTracker was responsible
for resource management and task scheduling (In-Depth Big Data Framework Comparison.
(n.d.)).

• Hadoop 2.0: Hadoop 2.0 provided important advancements, most notably the inclusion of
Yet Another Resource Negotiator (YARN). YARN removed the resource management and
task scheduling features from MapReduce. This huge improvement allowed Hadoop to
become more versatile and capable of running alternative data processing frameworks
alongside MapReduce. YARN enhanced the scalability, fault tolerance, and the capability to
handle multiple workloads simultaneously. In summary, Hadoop 2.0 makes Hadoop a more
versatile and capable data processing platform (In-Depth Big Data Framework Comparison.
(n.d.)).
References:

 Hadoop vs Spark: Main Big Data Tools Explained. (n.d.). AltexSoft.


https://www.altexsoft.com/blog/hadoop-vs-spark/
 Murthy, A. (2017). Big Data Analysis Using Hadoop and Spark Big Data Analysis
Using Hadoop and Spark. https://digitalcommons.memphis.edu/cgi/viewcontent.cgi?
article=2808&context=etd
 Ahmed, N., Barczak, A. L. C., Susnjak, T., & Rashid, M. A. (2020). A comprehensive
performance analysis of Apache Hadoop and Apache Spark for large scale data sets
using HiBench. Journal of Big Data, 7(1). https://doi.org/10.1186/s40537-020-00388-
5
 Spark Vs Hadoop: All You Need to Know About Big Data Analytics. (n.d.).
https://www.veritis.com/blog/hadoop-vs-spark-all-you-need-to-know-about-big-data-
analytics/
 Hadoop vs. Spark: In-Depth Big Data Framework Comparison. (n.d.).
SearchDataManagement.
https://www.techtarget.com/searchdatamanagement/feature/Hadoop-vs-Spark-
Comparing-the-two-big-data-frameworks

You might also like