Week 9
Week 9
Hadoop's Limitations:
b. Latency: Hadoop's core processing are framework, MapReduce and that is built for batch
processing. While it's fantastic for managing massive datasets, it's not ideal for real-time or
near-real-time data processing. The batch processing characteristic of MapReduce might
result in greater latency, making Hadoop less appropriate for applications that demand
immediate data insights (AltexSoft). This constraint may effect applications where quick
decision-making is critical, such as fraud detection or recommendation systems (All You
Need to Know About Big Data Analytics. (n.d.)).
c. Storage Overheads: Hadoop's HDFS is designed for huge files and is less efficient when
managing multiple little files. HDFS replicates data across numerous nodes for fault
tolerance, contributing to storage cost when dealing with tiny files (AltexSoft). This may be a
downside for the enterprises with various data storage demands, as it may lead to wasteful
disk utilization (All You Need to Know About Big Data Analytics. (n.d.)).
d. High Disk I/O: Hadoop depends extensively on disk I/O for storing and retrieving the
data. While this method gives increased durability and fault tolerance, it may become a
bottleneck for specific workloads. Disk I/O may restrict the pace of data processing and
impair the overall performance of Hadoop (AltexSoft). In circumstances when low-latency
data access is necessary, this constraint becomes more obvious (All You Need to Know
About Big Data Analytics. (n.d.)).
2. Spark's Drawbacks:
a. Complexity: While Spark is a robust and adaptable data processing framework, its
diversity may contribute to complexity. Advanced capabilities like Spark Streaming for real-
time data and machine learning libraries may be complicated to install and operate, making
Spark less accessible to novice users (AltexSoft). Organizations may need to invest in
training or employ professionals to successfully leverage these talents (Murthy, 2017).
c. Limited Hadoop Ecosystem Integration: Spark can function with Hadoop, but it doesn't
totally replace Hadoop in certain instances. This implies enterprises running mixed systems
with both Hadoop and Spark need to manage and integrate the two technologies. This
connection may add complexity, as it needs coordination between the systems, data
transmission between HDFS and Spark, and occasionally, operating and monitoring several
clusters (AltexSoft). This complexity may raise operational overhead, making it less smooth
for enterprises (All You Need to Know About Big Data Analytics. (n.d.)).
• Hadoop 1.0: The first version of Hadoop, Hadoop 1.0, principally contained the Hadoop
Distributed File System (HDFS) for storage and the MapReduce processing engine. While it
was innovative in the managing enormous datasets, it had drawbacks, such as a single point
of failure in that JobTracker and restrictions on the scalability. JobTracker was responsible
for resource management and task scheduling (In-Depth Big Data Framework Comparison.
(n.d.)).
• Hadoop 2.0: Hadoop 2.0 provided important advancements, most notably the inclusion of
Yet Another Resource Negotiator (YARN). YARN removed the resource management and
task scheduling features from MapReduce. This huge improvement allowed Hadoop to
become more versatile and capable of running alternative data processing frameworks
alongside MapReduce. YARN enhanced the scalability, fault tolerance, and the capability to
handle multiple workloads simultaneously. In summary, Hadoop 2.0 makes Hadoop a more
versatile and capable data processing platform (In-Depth Big Data Framework Comparison.
(n.d.)).
References: