BigData-Assignment2-Last-CSP 554
BigData-Assignment2-Last-CSP 554
This article compares the performance of Hadoop MapReduce and Apache Spark for processing
large-scale datasets. Spark is becoming more popular for its performance and memory-based
processing, while Hadoop relies heavily on disk operations. They saw that some parameters can
improve the performance when processing large datasets (600 GB). The study selected 2
workloads : WordCount and TeraSort, and evaluates performance on execution time, throughput,
and speedup.
In several studies comparing Hadoop and Spark, mostly focusing on smaller datasets and
workloads, Hadoop often performs more consistently (data set of bigger sizes > 40GB). However,
Spark outperforms (five times) Hadoop in iterative tasks (like machine learning algorithms), due
to its in-memory processing capabilities (RDD caching).
Hadoop works with 2 core parts : HDFS and MapReduce. HDFS splits the files into small pieces
and stores them into nodes (data-nodes and name-nodes). All the operations are based on these
2 types of nodes. MapReduce processes the files in 2 steps : with the mappers, that are used to
transform the files into key-value pairs, then these pairs are shuffled, sorted and sent to the
reducers to process the data, then the final output is written back to HDFS.
Spark works with 2 concepts : RDD (Resilient Distributed Datasets) and DAG (Directed Acyclic
Graphs). Spark can be running on a Hadoop cluster, where the RDDs are created on HDFS, and
the DAG scheduler manages the dependencies between RDDs by breaking the tasks into stages
and launching them to the cluster. The DAG will be create for both map and reduce stages, with
intermediate results that are stored in distributed memory (it is why it is faster for small quantities
of data).
In the TeraSort workload tests with MapReduce, it was observed that changing the shuffle
parameters (Reduce_150 and task.io_45) improves performance for data sizes up to 450 GB,
reducing the execution time by 1%. However, for data sizes larger than 450 GB, the default
parameters (Reduce_100 and task.io_30) have the best performance.
For Spark, increasing the input block size to 1024 MB improves performance by up to 4% for data
sizes over 500 GB, so larger block sizes are more efficient for big datasets.
Also, we can see that Spark outperforms MapReduce by more than 2 times for Wordcount
workloads when the data sizes are over 300 GB. For smaller datasets, Sparker is up to 10 times
faster.
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)