BigData-Assignment2-Last-CSP 554

Uploaded by

emile.mondon.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views3 pages

BigData-Assignment2-Last-CSP 554

Uploaded by

emile.mondon.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Emile MONDON

CSP 554 – Big Data Technologies : Assignment 2

This article compares the performance of Hadoop MapReduce and Apache Spark for processing
large-scale datasets. Spark is becoming more popular for its performance and memory-based
processing, while Hadoop relies heavily on disk operations. They saw that some parameters can
improve the performance when processing large datasets (600 GB). The study selected 2
workloads : WordCount and TeraSort, and evaluates performance on execution time, throughput,
and speedup.

In several studies comparing Hadoop and Spark, mostly focusing on smaller datasets and
workloads, Hadoop often performs more consistently (data set of bigger sizes > 40GB). However,
Spark outperforms (five times) Hadoop in iterative tasks (like machine learning algorithms), due
to its in-memory processing capabilities (RDD caching).

Hadoop works with 2 core parts : HDFS and MapReduce. HDFS splits the files into small pieces
and stores them into nodes (data-nodes and name-nodes). All the operations are based on these
2 types of nodes. MapReduce processes the files in 2 steps : with the mappers, that are used to
transform the files into key-value pairs, then these pairs are shuffled, sorted and sent to the
reducers to process the data, then the final output is written back to HDFS.

Spark works with 2 concepts : RDD (Resilient Distributed Datasets) and DAG (Directed Acyclic
Graphs). Spark can be running on a Hadoop cluster, where the RDDs are created on HDFS, and
the DAG scheduler manages the dependencies between RDDs by breaking the tasks into stages
and launching them to the cluster. The DAG will be create for both map and reduce stages, with
intermediate results that are stored in distributed memory (it is why it is faster for small quantities
of data).

In the TeraSort workload tests with MapReduce, it was observed that changing the shuffle
parameters (Reduce_150 and task.io_45) improves performance for data sizes up to 450 GB,
reducing the execution time by 1%. However, for data sizes larger than 450 GB, the default
parameters (Reduce_100 and task.io_30) have the best performance.

For Spark, increasing the input block size to 1024 MB improves performance by up to 4% for data
sizes over 500 GB, so larger block sizes are more efficient for big datasets.

Also, we can see that Spark outperforms MapReduce by more than 2 times for Wordcount
workloads when the data sizes are over 300 GB. For smaller datasets, Sparker is up to 10 times
faster.

2)
3)

La commande est « hdfs dfs -ls /user » :

9)
10)

11)

Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Mckinsey Healthcare Analytics
No ratings yet
Mckinsey Healthcare Analytics
36 pages
NetBackup10 AdminGuide Hadoop
No ratings yet
NetBackup10 AdminGuide Hadoop
67 pages
BDA Unit-4 Part-1 HDFS, MapReduce
No ratings yet
BDA Unit-4 Part-1 HDFS, MapReduce
76 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Data Engineering Cookbook
No ratings yet
Data Engineering Cookbook
125 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
R20 M.Tech DS
No ratings yet
R20 M.Tech DS
64 pages
Emerging Chapter 1,2&3
100% (1)
Emerging Chapter 1,2&3
53 pages
Spark Interview Ques1
No ratings yet
Spark Interview Ques1
20 pages
Data Science Manual
No ratings yet
Data Science Manual
155 pages
Big Data 2.0 Processing Systems 2ed
No ratings yet
Big Data 2.0 Processing Systems 2ed
155 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Hadoop Commands
No ratings yet
Hadoop Commands
2 pages
Documentation Distributed ML
No ratings yet
Documentation Distributed ML
55 pages
10 SparkIntroduction BigData 2x
No ratings yet
10 SparkIntroduction BigData 2x
33 pages
SPARK
No ratings yet
SPARK
125 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Big Data Analysis Using Hadoop and Spark
No ratings yet
Big Data Analysis Using Hadoop and Spark
36 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Spark
No ratings yet
Spark
49 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Key Differences in Aache Spark Components and Concepts
No ratings yet
Key Differences in Aache Spark Components and Concepts
7 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
35 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
3.1 Realtime Operating System
No ratings yet
3.1 Realtime Operating System
25 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Electrical and Electronics Engg. B.Tech. Semester-Vii: Course Code Course Title L T P Credits
No ratings yet
Electrical and Electronics Engg. B.Tech. Semester-Vii: Course Code Course Title L T P Credits
26 pages
DS QCM BigData 2021
No ratings yet
DS QCM BigData 2021
6 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Cloud Series 2 ORAF
No ratings yet
Cloud Series 2 ORAF
19 pages
Big Data, Hadoop
No ratings yet
Big Data, Hadoop
24 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
Beginner Guide Spark
No ratings yet
Beginner Guide Spark
12 pages
2015 Liulu Ms
No ratings yet
2015 Liulu Ms
63 pages
Big Data Analytics (2171607) : Chapter - 1 Mapreduce
No ratings yet
Big Data Analytics (2171607) : Chapter - 1 Mapreduce
32 pages
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
Hands-On Hadoop Tutorial
100% (1)
Hands-On Hadoop Tutorial
13 pages
Distributed Parallel Architecture For "Big Data"
No ratings yet
Distributed Parallel Architecture For "Big Data"
12 pages
A20528094 Assign2
No ratings yet
A20528094 Assign2
5 pages
Int 421
No ratings yet
Int 421
2 pages
Unit 5
No ratings yet
Unit 5
7 pages
CV Aizaz PDF
No ratings yet
CV Aizaz PDF
1 page
Artificial Intelligence For Business: A.K. Swain
No ratings yet
Artificial Intelligence For Business: A.K. Swain
26 pages
Bsa Assignment
No ratings yet
Bsa Assignment
13 pages
Bda 7
No ratings yet
Bda 7
4 pages
30 Comparative Performance Analysis of Apache Spark and Map Reduce Using K-Means E
No ratings yet
30 Comparative Performance Analysis of Apache Spark and Map Reduce Using K-Means E
7 pages
Apache Spark Vs MapReduce
No ratings yet
Apache Spark Vs MapReduce
3 pages
HBase at Stumbleupon
No ratings yet
HBase at Stumbleupon
38 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Compare Hadoop and Spark.: Table
No ratings yet
Compare Hadoop and Spark.: Table
10 pages
Parminder Singh Bhatia Resume
No ratings yet
Parminder Singh Bhatia Resume
2 pages
BigData-Assignment6-CSP 554
No ratings yet
BigData-Assignment6-CSP 554
2 pages
Hadoop Vs Spark
No ratings yet
Hadoop Vs Spark
2 pages
Understanding Big Data and The QFabric System
No ratings yet
Understanding Big Data and The QFabric System
2 pages
A Comparative Between Hadoop MapReduce and Apache
No ratings yet
A Comparative Between Hadoop MapReduce and Apache
4 pages
Arora 2016
No ratings yet
Arora 2016
6 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
8 pages
Comparing The Hadoop Distributed File System (HDFS) With The Cassandra File System (CFS)
No ratings yet
Comparing The Hadoop Distributed File System (HDFS) With The Cassandra File System (CFS)
13 pages
Hadoop vs. Spark: The New Age of Big Data
No ratings yet
Hadoop vs. Spark: The New Age of Big Data
7 pages
Week 9
No ratings yet
Week 9
2 pages
Comparative Analysis of Open Source Business Intelligence Tools For Crime Data Analytics
No ratings yet
Comparative Analysis of Open Source Business Intelligence Tools For Crime Data Analytics
6 pages
The Big Big Data' Question Hadoop or Spark
No ratings yet
The Big Big Data' Question Hadoop or Spark
3 pages
Informatica and Cloudera Unleash The Power of Hadoop
No ratings yet
Informatica and Cloudera Unleash The Power of Hadoop
3 pages
Hadoop实际解决方案手册: Chinese Edition
From Everand
Hadoop实际解决方案手册: Chinese Edition
Posts & Telecom Press
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Apache Sedona Essentials: A Practical Guide to Spatial Data Processing
From Everand
Apache Sedona Essentials: A Practical Guide to Spatial Data Processing
Robert Johnson
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Mastering the SAS DS2 Procedure: Advanced Data-Wrangling Techniques, Second Edition
From Everand
Mastering the SAS DS2 Procedure: Advanced Data-Wrangling Techniques, Second Edition
Mark Jordan
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
DBA's Guide to NoSQL
From Everand
DBA's Guide to NoSQL
The Enlightened DBA
5/5 (1)
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet

BigData-Assignment2-Last-CSP 554

Uploaded by

BigData-Assignment2-Last-CSP 554

Uploaded by

Emile MONDON

CSP 554 – Big Data Technologies : Assignment 2

La commande est « hdfs dfs -ls /user » :

You might also like