0% found this document useful (0 votes)

19 views7 pages

Basics of Big Data

Uploaded by

abdoalsenaweabdo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views7 pages

Basics of Big Data

Uploaded by

abdoalsenaweabdo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

9/17/2024 Big Data

Hadoop & Spark

Mostafa Fawzy
Hadoop
- A distributed software framework to store, process and analyzing large scale
of data.
- It’s open source.
- It run on commodity hardware. (‫) ال يتطلب مواصفات معينة الي جهاز‬.
- Hadoop architecture and it's ecosystems.
ML Engine ‫ لتأدية بعض الوظائف لهيكلتها األساسية مثل‬Hadoop ‫ ل‬Engines ‫يمكن إضافة‬

Hadoop Core Components

- Hadoop Distributed File System (HDFS):
o Data Storage Layer.
o Responsible for storing data on Hadoop cluster.
o Data is split into blocks with configurable block size, for example 64
MB, 128 MB, 512 MB.
o Each block size replicate as default 3 times on different nodes across
cluster. It’s recommended 2 nodes in rack and third node in other
rack (Space).
o HDFS files write one. (‫)ال نستطيع التعديل علي ملفات هادوب‬
o To access HDFS we use Hadoop API.

.‫ وحده تضم مجموعة من الخوادم‬rack or space ‫المقصود بال‬

o It’s architecture consists of

o Name Node:
▪ Contains meta data about each data node.
▪ Each Cluster contain 2 Name Node (Active | Standby) .
o Data Node
▪ Each block size replicate as default 3 times on different
data nodes across cluster.
▪ It’s recommended 2 nodes in rack and third node in
other.
- MapReduce: The processing engine (compute paradigm) in Hadoop.
o Consists of 3 steps
▪ Mapping
▪ Shuffle | Grouping
▪ Reducing
MapReduce consists of 3 parts:
1. The Driver
o It's responsible for setting up the job's configuration, specifying the
mapper and reducer classes, and submitting the job to the Hadoop
cluster for execution.

.)‫ اللي بتنفذها‬job ‫ لتنفيذ المهمة (العقل المدبر في ال‬Client Machine ‫ الكود اللي بيرن علي‬:‫باختصار‬

2. The Mapper
o Input: Mapper takes a chunk of data (e.g., a line from a text file) as
input.
o Processing: It applies a user-defined mapping function to the input
data. This function typically breaks down the input into key-value
pairs.
o Output: The mapper emits these key-value pairs.
3. The Reducer
o Input: Reducer receives a key and an iterable collection of values
associated with that key from all mappers.
o Processing: It applies a user-defined reduce function to this data. This
function typically aggregates the values for the given key.
o Output: The reducer emits the aggregated key-value pair.

- YARN : The resource manager in Hadoop. (‫)مسؤول عن توزيع المهام والتاسكات‬

- Kafka:
o used as a powerful and flexible engine for Hadoop
o providing a variety of benefits for
▪ real-time data processing
▪ message queuing
▪ data replication
▪ stream processing
▪ integration with other systems.
- Flume:
o a distributed, fault-tolerant, and reliable system for collecting,
aggregating, and transporting large amounts of log data to Hadoop.
o It acts as a data ingestion engine, providing a robust and scalable way
to collect data from various sources and deliver it to Hadoop for
further processing.

Spark
- a lightning-fast cluster computing designed for fast computation.
- It was built on top of Hadoop MapReduce and it extends the MapReduce
model to efficiently use more types of computations which includes
Interactive Queries and Stream Processing.
- The main feature of Spark is its in-memory cluster computing that increases
the processing speed of an application.
- helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by
reducing number of read/write operations to disk. It stores the
intermediate processing data in memory.
- Spark Built on Hadoop
- designed to cover a wide range of workloads such as
o batch applications
o iterative algorithms
o interactive queries
o streaming.

Features of Apache Spark

• Speed − Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on disk.
This is possible by reducing number of read/write operations to disk. It
stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages. Spark comes up with 80 high-level operators for interactive
querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and
Graph algorithms.
Components of Spark
- Spark Core
o is the underlying general execution engine for spark platform that all
other functionality is built upon. I
o t provides In-Memory computing and referencing datasets in external
storage systems.
- Spark SQL
o a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured
and semi-structured data.
- Spark Streaming
o leverages Spark Core's fast scheduling capability to perform
streaming analytics.
o It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of data.
- MLlib (Machine Learning Library)
o a distributed machine learning framework above Spark because of
the distributed memory-based Spark architecture.
o It is according to benchmarks, done by the MLlib developers against
the Alternating Least Squares (ALS) implementations.
o is nine times as fast as the Hadoop disk-based version of Apache
Mahout (before Mahout gained a Spark interface).
- GraphX
o a distributed graph-processing framework on top of Spark.
o It provides an API for expressing graph computation that can model
the user-defined graphs by using Pregel abstraction API.
o It also provides an optimized runtime for this abstraction.

Data Sharing in MapReduce and Spark

Data Sharing in MapReduce

• Slow: Involves replication, serialization, and disk I/O.

• HDFS-dependent: Relies heavily on HDFS for data storage and retrieval.

• Inefficient for iterative and interactive workloads: The overhead of disk I/O
and data transfer can significantly impact performance.

Data Sharing in Spark

• Fast: Utilizes in-memory processing with Resilient Distributed Datasets

(RDDs).

• Efficient for iterative and interactive workloads: Intermediate results are

stored in memory, reducing the need for disk I/O.

• Supports persistence: RDDs can be persisted in memory or on disk for

faster access in subsequent queries.

Key Advantages of Spark over MapReduce for Data Sharing

• Improved performance: Faster data sharing and processing due to in-

memory operations.

• Reduced overhead: Less disk I/O and data transfer.

• Better suited for iterative and interactive workloads: Handles these types
of workloads more efficiently.

PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Sizing of Amine Absorber
No ratings yet
Sizing of Amine Absorber
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
M5
No ratings yet
M5
18 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Sspark
No ratings yet
Sspark
7 pages
Shark
No ratings yet
Shark
24 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark Introduction
No ratings yet
Spark Introduction
12 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit V
No ratings yet
Unit V
35 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
BD Notes 5
No ratings yet
BD Notes 5
37 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark
No ratings yet
Spark
9 pages
Module 2
No ratings yet
Module 2
20 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
SPARK
No ratings yet
SPARK
47 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Unit 4
No ratings yet
Unit 4
8 pages
Unit 5
100% (1)
Unit 5
109 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Bda U4
No ratings yet
Bda U4
49 pages
Spark
No ratings yet
Spark
5 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
SPARK
No ratings yet
SPARK
125 pages
Bda 5
No ratings yet
Bda 5
21 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
T07 Spark
No ratings yet
T07 Spark
23 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
AI OS Ch03
No ratings yet
AI OS Ch03
27 pages
Lec 1 NLP
No ratings yet
Lec 1 NLP
23 pages
Lect-4 2
No ratings yet
Lect-4 2
18 pages
Lecture 2
No ratings yet
Lecture 2
49 pages
50 Data Websites
No ratings yet
50 Data Websites
5 pages
Lecture 3
No ratings yet
Lecture 3
12 pages
OOP Concepts
No ratings yet
OOP Concepts
47 pages
ML (AutoRecovered)
No ratings yet
ML (AutoRecovered)
5 pages
Extra Lecturenotes Cs725
No ratings yet
Extra Lecturenotes Cs725
119 pages
Slide
No ratings yet
Slide
4 pages
S&P2 Question Bank2
No ratings yet
S&P2 Question Bank2
8 pages
2 Structured+Programming+18 10 2023
No ratings yet
2 Structured+Programming+18 10 2023
31 pages
4 - Structured Programming 1-11-2023
No ratings yet
4 - Structured Programming 1-11-2023
15 pages
1 Structured+Programming11 10 2023
No ratings yet
1 Structured+Programming11 10 2023
75 pages
Labview Digital Filter Design Toolkit Api Reference 2024-04-19-01-39-07
No ratings yet
Labview Digital Filter Design Toolkit Api Reference 2024-04-19-01-39-07
149 pages
MPC 509
No ratings yet
MPC 509
22 pages
Rashed
No ratings yet
Rashed
9 pages
Smart Traffic Management Project
No ratings yet
Smart Traffic Management Project
2 pages
Legal Education Board
No ratings yet
Legal Education Board
2 pages
Question Bank Sybbi It Sem 3 2024-25
No ratings yet
Question Bank Sybbi It Sem 3 2024-25
2 pages
Learning Journal - Photo Class
No ratings yet
Learning Journal - Photo Class
112 pages
TGTDCL 2018 Question Solution by Design Integrity
No ratings yet
TGTDCL 2018 Question Solution by Design Integrity
6 pages
VDL Sample
No ratings yet
VDL Sample
2 pages
Understanding The Security Architecture of The One Identity Safeguard Appliance
No ratings yet
Understanding The Security Architecture of The One Identity Safeguard Appliance
6 pages
SP800 Operating Manual
No ratings yet
SP800 Operating Manual
17 pages
Lel Tender Specs
No ratings yet
Lel Tender Specs
6 pages
Icom IC-T90A Instruction Manual
100% (1)
Icom IC-T90A Instruction Manual
100 pages
EO Catalyst
No ratings yet
EO Catalyst
30 pages
Xbox 360
No ratings yet
Xbox 360
5 pages
Ticket Muenchen Berlin 3165580741
No ratings yet
Ticket Muenchen Berlin 3165580741
1 page
6FM9Y
No ratings yet
6FM9Y
2 pages
CMIT-796-PIP-15.69-00-0008 - 0 3D Model Review Procedure
No ratings yet
CMIT-796-PIP-15.69-00-0008 - 0 3D Model Review Procedure
10 pages
VSS ppt.1-2
No ratings yet
VSS ppt.1-2
13 pages
Price List Toshiba - Januari 2011
No ratings yet
Price List Toshiba - Januari 2011
3 pages
1 Write The Java Program For Grading System
No ratings yet
1 Write The Java Program For Grading System
5 pages
Ivosights
No ratings yet
Ivosights
49 pages
Pros and Cons of e Banking
No ratings yet
Pros and Cons of e Banking
2 pages
Inspection Notification-093.Rev A
No ratings yet
Inspection Notification-093.Rev A
2 pages
Lebanese International University: CSCI345 - Digital Logic Assignment 1
No ratings yet
Lebanese International University: CSCI345 - Digital Logic Assignment 1
5 pages
Business Result Pre-Int. Wordlist English-French
No ratings yet
Business Result Pre-Int. Wordlist English-French
18 pages
TCS Allegations and Mixtures Quiz-3 PREP INSTA
No ratings yet
TCS Allegations and Mixtures Quiz-3 PREP INSTA
21 pages
Testing Scope
No ratings yet
Testing Scope
2 pages
An Introduction To Submarine Cables
100% (1)
An Introduction To Submarine Cables
7 pages

Basics of Big Data

Uploaded by

Basics of Big Data

Uploaded by

9/17/2024 Big Data

Hadoop & Spark

Hadoop Core Components

.‫ وحده تضم مجموعة من الخوادم‬rack or space ‫المقصود بال‬

o It’s architecture consists of

- YARN : The resource manager in Hadoop. (‫)مسؤول عن توزيع المهام والتاسكات‬

Features of Apache Spark

Data Sharing in MapReduce and Spark

Data Sharing in MapReduce

• Slow: Involves replication, serialization, and disk I/O.

Data Sharing in Spark

• Fast: Utilizes in-memory processing with Resilient Distributed Datasets

• Efficient for iterative and interactive workloads: Intermediate results are

• Supports persistence: RDDs can be persisted in memory or on disk for

Key Advantages of Spark over MapReduce for Data Sharing

• Improved performance: Faster data sharing and processing due to in-

• Reduced overhead: Less disk I/O and data transfer.

You might also like