0% found this document useful (0 votes)
19 views7 pages

Basics of Big Data

Uploaded by

abdoalsenaweabdo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views7 pages

Basics of Big Data

Uploaded by

abdoalsenaweabdo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

9/17/2024 Big Data

Hadoop & Spark

Mostafa Fawzy
Hadoop
- A distributed software framework to store, process and analyzing large scale
of data.
- It’s open source.
- It run on commodity hardware. (‫) ال يتطلب مواصفات معينة الي جهاز‬.
- Hadoop architecture and it's ecosystems.
ML Engine ‫ لتأدية بعض الوظائف لهيكلتها األساسية مثل‬Hadoop ‫ ل‬Engines ‫يمكن إضافة‬

Hadoop Core Components


- Hadoop Distributed File System (HDFS):
o Data Storage Layer.
o Responsible for storing data on Hadoop cluster.
o Data is split into blocks with configurable block size, for example 64
MB, 128 MB, 512 MB.
o Each block size replicate as default 3 times on different nodes across
cluster. It’s recommended 2 nodes in rack and third node in other
rack (Space).
o HDFS files write one. (‫)ال نستطيع التعديل علي ملفات هادوب‬
o To access HDFS we use Hadoop API.

.‫ وحده تضم مجموعة من الخوادم‬rack or space ‫المقصود بال‬

o It’s architecture consists of


o Name Node:
▪ Contains meta data about each data node.
▪ Each Cluster contain 2 Name Node (Active | Standby) .
o Data Node
▪ Each block size replicate as default 3 times on different
data nodes across cluster.
▪ It’s recommended 2 nodes in rack and third node in
other.
- MapReduce: The processing engine (compute paradigm) in Hadoop.
o Consists of 3 steps
▪ Mapping
▪ Shuffle | Grouping
▪ Reducing
MapReduce consists of 3 parts:
1. The Driver
o It's responsible for setting up the job's configuration, specifying the
mapper and reducer classes, and submitting the job to the Hadoop
cluster for execution.

.)‫ اللي بتنفذها‬job ‫ لتنفيذ المهمة (العقل المدبر في ال‬Client Machine ‫ الكود اللي بيرن علي‬:‫باختصار‬

2. The Mapper
o Input: Mapper takes a chunk of data (e.g., a line from a text file) as
input.
o Processing: It applies a user-defined mapping function to the input
data. This function typically breaks down the input into key-value
pairs.
o Output: The mapper emits these key-value pairs.
3. The Reducer
o Input: Reducer receives a key and an iterable collection of values
associated with that key from all mappers.
o Processing: It applies a user-defined reduce function to this data. This
function typically aggregates the values for the given key.
o Output: The reducer emits the aggregated key-value pair.

- YARN : The resource manager in Hadoop. (‫)مسؤول عن توزيع المهام والتاسكات‬


- Kafka:
o used as a powerful and flexible engine for Hadoop
o providing a variety of benefits for
▪ real-time data processing
▪ message queuing
▪ data replication
▪ stream processing
▪ integration with other systems.
- Flume:
o a distributed, fault-tolerant, and reliable system for collecting,
aggregating, and transporting large amounts of log data to Hadoop.
o It acts as a data ingestion engine, providing a robust and scalable way
to collect data from various sources and deliver it to Hadoop for
further processing.

Spark
- a lightning-fast cluster computing designed for fast computation.
- It was built on top of Hadoop MapReduce and it extends the MapReduce
model to efficiently use more types of computations which includes
Interactive Queries and Stream Processing.
- The main feature of Spark is its in-memory cluster computing that increases
the processing speed of an application.
- helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by
reducing number of read/write operations to disk. It stores the
intermediate processing data in memory.
- Spark Built on Hadoop
- designed to cover a wide range of workloads such as
o batch applications
o iterative algorithms
o interactive queries
o streaming.

Features of Apache Spark


• Speed − Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on disk.
This is possible by reducing number of read/write operations to disk. It
stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages. Spark comes up with 80 high-level operators for interactive
querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and
Graph algorithms.
Components of Spark
- Spark Core
o is the underlying general execution engine for spark platform that all
other functionality is built upon. I
o t provides In-Memory computing and referencing datasets in external
storage systems.
- Spark SQL
o a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured
and semi-structured data.
- Spark Streaming
o leverages Spark Core's fast scheduling capability to perform
streaming analytics.
o It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of data.
- MLlib (Machine Learning Library)
o a distributed machine learning framework above Spark because of
the distributed memory-based Spark architecture.
o It is according to benchmarks, done by the MLlib developers against
the Alternating Least Squares (ALS) implementations.
o is nine times as fast as the Hadoop disk-based version of Apache
Mahout (before Mahout gained a Spark interface).
- GraphX
o a distributed graph-processing framework on top of Spark.
o It provides an API for expressing graph computation that can model
the user-defined graphs by using Pregel abstraction API.
o It also provides an optimized runtime for this abstraction.

Data Sharing in MapReduce and Spark

Data Sharing in MapReduce

• Slow: Involves replication, serialization, and disk I/O.


• HDFS-dependent: Relies heavily on HDFS for data storage and retrieval.

• Inefficient for iterative and interactive workloads: The overhead of disk I/O
and data transfer can significantly impact performance.

Data Sharing in Spark

• Fast: Utilizes in-memory processing with Resilient Distributed Datasets


(RDDs).

• Efficient for iterative and interactive workloads: Intermediate results are


stored in memory, reducing the need for disk I/O.

• Supports persistence: RDDs can be persisted in memory or on disk for


faster access in subsequent queries.

Key Advantages of Spark over MapReduce for Data Sharing

• Improved performance: Faster data sharing and processing due to in-


memory operations.

• Reduced overhead: Less disk I/O and data transfer.

• Better suited for iterative and interactive workloads: Handles these types
of workloads more efficiently.

You might also like