Basics of Big Data
Basics of Big Data
Mostafa Fawzy
Hadoop
- A distributed software framework to store, process and analyzing large scale
of data.
- It’s open source.
- It run on commodity hardware. () ال يتطلب مواصفات معينة الي جهاز.
- Hadoop architecture and it's ecosystems.
ML Engine لتأدية بعض الوظائف لهيكلتها األساسية مثلHadoop لEngines يمكن إضافة
.) اللي بتنفذهاjob لتنفيذ المهمة (العقل المدبر في الClient Machine الكود اللي بيرن علي:باختصار
2. The Mapper
o Input: Mapper takes a chunk of data (e.g., a line from a text file) as
input.
o Processing: It applies a user-defined mapping function to the input
data. This function typically breaks down the input into key-value
pairs.
o Output: The mapper emits these key-value pairs.
3. The Reducer
o Input: Reducer receives a key and an iterable collection of values
associated with that key from all mappers.
o Processing: It applies a user-defined reduce function to this data. This
function typically aggregates the values for the given key.
o Output: The reducer emits the aggregated key-value pair.
Spark
- a lightning-fast cluster computing designed for fast computation.
- It was built on top of Hadoop MapReduce and it extends the MapReduce
model to efficiently use more types of computations which includes
Interactive Queries and Stream Processing.
- The main feature of Spark is its in-memory cluster computing that increases
the processing speed of an application.
- helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by
reducing number of read/write operations to disk. It stores the
intermediate processing data in memory.
- Spark Built on Hadoop
- designed to cover a wide range of workloads such as
o batch applications
o iterative algorithms
o interactive queries
o streaming.
• Inefficient for iterative and interactive workloads: The overhead of disk I/O
and data transfer can significantly impact performance.
• Better suited for iterative and interactive workloads: Handles these types
of workloads more efficiently.