0% found this document useful (0 votes)
3 views13 pages

Big Data Processing, MapReduce

The document discusses Big Data processing, focusing on batch and transactional processing, and introduces Hadoop as a framework for distributed data management. It explains the MapReduce programming model, which enables parallel processing of data across a Hadoop cluster, highlighting its scalability and fault tolerance. Examples illustrate the Map and Reduce tasks in action, such as counting words in documents and cataloging coins.

Uploaded by

azamsyed811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

Big Data Processing, MapReduce

The document discusses Big Data processing, focusing on batch and transactional processing, and introduces Hadoop as a framework for distributed data management. It explains the MapReduce programming model, which enables parallel processing of data across a Hadoop cluster, highlighting its scalability and fault tolerance. Examples illustrate the Map and Reduce tasks in action, such as counting words in documents and cataloging coins.

Uploaded by

azamsyed811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Big Data Processing and Map Reduce

Outline
q Batch and Transactional Processing
q Hadoop
q MapReduce

Reference:
• Chapter 6, “Big Data Fundamentals: Concepts, Drivers & Techniques”, by Thomas Erl,
Wajid Khattak, Paul Buhler. 1st Ed. ISBN-10: 0134291077,
2
Big Data Management Software Stack

3
Distributed Data Processing

• Achieved through
physically separate
machines that are
networked together
as a cluster

4
Processing Workloads
• Batch: processing data in batches and usually imposes delays, which in
turn results in high-latency responses
o Also known as offline processing

o Queries can be complex and involve multiple joins

• Transactional: data is processed interactively without delay, resulting in


low-latency responses
o Also known as online processing

o small amounts of data with random reads and writes

5
Batch Processing
● a batch workload can include
grouped read/writes to INSERT,
SELECT, UPDATE and DELETE
● response time could vary from
minutes to hours
● generally involves processing a
range of large datasets

● majority of Big Data processing


occurs in batch mode

6
Transactional Processing
● Transactional workloads have few
joins and lower latency responses
than batch workloads
● Generally more write-intensive
than read-intensive
● smaller data footprint

7
Hadoop
● Hadoop is a versatile
framework that provides
both processing and storage
capabilities

● Two main components:


1. Hadoop Distributed File
System (HDFS) for
distributed storage
2. MapReduce for
distributed processing

8
Batch Processing with MapReduce
● MapReduce is a programming model that allows parallel and distributed
processing of data across a Hadoop cluster

● Highly scalable, reliable, and based on the principle of divide-and-


conquer

● Built-in fault tolerance and redundancy

● Does not require that the input data conform to any particular data model

● High coordination overhead

● Data processing algorithm is moved to the nodes that store the data

9
Map and Reduce Tasks

An illustration of a MapReduce job with the map stage highlighted

10
Example #1 of MapReduce
Goal: Count the number of times a word appeared in a document

11
Example #1 of MapReduce
Goal: Count the number of times a word appeared in a document

Assume we have 10 servers and 200 documents

1. Map: divide the documents and assign them to the servers (e.g., 20 each)
• (Key, Value) pair à (Word, Count) à (“Taco”, 7)
2. Combine and Partition if necessary
3. Shuffle and Sort à Take the output from previous stage and combine them
together in a sorted list
4. Reduce à Sum or merge to arrive at the final result

12
Example #2 of MapReduce
Goal: Count and catalog all the coins in a
pile (different currency types and
denominations)

“classical” approach to
parallel computing

Ref: https://freecontent.manning.com/explaining-mapreduce-with-ducks/
13
Example #2 of MapReduce
Goal: Count and catalog all the coins in a
pile (different currency types and
denominations)

MapReduce

Ref: https://freecontent.manning.com/explaining-mapreduce-with-ducks/
14

You might also like