Session 8 Big Data
Session 8 Big Data
1. 2.
large arriving
quantity quickly
3.
[un]structed, multi-
modal
Distributed and Parallel Computing for
Big Data
• Distributed computing
• Multiple computing resources are connected in a network
• Parallel computing
• Processing power of a standalone personal computer is enhanced by adding multiple processing units
An independent, autonomous system connected in A computer system with several processing units
a network for accomplishing specific tasks attached to it
Coordination is possible between connected A common shared memory can be directly accessed
computers that have their own memory and CPU by every processing unit in a network
Loose coupling of computers connected in a Tight coupling of processing resources that are used
network that provides access to data and remotely for solving a single, complex problem
located resources
The MapReduce Paradigm
• Platform for reliable, scalable parallel computing
• Abstracts issues of distributed and parallel environment from programmer.
• Runs over distributed file systems
• Google File System
• Hadoop File System (HDFS)
MapReduce: Insight
• Consider the problem of counting the number of occurrences of each word in a
large collection of documents
• How would you do it in parallel ?
• Solution:
• Divide documents among workers
• Each worker parses document to find all words, outputs (word, count) pairs
• Partition (word, count) pairs across workers based on word
• For each word at a worker, locally add up counts
MapReduce Workflow
Worker Output
write
local Worker File 0
Split 0 read write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
Map Reduce
extract something you aggregate,
care about from each summarize, filter,
record or transform
6
Big Data Trends