0% found this document useful (0 votes)
2 views7 pages

Session 8 Big Data

The document discusses big data analytics and distributed and parallel computing for big data. It talks about how big data is characterized by the 3 Vs - volume, velocity and variety. It then discusses distributed computing, parallel computing and the MapReduce paradigm for handling large scale data processing across distributed systems.

Uploaded by

pranjal rohilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views7 pages

Session 8 Big Data

The document discusses big data analytics and distributed and parallel computing for big data. It talks about how big data is characterized by the 3 Vs - volume, velocity and variety. It then discusses distributed computing, parallel computing and the MapReduce paradigm for handling large scale data processing across distributed systems.

Uploaded by

pranjal rohilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Big Data Analytics

Analyses which can handle the 3 Vs and do it with quality (veracity):


(Laney, 2001: META Group)

1. 2.
large arriving
quantity quickly

3.
[un]structed, multi-
modal
Distributed and Parallel Computing for
Big Data
• Distributed computing
• Multiple computing resources are connected in a network

• Computing tasks are distributed across the resources

• Faster and more efficient than traditional computing

• Parallel computing
• Processing power of a standalone personal computer is enhanced by adding multiple processing units

• Computing tasks are distributed across processing units


Distributed and Parallel Computing for Big
Data
Distributed Computing System Parallel Computing System

An independent, autonomous system connected in A computer system with several processing units
a network for accomplishing specific tasks attached to it

Coordination is possible between connected A common shared memory can be directly accessed
computers that have their own memory and CPU by every processing unit in a network

Loose coupling of computers connected in a Tight coupling of processing resources that are used
network that provides access to data and remotely for solving a single, complex problem
located resources
The MapReduce Paradigm
• Platform for reliable, scalable parallel computing
• Abstracts issues of distributed and parallel environment from programmer.
• Runs over distributed file systems
• Google File System
• Hadoop File System (HDFS)
MapReduce: Insight
• Consider the problem of counting the number of occurrences of each word in a
large collection of documents
• How would you do it in parallel ?
• Solution:
• Divide documents among workers
• Each worker parses document to find all words, outputs (word, count) pairs
• Partition (word, count) pairs across workers based on word
• For each word at a worker, locally add up counts
MapReduce Workflow

Input Data Output Data

Worker Output
write
local Worker File 0
Split 0 read write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
Map Reduce
extract something you aggregate,
care about from each summarize, filter,
record or transform
6
Big Data Trends

You might also like