0% found this document useful (0 votes)

40 views36 pages

Intro Big Data

The document discusses big data processing characteristics and techniques. It covers volume, variety, velocity, variability, and veracity as key characteristics of big data. It also discusses using sampling, parallel processing, MapReduce and other tools to improve efficiency and scalability when working with large datasets.

Uploaded by

Sidh Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views36 pages

Intro Big Data

Uploaded by

Sidh Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

Big Data Processing

Jiaul Paik
Lecture 3
Characteristics of Big Data
• Volume:
• the size and amounts of big data

• Variety:
• the diversity and range of different data types (unstructured data, semi-structured data, video, image etc)

• Velocity (known as streaming data):

• the speed at which data is coming
• e.g., social media posts or search queries received within a day/hour

• Variability:
• the changing nature of the data companies seek to capture, manage and analyze
– e.g., in sentiment or text analytics

• Veracity:
• the “truth” or accuracy of data and information

• Value:
• the value of big data comes from interesting pattern recognition that lead to better decision making
Science
Why big data? Engineering
Commerce
Society

Source: Wikipedia (Everest)

Science

Data-intensive Science
Large Hadron collider: Particle collider
15 Petabyte/year

Maximilien Brice, © CERN

Engineering
The unreasonable effectiveness of data
Search, recommendation, prediction, …

Source: Wikipedia (Three Gorges Dam)

Language Translation
Big data analytics: Case Studies

• Google
• uses big data to improve search quality (ranked results)

• Personalized ad recommendation

• Youtube: Personalized video recommendation

• Directions in Google map

• Google translate: Language translation

Big data analytics: Case Studies
• Netflix
• On-demand streaming video for its customers

• Predicts what the customers will enjoy watching with Big

Data

• Uses LOTs of historical data

• Who watched what
• Genre of the movie
Big data analytics: Case Studies
• Uber
• Uses personal data of the user to monitor which features of the
service are mostly used

• Focuses on supply and demand of the services

• Finds out best routes depending upon factors such traffic,

location, time etc

• Fixing fare
Big data analytics: Case Studies

• Walmart
• Uses transaction data to discover patterns that can be used
• Provides product recommendations
• which products were brought together.

• Organizing product in such a way that customers easily find it

• Bread and butter in the same rack
Big data analytics: Case Studies
• LinkedIn
• Uses Big data to develop product offerings such as
• people you may know
• who have viewed your profile
• jobs you may be interested in, and more

• LinkedIn uses network/graph data to

• analyze the profiles,
• suggests opportunities according to qualification and interests
Big data analytics: Case Studies
• Healthcare
• Finding a disease outbreak

• Predicting mental health from social media data

• Human genome sequencing

• identifying, mapping and sequencing all of the genes of the human genome from both a
physical and a functional standpoint
• Can be used to identify diseases and for better treatment
Focus of this course

Data Science
Tools

This Course
Analytics
Infrastructure

Execution
Infrastructure

“big data stack”

Buzzwords
Text: frequency estimation,
language models, inverted
data analytics, business indexes
Data Science
intelligence, OLAP, ETL, Graphs: graph traversals,
Tools
data warehouses and data random walks (PageRank)
lakes

This Course
Relational data: SQL, joins,
Analytics column stores
Infrastructure
Data mining: hashing,
clustering (k-means),
MapReduce, Spark, noSQL, classification,
Execution
Flink, Pig, Hive, Dryad, recommendations
Infrastructure
Pregel, Giraph, Storm
Streams: probabilistic data
structures (Bloom filters,
“big data stack” CMS, HLL counters)

This course focuses on algorithm design and “programming at

scale”
What is the Goal of Big Data Processing?

• Finding useful pattern/insight/model from large data in

reasonable amount of time

• Primary focus is on efficiency as well as on information

quality
Scalability of an algorithm

• Growth of its complexity with the problem size

• Time to finish
• Memory requirement

• How well it can handle big data?

Two Common Routes to Scalability
1. Improving Algorithmic Efficiency
• Sampling techniques
We will
study all in
• Efficient data structures and algorithms detail

2. Parallel Processing

• Scale-up architecture: Powerful server with lots of RAM, disk and cpu-cores

• Scale-out architecture: Cluster of low-cost computers

• Hadoop, Map-reduce, Spark
Scalability: Data Clustering

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster

centers (that is why k-mean )

STEP 2: Assign/cluster each

member to the closest center
Iterative
steps
STEP 3: Recalculate centers
K-means: Illustration
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
Kmeans: The Expensive part

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster

centers (that is why k-mean )

STEP 2: Assign/cluster each 10 3 4 17

member to the closest center 10 ×10 ×10 =10
Iterative
steps
STEP 3: Recalculate centers
Solution 1: Improving Algorithmic Efficiency
Sampling based k-mean

• Take a random sample from the data

• Apply kmean on that data to produce the approximate centroids

• Key assumption:
• The centroids from the random samples are very close to the centroids of the
original data

Selective Search: Efficient and Effective Search of Large Textual Collections

by Kulkarni and Callan, ACM TOIS, 2016
Illustration: Sampling based k-mean

original centroids

Random sample
(30%)

kmean

Assign clusters to original

data

approx
centroids
Solution 2: Parallel Processing

STEP 1: Start with k initial cluster

centers (that is why k-mean ) 1. Split data into small chunks

STEP 2: Assign/cluster each 2. Process each chunk in

member to the closest center different cores / nodes in a
Iterative cluster
steps
STEP 3: Recalculate centers
Pros and Cons
• Sampling based method
• Pros: Fast processing
• Cons: Lossy Inference, Often Low accuracy

• Scale-up Architecture
• Pros: Fast processing (if data fits into RAM)
• Cons: Risk of data loss (system failure), Scalability issue for very large
data

• Scale-out architecture
• Pros: Can handle very large data, fault tolerant
• Cons: Communication bottleneck, difficulty in writing code
How to Tackle Big Data?

Source: Google
Divide and Conquer: the good, old and reliable friend

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Combine
What are the Challenges?
• How do we assign work units to workers?

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?

A Critical Issue

• Parallelization problems arise from:

• Communication between workers (e.g., to exchange state)
• Access to shared resources (e.g., data)

• Thus, we need a synchronization mechanism

!!
on
d ti
a
B iza
on
h r
nc
sy

Source: Ricardo Guimarães Herrmann

Old Tools for Big Data Processing
Shared Memory Message Passing

• Programming models

Memory
• Shared memory (pthreads)
• Message passing (MPI) P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

• Design Patterns
• Master-slaves
• Producer-consumer flows
• Shared work queues producer consumer
master

work queue
slaves

producer consumer
Thank you so much!!

Module 1
No ratings yet
Module 1
90 pages
Module 1-BDA
No ratings yet
Module 1-BDA
82 pages
DATA228 Lecture Notes Week 1
No ratings yet
DATA228 Lecture Notes Week 1
20 pages
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Big Data Overview
No ratings yet
Big Data Overview
75 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
No ratings yet
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
51 pages
Bigdata Lecture Notes
No ratings yet
Bigdata Lecture Notes
166 pages
Big Data Analytics - AAM - Unit 1
No ratings yet
Big Data Analytics - AAM - Unit 1
178 pages
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Bigdata
No ratings yet
Bigdata
54 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Cross Site Scripting (XSS) : by Amit Tyagi
0% (1)
Cross Site Scripting (XSS) : by Amit Tyagi
31 pages
BDA Module - 1 PSM
No ratings yet
BDA Module - 1 PSM
32 pages
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Purple Pink Trendy Cyber Y2K Creative Presentation - 20241202 - 093632 - 0000
No ratings yet
Purple Pink Trendy Cyber Y2K Creative Presentation - 20241202 - 093632 - 0000
16 pages
Big Data Pyq 2023 Solution
No ratings yet
Big Data Pyq 2023 Solution
18 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
BigData AmberSahai1
No ratings yet
BigData AmberSahai1
32 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
46 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
8 Revolution of Big Data
No ratings yet
8 Revolution of Big Data
18 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Lecture1 Introductiontobigdata 190301171350
No ratings yet
Lecture1 Introductiontobigdata 190301171350
63 pages
Mitsubishi L-Series Operator Manual
100% (1)
Mitsubishi L-Series Operator Manual
42 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
42 pages
ETB 1 (Big Data)
No ratings yet
ETB 1 (Big Data)
28 pages
Big Data
No ratings yet
Big Data
16 pages
Introduction To: Dr. Sandeep G. Deshmukh
No ratings yet
Introduction To: Dr. Sandeep G. Deshmukh
26 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Business-Level Strategy: Michael A. Hitt R. Duane Ireland Robert E. Hoskisson
No ratings yet
Business-Level Strategy: Michael A. Hitt R. Duane Ireland Robert E. Hoskisson
68 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
17 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
Intro To Big Data
No ratings yet
Intro To Big Data
23 pages
Chapter 1
No ratings yet
Chapter 1
49 pages
Big Data
No ratings yet
Big Data
35 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
33 pages
Input & Output Devices - Complete Chapter
No ratings yet
Input & Output Devices - Complete Chapter
128 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
23 Big Data and Data Wrangling
No ratings yet
23 Big Data and Data Wrangling
56 pages
Big Data
No ratings yet
Big Data
10 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
SEM VII BDA Syllabus Theory
No ratings yet
SEM VII BDA Syllabus Theory
4 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
Astm f2291 2006
No ratings yet
Astm f2291 2006
46 pages
Sharafu Report
No ratings yet
Sharafu Report
128 pages
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
100% (1)
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
107 pages
ID-20MF-WR-FV - 13.56mhz RFID Reader Module HF Read Write
No ratings yet
ID-20MF-WR-FV - 13.56mhz RFID Reader Module HF Read Write
18 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
4-90001 LANDI APOS A7 Security Policy
No ratings yet
4-90001 LANDI APOS A7 Security Policy
19 pages
5 - Programs and Apps
No ratings yet
5 - Programs and Apps
42 pages
Real Life Project JBM
No ratings yet
Real Life Project JBM
19 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
Lesson 4 Overview of Health Informatics Copy 1
No ratings yet
Lesson 4 Overview of Health Informatics Copy 1
24 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Microprocessor-Lab 2
No ratings yet
Microprocessor-Lab 2
7 pages
Producingandconsumingmessagesin Kafkafrom CICSapplications
No ratings yet
Producingandconsumingmessagesin Kafkafrom CICSapplications
25 pages
AI Games
No ratings yet
AI Games
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
18 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Activity 1 - DataStructures
No ratings yet
Activity 1 - DataStructures
13 pages
Manual View Power
No ratings yet
Manual View Power
54 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Entradas Analogica Kfd2-Stc4-Ex PDF
No ratings yet
Entradas Analogica Kfd2-Stc4-Ex PDF
3 pages
OPRE 6302 Operations Management (Syllabus
No ratings yet
OPRE 6302 Operations Management (Syllabus
4 pages
Ims 15 18149 Manual Sample PDF
No ratings yet
Ims 15 18149 Manual Sample PDF
13 pages
ITC 111 Syllabus
No ratings yet
ITC 111 Syllabus
9 pages
0 IK-PBSYS Cover15 V201
33% (3)
0 IK-PBSYS Cover15 V201
1 page
Database Systems Ii Dec, 2017
No ratings yet
Database Systems Ii Dec, 2017
4 pages
Review of Literature
No ratings yet
Review of Literature
2 pages
STR 0367 Ver. 0 - Earth Leakage Detector
No ratings yet
STR 0367 Ver. 0 - Earth Leakage Detector
4 pages
CRO Specification
No ratings yet
CRO Specification
1 page
GBI News (2018-09-22) - 1 PDF
No ratings yet
GBI News (2018-09-22) - 1 PDF
1 page
Pritam CV
No ratings yet
Pritam CV
3 pages
Safe Use of QR Codes
No ratings yet
Safe Use of QR Codes
3 pages
How To Scroll Screenshot in Iphone - Google Search
No ratings yet
How To Scroll Screenshot in Iphone - Google Search
1 page
Fiche Technique Radiateur
No ratings yet
Fiche Technique Radiateur
16 pages

Intro Big Data

Uploaded by

Intro Big Data

Uploaded by

Big Data Processing

• Velocity (known as streaming data):

Source: Wikipedia (Everest)

Maximilien Brice, © CERN

Source: Wikipedia (Three Gorges Dam)

• Youtube: Personalized video recommendation

• Directions in Google map

• Google translate: Language translation

• Predicts what the customers will enjoy watching with Big

• Uses LOTs of historical data

• Focuses on supply and demand of the services

• Finds out best routes depending upon factors such traffic,

• Organizing product in such a way that customers easily find it

• LinkedIn uses network/graph data to

• Predicting mental health from social media data

• Human genome sequencing

“big data stack”

This course focuses on algorithm design and “programming at

• Finding useful pattern/insight/model from large data in

• Primary focus is on efficiency as well as on information

• Growth of its complexity with the problem size

• How well it can handle big data?

• Scale-out architecture: Cluster of low-cost computers

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster

STEP 2: Assign/cluster each

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster

STEP 2: Assign/cluster each 10 3 4 17

• Take a random sample from the data

• Apply kmean on that data to produce the approximate centroids

Selective Search: Efficient and Effective Search of Large Textual Collections

Assign clusters to original

STEP 1: Start with k initial cluster

STEP 2: Assign/cluster each 2. Process each chunk in

worker worker worker

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?

• Parallelization problems arise from:

• Thus, we need a synchronization mechanism

Source: Ricardo Guimarães Herrmann

You might also like