0% found this document useful (0 votes)
40 views36 pages

Intro Big Data

The document discusses big data processing characteristics and techniques. It covers volume, variety, velocity, variability, and veracity as key characteristics of big data. It also discusses using sampling, parallel processing, MapReduce and other tools to improve efficiency and scalability when working with large datasets.

Uploaded by

Sidh Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views36 pages

Intro Big Data

The document discusses big data processing characteristics and techniques. It covers volume, variety, velocity, variability, and veracity as key characteristics of big data. It also discusses using sampling, parallel processing, MapReduce and other tools to improve efficiency and scalability when working with large datasets.

Uploaded by

Sidh Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Big Data Processing

Jiaul Paik
Lecture 3
Characteristics of Big Data
• Volume:
• the size and amounts of big data

• Variety:
• the diversity and range of different data types (unstructured data, semi-structured data, video, image etc)

• Velocity (known as streaming data):


• the speed at which data is coming
• e.g., social media posts or search queries received within a day/hour

• Variability:
• the changing nature of the data companies seek to capture, manage and analyze
– e.g., in sentiment or text analytics

• Veracity:
• the “truth” or accuracy of data and information

• Value:
• the value of big data comes from interesting pattern recognition that lead to better decision making
Science
Why big data? Engineering
Commerce
Society

Source: Wikipedia (Everest)


Science

Data-intensive Science
Large Hadron collider: Particle collider
15 Petabyte/year

Maximilien Brice, © CERN


Engineering
The unreasonable effectiveness of data
Search, recommendation, prediction, …

Source: Wikipedia (Three Gorges Dam)


Language Translation
Big data analytics: Case Studies

• Google
• uses big data to improve search quality (ranked results)

• Personalized ad recommendation

• Youtube: Personalized video recommendation

• Directions in Google map

• Google translate: Language translation


Big data analytics: Case Studies
• Netflix
• On-demand streaming video for its customers

• Predicts what the customers will enjoy watching with Big


Data

• Uses LOTs of historical data


• Who watched what
• Genre of the movie
Big data analytics: Case Studies
• Uber
• Uses personal data of the user to monitor which features of the
service are mostly used

• Focuses on supply and demand of the services

• Finds out best routes depending upon factors such traffic,


location, time etc

• Fixing fare
Big data analytics: Case Studies

• Walmart
• Uses transaction data to discover patterns that can be used
• Provides product recommendations
• which products were brought together.

• Organizing product in such a way that customers easily find it


• Bread and butter in the same rack
Big data analytics: Case Studies
• LinkedIn
• Uses Big data to develop product offerings such as
• people you may know
• who have viewed your profile
• jobs you may be interested in, and more

• LinkedIn uses network/graph data to


• analyze the profiles,
• suggests opportunities according to qualification and interests
Big data analytics: Case Studies
• Healthcare
• Finding a disease outbreak

• Predicting mental health from social media data

• Human genome sequencing


• identifying, mapping and sequencing all of the genes of the human genome from both a
physical and a functional standpoint
• Can be used to identify diseases and for better treatment
Focus of this course

Data Science
Tools

This Course
Analytics
Infrastructure

Execution
Infrastructure

“big data stack”


Buzzwords
Text: frequency estimation,
language models, inverted
data analytics, business indexes
Data Science
intelligence, OLAP, ETL, Graphs: graph traversals,
Tools
data warehouses and data random walks (PageRank)
lakes

This Course
Relational data: SQL, joins,
Analytics column stores
Infrastructure
Data mining: hashing,
clustering (k-means),
MapReduce, Spark, noSQL, classification,
Execution
Flink, Pig, Hive, Dryad, recommendations
Infrastructure
Pregel, Giraph, Storm
Streams: probabilistic data
structures (Bloom filters,
“big data stack” CMS, HLL counters)

This course focuses on algorithm design and “programming at


scale”
What is the Goal of Big Data Processing?

• Finding useful pattern/insight/model from large data in


reasonable amount of time

• Primary focus is on efficiency as well as on information


quality
Scalability of an algorithm

• Growth of its complexity with the problem size


• Time to finish
• Memory requirement

• How well it can handle big data?


Two Common Routes to Scalability
1. Improving Algorithmic Efficiency
• Sampling techniques
We will
study all in
• Efficient data structures and algorithms detail

2. Parallel Processing

• Scale-up architecture: Powerful server with lots of RAM, disk and cpu-cores

• Scale-out architecture: Cluster of low-cost computers


• Hadoop, Map-reduce, Spark
Scalability: Data Clustering

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster


centers (that is why k-mean )

STEP 2: Assign/cluster each


member to the closest center
Iterative
steps
STEP 3: Recalculate centers
K-means: Illustration
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
Kmeans: The Expensive part

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster


centers (that is why k-mean )

STEP 2: Assign/cluster each 10 3 4 17


member to the closest center 10 ×10 ×10 =10
Iterative
steps
STEP 3: Recalculate centers
Solution 1: Improving Algorithmic Efficiency
Sampling based k-mean

• Take a random sample from the data

• Apply kmean on that data to produce the approximate centroids

• Key assumption:
• The centroids from the random samples are very close to the centroids of the
original data

Selective Search: Efficient and Effective Search of Large Textual Collections


by Kulkarni and Callan, ACM TOIS, 2016
Illustration: Sampling based k-mean

original centroids

Random sample
(30%)

kmean

Assign clusters to original


data

approx
centroids
Solution 2: Parallel Processing

STEP 1: Start with k initial cluster


centers (that is why k-mean ) 1. Split data into small chunks

STEP 2: Assign/cluster each 2. Process each chunk in


member to the closest center different cores / nodes in a
Iterative cluster
steps
STEP 3: Recalculate centers
Pros and Cons
• Sampling based method
• Pros: Fast processing
• Cons: Lossy Inference, Often Low accuracy

• Scale-up Architecture
• Pros: Fast processing (if data fits into RAM)
• Cons: Risk of data loss (system failure), Scalability issue for very large
data

• Scale-out architecture
• Pros: Can handle very large data, fault tolerant
• Cons: Communication bottleneck, difficulty in writing code
How to Tackle Big Data?

Source: Google
Divide and Conquer: the good, old and reliable friend

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Combine
What are the Challenges?
• How do we assign work units to workers?

• What if we have more work units than workers?

• What if workers need to share partial results?

• How do we aggregate partial results?

• How do we know all the workers have finished?

• What if workers die?


A Critical Issue

• Parallelization problems arise from:


• Communication between workers (e.g., to exchange state)
• Access to shared resources (e.g., data)

• Thus, we need a synchronization mechanism


!!
on
d ti
a
B iza
on
h r
nc
sy

Source: Ricardo Guimarães Herrmann


Old Tools for Big Data Processing
Shared Memory Message Passing

• Programming models

Memory
• Shared memory (pthreads)
• Message passing (MPI) P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

• Design Patterns
• Master-slaves
• Producer-consumer flows
• Shared work queues producer consumer
master

work queue
slaves

producer consumer
Thank you so much!!

You might also like