Intro Big Data
Intro Big Data
Jiaul Paik
Lecture 3
Characteristics of Big Data
• Volume:
• the size and amounts of big data
• Variety:
• the diversity and range of different data types (unstructured data, semi-structured data, video, image etc)
• Variability:
• the changing nature of the data companies seek to capture, manage and analyze
– e.g., in sentiment or text analytics
• Veracity:
• the “truth” or accuracy of data and information
• Value:
• the value of big data comes from interesting pattern recognition that lead to better decision making
Science
Why big data? Engineering
Commerce
Society
Data-intensive Science
Large Hadron collider: Particle collider
15 Petabyte/year
• Google
• uses big data to improve search quality (ranked results)
• Personalized ad recommendation
• Fixing fare
Big data analytics: Case Studies
• Walmart
• Uses transaction data to discover patterns that can be used
• Provides product recommendations
• which products were brought together.
Data Science
Tools
This Course
Analytics
Infrastructure
Execution
Infrastructure
This Course
Relational data: SQL, joins,
Analytics column stores
Infrastructure
Data mining: hashing,
clustering (k-means),
MapReduce, Spark, noSQL, classification,
Execution
Flink, Pig, Hive, Dryad, recommendations
Infrastructure
Pregel, Giraph, Storm
Streams: probabilistic data
structures (Bloom filters,
“big data stack” CMS, HLL counters)
2. Parallel Processing
• Scale-up architecture: Powerful server with lots of RAM, disk and cpu-cores
• Key assumption:
• The centroids from the random samples are very close to the centroids of the
original data
original centroids
Random sample
(30%)
kmean
approx
centroids
Solution 2: Parallel Processing
• Scale-up Architecture
• Pros: Fast processing (if data fits into RAM)
• Cons: Risk of data loss (system failure), Scalability issue for very large
data
• Scale-out architecture
• Pros: Can handle very large data, fault tolerant
• Cons: Communication bottleneck, difficulty in writing code
How to Tackle Big Data?
Source: Google
Divide and Conquer: the good, old and reliable friend
“Work”
Partition
w1 w2 w3
r1 r2 r3
“Result” Combine
What are the Challenges?
• How do we assign work units to workers?
• Programming models
Memory
• Shared memory (pthreads)
• Message passing (MPI) P1 P2 P3 P4 P5 P1 P2 P3 P4 P5
• Design Patterns
• Master-slaves
• Producer-consumer flows
• Shared work queues producer consumer
master
work queue
slaves
producer consumer
Thank you so much!!