Course Outline and Introduction
Course Outline and Introduction
MASSIVE DATASETS
Zareen Alamgir
MINING OF MASSIVE
DATASETS
COURSE
OVERVIEW
Course Information
■ Instructor: Zareen Alamgir
■ Email: [email protected]
This Course
Analytics
– Apache Spark Infrastructure
■ Algorithms and Techniques (Tentative)
– Clustering Execution
– Graphs - Link Analysis (Page Rank) and Inverted Index Infrastructure
■ Reference
– Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei
Zaharia
– Introduction to Data Mining. By P.-N. Tan, M. Steinbach and V. Kumar.
– Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer
■ Two Midterms
Tentative
30%
■ Quizzes 10%
– 5 quizzes or more
■ Assignments/Project 10%
– Programming Assignments
– Project/Presentation
■ Final 50%
WHAT IS DATA MINING?
Knowledge discovery from data
http://data-mining.philippe-fournier-viger.com/introduction-data-mining/
Introduction
■ Data is growing at a phenomenal rate
– Web data, e‐commerce
– purchases at department/grocery stores
– Bank/Credit Card transactions
– scientific simulations
http://www.cs.science.cmu.ac.th
Data Mining and related Disciplines
■ Data mining overlaps with:
– Databases: Large-scale data, simple queries
– Machine learning: Small data, Complex models
– CS Theory: (Randomized) Algorithms
■ Different cultures:
– To a DB person, data mining is an extreme form of analytic
processing – queries that examine large amounts of data
■ Result is the query answer
– To a ML person, data-mining
is the inference of models
■ Result is the parameters of the model
Data Mining and related Disciplines
■ Emphasis is on
– scalability of number of features and instances (massive data)
– stress on algorithms and architectures
■ whereas foundations of methods provided by statistics and machine learning
– automation for handling large, complex and heterogeneous data
Database vs Data Mining
•Precise •Fuzzy
Output •Subset of Output •Not a subset
database of database
Data Mining Models and Tasks
■ Descriptive data mining:
– Describe general properties
■ Predictive data mining:
– Infer on available data
What this course is about ?
Mining of massive datasets
What this course is about ?
Extraction of actionable information from (usually) very
large datasets
Distributed Infrastructure
•HADOOP
•HDFS
Programming Models
•Map Reduce
•pioneered by Google
•popularized by Yahoo
•SPARK
Cluster Architecture
Parallelization Challenges
■ How do we assign work units to workers?
■ What if we have more work units than
workers?
■ What if workers need to share partial
results?
■ How do we aggregate partial results?
■ How do we know all the workers have
finished?
■ What if workers die?
■ What is the common theme of all of these
problems
Common Theme?
■ Parallelization problems arise from:
– Communication between workers
(e.g., to exchange state)
– Access to shared resources (e.g.,
data)
■ Thus, we need a synchronization
mechanism
The reality:
Lots of one-off solutions, custom code
Write you own dedicated library, then program with it
Burden on the programmer to explicitly manage everything
Scale out
Large number of commodity low-
end servers is more effective for
data-intensive applications
Why does this make sense for compute-intensive tasks? Many data-intensive
applications are not
What’s the issue for data-intensive tasks? very processor-
demanding
What’s the solution?
Don’t move data to workers… move workers to the data!
Key idea: co-locate storage and compute
Start up worker on nodes that hold the data
SAN
Compute Nodes
What’s the solution?
Don’t move data to workers… move workers to the data!
Key idea: co-locate storage and compute
Start up worker on nodes that hold the data