0% found this document useful (0 votes)
19 views37 pages

Course Outline and Introduction

This course focuses on algorithm design and analysis for mining massive datasets. It covers infrastructure like MapReduce, Hadoop and Spark for distributed computing. It also covers algorithms and techniques for tasks like clustering, graphs, recommendations and frequent pattern mining.

Uploaded by

l200908
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views37 pages

Course Outline and Introduction

This course focuses on algorithm design and analysis for mining massive datasets. It covers infrastructure like MapReduce, Hadoop and Spark for distributed computing. It also covers algorithms and techniques for tasks like clustering, graphs, recommendations and frequent pattern mining.

Uploaded by

l200908
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

MINING OF

MASSIVE DATASETS
Zareen Alamgir
MINING OF MASSIVE
DATASETS
COURSE
OVERVIEW
Course Information
■ Instructor: Zareen Alamgir
■ Email: [email protected]

■ Course Information and updates will be posted on Google


Classroom
– Tentative schedule
– News and announcements
– Lecture Slides
– Assignments
– Books and Reading Material
Course Content
■ Introduction
■ Infrastructure for Massive data Data Science
Tools
– Map Reduce (very brief)
– Hadoop, HDFS

This Course
Analytics
– Apache Spark Infrastructure
■ Algorithms and Techniques (Tentative)
– Clustering Execution
– Graphs - Link Analysis (Page Rank) and Inverted Index Infrastructure

– Finding Similar items (Locality Sensitive Hashing)


– Large-Scale Machine Learning (Decision trees)
– Recommendation Systems (ALS) This course focuses on
– Frequent Pattern Mining algorithm design and
“thinking at scale”
Textbooks and Readings
■ TextBooks
– Mining of Massive Data Sets, by Anand Rajaraman, Jure Leskovec and Jeff Ullman
– Data Mining: Concepts and Techniques. By Jiawei Han and Micheline Kamber.

■ Reference
– Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei
Zaharia
– Introduction to Data Mining. By P.-N. Tan, M. Steinbach and V. Kumar.
– Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer

■ All textbooks are free to download


■ We will also cover important research papers and tutorials
Pre-Requisites

The students should have good background in

– Programming and Data structures


– Database Systems (familiarity with SQL queries)
Tentative Grading Scheme

■ Two Midterms
Tentative
30%
■ Quizzes 10%
– 5 quizzes or more
■ Assignments/Project 10%
– Programming Assignments
– Project/Presentation

■ Final 50%
WHAT IS DATA MINING?
Knowledge discovery from data

http://data-mining.philippe-fournier-viger.com/introduction-data-mining/
Introduction
■ Data is growing at a phenomenal rate
– Web data, e‐commerce
– purchases at department/grocery stores
– Bank/Credit Card transactions
– scientific simulations

UNCOVER HIDDEN INFORMATION


DATA MINING
We are drowning in data but starving for
knowledge!

Information “hidden” in the


data is not readily evident Human analysts take
weeks to discover
useful information
https://whatsthebigdata.com
What is Data Mining
■ Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

– Exploration & analysis, by automatic or semi‐automatic


means, of large quantities of data in order to discover
meaningful patterns

http://www.cs.science.cmu.ac.th
Data Mining and related Disciplines
■ Data mining overlaps with:
– Databases: Large-scale data, simple queries
– Machine learning: Small data, Complex models
– CS Theory: (Randomized) Algorithms
■ Different cultures:
– To a DB person, data mining is an extreme form of analytic
processing – queries that examine large amounts of data
■ Result is the query answer
– To a ML person, data-mining
is the inference of models
■ Result is the parameters of the model
Data Mining and related Disciplines
■ Emphasis is on
– scalability of number of features and instances (massive data)
– stress on algorithms and architectures
■ whereas foundations of methods provided by statistics and machine learning
– automation for handling large, complex and heterogeneous data
Database vs Data Mining

Database Data Mining


•Find all credit applicants •Find all credit applicants
with last name of Smith. who are poor credit risks.
(classification)
•Identify customers who
have purchased more •Identify customers with
than $10,000 in the last similar buying habits.
month. (Clustering)
•Find all customers who •Find all items which are
have purchased milk frequently purchased with
milk. (association rules)
Database Processing vs. Data Mining
Processing •Poorly defined
•Well defined •No precise
Query •SQL
Query query
language

•Precise •Fuzzy
Output •Subset of Output •Not a subset
database of database
Data Mining Models and Tasks
■ Descriptive data mining:
– Describe general properties
■ Predictive data mining:
– Infer on available data
What this course is about ?
Mining of massive datasets
What this course is about ?
Extraction of actionable information from (usually) very
large datasets

It’s not all about machine learning


But most of it is!

• Emphasis is on algorithms that scale


• Parallelization often essential
DISTRIBUTED
COMPUTING FOR
DATA MINING
What is Massive/Big Data?
Too big: petabyte-scale collections or lots of big
data sets

Too hard: does not fit neatly in an existing tool


• Data sets that need to be cleaned, processed and integrated
• E.g., Twitter, news, customer transactions

Too fast: needs to be processed quickly


Single-node Architecture

Data Data Machine


Analysis Mining Learning
Motivation: Google Example
20+ billion web pages x 20KB = 400+ TB

1 computer reads 30-35 MB/sec from disk


• ~4 months to read the web

~1,000 hard drives to store the web

Takes even more to do something useful with the data!

A standard architecture for such problems is emerging


• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them
How to handle massive data ?
Platforms for Large-scale Data Mining

Distributed Infrastructure
•HADOOP
•HDFS
Programming Models
•Map Reduce
•pioneered by Google
•popularized by Yahoo
•SPARK
Cluster Architecture
Parallelization Challenges
■ How do we assign work units to workers?
■ What if we have more work units than
workers?
■ What if workers need to share partial
results?
■ How do we aggregate partial results?
■ How do we know all the workers have
finished?
■ What if workers die?
■ What is the common theme of all of these
problems
Common Theme?
■ Parallelization problems arise from:
– Communication between workers
(e.g., to exchange state)
– Access to shared resources (e.g.,
data)
■ Thus, we need a synchronization
mechanism

Semaphores (lock, unlock)


Conditional variables (wait, notify, broadcast)
Barriers
Still, lots of problems:
Deadlock, livelock, race conditions...
Dining philosophers, sleeping barbers, cigarette
smokers...
Current Tools
■ What if workers need to share partial results?
■ Programming models
– Shared memory (pthreads)
– Message passing (MPI)
When Theory Meets Practices
Concurrency is already difficult to reason about…

Now throw in:


The scale of clusters and (multiple) datacenters
The presence of hardware failures and software bugs
The presence of multiple interacting services

The reality:
Lots of one-off solutions, custom code
Write you own dedicated library, then program with it
Burden on the programmer to explicitly manage everything

Bottom line: it’s hard!


Big Ideas: Abstract System-Level Details

It’s all about the right level of abstraction

MapReduce isolates developers from System level details

Separating the What from the How !!!!


•Programmer defines what computations are to be performed

• MapReduce execution framework takes care of how the


computations are carried out
Big Ideas: Scale Out vs. Scale Up
Scale up •Symmetric multi-processing (SMP) machines, large
shared memory
small number of high-end •Not cost-effective – cost of machines does not scale
servers linearly; and no single SMP machine is big enough

Scale out
Large number of commodity low-
end servers is more effective for
data-intensive applications

8 128-core machines vs.


128 8-core machines
Big Ideas: Failures are Common
■ Suppose a cluster is built using machines with a mean-time between failures
(MTBF) of 1000 days
■ For a 10,000 server cluster, there are on average 10 failures per
day!
■ MapReduce and Spark implementation cope with failures
– Automatic task restarts
Big Ideas: Move Processing to Data
■ Supercomputers often have
processing nodes and storage nodes
– Computationally expensive tasks
– High-capacity interconnect to
move data around
– Data movement leads to a
bottleneck in the network!

Why does this make sense for compute-intensive tasks? Many data-intensive
applications are not
What’s the issue for data-intensive tasks? very processor-
demanding
What’s the solution?
Don’t move data to workers… move workers to the data!
Key idea: co-locate storage and compute
Start up worker on nodes that hold the data

SAN

Compute Nodes
What’s the solution?
Don’t move data to workers… move workers to the data!
Key idea: co-locate storage and compute
Start up worker on nodes that hold the data

We need a distributed file system for managing this


GFS (Google File System) for Google’s MapReduce
HDFS (Hadoop Distributed File System) for Hadoop

You might also like