0% found this document useful (0 votes)

19 views37 pages

Course Outline and Introduction

This course focuses on algorithm design and analysis for mining massive datasets. It covers infrastructure like MapReduce, Hadoop and Spark for distributed computing. It also covers algorithms and techniques for tasks like clustering, graphs, recommendations and frequent pattern mining.

Uploaded by

l200908

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views37 pages

Course Outline and Introduction

Uploaded by

l200908

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

MINING OF

MASSIVE DATASETS
Zareen Alamgir
MINING OF MASSIVE
DATASETS
COURSE
OVERVIEW
Course Information
■ Instructor: Zareen Alamgir
■ Email: [email protected]

■ Course Information and updates will be posted on Google

Classroom
– Tentative schedule
– News and announcements
– Lecture Slides
– Assignments
– Books and Reading Material
Course Content
■ Introduction
■ Infrastructure for Massive data Data Science
Tools
– Map Reduce (very brief)
– Hadoop, HDFS

This Course
Analytics
– Apache Spark Infrastructure
■ Algorithms and Techniques (Tentative)
– Clustering Execution
– Graphs - Link Analysis (Page Rank) and Inverted Index Infrastructure

– Finding Similar items (Locality Sensitive Hashing)

– Large-Scale Machine Learning (Decision trees)
– Recommendation Systems (ALS) This course focuses on
– Frequent Pattern Mining algorithm design and
“thinking at scale”
Textbooks and Readings
■ TextBooks
– Mining of Massive Data Sets, by Anand Rajaraman, Jure Leskovec and Jeff Ullman
– Data Mining: Concepts and Techniques. By Jiawei Han and Micheline Kamber.

■ Reference
– Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei
Zaharia
– Introduction to Data Mining. By P.-N. Tan, M. Steinbach and V. Kumar.
– Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer

■ All textbooks are free to download

■ We will also cover important research papers and tutorials
Pre-Requisites

The students should have good background in

– Programming and Data structures

– Database Systems (familiarity with SQL queries)
Tentative Grading Scheme

■ Two Midterms
Tentative
30%
■ Quizzes 10%
– 5 quizzes or more
■ Assignments/Project 10%
– Programming Assignments
– Project/Presentation

■ Final 50%
WHAT IS DATA MINING?
Knowledge discovery from data

http://data-mining.philippe-fournier-viger.com/introduction-data-mining/
Introduction
■ Data is growing at a phenomenal rate
– Web data, e‐commerce
– purchases at department/grocery stores
– Bank/Credit Card transactions
– scientific simulations

UNCOVER HIDDEN INFORMATION

DATA MINING
We are drowning in data but starving for
knowledge!

Information “hidden” in the

data is not readily evident Human analysts take
weeks to discover
useful information
https://whatsthebigdata.com
What is Data Mining
■ Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

– Exploration & analysis, by automatic or semi‐automatic

means, of large quantities of data in order to discover
meaningful patterns

http://www.cs.science.cmu.ac.th
Data Mining and related Disciplines
■ Data mining overlaps with:
– Databases: Large-scale data, simple queries
– Machine learning: Small data, Complex models
– CS Theory: (Randomized) Algorithms
■ Different cultures:
– To a DB person, data mining is an extreme form of analytic
processing – queries that examine large amounts of data
■ Result is the query answer
– To a ML person, data-mining
is the inference of models
■ Result is the parameters of the model
Data Mining and related Disciplines
■ Emphasis is on
– scalability of number of features and instances (massive data)
– stress on algorithms and architectures
■ whereas foundations of methods provided by statistics and machine learning
– automation for handling large, complex and heterogeneous data
Database vs Data Mining

Database Data Mining

•Find all credit applicants •Find all credit applicants
with last name of Smith. who are poor credit risks.
(classification)
•Identify customers who
have purchased more •Identify customers with
than $10,000 in the last similar buying habits.
month. (Clustering)
•Find all customers who •Find all items which are
have purchased milk frequently purchased with
milk. (association rules)
Database Processing vs. Data Mining
Processing •Poorly defined
•Well defined •No precise
Query •SQL
Query query
language

•Precise •Fuzzy
Output •Subset of Output •Not a subset
database of database
Data Mining Models and Tasks
■ Descriptive data mining:
– Describe general properties
■ Predictive data mining:
– Infer on available data
What this course is about ?
Mining of massive datasets
What this course is about ?
Extraction of actionable information from (usually) very
large datasets

It’s not all about machine learning

But most of it is!

• Emphasis is on algorithms that scale

• Parallelization often essential
DISTRIBUTED
COMPUTING FOR
DATA MINING
What is Massive/Big Data?
Too big: petabyte-scale collections or lots of big
data sets

Too hard: does not fit neatly in an existing tool

• Data sets that need to be cleaned, processed and integrated
• E.g., Twitter, news, customer transactions

Too fast: needs to be processed quickly

Single-node Architecture

Data Data Machine

Analysis Mining Learning
Motivation: Google Example
20+ billion web pages x 20KB = 400+ TB

1 computer reads 30-35 MB/sec from disk

• ~4 months to read the web

~1,000 hard drives to store the web

Takes even more to do something useful with the data!

A standard architecture for such problems is emerging

• Cluster of commodity Linux nodes
• Commodity network (ethernet) to connect them
How to handle massive data ?
Platforms for Large-scale Data Mining

Distributed Infrastructure
•HADOOP
•HDFS
Programming Models
•Map Reduce
•pioneered by Google
•popularized by Yahoo
•SPARK
Cluster Architecture
Parallelization Challenges
■ How do we assign work units to workers?
■ What if we have more work units than
workers?
■ What if workers need to share partial
results?
■ How do we aggregate partial results?
■ How do we know all the workers have
finished?
■ What if workers die?
■ What is the common theme of all of these
problems
Common Theme?
■ Parallelization problems arise from:
– Communication between workers
(e.g., to exchange state)
– Access to shared resources (e.g.,
data)
■ Thus, we need a synchronization
mechanism

Semaphores (lock, unlock)

Conditional variables (wait, notify, broadcast)
Barriers
Still, lots of problems:
Deadlock, livelock, race conditions...
Dining philosophers, sleeping barbers, cigarette
smokers...
Current Tools
■ What if workers need to share partial results?
■ Programming models
– Shared memory (pthreads)
– Message passing (MPI)
When Theory Meets Practices
Concurrency is already difficult to reason about…

Now throw in:

The scale of clusters and (multiple) datacenters
The presence of hardware failures and software bugs
The presence of multiple interacting services

The reality:
Lots of one-off solutions, custom code
Write you own dedicated library, then program with it
Burden on the programmer to explicitly manage everything

Bottom line: it’s hard!

Big Ideas: Abstract System-Level Details

It’s all about the right level of abstraction

MapReduce isolates developers from System level details

Separating the What from the How !!!!

•Programmer defines what computations are to be performed

• MapReduce execution framework takes care of how the

computations are carried out
Big Ideas: Scale Out vs. Scale Up
Scale up •Symmetric multi-processing (SMP) machines, large
shared memory
small number of high-end •Not cost-effective – cost of machines does not scale
servers linearly; and no single SMP machine is big enough

Scale out
Large number of commodity low-
end servers is more effective for
data-intensive applications

8 128-core machines vs.

128 8-core machines
Big Ideas: Failures are Common
■ Suppose a cluster is built using machines with a mean-time between failures
(MTBF) of 1000 days
■ For a 10,000 server cluster, there are on average 10 failures per
day!
■ MapReduce and Spark implementation cope with failures
– Automatic task restarts
Big Ideas: Move Processing to Data
■ Supercomputers often have
processing nodes and storage nodes
– Computationally expensive tasks
– High-capacity interconnect to
move data around
– Data movement leads to a
bottleneck in the network!

Why does this make sense for compute-intensive tasks? Many data-intensive
applications are not
What’s the issue for data-intensive tasks? very processor-
demanding
What’s the solution?
Don’t move data to workers… move workers to the data!
Key idea: co-locate storage and compute
Start up worker on nodes that hold the data

SAN

Compute Nodes
What’s the solution?
Don’t move data to workers… move workers to the data!
Key idea: co-locate storage and compute
Start up worker on nodes that hold the data

We need a distributed file system for managing this

GFS (Google File System) for Google’s MapReduce
HDFS (Hadoop Distributed File System) for Hadoop

Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Desmi Operations and Maintenance Instructions
100% (2)
Desmi Operations and Maintenance Instructions
29 pages
Week 1A - Overview and Introduction of Data Mining
No ratings yet
Week 1A - Overview and Introduction of Data Mining
41 pages
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
No ratings yet
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
23 pages
Logs
No ratings yet
Logs
7 pages
Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
Big Datawith Data Warehousingand Data Mining NEW020
No ratings yet
Big Datawith Data Warehousingand Data Mining NEW020
180 pages
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
100% (1)
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
75 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
ch01 Intro
No ratings yet
ch01 Intro
29 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
DMBI Presentations Unit-8
No ratings yet
DMBI Presentations Unit-8
28 pages
Sakhr - Chaib - Paper On Data Mining
No ratings yet
Sakhr - Chaib - Paper On Data Mining
3 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Bigdata
No ratings yet
Bigdata
12 pages
DM Lecture 1 Introudction and Policies
No ratings yet
DM Lecture 1 Introudction and Policies
17 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data - Spring 25 - Week01
No ratings yet
Big Data - Spring 25 - Week01
54 pages
The TOEFL ITP Tests at A Glance
No ratings yet
The TOEFL ITP Tests at A Glance
4 pages
Temperature Controllers: Installation and Maintenance
No ratings yet
Temperature Controllers: Installation and Maintenance
5 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Unit 1 1
No ratings yet
Unit 1 1
10 pages
Bda U1
No ratings yet
Bda U1
80 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Week1 1
No ratings yet
Week1 1
18 pages
BigDataAnalytics 1.2
No ratings yet
BigDataAnalytics 1.2
25 pages
DM-Unit 1
No ratings yet
DM-Unit 1
110 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Big Data Unit 1 AKTU Notes
No ratings yet
Big Data Unit 1 AKTU Notes
87 pages
Biomedical Waste Management Guidelines - Compressed
No ratings yet
Biomedical Waste Management Guidelines - Compressed
60 pages
Data Mining Unit4
No ratings yet
Data Mining Unit4
16 pages
Uniform Plane Wave Solution To The Wave Equation
No ratings yet
Uniform Plane Wave Solution To The Wave Equation
5 pages
Nedal Alloy Data
No ratings yet
Nedal Alloy Data
1 page
01 Intro 1
No ratings yet
01 Intro 1
50 pages
Form Aoc-4 XBRL Help
No ratings yet
Form Aoc-4 XBRL Help
23 pages
Data Mining: Ying Liu, Prof., PH.D
No ratings yet
Data Mining: Ying Liu, Prof., PH.D
57 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Bda U2
No ratings yet
Bda U2
68 pages
MCAD2232 (PRESS) BIG DATA and Its Applications
No ratings yet
MCAD2232 (PRESS) BIG DATA and Its Applications
140 pages
Ammeraal Beltech: Innovation & Service in Belting
No ratings yet
Ammeraal Beltech: Innovation & Service in Belting
6 pages
Module 1-BDA
No ratings yet
Module 1-BDA
82 pages
Microsoft Powerpoint Tips and Tricks
No ratings yet
Microsoft Powerpoint Tips and Tricks
8 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
Ampere's Law
No ratings yet
Ampere's Law
20 pages
TCNet Design Report
No ratings yet
TCNet Design Report
2 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Ut Dallas - Big Data Analytics Management Syl33611
No ratings yet
Ut Dallas - Big Data Analytics Management Syl33611
9 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Manual CP-8050 ENG DC8-026-2 04
No ratings yet
Manual CP-8050 ENG DC8-026-2 04
1,116 pages
2nd KIIT University National Moot Court Competition, 2014: Civil Writ Petition No. - OF 2014
No ratings yet
2nd KIIT University National Moot Court Competition, 2014: Civil Writ Petition No. - OF 2014
30 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Hamid Seminar
No ratings yet
Hamid Seminar
57 pages
Data Mining
No ratings yet
Data Mining
26 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
3rd Sem Syllabus
No ratings yet
3rd Sem Syllabus
13 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Activity 4 Worlds Greatest Strategists
No ratings yet
Activity 4 Worlds Greatest Strategists
3 pages
Cambridge International Advanced Subsidiary and Advanced Level
No ratings yet
Cambridge International Advanced Subsidiary and Advanced Level
12 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
SEM VII BDA Syllabus Theory
No ratings yet
SEM VII BDA Syllabus Theory
4 pages
Data Mining With Bigdata
No ratings yet
Data Mining With Bigdata
30 pages
Division of Negros Occidental
No ratings yet
Division of Negros Occidental
5 pages
MMG 301 Final March18
No ratings yet
MMG 301 Final March18
143 pages
Real Numbers
No ratings yet
Real Numbers
6 pages
Chapter 7-8-9
No ratings yet
Chapter 7-8-9
26 pages
Mathematics BSC FYUP Syllabus 2024
No ratings yet
Mathematics BSC FYUP Syllabus 2024
36 pages
TARUN 230914500082 11092023 NoMemo H
No ratings yet
TARUN 230914500082 11092023 NoMemo H
6 pages
RFLI (With Reviewers)
No ratings yet
RFLI (With Reviewers)
22 pages
Check List DPR at SRRDA Level
No ratings yet
Check List DPR at SRRDA Level
4 pages
Artikel 7 Pages From Prosiding Vol 2 No 1 Jan 2020
No ratings yet
Artikel 7 Pages From Prosiding Vol 2 No 1 Jan 2020
7 pages
Micro CC 20 Plus Communication Protocol
No ratings yet
Micro CC 20 Plus Communication Protocol
9 pages
Computing Maze Game Activity Sheet
No ratings yet
Computing Maze Game Activity Sheet
3 pages
Infographic - Schools of Criticism
No ratings yet
Infographic - Schools of Criticism
1 page
Analytical Review
No ratings yet
Analytical Review
2 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Course Outline and Introduction

Uploaded by

Course Outline and Introduction

Uploaded by

MINING OF

■ Course Information and updates will be posted on Google

– Finding Similar items (Locality Sensitive Hashing)

■ All textbooks are free to download

The students should have good background in

– Programming and Data structures

UNCOVER HIDDEN INFORMATION

Information “hidden” in the

– Exploration & analysis, by automatic or semi‐automatic

Database Data Mining

It’s not all about machine learning

• Emphasis is on algorithms that scale

Too hard: does not fit neatly in an existing tool

Too fast: needs to be processed quickly

Data Data Machine

1 computer reads 30-35 MB/sec from disk

~1,000 hard drives to store the web

Takes even more to do something useful with the data!

A standard architecture for such problems is emerging

Semaphores (lock, unlock)

Now throw in:

Bottom line: it’s hard!

It’s all about the right level of abstraction

MapReduce isolates developers from System level details

Separating the What from the How !!!!

• MapReduce execution framework takes care of how the

8 128-core machines vs.

We need a distributed file system for managing this

You might also like