Updated DM
Updated DM
Introduction
What is data mining?
• After years of data mining there is still no unique
answer to this question.
• A tentative definition:
Data are stored to provide information from a historical perspective and are typically
summarized.
Example: transaction data
• Billions of real-life customers:
• WALMART: 20M transactions per day
• AT&T 300 M calls per day
• Credit card companies: billions of transactions per day.
• Amazon collects all the items that you browsed, placed into your basket,
read reviews about, purchased.
• Google and Bing record all your browsing activity via toolbar plugins.
They also record the queries you asked, the pages you saw and the
clicks you did.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• Each record (transaction) is a set of items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Correlation of stocks
• Recommendations:
• Users who buy this item often buy this item as well
• Users who watched James Bond movies, also watched
Jason Bourne movies.
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized
1
Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
2
Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN,
ADV-M icro-Device-DOWN,Andrew-Corp-DOWN,
Co mputer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
3
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,
MBNA-Corp -DOWN,Morgan-Stanley-DOWN Financial-DOWN
4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
Schlu mberger-UP
• Application:
• Create a catalog to send out that has at least one item
of interest for every customer.
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of
the values of other attributes.
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10
10 No Single 90K Yes
Set Classifier Model
distributed nature
of data Database
• Emphasis on the use of data systems
Cultures
• Databases: concentrate on large-scale (non-
main-memory) data.
• AI (machine-learning): concentrate on complex
methods, small data.
• In today’s world data is more important than algorithms
• Statistics: concentrate on models.
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithm Disciplines
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithm Disciplines
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Distributed
Algorithm Computing
Single-node architecture
CPU
Machine Learning, Statistics
Memory
Disk
Commodity Clusters
• Web data sets can be very large
• Tens to hundreds of terabytes
• Cannot mine on a single server
• Standard architecture emerging:
• Cluster of commodity Linux nodes, Gigabit ethernet interconnect
• Google GFS; Hadoop HDFS; Kosmix KFS
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common
• How to organize computations on this architecture?
• Map-Reduce paradigm
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch
Data Result
Data Mining
Preprocessing Post-processing
• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
Sample Size
• Reservoir Sampling:
• Standard interview question
69
Meaningfulness of Answers