Module 1 - Aug 2024
Module 1 - Aug 2024
Introduction
What is data mining?
• After years of data mining there is still no unique
answer to this question.
• A tentative definition:
Data are stored to provide information from a historical perspective and are typically
summarized.
Example: transaction data
• Billions of real-life customers:
• WALMART: 20M transactions per day
• AT&T 300 M calls per day
• Credit card companies: billions of transactions per day.
• Amazon collects all the items that you browsed, placed into your basket,
read reviews about, purchased.
• Google and Bing record all your browsing activity via toolbar plugins.
They also record the queries you asked, the pages you saw and the
clicks you did.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• Each record (transaction) is a set of items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
• Recommendations:
• Users who buy this item often buy this item as well
• Users who watched James Bond movies, also watched
Jason Bourne movies.
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized
• Application:
• Create a catalog to send out that has at least one item
of interest for every customer.
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of
the values of other attributes.
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10
10 No Single 90K Yes
Set Classifier Model
• Classification: Derive a model for each of these three classes based on the
descriptive features of the items, such as price, brand, place made, type, and
category.
• Decision tree: The decision tree, may identify price as being the single factor
that best distinguishes the three classes. The tree may reveal that, after price,
other features that help further distinguish objects of each class from another
include brand and place made. Such a decision tree may help you
understand the impact of the given sales campaign and
• Prediction: Suppose you would like to predict the amount of revenue that
each item will generate during an upcoming sale at AllElectronics, based on
previous sales data. This is an example of (numeric) prediction because the
model constructed will predict a continuous-valued
• function, or ordered value.
Outlier Analysis
• A database may contain data objects that do not comply with
the general behavior or model of the data. These data
objects are outliers.
distributed nature
of data Database
• Emphasis on the use of data systems
Cultures
• Databases: concentrate on large-scale (non-
main-memory) data.
• AI (machine-learning): concentrate on complex
methods, small data.
• In today’s world data is more important than algorithms
• Statistics: concentrate on models.
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithm Disciplines
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithm Disciplines
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Distributed
Algorithm Computing
Single-node architecture
CPU
Machine Learning, Statistics
Memory
Disk
Commodity Clusters
• Web data sets can be very large
• Tens to hundreds of terabytes
• Cannot mine on a single server
• Standard architecture emerging:
• Cluster of commodity Linux nodes, Gigabit ethernet interconnect
• Google GFS; Hadoop HDFS; Kosmix KFS
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common
• How to organize computations on this architecture?
• Map-Reduce paradigm
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch
Data Result
Data Mining
Preprocessing Post-processing
• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
Sample Size
• Reservoir Sampling:
• Standard interview question
91
Meaningfulness of Answers
• A big data-mining risk is that you will “discover”
patterns that are meaningless.
• Statisticians call it Bonferroni’s principle:
(roughly) if you look in more places for
interesting patterns than your amount of data
will support, you are bound to find crap.
• The Rhine Paradox: a great example of how
not to conduct scientific research.