Datamining-lect1 - Introduction to Data Mining
Datamining-lect1 - Introduction to Data Mining
LECTURE 1
Introduction
2
• A tentative definition:
Medical data
• Wearable devices can measure your heart rate, blood sugar,
blood pressure, and other signals about your health. Medical
records are becoming available to individuals
• Wearable computing
• Brain imaging
• Images that monitor the activity in different areas of the brain under
different stimuli
• TB of data that need to be analyzed.
Behavioral data
• Mobile phones today record a large amount of information about the user
behavior
• GPS records position
• Camera produces images
• Communication via phone and SMS
• Text via facebook updates
• Association with entities via check-ins
• Amazon collects all the items that you browsed, placed into your basket,
read reviews about, purchased.
• Google and Bing record all your browsing activity via toolbar plugins.
They also record the queries you asked, the pages you saw and the
clicks you did.
Recommendations
15
Amazon Recommendations
• “People who have bought this also bought…”
• pages clicked,
Query auto-completion
• ads clicked and spelling correction
Example Application
• Google auto-complete and spelling correction
20
Correlation of stocks
Viral Marketing
• Word-of-Mouth marketing where the network of
users do the advertising themselves.
• It is considered the most effective, but also the hardest
way to advertise
Friendship suggestions
• LinkedIn, Twitter, Facebook friendship
suggestions
• Useful for the users to discover their friends, but also
useful for the network in order to grow, and increase
engagement
• LinkedIn success story
29
Big data
• The new trend in data mining…
• An all-encompassing term to describe problems in science,
industry, everyday life where there are huge amounts of data
that need to be stored, maintained and analyzed to produce
value.
• The overall idea:
• Every activity generates data
• Wearable computing, Internet of Things, Brain Imaging, Urban behavior
• If we collect and understand this data we can improve life for the
individual and the world
• E.g., Urban computing, Health informatics.
• Deep Learning:
• New techniques that can extract useful information (learn) from
massive amounts of data.
30
Itemsets
ItemsetsDiscovered:
Discovered:
TID Items {Milk,Coke}
{Milk,Coke}
1 Bread, Coke, Milk {Diaper,
{Diaper,Milk}
Milk}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Rules
RulesDiscovered:
Discovered:
4 Beer, Bread, Diaper, Milk {Milk}
{Milk}-->
-->{Coke}
{Coke}
5 Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
Tan, M. Steinbach and V. Kumar, Introduction to Data Mining
39
Example Application
• Supermarket shelf management.
• Goal: To identify items that are bought together by
sufficiently many customers.
• Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among
items.
• A classic rule --
• If a customer buys diaper and milk, then he is very
likely to buy beer.
• So, don’t be surprised if you find six-packs stacked
next to diapers!
Recommender systems
Collaborative filtering: Use the collective behavior of the users to
draw conclusions for an individual
Harry Harry Harry Twilight Star Star Star
Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3
Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to
one another.
• Similarity Measures?
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized
Clustering: Application 1
• Bioinformatics applications:
• Goal: Group genes and tissues together such that genes are
coexpressed on the same tissues
45
Clustering: Application 2
• Document Clustering:
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
Coverage
• Given a set of customers and items and the
transaction relationship between the two, select a
small set of items that “covers” all users.
• For each user there is at least one item in the set that
the user has bought.
• Application:
• Create a catalog to send out that has at least one item
of interest for every customer.
47
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10
10 No Single 90K Yes
Set Classifier Model
Refund
Yes No
NO MarSt
TaxInc NO
NO YES
50
Classification: Application 1
• Ad Click Prediction
• Goal: Predict if a user that visits a web page will click
on a displayed ad. Use it to target users with high click
probability.
• Approach:
• Collect data for users over a period of time and record who
clicks and who does not. The {click, no click} information
forms the class attribute.
• Use the history of the user (web pages browsed, queries
issued) as the features.
• Learn a classifier model and test on new users.
51
Classification: Application 2
• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
• Use credit card transactions and the information on its account-
holder as attributes.
• When does a customer buy, what does he buy, how often he pays on
time, etc
• Label past transactions as fraud or fair transactions. This forms
the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
We distribute a liter of
liquid equally to all
containers
54
Exploratory Analysis
• Trying to understand the data as a physical
phenomenon, and describe them with simple metrics
• What does the web graph look like?
• How often do people repeat the same query?
• Are friends in facebook also friends in twitter?
distributed nature
of data Database
• Emphasis on the use of data systems
Cultures
• Databases: concentrate on large-scale (non-
main-memory) data.
• AI (machine-learning): concentrate on complex
methods, small data.
• In today’s world data is more important than algorithms
• Statistics: concentrate on models.
• Big Data: Make machine learning scale on large
data
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithms Disciplines
69
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithms Disciplines
70
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Distributed
Algorithms Computing
71
Single-node architecture
CPU
Machine Learning, Statistics
Memory
Disk
72
Commodity Clusters
• Web data sets can be very large
• Tens to hundreds of terabytes
• Cannot mine on a single server
• Standard architecture emerging:
• Cluster of commodity Linux nodes, Gigabit ethernet interconnect
• Google GFS; Hadoop HDFS; Kosmix KFS
• Typical usage pattern
• Huge files (100s of GB to TB)
• Data is rarely updated in place
• Reads and appends are common
• How to organize computations on this architecture?
• Map-Reduce paradigm
73
Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch
Map-Reduce paradigm
• Map the data into key-value pairs
• E.g., map a document to word-count pairs
• Group by key
• Group all pairs of the same word, with lists of counts
• Reduce by aggregating
• E.g. sum all the counts to produce the total count.
75
It is a hard job
77