0% found this document useful (0 votes)
10 views46 pages

01 Intro

The document outlines the course CS345a: Data Mining taught by Jure Leskovec at Stanford University, highlighting various project ideas and data sources such as Netflix and Wikipedia. It discusses the importance of machine learning, data mining techniques, and the challenges of handling large datasets. Additionally, it emphasizes the significance of discovering meaningful patterns in data while being cautious of meaningless patterns.

Uploaded by

nguyenthiphamczx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

01 Intro

The document outlines the course CS345a: Data Mining taught by Jure Leskovec at Stanford University, highlighting various project ideas and data sources such as Netflix and Wikipedia. It discusses the importance of machine learning, data mining techniques, and the challenges of handling large datasets. Additionally, it emphasizes the significance of discovering meaningful patterns in data while being cautious of meaningless patterns.

Uploaded by

nguyenthiphamczx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CS345a: Data Mining

Jure Leskovec
Stanford University
 Many past projects have dealt with
collaborative filtering (advice based on what
similar people do)
 E.g., Netflix Challenge
 Others have dealt with engineering solutions
to machine‐learning problems
 Lots of interesting project ideas
 If you can’t think of one please come talk to us

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 6


 Data:
 Netflix
 WebBase
 Wikipedia
 C
TREC
 ShareThis
 g
Google
 Infrastructure:
 Aster Data cluster on Amazon EC2
 Supports both MapReduce and SQL
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 7
 ML generally requires a large
“training set” of correctly
classified data:
 Example: classify Web pages by topic

 Hard to find well‐classified data:


 Open Directory works for page topics,
because work is collaborative and
shared by many.
 Other good exceptions?
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 8
 Many problems require thought:
1. Tell important pages from unimportant
(PageRank)
2. Tell real news from publicity (how?)
3 Distinguish positive from negative product
3.
reviews (how?)
4 Feature generation in ML
4.
5. Etc., etc.

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 9


 Map Reduce and Hadoop
Map‐Reduce
 Recommendation systems
 Collaborative filtering
 Dimensionality reduction
 Finding nearest neighbors
 Finding similar sets
 Minhashing, Locality‐Sensitive hashing
 Clustering
 PageRank and measures of importance in graphs
(link analysis)
 Spam detection
 Topic‐specific search

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 11


 Large scale machine learning
 Association rules, frequent itemsets
 Extracting structured data (relations) from the
Web
 Clustering data
 Graph partitioning
 Spam detection
 Managing Web advertisements
 Mining data streams

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 12


 Lots of data is being collected
and warehoused
 Web data, e‐commerce
 purchases
h at d
department//
grocery stores
 Bank/Credit Card
transactions

 Computers
p are cheapp and ppowerful
 Competitive Pressure is Strong
 Provide better, customized services for an edge
g ((e.g.
g in
Customer Relationship Management)

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 13


 Data collected and stored at
enormous speeds (GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene
expression
p data
 scientific simulations
generating terabytes of data
 TTraditional
di i l techniques
h i infeasible
i f ibl forf
raw data
 Data mining helps scientists
 in classifying and segmenting data
 in Hypothesis Formation
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 14
 There is often information “hidden” in the data that is
not readily evident
 Human analysts take weeks to discover useful
information
 Much
M h off the
th data
d t isi never analyzed
l d att allll
4,000,000

3,500,000

3,000,000
The Data Gap
2,500,000

2,000,000
T t l new disk
Total di k (TB) since
i 1995
1,500,000

1,000,000
Number of
500,000
00 000
analysts
0
1995 1996 1997 1998 1999
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 15
 Many Definitions
 Non‐trivial extraction of implicit, previously
unknown and useful information from data
 Exploration & analysis, by automatic or
semi automatic means
semi‐automatic means, of
large quantities of data
in order to discover
meaningful patterns

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 16


 Process of semi‐automatically
semi automatically analyzing large
databases to find patterns that are:
 valid: hold on new data with some certainty
 novel: non‐obvious to the system
 useful:
f l should
h ld bbe possible
ibl tto actt on th
the it
item
 understandable: humans should be able to
interpret the pattern

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 17


 A big data‐mining
data mining risk is that you will
“discover” patterns that are meaningless.

 Bonferroni’s principle: (roughly) if you look in


more places for interesting patterns than your
amount of data will support, you are bound to
find crap
crap.

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 18


 A parapsychologist in the 1950
1950’ss hypothesized
that some people had Extra‐Sensory
Perception
 He devised an experiment where subjects
were asked to guess 10 hidden cards – red or
blue
 He discovered that almost 1 in 1000 had ESP –
they were able to get all 10 right

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 19


 He told these people they had ESP and called
them in for another test of the same type
 Alas,
Alas he discovered that almost all of them
had lost their ESP
 What did he conclude?

 He concluded that you shouldn


shouldn’tt tell people
they have ESP; it causes them to lose it. 

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 20


 g loan/credit
Banking: / pp
card approval:
 predict good customers based on old customers
 Customer relationship management:
 identify
id tif ththose who
h are likely
lik l to
t leave
l for
f a competitor
tit
 Targeted marketing:
 identify likely responders to promotions
 Fraud detection: telecommunications, finance
 from an online stream of event identify fraudulent
events
t
 Manufacturing and production:
 automatically adjust knobs when process parameter
changes
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 21
 Medicine: disease outcome, effectiveness of
treatments
 analyze patient disease history: find relationship
between diseases
 Molecular/Pharmaceutical:
 identify
id tif new drugs
d
 Scientific data analysis:
 identify
id if new galaxies
l i by b searching
hi for
f sub
b clusters
l
 Web site/store design and promotion:
 find affinity of visitor to pages and modify layout

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 22


 Overlaps with machine learning,
learning statistics,
statistics
artificial intelligence, databases, visualization but
more stress on
 scalability of number
of features and instances Statistics/ Machine Learning/
AI Pattern
 stress on algorithms and Recognition
architectures whereas
foundations of methods Data Mining
and formulations provided
by statistics and machine learning
Database
 automation for handling large
large, systems
heterogeneous data
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 23
 Prediction Methods
 Use some variables to predict unknown or
future values of other variables
variables.

 Description Methods
 Find human‐interpretable patterns that
describe the data.

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 24


 Classification
 Clustering
 Association Rule Discovery:
 Sequential Pattern Discovery
 Regression
 Anomaly Detection

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 25


Courtesy: http://aps.umn.edu

Earlyy Class: Attributes:


• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 26


 Observe Stock Movements
 Cluster them: Stock‐{UP/DOWN}
 Similarity Measure:
 TTwo points
i t are more similar
i il if the
th events
t described
d ib d by
b
them frequently happen together on the same day.
Discovered Clusters Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1‐DOWN

Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
A l C
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
DOWN A t d k DOWN DEC DOWN

2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN

Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

3 MBNA Corp DOWN Morgan Stanley DOWN


MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
i i l
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 27


 Given database of user preferences,
preferences predict
preference of new user
 Example:
p
 Predict what new movies you will like based on
 your past preferences
 others with similar past preferences
 their preferences for the new movies
 Example:
 Predict what books/CDs a person may want to buy
 (and suggest it,
it or give discounts to tempt
customer)
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 28
 Detect significant deviations
from normal behavior
 Applications:
 Credit Card Fraud Detection

 Network Intrusion
Detection

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 29


 Supermarket shelf management.
 Goal: To identify items that are bought together by
sufficiently many customers.
 Approach: Process the point‐of‐sale
point of sale data collected with
barcode scanners to find dependencies among items.
 A classic rule ‐‐
 If a customer buys diaper and milk, then he is likely to buy beer.
 So, don’t be surprised if you find six‐packs stacked next to diapers!
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper,
p Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 30
 Network intrusion detection using a combination of
sequential
i l rule
l discovery
di and
d classification
l ifi i tree on 4 GB G
DARPA data
 Won over (manual) knowledge engineering approach
 http://www.cs.columbia.edu/
http://www cs columbia edu/~sal/JAM/PROJECT/
sal/JAM/PROJECT/ provides good
detailed description of the entire process
 Major US bank: Customer attrition prediction
 Segment customers based on financial behavior: 3 segments
 Build attrition models for each of the 3 segments
 40‐50% of attritions were predicted == factor of 18 increase
 T
Targeted
d credit
di marketing:
k i major
j US b
banks
k
 find customer segments based on 13 months credit balances
 build another response model based on surveys
 increased response 4 times – 2%

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 31


 Scalability
 Dimensionality
 Complex and Heterogeneous Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 32


[Leskovec et al., TWEB ’07]

 Senders and followers of recommendations


receive discounts on products
10% credit 10% off

 Recommendations
R d i are made
d to any number
b off
people at the time of purchase
 Only
O l the
h recipient
i i who
h buys
b first
fi gets a
discount
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 33
Product
recommendation
network
k

purchase following a
recommendation

customer recommending a
product

customer not buying a


recommended product

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 34


 Large online retailer (June 2001 to May 2003)
 15,646,121 recommendations
 3,943,084 distinct
d customers
 548,523 products recommended
 99% of them belonging 4 main product
groups:
 books
 DVDs
 music
 VHS
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 35
 Recommendations
 sender (shadowed)
 recipient (shadowed)
 recommendation time
 buy bit
 purchase time
 product price

 Additional product info (from the retailer’s


retailer s website)
 categories
 reviews
 ratings
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 36
 What role does the product category play?
recommenda- buy + get buy + no
products customers edges
tions discount discount
Book 103,161 2,863,977 5,741,611 2,097,809 65,344 17,769
DVD 19,829 805,285 8,180,393 962,341 17,232 58,189
Music 393,598 794,148 1,443,847 585,738 7,837 2,739

Video 26,131 239,583 280,270 160,683 909 467


F ll
Full 542 719
542,719 3 943 084
3,943,084 15 646 121
15,646,121 3 153 676
3,153,676 91 322
91,322 79 164
79,164

people
recommendations
high
low

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 37


 There are relatively few DVD titles, but DVDs account for ~ 50% of
recommendations.
recommendations
 recommendations per person
 DVD: 10
 books and music: 2
 VHS: 1
 recommendations per purchase
 books: 69
 DVDs: 108
 music: 136
 VHS: 203
 Overall there are 3.69 recommendations per node on 3.85 different
products.
 Music recommendations reached about the same number of people as
DVDs but used only 1/5 as many recommendations
 Book recommendations reached by far the most people – 2.8 million.
 All networks have a very small number of unique edges.
edges For books,
books videos
and music the number of unique edges is smaller than the number of
nodes – the networks are highly disconnected

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 38


4
x 10
12 6
x 10
4
10
nent
size of giantt compon

8 2
n

# nodes
6
1.7*106m
0
4 0 10 20
m (month)

2 by month
quadratic fit
0
0 1 2 3 4
number of nodes x 10
6

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 39


 94% of users make first recommendation without having
received one previously
 linear growth: ~ 165,000 new users added each month
 size of giant connected component increases from 1% to 2.5%
2 5%
of the network (100,420 users) – small!
 some sub‐communities are better connected
 24% out of 18,000 users for westerns on DVD
 26% of 25,000 for classics on DVD
 19% of 47,000 for anime (Japanese animated film) on DVD
 others are just as disconnected
 3% of 180,000 home and gardening
 2‐7%
2 7% ffor children’s
hild ’ andd fitness
fit DVD
DVDs

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 40


 Does sending more recommendations
influence more purchases?
7

6
ases

5
Number of Purcha

0
20 40 60 80 100 120 140
Outgoing Recommendations

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 41


 consider whether sender has at least one successful
recommendation
d i
 controls for sender getting credit for purchase that resulted
from others recommending the same product to the same
person
0.12

0.1 probability of
bility of Credit

receiving
i i a
0.08 credit levels
off for DVDs
0.06
Probab

0.04

0.02

0
10 20 30 40 50 60 70 80
Outgoing Recommendations
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 42
DVD recommendations
(8.2 million observations)
of purchaasing

0.1
0.09
0.08
0.07
0.06
bability o

0 05
0.05
0.04
0.03
0 02
0.02
Prob

0.01
0
0 10 20 30 40
# recommendations received
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 43
 Effectiveness of subsequent recommendations?
 Multiple recommendations between two individuals
weaken the impact
p of the bond on p
purchases
0.07

0.06
Probabillity of buying
g

0.05

0.04

0.03

0.02
5 10 15 20 25 30 35 40
Exchanged recommendations
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 44
 Consider successful recommendations in terms of
 av.
av # senders of recommendations per book category
 av. # of recommendations accepted
 books overall have a 3% success rate
 (2% with discount, 1% without)
 Lower than average success rate
 fiction
 romance (1.78), horror (1.81)
 teen (1.94), children’s books (2.06)
 comics
i (2.30),
(2 30) sci‐fi
i fi (2.34),
(2 34) mystery
t and
d thrillers
th ill (2.40)
(2 40)
 nonfiction
 sports (2.26)
 home & garden (2.26)
 travel (2.39)
(2 39)
 Higher than average success rate
 professional & technical
 medicine (5.68)
 professional & technical (4.54)
(4 54)
 engineering (4.10), science (3.90), computers & internet (3.61)
 law (3.66), business & investing (3.62)

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 45


 Professional & technical book recommendations are more
often accepted
 Some organized contexts other than professional also have
higher
g success rate,, e.g.
g religion
g
 overall success rate 3.13%
 Christian themed books
 Christian living and theology (4.7%)
 Bibles (4.8%)
 not‐as‐organized religion
 new age
g ((2.5%))
 occult spirituality (2.2%)
 Well organized hobbies
 books on orchids recommended successfully twice as often as books
on tomato growing

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 46


Variable transformation Coefficient
const -0.940 ***
# recommendations ln(r) 0 426 ***
0.426
# senders ln(ns) -0.782 ***
# recipients ln(nr) -1
1.307
307 ***
product price ln(p) 0.128 ***
# reviews ln(v) -0 011 ***
-0.011
avg. rating ln(t) -0.027 *
R2 0 74
0.74
significance at the 0.01 (***), 0.05 (**) and 0.1 (*) levels
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 47
 47,000
47 000 customers responsible for the 2.5
2 5 out of
16 million recommendations in the system

 29% success rate per recommender of an anime


DVD

 Giant component covers 19% of the nodes

 Overall, recommendations for DVDs are more


likely to result in a purchase (7%), but the anime
community i standsd out
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 48
 Three colors: blue,, white & red
 showing purchasers only

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 49


 Small community
 few reviews, senders, and recipients
 but
b t sending
di more recommendations
d ti h
helps
l
 Pricey products
 Rating doesn
doesn’tt play as much of a role

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 50


Observations for diffusion models:
 purchase decision more complex than threshold
or simple infection
 influence saturates as the number of contacts
expands
 links user effectiveness if they are overused

Conditions for successful recommendations:


 professional and organizational contexts
 discounts on expensive items
 small,
ll tightly
i h l knit
k i communities
ii
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 51

You might also like