01 Intro
01 Intro
Jure Leskovec
Stanford University
Many past projects have dealt with
collaborative filtering (advice based on what
similar people do)
E.g., Netflix Challenge
Others have dealt with engineering solutions
to machine‐learning problems
Lots of interesting project ideas
If you can’t think of one please come talk to us
Computers
p are cheapp and ppowerful
Competitive Pressure is Strong
Provide better, customized services for an edge
g ((e.g.
g in
Customer Relationship Management)
3,500,000
3,000,000
The Data Gap
2,500,000
2,000,000
T t l new disk
Total di k (TB) since
i 1995
1,500,000
1,000,000
Number of
500,000
00 000
analysts
0
1995 1996 1997 1998 1999
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 15
Many Definitions
Non‐trivial extraction of implicit, previously
unknown and useful information from data
Exploration & analysis, by automatic or
semi automatic means
semi‐automatic means, of
large quantities of data
in order to discover
meaningful patterns
Description Methods
Find human‐interpretable patterns that
describe the data.
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1‐DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
A l C
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
DOWN A t d k DOWN DEC DOWN
2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
Network Intrusion
Detection
Recommendations
R d i are made
d to any number
b off
people at the time of purchase
Only
O l the
h recipient
i i who
h buys
b first
fi gets a
discount
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 33
Product
recommendation
network
k
purchase following a
recommendation
customer recommending a
product
people
recommendations
high
low
8 2
n
# nodes
6
1.7*106m
0
4 0 10 20
m (month)
2 by month
quadratic fit
0
0 1 2 3 4
number of nodes x 10
6
6
ases
5
Number of Purcha
0
20 40 60 80 100 120 140
Outgoing Recommendations
0.1 probability of
bility of Credit
receiving
i i a
0.08 credit levels
off for DVDs
0.06
Probab
0.04
0.02
0
10 20 30 40 50 60 70 80
Outgoing Recommendations
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 42
DVD recommendations
(8.2 million observations)
of purchaasing
0.1
0.09
0.08
0.07
0.06
bability o
0 05
0.05
0.04
0.03
0 02
0.02
Prob
0.01
0
0 10 20 30 40
# recommendations received
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 43
Effectiveness of subsequent recommendations?
Multiple recommendations between two individuals
weaken the impact
p of the bond on p
purchases
0.07
0.06
Probabillity of buying
g
0.05
0.04
0.03
0.02
5 10 15 20 25 30 35 40
Exchanged recommendations
1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 44
Consider successful recommendations in terms of
av.
av # senders of recommendations per book category
av. # of recommendations accepted
books overall have a 3% success rate
(2% with discount, 1% without)
Lower than average success rate
fiction
romance (1.78), horror (1.81)
teen (1.94), children’s books (2.06)
comics
i (2.30),
(2 30) sci‐fi
i fi (2.34),
(2 34) mystery
t and
d thrillers
th ill (2.40)
(2 40)
nonfiction
sports (2.26)
home & garden (2.26)
travel (2.39)
(2 39)
Higher than average success rate
professional & technical
medicine (5.68)
professional & technical (4.54)
(4 54)
engineering (4.10), science (3.90), computers & internet (3.61)
law (3.66), business & investing (3.62)