Lec1a-IntroDataMining
Lec1a-IntroDataMining
Mauro Sozio
HKU:[email protected]
! Questions:
! What is the probability that two random people in
the world know each other?
! How many hops between them? (e.g. friend of
friend of friend = 3 hops.)
! Experiment:
! Random people from Nebraska, Kansas,..., were
sent a letter with the goal of forwarding it to a
random person in Boston.
! If the person knew that person then he/she could
send him/her the letter directly.
! Otherwise she could forward the letter to a
relative or a friend who might know the person.
! Some basic information about the target person
were included.
COMP7103: Introduction to Data Mining HKU, Hong Kong
Small-world experiment and
six degree of separation
! Results:
! only 64 out of 296
letters reach the
destination (some people
refused to participate)
! among those reaching
the destination, the
average number of hops
was ~5-6.
References:
Travers, Jeffrey & Stanley Milgram. 1969. "An Experimental Study of the Small World Problem." Sociometry, Vol.
32, No. 4, pp. 425-443.
Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander, Sebastiano Vigna: Four degrees of separation. WebSci 2012:33-42
! Many definitions:
! Non-trivial extraction of implicit, previously unknown
Machine Learning/
Statistics/ Pattern
AI Recognition
Data Mining
Database
systems
! Issues:
! Massive amount of data
! High dimensionality
! Heterogenous, distributed nature of data
COMP7103: Introduction to Data Mining HKU, Hong Kong
Data Mining Tasks
! Prediction Methods
! Use some variables to predict unknown or future
values of other variables.
! Description Methods
! Find human-interpretable patterns that describe
the data.
Test
Set
Learn
Training Model
Set Classifier
" Success Story: Could find 16 new high red-shift quasars, some of
the farthest objects that are difficult to find!
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
! Inventory Management:
! Goal: A consumer appliance repair company wants to
(A B) (C) (D E)
(A B) (C) (D E)
<= xg >ng <= ws
<= ms
! Network Intrusion
Detection