CS345A: Data Mining On The Web: Course Introduction Issues in Data Mining Bonferroni's Principle
CS345A: Data Mining On The Web: Course Introduction Issues in Data Mining Bonferroni's Principle
Course Staff
x Instructors:
Anand Rajaraman Jeff Ullman
Requirements
x Homework (Gradiance and other) 20%
Go to www.gradiance.com/pearson Enter class code 83769DC9. If you took CS145 or CS245 in the past year, you should have free access; otherwise you will have to purchase access from Pearson Ed.
Project
x Software implementation related to course subject matter. x Should involve an original component or experiment. x More later about available data and computing resources.
Possible Projects
x Many past projects have dealt with collaborative filtering (advice based on what similar people do).
E.g., Netflix Challenge.
ML-Replacement Projects
x ML generally requires a large training set of correctly classified data.
Example: classifying Web pages by topic.
ML-Replacement (2)
x Many problems require thought rather than ML:
1. Tell important pages from unimportant (PageRank). 2. Tell real news from publicity (how?). 3. Distinguish positive from negative product reviews (how?). 4. Etc., etc.
7
Team Projects
x Working in pairs OK, but
1. No more than two per project. 2. We will expect more from a pair than from an individual. 3. The effort should be roughly evenly distributed.
Cultures
x Databases: concentrate on large-scale (non-main-memory) data. x AI (machine-learning): concentrate on complex methods, small data. x Statistics: concentrate on models.
10
Outline of Course
x Map-Reduce and Hadoop. x Association rules, frequent itemsets. x PageRank and related measures of importance on the Web (link analysis ).
Spam detection. Topic-specific search.
x Recommendation systems.
Collaborative filtering.
13
Outline (2)
x Finding similar sets.
Minhashing, Locality-Sensitive hashing.
x Extracting structured data (relations) from the Web. x Clustering data. x Managing Web advertisements. x Mining data streams.
14
Meaningfulness of Answers
x A big data-mining risk is that you will discover patterns that are meaningless. x Statisticians call it Bonferronis principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.
15
18
The Details
x 109 people being tracked. x 1000 days. x Each person stays in a hotel 1% of the time (10 days out of 1000). x Hotels hold 100 people (so 105 hotels). x If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious?
19
p at some hotel
q at some hotel
Calculations (1)
Same hotel
x Probability that given persons p and q will be at the same hotel on given day d :
1/100 1/100 10-5 = 10-9.
x Probability that p and q will be at the same hotel on given days d1 and d2:
10-9 10-9 = 10-18.
x Pairs of days:
5 105.
20
Calculations (2)
x Probability that p and q will be at the same hotel on some two days:
5 105 10-18 = 5 10-13.
x Pairs of people:
5 1017.
Conclusion
x Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. x Analysts have to sift through 250,010 candidates to find the 10 real cases.
Not gonna happen. But how can we improve the scheme?
22
Moral
x When looking for a property (e.g., two people stayed at the same hotel twice), make sure that the property does not allow so many possibilities that random data will surely produce facts of interest.
23
26
Moral
x Understanding Bonferronis Principle will help you look a little less stupid than a parapsychologist.
27