0% found this document useful (0 votes)

36 views

CS345A: Data Mining On The Web: Course Introduction Issues in Data Mining Bonferroni's Principle

This document provides an introduction and overview of the CS345A: Data Mining on the Web course. It discusses the course staff, requirements including homework, projects, and exams. It also outlines possible project topics related to collaborative filtering, machine learning problems, and team projects. Additionally, it covers what data mining is, relevant cultures and models, and provides an outline of topics to be covered in the course including link analysis, recommendation systems, clustering, and data streams. It emphasizes the importance of finding meaningful patterns and avoiding meaningless discoveries by understanding Bonferroni's principle.

Uploaded by

Preetham Gowda

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

CS345A: Data Mining On The Web: Course Introduction Issues in Data Mining Bonferroni's Principle

Uploaded by

Preetham Gowda

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 27

CS345A: Data Mining on the Web

Course Introduction Issues in Data Mining Bonferronis Principle

Course Staff
x Instructors:
Anand Rajaraman Jeff Ullman

x Reach us as cs345a-win0809-staff @ lists.stanford.edu. x More info on www.stanford.edu/class/cs345a.

Requirements
x Homework (Gradiance and other) 20%
Go to www.gradiance.com/pearson Enter class code 83769DC9. If you took CS145 or CS245 in the past year, you should have free access; otherwise you will have to purchase access from Pearson Ed.

x Project 40% x Final Exam 40%

Project
x Software implementation related to course subject matter. x Should involve an original component or experiment. x More later about available data and computing resources.

Possible Projects
x Many past projects have dealt with collaborative filtering (advice based on what similar people do).
E.g., Netflix Challenge.

x Others have dealt with engineering solutions to machine-learning problems.

ML-Replacement Projects
x ML generally requires a large training set of correctly classified data.
Example: classifying Web pages by topic.

x Hard to find well-classified data.

Exception: Open Directory works for page topics, because work is collaborative and shared by many. Other good exceptions?
6

ML-Replacement (2)
x Many problems require thought rather than ML:
1. Tell important pages from unimportant (PageRank). 2. Tell real news from publicity (how?). 3. Distinguish positive from negative product reviews (how?). 4. Etc., etc.
7

Team Projects
x Working in pairs OK, but
1. No more than two per project. 2. We will expect more from a pair than from an individual. 3. The effort should be roughly evenly distributed.

What is Data Mining?

x Discovery of useful, possibly unexpected, patterns in data. x Subsidiary issues:
Data cleaning: detection of bogus data.
E.g., age = 150. Entity resolution.

Visualization: something better than megabyte files of output.

Cultures
x Databases: concentrate on large-scale (non-main-memory) data. x AI (machine-learning): concentrate on complex methods, small data. x Statistics: concentrate on models.

Models vs. Analytic Processing

x To a database person, data-mining is an extreme form of analytic processing queries that examine large amounts of data.
Result is the query answer.

x To a statistician, data-mining is the inference of models.

Result is the parameters of the model.
11

(Way too Simple) Example

x Given a billion numbers, a DB person would compute their average and standard deviation. x A statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation of that distribution.
12

Outline of Course
x Map-Reduce and Hadoop. x Association rules, frequent itemsets. x PageRank and related measures of importance on the Web (link analysis ).
Spam detection. Topic-specific search.

x Recommendation systems.
Collaborative filtering.
13

Outline (2)
x Finding similar sets.
Minhashing, Locality-Sensitive hashing.

x Extracting structured data (relations) from the Web. x Clustering data. x Managing Web advertisements. x Mining data streams.
14

Meaningfulness of Answers
x A big data-mining risk is that you will discover patterns that are meaningless. x Statisticians call it Bonferronis principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.
15

Examples of Bonferronis Principle

1. A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents privacy. 2. The Rhine Paradox: a great example of how not to conduct scientific research.
16

Stanford Professor Proves Tracking Terrorists Is Impossible!

x Three years ago, the example I am about to give you was picked up from my class slides by a reporter from the LA Times. x Despite my talking to him at length, he was unable to grasp the point that the story was made up to illustrate Bonferronis Principle, and was not real.
17

The TIA Story

x Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. x We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.

The Details
x 109 people being tracked. x 1000 days. x Each person stays in a hotel 1% of the time (10 days out of 1000). x Hotels hold 100 people (so 105 hotels). x If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious?
19

p at some hotel

q at some hotel

Calculations (1)

Same hotel

x Probability that given persons p and q will be at the same hotel on given day d :
1/100 1/100 10-5 = 10-9.

x Probability that p and q will be at the same hotel on given days d1 and d2:
10-9 10-9 = 10-18.

x Pairs of days:
5 105.
20

Calculations (2)
x Probability that p and q will be at the same hotel on some two days:
5 105 10-18 = 5 10-13.

x Pairs of people:
5 1017.

x Expected number of suspicious pairs of people:

5 1017 5 10-13 = 250,000.
21

Conclusion
x Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. x Analysts have to sift through 250,010 candidates to find the 10 real cases.
Not gonna happen. But how can we improve the scheme?
22

Moral
x When looking for a property (e.g., two people stayed at the same hotel twice), make sure that the property does not allow so many possibilities that random data will surely produce facts of interest.

Rhine Paradox (1)

x Joseph Rhine was a parapsychologist in the 1950s who hypothesized that some people had Extra-Sensory Perception. x He devised (something like) an experiment where subjects were asked to guess 10 hidden cards red or blue. x He discovered that almost 1 in 1000 had ESP they were able to get all 10 right!
24

Rhine Paradox (2)

x He told these people they had ESP and called them in for another test of the same type. x Alas, he discovered that almost all of them had lost their ESP. x What did he conclude?
Answer on next slide.
25

Rhine Paradox (3)

x He concluded that you shouldnt tell people they have ESP; it causes them to lose it.

Moral
x Understanding Bonferronis Principle will help you look a little less stupid than a parapsychologist.

Consumer Behaviour-7: Culture & Sub Culture
96% (24)
Consumer Behaviour-7: Culture & Sub Culture
6 pages
Financial Analysis and Reporting Syllabus
100% (6)
Financial Analysis and Reporting Syllabus
9 pages
Introduction Data Mining
100% (1)
Introduction Data Mining
23 pages
Unit I - MMD - Lecture NoteStu
No ratings yet
Unit I - MMD - Lecture NoteStu
10 pages
Introduction To Data Science: John P Dickerson
No ratings yet
Introduction To Data Science: John P Dickerson
36 pages
MMD1
No ratings yet
MMD1
17 pages
Hanbury 2022 Block1 Lecture1
No ratings yet
Hanbury 2022 Block1 Lecture1
107 pages
MATH 2565 Week 1
No ratings yet
MATH 2565 Week 1
70 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
Introduction To Data Analysis and Mining
No ratings yet
Introduction To Data Analysis and Mining
23 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
FDS Unit1 Part1
No ratings yet
FDS Unit1 Part1
57 pages
Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
Data Mining Week 1 2
No ratings yet
Data Mining Week 1 2
117 pages
Data Science Lab
No ratings yet
Data Science Lab
66 pages
ch01 Intro
No ratings yet
ch01 Intro
29 pages
STAT121 / AC209 / E-109: CS109 Data Science
No ratings yet
STAT121 / AC209 / E-109: CS109 Data Science
74 pages
Introduction To Data Science
75% (4)
Introduction To Data Science
74 pages
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
0% (1)
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
17 pages
sivuuuu
No ratings yet
sivuuuu
55 pages
Support Machine Learning
No ratings yet
Support Machine Learning
161 pages
Mit Data Science Machine Learning Program Brochure Dom
No ratings yet
Mit Data Science Machine Learning Program Brochure Dom
18 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
Data Mining PDF
No ratings yet
Data Mining PDF
24 pages
CPS 196.03: Information Management and Mining: Shivnath Babu
No ratings yet
CPS 196.03: Information Management and Mining: Shivnath Babu
30 pages
Stats10 lecture 1.1 copy_副本
No ratings yet
Stats10 lecture 1.1 copy_副本
61 pages
Chapter 1 Introduction To Datascience
No ratings yet
Chapter 1 Introduction To Datascience
13 pages
Smith (2020)
No ratings yet
Smith (2020)
13 pages
Datamining Lect 1
No ratings yet
Datamining Lect 1
118 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
No ratings yet
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
91 pages
Mit Data Science Machine Learning Program Brochure
No ratings yet
Mit Data Science Machine Learning Program Brochure
17 pages
Aba Syllabus 2013
No ratings yet
Aba Syllabus 2013
4 pages
Course Code: IS423 Course Name: Business Process Mining: Presented By: Dr. Iman Helal
No ratings yet
Course Code: IS423 Course Name: Business Process Mining: Presented By: Dr. Iman Helal
32 pages
Module 2 BDA
No ratings yet
Module 2 BDA
40 pages
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
No ratings yet
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
50 pages
Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton - Modern Data Science With R (Chapman & Hall - CRC Texts in Statistical Science) - Chapman and Hall - CRC (2021)
100% (1)
Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton - Modern Data Science With R (Chapman & Hall - CRC Texts in Statistical Science) - Chapman and Hall - CRC (2021)
650 pages
Research Paper On Hadoop
No ratings yet
Research Paper On Hadoop
47 pages
Data Science
No ratings yet
Data Science
35 pages
Instant ebooks textbook The K 12 Educator s Data Guidebook Reimagining Practical Data Use in Schools 1st Edition Ryan A. Estrellado download all chapters
100% (2)
Instant ebooks textbook The K 12 Educator s Data Guidebook Reimagining Practical Data Use in Schools 1st Edition Ryan A. Estrellado download all chapters
65 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Download ebooks file (Ebook) Data Mining and Exploration: From Traditional Statistics to Modern Data Science by Chong Ho Alex Yu ISBN 9780367721466, 0367721465 all chapters
100% (6)
Download ebooks file (Ebook) Data Mining and Exploration: From Traditional Statistics to Modern Data Science by Chong Ho Alex Yu ISBN 9780367721466, 0367721465 all chapters
81 pages
Mis637 Aacsb Syllabus-Mis 637 A Fall 2014
No ratings yet
Mis637 Aacsb Syllabus-Mis 637 A Fall 2014
6 pages
001-2023-0714 DLBDSIDS01 Course Book
No ratings yet
001-2023-0714 DLBDSIDS01 Course Book
90 pages
Lecture 1- Introduction to Big Data
No ratings yet
Lecture 1- Introduction to Big Data
51 pages
Data Science Tips and Tricks To Learn Data Science Theories Effectively
No ratings yet
Data Science Tips and Tricks To Learn Data Science Theories Effectively
208 pages
Things We Forget To Think About:: The Not-So-Obvious Side of Data Science
No ratings yet
Things We Forget To Think About:: The Not-So-Obvious Side of Data Science
25 pages
MAT8033 Lecture Slides (3)
No ratings yet
MAT8033 Lecture Slides (3)
62 pages
MAT8033 Lecture Slides
No ratings yet
MAT8033 Lecture Slides
29 pages
Mid 1 Answers IDS
No ratings yet
Mid 1 Answers IDS
22 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
Data Science
No ratings yet
Data Science
21 pages
Unit 1
No ratings yet
Unit 1
26 pages
PSK Unit 1 Merged
No ratings yet
PSK Unit 1 Merged
125 pages
Think Stats
100% (2)
Think Stats
142 pages
Think Stats: Probability and Statistics For Programmers
100% (1)
Think Stats: Probability and Statistics For Programmers
142 pages
Part1 Ds ML Introduction
No ratings yet
Part1 Ds ML Introduction
61 pages
Data Mining
No ratings yet
Data Mining
38 pages
Instant ebooks textbook Data Mining and Exploration: From Traditional Statistics to Modern Data Science 1st Edition Chong Ho Alex Yu download all chapters
100% (2)
Instant ebooks textbook Data Mining and Exploration: From Traditional Statistics to Modern Data Science 1st Edition Chong Ho Alex Yu download all chapters
48 pages
Using Models and Math in Science
From Everand
Using Models and Math in Science
Riley Flynn
No ratings yet
CSI Controversies
From Everand
CSI Controversies
Grace Campbell
No ratings yet
ECE3530 Slides Introduction
No ratings yet
ECE3530 Slides Introduction
13 pages
AICTE Act
No ratings yet
AICTE Act
12 pages
Dewey, Makarenko and The "Pedagogical Poem" Between Analogies and Differences
No ratings yet
Dewey, Makarenko and The "Pedagogical Poem" Between Analogies and Differences
7 pages
Bindhu Mba Project
No ratings yet
Bindhu Mba Project
29 pages
DNP Cohort 5 Capstone Abstracts
No ratings yet
DNP Cohort 5 Capstone Abstracts
26 pages
Unit-IV (SECURITY RISK FACTOR TABLE)
No ratings yet
Unit-IV (SECURITY RISK FACTOR TABLE)
9 pages
Tên Học Phần: Tiếng Anh Thương Mại 1 Thời gian làm bài: 45 phút
No ratings yet
Tên Học Phần: Tiếng Anh Thương Mại 1 Thời gian làm bài: 45 phút
4 pages
Chapter 3: Power System Contingency Analysis: 3.1 Overview
No ratings yet
Chapter 3: Power System Contingency Analysis: 3.1 Overview
2 pages
Strong School-Community Partnerships in Inclusive Schools Are "Part of The Fabric of The School. We Count On Them"
No ratings yet
Strong School-Community Partnerships in Inclusive Schools Are "Part of The Fabric of The School. We Count On Them"
47 pages
Integration of 21st Century Skills An in
No ratings yet
Integration of 21st Century Skills An in
12 pages
Review of CiteSpace A Practical Guide For Mapping
No ratings yet
Review of CiteSpace A Practical Guide For Mapping
3 pages
J12
No ratings yet
J12
13 pages
Corporate Social Responsibility Complete Report France Telecom 2008
No ratings yet
Corporate Social Responsibility Complete Report France Telecom 2008
104 pages
MultiPier Soil Table PDF
No ratings yet
MultiPier Soil Table PDF
24 pages
DataEase
No ratings yet
DataEase
5 pages
Template-Research Project Proposal
No ratings yet
Template-Research Project Proposal
2 pages
Unformed Drawing: Notes, Sketches, and Diagrams: Yeoryia Manolopoulou
No ratings yet
Unformed Drawing: Notes, Sketches, and Diagrams: Yeoryia Manolopoulou
10 pages
Achievement Test of Social Science 1
100% (1)
Achievement Test of Social Science 1
26 pages
04 Buku Saku MKIA
No ratings yet
04 Buku Saku MKIA
3 pages
CHAPTER 1_ THE PROBLEM AND ITS BACKGROUND
No ratings yet
CHAPTER 1_ THE PROBLEM AND ITS BACKGROUND
18 pages
Transmission Loss in Piping From Fisher
No ratings yet
Transmission Loss in Piping From Fisher
8 pages
DO 5510 Lutron
No ratings yet
DO 5510 Lutron
2 pages
DH301 Last Yr
No ratings yet
DH301 Last Yr
17 pages
A Practical Guide To Design, Production and Maintenance For Engineers and Architects
No ratings yet
A Practical Guide To Design, Production and Maintenance For Engineers and Architects
8 pages
RoutledgeHandbooks 9781315688053 Chapter3 1
No ratings yet
RoutledgeHandbooks 9781315688053 Chapter3 1
26 pages
System Analysis and Educational Design
No ratings yet
System Analysis and Educational Design
43 pages
PHD Thesis Presentation 9oct07 Final
100% (25)
PHD Thesis Presentation 9oct07 Final
50 pages
Establishment of EDB Baseline Pakistan
No ratings yet
Establishment of EDB Baseline Pakistan
3 pages