Lecture 1
Lecture 1
MSEC 20-760
Mini II
Location: GSIA Simon Auditorium
Time: 1:30-3:20pm, Tues. & Thurs.
Instructor: Prof. Jaime Carbonell
Office: NSH 4519
Email: [email protected]
Tel: 268-7279
[Augmented with expert guest lectures]
Teaching assistant: Jian Zhang
Office: NSH 4605
Email: [email protected]
Tel: 268-6521
Offices Hours: TBD
Administrative assistant: TBD
Office: NSH 4517
Email: [email protected]
Tel: 268-4788
Administrative Issues
Prerequisites
•Basic programming skills (preferably JAVA)
•Familiarity with the web (HTML, browsing, etc.)
•Fundamentals of Web Programming (20-753).
Grading
30% homeworks (2 programming assignments)
30% miniproject (student teams will propose)
15% midterm (5 pages notes, calculator OK, no laptops)
25% final (10 pages notes, calculator OK, no laptops)
Bulletin Board
Schedule/syllabus
Lecture notes (in powerpoint)
Homework
Announcements & discussions
Textbook and Reference Materials (1)
Required: Class notes (slides on web site)
and handouts (to be provided)
(2) Dan Marino states that professional football risks loosing the number
one position in heart of fans across this land. Declines in TV audience
ratings are cited...
Challenges
•How to retrieve (1) and (3) and not (2)?
•How to rank (3) as best?
•How to cope with no shared words?
Information Retrieval in
eCommerce (1)
Bringing in Customers
How do Web-search engines work?
Text Mining
•How to learn what customers want most?
•How to find out what they missed, but wanted?
•How to discover customer search/browsing
patterns?
Information Retrieval
Assumption (1)
Basic IR task
•There exists a document collection {Dj }
Desiderata (1)
•Query must be natural for all users
•Sentence, phrase, or word(s)
•No AND’s, OR’s, NOT’s, ...
•No parentheses (no structure)
•System focus on important words
•Q: I want laser printers now
Beyond the Boolean Boondoggle (2)
Desiderata (2)
• Find what I mean, not just what I say
Q: cheap car insurance
(pAND (pOR
"cheap" [1.0]
"inexpensive" [0.9]
"discount" [0.5)]
(pOR "car" [1.0]
"auto" [0.8]
"automobile" [0.9]
"vehicle" [0.5])
(pOR "insurance" [1.0]
"policy" [0.3]))
Beyond the Boolean Boondoggle (3)
Desiderata (3)
•Speech-recognized queries
•Coming soon, to a system near you
•longer queries
•more fluff words to filter
•acoustic recognition errors
INFORMATION RETRIEVAL
User
Search
Engine
Inverted
Index
Library, etc.
INFORMATION RETRIEVAL:
APPLICATIONS
• Searching Document Archives
– Libraries (title, subject, full-text)
– Data bases of patents and applications
– DBs of legal cases (e.g. Lexis, Westlaw)
• Searching the Web
– Pure search engines (Google, Inktomi, …)
– Browsing + Search (Yahoo, Terra-Lycos, …)
– Meta-search (Metacrawler, Vivisimo, …)
• Corporate or Government Intranets
• Non-traditional (e.g. Software DBs, News)
INFORMATION RETRIEVAL
(IR) EVOLUTION
• IR in the 1980s:
– Single collection with < 106 documents (archive)
– Boolean queries with unordered-set answer
• IR circa 2000:
– Single collection with > 109 documents (web)
– Free-form queries with ranked-list answer
• IR circa 2010:
– Multiple collections > 1012 docs (invisible web)
– “Find what I mean” queries with clustering,
summarization and customization.
Content for Rest of the Course (1)
[See the course BB for the latest updates to the
course schedule.]