0% found this document useful (0 votes)

18 views23 pages

Lecture 1

Uploaded by

NEDAL MOHAMMED

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views23 pages

Lecture 1

Uploaded by

NEDAL MOHAMMED

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 23

Web-based Information Architectures

MSEC 20-760
Mini II
Location: GSIA Simon Auditorium
Time: 1:30-3:20pm, Tues. & Thurs.
Instructor: Prof. Jaime Carbonell
Office: NSH 4519
Email: [email protected]
Tel: 268-7279
[Augmented with expert guest lectures]
Teaching assistant: Jian Zhang
Office: NSH 4605
Email: [email protected]
Tel: 268-6521
Offices Hours: TBD
Administrative assistant: TBD
Office: NSH 4517
Email: [email protected]
Tel: 268-4788
Administrative Issues
Prerequisites
•Basic programming skills (preferably JAVA)
•Familiarity with the web (HTML, browsing, etc.)
•Fundamentals of Web Programming (20-753).
Grading
30% homeworks (2 programming assignments)
30% miniproject (student teams will propose)
15% midterm (5 pages notes, calculator OK, no laptops)
25% final (10 pages notes, calculator OK, no laptops)
Bulletin Board
Schedule/syllabus
Lecture notes (in powerpoint)
Homework
Announcements & discussions
Textbook and Reference Materials (1)
Required: Class notes (slides on web site)
and handouts (to be provided)

Required: "Understanding Search Engines:

Mathematical Modeling and Text Retrieval"
by Michael W. Berry, Murray Browne
Available at http://www.siam.org
(tel: 1-800-447-7426)

Optional: Background reading material provided

Textbook and Reference Materials (2)
Optional: "Advances in Information Retrieval" Edited
by Croft, Kluwer Academic Pub., 2000
[more detailed state-of-the-art IR book]

Optional: "Machine Learning" by Tom M. Mitchell,

WCB McGraw-Hill [Tools for text
categorization and data mining.]
Information Retrieval: The Challenge (1)
Text DB includes:
(1) Rainfall measurements in the Sahara continue to show a steady
decline starting from the first measurements in 1961. In 1996 only
12mm of rain were recorded in upper Sudan, and 1mm in Southern
Algiers...

(2) Dan Marino states that professional football risks loosing the number
one position in heart of fans across this land. Declines in TV audience
ratings are cited...

(3) Alarming reductions in precipitation in desert regions are blamed for

desert encroachment of previously fertile farmland in Northern Africa.
Scientists measured both yearly precipitation and groundwater levels...
Information Retrieval: The Challenge (2)

User query states:

"Decline in rainfall and impact on farms near Sahara"

Challenges
•How to retrieve (1) and (3) and not (2)?
•How to rank (3) as best?
•How to cope with no shared words?
Information Retrieval in
eCommerce (1)
Bringing in Customers
How do Web-search engines work?

How to maximize hits on my eCommerce pages?

How to maximize preselection of customers who will

transact?
Information Retrieval in
eCommerce (2)
Analyzing the Competition
•How do we find the competition?
•How will customers find the competition?
•Can we do preemptive information strikes?

Text Mining
•How to learn what customers want most?
•How to find out what they missed, but wanted?
•How to discover customer search/browsing
patterns?
Information Retrieval
Assumption (1)
Basic IR task
•There exists a document collection {Dj }

•Users enters at hoc query Q

•Q correctly states user’s interest

•User wants {Di } < {Dj } most relevant to Q

Information Retrieval
Assumption (2)
"Shared Bag of Words" assumption
Every query = {wi }
Every document = {wk }
...where wi & wk in same Σ

All syntax is irrelevant (e.g. word order)

All document structure is irrelevant
All meta-information is irrelevant
(e.g. author, source, genre)
=> Words suffice for relevance assessment
Information Retrieval
Assumption (3)
Retrieval by shared words
If Q and Dj share some wi , then Relevant(Q, Dj )

If Q and Dj share all wi , then Relevant(Q, Dj )

If Q and Dj share over K% of wi , then Relevant(Q, Dj)

Boolean Queries (1)
Industrial use of Silver
Q: silver
R: "The Count’s silver anniversary..."
"Even the crash of ’87 had a silver lining..."
"The Lone Ranger lived on in syndication..."
"Sliver dropped to a new low in London..."
...

Q: silver AND photography

R: "Posters of Tonto and the Lone Ranger..."
"The Queen’s Silver Anniversary photos..."
...
Boolean Queries (2)
Q: (silver AND (NOT anniversary)
AND (NOT lining)
AND emulsion)
OR (AgI AND crystal
AND photography))

R: "Silver Iodide Crystals in Photography..."

"The emulsion was worth its weight in
silver..."
...
Boolean Queries (3)

Boolean queries are:

a) easy to implement
b) confusing to compose
c) seldom used (except by librarians)
d) prone to low recall
e) all of the above
Beyond the Boolean Boondoggle (1)

Desiderata (1)
•Query must be natural for all users
•Sentence, phrase, or word(s)
•No AND’s, OR’s, NOT’s, ...
•No parentheses (no structure)
•System focus on important words
•Q: I want laser printers now
Beyond the Boolean Boondoggle (2)
Desiderata (2)
• Find what I mean, not just what I say
Q: cheap car insurance
(pAND (pOR
"cheap" [1.0]
"inexpensive" [0.9]
"discount" [0.5)]
(pOR "car" [1.0]
"auto" [0.8]
"automobile" [0.9]
"vehicle" [0.5])
(pOR "insurance" [1.0]
"policy" [0.3]))
Beyond the Boolean Boondoggle (3)

Desiderata (3)
•Speech-recognized queries
•Coming soon, to a system near you
•longer queries
•more fluff words to filter
•acoustic recognition errors
INFORMATION RETRIEVAL

User

The Web Spider

Search
Engine
Inverted
Index

Library, etc.
INFORMATION RETRIEVAL:
APPLICATIONS
• Searching Document Archives
– Libraries (title, subject, full-text)
– Data bases of patents and applications
– DBs of legal cases (e.g. Lexis, Westlaw)
• Searching the Web
– Pure search engines (Google, Inktomi, …)
– Browsing + Search (Yahoo, Terra-Lycos, …)
– Meta-search (Metacrawler, Vivisimo, …)
• Corporate or Government Intranets
• Non-traditional (e.g. Software DBs, News)
INFORMATION RETRIEVAL
(IR) EVOLUTION
• IR in the 1980s:
– Single collection with < 106 documents (archive)
– Boolean queries with unordered-set answer
• IR circa 2000:
– Single collection with > 109 documents (web)
– Free-form queries with ranked-list answer
• IR circa 2010:
– Multiple collections > 1012 docs (invisible web)
– “Find what I mean” queries with clustering,
summarization and customization.
Content for Rest of the Course (1)
[See the course BB for the latest updates to the
course schedule.]

Under the Hood

•The vector space model for retrieval
•Building an inverted index
•Term weighting and selection
•Web spidering
•Automated text categorization
Content for Rest of the Course (2)
IR Uses in eCommerce
•How to make search engine work for you
•How to build optimal search-attractive web sites
•The business(es) of web-based information

Beyond Web Search Engines

•Speech processing primer
•Information extraction from web pages
•Data mining primer
•Multi-media applications
•Business models
Optional Quick Review of Linear Algebra
If you know n-dimensional vectors, matrices, computing
inner products, etc.., Then you do not need this review.
You may take a break.

If you learned this material, but do not remember it, please

stay and listen to refresh your knowledge.

If you never learned linear algebra, stay, listen and

(optionally) read either:
• G. Hadley. Linear Algebra. Addison-Wesley, 1961. Ch 3.
• Or, Stephen W. Goode. An Introduction to Differential
Equations and Linear Algebra. Prentice Hall, 1991. Ch.3).

Information Retrieval: C. J. Van Rijsbergen
No ratings yet
Information Retrieval: C. J. Van Rijsbergen
2 pages
Scientific Communication
No ratings yet
Scientific Communication
4 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
PUB - 1032292 - XS-111 - XS-211 Install-Owners
100% (1)
PUB - 1032292 - XS-111 - XS-211 Install-Owners
42 pages
Trend Micro Control Manager: Installation Guide
No ratings yet
Trend Micro Control Manager: Installation Guide
144 pages
Electric Rop
No ratings yet
Electric Rop
2 pages
Impact of Bonus Issue On Market Price
No ratings yet
Impact of Bonus Issue On Market Price
70 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Dummy Tables For QOC Assessment 1st Draft
No ratings yet
Dummy Tables For QOC Assessment 1st Draft
33 pages
Introduction Advanced DB
No ratings yet
Introduction Advanced DB
80 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
11 04 2019 Asea P1
No ratings yet
11 04 2019 Asea P1
40 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Introduction To DBMS: Application Program End-User
No ratings yet
Introduction To DBMS: Application Program End-User
19 pages
Ind Hstry 202313jun
No ratings yet
Ind Hstry 202313jun
80 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
MarketingPlan Nike
No ratings yet
MarketingPlan Nike
32 pages
Termination of Contract
No ratings yet
Termination of Contract
12 pages
Chap 1
No ratings yet
Chap 1
23 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
White Paper GFM Functional Specification
No ratings yet
White Paper GFM Functional Specification
51 pages
Structural Analysis of A Multi-Storeyed PDF
No ratings yet
Structural Analysis of A Multi-Storeyed PDF
5 pages
Nonprofit Organizations
No ratings yet
Nonprofit Organizations
25 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Some Aspects of Impact Analysis of A Planned New 25 KV AC Railway Lines Systel On The Existing 3Kv DC Railway System in A Traction Supply Transition Zone
No ratings yet
Some Aspects of Impact Analysis of A Planned New 25 KV AC Railway Lines Systel On The Existing 3Kv DC Railway System in A Traction Supply Transition Zone
5 pages
Ravi Teja Resume
No ratings yet
Ravi Teja Resume
2 pages
Effect of Peer Mentoring Instructional Strategy On Students' Mathematics Achievement in HO Municipality, Ghana
No ratings yet
Effect of Peer Mentoring Instructional Strategy On Students' Mathematics Achievement in HO Municipality, Ghana
7 pages
Necanko, Inc.
0% (1)
Necanko, Inc.
9 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Report in Blje
No ratings yet
Report in Blje
18 pages
Transformers Health Management Condition Monitoring System: Product Description
No ratings yet
Transformers Health Management Condition Monitoring System: Product Description
46 pages
Lec 1 - Intro - Unit 1 Information Technology
No ratings yet
Lec 1 - Intro - Unit 1 Information Technology
102 pages
System Modelling Ch5
No ratings yet
System Modelling Ch5
53 pages
Step Down Requirements
No ratings yet
Step Down Requirements
1 page
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Unit II
No ratings yet
Unit II
73 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
ITR Notes
No ratings yet
ITR Notes
166 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
ADVANCED PNEUMATICS Reviewer
No ratings yet
ADVANCED PNEUMATICS Reviewer
4 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
SGS-Supplier Code of Conduct
No ratings yet
SGS-Supplier Code of Conduct
10 pages
Zertifikat-IEC62109-fuer-Huawei-SUN2000-215KTL-H0-Wechselrichter
No ratings yet
Zertifikat-IEC62109-fuer-Huawei-SUN2000-215KTL-H0-Wechselrichter
3 pages
Introduction
No ratings yet
Introduction
32 pages
LX-Helios User Manual 1 8
No ratings yet
LX-Helios User Manual 1 8
26 pages
Tycs Sem Vi Informational Retrival Final Notes (WWW - Profajaypashankar.com-1
No ratings yet
Tycs Sem Vi Informational Retrival Final Notes (WWW - Profajaypashankar.com-1
103 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Introduction
No ratings yet
Introduction
25 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
48 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
Manning Christopher, Prabhakar Raghavan, Hinrich Schu Tze: Introduction To Information Retrieval
No ratings yet
Manning Christopher, Prabhakar Raghavan, Hinrich Schu Tze: Introduction To Information Retrieval
4 pages
Information Retrieval 1
No ratings yet
Information Retrieval 1
10 pages
Bulu
No ratings yet
Bulu
47 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
A6CRX65TI 1800rpm
100% (1)
A6CRX65TI 1800rpm
3 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Ey HK Tax Alert 1 Dec Issue 17
No ratings yet
Ey HK Tax Alert 1 Dec Issue 17
6 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
No ratings yet
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
24 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Module 1print
No ratings yet
Module 1print
5 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
ABoolean Modelin Information Retrievalfor Search Engines PDF
No ratings yet
ABoolean Modelin Information Retrievalfor Search Engines PDF
6 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
I. Belt or Rope Drives
No ratings yet
I. Belt or Rope Drives
23 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
5-Introduction To Information Retrieval
No ratings yet
5-Introduction To Information Retrieval
3 pages
IR UNIT I - Notes
No ratings yet
IR UNIT I - Notes
23 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
8086 Instruction Set
No ratings yet
8086 Instruction Set
50 pages