Introduction To Information Retrieval
Introduction To Information Retrieval
Lecture 1 Introduction
What is information retrieval?
2
Why information retrieval
• Information overload
– “It refers to the difficulty a person can have
understanding an issue and making decisions that
can be caused by the presence of too much
information.” - wiki
3
Why information retrieval
• Information overload
5
Why information retrieval
• Handling unstructured data
– Structured data: database system is a good choice
– Unstructured data is more dominant
• Text in Web documents
Table orDepartment
1: People in CS emails, image, audio, video…
• “85 percentIDof allName
business information
Job exists as
unstructured1 data” - Merrill Lynch
Dr. Kashif Professor
• Unknown semantic
3 Miss meaning
Wajeeha Secretary
5 Mr. Aftab Academic Officer
250
200
150
Unstructured
100 Structured
50
0
Data volume Market Cap
7
Unstructured (text) vs. structured (database) data
today
250
200
150
Unstructured
100 Structured
50
0
Data volume Market Cap
8
Why information retrieval
• An essential tool to deal with information
overload
You are
here!
9
History of information retrieval
• Catalyst
– Industry: web search engines
• WWW unleashed explosion of published information
and drove the innovation of IR techniques
• Lycos (started at CMU) was launched and became a
major commercial endeavor in 1994
• Booming of search engine industry: Magellan, Excite,
Infoseek, Inktomi, Northern Light, AltaVista, Yahoo!,
Google, and Bing
10
Major players in this game
• Global search engine market
– By http://marketshare.hitslink.com/search-
engine-market-share.aspx
11
How to perform information retrieval
• Information retrieval when we did not have a
computer
12
The Standard Retrieval Interaction Model
13
How to perform information retrieval
Crawler and indexer
Query parser
Ranking model
Document Analyzer
14
How to perform information retrieval
PARSING & INDEXING
Doc Query query
Rep Rep
Repository
User
SEARCH
Ranking results
APPLICATIONS
LEARNING
Evaluation judgments
FEEDBACK
We will cover:
1) Search engine architecture; 2)Retrieval models;
3) Retrieval evaluation; 4) Relevance feedback;
5) Link analysis; 6) Search applications.
15
Core concepts in IR
• Query representation
– Lexical gap: say v.s. said
– Semantic gap
• Document representation
– Specific data structure for efficient access
• Retrieval model
– Algorithms that find the most relevant documents
for the given information need
16
A glance of modern search engine
• In old times
17
A glance of modern search engine
In modern time
18
A glance of modern search engine
Demand of understanding
Demand of convenience
Demand of efficiency
Demand of accuracy
Demand of diversity
19
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Recommendation
20
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Question answering
21
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Text mining
22
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Online advertising
23
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Enterprise search: web search + desktop search
24
Related Areas
Applications
Mathematics
Web Applications,
Bioinformatics…
Machine Learning
Pattern Recognition Library & Info
Information Science
Natural
Statistics Retrieval
Language Databases
Optimization
Processing
Data Mining Software engineering
Computer systems
Algorithms
Systems
25
IR v.s. DBs
• Information Retrieval: • Database Systems:
– Unstructured data – Structured data
– Semantics of object are – Semantics of each object
subjective are well defined
– Simple key work queries – Structured query
– Relevance-driven languages (e.g., SQL)
retrieval – Exact retrieval
– Effectiveness is primary – Emphasis on efficiency
issue, though efficiency
is also important
26
IR and DBs are getting closer
• IR => DBs • DBs => IR
– Approximate search is – Use information
available in DBs extraction to convert
– Eg. in mySQL unstructured data to
structured data
mysql> SELECT * FROM articles – Semi-structured
-> WHERE MATCH (title,body)
AGAINST ('database');
representation: XML data;
queries with structured
information
27
IR v.s. NLP
• Information retrieval • Natural language
– Computational processing
approaches – Cognitive, symbolic and
– Statistical (shallow) computational
understanding of approaches
language – Semantic (deep)
understanding of
language
28
IR and NLP are getting closer
• IR => NLP • NLP => IR
– Larger data collections – Deep analysis of text
– Scalable/robust NLP documents and queries
techniques, e.g., – Information extraction for
translation models structured IR tasks
29
Course Learning Objectives
Web Crawling
Question Answering
31
Text books
• Introduction to Information Retrieval.
Christopher D. Manning, Prabhakar
Raghavan, and Hinrich Schuetze,
Cambridge University Press, 2007.
33
What to read?
Applications
Mathematics
Web Applications,
Bioinformatics…
Machine Learning
Pattern Recognition Library & Info
ICML, NIPS, UAI
Science
Information Retrieval
Statistics NLP SIGIR, WWW, WSDM, CIKM
Databases
OptimizationACL, EMNLP, COLING SIGMOD, VLDB, ICDE
Data Mining Software engineering
KDD, ICDM, SDM Computer systems
Algorithms
Systems
34
Top Conferences and Journals in IR
Field
• SIGIR: One of the most important and influential conference in IR field (attract
more attention from academia), proceedings of publications can be found here.
• WWW: Another most important and influential conference in IR field (attract more
attention from industry), proceedings of publications can be found here.
• WSDM: A new but quickly raising conference in the field, attracting attentions
from both industry and academia. Proceedings of publications can be found here.
• CIKM: A major conference in IR field. Proceedings of publications can be found
here.
• ECIR Conference Proceedings
35
IR Toolkits
• ElasticSearch
• Lucene (Apache)
• Lemur & Indri (CMU/Univ. of Massachusetts)
• Terrier (Glasgow)
• MeTA (University of Illinois)
• RankLib (A collection of learning-to-rank
algorithms University of Massachusetts Amherst)
• General Information Retrieval Systems
36
NLP-related Resources
• Statistical natural language processing and
corpus-based computational linguistics: An
annotated list of resources
• Stanford NLP parser (Stanford University NLP
group)
• OpenNLP (Apache)
• LingPipe (Jave-based)
• NLTK(Python-based)
37
Machine Learning Toolkits
• Weka (A rich collection of machine learning algorithms,
Machine Learning Group at the University of Waikato)
• Mallet (An alternative package for Weka, developed by
Andrew McCallum at University of Massachusetts Amherst)
• LibSVM (A collection of SVMs, developed by Chih-Chung
Chang and Chih-Jen Lin at National Taiwan University)
• SVM-light (Another collection of SVMs, developed by
Thorsten Joachims at Cornell University)
• GraphLab (Large-scale machine learning package)
• mahout (Apache large-scale machine learning package)
• Topic Models (David Blei's collection of various topic
models)
38
Percentage Grade Distribution
Number Total Weight (%)
Quizes 5 (3 best will be 12
selected for each
student)
Programming 3 12
Assignments
Presentation 1 5
Midterm 2 26
Final Exam 1 45
39
Passing Criteria
40
Plagiarism Policy
You are not allowed to copy code for
programming assignments from internet or
any other student. Penalty of plagiarism in
programming assignments will be from one of
the following depending on severity of case:
– -1 absolute from final grade
– Final grade is lowered
– F in course
41
Slide Credits
• Dr. ChengXiang Zhai
• Lecture Notes, Text Retrieval and Mining by
Christopher Manning and Prabhakar
Raghavan, Stanford University
42