0% found this document useful (0 votes)
773 views42 pages

Introduction To Information Retrieval

The document provides an introduction to information retrieval, including definitions and examples of key concepts. It discusses why IR is needed due to information overload and the growth of unstructured data. It also outlines the history and major components of modern search engines as well as how IR compares to and relates to other fields like databases and natural language processing.

Uploaded by

Muhammad Nouman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
773 views42 pages

Introduction To Information Retrieval

The document provides an introduction to information retrieval, including definitions and examples of key concepts. It discusses why IR is needed due to information overload and the growth of unstructured data. It also outlines the history and major components of modern search engines as well as how IR compares to and relates to other fields like databases and natural language processing.

Uploaded by

Muhammad Nouman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Information Retrieval

Lecture 1 Introduction
What is information retrieval?

2
Why information retrieval
• Information overload
– “It refers to the difficulty a person can have
understanding an issue and making decisions that
can be caused by the presence of too much
information.” - wiki

3
Why information retrieval
• Information overload

Figure 1: Growth of Internet 4


Why information retrieval
• Information overload

Figure 2: Growth of WWW

5
Why information retrieval
• Handling unstructured data
– Structured data: database system is a good choice
– Unstructured data is more dominant
• Text in Web documents
Table orDepartment
1: People in CS emails, image, audio, video…
• “85 percentIDof allName
business information
Job exists as
unstructured1 data” - Merrill Lynch
Dr. Kashif Professor

• Unknown semantic
3 Miss meaning
Wajeeha Secretary
5 Mr. Aftab Academic Officer

Total Enterprise Data Growth 2005-2015, IDC 2012


6
Unstructured (text) vs. structured (database) data
in the mid-nineties

250

200

150
Unstructured
100 Structured

50

0
Data volume Market Cap

7
Unstructured (text) vs. structured (database) data
today
250

200

150
Unstructured

100 Structured

50

0
Data volume Market Cap

8
Why information retrieval
• An essential tool to deal with information
overload

You are
here!

9
History of information retrieval
• Catalyst
– Industry: web search engines
• WWW unleashed explosion of published information
and drove the innovation of IR techniques
• Lycos (started at CMU) was launched and became a
major commercial endeavor in 1994
• Booming of search engine industry: Magellan, Excite,
Infoseek, Inktomi, Northern Light, AltaVista, Yahoo!,
Google, and Bing

10
Major players in this game
• Global search engine market
– By http://marketshare.hitslink.com/search-
engine-market-share.aspx

11
How to perform information retrieval
• Information retrieval when we did not have a
computer

12
The Standard Retrieval Interaction Model

13
How to perform information retrieval
Crawler and indexer

Query parser

Ranking model

Document Analyzer
14
How to perform information retrieval
PARSING & INDEXING
Doc Query query
Rep Rep
Repository
User
SEARCH
Ranking results
APPLICATIONS
LEARNING
Evaluation judgments
FEEDBACK
We will cover:
1) Search engine architecture; 2)Retrieval models;
3) Retrieval evaluation; 4) Relevance feedback;
5) Link analysis; 6) Search applications.
15
Core concepts in IR
• Query representation
– Lexical gap: say v.s. said
– Semantic gap
• Document representation
– Specific data structure for efficient access
• Retrieval model
– Algorithms that find the most relevant documents
for the given information need

16
A glance of modern search engine
• In old times

17
A glance of modern search engine
In modern time

18
A glance of modern search engine
Demand of understanding
Demand of convenience
Demand of efficiency

Demand of accuracy
Demand of diversity

19
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Recommendation

20
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Question answering

21
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Text mining

22
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Online advertising

23
IR is not just about web search
• Web search is just one important area of
information retrieval, but not all
• Information retrieval also includes
– Enterprise search: web search + desktop search

24
Related Areas
Applications
Mathematics

Web Applications,
Bioinformatics…
Machine Learning
Pattern Recognition Library & Info
Information Science
Natural
Statistics Retrieval
Language Databases
Optimization
Processing
Data Mining Software engineering
Computer systems
Algorithms
Systems

25
IR v.s. DBs
• Information Retrieval: • Database Systems:
– Unstructured data – Structured data
– Semantics of object are – Semantics of each object
subjective are well defined
– Simple key work queries – Structured query
– Relevance-driven languages (e.g., SQL)
retrieval – Exact retrieval
– Effectiveness is primary – Emphasis on efficiency
issue, though efficiency
is also important

26
IR and DBs are getting closer
• IR => DBs • DBs => IR
– Approximate search is – Use information
available in DBs extraction to convert
– Eg. in mySQL unstructured data to
structured data
mysql> SELECT * FROM articles – Semi-structured
-> WHERE MATCH (title,body)
AGAINST ('database');
representation: XML data;
queries with structured
information

27
IR v.s. NLP
• Information retrieval • Natural language
– Computational processing
approaches – Cognitive, symbolic and
– Statistical (shallow) computational
understanding of approaches
language – Semantic (deep)
understanding of
language

28
IR and NLP are getting closer
• IR => NLP • NLP => IR
– Larger data collections – Deep analysis of text
– Scalable/robust NLP documents and queries
techniques, e.g., – Information extraction for
translation models structured IR tasks

29
Course Learning Objectives

• Enable students to understand the common algorithms


and techniques for information retrieval (document
indexing and retrieval, query processing, etc )
• Introduce the quantitative evaluation methods for the
IR systems and data mining techniques
• Enable students to implement a basic textual
information retrieval system using Java or Python
• Introduce the popular probabilistic retrieval methods
and ranking principles
• Apply information retrieval techniques to the problems
of text clustering, text classification etc.
30
Course Outline
Inverted Index Construction
Posting Lists, Dictionary
Text Preprocessing
Tokenization Stopping, stemming
Retrieval Models (Vector Space Models)
Vector-space model, Cosine Similarity, Tf-Idf, BM25
Retrieval Models ( Language Models)
Smoothing Methods
Relevance Feedback
IR Evaluation/ Measures
Ranking measures: R-prec, Mean Average Precision, nDCG, Reciprocal Rank
Web Retrieval
Link analysis, Markov Chains, PageRank
Clustering
K-means clustering, HAC
Distributed Represntaitons of Words (WordtoVec)

Web Crawling

Question Answering

31
Text books
• Introduction to Information Retrieval.
Christopher D. Manning, Prabhakar
Raghavan, and Hinrich Schuetze,
Cambridge University Press, 2007.

• Search Engines: Information Retrieval


in Practice. Bruce Croft, Donald
Metzler, and Trevor Strohman, Pearson
Education, 2009.
32
You should know
• IR originates from library science for handling
unstructured data
• IR has many important application areas, e.g.,
web search, recommendation, and question
answering
• IR is a highly interdisciplinary area with DBs,
NLP, ML, HCI

33
What to read?
Applications
Mathematics

Web Applications,
Bioinformatics…
Machine Learning
Pattern Recognition Library & Info
ICML, NIPS, UAI
Science
Information Retrieval
Statistics NLP SIGIR, WWW, WSDM, CIKM
Databases
OptimizationACL, EMNLP, COLING SIGMOD, VLDB, ICDE
Data Mining Software engineering
KDD, ICDM, SDM Computer systems
Algorithms
Systems

34
Top Conferences and Journals in IR
Field
• SIGIR: One of the most important and influential conference in IR field (attract
more attention from academia), proceedings of publications can be found here.
• WWW: Another most important and influential conference in IR field (attract more
attention from industry), proceedings of publications can be found here.
• WSDM: A new but quickly raising conference in the field, attracting attentions
from both industry and academia. Proceedings of publications can be found here.
• CIKM: A major conference in IR field. Proceedings of publications can be found
here.
• ECIR Conference Proceedings

• TOIS: One of major journals for IR field.


• Information Processing and Management (Journal)
• Knowledge and Data Engineering (Journal)
• Information Retrieval (Journal)
• Information Science (Journal)
• Knowledge Based systems (Journal)

35
IR Toolkits
• ElasticSearch
• Lucene (Apache)
• Lemur & Indri (CMU/Univ. of Massachusetts)
• Terrier (Glasgow)
• MeTA (University of Illinois)
• RankLib (A collection of learning-to-rank
algorithms University of Massachusetts Amherst)
• General Information Retrieval Systems

36
NLP-related Resources
• Statistical natural language processing and
corpus-based computational linguistics: An
annotated list of resources
• Stanford NLP parser (Stanford University NLP
group)
• OpenNLP (Apache)
• LingPipe (Jave-based)
• NLTK(Python-based)
37
Machine Learning Toolkits
• Weka (A rich collection of machine learning algorithms,
Machine Learning Group at the University of Waikato)
• Mallet (An alternative package for Weka, developed by
Andrew McCallum at University of Massachusetts Amherst)
• LibSVM (A collection of SVMs, developed by Chih-Chung
Chang and Chih-Jen Lin at National Taiwan University)
• SVM-light (Another collection of SVMs, developed by
Thorsten Joachims at Cornell University)
• GraphLab (Large-scale machine learning package)
• mahout (Apache large-scale machine learning package)
• Topic Models (David Blei's collection of various topic
models)

38
Percentage Grade Distribution
Number Total Weight (%)
Quizes 5 (3 best will be 12
selected for each
student)
Programming 3 12
Assignments
Presentation 1 5

Midterm 2 26
Final Exam 1 45

39
Passing Criteria

• Students with 50% or higher marks in course


will pass the course.

40
Plagiarism Policy
You are not allowed to copy code for
programming assignments from internet or
any other student. Penalty of plagiarism in
programming assignments will be from one of
the following depending on severity of case:
– -1 absolute from final grade
– Final grade is lowered
– F in course

41
Slide Credits
• Dr. ChengXiang Zhai
• Lecture Notes, Text Retrieval and Mining by
Christopher Manning and Prabhakar
Raghavan, Stanford University

42

You might also like