0% found this document useful (0 votes)
8 views5 pages

Lecture3 Roy

Uploaded by

jatinchowhan8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Lecture3 Roy

Uploaded by

jatinchowhan8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

01.08.

2023

Introduction to Information Retrieval Introduction to Information Retrieval

Documents Parsing a document

 We need to deal with format and language of each document.


 Last lecture: Simple Boolean retrieval system
 What format is it in? pdf, word, excel, html etc.
 Our assumptions were:
 What language is it in?
 We know what a document is.
 What character set is in use?
 We can “machine-read” each document.
 Each of these is a classification problem, which we will study
 This can be complex in reality.
later in this course.
 Alternative: use heuristics

1 2

Introduction to Information Retrieval Introduction to Information Retrieval

Format/Language: Complications Normalization


 A single index usually contains terms of several languages.  Need to “normalize” terms in indexed text as well as query
 Sometimes a document or its components contain multiple terms into the same form.
languages/formats.  Example: We want to match U.S.A. and USA
 French email with Spanish pdf attachment  We most commonly implicitly define equivalence classes of
 What is the document unit for indexing? terms.
 A file?  Alternatively: do asymmetric expansion
 window → window, windows
 An email?  windows → Windows, windows
 An email with 5 attachments?  Windows (no expansion)
 A group of files (ppt or latex in HTML)?  More powerful, but less efficient
 Upshot: Answering the question “what is a document?” is not
trivial and requires some design decisions.
 Also: XML
3 4

Introduction to Information Retrieval Introduction to Information Retrieval

Normalization: Other languages Recall: Inverted index construction

 Input:

 Normalization and language detection interact.


 PETER WILL NICHT MIT. → MIT = mit
 Output:
 He got his PhD from MIT. → MIT ≠ mit

 Each token is a candidate for a postings entry.


 What are valid tokens to emit?

5 6

1
01.08.2023

Introduction to Information Retrieval Introduction to Information Retrieval

Tokenization problems: One word or two? (or


Exercises several)
 Hewlett-Packard
 State-of-the-art
In June, the dog likes to chase the cat in the barn. – How many  co-education
word tokens?  the hold-him-back-and-drag-him-away maneuver
 data base
–Tokenize: Mr. O’Neill thinks that the boys’  San Francisco
stories about Chile’s capital aren’t amusing.  Los Angeles-based company
 cheap San Francisco-Los Angeles fares
 York University vs. New York University

7 8

Introduction to Information Retrieval Introduction to Information Retrieval

Numbers Chinese: No whitespace


 3/20/91
 20/3/91
 Mar 20, 1991
 B-52
 100.2.86.144
 (800) 234-2333
 800.234.2333
 Older IR systems may not index numbers . . .
 . . . but generally it’s a useful feature.

9 10

Introduction to Information Retrieval Introduction to Information Retrieval

Ambiguous segmentation in Chinese Other cases of “no whitespace”

 Compounds in Dutch, German, Swedish


 Computerlinguistik → Computer + Linguistik
 Lebensversicherungsgesellschaftsangestellter
 → leben + versicherung + gesellschaft + angestellter
 Inuit: tusaatsiarunnanngittualuujunga (I can’t hear very well.)
The two characters can be treated as one word meaning  Many other languages with segmentation difficulties: Finnish,
‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’. Urdu, . . .

11 12

2
01.08.2023

Introduction to Information Retrieval Introduction to Information Retrieval

Japanese Arabic script

4 different “alphabets”: Chinese characters, hiragana syllabary


for inflectional endings and functional words, katakana
syllabary for transcription of foreign words and other uses,
and latin. No spaces (as in Chinese). End user can express
query entirely in hiragana!

13 14

Introduction to Information Retrieval Introduction to Information Retrieval

Arabic script: Bidirectionality Accents and diacritics

 Accents: résumé vs. resume (simple omission of accent)


 Umlauts: Universität vs. Universitaet (substitution with special
← → ← → ← START letter sequence “ae”)
‘Algeria achieved its independence in 1962 after 132 years of French occupation.’  Most important criterion: How are users likely to write their
queries for these words?
 Even in languages that standardly have accents, users often do
not type them. (Polish?)

15 16

Introduction to Information Retrieval Introduction to Information Retrieval

Case folding Stop words

 stop words = extremely common words which would appear


 Reduce all letters to lower case to be of little value in helping select documents matching a
 Possible exceptions: capitalized words in mid-sentence user need
 MIT vs. mit  Examples: a, an, and, are, as, at, be, by, for, from, has, he, in,
 Fed vs. fed is, it, its, of, on, that, the, to, was, were, will, with
 It’s often best to lowercase everything since users will use  Stop word elimination used to be standard in older IR
lowercase regardless of correct capitalization. systems.
 But you need stop words for phrase queries, e.g. “King of
Denmark”
 Most web search engines index stop words.

17 18

3
01.08.2023

Introduction to Information Retrieval Introduction to Information Retrieval

Lemmatization Stemming

 Reduce inflectional/variant forms to base form


 Definition of stemming: Crude heuristic process that chops
 Example: am, are, is → be off the ends of words in the hope of achieving what
 Example: car, cars, car’s, cars’ → car “principled” lemmatization attempts to do with a lot of
 Example: the boy’s cars are different colors → the boy car be linguistic knowledge.
different color  Language dependent
 Lemmatization implies doing “proper” reduction to  Often inflectional and derivational
dictionary headword form (the lemma).  Example for derivational: automate, automatic, automation
 Inflectional morphology (cutting → cut) vs. derivational all reduce to automat
morphology (destruction → destroy)

19 20

Introduction to Information Retrieval Introduction to Information Retrieval

Porter algorithm Porter stemmer: A few rules


 Most common algorithm for stemming English
 Results suggest that it is at least as good as other stemming
options Rule Example
 Conventions + 5 phases of reductions SSES → SS caresses → caress
 Phases are applied sequentially IES → I ponies → poni
 Each phase consists of a set of commands. caress → caress
SS → SS
 Sample command: Delete final ement if what remains is longer cats → cat
than 1 character S→
 replacement → replac
 cement → cement
 Sample convention: Of the rules in a compound command,
select the one that applies to the longest suffix.
21 22

Introduction to Information Retrieval Introduction to Information Retrieval

Three stemmers: A comparison Does stemming improve effectiveness?


Sample text: Such an analysis can reveal features that are not easil
visible from the variations in the individual genes and  In general, stemming increases effectiveness for some
can lead to a picture of expression that is more queries, and decreases effectiveness for others.
biologically transparent and accessible to interpretation
Porter stemmer: such an analysi can reveal featur that ar not easili visibl
 Queries where stemming is likely to help: [tartan sweaters],
from the variat in the individu gene and can lead to [sightseeing tour san francisco] (equivalence classes:
pictur of express that is more biolog transpar and {sweater,sweaters}, {tour,tours})
access to interpret  Porter Stemmer equivalence class oper contains all of
Lovins stemmer: such an analys can reve featur that ar not eas vis from operate operating operates operation operative operatives
th vari in th individu gen and can lead to a pictur of operational.
expres that is mor biolog transpar and acces to
interpres  Queries where stemming hurts: [operational AND research],
Paice stemmer: such an analys can rev feat that are not easy vis from [operating AND system], [operative AND dentistry]
the vary in the individ gen and can lead to a pict of
express that is mor biolog transp and access to interpret
23 24

4
01.08.2023

Introduction to Information Retrieval

Exercise: What does Google do?


 Stop words
 Normalization
 Tokenization
 Lowercasing
 Stemming
 Non-latin alphabets
 Umlauts
 Compounds
 Numbers

25

You might also like