Multimedia Information Retrieval (CSC 545) : The Problem of IR
Multimedia Information Retrieval (CSC 545) : The Problem of IR
Textual Retrieval
By Dr. Nursuriati Jamil
The problem of IR
Goal = find documents relevant to an information need from a large document set
Info. need
Document collection
Retrieval
IR system
Problem
N documents (D0, ..., DN-1) Query Q of user ranked list of k documents Dj (0<j<N) which match the query sufficiently well; ranking with respect to relevance of document to the query
Feature extraction (words, phrases, n-gram, stemming, stop words, thesaurus, multimedia) Retrieval model (Boolean retrieval, vector space retrieval, LSI, signatures, probabilistic retrieval) Index structures (inverted list, signature files, relational database, multidimensional index structures) Freshness of data (real-time, update every day / week / month) Query transformation (AND/OR, expansion, stemming, thesaurus) Ranking of retrieved documents (RSV, link structure, phrases)
docID = HBJ3N129 hukum -> word10, word25 denda -> word2, word35 word100, word123, kena -> word67, . OFFLINE
Inverted file: dera - HBJ3N129, HBM4N111 budak -> HBJ2N19, HBJ3N129 Malaysia-> HBJN129 ONLINE
A text retrieval system represents documents as sets of terms (e.g., words). Thereby, the originally structured document becomes an unstructured set of terms potentially annotated with attributes to denote frequency and position in the text. The transformation comprises several steps: 1. Elimination of structure (i.e. formats) 2. Elimination of frequent/infrequent terms (i.e. stop words) 3. Mapping text to terms (without punctuation) 4. Reduction of terms to their stems (stemming, syllable division) 5. Mapping to index terms
(The order of the steps above may vary; often, steps are even broken into several steps or several steps are combined into a single pass)
3 Stemming
Structure elimination
Frequent/infrequent terms
Index
Stem
HTML contains special markups, so-called tags. They describe meta-information about the document and the layout/presentation of content. An HTML document is split into two parts, a header section and a body section:
Header: Contains metainformation about the document; they also describe all embedded elements like images.
Body: Encompasses the document enriched with markups for layout. The structure of the document is not always obvious.
Meta data: HTML provides several possibilities to define metainformation (<meta>-Tag). The most frequent ones are: URL of page: http://www-dbs.ethz.ch/~mmir/ Title of document: <title>ETH Zurich - Homepage</title> Meta information in header section: <meta name=keywords content="ETHZ,ETH,swiss,> <meta name=description content=This page is about> Raw Text: the raw text subsumes all text pieces with tags stripped from the original <body>-section. A few tags are useful to derive additional information on the importance of a text piece:
Special Character: Meta data and text data may contain special characters which have to be translated -> space, ü -> Transformation to Unicode, ASCII or other character set
1 Eliminate structure
Question: what does the link text describe? the document itself or the embedded/referenced object?
Usually, the link text is associated with both the embedding and the linked document. In most cases, the link text is a good summary for the linked document. In a few cases, the link text is meaningless (click here)
Indexing is used is to determine useful answers for user queries. Thus, it is not required to consider frequent terms with little or no semantics (e.g., the, a, it) or terms that appear seldom. Theoretical solution: restrict indexing to terms that have proven to be useful or that appear interesting from past, practical experiences with the system. However, this requires a feedback mechanism with the user to understand term importance. How to select important terms: Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.
Insignificant terms
Stop words are terms with little or no semantic meaning, thus often not indexed. Examples: English: the, a, is Bahasa Melayu: ada, iaitu, mana, bersabda, wahai Often, the rank of these terms is on the left side of the upper cutoff line. Generally, stop words are responsible for 20% to 30% of the term occurrences in a text. With the elimination of stop words, the memory consumption of the index can be reduced. Similarly, the most frequent terms in a collection of documents carry little information (rank on the left side of the upper cut-off line):
Analogously, one can strip off words that are seldom used. This assumes that users will not use them in their queries (the rank is on the right side of the lower cut-off). Although, the additional memory consumption is rather small.
The term Computer is meaningless to index articles about computer science. The term Computer, however, is important to distinguish between general articles such as careers in computer science.
2 Remove stopwords
To select appropriate features for documents, one typically uses linguistic or statistical approaches to define the features based on words, fragments of words or phrases. Most search engines use words or phrases as features. Some engines use stemming, some differentiate between upper and lower cases, and some support error correction. An interesting option is the usage of fragments, i.e., so-called ngrams. Although not directly related to semantics of text, they are very useful to support fuzzy retrieval. But there are other possibilities: fragments of words, i.e., n-grams: Example: street -> str, tre, ree, eet streets -> str, tre, ree, eet, ets strets -> str, tre, ret, ets Benefits: Simple misspellings or bad recognition often result in bad retrievals; fragments significantly improve retrieval quality. Stemming and syllable division not necessary any more better. No language specific retrieval necessary; every language is processed equally
Retrieval algorithms often use the number of term occurrences and the positions of terms within the document to identify and rank results. Term frequency ("feature frequency"): tf(Ti, Dj) Number of occurrences of a feature Ti in document Dj Term frequency is important to rank documents. Term locations (feature locations): loc(Ti ,dj) ->P(N) [set of locations] Term locations frequently influence the ranking and whether a document appears in the result at all, e.g.: Condition: Q =shah NEAR alam (explicit phrase matching) looking for documents with the terms shah and alam close to each other Ranking: Q =shah alam (implicit phrase matching) documents with the terms shah next to alam should be at the top of results.
tf = term frequency
The higher the tf, the higher the importance (weight) for the doc.
df = document frequency
The more the term is distributed evenly, the less it is specific to a document
Example
Term Haji #of docs 3 --> Dj, tfj Dj, tfj --> D7, 4 D26,10 Dj, tfj D40, 5 .
Iman
--> D21, 2
....
Term Haji occurs in three documents, 4 times in doc 7, 10 times in doc 26 and 5 times in doc 40.
D)=freq(t,D) idf(t) = log(N/n) D)=log[freq(t,D)] n = #docs containing t D)=log[freq(t,D)]+1 N = #docs in corpus D)=freq(t,d)/Max[f(t,d)]
10
Pos 5 4 :
#Doc 2 3 :
Dj,tfj 10, 1 2, 3 :
Dj,tfj 21, 2 6, 5 :
Dj,tfj
3 Text to term
31, 2 :
Step 4: Stemming
How word stemming works? Stemming broadens our results to include both word roots and word derivations. It is commonly accepted that removal of word-endings (sometimes called suffix stripping) is a good idea; removal of prefixes can be useful in some subject domains. Why do we need word stemming in the context of free text searching? Free text-searching, searches exactly as we type in to the search box, without changing it to thesaurus term. Morphological variants of words have similar semantic interpretations. Smaller dictionary size results in a saving of storage space and processing time.
11
Algorithms for Word Stemming A stemming algorithm is an algorithm that converts a word to a related form. One of the simplest such transformations is conversion of plurals to singulars. Affix removal algorithms, Successor Variety, Table Lookup, N-gram In most languages, words have various inflected (or sometimes, derived) forms. The different forms should not carry different meanings but should be mapped to a single form. However, in many languages, it is not simple to derive the linguistic stem without a dictionary. At least for English, there exist algorithms without the need of a dictionary which still produce good results (Porter Algorithm).
Pros & Cons Word Stemmers are used to conflate terms to improve retrieval effectiveness and/or to reduce the size of indexing files increase recall at the cost of decreased precision Over stemming and Under Stemming also create a problem for retrieving the documents.
12
Porter's Algorithm
The Porter Stemmer is a conflation Stemmer developed by Martin Porter at the University of Cambridge in 1980. Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English. Most effective and widely used. Porter's Algorithm works based on number of vowel characters, which are followed be a consonant character in the stem (Measure), must be greater than one for the rule to be applied. A word can have any one of the forms: CC, C..V, V..V, V..C. These can be represented as [C](VC){m}[V].
The rules in the Porter algorithm are separated into five distinct steps numbered from 1 to 5. They are applied to the words in the text starting from step 1 and moving on to step 5. Step 1 deals with plurals and past participles. The subsequent steps are much more straightforward. Ex. plastered->plaster, motoring-> motor Step 2 deals with pattern matching on some common suffixes. Ex. happy -> happi, relational -> relate, callousness -> callous Step 3 deals with special word endings. Ex. triplicate-> triplic, hopeful-> hope
13
Step 4 checks the stripped word against more suffixes in case the word is compounded. Ex. revival -> reviv, allowance-> allow, inference-> infer etc., Step 5 checks if the stripped word ends in a vowel and fixes it appropriately Ex. probate -> probat, cease -> ceas, controll -> control The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach.
Dictionary-based stemming
A dictionary significantly improves the quality of stemming (Note: the Porter Algorithm does not derive a linguistic correct stem). It determines the correct linguistic stem for all words but at the price of additional lookup costs and maintenance costs for the dictionary. The EuroWordNet initiative tries to develop a semantic dictionary for the European languages. Next to words, the dictionary shall contain flexed forms and relations between words (see next section). However, the usage of these dictionaries is not for free (with the exception of WordNet for English). Names remain a problem of their own... Examples of such dictionaries / ontologies: EuroWordNet: http://www.illc.uva.nl/EuroWordNet/ GermanNet: http://www.sfs.uni-tuebingen.de/lsd/ WordNet: http://wordnet.princeton.edu/ We look a dictionary based stemming with the example of Morphy, the stemmer of WordNet. Morphy combines two approaches for stemming: a rule-based approach for regular flexions much like the porter algorithm but muchsimpler an exception list with strong or irregular flexions of terms
14
Stemming process
Stopwords Is it a stopword? Unstemmed words
Porters algortihm, Fatimahs algorithm, Wordnet dictionary
Stemming algorithm
Morphological rules
(e.g. ber..an, me+, +lah)
Stemmed words
Term extraction must further deal with homonyms (equal terms but different semantics) and synonyms (different terms but equal semantics). But there are further relations between terms that may be useful to consider. In the following, a list of the most common relationships: Homonyms (equal terms but different semantics): bank (shore vs. financial institute) Synonyms (different terms but equal semantics): walk, go, pace, run, sprint Hypernyms (umbrella term) / Hyponym (species) Animal -> dog, cat, bird, ... Holonyms (is part of) / Meronyms (has parts) door ->lock The relationships above define a network (often denoted as ontology) with terms as nodes and relations as edges. An occurrence of a term may be interpreted as occurrences of near-by terms in this network as well (whereby near-by has to be defined appropriately). Example: A document contains the term dog. We may also interpret this as an occurrence of the term animal (with a smaller weight)
15
Step 5 : (cont.)
Some search engine do not implement step 4 and 5. Google only recently improved its search capabilities with stemming. If the collection contains documents in different languages, cross-lingual approaches that (automatically) translate or relate terms to different languages and make them retrievable even for queries in different languages than the document. Term extraction for queries: Similar to term extraction of documents If term extraction of query implements step 5: Omit step 5 in term extraction of documents in the collection Extend the query terms with near-by terms: Expansion with synonyms: Q=house Qnew=house, home, domicile, ... If a specialized search returns not enough answers, exchange keywords with their hypernyms: e.g., Q=mare (female horse) -> Qnew =horse If a general search term returns too many results, let the user choose (i.e. relevance feedback) a more specialized term to reduce the result list: e.g., Q=horse -> Qnew =mare, pony, chestnut, pacer
What is WordNet?
A large lexical database, or electronic dictionary, developed and maintained at Princeton University http://wordnet.princeton.edu Includes most English nouns, verbs, adjectives, adverbs Electronic format makes it amenable to automatic manipulation Used in many Natural Language Processing applications (information retrieval, text mining, question answering, machine translation, AI/reasoning,...) Wordnets are built for many languages.
16
Traditional paper dictionaries are organized alphabetically: words that are found together (on the same page) are not related by meaning WordNet is organized by meaning: words in close proximity are semantically similar Human users and computers can browse WordNet and find words that are meaningfully related to their queries (somewhat like in a hyperdimensional thesaurus) Meaning similiarity can be measured and quantified to support Natural Language Understanding
A simple picture
animal (animate, breathes, has heart,...) | bird (has feathers, flies,..) | canary (yellow, sings nicely,..)
17
Creates relationships among more/less general concepts Creates hierarchies. Hierarchies can have up to 16 levels {vehicle} / \ {car, automobile} {bicycle, bike} / \ \ {convertible} {SUV} {mountain bike}
A car is a kind of vehicle <=> The class of vehicles includes cars, bikes
Hyponymy
Transitivity: A car is a kind of vehicle An SUV is a kind of car => An SUV is a kind of vehicle
18
Meronymy/Holonymy
Inheritance: A finger is part of a hand A hand is part of an arm An arm is part of a body =>a finger is part of a body
19
{vehicle}
hyperonym
{doorlock}
meronym
{armrest}
Homework
Select 5 most frequent noun terms, find homonyms, synonyms, hypernmys and holonyms of the terms. May use Wordnet at http://wordnet.princeton.edu/. Select Use Wordnet Online. Create the noun ontology.
20
IR models
Overview Boolean Retrieval Fuzzy Retrieval Vector Space Retrieval Probabilistic Retrieval (BIR Model) Latent Semantic Indexing
Boolean search
21
Boolean model
Historically: Documents were stored on tapes or punched cards. Searching: only sequential access. Today: Boolean search is still very frequent but is not state-of-theart.. Google uses it for simplicity but furtehr improved it by additionally sorting/ranking results sets. Model: Document D represented by binary vector d with di =1 if term ti occurs in document i. Query q comes from query space Q; let t be an arbitrary term, and q1 and q2 be queries from Q; Q is given by queries of type: t, q1 ^ q2, , q1 v q2, q1
22
Term-document matrix
Query: Brutus AND Caesar AND NOT Calpurnia Take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100
Boolean retrieval
Query: Brutus AND Caesar AND NOT Calpurnia
23
Fuzzy retrieval
24
Vector-space model
Since Boolean models binary weights too limiting, vector supports partial matching. Non-binary weights are assigned to index terms in queries and documents. Term weights are used to compute degree of similarity between documents in the database and the users query.
term2 = ibadah
d
term3 = malam
term1 = solat
The tf metric is considered an indication of how well a term characterizes the content of a document. The idf, in turn, reflects the number of documents in the collection in which the term occurs, irrespective of the number of times it occurs in those documents.
25
Document-Term-Matrix
26
Example
27
arrived
gold
silver
truck
Class exercises
Using selected, most frequent 10 terms in your story, create term-document matrix for boolean model and vector model.
28
Remarks
There are many more methods to determine the vector representations and to compute retrieval status values
Advantages:
Main assumption of vector space retrieval Terms occur independent from each other in documents Not true: if one writes about Mercedes, the term "car" is likely to cooccur in document Simple model with efficient evaluation algorithms Partial match queries possible, i.e., it returns documents that only partly contain the query terms (similar to or-operator of Boolean retrieval) Very good retrieval quality; but not state-of-the-art Relevance feedback may further improve vector space retrieval Many heuristics and simplification; no proof for "correctness" of result set HTML/Web: occurrences of terms is not the most important criteria to rank documents (spamming)
Disadvantages:
29