0% found this document useful (0 votes)
48 views15 pages

Preprocessing, Inverted Index

The document discusses text and web page preprocessing tasks like stopword removal, stemming, and identifying main content blocks. It then describes how an inverted index structures text data for efficient searching and discusses latent semantic indexing for dealing with statistical associations between terms.

Uploaded by

vaishakh2052
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views15 pages

Preprocessing, Inverted Index

The document discusses text and web page preprocessing tasks like stopword removal, stemming, and identifying main content blocks. It then describes how an inverted index structures text data for efficient searching and discusses latent semantic indexing for dealing with statistical associations between terms.

Uploaded by

vaishakh2052
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

19CSE354 Web Mining

PART 2
Text and Web Page Pre-Processing
Text preprocessing tasks
For text documents : Stopword removal, stemming, and handling of digits, hyphens,
punctuations, and cases of letters.
For Web pages : HTML tag removal and identification of main content blocks.

➢ Stopword Removal : Frequently occurring and insignificant words in a language


that help construct sentences but do not represent any content of the documents.
➢ Stemming refers to the process of reducing words to their stems or roots.
Text and Web Page Pre-Processing
Webpage preprocessing tasks
➢ Identifying different text fields: allows the retrieval system to treat terms in different
fields differently. Ex: HTML
➢ Identifying anchor text: Anchor text associated with a hyperlink is treated specially
in search engines.
➢ Removing HTML tags: The removal of HTML tags can be dealt with similarly to
punctuation.
➢Identifying main content blocks: A typical Web page, especially a commercial page,
contains a large amount of information that is not part of the main content of the
page. For example, banner ads, navigation bars, copyright notices, etc.
Inverted Index
Indexing : To quickly search for and retrieve information contained within your
scanned documents.
Inverted index is one of the popular indexing schemes used in search engines.

"The inverted index of a document collection is basically a data structure that attaches
each distinctive term with a list of all documents that contains the term."

Given a set of documents, D = {d1, d2, …, dN}, each document has a unique identifier.
Inverted Index
An inverted index consists of two parts:
1. Vocabulary V : contains all the distinct terms (ti) in the document set.
2. Posting : stores the ID (denoted by idj) of the document dj that contains term ti in
document dj.
Inverted Index
Assume that we have three documents in our database.

Doc 1 : Web mining is useful.


Doc 2 : Usage mining applications.
Doc 3 : Web structure mining studies the Web hyperlink structure.

The vocabulary is the set:


{web, mining, useful, applications, usage, structure, studies, hyperlink}
Search using Inverted Index
Given the query terms, searching for relevant documents in the inverted index consists
of three main steps:
Step 1 (vocabulary search): This step finds each query term in the vocabulary, which
gives the inverted list of each term. If the query contains only a single term, this step
gives all the relevant documents, and the algorithm then goes to step 3. If the query
contains multiple terms, the algorithm proceeds to step 2.
Step 2 (results merging): After the inverted list of each term is found, merging of the
lists is performed to find their intersection, i.e., the set of documents containing all
query terms.
Step 3 (Rank score computation): This step computes a rank (or relevance) score for
each document based on a relevance function.
Index Construction
The construction of an inverted index can be done efficiently using a trie
data structure.
Latent Semantic Indexing
Latent Semantic Indexing aims to deal with the identification of statistical associations
of terms.
It uses a statistical technique, called singular value decomposition (SVD) to estimate
this latent structure.
This structure is also called the hidden “concept” space.
An important feature of SVD is that we can delete some insignificant dimensions in the
transformed space to optimally approximate matrix A.
The truncated SVD captures most of the important underlying structures in the
association of terms and documents.
Latent Semantic Indexing
Latent Semantic Indexing
We will use an example to illustrate the process. The document collection has the
following nine documents.
Latent Semantic Indexing
The term-document matrix A is given below, which is a 9x12 matrix
Latent Semantic Indexing
If we choose only 2 largest singular values from Σ, i.e k=2.
Latent Semantic Indexing
Now if we have a search query q, its transformed into the same concept space by,
Latent Semantic Indexing
qk is then compared with every document vector in Vk using the cosine
similarity. The similarity values are as follows:

We obtain the final ranking of (c3, c1, c2, c4, c5, m4, m2, m3, m1).

You might also like