CS583 Info Retrieval
CS583 Info Retrieval
Web Search
Introduction
Text mining refers to data mining using text
documents as data.
Most text mining tasks use Information
Retrieval (IR) methods to pre-process text
documents.
These methods are quite different from
traditional data pre-processing methods
used for relational tables.
Web search also has its root in IR.
frequency.
N: total number of docs
dfi: the number of docs that ti
appears.
The final TF-IDF term
weight is:
may be constructed
Why do we need to remove stopwords?
Reduce indexing (or data) file size
stopwords accounts 20-30% of total word counts.