0% found this document useful (0 votes)
5 views

Classification Methods

Uploaded by

rm23082001
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Classification Methods

Uploaded by

rm23082001
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Classification

Methods & Cluster


Hypothesis
Information Retrieval CC4151
Classification Methods

 In the context of information retrieval, a classification is required for a purpose.


 The purpose may be to group the documents in such a way that retrieval will be faster or
alternatively it may be to construct a thesaurus automatically.
 There are two main areas of application of classification methods in IR:
(1) keyword clustering;
(2) document clustering.
Clustering and Cluster Hypothesis

 Clustering is used in information retrieval systems to


enhance the efficiency and effectiveness of the retrieval
process. Clustering is achieved by partitioning the documents
in a collection into classes such that documents that are
associated with each other are assigned to the same cluster.
 In information retrieval, the cluster hypothesis is an
assumption about the nature of the data handled in those
fields, which takes various forms. In information retrieval, it
states that documents that are clustered together "behave
similarly with respect to relevance to information needs".
Applications of Clustering
What is Benefit
Application
clustered?
search results more effective information
presentation to user
Search result clustering

(subsets of) alternative user interface: ``search


collection without typing''
Scatter-Gather

collection effective information presentation for


exploratory browsing
Collection clustering

collection increased precision and/or recall


Language modeling

collection higher efficiency: faster search


Cluster-based retrieval
Search Result Clustering
 Search results we mean the documents that were returned in
response to a query.
 The default presentation of search results in information retrieval is
a simple list.
 Users scan the list from top to bottom until they have found the
information they are looking for. Instead, search result clustering
clusters the search results, so that similar documents appear
together.
 It is often easier to scan a few coherent groups than many individual
documents.
 This is particularly useful if a search term has different word senses.
Scatter-Gather

 Scatter-Gather clusters the whole collection to get groups of documents that the user can
select or gather.
 The selected groups are merged and the resulting set is again clustered. This process is
repeated until a cluster of interest is found.
 Example: A collection of New York Times news stories is clustered (``scattered'') into eight
clusters (top row). The user manually gathers three of these into a smaller collection
International Stories and performs another scattering operation. This process repeats until a
small cluster with relevant documents is found (e.g., Trinidad)
Collection clustering

 Clustered collections store documents ordered by the clustered index key value,.
 clustered collections have the following benefits compared to non-clustered collections:
• Faster queries on clustered collections without needing a secondary index, such as queries
with range scans and equality comparisons on the clustered index key.
• Clustered collections have a lower storage size, which improves performance for queries
and bulk inserts.
• Clustered collections have additional performance improvements for inserts, updates,
deletes, and queries.
Language Modelling

 A common suggestion to users for coming up with good queries is


to think of words that would likely appear in a relevant document,
and to use those words as the query. The language modelling
approach to IR directly models that idea: a document is a good
match to a query if the document model is likely to generate the
query, which will in turn happen if the document contains the query
words often. This approach thus provides a different realization of
some of the basic ideas for document ranking.
Example: Finite Automata
Cluster-based

 Cluster-based information retrieval is one of the Information retrieval(IR) tools


that organize, extract features and categorize the web documents according
to their similarity.

You might also like