Indexing Processes (Text Transformation)
Indexing Processes (Text Transformation)
(TEXT TRANSFORMATION)
PRESENTED BY TEAM 4:
ANKIT(2001010027)
SAHIL(2001010054)
HOW SEARCH ENGINE WORK?
Process involved in search engine:
Scour the Internet for content, Store and organize the Provide the pieces of
looking over the code/content content found during the content that will best
for each URL they find. crawling process. Once a answer a searcher's query,
page is in the index, it’s in which means that results
the running to be displayed are ordered by most
as a result to relevant relevant to least relevant.
queries.
INDEXING PROCESS
Indexing process comprises of the following three tasks:
● Text acquisition
● Text transformation
● Index creation
Text acquisition
Text Transformation
Index Creation
It takes index terms created by text transformations and create data structures to support fast
searching.
PARSING
● Processing the sequence of text tokens in the document to recognize
structural elements. Titles, links, headings, etc.
● It divide data in three forms:
To be or not to be
STEMMING
● Group words derived from a common stem
“computer”, “computers”, “computing”, “compute”
Fish, fishing, fisherman
● Usually effective, but not for all queries
Aggressive vs. conservative vs. not at all
● Benefits vary for different languages
Arabic: Very complicated morphology
Chinese: Few word variations anyway
LINK ANALYSIS
● Makes use of links and anchor text in web pages.
Stored and indexed separately
<a href = http://www.hpi.uni-potsdam.de/naumann/home.html>
Information Systems Group
</a>
● Link analysis identifies popularity and community information
● Significant impact on web search . Less importance in other applications
INFORMATION EXTRACTION
● Identify classes of index terms that are important for some applications
● Simple: Bold-face, heading, title
● Part of speech tagging
● Named entity recognizers (NER) identify classes such as people
,location ,companies, data etc.
CLASSIFIER
Advertisements in documents.