0% found this document useful (0 votes)
172 views10 pages

Indexing Processes (Text Transformation)

The document discusses the indexing process in search engines. It involves three main tasks: 1) text acquisition which identifies documents for indexing, 2) text transformation which transforms documents into index terms, and 3) index creation which creates data structures to support fast searching. Text transformation involves parsing documents, removing stop words, stemming words, analyzing links, extracting information, and classifying documents.

Uploaded by

Sagar Vashnav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views10 pages

Indexing Processes (Text Transformation)

The document discusses the indexing process in search engines. It involves three main tasks: 1) text acquisition which identifies documents for indexing, 2) text transformation which transforms documents into index terms, and 3) index creation which creates data structures to support fast searching. Text transformation involves parsing documents, removing stop words, stemming words, analyzing links, extracting information, and classifying documents.

Uploaded by

Sagar Vashnav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

INDEXING PROCESS

(TEXT TRANSFORMATION)

PRESENTED BY TEAM 4:
ANKIT(2001010027)
SAHIL(2001010054)
HOW SEARCH ENGINE WORK?
Process involved in search engine:

CRAWLING INDEXING RANKING

Scour the Internet for content, Store and organize the Provide the pieces of
looking over the code/content content found during the content that will best
for each URL they find. crawling process. Once a answer a searcher's query,
page is in the index, it’s in which means that results
the running to be displayed are ordered by most
as a result to relevant relevant to least relevant.
queries.
INDEXING PROCESS
Indexing process comprises of the following three tasks:

● Text acquisition
● Text transformation
● Index creation

Text acquisition

It identifies and stores documents for indexing.

Text Transformation

It transforms document into index terms or features.

Index Creation

It takes index terms created by text transformations and create data structures to support fast
searching.
PARSING
● Processing the sequence of text tokens in the document to recognize
structural elements. Titles, links, headings, etc.
● It divide data in three forms:

1 Text 2 URLs 3 Metadata

● Remove HTML basic fonts.


● This provide data to further processes.
STOPPING
● Remove common words

“and”, “or”, “the”, “in”, …


● Some impact on efficiency and effectiveness
● Can be a problem for some queries

To be or not to be
STEMMING
● Group words derived from a common stem
“computer”, “computers”, “computing”, “compute”
Fish, fishing, fisherman
● Usually effective, but not for all queries
Aggressive vs. conservative vs. not at all
● Benefits vary for different languages
Arabic: Very complicated morphology
Chinese: Few word variations anyway
LINK ANALYSIS
● Makes use of links and anchor text in web pages.
Stored and indexed separately
<a href = http://www.hpi.uni-potsdam.de/naumann/home.html>
Information Systems Group
</a>
● Link analysis identifies popularity and community information
● Significant impact on web search . Less importance in other applications
INFORMATION EXTRACTION
● Identify classes of index terms that are important for some applications
● Simple: Bold-face, heading, title
● Part of speech tagging
● Named entity recognizers (NER) identify classes such as people
,location ,companies, data etc.
CLASSIFIER

■ Identifies class-related metadata for documents i.e., assigns labels to documents.

e.g., topics, reading levels, sentiment, genre.

Advertisements in documents.

■ Use depends on application.

You might also like