Query and Reporting Tools: Search Engine Architecture
Query and Reporting Tools: Search Engine Architecture
The data warehouse is accessed using an end-user query and reporting tool from Business Objects.
Business Objects provides several tools to securely access the data warehouse or personal data files
tool.
InfoView - a web based tool, that allows reports to be refreshed on demand (but can not
InfoBurst - a web based server tool, that allows reports to be refreshed, scheduled and
distributed. It can be used to distribute reports and data to users or servers in various formats
(e.g. Text, Excel, PDF, HTML, etc.). For more information, see the documentation below:
Data Warehouse List Upload - a web based tool, that allows lists of data to be uploaded into
the data warehouse for use as input to queries. For more information, see the documentation
below:
WSU has negotiated a contract with Business Objects for purchasing these tools at a discount. View
BusObj Rates
-Indexing
– Each page that is included in the database is indexed (automatically)
• “All” the words on the page (for full-text search)
• Stop words
• Metatags: title, others
• URL
• Hypertext anchors and links
– Spamming
• Load words into metatags
• Load invisible words (e.g., white text on white background)
Indexing
– Each page that is included in the database is indexed (automatically)
• “All” the words on the page (for full-text search)
• Stop words
• Metatags: title, others
• URL
• Hypertext anchors and links
– Spamming
• Load words into metatags
• Load invisible words (e.g., white text on white background)
Inverted Index
-How to store the words for fast lookup
-Basic steps:
– Make a “dictionary” of all the words in all of the web pages
– For each word, list all the documents it occurs in.
– Often omit very common words
• “stop words”
– Sometimes stem the words
• (also called morphological analysis)
• cats -> cat
• running -> run
-In reality, this index is huge.
3. Results ranking
a)Search engine receives a query, then
b)Looks up the words in the index, retrieves many documents, then
c)Rank orders the pages and extracts “snippets” or summaries containing query words.
– Most web search engines assume the user wants all of the words (Boolean AND,
not OR).
d)These are complex and highly guarded algorithms unique to each search engine.
WRAPUP
• Loads of future work
– Even at that time, there were issues of:
• Information extraction from semi-structured sources (such as web pages)
– Still an active area of research
• Search engines as a digital library
– What services, APIs and toolkits should a search engine provide?
– What storage methods are the most efficient?
– From 2005 to 2010 to ???
• Enhancing metadata
– Automatic markup and generation
– What are the appropriate fields?
• Automatic Concept Extraction
– Present the Searcher with a context
• Searching languages: beyond context-free queries
• Other types of search: Facet, GIS, etc.