0% found this document useful (0 votes)
110 views5 pages

Query and Reporting Tools: Search Engine Architecture

The document discusses query and reporting tools available from Business Objects to access a data warehouse. It describes BusinessObjects, InfoView, InfoBurst, and Data Warehouse List Upload as tools that provide point-and-click interfaces for querying, reporting, refreshing reports, and uploading lists. The document also states that WSU has negotiated a contract with Business Objects for purchasing these tools at a discount.

Uploaded by

Umang Purohit
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views5 pages

Query and Reporting Tools: Search Engine Architecture

The document discusses query and reporting tools available from Business Objects to access a data warehouse. It describes BusinessObjects, InfoView, InfoBurst, and Data Warehouse List Upload as tools that provide point-and-click interfaces for querying, reporting, refreshing reports, and uploading lists. The document also states that WSU has negotiated a contract with Business Objects for purchasing these tools at a discount.

Uploaded by

Umang Purohit
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Query and Reporting Tools

The data warehouse is accessed using an end-user query and reporting tool from Business Objects.

Business Objects provides several tools to securely access the data warehouse or personal data files

with a point-and-click interface including the following:

 BusinessObjects (Reporter and Explorer) ? a Microsoft Windows based query and reporting

tool.

 InfoView - a web based tool, that allows reports to be refreshed on demand (but can not

create new reports).

 InfoBurst - a web based server tool, that allows reports to be refreshed, scheduled and

distributed. It can be used to distribute reports and data to users or servers in various formats

(e.g. Text, Excel, PDF, HTML, etc.). For more information, see the documentation below:

 InfoBurst Usage Notes (PDF)

 InfoBurst User Guide (PDF)

 Data Warehouse List Upload  - a web based tool, that allows lists of data to be uploaded into

the data warehouse for use as input to queries. For more information, see the documentation

below:

 Data Warehouse List Upload Instructions (PDF)

WSU has negotiated a contract with Business Objects for purchasing these tools at a discount. View

BusObj Rates

Search engine architecture:-


A software architecture consists of software components, the interfaces provided by those
components, and the relationships between them.
-describes a system at a particular level of abstraction.

Architecture of a search engine determined by 2 requirements


-effectiveness (quality of results) and efficiency (response time and throughput)

Search engines are one tool used to answer information


needs
• Users express their information needs as queries.

What is a good answer to a query?


• One that is relevant to the user’s information need!
• Search engines typically return ten answers-per-page,
where each answer is a short summary of a web
document
• Likely relevance to an information need is approximated
by statistical similarity between web documents and the
query
• Users favour search engines that have high precision,
that is, those that return relevant answers in the first
page of results.

HOW DO SEARCH ENGINES WORK:-


1. Web Crawler / Spiders
2. Databases & Indexes (Inverted Index)
3. Search Results Ranking

Three main parts:


– Gather the contents of all web pages (using a program called a crawler or spider)
– Organize the contents of pages in a way that allows efficient retrieval (indexing)
– Take in a query, determine which pages match, and show the results (ranking and
display of results)

1. Web Crawlers / Spiders


Crawlers gather pages/sites
– Programs that move from site to site on the web and gather information about the
pages found
– Start with a list of domain names (homepages), and follow the hyperlink on the
homepages.
– Keep a list of urls visited and those still to be visited.
– At each site, the crawler may be focused on breadth or depth
• Breadth – gather top pages and move on to another site
– Allows it to find more sites
• Depth – gathers all pages at site
– Allows it to index more pages in each site
How frequently a site gets crawled varies
– From engine to engine
Web Crawler do collect…
-Mostly html pages
-PDF
-Word
-PPT, etc.

2. Databases & Indexing


Databases
a. Input from crawlers, from submissions by authors, from related directories
b. Cashed pages
c. Describes pages (indexes)
d. The size of the database is an important issue
e. Even the largest does not cover the entire Web

-Indexing
– Each page that is included in the database is indexed (automatically)
• “All” the words on the page (for full-text search)
• Stop words
• Metatags: title, others
• URL
• Hypertext anchors and links
– Spamming
• Load words into metatags
• Load invisible words (e.g., white text on white background)
Indexing
– Each page that is included in the database is indexed (automatically)
• “All” the words on the page (for full-text search)
• Stop words
• Metatags: title, others
• URL
• Hypertext anchors and links
– Spamming
• Load words into metatags
• Load invisible words (e.g., white text on white background)
Inverted Index
-How to store the words for fast lookup
-Basic steps:
– Make a “dictionary” of all the words in all of the web pages
– For each word, list all the documents it occurs in.
– Often omit very common words
• “stop words”
– Sometimes stem the words
• (also called morphological analysis)
• cats -> cat
• running -> run
-In reality, this index is huge.

3. Results ranking
a)Search engine receives a query, then
b)Looks up the words in the index, retrieves many documents, then
c)Rank orders the pages and extracts “snippets” or summaries containing query words.
– Most web search engines assume the user wants all of the words (Boolean AND,
not OR).
d)These are complex and highly guarded algorithms unique to each search engine.

Some ranking criteria:-


-For a given candidate result page, use:
– Number of matching query words in the page
– Frequency of terms on the page and in general
– Proximity of matching words to one another
– Location of terms within the page
– Location of terms within tags e.g. <title>, <h1>, link text, body text
– Anchor text on pages pointing to this one
– Link analysis of which pages point to this one
– (Sometimes) Click-through analysis: how often the page is clicked on
– How “fresh” is the page.

HOW TO SCALE MODERN TIMES-


• Currently
– Efficient index
– Petabyte scale storage space
– Efficient Crawling
– Cost effectiveness of hardware
• Future
– Qualitative context
• Maintaining localization data
– Perhaps send indexing to clients
– Client computers help gather Google’s index in a distributed,
decentralized fashion?

WRAPUP
• Loads of future work
– Even at that time, there were issues of:
• Information extraction from semi-structured sources (such as web pages)
– Still an active area of research
• Search engines as a digital library
– What services, APIs and toolkits should a search engine provide?
– What storage methods are the most efficient?
– From 2005 to 2010 to ???
• Enhancing metadata
– Automatic markup and generation
– What are the appropriate fields?
• Automatic Concept Extraction
– Present the Searcher with a context
• Searching languages: beyond context-free queries
• Other types of search: Facet, GIS, etc.

You might also like