0% found this document useful (0 votes)
84 views24 pages

Seminar Formatkhjj

The document summarizes the key components and process of how search engines work. It discusses: 1) The main components of a search engine including the search form, query parser, index, query engine, relevance ranker and formatter. 2) The process of how search engines work including web crawling to index pages, searching the index to find relevant results, and ranking results based on relevance factors like keyword location and frequency. 3) Additional techniques used for relevance ranking including analyzing links between pages and click-through data from user searches.

Uploaded by

Prasad Chavan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views24 pages

Seminar Formatkhjj

The document summarizes the key components and process of how search engines work. It discusses: 1) The main components of a search engine including the search form, query parser, index, query engine, relevance ranker and formatter. 2) The process of how search engines work including web crawling to index pages, searching the index to find relevant results, and ranking results based on relevance factors like keyword location and frequency. 3) Additional techniques used for relevance ranking including analyzing links between pages and click-through data from user searches.

Uploaded by

Prasad Chavan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

REPORT TITLED

SEARCH ENGINES CONCEPT, TECHNOLOGY AND CHALLENGES Submitted in partial fulfillment of the requirements for the Degree of

Master of Computer Applications


(MCA)

By Mrunalini S. Shinde Roll No. 092011015

Under the guidance of Prof. L. C. Nene

Department Of Computer Technology Veermata Jijabai Technological Institute (Autonomous Institute, Affiliated To University of Mumbai) Mumbai 400019 Year 2011-2012

VEERMATA JIJABAI TECHNOLOGICAL INSTITUTE MATUNGA, MUMBAI 400019

CERTIFICATE

This is to certify that the seminar report titled Search Engines: Concept, Technology and Challenges has been completed successfully By

Miss. Mrunalini S. Shinde


Roll No. 092011015 Class: MCA-VI in Academic year 2011-2012

Evaluator: Date:

Contents
Chapters 1 Introduction 1.1 Search Engines 1.2 History Of Search Engines 2 3 Components Of Search Engine How Search Engines Work? 3.1 Web Crawling 3.2 Indexing 3.3 Searching 3.4 Relevance Ranking 4 Types Of Search Engines 4.1 Crawler-Based Search Engines 4.2 Human-Powered Directories 4.3 Hybrid Search Engines 5 Search Engine Ranking Algorithms 5.1 TF-IDF Ranking Algorithm 5.2 PageRank Algorithm 6 7 Challenges To Search Engines Search Engine Optimization (SEOs) 6.1 SEOs 6.2 Advantages Of SEO 8 9 Challenges To SEOs Case Study: Google Search 9.1 Introduction 9.2 Architecture & Working 10 11 Conclusion References 18 18 20 21 14 14 15 10 11 13 9 9 9 6 6 6 7 1 1 3 Topics Page No.

vi

1. Introduction
1.1 Search Engines
Search Engine is a tool or a program designed to search for information on the WWW on the basis of specified keywords and return a list of the documents where the keywords were found.

Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:

They search the Internet -- or select pieces of the Internet -- based on important words.

They keep an index of the words they find, and where they find them. They allow users to look for words or combinations of words found in that index.

1.2 History of Search Engines

Gopher

Gopher was developed in 1991 and was in use up to 1996. Its an internet server from which hierarchically organized text files could be retrieved from all over the world. Developed at the University of Minnesota, whose sports teams are called The Golden Gophers. HyperGopher could also display Gif and Jpeg graphic images. Three important Gopher applications were Archie, Veronica and Jughead.

Archie was a tool for indexing FTP archives, allowing people to find specific files. It is considered to be the first Internet search engine.

Veronica i.e. "Very Easy Rodent-Oriented Net-wide Index to Computer Archives is a search engine system for the Gopher protocol, developed in 1992 by Steven Foster and Fred Barrie at the University of Nevada, Reno. Veronica is a constantly updated database of the names of almost every menu item on thousands of Gopher servers. The Veronica database can be searched from most major Gopher menus.

Jughead i.e. Jonzy's Universal Gopher Hierarchy Excavation And Display is a search engine system for the Gopher protocol. Jughead was developed by Rhett Jones in 1993 at the University of Utah. It is distinct from Veronica in that it searches a single server at a time.

However it lost importance with the introduction of the first graphical browser viz., Mosaic.

Wide Area Information Servers (W.A.I.S.)

W.A.I.S. coexisted with Gopher. For Gopher, files had to be stored in a predetermined manner in databases. The W.A.I.S. user had to connect to known databases in order to retrieve information or files. It had the same fate as Gophers i.e. became superfluous with the introduction of browsers and search engines.

Wandex

The first real search engine, in the form that we know search engines today, didnt come into being until 1993. It was developed by Matthew Gray, and it was called Wandex. Wandex indexed the files and allowed users to search for them. This technology was the first program to crawl the Web, and later became the basis for all search crawlers.

2. Components of Search Engine

Fig: Components Of Search Engine

Search Form
Search Form can be considered as the user interface of search engine. It is a simple form where user enters his query usually in the form of keywords. 3

Query Parser
The Query Parser tokenizes the input and looks for operators and filters.

Index
Index is the file that is created by the web crawler and is used as a lookup by the Query Engine.

Query Engine
The Query Engine finds the web pages that match the given criteria using index.

Relevance Ranker
The Relevance ranker is the search engine algorithm that ranks the results of search in order of relevance.

Formatter
This deal with the way results are laid out and displayed to the user. It shows the results in order of importance as decided by the relevance ranking.

3. How Search Engine Works?


Search engines use software robots to survey the Web and build their databases. Web documents are retrieved and indexed. When you enter a query at a search engine website, your input is checked against the search engine's keyword indices. Search engines look through their own databases of information in order to find what it is that you are looking for. The best matches are then returned to you as hits.

Fig: Internal Working Of Search Engine 5

3.1 Web Crawling


Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

3.2 Indexing
Indexing is extracting the content and storing it i.e. assigning the word to the page under which it will be found later on when users are searching.

It uses similar techniques as handling actual queries such as following: Stopword lists: What words do not contribute to the meaning Examples: a, an, in, the, we, you, do, and etc. Word stemming: Creating a canonical form Example: words : word, swimming : swim etc Thesaurus: Words with identical/similar meaning; synonyms. Capitalization: Mostly ignored (content is important, not case)

Some search engines also index different file types. Example: Google also indexes PDF files

3.3 Searching
There are two primary methods of text searching--keyword and concept. KEYWORD SEARCHING: This is the most common form of text search on the Web. Most search engines do their text query and retrieval using keywords. Unless the author of the Web document specifies the keywords for her document (this is possible by using meta tags in HTML), it's up to the search engine to determine them. 6

Essentially, this means that search engines pull out and index words that are believed to be significant. Words that are mentioned towards the top of a document and words that are repeated several times throughout the document are more likely to be deemed important. CONCEPT BASED SEARCHING: Unlike keyword search systems, concept-based search systems try to determine what you mean, not just what you say. In the best circumstances, a concept-based search returns hits on documents that are "about" the subject/theme you're exploring, even if the words in the document don't precisely match the words you enter into the query. This is also known as clustering -- which essentially means that words are examined in relation to other words found nearby.

3.4 Relevance Ranking


Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and present you with ones that match your topic. The matches will even be ranked, so that the most relevant ones come first. Of course, the search engines don't always get it right. Non-relevant pages make it through, and sometimes it may take a little more digging to find what you are looking for. But, by and large, search engines do an amazing job. How do crawler-based search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort through? They follow a set of rules, known as an algorithm. Exactly how a particular search engine's algorithm works is a closely-kept trade secret. However, all major search engines follow the general rules below. Location, Location, Location...and Frequency One of the main rules in a ranking algorithm involves the location and frequency of keywords on a web page. Call it the location/frequency method, for short. 7

Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning. Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages. Off The Page Factors Off the page factors are those that a webmasters cannot easily influence. Chief among these is link analysis. By analyzing how pages link to each other, a search engine can both determine what a page is about and whether that page is deemed to be "important" and thus deserving of a ranking boost.

Another off the page factor is clickthrough measurement. In short, this means that a search engine may watch what results someone selects for a particular search, and then eventually drop high-ranking pages that aren't attracting clicks, while promoting lower-ranking pages that do pull in visitors.

Number of other Web pages that link to the page in question. Google uses this to calculate page rank.

4. Types of Search Engine


The term "search engine" is often used generically to describe both crawler-based search engines and human-powered directories. These two types of search engines gather their listings in radically different ways.

4.1 Crawler-Based Search Engines


Crawler-based search engines, such as Google, create their listings automatically. They "crawl" or "spider" the web, then people search through what they have found. If you change your web pages, crawler-based search engines eventually find these changes, and that can affect how you are listed. Page titles, body copy and other elements all play a role.

4.2 Human-Powered Directories


A human-powered directory, such as the Open Directory, depends on humans for its listings. You submit a short description to the directory for your entire site, or editors write one for sites they review. A search looks for matches only in the descriptions submitted. Changing your web pages has no effect on your listing. Things that are useful for improving a listing with a search engine have nothing to do with improving a listing in a directory. The only exception is that a good site, with good content, might be more likely to get reviewed for free than a poor site.

4.3 "Hybrid Search Engines" Or Mixed Results


In the web's early days, it used to be that a search engine either presented crawler-based results or human-powered listings. Today, it is extremely common for both types of results to be presented. Usually, a hybrid search engine will favor one type of listings over another. However, it does also present crawler-based results (as provided by Inktomi), especially for more obscure queries.

5. Search Engine Ranking Algorithms


After the database has been created and placed in the search engine computer's memory, the device is finally ready to perform searches and deliver results. Only now does another device come into play: the ranking algorithm. All search engines, including directories, score the relevancy of web pages through these mathematical machines. Their purpose is to deliver links to web pages most relevant to each search phrase. Rightfully so, these automatic mechanisms are a source of great pride and revenue for their inventors.

5.1 TF-IDF Ranking Algorithm


This algorithm calculates the page rank based on following two concepts as follows:

1. Term Frequency (TF) i.e. how frequently the term appears on the page.

2. Inverse Document Frequency (IDF) i.e. rare words are likely to be more important.

Consider for example the query Mahendra Singh Dhoni

A good answer contains all three words, and the more frequently the better; we call this Term Frequency (TF). Some query terms are more important-have better discriminating power than others. For e.g. An answer containing only Dhoni is likely to be better than an answer containing only Mahendra; we call this Inverse Document Frequency(IDF).

wij = tfij * log2 N/n ,


wij : weight of term Tj in document Di tfij : frequency of term Tj in document Di N: number of documents in collection n : number of documents where term Tj occurs at least once 10

5.2 PageRank Algorithm


PageRank is a link analysis algorithm, named after Larry Page and used by Google Internet search engine that assigns a numerical weight to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of measuring its relative importance within the set.

Fig: Mathematical PageRanks for simple n/w expressed as percentages We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages PageRanks will be one. PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web.

11

Simplified algorithm
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple outbound links from one single page to another single page, are ignored. PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial PageRank of 1. However, later versions of PageRank, and the remainder of this section, assume a probability distribution between 0 and 1. Hence the initial value for each page is 0.25. The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links. If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.

Suppose instead that page B had a link to pages C and A, while page D had links to all three pages. Thus, upon the next iteration, page B would transfer half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A.

In other words, the PageRank conferred by an outbound link is equal to the documents own PageRank score divided by the number of outbound links L( ).

In the general case, the PageRank value for any page u can be expressed as:

, i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v.

12

6. Challenges to Search Engines


1. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. 2. Automated search engines that rely on keyword matching usually return too many low quality matches. 3. To make matters worse, some advertisers attempt to gain peoples attention by taking measures meant to mislead automated search engines. 4. Web search engines try to avoid having duplicate and near-duplicate pages in their collection, since such pages increase the time it takes to add useful content to the collection and do not contribute new information to search results and thus annoy users. 5. Uniformly sampling of web pages. 6. Modelling the web graph and utilising its structure in search engine.

13

7. Search Engine Optimization


7.1 SEOs
From the above discussion it is clear that there are two things involved in getting a web page into search engine results: 1. Getting into the search engine index. 2. Getting the web page to the top of the final sorted results before display.

Accomplishing step 1 is relatively easy. You just need to let the search engine spider know that the new web page exists. You can do this by pointing to the new page from an existing web page that is already indexed. Some search engines also provide an option to suggest a new URL for inclusion into their index.

Step 2 is the tough part. Most of the Search Engine Optimization tasks revolve around this. Search engines spend a lot of time and effort on making their algorithms find the best way to rank sites. According to Google, there are over 200 factors that determine the rank of a web page in the results.

Thus, Search Engine Optimization is the process of trying to get your web page rank at top of the search engine results for the keywords that are important to you.

7.2 Advantages of SEO


1) It Improves the Ranking of the Web-site and hence improves the Turn Over of the Firm or Company. 2) SEO can increase the number of visitors who are actively searching for your website and products. 3) SEO increases the brand visibility and hence increases the sales to a top level. 4) SEO Services are highly cost effective. Doesnt require much capital for SEO of a site. 5) SEO also increases Flexibility, visibility, targeted traffic, gives long term top positioning and much more. 14

8. Challenges To SEOs
1. The challenge of keyword research

You want your website to be ranked well in all suitable search users perform. This needs choosing and using the appropriate keywords in an appropriate manner. i.e. some amount of keyword research. Thinking from the users perspective and deciding the keywords for any website is indeed a challenge. The other issue here can be the incorrect or misleading use of same keywords by other sites competing for relevance ranking in any scenario.

2. The challenge of optimizing your page


Many SEOs look at optimization as their major challenge. The SEOs are so focused on where the keywords are on their page, that they are not really creating the level of "content quality" or value that they might have if they were not so preoccupied with optimization! You've all seen pages that are stacked with keywords, but how do these read quality wise to the average human being? Let's say it's not a fraction of the potential it could be with just a little different approach.

I've even seen people so focused on the optimization process that they forget to even include any call to action. What good is even a number 1 ranking for the best possible phrase if when people read it, they just leave again because of no call to action.

So what am I suggesting? The solution is simple. Write with your researched keyword phrase in mind, but do not try to optimize and create new content at the exact same time.

Instead think about it as a two step process. So what if we take these as two separate steps.

Step 1 Create new content for your readers: Write some unique original content that you feel you can be proud of. Something you know your visitors will find interesting and useful. 15

Don't even think about keyword density, or keyword prominence or keyword placement. Stay focused on the message of your content so that you end up with a well crafted page that reads well when you read it out loud and make sure it serves the needs of your readers. Make sure there is a significant call to action. Okay, now you've created your content, you have done the hard part.

Step 2: Now go back and do a simple re-write for those search engines. You'll be rewriting with a finished article and just making mild changes to it for the purpose of optimization.

Tip: If you've done good keyword research and are optimizing for the right phrases, you'll find that with this 2 step approach, you'll be creating much better quality of articles for your visitors and because you have done good research, the optimization is not that hard.

3. The challenge of knowing exactly what search engine spiders are doing and how far they are crawling into your Web site.
How to know exactly what those search engine spiders are doing, how often they visit, which pages they visit, how often they stay and a whole lot more. There are some tools like Robot Manager Professional, that allows you track some really fascinating information about search engine spiders.

4. Are you trying to get search engine spiders to come back and re-visit your Web site more often?
Are you trying to use a Meta Tag to get search engine spiders to come back and revisit your Web site every so often?

Save your time. Search engine robots actually run much better based on their own schedule and just because you include a Meta Revisit Tag <META name="Revisit-After" content="X Days"> does not mean the robot is going pay attention to it.

16

5. Content Freshness:
There is one tip which is extremely effective in getting search engine robots to come visit you much more often. We refer to as "content freshness" factors. It's not complex to understand. What you want to do is start adding some pages on to your Web site on a regular consistent basis. For example, if you were add one new article per month to your Web site that would be good, but even better would be to add one new article to your Web site per week. The key is consistency. Each time the robot returns and sees more new content it tends to adjust the frequency of its schedule. There are other benefits as well but even if you were to only add new content regularly to your site for the sake of your visitors, you may enjoy some other benefits too.

17

9. Case Study: Google Search


9.1 Introduction:
The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called Page Rank.Second, Google utilizes link to improve search results.

9.2 Architecture & Working

Fig: High Level Google Architecture 18

Working
In Google, the web crawling is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.

The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization.

The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.

The sorter takes the barrels, which are sorted by docID and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index.

A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

19

10. Conclusions

Though there are many search engines available on the web, the searching methods and the engines need to go a long way for efficient retrieval of information on relevant topics.

Indexing the entire web and building one huge integrated index will further deteriorate retrieval effectiveness, since the web is growing at an exponential rate. Building indexes in an hierarchical manner can be considered as an alternative. The current generation of search tools and services have to significantly improve their retrieval effectives. Otherwise, the web will continue to evolve towards an information entertainment center for users with no specific search objectives. Choosing the right search engine also is a challenge and several factors should be considered while deciding it. SEOs can do a great job to help increase search engines efficiency and improve performance of your websites.

20

11. References

1. The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page 2. www.perfect-optimization.com 3. www.google.com 4. http://computer.howstuffworks.com 5. http://www.searchengineworkshops.com/articles/5-challenges.html 6. Algorithmic Challenges In Web Search Engines by Monica R. Henzinger 7. http://searchenginewatch.com/article/2065267/International-SEO-Challenges-and-Tips

21

You might also like