Seminar Formatkhjj
Seminar Formatkhjj
SEARCH ENGINES CONCEPT, TECHNOLOGY AND CHALLENGES Submitted in partial fulfillment of the requirements for the Degree of
Department Of Computer Technology Veermata Jijabai Technological Institute (Autonomous Institute, Affiliated To University of Mumbai) Mumbai 400019 Year 2011-2012
CERTIFICATE
This is to certify that the seminar report titled Search Engines: Concept, Technology and Challenges has been completed successfully By
Evaluator: Date:
Contents
Chapters 1 Introduction 1.1 Search Engines 1.2 History Of Search Engines 2 3 Components Of Search Engine How Search Engines Work? 3.1 Web Crawling 3.2 Indexing 3.3 Searching 3.4 Relevance Ranking 4 Types Of Search Engines 4.1 Crawler-Based Search Engines 4.2 Human-Powered Directories 4.3 Hybrid Search Engines 5 Search Engine Ranking Algorithms 5.1 TF-IDF Ranking Algorithm 5.2 PageRank Algorithm 6 7 Challenges To Search Engines Search Engine Optimization (SEOs) 6.1 SEOs 6.2 Advantages Of SEO 8 9 Challenges To SEOs Case Study: Google Search 9.1 Introduction 9.2 Architecture & Working 10 11 Conclusion References 18 18 20 21 14 14 15 10 11 13 9 9 9 6 6 6 7 1 1 3 Topics Page No.
vi
1. Introduction
1.1 Search Engines
Search Engine is a tool or a program designed to search for information on the WWW on the basis of specified keywords and return a list of the documents where the keywords were found.
Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:
They search the Internet -- or select pieces of the Internet -- based on important words.
They keep an index of the words they find, and where they find them. They allow users to look for words or combinations of words found in that index.
Gopher
Gopher was developed in 1991 and was in use up to 1996. Its an internet server from which hierarchically organized text files could be retrieved from all over the world. Developed at the University of Minnesota, whose sports teams are called The Golden Gophers. HyperGopher could also display Gif and Jpeg graphic images. Three important Gopher applications were Archie, Veronica and Jughead.
Archie was a tool for indexing FTP archives, allowing people to find specific files. It is considered to be the first Internet search engine.
Veronica i.e. "Very Easy Rodent-Oriented Net-wide Index to Computer Archives is a search engine system for the Gopher protocol, developed in 1992 by Steven Foster and Fred Barrie at the University of Nevada, Reno. Veronica is a constantly updated database of the names of almost every menu item on thousands of Gopher servers. The Veronica database can be searched from most major Gopher menus.
Jughead i.e. Jonzy's Universal Gopher Hierarchy Excavation And Display is a search engine system for the Gopher protocol. Jughead was developed by Rhett Jones in 1993 at the University of Utah. It is distinct from Veronica in that it searches a single server at a time.
However it lost importance with the introduction of the first graphical browser viz., Mosaic.
W.A.I.S. coexisted with Gopher. For Gopher, files had to be stored in a predetermined manner in databases. The W.A.I.S. user had to connect to known databases in order to retrieve information or files. It had the same fate as Gophers i.e. became superfluous with the introduction of browsers and search engines.
Wandex
The first real search engine, in the form that we know search engines today, didnt come into being until 1993. It was developed by Matthew Gray, and it was called Wandex. Wandex indexed the files and allowed users to search for them. This technology was the first program to crawl the Web, and later became the basis for all search crawlers.
Search Form
Search Form can be considered as the user interface of search engine. It is a simple form where user enters his query usually in the form of keywords. 3
Query Parser
The Query Parser tokenizes the input and looks for operators and filters.
Index
Index is the file that is created by the web crawler and is used as a lookup by the Query Engine.
Query Engine
The Query Engine finds the web pages that match the given criteria using index.
Relevance Ranker
The Relevance ranker is the search engine algorithm that ranks the results of search in order of relevance.
Formatter
This deal with the way results are laid out and displayed to the user. It shows the results in order of importance as decided by the relevance ranking.
3.2 Indexing
Indexing is extracting the content and storing it i.e. assigning the word to the page under which it will be found later on when users are searching.
It uses similar techniques as handling actual queries such as following: Stopword lists: What words do not contribute to the meaning Examples: a, an, in, the, we, you, do, and etc. Word stemming: Creating a canonical form Example: words : word, swimming : swim etc Thesaurus: Words with identical/similar meaning; synonyms. Capitalization: Mostly ignored (content is important, not case)
Some search engines also index different file types. Example: Google also indexes PDF files
3.3 Searching
There are two primary methods of text searching--keyword and concept. KEYWORD SEARCHING: This is the most common form of text search on the Web. Most search engines do their text query and retrieval using keywords. Unless the author of the Web document specifies the keywords for her document (this is possible by using meta tags in HTML), it's up to the search engine to determine them. 6
Essentially, this means that search engines pull out and index words that are believed to be significant. Words that are mentioned towards the top of a document and words that are repeated several times throughout the document are more likely to be deemed important. CONCEPT BASED SEARCHING: Unlike keyword search systems, concept-based search systems try to determine what you mean, not just what you say. In the best circumstances, a concept-based search returns hits on documents that are "about" the subject/theme you're exploring, even if the words in the document don't precisely match the words you enter into the query. This is also known as clustering -- which essentially means that words are examined in relation to other words found nearby.
Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning. Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages. Off The Page Factors Off the page factors are those that a webmasters cannot easily influence. Chief among these is link analysis. By analyzing how pages link to each other, a search engine can both determine what a page is about and whether that page is deemed to be "important" and thus deserving of a ranking boost.
Another off the page factor is clickthrough measurement. In short, this means that a search engine may watch what results someone selects for a particular search, and then eventually drop high-ranking pages that aren't attracting clicks, while promoting lower-ranking pages that do pull in visitors.
Number of other Web pages that link to the page in question. Google uses this to calculate page rank.
1. Term Frequency (TF) i.e. how frequently the term appears on the page.
2. Inverse Document Frequency (IDF) i.e. rare words are likely to be more important.
A good answer contains all three words, and the more frequently the better; we call this Term Frequency (TF). Some query terms are more important-have better discriminating power than others. For e.g. An answer containing only Dhoni is likely to be better than an answer containing only Mahendra; we call this Inverse Document Frequency(IDF).
Fig: Mathematical PageRanks for simple n/w expressed as percentages We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages PageRanks will be one. PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web.
11
Simplified algorithm
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple outbound links from one single page to another single page, are ignored. PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial PageRank of 1. However, later versions of PageRank, and the remainder of this section, assume a probability distribution between 0 and 1. Hence the initial value for each page is 0.25. The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links. If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.
Suppose instead that page B had a link to pages C and A, while page D had links to all three pages. Thus, upon the next iteration, page B would transfer half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A.
In other words, the PageRank conferred by an outbound link is equal to the documents own PageRank score divided by the number of outbound links L( ).
In the general case, the PageRank value for any page u can be expressed as:
, i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v.
12
13
Accomplishing step 1 is relatively easy. You just need to let the search engine spider know that the new web page exists. You can do this by pointing to the new page from an existing web page that is already indexed. Some search engines also provide an option to suggest a new URL for inclusion into their index.
Step 2 is the tough part. Most of the Search Engine Optimization tasks revolve around this. Search engines spend a lot of time and effort on making their algorithms find the best way to rank sites. According to Google, there are over 200 factors that determine the rank of a web page in the results.
Thus, Search Engine Optimization is the process of trying to get your web page rank at top of the search engine results for the keywords that are important to you.
8. Challenges To SEOs
1. The challenge of keyword research
You want your website to be ranked well in all suitable search users perform. This needs choosing and using the appropriate keywords in an appropriate manner. i.e. some amount of keyword research. Thinking from the users perspective and deciding the keywords for any website is indeed a challenge. The other issue here can be the incorrect or misleading use of same keywords by other sites competing for relevance ranking in any scenario.
I've even seen people so focused on the optimization process that they forget to even include any call to action. What good is even a number 1 ranking for the best possible phrase if when people read it, they just leave again because of no call to action.
So what am I suggesting? The solution is simple. Write with your researched keyword phrase in mind, but do not try to optimize and create new content at the exact same time.
Instead think about it as a two step process. So what if we take these as two separate steps.
Step 1 Create new content for your readers: Write some unique original content that you feel you can be proud of. Something you know your visitors will find interesting and useful. 15
Don't even think about keyword density, or keyword prominence or keyword placement. Stay focused on the message of your content so that you end up with a well crafted page that reads well when you read it out loud and make sure it serves the needs of your readers. Make sure there is a significant call to action. Okay, now you've created your content, you have done the hard part.
Step 2: Now go back and do a simple re-write for those search engines. You'll be rewriting with a finished article and just making mild changes to it for the purpose of optimization.
Tip: If you've done good keyword research and are optimizing for the right phrases, you'll find that with this 2 step approach, you'll be creating much better quality of articles for your visitors and because you have done good research, the optimization is not that hard.
3. The challenge of knowing exactly what search engine spiders are doing and how far they are crawling into your Web site.
How to know exactly what those search engine spiders are doing, how often they visit, which pages they visit, how often they stay and a whole lot more. There are some tools like Robot Manager Professional, that allows you track some really fascinating information about search engine spiders.
4. Are you trying to get search engine spiders to come back and re-visit your Web site more often?
Are you trying to use a Meta Tag to get search engine spiders to come back and revisit your Web site every so often?
Save your time. Search engine robots actually run much better based on their own schedule and just because you include a Meta Revisit Tag <META name="Revisit-After" content="X Days"> does not mean the robot is going pay attention to it.
16
5. Content Freshness:
There is one tip which is extremely effective in getting search engine robots to come visit you much more often. We refer to as "content freshness" factors. It's not complex to understand. What you want to do is start adding some pages on to your Web site on a regular consistent basis. For example, if you were add one new article per month to your Web site that would be good, but even better would be to add one new article to your Web site per week. The key is consistency. Each time the robot returns and sees more new content it tends to adjust the frequency of its schedule. There are other benefits as well but even if you were to only add new content regularly to your site for the sake of your visitors, you may enjoy some other benefits too.
17
Working
In Google, the web crawling is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.
The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization.
The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index.
A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
19
10. Conclusions
Though there are many search engines available on the web, the searching methods and the engines need to go a long way for efficient retrieval of information on relevant topics.
Indexing the entire web and building one huge integrated index will further deteriorate retrieval effectiveness, since the web is growing at an exponential rate. Building indexes in an hierarchical manner can be considered as an alternative. The current generation of search tools and services have to significantly improve their retrieval effectives. Otherwise, the web will continue to evolve towards an information entertainment center for users with no specific search objectives. Choosing the right search engine also is a challenge and several factors should be considered while deciding it. SEOs can do a great job to help increase search engines efficiency and improve performance of your websites.
20
11. References
1. The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page 2. www.perfect-optimization.com 3. www.google.com 4. http://computer.howstuffworks.com 5. http://www.searchengineworkshops.com/articles/5-challenges.html 6. Algorithmic Challenges In Web Search Engines by Monica R. Henzinger 7. http://searchenginewatch.com/article/2065267/International-SEO-Challenges-and-Tips
21