Building A Search Engine
Building A Search Engine
A little while ago, I was working on a intranet-site for a mid-size company. As the site grew in both size and popularity, the assigner requested me to extend the site with a searchfeature. And, since one of the general criterias on the intranet was that all logic code should be written in-house, using one of the open source engines was not an option. After about a day's work, the engine was quite complete, and the result actually turned out better than expected. By using a few simple techniques, and with the help of PHP and MySQL, these kind of projects are really not a problem anymore. In this article, I will present to you a cut-down version of the search engine that was developed. I hope this will get you started, and give you a few good ideas how to develop a search-engine of your own. With the example used in this article in hand, you could easily develop an engine that suits your particular needs, with the exact features that you would like to have. Database Design And Logic The database for the search engine consist of three table: page, word and occurrence. page holds all web pages that has been indexed, and word holds all words that has been found on the indexed pages in page. The rows in table occurrence consists of references to rows in page and word. Each row representing one occurrence of one particular word on one particular page. The SQL for creating these tables are shown below. CREATE TABLE page ( page_id int(10) unsigned NOT NULL auto_increment, page_url varchar(200) NOT NULL default '', PRIMARY KEY (page_id) ) TYPE=MyISAM; CREATE TABLE word ( word_id int(10) unsigned NOT NULL auto_increment, word_word varchar(50) NOT NULL default '', PRIMARY KEY (word_id) ) TYPE=MyISAM; CREATE TABLE occurrence ( occurrence_id int(10) unsigned NOT NULL auto_increment, word_id int(10) unsigned NOT NULL default '0', page_id int(10) unsigned NOT NULL default '0', PRIMARY KEY (occurrence_id) ) TYPE=MyISAM; As you see, while page and word hold actual data, occurrence act only as a reference table. By joining occurrence with page and word, one can find out on which page(s) a certain word is mentioned, as well as how many times it is mentioned. We'll get back to that a little later, however - first we need to populate the database so that we have some content to work with.
Populating The Database Okay, the database is created and we're ready to feed it with some content. For this, we'll create a PHP-script that takes a user-specified URL, reads the document representing the URL, and creates records in the database based on the words it extracts from the document. Take a look at the listing below. <? /* * populate.php * * Script for populating the search-database with words, * pages and word-occurences. */ /* Connect to the database: */ mysql_pconnect("localhost","root","secret") or die("ERROR: Could not connect to database!"); mysql_select_db("test"); /* Define the URL that sould be processed: */ $url = $_GET['url']; if( !$url ) { die( "You need to define a URL to process." ); } else if( substr($url,0,7) != "http://" ) { $url = "http://$url"; } /* Does this URL already have a record in the page-table? */ $result = mysql_query("SELECT page_id FROM page WHERE page_url = \"$url\""); $row = mysql_fetch_array($result); if( $row['page_id'] ) { /* If yes, use the old page_id: */ $page_id = $row['page_id']; } else { /* If not, create one: */ mysql_query("INSERT INTO page (page_url) VALUES (\"$url\")"); $page_id = mysql_insert_id(); } /* Start parsing through the text, and build an index in the database: */ $fd = fopen($url,"r"); while( $buf = fgets($fd,1024) ) { /* Remove whitespace from beginning and end of string: */ $buf = trim($buf); /* Try to remove all HTML-tags: */ $buf = strip_tags($buf); $buf = ereg_replace('/&\w;/', '', $buf); /* Extract all words matching the regexp from the current line: */ preg_match_all("/(\b[\w+]+\b)/",$buf,$words);
/* Loop through all words/occurrences and insert them into the database: */ for( $i = 0; $words[$i]; $i++ ) { for( $j = 0; $words[$i][$j]; $j++ ) { /* Does the current word already have a record in the word-table? */ $cur_word = strtolower($words[$i][$j]); $result = mysql_query("SELECT word_id FROM word WHERE word_word = '$cur_word'"); $row = mysql_fetch_array($result); if( $row['word_id'] ) { /* If yes, use the old word_id: */ $word_id = $row['word_id']; } else { /* If not, create one: */ mysql_query("INSERT INTO word (word_word) VALUES (\"$cur_word\")"); $word_id = mysql_insert_id(); } /* And finally, register the occurrence of the word: */ mysql_query("INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id)"); print "Indexing: $cur_word<br>"; } } } fclose($fd); ?> Basically, this script connects to the database, registers the URL (the page) in the database (if it's not already there), starts to retrieve data, uses the preg_match_all()-function to extract the words from the page, and then creates a record in the occurrence-table and/or the word-table for the currently processed word. So, for example, if the script finds the word 'linux' on http://www.onlamp.com, it will execute the following INSERT-statements: INSERT INTO page (page_url) VALUES ("http://www.onlamp.com"); INSERT INTO word (word_word) VALUES ("linux"); INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id); However, this is only true if http://www.onlamp.com has not been indexed yet, and that this occurence of 'linux' is the first one. If 'linux' occurs once more further down on the page, the two first statements will not get executed, and the 'old' page_id and word_id will be used again.
Let's now index a few pages. The seven sites that makes up the O'Reilly Network is probably a good idea. So, call populate.php with your browser using the site URLs as the only argument, one at a time: http://localhost/populate.php?url=http://www.macdevcenter.com http://localhost/populate.php?url=http://www.onjava.com http://localhost/populate.php?url=http://www.onlamp.com http://localhost/populate.php?url=http://www.openp2p.com http://localhost/populate.php?url=http://www.osdir.co http://localhost/populate.php?url=http://www.perl.com http://localhost/populate.php?url=http://www.xml.com A quick investigation of the tables now should result in something like this: mysql> SELECT * FROM page; page_id 1 2 3 4 5 6 7 page_url http://www.macdevcenter.com http://www.onjava.com http://www.onlamp.com http://www.openp2p.com http://www.osdir.com http://www.perl.com http://www.xml.com
7 rows in set (0.00 sec) mysql> SELECT COUNT(*) FROM word; COUNT(*) 2423 1 row in set (0.00 sec) mysql> SELECT COUNT(*) FROM occurrence; COUNT(*) 20628 1 row in set (0.00 sec) There you go - the database now have indexed seven pages, resulting in 2423 unique words and 20628 occurrences. We're ready for the next-level - the search-interface.
The Search Interface Of course, users of the search-engine will not be able to work with the MySQL database directly. Therefore, we'll create another PHP-script that makes it possible to query the database through a HTML-form. This will work just as any other search-engine: the user enters a word in a text-box, hits Enter, and the interface presents a result-page with links to the pages which contains the word that was searched for. In this example, the order in which the pages are presented is settled by the number of times the keyword appears in each document. The search.php-script are listed below <? /* * search.php * * Script for searching a datbase populated with keywords by the * populate.php-script. */ print "<html><head><title>My Search Engine</title></head><body>\n"; if( $_POST['keyword'] ) { /* Connect to the database: */ mysql_pconnect("localhost","root","secret") or die("ERROR: Could not connect to database!"); mysql_select_db("test"); /* Get timestamp before executing the query: */ $start_time = getmicrotime(); /* Execute the query that performs the actual search in the DB: */ $result = mysql_query(" SELECT p.page_url AS url, COUNT(*) AS occurrences FROM page p, word w, occurrence o WHERE p.page_id = o.page_id AND w.word_id = o.word_id AND w.word_word = \"".$_POST['keyword']."\" GROUP BY p.page_id ORDER BY occurrences DESC LIMIT ".$_POST['results'] ); /* Get timestamp when the query is finished: */ $end_time = getmicrotime(); /* Present the search-results: */ print "<h2>Search results for '".$_POST['keyword']."':</h2>\n"; for( $i = 1; $row = mysql_fetch_array($result); $i++ ) { print "$i. <a href='".$row['url']."'>".$row['url']."</a>\n"; print "(occurrences: ".$row['occurrences'].")<br><br>\n"; } /* Present how long it took the execute the query: */
print "query executed in ".(substr($end_time-$start_time,0,5))." seconds."; } else { /* If no keyword is defined, present the search-page instead: */ print "<form method='post'>Keyword: <input type='text' size='20' name='keyword'>\n"; print "Results: <select name='results'><option value='5'>5</option>\n"; print "<option value='10'>10</option><option value='15'>15</option>\n"; print "<option value='20'>20</option></select>\n"; print "<input type='submit' value='Search'></form>\n"; } print "</body></html>\n"; /* Simple function for retrieving the currenct timestamp in microseconds: */ function getmicrotime() { list($usec, $sec) = explode(" ",microtime()); return ((float)$usec + (float)$sec); } ?> There are two possible ways this script can get executed: with or without the keywordargument. If keyword is defined, the script tries to find that particular word in the database, and presents the result on-screen. Additionally, the script also shows how long it took for the database to come up with the result. If keyword is not defined, the search-page is presented instead. Enter the URL to this script in your browser, and you should see something like Figure 1.
Let's search on the keyword 'linux' then. You should now see something like Figure 2.
As expected, onlamp.com is presented first on the result-page. As stated, this is simply because the keyword 'linux' appears more frequently on this site than on the others. A search for 'java' would probably get onjava.com on the top, and 'xml' would most likely generate most hits on xml.com. Also note that we selected to limit the number of results to five on the search-page, which makes only the five most interesting pages to show up in the result.
Speeding Up The Database As you also see on the bottom of the result-page, the query took 0.393 seconds to execute. While this may not seem like an incredible long time, it would make the searches unreasonable slow as the database grows. Fortunately, there's a very simple trick you can do to speed this up. Create an index on the word_word-column: CREATE INDEX word_word_ix ON word (word_word); This will create an index on the column word_word in the word-table. Since this column is used every time a search is made with our search-engine, this will probably make our engine much faster. Let's try it out - search for the keyword 'linux' again, to see if we gained any performance. See Figure 3.
Nice! 0.028 seconds - a speed-increase of 0.365 seconds, or 1400%. If this engine would handle an average of 1000 queries per hour, this would mean a save in query executiontime of about 144 minutes per day.
DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.
Fig (2) Code for inserting selected file in database. Now when user enter FileID for uploaded file in textbox and click on retrieve button we will get the Content and extension field from database for that file id. You can use FillDataSet method to retrieve the byte array. Below code shows how to send retrieved file to user depending on the extension. string strExtenstion = extension of retrieved file; byte[] bytFile = Byte array retrieved from database; Response.Clear(); Response.Buffer = true; if (strExtenstion == .doc || strExtenstion == .docx) { Response.ContentType = application/vnd.ms-word; Response.AddHeader(content-disposition, attachment;filename=Tr.doc); } else if (strExtenstion == .xls || strExtenstion == .xlsx) { Response.ContentType = application/vnd.ms-excel; Response.AddHeader(content-disposition, attachment;filename=Tr.xls); } else if (strExtenstion == .pdf) { Response.ContentType = application/pdf; Response.AddHeader(content-disposition, attachment;filename=Tr.pdf); } Response.Charset = ; Response.Cache.SetCacheability(HttpCacheability.NoCache); // If you write, // Response.Write(bytFile1); // then you will get only 13 byte in bytFile. Response.BinaryWrite(bytFile); Response.End();