0% found this document useful (0 votes)
43 views9 pages

Search Engine Personalization Tool Using Linear Vector Algorithm

This document proposes a search engine personalization tool that uses a linear vector algorithm to improve search engine results. It introduces the tool and algorithm, discusses related work in personalized search engines and machine learning approaches, and outlines the system design and methods, including the use of parallel programming and agent technology. The tool is intended to extract text from URLs in search results, store it, compare it to a reference document provided by the user, assign weights, and rank the documents to produce an improved result over traditional search engines. Experiments found the linear vector algorithm produced promising results compared to other methods.

Uploaded by

ahmed_trab
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views9 pages

Search Engine Personalization Tool Using Linear Vector Algorithm

This document proposes a search engine personalization tool that uses a linear vector algorithm to improve search engine results. It introduces the tool and algorithm, discusses related work in personalized search engines and machine learning approaches, and outlines the system design and methods, including the use of parallel programming and agent technology. The tool is intended to extract text from URLs in search results, store it, compare it to a reference document provided by the user, assign weights, and rank the documents to produce an improved result over traditional search engines. Experiments found the linear vector algorithm produced promising results compared to other methods.

Uploaded by

ahmed_trab
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Search Engine Personalization Tool Using Linear Vector Algorithm

Wadee Al Halabi University of Miami [email protected] Miroslav Kubat University of Miami [email protected] Moiez Tapia University of Miami [email protected]

Keywords: Search Engine, personalization, vector space, machine learning, parallel programming. Abstract: Internet search engine result is a dynamic research field, full of challenging tasks. In this paper, we introduced the linear vector algorithm (LVA) to provide an efficient way to rank the search engine's result. We conducted several experiments to compare the new algorithm with an existing one (Vector Space Algorithm) and found a promising result. This method uses a knowledge-based system. The information supplied to the knowledge-based system is acquired from the user behavior. The system monitors the user behavior and feeds positive and negative examples to the knowledge-based system. The system is able to be trained and produces good result compared with other algorithms. The search engine's result was improved tremendously when this system was applied to the search engine. memory location. Then, it compares the content of each URL with a reference document provided by the user. Lastly, the algorithm assigns a weight to each document and ranks them accordingly. If the suggested algorithm is intelligent enough to learn, we can apply its knowledge to the ranking task. This hypothesis implies that the new rank is much better than the original searchs result. It also implies that the algorithm achieves the new rank using its knowledge. We designed an algorithm to embark on this quest. To show that this algorithm improved the search engine rank result, we conducted several experiments. Finally, we compared our algorithm "Linear Vector Algorithm", with a widely used algorithm "Vector Space Algorithm." The findings were spectacular.

1. Introduction In this study, we attempt to determine if a search engines result could be enhanced by exploring web pages contents. Presuppositions in this work appear to lend weight to our hypothesis that a new algorithm could be developed to help inexperienced users utilize search engines more efficiently. It was our belief that if the web pages text can be obtained, then the rank according to each users interest would be improved. Search engines are the only available search tool on the web today. Although they are widely used, search engine results are not efficient to satisfy the user. We suggest two main reasons that clarify this inadequacy: 1. The tremendous amount of information on the web, which is an advantage to the web. 2. Our tool seems to have more promising solutions. Exploring the text content of a web page with a personalized search engine using new tools such as linear algorithm, concurrent programming, and agent technology, in addition to the use of learning ability, would result in a much greater ranking efficiency. Our study is centered on the hypothesis that we can design an algorithm which explores the search engine result, and downloads the text contents of each URL appearing in the search result. After that, the algorithm stores the text of each URL in a distinct

2. Assumptions There has been a great deal on personalized search engines in earlier literature [1, 2, 6, 9]. Fan et al. have published an interesting paper on personalization of search engine [1]. The authors implemented their framework as a specific approach rather than a generic one. They showed how this method improved the searchs efficiency. The authors believe that the personalized information retrieval and delivery is an imminent problem facing search engines. They argue that the term-weighted strategy should be context specific, which we did not use for simplicity because this strategy would increase the new algorithm complexity. Sufficient amount of recent activity in machine learning contributed to information retrieval and web search. Boyan et al. in [2] introduced a heuristic method to optimize the searchs result, and therefore, improve the overall system performance. In their work, Boyan et al. showed how they successfully implemented a machine learning approach to improve the searchs retrieval efficiency. Learning Architecture for Search Engine Retrieval is t e name of their system. They h designed it to enhance search engines performance. The authors implemented a learning algorithm to assign different weights according to the word location in a text. Their algorithm assigns a weight to a word in headlines, for instance, and assigns a totally different weight to the same word in the text body. Their method is considered as content-based in the way it assigns weights.

Cuwe et al. in [12] suggested improving retrieval efficiency by tracking the user and exploring his/her logs. The authors reported that their algorithm dramatically improved the results efficiency. They investigated the users log files in the search engine and used them in the subsequent q ueries. This method directs the search engine toward common information in documents for each user. The focal point for each user is distinct and in agreement with the log file. In this paper, we implemented a similar log file to track a user interest, but it requires the user to submit an interested document. Haveliwala et al. in [3] investigated the possibility to find a web page relevant to a reference web page. Although the objective of the project is quite similar to this paper, it was implemented using a totally different strategy. The authors used the reference page only to represent the knowledge-based system. Whereas in this work, we demonstrated how this approach is inadequate in comparison with the positive and negative examples method we provided. Poincot et al. in [4] introduced a new approach to compare documents and calculate their similarity using machine learning. The authors showed how documents similarities could be calculated using neural network (Kohonen maps). Chakrabartwe et al., in [5] presented an algorithm in mining the web using hub and authority's techniques to discover relevant web pages. Ahonen et al. in [7] experimented with the co-occurring text phrase and they concluded that a promising result was found. Liu et al. in [9] proposed a similar method to the one provided in this paper. The authors provided a personalized web search for improving retrieval effectiveness. They have implemented a machine learning algorithm to capture the user interests. Every time the user connects to a URL, the system keeps track of that URL and categorizes it. This process improved the overall system performance as every URL is reflected on the subsequent query. The authors of [9] presented another approach to address the same problem which was handled in this paper. However, they introduced the learning algorithm as a mandatory and without the user awareness, and provided more evidence as in [1, 2, 4, 5], and [6], which advocated the advent of machine learning in web mining and information retrieval. Mller et al. in [11] provided another implementation of the Web Information Retrieval using machine learning algorithm to track users interests. Thus, the algorithm attracts the relevant web pages and filters out the irrelevant ones. They implemented their system in a multi-agent approach. Lin et al. in [15] showed how machine learning algorithms could be used to classify web documents. This method explores the documents before the searching process starts. Then, the algorithm categorizes the documents according to the knowledge acquisition process. Perez et al. in [10] discussed the advantages of the parallelism in information retrieval system. The

authors emphasized the cooperation and coordination among agents. The authors showed in [10] how the three elements: agents, parallelism and information retrieval are combined in a coherent implementation. In summary, the personalized search engine was addressed by many investigators as an approach to resolve search engines difficulties. The machine learning method was investigated as a possibility for major contribution to search engine enhancement. Researchers in machine learning provided fabulous results related to search engines applications. Text mining is considered as a major mining technique to explore documents, extract features, and find similarity. In this research, we deployed these principles along with the parallel programming concepts as it showed considerable advantages over the sequential programming method. Lastly, agent technology was used to add flexibility to the system as shown in previous literature.

3. System design and method This part deals with the system design and requirements. It discusses the design concepts of several algorithms and agents employed in this project. Search engine personalization tool (SEPT) Basically we can say that this SEPT consists of a Graphical User Interface (GUI) agent, a Reader Algorithm, a Downloading Algorithm, a Ranking Algorithm, a Training Algorithm and a Presentation Algorithm. System definition The SEPT designed herein implements the parallel programming and agents concepts. Certain critical decisions (such as speed, efficiency and choice of technology, etc.) are based entirely on the predefined requirements and goals. The following sections present the detailed requirements and goals for the SEPT Project.

Search Engine Result

Search Engine Personalization Tool

The New Rank

Figure 1: SEPT System requirements The SEPT system is shown in Figure 1. From this diagram, we can define the basic requirements of the final design. The system enhances the search engines performance using a new algorithm. It extracts and handles up to 100 URLs appearing in the search engines result. It compares the retrieved documents against a reference URL or a reference local file. The system considers the parallel programming concepts and agent technology approach as an imp lementation

method. It implements a fast algorithm able to outperform an existing and widely used one. System goal and algorithms The goal of SEPT is to read the searchs result, and rank this result in agreement with a specific criteria. The new rank pops up the high weighted web pages, and pushes down the low weighted ones. The new rank arranges the web pages in a descending order according to their weights. We designed an algorithm to assign weights to retrieved documents. This algorithm is considered as the major algorithm in this paper. This algorithm is basically a modification to the existing Vector Space Model [13, 17]. The Vector Space Model algorithm uses formula 1 to calculate documents similarities. We have implemented a modification to convert the nonlinear relatively slow Vector Space Algorithm (VSA) to a linear fast algorithm entitled the Linearly Modified VSA. Cosine(S, Q) =
u S Q

Wu, S.WuQ
2 u Q

Wu , S . Wu , Q
u S

(1) where W is the weight of term u in document S, and query Q. The new fast algorithm was designed to satisfy the requirements. Although the name is descriptive, it is too long. Therefore, we decided to abridge it to Linear Vector Algorithm (LVA). This name has the adequacy to portray the algorithm behavior. From hereon, the name LVA or Linear Vector corresponds to The Linearly Modified VSA. Precision and Recall: Precision and Recall are the most widely used tools to evaluate search engine retrieval efficiency. However, Precision and Recall are inadequate to satisfy the requirements of the experiments in this paper. Using these tools we can measure the retrieved documents and evaluate their relevance to the query and to the database of the search engine. In this paper, we are not interested only in the return documents and their relevance to the query, but we are interested in their interest to the user compared to each other. In other words we are interested in the rank of the result. This cannot be measured using precision and recall method. The following example depicts the difference between the precision and recall method and the new tools used in this experiment. If the returned web pages are 1 www.example1.com 2 www.example2.com 3 www.example3.com 4 www.example4.com 5 www.example5.com and three of these web pages are relevant, then we can say we have the ability to measure the number of relevant web pages according to the retrieved web pages

and the database of the search engine. This can be achieved using Precision and Recall tools. However, this does not satisfy the requirements, as we are interested in measuring the rank of the result. For example, if web page # 1 "www.example1.com" is more important than web page #2 "www.example2.com", and the search engine A retrieved the documents in the sequence #1 then #2, whereas, another search engine B retrieved the documents in the sequence #2 then #1. In this case, the search engine A is more efficient than the search engine B. The new tools precisely measure this dependent variables. A numerical example is presented later in this paper. The Linear Vector Algorithm (LVA): Figure 2 illustrates the behavior of the LVA. It depicts the amount of weight awarded to each word. D/R ratio refers to the ratio of the word frequency in the retrieved document -denoted by D- to the word frequency in the reference document denoted by R. To calculate the value of D/R ratio for each word, we find out the words frequency in the retrieved document and the same words frequency in the reference document. Then we work out the value of D/R using the formula 2. A practical explanation to calculate the D/R ratio is shown in Example 1. D/R ratio =

D * 100; R
Linear Vector Algorithm

(2)

110 100 90 80 70 60 50 40 30 20 10 0 100 200 300 400 500 600 700 800 900 1000 1100 0

Score

D/R Ratio

Figure 2: LVA behavior The LVA curve in figure 2 is constructed according to formula 3. Weight = D/R ratio, for D/R ratio = 100, and Weight = 100 [

D/R ratio - 100 ], 10


(3)

for (100 < D/R ratio < 1100), Otherwise weight = 0;

Example 1 Assume that the retrieved and reference documents have the following words and frequencies:

Table1: Example 1 Words Automobile Brazil Propagation Learning Miami Fund Dollar Words Frequencies in the reference document 25 14 10 0 12 23 14 Words Frequencies in the retrieved document 25 13 14 7 0 5 14

The weight assigned to a retrieved document =

Weight (i) . C
i

we

Where weight(i) is the weight for the word i , and n is the number of distinct words in the document; in this example n = 7. Cwe = D/R for D<R Cwe = R/D for R<D Cwe = 1 for D=R Documents weight = Automobile (weight). C1 + Brazil (weight). C2 + Propagation (weight) . C3 + Learning (weight) . C4 + Miamwe (weight). C5 + Fund (weight) . C6 + Dollar (weight) . C7 = 100 . 1 + 92.8571 .

The weight algorithm assigns a weight to the document equals to the total weight of all words. Calculating the weight for selected words: The weight for the word automobile is calculated according to formula 2, and 3: D/R ratio =

13 10 + 96 . + . 14 14

= 100 + 86.22 + 68.571 + The concept of this algorithm is to award the exact similarity, and penalize the excessive repetition of similar words. The algorithm also penalizes words with retrieved document frequencies (D) less than reference frequency (R). The algorithm computes the frequency of each word in the reference document and the retrieved document. Then, it assigns 100 points to each word if the frequencies of the word in both documents are identical where D/R = 1. The word automobile, in Example 1, is awarded 100 points no matter the frequency. As long as the frequencies in both documents are alike, it is awarded 100 points. In Example 1, the algorithm assigns 100 points to the words automobile, and dollar. These two words have different frequencies, but they have similar frequencies to their peers in the reference document. 100 points is considered as the maximum number of points. In Example 1 the word Brazil has a frequency in the retrieved document = 13 words/document, which is less than the frequency in the reference document. In this case, the algorithm applies the first part of the graph where the D/R ratio is in the range [0, 100]. The algorithm operates in the [100, 1100] range if the frequency in the retrieved document (D) is greater than the frequency in the reference document (R). The word propagation, in Example 1, corresponds to this case. The frequency of the word propagation in the retrieved document = 14 words/document, and its peer in the reference document = 10 words/document. The D/R ratio for the word propagation was calculated and found to be 140. This ratio is in the range ]100, 1100]. According to figure 2, the weight = 96 points. Here, we can see how the algorithm penalizes the weight due to the excessive repetition of the word propagation in the retrieved document. This algorithm follows a linear function as illustrated in Figure 3, and applies a linear formula as presented in formula 3. On the other hand, the VSA awards similarity and penalizes excessive repetition in a non linear behavior according to formula 1 [13, 14, 17]. The

Frequency in the retrieved document (D) * 100 Frequency in the reference document (R)
The frequency of the word Automobile in the reference document = 25 words/document, and its peer in the retrieved document = 25 words/document. Therefore, the D/R ratio is calculated as follows: D/R ratio =

25 * 100 = 100 25

According to the linear vector behaviors curve in figure 2 and formula 3, if the D/R ratio = 100, then weight = ratio Thus: Automobile (weight) = ratio = 100 points. Similarly, the weight for the word Brazil is calculated as follows: D/R ratio =

13 *100 14

= 92.8571

According to the linear vector behaviors curve in figure 2 and formula 3, if D/R ratio = 100, then weight = ratio Thus: Brazil (weight) = ratio = 92.8571 points. Lastly, the weight for the word Propagation is calculated as follows: D/R ratio =

14 * 100 = 140 10

According to the linear vector behaviors curve in figure 2 and formula 3, if (100 < D/R ratio < 1100) then: Weight = 100 [

D/R ratio - 100 ] 10 140 - 100 = 100 [ ] 10 40 = 100 [ ] 10


= 100 4 = 96 points.

Vector Space curve vs. Linear Vector curve is shown in Figure 3. We assigned a slope equals to 1 where the D/R ratio ? [0,100], therefore, the new weight equals its D/R ratio. However, if the frequency of the word in the retrieved document is greater than its peer in the reference document, the algorithm penalizes this amount linearly. We designed the second slope to allow a maximum frequency of retrieved document D to be no more than 10*R. Therefore the slope was calculated as follows: D/R = 10 > > > D/R ratio = 10*100 = 1000 This slope starts in the horizontal axis where D/R = 100, thus it should intersect with the horizontal axis at D/R ratio = 1100, consequently the slope = -0.1
Linear Vector Algorithm vs Vector Space Algorithm

120 100 80 Score 60 40 20 0 0 100 200 300 400 500 600 700 800 900 1000 1100 D/R ratio Linear Vector Vector Space

Figure 3: LVA vs. Vector Space behavior Reader Algorithm: This algorithm reads the searchs result in the result.html file. The algorithm reads the file looking for a pattern to distinguish and extract URLs. Training algorithm: The SEPT was implemented with a machine learning algorithm to boost its performance. We designed this algorithm to train the system. The training algorithm reads the remove files (*.rem). Each of these files contains a web page with high degree of dissimilarity to the user interest. It also reads the add files (*.add), which contain web pages with high degree of similarity.
Downloading Time Text & Image vs Text only
Downloading Time in Seconds

Downloading Algorithm: The SEPT reads the searchs result extracts URLs and connects to these URLs . While it is connected to URLs, the system downloads the text data found at each URL. When the system connects to a URL, it downloads text data only, because we need only this data. Ranking Algorithm: This is a new algorithm we are presenting in this paper. We have designed it to handle the ranking task. The downloading algorithm stores the text data of each web page in a distinct temporary storage as in a form of a string. The ranking algorithm reads this string one character at a time. It then constructs the words and stores them in a sorted tree. This kind of data structure has a powerful retrieval time with time complexity equals a constant time. The ranking algorithm uses these trees in order to assign weights to web pages. Ranking Efficiency Algorithm: The purpose of this algorithm is to evaluate the efficiency of the new rank. We could not find a suitable algorithm to evaluate the ranking efficiency in literature. The only methods available are Precision and Recall. These methods do not satisfy the requirements and do not measure the dependent variables, as mentioned previously in this paper. Therefore, we designed a new tool to accomplish this task. The algorithm starts at 100% in the top of the rank and deducts points each time a miss is encountered in the rank. We consider a relevant web page as a hit, and any irrelevant one as a miss. Example 4 provides a detailed description of this algorithm. We called this algorithm Ranking Efficiency Evaluation Algorithm. The points are awarded using formula 5. Awarded Points = 100 if top of the list or no miss at all, Awarded Points = min (previous points , 100) min (100, # of miss + 1); (5) Example 4 Suppose we have a list of 11 web pages, five of them have a high degree of similarity to our reference page. a- Evaluate the ranking efficiency if the five items of the high similarity are in the top of the rank. If the five items of the high similarity are in the top of the rank, then we have five items in sequence without a miss. Therefore, we have five items in sequence without a miss, obtaining 100 points each. Thus efficiency =

1200 1000 800 600 400 200 0 1 2 3 4 5 6 7 8 9 10


Threads

Text Only Text & Image

5 *100 = 100 % 5

Figure 4:. Downloading time per thread for (Text & Images) vs. (Text only)

b- If the five relevant items are in the sequence (1, 3, 4, 5, 6) in the rank, find the Ranking Efficiency. In this case only 1 web page has no miss and in the top of the list, this site obtains 100 points. It is then followed by one miss and four web pages in a sequence

(3, 4, 5, 6). The awarded points to these four web pages are calculated using formula 5. Points = min (previous points, 100) min (100, # of miss + 1) = 100 (1 + 1) = 98 points. Therefore the efficiency is calculated as follows:

1*100 + 4 * 98 5 492 = 5
=

design remains the same, the low level design of agents, data manipulation, temporary storage, data structure, parallel programming, etc. is dependent on the choice of the implementation method. Implementation of the system is done according to the following algorithm: 1. Choose an appropriate programming language, which has the ability to fulfill the systems requirements. 2. Use agent technology approach. 3. Apply parallel programming concepts. 4. Each agent should complete assigned task and notify the following agent(s).

= 98.4 % c- If the five relevant items are in the sequence (2, 3, 5, 6, 9) in the rank, find the Ranking Efficiency. In this case the first place is a miss, followed by two items in a sequence (2, 3). Thus, the awarded points to these items are calculated as follows: Points = 100 (1+1) = 98 point each. Then we have two items in a sequence (5,6) with one miss. Thus, the awarded points to these items are calculated as follows: Points = 98 (1+1) = 96 point each. Finally, we have one item ( with two misses. Thus, 9) the awarded points to this item are calculated as follows: Points = 96 (2+1) = 93 point Therefore the efficiency is calculated as follows:

2 * 98 + 2 * 96 + 93 5 481 = 5
= = 96.2 %

Design methods We have designed the architecture of this system while keeping in mind the requirements and goals . The system was implemented in agent technology approach and parallelism concepts in keeping with the following concepts: 1. Each agent has a unique function. 2. Any agent may execute or recruit other agents to execute a given task. 3. Agents may exchange data using files or any suitable data structure. 4. Agents may have access to all necessary data. Parallel programming was used whenever possible in the experiments in this paper. It was implemented to download the retrieved documents, as shown in Figure 4 where a distinct thread was used to download each web page. In the ranking algorithm, another thread was assigned to each web page to reward and penalize each one according to its contents. 4. Implementation The implementation of the design is dependent on the technology chosen. Although the basic level

5. Operation and results We have conducted many experiments in order to achieve the results presented in this section. The experiments focused on the following: 1- Study the Liner Vector behavior under an influence of a training process. 2- Compare the LVA against the VSA. 3- Study the Linear Vector behavior with the minimum amount of knowledge, and compare it against the same algorithm with maximum amount of knowledge and the VSA with the maximum amount knowledge. 4- Study of the ranking efficiency evaluation experiments. Before launching the experiments, we prepared three test pads using the one principle (keywordcreating the user profile run the system). We used the (jaguar, research fund, and beetle) as our keywords for the three test pads. The keyword jaguar returned result categorized as cat, car, aircraft etc. The keyword Research fund returned result categorized as Medical, Engineering, Biology, Geology, General Funds, and Chemistry. The keyword Beetle returned result categorized as insect, car, etc. The following result is based on the keyword Research fund, and the user interest is cancer research medical field. However, the other two keywords produced similar result when we used the keyword beetle and the user interest is car, and when we used the keyword jaguar and the user interest is cats. We found the words medical and cancer to be the only two words repeated in the reference document and all of the positive examples. Therefore, any web page consisting of these two words is considered as a relevant web page, and if both words are not in a web page, we consider the page irrelevant. Study of the linear vector behavior under an influence of a training process: We have conducted this experiment to observe the LVA behavior when it is exposed to a training algorithm. Procedure: 1- We prepared the sample for the training. 2- We ran the training algorithm over different sets of examples (1 pair of add and remove files, 2 pairs, 3 pairs . up to 10 pairs of file).

Result: After running the experiment, we applied the Ranking Efficiency Evaluation Algorithm to calculate the ranking efficiency, and find the error rate. Figure 5 illustrates the linear vector behavior. This experiment shows that, the LVA could be trained to achieve a better error rate. With no training, or few training examples the error rate was relatively high according to the error rate when the system is exposed to higher number of examples.

Comparing the LVA against VSA behavior: We have conducted this experiment to evaluate the performance of the LVA and VSA to determine whether the linear algorithm was able to outperform the VSA or not. Then, we would like to measure the difference. The following steps took place in order to obtain the result.
Linear Vector behavior in the training session
50 45

Error Rate %

40 35 30 25 20 15 10 5 0 no ref 0 1 2 3 4 5 6 7 8 9 10

Number of pairs of websites

Figure 5: Linear Vector Responses to the Training Process Procedures: 1- We assume that the user submits the keyword Researchs Fund to the search engine. 2- The user is interested in the Cancer Research topic. 3- We carried out the training process as follows: In order to train the system, we collected random positive and negative examples. We stored the positive examples in *.add files, and the negative examples in *.rem files. Each of them consisted of a single web page. The web pages with high degree of similarity were stored in the *.add files. However, the web pages with high degree of dissimilarity were stored in *.rem files. We dis carded all web pages, which had low degree of similarity or low degree of dissimilarity. We created 10 *.add files, and 10 *.rem files. These files have tremendous amount of examples. Some files contain more than 7000 distinct terms . Once the samples were ready, we could start the training process. We launched the training agent to train the system. The training process takes about 10 to 30 seconds. However, the sample collection takes considerable effort and time. All the positive exa mples are regarding cancer research. However, all our negative examples are strictly related to non-medical interest. Once the system is fully trained, it behaves as if it is used by a cancer researcher for a

while. Thus, the system responds effectively and rearranges the searchs result. The algorithm awards similar pages high weights, and penalizes dissimilar web pages. 4- We have trained the system and prepared it to assign weights according to VSA and LVA. In two distinct experiments, we collected the data illustrated in figure 6, which was calculated according to the Ranking Efficiency Evaluation Algorithm introduced in this paper. Error rate: The percentage Error rate is calculated as follows: Percentage Error Rate = 100 - Ranking Efficiency Evaluation Result: Figure 6 shows the compared result between the LVA and the VSA. This figure clearly shows how the LVA outperforms the VSA in efficiency. The figure shows that both algorithms have the same error rate when there is no reference page, because both algorithms are inactive. If there is not any word in the ReferenceTree(), both algorithms abort the operation. Thus the error rate at no reference page equals to the error rate of the old rank, as there will be no new rank by either algorithm. When we introduced a reference page, but no training was applied, the vector space and linear vector was found to have the same error rate of 31.36%. We considered that the lack of sufficient examples in the reference page would cause this relatively high error rate. However, once the training algorithm is introduced to the system, we found that the Linear Vector method outperforms the VSA. The latter has a very high error rate. When we introduced more examples to the training algorithm, both methods start to converge, as it is illustrated in Figure 6. The uneven behavior of both curves is expected in a practical machine learning algorithms [16]. As more examples were introduced to the training algorithm, the curves converged farther. We expect that the curve will keep converging if more examples are added. This experiment advocates our hypothesis that the LVA could substitute a widely used algorithm, and improve the searchs result for inexperienced users. However, knowledge and training are dominant factors in using Linear Vector method.

Linear Vector vs Vector Space


60 50 40 30 20 10 0 no ref 0 1 2 3 4 5 6 7 8 9 10

Error Rate %

Number of pairs of websites Linear Vector Vector Space

Figure 6: LVA vs. VSA

Minimum amount of knowledge experiment: Downloading 100 web pages can be a lengthy task. However, we have introduced the minimum amount of knowledge concept for inspection. If this amount of knowledge is sufficient to compose a promising result, we can save considerable amount of time and effort. The minimum amount of knowledge concept refers to the result.html file only. This implies that a deep analysis to the result .html file may produce sufficient information to rank web pages, or this information is insufficient to construct an efficient rank. Procedure: 1- We extracted the URLs from the result.html file. 2- We extracted the few lines provided by the search engine as a detail to each link. 3- We analyzed the words located in the web pages details, and extracted them and added them to the RetrievedTree(). 4- We ran the LVA and assigned a weight to each web page, then present the new rank to the user. 5- We ran the training algorithm to train the system over different values of examples. 6- We ran the Ranking Efficiency Evaluation Algorithm to find out the ranking efficiency. Figure 7 shows the behavior of the LVA when it is exposed to the minimum amount of knowledge experiment. The algorithm is inactive in the absence of a reference page; therefore, the error rate is relatively high. When a reference page was provided, in addition to the absence of training, the error rate dropped from 47.14% to 23.67%. Then, we introduced more examples to the system. This improves the error rate slightly. Thus, using the minimum knowledge method would slightly improve the ranking efficiency.

latter method. On the other hand, the LVA was able to compete and outperform the Vector Space method, even when we provided it with the minimum amount of knowledge and the maximum amount of knowledge was available to the Vector Space method. Therefore, the knowledge does not help the VSA but it helps the LVA spectacularly. 6. Conclusions and future work This study shows how the modified VSA (LVA) provides the user with a much better result according to its interest. The study compares the result with an existing algorithm and shows the performance of both algorithms. In the training algorithm, the study concludes that the modified VSA response to training faster than the VSA. Another advantage added to the modified VSA over the existing one is found when we measure the minimum amount of knowledge. The modified VSA requires less data for the training whereas the VSA requires much more amount of data to produce the same result.
Linear Vector with min. & Max. Knoweldge vs. Vector Space

60 50
Error Rate %

40 30 20 10 0 no ref 0 1 2 3 4 5 6 7 8 9 10

Number of pairs of websites

Linear Vector

Linear Vector with Minimum Knowladge

Vector Space

Linear Vector with Minimum Knowledge


50 45 40

Figure 8: LVA with minimum and maximum knowledge vs. VSA

35 30 25 20 15 10 5 0 no ref 0 1 2 3 4 5 6 7 8 9 10

The study shows a great improvement to the VSA in text -based retrieval system. However, further research could show promising result to other algorithms in content-based retrieval systems, such as image, audio and video retrieval systems. This study encourages scientists to investigate other media retrieval system.

Error Rate %

Number of Pairs of Websites

Figure 7: Linear Vector with Minimum amount of knowledge

REFERENCES [1] W. Fan,

M.D.

Gordon,

P.

Pathak,

Figure 8 shows the result of the previous experiments combined together. In this graph we can observe the efficiency of each algorithm. Thus, we can confidently say that the LVA provides great result and a better error rate, as low as 10.64 %, when it is exposed to a suitable amount of knowledge. However, when the minimum amount of knowledge is provided to the algorithm, the error rate is somewhat higher than the

Personalization of search engine services for effective retrieval and knowledge management, in the Proceedings of the 2000
International Conference on Information Systems (ICIS), 2000, Brisbane, Australia. [2] Boyan, J. A., D. Freitag and T. Joachims . "A Machine learning architecture for optimizing web search engines." Proceedings of the AAAWE workshop

on Internet-Based Information Systems, AAAWE Technical Report WS-96-06, 1996.[3] Taher Haveliwala, Aristides Gionis , Dan Klein, and Piotr Indyk. Evaluating strategies for similarity search on the web, In Proceedings of the Eleventh International World Wide Web Conference, May 2002. [4] Philippe Poinot, Soizick Lesteven, and Fionn Murtagh. Comparison of two document similarity search engines, Library and Information Services in Astronomy III, ASP Conference Series, Vol. 153, 1998 [5] S. Chakrabarti, B.Dom, D. Gilbson, J. Kleinberg, S. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Mining the link structure of the world wide web, IEEE Computer, 32(8): 60-67, 1999. [6] D. Mladenic. Text -learning and related intelligent agents, IEEE Intellegent Systems , 14 (4): 44-54, 1999. [7] H. Ahonen, O. Heinonen, M. Klemettinen, and A. Verkamo. Finding co-occurring text phrases by combining sequence and frequent set discovery. In R. Feldman, editor, Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI-99 Workshop on Text Mining: Fiundations, Techniques and Applications, page 1-9, 1999. [8] Taher Haveliwala. Topic-sensitive Pagerank: a context -sensitive ranking algorithm for web search, IEEE Transaction on knowledge and data engineering, Vol 15, No. 4, Page 784-796, Jul-Aug 2003. [9] Fang Liu; Yu, C.; Weiywe Meng. Personalized web search for improving retrieval effectiveness, IEEE Transaction on knowledge and data engineering, Volume: 16, Issue: 1, page 28-40, Jan. 2004. [10] Perez, M.S. Garcia, R. Carretero, J. MAPFSMAS: a model of interaction among information retrieval agents, Cluster Computing and the Grid 2nd IEEE/ACM International Symposium CCGRID2002, page 248 249, May 2002 [11] Martin E. Muller, An Intelligent multi-agent architecture for information retrieval from the Internet. [12] Hang Cui; Ji-Rong Wen; Jian-Yun Nie; Wei-Ying Ma, Query expansion by mining user logs, IEEE Transactions on Knowledge and Data Engineering, page(s): 829- 839, July-Aug. 2003. [13] Ricardo Baeza-Yates, Berthier Ribiero-Neto, Modern information retrieval, Addison-Wesley Pub Co, 1st edition (May 15, 1999).

[14] Michael W. Berry, Murray Browne, Understanding search engines: mathematical modeling and text retrieval, Society for Industrial & Applied Mathematics, (July 1999). [15] Shian-Hua Lin; Meng Chang Chen; Jan-Ming Ho; Yueh-Ming Huang, ACIRD: intelligent Internet document organization and retrieval, IEEE Transactions on Knowledge and Data Engineering, Volume: 14, Issue: 3, Page(s): 599-614, May/Jun 2002. [16] Ryszad S. Michalski, Ivan Bratko , Miroslav Kubat, Machine Learning and Data Mining: Methods and Applications John Wiley & Sons; (April 9, 1998). [17] G. Salton, A. Wong and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communication of the ACM, Volume 18 , Issue 11, Pages: 613 620, Novemb er 1975

Wadee Al halabi, is a PhD candidate at the University of Miami, Head of the Computer Technology Department at the College of Technology, Saudi Arabia and Lecturer in the Department of Computer Science at the University of Umm Alqura. Miroslav Kubat is Associate Professor in the Department of Electrical and Computer Engineering at the University of Miami. His research interests are in the field of machine learning, data mining and artificial neural networks. Moiez Tapia is Professor in the Department of Electrical and Computer Engineering at the University of Miami. His research interests are in the field of Multivalued logic and calculus, real-time systems, machine learning, faulttolerant computing.

You might also like