Web Page Similarity Draft Final
Web Page Similarity Draft Final
1. INTRODUCTION
1.1 INTRODUCTION TO DATA MINING
Data mining is the computational process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract information from a data set
and transform it into an understandable structure for further use. Apart from the raw analysis,
it involves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.
1.1.1 Data Mining Task
The data mining task is the automatic or semi-automatic analysis of large quantities of data to
extract previously unknown interesting patterns such as groups of data records, unusual
records and dependencies. This usually involves using database techniques such as spatial
indices. These patterns can then be seen as a kind of summary of the input data, and may be
used in further analysis or, for example, in machine learning and predictive analytics. The
data mining can identify multiple groups in the data, which can then be used to obtain more
accurate prediction results by a decision support system.
1.1.2 Intrinsic: Text and Metadata Analysis
Metadata extraction is the process of describing extrinsic and intrinsic qualities of the
resource such as document, image, video, etc. As the result of that a number of texts from
webpages are produced which enables efficient search, sort and mining functionalities
provided by the websites. Another source of data for similarity analysis is document metadata
where the information is stored by the data repository about the webpage. Metadata that
might be useful in a similarity query include the author or the creation date of the document.
Web Mining
Content
Structure
Usage
Mining
Mining
Mining
Agent Based
Database
Customized
Psychographic
Based on the Content and Structure, the similarity is calculated based on the Cosine
Technique with the help of matched words between two webpages. Also, the structure of
webpage is used to find out the metadata and keyword details resulting in comparison
between two such data to identify the structural information of the webpage, which can be
used to compare and analyze the same.
URL 1
URL 2
Extracts
Extracts
Title
Title
Keywor
Keywor
ds
ds
Meta-
Meta-
>50%
Similarit
y
Complia
Similar
Cosine Similarity
Measurement
<=50%
Dissimil
The drawback namely sub-links are not processed in the existing system have been
considered in our technique, which helps in identifying the exact match of web pages based
on its content as well as structure. Consider that the content of web page is completely
different on two different web sites. This information may not be concluded as a website to
be dissimilar. The Structure of the web page might provide with a set of data like all the
hyperlinks in the web might lead to the same location from both the sites since a hyperlink is
a structural component that connects the web page to a different location. Also, the document
structure described in form of HTML or XML might contain the descriptive or meta-data
information which might be common for both the websites, which is not compared in the
existing system.
The features of the proposed system are as follows:
CHAPTER 2
2. LITERATURE REVIEW
Existing large-scale scanned book collections have many short-comings for data-driven
research; there is variable quality to the lack of accurate descriptive and structural meta-data.
We argue that complementary research in inferring relational metadata is important in its own
right to support use of these collections and that it can help to mitigate other problems with
scanned book collections, where the problem arises in Mining Relational Structure from
Millions of Books. Inspite of these issues, Ismet Zeki Yalniz, Ethem F. Can and R. Manmatha
suggested that Partial Duplicate Detection can be made for large book collections. Hence, this
trick is taken forward to implement the same in finding the duplicates or similarities in Web
Pages. A framework is presented for discovering partial duplicates in large collections of
scanned books. Each book in the collection is represented by the sequence of words, with
respect to the condition that it appears only once in the book. These words are referred to as
unique words and they constitute a small percentage of all the words in a typical book.
Along with the order information the set of unique words provides a compact representation
which is highly descriptive of the content and the flow of ideas in the book. By aligning the
sequence of unique words from two books using the longest common subsequence (LCS) one
can discover whether two books are duplicates. The same idea is incorporated in finding the
similarities between web pages with the help of cosine similarity measurement which can
maintain better quality in terms of matching keywords or meta-words or description than the
existing methods.
Weifeng Z [50] proposed the web similarity is becoming an increasingly severe security threat
in the web domain. Effective and efficient similarity detection is very important for
protecting web users from loss of sensitive private information and even personal properties.
One of the keys of phishing detection is to efficiently search the legitimate web page library
and to find the pages that are the most similar to a suspicious phishing page. Most existing
phishing detection methods are focused on text and image features and have paid very limited
attention to spatial layout characteristics of web pages. In this technique, the author proposes
a novel phishing detection method that makes use of the informative spatial layout
characteristics of web pages. In particular, the author develops two different options to extract
the spatial layout features as rectangle blocks from a given web page. Furthermore, the author
builds an R-tree to index all the spatial layout features of a legitimate page library. As a
9
result, phishing detection based on the spatial layout feature similarity is facilitated by
relevant spatial queries via the R-tree. A series of simulation experiments are conducted to
evaluate our proposals. The results demonstrate that the proposed novel phishing detection
method is effective and efficient.
Rajhans M et.al [35] proposed about adoption of the similarity upper approximation based
clustering of web logs using various similarity distance metrics. The technique shows the
viability of our methodology. Web logs capture the information about web sites as well the
sequence of the visit. Sequence of visit provides an important insight about the behavior of
the user. Rough set, a soft computing technique, deals with vagueness present in data. It
captures the indiscernibility at different levels of granularity. The technique has shown the
results on data set with different similarity measures along with explanation of results.
Christoph K et.al [6] proposed that research explores three SPARQL-based techniques to
solve semantic web tasks that often require similarity measures, such as semantic data
integration, ontology mapping, and semantic web service match making. Our aim is to see
how far it is possible to integrate customized similarity functions into SPARQL to achieve
good results for these tasks. Our first approach exploits virtual triples calling property
functions to establish virtual relations among resources under comparison; the second
approach uses extension functions to alter our resources that do not meet the requested
similarity criteria; externally, our third technique applies new solution modifiers to postprocess a SPARQL solution sequence. The semantics of the three approaches are formally
elaborated and discussed.
Shalini P et.al [42] proposed the current era of web page similarity technology, advancements
and techniques, efficient and effective text document classification is becoming a challenging
and highly required area to capably categorize text documents into mutually exclusive
categories. Fuzzy similarity provides a way to find the similarity of features among various
documents. In this research, a technical review on various fuzzy similarity based models is
given. These models are discussed and compared to frame out their use and necessity. A tour
of different methodologies is provided which is based upon fuzzy similarity related concerns.
It shows that how text and web documents are categorized efficiently into different
categories. Various experimental results of these models are also discussed.
10
Rajendra L et.al [34] proposed that predicting news about the cosine similarity series are
likely to provides a distinct advantage to news sites and Collaborative filtering is a widely
used technique for the same. The author details an approach within Collaborative Filtering
that uses the Cosine Similarity Function to achieve this purpose. Here author furnished
details about two different approaches customized targeting and article level targeting that
can be used in marketing campaigns. All through history, people have relied on some kind of
observation/advice/feedback in making decisions of any kind. With information in the web
increasing manifold and various options to choose from, customers at times find it difficult to
search and read articles that are of most interest to them. News sites have stepped in to fill
this gap by analyzing customer behavior and recommending articles that customers have high
likelihood to read.
Glen J [13] proposed measuring similarity of objects arises in many applications, and many
domain-specific measures have been developed, e.g., matching text across documents or
computing overlap among item-sets. The author propose a complementary approach,
applicable in any domain with object-to-object relationships that measures similarity of the
structural context in which objects occur, based on their relationships with other objects.
Effectively, the author computes a measure that says two objects are similar if they are
related to similar objects. This general similarity measure, called Sim-Rank, is based on a
simple and intuitive graph-theoretic model. For a given domain, Sim- Rank can be combined
with other domain-specific similarity measures. The author suggested techniques for efficient
computation of Sim-Rank scores, and provide experimental results on two application
domains showing the computational feasibility and effectiveness of our approach.
Mehran S [22] proposed that determining the similarity of short text snippets, such as search
queries, works poorly with traditional document similarity measures (e.g., cosine), since there
are often few, if any, terms in common between two short text snippets. The author address
this problem by introducing a novel method for measuring the similarity between short text
snippets by leveraging web search results to provide greater context for the short texts, such a
similarity kernel function, mathematically analyze some of its properties, and provide
examples of its escacy and also show the use of this kernel function in a large-scale system
for suggesting related queries to search engine users.
Seokkyung C et.al [41] proposed from a statement, given that web page similarity
computations are essential in ontology learning and data mining, WebSim (Web-based term
11
Similarity metric), whose feature extraction and similarity model is based on a conventional
web search engine. There are two main aspects that the author can benefit from utilizing a
web search engine. First, the author can obtain the freshest content for each term that
represents the up-to-date knowledge on the term. This is particularly useful for dynamic
ontology management in that ontologies must evolve with time as new concepts or terms
appear. Second, in comparison with the approaches that use the certain amount of crawled
web documents as corpus, our method is less sensitive to the problem of data sparseness
because the author accesses as much content as possible using a search engine. At the core of
Web Sim, the author present two different methodologies for similarity computation, a
mutual information based metric and a feature-based metric. Moreover, the author shows how
WebSim can be utilized for modifying existing ontologies. Finally, the author demonstrates
the characteristics of WebSim by coupling with word net. Experimental results show that
WebSim can uncover topical relations between terms that are not shown in conventional
concept-based ontologies.
David B [8] proposed a brief survey of web structural similarity algorithms, including the
optimal Tree Edit Distance algorithm and various approximation algorithms. The
approximation algorithms include the simple weighted tag similarity algorithm, Fourier
transforms of the structure, and a new application of the shingle technique to structural
similarity. The author shows three surprising results. First, the Fourier transform technique
proves to be the least accurate of any of approximation algorithms, while also being slowest.
Second, optimal Tree Edit Distance algorithms may not be the best technique for clustering
pages from different sites. Third, the simplest approximation to structure may be the most
effective and efficient mechanism for many applications.
Ding-Yun C [9] proposed a web similarity-based 3D model retrieval system is proposed. This
approach measures the similarity among 3D models by visual similarity, and the main idea is
that if two 3D models are similar, they also look similar from all viewing angles. Therefore,
one hundred orthogonal projections of an object, excluding symmetry, are encoded both by
Zernike moments and Fourier descriptors as features for later retrieval. The visual similaritybased approach is robust against similarity transformation, noise, model degeneracy etc., and
provides 42%, 94% and 25% better performance (precision-recall evaluation diagram) than
three other competing approaches: (1) the spherical harmonics approach developed by Funk
houser et al., (2) the MPEG-7 Shape 3D descriptors, and (3) the MPEG-7 Multiple View
12
Descriptor.
The
proposed
system
is
on
the
Web
for
practical
trial
use
(http://3d.csie.ntu.edu.tw), and the database contains more than 10,000 publicly available 3D
models collected from WWW pages. Furthermore, a user friendly interface is provided to
retrieve 3D models by drawing 2D shapes. The retrieval is fast enough on a server with
Pentium IV 2.4GHz CPU, and it takes about 2 seconds and 0.1 seconds for querying directly
by a 3D model and by hand drawn 2D shapes respectively.
Peixiang Z [27] proposed about the issue of similarity computation between entities of an
information network arises and draws extensive research interests. However, to effectively
and comprehensively measure how similar two entities are within an information network is
nontrivial, and the problem becomes even more challenging when the information network to
be examined is massive and diverse. In this research, the author proposes a new similarity
measure, P-Rank, toward effectively computing the structural similarities of entities in real
information networks. P-Rank enriches the well-known similarity measure, SimRank, by
jointly encoding both in-links and out-links relationships into structural similarity
computation. P-Rank is proven to be a united structural similarity framework, under which all
state-of-the-art similarity measures, including CoCitation, Coupling, Amsler and SimRank,
are just its special cases. Based on its recursive nature of P-Rank, the author proposes a fixed
point algorithm to reinforce structural similarity of vertex pairs beyond the localized
neighborhood scope toward the entire information network. Our experimental studies
demonstrate the power of P-Rank as an effective similarity measure in different information
networks. Meanwhile, under the same time or space complexity, P-Rank outperforms
SimRank as a comprehensive and more meaningful structural similarity measure, especially
in large real information networks.
Allan M. Sn [1] proposed that rather than using traditional text analysis to discover web pages
similar to a given page, the author investigate applying link analysis. Since web pages exist in
a link-rich environment that has the potential to relate pages by any property imaginable since
links are not restricted to intrinsic properties of the page text or metadata. In particular, while
web page similarity link analysis has been explored, prior work has deliberately ignored the
explicitly hierarchical host & pathname structure within URLs. To exploit this property, the
author generalize Kleinbergs well-known hubs and authorities HITS algorithm; adapt this
algorithm to accommodate hierarchical link structure; test some sample web queries; and
argue that the results are potentially superior and that the algorithm itself is better motivated.
13
Rudi L. et.al [37] proposed that the Normalized Web Distance (NWD) method to determine
similarity between words and phrases. It is a general way to tap the amorphous low-grade
knowledge available for free on the internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is effectively the largest
semantic electronic database in the world. Moreover, this database is available for all by
using any search engine that can return aggregate page-count estimates for a large range of
search-queries.
Pushpa C [31] proposed semantic similarity measures plays an important role in information
retrieval, natural language processing and various tasks on web such as relation extraction,
community mining, document clustering, and automatic meta-data extraction. In this
technique, the author have proposed a Pattern Retrieval Algorithm [PRA] to compute the
semantic similarity measure between the words by combining both page count method and
web snippets method. Four association measures are used to find semantic similarity between
words in page count method using web search engines. The author use a Sequential Minimal
Optimization (SMO) Support Vector Machines (SVM) to find the optimal combination of
page counts-based similarity scores and top-ranking patterns from the web snippets method.
The SVM is trained to classify synonymous word-pairs and non-synonymous word-pairs. The
proposed approach aims to improve the correlation values, precision, recall, and F-measures,
compared to the existing methods. The proposed algorithm outperforms by 89.8 % of
correlation value.
Isabel F et.al [16] proposed a web page similarity, the author often use phrases like it looks
like a newspaper site, there are several unordered lists or its just a collection of links.
Unfortunately, no web search or classification tools provide the capability to retrieve
information using such informal descriptions that are based on the appearance, i.e., structure,
of the web page. In this technique, the author takes a look at the concept of structurally
similar web pages. The author note that some structural properties can be identified with
semantic properties of the data and provide measures for comparison between HTML
documents.
Mohamed S [24] proposed that measuring similarity between web page using a search engine
based on page counts alone is a challenging task. Search engines consider a document as a
bag of words, ignoring the position of words in a document. In order to measure semantic
similarity between two given words, the author proposes a transformation function for web
14
measures along with a new approach that exploits the documents title attribute and uses page
counts alone returned by web search engines. Experimental results on benchmark datasets
show that the proposed approach outperforms snippets alone methods, achieving a correlation
coefficient up to 71%.
Chaomei Chen [5] proposed a generic approach to structuring and visualising a hypertextbased information space on the web page. This approach, called Generalised Similarity
Analysis (GSA), provides a unifying framework for extracting structural patterns from a
range of proximity data concerning three fundamental relationships in hypertext, namely,
hypertext linkage, content similarity and browsing patterns. GSA emphasizes the integral role
of users interest in dynamically structuring the underlying information space. Pathfinder
networks are used as a natural vehicle for structuring and visualising the rich structure of an
information space by highlighting salient relationships in proximity data. In this technique,
the author use the GSA framework in the study of hypertext documents automatically
retrieved over the internet, including a number of departmental web sites and conference
proceedings on the web page. The author shows that GSA has several distinct features for
structuring and visualising hypertext information spaces. GSA provides some generic tools
for developing adaptive user interfaces to hypertext systems. Link structures derived by GSA
can be used together with dynamic linking mechanisms to produce a number of hypertext
versions of a common information space.
Istvan V [17] proposed that a language model for web page recognition of inputs with a
particular style, using a large-scale web archive. Our target is an open domain web-activated
QA system and our words recognition module must recognize relatively short, domain
independent questions. The central issue is how to prepare a large scale training corpus with
low cost, and the author tackled this problem by combining an existing domain adaptation
method and distributional word similarity. From 500 seed sentences and 600 million Web
pages the author constructed a language model covering 413,000 words. The author achieved
an average improvement of 3.25 points in word error rate over a baseline model constructed
from randomly sampled Web sentences.
Vidya K et.al [48] proposed that semantic similarity aims at providing robust tools for
standardizing the content and delivery of relevant information across communicating
information sources. Most of the times the user gets lots of irrelevant data as a result of
poorly implemented search process. To avoid this, a ranking scheme is proposed, which
15
provides the search result set according to the better understood and correctly interpreted user
query. This is done by considering the relevance of the query by keeping the user view in
mind and also the semantics of the document and the user query. The simple lexical and
syntactical matching usually used by search engines does not extract web documents to the
user expectations. The proposed solution provides the most relevant data to user ranked in
their relevance. The proposed ranking scheme for the semantic web search engine functions
by finding the semantic similarity between the information available on the web and the
query which is specified by the user. This approach considers both the syntactic structure of
the document and the semantic structure of the document and the query. The objective of this
technique is to demonstrate that a semantic similarity based ranking scheme will provide
much better results than those by the prevailing methods. In this technique, an algorithm will
be implemented that provides ranking scheme for the semantic web documents by finding the
semantic similarity between the documents and the query which is specified by the user. The
algorithm considers both syntactical and semantic similarities of the query and categorizes
the search results based on the most probable and most appropriate interpretation of the query
based on various interpretations taking into account all the words and their combinations in
the query.
Yanshan X et.al [53] proposed that in web analysis the Positive and unlabelled learning (PU
learning) has been investigated to deal with the situation where only the positive examples
and the unlabelled examples are available. Most of the previous works focus on identifying
some negative examples from the unlabelled data, so that the supervised learning methods
can be applied to build a classifier. However, for the remaining unlabelled data, which cannot
be explicitly identified as positive or negative (the author call them ambiguous examples),
they either exclude them from the training phase or simply enforce them to either class.
Consequently, their performance may be constrained. This technique proposes a novel
approach, called Similarity-based PU learning (SPUL) method, by associating the ambiguous
examples with two similarity weights, which indicate the similarity of an ambiguous example
towards the positive class and the negative class, respectively. The local similarity-based and
global similarity-based mechanisms are proposed to generate the similarity weights. The
ambiguous examples and their similarity-weights are thereafter incorporated into an SVMbased learning phase to build a more accurate classifier. Extensive experiments on real-world
datasets have shown that SPUL outperforms state-of-the-art PU learning methods.
16
Nguyen C et.al [25] proposed the use of Vector Space Model (VSM) to represent documents
in the web page similarity, where documents are denoted by a vector in a word vector space.
The standard VSM does not take into account the semantic relatedness between terms. Thus,
terms with some semantic similarity are dealt with in the same way as terms with no semantic
relatedness. Since this unconcern about semantics reduces the quality of clustering results,
many studies have proposed various approaches to introduce knowledge of semantic
relatedness into VSM model. Those approaches give better results than the standard VSM.
However they still have their own issues. The author proposed a new approach as a
combination of two approaches, one of which uses Rough Sets theory and co-occurrence of
terms, and the other uses WordNet knowledge to solve these issues. Experiments for its
evaluation show advantage of the proposed approach over the others.
Ana G [2] proposed about Automatic extraction of similar information from text and links in
Web pages is key to improving the quality of search results. However, the assessment of
automatic semantic measures is limited by the coverage of user studies, which do not scale
with the size, heterogeneity, and growth of the Web. Here the author proposes to leverage
human-generated metadata namely topical directories to measure semantic relationships
among massive numbers of pairs of Web pages or topics. The Open Directory Project
classifies millions of URLs in a topical ontology, providing a rich source from which
semantic relationships between Web pages can be derived. While semantic similarity
measures based on taxonomies (trees) are well studied, the design of well-founded similarity
measures for objects stored in the nodes of arbitrary ontologies is an open problem. This
technique defines an information-theoretic measure of semantic similarity that exploits both
the hierarchical and non-hierarchical structure of ontology. An experimental study shows that
this measure improves significantly on the traditional taxonomy-based approach. This novel
measure allows us to address the general question of how text and link analyses can be
combined to derive measures of relevance that are in good agreement with semantic
similarity.
Sven H [45] proposed a technique for measuring the structural similarity of web page
documents based on entropy. After extracting the structural information from two documents
the author use either ZivLempel encoding or Ziv-Merhav is crossparsing to determine the
entropy and consequently the similarity between the documents. This is the first true lineartime approach for evaluating structural similarity. In an experimental evaluation the author
17
demonstrate that the results of our algorithm in terms of clustering quality are on a bar with
or even better than existing approaches.
Jiahui L et.al [18] proposed a method for measuring web page semantic similarity, an
important type of semantic relationship, between entities. The method is based on Google
Directory, a search interface to the Open Directory Project. Via the search engine, the author
can locate the web pages relevant to an entity and automatically create a profile of the entity
according to the directory assignments of its web pages, which capture various features of the
entity. Using their profiles, the semantic similarity between entities can be measured in
different dimensions. The author applies the semantic similarity measurement to two
knowledge acquisition tasks: thesaurus construction of entities and fine grained
categorization of entities. Our experiments demonstrate that the proposed method works
effectively in these two tasks.
Poonam C et.al [29] proposed on many similar web search engines have been developed like
Ontolook, Swoogle, etc which help in searching meaningful documents presented on
semantic web. In contrast to this the commonly used approach is based on matching
keywords extracted from the document which is known as lexical matching. But there exist
the documents that contains same information but using different words i.e. one document
using a word and other document using synonym of that word. So, when similarity of such
documents is computed through lexical matching it will not give true results of similarity
computation. In this technique the author has proposed a semantic web document similarity
scheme that relies not only on the keywords but on conceptual instances present between the
keywords and also considers the relationships that exists between the concepts present in the
web pages. The author explores all relevant relations between the keywords exploring the
users intention and then calculates the fraction of these relations on each web page to
determine their relevance and similarity with the other documents. The author has found that
this semantic similarity scheme gives better results than those by the prevailing methods.
Phyu Te [28] proposed about the Web is an important source of information retrieval, and the
users accessing the Web are from different backgrounds. The usage information about users
are recorded in web logs. Analyzing web log files to extract useful patterns is called Web
Usage. Web usage similarity approaches include clustering, association rule mining,
sequential pattern mining etc. The web usage similarity approaches can be applied to predict
next page access. In this technique, the author proposed a Page Rank-like algorithm is
18
proposed for conducting web page access prediction. The author extend the use of page rank
algorithm for next page prediction with several navigational attributes, which are the
similarity of the page, size of the page, access-time of the page, duration of the page and
transition(two pages visits sequentially) and frequency of page and transition.
Sridevi.U [44] proposed a methodology for the web page similarity based semantic
annotation of web pages with annotation weighting scheme that takes advantage of the
different relevance of structured document fields. The retrieval model is based on the
importance factors of the structural elements, which are used to re-rank the documents
retrieval by the ontology based distance measure. The relevance concept similarities are
combined with the annotation-weighting scheme to improve the relevance measures. The
proposed method has been evaluated on USGS Science directory collection. Preliminary
experiments results show that our method may generate relevant document in the top rank.
Radha D et.al [33] proposed Phishing is a current social engineering attack that results in
online identity theft. Phishing Web pages generally use similar page layouts, styles (font
families, sizes, and so on), key regions, and blocks to mimic genuine pages in an effort to
convince Internet users to divulge personal information, such as bank account numbers and
passwords. A novel technique to visually compare an assumed phishing page with the
legitimate one is presented. Five important features such as signature extraction, text pieces
and their style, images, URL keywords and the overall appearance of the page as rendered by
the browser are identified and considered. An experimental evaluation using a dataset
collected of 150 real world phishing pages, along with their equivalent legitimate targets has
been performed. The investigational results are satisfactory in terms of false positives and
false negatives and an efficiency rate of about 98.11% for false positive pages and 92.95% for
false negative pages has been obtained.
Douglas L [10] proposed the web page similarity in lexical semantic system is an important
component of human language and cognitive processing. One approach to modeling semantic
knowledge makes use of hand-constructed networks or trees of interconnected word
senses .An alternative approach seeks to model word meanings as high-dimensional vectors,
which are derived from the co-occurrence of words in unlabeled text corpora. This technique
introduces a new vector-space method for deriving word-meanings from large corpora that
was inspired by the HAL and LSA models, but which achieves better and more consistent
results in predicting human similarity judgments. The author explain the new model, known
19
as COALS, and how it relates to prior methods, and then evaluate the various models on a
range of tasks, including a novel set of semantic similarity ratings involving both
semantically and morphologically related terms.
Mara A et.al [21] proposed a functional technique for identifying similar web pages that is
based on measuring tree similarity. The key idea behind the method is to transform each web
page into a compressed, normalized tree that effectively represents its visual structure. In this
technique, the author develops an optimization of this technique that is based on
memorization and that achieves significant improvements in efficiency in both time and
space. This work also presents a tool that implements the proposed technique as well as two
case studies for two real scenarios. Experiments on real documents show that the optimized
algorithm performs significantly better than the original technique and demonstrate the
practicality of our approach.
Sreedevi S [43] proposed the highly increased use of the web similarities comes a significant
demand to provide more reliable web applications. By learning more about the usage and
dynamic behavior of these applications, the author believe that some software development
and maintenance tools can be designed with increased cost-effectiveness. The author
describes our work in analyzing user session data. Particularly, the main contributions of this
technique are the analysis of user session data with concept analysis, an experimental study
of user session data analysis with two different types of web software, and an application of
user session analysis to scalable test case generation for web applications. In addition to
fruitful experimental results, the techniques and metrics themselves provide insight into
future approaches to analyzing the dynamic behavior of web applications.
Krishna N [19] proposed that the objective of this project is to use semantic similarity
techniques to identify the MOOCs [Massive Open Online Courses] offered by the e-learning
websites which are similar to the regular courses offered by the university. Over the last few
years there has been a significant development in the e-learning industry that provides online
courses to the public. Due to the drastic improvement in technology and the internet, this
form of education reaches many people across boundaries. There is vast set of courses
currently provided by various sources, which range from the latest technologies in the field of
computer science to any topic in history. Since the invention of e-learning, there has been
constant improvement of user friendly tools to enhance the learning process. In the span of
the last three years, many websites have come into existence that provides online courses.
20
Some of the best universities in the United States and other universities throughout the world
have also started to provide online courses that students can easily attend. It has become very
difficult for a student to pick the right online course. Hence, applications that can integrate
the courses provide by various e-learning websites like Coursera, Udacity and Edx would be
very helpful. The student can compare the regular courses provided by his or her university
with the courses offered by the e-learning websites and can enroll in similar online courses to
get a better understanding of the subject.
Taher H [47] proposed about finding web pages that are similar to a query page is an
important component of modern search engines. A variety of strategies have been proposed
for answering related pages queries, but comparative evaluation by user studies is
expensive ,especially when large strategy apaches must be searched. The author proposed a
technique for automatically evaluating strategies using web hierarchies, such as open
directory, in place of user feedback. The author applies this evaluation methodology to a mix
of document representation strategies, including the use of text, anchor text and links. The
author proposed the relative advantages and disadvantages of the various approaches
examined. Finally, the author proposed how to efficiently construct a similarity index of our
chosen strategies, and provide sample results from our index.
Danushka B et.al [7] proposed about measuring the semantic similarity between words is an
important component in various tasks on the web such as relation extraction, community
mining, document clustering, and automatic meta data extraction. Despite the usefulness of
semantic similarity measures in these applications, accurately measuring semantic similarity
between two words (or entities) remains a challenging task. The author propose an empirical
method to estimate semantic similarity using page counts and text snippets retrieved from a
web search engine for two words. Specifically, define various word co-occurrence measures
using page counts and integrate those with lexical patterns extracted from text snippets. To
identify the numerous semantic relations that exist between two given words, he propose a
novel pattern extraction algorithm and a pattern clustering algorithm. The optimal
combination of page counts-based on co-occurrence measures and lexical pattern clusters is
learned using support vector machines. The proposed method outperforms various baselines
and previously proposed web-based semantic similarity measures on three benchmark
datasets showing a high correlation with human ratings. Moreover, the proposed method
significantly improves the accuracy in a community mining task.
21
P.V.Praveen [26] proposed an automatic web record extraction extracts a set of objects from
heterogeneous web pages based on similarity measure among objects in an automated
fashion. This classifies a region in the web page according to similar data object which
emerge frequently in it. This involves transformation of unstructured data into structured data
that can be stored and analyzed in a central local database. The existing system develops a
data extraction and alignment method known as Combining Tag and Value Similarity
(CTVS), which identifies the Query Result Records (QRRs) by extracting the data from
query result page and segment them. Those segmented QRRs are aligned into a table where
same attribute data values are put into the same column. This technique is based on the
discovery of non-consecutive data records to detect nested data records in QRRs. Those
attributes in record are aligned using record alignment algorithm by combining the tag and
data value similarity information based on similarity measure. Besides the structure of the
data value is altered when extracting from the webpage. Those changes in template make it
inefficient to properly access them as done in traditional databases. The proposed structural
semantic entropy measures the degree of repeated occurrence of information from DOM tree
representation. This aims to locate the data on web pages depend on unique choice of interest
in extracting the record. This algorithm extracts data from heterogeneous web pages. It is
insensitive to modifications in web-page format which enable to detect false positive rate in
associating the attributes of records with their respective values. Experiments show that this
method achieves higher accuracy than existing methods in automated information extraction.
Rekha R [36] proposed about the advancement of semantic web page similarity and elearning technologies have provided more opportunities to achieve the goal of collaborative
knowledge sharing. It has also facilitated teachers to share their teaching material, tools, and
experiences with others through the medium of internet and web technologies. The author
proposed the need for creating a Distributed Question Bank (DQB) by different experts in the
related fields. The author explored the possibility of using the semantic web technology and
ontology in particular in addressing the issue of question similarity in a DQB. The author
proposes a method of creation of subset ontologies for the questions, and comparing them to
find the overlapping concepts to determine question similarity. The author also tried
formulating a model based on information theoretic approach using this notion of subset
ontology, to measure the similarities among the questions in the data set considered.
22
Wei L et.al [49] proposed a system on extracting structured data from deep web pages is a
challenging problem due to the underlying intricate structures of such pages. Until now, a
large number of techniques have been proposed to address this problem, but all of them have
inherent limitations because they are web-page-programming-language dependent. As the
popular two-dimensional media, the contents on web pages are always displayed regularly for
users to browse. This motivates us to seek a different way for deep web data extraction to
overcome the limitations of previous works by utilizing some interesting common visual
features on the deep web pages. In this technique, a novel vision-based approach that is webpage programming- language-independent is proposed. This approach primarily utilizes the
visual features on the deep web pages to implement deep web data extraction, including data
record extraction and data item extraction. The author also proposes a new evaluation
measure revision to capture the amount of human effort needed to produce perfect extraction.
Our experiments on a large set of web databases show that the proposed vision-based
approach is highly effective for deep web data extraction.
Eric Mt [12] proposed a novel technique to visually compare similarities a suspected phishing
page with the legitimate one. The goal is to determine whether the two pages are suspiciously
similar. The author identify and consider three page features that play a key role in making a
phishing page look similar to a legitimate one. These features are text pieces and their style,
images embedded in the page, and the overall visual appearance of the page as rendered by
the browser. To verify the feasibility of our approach, the author proposed an experimental
evaluation using a dataset composed of 41 real world phishing pages, along with their
corresponding legitimate targets. Our experimental results are satisfactory in terms of false
positives and false negatives.
Poonam l et.al [29] proposed that in recent years, semantic search for relevant documents on
web has been an important topic of research. Many semantic web search engines have been
developed like Ontolook, Swoogle, etc that helps in searching meaningful documents
presented on semantic web. The concept of semantic similarity has been widely used in many
fields like artificial intelligence, cognitive science, natural language processing, psychology.
To relate entities or texts or documents having same meaning, semantic similarity approach is
used based on matching of the keywords which are extracted from the documents using
syntactic parsing. The simple lexical matching usually used by semantic search engine does
not extract web documents to the user expectations. In this technique, the author have
23
proposed a ranking scheme for the semantic web documents by finding the semantic
similarity between the documents and the query which is specified by the user. The novel
approach proposed in this technique not only relies on the syntactic structure of the document
but also considers the semantic structure of the document and the query. The approach used
here includes the lexical as well as the conceptual matching. The combined use of conceptual,
linguistic and ontology based matching has significantly improved the performance of the
proposed ranking scheme. The author proposed all relevant relations between the keywords
exploring the users intention and then calculate the fraction of these relations on each web
page to determine their relevance with respect to the query provided by the user. The author
has found that this semantic similarity based ranking scheme gives much better results than
those by the prevailing methods.
Yanhong Z et.al [52] proposed that the technique studies about structured data extraction from
Web pages, e.g., online product description pages. Existing approaches to data extraction
include wrapper induction and automatic methods. In this technique, the author proposes an
instance-based learning method, which performs extraction by comparing each new instance
to be extracted with labeled instances. The key advantage of our method is that it does not
need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead,
the algorithm is able to start extraction from a single labeled instance. Only when a new page
cannot be extracted does the page need labeling. This avoids unnecessary page labeling,
which solves a major problem with inductive learning (or wrapper induction), i.e., the set of
labeled pages may not be representative of all other pages. The instance-based approach is
very natural because structured data on the web usually follow some templates and pages of
the same template usually can be extracted using a single page instance of the template. The
key issue is the similarity or distance measure. Traditional measures based on the Euclidean
distance or text similarity are not easily applicable in this context because items to be
extracted from different pages can be entirely different. In this technique, the author proposes
a novel similarity measure for the purpose, which is suitable for template Web pages.
Experimental results with product data extraction from 1200 pages in 24 diverse Web sites
show that the approach is surprisingly effective. It outperforms the state-of-the-art existing
systems significantly.
William w [51] proposed about the measure of the similarity between incomplete rankings
should handle non-conjointness, weight high ranks more heavily than low, and be monotonic
24
with increasing depth of evaluation; but no measure satisfying all these criteria currently
exists. In this article, the author proposes a new measure having these qualities, namely
Rank-Biased Overlap (RBO). The RBO measure is based on a simple probabilistic user
model. It provides monotonicity by calculating, at a given depth of evaluation, a base score
that is non-decreasing with additional evaluation, and a maximum score that is nonincreasing. An extrapolated score can be calculated between these bounds if a point estimate
is required. RBO has a parameter which determines the strength of the weighting to top
ranks. The author extends RBO to handle tied ranks and rankings of different lengths. Finally,
there is an example of the use of the measure in comparing the results produced by public
search engines, and in assessing retrieval systems in the laboratory.
Apostolos K [4] proposed a new page ranking system, which exploits similarity between
interconnected pages. WordRank introduces the model of the biased surfer which is based
on the following assumption: the visitor of a web page tends to visit web pages with similar
content rather than content irrelevant pages. The algorithm modifies the random surfer
model by biasing the probability of a user to follow a link in favor of links to pages with
similar content. It is our intuition that WordRank is most appropriate in topic based searches,
since it prioritizes strongly interconnected pages, and in the same time is more robust to the
multitude of topics and to the noise produced by navigation links. In this technique, the
author proposed preliminary experimental evidence from a search engine the author proposed
for the Greek fragment of the worldwide Web. For evaluation purposes, the author introduce
a new metric (SI score) which is based on implicit user's feedback, but the author also employ
explicit evaluation, where available.
Hila B et.al [15] proposed on Social media sites (e.g., Flickr, YouTube, and Facebook) to
compare the web page similarities are a popular distribution outlet for users looking to share
their experiences and interests on the web. These sites host substantial amounts of usercontributed materials (e.g., photographs, videos, and textual content) for a wide variety of
real-world events of different type and scale. By automatically identifying these events and
their associated user-contributed social media documents, which is the focus of this
technique, the author can enable event browsing and search in state-of-the-art search engines.
To address this problem, the author proposed the rich or context associated with social media
content, including user-provided annotations (e.g., title, tags) and automatically generated
information (e.g., content creation time). Using this rich context, which includes both textual
25
and non-textual features, the author can appropriate document similarity metrics to enable
online clustering of media to events. As a key contribution of this technique, the author
proposed a variety of techniques for learning multi-feature similarity metrics for social media
documents in a principled manner. The author evaluates our techniques on large-scale, real
world datasets of event images from Flickr. Our evaluation results suggest that our approach
identities events and their associated social media documents, more effectively than the stateof-the-art strategies on which the author build.
T.Upender et.al [46] proposed a similarity between words that was concerned with the
syntactic similarity of two strings. Semantic similarity is a confidence score that reflects the
semantic relation between the meanings of two sentences. It is difficult to gain a high
accuracy score because the exact semantic meanings are completely understood only in a
particular context. The goals of the paper are to present you some dictionary-based
algorithms to capture the semantic similarity between two sentences, which is heavily based
on the semantic dictionary. A web search engine is designed to search for information on the
World Wide Web. The search results are generally presented in a line of results often referred
to as Search Engine Results Pages (SERPs). The information may be a specialist in web
pages, images, information and other types of files. Some search engines also mine data
available in databases or open directories. Unlike web directories, which are maintained only
by human editors, search engines also maintain real-time information by running an
algorithm on a web crawler. To identify the numerous semantic relations that exist between
two given words, we propose a novel pattern extraction algorithm and a pattern clustering
algorithm. The optimal combination of page counts-based co-occurrence measures and
lexical pattern clusters is learned using support vector machines.
Ruzhan L [39] proposed on the measurement of word similarity is a foundation work in
semantic computing. In this paper the author propose a method for measuring word similarity
utilizing the definitional words in Machine-Readable Dictionary (MRD). It is noticed that
similar words have similar definitions. So we transform the definition of a word into a vector.
Then the similarity between two words is calculated as the distance between their definition
vectors. To avoid the exponential increase of the definition vectors, two kinds of Basic
Lexical Sets (BLSs) are used to hold the most essential definitional words. One is the set of
sememes in HowNet, and the other is constructed automatically using PageRank algorithm.
Experiment shows that this method achieves competitive results.
26
Rudi L [38] proposed about words and phrases acquire meaning from the way they are used
in society, from their relative semantics to other words and phrases. For computers the
equivalent of society is database, and the equivalent of use is way to search the
database. We present a new theory of similarity between words and phrases based on
information distance and Kolmogorov complexity. To fix thoughts we use the World-WideWeb as database, and Google as search engine. The method is also applicable to other search
engines and databases. This theory is then applied to construct a method to automatically
extract similarity, the Google similarity distance, of words and phrases from the world-wide
web using Google page counts. The World-Wide-Web is the largest database on earth, and the
context information entered by millions of independent users averages out to provide
automatic semantics of useful quality. We give applications in hierarchical clustering,
classification, and language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of books by
English novelists, the ability to understand emergencies, and primes, and we demonstrate the
ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet
database as an objective baseline against which to judge the performance of our method. We
conduct a massive randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement of 87% with
the expert crafted WordNet categories.
L M Patnaik [20] proposed semantic similarity measures plays an important role in
information retrieval, natural language processing and various tasks on web such as relation
extraction, community mining, document clustering, and automatic meta-data extraction,
proposed a Pattern Retrieval Algorithm [PRA] to compute the semantic similarity measure
between the words by combining both page count method and web snippets method, four
association measures are used to find semantic similarity between words in page count
method using web search engines. We use a Sequential Minimal Optimization (SMO)
Support Vector Machines (SVM) to find the optimal combination of page counts-based
similarity scores and top-ranking patterns from the web snippets method. The SVM is trained
to classify synonymous word-pairs and non synonymous word-pairs. The proposed approach
aims to improve the Correlation values, Precision, Recall, and F-measures, compared to the
existing methods. The proposed algorithm outperforms by 89.8 % of correlation value.
27
Seema B [40] proposed web services have become a new industrial standard offering interoperability among various platforms but the discovery mechanism is limited to syntactic
discovery only. The framework named AD WebS is proposed for automatic discovery of
semantic web services, which can be considered as an extension to one of the most prevalent
frameworks for semantic web service, WSDL-S. At the first stage, the framework proposes
manual semantic annotations of web service to provide the functional description of the
services in Web Service Description Language(WSDL)s <document> tag. These annotations
are extracted and term-category matrix is formed, where category denotes a class in which a
web service will be added. Next, semantic relatedness between terms and pre-defined
categories i calculated using Normalized Similarity Score (NSS). A nonparametric test,
Kruskal Wallis test, is applied on values generated and based on the test results, services are
put into one or more pre-defined categories. The user or the requestor of the service is
directed to the semantically categorized Universal Description Discovery and Integration
(UDDI) repository for discovery of required service. Experimental results on a dataset
covering multiple Web services of various categories show a significant improvement over
the current state-of-the-art Web service discovery methods.
H.Devaraj et.al [14] proposed on measuring the semantic similarity between two words is an
important component in various tasks on the web such as relation extraction, community
mining, document clustering, and automatic meta-data extraction. Despite the usefulness of
semantic similarity measures in these applications, accurately measuring semantic similarity
between two words (or entities) remains a challenging task. We propose an empirical method
to estimate semantic similarity using page counts and text snippets retrieved from a web
search engine for two words. Specifically, we define various word co-occurrence measures
using page counts and integrate those with lexical patterns extracted from text snippets. To
identify the numerous semantic relations that exist between two given words, we propose a
novel pattern extraction algorithm and a pattern clustering algorithm. The optimal
combination of page counts-based co-occurrence measures and lexical pattern clusters is
learned using support vector machines. The proposed method outperforms various baselines
and previously proposed web-based semantic similarity measures on three benchmark data
sets showing a high correlation with human ratings. Moreover, the proposed method
significantly improves the accuracy in a community mining task.
28
R. Kotteswari et.al [32] proposed to the web mining is the application of data mining
technology to discover patterns from the web. The various tasks on web such as relation
extraction, community mining, document clustering and automatic metadata extraction. A
previously proposed web-based semantic similarity measures on three benchmark datasets
showing high correlation with human rating. One of the main problems in information
retrieval is to retrieve a set of documents that is semantically related to given user query. We
propose an automatic acquisition method to estimate semantic relation between two words by
using pattern extraction algorithm and sequential clustering algorithm.
Ming-S et.al [23] proposed the web search with double checking model is explore the web as
a live corpus. Five association measures including variants of Dice, Overlap Ratio, Jaccard,
and Cosine, as well as Co- Occurrence Double Check (CODC), are presented. In the
experiments on Rubenstein- Goodenoughs benchmark data set, the CODC measure achieves
correlation coefficient 0.8492, which competes with the performance (0.8914) of the model
using WordNet. The experiments on link detection of named entities using the strategies of
direct association, association matrix and scalar association matrix verify that the doublecheck frequencies are reliable. Further study on named entity clustering shows that the five
measures are quite useful. In particular, CODC measure is very stable on wordword and
name-name experiments. The application of CODC measure to expand community chains
for personal name disambiguation achieves 9.65% and 14.22% increase compared to the
system without community expansion. All the experiments illustrate that the novel model of
web search with double checking is feasible for mining associations from the web.
Anita J et.al [3] proposed on measuring the semantic similarity between words is an
important component in various tasks on the web such as relation extraction, community
mining, document clustering, and automatic metadata extraction. Despite the usefulness of
semantic similarity measures in these applications, accurately measuring semantic similarity
between two words (or entities) remains a challenging task. The author propose an empirical
method to estimate semantic similarity using page counts and text snippets retrieved from a
web search engine for two words. Specifically, we define various word co-occurrence
measures using page counts and integrate those with lexical patterns extracted from text
snippets. To identify the numerous semantic relations that exist between two given words, we
propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal
combination of page counts-based co-occurrence measures and lexical pattern clusters is
29
learned using support vector machines. The proposed method outperforms various baselines
and previously proposed web-based semantic similarity measures on three benchmark data
sets showing a high correlation with human ratings. Moreover, the proposed method
significantly improves the accuracy in a community mining task.
Saravanan [11] proposed about semantic web mining aims at combining the two fastdeveloping research areas semantic web and web wining. This survey analyzes the
convergence of trends. More and more researchers are working on improving the results of
web mining by exploiting semantic structures in the web, and they make use of web mining
techniques for building the semantic web. Last but not least, these techniques can be used for
mining the semantic web itself. The semantic web is the second-generation WWW, enriched
by machine-learning techniques which support the user in his tasks. Given the enormous size
of even todays web, it is impossible to manually enrich all of these resources. Therefore,
automated schemes for learning the relevant information are increasingly being used. We
argue that the two areas web mining and semantic web need each other to fulfill their goals,
but that the full potential of this convergence is not yet realized, gives an overview of where
the two areas meet today, and sketches ways of how a closer integration could be profitable.
By applying lexico-syntactic patterns to the process of ontology design or evolution, we
might derive ontology elements
30
CHAPTER - 3
3. RESEARCH METHODOLOGY
3.1 METHODOLOGY ANALYSIS
The proposed architecture accepts the webpage URLs as input which contains the
comparison module where the cosine similarity measurement algorithm is applied to compare
the two web pages. This architecture follows a path from the start state to the end state. The
user inputs the web pages in the form of URL in which the similarity is to be detected.
URL 1
URL 2
Web Page
DOM Parser
Path Completion
Data Extraction
Data Cleaning
Lexical Similarity
Process (Comparing with another web page)
Cosine Similarity
Output
Figure 5. Architecture of Cosine Similarity
31
There are totally four steps involved in determining the similarity for the given websites.
Initially, the URL of the two web pages is taken as input for comparison. A URL, also known
as a web address, particularly when used with HTTP is a specific character or string that
constitutes a reference to a resource or a web page. Most web browsers display the URL of a
web page in an address bar. A typical URL might look like:
http://en.example.org/test/Main_Page.html / php
Next, in order to find the contents and structure of web page, information about the measures
are necessary. A web page is a web document that is suitable for the World Wide Web and the
web browser. The web page displays the text or hypertext or image or audio/video, usually
written in HTML or comparable markup language, whose main distinction is to provide
navigation to other web pages via links. Web browsers coordinate web resources centered on
the written web page, such as style sheets, scripts and images, to present the web page. From
such a webpage, the title, description, keyword and metadata are extracted.
From the dataset, the list of title, description, keyword and metadata are separated as objects
to make the many-to-many mapping to look for the matching words. When matching, there's
an option to store the text frequency. An easier way might be to submit URL 1 as a query and
look for URL 2 in the result set. Here, the cosine similarity is used as a measure of similarity
between two pages of different objects that measures the mapping similarity between them.
Note that these objects apply for any number of mappings, and cosine similarity is most
widely used for measuring multiple mapping scenarios. For example, in Information
Retrieval and text mining, each term is theoretically assigned a different object and a
document is characterized by a vector where the value of each dimension corresponds to the
number of times that term appears in the document. Cosine similarity then gives a useful
measure of how similar two documents are likely to be in terms of their subject matter.
Finally, after using the cosine similarity, similarities in two web pages are displayed using
percentage as,
Similarity (%) =
Matched Words
X 100
Total Number of Words
The web page could have our own compliance ratio for a webpage to be similar, is our
resaerch which calculate as and above. Hence, any similarity below 50% which resulted after
32
comparison from cosine similarity will be declared as dissimilar. If the similarity output is
57%, then the web page considered for comparison are similar.
The Proposed System:
The methodology right from providing URL as input until the similarity measurement is
explained as a step by step process below.
3.1.1 DOM Parser
The Document Object Model is a standard to interact with the objects stored in XML and
HTML documents. All browsers use a model like DOM to render/parse an HTML page.
When a browser renders an HTML document, it parses the document in order to display the
contents, which is the web content mining used in our research. The HTML Parser accesses
and traverses the HTML document. It is also important that the HTML Parser identifies
various HTML tags to find the structure and description of the web page. The HTML parser
also provides various functions to extract specific portions of the HTML content. For this
research we have used Java library File.
Java library
Java library reads the HTML document and parses it similarly to the DOM parser to identify
various objects like title, keyword, description and metadata. These objects are represented in
a structure, and this can be traversed and the required data can be extracted using various
methods provided by Jsoup. The following are some of the methods in Java library to extract
data.
Document doc = Jsoup.connect("http://www.bu.edu/").get();
This above method can be used to render an HTML document and find the required data. The
connect method makes connections to the URL provided and gets method parses the
HTML document. An exception is thrown if there is any error fetching in the HTML page.
Elements getid = doc.select("div[id=top_subsite]");
The select method helps to traverse the HTML document and go to the particular location
where the required data is found. In the above case we traverse to the div tag in the HTML
page where the id attribute is top_subsite
33
the data. Since the solution for these problems is very important in various real time
applications, a lot of research is being done in this field. Since each website stores and
presents the data differently, the data we require might not always be present in the same
location, and the same method cannot be used to extract the data across various websites.
Every web page should be analyzed to exactly determine the location of data. Some websites
display information that is not required, which is eliminated in the next phase of our research.
It is critical to extract only the information that is required and scrap the rest of the data
before storing the data in the dataset. Otherwise, it would involve a lot of overhead to process
the data every time it is fetched from the database. Hence, the complete set of data is
extracted and sent to cleaning.
3.1.4 Data Cleaning
Data Cleaning, data cleansing or data scrubbing is the process of detecting and correcting (or
removing) unwanted records from a dataset. In our research, we would eliminate all the
articles, prepositions and conjunctions. Since the word a, an, the, for, under, and, etc., will be
available in all the webpages data, title, description, keyword and metadata, the similarity
percentage calculation would affect when comparing with these kind of words. Used mainly
in databases, the term cleaning refers to identifying incomplete, incorrect, inaccurate,
irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or
coarse data. Here, we identify the irrelevant and delete the same. After cleaning, a data set
will be consistent with other similar data sets in the system. The actual process of data
cleaning may involve removing typographical errors or validating and correcting values
against a known list of entities. Here, the cleaning only identifies irrelevant words,
considering that the condition that the web page might not contain typographical errors.
3.1.5 Lexical similarity
Once the data is cleaned, the data set is ready for comparison. If the webpage content we
compare is with respect to one-to-one mapping, then lexical similarity would help us analyze
two word sets and identify the measure of similarity between them. In the sense, that the title
of one web page and another is compared, similarly the description, keyword and metadata of
one another. The similarity scores are in the range of zero to one, where one means that there
is a complete overlap between the two word sets and the word sets are very similar to each
other. If the score is zero then there are no common words between the two sets. The zero and
one are simply the Boolean logic, which when compared results in true or false.
35
36
CHAPTER - 4
4. IMPLEMENTATION
The research describes about the implementation for developing the similarity measurement
software. It uses PHP 5, Java Parser, DOM, XML and standard HTML. It will be capable of
executing any on standard internet web browsers, although. The interface will provide a point
of web page similarity search for the user and the step-by-step scenario is figured and
discussed below.
User
Main Page
Compare
Parser
Cleaning
Extraction
Path Completion
Lexical Phase
similarity
for two
URL based on many-to-many mapping
Comparison of Generate
two URL by
one-to-one
mapping
Dissimilar
The user enters into the main website in order to compare two web pages, where he
can identify the details and working scenario of the research work.
The user then identifies the websites to compare and enters two URLs in the given
text field, which first identifies whether the given URL is a working webpage after
clicking the Compare button. The users job is over now, and he is ready to view the
37
similar webpages and the process below is our research work which is taken care of
our implementation.
Table 6.1
URL 1
ANNA UNIVERSITY
URL 2
AMRITA UNIVERSITY
(http://www.annauniv.edu/)
12
(http://www.amrita.edu/)
8
The various objects like title, keyword, description and metadata plays a vital role in
finding similarity. Hence, the Document Object Model with the help of Java Library
interacts together to find the objects stored in XML and HTML documents of the
given URL.
There may be many relevant and irrelevant links from a particular webpage/website,
where the user accesses that are not being necessarily the webpages relevant data, for
example, advertisement link in a website or a link to another website. The aim of the
path completion is to acquire the complete set of relevant or valid sub links of the
entered URL.
URL 1
Table 6.2
Frequency
URL 2
(%)
http://www.annauniv.edu/
http://www.annauniv.edu/acade
mic_courses/index.html
58%
62%
http://www.amrita.edu/
http://www.amrita.edu/academic
s
Once the links or paths of a particular website are identified, we need to extract the
dataset in form of title, keyword, description and metadata. This is where the content
mining comes into play, which is very similar to text mining and extracts all the data
phase.
Once the data is cleaned, the data set is ready for comparison. The webpage content is
compute with respect to one-to-one mapping and the lexical similarity helps is to
analyze two word sets.
38
Similar to the lexical phase, the two datasets are compared based on many-to-many
mapping in the cosine similarity phase and gives the percentage of similarity.
Finally, that similar pages are displayed in the form of snapshots with the similarity
compliance above 50%.
Table 6.3
URL 1
URL 2
http://www.svgv.in/academics.php
http://www.vidyaavikas.ac.in/about
%20us/Group%20Of%20Institutions.
html
http://www.svgv.in/admission.php
http://www.vidyaavikas.ac.in/achieve
ments/achievements.html
39
CHAPTER - 5
5. EXPERIMENTAL RESULT
Sample Screen Designs
1. WEBSITES TAKEN FOR STUDY
UNIVERSITY
SCHOOL
SVGV (http://www.svgv.in/)
VIDYA VIKAS (http://www.vidyaavikas.ac.in/)
ONLINE SHOPPING -
FLIPKART (http://www.flipkart.com/)
NAAPTOL (http://www.naaptol.com/)
http://www.annauniv.edu/academic_courses/index.html
http://www.annauniv.edu/cai13b/Options.php
http://aucoe.annauniv.edu/
http://aucoe.annauniv.edu/stat.html
http://aucoe.annauniv.edu/circular.html
http://aucoe.annauniv.edu/administration.html
http://www.annauniv.edu/centres.php
http://cfr.annauniv.edu/research/index.php
http://www.annauniv.edu/cai13b/
http://www.annauniv.edu/sports/
https://mail.annauniv.edu/mail/src/login.php
http://www.amrita.edu/academics
http://www.amrita.edu/aums
http://www.amrita.edu/campus
http://www.amrita.edu/academics
http://www.amrita.edu/research
http://www.amrita.edu/global
40
http://www.amrita.edu/events
SVGV (http://www.svgv.in/)
http://www.svgv.in/about_us.php
http://www.svgv.in/academics.php
http://www.svgv.in/admission.php
http://www.svgv.in/events.php
http://www.svgv.in/achivements.php
http://www.svgv.in/contact.php
http://www.svgv.in/careers.php
http://www.svgv.in/facilities.php
http://www.svgv.in/gallery.php
http://www.svgv.in/rules.php
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/index.html
www.vidyaavikas.ac.in/vidyavikass.ac.in/about us/about-us.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/about%20us/Group%20Of
%20Institutions.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/achievements/achievements.ht
ml
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/Academic/Affiliations.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/Beyond
%20Curriculam/sports.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/gallery/gallery.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/contact/contact.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/Beyond
%20Curriculam/Co_curricular.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/Beyond
%20Curriculam/Co_curricular.html#
FLIPKART (http://www.flipkart.com/)
http://www.flipkart.com/mobiles
http://www.flipkart.com/computers
http://www.flipkart.com/computers/accessories
http://www.flipkart.com/books
http://www.flipkart.com/ebook
http://www.flipkart.com/household
http://www.flipkart.com/watches
41
http://www.flipkart.com/household?otracker=hp_nmenu_sub_home-
kitchen_0_Home%20%26%20Kitchen%20Needs
http://www.flipkart.com/sports-fitness?otracker=hp_nmenu_sub_more-
stores_0_Sports%20%26%20Fitness
http://www.flipkart.com/sports-fitness/outdoor-adventure/
https://www.flipkart.com/s/contact
http://www.flipkart.com/flipkart-first?otracker=hp_ch_vn_flipkart-first
NAAPTOL (http://www.naaptol.com/)
http://www.naaptol.com/shop-online/mobile-phones.html
http://www.naaptol.com/shop-online/computers-peripherals.html
http://www.naaptol.com/shop-online/home-kitchen-appliances.html
http://www.naaptol.com/shop-online/home-decor.html
http://www.naaptol.com/shop-online/automobiles.html
http://www.naaptol.com/shop-online/jewellery-watches.html
http://www.naaptol.com/shop-online/consumer-electronics.html
http://www.naaptol.com/shop-online/cameras.html
http://www.naaptol.com/shop-online/toys-nursery.html
http://www.naaptol.com/shop-online/sports-fitness.html
http://www.naaptol.com/shop-online/gifts.html
http://www.naaptol.com/shop-online/baby-care-maternity.html
http://www.naaptol.com/shop-online/books.html
http://www.naaptol.com/shop-online/footwear-travel-bags.html
http://www.naaptol.com/shop-online/apparels-accessories.html
URL 2
AMRITA UNIVERSITY
(http://www.annauniv.edu/)
12
(http://www.amrita.edu/)
8
URL 1
http://www.annauniv.edu/
http://www.annauniv.edu/academic
_courses/index.html
http://aucoe.annauniv.edu/
http://aucoe.annauniv.edu/circular.
html
42
URL 2
http://www.amrita.edu/
http://www.amrita.edu/academics
http://www.amrita.edu/aums
http://www.amrita.edu/campus
http://www.amrita.edu/research
http://www.amrita.edu/events
http://www.annauniv.edu/sports/
Frequency
URL 2
(%)
http://www.annauniv.edu/
http://www.annauniv.edu/aca
demic_courses/index.html
http://aucoe.annauniv.edu/
http://aucoe.annauniv.edu/cir
cular.html
http://www.annauniv.edu/spo
rts/
58%
62%
77%
82%
59%
http://www.amrita.edu/
http://www.amrita.edu/acade
mics
http://www.amrita.edu/aums
http://www.amrita.edu/resear
ch
http://www.amrita.edu/events
URL 2
http://www.annauniv.edu/
http://www.amrita.edu/
43
http://www.amrita.edu/academics
http://www.annauniv.edu/academic_co
urses/index.html
http://www.amrita.edu/aums
http://aucoe.annauniv.edu/
http://www.amrita.edu/research
http://aucoe.annauniv.edu/circular.html
44
http://www.amrita.edu/events
http://www.annauniv.edu/sports/
f) SIMILARITY GRAPH
Total Words
Title
60
Title
40
URL 1
49
20
URL 2
15
Title
URL 1
URL 2
Description
Descriptio
n
400
Description
URL 1
122
URL 2
235
Meta
Keywords
URL 1
46
URL 2
614
45
200
0
URL 1 URL 2
Meta Keywords
Match % based on keywords
300
Meta
Keywords
200
Title
URL 1
56
URL 2
53
100
0
URL 1
URL 2
Title
100
Description
50
URL 1
64
URL 2
73
Title
URL 1
URL 2
Description
Meta
Keywords
URL 1
57
URL 2
60
70
65
60
55
50
Description
URL 1
URL 2
Meta Keywords
68
Meta
Keywords
66
64
URL 1
URL 2
URL 2
VIDYA VIKAS
(http://www.svgv.in/)
(http://www.vidyaavikas.ac.in/)
46
11
11
URL 1
http://www.svgv.in/
http://www.svgv.in/about_us.php
http://www.svgv.in/academics.php
http://www.svgv.in/admission.php
http://www.svgv.in/events.php
http://www.svgv.in/achivements.ph
p
http://www.svgv.in/contact.php
http://www.svgv.in/careers.php
http://www.svgv.in/facilities.php
http://www.svgv.in/gallery.php
http://www.svgv.in/rules.php
URL 2
http://www.vidyaavikas.ac.in/
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/index.html
www.vidyaavikas.ac.in/vidyavikas
s.ac.in/about us/about-us.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/about%20us/Group
%20Of%20Institutions.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/achievements/achiev
ements.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Academic/Affiliation
s.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Beyond
%20Curriculam/sports.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/gallery/gallery.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/contact/contact.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Beyond
%20Curriculam/Co_curricular.htm
l
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Beyond
%20Curriculam/Co_curricular.htm
l#
47
Home, Search, About us, Academics, Admission, Events, Achievements, Contacts, Individual
and group, programs, High Level, Education, Global recognition, Positive Stay, Parental
Care, Homely, Ambience, Careers, facilities, gallery, rules, Success, quality, etc., (with a total
of 141 matching words all over the similar/matched webpages).
d) FREQUENCY OF MATCH (AFTER IMPLEMENTING THE ALGORITHM)
URL 1
Frequency
URL 2
(%)
http://www.svgv.in/academics.php
http://www.svgv.in/admission.php
http://www.svgv.in/events.php
http://www.svgv.in/contact.php
http://www.svgv.in/careers.php
65%
72%
57%
68%
59%
http://www.vidyaavikas.ac.in/
about%20us/Group%20Of
%20Institutions.html
http://www.vidyaavikas.ac.in/
achievements/achievements.ht
ml
http://www.vidyaavikas.ac.in/
vidyavikass.ac.in/gallery/galle
ry.html
http://www.vidyaavikas.ac.in/
vidyavikass.ac.in/contact/cont
act.html
http://www.vidyaavikas.ac.in/
vidyavikass.ac.in/Beyond
%20Curriculam/Co_curricular
.html
URL 2
http://www.svgv.in/academics.ph
http://www.vidyaavikas.ac.in/abou
t%20us/Group%20Of
%20Institutions.html
48
http://www.svgv.in/admission.ph
http://www.vidyaavikas.ac.in/achi
evements/achievements.html
p
http://www.svgv.in/events.php
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/gallery/gallery.html
http://www.svgv.in/contact.php
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/contact/contact.html
http://www.svgv.in/careers.php
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Beyond
%20Curriculam/Co_curricular.htm
f) SIMILARITY GRAPH
49
Total Words
Title
URL 1
54
URL 2
28
Title
60
Title
40
20
0
URL 1
URL 2
Description
URL 1
243
URL 2
342
Description
400
Description
200
Meta
Keywords
URL 1
164
URL 2
216
0
URL 1 URL 2
Meta Keywords
300
Meta
Keywords
200
100
Title
URL 1
67
URL 2
54
0
URL 1
URL 2
Title
100
Title
50
Description
URL 1
57
URL 2
67
0
URL 1
URL 2
Description
70
65
60
55
50
Description
URL 1
Meta
Keywords
50
URL 2
URL 1
67
URL 2
65
Meta Keywords
68
Meta
Keywords
66
64
URL 1
URL 2
URL 2
NAAPTOL
(http://www.flipkart.com/)
13
(http://www.naaptol.com/)
16
URL 1
http://www.flipkart.com/mobiles
http://www.flipkart.com/computers
http://www.flipkart.com/books
http://www.flipkart.com/household
http://www.flipkart.com/watches
http://www.flipkart.com/sportsfitness/outdoor-adventure/
51
URL 2
http://www.naaptol.com/shop-
online/mobile-phones.html
http://www.naaptol.com/shop-
online/computers-peripherals.html
http://www.naaptol.com/shop-
online/books.html
http://www.naaptol.com/shop-
online/home-decor.html
http://www.naaptol.com/shop-
online/jewellery-watches.html
http://www.naaptol.com/shop-
online/sports-fitness.html
c) LIST OF MATCHING WORDS/KEYWORDS/METAWORDS/DESCRIPTION
Online Shopping India Shop, Cameras & Accessories, Sports & Fitness, Subscribe, Keep in
touch, latest offers, news & events, check out, Categories Men & Women, Clothing,
Footwear, Travel & Bags, Mobiles Tablets & Computers, Home & Kitchen, Automobiles,
Jewellery & Watches, Consumer Electronics, Cameras & Accessories, Toys & Nursery,
Health & Beauty, Sports & Fitness, Gifts & Stationery, Watches, Shirts, Jeans (with a total of
428 matching words all over the similar/matched webpages).
Frequency
URL 2
(%)
http://www.flipkart.com/mobil
es
http://www.flipkart.com/comp
58%
62%
http://www.naaptol.com/shop-
online/mobile-phones.html
http://www.naaptol.com/shop-
uters
online/computers-
http://www.flipkart.com/book
s
http://www.flipkart.com/house
hold
http://www.flipkart.com/watc
82%
hes
http://www.flipkart.com/sports-
59%
peripherals.html
http://www.naaptol.com/shop-
online/books.html
http://www.naaptol.com/shop-
online/home-decor.html
http://www.naaptol.com/shop-
77%
fitness/outdoor-adventure/
online/jewellery-watches.html
http://www.naaptol.com/shop
-online/sports-fitness.html
72%
URL 2
http://www.naaptol.com/shop-
http://www.flipkart.com/mobiles
online/mobile-phones.html
52
http://www.flipkart.com/computers
http://www.naaptol.com/shoponline/computers-peripherals.html
http://www.flipkart.com/books
http://www.naaptol.com/shoponline/books.html
http://www.flipkart.com/household
http://www.naaptol.com/shoponline/home-decor.html
http://www.naaptol.com/shop-
http://www.flipkart.com/watches
online/jewellery-watches.html
53
http://www.flipkart.com/sports-
http://www.naaptol.com/shoponline/sports-fitness.html
fitness/outdoor-adventure/
f) SIMILARITY GRAPH
Total Words
Title
Title
URL 1
33
URL 2
43
50
Title
0
URL 1
Description
URL 1
253
URL 2
234
URL 2
Description
260
Description
240
220
URL 1
URL 2
Meta
Keywords
URL 1
123
URL 2
154
Meta Keywords
Meta
Keywords
200
100
0
URL 1
76
54
URL 2
URL 2
Title
56
100
Title
50
0
URL 1
Description
URL 1
87
URL 2
69
URL 2
Description
100
Description
50
Meta
Keywords
URL 1
67
URL 2
87
0
URL 1
URL 2
Meta Keywords
Meta
Keywords
100
50
0
URL 1
URL 2
URL 2
SVGV
(http://www.annauniv.edu/)
12
(http://www.svgv.in/)
11
URL 1
http://www.annauniv.edu/sports/
55
URL 2
http://www.svgv.in/events.php
Frequency
URL 2
(%)
http://www.annauniv.edu/spor
ts/
8%
http://www.svgv.in/events.php
URL 2
FLIPKART
(http://www.svgv.in/)
11
(http://www.flipkart.com/)
13
URL 1
http://www.svgv.in/contact.php
URL 2
https://www.flipkart.com/s/contact
URL 1
Frequency
URL 2
(%)
http://www.svgv.in/contact.ph
p
38%
https://www.flipkart.com/s/con
tact
57
CHAPTER - 5
5. PERFORMANCE EVALUATION
The evaluation process done by the similarity detection system has been described that allows
extraction of unique words.
The similarity detection framework consists of two stages. First, a linear process extracted the
unique words in the document once for each. Second, the sequences of unique words are
matched for checking repetition. The score function using information theory has been used
to calculate the number of similar sequence. The following is the formula used:
similarity ( X , Y )=
log Pr ( common ( X , Y ) )
log Pr ( description ( X , Y ))
The existing system has used information theoretic principle for scoring function; It performs
worst in all the experiments cited in the literature.
Due to the growing amount of information in the web world, a method for discovering useful
information from different website is required to achieve accurate extraction of interesting
information.
The current information extraction system deals with either unstructured information (static
in nature) or structured information. In a recent effort on Twitter text analysis, a system has
been designed to analyze and extract information from the contents of text that is produced
from different communities. The recent challenge is on how to find useful messages from
different websites which has background information in a semi-structured form.
58
The main aim of the proposed work is to look at interesting information from different text
content which has semi-structured information. As the different user may enter the same type
of information, the information may be in similar form with same meaning, which adds value
for information extraction system to identify potential threats or interest in domains such as
fraud detection, product recommendation analysis, cyber-attacks, terror attacks, and
healthcare. In the proposed work, a new method is introduced by uing cosine similarity
measurement scheme which identify the percentage of similar information between two
vectors. The proposed method shows better result than the existing system.
% 80
o
f
S
i
m
i
l
a
r
i
t
y
70
60
50
40
Existing
30
Proposed
20
10
0
1
URL Comparisons
.
59
6. CONCLUSION
This thesis presented a new approach which combines technique from various fields and
adapts to solve the problem of matching title, description, keyword or metadata. The results
show that the suggestions generated are extremely relevant. It has been observed that as the
content and structure size grows the quality of suggestions improve.
The goal of this thesis was to enlarge our understanding, how different groups of websites are
using the Web for commercial purposes. Because a solid knowledge base for analyzing Web
similarity is missing, we developed an enhanced framework to analyze and categorize the
capabilities of Websites. This framework distinguishes between content and design. This
technique would struggle, if structure of a website is not defined, when a description or
keyword or metadata of a website is not defined, title and content is partially modified from
other websites. This leads to confusion whether the webpage belongs to the current category
or falls under the old category from which the original website was modified.
The result of our study shows the importance of user observations when studying about
similarity among websites is convincing. The five comparisons are performed for the existing
and the proposed systems. The first comparison shows that the existing system compares the
text and displays only 12% of similarity, whereas the proposed system shows that the content
and structure of the webpage matches upto 65%, which crossed our similarity compliance of
50% and hence the webpage are similar as per our research method and dissimilar based on
existing technique which is a breakthrough in our research.
Some future works can include:
Extending the comparison based on the meaning of the word without modifying the
architecture.
60
Integration with systems like dictionary.com and online synonym content would
significantly improve the semantic similarity between these keywords which would
result in finding the similarity in a better way.
REFERENCES
[1]
[2]
[3]
Anita Jadhav, Semantic Similarity Measure using Web Page Count, NGD and
Snippet Based Method, International Journal of Computer Science and Mobile
Computing, Vol.3 Issue.7, July- 2014, Pg. 249-260.
[4]
[5]
[6]
[7]
Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka, A Web Search Enginebased Approach to Measure Semantic Similarity between Words, In Proc. Of 14th
COLING, Vol.23, No.539, 2007.
[8]
[9]
Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen and Ming Ouhyoung, On Visual
Similarity Based 3D Model Retrieval , EUROGRAPHICS 2003, Vol.22, No.3, 2003.
[10]
[11]
Dr. V. Saravanan, Web Mining through Semantic Similarity Measures between Words
Using Page Counts, International Journal of Advanced Research in Computer
Science and Software Engineering, Volume 4, Issue 9, September 2014.
[12]
[13]
[14]
[15]
[16]
Isabel F. Cruz1, Slava Borisov, Michael A. Marks, and Timothy R.Webb, Measuring
Structural Similarity Among Web Documents:, In ACM SIGMOD Conf.,Vol.97,
No.239, 1997.
[17]
Istvan Varga Kiyonori Ohtake Kentaro Torisawa Stijn De Saeger Teruhisa Misu
Shigeki Matsuda Junichi Kazama, Similarity Based Language Model Construction
for Voice Activated Open-Domain Question Answering, In proceedings of ACL,
Vol.40, No.11, 2001.
62
[18]
Jiahui Liu and Larry Birnbaum, Measuring Semantic Similarity between Named
Entities by Searching the Web Directory, conference on Computational Linguistics,
Vol. 23, No. 4, 2012.
[19]
[20]
[21]
Mara Alpuente and Daniel Romero, A Tool for Computing the Visual Similarity of
Web Pages, 10th annual International Symposium on Applications and the Internet,
Vol.45, No.8, 2010.
[22]
Mehran Sahami, A Webbased Kernel Function for Measuring the Similarity of Short
Text Snippets, Annual International conference on Information retrieval, Vol.21,
No.206, 1999.
[23]
Ming-Shun Lin, Novel Association Measures Using Web Search with Double
Checking, Proceedings of the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the ACL, pages 10091016, Sydney, July
2006.
[24]
[25]
Nguyen Chi Thanh and Koichi Yamada , Document Representation and Clustering
with WordNet Based Similarity Rough Set Model, International Journal of Computer
Science Issues, Vol. 8, No. 3, September 2011.
[26]
P.V.Praveen Sundar, Towards Automatic Data Extraction Using Tag and Value
Similarity Based on Structural -Semantic Entropy, International Journal of Advanced
Research in Computer Science and Software Engineering , Vol. 3, No. 4, April 2013.
[27]
63
[28]
Phyu Thwe, Proposed Approach For Web Page Access Prediction Using Popularity
And Similarity Based Page Rank Algorithm, INTERNATIONAL JOURNAL OF
SCIENTIFIC & TECHNOLOGY RESEARCH, Vol. 2, No. 3, MARCH 2013.
[29]
Poonam Chahal , Manjeet Singh , Suresh Kumar, An Ontology Based Approach for
Finding Semantic Similarity between Web Documents, International Journal of
Current Engineering and Technology, Vol. 3, No. 5, December 2013.
[30]
Poonam Chahal, Manjeet Singh, Suresh Kumar, Ranking of Web Documents using
Semantic Similarity, International conference on Information Systems and computer
Networks, Vol.56, No.890 , 2013.
[31]
Pushpa C N, Web search engine based semantic similarity measure between words
using pattern retrieval algorithm, J DOI, Vol.10, No.121, 2013.
[32]
[33]
Radha Damodaram, Dr. M.L. Valarmathi, Phishing Detection based on Web Page
Similarity, IJCST, Vol. 2, No, 3, September 2011.
[34]
Rajendra LVN1, Qing Wang2 and John Dilip Raj, Recommending News Articles
using Cosine Similarity Function, Vol.3, No.770, 2014 .
[35]
Rajhans Mishra and Pradeep Kumar Clustering Web Logs Using Similarity Upper
Approximation with Different Similarity Measures International Journal of Machine
Learning and Computing, Vol. 2, No. 3, June 2012.
[36]
Rekha Ramesh, Ontology Based Approach For Finding The Similarity Among
Questions In A Distributed Question Bank Scenario, International Journal of
Computer Science and Applications, Vol. 7, No. 1, 2010.
[37]
Rudi L. Cilibrasi and Paul M.B. Vitanyi,J. Adachi and M. Hasegawa. Normalized
web distance and web similarity, Vol.4, No. 28, 1996.
[38]
[39]
Ruzhan Lu, Word Similarity Measurement Based on Basic Lexical Set, Journal of
Information & Computational Science 8: 5 (Apr 2011) 799{807).
[40]
[41]
Seokkyung Chung, Jongeun Jun, and Dennis McLeod, A novel Term Similarity
Metric based on a web search technology University of Southern California,
Vol.13(1), No.71, 2004.
[42]
Shalini puri and Sona kaushik , A technical study and analysis on fuzzy similarity
based models for text classification, International Journal of Data Mining &
Knowledge Management Process Vol. 2, No.2, March 2012.
[43]
[44]
[45]
[46]
[47]
[48]
Vidya Kannan, Dr. G.N Srinivasan, Yet another way of Ranking web Documents,
Based On Semantic Similarity, Internatinal Journal of Advanced Research in
Computer and communication Engineering, Vol. 3, No.4, April 2014.
[49]
Wei Liu, Xiaofeng Meng, Member, IEEE, and Weiyi Meng, A Vision-Based
Approach for Deep Web Data Extraction, IEEE Transaction on knowledge and data
engineering , Vol.22, No. 3, 2010.
65
[50]
Weifeng Zhang Web Phishing Detection Based on Page Spatial Layout Similarity
School of Computer, Nanjing University of Posts and Telecommunications, Vol. 37,
No. 231244 , 2013.
[51]
[52]
Yanhong Zhai and Bing Liu, Extracting Web Data Using Instance-Based Learning,
In SIGMOD04, Vol. 23, No. 203, 2004.
[53]
66
APPENDIX
<?php
session_start();
require_once('database.php');
require_once('library.php');
$error = "";
if(isset($_POST['txtusername'])){
$error = checkUser($_POST['txtusername'],$_POST['txtpassword'],$_POST['OfficeName']);
}//if
require_once('database.php');
$sql = "SELECT DISTINCT(off_name)
FROM tbl_offices";
$result = dbQuery($sql);
?>
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Login</title>
<link href="css/style.css" rel="stylesheet" type="text/css">
<link href="css/mystyle.css" rel="stylesheet" type="text/css">
<script language="javascript">
<!-function memloginvalidate()
{
if(document.form1.txtusername.value == "")
{
alert("Please enter admin UserName.");
document.form1.txtusername.focus();
return false;
}
if(document.form1.txtpassword.value == "")
{
alert("Please enter admin Password.");
document.form1.txtpassword.focus();
return false;
}
}
-->
</script></head>
<body onLoad="document.form1.txtusername.focus();">
<table id="Outer" bgcolor="#FFFFFF" border="0" cellpadding="0" cellspacing="0" align="center"
width="780">
<tbody><tr>
<td><table id="inner" border="0" cellpadding="3" cellspacing="3" height="500" align="center"
width="96%">
<tbody><tr>
<td>
<link href="css/style.css" rel="stylesheet" type="text/css">
<style type="text/css">
<!--
67
68
<input name="txtusername"
class="forminput" id="txtusername" maxlength="20" type="text"></td>
</tr>
<tr>
<td> <font style="font-size:12px;">Password</font></td>
<td>:</td>
<td><input name="txtpassword" class="forminput" id="txtpassword" maxlength="20"
type="password"></td>
</tr>
<tr>
<td> <font style="font-size:12px;">Office</font></td>
<td>:</td>
<td>
<select name="OfficeName">
<?php
while($data = dbFetchAssoc($result)){
?>
<option value="<?php echo $data['off_name']; ?>"><?php echo $data['off_name']; ?
></option>
<?php
}//while
?>
</select>
</td>
</tr>
<tr>
<td> </td>
<td> </td>
<td><input name="Submit" class="green-button" value="Login Now" type="submit"
style="padding:5px 10px;font-weight:bold;"></td>
</tr>
</tbody>
</table>
</form>
</td>
</tr>
</tbody></table></td>
</tr>
<tr>
<td> </td>
</tr>
</tbody></table></td>
<td background="images/boxrightBG.gif"></td>
</tr>
<tr>
<td width="18"><img src="images/boxbtmleftcorner.gif" alt="" height="12" width="18"></td>
<td background="images/boxbtmBG.gif" width="734"></td>
<td width="18"><img src="images/boxbtmrightcorner.gif" alt="" height="12" width="18"></td>
</tr>
</tbody></table>
<br>
<br></td>
</tr>
<tr>
<td><table border="0" cellpadding="0" cellspacing="0" align="center" width="780">
<tbody><tr>
<td bgcolor="#2284d5" height="40" width="476"> </td>
<td bgcolor="#2284d5" width="304"><div align="right"></div></td>
</tr>
</tbody></table>
</td>
</tr>
69
</tbody></table></td>
</tr>
</tbody></table>
</td></tr></tbody></table></body></html>
70
<tr>
<td class="Partext1"> </td>
</tr>
<tr>
<td> </td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td><table border="0" cellpadding="0" cellspacing="0" align="center" width="900">
<tbody><tr>
<td bgcolor="#2284d5" height="40" width="476"> </td>
<td bgcolor="#2284d5" width="304"> </td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>
</body></html>
71