0% found this document useful (0 votes)

113 views71 pages

Web Page Similarity Draft Final

The document provides an introduction to data mining and web mining. It discusses how data mining involves extracting patterns from large datasets using techniques from artificial intelligence, machine learning, statistics, and databases. Data mining tasks analyze data to extract unknown patterns that can be used for prediction. Web mining extracts useful information from websites through content, structure, and usage mining. Content mining analyzes web page content while structure mining analyzes link structures. The document also introduces the objectives of the proposed system which is to determine the percentage similarity between two web pages by analyzing and comparing their title, keywords, and metadata using a cosine similarity calculation.

Uploaded by

KALAI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views71 pages

Web Page Similarity Draft Final

Uploaded by

KALAI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 71

CHAPTER - 1

1. INTRODUCTION
1.1 INTRODUCTION TO DATA MINING
Data mining is the computational process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to extract information from a data set
and transform it into an understandable structure for further use. Apart from the raw analysis,
it involves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.
1.1.1 Data Mining Task
The data mining task is the automatic or semi-automatic analysis of large quantities of data to
extract previously unknown interesting patterns such as groups of data records, unusual
records and dependencies. This usually involves using database techniques such as spatial
indices. These patterns can then be seen as a kind of summary of the input data, and may be
used in further analysis or, for example, in machine learning and predictive analytics. The
data mining can identify multiple groups in the data, which can then be used to obtain more
accurate prediction results by a decision support system.
1.1.2 Intrinsic: Text and Metadata Analysis
Metadata extraction is the process of describing extrinsic and intrinsic qualities of the
resource such as document, image, video, etc. As the result of that a number of texts from
webpages are produced which enables efficient search, sort and mining functionalities
provided by the websites. Another source of data for similarity analysis is document metadata
where the information is stored by the data repository about the webpage. Metadata that
might be useful in a similarity query include the author or the creation date of the document.

1.1.3 Extrinsic: Web Link Analysis

Link- or network-analysis algorithms have the prospect of being scalable, language and
media -independent, as well as robust in the face of link-spam and topic complexity A
webpage might contain link to another webpage like advertisement or adwords in form of
Googles AdSense, which is irrelevant for our research. Such links should be discarded since
the similarity between two websites is only studied and not the extrinsic link to another
website though it is relevant or irrelevant.
1.1.4 Data Warehouses
A data warehouse is a relational database that is designed for query and analysis rather than
for transaction processing. It usually contains historical data derived from transaction data,
but it can include data from other sources. It separates analysis workload from transaction
workload and enables an organization to consolidate data from several sources. In addition to
a relational database, a data warehouse environment includes an extraction, transportation,
transformation, and loading solution, an online analytical processing engine, client analysis
tools, and other applications that manage the process of gathering data and delivering it to
business users.
1.2 WEB MINING
Web usage mining is the process of extracting useful information from websites or webpages,
e.g. the process of finding out what data the users are looking for on the Internet. Some users
might be looking at only textual data, whereas some others might be interested in multimedia
data. Web usage mining is the application of data mining techniques to discover interesting
usage patterns in form of text from web data in order to understand and better serve the needs
of web-based applications. The proposed research, Content and Structure based mining is
taken for finding similarity, since usage is not mandatory.

Web Mining

Content

Structure

Usage

Mining

Agent Based

Database

Customized

Psychographic

Figure 1. Web Mining Process

1.2.1 Web Content Mining
Web content mining is the extraction and integration of useful data, information and
knowledge from web page content. When the URL is given as input to the webpage, the
content from the web are extracted. These contents play a vital role in finding the similarity
between two webpage or website.
1.2.2 Web Structure Mining
Web Structure Mining deals with the discovering and modeling the link structure of
the web. This can help in discovering similarity between sites or discovering web
communities. Web structure mining is the process of using graph theory to analyze the node
and connection structure of a web site. According to the type of web structural data, web
structure mining can be divided into two kinds:

Extracting patterns from hyperlinks in the web where a hyperlink is a structural

component that connects the web page to a different location.

Mining the document structure: analysis of the tree-like structure of page structures to
describe HTML or XML tag usage.

Based on the Content and Structure, the similarity is calculated based on the Cosine
Technique with the help of matched words between two webpages. Also, the structure of
webpage is used to find out the metadata and keyword details resulting in comparison
between two such data to identify the structural information of the webpage, which can be
used to compare and analyze the same.

Figure 2. Data Mining Process

Data Selection Process
Resource finding is the process which involves extracting data from either online or
offline text resource available on the web.
Information selection and pre-processing is the automatic selection of particular
information from retrieved web resources. This process transforms the original retrieved data
into information. The transformation could be renewal of stop words, stemming or it may be
aimed for obtaining the desired representation such as finding phrases in the training corpus
and representing the text in the first order logic form.
Generalization automatically discovers specific patterns at individual web sites. Data
Mining techniques and machine learning are used in generalization.
Analysis involves the validation and interpretation of the mined patterns. It plays an
important role in pattern mining.

1.2 INTRODUCTION ABOUT THE PROPOSED SYSTEM

Search Engines are used to search websites that contain duplicate or similar content. A
content of web page could be similar to other content of another webpage. This happens
while copying the template from online providers or copying the code from one website and
modifying the same. The proposed technique helps to determine the percentage of similarity
between two web pages.

URL 1

URL 2

Extracts

Title

Keywor

Meta-

>50%

Similarit
y
Complia

Similar

Cosine Similarity
Measurement

<=50%
Dissimil

Figure 3. Cosine Similarity Framework

The cosine similarity framework discusses how the title, keyword and meta-data are parsed
together to obtain the resultant output, which is similar or dissimilar. The proposed system
takes the title, keyword and meta-data of both the URLs (Uniform Resource Locators to
identify the Website or a Webpage) and calculates similarity based on many-to-many
mapping, such that every content is compared with content of other URL where all the data is
parsed to find the similarity among web pages.

1.3 OBJECTIVE AND MOTIVATION OF RESEARCH

This reasearch deals with similarity measures which plays an important role in the extraction
of textual relations in web to find the similarity among them. For a computer to decide the
similarity, it should understand the semantics of the words. Computer being a syntactic
machine, it cannot understand the semantics. There are various methods proposed to find the
similarity between words from websites. Some of these methods have used the precompiled
concepts based on Web Search Engine. It makes use of snippets returned by the Wikipedia or
any encyclopedia such as Britannica Encyclopedia. The snippets are preprocessed for stop
word removal or cleaning. Similarity measures here are based on the different associative
measures in Information retrieval, namely simple matching and Cosine coefficient. For a
machine to be able to decide the similarity, intelligence is needed. It should be able to
understand the semantics or meaning of the words. But a computer being a syntactic machine,
semantics associated with the words or terms is to be represented as syntax.
Word semantic similarity approaches or metrics can be categorized as where Pre-compiled
database based metrics, metrics consulting only human-built knowledge resources, such as
ontologies, Co-occurrence based metrics using WWW, metrics that assume that the semantic
similarity between words or terms can be expressed by an association ratio which is a
function of their co-occurrence Context based metrics using WWW, metrics that are fully
text-based and understand and utilize the context or proximity of words or terms to compute
semantic similarity. Counting methods consider the length of the paths that link the words, as
well as the word positions in the taxonomic structure. Information content methods compute
similarity between words by combining taxonomic features that exist in the used resource,
e.g., number of counted words, with frequencies computed over textual masses. Semantic
similarity between words changes over time as new words are constantly being created and
new meaning is also being assigned to the existing words. Also there can be a problem with
person name detection and alias detection.
This system includes two situations, namely: Pre-compiled, page counts-based co-occurrence
measures and lexical pattern clusters are used to define features for a word pair.
The research focuses on two important characteristics of Websites: content and structure or
design. Both characteristics are measured by means of title, keyword and meta-data. Fig.4
below provides an overview of the proposed framework.

Figure 4. Overview of Proposed Framework

The Framework explains that the details of content and structure are emphasized before
progressing to the similarity processing phase, which helps in determining the accurate
similarity for a web page. Initially, Web content is extracted and processed from Web page
content for useful data or information or knowledge. There may be some lack of or additional
data that permits much of the expanding information sources on the World Wide Web, such as
hypertext documents, automated discovery, advertisements, other websites links, but they do
not generally provide structural information nor categorize, filter, or interpret documents.
Here comes the web structure into play for analyzing the structure of a web site, since the
content alone can never be considered for similarity analysis. According to the type of web
structural data, web structure mining can be useful in two ways for our research. Primarily,
extracting patterns from hyperlinks in the web. Furthermore, mining the document structure
for analyzing tree-like structure of page structures to describe HTML or XML tag usage.

The drawback namely sub-links are not processed in the existing system have been
considered in our technique, which helps in identifying the exact match of web pages based
on its content as well as structure. Consider that the content of web page is completely
different on two different web sites. This information may not be concluded as a website to
be dissimilar. The Structure of the web page might provide with a set of data like all the
hyperlinks in the web might lead to the same location from both the sites since a hyperlink is
a structural component that connects the web page to a different location. Also, the document
structure described in form of HTML or XML might contain the descriptive or meta-data
information which might be common for both the websites, which is not compared in the
existing system.
The features of the proposed system are as follows:

Percentage or frequency of match is increased.

Number of texts or keywords or meta-words or description taken for consideration is

high which results in higher ranking of similarity occurrence.

Pages are compared in one-to-many correlation.

CHAPTER 2
2. LITERATURE REVIEW
Existing large-scale scanned book collections have many short-comings for data-driven
research; there is variable quality to the lack of accurate descriptive and structural meta-data.
We argue that complementary research in inferring relational metadata is important in its own
right to support use of these collections and that it can help to mitigate other problems with
scanned book collections, where the problem arises in Mining Relational Structure from
Millions of Books. Inspite of these issues, Ismet Zeki Yalniz, Ethem F. Can and R. Manmatha
suggested that Partial Duplicate Detection can be made for large book collections. Hence, this
trick is taken forward to implement the same in finding the duplicates or similarities in Web
Pages. A framework is presented for discovering partial duplicates in large collections of
scanned books. Each book in the collection is represented by the sequence of words, with
respect to the condition that it appears only once in the book. These words are referred to as
unique words and they constitute a small percentage of all the words in a typical book.
Along with the order information the set of unique words provides a compact representation
which is highly descriptive of the content and the flow of ideas in the book. By aligning the
sequence of unique words from two books using the longest common subsequence (LCS) one
can discover whether two books are duplicates. The same idea is incorporated in finding the
similarities between web pages with the help of cosine similarity measurement which can
maintain better quality in terms of matching keywords or meta-words or description than the
existing methods.
Weifeng Z [50] proposed the web similarity is becoming an increasingly severe security threat
in the web domain. Effective and efficient similarity detection is very important for
protecting web users from loss of sensitive private information and even personal properties.
One of the keys of phishing detection is to efficiently search the legitimate web page library
and to find the pages that are the most similar to a suspicious phishing page. Most existing
phishing detection methods are focused on text and image features and have paid very limited
attention to spatial layout characteristics of web pages. In this technique, the author proposes
a novel phishing detection method that makes use of the informative spatial layout
characteristics of web pages. In particular, the author develops two different options to extract
the spatial layout features as rectangle blocks from a given web page. Furthermore, the author
builds an R-tree to index all the spatial layout features of a legitimate page library. As a
9

result, phishing detection based on the spatial layout feature similarity is facilitated by
relevant spatial queries via the R-tree. A series of simulation experiments are conducted to
evaluate our proposals. The results demonstrate that the proposed novel phishing detection
method is effective and efficient.
Rajhans M et.al [35] proposed about adoption of the similarity upper approximation based
clustering of web logs using various similarity distance metrics. The technique shows the
viability of our methodology. Web logs capture the information about web sites as well the
sequence of the visit. Sequence of visit provides an important insight about the behavior of
the user. Rough set, a soft computing technique, deals with vagueness present in data. It
captures the indiscernibility at different levels of granularity. The technique has shown the
results on data set with different similarity measures along with explanation of results.
Christoph K et.al [6] proposed that research explores three SPARQL-based techniques to
solve semantic web tasks that often require similarity measures, such as semantic data
integration, ontology mapping, and semantic web service match making. Our aim is to see
how far it is possible to integrate customized similarity functions into SPARQL to achieve
good results for these tasks. Our first approach exploits virtual triples calling property
functions to establish virtual relations among resources under comparison; the second
approach uses extension functions to alter our resources that do not meet the requested
similarity criteria; externally, our third technique applies new solution modifiers to postprocess a SPARQL solution sequence. The semantics of the three approaches are formally
elaborated and discussed.
Shalini P et.al [42] proposed the current era of web page similarity technology, advancements
and techniques, efficient and effective text document classification is becoming a challenging
and highly required area to capably categorize text documents into mutually exclusive
categories. Fuzzy similarity provides a way to find the similarity of features among various
documents. In this research, a technical review on various fuzzy similarity based models is
given. These models are discussed and compared to frame out their use and necessity. A tour
of different methodologies is provided which is based upon fuzzy similarity related concerns.
It shows that how text and web documents are categorized efficiently into different
categories. Various experimental results of these models are also discussed.

Rajendra L et.al [34] proposed that predicting news about the cosine similarity series are
likely to provides a distinct advantage to news sites and Collaborative filtering is a widely
used technique for the same. The author details an approach within Collaborative Filtering
that uses the Cosine Similarity Function to achieve this purpose. Here author furnished
details about two different approaches customized targeting and article level targeting that
can be used in marketing campaigns. All through history, people have relied on some kind of
observation/advice/feedback in making decisions of any kind. With information in the web
increasing manifold and various options to choose from, customers at times find it difficult to
search and read articles that are of most interest to them. News sites have stepped in to fill
this gap by analyzing customer behavior and recommending articles that customers have high
likelihood to read.
Glen J [13] proposed measuring similarity of objects arises in many applications, and many
domain-specific measures have been developed, e.g., matching text across documents or
computing overlap among item-sets. The author propose a complementary approach,
applicable in any domain with object-to-object relationships that measures similarity of the
structural context in which objects occur, based on their relationships with other objects.
Effectively, the author computes a measure that says two objects are similar if they are
related to similar objects. This general similarity measure, called Sim-Rank, is based on a
simple and intuitive graph-theoretic model. For a given domain, Sim- Rank can be combined
with other domain-specific similarity measures. The author suggested techniques for efficient
computation of Sim-Rank scores, and provide experimental results on two application
domains showing the computational feasibility and effectiveness of our approach.
Mehran S [22] proposed that determining the similarity of short text snippets, such as search
queries, works poorly with traditional document similarity measures (e.g., cosine), since there
are often few, if any, terms in common between two short text snippets. The author address
this problem by introducing a novel method for measuring the similarity between short text
snippets by leveraging web search results to provide greater context for the short texts, such a
similarity kernel function, mathematically analyze some of its properties, and provide
examples of its escacy and also show the use of this kernel function in a large-scale system
for suggesting related queries to search engine users.
Seokkyung C et.al [41] proposed from a statement, given that web page similarity
computations are essential in ontology learning and data mining, WebSim (Web-based term
11

Similarity metric), whose feature extraction and similarity model is based on a conventional
web search engine. There are two main aspects that the author can benefit from utilizing a
web search engine. First, the author can obtain the freshest content for each term that
represents the up-to-date knowledge on the term. This is particularly useful for dynamic
ontology management in that ontologies must evolve with time as new concepts or terms
appear. Second, in comparison with the approaches that use the certain amount of crawled
web documents as corpus, our method is less sensitive to the problem of data sparseness
because the author accesses as much content as possible using a search engine. At the core of
Web Sim, the author present two different methodologies for similarity computation, a
mutual information based metric and a feature-based metric. Moreover, the author shows how
WebSim can be utilized for modifying existing ontologies. Finally, the author demonstrates
the characteristics of WebSim by coupling with word net. Experimental results show that
WebSim can uncover topical relations between terms that are not shown in conventional
concept-based ontologies.
David B [8] proposed a brief survey of web structural similarity algorithms, including the
optimal Tree Edit Distance algorithm and various approximation algorithms. The
approximation algorithms include the simple weighted tag similarity algorithm, Fourier
transforms of the structure, and a new application of the shingle technique to structural
similarity. The author shows three surprising results. First, the Fourier transform technique
proves to be the least accurate of any of approximation algorithms, while also being slowest.
Second, optimal Tree Edit Distance algorithms may not be the best technique for clustering
pages from different sites. Third, the simplest approximation to structure may be the most
effective and efficient mechanism for many applications.
Ding-Yun C [9] proposed a web similarity-based 3D model retrieval system is proposed. This
approach measures the similarity among 3D models by visual similarity, and the main idea is
that if two 3D models are similar, they also look similar from all viewing angles. Therefore,
one hundred orthogonal projections of an object, excluding symmetry, are encoded both by
Zernike moments and Fourier descriptors as features for later retrieval. The visual similaritybased approach is robust against similarity transformation, noise, model degeneracy etc., and
provides 42%, 94% and 25% better performance (precision-recall evaluation diagram) than
three other competing approaches: (1) the spherical harmonics approach developed by Funk
houser et al., (2) the MPEG-7 Shape 3D descriptors, and (3) the MPEG-7 Multiple View
12

Descriptor.

The

proposed

system

the

Web

for

practical

trial

use

(http://3d.csie.ntu.edu.tw), and the database contains more than 10,000 publicly available 3D
models collected from WWW pages. Furthermore, a user friendly interface is provided to
retrieve 3D models by drawing 2D shapes. The retrieval is fast enough on a server with
Pentium IV 2.4GHz CPU, and it takes about 2 seconds and 0.1 seconds for querying directly
by a 3D model and by hand drawn 2D shapes respectively.
Peixiang Z [27] proposed about the issue of similarity computation between entities of an
information network arises and draws extensive research interests. However, to effectively
and comprehensively measure how similar two entities are within an information network is
nontrivial, and the problem becomes even more challenging when the information network to
be examined is massive and diverse. In this research, the author proposes a new similarity
measure, P-Rank, toward effectively computing the structural similarities of entities in real
information networks. P-Rank enriches the well-known similarity measure, SimRank, by
jointly encoding both in-links and out-links relationships into structural similarity
computation. P-Rank is proven to be a united structural similarity framework, under which all
state-of-the-art similarity measures, including CoCitation, Coupling, Amsler and SimRank,
are just its special cases. Based on its recursive nature of P-Rank, the author proposes a fixed
point algorithm to reinforce structural similarity of vertex pairs beyond the localized
neighborhood scope toward the entire information network. Our experimental studies
demonstrate the power of P-Rank as an effective similarity measure in different information
networks. Meanwhile, under the same time or space complexity, P-Rank outperforms
SimRank as a comprehensive and more meaningful structural similarity measure, especially
in large real information networks.
Allan M. Sn [1] proposed that rather than using traditional text analysis to discover web pages
similar to a given page, the author investigate applying link analysis. Since web pages exist in
a link-rich environment that has the potential to relate pages by any property imaginable since
links are not restricted to intrinsic properties of the page text or metadata. In particular, while
web page similarity link analysis has been explored, prior work has deliberately ignored the
explicitly hierarchical host & pathname structure within URLs. To exploit this property, the
author generalize Kleinbergs well-known hubs and authorities HITS algorithm; adapt this
algorithm to accommodate hierarchical link structure; test some sample web queries; and
argue that the results are potentially superior and that the algorithm itself is better motivated.
13

Rudi L. et.al [37] proposed that the Normalized Web Distance (NWD) method to determine
similarity between words and phrases. It is a general way to tap the amorphous low-grade
knowledge available for free on the internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is effectively the largest
semantic electronic database in the world. Moreover, this database is available for all by
using any search engine that can return aggregate page-count estimates for a large range of
search-queries.
Pushpa C [31] proposed semantic similarity measures plays an important role in information
retrieval, natural language processing and various tasks on web such as relation extraction,
community mining, document clustering, and automatic meta-data extraction. In this
technique, the author have proposed a Pattern Retrieval Algorithm [PRA] to compute the
semantic similarity measure between the words by combining both page count method and
web snippets method. Four association measures are used to find semantic similarity between
words in page count method using web search engines. The author use a Sequential Minimal
Optimization (SMO) Support Vector Machines (SVM) to find the optimal combination of
page counts-based similarity scores and top-ranking patterns from the web snippets method.
The SVM is trained to classify synonymous word-pairs and non-synonymous word-pairs. The
proposed approach aims to improve the correlation values, precision, recall, and F-measures,
compared to the existing methods. The proposed algorithm outperforms by 89.8 % of
correlation value.
Isabel F et.al [16] proposed a web page similarity, the author often use phrases like it looks
like a newspaper site, there are several unordered lists or its just a collection of links.
Unfortunately, no web search or classification tools provide the capability to retrieve
information using such informal descriptions that are based on the appearance, i.e., structure,
of the web page. In this technique, the author takes a look at the concept of structurally
similar web pages. The author note that some structural properties can be identified with
semantic properties of the data and provide measures for comparison between HTML
documents.
Mohamed S [24] proposed that measuring similarity between web page using a search engine
based on page counts alone is a challenging task. Search engines consider a document as a
bag of words, ignoring the position of words in a document. In order to measure semantic
similarity between two given words, the author proposes a transformation function for web
14

measures along with a new approach that exploits the documents title attribute and uses page
counts alone returned by web search engines. Experimental results on benchmark datasets
show that the proposed approach outperforms snippets alone methods, achieving a correlation
coefficient up to 71%.
Chaomei Chen [5] proposed a generic approach to structuring and visualising a hypertextbased information space on the web page. This approach, called Generalised Similarity
Analysis (GSA), provides a unifying framework for extracting structural patterns from a
range of proximity data concerning three fundamental relationships in hypertext, namely,
hypertext linkage, content similarity and browsing patterns. GSA emphasizes the integral role
of users interest in dynamically structuring the underlying information space. Pathfinder
networks are used as a natural vehicle for structuring and visualising the rich structure of an
information space by highlighting salient relationships in proximity data. In this technique,
the author use the GSA framework in the study of hypertext documents automatically
retrieved over the internet, including a number of departmental web sites and conference
proceedings on the web page. The author shows that GSA has several distinct features for
structuring and visualising hypertext information spaces. GSA provides some generic tools
for developing adaptive user interfaces to hypertext systems. Link structures derived by GSA
can be used together with dynamic linking mechanisms to produce a number of hypertext
versions of a common information space.
Istvan V [17] proposed that a language model for web page recognition of inputs with a
particular style, using a large-scale web archive. Our target is an open domain web-activated
QA system and our words recognition module must recognize relatively short, domain
independent questions. The central issue is how to prepare a large scale training corpus with
low cost, and the author tackled this problem by combining an existing domain adaptation
method and distributional word similarity. From 500 seed sentences and 600 million Web
pages the author constructed a language model covering 413,000 words. The author achieved
an average improvement of 3.25 points in word error rate over a baseline model constructed
from randomly sampled Web sentences.
Vidya K et.al [48] proposed that semantic similarity aims at providing robust tools for
standardizing the content and delivery of relevant information across communicating
information sources. Most of the times the user gets lots of irrelevant data as a result of
poorly implemented search process. To avoid this, a ranking scheme is proposed, which
15

provides the search result set according to the better understood and correctly interpreted user
query. This is done by considering the relevance of the query by keeping the user view in
mind and also the semantics of the document and the user query. The simple lexical and
syntactical matching usually used by search engines does not extract web documents to the
user expectations. The proposed solution provides the most relevant data to user ranked in
their relevance. The proposed ranking scheme for the semantic web search engine functions
by finding the semantic similarity between the information available on the web and the
query which is specified by the user. This approach considers both the syntactic structure of
the document and the semantic structure of the document and the query. The objective of this
technique is to demonstrate that a semantic similarity based ranking scheme will provide
much better results than those by the prevailing methods. In this technique, an algorithm will
be implemented that provides ranking scheme for the semantic web documents by finding the
semantic similarity between the documents and the query which is specified by the user. The
algorithm considers both syntactical and semantic similarities of the query and categorizes
the search results based on the most probable and most appropriate interpretation of the query
based on various interpretations taking into account all the words and their combinations in
the query.
Yanshan X et.al [53] proposed that in web analysis the Positive and unlabelled learning (PU
learning) has been investigated to deal with the situation where only the positive examples
and the unlabelled examples are available. Most of the previous works focus on identifying
some negative examples from the unlabelled data, so that the supervised learning methods
can be applied to build a classifier. However, for the remaining unlabelled data, which cannot
be explicitly identified as positive or negative (the author call them ambiguous examples),
they either exclude them from the training phase or simply enforce them to either class.
Consequently, their performance may be constrained. This technique proposes a novel
approach, called Similarity-based PU learning (SPUL) method, by associating the ambiguous
examples with two similarity weights, which indicate the similarity of an ambiguous example
towards the positive class and the negative class, respectively. The local similarity-based and
global similarity-based mechanisms are proposed to generate the similarity weights. The
ambiguous examples and their similarity-weights are thereafter incorporated into an SVMbased learning phase to build a more accurate classifier. Extensive experiments on real-world
datasets have shown that SPUL outperforms state-of-the-art PU learning methods.

Nguyen C et.al [25] proposed the use of Vector Space Model (VSM) to represent documents
in the web page similarity, where documents are denoted by a vector in a word vector space.
The standard VSM does not take into account the semantic relatedness between terms. Thus,
terms with some semantic similarity are dealt with in the same way as terms with no semantic
relatedness. Since this unconcern about semantics reduces the quality of clustering results,
many studies have proposed various approaches to introduce knowledge of semantic
relatedness into VSM model. Those approaches give better results than the standard VSM.
However they still have their own issues. The author proposed a new approach as a
combination of two approaches, one of which uses Rough Sets theory and co-occurrence of
terms, and the other uses WordNet knowledge to solve these issues. Experiments for its
evaluation show advantage of the proposed approach over the others.
Ana G [2] proposed about Automatic extraction of similar information from text and links in
Web pages is key to improving the quality of search results. However, the assessment of
automatic semantic measures is limited by the coverage of user studies, which do not scale
with the size, heterogeneity, and growth of the Web. Here the author proposes to leverage
human-generated metadata namely topical directories to measure semantic relationships
among massive numbers of pairs of Web pages or topics. The Open Directory Project
classifies millions of URLs in a topical ontology, providing a rich source from which
semantic relationships between Web pages can be derived. While semantic similarity
measures based on taxonomies (trees) are well studied, the design of well-founded similarity
measures for objects stored in the nodes of arbitrary ontologies is an open problem. This
technique defines an information-theoretic measure of semantic similarity that exploits both
the hierarchical and non-hierarchical structure of ontology. An experimental study shows that
this measure improves significantly on the traditional taxonomy-based approach. This novel
measure allows us to address the general question of how text and link analyses can be
combined to derive measures of relevance that are in good agreement with semantic
similarity.
Sven H [45] proposed a technique for measuring the structural similarity of web page
documents based on entropy. After extracting the structural information from two documents
the author use either ZivLempel encoding or Ziv-Merhav is crossparsing to determine the
entropy and consequently the similarity between the documents. This is the first true lineartime approach for evaluating structural similarity. In an experimental evaluation the author
17

demonstrate that the results of our algorithm in terms of clustering quality are on a bar with
or even better than existing approaches.
Jiahui L et.al [18] proposed a method for measuring web page semantic similarity, an
important type of semantic relationship, between entities. The method is based on Google
Directory, a search interface to the Open Directory Project. Via the search engine, the author
can locate the web pages relevant to an entity and automatically create a profile of the entity
according to the directory assignments of its web pages, which capture various features of the
entity. Using their profiles, the semantic similarity between entities can be measured in
different dimensions. The author applies the semantic similarity measurement to two
knowledge acquisition tasks: thesaurus construction of entities and fine grained
categorization of entities. Our experiments demonstrate that the proposed method works
effectively in these two tasks.
Poonam C et.al [29] proposed on many similar web search engines have been developed like
Ontolook, Swoogle, etc which help in searching meaningful documents presented on
semantic web. In contrast to this the commonly used approach is based on matching
keywords extracted from the document which is known as lexical matching. But there exist
the documents that contains same information but using different words i.e. one document
using a word and other document using synonym of that word. So, when similarity of such
documents is computed through lexical matching it will not give true results of similarity
computation. In this technique the author has proposed a semantic web document similarity
scheme that relies not only on the keywords but on conceptual instances present between the
keywords and also considers the relationships that exists between the concepts present in the
web pages. The author explores all relevant relations between the keywords exploring the
users intention and then calculates the fraction of these relations on each web page to
determine their relevance and similarity with the other documents. The author has found that
this semantic similarity scheme gives better results than those by the prevailing methods.
Phyu Te [28] proposed about the Web is an important source of information retrieval, and the
users accessing the Web are from different backgrounds. The usage information about users
are recorded in web logs. Analyzing web log files to extract useful patterns is called Web
Usage. Web usage similarity approaches include clustering, association rule mining,
sequential pattern mining etc. The web usage similarity approaches can be applied to predict
next page access. In this technique, the author proposed a Page Rank-like algorithm is
18

proposed for conducting web page access prediction. The author extend the use of page rank
algorithm for next page prediction with several navigational attributes, which are the
similarity of the page, size of the page, access-time of the page, duration of the page and
transition(two pages visits sequentially) and frequency of page and transition.
Sridevi.U [44] proposed a methodology for the web page similarity based semantic
annotation of web pages with annotation weighting scheme that takes advantage of the
different relevance of structured document fields. The retrieval model is based on the
importance factors of the structural elements, which are used to re-rank the documents
retrieval by the ontology based distance measure. The relevance concept similarities are
combined with the annotation-weighting scheme to improve the relevance measures. The
proposed method has been evaluated on USGS Science directory collection. Preliminary
experiments results show that our method may generate relevant document in the top rank.
Radha D et.al [33] proposed Phishing is a current social engineering attack that results in
online identity theft. Phishing Web pages generally use similar page layouts, styles (font
families, sizes, and so on), key regions, and blocks to mimic genuine pages in an effort to
convince Internet users to divulge personal information, such as bank account numbers and
passwords. A novel technique to visually compare an assumed phishing page with the
legitimate one is presented. Five important features such as signature extraction, text pieces
and their style, images, URL keywords and the overall appearance of the page as rendered by
the browser are identified and considered. An experimental evaluation using a dataset
collected of 150 real world phishing pages, along with their equivalent legitimate targets has
been performed. The investigational results are satisfactory in terms of false positives and
false negatives and an efficiency rate of about 98.11% for false positive pages and 92.95% for
false negative pages has been obtained.
Douglas L [10] proposed the web page similarity in lexical semantic system is an important
component of human language and cognitive processing. One approach to modeling semantic
knowledge makes use of hand-constructed networks or trees of interconnected word
senses .An alternative approach seeks to model word meanings as high-dimensional vectors,
which are derived from the co-occurrence of words in unlabeled text corpora. This technique
introduces a new vector-space method for deriving word-meanings from large corpora that
was inspired by the HAL and LSA models, but which achieves better and more consistent
results in predicting human similarity judgments. The author explain the new model, known
19

as COALS, and how it relates to prior methods, and then evaluate the various models on a
range of tasks, including a novel set of semantic similarity ratings involving both
semantically and morphologically related terms.
Mara A et.al [21] proposed a functional technique for identifying similar web pages that is
based on measuring tree similarity. The key idea behind the method is to transform each web
page into a compressed, normalized tree that effectively represents its visual structure. In this
technique, the author develops an optimization of this technique that is based on
memorization and that achieves significant improvements in efficiency in both time and
space. This work also presents a tool that implements the proposed technique as well as two
case studies for two real scenarios. Experiments on real documents show that the optimized
algorithm performs significantly better than the original technique and demonstrate the
practicality of our approach.
Sreedevi S [43] proposed the highly increased use of the web similarities comes a significant
demand to provide more reliable web applications. By learning more about the usage and
dynamic behavior of these applications, the author believe that some software development
and maintenance tools can be designed with increased cost-effectiveness. The author
describes our work in analyzing user session data. Particularly, the main contributions of this
technique are the analysis of user session data with concept analysis, an experimental study
of user session data analysis with two different types of web software, and an application of
user session analysis to scalable test case generation for web applications. In addition to
fruitful experimental results, the techniques and metrics themselves provide insight into
future approaches to analyzing the dynamic behavior of web applications.
Krishna N [19] proposed that the objective of this project is to use semantic similarity
techniques to identify the MOOCs [Massive Open Online Courses] offered by the e-learning
websites which are similar to the regular courses offered by the university. Over the last few
years there has been a significant development in the e-learning industry that provides online
courses to the public. Due to the drastic improvement in technology and the internet, this
form of education reaches many people across boundaries. There is vast set of courses
currently provided by various sources, which range from the latest technologies in the field of
computer science to any topic in history. Since the invention of e-learning, there has been
constant improvement of user friendly tools to enhance the learning process. In the span of
the last three years, many websites have come into existence that provides online courses.
20

Some of the best universities in the United States and other universities throughout the world
have also started to provide online courses that students can easily attend. It has become very
difficult for a student to pick the right online course. Hence, applications that can integrate
the courses provide by various e-learning websites like Coursera, Udacity and Edx would be
very helpful. The student can compare the regular courses provided by his or her university
with the courses offered by the e-learning websites and can enroll in similar online courses to
get a better understanding of the subject.
Taher H [47] proposed about finding web pages that are similar to a query page is an
important component of modern search engines. A variety of strategies have been proposed
for answering related pages queries, but comparative evaluation by user studies is
expensive ,especially when large strategy apaches must be searched. The author proposed a
technique for automatically evaluating strategies using web hierarchies, such as open
directory, in place of user feedback. The author applies this evaluation methodology to a mix
of document representation strategies, including the use of text, anchor text and links. The
author proposed the relative advantages and disadvantages of the various approaches
examined. Finally, the author proposed how to efficiently construct a similarity index of our
chosen strategies, and provide sample results from our index.
Danushka B et.al [7] proposed about measuring the semantic similarity between words is an
important component in various tasks on the web such as relation extraction, community
mining, document clustering, and automatic meta data extraction. Despite the usefulness of
semantic similarity measures in these applications, accurately measuring semantic similarity
between two words (or entities) remains a challenging task. The author propose an empirical
method to estimate semantic similarity using page counts and text snippets retrieved from a
web search engine for two words. Specifically, define various word co-occurrence measures
using page counts and integrate those with lexical patterns extracted from text snippets. To
identify the numerous semantic relations that exist between two given words, he propose a
novel pattern extraction algorithm and a pattern clustering algorithm. The optimal
combination of page counts-based on co-occurrence measures and lexical pattern clusters is
learned using support vector machines. The proposed method outperforms various baselines
and previously proposed web-based semantic similarity measures on three benchmark
datasets showing a high correlation with human ratings. Moreover, the proposed method
significantly improves the accuracy in a community mining task.
21

P.V.Praveen [26] proposed an automatic web record extraction extracts a set of objects from
heterogeneous web pages based on similarity measure among objects in an automated
fashion. This classifies a region in the web page according to similar data object which
emerge frequently in it. This involves transformation of unstructured data into structured data
that can be stored and analyzed in a central local database. The existing system develops a
data extraction and alignment method known as Combining Tag and Value Similarity
(CTVS), which identifies the Query Result Records (QRRs) by extracting the data from
query result page and segment them. Those segmented QRRs are aligned into a table where
same attribute data values are put into the same column. This technique is based on the
discovery of non-consecutive data records to detect nested data records in QRRs. Those
attributes in record are aligned using record alignment algorithm by combining the tag and
data value similarity information based on similarity measure. Besides the structure of the
data value is altered when extracting from the webpage. Those changes in template make it
inefficient to properly access them as done in traditional databases. The proposed structural
semantic entropy measures the degree of repeated occurrence of information from DOM tree
representation. This aims to locate the data on web pages depend on unique choice of interest
in extracting the record. This algorithm extracts data from heterogeneous web pages. It is
insensitive to modifications in web-page format which enable to detect false positive rate in
associating the attributes of records with their respective values. Experiments show that this
method achieves higher accuracy than existing methods in automated information extraction.
Rekha R [36] proposed about the advancement of semantic web page similarity and elearning technologies have provided more opportunities to achieve the goal of collaborative
knowledge sharing. It has also facilitated teachers to share their teaching material, tools, and
experiences with others through the medium of internet and web technologies. The author
proposed the need for creating a Distributed Question Bank (DQB) by different experts in the
related fields. The author explored the possibility of using the semantic web technology and
ontology in particular in addressing the issue of question similarity in a DQB. The author
proposes a method of creation of subset ontologies for the questions, and comparing them to
find the overlapping concepts to determine question similarity. The author also tried
formulating a model based on information theoretic approach using this notion of subset
ontology, to measure the similarities among the questions in the data set considered.

Wei L et.al [49] proposed a system on extracting structured data from deep web pages is a
challenging problem due to the underlying intricate structures of such pages. Until now, a
large number of techniques have been proposed to address this problem, but all of them have
inherent limitations because they are web-page-programming-language dependent. As the
popular two-dimensional media, the contents on web pages are always displayed regularly for
users to browse. This motivates us to seek a different way for deep web data extraction to
overcome the limitations of previous works by utilizing some interesting common visual
features on the deep web pages. In this technique, a novel vision-based approach that is webpage programming- language-independent is proposed. This approach primarily utilizes the
visual features on the deep web pages to implement deep web data extraction, including data
record extraction and data item extraction. The author also proposes a new evaluation
measure revision to capture the amount of human effort needed to produce perfect extraction.
Our experiments on a large set of web databases show that the proposed vision-based
approach is highly effective for deep web data extraction.
Eric Mt [12] proposed a novel technique to visually compare similarities a suspected phishing
page with the legitimate one. The goal is to determine whether the two pages are suspiciously
similar. The author identify and consider three page features that play a key role in making a
phishing page look similar to a legitimate one. These features are text pieces and their style,
images embedded in the page, and the overall visual appearance of the page as rendered by
the browser. To verify the feasibility of our approach, the author proposed an experimental
evaluation using a dataset composed of 41 real world phishing pages, along with their
corresponding legitimate targets. Our experimental results are satisfactory in terms of false
positives and false negatives.
Poonam l et.al [29] proposed that in recent years, semantic search for relevant documents on
web has been an important topic of research. Many semantic web search engines have been
developed like Ontolook, Swoogle, etc that helps in searching meaningful documents
presented on semantic web. The concept of semantic similarity has been widely used in many
fields like artificial intelligence, cognitive science, natural language processing, psychology.
To relate entities or texts or documents having same meaning, semantic similarity approach is
used based on matching of the keywords which are extracted from the documents using
syntactic parsing. The simple lexical matching usually used by semantic search engine does
not extract web documents to the user expectations. In this technique, the author have
23

proposed a ranking scheme for the semantic web documents by finding the semantic
similarity between the documents and the query which is specified by the user. The novel
approach proposed in this technique not only relies on the syntactic structure of the document
but also considers the semantic structure of the document and the query. The approach used
here includes the lexical as well as the conceptual matching. The combined use of conceptual,
linguistic and ontology based matching has significantly improved the performance of the
proposed ranking scheme. The author proposed all relevant relations between the keywords
exploring the users intention and then calculate the fraction of these relations on each web
page to determine their relevance with respect to the query provided by the user. The author
has found that this semantic similarity based ranking scheme gives much better results than
those by the prevailing methods.
Yanhong Z et.al [52] proposed that the technique studies about structured data extraction from
Web pages, e.g., online product description pages. Existing approaches to data extraction
include wrapper induction and automatic methods. In this technique, the author proposes an
instance-based learning method, which performs extraction by comparing each new instance
to be extracted with labeled instances. The key advantage of our method is that it does not
need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead,
the algorithm is able to start extraction from a single labeled instance. Only when a new page
cannot be extracted does the page need labeling. This avoids unnecessary page labeling,
which solves a major problem with inductive learning (or wrapper induction), i.e., the set of
labeled pages may not be representative of all other pages. The instance-based approach is
very natural because structured data on the web usually follow some templates and pages of
the same template usually can be extracted using a single page instance of the template. The
key issue is the similarity or distance measure. Traditional measures based on the Euclidean
distance or text similarity are not easily applicable in this context because items to be
extracted from different pages can be entirely different. In this technique, the author proposes
a novel similarity measure for the purpose, which is suitable for template Web pages.
Experimental results with product data extraction from 1200 pages in 24 diverse Web sites
show that the approach is surprisingly effective. It outperforms the state-of-the-art existing
systems significantly.
William w [51] proposed about the measure of the similarity between incomplete rankings
should handle non-conjointness, weight high ranks more heavily than low, and be monotonic
24

with increasing depth of evaluation; but no measure satisfying all these criteria currently
exists. In this article, the author proposes a new measure having these qualities, namely
Rank-Biased Overlap (RBO). The RBO measure is based on a simple probabilistic user
model. It provides monotonicity by calculating, at a given depth of evaluation, a base score
that is non-decreasing with additional evaluation, and a maximum score that is nonincreasing. An extrapolated score can be calculated between these bounds if a point estimate
is required. RBO has a parameter which determines the strength of the weighting to top
ranks. The author extends RBO to handle tied ranks and rankings of different lengths. Finally,
there is an example of the use of the measure in comparing the results produced by public
search engines, and in assessing retrieval systems in the laboratory.
Apostolos K [4] proposed a new page ranking system, which exploits similarity between
interconnected pages. WordRank introduces the model of the biased surfer which is based
on the following assumption: the visitor of a web page tends to visit web pages with similar
content rather than content irrelevant pages. The algorithm modifies the random surfer
model by biasing the probability of a user to follow a link in favor of links to pages with
similar content. It is our intuition that WordRank is most appropriate in topic based searches,
since it prioritizes strongly interconnected pages, and in the same time is more robust to the
multitude of topics and to the noise produced by navigation links. In this technique, the
author proposed preliminary experimental evidence from a search engine the author proposed
for the Greek fragment of the worldwide Web. For evaluation purposes, the author introduce
a new metric (SI score) which is based on implicit user's feedback, but the author also employ
explicit evaluation, where available.
Hila B et.al [15] proposed on Social media sites (e.g., Flickr, YouTube, and Facebook) to
compare the web page similarities are a popular distribution outlet for users looking to share
their experiences and interests on the web. These sites host substantial amounts of usercontributed materials (e.g., photographs, videos, and textual content) for a wide variety of
real-world events of different type and scale. By automatically identifying these events and
their associated user-contributed social media documents, which is the focus of this
technique, the author can enable event browsing and search in state-of-the-art search engines.
To address this problem, the author proposed the rich or context associated with social media
content, including user-provided annotations (e.g., title, tags) and automatically generated
information (e.g., content creation time). Using this rich context, which includes both textual
25

and non-textual features, the author can appropriate document similarity metrics to enable
online clustering of media to events. As a key contribution of this technique, the author
proposed a variety of techniques for learning multi-feature similarity metrics for social media
documents in a principled manner. The author evaluates our techniques on large-scale, real
world datasets of event images from Flickr. Our evaluation results suggest that our approach
identities events and their associated social media documents, more effectively than the stateof-the-art strategies on which the author build.
T.Upender et.al [46] proposed a similarity between words that was concerned with the
syntactic similarity of two strings. Semantic similarity is a confidence score that reflects the
semantic relation between the meanings of two sentences. It is difficult to gain a high
accuracy score because the exact semantic meanings are completely understood only in a
particular context. The goals of the paper are to present you some dictionary-based
algorithms to capture the semantic similarity between two sentences, which is heavily based
on the semantic dictionary. A web search engine is designed to search for information on the
World Wide Web. The search results are generally presented in a line of results often referred
to as Search Engine Results Pages (SERPs). The information may be a specialist in web
pages, images, information and other types of files. Some search engines also mine data
available in databases or open directories. Unlike web directories, which are maintained only
by human editors, search engines also maintain real-time information by running an
algorithm on a web crawler. To identify the numerous semantic relations that exist between
two given words, we propose a novel pattern extraction algorithm and a pattern clustering
algorithm. The optimal combination of page counts-based co-occurrence measures and
lexical pattern clusters is learned using support vector machines.
Ruzhan L [39] proposed on the measurement of word similarity is a foundation work in
semantic computing. In this paper the author propose a method for measuring word similarity
utilizing the definitional words in Machine-Readable Dictionary (MRD). It is noticed that
similar words have similar definitions. So we transform the definition of a word into a vector.
Then the similarity between two words is calculated as the distance between their definition
vectors. To avoid the exponential increase of the definition vectors, two kinds of Basic
Lexical Sets (BLSs) are used to hold the most essential definitional words. One is the set of
sememes in HowNet, and the other is constructed automatically using PageRank algorithm.
Experiment shows that this method achieves competitive results.
26

Rudi L [38] proposed about words and phrases acquire meaning from the way they are used
in society, from their relative semantics to other words and phrases. For computers the
equivalent of society is database, and the equivalent of use is way to search the
database. We present a new theory of similarity between words and phrases based on
information distance and Kolmogorov complexity. To fix thoughts we use the World-WideWeb as database, and Google as search engine. The method is also applicable to other search
engines and databases. This theory is then applied to construct a method to automatically
extract similarity, the Google similarity distance, of words and phrases from the world-wide
web using Google page counts. The World-Wide-Web is the largest database on earth, and the
context information entered by millions of independent users averages out to provide
automatic semantics of useful quality. We give applications in hierarchical clustering,
classification, and language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of books by
English novelists, the ability to understand emergencies, and primes, and we demonstrate the
ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet
database as an objective baseline against which to judge the performance of our method. We
conduct a massive randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement of 87% with
the expert crafted WordNet categories.
L M Patnaik [20] proposed semantic similarity measures plays an important role in
information retrieval, natural language processing and various tasks on web such as relation
extraction, community mining, document clustering, and automatic meta-data extraction,
proposed a Pattern Retrieval Algorithm [PRA] to compute the semantic similarity measure
between the words by combining both page count method and web snippets method, four
association measures are used to find semantic similarity between words in page count
method using web search engines. We use a Sequential Minimal Optimization (SMO)
Support Vector Machines (SVM) to find the optimal combination of page counts-based
similarity scores and top-ranking patterns from the web snippets method. The SVM is trained
to classify synonymous word-pairs and non synonymous word-pairs. The proposed approach
aims to improve the Correlation values, Precision, Recall, and F-measures, compared to the
existing methods. The proposed algorithm outperforms by 89.8 % of correlation value.

Seema B [40] proposed web services have become a new industrial standard offering interoperability among various platforms but the discovery mechanism is limited to syntactic
discovery only. The framework named AD WebS is proposed for automatic discovery of
semantic web services, which can be considered as an extension to one of the most prevalent
frameworks for semantic web service, WSDL-S. At the first stage, the framework proposes
manual semantic annotations of web service to provide the functional description of the
services in Web Service Description Language(WSDL)s <document> tag. These annotations
are extracted and term-category matrix is formed, where category denotes a class in which a
web service will be added. Next, semantic relatedness between terms and pre-defined
categories i calculated using Normalized Similarity Score (NSS). A nonparametric test,
Kruskal Wallis test, is applied on values generated and based on the test results, services are
put into one or more pre-defined categories. The user or the requestor of the service is
directed to the semantically categorized Universal Description Discovery and Integration
(UDDI) repository for discovery of required service. Experimental results on a dataset
covering multiple Web services of various categories show a significant improvement over
the current state-of-the-art Web service discovery methods.
H.Devaraj et.al [14] proposed on measuring the semantic similarity between two words is an
important component in various tasks on the web such as relation extraction, community
mining, document clustering, and automatic meta-data extraction. Despite the usefulness of
semantic similarity measures in these applications, accurately measuring semantic similarity
between two words (or entities) remains a challenging task. We propose an empirical method
to estimate semantic similarity using page counts and text snippets retrieved from a web
search engine for two words. Specifically, we define various word co-occurrence measures
using page counts and integrate those with lexical patterns extracted from text snippets. To
identify the numerous semantic relations that exist between two given words, we propose a
novel pattern extraction algorithm and a pattern clustering algorithm. The optimal
combination of page counts-based co-occurrence measures and lexical pattern clusters is
learned using support vector machines. The proposed method outperforms various baselines
and previously proposed web-based semantic similarity measures on three benchmark data
sets showing a high correlation with human ratings. Moreover, the proposed method
significantly improves the accuracy in a community mining task.

R. Kotteswari et.al [32] proposed to the web mining is the application of data mining
technology to discover patterns from the web. The various tasks on web such as relation
extraction, community mining, document clustering and automatic metadata extraction. A
previously proposed web-based semantic similarity measures on three benchmark datasets
showing high correlation with human rating. One of the main problems in information
retrieval is to retrieve a set of documents that is semantically related to given user query. We
propose an automatic acquisition method to estimate semantic relation between two words by
using pattern extraction algorithm and sequential clustering algorithm.
Ming-S et.al [23] proposed the web search with double checking model is explore the web as
a live corpus. Five association measures including variants of Dice, Overlap Ratio, Jaccard,
and Cosine, as well as Co- Occurrence Double Check (CODC), are presented. In the
experiments on Rubenstein- Goodenoughs benchmark data set, the CODC measure achieves
correlation coefficient 0.8492, which competes with the performance (0.8914) of the model
using WordNet. The experiments on link detection of named entities using the strategies of
direct association, association matrix and scalar association matrix verify that the doublecheck frequencies are reliable. Further study on named entity clustering shows that the five
measures are quite useful. In particular, CODC measure is very stable on wordword and
name-name experiments. The application of CODC measure to expand community chains
for personal name disambiguation achieves 9.65% and 14.22% increase compared to the
system without community expansion. All the experiments illustrate that the novel model of
web search with double checking is feasible for mining associations from the web.
Anita J et.al [3] proposed on measuring the semantic similarity between words is an
important component in various tasks on the web such as relation extraction, community
mining, document clustering, and automatic metadata extraction. Despite the usefulness of
semantic similarity measures in these applications, accurately measuring semantic similarity
between two words (or entities) remains a challenging task. The author propose an empirical
method to estimate semantic similarity using page counts and text snippets retrieved from a
web search engine for two words. Specifically, we define various word co-occurrence
measures using page counts and integrate those with lexical patterns extracted from text
snippets. To identify the numerous semantic relations that exist between two given words, we
propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal
combination of page counts-based co-occurrence measures and lexical pattern clusters is
29

learned using support vector machines. The proposed method outperforms various baselines
and previously proposed web-based semantic similarity measures on three benchmark data
sets showing a high correlation with human ratings. Moreover, the proposed method
significantly improves the accuracy in a community mining task.
Saravanan [11] proposed about semantic web mining aims at combining the two fastdeveloping research areas semantic web and web wining. This survey analyzes the
convergence of trends. More and more researchers are working on improving the results of
web mining by exploiting semantic structures in the web, and they make use of web mining
techniques for building the semantic web. Last but not least, these techniques can be used for
mining the semantic web itself. The semantic web is the second-generation WWW, enriched
by machine-learning techniques which support the user in his tasks. Given the enormous size
of even todays web, it is impossible to manually enrich all of these resources. Therefore,
automated schemes for learning the relevant information are increasingly being used. We
argue that the two areas web mining and semantic web need each other to fulfill their goals,
but that the full potential of this convergence is not yet realized, gives an overview of where
the two areas meet today, and sketches ways of how a closer integration could be profitable.
By applying lexico-syntactic patterns to the process of ontology design or evolution, we
might derive ontology elements

CHAPTER - 3
3. RESEARCH METHODOLOGY
3.1 METHODOLOGY ANALYSIS
The proposed architecture accepts the webpage URLs as input which contains the
comparison module where the cosine similarity measurement algorithm is applied to compare
the two web pages. This architecture follows a path from the start state to the end state. The
user inputs the web pages in the form of URL in which the similarity is to be detected.
URL 1

URL 2

Web Page

Another Web Page

DOM Parser

Path Completion

Data Extraction

Data Cleaning

Lexical Similarity
Process (Comparing with another web page)

Cosine Similarity
Output
Figure 5. Architecture of Cosine Similarity
31

There are totally four steps involved in determining the similarity for the given websites.
Initially, the URL of the two web pages is taken as input for comparison. A URL, also known
as a web address, particularly when used with HTTP is a specific character or string that
constitutes a reference to a resource or a web page. Most web browsers display the URL of a
web page in an address bar. A typical URL might look like:
http://en.example.org/test/Main_Page.html / php
Next, in order to find the contents and structure of web page, information about the measures
are necessary. A web page is a web document that is suitable for the World Wide Web and the
web browser. The web page displays the text or hypertext or image or audio/video, usually
written in HTML or comparable markup language, whose main distinction is to provide
navigation to other web pages via links. Web browsers coordinate web resources centered on
the written web page, such as style sheets, scripts and images, to present the web page. From
such a webpage, the title, description, keyword and metadata are extracted.
From the dataset, the list of title, description, keyword and metadata are separated as objects
to make the many-to-many mapping to look for the matching words. When matching, there's
an option to store the text frequency. An easier way might be to submit URL 1 as a query and
look for URL 2 in the result set. Here, the cosine similarity is used as a measure of similarity
between two pages of different objects that measures the mapping similarity between them.
Note that these objects apply for any number of mappings, and cosine similarity is most
widely used for measuring multiple mapping scenarios. For example, in Information
Retrieval and text mining, each term is theoretically assigned a different object and a
document is characterized by a vector where the value of each dimension corresponds to the
number of times that term appears in the document. Cosine similarity then gives a useful
measure of how similar two documents are likely to be in terms of their subject matter.
Finally, after using the cosine similarity, similarities in two web pages are displayed using
percentage as,
Similarity (%) =

Matched Words
X 100
Total Number of Words

The web page could have our own compliance ratio for a webpage to be similar, is our
resaerch which calculate as and above. Hence, any similarity below 50% which resulted after
32

comparison from cosine similarity will be declared as dissimilar. If the similarity output is
57%, then the web page considered for comparison are similar.
The Proposed System:
The methodology right from providing URL as input until the similarity measurement is
explained as a step by step process below.
3.1.1 DOM Parser
The Document Object Model is a standard to interact with the objects stored in XML and
HTML documents. All browsers use a model like DOM to render/parse an HTML page.
When a browser renders an HTML document, it parses the document in order to display the
contents, which is the web content mining used in our research. The HTML Parser accesses
and traverses the HTML document. It is also important that the HTML Parser identifies
various HTML tags to find the structure and description of the web page. The HTML parser
also provides various functions to extract specific portions of the HTML content. For this
research we have used Java library File.
Java library
Java library reads the HTML document and parses it similarly to the DOM parser to identify
various objects like title, keyword, description and metadata. These objects are represented in
a structure, and this can be traversed and the required data can be extracted using various
methods provided by Jsoup. The following are some of the methods in Java library to extract
data.
Document doc = Jsoup.connect("http://www.bu.edu/").get();
This above method can be used to render an HTML document and find the required data. The
connect method makes connections to the URL provided and gets method parses the
HTML document. An exception is thrown if there is any error fetching in the HTML page.
Elements getid = doc.select("div[id=top_subsite]");
The select method helps to traverse the HTML document and go to the particular location
where the required data is found. In the above case we traverse to the div tag in the HTML
page where the id attribute is top_subsite

String value = getid.get(1).attr("href")

The get method extracts data from the HTML page. The attr method helps us to select the
value for the attribute href.
String content = getid.get(1).text();
The text method can be used to extract the text portion between the HTML tags.
Elements list = getid.select("a[href]");
for (int i=0;i<list.size();i++)
{
//Extract a particular data
}
The above snippet of the code is used to loop over all the objects in the HTML document
with href attribute and extract data.
3.1.2 Path Completion
There are many important and unimportant links from a particular webpage or website, where
the user accesses that are not being necessarily the webpages relevant data, for example,
advertisement link in a website or a link to another website. The aim of the path completion is
to acquire complete access path by filling up the missing page references. The incomplete
access path is recognized based on the domain.ie, a user requested for a page, which is
unlinked to the last page or moving to a different page irrelevant to the current webpage.
Similarly, if the domain we are dealing with is, say www.sample.com, then the path
completion looks for only the links that comes under the domain named sample and ignores
all the links from the websites that jumps to a different domain.
3.1.3 Data Extraction
Once the links or paths of a particular website are identified, we need to extract the dataset
with the help of identified URLs. This is where the content mining comes into play, which is
very similar to text mining and data mining. Some of the methodologies used in data mining
are used in the web content mining since both deals with information extraction. In web
content mining the data is either structured or semi-structured whereas in data mining the data
is more structured, to extract the content and structural information from the URL. Due to the
exponential growth in web content and its usage, many applications are built to use this data
and present the data in a user-friendly manner. There are various problems involved in
extracting data from the Internet due to lack of a standard structure used by websites to store
34

the data. Since the solution for these problems is very important in various real time
applications, a lot of research is being done in this field. Since each website stores and
presents the data differently, the data we require might not always be present in the same
location, and the same method cannot be used to extract the data across various websites.
Every web page should be analyzed to exactly determine the location of data. Some websites
display information that is not required, which is eliminated in the next phase of our research.
It is critical to extract only the information that is required and scrap the rest of the data
before storing the data in the dataset. Otherwise, it would involve a lot of overhead to process
the data every time it is fetched from the database. Hence, the complete set of data is
extracted and sent to cleaning.
3.1.4 Data Cleaning
Data Cleaning, data cleansing or data scrubbing is the process of detecting and correcting (or
removing) unwanted records from a dataset. In our research, we would eliminate all the
articles, prepositions and conjunctions. Since the word a, an, the, for, under, and, etc., will be
available in all the webpages data, title, description, keyword and metadata, the similarity
percentage calculation would affect when comparing with these kind of words. Used mainly
in databases, the term cleaning refers to identifying incomplete, incorrect, inaccurate,
irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or
coarse data. Here, we identify the irrelevant and delete the same. After cleaning, a data set
will be consistent with other similar data sets in the system. The actual process of data
cleaning may involve removing typographical errors or validating and correcting values
against a known list of entities. Here, the cleaning only identifies irrelevant words,
considering that the condition that the web page might not contain typographical errors.
3.1.5 Lexical similarity
Once the data is cleaned, the data set is ready for comparison. If the webpage content we
compare is with respect to one-to-one mapping, then lexical similarity would help us analyze
two word sets and identify the measure of similarity between them. In the sense, that the title
of one web page and another is compared, similarly the description, keyword and metadata of
one another. The similarity scores are in the range of zero to one, where one means that there
is a complete overlap between the two word sets and the word sets are very similar to each
other. If the score is zero then there are no common words between the two sets. The zero and
one are simply the Boolean logic, which when compared results in true or false.
35

3.1.6 Cosine similarity

The cosine similarity framework discusses how the title, keyword and meta-data are parsed
together to obtain the resultant output, which is similar or dissimilar. The proposed system
takes the title, keyword and meta-data of both the URLs and calculates similarity based on
many-to-many mapping, such that every content is compared with content of other URL
where all the data is parsed to find the similarity among web pages, i.e., title of one webpage
is compared with title, keyword and metadata of another, followed by the comparison of
keyword and metadata of one webpage compared with all the factors of other webpage.
Cosine similarity, in general, is used to identify the relatedness of two datasets. This
technique is used in various applications related to artificial intelligence, information
retrieval, and natural language processing; here we apply after information retrieval phase.
Similar to the lexical phase, the scores are usually in the scale of zero to one. If the two word
sets are similar then the scores are closer to one and if the score is zero then the word sets are
not similar.
Advantages:

Percentage or Frequency of match is increased

Number of texts or keywords or meta-words or description taken for

consideration is high which results in higher ranking of similarity occurrence.

Pages are compared in one-to-many correlation.

CHAPTER - 4
4. IMPLEMENTATION
The research describes about the implementation for developing the similarity measurement
software. It uses PHP 5, Java Parser, DOM, XML and standard HTML. It will be capable of
executing any on standard internet web browsers, although. The interface will provide a point
of web page similarity search for the user and the step-by-step scenario is figured and
discussed below.
User

Main Page

Compare

Parser

User visits our home

User
page.
enters two URLs for comparison
Fetches the contents from URL

Cleaning

Extraction

Path Completion

Eliminating unwanted data

Fetching
like a, an,
thethe,
dataetc.,
from identified
Findssub-links
sub-links of entered URLs

Lexical Phase

Cosine Similarity Phase

Similar

(if > 50%)

similarity
for two
URL based on many-to-many mapping
Comparison of Generate
two URL by
one-to-one
mapping

Dissimilar

(if <= 50%)

Figure 6. Implementation Scenario

The user enters into the main website in order to compare two web pages, where he

can identify the details and working scenario of the research work.
The user then identifies the websites to compare and enters two URLs in the given
text field, which first identifies whether the given URL is a working webpage after
clicking the Compare button. The users job is over now, and he is ready to view the
37

similar webpages and the process below is our research work which is taken care of
our implementation.
Table 6.1

URL 1
ANNA UNIVERSITY

URL 2
AMRITA UNIVERSITY

(http://www.annauniv.edu/)
12

(http://www.amrita.edu/)
8

The various objects like title, keyword, description and metadata plays a vital role in
finding similarity. Hence, the Document Object Model with the help of Java Library
interacts together to find the objects stored in XML and HTML documents of the

given URL.
There may be many relevant and irrelevant links from a particular webpage/website,
where the user accesses that are not being necessarily the webpages relevant data, for
example, advertisement link in a website or a link to another website. The aim of the
path completion is to acquire the complete set of relevant or valid sub links of the
entered URL.

URL 1

Table 6.2

Frequency

URL 2

(%)

http://www.annauniv.edu/
http://www.annauniv.edu/acade
mic_courses/index.html

58%
62%

http://www.amrita.edu/
http://www.amrita.edu/academic
s

Once the links or paths of a particular website are identified, we need to extract the
dataset in form of title, keyword, description and metadata. This is where the content
mining comes into play, which is very similar to text mining and extracts all the data

required and stores them in the dataset.

Data cleaning is the process of detecting and correcting unwanted records from a
dataset. In this research, we would eliminate all the articles, prepositions and
conjunctions are eliminated, since they arent taken inside the similarity comparison

phase.
Once the data is cleaned, the data set is ready for comparison. The webpage content is
compute with respect to one-to-one mapping and the lexical similarity helps is to
analyze two word sets.

Similar to the lexical phase, the two datasets are compared based on many-to-many

mapping in the cosine similarity phase and gives the percentage of similarity.
Finally, that similar pages are displayed in the form of snapshots with the similarity
compliance above 50%.

Table 6.3

URL 1

URL 2

http://www.svgv.in/academics.php

http://www.vidyaavikas.ac.in/about
%20us/Group%20Of%20Institutions.
html

http://www.svgv.in/admission.php

http://www.vidyaavikas.ac.in/achieve
ments/achievements.html

CHAPTER - 5
5. EXPERIMENTAL RESULT
Sample Screen Designs
1. WEBSITES TAKEN FOR STUDY
UNIVERSITY

ANNA UNIVERSITY (http://www.annauniv.edu/)

AMRITA UNIVERSITY (http://www.amrita.edu/)

SCHOOL

SVGV (http://www.svgv.in/)
VIDYA VIKAS (http://www.vidyaavikas.ac.in/)

ONLINE SHOPPING -

FLIPKART (http://www.flipkart.com/)
NAAPTOL (http://www.naaptol.com/)

2. TOTAL NUMBER OF LINKS/WEBPAGES

ANNA UNIVERSITY (http://www.annauniv.edu/)

http://www.annauniv.edu/academic_courses/index.html
http://www.annauniv.edu/cai13b/Options.php
http://aucoe.annauniv.edu/
http://aucoe.annauniv.edu/stat.html
http://aucoe.annauniv.edu/circular.html
http://aucoe.annauniv.edu/administration.html
http://www.annauniv.edu/centres.php
http://cfr.annauniv.edu/research/index.php
http://www.annauniv.edu/cai13b/
http://www.annauniv.edu/sports/
https://mail.annauniv.edu/mail/src/login.php

AMRITA UNIVERSITY (http://www.amrita.edu/)

http://www.amrita.edu/academics
http://www.amrita.edu/aums
http://www.amrita.edu/campus
http://www.amrita.edu/academics
http://www.amrita.edu/research
http://www.amrita.edu/global
40

http://www.amrita.edu/events

SVGV (http://www.svgv.in/)

http://www.svgv.in/about_us.php
http://www.svgv.in/academics.php
http://www.svgv.in/admission.php
http://www.svgv.in/events.php
http://www.svgv.in/achivements.php
http://www.svgv.in/contact.php
http://www.svgv.in/careers.php
http://www.svgv.in/facilities.php
http://www.svgv.in/gallery.php
http://www.svgv.in/rules.php

VIDYA VIKAS (http://www.vidyaavikas.ac.in/)

http://www.vidyaavikas.ac.in/vidyavikass.ac.in/index.html
www.vidyaavikas.ac.in/vidyavikass.ac.in/about us/about-us.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/about%20us/Group%20Of

%20Institutions.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/achievements/achievements.ht

ml
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/Academic/Affiliations.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/Beyond

%20Curriculam/sports.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/gallery/gallery.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/contact/contact.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/Beyond

%20Curriculam/Co_curricular.html
http://www.vidyaavikas.ac.in/vidyavikass.ac.in/Beyond
%20Curriculam/Co_curricular.html#

FLIPKART (http://www.flipkart.com/)

http://www.flipkart.com/mobiles
http://www.flipkart.com/computers
http://www.flipkart.com/computers/accessories
http://www.flipkart.com/books
http://www.flipkart.com/ebook
http://www.flipkart.com/household
http://www.flipkart.com/watches
41

http://www.flipkart.com/household?otracker=hp_nmenu_sub_home-

kitchen_0_Home%20%26%20Kitchen%20Needs
http://www.flipkart.com/sports-fitness?otracker=hp_nmenu_sub_more-

stores_0_Sports%20%26%20Fitness
http://www.flipkart.com/sports-fitness/outdoor-adventure/
https://www.flipkart.com/s/contact
http://www.flipkart.com/flipkart-first?otracker=hp_ch_vn_flipkart-first

NAAPTOL (http://www.naaptol.com/)

http://www.naaptol.com/shop-online/mobile-phones.html
http://www.naaptol.com/shop-online/computers-peripherals.html
http://www.naaptol.com/shop-online/home-kitchen-appliances.html
http://www.naaptol.com/shop-online/home-decor.html
http://www.naaptol.com/shop-online/automobiles.html
http://www.naaptol.com/shop-online/jewellery-watches.html
http://www.naaptol.com/shop-online/consumer-electronics.html
http://www.naaptol.com/shop-online/cameras.html
http://www.naaptol.com/shop-online/toys-nursery.html
http://www.naaptol.com/shop-online/sports-fitness.html
http://www.naaptol.com/shop-online/gifts.html
http://www.naaptol.com/shop-online/baby-care-maternity.html
http://www.naaptol.com/shop-online/books.html
http://www.naaptol.com/shop-online/footwear-travel-bags.html
http://www.naaptol.com/shop-online/apparels-accessories.html

3.1. COMPARISON STUDY 1 (BETWEEN UNIVERSITIES)

a) TOTAL NUMBER OF LINKS/WEBPAGES TAKEN FOR COMPARISON
URL 1
ANNA UNIVERSITY

URL 2
AMRITA UNIVERSITY

(http://www.annauniv.edu/)
12

(http://www.amrita.edu/)
8

b) LINKS OF SIMILAR WEBPAGES

URL 1
http://www.annauniv.edu/
http://www.annauniv.edu/academic

_courses/index.html
http://aucoe.annauniv.edu/
http://aucoe.annauniv.edu/circular.
html
42

URL 2
http://www.amrita.edu/
http://www.amrita.edu/academics
http://www.amrita.edu/aums
http://www.amrita.edu/campus
http://www.amrita.edu/research
http://www.amrita.edu/events

http://www.annauniv.edu/sports/

c) LIST OF MATCHING WORDS/KEYWORDS/METAWORDS/DESCRIPTION

International Conference on, Advanced Computing, Science Fest, is an initiative to foster,
Independence Day was celebrated, Resources Management at Centre, technology and
research, Classes for UG/PG Course, International Students Admissions 2014, University
provides value-based education, etc., (with a total of 228 matching words all over the
similar/matched webpages).

d) FREQUENCY OF MATCH (AFTER IMPLEMENTING THE ALGORITHM)

URL 1

Frequency

URL 2

(%)

http://www.annauniv.edu/
http://www.annauniv.edu/aca
demic_courses/index.html
http://aucoe.annauniv.edu/
http://aucoe.annauniv.edu/cir
cular.html
http://www.annauniv.edu/spo
rts/

58%
62%

77%

82%
59%

http://www.amrita.edu/
http://www.amrita.edu/acade
mics
http://www.amrita.edu/aums
http://www.amrita.edu/resear
ch
http://www.amrita.edu/events

e) SIMILAR WEBPAGES SNAPSHOT

URL 1

URL 2

http://www.annauniv.edu/

http://www.amrita.edu/

http://www.amrita.edu/academics
http://www.annauniv.edu/academic_co
urses/index.html

http://www.amrita.edu/aums
http://aucoe.annauniv.edu/

http://www.amrita.edu/research

http://aucoe.annauniv.edu/circular.html

http://www.amrita.edu/events
http://www.annauniv.edu/sports/

f) SIMILARITY GRAPH

Total Words

Title
60

Title

URL 1

URL 2

Title

URL 1

URL 2

Description
Descriptio
n

400

Description
URL 1

122

URL 2

235

Meta
Keywords
URL 1

URL 2

614

200
0
URL 1 URL 2

Meta Keywords
Match % based on keywords

300

Meta
Keywords

200

Title
URL 1

URL 2

100
0
URL 1

URL 2

Title
100

Description

URL 1

URL 2

Title

URL 1

URL 2

Description
Meta
Keywords
URL 1

URL 2

70
65
60
55
50

Description

URL 1

URL 2

Meta Keywords
68

Meta
Keywords

66
64
URL 1

URL 2

3.2. COMPARISON STUDY 2 (BETWEEN SCHOOLS)

a) TOTAL NUMBER OF LINKS/WEBPAGES TAKEN FOR COMPARISON
URL 1
SVGV

URL 2
VIDYA VIKAS

(http://www.svgv.in/)

(http://www.vidyaavikas.ac.in/)
46

b) LINKS OF SIMILAR WEBPAGES

URL 1
http://www.svgv.in/
http://www.svgv.in/about_us.php
http://www.svgv.in/academics.php
http://www.svgv.in/admission.php
http://www.svgv.in/events.php
http://www.svgv.in/achivements.ph

p
http://www.svgv.in/contact.php
http://www.svgv.in/careers.php
http://www.svgv.in/facilities.php
http://www.svgv.in/gallery.php
http://www.svgv.in/rules.php

URL 2
http://www.vidyaavikas.ac.in/
http://www.vidyaavikas.ac.in/vidy

avikass.ac.in/index.html
www.vidyaavikas.ac.in/vidyavikas

s.ac.in/about us/about-us.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/about%20us/Group

%20Of%20Institutions.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/achievements/achiev

ements.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Academic/Affiliation

s.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Beyond

%20Curriculam/sports.html
http://www.vidyaavikas.ac.in/vidy

avikass.ac.in/gallery/gallery.html
http://www.vidyaavikas.ac.in/vidy

avikass.ac.in/contact/contact.html
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Beyond
%20Curriculam/Co_curricular.htm

l
http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Beyond
%20Curriculam/Co_curricular.htm
l#

c) LIST OF MATCHING WORDS/KEYWORDS/METAWORDS/DESCRIPTION

Home, Search, About us, Academics, Admission, Events, Achievements, Contacts, Individual
and group, programs, High Level, Education, Global recognition, Positive Stay, Parental
Care, Homely, Ambience, Careers, facilities, gallery, rules, Success, quality, etc., (with a total
of 141 matching words all over the similar/matched webpages).
d) FREQUENCY OF MATCH (AFTER IMPLEMENTING THE ALGORITHM)
URL 1

Frequency

URL 2

(%)

http://www.svgv.in/academics.php

http://www.svgv.in/admission.php

http://www.svgv.in/events.php

http://www.svgv.in/contact.php

http://www.svgv.in/careers.php

65%

72%

57%

68%

59%

http://www.vidyaavikas.ac.in/
about%20us/Group%20Of
%20Institutions.html
http://www.vidyaavikas.ac.in/
achievements/achievements.ht
ml
http://www.vidyaavikas.ac.in/
vidyavikass.ac.in/gallery/galle
ry.html
http://www.vidyaavikas.ac.in/
vidyavikass.ac.in/contact/cont
act.html
http://www.vidyaavikas.ac.in/
vidyavikass.ac.in/Beyond
%20Curriculam/Co_curricular
.html

e) SIMILAR WEBPAGES SNAPSHOT

URL 1

URL 2

http://www.svgv.in/academics.ph

http://www.vidyaavikas.ac.in/abou
t%20us/Group%20Of
%20Institutions.html

http://www.svgv.in/admission.ph

http://www.vidyaavikas.ac.in/achi
evements/achievements.html

p
http://www.svgv.in/events.php

http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/gallery/gallery.html

http://www.svgv.in/contact.php

http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/contact/contact.html

http://www.svgv.in/careers.php

http://www.vidyaavikas.ac.in/vidy
avikass.ac.in/Beyond
%20Curriculam/Co_curricular.htm

f) SIMILARITY GRAPH

Total Words
Title
URL 1

URL 2

Title
60
Title

40
20
0
URL 1

URL 2

Description
URL 1

243

URL 2

342

Description
400

Description

200

Meta
Keywords
URL 1

164

URL 2

216

0
URL 1 URL 2

Meta Keywords
300

Meta
Keywords

200

Match % based on keywords

100

Title
URL 1

URL 2

0
URL 1

URL 2

Title
100

Title

Description
URL 1

URL 2

0
URL 1

URL 2

Description
70
65
60
55
50

Description

URL 1

Meta
Keywords

URL 2

URL 1

URL 2

Meta Keywords
68

Meta
Keywords

66
64
URL 1

URL 2

3.3. COMPARISON STUDY 3 (BETWEEN ONLINE SHOPPING WEBSITES)

a) TOTAL NUMBER OF LINKS/WEBPAGES TAKEN FOR COMPARISON
URL 1
FLIPKART

URL 2
NAAPTOL

(http://www.flipkart.com/)
13

(http://www.naaptol.com/)
16

b) LINKS OF SIMILAR WEBPAGES

URL 1
http://www.flipkart.com/mobiles
http://www.flipkart.com/computers
http://www.flipkart.com/books
http://www.flipkart.com/household
http://www.flipkart.com/watches
http://www.flipkart.com/sportsfitness/outdoor-adventure/

URL 2
http://www.naaptol.com/shop-

online/mobile-phones.html
http://www.naaptol.com/shop-

online/computers-peripherals.html
http://www.naaptol.com/shop-

online/books.html
http://www.naaptol.com/shop-

online/home-decor.html
http://www.naaptol.com/shop-

online/jewellery-watches.html
http://www.naaptol.com/shop-

online/sports-fitness.html
c) LIST OF MATCHING WORDS/KEYWORDS/METAWORDS/DESCRIPTION
Online Shopping India Shop, Cameras & Accessories, Sports & Fitness, Subscribe, Keep in
touch, latest offers, news & events, check out, Categories Men & Women, Clothing,
Footwear, Travel & Bags, Mobiles Tablets & Computers, Home & Kitchen, Automobiles,
Jewellery & Watches, Consumer Electronics, Cameras & Accessories, Toys & Nursery,
Health & Beauty, Sports & Fitness, Gifts & Stationery, Watches, Shirts, Jeans (with a total of
428 matching words all over the similar/matched webpages).

d) FREQUENCY OF MATCH (AFTER IMPLEMENTING THE ALGORITHM)

URL 1

Frequency

URL 2

(%)

http://www.flipkart.com/mobil

es
http://www.flipkart.com/comp

58%

62%

http://www.naaptol.com/shop-

online/mobile-phones.html
http://www.naaptol.com/shop-

uters

online/computers-

http://www.flipkart.com/book

s
http://www.flipkart.com/house

hold
http://www.flipkart.com/watc

82%

hes
http://www.flipkart.com/sports-

59%

peripherals.html
http://www.naaptol.com/shop-

online/books.html
http://www.naaptol.com/shop-

online/home-decor.html
http://www.naaptol.com/shop-

77%

fitness/outdoor-adventure/

online/jewellery-watches.html
http://www.naaptol.com/shop
-online/sports-fitness.html

72%

e) SIMILAR WEBPAGES SNAPSHOT

URL 1

URL 2
http://www.naaptol.com/shop-

http://www.flipkart.com/mobiles

online/mobile-phones.html

http://www.flipkart.com/computers

http://www.naaptol.com/shoponline/computers-peripherals.html

http://www.flipkart.com/books

http://www.naaptol.com/shoponline/books.html

http://www.flipkart.com/household

http://www.naaptol.com/shoponline/home-decor.html

http://www.naaptol.com/shop-

http://www.flipkart.com/watches

online/jewellery-watches.html

http://www.flipkart.com/sports-

http://www.naaptol.com/shoponline/sports-fitness.html

fitness/outdoor-adventure/

f) SIMILARITY GRAPH
Total Words

Title

Title
URL 1

URL 2

Title

0
URL 1

Description
URL 1

253

URL 2

234

URL 2

Description
260

Description

240
220
URL 1

URL 2

Meta
Keywords
URL 1

123

URL 2

154

Meta Keywords
Meta
Keywords

200
100
0
URL 1

Match % based on keywords

Title
URL 1

76
54

URL 2

Title

56
100

Title

50
0
URL 1

Description
URL 1

URL 2

Description
100

Description

Meta
Keywords
URL 1

URL 2

0
URL 1

URL 2

Meta Keywords
Meta
Keywords

100
50
0
URL 1

URL 2

3.4. COMPARISON STUDY 4 (BETWEEN UNIVERSITY AND SCHOOL)

a) TOTAL NUMBER OF LINKS/WEBPAGES TAKEN FOR COMPARISON
URL 1
ANNA UNIVERSITY

URL 2
SVGV

(http://www.annauniv.edu/)
12

(http://www.svgv.in/)
11

b) LINKS OF SIMILAR WEBPAGES

URL 1
http://www.annauniv.edu/sports/

URL 2
http://www.svgv.in/events.php

c) LIST OF MATCHING WORDS/KEYWORDS/METAWORDS/DESCRIPTION

Sports Board, rules, Football Federation, Looking, District First, Tennis and Board.
d) FREQUENCY OF MATCH (AFTER IMPLEMENTING THE ALGORITHM)
URL 1

Frequency

URL 2

(%)

http://www.annauniv.edu/spor
ts/

http://www.svgv.in/events.php

e) SIMILAR WEBPAGES SNAPSHOT

Only 8% of match have occurred which violated the 50% similarity rule, hence no similar
pages in the webpages for the comparison between University and School
f) SIMILARITY GRAPH
No Similarity Graph available to display.
3.5.

COMPARISON STUDY 5 (BETWEEN SCHOOL AND ONLINE SHOPPING

WEBSITE)

a) TOTAL NUMBER OF LINKS/WEBPAGES TAKEN FOR COMPARISON

URL 1
SVGV

URL 2
FLIPKART

(http://www.svgv.in/)
11

(http://www.flipkart.com/)
13

b) LINKS OF SIMILAR WEBPAGES

URL 1
http://www.svgv.in/contact.php

URL 2
https://www.flipkart.com/s/contact

c) LIST OF MATCHING WORDS/KEYWORDS/METAWORDS/DESCRIPTION

Signup, Login. Track, Email, Enter Email ID, View, Cancel, Search, Home, Events,
Achievements, Contacts, Careers, facilities, gallery, rules, Our contacts.
d) FREQUENCY OF MATCH (AFTER IMPLEMENTING THE ALGORITHM)
56

URL 1

Frequency

URL 2

(%)

http://www.svgv.in/contact.ph
p

38%

https://www.flipkart.com/s/con
tact

e) SIMILAR WEBPAGES SNAPSHOT

Only 38% of match have occurred which violated the 50% similarity rule, hence no similar
pages in the webpages for the comparison between School and Online Shopping Site.
f) SIMILARITY GRAPH
No Similarity Graph available to display.

CHAPTER - 5
5. PERFORMANCE EVALUATION
The evaluation process done by the similarity detection system has been described that allows
extraction of unique words.
The similarity detection framework consists of two stages. First, a linear process extracted the
unique words in the document once for each. Second, the sequences of unique words are
matched for checking repetition. The score function using information theory has been used
to calculate the number of similar sequence. The following is the formula used:
similarity ( X , Y )=

log Pr ( common ( X , Y ) )
log Pr ( description ( X , Y ))

The existing system has used information theoretic principle for scoring function; It performs
worst in all the experiments cited in the literature.
Due to the growing amount of information in the web world, a method for discovering useful
information from different website is required to achieve accurate extraction of interesting
information.
The current information extraction system deals with either unstructured information (static
in nature) or structured information. In a recent effort on Twitter text analysis, a system has
been designed to analyze and extract information from the contents of text that is produced
from different communities. The recent challenge is on how to find useful messages from
different websites which has background information in a semi-structured form.

The main aim of the proposed work is to look at interesting information from different text
content which has semi-structured information. As the different user may enter the same type
of information, the information may be in similar form with same meaning, which adds value
for information extraction system to identify potential threats or interest in domains such as
fraud detection, product recommendation analysis, cyber-attacks, terror attacks, and
healthcare. In the proposed work, a new method is introduced by uing cosine similarity
measurement scheme which identify the percentage of similar information between two
vectors. The proposed method shows better result than the existing system.
% 80
o
f
S
i
m
i
l
a
r
i
t
y

70
60
50
40

Existing

Proposed

20
10
0
1

URL Comparisons

Existing Reference from paper [18] Measuring web page similarity

Figure 7. Similarity Measurement Graph in terms of %
The figure displays the graph for similarity measurement, when different websites are taken
for comparison based on existing and proposed methodologies. The five comparisons are
performed for the existing and the proposed systems. The two webpages are taken as single
comparison and the percentage shows the similarity level for two different methods. The
graph clearly shows that the content matching never shows the exact result or similarity ratio
which is done in the existing system, whereas the content and structure combined together to
form the better output.

.
59

6. CONCLUSION
This thesis presented a new approach which combines technique from various fields and
adapts to solve the problem of matching title, description, keyword or metadata. The results
show that the suggestions generated are extremely relevant. It has been observed that as the
content and structure size grows the quality of suggestions improve.
The goal of this thesis was to enlarge our understanding, how different groups of websites are
using the Web for commercial purposes. Because a solid knowledge base for analyzing Web
similarity is missing, we developed an enhanced framework to analyze and categorize the
capabilities of Websites. This framework distinguishes between content and design. This
technique would struggle, if structure of a website is not defined, when a description or
keyword or metadata of a website is not defined, title and content is partially modified from
other websites. This leads to confusion whether the webpage belongs to the current category
or falls under the old category from which the original website was modified.
The result of our study shows the importance of user observations when studying about
similarity among websites is convincing. The five comparisons are performed for the existing
and the proposed systems. The first comparison shows that the existing system compares the
text and displays only 12% of similarity, whereas the proposed system shows that the content
and structure of the webpage matches upto 65%, which crossed our similarity compliance of
50% and hence the webpage are similar as per our research method and dissimilar based on
existing technique which is a breakthrough in our research.
Some future works can include:
Extending the comparison based on the meaning of the word without modifying the
architecture.

Integration with systems like dictionary.com and online synonym content would
significantly improve the semantic similarity between these keywords which would
result in finding the similarity in a better way.

REFERENCES
[1]

Allan M. Schiffman, Hierarchy in Web Page Similarity Link Analysis Carnegie

Mellon University & CommerceNet Labs Technical Report, Vol.6, No.02 May 2006.

[2]

Ana G. Maguitman, Algorithmic Detection of Semantic Similarity, MIT Press,

Vol. 30, No. 107, 1998.

[3]

Anita Jadhav, Semantic Similarity Measure using Web Page Count, NGD and
Snippet Based Method, International Journal of Computer Science and Mobile
Computing, Vol.3 Issue.7, July- 2014, Pg. 249-260.

[4]

Apostolos Kritikopoulos, Martha Sideri, Iraklis Varlamis, A Method For Ranking

Web Pages Based On Content Similarity, Journal of ACM , Vol. 46, No. 604, 1999.

[5]

Chaomei Chen, Structuring and Visualising the WWW by Generalised Similarity

Analysis, Commun. ACM, Vol.39, No.106, 2011.

[6]

Christoph Kiefer, Abraham Bernstein, and Markus Stocker, A Virtual Triple

Approach For Similarity-Based Semantic Web Tasks Department of Informatics,
University of Zurich, Vol. 10, No. 707-710, 2008.

[7]

Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka, A Web Search Enginebased Approach to Measure Semantic Similarity between Words, In Proc. Of 14th
COLING, Vol.23, No.539, 2007.

[8]

David Buttler, A Short Survey of Document Structure Similarity Algorithms,

ACM SIGMOD, Vol.16, No.1, 2001.
61

[9]

Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen and Ming Ouhyoung, On Visual
Similarity Based 3D Model Retrieval , EUROGRAPHICS 2003, Vol.22, No.3, 2003.

[10]

Douglas L. T. Rohde ,An Improved Model of Semantic Similarity Based on Lexical

Co-Occurrence, Vol.4, No.67, November 7, 2005.

[11]

Dr. V. Saravanan, Web Mining through Semantic Similarity Measures between Words
Using Page Counts, International Journal of Advanced Research in Computer
Science and Software Engineering, Volume 4, Issue 9, September 2014.

[12]

Eric Medvet DEEI, Visual-Similarity-Based Phishing Detection, Computer Journal,

Vol .7 ,No. 302, 2009.

[13]

Glen Jeh, [email protected] Jennifer Widom, [email protected], A

Measure of Structural-Context Similarity Stanford University,Vol.14, No.198, March
2000.

[14]

H.Devaraj, A Web Search Engine Application to Compare Semantic Equivalence

between Two Words, International Journal of Computer Science and Information
Technologies, Vol. 5 (5), 2014, 6572-6577.

[15]

Hila Becker E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne.,

Learning Similarity Metrics for Event Identification in Social Media, In
Proceedings of the First ACM International Conference on Web Search and Data
Mining (WSDM'08), Vol.34, No. 5, 2008.

[16]

Isabel F. Cruz1, Slava Borisov, Michael A. Marks, and Timothy R.Webb, Measuring
Structural Similarity Among Web Documents:, In ACM SIGMOD Conf.,Vol.97,
No.239, 1997.

[17]

Istvan Varga Kiyonori Ohtake Kentaro Torisawa Stijn De Saeger Teruhisa Misu
Shigeki Matsuda Junichi Kazama, Similarity Based Language Model Construction
for Voice Activated Open-Domain Question Answering, In proceedings of ACL,
Vol.40, No.11, 2001.

[18]

Jiahui Liu and Larry Birnbaum, Measuring Semantic Similarity between Named
Entities by Searching the Web Directory, conference on Computational Linguistics,
Vol. 23, No. 4, 2012.

[19]

Krishna Nitin Tenali, Semantic Similarity Based Information Retrieval As Applied To

Moocs, SJSU Conf, Vol.4, No. 1 , January 2014.

[20]

L M Patnaik, Web Search Engine Based Semantic Similarity Measure Between

Words Using Pattern Retrieval Algorithm, 15th International Conference on World
Wide Web, pp. 377-386, Jan (2006).

[21]

Mara Alpuente and Daniel Romero, A Tool for Computing the Visual Similarity of
Web Pages, 10th annual International Symposium on Applications and the Internet,
Vol.45, No.8, 2010.

[22]

Mehran Sahami, A Webbased Kernel Function for Measuring the Similarity of Short
Text Snippets, Annual International conference on Information retrieval, Vol.21,
No.206, 1999.

[23]

Ming-Shun Lin, Novel Association Measures Using Web Search with Double
Checking, Proceedings of the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the ACL, pages 10091016, Sydney, July
2006.

[24]

Mohamed Said Hamani, Word Semantic Similarity based on documents title, In

proceedings of ECML conf., Vol. 42(1), No.59, 2011.

[25]

Nguyen Chi Thanh and Koichi Yamada , Document Representation and Clustering
with WordNet Based Similarity Rough Set Model, International Journal of Computer
Science Issues, Vol. 8, No. 3, September 2011.

[26]

P.V.Praveen Sundar, Towards Automatic Data Extraction Using Tag and Value
Similarity Based on Structural -Semantic Entropy, International Journal of Advanced
Research in Computer Science and Software Engineering , Vol. 3, No. 4, April 2013.

[27]

Peixiang Zhao Jiawei Han Yizhou Sun, A Comprehensive Structural Similarity

Measure over Information Networks , IL 61801, Vol. 46(5), No.10, 2000.

[28]

Phyu Thwe, Proposed Approach For Web Page Access Prediction Using Popularity
And Similarity Based Page Rank Algorithm, INTERNATIONAL JOURNAL OF
SCIENTIFIC & TECHNOLOGY RESEARCH, Vol. 2, No. 3, MARCH 2013.

[29]

Poonam Chahal , Manjeet Singh , Suresh Kumar, An Ontology Based Approach for
Finding Semantic Similarity between Web Documents, International Journal of
Current Engineering and Technology, Vol. 3, No. 5, December 2013.

[30]

Poonam Chahal, Manjeet Singh, Suresh Kumar, Ranking of Web Documents using
Semantic Similarity, International conference on Information Systems and computer
Networks, Vol.56, No.890 , 2013.

[31]

Pushpa C N, Web search engine based semantic similarity measure between words
using pattern retrieval algorithm, J DOI, Vol.10, No.121, 2013.

[32]

R.Kotteswari, Automatic Acquisition of Similarity between Entities by Using Web

Search, International Journal of Smart Sensors and Ad Hoc Networks (IJSSAN)
ISSN No. 2248-9738 Volume-1, Issue-3, 2012.

[33]

Radha Damodaram, Dr. M.L. Valarmathi, Phishing Detection based on Web Page
Similarity, IJCST, Vol. 2, No, 3, September 2011.

[34]

Rajendra LVN1, Qing Wang2 and John Dilip Raj, Recommending News Articles
using Cosine Similarity Function, Vol.3, No.770, 2014 .

[35]

Rajhans Mishra and Pradeep Kumar Clustering Web Logs Using Similarity Upper
Approximation with Different Similarity Measures International Journal of Machine
Learning and Computing, Vol. 2, No. 3, June 2012.

[36]

Rekha Ramesh, Ontology Based Approach For Finding The Similarity Among
Questions In A Distributed Question Bank Scenario, International Journal of
Computer Science and Applications, Vol. 7, No. 1, 2010.

[37]

Rudi L. Cilibrasi and Paul M.B. Vitanyi,J. Adachi and M. Hasegawa. Normalized
web distance and web similarity, Vol.4, No. 28, 1996.

[38]

Rudi L. Cilibrasi, The Google Similarity Distance, IEEE transactions on knowledge

and data engineering, Vol. 19, No 3, March 2007, 370383.
64

[39]

Ruzhan Lu, Word Similarity Measurement Based on Basic Lexical Set, Journal of
Information & Computational Science 8: 5 (Apr 2011) 799{807).

[40]

Seema Bawa, A Framework for Semantic Discovery of Web Services, Development

and Maintenance. International Journal of Web services Research, vol. 5, pp. 62-80
(March 2008).

[41]

Seokkyung Chung, Jongeun Jun, and Dennis McLeod, A novel Term Similarity
Metric based on a web search technology University of Southern California,
Vol.13(1), No.71, 2004.

[42]

Shalini puri and Sona kaushik , A technical study and analysis on fuzzy similarity
based models for text classification, International Journal of Data Mining &
Knowledge Management Process Vol. 2, No.2, March 2012.

[43]

Sreedevi Sampath, Towards Defining and Exploiting Similarities in Web Application

Use Cases through User Session Analysis, In Int Conf on Soft Eng, Vol.23, No.5,
1997.

[44]

Sridevi.U.K, Ontology based Similarity Measure in Document Ranking,

International Journal of Computer Applications , Vol.1 , No 26, 2010.

[45]

Sven Helmer, Measuring the Structural Similarity of Semistructured, Documents

Using Entropy, Very Large Databases VLDB01, Vol.29, No 23, 2004.

[46]

T.Upender, Measuring Exegetic Semblance between Words International Journal of

Latest Trends in Engineering and Technology (IJLTET), Vol. 2 Issue 1 January 2013.

[47]

Taher H. Haveliwala_, Evaluating Strategies for Similarity Search on the Web,

Proceedings of SIGIR, Vol.49 , No.732, 2001.

[48]

Vidya Kannan, Dr. G.N Srinivasan, Yet another way of Ranking web Documents,
Based On Semantic Similarity, Internatinal Journal of Advanced Research in
Computer and communication Engineering, Vol. 3, No.4, April 2014.

[49]

Wei Liu, Xiaofeng Meng, Member, IEEE, and Weiyi Meng, A Vision-Based
Approach for Deep Web Data Extraction, IEEE Transaction on knowledge and data
engineering , Vol.22, No. 3, 2010.
65

[50]

Weifeng Zhang Web Phishing Detection Based on Page Spatial Layout Similarity
School of Computer, Nanjing University of Posts and Telecommunications, Vol. 37,
No. 231244 , 2013.

[51]

William Webber, A Similarity Measure for Indefinite Rankings, ACM Transaction

on Information System, Vol. 28, No. 4, November 2010.

[52]

Yanhong Zhai and Bing Liu, Extracting Web Data Using Instance-Based Learning,
In SIGMOD04, Vol. 23, No. 203, 2004.

[53]

Yanshan Xiao, Similarity-Based Approach for Positive and Unlabelled Learning, In

proceedings of the Twenty-second International Joint Conference on Artificial
Intellingence, Vol.89, No.60, May 2009.

APPENDIX

A. SAMPLE SOURCE CODE

Below Code is for Login

<?php
session_start();
require_once('database.php');
require_once('library.php');
$error = "";
if(isset($_POST['txtusername'])){
$error = checkUser($_POST['txtusername'],$_POST['txtpassword'],$_POST['OfficeName']);
}//if
require_once('database.php');
$sql = "SELECT DISTINCT(off_name)
FROM tbl_offices";
$result = dbQuery($sql);
?>
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Login</title>
<link href="css/style.css" rel="stylesheet" type="text/css">
<link href="css/mystyle.css" rel="stylesheet" type="text/css">
<script language="javascript">
<!-function memloginvalidate()
{
if(document.form1.txtusername.value == "")
{
alert("Please enter admin UserName.");
document.form1.txtusername.focus();
return false;
}
if(document.form1.txtpassword.value == "")
{
alert("Please enter admin Password.");
document.form1.txtpassword.focus();
return false;
}
}
-->
</script></head>
<body onLoad="document.form1.txtusername.focus();">
<table id="Outer" bgcolor="#FFFFFF" border="0" cellpadding="0" cellspacing="0" align="center"
width="780">
<tbody><tr>
<td><table id="inner" border="0" cellpadding="3" cellspacing="3" height="500" align="center"
width="96%">
<tbody><tr>
<td>
<link href="css/style.css" rel="stylesheet" type="text/css">
<style type="text/css">
<!--

.style2 {color: #FFFFFF}

-->
</style>
<table border="0" cellpadding="0" cellspacing="0" width="782">
<tbody><tr>
<td colspan="15"><img src="images/trheader.jpg" height="109" width="780"></td>
</tr>
<tr>
<td><div align="center">
<span class="redtext"><strong>
</strong></span><br>
<br>
</div>
<table border="0" cellpadding="0" cellspacing="0" align="center" width="300">
<tbody><tr>
<td width="18"><img src="images/boxtopleftcorner.gif" alt="" height="13" width="18"></td>
<td background="images/boxtopBG.gif" width="734"></td>
<td width="18"><img src="images/boxtoprightcorner.gif" alt="" height="13" width="18"></td>
</tr>
<tr>
<td background="images/boxleftBG.gif"></td>
<td><table border="0" cellpadding="0" cellspacing="0" align="center" width="98%">
<tbody><tr>
<td colspan="2" height="4"></td>
</tr>
<tr>
<td height="18"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody><tr>
<td colspan="3" class="smalltextgrey" width="378">
</td>
</tr>
</tbody></table></td>
</tr>
<tr>
<td><table class="GreenBox" border="0" cellpadding="0" cellspacing="0" align="center"
width="300">
<tbody><tr>
<form name="form1" id="form1" method="post" onSubmit="return memloginvalidate()">
<td><table bgcolor="#FFFFFF" border="0" cellpadding="3" cellspacing="1" width="100%">
<tbody><tr>
<td colspan="3" class="smalltextgrey"> </td>
</tr>
<tr>
<td colspan="3" class="smalltextgrey">
<div class="headtext13"
align="center"><strong>Administrator Login Area </strong></div></td>
</tr>
<tr>
<td colspan="3" height="10">
<font color="#FF0000" style="fontsize:12px;">
<?php echo $error; ?>
</font>
</td>
</tr>
<tr>
<td width="115">    <font style="fontsize:12px;">Username</font></td>
<td width="3">:</td>
<td width="160">

<input name="txtusername"
class="forminput" id="txtusername" maxlength="20" type="text"></td>
</tr>
<tr>
<td>    <font style="font-size:12px;">Password</font></td>
<td>:</td>
<td><input name="txtpassword" class="forminput" id="txtpassword" maxlength="20"
type="password"></td>
</tr>
<tr>
<td>    <font style="font-size:12px;">Office</font></td>
<td>:</td>
<td>
<select name="OfficeName">
<?php
while($data = dbFetchAssoc($result)){
?>
<option value="<?php echo $data['off_name']; ?>"><?php echo $data['off_name']; ?
></option>
<?php
}//while
?>
</select>
</td>
</tr>
<tr>
<td> </td>
<td> </td>
<td><input name="Submit" class="green-button" value="Login Now" type="submit"
style="padding:5px 10px;font-weight:bold;"></td>
</tr>
</tbody>
</table>
</form>
</td>
</tr>
</tbody></table></td>
</tr>
<tr>
<td> </td>
</tr>
</tbody></table></td>
<td background="images/boxrightBG.gif"></td>
</tr>
<tr>
<td width="18"><img src="images/boxbtmleftcorner.gif" alt="" height="12" width="18"></td>
<td background="images/boxbtmBG.gif" width="734"></td>
<td width="18"><img src="images/boxbtmrightcorner.gif" alt="" height="12" width="18"></td>
</tr>
</tbody></table>
<br>
<br></td>
</tr>
<tr>
<td><table border="0" cellpadding="0" cellspacing="0" align="center" width="780">
<tbody><tr>
<td bgcolor="#2284d5" height="40" width="476"> </td>
<td bgcolor="#2284d5" width="304"><div align="right"></div></td>
</tr>
</tbody></table>
</td>
</tr>

</tbody></table></td>
</tr>
</tbody></table>
</td></tr></tbody></table></body></html>

Below code is to compare two web pages

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Courier / Cargo Tracking Script in PHP - Ver 0.97</title>
<link href="css/mystyle.css" rel="stylesheet" type="text/css">
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" align="center" width="900">
<tbody><tr>
<td width="900">
<?php include("header.php"); ?> </td>
</tr>
<tr>
<td bgcolor="#FFFFFF"><table border="0" cellpadding="1" cellspacing="1" align="center"
width="98%">
<tbody><tr>
<td class="Partext1"> </td>
</tr>
<tr>
<td class="Partext1"><div align="center">
<table cellpadding="4" cellspacing="0" align="center" width="70%">
<script language="javascript">
function validate()
{
if (form.Consignment.value == "" )
{
alert("Consignment No is required.");
form.track.focus( );
return false;
}
}
</script>
<tbody><tr>
<td class="TrackTitle" valign="top"><div class="newtext" align="center"></div></td>
</tr>
<tr>
<td class="bottom" valign="middle"> </td>
</tr>
<tr bgcolor="EFEFEF">
<td valign="top">Courier is added successfully.<br/>
</td>
</tr>
<tr bgcolor="EFEFEF">
<td class="TrackNormalBlue" bgcolor="#FFFFFF" valign="top"> </td>
</tr>
</tbody></table>
</div></td>
</tr>

IBPS SO IT Officers Practice Sets All Topics
No ratings yet
IBPS SO IT Officers Practice Sets All Topics
107 pages
TRANSFORMERS Core Size
100% (1)
TRANSFORMERS Core Size
3 pages
UNIT - 3 Final
No ratings yet
UNIT - 3 Final
37 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
Uneditable - M.sc. CS Sem-II Web Data Analytics
No ratings yet
Uneditable - M.sc. CS Sem-II Web Data Analytics
93 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
18 pages
Module1PartAweb Mining-Intro
No ratings yet
Module1PartAweb Mining-Intro
28 pages
Web Mining
100% (3)
Web Mining
28 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
3.Eng-A Survey On Web Mining
No ratings yet
3.Eng-A Survey On Web Mining
8 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
Data Mining. Mining WWW.: Sonali. Parab
No ratings yet
Data Mining. Mining WWW.: Sonali. Parab
25 pages
Introduction To Web Mining
No ratings yet
Introduction To Web Mining
20 pages
Web Mining MMMUT NOTES
No ratings yet
Web Mining MMMUT NOTES
5 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
Web Mining
No ratings yet
Web Mining
73 pages
Web Usage Mining
No ratings yet
Web Usage Mining
13 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Unit 7: Web Mining and Text Mining
No ratings yet
Unit 7: Web Mining and Text Mining
13 pages
Web Mining and Knowledge Discovery of Usage Patterns: CS 748T Project (Part I)
No ratings yet
Web Mining and Knowledge Discovery of Usage Patterns: CS 748T Project (Part I)
25 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
Web Mining
No ratings yet
Web Mining
28 pages
Web Mining
No ratings yet
Web Mining
13 pages
DMDW-Unit V
No ratings yet
DMDW-Unit V
13 pages
Research Proposal On Distinct Study and Significant of Search Techniques in Web Mining
No ratings yet
Research Proposal On Distinct Study and Significant of Search Techniques in Web Mining
5 pages
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
No ratings yet
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
7 pages
Data Mining
No ratings yet
Data Mining
12 pages
Unit - 5
No ratings yet
Unit - 5
12 pages
Wdm-Unit I
No ratings yet
Wdm-Unit I
70 pages
Web Mining
No ratings yet
Web Mining
42 pages
Web Mining: Created By
No ratings yet
Web Mining: Created By
11 pages
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
No ratings yet
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
28 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining
No ratings yet
Web Mining
34 pages
Business Data Mining Week 13
No ratings yet
Business Data Mining Week 13
15 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Unit V - Web and Text Mining
No ratings yet
Unit V - Web and Text Mining
35 pages
Unit 3 DMW
No ratings yet
Unit 3 DMW
31 pages
Study On Web Designing
No ratings yet
Study On Web Designing
8 pages
Artificial Intelligence and Innovative A
No ratings yet
Artificial Intelligence and Innovative A
9 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Spatial & Web Mining
100% (1)
Spatial & Web Mining
45 pages
Web Mining
No ratings yet
Web Mining
42 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
17 pages
Web Data Mining - 5
No ratings yet
Web Data Mining - 5
14 pages
Webmininglec
100% (1)
Webmininglec
75 pages
Webmining I
No ratings yet
Webmining I
69 pages
Data Mining
No ratings yet
Data Mining
80 pages
A Study On Different Aspects of Web Mining and Research Issues
No ratings yet
A Study On Different Aspects of Web Mining and Research Issues
8 pages
Advanced-Applications
No ratings yet
Advanced-Applications
54 pages
Unit 7 - Advanced Application
No ratings yet
Unit 7 - Advanced Application
5 pages
Web Mining Using Artificial Ant Colonies: A Survey
No ratings yet
Web Mining Using Artificial Ant Colonies: A Survey
6 pages
QU PPT Format
No ratings yet
QU PPT Format
12 pages
Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data
No ratings yet
Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data
12 pages
Web Mining U-1,2
No ratings yet
Web Mining U-1,2
15 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Hindi - 1
No ratings yet
Hindi - 1
2 pages
E-Way Bill System
No ratings yet
E-Way Bill System
1 page
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
4 pages
Image-To Type
No ratings yet
Image-To Type
2 pages
English - 3
No ratings yet
English - 3
2 pages
Care of Self
No ratings yet
Care of Self
9 pages
Namma Kalvi General English - Reference Book Cum Work Sheets
No ratings yet
Namma Kalvi General English - Reference Book Cum Work Sheets
144 pages
Transformer Test Report - Calculation Transformer Specification
100% (1)
Transformer Test Report - Calculation Transformer Specification
2 pages
Comparing Numbers
No ratings yet
Comparing Numbers
5 pages
Vegetables Sample
No ratings yet
Vegetables Sample
3 pages
Counting Backwards
No ratings yet
Counting Backwards
5 pages
Kurukshetra University Kurukshetra: Vice-Chancellor: Vc@kuk - Ac.in Registrar: Proctor: Proctor@kuk - Ac.in
No ratings yet
Kurukshetra University Kurukshetra: Vice-Chancellor: Vc@kuk - Ac.in Registrar: Proctor: Proctor@kuk - Ac.in
9 pages
Color by Numbers
No ratings yet
Color by Numbers
7 pages
Kalaivani - Batch 1 - sh10
100% (1)
Kalaivani - Batch 1 - sh10
342 pages
Homoeopathic Medical College & Hospital,: H.K.E.Society's, Dr. Maalakaraddy Kalaburagi-05
No ratings yet
Homoeopathic Medical College & Hospital,: H.K.E.Society's, Dr. Maalakaraddy Kalaburagi-05
7 pages
Shab Child Rearing
No ratings yet
Shab Child Rearing
6 pages
Dynamic Data Integration
No ratings yet
Dynamic Data Integration
2 pages
Logeshwari - Sample
No ratings yet
Logeshwari - Sample
36 pages
Ges - Material Check List Customer: Capacity: Input Range:240-480V Delivery Date
No ratings yet
Ges - Material Check List Customer: Capacity: Input Range:240-480V Delivery Date
2 pages
Transformer Test Report - Calculation Transformer Specification
100% (1)
Transformer Test Report - Calculation Transformer Specification
2 pages
2018 - 4 - List of Recognised ANM, GNM, B.Sc. Nursing, Post Basic B.SC - Nursing and M.Sc. Nursing Schools-Colleges in Govt. and Pvt. Sector - in The Himachal Pradesh
No ratings yet
2018 - 4 - List of Recognised ANM, GNM, B.Sc. Nursing, Post Basic B.SC - Nursing and M.Sc. Nursing Schools-Colleges in Govt. and Pvt. Sector - in The Himachal Pradesh
16 pages
Ges - Material Check List Customer: Capacity: Input Range:110V Delivery Date
No ratings yet
Ges - Material Check List Customer: Capacity: Input Range:110V Delivery Date
2 pages
Logeshwari - Batch 1
No ratings yet
Logeshwari - Batch 1
36 pages
Klaai - Batch 27
No ratings yet
Klaai - Batch 27
66 pages
Kalaivani - Batch 6
No ratings yet
Kalaivani - Batch 6
24 pages
Transformers and Choke Lamination Standard Types Singal Phase E I
100% (7)
Transformers and Choke Lamination Standard Types Singal Phase E I
3 pages
Kalaivani - Batch 19
No ratings yet
Kalaivani - Batch 19
39 pages
E&I Lamination Core
33% (6)
E&I Lamination Core
6 pages
Kalai - Batch 28
No ratings yet
Kalai - Batch 28
64 pages
Data Mining Notes
No ratings yet
Data Mining Notes
8 pages
I Ji Scs 02222013
No ratings yet
I Ji Scs 02222013
5 pages
Sales Analysis of E-Commerce Websites Using Data M
No ratings yet
Sales Analysis of E-Commerce Websites Using Data M
6 pages
Data Mining: Concepts and Techniques: - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Introduction
44 pages
An Improved Fuzzy Clustering Technique For User's Browsing Behaviors
No ratings yet
An Improved Fuzzy Clustering Technique For User's Browsing Behaviors
4 pages
Bose 2008
No ratings yet
Bose 2008
20 pages
Sijash Feb2018 Mangayarkarasicollege PDF
No ratings yet
Sijash Feb2018 Mangayarkarasicollege PDF
243 pages
Learning Objectives: Online Consumer Behavior, Market Research, and Advertisement
No ratings yet
Learning Objectives: Online Consumer Behavior, Market Research, and Advertisement
21 pages
Shekhar - Khadka Assignment
No ratings yet
Shekhar - Khadka Assignment
11 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
Intelligent Web Applications: (Part 1)
No ratings yet
Intelligent Web Applications: (Part 1)
36 pages
Ijesat 2012 02 Si 01 12
No ratings yet
Ijesat 2012 02 Si 01 12
5 pages
Data Mining Cat
No ratings yet
Data Mining Cat
6 pages
Web Mining: Based On Tutorials and Presentations
No ratings yet
Web Mining: Based On Tutorials and Presentations
101 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Web Mining Report
100% (2)
Web Mining Report
46 pages
Introduction To Data Mining
75% (4)
Introduction To Data Mining
45 pages
Web Mining App and Tech2 PDF
No ratings yet
Web Mining App and Tech2 PDF
443 pages
Web Analytics Tutorial
No ratings yet
Web Analytics Tutorial
29 pages
Foundations of Business Intelligence: Databases and Information Management
No ratings yet
Foundations of Business Intelligence: Databases and Information Management
44 pages
Big Data Analytics 1500
No ratings yet
Big Data Analytics 1500
7 pages
2019 Book DataScienceAndBigDataAnalytics
100% (15)
2019 Book DataScienceAndBigDataAnalytics
418 pages
Introduction To Information Technology Turban, Rainer and Potter John Wiley & Sons, Inc
No ratings yet
Introduction To Information Technology Turban, Rainer and Potter John Wiley & Sons, Inc
38 pages
000099998888
No ratings yet
000099998888
10 pages
Jayamukhi Institute of Technological Sciences (Autonomous) M.Tech. (Software Engineering) Course Structure and Syllabus I Year - I Semester
No ratings yet
Jayamukhi Institute of Technological Sciences (Autonomous) M.Tech. (Software Engineering) Course Structure and Syllabus I Year - I Semester
70 pages
Analysis of Web Server Log Files
No ratings yet
Analysis of Web Server Log Files
8 pages