(Web Mining) Assignment 3
(Web Mining) Assignment 3
Report 3
Grade: M1
Name: Roberto Espinoza
Chamorro
Student ID: 6930-31-1295
1. VisualRank:
VisualRank is an adapted algorithm from the well-known PageRank algorithm, applying
the idea of distributing importance among nodes in a graph to image search. Unlike
previous image searching algorithms that rely on image metadata (name, link or other
text), the algorithm presented by Google incorporates the feature of finding similarities
between images, comparing their contents (the initial paper was presented at the
International World Wide Web Conference in Beijing in 2008 under the title “PageRank
for Product Image Search”). The basic idea of the algorithm is to find common visual
themes in a set of images, as to find another set that represents those themes in the best
way. In a usual image/ranking problem, the idea is to identify the “authority” nodes on an
inferred visual similarity graph; for this, the VisualRank algorithm analyzes the visual
structures among the images, and picks which images are the “authorities”, which are also
chosen as the answer of the image-queries.
As a starting point, the image search is initiated by a text query. The initial result
candidates (using a metadata-based search technique) are retrieved. Local features vectors
are extracted from the images using Scale Invariant Feature Transform (SIFT), and
Locality-Sensitive Hashing (LSH) is applied to these feature vectors. Some of the ways to
measure image similarity that are applied in the algorithm are Harris corners, Scale
Invariant Feature Transform, Shape Context, and Spin Image, for example. The
similarities get calculated, so edges and connections can be drawn between images with
same values or similar contents. The more connections they share, their important (and
their VisualRank) increases. Then, the computed graph structure gets to identify clusters
of similar images, and measures centrality on it, as to return the most relevant image
related to the query.
2. TrustRank
Related to the TrustRank algorithm, it’s worth to mention that the initial paper was a joint
Date: 20/11/2019
work from Stanford University and Yahoo in their paper “Combating Web Spam with
TrustRank” in 2004. On the other hand, Google has its own patent for a search engine that
provides search results that are ranked according to a measure of the trust. Yahoo
TrustRank is more focused on finding Webspam, while Google TrustRank has an
approach of changing the rankings of search results according to a measure of trust
associated with entities that have provided labels for the documents in the search results. I
will focus on the former concept of TrustRank algorithm.
To determine the quality of a web page when returning results, one important factor are
Backlinks. In that sense, the TrustRank Algorithm conducts link analysis to separate
useful webpages from SPAM and helps search engines rank pages in Search Engine
Result Pages (SERPs). Since many web pages are created with the intention of misleading
search engines, using various techniques to achieve higher-than-deserved rankings, the
TrustRank uses a semi-automated process, which means that it needs some human
assistance in order to function properly (considering that human experts can easily
identify SPAM). The way it works is that the algorithm selects a small seed set of pages
whose “spam status” will be evaluated by a human expert, who will tell the algorithm if
they are spam (bad pages) or not (good pages). Then, the algorithm identifies the status of
other pages by extending outward from the initial seed set, looking for similarly
trustworthy pages. As to discover this kind of pages without invoking the oracle function,
the algorithm relies on the approximate isolation of the good set, which considers that
good pages seldom point to bad ones. Still, it has to be considered that the further away
the distance from the seed set, the less reliability should be, and for that, trust is
attenuated, either by trust dampening or trust splitting.
3. TextRank
TextRank is a graph-based ranking algorithm based on the PageRank algorithm which
involves keyword extraction and unsupervised summarization. Like other graph-based
ranking algorithms, it looks to decide the importance of a vortex within the graph, based
on all the information recursively obtained from its entirety. The idea behind the
functionality is similar to the PageRank one. While PageRank is used for webpage
ranking, TextRank is used for text ranking; in place of web pages we use sentences, we
look for the similarity between any two sentences instead of the web page transition
probability, and similar to the matrix M used for PageRank, a square matrix is used in
TextRank to store the similarity scores.
The way that it works it’s that first, all the text contained in the articles is concatenated,
then the text is split into individual sentences. Later, we find the vector representation for
all the sentences, and the calculated similarities between the vectors are stored in a
Date: 20/11/2019
matrix. After the similarity matrix is formed, it is later converted into a graph, with
sentences as vertices/node and the similarity scores as edges; the links are between each
sentence to all others or to the k-most similar sentence by the weight of the similarity. We
have to consider that, like in PageRank, the higher the number of votes that are cast for a
vertex, the higher the importance of the vertex, which in itself determines how important
the vote of that vortex is; all information that is considered by the model for the ranking.
Finally, the final summary is created with some of the top-ranked sentences.
In the end, what the TextRank algorithm does is finding how similar is each sentence to
the rest in a certain text and determines the importance of each according to how similar
they are to all others.