Chapter 06
Chapter 06
Why PageRank?
used to determine the order in which search engine results are presented:
PageRank is important as it helps Google determine the value of a page
relative to other similar pages on the web. Among other factors, pages with
higher PageRank have higher chances of ranking.
Introduction
Despite this many people seem to get it wrong! In particular “Chris Ridings of
www.searchenginesystems.net” has written a paper entitled “PageRank Explained:
Everything you’ve always wanted to know about PageRank”, pointed to by many
people, that contains a fundamental mistake early on in the explanation!
Unfortunately this means some of the recommendations in the paper are not quite
accurate.
1
By showing code to correctly calculate real PageRank I hope to achieve several
things in this response:
Any good web designer should take the time to fully understand how PageRank
really works - if you don’t then your site’s layout could be seriously hurting your
Google listings!
[Note: I have nothing in particular against Chris. If I find any other papers on the
subject I’ll try to comment evenly]
PageRank is also displayed on the toolbar of your browser if you’ve installed the
Google toolbar (http://toolbar.google.com/). But the Toolbar PageRank only goes
from 0 – 10 and seems to be something like a logarithmic scale:
We can’t know the exact details of the scale because, as we’ll see later, the
maximum PR of all pages on the web changes every month when Google does its
re-indexing! If we presume the scale is logarithmic (although there is only
anecdotal evidence for this at the time of writing) then Google could simply give
the highest actual PR page a toolbar PR of 10 and scale the rest appropriately.
Also the toolbar sometimes guesses! The toolbar often shows me a Toolbar PR for
pages I’ve only just uploaded and cannot possibly be in the index yet!
2
What seems to be happening is that the toolbar looks at the URL of the page the
browser is displaying and strips off everything down the last “/” (i.e. it goes to the
“parent” page in URL terms). If Google has a Toolbar PR for that parent then it
subtracts 1 and shows that as the Toolbar PR for this page. If there’s no PR for the
parent it goes to the parent’s parent’s page, but subtracting 2, and so on all the way
up to the root of your site.� If it can’t find a Toolbar PR to display in this way,
that is if it doesn’t find a page with a real calculated PR, then the bar is greyed out.
Note that if the Toolbar is guessing in this way, the Actual PR of the page is 0 -
though its PR will be calculated shortly after the Google spider first sees it.
PageRank says nothing about the content or size of a page, the language it’s
written in, or the text used in the anchor of a link!
Definitions
I’ve started to use some technical terms and shorthand in this paper. Now’s as good
a time as any to define all the terms I’ll use:
PR: Shorthand for PageRank: the actual, real, page rank for
each page as calculated by Google. As we’ll see later
this can range from 0.15 to billions.
Toolbar PR: The PageRank displayed in the Google toolbar in your
browser. This ranges from 0 to 10.
Backlink: If page A links out to page B, then page B is said to
have a “backlink” from page A.
So what is PageRank?
In short PageRank is a “vote”, by all the other pages on the Web, about how
important a page is. A link to a page counts as a vote of support. If there’s no link
there’s no support (but it’s an abstention from voting rather than a vote against the
page).
Quoting from the original Google paper, PageRank is defined like this:
We assume page A has pages T1...Tn which point to it (i.e., are citations).
The parameter d is a damping factor which can be set between 0 and 1. We
usually set d to 0.85. There are more details about d in the next section.
Also C(A) is defined as the number of links going out of page A. The
PageRank of a page A is given as follows:
3
Note that the PageRanks form a probability distribution over web pages, so
the sum of all web pages' PageRanks will be one.
but that’s not too helpful so let’s break it down into sections.
This is where it gets tricky. The PR of each page depends on the PR of the pages
pointing to it. But we won’t know what PR those pages have until the pages
pointing to them have their PR calculated and so on… And when you consider that
page links can form circles it seems impossible to do this calculation!
But actually it’s not that bad. Remember this bit of the Google paper:
What that means to us is that we can just go ahead and calculate a page’s
PR without knowing the final value of the PR of the other pages. That seems
strange but, basically, each time we run the calculation we’re getting a closer
estimate of the final value. So all we need to do is remember the each value we
4
calculate and repeat the calculations lots of times until the numbers stop changing
much.
Lets take the simplest example network: two pages, each pointing to the other:
Each page has one outgoing link (the outgoing count is 1, i.e. C(A) = 1 and C(B) =
1).
Guess 1
We don’t know what their PR should be to begin with, so let’s take a guess at 1.0
and do some calculations:
d= 0.85
PR(A)= (1 – d) + d(PR(B)/1)
PR(B) = (1 – d) + d(PR(A)/1)
i.e.
Hmm, the numbers aren’t changing at all! So it looks like we started out with a
lucky guess!!!
Guess 2
No, that’s too easy, maybe I got it wrong (and it wouldn’t be the first time). Ok,
let’s start the guess at 0 instead and re-calculate:
And again:
5
= 0.385875
PR(B) = 0.15 + 0.85 * 0.385875
= 0.47799375
And again
and so on. The numbers just keep going up. But will the numbers stop increasing
when they get to 1.0? What if a calculation over-shoots and goes above 1.0?
Guess 3
Well let’s see. Let’s start the guess at 40 each and do a few cycles:
PR(A) = 40
PR(B) = 40
First calculation
And again
Yup, those numbers are heading down alright! It sure looks the numbers will get to
1.0 and stop
Here’s the code used to calculate this example starting the guess at 0: Show the
code | Run the program
Principle: it doesn’t matter where you start your guess, once the PageRank
calculations have settled down, the “normalized probability distribution”
(the average PageRank for all pages) will be 1.0
6
�
How many times do we need to repeat the calculation for big networks? That’s a
difficult question; for a network as large as the World Wide Web it can be many
millions of iterations! The “damping factor” is quite subtle. If it’s too high then it
takes ages for the numbers to settle, if it’s too low then you get repeated over-
shoot, both above and below the average - the numbers just swing about the
average like a pendulum and never settle down.
Also choosing the order of calculations can help. The answer will always come out
the same no matter which order you choose, but some orders will get you there
quicker than others.
I’m sure there’s been several Master’s Thesis on how to make this calculation as
efficient as possible, but, in the examples below, I’ve used very simple code for
clarity and roughly 20 to 40 iterations were needed!
K-Nearest Neighbor
7
Introduction
8
Suppose, we have an image of a creature that looks similar
to cat and dog, but we want to know either it is a cat or
dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to
the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
9
Why do we need a K-NN Algorithm?
10
How does K-NN work?
11
Firstly, we will choose the number of neighbors, so
we will choose the k=5.
12
By calculating the Euclidean distance we got the
nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category
B. Consider the below image:
13
As we can see the 3 nearest neighbors are from
category A, hence this new data point must belong
to category A.
14
Kvalue indicates the count of the nearest neighbors. We
have to compute distances between test points and trained
labels points. Updating distance metrics with every
iteration is computationally expensive, and that’s why KNN
is a lazy learning algorithm.
15
Initialize a random K value and start computing.
Now you will get the idea of choosing the optimal K value
by implementing the model.
Calculating distance:
16
Hamming Distance: It is used for categorical variables. If
the value (x) and the value (y) are the same, the distance D
will be equal to 0 . Otherwise D=1.
17
What is the KNN Classification
Algorithm?
KNN (K-Nearest Neighbors) is a simple, non-parametric method for
classification. Given a set of labeled data points, the KNN
classification algorithm finds the k data points in the training set
that are closest to the point to be classified. Then, it assigns the
label that is most common among those k data points. Here, we
need to specify the number of nearest neighbors, k, which is a user-
specified parameter. The basic idea behind the KNN algorithm is
that similar data points will have similar labels.
Using the above inputs, we follow the below steps to classify any
data point.
18
can also specify your own distance metric if you have datasets
having categorical or mixed attributes.
2. For a new data point P, calculate its distance to all the existing
data points.
3. Select the k-nearest data points, where k is a user-specified
parameter.
4. Among the k-nearest neighbors, count the number of data
points in each class. We do this to select the class label with a
majority of data points in the k neighbors that we select.
5. Assign the new data point to the class with the majority class
label among the k-nearest neighbors.
Now that we have discussed the basic intuition and the algorithm for
KNN classification, let us discuss a KNN classification numerical
example using a small dataset.
A1 (2,10) C2
A2 (2, 6) C1
A3 (11,11) C3
A4 (6, 9) C2
A5 (6, 5) C1
A6 (1, 2) C1
A7 (5, 10) C2
A8 (4, 9) C2
A9 (10, 12) C3
A10 (7, 5) C1
A12 (4, 6) C1
19
A13 (3, 10) C2
A15 (3, 8) C2
For this, we will first specify the number of nearest neighbors i.e. k.
Let us take k to be 3. Now, we will find the distance of P to each data
point in the dataset. For this KNN classification numerical example,
we will use the euclidean distance metric. The following table shows
the euclidean distance of P to each data point in the dataset.
A2 (2, 6) 3.16
A4 (6, 9) 2.23
A5 (6, 5) 2.23
A6 (1, 2) 6.40
A8 (4, 9) 2.23
20
After finding the distance of each point in the dataset to P, we will
sort the above points according to their distance from P (5, 7). After
sorting, we get the following table.
A4 (6, 9) 2.23
A5 (6, 5) 2.23
A8 (4, 9) 2.23
A7 (5, 10) 3
A2 (2, 6) 3.16
A6 (1, 2) 6.4
Now, point A12, A4, and A5 have the class labels C1, C2, and
C1 respectively. Among these points, the majority class label
is C1. Therefore, we will specify the class label of point P =
(5, 7) as C1. Hence, we have successfully used KNN
classification to classify point P according to the given
dataset.
21
By studying the above KNN classification numerical example, you
can see that the algorithm is pretty straightforward and doesn’t
require any specific mathematical skills apart from distance
calculation and majority selection.
22
point, it used the majority of the class labels of k nearest
neighbors. Hence, the noise in the data will become a minority
class and won’t affect the classification process.
9. Handling outliers: KNN can be robust to outliers since the
decision is based on the majority class among k-nearest
neighbors.
23
categorical and mixed data types to perform KNN
classification.
8. Slow prediction: KNN is slow in prediction as it needs to
calculate the distance of the new point from each stored point.
This is a slow process and computationally expensive.
24