Lec 31
Lec 31
Lecture 31
Preface content of this lecture, in this lecture we will discuss PageRank algorithm in big data using
different framework, with different ways and the scale.
So, let us see that, using this particular picture. So, the pages which are very important, are shown
with a, with a, with the bigger size, compared to the pages, which are not that important, shown by the
smaller size. Now, the page rank is denoted by E, let us say, so the page rank we are denoting by the
symbol PR. PR is the page rank of the page E. So give, this page rank algorithm will give, the pages
the ranks, the score based on the number of links which are pointing to it. For example, this particular,
this particular page which is shown, as a very fat, node you see that, most of these links are pointing
to it. Therefore the size or the rank or the score of this particular page is higher, compared to the all
other pages, we are not that many number of links are pointing to it. And if this particular, fat page, is
pointing to some other web page, then that page also, get more importance, hence this kind of ranking,
is computed through an algorithm, which gives the ranks or to the pages or this is also called as a,
‘Score’ which is fairly, based on the number of links, which are pointing to them. So, the number of
links, from many pages, will indicate that it is a very high rank, web page or it is an important page
and the links from a high, rank page is also, going to be contributing to a high rank. So, considering
these two aspect or the rating, into the algorithm, the PageRank has combined these two different
notions, to give, the rating or the score, to the webpages.
Now, let us go in more detail of an example, let us say that, node A, is pointed by all other nodes that
is B C and D. Now, then in that case, the initial PageRank here, there are four nodes, so initial
PageRank is equally divided 1/4 that is, 0.25.
So, 0.25 is given the initial PageRank of every page: that is the initial page rank is equal to 1/4, for all
the nodes, because the total number of nodes is 4 and let us say that, initially, it is to be 1/4. So, 0.25 is
the initial page rank. Now, these values, are to be given back, to the page rank, so page rank of a will
be recomputed again in the first iterations.
So, after completing the first iteration, the page rank, of a will be recalculated in this manner.
Refer Slide Time :( 9: 13)
So, for example, if let us say, B has also added the links, to a and C, so the page rank of B that is 0.25,
will be equally divided into two parts and this page rank will be given back, to C and A.
In this manner and the page ranks, values, which are calculated, which is iterated in each iteration in
this particular manner.
PR( A)=PR(B)
L(B)+PR(C)
L(C)+PR(D)
L(D)+…
+d PR(B)
N (
L(B)+PR(C)
PR( A)=1−d L(C)+PR(D)
+…
L(D) )
Refer Slide Time :( 10: 30)
So, this particular way the page rank is calculated using this particular formula where P1, P2 and so
on P and are the pages under the consideration and Rho P is the set of pages that links to the P and L
P is the number of outbound links on page P and N is the total number of pages.
+d ❑ PR(p
N ∑p ∈ p i i (
j)
:( 10: 55)
PR( pi)=1−d
L(p j) )
Refer Slide Time
Now, this particular, after the fixed number of iterations or when the convergence, is assumed, to be a
very small value, of changes then the,
execution of the PageRank algorithm, converges and the final values of the page rank, we the page
rank of those algorithm. So, in summary the page rank is the first algorithm, given by the Google
search engine to rank the web pages, in their search engine results. So, we have shown how to
calculate iteratively the page rank using formula for every vertex in a given graph.
Refer Slide Time :( 11: 35)
Let us see how, Map Reduce can be applied to this PageRank algorithm, so that large scale graphs, we
can apply this algorithm PageRank on a very big size crops. So, PageRank may produce applied on a
page rank has two components one is called, ‘Mapper’ the other is called the, ‘Reducer’. In the
mapper phase, we have taken the input as the node IDs and so, it is the vertex ID and this is the vertex
for which we are going to calculate the PageRank. So, for a vertex and what we will do here is that
We will find out or we will consider the current page rank of node n and also, you will find out that,
total number of adjacent, at the sensi list or the total number of neighbours, of n. So, this particular
page rank will be divided by the total number of neighbours: that will be calculating its page rank.
And then, for all values of the node IDs, M in its neighbour list or it is as NC list, it will emit, this
values, of this ID, m and the value of P. Now, as far as, the reducer is concerned after getting these
values M, for the node M, all these values, when it is collected: that is Page ranks from all its
neighbours. So, like let us say that this, when it collects the page rank from different p1, p2 and so, on
up to. So, these page ranks ment, is being sent by the map function and it when it is received at M,
then the reducer will now calculate the page ranks. So, it will start with M is equal to null and s, the
calculating the value of page rank, in s variable. So, what it will do is, for all p values, which are there
receiving the page ranks and if it is, not a vertex: that means a p is not a vertex, then it is the page rank
which is being sent. So basically, it will add, the value of s to p and the page rank will be calculated,
based on, the PageRank is equal to s* 0.85 + 0.15/TotalVertices , 0.85 is a damping factor and the
other portion, where you have to add (1- D) / n, so n is the total number of vertices, total vertices and
1 - D that is, 1 - 0.85 it becomes, 0.15. So, this will be the total summation of these two will become
the PageRank, of no DOM and it will be emitted, from the reducer function, in vertex ID. Now, there
is one more function, one more statement which we have omitted is that, this not only will send the
PageRank of the neighbouring node, but, the node itself, as the complex object will be emitted out
here in this algorithm. So, it will emit N and vertex n, so here it will be checked, if p is the vertex, is p
is the vertex, so if p is the vertex, then M will be, added to the p: that means it is the complete
complex node object, which is being emitted out of the mapper, not the page ranks. So, both of them,
are being used so p is the complex object is the state variable, which are also, transferred or emitted,
along with the PageRank values, so this was the Map Reduce for implementation for PageRank.
Let us see how the pregel, the PageRank is implemented. Pregel, is originally from the Google it is an
open source implementation, the other such open source implementation, which are available called,
‘Apache Giraph’ and standard GPS, then J Pregel and Hama. Batch algorithms on the large-scale
graph, processing it is how is being used the Pregel. Now let us see the, iterations in the Pregel, so
Pregel all does not, use the Map Reduce notion of the concept which we have seen in the previous
slide. So, here in the Pregel, all the entire vertex, is being programmed, only one vertex program is
written. And here, we have to see for any vertex, which is active or if the maximum iterations not
reached, than for each vertex, this particular loop will, run in parallel, what it will do, in this iteration
is that, it will process the messages, from the neighbours, from the previous iterations. So, this is
required why because, it will bring the page ranks, from the neighbours and then, it will send the
message to the neighbours and set the active flag, appropriately. So, what it will do is, it will collect, it
will, it will receive the messages, the messages of the previous iteration and compute or oblique
update, the new page ranks and then again, send the message, to the neighbours, about the new
PageRank and it will set, the active flag appropriately. So, that means when a particular node, is not
doing any of these actions, then the node is not active, its passive and it will be out of the action, in
the further iterations.
So, let us go and see the detail algorithm of page rank in the pregel. So, here we, we have, we see
that, it is a super step, super step means, this is the iteration, of the PageRank algorithm, on a
particular vertex or a node, what it does it will start, it will initialize, the value of sum and it will
collect, the messages, the values which comes into the messages and then using this particular
summation of all the values, of the PageRank which it has collected, it will apply the damping factor
that is 0.85 and then it will calculate 0.15 / total number of nodes and this will be the new PageRank.
Now, if this particular in number of iterations < 30 and then, it will that means it is not terminated,
still it has to send the messages. So, it will send the messages, to all the neighbours and send to all the
neighbours, divided by n why because, the PageRank will be equally divided, across all the outgoing
edges. And this particular iteration will keep on repeating and in the end, it will go to the halt, if no
action is being taken. So, this is the vertex program, of the PageRank in prison, you see that, it is not
becoming complicated as you have seen in the Map Reduce. So, a vertex program, we can write down
and which will calculate the PageRank, in parallel of all the vertices in the graph. So, the parallel
execution, parallel graph execution, for this particular code, will run on all the vertices and they will
calculate the page rank or iterations and once the iterations, when's the iterations finishes: that is, it
converges, the PageRank and it will be the value.
so basically you can see here, the value will become 0.426, over here and this is 0.1 and why these
values are changed is because, this contribution is 0. So, 0.15 /5 will become 0.03 and this is the
iteration number one and in the next iteration, you see that, these values, these two values are not
changing in the next iteration.
So, these two values are not changing in the rate in the next iteration so, they will become red. Red in
the sense, they will stop participating in the further iterations and these values, have to be which are
sent by these nodes earlier they will be stored in the buffer, because in the in the further iteration they
will not participate but, these values are, going to be fixed and a constant which is going to be used.
And you see that, here now in this iteration this particular node is out of the action, so it will be not
active and finally, at this stage the page rank algorithm, will terminate, with these values.
Refer Slide Time :( 26: 30)
So, in conclusion graph structure data, are increasingly common in data, in data science context, due
to their ubiquity in modeling the communication between entities: that is the people, in the social
network computers, in the internet communication cities and countries in the transportation network.
And the web page is no word white web or the corporations in the financial car operations. So, in this
lecture, we have discussed important algorithm: that is a PageRank algorithm for extracting the
information, of the big graphs, big graph data. Using the Map Reduce and using the pregel paradigm.
Thank you.