Unit2 Bda
Unit2 Bda
UNIT II
Mining Data Streams & Link
Analysis
1
Mining Data Streams
• The stream is a term that can be used when media is
sent in a continuous stream of data and the media can
play as it receives to the receiver.
• Streaming is a technique in which data is sent by the
sender in the compressed form over the Internet.
• After receiving the data, it shows immediately to the
user.
• Streaming does not mean to store the full data to the
storage devices(e.g., hard drive).
2
New Topic: Infinite Data
High dim. Graph Infinite Machine
Apps
data data data learning
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection
3
https://hazelcast.com/blog/the-role-of-streaming-technology-
4
in-retail-banking/
The Stream Data Model
• Refers to a sequence of data elements or symbols made available
over time
• Data stream transmits from a source and receives at the
processing end in a network
• A continuous stream of data flows between the source and
receiver ends, and which is processed in real time
• In many data mining situations, we do not know the entire data
set in advance
• Stream Management is important when the input rate is
controlled externally:
• Google queries
• Twitter or Facebook status updates
• We can think of the data as infinite and non-stationary (the
distribution changes over time)
5
SGD is a Streaming Alg.
• Stochastic Gradient Descent (SGD) is an
example of a stream algorithm
• In Machine Learning we call this: Online Learning
• Allows for modeling problems where we have
a continuous stream of data
• We want an algorithm to learn from it and
slowly adapt to the changes in data
• Idea: Do slow updates to the model
• SGD (SVM, Perceptron) makes small updates
• So: First train the classifier on training data.
• Then: For every example from the stream, we slightly
update the model (using small learning rate)
6
Data-stream-management System
Streams Entering
Each stream composed
of elements /tuples
7
• Also refers to communication of bytes or characters
over sockets in a computer network
• A program uses stream as an underlying data type in
inter-process communication channels.
• Input elements enter at a rapid rate, at one or more
input ports (i.e., streams)
• We call elements of the stream tuples
• The system cannot store the entire stream accessibly
8
• Streams may be archived in a large archival store, but
we assume it is not possible to answer queries from the
archival store.
• It could be examined only under special circumstances
using TIME-CONSUMING RETRIEVAL PROCESSES.
• There is also a working store, into which summaries or
parts of streams may be placed, and which can be used
for answering queries.
• The working store might be disk, or it might be main
memory, depending on how fast we need to process
queries.
• But either way, it is of sufficiently limited capacity
that it cannot store all the data from all the streams.
9
Problems on Data Streams
• Types of queries one wants on answer on
a data stream:
• Sampling data from a stream
• Construct a random sample
• Queries over sliding windows
• Number of items of type x in the last k elements
of the stream
10
Problems on Data Streams
• Types of queries one wants on answer on
a data stream:
• Filtering a data stream
• Select elements with property x from the stream (unwanted data )
• Counting distinct elements
• Number of distinct elements in the last k elements
of the stream (checking of unique elements or queries)
• Estimating moments
• Estimate avg./std. dev. of last k elements(distributed data which is
uneven estimating the moments for statistical purpose)
• Finding frequent elements
• frequent queries which queries are frequently in data stream
11
Applications (1)
• Mining query streams
• Google wants to know what queries are more frequent today
than yesterday
12
Applications (2)
• Sensor Networks
• Many sensors feeding into a central controller
• Telephone call records
• Data feeds into customer bills as well as settlements between
telephone companies
• IP packets monitored at a switch
• Gather information for optimal routing
• Detect denial-of-service attacks
13
Examples of Stream Sources
Image Data
14
Examples of Streaming Data 2
15
Streaming Data: what’s different? 3
16
Important tasks with streaming
4
data
▪ Sampling from a stream
– Because we cannot afford to process the whole data
Filtering from streaming data
▪
Accept / discard elements based on some condition
– Estimating number of distinct elements in a stream
▪ Without actually counting it
– Estimating moments
▪
17
Stream Queries
• A stream query operates on one or two streams to transform
their contents into a single output stream.
• A stream query definition declares an identifier for the items in
the stream so that the item can be referred to by the operators
in the stream query.
• EX: 1
• The stream produced by the ocean-surface-temperature sensor
mentioned might have a standing query
• To output an alert whenever the temperature exceeds 25
degrees centigrade.
• This query is easily answered, since it depends only on the most
recent stream element.
18
• We might have a standing query that, each time a new reading
arrives, produces the average of the 24 most recent readings.
• That query also can be answered easily, if we store the 24
MOST RECENT STREAM ELEMENTS.
• When a new stream element arrives, we can drop from the
working store the 25th most recent element, since it will never
again be needed (unless there is some other standing query that
requires it).
19
• EX2:
• Another query we might ask is the MAXIMUM TEMPERATURE
EVER RECORDED BY THAT SENSOR.
• The maximum of all stream elements ever seen.
• It is not necessary to record the entire stream.
• When a new stream element arrives, we COMPARE it with the
stored maximum, and set the maximum to whichever is larger.
• We can then answer the query by producing the current value of
the maximum.
• Similarly, if we want the average temperature over all time, we
have only to record two values: the number of readings ever sent
in the stream and the sum of those readings.
• We can adjust these values easily each time a new reading
arrives, and we can produce their quotient as the answer to the
query
20
• Ex 3:
• The other form of query is AD-HOC, a question asked once
about the current state of a stream or streams.
• If we do not store all streams in their entirety, as normally we
can not, then we cannot expect to answer arbitrary queries
about streams.
• If we have some idea what kind of queries will be asked through
the ad-hoc query interface, then we can prepare for them by
storing appropriate parts or summaries of streams
• A sliding window can be the most recent n elements of a stream,
for some n, or it can be all the elements that arrived within the
last t time units, e.g., one day.
• If we regard each stream element as a tuple, we can treat the
window as a relation and query it with any SQL query
21
Sliding window
22
• Web sites often like to report the number of unique users over the past
month.
• If we think of each login as a stream element, we can maintain a
window that is all logins in the most recent month.
• We must associate the arrival time with each login, so we know when
it no longer belongs to the window.
• If we think of the window as a relation Logins(name, time)
• Then it is simple to get the number of unique users over the past
month.
• The SQL query is:
• SELECT COUNT(DISTINCT(name))
• FROM Logins
• WHERE time >= t;
• Here, t is a constant that represents the time one month before the
current time.
23
Sampling Data in a Stream
• Managing streaming data
• Extracting reliable samples from a stream
• EX: 1
• Selecting a SUBSET OF A STREAM
• Queries about the selected subset and have the
responses be statistically representative of the stream
as a whole.
24
• A search engine receives a stream of queries, and it would like to
study the behavior of typical users
• The stream consists of tuples (user, query, time).
• Suppose that we want to answer queries such as “What fraction
of the typical user’s queries were repeated over the past month?”
• Instead having complete tuple we are picking up sample here
• Assume also that we wish to store only 1/10th of the stream
elements.(Ex if the size is 100 we are picking up only 10)
• The obvious approach would be to generate a random
number, say an integer from 0 to 9, in response to each
search query.
• Store the tuple if and only if the random number is 0.
• Each user has, on average, 1/10th of their queries stored
• Statistical fluctuations
25
• The law of large numbers will assure us that most users will have
a fraction quite close to 1/10th of their queries stored.
• The results for average number of duplicate queries for a user.
(Output will be wrong)
• ASSUMPTIONS MADE :
•Suppose a user has issued s search queries one time in the
past month
•d search queries twice
•No search queries more than twice.
26
• Of the d search queries issued twice, only d/100 will appear twice in the
sample; that fraction is d times
• The probability that both occurrences of the query will be in the 1/10th
sample.
• Of the queries that appear twice in the full stream, 18d/100 will appear
exactly once.
• To see why, note that 18/100 is the probability that one of the two
occurrences will be in the 1/10th of the stream that is selected, while the
other is in the 9/10th that is not selected.
• The correct answer to the query about the fraction of repeated searches is
d/(s+d).
• However, the answer we shall obtain from the sample is d/(10s+19d).
• To derive the latter formula, note that d/100 appear twice, while
s/10+18d/100 appear once.
• Thus, the fraction appearing twice in the sample is d/100 divided by d/100+
s/10 + 18d/100.
• This ratio is d/(10s+ 19d). For no positive values of s and d is d/(s + d) =
d/(10s + 19d).
27
Sampling from Data Streams
▪ Search queries coming in a stream
▪ Question: Typically, what fraction of the queries were
repeated by the same user over the last month?
▪ Assume: want to answer from a sample storing only
10% of the stream
28
Sampling: seemingly obvious approach 7
31
Selecting queries for “chosen” users
▪ For each query arriving in the stream
– Check if the user is seen before
• If yes, check if the user is chosen
– If yes, keep the query
– If not, discard the query
• If not, determine whether to choose the user
– Generate a random integer in the range 0-9
– If the integer is 0, keep the query, put the user in the chosen list
– If not, discard the query
32
Obtaining a Representative Sample
33
• If so, we add this search query to the sample, and if
not, then not.
• However, if we have no record of ever having seen this
user before, then we generate a random integer
between 0 and 9.
• If the number is 0, we add this user to our list with
value “in,” and if the number is other than 0, we add
the user with the value “out”.
34
Checking if the user is seen before 11
35
The General Sampling Problem
• Our stream consists of tuples with n components.
• A subset of the components are the key components,
on which the selection of the sample will be based.
• In our running example, there are three components –
user, query, and time – of which only user is in the key.
• However, we could also take a sample of queries by
making query be the key, or even take a sample of
user-query pairs by making both those components
form the key.
36
• To take a sample of size a/b, we hash the key value for each
tuple to b buckets, and accept the tuple for the sample if the
hash value is less than a.
• If the key consists of more than one component, the hash
function needs to combine the values for those components to
make a single hash-value.
• The result will be a sample consisting of all tuples with certain
key values.
• The selected key values will be approximately a/b of all the key
values appearing in the stream.
37
Varying the Sample Size
• We retain all the search queries of the selected 1/10th of the
users, forever.
• As time goes on, more searches for the same users will be
accumulated, and new users that are selected for the sample will
appear in the stream.
• If we have a budget for how many tuples from the stream can
be stored as the sample, then the fraction of key values must
vary, lowering as time goes on.
• In order to assure that at all times, the sample consists of all
tuples from a subset of the key values, we choose a hash function
h from key values to a very large number of values 0,
1, . . . ,B−1.
38
• We maintain a threshold t, which initially can be the largest
bucket number, B − 1.
• At all times, the sample consists of those tuples whose key K
satisfies h(K) ≤ t.
• New tuples from the stream are added to the sample if and only
if they satisfy the same condition.
• If the number of stored tuples of the sample exceeds the allotted
space, we lower t to t−1 and remove from the sample all those
tuples whose key K hashes to t.
• For efficiency, we can lower t by more than 1, and remove the
tuples with several of the highest hash values, whenever we need
to throw some key values out of the sample.
• Further efficiency is obtained by maintaining an index on the
hash value, so we can find all those tuples whose keys hash to a
particular value quickly.
39
Generalized Solution
• Stream of tuples with keys:
• Key is some subset of each tuple’s components
• e.g., tuple is (user, search, time); key is user
• Choice of key depends on application
Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets 40
Filtering Streams
• Common process on streams is selection, or filtering.
• To accept those tuples in the stream that meet a criterion.
• Accepted tuples are passed to another process as a stream, while other tuples
are dropped.
• If the selection criterion is a property of the tuple that can be calculated (e.G.,
The first component is less than 10), then the selection is easy to do.
• The problem becomes harder when the criterion involves lookup for
membership in a set.
• It is especially hard, when that set is too large to store in main memory.
• The technique known as “bloom filtering” as a way to eliminate most of the
tuples that do not meet the criterion.
41
• Suppose you are creating an Gmail account
• You want to enter a cool username, you entered it and
got a message, “Username is already taken”.
• You added your birth date along username, still no luck.
• Now you have added your university roll number also,
still got “Username is already taken”.
• But have you ever thought how quickly Gmail check
availability of username by searching millions of
username registered with it?????
• There are many ways to do this job
42
• Bloom Filter is a data structure that can do this job.
• For understanding bloom filters, we must know what is
HASHING.
• A hash function takes input and outputs a unique
identifier of fixed length which is used for identification
of input
43
Hashing
• Assume we want to design a system for storing employee records keyed using
phone numbers.
• And we want following queries to be performed efficiently
• Insert a phone number and corresponding information.
• Search a phone number and fetch the information.
• Delete a phone number and related information
• Techniques can be used:
1.Array of phone numbers and records.
2.Linked List of phone numbers and records.
3.Balanced binary search tree with phone numbers as keys.
4.Direct Access Table.
44
Hashing
S.No Emp Name Phone no
1 ABC 9876543210
2 XYZ 9012345698
45
Hashing function
• A function that converts a given big phone number to
a small practical integer value.
• The mapped integer value is used as an index in hash
table.
• In simple terms, a hash function maps a big number or
string to a small integer that can be used as index in
hash table.
• A good hash function should have following properties
• Efficiently computable.
• Should uniformly distribute the keys (Each table position
equally likely for each key)
46
Bloom Filter
• A Bloom filter is a space-efficient probabilistic data
structure that is used to test whether an element is a
member of a set.
• For example, checking availability of username is set
membership problem, where the set is the list of all
registered username.
• The price we pay for efficiency is that it is probabilistic in
nature that means, there might be some FALSE POSITIVE
RESULTS.
• When testing if an element is in the bloom filter, false
positives are possible.
• It will either say that an element is definitely not in the set
or that it is possible the element is in the set.
• It might tell that given username is already taken but
actually it’s not.
47
Interesting Properties of Bloom Filters
• Unlike a standard hash table, a Bloom filter of a fixed size can represent a set
with an arbitrarily large number of elements.
• Adding an element never fails.
• The false positive rate increases steadily as elements are added until all bits in
the filter are set to 1, at which point all queries yield a positive result.
• Bloom filters never generate false negative result, i.e., telling you that a
username doesn’t exist when it actually exists.
• Deleting elements from filter is not possible because, if we delete a single
element by clearing bits at indices generated by k hash functions, it might
cause deletion of few other elements.
• Example – if we delete “hello” (in given example below) by clearing bit at 1,
4 and 7, we might end up deleting “world” also Because bit at index 4
becomes 0 and bloom filter claims that “world” is not present.
48
• Step 1:A empty bloom filter is a bit array of m bits, all set to zero
1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0
49
• h1(“hello”) % 10 = 1
• h2(“hello”) % 10 = 4
• h3(“hello”) % 10 = 7
• Note: These outputs are random for explanation only.
Now we will set the bits at indices 1, 4 and 7 to 1 “hello”
1 2 3 4 5 6 7 8 9 10
1 0 0 1 0 0 1 0 0 0
50
• Note: These outputs are random for explanation only.
Now we will set the bits at indices 3, 5 and 4 to 1 for “world”.
1 2 3 4 5 6 7 8 9 10
1 0 1 1 1 0 1 0 0 0
51
False Positive in Bloom Filters
• The question is why we said “probably present”, why this uncertainty.
• Suppose we want to check whether “cat” is present or not.
• We’ll calculate hashes using h1, h2 and h3
• h1(“cat”) % 10 = 1
• h2(“cat”) % 10 = 3
• h3(“cat”) % 10 = 7
• If we check the bit array, bits at these indices are set to 1 but we know that
“cat” was never added to the filter.
• Bit at index 1 and 7 was set when we added “hello” and bit 3 was set we
added “world”.
1 2 3 4 5 6 7 8 9 10
1 0 1 1 1 0 1 0 0 0
52
• So, because bits at calculated indices are already set by some
other item, bloom filter erroneously claim that “cat” is present
and generating a false positive result(incorrectly indicates the
presence of a condition such as a Already available, where when the word is
not present).
• Depending on the application, it could be huge downside or
relatively okay.
• We can control the probability of getting a false positive by
controlling the size of the Bloom filter.
• More space means fewer false positives.
• If we want decrease probability of false positive result, we have
to use more number of HASH FUNCTIONS AND LARGER BIT
ARRAY.
• This would add latency in addition of item and checking
membership.
53
Operations that a Bloom Filter
• insert(x) : To insert an element in the Bloom Filter.
• lookup(x) : to check whether an element is already present in
Bloom Filter with a positive false probability.
54
Probability of False positivity
• Let m be the size of bit array, k be the number of hash
functions and n be the number of expected elements to be
inserted in the filter, then the probability of false positive p
can be calculated as:
55
56
Bloom filter
57
Counting Distinct Elements in a
Stream
• Measuring the number of distinct elements from a
stream of values is one of the most common utilities
that finds its application in the field
• Applications: Database Query Optimizations, Network
Topology, Internet Routing, Big Data Analytics, and
Data Mining.
58
The Count-Distinct Problem
• Suppose stream elements are chosen from some
universal set.
• We would like to know how many different elements
have appeared in the stream
• Counting either from the beginning of the stream or
from some known time in the past.
59
• As a useful example of this problem
• Consider a Web site gathering statistics on how many unique
users it has seen in each given month.
• I/P :The universal set is the set of logins for that site, and a
stream element is generated each time someone logs in.
• ASSUMPTION MADE:
• This measure is appropriate for a site like Amazon, where the
typical user logs in with their unique login name.
• A similar problem is a Web site like Google that does not
require login to issue a search query, and may be able to
identify users only by the IP address from which they send
the query.
• There are about 4 billion IP addresses
• Sequences of four 8-bit bytes will serve as the universal set in
this case.
60
• The obvious way to solve the problem is to keep in main
memory a list of all the elements seen so far in the stream.
• Keep them in an efficient search structure such as a hash
table or search tree,
• Can quickly add new elements and check whether or not
the element that just arrived on the stream was already
seen.
• As long as the number of distinct elements is not too great,
• This structure can fit in main memory and there is little
problem obtaining an exact answer to the question how
many distinct elements appear in the stream.
61
• If the number of distinct elements is too great, or if there are
too many streams that need to be processed at once (e.G.,
Yahoo! Wants to count the number of unique users viewing each
of its pages in a month), then we cannot store the needed data
in main memory.
• There are several options.
• We could use more machines, each machine handling only
one or several of the streams.
• We could store most of the data structure in secondary
memory
• Batch stream elements so whenever we brought a disk block
to main memory there would be many tests and updates to
be performed on the data in that block.
62
Counting Distinct Elements
• Problem:
• Data stream consists of a universe of elements chosen from a
set of size N
• Maintain a count of the number of distinct elements seen so
far
• Obvious approach:
Maintain the set of elements seen so far
• That is, keep a hash table of all the distinct elements seen so
far
63
Applications
• How many different words are found among the Web
pages being crawled at a site?
• Unusually low or high numbers could indicate artificial pages
(spam?)
64
Using Small Storage
• Real problem: What if we do not have space
to maintain the set of elements seen so far?
65
Steps involved in FM
• Get the input stream https://arpitbhayani.me/blogs/flajolet-martin
66
Example
• Input X={1,3,2,1,2,3,4,3,1,2,3,1}
• Hash function:6x+1 mod 5
67
Example
68
Example
69
Example
70
Example
71
Example
72
The Flajolet-Martin Algorithm
• The more different elements we see in the stream, the
more different hash-values
• Whenever we apply a hash function h to a stream
element a, the bit string
• H(a) will end in some number of 0’s, possibly none.
• Call this number the tail length for a and h.
• Let r be the maximum tail length of any a seen so far
in the stream.
• Then we shall use estimate 2� for the number of
distinct elements seen in the stream.
73
• If m is much larger than 2� , then the probability that
we shall find a tail of length at least r approaches 1.
• If m is much less than 2� , then the probability of
finding a tail length at least r approaches 0.
• We conclude from these two points that the proposed
estimate of m, which is 2� (recall R is the largest tail
length for any stream element) is unlikely to be either
much too high or much too low.
74
75
76
Why It Works
• The probability that a given h(a) ends in
at least i 0’s is 2-i.
• If there are m different elements, the
probability that R ≥ i is 1 – (1 - 2-i)m.
77
Why It Works – (2)
• If 2 >> m, 1 - e ≈ 1 - (1 - m2-i) ≈
i -i
-m2
m/2i ≈ 0. -i
• If 2i << m, 1 - e-m2 ≈ 1.
• Thus, 2R will almost always be around m.
First 2 terms of the
Taylor expansion of e x
78
Why It Doesn’t Work
• E(2R) is, in principle, infinite.
• Probability of > R 0’s halves when R -> R+1, but
value of 2R doubles.
• Workaround involves using many hash
functions and getting many samples.
• How are samples combined?
• Average? What if one very large value?
• Median? All values are a power of 2.
79
Solution
• Partition your samples into small groups.
• O(log n), where n = size of universal set, suffices.
• Take the average within each group.
• Then take the median of the averages.
80
Generalization: Moments
• Suppose a stream has elements chosen
from a set A of N values
iA
( mi ) k
81
82
Special Cases
iA
( mi ) k
83
Example: Surprise Number
• Stream of length 100
• 11 distinct values
84
85
86
The Alon-Matias-Szegedy Algorithm
for Second Moments
• A particular element of the universal set, which we
refer to as X.element
• An integer X.value, which is the value of the variable.
• To determine the value of a variable X, we choose a
position in the stream between 1 and n, uniformly and
at random.
• Set X.element to be the element found there, and
initialize X.value to 1.
• As we read the stream, add 1 to X.value each time we
encounter another occurrence of X.element
87
Problem Statement Problem of counting distinct elements in a stream of data
It involves distribution frequency of different elements in a stream
89
90
Problem Statement Problem of counting distinct elements in a stream of data
It involves distribution frequency of different elements in a stream
92
Sample problem for AMS
• Stream={a, b, c, b, d, a, c, d, a, b, d, c, a, a, b}
93
Sample problem for AMS
• a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
• 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
94
95
96
[Alon, Matias, and Szegedy]
AMS Method
• AMS method works for all moments
• Gives an unbiased estimate
• We will just concentrate on the 2nd moment
S
• We pick and keep track of many variables X:
• For each variable X we store X.el and X.val
• X.el corresponds to the item i
• X.val corresponds to the count of item i
• Note this requires a count in main memory,
so number of Xs is limited
• Our goal is to compute � = �
� �
�
97
One Random Variable (X)
• How to set X.val and X.el?
• Assume stream has length n (we relax this later)
• Pick some random time t (t<n) to start,
so that any time is equally likely
• Let at time t the stream have item i. We set X.el
=i
• Then we maintain count c (X.val = c) of the
number of is in the stream starting from the
chosen time t
Then the estimate of the 2nd moment ( � ��� ) is:
� = �(�) = � (�·� – �)
• Note, we will keep track of multiple Xs, (X1, X2,… Xk)
and our final estimate will be � = �/� �� �(�� )
98
Expectation Analysis
Count: 1 2 3 ma
Stream: a a b b b a b a
• 2nd moment is � = �
� �
�
• ct … number of times item at time t appears
from time t onwards (c1=ma , c2=ma-1,
c3=mb)
� �
• � �(�) = �=�
�(��� − �) mi … total count of
item i in the stream
� (we are assuming
� stream has length n)
= �
� (� + � + � + … + ��� − �)
�
Time t when Time t when
Time t when the penultimate the first i is
Group times
the last i is i is seen (ct=2) seen (ct=mi)
by the value
seen (ct=1)
seen
99
Expectation Analysis
Count: 1 2 3 ma
Stream: a a b b b a b a
1
• � �(�) = �
� (1 + 3 + 5 + … + 2�� − 1)
�
• Little side calculation: (1 + 3 + 5 + … + 2�� − 1) =
�� �� (�� +1) 2
�=1
(2� − 1) = 2 − �� = (�� )
2
• Then � �(�) =
� �
�
� ( � � )
�
• So, � �(�) = �
( � � ) �
=�
• We have the second moment (in expectation)!
100
Higher-Order Moments
• For estimating kth moment we essentially use
the same algorithm but change the estimate:
• For k=2 we used n (2·c – 1)
• For k=3 we use: n (3·c2 – 3c + 1) (where
c=X.val)
• Why?
• For k=2: Remember we had (1 + 3 + 5 + … + 2�� −
1) and we showed terms 2c-1 (for c=1,…,m)
sum to m2
• ��=1
2� − 1 = �
�=1
�2
− �
�=1
(� − 1) 2
= �2
• So: �� − � = � − (� − �)�
�
101
Combining Samples
• In practice:
• Compute �(�) = �(� � – �) for
as many variables X as you can fit in
memory
• Average them in groups
• Take median of averages
• Obvious solution:
Store the most recent N bits
• When new bit comes in, discard the N+1st bit
108
Counting Bits (2)
• You can not get an exact answer without storing the
entire window
• Real Problem:
What if we cannot afford to store N bits?
• E.g., we’re processing 1 billion streams and
N = 1 billion 010011011101010110110110
Past Future
109
An attempt: Simple solution
• Q: How many 1s are in the last N bits?
• A simple solution that does not really solve
our problem: Uniformity assumption
N
010011100010100100010110110111001010110011010
Past Future
• Maintain 2 counters:
• S: number of 1s from the beginning of the
stream
• Z: number of 0s from the beginning of the
stream
• How many 1s are in the last N bits? � ∙
�
�+�
• But, what if stream is non-uniform?
• What if distribution changes over time? 110
DGIM algorithm
• DGIM algorithm (Datar-Gionis-Indyk-Motwani
Algorithm)
DGIM Method
• DGIM solution that does not assume uniformity
112
What’s Good?
• Stores only O(log2N ) bits
• �(log �) counts of log� � bits each
113
Generic example
114
115
116
• Each bit of the stream has a timestamp, the position in
which it arrives.
• The first bit has timestamp 1, the second has
timestamp 2, and so on.
• Since we only need to distinguish positions within the
window of length N, we shall represent timestamps
modulo N, so they can be represented by log2 N bits.
• If we also store the total number of bits ever seen in
the stream (i.e., the most recent timestamp) modulo N,
then we can determine from a timestamp modulo N
where in the current window the bit with that
timestamp is.
117
Buckets representation
• We divide the window into buckets, consisting of:
• The timestamp of its right (most recent) end.
• The number of 1’s in the bucket. This number must be a power of 2, and
we refer to the number of 1’s as the size of the bucket.
• To represent a bucket, we need log2 N bits to represent the timestamp
(modulo N) of its right end.
• To represent the number of 1’s we only need log2 log2 N bits.
• The reason is that we know this number i is a power of 2, say 2j , so we
can represent i by coding j in binary.
• Since j is at most log2 N, it requires log2 log2 N bits.
• Thus, O(logN) bits suffice to represent a bucket.
118
6 Rules representing a stream by buckets
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to
some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left
(back in time).
119
Rules with example
The right end of a bucket is always 0 1 1 0 0 0 0 “0” No need to consider
a position with a 1
Every position with a 1 is in some 1 0 1 1 1 0 1 0
bucket.
B4 B3 B2 B1
No position is in more than one 1 0 1 1 1 0 1 0 (Not valid)
bucket. B4 B3 B2 B1
121
Sample Problem
• Apply the 6 rule for buckets
•1 0 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
122
Example: Bucketized Stream
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100010110010
124
Updating Buckets (2)
• If the current bit is 1:
• (1) Create a new bucket of size 1, for just this bit
• End timestamp = current time
• (2) If there are now three buckets of size 1, combine the
oldest two into a bucket of size 2
• (3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
• (4) And so on …
125
Example: Updating Buckets
Current state of the stream:
1001010110001011010101010101011010101010101110101010111010100010110010
127
Example: Bucketized Stream
At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.
1001010110001011010101010101011010101010101110101010111010100010110010
128
Steps involved for query
129
130
131
Error Bound: Proof
111111110000000011101010101011010101010101110101010111010100010110010
N 132
Further Reducing the Error
• Instead of maintaining 1 or 2 of each size bucket, we
allow either r-1 or r buckets (r > 2)
• Except for the largest size buckets; we can have any number
between 1 and r of those
• Error is at most O(1/r)
• By picking r appropriately, we can tradeoff between
number of bits we store and the error
133
Storage Requirements for the • Each Bucket Can Be Represented By O(logn) Bits.
DGIM Algorithm • If The Window Has Length N, Then There Are No More Than N 1’s, Surely.
Suppose The Largest Bucket Is Of Size 2j.
• Then J Cannot Exceed Log2 N, Or Else There Are More 1’s In This Bucket
Than There Are 1’s In The Entire Window.
• Thus, There Are At Most Two Buckets Of All Sizes From Log2 N Down To 1,
And No Buckets Of Larger Sizes.
• We Conclude That There Are O(logn) Buckets.
• Since Each Bucket Can Be Represented In O(logn) Bits, The Total Space
Required For All The Buckets Representing A Window Of Size N Is O(log2
N).
Query Answering in the • Suppose we are asked how many 1’s there are in the last k bits of the
DGIM Algorithm window, for some 1 ≤ k ≤ N.
• Find the bucket b with the earliest timestamp that includes at least some of
the k most recent bits.
• Estimate the number of 1’s to be the sum of the sizes of all the buckets to
the right (more recent) than bucket b, plus half the size of b itself.
134
Steps involved with example
• Convert integer to binary
• Generate the elemts [ C0,C1,C2,…………, Cm ] m is no of
elemts
• Calculate
�−1 �
�
�=0 �
2
135
136
137
138
139
Extensions
• Can we use the same trick to answer queries How
many 1’s in the last k? where k < N?
• A: Find earliest bucket B that at overlaps with k.
Number of 1s is the sum of sizes of more recent buckets + ½
size of B
1001010110001011010101010101011010101010101110101010111010100010110010
k
140
Extensions
• Stream of positive integers
• We want the sum of the last k elements
• Amazon: Avg. price of last k sales
• Solution:
• (1) If you know all have at most m bits
• Treat m bits of each integer as a separate stream
ci …estimated count for i-th bit
• Use DGIM to count 1s in each integer
• The sum is = �−1
�=0
�� 2�
142
Definition of the Decaying Window
• let a stream currently consist of the elements a1,
a2, . . . , at, where a1 is the first element to arrive
and at is the current element.
• Let c be a small constant, such as 10−6 or 10−9.
• Define the exponentially decaying window for this
stream to be the sum
143
A decaying window and a fixed-
length window of equal weight
144
Summary
• Sampling a fixed proportion of a stream
• Sample size grows as the stream grows
• Sampling a fixed-size sample
• Reservoir sampling
• Counting the number of 1s in the last N
elements
• Exponentially increasing windows
• Extensions:
• Number of 1s in any last k (k < N) elements
• Sums of integers in the last N elements
145
146
New Topic: Graph Data!
High dim. Graph Infinite Machine
Apps
data data data learning
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
147
Overview
• Graph data overview
• Problems with early search engines
• PageRank Model
• Flow Formulation
• Matrix Interpretation
• Random Walk Interpretation
• Google’s Formulation
• How to Compute PageRank
148
Graph Data: Social Networks
domain2
domain1
router
domain3
Internet 154
Graph Data: Technological
Networks
157
Web as a Graph
• Web as a directed graph:
• Nodes: Webpages
• Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University
158
Web as a Directed Graph
159
Broad Question
• How to organize the Web?
• First try: Human curated
Web directories
• Yahoo, DMOZ, LookSmart
• Second try: Web Search
• Information Retrieval investigates:
Find relevant docs in a small
and trusted set
• Newspaper articles, Patents, etc.
• But: Web is huge, full of untrusted documents,
random things, web spam, etc.
160
Web Search: 2 Challenges
2 challenges of web search:
• (1) Web contains many sources of
information
Who to “trust”?
• Trick: Trustworthy pages may point to each
other!
• Main innovations:
1. Define the importance of a page based on:
• How many pages point to it?
• How important are those pages?
2. Judge the contents of a page based on:
• Which terms appear in the page?
• Which terms are used to link to the page?
Ranking Nodes on the Graph
• All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
165
Link Analysis Algorithms
• We will cover the following Link Analysis approaches
for computing importance's
of nodes in a graph:
• Page Rank
• Topic-Specific (Personalized) Page Rank
• Web Spam Detection Algorithms
166
PageRank
167
Links as Votes
168
Example: PageRank Scores
A
B
3.3 C
38.4
34.3
D
E F
3.9
8.1 3.9
1.6
1.6 1.6 1.6
1.6
169
Simple Recursive Formulation
• Each link’s vote is proportional to the
importance of its source page
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
170
PageRank: The “Flow” Model
• A “vote” from an important
page is worth more
y/2
• A page is important if it is
pointed to by other y
important pages
a/2
• Define a “rank” rj for page j y/2
m
a m
a/2
ri
rj “Flow” equations:
i j di
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
�� … out-degree of node � 171
Solving the Flow Equations
• 3 equations, 3 unknowns, Flow equations:
ry = ry /2 + ra /2
no constants ra = ry /2 + rm
• No unique solution rm = ra /2
• All solutions equivalent modulo the scale factor
• Additional constraint forces uniqueness:
• �� + �� + �� = �
� � �
• Solution: �� = , �� = , �� =
� � �
ri
• The flow equations can be written rj
� = � ⋅ � i j di
173
Example: Flow Equations & M
y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0
r = M∙r
ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m
174
y
a m
175
176
Markov process
• A Markov chain or Markov process is a stochastic
model describing a sequence of possible events in which
the probability of each event depends only on the state
attained in the previous event.
https://setosa.io/ev/markov-chains/
177
Example
ri
• Remember the flow equation:rj
i j di
• Flow equation in the matrix form
� ⋅ � = �
• Suppose page i links to 3 pages, including j
i
j rj
. =
ri
1/3
M . r = r
178
Exercise: Matrix Formulation
M r r
A B 0 1/2 0 rA rA
1
1/3 0 0 1/2 r rB
. B =
1/3 0 0 1/2 rC rC
C D 1/3 1/2 0 0 rD rD
Eigen vector
• Eigen vector of a matrix A is a vector represented by a matrix X such that
when X is multiplied with matrix A, then the direction of the resultant
matrix remains same as vector X.
• where A is any arbitrary matrix, λ are eigen values and X is an
eigen vector corresponding to each eigen value.
• Here, we can see that AX is parallel to X. So, X is an eigen vector.
http://www.sharetechnote.com/html/Handbook_EngMath_Ma
trix_Determinent.bak
Linear Algebra Reminders
• A is a column stochastic matrix if each of its columns add up to 1 and there are no
negative entries.
• Our adjacency matrix M is column stochastic. Why?
184
Power Iteration Method
• Given a web graph with n nodes, where the nodes are
pages and edges are hyperlinks
• Power iteration: a simple iterative scheme
• Suppose there are N web pages (t )
ri
( t 1)
• Initialize: r(0) = [1/N,….,1/N]T rj
i j di
• Iterate: r(t+1) = M ∙ r(t)
di …. out-degree of node i
• Stop when |r(t+1) – r(t)|1 <
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
185
PageRank: How to solve?
y a m
y
y ½ ½ 0
• Power Iteration: a ½ 0 1
• Set �� = 1/N a m m 0 ½ 0
��
• 1: �′� = �→� ��
ry = ry /2 + ra /2
• 2: � = �′ ra = ry /2 + rm
• Goto 1 rm = ra /2
• Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
186
PageRank: How to solve?
y a m
y
y ½ ½ 0
• Power Iteration: a ½ 0 1
• Set �� = 1/N a m m 0 ½ 0
��
• 1: �′� = �→� ��
ry = ry /2 + ra /2
• 2: � = �′ ra = ry /2 + rm
• Goto 1 rm = ra /2
• Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
187
Power Iteration Convergence
• Power iteration:
A method for finding principal eigenvector
(the vector corresponding to the largest
eigenvalue)
• �(�) = � ⋅ �(�)
• �(�) = � ⋅ �(�) = � ��(�) = �� ⋅ �(�)
• �(�) = � ⋅ �(�) = � �� �(�) = �� ⋅ �(�)
• Claim:
Sequence � ⋅ �(�) , �� ⋅ �(�) , …�� ⋅ �(�) , …
approaches the dominant eigenvector of �
188
PageRank:
Random Walk Interpretation
189
Random Walk Interpretation of PageRank
• Consider a web surfer:
• He starts at a random page
• He follows a random link at every time step
• After a sufficiently long time:
• What is the probability that he is at page j?
• This probability corresponds to the page rank of j.
https://neo4j.com/docs/graph-data-science/current/alpha-
algorithms/random-walk/
Need for Random walks in Page rank
• Dead-ends occur when pages have no out-links.
• In this case, the random walk will abort and a path containing only the first
node will be returned.
• This problem can be avoided by running on an undirected graph, so that the
random walk will traverse relationships in both directions.
• If there are no links from within a group of pages to outside of the group,
then the group is considered a spider trap.
• Random walks starting from any of the nodes in that group will only traverse
to the others in the group - our implementation of the algorithm doesn’t
allow a random walk to jump to non-neighbouring nodes.
• Sinks can occur when a network of links form an infinite cycle.
Random Walk Interpretation
i1 i2 i3
192
Stationary Distribution
• A stationary distribution of a Markov chain is a
probability distribution that remains unchanged in the
Markov chain as time progresses.
• Typically, it is represented as a row vector π whose
entries are probabilities summing to 1, and given
transition matrix P it satisfies
• π=πP
• In other words, π is invariant by the matrix P.
https://brilliant.org/wiki/stationary-distributions/
193
The Stationary Distribution
• Where is the surfer at time t+1? i1 i2 i3
194
Existence and Uniqueness
• A central result from the theory of
random walks (a.k.a. Markov processes):
195
Example: Random Walk
Time t = 1:
p(A, 1) = ? 0
C D p(B, 1) = ? 1/3
p(C, 1) = ? 1/3
p(D, 1) = ? 1/3
Example: Random Walk
Time t = 1:
p(B, 1) = 1/3
A B p(C, 1) = 1/3
p(D, 1) = 1/3
C D Time t=2:
p(A, 2) = ?
M p(t) p(t+1)
A B 0 1/2 1 0 pA pA
1/3 0 0 1/2 pB pB
. =
1/3 0 0 1/2 pC pC
C D 1/3 1/2 0 0 pD pD
• Iterative algorithm:
1. Initialize rank of each page to 1/N (where N is the number of pages)
2. Compute the next page rank values using the formula above
3. Repeat step 2 until the page rank values do not change much
y a m
y y ½ ½ 0 ry = ry /2 + ra /2
a ½ 0 1 ra = ry /2 + rm
m 0 ½ 0 rm = ra /2
a m
C D
Irreducible graph
Aperiodic
• State i has period k if any return to state i must occur in
multiples of k time steps.
• If k = 1 for a state, it is called aperiodic.
• Returning to the state at irregular intervals
• A Markov chain is aperiodic if all its states are aperiodic.
• If Markov chain is irreducible, one aperiodic state means all stated are
aperiodic.
A B C D
t0
How to make this aperiodic?
t0 + 4 k= 4
Add any self edge
t0 + 8
PageRank: The Google Formulation
PageRank: Three Questions
(t )
ri
( t 1)
rj
i j di
or
equivalently r Mr
• Does this converge?
• Does it converge to what we want?
• Are results reasonable?
206
Does this converge?
(t )
ri
( t 1)
a b rj
i j di
• Example:
=
ra 1 0 1 0
Iteration 0, 1, 2, …
rb 0 1 0 1
207
Does it converge to what we
want?
(t )
ri
( t 1)
a b rj
i j di
• Example:
=
ra 1 0 0 0
Iteration 0, 1, 2, …
rb 0 1 0 0
208
Problems and solution
209
PageRank: Problems
Dead end
2 problems:
• (1) Some pages are
dead ends (have no out-links)
• Random walk has “nowhere” to go to
• Such pages cause importance to “leak out”
Spider
trap
210
Problem: Spider Traps
y a m
y
y ½ ½ 0
• Power Iteration: a ½ 0 0
• Set �� = 1/N a m m 0 ½ 1
��
• �� = �→�
�� m is a spider trap ry = ry /2 + ra /2
• And iterate
ra = ry /2
rm = ra /2 + rm
• Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m. 211
Solution: Teleports!
• The Google solution for spider traps: At each
time step, the random surfer has two
options
• With prob. , follow a link at random
• With prob. 1-, jump to some random page
• Common values for are in the range 0.8 to
0.9
• Surfer will teleport out of spider trap
within a few time steps
y y
a m a m 212
Problem: Dead Ends
y a m
y
y ½ ½ 0
• Power Iteration: a ½ 0 0
• Set �� = 1 a m m 0 ½ 0
��
• �� = �→�
��
ry = ry /2 + ra /2
• And iterate
ra = ry /2
rm = ra /2
• Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …
Here the PageRank “leaks” out since the matrix is not stochastic. 213
Solution: Always Teleport!
• Teleports: Follow random teleport links
with probability 1.0 from dead-ends
• Adjust matrix accordingly
y y
a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
214
Why Teleports Solve the Problem?
Why are dead-ends and spider traps a
problem
and why do teleports solve the problem?
• Spider-traps: PageRank scores are not what
we want
• Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
• Dead-ends are a problem
• The matrix is not column stochastic so our initial
assumptions are not met
• Solution: Make matrix column stochastic by
always teleporting when there is nowhere else to
go
215
Solution: Random Teleports
• Google’s solution that does it all:
At each step, random surfer has two options:
• With probability , follow a link at random
• With probability 1-, jump to some random page
217
Random Teleports ( 0.8) [1/N]NxN
7/15
M
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5
7/1
15
5
7/1
1/
y 7/15 7/15 1/15
15
13/15
a 7/15 1/15 1/15
a 7/15
m 1/15 7/15 13/15
1/15
m
1/
15 A
https://matrix.reshish.com/addSubCalculation.php
219
Matrix Formulation
• Suppose there are N pages
• Consider page i, with di out-links
• We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
• The random teleport is equivalent to:
• Adding a teleport link from i to every other page and setting
transition probability to (1-)/N
• Reducing the probability of following each
out-link from 1/|di| to /|di|
• Equivalent: Tax each page a fraction (1-) of its score and
redistribute evenly
220
Example
• Avoiding Dead Ends
• C is now a dead end
221
Steps involved in above example
222
223
The reduced graph with no dead
ends
224
Spider Traps and Taxation
225
Apply teleporting
226
227
228
How do we actually compute
the PageRank?
229
Computing Page Rank
• Key step is matrix-vector multiplication
• rnew = A ∙ rold
• Easy if we have enough main memory to
hold A, rold, rnew
• Say N = 1 billion pages
• We need 4 bytes for A = ∙M + (1-) [1/N]NxN
each entry (say)
½ ½ 0 1/3 1/3 1/3
• 2 billion entries for A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3
vectors, approx 8GB 0 ½ 1 1/3 1/3 1/3
• Matrix A has N2 entries
• 1018 is a large number! 7/15 7/15 1/15
= 7/15 1/15 1/15
1/15 7/15 13/15 230
Matrix Sparseness
• Reminder: Our original matrix was sparse.
• On average: ~10 out-links per vertex
• # of non-zero values in matrix M: ~10N
• Teleport links make matrix M dense.
• Can we convert it back to the sparse form?
C D 1/3 1/2 0 0
Rearranging the Equation
• So we get: � = � � ⋅ � +
�−�
� �
Note: Here we assumed M
has no dead-ends
[x]N … a vector of length N with all entries x
232
Example: Equation with Teleports
rnew M rold
A B rA 0 1/2 1 0 rA 1/4
rB 1/3 0 0 1/2 rB 1/4
= β 1/3 . + (1-β)
rC 0 0 1/2 rC 1/4
rD 1/3 1/2 0 0 rD 1/4
C D
235
PageRank: The Complete
Algorithm
• Input: Graph � and parameter �
• Directed graph � (can have spider traps and dead
ends)
• Parameter �
• Output: PageRank vector ����
1
• Set: ����
� = �
• repeat until convergence: �
����
� − ����
� >�
����
• ∀�: �′���
� = �→�
� �
��
�′���
� = � if in-degree of � is 0
• Now re-insert the leaked PageRank:
���
∀�: ����
� = �′ �
���
+
�−�
where: � = �′
� �
�
��� ���
• � =�
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
236
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Sparse Matrix Encoding: First Try
Store a triplet for each nonzero entry: (row, column, weight)
A B 0 1/2 1 0
1/3 0 0 1/2
1/3 0 0 1/2
C D 1/3 1/2 0 0
(2, 1, 1/3); (3, 1, 1/3); (4, 1, 1/3); (1, 2, 1/2); (4, 2, 1/2); (1, 3, 1); …
Assume 4 bytes per integer and 8 bytes per float: 16 bytes per entry
Inefficient: Repeating the column index and weight multiple times
Sparse Matrix Encoding
• Encode sparse matrix using only nonzero entries
• Space proportional roughly to number of links
• Say 10N, or 4*10*1 billion = 40GB
• Still won’t fit in memory, but will fit on disk
source
node degree destination nodes
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
238
Basic Algorithm: Update Step
• Assume enough RAM to fit rnew into memory
• Store rold and matrix M on disk
• 1 step of power-iteration is:
Initialize all entries of rnew = (1-) / N
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) += rold(i) / di
0 rnew source degree destination rold 0
1 1
0 3 1, 5, 6
2 2
3 1 4 17, 64, 113, 117 3
4 4
2 2 13, 23
5 5
6 239
6
Analysis
• Assume enough RAM to fit rnew into memory
• Store rold and matrix M on disk
• In each iteration, we have to:
• Read rold and M
• Write rnew back to disk
• Cost per iteration of Power method:
= 2|r| + |M|
• Question:
• What if we could not even fit rnew in memory?
240
Block-based Update Algorithm
241
Block-based Update Algorithm
242
Block-Stripe Update Algorithm
src degree destination
rnew
0 4 0, 1
0
1 1 3 0 rold
0
2 2 1 1
2
2 0 4 3 3
4
3
2 2 3 5
0 4 5
4
5
1 3 5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of rnew 243
Analysis of Block Update
• Similar to nested-loop join in databases
• Break rnew into k blocks that fit in memory
• Scan M and rold once for each block
• Total cost:
• k scans of M and rold
• Cost per iteration of Power method:
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
• Can we do better?
• Hint: M is much bigger than r (approx 10-20x), so we must
avoid reading it k times per iteration
244
Block-Stripe Update Algorithm
src degree destination
rnew
0 4 0, 1
0
1 1 3 0 rold
0
2 2 1 1
2
2 0 4 3 3
4
3
2 2 3 5
0 4 5
4
5
1 3 5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of rnew 245
Block-Stripe Update Algorithm
src degree destination
rnew
0 4 0, 1
0
1 1 3 0 rold
0
2 2 1 1
2
2 0 4 3 3
4
3
2 2 3 5
0 4 5
4
5
1 3 5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of rnew 246
Block-Stripe Analysis
• Break M into stripes
• Each stripe contains only destination nodes
in the corresponding block of rnew
• Some additional overhead per stripe
• But it is usually worth it
• Cost per iteration of Power method:
=|M|(1+) + (k+1)|r|
247
Some Problems with Page Rank
• Measures generic popularity of a page
• Biased against topic-specific authorities
• Solution: Topic-Specific PageRank (next)
• Uses a single measure of importance
• Other models of importance
• Solution: Hubs-and-Authorities
• Susceptible to Link spam
• Artificial link topographies created in order to boost page
rank
• Solution: TrustRank
248
Topic-Sensitive PageRank
249
Topic-Specific PageRank
• Instead of generic popularity, can we measure
popularity within a topic?
• Goal: Evaluate Web pages not just according to their
popularity, but by how close they are to a particular
topic, e.g. “sports” or “history”
• Allows search queries to be answered based on interests
of the user
• Example: Query “Trojan” wants different pages depending on
whether you are interested in sports, history and computer
security
250
Topic-Specific PageRank
• Random walker has a small probability of
teleporting at any step
• Teleport can go to:
• Standard PageRank: Any page with equal
probability
• To avoid dead-end and spider-trap problems
• Topic Specific PageRank: A topic-specific set of
“relevant” pages (teleport set)
• Idea: Bias the random walk
• When walker teleports, she pick a page from a
set S
• S contains only pages that are relevant to the
topic
• E.g., Open Directory (DMOZ) pages for a given
topic/query
• For each teleport set S, we get a different vector 251
Matrix Formulation
• To make this work all we need is to update
the teleportation part of the PageRank
formulation:
��� = � ��� + (� − �)/|�| if � ∈ �
� ��� + � otherwise
• A is stochastic!
• We weighted all pages in the teleport set S
equally
• Could also assign different weights to pages!
• Compute as for regular PageRank:
• Multiply by M, then add a vector
• Maintains sparseness 252
Example: Topic-Specific
PageRank
4 S={1,2,3,4}, β=0.8:
r=[0.13, 0.10, 0.39, 0.36]
S={1}, β=0.90: S={1,2,3} , β=0.8:
r=[0.17, 0.07, 0.40, 0.36] r=[0.17, 0.13, 0.38, 0.30]
S={1} , β=0.8: S={1,2} , β=0.8:
r=[0.29, 0.11, 0.32, 0.26] r=[0.26, 0.20, 0.29, 0.23]
S={1}, β=0.70: S={1} , β=0.8:
r=[0.39, 0.14, 0.27, 0.19] r=[0.29, 0.11, 0.32, 0.26] 253
Example: Topic-Specific
PageRank
4 S={1,2,3,4}, β=0.8:
r=[0.13, 0.10, 0.39, 0.36]
S={1}, β=0.90: S={1,2,3} , β=0.8:
r=[0.17, 0.07, 0.40, 0.36] r=[0.17, 0.13, 0.38, 0.30]
S={1} , β=0.8: S={1,2} , β=0.8:
r=[0.29, 0.11, 0.32, 0.26] r=[0.26, 0.20, 0.29, 0.23]
S={1}, β=0.70: S={1} , β=0.8:
r=[0.39, 0.14, 0.27, 0.19] r=[0.29, 0.11, 0.32, 0.26] 254
DMOZ
• DMOZ (from directory.mozilla.org, an earlier domain
name, stylized in lowercase in its logo) was a
multilingual open-content directory of World Wide
Web links
https://dmoz-odp.org/
https://www.kaggle.com/shawon10/url-classification-dataset-
dmoz
255
Discovering the Topic Vector S
• Create different PageRanks for different topics
• The 16 DMOZ top-level categories:
• arts, business, sports,…
• Which topic ranking to use?
• User can pick from a menu
• Classify query into a topic
• Can use the context of the query
• E.g., query is launched from a web page talking about a known
topic
• History of queries e.g., “basketball” followed by “Jordan”
• User context, e.g., user’s bookmarks, …
256
Application to Measuring Proximity in
Graphs
257
[Tong-Faloutsos, ‘06]
Proximity on Graphs
259
[Tong-Faloutsos, ‘06]
• Multiple connections
• Quality of connection
…
• Direct & Indirect
connections
• Length, Degree,
Weight… 260
SimRank: Idea
• SimRank: Random walks from a fixed node
• Problem:
• Must be done once for each node u
• Suitable for sub-Web-scale applications
261
Bipartite Graph
• A bipartite graph is a special kind of graph with the
following properties-
• It consists of two sets of vertices X and Y.
• The vertices of set X join only with the vertices of
set Y.
• The vertices within the same set do not join
https://www.gatevidyalay.com/bipartite-graphs/
262
SimRank: Example
…
…
Q: What is most related
conference to ICDM?
A: Topic-Specific
PageRank with
… teleport set S={ICDM}
…
Conference Author
263
SimRank: Example
264
Example
β = 0.8
265
Example
• Teleport set S = {B,D}
• Vector (1 − β)eS/|S| has 1/10 for its second and fourth
components and 0 for the other two components
• 1 − β = 1/5, the size of S is 2, and eS has 1 in the
components for B and D and 0 in the components for
A and C
266
Example
267
PageRank: Summary
• “Normal” PageRank:
• Teleports uniformly at random to any node
• All nodes have the same probability of surfer landing there: S =
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
• Topic-Specific PageRank also known as Personalized
PageRank:
• Teleports to a topic specific set of pages
• Nodes can have different probabilities of surfer landing there: S =
[0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0.2]
• Random Walk with Restarts:
• Topic-Specific PageRank where teleport is always to the same node.
S=[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
268
TrustRank:
Combating the Web Spam
What is Web Spam?
• Spamming:
• Any deliberate action to boost a web
page’s position in search engine results, incommensurate with
page’s real value
• Spam:
• Web pages that are the result of spamming
• This is a very broad definition
• SEO industry might disagree!
• SEO = search engine optimization
• Link spam:
• Creating link structures that boost PageRank of a particular
page
Link Spamming
• Three kinds of web pages from a
spammer’s point of view
• Inaccessible pages
• Accessible pages
• e.g., blog comments pages
• spammer can post links to his pages
• Owned pages
• Completely controlled by spammer
• May span multiple domain names
Link Farms
• Spammer’s goal:
• Maximize the PageRank of target page t
• Technique:
• Get as many links from accessible pages as possible to target
page t
• Construct “link farm” to get PageRank
multiplier effect
Link Farms
Accessible Owned
Inaccessible 1
2
t
Millions of
farm pages
Inaccessible 1
t 2
Inaccessible 1
t 2
� � �
•�= +� where � =
�−�� � 1+�
• Trust splitting:
• The larger the number of out-links from a page, the less
scrutiny the page author gives each out-link
• Trust is split across out-links
Categorize Spam Pages after TrustRank
• Solution 1: Use a threshold value and mark all pages below the trust
threshold as spam
• Complementary view:
What fraction of a page’s PageRank comes from spam
pages?
Web
Spam Mass Estimation
Solution 2:
• �� = PageRank of page p
� = PageRank of p with teleport into
• �+
trusted pages only