0% found this document useful (0 votes)

82 views293 pages

Unit2 Bda

1. Mining data streams involves analyzing data as it flows continuously in streams from sources like sensors, websites, and social networks. This allows extracting useful insights from massive amounts of data in real-time. 2. Key challenges include processing unlimited data volumes, answering queries over sliding windows or all data seen so far, and storing limited summaries instead of entire streams due to storage constraints. 3. Common stream mining tasks are sampling, filtering, counting distinct elements, estimating statistics, and finding frequent patterns. Examples of applications are detecting trends from Twitter, optimizing network routing using IP packet data, and generating temperature alerts from sensor readings.

Uploaded by

Bhaviteja Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views293 pages

Unit2 Bda

Uploaded by

Bhaviteja Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 293

BIG DATA ANALYTICS

UNIT II
Mining Data Streams & Link
Analysis

1
Mining Data Streams
• The stream is a term that can be used when media is
sent in a continuous stream of data and the media can
play as it receives to the receiver.
• Streaming is a technique in which data is sent by the
sender in the compressed form over the Internet.
• After receiving the data, it shows immediately to the
user.
• Streaming does not mean to store the full data to the
storage devices(e.g., hard drive).

2
New Topic: Infinite Data
High dim. Graph Infinite Machine
Apps
data data data learning

Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Community Queries on Decision Association

Clustering
Detection streams Trees Rules

Dimensional Duplicate
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection
3
https://hazelcast.com/blog/the-role-of-streaming-technology-
4
in-retail-banking/
The Stream Data Model
• Refers to a sequence of data elements or symbols made available
over time
• Data stream transmits from a source and receives at the
processing end in a network
• A continuous stream of data flows between the source and
receiver ends, and which is processed in real time
• In many data mining situations, we do not know the entire data
set in advance
• Stream Management is important when the input rate is
controlled externally:
• Google queries
• Twitter or Facebook status updates
• We can think of the data as infinite and non-stationary (the
distribution changes over time)

5
SGD is a Streaming Alg.
• Stochastic Gradient Descent (SGD) is an
example of a stream algorithm
• In Machine Learning we call this: Online Learning
• Allows for modeling problems where we have
a continuous stream of data
• We want an algorithm to learn from it and
slowly adapt to the changes in data
• Idea: Do slow updates to the model
• SGD (SVM, Perceptron) makes small updates
• So: First train the classifier on training data.
• Then: For every example from the stream, we slightly
update the model (using small learning rate)
6
Data-stream-management System

Streams Entering
Each stream composed
of elements /tuples

7
• Also refers to communication of bytes or characters
over sockets in a computer network
• A program uses stream as an underlying data type in
inter-process communication channels.
• Input elements enter at a rapid rate, at one or more
input ports (i.e., streams)
• We call elements of the stream tuples
• The system cannot store the entire stream accessibly

8
• Streams may be archived in a large archival store, but
we assume it is not possible to answer queries from the
archival store.
• It could be examined only under special circumstances
using TIME-CONSUMING RETRIEVAL PROCESSES.
• There is also a working store, into which summaries or
parts of streams may be placed, and which can be used
for answering queries.
• The working store might be disk, or it might be main
memory, depending on how fast we need to process
queries.
• But either way, it is of sufficiently limited capacity
that it cannot store all the data from all the streams.
9
Problems on Data Streams
• Types of queries one wants on answer on
a data stream:
• Sampling data from a stream
• Construct a random sample
• Queries over sliding windows
• Number of items of type x in the last k elements
of the stream

10
Problems on Data Streams
• Types of queries one wants on answer on
a data stream:
• Filtering a data stream
• Select elements with property x from the stream (unwanted data )
• Counting distinct elements
• Number of distinct elements in the last k elements
of the stream (checking of unique elements or queries)
• Estimating moments
• Estimate avg./std. dev. of last k elements(distributed data which is
uneven estimating the moments for statistical purpose)
• Finding frequent elements
• frequent queries which queries are frequently in data stream

11
Applications (1)
• Mining query streams
• Google wants to know what queries are more frequent today
than yesterday

• Mining click streams (URL visited by users)

• Yahoo wants to know which of its pages are getting an
unusual number of hits in the past hour

• Mining social network news feeds(trending or current

hot topics)
• E.g., look for trending topics on Twitter, Facebook

12
Applications (2)
• Sensor Networks
• Many sensors feeding into a central controller
• Telephone call records
• Data feeds into customer bills as well as settlements between
telephone companies
• IP packets monitored at a switch
• Gather information for optimal routing
• Detect denial-of-service attacks

13
Examples of Stream Sources

Examples of Stream Sources

Sensor Data

Image Data

Internet and Web

Traffic

14
Examples of Streaming Data 2

▪ Ocean behavior at a point

– Temperature (once every half an hour)
– Surface height (once or more / second)
– Several places in the ocean: one per 100 km2
– Overall 1.5 million sensors
– A few terabytes of data everyday
▪ Satellite image data
– Terabytes of images sent to the earth everyday
– Convert to low resolution, but many satellites, a lot of data
▪ Web stream data
– More than hundred million search queries per day
– Clicks

15
Streaming Data: what’s different? 3

▪ Standard (non-stream) setting: data available when we need it

▪ Streaming data: data comes in one or more streams
▪ If you can, process, store results
– Size of results much smaller than the stream size
– Why?
– Typically result of some analysis (that we can consume) needs to be concise
▪ Then the data is lost forever
▪ Queries
– Temperature alert if > some degree (standing query)
– Maximum temperature in this month
– Number of distinct users in the last month
▪ Even if not coming from a stream, but if the data is very big, these techniques
are very useful

16
Important tasks with streaming
4

data
▪ Sampling from a stream
– Because we cannot afford to process the whole data
Filtering from streaming data
▪
Accept / discard elements based on some condition
– Estimating number of distinct elements in a stream
▪ Without actually counting it
– Estimating moments
▪

17
Stream Queries
• A stream query operates on one or two streams to transform
their contents into a single output stream.
• A stream query definition declares an identifier for the items in
the stream so that the item can be referred to by the operators
in the stream query.
• EX: 1
• The stream produced by the ocean-surface-temperature sensor
mentioned might have a standing query
• To output an alert whenever the temperature exceeds 25
degrees centigrade.
• This query is easily answered, since it depends only on the most
recent stream element.

18
• We might have a standing query that, each time a new reading
arrives, produces the average of the 24 most recent readings.
• That query also can be answered easily, if we store the 24
MOST RECENT STREAM ELEMENTS.
• When a new stream element arrives, we can drop from the
working store the 25th most recent element, since it will never
again be needed (unless there is some other standing query that
requires it).

19
• EX2:
• Another query we might ask is the MAXIMUM TEMPERATURE
EVER RECORDED BY THAT SENSOR.
• The maximum of all stream elements ever seen.
• It is not necessary to record the entire stream.
• When a new stream element arrives, we COMPARE it with the
stored maximum, and set the maximum to whichever is larger.
• We can then answer the query by producing the current value of
the maximum.
• Similarly, if we want the average temperature over all time, we
have only to record two values: the number of readings ever sent
in the stream and the sum of those readings.
• We can adjust these values easily each time a new reading
arrives, and we can produce their quotient as the answer to the
query
20
• Ex 3:
• The other form of query is AD-HOC, a question asked once
about the current state of a stream or streams.
• If we do not store all streams in their entirety, as normally we
can not, then we cannot expect to answer arbitrary queries
about streams.
• If we have some idea what kind of queries will be asked through
the ad-hoc query interface, then we can prepare for them by
storing appropriate parts or summaries of streams
• A sliding window can be the most recent n elements of a stream,
for some n, or it can be all the elements that arrived within the
last t time units, e.g., one day.
• If we regard each stream element as a tuple, we can treat the
window as a relation and query it with any SQL query

21
Sliding window

22
• Web sites often like to report the number of unique users over the past
month.
• If we think of each login as a stream element, we can maintain a
window that is all logins in the most recent month.
• We must associate the arrival time with each login, so we know when
it no longer belongs to the window.
• If we think of the window as a relation Logins(name, time)
• Then it is simple to get the number of unique users over the past
month.
• The SQL query is:
• SELECT COUNT(DISTINCT(name))
• FROM Logins
• WHERE time >= t;
• Here, t is a constant that represents the time one month before the
current time.

23
Sampling Data in a Stream
• Managing streaming data
• Extracting reliable samples from a stream
• EX: 1
• Selecting a SUBSET OF A STREAM
• Queries about the selected subset and have the
responses be statistically representative of the stream
as a whole.

24
• A search engine receives a stream of queries, and it would like to
study the behavior of typical users
• The stream consists of tuples (user, query, time).
• Suppose that we want to answer queries such as “What fraction
of the typical user’s queries were repeated over the past month?”
• Instead having complete tuple we are picking up sample here
• Assume also that we wish to store only 1/10th of the stream
elements.(Ex if the size is 100 we are picking up only 10)
• The obvious approach would be to generate a random
number, say an integer from 0 to 9, in response to each
search query.
• Store the tuple if and only if the random number is 0.
• Each user has, on average, 1/10th of their queries stored
• Statistical fluctuations
25
• The law of large numbers will assure us that most users will have
a fraction quite close to 1/10th of their queries stored.
• The results for average number of duplicate queries for a user.
(Output will be wrong)
• ASSUMPTIONS MADE :
•Suppose a user has issued s search queries one time in the
past month
•d search queries twice
•No search queries more than twice.

• If we have a 1/10th sample, of queries, we shall see in the

sample for that user an expected s/10 of the search queries
issued once.

26
• Of the d search queries issued twice, only d/100 will appear twice in the
sample; that fraction is d times
• The probability that both occurrences of the query will be in the 1/10th
sample.
• Of the queries that appear twice in the full stream, 18d/100 will appear
exactly once.
• To see why, note that 18/100 is the probability that one of the two
occurrences will be in the 1/10th of the stream that is selected, while the
other is in the 9/10th that is not selected.

• The correct answer to the query about the fraction of repeated searches is
d/(s+d).
• However, the answer we shall obtain from the sample is d/(10s+19d).
• To derive the latter formula, note that d/100 appear twice, while
s/10+18d/100 appear once.
• Thus, the fraction appearing twice in the sample is d/100 divided by d/100+
s/10 + 18d/100.
• This ratio is d/(10s+ 19d). For no positive values of s and d is d/(s + d) =
d/(10s + 19d).
27
Sampling from Data Streams
▪ Search queries coming in a stream
▪ Question: Typically, what fraction of the queries were
repeated by the same user over the last month?
▪ Assume: want to answer from a sample storing only
10% of the stream

28
Sampling: seemingly obvious approach 7

▪ Randomly select queries from the stream

– For each query, generate a random integer < 10. Select if 0, discard otherwise
▪ Suppose: a particular user has issued queries exactly once and d queries exactly twice, none more
than twice
▪ Correct answer for user is d (s + d) s/10 of the s queries, each
appearing once
Full stream 1/10th Sample
Question: what fraction
For each of the d queries, each of the typical user’s
s occurrence will be in the queries were repeated
queries sample with probability 1/10. over the last month?
issued
once Both occurrences will be in the Assume: want to answer
sample with probability 1/100. from a sample storing
only 10% of the stream
No occurrence will be in the
d queries sample with probability 81/100.
issued
twice each One occurrence will be in the
sample with probability 18/100.
29
Sampling: seemingly obvious approach 8

▪ Correct answer for user u: d (s + d)

▪ Calculation from the sample:
▪ Queries appearing once: s/10 + 18d/100.
▪ Queries appearing twice: d/100.
▪ Fraction of queries repeating: d / (10s + 19d) s/10 of the s queries, each
appearing once
Full stream 1/10th Sample
Question: what fraction
For each of the d queries, each of the typical user’s
s occurrence will be in the queries were repeated
queries sample with probability 1/10. over the last month?
issued
once Both occurrences will be in the Assume: want to answer
sample with probability 1/100. from a sample storing
only 10% of the stream
No occurrence will be in the
d queries sample with probability 81/100.
issued
twice each One occurrence will be in the
sample with probability 18/100.
30
9

How to take a representative sample

then?
▪ SAMPLE USERS, not their queries
▪ Instead of taking 1/10th of the queries, take 1/10th of the
users
▪ For the selected users, take all queries
▪ How to do this from a stream?
– As each query in the stream arrives, need to determine whether
the user issuing the query is “chosen” or not

31
Selecting queries for “chosen” users
▪ For each query arriving in the stream
– Check if the user is seen before
• If yes, check if the user is chosen
– If yes, keep the query
– If not, discard the query
• If not, determine whether to choose the user
– Generate a random integer in the range 0-9
– If the integer is 0, keep the query, put the user in the chosen list
– If not, discard the query

32
Obtaining a Representative Sample

• We must strive to pick 1/10th of the users, and take

all their searches for the sample, while taking none of
the searches from other users.
• If we can store a list of all users, and whether or not
they are in the sample, then we could do the following.
• Each time a search query arrives in the stream
• We look up the user to see whether or not they are in
the sample.

33
• If so, we add this search query to the sample, and if
not, then not.
• However, if we have no record of ever having seen this
user before, then we generate a random integer
between 0 and 9.
• If the number is 0, we add this user to our list with
value “in,” and if the number is other than 0, we add
the user with the value “out”.

34
Checking if the user is seen before 11

▪ If the number of users is small fits in main memory for a quick

lookup enough
▪ When the number is larger?
▪ Hash function:

UserID A random integer in the range 0-9

But, the same hash function sends the same UserID to the same
value every time
▪ Query in the stream arrives
▪ Hash the UserID
▪ If the hash value is 0, select the user (and the query). Otherwise
discard.

35
The General Sampling Problem
• Our stream consists of tuples with n components.
• A subset of the components are the key components,
on which the selection of the sample will be based.
• In our running example, there are three components –
user, query, and time – of which only user is in the key.
• However, we could also take a sample of queries by
making query be the key, or even take a sample of
user-query pairs by making both those components
form the key.

36
• To take a sample of size a/b, we hash the key value for each
tuple to b buckets, and accept the tuple for the sample if the
hash value is less than a.
• If the key consists of more than one component, the hash
function needs to combine the values for those components to
make a single hash-value.
• The result will be a sample consisting of all tuples with certain
key values.
• The selected key values will be approximately a/b of all the key
values appearing in the stream.

37
Varying the Sample Size
• We retain all the search queries of the selected 1/10th of the
users, forever.
• As time goes on, more searches for the same users will be
accumulated, and new users that are selected for the sample will
appear in the stream.
• If we have a budget for how many tuples from the stream can
be stored as the sample, then the fraction of key values must
vary, lowering as time goes on.
• In order to assure that at all times, the sample consists of all
tuples from a subset of the key values, we choose a hash function
h from key values to a very large number of values 0,
1, . . . ,B−1.

38
• We maintain a threshold t, which initially can be the largest
bucket number, B − 1.
• At all times, the sample consists of those tuples whose key K
satisfies h(K) ≤ t.
• New tuples from the stream are added to the sample if and only
if they satisfy the same condition.
• If the number of stored tuples of the sample exceeds the allotted
space, we lower t to t−1 and remove from the sample all those
tuples whose key K hashes to t.
• For efficiency, we can lower t by more than 1, and remove the
tuples with several of the highest hash values, whenever we need
to throw some key values out of the sample.
• Further efficiency is obtained by maintaining an index on the
hash value, so we can find all those tuples whose keys hash to a
particular value quickly.

39
Generalized Solution
• Stream of tuples with keys:
• Key is some subset of each tuple’s components
• e.g., tuple is (user, search, time); key is user
• Choice of key depends on application

• To get a sample of a/b fraction of the stream:

• Hash each tuple’s key uniformly into b buckets
• Pick the tuple if its hash value is at most a

Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets 40
Filtering Streams
• Common process on streams is selection, or filtering.
• To accept those tuples in the stream that meet a criterion.
• Accepted tuples are passed to another process as a stream, while other tuples
are dropped.
• If the selection criterion is a property of the tuple that can be calculated (e.G.,
The first component is less than 10), then the selection is easy to do.
• The problem becomes harder when the criterion involves lookup for
membership in a set.
• It is especially hard, when that set is too large to store in main memory.
• The technique known as “bloom filtering” as a way to eliminate most of the
tuples that do not meet the criterion.

41
• Suppose you are creating an Gmail account
• You want to enter a cool username, you entered it and
got a message, “Username is already taken”.
• You added your birth date along username, still no luck.
• Now you have added your university roll number also,
still got “Username is already taken”.
• But have you ever thought how quickly Gmail check
availability of username by searching millions of
username registered with it?????
• There are many ways to do this job

42
• Bloom Filter is a data structure that can do this job.
• For understanding bloom filters, we must know what is
HASHING.
• A hash function takes input and outputs a unique
identifier of fixed length which is used for identification
of input

43
Hashing
• Assume we want to design a system for storing employee records keyed using
phone numbers.
• And we want following queries to be performed efficiently
• Insert a phone number and corresponding information.
• Search a phone number and fetch the information.
• Delete a phone number and related information
• Techniques can be used:
1.Array of phone numbers and records.
2.Linked List of phone numbers and records.
3.Balanced binary search tree with phone numbers as keys.
4.Direct Access Table.

44
Hashing
S.No Emp Name Phone no
1 ABC 9876543210
2 XYZ 9012345698

S.No Technique Process carried in it

1 Arrays and linked lists Search in a linear fashion and can be costly in practice
2 Balanced binary search tree We get moderate search, insert and delete times
3 Direct access table Make a big array and use phone numbers as index in the array.
4 Hashing 1. The solution that can be used in almost all such situations and
performs extremely well compared to above data structures
like array, linked list, balanced BST in practice.

2. With hashing we get O(1) search time on average (under

reasonable assumptions) and o(n) in worst case.

45
Hashing function
• A function that converts a given big phone number to
a small practical integer value.
• The mapped integer value is used as an index in hash
table.
• In simple terms, a hash function maps a big number or
string to a small integer that can be used as index in
hash table.
• A good hash function should have following properties
• Efficiently computable.
• Should uniformly distribute the keys (Each table position
equally likely for each key)

46
Bloom Filter
• A Bloom filter is a space-efficient probabilistic data
structure that is used to test whether an element is a
member of a set.
• For example, checking availability of username is set
membership problem, where the set is the list of all
registered username.
• The price we pay for efficiency is that it is probabilistic in
nature that means, there might be some FALSE POSITIVE
RESULTS.
• When testing if an element is in the bloom filter, false
positives are possible.
• It will either say that an element is definitely not in the set
or that it is possible the element is in the set.
• It might tell that given username is already taken but
actually it’s not.
47
Interesting Properties of Bloom Filters
• Unlike a standard hash table, a Bloom filter of a fixed size can represent a set
with an arbitrarily large number of elements.
• Adding an element never fails.
• The false positive rate increases steadily as elements are added until all bits in
the filter are set to 1, at which point all queries yield a positive result.
• Bloom filters never generate false negative result, i.e., telling you that a
username doesn’t exist when it actually exists.
• Deleting elements from filter is not possible because, if we delete a single
element by clearing bits at indices generated by k hash functions, it might
cause deletion of few other elements.
• Example – if we delete “hello” (in given example below) by clearing bit at 1,
4 and 7, we might end up deleting “world” also Because bit at index 4
becomes 0 and bloom filter claims that “world” is not present.

48
• Step 1:A empty bloom filter is a bit array of m bits, all set to zero

1 2 3 4 5 6 7 8 9 10

0 0 0 0 0 0 0 0 0 0

• Step 2: We need k number of hash functions to calculate the hashes

for a given input.
• When we want to add an item in the filter, the bits at k indices h1(x),
h2(x), … hk(x) are set, where indices are calculated using hash
functions.
• Example – Suppose we want to enter “hello” in the filter, we are
using 3 hash functions and a bit array of length 10, all set to 0
initially.
• First we’ll calculate the hashes as follows:

49
• h1(“hello”) % 10 = 1
• h2(“hello”) % 10 = 4
• h3(“hello”) % 10 = 7
• Note: These outputs are random for explanation only.
Now we will set the bits at indices 1, 4 and 7 to 1 “hello”
1 2 3 4 5 6 7 8 9 10

1 0 0 1 0 0 1 0 0 0

• Step 3:Again we want to enter “world”, similarly we’ll calculate hashes

• h1(“world”) % 10 = 3
• h2(“world”) % 10 = 5
• h3(“world”) % 10 = 4
• Set the bits at indices 3, 5 and 4 to 1

50
• Note: These outputs are random for explanation only.
Now we will set the bits at indices 3, 5 and 4 to 1 for “world”.

1 2 3 4 5 6 7 8 9 10

1 0 1 1 1 0 1 0 0 0

• Step 4: Now if we want to check “hello” is present in filter or not.

• We’ll do the same process but this time in reverse order.
• We calculate respective hashes using h1, h2 and h3 and check if all these
indices are set to 1 in the bit array.
• If all the bits are set then we can say that “hello” is probably present.
• If any of the bit at these indices are 0 then “hello” is definitely not present.

51
False Positive in Bloom Filters
• The question is why we said “probably present”, why this uncertainty.
• Suppose we want to check whether “cat” is present or not.
• We’ll calculate hashes using h1, h2 and h3
• h1(“cat”) % 10 = 1
• h2(“cat”) % 10 = 3
• h3(“cat”) % 10 = 7
• If we check the bit array, bits at these indices are set to 1 but we know that
“cat” was never added to the filter.
• Bit at index 1 and 7 was set when we added “hello” and bit 3 was set we
added “world”.
1 2 3 4 5 6 7 8 9 10

1 0 1 1 1 0 1 0 0 0

52
• So, because bits at calculated indices are already set by some
other item, bloom filter erroneously claim that “cat” is present
and generating a false positive result(incorrectly indicates the
presence of a condition such as a Already available, where when the word is
not present).
• Depending on the application, it could be huge downside or
relatively okay.
• We can control the probability of getting a false positive by
controlling the size of the Bloom filter.
• More space means fewer false positives.
• If we want decrease probability of false positive result, we have
to use more number of HASH FUNCTIONS AND LARGER BIT
ARRAY.
• This would add latency in addition of item and checking
membership.
53
Operations that a Bloom Filter
• insert(x) : To insert an element in the Bloom Filter.
• lookup(x) : to check whether an element is already present in
Bloom Filter with a positive false probability.

• We cannot delete an element in Bloom Filter

54
Probability of False positivity
• Let m be the size of bit array, k be the number of hash
functions and n be the number of expected elements to be
inserted in the filter, then the probability of false positive p
can be calculated as:

• The model to use is throwing darts at targets.

• Suppose we have x targets and y darts.
• Any dart is equally likely to hit any target.
• After throwing the darts, how many targets can we expect
to be hit at least once?
• The analysis is similar to the analysis

55
56
Bloom filter

57
Counting Distinct Elements in a
Stream
• Measuring the number of distinct elements from a
stream of values is one of the most common utilities
that finds its application in the field
• Applications: Database Query Optimizations, Network
Topology, Internet Routing, Big Data Analytics, and
Data Mining.

58
The Count-Distinct Problem
• Suppose stream elements are chosen from some
universal set.
• We would like to know how many different elements
have appeared in the stream
• Counting either from the beginning of the stream or
from some known time in the past.

59
• As a useful example of this problem
• Consider a Web site gathering statistics on how many unique
users it has seen in each given month.
• I/P :The universal set is the set of logins for that site, and a
stream element is generated each time someone logs in.
• ASSUMPTION MADE:
• This measure is appropriate for a site like Amazon, where the
typical user logs in with their unique login name.
• A similar problem is a Web site like Google that does not
require login to issue a search query, and may be able to
identify users only by the IP address from which they send
the query.
• There are about 4 billion IP addresses
• Sequences of four 8-bit bytes will serve as the universal set in
this case.
60
• The obvious way to solve the problem is to keep in main
memory a list of all the elements seen so far in the stream.
• Keep them in an efficient search structure such as a hash
table or search tree,
• Can quickly add new elements and check whether or not
the element that just arrived on the stream was already
seen.
• As long as the number of distinct elements is not too great,
• This structure can fit in main memory and there is little
problem obtaining an exact answer to the question how
many distinct elements appear in the stream.

61
• If the number of distinct elements is too great, or if there are
too many streams that need to be processed at once (e.G.,
Yahoo! Wants to count the number of unique users viewing each
of its pages in a month), then we cannot store the needed data
in main memory.
• There are several options.
• We could use more machines, each machine handling only
one or several of the streams.
• We could store most of the data structure in secondary
memory
• Batch stream elements so whenever we brought a disk block
to main memory there would be many tests and updates to
be performed on the data in that block.

62
Counting Distinct Elements
• Problem:
• Data stream consists of a universe of elements chosen from a
set of size N
• Maintain a count of the number of distinct elements seen so
far

• Obvious approach:
Maintain the set of elements seen so far
• That is, keep a hash table of all the distinct elements seen so
far

63
Applications
• How many different words are found among the Web
pages being crawled at a site?
• Unusually low or high numbers could indicate artificial pages
(spam?)

• How many different Web pages does each customer

request in a week?

• How many distinct products have we sold in the last

week?

64
Using Small Storage
• Real problem: What if we do not have space
to maintain the set of elements seen so far?

• Estimate the count in an unbiased way

• Accept that the count may have a little error,

but limit the probability that the error is large

65
Steps involved in FM
• Get the input stream https://arpitbhayani.me/blogs/flajolet-martin

• Apply Hash function (one or more)

• You will get an integer
• Convert to binary equivalent for that integer
• Identify the tail length of binary equivalent
• Make use of variable R and identify the max tail length
in stream
• Use 2 power R to estimate no of distinct element
• Combine estimates (average and median calculation)
(Don’t write the above steps in exam its for
understanding purpose alone)

66
Example
• Input X={1,3,2,1,2,3,4,3,1,2,3,1}
• Hash function:6x+1 mod 5

67
Example

68
Example

69
Example

70
Example

71
Example

72
The Flajolet-Martin Algorithm
• The more different elements we see in the stream, the
more different hash-values
• Whenever we apply a hash function h to a stream
element a, the bit string
• H(a) will end in some number of 0’s, possibly none.
• Call this number the tail length for a and h.
• Let r be the maximum tail length of any a seen so far
in the stream.
• Then we shall use estimate 2� for the number of
distinct elements seen in the stream.
73
• If m is much larger than 2� , then the probability that
we shall find a tail of length at least r approaches 1.
• If m is much less than 2� , then the probability of
finding a tail length at least r approaches 0.
• We conclude from these two points that the proposed
estimate of m, which is 2� (recall R is the largest tail
length for any stream element) is unlikely to be either
much too high or much too low.

74
75
76
Why It Works
• The probability that a given h(a) ends in
at least i 0’s is 2-i.
• If there are m different elements, the
probability that R ≥ i is 1 – (1 - 2-i)m.

Prob. all h(a)’s Prob. a given h(a)

end in fewer than ends in fewer than
i 0’s. i 0’s.

77
Why It Works – (2)

• Since 2-i is small, 1 - ≈ 1 - e-m2 .

(1-2-i)m
-i

• If 2 >> m, 1 - e ≈ 1 - (1 - m2-i) ≈
i -i
-m2

m/2i ≈ 0. -i

• If 2i << m, 1 - e-m2 ≈ 1.
• Thus, 2R will almost always be around m.
First 2 terms of the
Taylor expansion of e x

Same trick as “throwing darts.”

Multiply and divide m by 2-i.

78
Why It Doesn’t Work
• E(2R) is, in principle, infinite.
• Probability of > R 0’s halves when R -> R+1, but
value of 2R doubles.
• Workaround involves using many hash
functions and getting many samples.
• How are samples combined?
• Average? What if one very large value?
• Median? All values are a power of 2.

79
Solution
• Partition your samples into small groups.
• O(log n), where n = size of universal set, suffices.
• Take the average within each group.
• Then take the median of the averages.

80
Generalization: Moments
• Suppose a stream has elements chosen
from a set A of N values

• Let mi be the number of times value i

occurs in the stream

• The kth moment is

 iA
( mi ) k

81
82
Special Cases
 iA
( mi ) k

• 0thmoment = number of distinct elements

• The problem just considered
• 1st moment = count of the numbers of
elements = length of the stream
• Easy to compute
• 2nd moment = surprise number S =
a measure of how uneven the distribution
is

83
Example: Surprise Number
• Stream of length 100
• 11 distinct values

• Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise

S = 910

• Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1 Surprise

S = 8,110

84
85
86
The Alon-Matias-Szegedy Algorithm
for Second Moments
• A particular element of the universal set, which we
refer to as X.element
• An integer X.value, which is the value of the variable.
• To determine the value of a variable X, we choose a
position in the stream between 1 and n, uniformly and
at random.
• Set X.element to be the element found there, and
initialize X.value to 1.
• As we read the stream, add 1 to X.value each time we
encounter another occurrence of X.element

87
Problem Statement Problem of counting distinct elements in a stream of data
It involves distribution frequency of different elements in a stream

Solution 1 Identfying the order (0th,1st,2nd,high order)

89
90
Problem Statement Problem of counting distinct elements in a stream of data
It involves distribution frequency of different elements in a stream

Solution 1 Identfying the order (0th,1st,2nd,high order)

The set contains N elements (the size can be millions and trillions )
0th order moments indicates no of distinct elements
Ex {1,2,3,4,4,3,2,1}
Order of 0th is 4
1st order moments
Count no of elements
A={5,4,3,2,1}
1st order is 5
2nd order moments (SURPRISE Number)
K=0,k=1,k=2
(�� )�
Solution 2 Using AMS algorithm
• Randomly choose X1,X2,X3
• Find the value of X1 ,x2,x3
• Cal f(x) using formula
• Check the average of estimates
91
Steps involved
• S1: Assume stream has length n
• S2: Pick random time (t)
• S3: time t stream have item “i” at time t (“hello”
query is coming in X.el =i
• S4:Count c (X.val=i)
• S5: S=f(x)=n(2 * c-1) n : no of elements , c:occurance
of elements (x.val)
• S6:Keep track of multiple of Xs= (X1,X2,X3,….Xk)
• S7:Final estimate
�
• � = �/� �
�(�� )

92
Sample problem for AMS
• Stream={a, b, c, b, d, a, c, d, a, b, d, c, a, a, b}

93
Sample problem for AMS
• a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
• 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

94
95
96
[Alon, Matias, and Szegedy]

AMS Method
• AMS method works for all moments
• Gives an unbiased estimate
• We will just concentrate on the 2nd moment
S
• We pick and keep track of many variables X:
• For each variable X we store X.el and X.val
• X.el corresponds to the item i
• X.val corresponds to the count of item i
• Note this requires a count in main memory,
so number of Xs is limited
• Our goal is to compute � = �
� �
�

97
One Random Variable (X)
• How to set X.val and X.el?
• Assume stream has length n (we relax this later)
• Pick some random time t (t<n) to start,
so that any time is equally likely
• Let at time t the stream have item i. We set X.el
=i
• Then we maintain count c (X.val = c) of the
number of is in the stream starting from the
chosen time t
 Then the estimate of the 2nd moment ( � �� ) is:
� = �(�) = � (�·� – �)
• Note, we will keep track of multiple Xs, (X1, X2,… Xk)
and our final estimate will be � = �/� �� (�� )

98
Expectation Analysis
Count: 1 2 3 ma

Stream: a a b b b a b a

• 2nd moment is � = �
� �
�
• ct … number of times item at time t appears
from time t onwards (c1=ma , c2=ma-1,
c3=mb)
� �
• � �(�) = �=�
�(�� − �) mi … total count of
item i in the stream
� (we are assuming
� stream has length n)
= �
� (� + � + � + … + �� − �)
�
Time t when Time t when
Time t when the penultimate the first i is
Group times
the last i is i is seen (ct=2) seen (ct=mi)
by the value
seen (ct=1)
seen
99
Expectation Analysis
Count: 1 2 3 ma

Stream: a a b b b a b a
1
• � �(�) = �
� (1 + 3 + 5 + … + 2�� − 1)
�
• Little side calculation: (1 + 3 + 5 + … + 2�� − 1) =
�� (�� +1) 2
�=1
(2� − 1) = 2 − �� = (�� )
2

• Then � �(�) =
� �
�
� ( � � )
�

• So, � �(�) = �
( � � ) �
=�
• We have the second moment (in expectation)!

100
Higher-Order Moments
• For estimating kth moment we essentially use
the same algorithm but change the estimate:
• For k=2 we used n (2·c – 1)
• For k=3 we use: n (3·c2 – 3c + 1) (where
c=X.val)
• Why?
• For k=2: Remember we had (1 + 3 + 5 + … + 2�� −
1) and we showed terms 2c-1 (for c=1,…,m)
sum to m2
• ��=1
2� − 1 = �
�=1
�2
− �
�=1
(� − 1) 2
= �2

• So: �� − � = � − (� − �)�
�

• For k=3: c3 - (c-1)3 = 3c2 - 3c + 1

• Generally: Estimate = � (�� − (� − 1)� )

101
Combining Samples
• In practice:
• Compute �(�) = �(� � – �) for
as many variables X as you can fit in
memory
• Average them in groups
• Take median of averages

• Problem: Streams never end

• We assumed there was a number n,
the number of positions in the stream
• But real streams go on forever, so n is
a variable – the number of inputs seen so
far
102
Streams Never End: Fixups
• (1) The variables X have n as a factor –
keep n separately; just hold the count in
X
• (2) Suppose we can only store k counts.
We must throw some Xs out as time
goes on:
• Objective: Each starting time t is selected with
probability k/n
• Solution: (fixed-size sampling!)
• Choose the first k times for k variables
• When the nth element arrives (n > k), choose it with
probability k/n
• If you choose it, throw one of the previously stored
variables X out, with equal probability
103
• The proper technique is to maintain as many variables
as we can store at all times, and to throw some out as
the stream grows.
• The discarded variables are replaced by new ones, in
such a way that at all times, the probability of picking
any one position for a variable is the same as that of
picking any other position.
• Suppose we have space to store s variables.
• Then the first s positions of the stream are each picked
as the position of one of the s variables.
104
• N stream elements, and the probability of any particular position being the position of a
variable is uniform, that is s/n
• When the (n+1)st element arrives, pick that position with probability s/(n+1).
• If not picked, then the s variables keep their same positions.
• However, if the (n+1)st position is picked, then throw out one of the current s variables, with
Equal probability.
• Replace the one discarded by a new variable whose element is the one at position n + 1 and
whose value is 1.
• The probability that position n + 1 is selected for a variable is what It should be: s/(n + 1).
• However, the probability of every other position also is s/(n + 1), as we can prove by induction
on n.
• By the inductive hypothesis, before the arrival of the (n + 1)st stream element, this probability
was s/n.
• With probability 1 − s/(n + 1) the (n + 1)st position will not be selected, and the probability
of each of the first n positions remains s/n.
• However, with probability s/(n + 1), the (n + 1)st position is picked, and the probability for
• Each of the first n positions is reduced by factor (s−1)/s.
105
106
Counting Itemsets
• New Problem: Given a stream, which items appear
more than s times in the window?
• Possible solution: Think of the stream of baskets as one
binary stream per item
• 1 = item present; 0 = not present
• Use DGIM to estimate counts of 1s for all items
6 10
4
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N 107
Counting Bits (1)
• Problem:
• Given a stream of 0s and 1s
• Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N

• Obvious solution:
Store the most recent N bits
• When new bit comes in, discard the N+1st bit

010011011101010110110110 Suppose N=6

Past Future

108
Counting Bits (2)
• You can not get an exact answer without storing the
entire window

• Real Problem:
What if we cannot afford to store N bits?
• E.g., we’re processing 1 billion streams and
N = 1 billion 010011011101010110110110
Past Future

• But we are happy with an approximate answer

109
An attempt: Simple solution
• Q: How many 1s are in the last N bits?
• A simple solution that does not really solve
our problem: Uniformity assumption
N
010011100010100100010110110111001010110011010
Past Future
• Maintain 2 counters:
• S: number of 1s from the beginning of the
stream
• Z: number of 0s from the beginning of the
stream
• How many 1s are in the last N bits? � ∙
�
�+�
• But, what if stream is non-uniform?
• What if distribution changes over time? 110
DGIM algorithm
• DGIM algorithm (Datar-Gionis-Indyk-Motwani
Algorithm)

• Designed to find the number 1’s in a data set.

• This algorithm uses O(log²N) bits to represent a
window of N bit, allows to estimate the number of 1’s
in the window with and error of no more than 50%.
• Error to any fraction ǫ > 0, and still uses only O(log2
N) bits (although with a constant factor that grows as
ǫ shrinks).
111
[Datar, Gionis, Indyk, Motwani]

DGIM Method
• DGIM solution that does not assume uniformity

• We store �(log��) bits per stream

• Solution gives approximate answer,

never off by more than 50%
• Error factor can be reduced to any fraction > 0, with more
complicated algorithm and proportionally more stored bits

112
What’s Good?
• Stores only O(log2N ) bits
• �(log �) counts of log� � bits each

• Easy update as more bits enter

• Error in count no greater than the number

of 1s in the “unknown” area

113
Generic example

114
115
116
• Each bit of the stream has a timestamp, the position in
which it arrives.
• The first bit has timestamp 1, the second has
timestamp 2, and so on.
• Since we only need to distinguish positions within the
window of length N, we shall represent timestamps
modulo N, so they can be represented by log2 N bits.
• If we also store the total number of bits ever seen in
the stream (i.e., the most recent timestamp) modulo N,
then we can determine from a timestamp modulo N
where in the current window the bit with that
timestamp is.

117
Buckets representation
• We divide the window into buckets, consisting of:
• The timestamp of its right (most recent) end.
• The number of 1’s in the bucket. This number must be a power of 2, and
we refer to the number of 1’s as the size of the bucket.
• To represent a bucket, we need log2 N bits to represent the timestamp
(modulo N) of its right end.
• To represent the number of 1’s we only need log2 log2 N bits.
• The reason is that we know this number i is a power of 2, say 2j , so we
can represent i by coding j in binary.
• Since j is at most log2 N, it requires log2 log2 N bits.
• Thus, O(logN) bits suffice to represent a bucket.

118
6 Rules representing a stream by buckets
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• No position is in more than one bucket.
• There are one or two buckets of any given size, up to
some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left
(back in time).

119
Rules with example
The right end of a bucket is always 0 1 1 0 0 0 0 “0” No need to consider
a position with a 1
Every position with a 1 is in some 1 0 1 1 1 0 1 0
bucket.
B4 B3 B2 B1
No position is in more than one 1 0 1 1 1 0 1 0 (Not valid)
bucket. B4 B3 B2 B1

No overlap of buckets should be there

There are one or two buckets of 0 11 0 1 1 0 1 0 1
any given size, up to some B4 B3 B2 B1
maximum size.
MAXIMUM 2 SAME SIZE OF BUCKET CAN BE
ALLOWED

All sizes must be a power of 2. 20 , 21 , 22 , 24

Buckets cannot decrease in size as we move 1 0 11 0 1111 0 111 0 11 0 1
to the left (back in time).
120
Sample Problem
• I/P Stream
•1 0 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
• Assumptions made N=24
• Create of buckets O(log N)

121
Sample Problem
• Apply the 6 rule for buckets
•1 0 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0

122
Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1001010110001011010101010101011010101010101110101010111010100010110010

Three properties of buckets that are maintained:

- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size 123
Updating Buckets (1)
• When a new bit comes in, drop the last (oldest) bucket
if its end-time is prior to N time units before the
current time

• 2 cases: Current bit is 0 or 1

• If the current bit is 0:

no other changes are needed

124
Updating Buckets (2)
• If the current bit is 1:
• (1) Create a new bucket of size 1, for just this bit
• End timestamp = current time
• (2) If there are now three buckets of size 1, combine the
oldest two into a bucket of size 2
• (3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
• (4) And so on …

125
Example: Updating Buckets
Current state of the stream:
1001010110001011010101010101011010101010101110101010111010100010110010

Bit of value 1 arrives

0010101100010110101010101010110101010101011101010101110101000101100101
Two orange buckets get merged into a yellow bucket
0010101100010110101010101010110101010101011101010101110101000101100101
Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
0101100010110101010101010110101010101011101010101110101000101100101101

Buckets get merged…

0101100010110101010101010110101010101011101010101110101000101100101101

State of the buckets after merging

0101100010110101010101010110101010101011101010101110101000101100101101
126
How to Query?
• To estimate the number of 1s in the most recent N
bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the bucket)
2. Add half the size of the last bucket

• Remember: We do not know how many 1s

of the last bucket are still within the wanted window

127
Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1001010110001011010101010101011010101010101110101010111010100010110010

128
Steps involved for query

129
130
131
Error Bound: Proof

• Why is error 50%? Let’s prove it!

• Suppose the last bucket has size 2r
• Then by assuming 2r-1 (i.e., half) of its 1s are still
within the window, we make an error of at most 2r-1
• Since there is at least one bucket of each of the sizes
less than 2r, the true sum is at least
1 + 2 + 4 + .. + 2r-1 = 2r -1
• Thus, error at most 50%
At least 16 1s

111111110000000011101010101011010101010101110101010111010100010110010
N 132
Further Reducing the Error
• Instead of maintaining 1 or 2 of each size bucket, we
allow either r-1 or r buckets (r > 2)
• Except for the largest size buckets; we can have any number
between 1 and r of those
• Error is at most O(1/r)
• By picking r appropriately, we can tradeoff between
number of bits we store and the error

133
Storage Requirements for the • Each Bucket Can Be Represented By O(logn) Bits.
DGIM Algorithm • If The Window Has Length N, Then There Are No More Than N 1’s, Surely.
Suppose The Largest Bucket Is Of Size 2j.
• Then J Cannot Exceed Log2 N, Or Else There Are More 1’s In This Bucket
Than There Are 1’s In The Entire Window.
• Thus, There Are At Most Two Buckets Of All Sizes From Log2 N Down To 1,
And No Buckets Of Larger Sizes.
• We Conclude That There Are O(logn) Buckets.
• Since Each Bucket Can Be Represented In O(logn) Bits, The Total Space
Required For All The Buckets Representing A Window Of Size N Is O(log2
N).
Query Answering in the • Suppose we are asked how many 1’s there are in the last k bits of the
DGIM Algorithm window, for some 1 ≤ k ≤ N.
• Find the bucket b with the earliest timestamp that includes at least some of
the k most recent bits.
• Estimate the number of 1’s to be the sum of the sizes of all the buckets to
the right (more recent) than bucket b, plus half the size of b itself.

134
Steps involved with example
• Convert integer to binary
• Generate the elemts [ C0,C1,C2,…………, Cm ] m is no of
elemts
• Calculate
�−1 �
�
�=0 �
2

135
136
137
138
139
Extensions
• Can we use the same trick to answer queries How
many 1’s in the last k? where k < N?
• A: Find earliest bucket B that at overlaps with k.
Number of 1s is the sum of sizes of more recent buckets + ½
size of B

1001010110001011010101010101011010101010101110101010111010100010110010
k

• Can we handle the case where the stream is not bits,

but integers, and we want the sum of the last k
elements?

140
Extensions
• Stream of positive integers
• We want the sum of the last k elements
• Amazon: Avg. price of last k sales
• Solution:
• (1) If you know all have at most m bits
• Treat m bits of each integer as a separate stream
ci …estimated count for i-th bit
• Use DGIM to count 1s in each integer
• The sum is = �−1
�=0
�� 2�

• (2) Use buckets to keep partial sums

• Sum of elements in size b bucket is at most 2b
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 Idea: Sum in each
bucket is at most
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2b (unless bucket
has only 1 integer)
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2 Bucket sizes:
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2 5 16 8 4 2 1 141
Decaying Windows
• Sometimes we do not want to make a sharp
distinction between recent elements and those in the
distant past, but want to weight the recent elements
more heavily.
• We consider “exponentially decaying windows,” and an
application where they are quite useful: finding the
most common “recent” elements.

142
Definition of the Decaying Window
• let a stream currently consist of the elements a1,
a2, . . . , at, where a1 is the first element to arrive
and at is the current element.
• Let c be a small constant, such as 10−6 or 10−9.
• Define the exponentially decaying window for this
stream to be the sum

143
A decaying window and a fixed-
length window of equal weight

144
Summary
• Sampling a fixed proportion of a stream
• Sample size grows as the stream grows
• Sampling a fixed-size sample
• Reservoir sampling
• Counting the number of 1s in the last N
elements
• Exponentially increasing windows
• Extensions:
• Number of 1s in any last k (k < N) elements
• Sums of integers in the last N elements

145
146
New Topic: Graph Data!
High dim. Graph Infinite Machine
Apps
data data data learning

Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Community Web Decision Association

Clustering
Detection advertising Trees Rules

Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
147
Overview
• Graph data overview
• Problems with early search engines
• PageRank Model
• Flow Formulation
• Matrix Interpretation
• Random Walk Interpretation
• Google’s Formulation
• How to Compute PageRank

148
Graph Data: Social Networks

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] 149
150
151
Graph Data: Media Networks

Connections between political blogs

Polarization of the network [Adamic-Glance, 2005] 152
Graph Data: Information Nets

Citation networks and Maps of science

[Börner et al., 2012] 153
Graph Data: Communication
Nets

domain2

domain1

router

domain3

Internet 154
Graph Data: Technological
Networks

Seven Bridges of Königsberg

[Euler, 1735]
Return to the starting point by traveling each
link of the graph once and only once.
155
Graph Data: Technological
Networks

Seven Bridges of Königsberg

[Euler, 1735]
Return to the starting point by traveling each
link of the graph once and only once.
156
Web as a Graph
• Web as a directed graph:
• Nodes: Webpages
• Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University

157
Web as a Graph
• Web as a directed graph:
• Nodes: Webpages
• Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University

158
Web as a Directed Graph

159
Broad Question
• How to organize the Web?
• First try: Human curated
Web directories
• Yahoo, DMOZ, LookSmart
• Second try: Web Search
• Information Retrieval investigates:
Find relevant docs in a small
and trusted set
• Newspaper articles, Patents, etc.
• But: Web is huge, full of untrusted documents,
random things, web spam, etc.

160
Web Search: 2 Challenges
2 challenges of web search:
• (1) Web contains many sources of
information
Who to “trust”?
• Trick: Trustworthy pages may point to each
other!

• (2) What is the “best” answer to query

“newspaper”?
• No single right answer
• Trick: Pages that actually know about
newspapers might all be pointing to many
newspapers
161
Early Search Engines
• Inverted index
• Data structure that return pointers to all pages a term occurs

• Which page to return first?

• Where do the search terms appear in the page?
• How many occurrences of the search terms in the page?

• What if a spammer tries to fool the search engine?

Fooling Early Search Engines
• Example: A spammer wants his page to be in the top search results for the term
“movies”.
• Approach 1:
• Add thousands of copies of the term “movies” to your page.
• Make them invisible.
• Approach 2:
• Search the term “movies”.
• Copy the contents of the top page to your page.
• Make it invisible.
• Problem: Ranking only based on page contents
• Early search engines almost useless because of spam.
Google’s Innovations
• Basic idea: Search engine believes what other pages say about you instead of what
you say about yourself.

• Main innovations:
1. Define the importance of a page based on:
• How many pages point to it?
• How important are those pages?
2. Judge the contents of a page based on:
• Which terms appear in the page?
• Which terms are used to link to the page?
Ranking Nodes on the Graph
• All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu

• There is large diversity

in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!

165
Link Analysis Algorithms
• We will cover the following Link Analysis approaches
for computing importance's
of nodes in a graph:
• Page Rank
• Topic-Specific (Personalized) Page Rank
• Web Spam Detection Algorithms

166
PageRank

The “Flow” Formulation

167
Links as Votes

• Think of in-links as votes:

• www.stanford.edu has 23,400 in-links
• www.joe-schmoe.com has 1 in-link

• Are all in-links are equal?

• Links from important pages count more
• Recursive question!

168
Example: PageRank Scores
A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6
169
Simple Recursive Formulation
• Each link’s vote is proportional to the
importance of its source page

• If page j with importance rj has n out-

links, each link gets rj / n votes

• Page j’s own importance is the sum of the

votes on its in-links
i k
ri/3 r /4
k

j rj/3
rj = ri/3+rk/4
rj/3 rj/3
170
PageRank: The “Flow” Model
• A “vote” from an important
page is worth more
y/2
• A page is important if it is
pointed to by other y
important pages
a/2
• Define a “rank” rj for page j y/2
m
a m
a/2
ri
rj   “Flow” equations:

i j di
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
�� … out-degree of node � 171
Solving the Flow Equations
• 3 equations, 3 unknowns, Flow equations:
ry = ry /2 + ra /2
no constants ra = ry /2 + rm
• No unique solution rm = ra /2
• All solutions equivalent modulo the scale factor
• Additional constraint forces uniqueness:
• �� + �� + �� = �
� � �
• Solution: �� = , �� = , �� =
� � �

• Gaussian elimination method works for

small examples, but we need a better
method for large web-size graphs
• We need a new formulation!
172
PageRank: Matrix Formulation
• Adjacency matrix �
• Let page � have �� out-links
1
• If � → �, then �� = else �� = 0
��

• Rank vector �: vector with an entry per page

• �� is the importance score of page �
• � �� = 1

ri
• The flow equations can be written rj  
� = � ⋅ � i j di

173
Example: Flow Equations & M
y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

r = M∙r

ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m

174
y

a m

175
176
Markov process
• A Markov chain or Markov process is a stochastic
model describing a sequence of possible events in which
the probability of each event depends only on the state
attained in the previous event.

https://setosa.io/ev/markov-chains/
177
Example
ri
• Remember the flow equation:rj  
i j di
• Flow equation in the matrix form
� ⋅ � = �
• Suppose page i links to 3 pages, including j
i

j rj
. =
ri
1/3

M . r = r
178
Exercise: Matrix Formulation

M r r
A B 0 1/2 0 rA rA
1
1/3 0 0 1/2 r rB
. B =
1/3 0 0 1/2 rC rC
C D 1/3 1/2 0 0 rD rD
Eigen vector
• Eigen vector of a matrix A is a vector represented by a matrix X such that
when X is multiplied with matrix A, then the direction of the resultant
matrix remains same as vector X.
• where A is any arbitrary matrix, λ are eigen values and X is an
eigen vector corresponding to each eigen value.
• Here, we can see that AX is parallel to X. So, X is an eigen vector.

http://www.sharetechnote.com/html/Handbook_EngMath_Ma
trix_Determinent.bak
Linear Algebra Reminders
• A is a column stochastic matrix if each of its columns add up to 1 and there are no
negative entries.
• Our adjacency matrix M is column stochastic. Why?

• If there exist a vector x and a scalar λ such that Ax = λx, then:

• x is an eigenvector and λ is an eigenvalue of A
• The principal eigenvector is the one that corresponds to the largest eigenvalue.
• The largest eigenvalue of a column stochastic matrix is 1.
Ax = x, where x is the principal eigenvector
Eigenvector Formulation
• PageRank flow formulation:
� = � ∙ �

• So the rank vector r is an eigenvector of

the stochastic web matrix M
• In fact, its first or principal eigenvector, NOTE: x is an
eigenvector with
with corresponding eigenvalue 1 the corresponding
eigenvalue λ if:
�� = ��
• We can now efficiently solve for r!
The method is called Power iteration

184
Power Iteration Method
• Given a web graph with n nodes, where the nodes are
pages and edges are hyperlinks
• Power iteration: a simple iterative scheme
• Suppose there are N web pages (t )
ri

( t 1)
• Initialize: r(0) = [1/N,….,1/N]T rj
i j di
• Iterate: r(t+1) = M ∙ r(t)
di …. out-degree of node i
• Stop when |r(t+1) – r(t)|1 < 
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean

185
PageRank: How to solve?
y a m
y
y ½ ½ 0
• Power Iteration: a ½ 0 1
• Set �� = 1/N a m m 0 ½ 0
��
• 1: �′� = �→� ��
ry = ry /2 + ra /2
• 2: � = �′ ra = ry /2 + rm
• Goto 1 rm = ra /2
• Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
186
PageRank: How to solve?
y a m
y
y ½ ½ 0
• Power Iteration: a ½ 0 1
• Set �� = 1/N a m m 0 ½ 0
��
• 1: �′� = �→� ��
ry = ry /2 + ra /2
• 2: � = �′ ra = ry /2 + rm
• Goto 1 rm = ra /2
• Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
187
Power Iteration Convergence
• Power iteration:
A method for finding principal eigenvector
(the vector corresponding to the largest
eigenvalue)
• �(�) = � ⋅ �(�)
• �(�) = � ⋅ �(�) = � ��(�) = �� ⋅ �(�)
• �(�) = � ⋅ �(�) = � �� (�) = �� ⋅ �(�)
• Claim:
Sequence � ⋅ �(�) , �� ⋅ �(�) , …�� ⋅ �(�) , …
approaches the dominant eigenvector of �

188
PageRank:
Random Walk Interpretation

189
Random Walk Interpretation of PageRank
• Consider a web surfer:
• He starts at a random page
• He follows a random link at every time step
• After a sufficiently long time:
• What is the probability that he is at page j?
• This probability corresponds to the page rank of j.

https://neo4j.com/docs/graph-data-science/current/alpha-
algorithms/random-walk/
Need for Random walks in Page rank
• Dead-ends occur when pages have no out-links.
• In this case, the random walk will abort and a path containing only the first
node will be returned.
• This problem can be avoided by running on an undirected graph, so that the
random walk will traverse relationships in both directions.
• If there are no links from within a group of pages to outside of the group,
then the group is considered a spider trap.
• Random walks starting from any of the nodes in that group will only traverse
to the others in the group - our implementation of the algorithm doesn’t
allow a random walk to jump to non-neighbouring nodes.
• Sinks can occur when a network of links form an infinite cycle.
Random Walk Interpretation
i1 i2 i3

Imagine a random web surfer:

• At any time �, surfer is on some page � j
• At time � + �, the surfer follows an ri
out-link from � uniformly at random rj  
• Ends up on some page � linked from � i j d out (i)
• Process repeats indefinitely
Let:
 �(�) … vector whose �th coordinate is the
prob. that the surfer is at page � at time �
• So, �(�) is a probability distribution over pages

192
Stationary Distribution
• A stationary distribution of a Markov chain is a
probability distribution that remains unchanged in the
Markov chain as time progresses.
• Typically, it is represented as a row vector π whose
entries are probabilities summing to 1, and given
transition matrix P it satisfies
• π=πP
• In other words, π is invariant by the matrix P.
https://brilliant.org/wiki/stationary-distributions/

193
The Stationary Distribution
• Where is the surfer at time t+1? i1 i2 i3

• Follows a link uniformly at random

�(� + �) = � ⋅ �(�) j
p(t  1)  M  p(t )
Suppose the random walk reaches a state �(� +
�) = � ⋅ �(�) = �(�)
then �(�) is stationary distribution of a random walk
Our original rank vector � satisfies � = � ⋅ �
• So, � is a stationary distribution for
the random walk

194
Existence and Uniqueness
• A central result from the theory of
random walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,

the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0

195
Example: Random Walk

A B Time t = 0: Assume the random surfer is at A.

Time t = 1:
p(A, 1) = ? 0
C D p(B, 1) = ? 1/3
p(C, 1) = ? 1/3
p(D, 1) = ? 1/3
Example: Random Walk
Time t = 1:
p(B, 1) = 1/3
A B p(C, 1) = 1/3
p(D, 1) = 1/3

C D Time t=2:
p(A, 2) = ?

p(A, 2) = p(B, 1) . p(B→A) + p(C, 1) . p(C→A)

= 1/3 . 1/2 + 1/3 . 1 = 3/6
Example: Transition Matrix

M p(t) p(t+1)
A B 0 1/2 1 0 pA pA
1/3 0 0 1/2 pB pB
. =
1/3 0 0 1/2 pC pC
C D 1/3 1/2 0 0 pD pD

p(A, t+1) = p(B, t) . p(B→A) + p(C, t) . p(C→A)

p(C, t+1) = p(A, t) . p(A→C) + p(D, t) . p(D→C)

Summary So Far
(t )
ri

( t 1)
• PageRank formula: rj di …. out-degree of node i
i j di

• Iterative algorithm:
1. Initialize rank of each page to 1/N (where N is the number of pages)
2. Compute the next page rank values using the formula above
3. Repeat step 2 until the page rank values do not change much

• Same algorithm, but different interpretations

Summary So Far (cont’d)
• Eigenvector interpretation:
• Compute the principal eigenvector of stochastic adjacency matrix M
r=M.r
• Power iteration method

• Random walk interpretation:

• Rank of page i is the probability that a surfer is at i after random
walk
p(t+1) = M . p(t)
• Guaranteed to converge to a unique solution under certain
conditions
Convergence Conditions
• To guarantee convergence to a meaningful and unique solution, the transition
matrix must be:
1. Column stochastic
2. Irreducible
3. Aperiodic
Column Stochastic
• Column stochastic:
• All values in the matrix are non-negative
• Sum of each column is 1

y a m
y y ½ ½ 0 ry = ry /2 + ra /2
a ½ 0 1 ra = ry /2 + rm
m 0 ½ 0 rm = ra /2
a m

What if we remove the edge m → a ?

No longer column stochastic
Irreducible
• Irreducible: From any state, there is a non-zero probability of going to
another.
• Equivalent to: Strongly connected graph

A B What if we remove the edge C → A ?

No longer irreducible.

C D

Irreducible graph
Aperiodic
• State i has period k if any return to state i must occur in
multiples of k time steps.
• If k = 1 for a state, it is called aperiodic.
• Returning to the state at irregular intervals
• A Markov chain is aperiodic if all its states are aperiodic.
• If Markov chain is irreducible, one aperiodic state means all stated are
aperiodic.

A B C D
t0
How to make this aperiodic?
t0 + 4 k= 4
Add any self edge
t0 + 8
PageRank: The Google Formulation
PageRank: Three Questions
(t )
ri

( t 1)
rj
i j di
or
equivalently r  Mr
• Does this converge?
• Does it converge to what we want?
• Are results reasonable?
206
Does this converge?
(t )
ri

( t 1)
a b rj
i j di

• Example:
=
ra 1 0 1 0
Iteration 0, 1, 2, …
rb 0 1 0 1

207
Does it converge to what we
want?

(t )
ri

( t 1)
a b rj
i j di

• Example:
=
ra 1 0 0 0
Iteration 0, 1, 2, …
rb 0 1 0 0

208
Problems and solution

209
PageRank: Problems
Dead end
2 problems:
• (1) Some pages are
dead ends (have no out-links)
• Random walk has “nowhere” to go to
• Such pages cause importance to “leak out”
Spider
trap

• (2) Spider traps:

(all out-links are within the group)
• Random walk gets “stuck” in a trap
• And eventually spider traps absorb all
importance

210
Problem: Spider Traps
y a m
y
y ½ ½ 0
• Power Iteration: a ½ 0 0
• Set �� = 1/N a m m 0 ½ 1
��
• �� = �→�
�� m is a spider trap ry = ry /2 + ra /2
• And iterate
ra = ry /2
rm = ra /2 + rm
• Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m. 211
Solution: Teleports!
• The Google solution for spider traps: At each
time step, the random surfer has two
options
• With prob. , follow a link at random
• With prob. 1-, jump to some random page
• Common values for  are in the range 0.8 to
0.9
• Surfer will teleport out of spider trap
within a few time steps
y y

a m a m 212
Problem: Dead Ends
y a m
y
y ½ ½ 0
• Power Iteration: a ½ 0 0
• Set �� = 1 a m m 0 ½ 0
��
• �� = �→�
��
ry = ry /2 + ra /2
• And iterate
ra = ry /2
rm = ra /2
• Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …

Here the PageRank “leaks” out since the matrix is not stochastic. 213
Solution: Always Teleport!
• Teleports: Follow random teleport links
with probability 1.0 from dead-ends
• Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
214
Why Teleports Solve the Problem?
Why are dead-ends and spider traps a
problem
and why do teleports solve the problem?
• Spider-traps: PageRank scores are not what
we want
• Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
• Dead-ends are a problem
• The matrix is not column stochastic so our initial
assumptions are not met
• Solution: Make matrix column stochastic by
always teleporting when there is nowhere else to
go
215
Solution: Random Teleports
• Google’s solution that does it all:
At each step, random surfer has two options:
• With probability , follow a link at random
• With probability 1-, jump to some random page

• PageRank equation [Brin-Page, 98]

�� 1 d … out-degree
�� = � + (1 − �) i
of node i
��
�→�
This formulation assumes that � has no dead ends. We can either
preprocess matrix � to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends. 216
The Google Matrix
• PageRank equation [Brin-Page, ‘98]
�� 1
�� = � + (1 − �)
��
�→�
• The Google Matrix A:
1 [1/N] …N by N matrix
NxN

� = � � + (1 − �) where all entries are 1/N

� �×�
• We have a recursive problem: � = � ⋅ �
And the Power method still works!
• What is  ?
• In practice  =0.8,0.9 (make 5 steps on avg., jump)

217
Random Teleports (  0.8) [1/N]NxN
7/15
M
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3

1/
5
7/1

15
5
7/1

1/
y 7/15 7/15 1/15

15
13/15
a 7/15 1/15 1/15
a 7/15
m 1/15 7/15 13/15
1/15
m
1/
15 A

y 1/3 0.33 0.24 0.26 7/33

a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
218
https://onlinemschool.com/math/assistance/matrix/multiply1/

https://matrix.reshish.com/addSubCalculation.php
219
Matrix Formulation
• Suppose there are N pages
• Consider page i, with di out-links
• We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
• The random teleport is equivalent to:
• Adding a teleport link from i to every other page and setting
transition probability to (1-)/N
• Reducing the probability of following each
out-link from 1/|di| to /|di|
• Equivalent: Tax each page a fraction (1-) of its score and
redistribute evenly

220
Example
• Avoiding Dead Ends
• C is now a dead end

221
Steps involved in above example

222
223
The reduced graph with no dead
ends

224
Spider Traps and Taxation

A graph with a one-node spider trap

225
Apply teleporting

226
227
228
How do we actually compute
the PageRank?

229
Computing Page Rank
• Key step is matrix-vector multiplication
• rnew = A ∙ rold
• Easy if we have enough main memory to
hold A, rold, rnew
• Say N = 1 billion pages
• We need 4 bytes for A = ∙M + (1-) [1/N]NxN
each entry (say)
½ ½ 0 1/3 1/3 1/3
• 2 billion entries for A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3
vectors, approx 8GB 0 ½ 1 1/3 1/3 1/3
• Matrix A has N2 entries
• 1018 is a large number! 7/15 7/15 1/15
= 7/15 1/15 1/15
1/15 7/15 13/15 230
Matrix Sparseness
• Reminder: Our original matrix was sparse.
• On average: ~10 out-links per vertex
• # of non-zero values in matrix M: ~10N
• Teleport links make matrix M dense.
• Can we convert it back to the sparse form?

Original matrix without teleports

A B 0 1/2 1 0
1/3 0 0 1/2
1/3 0 0 1/2

C D 1/3 1/2 0 0
Rearranging the Equation

where �� = � �� +

�−�
• � = � ⋅ �,
�
�
• �� = � ⋅ ��
i=1 ��
� 1−�
• �� = �=1
� �� + ⋅ ��
�
� 1−� �
= i=1 � �� ⋅ �� + �
i=1 �
�
1−�
= �
i=1
� �� ⋅ �� + since �� = 1
�

• So we get: � = � � ⋅ � +
�−�
� �
Note: Here we assumed M
has no dead-ends
[x]N … a vector of length N with all entries x
232
Example: Equation with Teleports

rnew M rold
A B rA 0 1/2 1 0 rA 1/4
rB 1/3 0 0 1/2 rB 1/4
= β 1/3 . + (1-β)
rC 0 0 1/2 rC 1/4
rD 1/3 1/2 0 0 rD 1/4
C D

Note: Here we assumed M

has no dead-ends
Sparse Matrix Formulation
• We just rearranged the PageRank equation
�−�
� = �� ⋅ � +
� �
• where [(1-)/N]N is a vector with all N entries (1-
)/N

• M is a sparse matrix! (with no dead-ends)

• 10 links per node, approx 10N entries
• So in each iteration, we need to:
• Compute rnew =  M ∙ rold
• Add a constant value (1-)/N to each entry in
rnew
• Note if M contains dead-ends then � ��
� < � and
we also have to renormalize rnew so that it sums to 1
234
PageRank: Without Dead
Ends
• Input: Graph � and parameter �
• Directed graph � (cannot have dead ends)
• Parameter �
• Output: PageRank vector ��
1
• Set: ��
� = �
• repeat until convergence: �
��
� − ��
� >�
��
• ∀�: ��
� = �→�
� �
��
��
� = � if in-degree of � is 0
• Add constant terms:
�−�
∀�: ��
� = ��
� +
�
• �� = ��

235
PageRank: The Complete
Algorithm
• Input: Graph � and parameter �
• Directed graph � (can have spider traps and dead
ends)
• Parameter �
• Output: PageRank vector ��
1
• Set: ��
� = �
• repeat until convergence: �
��
� − ��
� >�
��
• ∀�: �′��
� = �→�
� �
��
�′��
� = � if in-degree of � is 0
• Now re-insert the leaked PageRank:
��
∀�: ��
� = �′ �
��
+
�−�
where: � = �′
� �
�
��
• � =�
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
236
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Sparse Matrix Encoding: First Try
Store a triplet for each nonzero entry: (row, column, weight)

A B 0 1/2 1 0
1/3 0 0 1/2
1/3 0 0 1/2

C D 1/3 1/2 0 0

(2, 1, 1/3); (3, 1, 1/3); (4, 1, 1/3); (1, 2, 1/2); (4, 2, 1/2); (1, 3, 1); …

Assume 4 bytes per integer and 8 bytes per float: 16 bytes per entry
Inefficient: Repeating the column index and weight multiple times
Sparse Matrix Encoding
• Encode sparse matrix using only nonzero entries
• Space proportional roughly to number of links
• Say 10N, or 4*10*1 billion = 40GB
• Still won’t fit in memory, but will fit on disk

source
node degree destination nodes
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23

238
Basic Algorithm: Update Step
• Assume enough RAM to fit rnew into memory
• Store rold and matrix M on disk
• 1 step of power-iteration is:
Initialize all entries of rnew = (1-) / N
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) +=  rold(i) / di
0 rnew source degree destination rold 0
1 1
0 3 1, 5, 6
2 2
3 1 4 17, 64, 113, 117 3
4 4
2 2 13, 23
5 5
6 239
6
Analysis
• Assume enough RAM to fit rnew into memory
• Store rold and matrix M on disk
• In each iteration, we have to:
• Read rold and M
• Write rnew back to disk
• Cost per iteration of Power method:
= 2|r| + |M|

• Question:
• What if we could not even fit rnew in memory?

240
Block-based Update Algorithm

rnew src degree destination rold

0 0
0 4 0, 1, 3, 5 1
1
1 2 0, 5 2
2 3
2 2 3, 4 4
3
M 5
4
5

• Break rnew into k blocks that fit in memory

• Scan M and rold once for each block

241
Block-based Update Algorithm

rnew src degree destination rold

0 0
0 4 0, 1, 3, 5 1
1
1 2 0, 5 2
2 3
2 2 3, 4 4
3
M 5
4
5

• Break rnew into k blocks that fit in memory

• Scan M and rold once for each block

242
Block-Stripe Update Algorithm
src degree destination
rnew
0 4 0, 1
0
1 1 3 0 rold
0
2 2 1 1
2
2 0 4 3 3
4
3
2 2 3 5

0 4 5
4
5
1 3 5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of rnew 243
Analysis of Block Update
• Similar to nested-loop join in databases
• Break rnew into k blocks that fit in memory
• Scan M and rold once for each block
• Total cost:
• k scans of M and rold
• Cost per iteration of Power method:
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
• Can we do better?
• Hint: M is much bigger than r (approx 10-20x), so we must
avoid reading it k times per iteration

244
Block-Stripe Update Algorithm
src degree destination
rnew
0 4 0, 1
0
1 1 3 0 rold
0
2 2 1 1
2
2 0 4 3 3
4
3
2 2 3 5

0 4 5
4
5
1 3 5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of rnew 245
Block-Stripe Update Algorithm
src degree destination
rnew
0 4 0, 1
0
1 1 3 0 rold
0
2 2 1 1
2
2 0 4 3 3
4
3
2 2 3 5

0 4 5
4
5
1 3 5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of rnew 246
Block-Stripe Analysis
• Break M into stripes
• Each stripe contains only destination nodes
in the corresponding block of rnew
• Some additional overhead per stripe
• But it is usually worth it
• Cost per iteration of Power method:
=|M|(1+) + (k+1)|r|

247
Some Problems with Page Rank
• Measures generic popularity of a page
• Biased against topic-specific authorities
• Solution: Topic-Specific PageRank (next)
• Uses a single measure of importance
• Other models of importance
• Solution: Hubs-and-Authorities
• Susceptible to Link spam
• Artificial link topographies created in order to boost page
rank
• Solution: TrustRank

248
Topic-Sensitive PageRank

249
Topic-Specific PageRank
• Instead of generic popularity, can we measure
popularity within a topic?
• Goal: Evaluate Web pages not just according to their
popularity, but by how close they are to a particular
topic, e.g. “sports” or “history”
• Allows search queries to be answered based on interests
of the user
• Example: Query “Trojan” wants different pages depending on
whether you are interested in sports, history and computer
security

250
Topic-Specific PageRank
• Random walker has a small probability of
teleporting at any step
• Teleport can go to:
• Standard PageRank: Any page with equal
probability
• To avoid dead-end and spider-trap problems
• Topic Specific PageRank: A topic-specific set of
“relevant” pages (teleport set)
• Idea: Bias the random walk
• When walker teleports, she pick a page from a
set S
• S contains only pages that are relevant to the
topic
• E.g., Open Directory (DMOZ) pages for a given
topic/query
• For each teleport set S, we get a different vector 251
Matrix Formulation
• To make this work all we need is to update
the teleportation part of the PageRank
formulation:
�� = � �� + (� − �)/|�| if � ∈ �
� �� + � otherwise
• A is stochastic!
• We weighted all pages in the teleport set S
equally
• Could also assign different weights to pages!
• Compute as for regular PageRank:
• Multiply by M, then add a vector
• Maintains sparseness 252
Example: Topic-Specific
PageRank

0.2 Suppose S = {1},  = 0.8

0.5 1 0.5
Node Iteration
0.4 0.4 0 1 2 … stable
1 1 0.25 0.4 0.28 0.294
2 0.8
3 2 0.25 0.1 0.16 0.118
1 3 0.25 0.3 0.32 0.327
1
4 0.25 0.2 0.24 0.261
0.8 0.8

0.2 Suppose S = {1},  = 0.8

0.5 1 0.5
Node Iteration
0.4 0.4 0 1 2 … stable
1 1 0.25 0.4 0.28 0.294
2 0.8
3 2 0.25 0.1 0.16 0.118
1 3 0.25 0.3 0.32 0.327
1
4 0.25 0.2 0.24 0.261
0.8 0.8

4 S={1,2,3,4}, β=0.8:
r=[0.13, 0.10, 0.39, 0.36]
S={1}, β=0.90: S={1,2,3} , β=0.8:
r=[0.17, 0.07, 0.40, 0.36] r=[0.17, 0.13, 0.38, 0.30]
S={1} , β=0.8: S={1,2} , β=0.8:
r=[0.29, 0.11, 0.32, 0.26] r=[0.26, 0.20, 0.29, 0.23]
S={1}, β=0.70: S={1} , β=0.8:
r=[0.39, 0.14, 0.27, 0.19] r=[0.29, 0.11, 0.32, 0.26] 254
DMOZ
• DMOZ (from directory.mozilla.org, an earlier domain
name, stylized in lowercase in its logo) was a
multilingual open-content directory of World Wide
Web links
https://dmoz-odp.org/

https://www.kaggle.com/shawon10/url-classification-dataset-
dmoz

255
Discovering the Topic Vector S
• Create different PageRanks for different topics
• The 16 DMOZ top-level categories:
• arts, business, sports,…
• Which topic ranking to use?
• User can pick from a menu
• Classify query into a topic
• Can use the context of the query
• E.g., query is launched from a web page talking about a known
topic
• History of queries e.g., “basketball” followed by “Jordan”
• User context, e.g., user’s bookmarks, …

256
Application to Measuring Proximity in
Graphs

Random Walk with Restarts

257
[Tong-Faloutsos, ‘06]

Proximity on Graphs

a.k.a.: Relevance, Closeness, ‘Similarity’… 258

Good proximity measure?
• Shortest path is not good:

• No effect of degree-1 nodes (E, F, G)!

• Multi-faceted relationships

259
[Tong-Faloutsos, ‘06]

What is good notion of proximity?

• Multiple connections
• Quality of connection
…
• Direct & Indirect
connections
• Length, Degree,
Weight… 260
SimRank: Idea
• SimRank: Random walks from a fixed node

• Topic Specific PageRank

from node u: teleport set S = {u}

• Resulting scores measures similarity to node

• Problem:
• Must be done once for each node u
• Suitable for sub-Web-scale applications
261
Bipartite Graph
• A bipartite graph is a special kind of graph with the
following properties-
• It consists of two sets of vertices X and Y.
• The vertices of set X join only with the vertices of
set Y.
• The vertices within the same set do not join

https://www.gatevidyalay.com/bipartite-graphs/

262
SimRank: Example

…
…
Q: What is most related
conference to ICDM?

A: Topic-Specific
PageRank with
… teleport set S={ICDM}
…

Conference Author
263
SimRank: Example

264
Example

β = 0.8

265
Example
• Teleport set S = {B,D}
• Vector (1 − β)eS/|S| has 1/10 for its second and fourth
components and 0 for the other two components
• 1 − β = 1/5, the size of S is 2, and eS has 1 in the
components for B and D and 0 in the components for
A and C

266
Example

267
PageRank: Summary
• “Normal” PageRank:
• Teleports uniformly at random to any node
• All nodes have the same probability of surfer landing there: S =
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
• Topic-Specific PageRank also known as Personalized
PageRank:
• Teleports to a topic specific set of pages
• Nodes can have different probabilities of surfer landing there: S =
[0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0.2]
• Random Walk with Restarts:
• Topic-Specific PageRank where teleport is always to the same node.
S=[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

268
TrustRank:
Combating the Web Spam
What is Web Spam?
• Spamming:
• Any deliberate action to boost a web
page’s position in search engine results, incommensurate with
page’s real value
• Spam:
• Web pages that are the result of spamming
• This is a very broad definition
• SEO industry might disagree!
• SEO = search engine optimization

• Approximately 10-15% of web pages are spam

Web Search
• Early search engines:
• Crawl the Web
• Index pages by the words they contained
• Respond to search queries (lists of words) with the pages
containing those words
• Early page ranking:
• Attempt to order pages matching a search query by
“importance”
• First search engines considered:
• (1) Number of times query words appeared
• (2) Prominence of word position, e.g. title, header
First Spammers
• As people began to use search engines to find things on
the Web, those with commercial interests tried to
exploit search engines to bring people to their own site
– whether they wanted to be there or not
• Example:
• Shirt-seller might pretend to be about “movies”
• Techniques for achieving high relevance/importance for
a web page
First Spammers: Term Spam
• How do you make your page appear to be about movies?
• (1) Add the word movie 1,000 times to your page
• Set text color to the background color, so only search engines
would see it
• (2) Or, run the query “movie” on your
target search engine
• See what page came first in the listings
• Copy it into your page, make it “invisible”
• These and similar techniques are term spam
Google’s Solution to Term Spam
• Believe what people say about you, rather than what
you say about yourself
• Use words in the anchor text (words that appear underlined
to represent the link) and its surrounding text

• PageRank as a tool to measure the “importance” of

Web pages
Why It Works?
• Our hypothetical shirt-seller loses
• Saying he is about movies doesn’t help, because
others don’t say he is about movies
• His page isn’t very important, so it won’t be
ranked
high for shirts or movies
• Example:
• Shirt-seller creates 1,000 pages, each links to his
with “movie” in the anchor text
• These pages have no links in, so they get little
PageRank
• So the shirt-seller can’t beat truly important
movie
pages, like IMDB
Why it does not work?
SPAM FARMING
Google vs. Spammers: Round 2!
• Once Google became the dominant search engine,
spammers began to work out ways to fool Google

• Spam farms were developed to concentrate PageRank

on a single page

• Link spam:
• Creating link structures that boost PageRank of a particular
page
Link Spamming
• Three kinds of web pages from a
spammer’s point of view
• Inaccessible pages
• Accessible pages
• e.g., blog comments pages
• spammer can post links to his pages
• Owned pages
• Completely controlled by spammer
• May span multiple domain names
Link Farms
• Spammer’s goal:
• Maximize the PageRank of target page t

• Technique:
• Get as many links from accessible pages as possible to target
page t
• Construct “link farm” to get PageRank
multiplier effect
Link Farms
Accessible Owned

Inaccessible 1

2
t

Millions of
farm pages

One of the most common and effective

organizations for a link farm
Analysis Accessible Owned

Inaccessible 1
t 2

N…# pages on the web

M…# of pages spammer
M owns

• x: PageRank contributed by accessible

pages
• y: PageRank of target page t
�� 1−�
• Rank of each “farm” page = +
� �
�� 1−� 1−�
• � = � + �� + + Very small; ignore
� � �
2 �(1−�)� 1−� Now we solve for y
=�+� �+ +
� �
� � �
•�= +� where � =
Analysis Accessible Owned

Inaccessible 1
t 2

N…# pages on the web

M…# of pages spammer
M owns

� � �
•�= +� where � =
�−�� 1+�

• For  = 0.85, 1/(1-2)= 3.6

• Multiplier effect for acquired PageRank

• By making M large, we can make y as
large as we want
TrustRank:
Combating the Web Spam
Combating Spam
• Combating term spam
• Analyze text using statistical methods
• Similar to email spam filtering
• Also useful: Detecting approximate duplicate pages
• Combating link spam
• Detection and blacklisting of structures that look like spam
farms
• Leads to another war – hiding and detecting spam farms
• TrustRank = topic-specific PageRank with a teleport set of
trusted pages
TrustRank: Idea
• Basic principle: Approximate isolation
• It is rare for a “good” page to point to a “bad” (spam) page

• Sample a set of seed pages from the web and

“propagate” trust from them.
Picking the Seed Set
• Two conflicting considerations:
• Human has to inspect each seed page, so
seed set must be as small as possible

• Must ensure every good page gets

adequate trust rank, so need make all
good pages reachable from seed set by
short paths
Approaches to Picking Seed Set
• Suppose we want to pick a seed set of k pages
• How to do that?
• (1) PageRank:
• Pick the top k pages by PageRank
• Theory is that you can’t get a bad page’s rank really high
• (2) Use trusted domains whose membership is
controlled, like .edu, .mil, .gov
Trust Propagation
• Call the subset of seed pages that are identified as good
the trusted pages

• Perform a topic-sensitive PageRank with teleport set =

trusted pages
• Propagate trust through links:
• Each page gets a trust value between 0 and 1
Why is it a good idea?
• Trust attenuation:
• The degree of trust conferred by a trusted page decreases
with the distance in the graph

• Trust splitting:
• The larger the number of out-links from a page, the less
scrutiny the page author gives each out-link
• Trust is split across out-links
Categorize Spam Pages after TrustRank
• Solution 1: Use a threshold value and mark all pages below the trust
threshold as spam

• Solution 2: Spam Mass

Spam Mass
• In the TrustRank model, we start with good pages and
propagate trust

• Complementary view:
What fraction of a page’s PageRank comes from spam
pages?

• In practice, we don’t know all Trusted

set
the spam pages, so we need
to estimate

Web
Spam Mass Estimation
Solution 2:
• �� = PageRank of page p
� = PageRank of p with teleport into
• �+
trusted pages only

• Then: What fraction of a page’s PageRank

comes from spam pages?
− +
�� = �� − ��
Trusted
set
�−
• Spam mass of p =
�
��
• Pages with high spam mass
are spam. Web

Reverse FMEA Process
100% (2)
Reverse FMEA Process
7 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
64 pages
Introduction To Stream Data Model
50% (2)
Introduction To Stream Data Model
15 pages
Bda M4
No ratings yet
Bda M4
57 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
Big Data Unit Ii Notes
No ratings yet
Big Data Unit Ii Notes
19 pages
Unit 4
No ratings yet
Unit 4
84 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
BDA Module-4
No ratings yet
BDA Module-4
8 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Big Data Analytics Unit 2 MINING DATA STREAMS
100% (2)
Big Data Analytics Unit 2 MINING DATA STREAMS
22 pages
Module II
No ratings yet
Module II
22 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
Data Stream Unit4
No ratings yet
Data Stream Unit4
20 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
A
No ratings yet
A
3 pages
Bda L4
No ratings yet
Bda L4
32 pages
UNIT-3 (Mining Data Streams)
No ratings yet
UNIT-3 (Mining Data Streams)
50 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Unit 3
No ratings yet
Unit 3
30 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Unit 3 - BD - Streaming
No ratings yet
Unit 3 - BD - Streaming
42 pages
Bda Unit II Lecture1
No ratings yet
Bda Unit II Lecture1
10 pages
Data Stream Management
No ratings yet
Data Stream Management
46 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Mining&Data Stream Unit-3 - Removed
No ratings yet
Mining&Data Stream Unit-3 - Removed
50 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
No ratings yet
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
68 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
No ratings yet
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
33 pages
Data Stream MG
No ratings yet
Data Stream MG
528 pages
Session 3.9.1
No ratings yet
Session 3.9.1
11 pages
BDA - Question Bank - 2
No ratings yet
BDA - Question Bank - 2
12 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Unit 2
No ratings yet
Unit 2
10 pages
Stream
No ratings yet
Stream
30 pages
MMD3
No ratings yet
MMD3
17 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-08-19 Reference-Material-I
53 pages
Aashir Ali BSCS 082 Alll Labs
No ratings yet
Aashir Ali BSCS 082 Alll Labs
43 pages
PAMPHLET 23 Door Schedules
No ratings yet
PAMPHLET 23 Door Schedules
31 pages
Chapter 6
No ratings yet
Chapter 6
30 pages
Magic Quadrant For Privileged Access Management 2024
No ratings yet
Magic Quadrant For Privileged Access Management 2024
41 pages
98-361 Software Development Fundamentals - Skills Measured
No ratings yet
98-361 Software Development Fundamentals - Skills Measured
2 pages
Top 49 MIS Executive Interview Questions and Answers
No ratings yet
Top 49 MIS Executive Interview Questions and Answers
14 pages
Labs: Introductio N To Spring 5 and Spring Mvc/Rest (Eclipse/Tom Cat)
No ratings yet
Labs: Introductio N To Spring 5 and Spring Mvc/Rest (Eclipse/Tom Cat)
34 pages
Recertification - Acronis #CyberFit Cloud Tech Fundamentals
No ratings yet
Recertification - Acronis #CyberFit Cloud Tech Fundamentals
78 pages
02 - Pthread+primitive Sincronizare PDF
No ratings yet
02 - Pthread+primitive Sincronizare PDF
169 pages
Modern Database Management Slides - ch03
No ratings yet
Modern Database Management Slides - ch03
33 pages
Software Engineering Lab File
No ratings yet
Software Engineering Lab File
20 pages
Untitled
No ratings yet
Untitled
155 pages
XPO Connect
No ratings yet
XPO Connect
2 pages
08 OS90538EN15GLA0 Mediations Troubleshooting
No ratings yet
08 OS90538EN15GLA0 Mediations Troubleshooting
62 pages
B23 - Node Backup Administration Using ENM
No ratings yet
B23 - Node Backup Administration Using ENM
36 pages
Agile 3
No ratings yet
Agile 3
3 pages
Jennifer Rosa: Weather Dashboard
No ratings yet
Jennifer Rosa: Weather Dashboard
1 page
Holy Cross of Davao College, Inc.: Finals Examination
No ratings yet
Holy Cross of Davao College, Inc.: Finals Examination
3 pages
IPD - Active Directory Domain Services Version 2.2
No ratings yet
IPD - Active Directory Domain Services Version 2.2
36 pages
DP - 15 - 1 - Practice FAZRULAKMALFADILA - C2C022001
No ratings yet
DP - 15 - 1 - Practice FAZRULAKMALFADILA - C2C022001
3 pages
OSCP Survival Guide
No ratings yet
OSCP Survival Guide
48 pages
Alert For New Employee
No ratings yet
Alert For New Employee
9 pages
DevOps - Presentation - 21052024 - v8.0 - Part 1
No ratings yet
DevOps - Presentation - 21052024 - v8.0 - Part 1
95 pages
Final Project Report
No ratings yet
Final Project Report
32 pages
7.1 Test Cases Data Owner Login: S. No Action Inputs Excepted Output Actual Output Test Browse R Test Result Test Commen Ts
No ratings yet
7.1 Test Cases Data Owner Login: S. No Action Inputs Excepted Output Actual Output Test Browse R Test Result Test Commen Ts
7 pages
Cap563 - Cyber Security Awareness-Laboratory PDF
No ratings yet
Cap563 - Cyber Security Awareness-Laboratory PDF
3 pages
Asm2 Security
No ratings yet
Asm2 Security
43 pages
Lecture 01
No ratings yet
Lecture 01
48 pages