0% found this document useful (0 votes)
61 views65 pages

17-Matrix Sketching

The document provides guidance for other teachers on using the lecture slides. It states that others are welcome to use the slides verbatim or modify them as needed for their own lectures. It requests that if a significant portion of the slides are used, that this message or a link to the website be included.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views65 pages

17-Matrix Sketching

The document provides guidance for other teachers on using the lecture slides. It states that others are welcome to use the slides verbatim or modify them as needed for their own lectures. It requests that if a significant portion of the slides are used, that this message or a link to the website be included.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Note to other teachers and users of these slides: We would be delighted if you found our

material useful for giving your own lectures. Feel free to use these slides verbatim, or to
modify them to fit your own needs. If you make use of a significant portion of these slides
in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

CS246: Mining Massive Datasets


Jure Leskovec, Stanford University
Mina Ghashami, Amazon
http://cs246.stanford.edu
¡ In many applications, we can represent data as a
matrix: e.g. text analysis, recommendation

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2
¡ Think of data as A ∈ ℝ𝒏×𝒅 containing n row
vectors in Rd , and typically 𝑛 ≫ 𝑑

¡ Some examples of typical web-scale data:

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3
¡ Rank-k approximation to A computes a smaller
matrix B of rank k such that B approximates A

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4
¡ Rank-k approximation to A computes a smaller
matrix B of rank k such that B approximates A

¡ B is much smaller than A that it fits in memory


¡ Rank(B) << rank(A)
§ If A is a document-term matrix with 10 billion
𝟏𝟎𝟏𝟎×𝟏𝟎𝟔
documents and 1 million words A ∈ ℝ then B
would probably be B ∈ ℝ𝟏𝟎𝟎𝟎×𝟏𝟎𝟔
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
¡ Rank-k approximation to A computes a smaller
matrix B of rank k such that B approximates A

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6
¡ Rank-k approximation to A computes a smaller
matrix B of rank k such that B approximates A

¡ Error difference between A and B is small:

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
¡ Rank-k approximation to A computes a smaller
matrix B of rank k such that B approximates A

¡ Error difference between A and B is small:


§ The covariance error 𝐴𝑇𝐴 − 𝐵𝑇𝐵 2 is small

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
¡ Rank-k approximation to A computes a smaller
matrix B of rank k such that B approximates A

¡ Error difference between A and B is small:


§ The covariance error 𝐴𝑇𝐴 − 𝐵𝑇𝐵 2 is small
§ The projection error 𝐴 − Π𝐵𝐴 2, 𝐹 is small
§ Π𝐵𝐴 := projecting rows of A onto the subspace of B
§ If B = USVT then, the subspace of B is VVT
§ Therefore Π𝐵𝐴 = AVVT

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
¡ We saw that SVD computes the best rank-k
approximation to A

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
¡ SVD computes the best rank-k approximation
to A

We compare error of other algorithms to ||A-Ak|| as it is the smallest error

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11
¡ SVD computes the best rank-k approximation
to A

¡ SVD requires O(nd2) time and O(nd) space


¡ Not applicable in streaming, or distributed
settings
¡ Not efficient for sparse matrices

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
¡ Can we compute rank-k approximation in
streaming setting?

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13
¡ Every element of the stream is a row vector of
fixed d-dimension.

§ We’d like to process A


in one pass and using
a small amount of
memory (sublinear in n)

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15
¡ Streaming data such as any time series data:
§ ecommerce purchases
§ Traffic sensors
§ Activity logs

¡ We can not store the entire data


3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16
¡ A large set of data analysis tasks rely on
obtaining a low rank approximation:
§ Dimension reduction
§ Anomaly detection
§ Data denoising
§ Clustering
§ Recommendation systems

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17
¡ B is a sketch of a streaming matrix A iff
§ B is of a fixed small size
that fts in memory
§ At any point in stream,
B approximates A

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
¡ Almost any matrix sketching methods in
streaming setting falls into one of these
categories:
1. Row sampling based
2. Random projection based and Hashing
3. Iterative sketching

¡They compute a significantly smaller sketch


matrix B such that 𝐴 ≈ 𝐵 𝑜𝑟 𝐴! 𝐴 ≈ 𝐵! 𝐵
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
¡ They select a subset of “important” rows of
the original matrix A
§ Sampling is done w.r.t a well-defined probability
distribution
§ Often sampling is done with replacement

¡ And show that sampled matrix B is a good


approximation to original one

¡ Methods differ in how they define notion of


“importance”
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
They construct sketch B by:
¡ assign a probability pi to each row ai
¡ sample 𝑙 rows from A to construct B
¡ rescale B appropriately to make it unbiased

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
¡ An Intuitive way to define “importance” of an item:
§ the weight associated to the item, e.g.
§ file records à weights as size of the file,
§ IP addresses à weights as number of times the IP address
makes a request
¡ why it is necessary to sample important items?
§ Consider a set of weighted items S = {(a1, w1),(a2, w2), ··· ,(an, wn)}
that we want to summarize with a small & representative sample.

§ We define a representative sample as the one estimates total


weight of S (i.e. 𝑊𝑠 = & 𝑤𝑖 ) in expectation.
!"#

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23
¡ This is achievable with a sample set of size one!
§ we sample any item (aj , wj) with an arbitrary fixed
probability p,
§ and rescale it to have weight Ws/p.

¡ This sample set has total weight Ws in expectation


§ but has a large variance too
§ To lower down the variance, it is necessary to allow heavy
items (i.e. important items) to get sampled with higher
probability

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24
¡ Row sampling based on L2 norm:
§ Sample with probability
§ Rescale rows of B by 1/ 𝑙 𝑝(
§ We can show that 𝐸[ 𝐵 𝐹 ]= 𝐴 𝐹
§ And it is proved that if we sample
rows, then:

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25
¡ Row sampling based on L2 norm:
§ CUR method: samples rows/columns with
probability = squared norm of rows/columns

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
¡ Row sampling based on L2 norm:
§ CUR method: samples rows/columns with
probability = squared norm of rows/columns

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
¡ Row sampling based on L2 norm:
§ CUR method: samples rows/columns with
probability = squared norm of rows/columns

𝒌 𝒍𝒐𝒈 𝒌
§ Error guarantee: If we sample 𝒄 = 𝑶
𝜺𝟐
𝒌 𝒍𝒐𝒈 𝒌
columns and r= 𝑶 rows, then
𝜺𝟐

With probability >= 98%


3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28
+ Easy interpretation of basis
• Since the basis vectors are actual rows/columns

+ Suitable for Sparse data


• Since the basis vectors are actual rows/columns

- Duplicate columns and rows


• Columns of large norms will be sampled multiple
times

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29
¡ Key idea: if points in a vector space are
projected onto a randomly selected subspace
of suitably high dimension, then the distances
between points are approximately preserved

¡ Johnson-Lindenstrauss Transform (JLT): d


datapoints in any dimension (ℝ. for 𝑛 ≫ 𝑑)can
get embedded into roughly log d dimensional
space, such that their pair-wise distances are
preserved to some extent
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
We define JLT more precisely:
¡ A random matrix S ∈ ℝ"×$ has JLT property if
for all vectors 𝑣, 𝑣′ ∈ ℝ$ ,
𝑺𝒗 − 𝑺𝒗% ) 𝟐 = (𝟏 ± 𝝐) 𝒗 − 𝒗% ) 𝟐
with probability at least 1 − 𝛿

¡ There are many ways to construct a matrix S


that preserve pair-wise distances.
§ All such matrices are called to have the Johnson-
Lindenstrauss Transform (JLT) property
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32
One simple construction of S:

¡ Pick matrix S ∈ ℝ"×$ as an orthogonal


projection on a random r-dimensional
subspace of ℝ$ with 𝑟 = 𝑂(𝜖 '( log 𝑑)
§ Rows of S are orthogonal vectors

¡ Then for any matrix 𝐴 ∈ ℝ$×) , SA preserves


pair-wise distances between d datapoints in A

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
¡ A simpler construction for S ∈ ℝ"×$ is:
§ to have entries as independent random variables
with the standard normal distribution
*
S= [matrix with entries draw from N(0,1)]
+

*
x
+

Entries drawn from


distribution N(0,1)

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34
¡ Another construction for S ∈ ℝ"×$ is:

*
S= [entries as independent +/-1 random var]
+

This is computationally simpler to construct

Entries are +/-1 random


variables
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35
¡ They use a JLT matrix S ∈ ℝ"×$
¡ Construct the sketch as B = SA ∈ ℝ"×)
§ this projects datapoints from a high-dim space
ℝ. onto a lower-dim subspace ℝ/
¡ They show

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36
¡ Depending on JLT construction, we achieve
different error bounds:

§ If S ∈ ℝ0×. has has iid zero-mean ±1 entries and


1
𝑟 = 𝑂( + 𝑘 log 𝑘) and, then
2

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 37
¡ Computationally efficient
¡ Sufficiently accurate in practice
¡ A great pre-processing step in applications

¡ Data-oblivious as their computation involves


only a random matrix S
§ Compare to row sampling methods that need to
access data to form a sketch

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38
¡ Use matrix S that contains one ±1 per column

Only one non-zero


entry in each
column of S.
The rest of entries
are zero

¡ To build S, use two hash functions:


§ h: [n] à [r] , and g:[n] à {-1, +1}
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39
¡ Very efficient for sparse matrices A
§ can be applied in O(nnz(A)) operations
§ nnz(A) = number of non-zeros of A
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40
¡ They work over a stream 𝐴 =< 𝑎3 , 𝑎4 , … , 𝑎. >
¡ each ai is read once, get processed quickly
and not read again
¡ with only a small
amount of memory
available

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42
¡ State of the art method in this group is called
“Frequent Directions”

¡ It is based on Misra-Gries algorithm for


finding frequent items in a data stream

¡ We first see how Misra-Gries algorihtm for


finding frequent items work
§ Then we extend it to matrices

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43
¡ Suppose there is a stream of items, and we
want to find frequency f(i) of each item

universal
set

stream

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44
¡ If I keep d counters, I can count frequency of
every item...
§ But it’s not good enough (IP addresses, queries,...)

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45
¡ Let’s keep 𝑙 counters where 𝑙 ≪ 𝑑

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 46
¡ If a new item arrives in the stream that is
already in the counters, we add 1 to its count

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47
¡ If the new item is not in the counters and we
have space, we create a counter for it and set
it to 1

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 48
¡ But what if we don’t have space for it?

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 49
¡ Let 𝛿 be the median counter at time t

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50
¡ Decrease all counters by 𝛿 (or set it to zero if
less than 𝛿)

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 51
¡ Now we have space for new item, so we
continue...

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 52
¡ At any time in the stream, the approximated
counts for items are what we have kept so far

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 53
¡ This method undercounts
0 ≤ 𝑓′(𝑖) ≤ 𝑓(𝑖)
¡ We decrease each counter by at most 𝛿,

𝑓 % 𝑖 ≥ 𝑓 𝑖 − C 𝛿,

¡ At any point that we have seen n elements in


stream: -
∑𝛿 ≤ 𝑛 ,
(

¡ The error guarantee: 𝟎 ≤ 𝒇 𝒊 − 𝒇′(𝒊) ≤ 𝟐𝒏/𝒍

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 54
¡ Misra-Gries produces a non-zero
approximated frequency 𝒇′(𝒊) for all items
that their true frequency 𝒇 𝒊 is higher than
𝟐𝒏/𝒍 ,
¡ 𝒇 𝒊 − 𝟐𝒏/𝒍 ≤ 𝒇′(𝒊)

¡ To find items that appear more than 20% of


the time i.e. 𝒇 𝒊 > 𝒏/𝟓, take 𝒍 = 𝟏𝟎
counters and run Misra-Gries algo.
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 55
¡ Let’s extend it to vectors and matrices

¡ Stream items are row vectors in d dimension

¡ At any time n in the stream, they form a tall


matrix A ∈ ℝ.×)

¡ The goal is to find the most frequent


directions of A

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 56
𝑆5 ← [ 𝑆#$ −𝑆%/$
$
, 𝑆$$ −𝑆%/$
$
, … 0, … , 0]

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 57
ai

𝑆5 ← [ 𝑆#$ −𝑆%/$
$ , 𝑆$$ −𝑆%/$
$ … 0, … , 0]

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 58
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 59
𝑆5 ← [ 𝑆#$ −𝑆%/$
$ , 𝑆$$ −𝑆%/$
$ … 0, … , 0]

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 60
𝑆5 ← [ 𝑆#$ −𝑆%/$
$
, 𝑆$$ −𝑆%/$
$
… 0, … , 0]

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 61
¡ Similar to the frequent items case, this
method has the following error guarantee:

2 (
𝐴𝑇𝐴 − 𝐵𝑇𝐵 ≤ 𝐴 /
𝑙

¡ More accurate error bounds:


𝐴 − 𝜋𝐵 (𝐴) 2𝐹 ≤ (1 + 𝜀) 𝐴 − 𝐴𝑘 2
𝐹

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 62
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 63
3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 64
¡ Matrix Sketching in Streams:
§ Row sampling methods
§ CUR
§ L2 norm based sampling
§ Random projection methods
§ Johnson Lindenstrauss Transform (JLT)
§ Different ways to construct a JLT matrix
§ Iterative sketching methods
§ Misra-Gries algorithm for frequent items
§ Frequent Directions method (state of the art)

3/1/22 Jure Leskovec & Mina Ghashami, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 65

You might also like