HW 1
HW 1
Problem Set 1
Please read the homework submission policies at https://cs246.stanford.edu.
Algorithm: Let us use a simple algorithm such that, for each user U , the algorithm rec-
ommends N = 10 users who are not already friends with U , but have the most number of
mutual friends in common with U .
Output:
• The output should contain one line per user in the following format:
<User><TAB><Recommendations>
where <User> is a unique ID corresponding to a user and <Recommendations> is a
comma separated list of unique IDs corresponding to the algorithm’s recommendation
of people that <User> might know, ordered in decreasing number of mutual friends.
• Even if a user has less than 10 second-degree friends, output all of them in decreasing
order of the number of mutual friends. If a user has no friends, you can provide an
empty list of recommendations. If there are recommended users with the same number
of mutual friends, then output those user IDs in numerically ascending order.
Pipeline sketch: Please provide a description of how you used Spark to solve this problem.
Don’t write more than 3 to 4 sentences for this: we only want a very high-level description
of your strategy to tackle this problem.
CS 246: Mining Massive Data Sets — Problem Set 1 2
Tips:
• Use Google Colab to use Spark seamlessly, e.g., copy and adapt the setup
cells from Colab 0.
• Before submitting a complete application to Spark, you may go line by line, checking
the outputs of each step. Command .take(X) should be helpful, if you want to check
the first X elements in the RDD.
• We recommend using either PySpark DataFrame and/or RDD syntax for this question.
• For sanity check, your top 10 recommendations for user ID 11 should be:
27552,7785,27573,27574,27589,27590,27600,27617,27620,27667.
What to submit
(1) Upload your code on Gradescope.
(2) Include in your writeup a short paragraph sketching your spark pipeline.
(3) Include in your writeup the recommendations for the users with following user IDs: 924,
8941, 8942, 9019, 9020, 9021, 9022, 9990, 9992, 9993.
conf(A → B) = Pr(B|A),
where Pr(B|A) is the conditional probability of finding item set B given that item set
A is present.
CS 246: Mining Massive Data Sets — Problem Set 1 3
2. Lift (denoted as lift(A → B)): Lift measures how much more “A and B occur together”
than “what would be expected if A and B were statistically independent”:
conf(A → B)
lift(A → B) = ,
S(B)
Support(B)
where S(B) = N
and N = total number of transactions (baskets).
1 − S(B)
conv(A → B) = .
1 − conf(A → B)
(a) [3pts]
A drawback of using confidence is that it ignores Pr(B). Why is this a drawback? Explain
why lift and conviction do not suffer from this drawback.
(b) [3pts]
(c) [4pts]
Perfect implications are rules that hold 100% of the time (or equivalently, the associated
conditional probability is 1). A measure is desirable if it reaches its maximum achievable
value for all perfect implications. This makes it easy to identify the best rules. Which of the
above measures have this property? You may ignore 0/0 but not other infinity cases. Also
you may find it easy to explain by an example.
to occur together at least 100 times to be considered frequent) and find itemsets of size 2
and 3.
Use the online browsing behavior dataset from browsing.txt in q2/data. Each line repre-
sents a browsing session of a customer. On each line, each string of 8 characters represents
the ID of an item browsed during that session. The items are separated by spaces.
Note: for parts (d) and (e), the writeup will require a specific rule ordering but the program
need not sort the output. We are not giving partial credits to coding when results are wrong.
However, two sanity checks are provided and they should be helpful when you progress: (1)
there are 647 frequent items after 1st pass (|L1 | = 647), (2) the top 5 pairs you should
produce in part (d) all have confidence scores greater than 0.985. See detailed instructions
below. Please provide at least five decimal points for your confidence scores.
(d) [10pts]
Identify pairs of items (X, Y ) such that the support of {X, Y } is at least 100. For all such
pairs, compute the confidence scores of the corresponding association rules: X ⇒ Y , Y ⇒ X.
Sort the rules in decreasing order of confidence scores and list the top 5 rules in the writeup.
Break ties, if any, by lexicographically increasing order on the left hand side of the rule.
(You need not use Spark for parts (d) and (e) of question 2)
(e) [10pts]
Identify item triples (X, Y, Z) such that the support of {X, Y, Z} is at least 100. For all such
triples, compute the confidence scores of the corresponding association rules: (X, Y ) ⇒ Z,
(X, Z) ⇒ Y , (Y, Z) ⇒ X. Sort the rules in decreasing order of confidence scores and list the
top 5 rules in the writeup. Order the left-hand-side pair lexicographically and break ties, if
any, by lexicographical order of the first then the second item in the pair.
What to submit
Upload all the code on Gradescope and include the following in your writeup:
(a) [5pts]
Suppose a column has m 1’s and therefore n − m 0’s, and we randomly choose k rows to
consider when computing the minhash. Prove that the probability of getting “don’t know”
as the minhash value for this column is at most ( n−k
n
)m .
(b) [5pts]
Suppose we want the probability of “don’t know” to be at most e−10 . Assuming n and m
are both very large (but n is much larger than m or k), give a simple approximation to the
smallest value of k that will ensure this probability is at most e−10 . Your expression should
be a function of n and m. Hints: (1) You can use ( n−k
n
)m as the exact value of the probability
of “don’t know.” (2) Remember that for large x, (1 − x1 )x ≈ 1/e.
(c) [5pts]
Note: Part (c) should be considered separate from the previous two parts, in that we are no
longer restricting our attention to a randomly chosen subset of the rows.
When minhashing, one might expect that we could estimate the Jaccard similarity without
using all possible permutations of rows. For example, we could only allow cyclic permuta-
tions, i.e. start at a randomly chosen row r, which becomes the first in the order, followed
by rows r + 1, r + 2, and so on, down to the last row, and then continuing with the first row,
second row, and so on, down to row r − 1. There are only n such permutations if there are
n rows. However, these permutations are not sufficient to estimate the Jaccard similarity
correctly.
CS 246: Mining Massive Data Sets — Problem Set 1 6
Give an example of two columns such that the probability (over cyclic permutations only)
that their minhash values agree is not the same as their Jaccard similarity. In your answer,
please provide (a) an example of a matrix with two columns (let the two columns correspond
to sets denoted by S1 and S2), (b) the Jaccard similarity of S1 and S2, and (c) the probability
that a random cyclic permutation yields the same minhash value for both S1 and S2.
What to submit
log(1/p1 )
1. Select L = nρ random members g1 , . . . , gL of G, where ρ = log(1/p2 )
.
2. Hash all the data points as well as the query point using all gi (1 ≤ i ≤ L).
3. Retrieve at most2 3L data points (chosen uniformly at random) from the set of L
buckets to which the query point hashes.
1
The equality G = Hk is saying that every function of G is an AND-construction of k functions of H, so
g(x) = g(y) only if hi (x) = hi (y) for every hi underlying g.
2
If there are fewer than 3L data points hashing to the same buckets as the query point, just take all of
them.
CS 246: Mining Massive Data Sets — Problem Set 1 7
4. Among the points selected in phase 3, report the one that is the closest to the query
point as a (c, λ)-ANN.
The goal of the first part of this problem is to show that this procedure leads to a correct
answer with constant probability.
(a) [5 pts]
Let Wj = {x ∈ A|gj (x) = gj (z)} (1 ≤ j ≤ L) be the set of data points x mapping to the
same value as the query point z by the hash function gj . Define T = {x ∈ A|d(x, z) > cλ}.
Prove: " L #
X 1
Pr |T ∩ Wj | ⩾ 3L ⩽ .
j=1
3
(b) [5 pts]
(c) [5 pts]
Conclude that with probability greater than some fixed constant the reported point is an
actual (c, λ)-ANN.
implement your own linear search. The default parameters L = 10, k = 24 to lsh setup
work for this exercise, but feel free to use other parameter values as long as you explain the
reason behind your parameter choice.
• For each of the image patches in rows 100, 200, 300, . . . , 1000, find the top 3 near
neighbors5 (excluding the original patch itself) using both LSH and linear search.
What is the average search time for LSH? What about for linear search?
• Assuming {zj | 1 ≤ j ≤ 10} to be the set of image patches considered (i.e., zj is the
image patch in row 100j), {xij }3i=1 to be the approximate near neighbors of zj found
using LSH, and {x∗ij }3i=1 to be the (true) top 3 near neighbors of zj found using linear
search, compute the following error measure:
10 P
1 X 3i=1 d(xij , zj )
error =
10 j=1 3i=1 d(x∗ij , zj )
P
Plot the error value as a function of L (for L = 10, 12, 14, . . . , 20, with k = 24).
Similarly, plot the error value as a function of k (for k = 16, 18, 20, 22, 24 with L = 10).
Briefly comment on the two plots (one sentence per plot would be sufficient).
• Finally, plot the top 10 near neighbors found6 using the two methods (using the default
L = 10, k = 24 or your alternative choice of parameter values for LSH) for the image
patch in row 100 (i.e. index 99 when using zero-indexing), together with the image
patch itself. You may find the function plot useful.
How do they compare visually?
What to submit
(i) Include the proof for 4(a) in your writeup.
(iii) Include the reasoning for why the reported point is an actual (c, λ)-ANN in your writeup
[4(c)].
• Plot of 10 nearest neighbors found by the two methods (also include the original
image) and brief visual comparison