0% found this document useful (0 votes)

42 views9 pages

HW 1

The document discusses a homework assignment on analyzing social network and product recommendation data using Spark. It includes instructions on implementing a basic friend recommendation algorithm based on mutual friends, evaluating association rules for market basket analysis, and locality sensitive hashing. Key aspects are loading adjacency list data in Spark, finding frequent itemsets and rules, and techniques for minhashing when considering subsets of rows.

Uploaded by

kateryna.koval

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views9 pages

HW 1

Uploaded by

kateryna.koval

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

CS246: Mining Massive Data Sets Winter 2024

Problem Set 1
Please read the homework submission policies at https://cs246.stanford.edu.

1 Spark (25 pts)

Write a Spark program that implements a simple “People You Might Know” social network
friendship recommendation algorithm. The key idea is that if two people have a lot of mutual
friends, then the system should recommend that they connect with each other.
Data:

• Associated data file is soc-LiveJournal1Adj.txt in q1/data.

• The file contains the adjacency list and has multiple lines in the following format:
<User><TAB><Friends>
Here, <User> is a unique integer ID corresponding to a unique user and <Friends> is
a comma separated list of unique IDs corresponding to the friends of the user with the
unique ID <User>. Note that the friendships are mutual (i.e., edges are undirected):
if A is friend with B then B is also friend with A. The data provided is consistent
with that rule as there is an explicit entry for each side of each edge.

Algorithm: Let us use a simple algorithm such that, for each user U , the algorithm rec-
ommends N = 10 users who are not already friends with U , but have the most number of
mutual friends in common with U .
Output:

• The output should contain one line per user in the following format:
<User><TAB><Recommendations>
where <User> is a unique ID corresponding to a user and <Recommendations> is a
comma separated list of unique IDs corresponding to the algorithm’s recommendation
of people that <User> might know, ordered in decreasing number of mutual friends.
• Even if a user has less than 10 second-degree friends, output all of them in decreasing
order of the number of mutual friends. If a user has no friends, you can provide an
empty list of recommendations. If there are recommended users with the same number
of mutual friends, then output those user IDs in numerically ascending order.

Pipeline sketch: Please provide a description of how you used Spark to solve this problem.
Don’t write more than 3 to 4 sentences for this: we only want a very high-level description
of your strategy to tackle this problem.
CS 246: Mining Massive Data Sets — Problem Set 1 2

Tips:

• Use Google Colab to use Spark seamlessly, e.g., copy and adapt the setup
cells from Colab 0.

• Before submitting a complete application to Spark, you may go line by line, checking
the outputs of each step. Command .take(X) should be helpful, if you want to check
the first X elements in the RDD.

• We recommend using either PySpark DataFrame and/or RDD syntax for this question.

• For sanity check, your top 10 recommendations for user ID 11 should be:
27552,7785,27573,27574,27589,27590,27600,27617,27620,27667.

What to submit
(1) Upload your code on Gradescope.

(2) Include in your writeup a short paragraph sketching your spark pipeline.

(3) Include in your writeup the recommendations for the users with following user IDs: 924,
8941, 8942, 9019, 9020, 9021, 9022, 9990, 9992, 9993.

2 Association Rules (30 pts)

Association Rules are frequently used for Market Basket Analysis (MBA) by retailers to
understand the purchase behavior of their customers. This information can be then used for
many different purposes such as cross-selling and up-selling of products, sales promotions,
loyalty programs, store design, discount plans and many others.
Evaluation of item sets: Once you have found the frequent itemsets of a dataset, you need
to choose a subset of them as your recommendations. Commonly used metrics for measuring
significance and interest for selecting rules for recommendations are:

1. Confidence (denoted as conf(A → B)): Confidence is defined as the probability of

occurrence of B in the basket if the basket already contains A:

conf(A → B) = Pr(B|A),

where Pr(B|A) is the conditional probability of finding item set B given that item set
A is present.
CS 246: Mining Massive Data Sets — Problem Set 1 3

2. Lift (denoted as lift(A → B)): Lift measures how much more “A and B occur together”
than “what would be expected if A and B were statistically independent”:

conf(A → B)
lift(A → B) = ,
S(B)
Support(B)
where S(B) = N
and N = total number of transactions (baskets).

3. Conviction (denoted as conv(A → B)): Conviction compares the “probability that

A appears without B if they were independent” with the “actual frequency of the
appearance of A without B”:

1 − S(B)
conv(A → B) = .
1 − conf(A → B)

(a) [3pts]

A drawback of using confidence is that it ignores Pr(B). Why is this a drawback? Explain
why lift and conviction do not suffer from this drawback.

(b) [3pts]

A measure is symmetrical if measure(A → B) = measure(B → A). Which of the measures

presented here are symmetrical? For each measure, please provide either a proof that the
measure is symmetrical, or a counterexample that shows the measure is not symmetrical.

Perfect implications are rules that hold 100% of the time (or equivalently, the associated
conditional probability is 1). A measure is desirable if it reaches its maximum achievable
value for all perfect implications. This makes it easy to identify the best rules. Which of the
above measures have this property? You may ignore 0/0 but not other infinity cases. Also
you may find it easy to explain by an example.

Application in product recommendations: The action or practice of selling additional

products or services to existing customers is called cross-selling. Giving product recommen-
dation is one of the examples of cross-selling that are frequently used by online retailers.
One simple method to give product recommendations is to recommend products that are
frequently browsed together by the customers.
Suppose we want to recommend new products to the customer based on the products they
have already browsed online. Write a program using the A-priori algorithm to find products
which are frequently browsed together. Fix the support to s =100 (i.e. product pairs need
CS 246: Mining Massive Data Sets — Problem Set 1 4

to occur together at least 100 times to be considered frequent) and find itemsets of size 2
and 3.
Use the online browsing behavior dataset from browsing.txt in q2/data. Each line repre-
sents a browsing session of a customer. On each line, each string of 8 characters represents
the ID of an item browsed during that session. The items are separated by spaces.
Note: for parts (d) and (e), the writeup will require a specific rule ordering but the program
need not sort the output. We are not giving partial credits to coding when results are wrong.
However, two sanity checks are provided and they should be helpful when you progress: (1)
there are 647 frequent items after 1st pass (|L1 | = 647), (2) the top 5 pairs you should
produce in part (d) all have confidence scores greater than 0.985. See detailed instructions
below. Please provide at least five decimal points for your confidence scores.

(d) [10pts]

Identify pairs of items (X, Y ) such that the support of {X, Y } is at least 100. For all such
pairs, compute the confidence scores of the corresponding association rules: X ⇒ Y , Y ⇒ X.
Sort the rules in decreasing order of confidence scores and list the top 5 rules in the writeup.
Break ties, if any, by lexicographically increasing order on the left hand side of the rule.
(You need not use Spark for parts (d) and (e) of question 2)

(e) [10pts]

Identify item triples (X, Y, Z) such that the support of {X, Y, Z} is at least 100. For all such
triples, compute the confidence scores of the corresponding association rules: (X, Y ) ⇒ Z,
(X, Z) ⇒ Y , (Y, Z) ⇒ X. Sort the rules in decreasing order of confidence scores and list the
top 5 rules in the writeup. Order the left-hand-side pair lexicographically and break ties, if
any, by lexicographical order of the first then the second item in the pair.

What to submit

Upload all the code on Gradescope and include the following in your writeup:

(i) Explanation for 2(a).

(ii) Proofs and/or counterexamples for 2(b).

(iii) Explanation for 2(c).

(iv) Top 5 rules with confidence scores [2(d)].

(v) Top 5 rules with confidence scores [2(e)].

CS 246: Mining Massive Data Sets — Problem Set 1 5

3 Locality-Sensitive Hashing (15 pts)

When simulating a random permutation of rows, as described in Sec. 3.3.5 of MMDS, we
could save time if we restricted our attention to a randomly chosen k of the n rows, rather
than hashing all n row numbers. The downside of doing so is that, if none of the k rows
contains a 1 in a certain column, then the result of the minhashing is “don’t know”. In other
words, we get no row number as the minhash value. It would be a mistake to assume that
two columns that both minhash to “don’t know” are likely to be similar. However, if the
probability of getting “don’t know” as a minhash value is small, we can tolerate the situation
and simply ignore such minhash values when computing the fraction of minhashes in which
two columns agree.
In part (a) we determine an upper bound on the probability of getting “don’t know” as the
minhash value when considering only a k-subset of the n rows, and in part (b) we use this
bound to determine an appropriate choice for k, given our tolerance for this probability.

(a) [5pts]

Suppose a column has m 1’s and therefore n − m 0’s, and we randomly choose k rows to
consider when computing the minhash. Prove that the probability of getting “don’t know”
as the minhash value for this column is at most ( n−k
n
)m .

(b) [5pts]

Suppose we want the probability of “don’t know” to be at most e−10 . Assuming n and m
are both very large (but n is much larger than m or k), give a simple approximation to the
smallest value of k that will ensure this probability is at most e−10 . Your expression should
be a function of n and m. Hints: (1) You can use ( n−k
n
)m as the exact value of the probability
of “don’t know.” (2) Remember that for large x, (1 − x1 )x ≈ 1/e.

Note: Part (c) should be considered separate from the previous two parts, in that we are no
longer restricting our attention to a randomly chosen subset of the rows.
When minhashing, one might expect that we could estimate the Jaccard similarity without
using all possible permutations of rows. For example, we could only allow cyclic permuta-
tions, i.e. start at a randomly chosen row r, which becomes the first in the order, followed
by rows r + 1, r + 2, and so on, down to the last row, and then continuing with the first row,
second row, and so on, down to row r − 1. There are only n such permutations if there are
n rows. However, these permutations are not sufficient to estimate the Jaccard similarity
correctly.
CS 246: Mining Massive Data Sets — Problem Set 1 6

Give an example of two columns such that the probability (over cyclic permutations only)
that their minhash values agree is not the same as their Jaccard similarity. In your answer,
please provide (a) an example of a matrix with two columns (let the two columns correspond
to sets denoted by S1 and S2), (b) the Jaccard similarity of S1 and S2, and (c) the probability
that a random cyclic permutation yields the same minhash value for both S1 and S2.

What to submit

Include the following in your writeup:

(i) Proof for 3(a)

(ii) Derivation and final answer for 3(b)

(iii) Example for 3(c) including the three requested items

4 LSH for Approximate Near Neighbor Search (30 pts)

In this problem, we study the application of LSH to the problem of finding approximate near
neighbors.
Assume we have a dataset A of n points in a metric space with distance metric d(·, ·). Let c
be a constant greater than 1. Then, the (c, λ)-Approximate Near Neighbor (ANN) problem
is defined as follows: Given a query point z, assuming that there is a point x in the dataset
with d(x, z) ≤ λ, return a point x′ from the dataset with d(x′ , z) ≤ cλ (this point is called
a (c, λ)-ANN). The parameter c therefore represents the maximum approximation factor
allowed in the problem.
Let us consider a LSH family H of hash functions that is (λ, cλ, p1 , p2 )-sensitive for the
distance measure d(·, ·). Let1 G = Hk = {g = (h1 , . . . , hk )|hi ∈ H, ∀ 1 ≤ i ≤ k}, where
k = log1/p2 (n).
Let us consider the following procedure:

log(1/p1 )
1. Select L = nρ random members g1 , . . . , gL of G, where ρ = log(1/p2 )
.

2. Hash all the data points as well as the query point using all gi (1 ≤ i ≤ L).

3. Retrieve at most2 3L data points (chosen uniformly at random) from the set of L
buckets to which the query point hashes.
1
The equality G = Hk is saying that every function of G is an AND-construction of k functions of H, so
g(x) = g(y) only if hi (x) = hi (y) for every hi underlying g.
2
If there are fewer than 3L data points hashing to the same buckets as the query point, just take all of
them.
CS 246: Mining Massive Data Sets — Problem Set 1 7

4. Among the points selected in phase 3, report the one that is the closest to the query
point as a (c, λ)-ANN.

The goal of the first part of this problem is to show that this procedure leads to a correct
answer with constant probability.

(a) [5 pts]

Let Wj = {x ∈ A|gj (x) = gj (z)} (1 ≤ j ≤ L) be the set of data points x mapping to the
same value as the query point z by the hash function gj . Define T = {x ∈ A|d(x, z) > cλ}.
Prove: " L #
X 1
Pr |T ∩ Wj | ⩾ 3L ⩽ .
j=1
3

(Hint: Markov’s Inequality.)

(b) [5 pts]

Let x∗ ∈ A be a point such that d(x∗ , z) ≤ λ. Prove:

1
Pr [∀ 1 ≤ j ≤ L, gj (x∗ ) ̸= gj (z)] < .
e

Conclude that with probability greater than some fixed constant the reported point is an
actual (c, λ)-ANN.

(d) [15 pts]

A dataset of images,3 patches.csv, is provided in q4/data.

Each row in this dataset is a 20 × 20 image patch represented as a 400-dimensional vector.
We will use the L1 distance metric on R400 to define similarity of images. We would like
to compare the performance of LSH-based approximate near neighbor search with that of
linear search.4 You should use the code provided with the dataset for this task.
The included starter code in lsh.py marks all locations where you need to contribute code
with TODOs. In particular, you will need to use the functions lsh setup and lsh search and
3
Dataset and code adopted from Brown University’s Greg Shakhnarovich
4
By linear search we mean comparing the query point z directly with every database point x.
CS 246: Mining Massive Data Sets — Problem Set 1 8

implement your own linear search. The default parameters L = 10, k = 24 to lsh setup
work for this exercise, but feel free to use other parameter values as long as you explain the
reason behind your parameter choice.

• For each of the image patches in rows 100, 200, 300, . . . , 1000, find the top 3 near
neighbors5 (excluding the original patch itself) using both LSH and linear search.
What is the average search time for LSH? What about for linear search?

• Assuming {zj | 1 ≤ j ≤ 10} to be the set of image patches considered (i.e., zj is the
image patch in row 100j), {xij }3i=1 to be the approximate near neighbors of zj found
using LSH, and {x∗ij }3i=1 to be the (true) top 3 near neighbors of zj found using linear
search, compute the following error measure:
10 P
1 X 3i=1 d(xij , zj )
error =
10 j=1 3i=1 d(x∗ij , zj )
P

Plot the error value as a function of L (for L = 10, 12, 14, . . . , 20, with k = 24).
Similarly, plot the error value as a function of k (for k = 16, 18, 20, 22, 24 with L = 10).
Briefly comment on the two plots (one sentence per plot would be sufficient).

• Finally, plot the top 10 near neighbors found6 using the two methods (using the default
L = 10, k = 24 or your alternative choice of parameter values for LSH) for the image
patch in row 100 (i.e. index 99 when using zero-indexing), together with the image
patch itself. You may find the function plot useful.
How do they compare visually?

What to submit
(i) Include the proof for 4(a) in your writeup.

(ii) Include the proof for 4(b) in your writeup.

(iii) Include the reasoning for why the reported point is an actual (c, λ)-ANN in your writeup
[4(c)].

(iv) Include the following in your writeup for 4(d):

• Average search time for LSH and linear search.

• Plots for error value vs. L and error value vs. K, and brief comments for each
plot
5
Sometimes, the function lsh search may return less than 3 nearest neighbors. You can use a while
loop to check that lsh search returns enough results, or you can manually run the program multiple times
until it returns the correct number of neighbors.
6
Same remark, you may sometimes have less that 10 nearest neighbors in your results; you can use the
same hacks to bypass this problem.
CS 246: Mining Massive Data Sets — Problem Set 1 9

• Plot of 10 nearest neighbors found by the two methods (also include the original
image) and brief visual comparison

(v) Upload the code for 4(d) on Gradescope.

CS246 Final Exam Solutions, Winter 2011
No ratings yet
CS246 Final Exam Solutions, Winter 2011
18 pages
Assignment 03:: Association Rule Mining
No ratings yet
Assignment 03:: Association Rule Mining
3 pages
Data Mining Practice Final Exam Solutions: True/False Questions
100% (1)
Data Mining Practice Final Exam Solutions: True/False Questions
5 pages
Module 5 - Frequent Pattern Mining
No ratings yet
Module 5 - Frequent Pattern Mining
111 pages
Association Rules Problem Statement
100% (1)
Association Rules Problem Statement
29 pages
02 Assocrules
No ratings yet
02 Assocrules
56 pages
Association Rules
No ratings yet
Association Rules
56 pages
Mmds Exam 2022
No ratings yet
Mmds Exam 2022
17 pages
07 Recsys1
No ratings yet
07 Recsys1
47 pages
L2: Frequent Itemsets Mining and Association Rules
No ratings yet
L2: Frequent Itemsets Mining and Association Rules
54 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
AIML Mod4 Loki
No ratings yet
AIML Mod4 Loki
11 pages
2425 HK1 MMDS
No ratings yet
2425 HK1 MMDS
3 pages
Association Rules Ans
No ratings yet
Association Rules Ans
28 pages
DA CIA 3 Answers
No ratings yet
DA CIA 3 Answers
20 pages
Association Rule Mining
No ratings yet
Association Rule Mining
19 pages
Association Rules
No ratings yet
Association Rules
58 pages
Data Mining
No ratings yet
Data Mining
24 pages
Association Rule Mapping - Unit-4
No ratings yet
Association Rule Mapping - Unit-4
11 pages
DWDM Descriptive Mid-I
No ratings yet
DWDM Descriptive Mid-I
6 pages
Data Mining Practice Final Sol
No ratings yet
Data Mining Practice Final Sol
5 pages
APRIARI Algorithm
No ratings yet
APRIARI Algorithm
55 pages
DWDM Answer
No ratings yet
DWDM Answer
19 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
178 pages
Data Mining Paer 2 Oct 12, 2024 - 241012 - 224522
No ratings yet
Data Mining Paer 2 Oct 12, 2024 - 241012 - 224522
13 pages
Big Data Analytics AAM Unit 4
No ratings yet
Big Data Analytics AAM Unit 4
80 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
RecSys - Final (Solution)
No ratings yet
RecSys - Final (Solution)
6 pages
Unit IV Recommender System
No ratings yet
Unit IV Recommender System
5 pages
Report
No ratings yet
Report
5 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
MODULE - 4 Advance AIML Part 1
No ratings yet
MODULE - 4 Advance AIML Part 1
12 pages
Previous Exam Paper 2 Solutions
No ratings yet
Previous Exam Paper 2 Solutions
6 pages
Homework 1 Data
No ratings yet
Homework 1 Data
5 pages
Assignment 2 Slot8 TTS3208 Summer
No ratings yet
Assignment 2 Slot8 TTS3208 Summer
11 pages
Data Mining Formula
No ratings yet
Data Mining Formula
2 pages
P-3 1 5-Association
No ratings yet
P-3 1 5-Association
46 pages
CIA 1 Key
No ratings yet
CIA 1 Key
3 pages
Sample Question
No ratings yet
Sample Question
19 pages
ch03 Assocrules
No ratings yet
ch03 Assocrules
59 pages
DMT Unit-IV - UR20 - New
No ratings yet
DMT Unit-IV - UR20 - New
62 pages
L13 Apriori
No ratings yet
L13 Apriori
32 pages
Q1R Ext
No ratings yet
Q1R Ext
4 pages
Script of E - Previous Question Papers - URR18 03.08.2023 - VI Semester - U18CS605 PDF
No ratings yet
Script of E - Previous Question Papers - URR18 03.08.2023 - VI Semester - U18CS605 PDF
10 pages
08 Recsys2
No ratings yet
08 Recsys2
60 pages
Data Mining of Very Large Data
No ratings yet
Data Mining of Very Large Data
50 pages
DMBI Index
No ratings yet
DMBI Index
2 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Midterm Review Solution
100% (1)
Midterm Review Solution
7 pages
CS8091 BDA Unit 3
No ratings yet
CS8091 BDA Unit 3
144 pages
Dsa - DK Question Paper
No ratings yet
Dsa - DK Question Paper
4 pages
Assignment 1 5
No ratings yet
Assignment 1 5
4 pages
07 Recsys1
No ratings yet
07 Recsys1
48 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Midterm Report
No ratings yet
Midterm Report
24 pages
Unit 2: Scs5623 - Data Mining and Warehousing
No ratings yet
Unit 2: Scs5623 - Data Mining and Warehousing
9 pages
Exam DUT 070816 Ans
No ratings yet
Exam DUT 070816 Ans
5 pages
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Multiplicative Inverse
No ratings yet
Multiplicative Inverse
15 pages
Catenary
No ratings yet
Catenary
13 pages
Integral Calculus Reviewer PDF
No ratings yet
Integral Calculus Reviewer PDF
17 pages
Speed Control of Separately Excited DC Motor Using Fuzzy Logic Controller
No ratings yet
Speed Control of Separately Excited DC Motor Using Fuzzy Logic Controller
6 pages
Math Problems
No ratings yet
Math Problems
8 pages
55+Scott+Page - The Model Thinker
No ratings yet
55+Scott+Page - The Model Thinker
39 pages
Sbi Clerk&rrb Po Mains Quants (E) Day - 2 168507796315
No ratings yet
Sbi Clerk&rrb Po Mains Quants (E) Day - 2 168507796315
12 pages
Support Vector Machine (SVM) PDF
No ratings yet
Support Vector Machine (SVM) PDF
15 pages
Functions Papst Paper 2
No ratings yet
Functions Papst Paper 2
19 pages
Math Teachers Guide 2
100% (1)
Math Teachers Guide 2
62 pages
MAD2104 Syllabus
No ratings yet
MAD2104 Syllabus
13 pages
University of Wolverhampton Faculty of Science and Engineering School of Engineering
No ratings yet
University of Wolverhampton Faculty of Science and Engineering School of Engineering
4 pages
A Novel Hybrid PSO-PS Optimized Fuzzy PI Controller For AGC See
No ratings yet
A Novel Hybrid PSO-PS Optimized Fuzzy PI Controller For AGC See
14 pages
Rma DLP
No ratings yet
Rma DLP
2 pages
Random Walk - Wikipedia
No ratings yet
Random Walk - Wikipedia
15 pages
Semi-Detailed Lesson Plan in Mathematics 10
No ratings yet
Semi-Detailed Lesson Plan in Mathematics 10
4 pages
Strategy Mathematics Strategy Topicwise Booklist Sample Answers For UPSC Mains Exam
No ratings yet
Strategy Mathematics Strategy Topicwise Booklist Sample Answers For UPSC Mains Exam
6 pages
Being Numerate PDF
No ratings yet
Being Numerate PDF
3 pages
Ncert Sol c12 Maths ch11
No ratings yet
Ncert Sol c12 Maths ch11
64 pages
Chapter 2: Fluid Dynamics Review
No ratings yet
Chapter 2: Fluid Dynamics Review
9 pages
Maths Class Xi: Limits and Derivatives Practice Paper 02
0% (1)
Maths Class Xi: Limits and Derivatives Practice Paper 02
2 pages
The 75th William Lowell Putnam Mathematical Competition Saturday, December 6, 2014
No ratings yet
The 75th William Lowell Putnam Mathematical Competition Saturday, December 6, 2014
1 page
12a HIGHER ORDER DERIVATIVES
No ratings yet
12a HIGHER ORDER DERIVATIVES
16 pages
Free Response Unit 3 Unit Test
No ratings yet
Free Response Unit 3 Unit Test
5 pages
Chapter 2 Introduction To Signals
No ratings yet
Chapter 2 Introduction To Signals
25 pages
2x2 Production Economy Handout
No ratings yet
2x2 Production Economy Handout
3 pages
Class 10 Sa1 Practice Paper
No ratings yet
Class 10 Sa1 Practice Paper
4 pages
Cot2022-1st Quarter-Math - Addition of Fractions With or Without Regrouping
100% (1)
Cot2022-1st Quarter-Math - Addition of Fractions With or Without Regrouping
3 pages
Star Batch DPP by Ajay Sir
No ratings yet
Star Batch DPP by Ajay Sir
9 pages
Practice Apx P
No ratings yet
Practice Apx P
4 pages

HW 1

Uploaded by

HW 1

Uploaded by

CS246: Mining Massive Data Sets Winter 2024

1 Spark (25 pts)

• Associated data file is soc-LiveJournal1Adj.txt in q1/data.

2 Association Rules (30 pts)

1. Confidence (denoted as conf(A → B)): Confidence is defined as the probability of

3. Conviction (denoted as conv(A → B)): Conviction compares the “probability that

A measure is symmetrical if measure(A → B) = measure(B → A). Which of the measures

Application in product recommendations: The action or practice of selling additional

(i) Explanation for 2(a).

(ii) Proofs and/or counterexamples for 2(b).

(iii) Explanation for 2(c).

(iv) Top 5 rules with confidence scores [2(d)].

(v) Top 5 rules with confidence scores [2(e)].

3 Locality-Sensitive Hashing (15 pts)

Include the following in your writeup:

(i) Proof for 3(a)

(ii) Derivation and final answer for 3(b)

(iii) Example for 3(c) including the three requested items

4 LSH for Approximate Near Neighbor Search (30 pts)

(Hint: Markov’s Inequality.)

Let x∗ ∈ A be a point such that d(x∗ , z) ≤ λ. Prove:

(d) [15 pts]

A dataset of images,3 patches.csv, is provided in q4/data.

(ii) Include the proof for 4(b) in your writeup.

(iv) Include the following in your writeup for 4(d):

• Average search time for LSH and linear search.

(v) Upload the code for 4(d) on Gradescope.

You might also like