0% found this document useful (0 votes)

15 views17 pages

Mmds Exam 2022

Mmds second unit

Uploaded by

sivatarak12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views17 pages

Mmds Exam 2022

Mmds second unit

Uploaded by

sivatarak12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

CS246: Mining Massive Data Sets Winter 2022

Exam

• Your Name:

• Your SUNetID (e.g., suxuan):

• Your SUID (e.g. 01234567):

I acknowledge and accept the Stanford Honor Code.

Signature:

1. There are 15 questions in this exam; the maximum score that you can obtain is 140 points.
These questions require thought, but do not require long answers. Please be as concise as
possible.

2. This exam is open-book and open-notes. You may use notes (digitally created notes are
allowed) and/or lecture slides and/or any reference material. However, answers should be
written in your own words.

3. Acceptable uses of computer:

• You may access the Internet, but you may not communicate with any other person.
• You may use your computer to write code or do any scientific computation, though
writing code is not required to solve any of the problems in this exam.
• You can use your computer as a calculator or an e-reader.

4. Collaboration with other students is not allowed in any form. Please do not discuss the exam
with anyone until after Tuesday, Mar 15.

5. If you have any clarifying questions, make a private post on Ed. It is very important that
your post is private; if it is public, we may deduct points from your exam grade.

6. Please submit your answers here on Gradescope. You have two options to submit your
answers: (1) to upload one file per question, in a file upload field in the last sub-question;
or (2) to write your answers directly in the text fields in the sub-questions.

7. Numerical answers
√ may be left as fractions, as decimals rounded to 2 decimal places, or as
radicals (e.g., 2).
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

1 Frequent Itemsets Mining I (8 points)

Given that we have items A, B, C, D, E and the following 6 baskets:

Basket1 A, B, C
Basket2 A, B
Basket3 A, B, D
Basket4 B, C, D, E
Basket5 A, B, D, E
Basket6 A, B, C, D

(a) (4 points) Use Apriori algorithm to compute all the frequent itemsets (triples) with support
≥ 3. Please show all of your iterations (indicating candidate/frequent itemsets) for
full score.
Write down your answers in the form of – frequent single items: A, B, · · · , candidate pairs:
AB, · · · , frequent pairs: AB, · · · (this example doesn’t indicate the results for this question).

(b) (4 points) Now we also consider a constraint when finding the frequent itemsets. Given an
itemset, we want the maximum price of all the items in the itemset minus the minimum price
of all the items in the set smaller than a threshold, i.e. given I ⊆ A, B, C, D, E, and that
I is frequent, max(item.price for item in I) - min(item.price for item in I) < T
(T is the threshold). Assume the price of all items and T are positive. How can we efficiently
find frequent items under such a constraint? Design your own algorithm and briefly explain
why it works.
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

2 Clustering (12 points)

Given the following 6 points: x1 = [−1, −2], x2 = [−2, 0], x3 = [4.5, 1], x4 = [3, 1], x5 = [0, −2],
x6 = [2, −0.5]. Please complete the following tasks.

(a) (3 points) If we select x1 , x6 as the initial centroid, please write down the simulation of the
K-means clustering process with L1 distance until convergence (report the new centroids and
cost for each iterations in the following table, including the assignments per step, suppose x1
and x3 is assigned to the first cluster and all other points to the second cluster, the assignment
is represented as [0, 1, 0, 1, 1, 1]).

Iteration Centroid1 Centroid2 Cluster Assignment Cost

0 [-1,-2] [2, -0.5]
1
2
3

Note: you might not need all of the rows

(b) (3 points) Run hierarchical clustering with L1 distance until there are 2 clusters (please write
down the clustering steps for each iterations, along with the centroids).

(c) (2 points) If we can arbitrarily set K (the number of clusters, K <= 6) and randomly select
K points from the set as initial centroids for each of the clusters, how to make sure that we
achieve the global optima of the cost function? What is the minimum cost we can achieve?
How many iterations do we need? What is K?

(d) If the goal is to cluster the data points in Figure 1 to 2 clusters, which algorithm(s) works
well? Select all that applies for each of them, you don’t need to provide justifications.

Figure 1: two example data points we want to cluster

CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

1 (2 points) Left: K-means/ CURE/ Hierarchical clustering

2 (2 points) Right: K-means/ CURE/ Hierarchical clustering

CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

3 Dimensionality Reduction (8 points)

We are given a matrix A ∈ Rm×n and its SVD decomposition A = U ΩV T , where U = [u1 , u2 , · · · , ud ],
and (uk ∈ Rm ); Ω ∈ Rd×d , V = [v1 , v2 , · · · , vd ], and (vk ∈ Rn ). Assume Ω is sorted in descending
order. Using the given information, please answer the following questions:

(a) (2 points) What is the best possible rank 1 approximation of A?

(b) (3 points) What are the eigenvalues and eigenvectors of of AAT ? (Show your steps to receive
full points)

(c) (3 points) Suppose that the original matrix A is a student to CS class rating, and the latent
factor is the topic of the class. The columns of matrix A are distinct courses, and the rows
of matrix A correspond with the student id. For example, A[0] corresponds to ratings from
student id = 0.

i (1 point) How can you interpret Ω[0, 0]?

ii (2 points) Assume Alice rated only CS246 and CS229 with a rating row vector va , and
Bob rated only CS221 and CS224N with a rating row vector vb ; besides, the two stu-
dents’ ratings are not stored in A. Does it mean they are impossible to share any common
interest in course topics? If it’s impossible, please explain why, if possible, please explain
how you can determine the similarity in their course topic interests.
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

4 Recommender Systems (13 points)

The table below shows ratings for three Pinterest topics by three different users.

Kangaroos Dragons Pizza

User A 10 5
User B 9 16 8
User C 13 6

a) (4 points) Calculate the Pearson coefficient among all the users.

User A User B User C

User A
User B
User C

b) (2 points) Using Pearson correlation as similarity, fill in the missing values with User-User
collaborative filtering.

Kangaroos Dragons Pizza

User A 10 5
User B 9 16 8
User C 13 6

c) (4 points) Instead of user-user collaborative filtering, we would like to use latent factors to
fill the missing values in the table. Given partial complete latent factor matrices Q and P T
 
2
Q= 2
1 4

T 1 4 2
P =
3 1
Solve the missing values in Q and P such that the objective minP,Q (i,x)∈R (rxi − qx pi )2 is
P
minimized to zero (rxi means rating on topic i by user x). Then, solve for the missing values
in the table.
Hint: r̂xi = qx · pi , and we want to fill the missing values such that rxi − r̂xi = 0 for all
observed ratings.

d) (3 points) After fitting the P and Q in (c), we found that that user A is a critical reviewer
such that his/her mean rating is 1 star lower than the mean, and across pinterest, Kangaroos
gets a mean rating of 1 higher than average topics. Suppose the predicted rating of User A
to Kangaroos is r̂xi = qx · pi = a, where qx and pi are resulted from minimizing the objective
function in part (c), and the overall mean rating is µ. If we take bias into account while still
ignore regularization, does the new predicted rating equals to µ+bx +bi +qx ·pi = µ+1+1+a?
Please explain your reason in two sentences.
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

5 Graph Neural Networks (7 points)

Suppose we have an undirected graph G(V, E), consider a modified Bellman-Ford algorithm that
finds the shortest distance between any node to a source node s, and we want to use GNN to learn
to execute the modified Bellman-Ford algorithm (the embeddings are 1-dimensional).
At the start of the algorithm, we randomly select a source node s, and initialize the distance of
any node (except for the source node) to infinity. A node becomes reachable from s if any of its
neighbors are reachable from s, and we may update the shortest distance between a node v and s
as the minimal way to reach any of its neighbors plus the edge length connecting the node to the
neighbor.
For example, if node v1 has two neighbors v2 and v3 connecting by edges of length 1, and v1 ’s
current distance before update is +∞, given distance(v2 ) = 2 and distance(v3 ) = 4, then we can
update distance(v1 ) = 2 + 1 = 3.
Specifically, the Bellman-Ford initialization for node embeddings is (at layer k = 1, for node v):
(
1 1, v=s
hv =
+∞, v ̸= s

Similar to the example, the edges have positive weights which are the edge lengths (the edge length
between node u and node v is euv ). For node v, layer k + 1, describe a message function M (hkv ),
an aggregation function hk+1 k+1 for the GNN such that it learns the task
N (v) , and an update rule hv
perfectly

M (hkv ) =

hk+1
N (v) =

hk+1
v =
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

6 Decision Trees (10 points)

In the decision tree learning algorithm, suppose a feature F has k different values: v1 , v2 , · · · , vk .
Let p be the total number of positive examples, and n be the total number of negative examples.
Let pi be the number of positive examples with feature value vi , and ni be the number of negative
examples with feature value vi .
In addition, suppose
p1 p2 pk
= = ··· =
p 1 + n1 p 2 + n2 pk + nk

What is the information gain if the examples are split using feature F ? Show and clearly justify
your derivation steps.
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

7 PageRank (10 points)

Suppose there are four web pages (A, B, C, D) in the network, and they form the graph as follows:

C A

D B

(a) (3 points) Without calculation, what is the resulting PageRank vector? Please provide justi-
fication for full scores.

(b) (4 points) Please write down column-stochastic adjacency matrix M corresponding to this
graph. Given the teleport parameter β is 0, how can you fix the problem in the original graph
through modifying M ? (please write down the original M for this graph, and the resulting
M after fixing this problem)

(c) (3 points) Add one edge in the graph to generate an example of spider trap, again, without
calculation, what is the new pagerank vector? Why?
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

8 Extensions of PageRank (10 points)

Given another web graph with four web pages:

C A

D B

To save you some computation time, with teleport parameter β = 0.8, the resulting pagerank vector
of this web graph calculated with Google matrix is: r = (0.08, 0.08, 0.40, 0.44).

(a) (3 points) Suppose the user is only interested in the topic related to pageB. Given the tele-
port parameter β = 0.8, what is the new pagerank matrix formulation A for power iteration
ri+1 = Ari ? Please present your answer as a matrix.

(b) (2 points) (T/F) Suppose some malicious people comment thousands of blog posts with the
link of their spam cite, Google can statistically process the contents (text) detect and remove
duplicate pages to combat the malicious attempt. Justify your answer in one sentence.

(c) (2 points) (T/F) If we have a graph G(V, E), and a set of trusted pages T ⊂ V , a page t ∈ T
will always have a non-negative spam mass. Justify your answer in one sentence.

(d) (3 points) Using the scenario of part a, suppose that our trusted page selected by the oracle is
pageB (teleport parameter is still 0.8), what is the spam mass of pageD (assume power itera-
tion algorithm converge in 1 iteration)? How does it compare to spam mass of pageC? Why
is it the case? (Note: please show your work for partial credits, and you can checkout this
calculator for matrix multiplications https://matrix.reshish.com/multiplication.php)
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

9 Graph Representation Learning (6 points)

For the following two application scenarios, determine whether Node2vec is an ideal embedding
method. Explain your reasoning in 1∼2 sentences.

(a) (3 points) Learn node embeddings in a historical Facebook social network and then make
predictions on a newly registered user based on its embedding.

(b) (3 points) Find functionally similar proteins across species based on the node embeddings
learned in different protein-to-protein interaction networks
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

10 Learning Embeddings (6 points)

(a) (3 points) Suppose you have a large document dataset with millions of distinct words. In
addition, the document consists of the reviews of items on Amazon, which means there are
very limited numbers of rare words. Which specific model that we’ve seen in class would you
use for feature representation of the documents? Why?
Models to choose from: Word2Vec(CBOW), Word2Vec(Skip Gram), Supervised NN with
one-hot input vector, Autoencoder, SVD

(b) (3 points) What is the one similarity and one difference between PCA and autoencoder?
Suppose we have a 3-layer autoencoder with input dimension of 64 and hidden dimension of
64, what would happen if we want to make predictions on unseen data?
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

11 Computational Advertising (10 points)

During lecture, you were given vi (q) = xi (1 − e−fi ), where xi is the bid and fi is the fraction
of leftover budget for bidder i. However, in the real world, we may want to compute vi (q) =
xi × CTRi × (1 − e−fi ), where CTRi is the click through rate for bidder i.

a) (3 points) There are advertisers A, B, and C with the following metrics in the table below.
A new query targeted to all three advertisers arrives. Who is the winner this round?

Advertisers Bid CTR Budget Spent so far

A 120 1.2% 1000 150
B 100 1.3% 1500 200
C 80 1.9% 1200 250

b) (3 points) Continuing from part (a), there is a new query targeting all three advertisers. Who
is the winner this round?

c) (4 points) There are four advertisers A, B, C, D, who bid on some of the search queries W, X,
Y, Z. Each advertiser has a budget of 3 search queries. A bids on W and X only; B bids on
X and Y only; C bids on Y and Z only; D bids on W and Z only. We assign queries using the
BALANCE Algorithm, with ties broken alphabetically. That is, A is preferred to B/C/D, B
is preferred to C/D, and C is preferred to D in case of ties. In what follows, represent the
allocation of queries by a list of the advertiser given each query in order, and use a dash (-) if
no bidder gets a query. For example, if A gets the first query, B the second, and no one gets
the third, then write AB-.
Suppose queries arrive in the order WXYZXYZWYZWX. What is the assignment of these
queries to bidders according to the BALANCE Algorithm (this question is independent from
the previous parts and ties broken alphabetically)?
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

12 Community Detection (10 points)

We consider the following two graphs that are composed of smaller sub-graphs as building blocks.
While the first graph is based on a number of circle graphs, the second graph connects multiple
clique graphs.

Figure 2: Graph of communities: (left) circle-based graph; (right) clique-based graph.

Let G denote the entire graph, and Hi=1,··· ,K denote the component circle or clique graphs. Assume
there are K component graphs (K > 2), and each component graph contains N nodes (N > 2).
Let S = {H1 , H2 , · · · , HK } be the partitioning of G that considers each component graph as a
separate community. Assume that the edge weights are all 1.

a) (6 points) For both circle-based and clique-based graph, express the total edge weights 2m,
as a function of K and N . Next, for both classes of graphs, express the modularity Q(G, S),
of the graph G under the partitioning S, as a function of K, N , and m.

i (3 points) Circle-based graph:

ii (3 points) Clique-based graph:

b) (6 points) We can express the

Pmodularity
P gain of moving nodes A and B to their adjacent
community as a function of in , tot , ki,in , ki . Please write down the expression for the
four values using K, N, m. You don’t need to simplify the expression.

i (2 points) Circle-based graph ∆Q(A → H1 ):

ii (2 points) Clique-based graph ∆Q(B → H1 ):

CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

13 Mining Data Streams (10 points)

a) (1 point) Suppose we want to calculate the number of distinct elements in the following
datastream: 2, 3, 1, 1, 2, 5, 2, 1. Which of the following algorithms is best suited for this
problem? (Bloom filter, Flajolet-Martin, Exponentially decaying window)
Note: You don’t need to provide justification

b) (2 points) Now given a hash function h(x) = (x + 2) mod 64, please use the algorithm you
chose in the previous part to count the estimated number of distinct elements in the datas-
tream.

c) (1 point) You are developing a software for an online supermarket, and your goal is to recom-
mend new items to users, but you don’t want to recommend items that they already bought.
Each item has a unique long integer id, but the supermarket has so many items (264 items)
that you cannot store in memory. Which of the following algorithms is best suited for testing
if the recommended item is unseen? (Bloom filter, Flajolet-Martin, Exponentially decaying
window)
Note: You don’t need to provide justification

d) (2 points) Using the algorithm you chose for the previous part, given the same hash function
h(x) = (x + 2) mod 64, where x is the item id, and a customer that has already bought 50
items, what is the error rate of the algorithm (labeling an item as “already bought” while the
customer never bought it)?

e) (4 points) You are not satisfied with the performance of the algorithm in part d, please iden-
tify if the following methods can reduce the error rate.

i (Yes/No/Depends) Change h(x) to h(x) = (x + 2) mod 512 (but keeping the number
of hash functions fixed and modifying them to the larger memory)
ii (Yes/No/Depends) Increasing the number of hash functions (each has the same number
of possible keys/buckets as that of h)
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

14 Frequent Itemsets Mining II (10 points)

The idea of PCY Algorithm is: besides calculating the frequent items in the first pass, PCY also
hashes pairs of items into buckets with a hash function h and calculates the frequency of pairs of
items hashed into the same bucket. Then, in the second pass, we summarize the hash table we
achieved in the first pass as a bitmap/bit-vector, and we can only consider candidate pairs Icand
that (1) consist of frequent items, and (2) are hashed into a frequent bucket (the bitmap/bit-vector
of the bucket is 1). Then, you can do the actual counting for each candidate pair.

(a) (2 points) Assume we have n items, and a hash function h that maps pairs of items to b
buckets, and we use integer type (4 byte) to record the counts in the first pass. How much
space (in bytes) do we need in the first pass? (Note: express your answer in terms of n and
b)

(b) (2 points) In the second pass, assume we have m frequent items (m < n), each item ID is an
integer, and we use a bit (1 byte = 8 bits) to represent whether a bucket is frequent or not,
(i.e., the bitmap in the second pass), how much space (in bytes) do we need at the start of
the second pass?

(c) Now we consider an extension to the PCY algorithm, which is called the multistage algorithm.
The idea is that in the second pass, instead of doing the counting for the candidate pairs Icand
as in PCY, we use another hash function h′ to hash Icand . For pass2, we use the available
main memory for another hash table, and only hash a pair {i, j} when both i and j are
frequent AND {i, j} is hashed to a frequent bucket with h. Then, we will obtain a new bitmap
BM2 for h′ . We count pairs that are hashed to frequent buckets with both of h and h′ , the
goal is to further reduce the number of candidate pairs that we do the counting.

i (3 points) Suppose we would like to utilize the benefits of multistage algorithm. Can we
use the same hash function of item pairs for the first and second pass of the multistage
algorithm (i.e. h = h′ )? Explain your answer.

ii (3 points) In multistage algorithms, we can do multiple passes, in each pass we have a

new hash function and further narrow down the candidate pairs. If we would like to
introduce a third pass with the same process for the third pass, (note that the third pass
is not the last step, we will still achieve a bitmap in the third pass, and we will have
a fourth pass to do the actual counting of all the candidate pairs). Can we remove the
bitmap we achieved in the first pass when constructing BM3? Explain your answer.
CS246 Exam - Winter 2022 CS246 Exam - Winter 2022 CS246 Exam - Winter 2022

15 Locality-Sensitive Hashing (10 points)

Say you’re trying to find similar data points in some data. Consider four data points they have:
“aaa”, “aaaaaa”, “ababab”, “aba” where we consider character level tokens.

(a) (3 points) Suppose we want to find nearby neighbors using shingling, min-hashing, and LSH. If
we use 2-shingles (only consider unique shingles in a data point), 4 permutations, set b, r = 2
for LSH, which pairs, if any, are guaranteed (with probability 1) to always be selected as
candidate pairs?

(b) (4 points) In order to align your data nicely, you decide to pad your data with the token “0”
so they all have the length of the longest string. For example, “aaa” becomes “aaa000” but
“aaaaaa” stays as “aaaaaa”.

i In order to account for this, you decide to exclude a certain type length=2 shingles from
your processing pipeline. What are the form of shingles you want to remove to make
sure the candidate pairs results are unchanged even after the padding? (you can verbally
describe the shingles or use regular expression)

ii Suppose you do not exclude any shingles after the padding. How likely the true can-
didate pairs in (a) to be identified as candidate pairs comparing to the case without
padding, please think outside of the given examples? (choose from: more likely/ less
likely/ the same/ unclear)?
If it is more, less, or the same, provide a proof. If it is unclear, provide counter-examples
showing where it increases the chance of candidate pairs and where it decreases the
number of candidate pairs being identified.

(c) (3 points) Say the data points are actually an unordered set of tokens, such as the classes of of
objects detected in an image. For example, “aaabbb” is functionally equivalent to “ababab”,
as this indicates that 3 ‘a’ and 3 ‘b’ are detected (the ordering of the inputs could be arbi-
trary). Is there a way to modify the shingling, min-hashing, LSH pipeline such that it gives
consistent results for this data and is permutation invariant? (Hint: what hyperparameters
can you change?)

CS246 Final Exam Solutions, Winter 2011
No ratings yet
CS246 Final Exam Solutions, Winter 2011
18 pages
CSI 4107 - Winter 2016 - Midterm
0% (1)
CSI 4107 - Winter 2016 - Midterm
10 pages
4225 5425Quiz19S2 PaperV1 Answer
No ratings yet
4225 5425Quiz19S2 PaperV1 Answer
16 pages
2nd Activity Fit Goal
No ratings yet
2nd Activity Fit Goal
3 pages
New PPT On Work Ethics
100% (10)
New PPT On Work Ethics
18 pages
Dewit CH 11 Growth and Development Infancy Through
100% (1)
Dewit CH 11 Growth and Development Infancy Through
32 pages
Toc CS246 PRK
No ratings yet
Toc CS246 PRK
17 pages
Previous Exam Paper 2 Solutions
No ratings yet
Previous Exam Paper 2 Solutions
7 pages
Previous Exam Paper 2 Solutions
No ratings yet
Previous Exam Paper 2 Solutions
6 pages
18-Sub-Modular Functions
No ratings yet
18-Sub-Modular Functions
51 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
EE4146 Test1 202324 Semb Solution
No ratings yet
EE4146 Test1 202324 Semb Solution
7 pages
?quiz 2?
No ratings yet
?quiz 2?
5 pages
HW 1
No ratings yet
HW 1
9 pages
Practice Exam - Gradescope Ver.
No ratings yet
Practice Exam - Gradescope Ver.
19 pages
Assignment1 PDF
No ratings yet
Assignment1 PDF
2 pages
QUIZ + Practice
No ratings yet
QUIZ + Practice
3 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
5 pages
CS2020 1617 Semester 2
No ratings yet
CS2020 1617 Semester 2
27 pages
Randomized Algorithms Notes
No ratings yet
Randomized Algorithms Notes
13 pages
Previous Year Paper - Sem 7
No ratings yet
Previous Year Paper - Sem 7
12 pages
12s MidI - SampleExam Print1
No ratings yet
12s MidI - SampleExam Print1
8 pages
ADMM For Combinatorial Graph Problems: Preprint
No ratings yet
ADMM For Combinatorial Graph Problems: Preprint
20 pages
Presentation On Clustering Algorithms
No ratings yet
Presentation On Clustering Algorithms
43 pages
Previous Exam Paper 2
No ratings yet
Previous Exam Paper 2
3 pages
CS2020 1617 Semester 2 Ans
No ratings yet
CS2020 1617 Semester 2 Ans
28 pages
Data Mining (COL761) A3: Instructions
No ratings yet
Data Mining (COL761) A3: Instructions
2 pages
Machine Learning Solutions
No ratings yet
Machine Learning Solutions
6 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
42 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
49 pages
Sample Midterm
No ratings yet
Sample Midterm
10 pages
Dsa Pattern
No ratings yet
Dsa Pattern
58 pages
Midterm 2 S24
No ratings yet
Midterm 2 S24
6 pages
19 Submodular
No ratings yet
19 Submodular
47 pages
10 EST Solution
No ratings yet
10 EST Solution
16 pages
MIDA1 AUT - Solutions
No ratings yet
MIDA1 AUT - Solutions
4 pages
Endterm
No ratings yet
Endterm
5 pages
1603-ADA-2nd Internals+ Answer Key
No ratings yet
1603-ADA-2nd Internals+ Answer Key
5 pages
Final Exam: CS 189 Spring 2020 Introduction To Machine Learning
No ratings yet
Final Exam: CS 189 Spring 2020 Introduction To Machine Learning
19 pages
Midterm Exam - FALL 2020 Artificial Intelligence
No ratings yet
Midterm Exam - FALL 2020 Artificial Intelligence
3 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
13 pages
Sample Question
No ratings yet
Sample Question
19 pages
Line Sweep Algorithms: Schalk-Willem Krüger - 2009 Training Camp 1
No ratings yet
Line Sweep Algorithms: Schalk-Willem Krüger - 2009 Training Camp 1
28 pages
Data Science Final Exam Fall 2023 SOL
No ratings yet
Data Science Final Exam Fall 2023 SOL
6 pages
(Fall 2011) CS-402 Data Mining - Final Exam-SUB - v03
No ratings yet
(Fall 2011) CS-402 Data Mining - Final Exam-SUB - v03
6 pages
RecSys - Final (Solution)
No ratings yet
RecSys - Final (Solution)
6 pages
PracticeSolution 1
No ratings yet
PracticeSolution 1
15 pages
Test2 Sample
No ratings yet
Test2 Sample
9 pages
Big Data Analytics Suggestion
No ratings yet
Big Data Analytics Suggestion
3 pages
Fa11 Final
No ratings yet
Fa11 Final
21 pages
DAA Final Assignment - 21aml-5
No ratings yet
DAA Final Assignment - 21aml-5
6 pages
CS 251 Fall 2018 Final Exam
No ratings yet
CS 251 Fall 2018 Final Exam
15 pages
Da 2023
No ratings yet
Da 2023
30 pages
Q1R Ext
No ratings yet
Q1R Ext
4 pages
Script of E - Previous Question Papers - URR18 03.08.2023 - VI Semester - U18CS605 PDF
No ratings yet
Script of E - Previous Question Papers - URR18 03.08.2023 - VI Semester - U18CS605 PDF
10 pages
Pec Cs 602b Cse Final
No ratings yet
Pec Cs 602b Cse Final
6 pages
Final: CS 188 Spring 2014 Introduction To Artificial Intelligence
No ratings yet
Final: CS 188 Spring 2014 Introduction To Artificial Intelligence
28 pages
CS Preliminaries: ECS289A
No ratings yet
CS Preliminaries: ECS289A
39 pages
CEGP013091: 49.248.216.238 08/12/2018 13:08:58 Static-238
No ratings yet
CEGP013091: 49.248.216.238 08/12/2018 13:08:58 Static-238
3 pages
DSDA SEE Model QP
No ratings yet
DSDA SEE Model QP
5 pages
HW 2
No ratings yet
HW 2
7 pages
NF and ApproxAlgo Prac
No ratings yet
NF and ApproxAlgo Prac
2 pages
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
From Everand
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
Manish Soni
No ratings yet
地理学论文题目
100% (1)
地理学论文题目
6 pages
Book List 2023 24 For Website
No ratings yet
Book List 2023 24 For Website
10 pages
Standards Nursing of Care
No ratings yet
Standards Nursing of Care
7 pages
Starting An Effective Dissertation Writing Group
No ratings yet
Starting An Effective Dissertation Writing Group
28 pages
C2-Proficiency-Cpe-Speaking-Phrases-1 2
No ratings yet
C2-Proficiency-Cpe-Speaking-Phrases-1 2
2 pages
DLP in MATHEMATICS COT SY 2021-2022 (REAL)
No ratings yet
DLP in MATHEMATICS COT SY 2021-2022 (REAL)
7 pages
Math 6 COT
No ratings yet
Math 6 COT
16 pages
TLE - TVL CROP PRODUCTION 78 - q0 - CLAS2 - Preoperative Check Up of Farm Tools Implements and Equipment - v3 - RO QA Liezl Arosio
No ratings yet
TLE - TVL CROP PRODUCTION 78 - q0 - CLAS2 - Preoperative Check Up of Farm Tools Implements and Equipment - v3 - RO QA Liezl Arosio
10 pages
DLL Grade 7 3rd Quarter
No ratings yet
DLL Grade 7 3rd Quarter
5 pages
CHIETA Bursary Application Form
No ratings yet
CHIETA Bursary Application Form
7 pages
De TS 10 Mon Tieng Anh Chuyen Quang Nam 2022 2023
No ratings yet
De TS 10 Mon Tieng Anh Chuyen Quang Nam 2022 2023
12 pages
Eapp - Module 1 - Lesson 2
100% (1)
Eapp - Module 1 - Lesson 2
5 pages
Time Minutes Learning Areas: Morning Session
No ratings yet
Time Minutes Learning Areas: Morning Session
8 pages
MATHS P1 Form 4 End Term 1 Exam 2021 Teacher - Co - .Ke
No ratings yet
MATHS P1 Form 4 End Term 1 Exam 2021 Teacher - Co - .Ke
17 pages
Pharmacy As My Career
No ratings yet
Pharmacy As My Career
6 pages
Identity Formation in The Postmodern World
No ratings yet
Identity Formation in The Postmodern World
30 pages
NCBSSH Tdna PDF
96% (25)
NCBSSH Tdna PDF
42 pages
Supplements
No ratings yet
Supplements
3 pages
Course Outline in Life and Works of Rizal
No ratings yet
Course Outline in Life and Works of Rizal
4 pages
Sylvan Learning - Get Ready For Kindergarten Math (Etc.) (Z-Library)
No ratings yet
Sylvan Learning - Get Ready For Kindergarten Math (Etc.) (Z-Library)
61 pages
Zaid Raad 999999
No ratings yet
Zaid Raad 999999
9 pages
Socio Lesson 2 PDF
No ratings yet
Socio Lesson 2 PDF
53 pages
AP World Unit 6 Topic 3 NoteGuide Answer Key
100% (1)
AP World Unit 6 Topic 3 NoteGuide Answer Key
6 pages
(Ebook) Empiricism and Language Learnability by Nick Chater, Alexander Clark, John A. Goldsmith, Amy Perfors ISBN 9780198734260, 0198734263 Download
No ratings yet
(Ebook) Empiricism and Language Learnability by Nick Chater, Alexander Clark, John A. Goldsmith, Amy Perfors ISBN 9780198734260, 0198734263 Download
53 pages
Schedule of Lectures (FEB 2019)
No ratings yet
Schedule of Lectures (FEB 2019)
1 page
Robawm 20.04.25
No ratings yet
Robawm 20.04.25
4 pages
Exam B - Respuestas
No ratings yet
Exam B - Respuestas
2 pages

Mmds Exam 2022

Uploaded by

Mmds Exam 2022

Uploaded by

CS246: Mining Massive Data Sets Winter 2022

• Your SUNetID (e.g., suxuan):

• Your SUID (e.g. 01234567):

I acknowledge and accept the Stanford Honor Code.

3. Acceptable uses of computer:

1 Frequent Itemsets Mining I (8 points)

Given that we have items A, B, C, D, E and the following 6 baskets:

2 Clustering (12 points)

Iteration Centroid1 Centroid2 Cluster Assignment Cost

Note: you might not need all of the rows

Figure 1: two example data points we want to cluster

1 (2 points) Left: K-means/ CURE/ Hierarchical clustering

2 (2 points) Right: K-means/ CURE/ Hierarchical clustering

3 Dimensionality Reduction (8 points)

(a) (2 points) What is the best possible rank 1 approximation of A?

i (1 point) How can you interpret Ω[0, 0]?

4 Recommender Systems (13 points)

Kangaroos Dragons Pizza

a) (4 points) Calculate the Pearson coefficient among all the users.

User A User B User C

Kangaroos Dragons Pizza

5 Graph Neural Networks (7 points)

6 Decision Trees (10 points)

7 PageRank (10 points)

8 Extensions of PageRank (10 points)

Given another web graph with four web pages:

9 Graph Representation Learning (6 points)

10 Learning Embeddings (6 points)

11 Computational Advertising (10 points)

Advertisers Bid CTR Budget Spent so far

12 Community Detection (10 points)

Figure 2: Graph of communities: (left) circle-based graph; (right) clique-based graph.

i (3 points) Circle-based graph:

b) (6 points) We can express the

i (2 points) Circle-based graph ∆Q(A → H1 ):

ii (2 points) Clique-based graph ∆Q(B → H1 ):

13 Mining Data Streams (10 points)

14 Frequent Itemsets Mining II (10 points)

ii (3 points) In multistage algorithms, we can do multiple passes, in each pass we have a

15 Locality-Sensitive Hashing (10 points)

You might also like