0% found this document useful (0 votes)

22 views5 pages

Mini-Project #2: Instructions

Uploaded by

tpriyateotia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views5 pages

Mini-Project #2: Instructions

Uploaded by

tpriyateotia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS168, Spring 2024

Mini-Project #2
Due by 11am on Thursday, April 18.

Instructions
• You can work in groups of up to four students. If you work in a group, please submit one assignment
via Gradescope (please have all group members link all group members to your submission).

• Detailed submission instruction can be found on the course website (https://web.stanford.edu/

class/cs168/) under “Coursework - Assignments” section. If you work in pairs, only one member
should submit.
• Use 12pt or higher font for your writeup.
• Make sure the plots you submit are easy to read at a normal zoom level.

• If you’ve written code to solve a certain part of a problem, or if the part explicitly asks you to implement
an algorithm, you must also include the code in your pdf submission.
• Code marked as able should be pasted into the relevant section. Keep variable names consistent with
those used in the problem statement, and with general conventions. No need to include import state-
ments and other scaffolding, if it is clear from context. Use the verbatim (or “minted”) environment
to paste code in LaTeX.

def example():
print "Your code should be formatted like this."

• Reminder: No late assignments will be accepted, but we will drop your lowest assignment grade.

Part 1: Similarity Metrics

Goal: The goal of this part of the assignment is to understand better the differences between distance
metrics, and to think about which metric makes the most sense for a particular application.

Description: In this part you will look at the similarity between the posts on various newsgroups. We’ll
use the well-known 20 newsgroups dataset.1 You will use a version of the dataset where every article is
represented by a bag-of-words — a vector indexed by words, with each component indicating the number of
occurrences of that word. You will need 3 files: data50.csv, label.csv, and group.csv, all of these can be
downloaded from the course website. In data50.csv there is a sparse representation of the bags-of-words,
with each line containing 3 fields: articleId, wordId, and count. To find out which group an article belongs
to, use the file label.csv, where for articleId i, line i in label.csv contains the groupId. Finally the
group name is in group.csv, with line i containing the name of group i.
We’ll use the following similarity metrics, where x and y are two bags of words:
P
min(xi ,yi )
• Jaccard Similarity: J(x, y) = Pi .
i max(xi ,yi )

• L2 Similarity2 : L2 (x, y) = −||x − y||2 = −

pP
2
i (xi − yi ) .
1 http://qwone.com/
~jason/20Newsgroups/
2 While we typically talk about L2 distance, to make sure that a higher number means a higher similarity we negate the
distances.

1
P
i xi ·yi
• Cosine Similarity: SC (x, y) = ||x||2 ·||y||2 .

Note that Jaccard and cosine similarity are numbers between 0 and 1, while L2 similarity is between −∞
and 0 (with higher numbers indicating more similarity).

(a) (2 points) Make sure you can import the given datasets into whatever language you’re using. For
example, if you’re using python, read the data50.csv file and store the information in an appropriate
way. Remember that the total number of words in the corpus is huge, so you might want to work with
a sparse representation of your data (e.g., you don’t want to waste space on words that don’t occur in
a document). If you’re using MATLAB, you can simply import the data using the GUI.
(b) (10 points) Implement the three similarity metrics described above. For each metric, prepare the
following plot. The plot will look like a 20 × 20 matrix. Rows and columns are index by newsgroups
(in the same order). For each entry (A, B) of the matrix (including the diagonal), compute the average
similarity over all ways of pairing up one article from A with one article from B. After you’ve computed
these 400 numbers, plot your results in a heatmap. Make sure that you label your axes with the group
names and pick an appropriate colormap to represent the data: the rainbow colormap may look fancy,
but a simple color map from white to blue may be a lot more insightful. Make sure to include a legend.
(Note that the computation might take five or ten minutes, but shouldn’t take more than that.)
(c) (5 points) Based on your three heatmaps, which of the similarity metrics seems the most reasonable,
and why would you expect those metrics to be better suited to this data?
Are there any pairs of newsgroups that are very similar? Would you have expected these to be similar?

Deliverables: All of your code. Three heat maps for (b), your discussion/explanations for (c).

Parts 2 and 3: A nearest-neighbor classification system

A “nearest-neighbor” classification system is conceptually extremely simple, and often is very effective. Given
a large dataset of labeled examples, a nearest-neighbor classification system will predict a label for a new
example, x, as follows: it will find the element of the labeled dataset that is closest to x—closest in whatever
metric makes the most sense for that dataset—and then output the label of this closest point. [As you can
imagine, there are many natural extensions of this system—for example considering the labels of the r > 1
closest neighbors.]
From a computational standpoint, naively, finding the closest point to x might be time consuming if the
labeled dataset is large, or the points are very high dimensional. In the next two parts, you will explore two
ways of speeding up this computation: dimension reduction, and via locality sensitive hashing.

Part 2: Dimension Reduction

Goal: The goal of this part is to get a feel for the trade-off in dimensionality reduction between the quality
of approximation and the number of dimensions used.

Description: You may have noticed that it takes some time to compute all the distances in the previous
part (though it should not take more than a couple of minutes). In this part we will implement a dimension
reduction technique to reduce the running time, which can be used to also speed up classification.
In the following, k will refer to the original dimension of your data, and d will refer to the target dimension.

• Random Projection: Given a set of k-dimensional vectors {v1 , v2 , . . .}, define a d × k matrix M by
drawing each entry randomly (and independently) from a normal distribution of mean 0 and variance
1. The d-dimensional reduced vector corresponding to vi is given by the matrix-vector product M vi .
We can think of the matrix M as a set of d random k-dimensional vectors {w1 , . . . , wd } (the rows
of M ), and then the jth coordinate of the reduced vector M vi is the inner product between that vi
and wj . If you need to review the basics of matrix-vector multiplication, see the primer on the course
webpage.

2
(a) (5 points) (Baseline Classification) Implement the baseline cosine-similarity nearest-neighbor classifi-
cation system that, for any given document, finds the document with largest cosine similarity, and
returns that newsgroup/label. (Do each computation using brute-force search.)
Compute the 20 × 20 matrix whose entry (A, B) is defined by the number of articles in group A
that have their nearest neighbor in group B. (When computing an article’s nearest neighbor, don’t
compute the similarity with itself, otherwise all the articles will be their own nearest neighbors, and
this part would be meaningless.) Does it make sense why this should correspond to the accuracy of a
nearest-neighbor classification system based on this dataset?
Plot these results in a heatmap.
What is the average classification accuracy (i.e., what fraction of the 1000 articles have the same
newsgroup/label as their closest neighbor)?

(b) (2 points) Your plots for Part 1(b) were symmetric—why is the matrix in (a) not symmetric?
(c) (7 points) Implement the random projection dimension reduction function and plot the nearest-neighbor
visualization as in part (a) for cosine similarity and d = 10, 25, 50, 100, 200, 500, 1000.
What is the average classification error for each of these settings?
For which values of the target dimension are the results comparable to the original dataset?
(d) (4 points) Suppose each document were much, much longer. Would you need larger target dimensions
in the random dimension reduction to accurately capture the similarity between this larger articles?
Explain your answer in at most 3 or 4 sentences.
(e) (5 points) Suppose you are trying to build a very fast article classification system, and have an enormous
dataset of n labeled tweets/articles. What is the time it takes to reduce the dimensionality of the data?
Give the Big-Oh runtime as a function of n (the number of labeled datapoints), k (the original dimension
of each datapoint), and d (the reduced dimension). What is the overall Big-Oh runtime of classifying
a new article? [Feel free to assume a naive matrix multiplication algorithm, as opposed to “fast matrix
multiplication” algorithms, such as Strassen’s algorithm.]
Now suppose you are instead trying to classify tweets; the bag-of-words representation is still a k-
dimensional vector, but now each tweet has, say, only 50 ≪ k words. Explain how you could exploit the
sparsity of the data to improve the runtime of the naive cosine-similarity nearest-neighbor classification
system (from part (a)).
How does this runtime compare to that of a dimension-reduction nearest-neighbor system (as in the
first step of this part) that reduces the dimension to d = 50? [For this part, we expect a theoretical
analysis—you do not need to implement these algorithms and measure their runtimes empirically.]

Deliverables: Code, figures, classification performance for part (a), brief explanation for part (b), code,
plots, and classification performance for part (c), yes/no and brief explanation for (d), discussion and analysis
for part (e).

Part 3: Locality Sensitive Hashing

Goal: The goal of this part is to think about a basic Locality-Sensitive-Hashing nearest-neighbor classifi-
cation system which could be used to speed up the computations performed in Part 2. This part is purely
theoretical analysis, and we split this up into a number of very small pieces—(a),(b),(c) and (d) all have
correct one-sentence “proofs”.

Description: Below is a description of a Random Hyperplane Hashing LSH scheme, which has the
property that vectors with larger cosine similarity will have a higher probability of colliding. The hashing
scheme, and associated nearest-neighbor classification system, is defined as follows:

3
• Hyperplane Hashing: Construct ℓ hashtables in the following manner: for the i’th hashtable, define a
d × k matrix Mi by drawing each entry randomly (and independently) from a normal distribution of
mean 0 and variance 1. The ith hashvalue of the k-dimensional vector v is defined as the binary vector
sgn(Mi v) ∈ {0, 1}d , where each positive coordinate of Mi v is replaced by a “1” and each nonpositive
coordinate by a “0”. Note that each hashtable has 2d buckets, and each data point is placed in exactly
one bucket of each of the ℓ hashtables.

• Classification: Given a dataset X, suppose each original datapoint v ∈ X has already been hashed (to
bucket sgn(Mi v) of the ith hashtable, for each i = 1, 2, . . . , ℓ). Then, to predict the label of a (new)
query vector q, do the following: (i) compute its ℓ hashvalues (bucket sgn(Mi u) of the ith hashtable);
(ii) consider the√set Sq of the original datapoints that were placed in at least one of these ℓ buckets;
(iii) If |Sq | ≤ 10 n then go through the elements of Sq one by one computing
√ the angle that each one
forms with q and return
√ the label of the closest one to q; if |Sq | > 10 n do the above search but only
look at the first 10 n elements of Sq .

(a) (3 points) Consider the ith hash table in the above scheme, corresponding to matrix Mi . For two
vectors, x, y ∈ Rk that form an angle of angle(x, y) = θ < π radians, what is the probability (over the
randomness in the construction of the matrix Mi ) that they hash to the same bucket in this ith hash
function? [Hint: for each of the d coordinates that define the hash of x and y, what is the probability
that they are equal, as a function of θ? To figure this out, it might be helpful to consider, geometrically,
what it means for x and y to have the same sign when multiplied by a random vector. What does
this look like in two dimensions? What random vectors will cause the inner products to have opposite
signs?] Prove your claim in at most two sentences. If you set θ = 0.1, numerically, what probability
do you get?
(b) (2 points) In the next few parts, we’ll let n denote the number of datapoints in our dataset X, and
argue that we can pick ℓ and d in such a way that with high probability, 1) Sq will contain almost all
“close” points—specifically all points whose angle with q is√at most 0.1, and 2) Sq won’t contain too
many “far” points—specifically, Sq will contain at most O( n) points with angle more than 0.2 with
q, (and probably won’t contain any points with angle more than 0.3, though we won’t bother showing
that).

Suppose there is a point x ∈ X such that angle(x, q) ≤ 0.1. Prove that

ℓ d
Pr[x ∈ Sq ] ≥ 1 − 1 − 0.968d ≥ 1 − e−ℓ·0.968 .

[Hint: to get the final inequality, use the fact that (1 − δ) < e−δ .]
(c) (2 points) Prove that the expected number of elements of Sq that have an angle with q of more than
0.2 radians is bounded as follows:
ℓ
E[|{x ∈ Sq with angle(x, q) > 0.2}|] ≤ n 1 − 1 − 0.937d ≤ nℓ · 0.937d .

[Hint: to get the final inequality, use the fact that for δ ∈ (0, 1) and m ≥ 1 the following is true:
(1 − δ)m ≥ 1 − mδ.]
√
(d) (1 points)
√ Consider setting d ≈ 15 log n so that 0.937d ≈ 1/n, and hence 0.968d ≈ 1/ n, and set
√ ∈ Sq ] > 0.99, and the
ℓ = 5 n. Using the previous two parts, show that if angle(x, q) < 0.1 then Pr[x
expected number of points in Sq with angle more than 0.2 from q is at most 5 n.
(e) (3 points) If your dataset X consists of n point in Rk , and the nearest neighbor of your query q
does have angle at most 0.1 with the query point, then what should we expect the runtime of the
Classification protocol to be, as a function of n and k, where we plugged in the values of d and ℓ from
the previous part? When do we get a “win” over the naive brute-force nearest-neighbor search? Is
there a regime in which it might make sense to use dimension reduction together with this scheme?

4
(f) (1 point) The above hashing based nearest neighbor search implicitly assumed that if there is a point
with angle at most 0.1 from our query point, then we are satisfied if we find a point with angle at most
0.2 from the query point. Is this a reasonable goal? Discuss in two or three sentences. (Note: A slight
extension of the scheme we described has similar runtime, but satisfies the stronger guarantee that if
the nearest neighbor has angle α, then we expect to find a point with angle at most 2α, and we don’t
need to know α in advance...)

(g) (1 points) [Challenge] Assuming that there is some point with angle at most 0.1 from our query point,
if our goal is to find a point whose angle is at most 0.1c from the query for some constant c ≥ 1, what
is the right variant of the approach, how should we set d and ℓ to minimize the runtime, and what is
the runtime as a function of c and n? [Feel free to assume the original dimensionality, k = O(log n),
and feel free to give a runtime of the form O(nf (c) ) where you ignore any constant terms and log(n)
terms.]

Deliverables: Part (a)-(d) short rigorous analyses. Part (e) and (f) short discussions. Part (g) algorithm
sketch and sketch of analysis.

Nguyen Princeton 0181D 11063
No ratings yet
Nguyen Princeton 0181D 11063
168 pages
Unit - IV
No ratings yet
Unit - IV
78 pages
Mimno Umass 0118D 10907
No ratings yet
Mimno Umass 0118D 10907
121 pages
DS - Module 4
No ratings yet
DS - Module 4
57 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
ML Chapter 8 (IBL) Notes
No ratings yet
ML Chapter 8 (IBL) Notes
60 pages
Learning 2
No ratings yet
Learning 2
104 pages
III Clustering
No ratings yet
III Clustering
87 pages
UNIT-2 ML Notes
No ratings yet
UNIT-2 ML Notes
15 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
Section 6
No ratings yet
Section 6
27 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
Week 5
No ratings yet
Week 5
64 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Eecs 2015 164
No ratings yet
Eecs 2015 164
95 pages
6715 DM Compressed
No ratings yet
6715 DM Compressed
29 pages
11 Text Categorization
No ratings yet
11 Text Categorization
25 pages
Statistical Machine Learning With Python Week #1
No ratings yet
Statistical Machine Learning With Python Week #1
37 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
DSB - Unit3
No ratings yet
DSB - Unit3
87 pages
Lecture Slides-Week15,16
No ratings yet
Lecture Slides-Week15,16
50 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
HW 4
No ratings yet
HW 4
13 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
JL Transformation - Minlash
No ratings yet
JL Transformation - Minlash
11 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
Previous Exam Paper 2 Solutions
No ratings yet
Previous Exam Paper 2 Solutions
7 pages
COMP90049 2021S1 A3-Spec
No ratings yet
COMP90049 2021S1 A3-Spec
7 pages
Unit Iii
No ratings yet
Unit Iii
13 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
Topic
No ratings yet
Topic
13 pages
19MAM81-GRLmidsem 1 Answer Key
No ratings yet
19MAM81-GRLmidsem 1 Answer Key
14 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
BDA
No ratings yet
BDA
31 pages
Linear Algebra Course Project
No ratings yet
Linear Algebra Course Project
7 pages
LSAfun
No ratings yet
LSAfun
35 pages
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
No ratings yet
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
21 pages
Efficient Graph-Based Author Disambiguation by Topological Similarity in DBLP
No ratings yet
Efficient Graph-Based Author Disambiguation by Topological Similarity in DBLP
5 pages
Project 12
No ratings yet
Project 12
7 pages
Predict Based Simmiliarity and Validation
No ratings yet
Predict Based Simmiliarity and Validation
19 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
L6 Recommendation
No ratings yet
L6 Recommendation
56 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Computer Science For Digital Engineering Assignment Report
No ratings yet
Computer Science For Digital Engineering Assignment Report
15 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
CS5785 Homework 4: .PDF .Py .Ipynb
No ratings yet
CS5785 Homework 4: .PDF .Py .Ipynb
5 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Fish Finder 300 C
No ratings yet
Fish Finder 300 C
24 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
SaaS Implementation Best Practices - v2
No ratings yet
SaaS Implementation Best Practices - v2
24 pages
Research Ii: Types of Research Data
No ratings yet
Research Ii: Types of Research Data
21 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Different Types of Sewing Machines
100% (1)
Different Types of Sewing Machines
11 pages
Unit 6 Fds 2023
No ratings yet
Unit 6 Fds 2023
67 pages
Benefits of Being Hilton
100% (3)
Benefits of Being Hilton
3 pages
2BN1 2BN2 2012
No ratings yet
2BN1 2BN2 2012
63 pages
The AI Marketing Canvas
No ratings yet
The AI Marketing Canvas
25 pages
ICT Trivia
No ratings yet
ICT Trivia
9 pages
Pana Bežični Manujal kx-tcd150FX
No ratings yet
Pana Bežični Manujal kx-tcd150FX
77 pages
YGT-IT Training Material
No ratings yet
YGT-IT Training Material
89 pages
B-Jac Us
No ratings yet
B-Jac Us
8 pages
HSPA - High Speed Packet Access Tutorial
No ratings yet
HSPA - High Speed Packet Access Tutorial
21 pages
Kawai CN290 Digital Piano Manual
No ratings yet
Kawai CN290 Digital Piano Manual
24 pages
Delcam - PowerMILL 2015 R2 WhatsNew EN - 2015
No ratings yet
Delcam - PowerMILL 2015 R2 WhatsNew EN - 2015
71 pages
4.SAP TAO Training Material
0% (1)
4.SAP TAO Training Material
20 pages
Huntington Et Al - HyLogging - Voluminous Industrial-Scale Reflectance Spectroscopy of The Earth PDF
No ratings yet
Huntington Et Al - HyLogging - Voluminous Industrial-Scale Reflectance Spectroscopy of The Earth PDF
14 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
23 pages
HP CIFS Client A.02.02 Administrator's Guide: HP-UX 11i v1 and v2
No ratings yet
HP CIFS Client A.02.02 Administrator's Guide: HP-UX 11i v1 and v2
141 pages
Student Guide Anthropogenic Climate Change
No ratings yet
Student Guide Anthropogenic Climate Change
9 pages
Vail CMMS
No ratings yet
Vail CMMS
24 pages
La Dificultad de Escribir Un Ensayo Persuasivo
100% (1)
La Dificultad de Escribir Un Ensayo Persuasivo
7 pages
CETECOM Antenna Testing Pocket Guide
No ratings yet
CETECOM Antenna Testing Pocket Guide
2 pages
Alternate Autonomous AP Upgrade Procedure
No ratings yet
Alternate Autonomous AP Upgrade Procedure
14 pages
Razberi User Manual (Razberi VMS)
No ratings yet
Razberi User Manual (Razberi VMS)
34 pages
Tally Shortcuts - Quick Short Cuts
No ratings yet
Tally Shortcuts - Quick Short Cuts
6 pages
Reduced Row Echelon Form
No ratings yet
Reduced Row Echelon Form
4 pages
ChatLog Indore ML Python Batch 2 2021 - 07 - 21 15 - 00
No ratings yet
ChatLog Indore ML Python Batch 2 2021 - 07 - 21 15 - 00
22 pages
Review Paper: Virtual Autopsy: A New Trend in Forensic Investigation
No ratings yet
Review Paper: Virtual Autopsy: A New Trend in Forensic Investigation
7 pages
Links For Learning German
No ratings yet
Links For Learning German
2 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
From Everand
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
Fouad Sabry
No ratings yet

Mini-Project #2: Instructions

Uploaded by

Mini-Project #2: Instructions

Uploaded by

CS168, Spring 2024

• Detailed submission instruction can be found on the course website (https://web.stanford.edu/

Part 1: Similarity Metrics

• L2 Similarity2 : L2 (x, y) = −||x − y||2 = −

Parts 2 and 3: A nearest-neighbor classification system

Part 2: Dimension Reduction

Part 3: Locality Sensitive Hashing

Suppose there is a point x ∈ X such that angle(x, q) ≤ 0.1. Prove that

You might also like