Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors
Jayesh Choudhuri
January 12
Nearest Neighbors
One of the fundamental problems in datamining is to find similar items. For example, given a
image to find out similar images from a dataset of images, or looking at collection of web pages
and finding out near duplicate pages. The basic method would be do perform a linear search, i.e.
1. In case of images: To compare the query image with each image from the dataset,
2. In case of documents: Take a string of query document and find the similar string/document
going through all the documents from the dataset.
Representation of dataset
1. Images:
pixel values
SIFT features
2. Documents:
string
vector
set
Similarity of Documents
Understanding the meaning of similarity is important. In this case we are trying to find the
character-level similarity and this does not requires us to examine the words in the documents
and their uses or semantic meaning. Finding documents that are exactly duplicate is easy and
can be done by comparing two documents character-by-character. In many cases the documents
are exactly identical but share a large portion of similar texts. Searching for such documents is
like finding near duplicates instead of exact duplicates. Some of the application of finding near
duplicates are Plagiarsm, Mirror pages, Articles from same source, etc. Generally documents are
normalised or pre-processed by removing the punctuations and by converting all the characters to
lower case.
Shingling:
One of the ways of representing documents is to represent them as sets. The elements of the
set are called as shingles. Given a positive integer k and a sequence of terms in the document
d, the k-shingles of d are defined to be a set of all consecutive sequences of k-terms in d.
For eg. consider the following text:
We are having class here
Taking k=5, the representation of document as 5-shingles would be
Spring 2015
Spring 2015
References
Mining of Massive Datasets - Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
An Introduction to Information Retrieval - Christopher D. Manning, Prabhakar Raghavan,
Hinrich Schtze
Spring 2015