Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Lecture Notes for Algorithms for Data Science

Jayesh Choudhuri

January 12

Nearest Neighbors

One of the fundamental problems in datamining is to find similar items. For example, given a
image to find out similar images from a dataset of images, or looking at collection of web pages
and finding out near duplicate pages. The basic method would be do perform a linear search, i.e.
1. In case of images: To compare the query image with each image from the dataset,
2. In case of documents: Take a string of query document and find the similar string/document
going through all the documents from the dataset.

Representation of dataset
1. Images:
pixel values
SIFT features
2. Documents:
string
vector
set

Similarity of Documents

Understanding the meaning of similarity is important. In this case we are trying to find the
character-level similarity and this does not requires us to examine the words in the documents
and their uses or semantic meaning. Finding documents that are exactly duplicate is easy and
can be done by comparing two documents character-by-character. In many cases the documents
are exactly identical but share a large portion of similar texts. Searching for such documents is
like finding near duplicates instead of exact duplicates. Some of the application of finding near
duplicates are Plagiarsm, Mirror pages, Articles from same source, etc. Generally documents are
normalised or pre-processed by removing the punctuations and by converting all the characters to
lower case.
Shingling:
One of the ways of representing documents is to represent them as sets. The elements of the
set are called as shingles. Given a positive integer k and a sequence of terms in the document
d, the k-shingles of d are defined to be a set of all consecutive sequences of k-terms in d.
For eg. consider the following text:
We are having class here
Taking k=5, the representation of document as 5-shingles would be

CS430 Algo for DS

Spring 2015

Instructor: Anirban Dasgupta

{We ar, e are, are , are h, ..., here}


In the above case k was taken as the number of the characters. One can also consider k as
number of words. That would result into a different representation. So, taking k = 2 words
in the above example we have:
{We are, are having, having a, ..., class here}
Such a representation is known as k gram representation. If k = 1 it is known as unigram,
k = 2, its bigram, k = 3 trigram and so on. k gram representation is better for English
language where space acts as a separation between two words, but in languages like chinese
there is no seperation between words and thus shingling can be used.
Representation of documents as a vector:
Documents can also be represented as a vector, where each element in the vector can be a
boolean, showing the presence of the term in the document or can be an interger showing the
frequency of a term in the document. In the context of representing documents as vector the
following terms are defined:
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a weight measure used
in text mining and information retrieval. TFIDF is a statistical measure showing importance
of a word in a document in a collection of corpus. TFIDF weight for a word increases with
the increase in the frequency of word in a document but is offset of frequency of word in the
corpus. TFIDF weighting is used for scoring and ranking documents relevance.
Term Frequency:
Each term in the document is assigned a weight. The weight depends on the number
of times the word occurs in the document. One of the simplest way to weight the
word in the document is by assigning the weight equal to frequency of the word in the
document. This weighting scheme is known as term frequency and is denoted tft,d ,
where t term and d document
The weight term frequency gives quantitative information about the document. Such a
representation of a document is known as bag of words model. In such cases the order
of terms is not considered but the number of occurence of each term is important (in
contrast to boolean representation).
Inverse Document frequency:
The weight term frequency suffers with a critical problem: all the terms are given equal
importance without giving any importance to the relevancy of the term in the document
and the corpus of document. Some of the terms have no power in determining relevance.
Consider the corpus of documents from the auto industry. Almost all the documents
are likely to contain the word auto. So, for relevance determination it is necessary to
attenuate the effect of the terms that occur too many times in the collection. In order
to scale down the term frequency, a new measure is introduced named as document
frequency dft , which gives the number of documents in the collection that contain term
t. Using document frequency we define Inverse Document Frequency idft which is
given by
idft = log(N/dft )

CS430 Algo for DS

Spring 2015

Instructor: Anirban Dasgupta

where N is the total number of documents in the collection


Tf-idf weighting:
Tf-idf weighting is given by combining the measures term frequency and Inverse Document frequency to result into a composite weighting for each term in each document.
tf -idft,d = tft,d idft
In other words, tf -idft,d assigns to term t a weight in document d that is
1. highest when t occurs many times within a small number of documents (thus lending
high discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in many documents
(thus offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents.

References
Mining of Massive Datasets - Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
An Introduction to Information Retrieval - Christopher D. Manning, Prabhakar Raghavan,
Hinrich Schtze

CS430 Algo for DS

Spring 2015

Instructor: Anirban Dasgupta

You might also like