Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors

The document provides an overview of nearest neighbor algorithms for finding similar items in large datasets. It discusses representing datasets as images, documents, vectors, or sets. For documents, it describes representing them as shingles or using TF-IDF to represent terms as vectors based on term frequency and inverse document frequency. TF-IDF assigns higher weight to terms that frequently occur in a document but are rare across documents, to determine relevance.

Uploaded by

LakshmiNarasimhan GN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views3 pages

Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors

Uploaded by

LakshmiNarasimhan GN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Lecture Notes for Algorithms for Data Science

Jayesh Choudhuri

January 12

Nearest Neighbors

One of the fundamental problems in datamining is to find similar items. For example, given a
image to find out similar images from a dataset of images, or looking at collection of web pages
and finding out near duplicate pages. The basic method would be do perform a linear search, i.e.
1. In case of images: To compare the query image with each image from the dataset,
2. In case of documents: Take a string of query document and find the similar string/document
going through all the documents from the dataset.

Representation of dataset
1. Images:
pixel values
SIFT features
2. Documents:
string
vector
set

Similarity of Documents

Understanding the meaning of similarity is important. In this case we are trying to find the
character-level similarity and this does not requires us to examine the words in the documents
and their uses or semantic meaning. Finding documents that are exactly duplicate is easy and
can be done by comparing two documents character-by-character. In many cases the documents
are exactly identical but share a large portion of similar texts. Searching for such documents is
like finding near duplicates instead of exact duplicates. Some of the application of finding near
duplicates are Plagiarsm, Mirror pages, Articles from same source, etc. Generally documents are
normalised or pre-processed by removing the punctuations and by converting all the characters to
lower case.
Shingling:
One of the ways of representing documents is to represent them as sets. The elements of the
set are called as shingles. Given a positive integer k and a sequence of terms in the document
d, the k-shingles of d are defined to be a set of all consecutive sequences of k-terms in d.
For eg. consider the following text:
We are having class here
Taking k=5, the representation of document as 5-shingles would be

CS430 Algo for DS

Spring 2015

Instructor: Anirban Dasgupta

{We ar, e are, are , are h, ..., here}

In the above case k was taken as the number of the characters. One can also consider k as
number of words. That would result into a different representation. So, taking k = 2 words
in the above example we have:
{We are, are having, having a, ..., class here}
Such a representation is known as k gram representation. If k = 1 it is known as unigram,
k = 2, its bigram, k = 3 trigram and so on. k gram representation is better for English
language where space acts as a separation between two words, but in languages like chinese
there is no seperation between words and thus shingling can be used.
Representation of documents as a vector:
Documents can also be represented as a vector, where each element in the vector can be a
boolean, showing the presence of the term in the document or can be an interger showing the
frequency of a term in the document. In the context of representing documents as vector the
following terms are defined:
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a weight measure used
in text mining and information retrieval. TFIDF is a statistical measure showing importance
of a word in a document in a collection of corpus. TFIDF weight for a word increases with
the increase in the frequency of word in a document but is offset of frequency of word in the
corpus. TFIDF weighting is used for scoring and ranking documents relevance.
Term Frequency:
Each term in the document is assigned a weight. The weight depends on the number
of times the word occurs in the document. One of the simplest way to weight the
word in the document is by assigning the weight equal to frequency of the word in the
document. This weighting scheme is known as term frequency and is denoted tft,d ,
where t term and d document
The weight term frequency gives quantitative information about the document. Such a
representation of a document is known as bag of words model. In such cases the order
of terms is not considered but the number of occurence of each term is important (in
contrast to boolean representation).
Inverse Document frequency:
The weight term frequency suffers with a critical problem: all the terms are given equal
importance without giving any importance to the relevancy of the term in the document
and the corpus of document. Some of the terms have no power in determining relevance.
Consider the corpus of documents from the auto industry. Almost all the documents
are likely to contain the word auto. So, for relevance determination it is necessary to
attenuate the effect of the terms that occur too many times in the collection. In order
to scale down the term frequency, a new measure is introduced named as document
frequency dft , which gives the number of documents in the collection that contain term
t. Using document frequency we define Inverse Document Frequency idft which is
given by
idft = log(N/dft )

CS430 Algo for DS

Spring 2015

Instructor: Anirban Dasgupta

where N is the total number of documents in the collection

Tf-idf weighting:
Tf-idf weighting is given by combining the measures term frequency and Inverse Document frequency to result into a composite weighting for each term in each document.
tf -idft,d = tft,d idft
In other words, tf -idft,d assigns to term t a weight in document d that is
1. highest when t occurs many times within a small number of documents (thus lending
high discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in many documents
(thus offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents.

References
Mining of Massive Datasets - Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
An Introduction to Information Retrieval - Christopher D. Manning, Prabhakar Raghavan,
Hinrich Schtze

CS430 Algo for DS

Spring 2015

Instructor: Anirban Dasgupta

Entropy Optimization Principles With Applications
100% (1)
Entropy Optimization Principles With Applications
217 pages
2023 02 Ansys General Hardware Recommendations
No ratings yet
2023 02 Ansys General Hardware Recommendations
24 pages
04 Notes 6250 f13
0% (1)
04 Notes 6250 f13
16 pages
Intro To Plastic Injection Molding Ebook
78% (9)
Intro To Plastic Injection Molding Ebook
43 pages
Frito-Lay: Operations Management in Manufacturing
No ratings yet
Frito-Lay: Operations Management in Manufacturing
2 pages
Operation Analytics and Investigating Metric Spike
50% (2)
Operation Analytics and Investigating Metric Spike
14 pages
Chemical Engineering, March 2014
100% (1)
Chemical Engineering, March 2014
92 pages
Tata Nano Car
No ratings yet
Tata Nano Car
34 pages
Agri-Fishery Arts: Module 1: Importance of Planting Trees
No ratings yet
Agri-Fishery Arts: Module 1: Importance of Planting Trees
22 pages
FORM No. 35: (See Rule 69 (8) (Iii) ) Report of Examination of Water-Sealed Gasholder
No ratings yet
FORM No. 35: (See Rule 69 (8) (Iii) ) Report of Examination of Water-Sealed Gasholder
1 page
Database-Introduction-Installation-Guide-Win7
No ratings yet
Database-Introduction-Installation-Guide-Win7
5 pages
Agarrado vs. Librando-Agarrado
No ratings yet
Agarrado vs. Librando-Agarrado
6 pages
National Liberty Alliance CLGJ Letter To District Court Judges
No ratings yet
National Liberty Alliance CLGJ Letter To District Court Judges
20 pages
Using SimBiology For Pharmacokinetic and Mechanistic Modeling
No ratings yet
Using SimBiology For Pharmacokinetic and Mechanistic Modeling
72 pages
Eng PDF
No ratings yet
Eng PDF
166 pages
Aaaa
No ratings yet
Aaaa
3 pages
March Pump SP-TE-7K-MD
No ratings yet
March Pump SP-TE-7K-MD
2 pages
Puente Arizona Et Al v. Arpai Arizona MOTION For Summary Judgment
100% (1)
Puente Arizona Et Al v. Arpai Arizona MOTION For Summary Judgment
31 pages
People, Pathogens and Our Planet: Volume 1: Towards A One Health Approach For Controlling Zoonotic Diseases
No ratings yet
People, Pathogens and Our Planet: Volume 1: Towards A One Health Approach For Controlling Zoonotic Diseases
74 pages
Use of Multi-Strain Probiotic To Improve Nutrient Utilization and Metabolism in Postweaning Piglets de Novo Metagenomic Analysis
No ratings yet
Use of Multi-Strain Probiotic To Improve Nutrient Utilization and Metabolism in Postweaning Piglets de Novo Metagenomic Analysis
385 pages
Deep LearningINAF With MATLAB
No ratings yet
Deep LearningINAF With MATLAB
80 pages
Final Guidelines For AFRL - Endorsed by ACCSQ
No ratings yet
Final Guidelines For AFRL - Endorsed by ACCSQ
7 pages
A Simple Guide On Using BERT For Binary Text Classification
No ratings yet
A Simple Guide On Using BERT For Binary Text Classification
18 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
100% (1)
Peter Dueben: Royal Society University Research Fellow & ECMWF's Coordinator For Machine Learning and AI Activities
33 pages
A Study of An Anthelmintic Drug - Albendazole Impurities and It's Chemical Synthesis, Portrayal, Mechanism of Action and Side Effects
No ratings yet
A Study of An Anthelmintic Drug - Albendazole Impurities and It's Chemical Synthesis, Portrayal, Mechanism of Action and Side Effects
9 pages
User Manual 2195612
No ratings yet
User Manual 2195612
2 pages
CPIM part 2 practice exam 2单词卡 - Quizlet
No ratings yet
CPIM part 2 practice exam 2单词卡 - Quizlet
15 pages
The Technical Analyst WWW - Technicalanalyst.co - Uk
No ratings yet
The Technical Analyst WWW - Technicalanalyst.co - Uk
2 pages
Week 1
No ratings yet
Week 1
50 pages
Polygenic Inheritance
100% (1)
Polygenic Inheritance
13 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
Roles of Big Data and Machine Learning in Bank Supervision
No ratings yet
Roles of Big Data and Machine Learning in Bank Supervision
13 pages
IIM KZ EPGP Combine Brochure Batch 17 32c718e31a
No ratings yet
IIM KZ EPGP Combine Brochure Batch 17 32c718e31a
20 pages
Measures On Metric Spaces and Their Weak Convergence
No ratings yet
Measures On Metric Spaces and Their Weak Convergence
17 pages
BUAD 823 Note-Environment of Business
No ratings yet
BUAD 823 Note-Environment of Business
98 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Zhongdong - Wang@manchester - Ac.uk Qiang - Liu@manchester - Ac.uk
No ratings yet
Zhongdong - Wang@manchester - Ac.uk Qiang - Liu@manchester - Ac.uk
3 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Data Science
No ratings yet
Data Science
74 pages
Visionis Biometric Solutions Vis 3015 Vis 3016 Vis 3013 ENG
No ratings yet
Visionis Biometric Solutions Vis 3015 Vis 3016 Vis 3013 ENG
14 pages
Efficacy of Doxycycline in Feed For The Control of Pneumonia Caused by Pasteurella Multocida and Mycoplasma Hyopneumoniae in Fattening Pigs
100% (1)
Efficacy of Doxycycline in Feed For The Control of Pneumonia Caused by Pasteurella Multocida and Mycoplasma Hyopneumoniae in Fattening Pigs
5 pages
Phannarak CV
No ratings yet
Phannarak CV
2 pages
A Gentle Introduction To Backpropagation
100% (1)
A Gentle Introduction To Backpropagation
15 pages
Knowledge Graph 4 Paper
No ratings yet
Knowledge Graph 4 Paper
52 pages
Artificial Intelligence Driven Drug Repurpos - 2021 - Current Research in Pharma
No ratings yet
Artificial Intelligence Driven Drug Repurpos - 2021 - Current Research in Pharma
10 pages
An Introduction To Bayesian Statistics
100% (9)
An Introduction To Bayesian Statistics
20 pages
Lecture 10 Tensor and Tensor Algebra 2 PDF
No ratings yet
Lecture 10 Tensor and Tensor Algebra 2 PDF
14 pages
Agile Programming: A Brief Presentation by Pradeep
No ratings yet
Agile Programming: A Brief Presentation by Pradeep
18 pages
Fransisca Mira Widyasari
No ratings yet
Fransisca Mira Widyasari
2 pages
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
No ratings yet
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
15 pages
Data Science With R
No ratings yet
Data Science With R
3 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Data Science New
No ratings yet
Data Science New
9 pages
Learn To Lead
No ratings yet
Learn To Lead
24 pages
BackPropogationCrossEntNotes PDF
No ratings yet
BackPropogationCrossEntNotes PDF
4 pages
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
No ratings yet
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
11 pages
Basic Iterative Methods For Solving Linear Systems PDF
No ratings yet
Basic Iterative Methods For Solving Linear Systems PDF
33 pages
The Ultimate Guide To Object Detection
No ratings yet
The Ultimate Guide To Object Detection
16 pages
WEEK5 DLL ENGLISH
100% (1)
WEEK5 DLL ENGLISH
11 pages
Logistic Classification With Cross-Entropy Loss: Juli An D. Arias Londo No August 3, 2020
No ratings yet
Logistic Classification With Cross-Entropy Loss: Juli An D. Arias Londo No August 3, 2020
3 pages
R Cheat Sheet 3 PDF
No ratings yet
R Cheat Sheet 3 PDF
2 pages
Course Notes Math 146
No ratings yet
Course Notes Math 146
10 pages
In - Vivo - Animal - Models - in - Preclinical - Evaluation - of - Anti-Inflammatory - Activity-A Review PDF
No ratings yet
In - Vivo - Animal - Models - in - Preclinical - Evaluation - of - Anti-Inflammatory - Activity-A Review PDF
5 pages
Lecture7 Borel Sets and Lebesgue Measure
No ratings yet
Lecture7 Borel Sets and Lebesgue Measure
8 pages
Pharmacology I
No ratings yet
Pharmacology I
82 pages
Badyal, 2003 - Animal Models of Hypertension and Effect of Drugs
100% (2)
Badyal, 2003 - Animal Models of Hypertension and Effect of Drugs
14 pages
Naive - Bayes - Ipynb - Colab
No ratings yet
Naive - Bayes - Ipynb - Colab
3 pages
Comparative
No ratings yet
Comparative
5 pages
Opa 2863
No ratings yet
Opa 2863
52 pages
3.2 Performance Evaluations
No ratings yet
3.2 Performance Evaluations
18 pages
Fraud Detection Using Machine Learning
No ratings yet
Fraud Detection Using Machine Learning
36 pages
UNIT 5 Time Series Analysis
No ratings yet
UNIT 5 Time Series Analysis
17 pages
An Introduction To Mathematics Behind Neural Networks
No ratings yet
An Introduction To Mathematics Behind Neural Networks
5 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
3 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Mathematical Treatise On Linear Algebra
No ratings yet
Mathematical Treatise On Linear Algebra
7 pages
Missing Persons - Newspaper
No ratings yet
Missing Persons - Newspaper
2 pages
Failure Mode Effects and Analysis (FMEA) : Subject Quality & Reliability Engineering Topic
No ratings yet
Failure Mode Effects and Analysis (FMEA) : Subject Quality & Reliability Engineering Topic
16 pages
Module 3.1 - Training Certificate - Folayeni - Awosika
No ratings yet
Module 3.1 - Training Certificate - Folayeni - Awosika
1 page
Quality and Reliability Engineering
No ratings yet
Quality and Reliability Engineering
15 pages
A Survey of Security Threats in Federated Learning
100% (1)
A Survey of Security Threats in Federated Learning
26 pages
FMEA
No ratings yet
FMEA
26 pages
Woman-Centered Coaching Revolution - Lesson 1 - Handout
No ratings yet
Woman-Centered Coaching Revolution - Lesson 1 - Handout
28 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
FMEA
No ratings yet
FMEA
5 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Case Study - Yangpu - Riverfront
No ratings yet
Case Study - Yangpu - Riverfront
2 pages

Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors

Uploaded by

Lecture Notes For Algorithms For Data Science: 1 Nearest Neighbors

Uploaded by

Lecture Notes for Algorithms for Data Science

CS430 Algo for DS

Instructor: Anirban Dasgupta

{We ar, e are, are , are h, ..., here}

CS430 Algo for DS

Instructor: Anirban Dasgupta

where N is the total number of documents in the collection

CS430 Algo for DS

Instructor: Anirban Dasgupta

You might also like