0% found this document useful (0 votes)

9 views7 pages

Problem Set 3: Document Distance: Pset Buddy

Uploaded by

DHIRAJ JHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views7 pages

Problem Set 3: Document Distance: Pset Buddy

Uploaded by

DHIRAJ JHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Problem Set 3: Document Distance

Pset Buddy
You do not have a buddy assigned for this pset.
Introduction
Objectives
Introducing the concept of dictionaries in Python
Writing and calling helper functions in Python
Collaboration
Students may work together, but each student should write up and hand in their assignment separately. Students may not submit the exact same
code.
Students are not permitted to look at or copy each other’s code or code structure.
Include the names of your collaborators in comment at the start of each file.
Please refer to the collaboration policy in the Course Information for more details.
Although this handout is long, the information is here to provide you with context, useful examples, and hints, so be sure to read carefully.
A) File Setup
Download the file 1_ps3.zip and extract all files to the same directory. The files included are: document_distance.py, test_ps3_student.py, and
various documents of texts and lyrics within the tests/student_tests directory. When you are done, make sure you run the tester file
test_ps3_student.py to check your code against some of our test cases.
You will edit ONLY document_distance.py.
B) Document Distance Overview
Given two words or documents, you will calculate a score between 0 and 1 that will tell you how similar they are. If the words or documents are the
same, they will get a score of 1. If the documents are completely different, they will get a score of 0. You will calculate the score in two different ways
and observe whether one works better than the other. The first way will use single word frequencies in the two texts. The second will use the TF-IDF
(Term Frequency-Inverse Document Frequency) of words in a file.
Note that you do NOT need to worry about case sensitivity throughout this pset. All inputs will be lower case.

1) Text to List
The first step in any data analysis problem is prepping your data. We have provided a function called load_file to read a text file and output all the
text in the file into a string. This function takes in a variable called filename , which is a string of the filename you want to load, including the extension.
It removes all punctuation, and saves the text as a string. Do not modify this function.
Here’s an example usage:
1
# hello_world.txt looks like this: 'hello world, hello'
>>> text = load_file("tests/student_tests/hello_world.txt")
>>> text
'hello world hello'

You will further prepare the text by taking the string and transforming it into a list representation of the text. Given the example from above, here is what
we expect:
>>> text_to_list('hello world hello')
[‘hello’, ‘world’, ‘hello’]

Implement text_to_list in document_distance.py as per the given instructions and docstring. In addition to running the tester file, you can quickly
check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document_distance.py:
if __name__ == "__main__":
# Tests Problem 0: Prep Data
test_directory = "tests/student_tests/"
hello_world, hello_friend = load_file(test_directory + 'hello_world.txt'), load_file(test_directory + 'h
world, friend = text_to_list(hello_world), text_to_list(hello_friend)
print(world) # should print ['hello', 'world', 'hello']
print(friend) # should print ['hello', 'friends']

Note: You can assume that the only kinds of white space in the text documents we provide will be new lines or space(s) between words (i.e.
there are no tabs).

2) Get Frequencies
Let’s start by calculating the frequency of each element in a given list. The goal is to return a dictionary with each unique element as the key, and the
number of times the element occurs in the list as the value.
Consider the following examples:
Example 1:
>>> get_frequencies(['h', 'e', 'l', 'l', 'o'])
{'h': 1, 'e': 1, 'l': 2, 'o': 1}

Example 2:
>>> get_frequencies(['hello', 'world', 'hello'])
{'hello': 2, 'world': 1}

Implement get_frequencies in document_distance.py using the above instructions and the docstring provided. In addition to running the tester file,
you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document_distance.py:
if __name__ == "__main__":
# Tests Problem 1: Get Frequencies
test_directory = "tests/student_tests/"
hello_world, hello_friend = load_file(test_directory + 'hello_world.txt'), load_file(test_directory + 'h
world, friend = text_to_list(hello_world), text_to_list(hello_friend)
world_word_freq = get_frequencies(world)
friend_word_freq = get_frequencies(friend)
print(world_word_freq) # should print {'hello': 2, 'world': 1}
print(friend_word_freq) # should print {'hello': 1, 'friends': 1}

3) Letter Frequencies
2
Now, given a word in the form of a string, let's create a dictionary with each letter as the key and how many times each letter occurs in the word as the
value. That sounds very similar to get_frequencies ...
You must call get_frequencies in your get_letter_frequencies to get full credit.
Example 1:
>>> get_letter_frequencies('hello')
{'h': 1, 'e': 1, 'l': 2, 'o': 1}

Example 2:
>>> get_letter_frequencies('that')
{'t': 2, 'h': 1, 'a': 1}

Implement get_letter_frequencies in document_distance.py using the above instructions and the docstring provided. In addition to running the
tester file, you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the
bottom of document_distance.py:
if __name__ == "__main__":
# Tests Problem 2: Get Letter Frequencies
freq1 = get_letter_frequencies('hello')
freq2 = get_letter_frequencies('that')
print(freq1) # should print {'h': 1, 'e': 1, 'l': 2, 'o': 1}
print(freq2) # should print {'t': 2, 'h': 1, 'a': 1}

4) Similarity
Now it’s time to calculate similarity! Complete the function calculate_similarity_score based on the definition of similarity found in the next
paragraph. Your function should be able to be used with the outputs of get_frequencies or get_letter_frequencies .
Consider two lists L1 and L2. Let U be a list made up of all the elements in L1 or L2, but with no repeats (e.g. if L1 = [‘a’, ‘b’], L2 =
[‘b’, ‘c’], then U = [‘a’, ‘b’, ‘c’]). For an element e in L1 or L2 , let

count(e, Li ) = {
number of times e appears in Li if e in Li ,

0 if e not in Li .

We can then define:

δ(e) = ∣count(e, L1 ) − count(e, L2 )∣ (where the vertical bars denote absolute value), and
.

σ(e) = count(e, L1 ) + count(e, L2 )

Similarity is defined as:

δ(u1 ) + δ(u2 ) + δ(u3 ) + ...
1−

σ(u1 ) + σ(u2 ) + σ(u3 ) + ...

where the sums are taken over all the elements u1 , u2 , u3, ... of U , and the result is rounded to two decimal places.

Example (where elements are words):

Suppose
L1 = [‘hello’, ‘world’, ‘hello’], and
L2 = [‘hello’, ‘friends’].

The list of unique elements U is U = [‘hello’, ‘world’, ‘friends’].

The frequency differences δ(u) are

δ(‘hello’) = ∣2 − 1∣ = 1
δ(‘world’) = ∣1 − 0∣ = 1
δ(‘friends’) = ∣0 − 1∣ = 1
The frequency totalsσ(u) are 3
δ(‘hello’) = 2 + 1 = 3
δ(‘world’) = 1 + 0 = 1
δ(‘friends’) = 0 + 1 = 1
Thus, similarity is
1 − 1+1+1
3+1+1 = 1 −

5 = 0.4 (0.4 rounded to two decimal places is still 0.4).

The same calculation with an alternate (but equivalent) explanation can be found in the calculate_similarity_score 's docstring.
IMPORTANT: Be sure to round your final similarity calculation to 2 decimal places.
Implement the function calculate_similarity_score in document_distance.py with the given instruction and docstrings. In addition to running
the tester file, you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at
the bottom of document_distance.py:
if __name__ == "__main__":
# Tests Problem 3: Similarity
test_directory = "tests/student_tests/"
hello_world, hello_friend = load_file(test_directory + 'hello_world.txt'), load_file(test_directory + 'h
world, friend = text_to_list(hello_world), text_to_list(hello_friend)
world_word_freq = get_frequencies(world)
friend_word_freq = get_frequencies(friend)
word1_freq = get_letter_frequencies('toes')
word2_freq = get_letter_frequencies('that')
word3_freq = get_frequencies('nah')
word_similarity1 = calculate_similarity_score(word1_freq, word1_freq)
word_similarity2 = calculate_similarity_score(word1_freq, word2_freq)
word_similarity3 = calculate_similarity_score(word1_freq, word3_freq)
word_similarity4 = calculate_similarity_score(world_word_freq, friend_word_freq)
print(word_similarity1) # should print 1.0
print(word_similarity2) # should print 0.25
print(word_similarity3) # should print 0.0
print(word_similarity4) # should print 0.4

5) Most Frequent Word(s)

Next, you will find out which word(s) occurs the most frequently among two dictionaries. You'll count how many times every word occurs combined
across both texts and return a list of the most frequent word(s). The most frequent word does not need to appear in both dictionaries. It is based
on the combined word frequencies across both dictionaries. If a word occurs in both dictionaries, consider the sum of frequencies as the combined
word frequency. If multiple words are tied (i.e. have the same highest frequency), return an alphabetically ordered list of all these words.
For example, consider the following usage:
>>> freq1 = {"hello": 5, "world": 1}
>>> freq2 = {"hello": 1, "world": 5}
>>> get_most_frequent_words(freq1, freq2)
["hello", "world"]

Implement the function get_most_frequent_words in document_distance.py as per the given instructions and docstring. In addition to running the
tester file, you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the
bottom of document_distance.py:
if __name__ == "__main__":
# Tests Problem 4: Most Frequent Word(s)
freq_dict1, freq_dict2 = {"hello": 5, "world": 1}, {"hello": 1, "world": 5}
most_frequent = get_most_frequent_words(freq_dict1, freq_dict2)
print(most_frequent) # should print ["hello", "world"]

6) Term Frequency–Inverse Document Frequency (TF-IDF)

In this part, you will calculate the Term Frequency–Inverse Document Frequency, which is a numerical measure that signifies the importance of
word(s) in a document. You will do so by first calculating the term frequency4 and inverse document frequency, then combine the two together to get
the TF-IDF.
The term frequency (TF) is calculated as:
number of times word w appears in the document
TF(w) = .
total number of words in the document

The inverse document frequency (IDF) is calculated as:

total number of documents
IDF(w) = log10 ( )
number of documents with word w in it

where log10 is log base 10 and can be called with math.log10 .

We can then combine TF and IDF to form TF-IDF(w) = TF(w) × IDF(w), where the higher the value, the rarer the term and vice versa. For this
pset, we'll only be working with individual words, but TF-IDF works for larger groupings of words as well (e.g. bigrams, trigrams, etc.).
For the get_tf function that you'll implement, you'll be given a file name stored in a variable named text_file . You will need to load the file, prep the
data, and determine the TF value of each word that appears in text_file . The output should be a dictionary mapping each word to its TF. Think
about how you could re-use previous functions.
For the get_idf function that you'll implement, you'll be given a list of text files stored in a variable named text_files . You will need to load each of
the files, prep the data, and determine the IDF values of all words that appear in any of the documents in text_files . The output should be a
dictionary mapping each word to its IDF.
For the get_tfidf function that you'll implement, you'll be given a file name text_file and a list of file names text_files . You will need to load
the file, prep the data, and determine the TF-IDF of all words in text_file . The output should be a sorted list of tuples (in increasing TF-IDF score),
where each tuple is of the form (word, TF-IDF) . In case of words with the same TF-IDF, the words should be sorted in increasing alphabetical order.
For example,
>>> text_file = "tests/student_tests/hello_world.txt"
>>> get_tf(text_file)
{"hello": 0.6666666666666666, "world": 0.3333333333333333}
# Explanation: There are 3 total words in "hello_world.txt". 2 of the three total words are "hello", giving

>>> text_files = ["tests/student_tests/hello_world.txt", "tests/student_tests/hello_friends.txt"]

>>> get_idf(text_files)
{"hello": 0.0, "world": 0.3010299956639812, "friends": 0.3010299956639812}
# Explanation: There are a total of 2 documents in this example. "hello" is in both documents, giving "hello

>>> text_file = "tests/student_tests/hello_world.txt"

>>> text_files = ["tests/student_tests/hello_world.txt", "tests/student_tests/hello_friends.txt"]
>>> get_tfidf(text_file, text_files)
[('hello', 0.0), ('world', 0.10034333188799373)]
# Explanation: We multiply the corresponding TF and IDF values for each word in "hello_world.txt" and get th

Implement the functions get_tf , get_idf , and get_tfidf in document_distance.py as per the given instructions. In addition to running the tester
file, you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the bottom
of document_distance.py:
if __name__ == "__main__":
# Tests Problem 5: Find TF-IDF
tf_text_file = 'tests/student_tests/hello_world.txt'
idf_text_files = ['tests/student_tests/hello_world.txt', 'tests/student_tests/hello_friends.txt']
tf = get_tf(tf_text_file)
idf = get_idf(idf_text_files)
tf_idf = get_tfidf(tf_text_file, idf_text_files)
print(tf) # should print {'hello': 0.6666666666666666, 'world': 0.3333333333333333}
print(idf) # should print {'hello': 0.0, 'world': 0.3010299956639812, 'friends': 0.3010299956639812}
print(tf_idf) # should print [('hello', 0.0), ('world', 0.10034333188799373)]

When you are done, make sure you run the tester file test_ps3_student.py to check your code against our test cases.
5
7) Hand-in Procedure
7.1) Naming Files
Save your solutions with the original file name: document_distance.py. Do not ignore this step or save your file with a different name!
7.2) Time and Collaboration Info
At the start of each file, in a comment, write down the number of hours (roughly) you spent on the problems in that part, and the names of your
collaborators. For example:
# Problem Set 3
# Name: Jane Lee
# Collaborators: John Doe

Please estimate the number of hours you spent on the Problem Set in the question box below.
A Python Error Occurred:
Error on line 30 of question tag.ImportError: cannot import name 'Sequence' from 'collections' (/home/cat

7.3) Half-way Submission

All students should submit their progress by the half-way due date (1 week before the final due date).
This submission will be worth 1 point out of the problem set grade and will not be graded for correctness. The intention is to make sure that you are
making steady progress on the problem set as opposed to working on it in the final days before the due date.
You may upload new versions of each file until Oct 12 at 09:00PM. You cannot use extensions or late days on this submission.
Please refresh the page before submitting a new file. If you do not, your latest submission won't be updated.
Select File No file selected

Submit

You have infinitely many submissions remaining.

7.4) Final Submission

Be sure to run the student tester and make sure all the tests pass. However, the student tester contains only a subset of the tests that will be run to
determine the problem set grade. Passing all of the provided test cases does not guarantee full credit on the pset.
You may upload new versions of each file until Nov 02 at 09:00PM, but anything uploaded after that time will be counted towards your late days, if you
have any remaining. If you have no remaining late days, you will receive no credit for a late submission.
When you upload a new file with the same name, your old one will be overwritten.
Please refresh the page before submitting a new file. If you do not, your latest submission won't be updated.
Select File No file selected

Submit

You have infinitely many submissions remaining.

Supplemental Reading about Document Similarity

This pset is a greatly simplified version of a very pertinent problem in Information Retrieval. Applications of document similarity range from retrieving
search engine results to comparing genes and proteins to improving machine translation.
More advanced techniques to calculating document distance include transforming the text into a vector space and computing the cosine similarity,
Jaccard Index, or some other metric of the vectors. 6
MIT OpenCourseWare
https://ocw.mit.edu

6.100L Introduction to CS and Programming Using Python

Fall 2022

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms

Text File Based Questions
No ratings yet
Text File Based Questions
4 pages
Introduction To Algorithms Lecture Notes (MIT 6 - 006) - It-eBooks - It-eBooks-2017, 2017 - IBooker It-eBooks - Anna's Archive
No ratings yet
Introduction To Algorithms Lecture Notes (MIT 6 - 006) - It-eBooks - It-eBooks-2017, 2017 - IBooker It-eBooks - Anna's Archive
150 pages
Final Cbse Practicals
60% (5)
Final Cbse Practicals
21 pages
Text File Question Bank Solutions
100% (1)
Text File Question Bank Solutions
14 pages
2: Models of Computation: Al-Khw Arizm I
No ratings yet
2: Models of Computation: Al-Khw Arizm I
8 pages
CBSE Practicals
No ratings yet
CBSE Practicals
37 pages
CS Practical File 2024-25
No ratings yet
CS Practical File 2024-25
55 pages
Lab Task 8: Programming Exercises
100% (1)
Lab Task 8: Programming Exercises
3 pages
CSC 108H1 F 2011 Test 2 Duration - 45 Minutes Aids Allowed: None
No ratings yet
CSC 108H1 F 2011 Test 2 Duration - 45 Minutes Aids Allowed: None
6 pages
Python Experiments
No ratings yet
Python Experiments
13 pages
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
Write A Python Program To Check The Given Number Is Prime or Not
No ratings yet
Write A Python Program To Check The Given Number Is Prime or Not
39 pages
Python
No ratings yet
Python
11 pages
Fake News Detection Using Machine Learning: Presented by Fathima T H MSC Computer Science
71% (7)
Fake News Detection Using Machine Learning: Presented by Fathima T H MSC Computer Science
15 pages
MMMMMMMM
No ratings yet
MMMMMMMM
39 pages
Practical File by Aksh Jaiswal
No ratings yet
Practical File by Aksh Jaiswal
48 pages
Python Lab Programs
No ratings yet
Python Lab Programs
15 pages
Project Cs (Computer Science)
No ratings yet
Project Cs (Computer Science)
106 pages
PyCode Files&Lists
No ratings yet
PyCode Files&Lists
15 pages
I Am Sharing 'CS 12 - Practical - File - Sample-2023 - NK' With You
No ratings yet
I Am Sharing 'CS 12 - Practical - File - Sample-2023 - NK' With You
23 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
Unit 6
No ratings yet
Unit 6
39 pages
Sheep
No ratings yet
Sheep
59 pages
Lab Programs
No ratings yet
Lab Programs
10 pages
Cs Journal
No ratings yet
Cs Journal
43 pages
Python Cost Model: Docdist1
No ratings yet
Python Cost Model: Docdist1
12 pages
Python Practice Exercise
No ratings yet
Python Practice Exercise
3 pages
First - Year - Python - Programs - Jupyter Notebook - Python Lab Program VTU
No ratings yet
First - Year - Python - Programs - Jupyter Notebook - Python Lab Program VTU
6 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
Day15 2 Periods
No ratings yet
Day15 2 Periods
10 pages
Anshika Cs Project
No ratings yet
Anshika Cs Project
20 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Python Programs
No ratings yet
Python Programs
10 pages
PDA Lab Prog (Short)
No ratings yet
PDA Lab Prog (Short)
11 pages
Lecture 10
No ratings yet
Lecture 10
7 pages
Lecture 14
No ratings yet
Lecture 14
26 pages
"Enter A Number:": Def If Return Else Return
No ratings yet
"Enter A Number:": Def If Return Else Return
5 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Python Practical List Xii 2019-2020
No ratings yet
Python Practical List Xii 2019-2020
7 pages
First Two PLC Lab Programs Eee - Ec
No ratings yet
First Two PLC Lab Programs Eee - Ec
10 pages
Python Lab Manual
No ratings yet
Python Lab Manual
22 pages
PWP - 5-12 Meet
No ratings yet
PWP - 5-12 Meet
16 pages
Vraj-198 PWP 5-12
No ratings yet
Vraj-198 PWP 5-12
16 pages
PLC Lab Programs
No ratings yet
PLC Lab Programs
8 pages
Practical File Questions
No ratings yet
Practical File Questions
34 pages
Dav Public School, Pokhaiput, Bhubaneswar Chapter Test-2024 File Handling Class-Xii Date:-12.07.24 Q1
No ratings yet
Dav Public School, Pokhaiput, Bhubaneswar Chapter Test-2024 File Handling Class-Xii Date:-12.07.24 Q1
7 pages
Python
No ratings yet
Python
3 pages
Python
No ratings yet
Python
13 pages
Lab Programs Python
No ratings yet
Lab Programs Python
20 pages
Experiment 8 & 9
No ratings yet
Experiment 8 & 9
14 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
3&4 Units Python Programs
No ratings yet
3&4 Units Python Programs
13 pages
Text File Programs Xii C
No ratings yet
Text File Programs Xii C
6 pages
Lab Manual
No ratings yet
Lab Manual
7 pages
Python Practice Exercise PDF
No ratings yet
Python Practice Exercise PDF
3 pages
AI 102 Notes
No ratings yet
AI 102 Notes
41 pages
PROGRAMS
No ratings yet
PROGRAMS
4 pages
Class Xii Text File Handling Assignment
No ratings yet
Class Xii Text File Handling Assignment
3 pages
Vtu NLP Questions
100% (1)
Vtu NLP Questions
5 pages
PYTHON PGM 1to6 - 241021 - 100013
No ratings yet
PYTHON PGM 1to6 - 241021 - 100013
8 pages
CH 4 - File Handling Material For Board Exam
No ratings yet
CH 4 - File Handling Material For Board Exam
2 pages
Irt 2 Marks With Answer
No ratings yet
Irt 2 Marks With Answer
15 pages
Team14 Mini Report FINAL
No ratings yet
Team14 Mini Report FINAL
61 pages
Saravanan Thesis
100% (1)
Saravanan Thesis
207 pages
Text To Image Synthesis Using Self
No ratings yet
Text To Image Synthesis Using Self
20 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Module 7 Mining Object Spatial Multimedia Text and Web Data
100% (1)
Module 7 Mining Object Spatial Multimedia Text and Web Data
28 pages
Resume Screening
No ratings yet
Resume Screening
53 pages
IR Practical Theory
No ratings yet
IR Practical Theory
9 pages
Topic Classification and Sentiment Analysis For Vietnamese Education Survey System
No ratings yet
Topic Classification and Sentiment Analysis For Vietnamese Education Survey System
9 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Lab5 Instructions
No ratings yet
Lab5 Instructions
51 pages
ML Summer Training
No ratings yet
ML Summer Training
20 pages
Data Mining Series 2 Important Topics
No ratings yet
Data Mining Series 2 Important Topics
22 pages
Ijresm V4 I4 34
No ratings yet
Ijresm V4 I4 34
3 pages
PSO11
No ratings yet
PSO11
5 pages
Retrieval Models I: Boolean, Vector Space, Probabilistic
No ratings yet
Retrieval Models I: Boolean, Vector Space, Probabilistic
35 pages
Deep Learning-Based Depression Detection From Social Media
No ratings yet
Deep Learning-Based Depression Detection From Social Media
20 pages
Sentiment Analysis of Twitter Data by Making Use of SVM Random Forest and Decision Tree Algorithm
No ratings yet
Sentiment Analysis of Twitter Data by Making Use of SVM Random Forest and Decision Tree Algorithm
6 pages
Kumar 2024 Ijca 924115
No ratings yet
Kumar 2024 Ijca 924115
7 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Sentiment Identification in Football-Specific Tweets: Corresponding Author: Samah Aloufi (Salou102@uottawa - Ca)
No ratings yet
Sentiment Identification in Football-Specific Tweets: Corresponding Author: Samah Aloufi (Salou102@uottawa - Ca)
13 pages
Hashtag Recommendation System in A P2P Social Networking Application
No ratings yet
Hashtag Recommendation System in A P2P Social Networking Application
13 pages
Anti-Spam Filter Based On Naïve Bayes, SVM, and KNN Model
No ratings yet
Anti-Spam Filter Based On Naïve Bayes, SVM, and KNN Model
5 pages
A Survey On Review Based Recommendation System
No ratings yet
A Survey On Review Based Recommendation System
4 pages
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
No ratings yet
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
5 pages
Automated Scoring System For Essays: Summary
No ratings yet
Automated Scoring System For Essays: Summary
4 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
PHP programming
From Everand
PHP programming
Nino Paiotta
No ratings yet

Problem Set 3: Document Distance: Pset Buddy

Uploaded by

Problem Set 3: Document Distance: Pset Buddy

Uploaded by

Problem Set 3: Document Distance

We can then define:

σ(e) = count(e, L1 ) + count(e, L2 )

Similarity is defined as:

σ(u1 ) + σ(u2 ) + σ(u3 ) + ...

Example (where elements are words):

The list of unique elements U is U = [‘hello’, ‘world’, ‘friends’].

The frequency differences δ(u) are

5 = 0.4 (0.4 rounded to two decimal places is still 0.4).

5) Most Frequent Word(s)

6) Term Frequency–Inverse Document Frequency (TF-IDF)

The inverse document frequency (IDF) is calculated as:

where log10 is log base 10 and can be called with math.log10 .

>>> text_files = ["tests/student_tests/hello_world.txt", "tests/student_tests/hello_friends.txt"]

>>> text_file = "tests/student_tests/hello_world.txt"

7.3) Half-way Submission

You have infinitely many submissions remaining.

7.4) Final Submission

You have infinitely many submissions remaining.

Supplemental Reading about Document Similarity

6.100L Introduction to CS and Programming Using Python

You might also like