Problem Set 3: Document Distance: Pset Buddy
Problem Set 3: Document Distance: Pset Buddy
Pset Buddy
You do not have a buddy assigned for this pset.
Introduction
Objectives
Introducing the concept of dictionaries in Python
Writing and calling helper functions in Python
Collaboration
Students may work together, but each student should write up and hand in their assignment separately. Students may not submit the exact same
code.
Students are not permitted to look at or copy each other’s code or code structure.
Include the names of your collaborators in comment at the start of each file.
Please refer to the collaboration policy in the Course Information for more details.
Although this handout is long, the information is here to provide you with context, useful examples, and hints, so be sure to read carefully.
A) File Setup
Download the file 1_ps3.zip and extract all files to the same directory. The files included are: document_distance.py, test_ps3_student.py, and
various documents of texts and lyrics within the tests/student_tests directory. When you are done, make sure you run the tester file
test_ps3_student.py to check your code against some of our test cases.
You will edit ONLY document_distance.py.
B) Document Distance Overview
Given two words or documents, you will calculate a score between 0 and 1 that will tell you how similar they are. If the words or documents are the
same, they will get a score of 1. If the documents are completely different, they will get a score of 0. You will calculate the score in two different ways
and observe whether one works better than the other. The first way will use single word frequencies in the two texts. The second will use the TF-IDF
(Term Frequency-Inverse Document Frequency) of words in a file.
Note that you do NOT need to worry about case sensitivity throughout this pset. All inputs will be lower case.
1) Text to List
The first step in any data analysis problem is prepping your data. We have provided a function called load_file to read a text file and output all the
text in the file into a string. This function takes in a variable called filename , which is a string of the filename you want to load, including the extension.
It removes all punctuation, and saves the text as a string. Do not modify this function.
Here’s an example usage:
1
# hello_world.txt looks like this: 'hello world, hello'
>>> text = load_file("tests/student_tests/hello_world.txt")
>>> text
'hello world hello'
You will further prepare the text by taking the string and transforming it into a list representation of the text. Given the example from above, here is what
we expect:
>>> text_to_list('hello world hello')
[‘hello’, ‘world’, ‘hello’]
Implement text_to_list in document_distance.py as per the given instructions and docstring. In addition to running the tester file, you can quickly
check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document_distance.py:
if __name__ == "__main__":
# Tests Problem 0: Prep Data
test_directory = "tests/student_tests/"
hello_world, hello_friend = load_file(test_directory + 'hello_world.txt'), load_file(test_directory + 'h
world, friend = text_to_list(hello_world), text_to_list(hello_friend)
print(world) # should print ['hello', 'world', 'hello']
print(friend) # should print ['hello', 'friends']
Note: You can assume that the only kinds of white space in the text documents we provide will be new lines or space(s) between words (i.e.
there are no tabs).
2) Get Frequencies
Let’s start by calculating the frequency of each element in a given list. The goal is to return a dictionary with each unique element as the key, and the
number of times the element occurs in the list as the value.
Consider the following examples:
Example 1:
>>> get_frequencies(['h', 'e', 'l', 'l', 'o'])
{'h': 1, 'e': 1, 'l': 2, 'o': 1}
Example 2:
>>> get_frequencies(['hello', 'world', 'hello'])
{'hello': 2, 'world': 1}
Implement get_frequencies in document_distance.py using the above instructions and the docstring provided. In addition to running the tester file,
you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the bottom of
document_distance.py:
if __name__ == "__main__":
# Tests Problem 1: Get Frequencies
test_directory = "tests/student_tests/"
hello_world, hello_friend = load_file(test_directory + 'hello_world.txt'), load_file(test_directory + 'h
world, friend = text_to_list(hello_world), text_to_list(hello_friend)
world_word_freq = get_frequencies(world)
friend_word_freq = get_frequencies(friend)
print(world_word_freq) # should print {'hello': 2, 'world': 1}
print(friend_word_freq) # should print {'hello': 1, 'friends': 1}
3) Letter Frequencies
2
Now, given a word in the form of a string, let's create a dictionary with each letter as the key and how many times each letter occurs in the word as the
value. That sounds very similar to get_frequencies ...
You must call get_frequencies in your get_letter_frequencies to get full credit.
Example 1:
>>> get_letter_frequencies('hello')
{'h': 1, 'e': 1, 'l': 2, 'o': 1}
Example 2:
>>> get_letter_frequencies('that')
{'t': 2, 'h': 1, 'a': 1}
Implement get_letter_frequencies in document_distance.py using the above instructions and the docstring provided. In addition to running the
tester file, you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the
bottom of document_distance.py:
if __name__ == "__main__":
# Tests Problem 2: Get Letter Frequencies
freq1 = get_letter_frequencies('hello')
freq2 = get_letter_frequencies('that')
print(freq1) # should print {'h': 1, 'e': 1, 'l': 2, 'o': 1}
print(freq2) # should print {'t': 2, 'h': 1, 'a': 1}
4) Similarity
Now it’s time to calculate similarity! Complete the function calculate_similarity_score based on the definition of similarity found in the next
paragraph. Your function should be able to be used with the outputs of get_frequencies or get_letter_frequencies .
Consider two lists L1 and L2. Let U be a list made up of all the elements in L1 or L2, but with no repeats (e.g. if L1 = [‘a’, ‘b’], L2 =
[‘b’, ‘c’], then U = [‘a’, ‘b’, ‘c’]). For an element e in L1 or L2 , let
count(e, Li ) = {
number of times e appears in Li if e in Li ,
0 if e not in Li .
where the sums are taken over all the elements u1 , u2 , u3, ... of U , and the result is rounded to two decimal places.
The same calculation with an alternate (but equivalent) explanation can be found in the calculate_similarity_score 's docstring.
IMPORTANT: Be sure to round your final similarity calculation to 2 decimal places.
Implement the function calculate_similarity_score in document_distance.py with the given instruction and docstrings. In addition to running
the tester file, you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at
the bottom of document_distance.py:
if __name__ == "__main__":
# Tests Problem 3: Similarity
test_directory = "tests/student_tests/"
hello_world, hello_friend = load_file(test_directory + 'hello_world.txt'), load_file(test_directory + 'h
world, friend = text_to_list(hello_world), text_to_list(hello_friend)
world_word_freq = get_frequencies(world)
friend_word_freq = get_frequencies(friend)
word1_freq = get_letter_frequencies('toes')
word2_freq = get_letter_frequencies('that')
word3_freq = get_frequencies('nah')
word_similarity1 = calculate_similarity_score(word1_freq, word1_freq)
word_similarity2 = calculate_similarity_score(word1_freq, word2_freq)
word_similarity3 = calculate_similarity_score(word1_freq, word3_freq)
word_similarity4 = calculate_similarity_score(world_word_freq, friend_word_freq)
print(word_similarity1) # should print 1.0
print(word_similarity2) # should print 0.25
print(word_similarity3) # should print 0.0
print(word_similarity4) # should print 0.4
Implement the function get_most_frequent_words in document_distance.py as per the given instructions and docstring. In addition to running the
tester file, you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the
bottom of document_distance.py:
if __name__ == "__main__":
# Tests Problem 4: Most Frequent Word(s)
freq_dict1, freq_dict2 = {"hello": 5, "world": 1}, {"hello": 1, "world": 5}
most_frequent = get_most_frequent_words(freq_dict1, freq_dict2)
print(most_frequent) # should print ["hello", "world"]
Implement the functions get_tf , get_idf , and get_tfidf in document_distance.py as per the given instructions. In addition to running the tester
file, you can quickly check your implementation on the provided examples for each problem by uncommenting the relevant lines of code at the bottom
of document_distance.py:
if __name__ == "__main__":
# Tests Problem 5: Find TF-IDF
tf_text_file = 'tests/student_tests/hello_world.txt'
idf_text_files = ['tests/student_tests/hello_world.txt', 'tests/student_tests/hello_friends.txt']
tf = get_tf(tf_text_file)
idf = get_idf(idf_text_files)
tf_idf = get_tfidf(tf_text_file, idf_text_files)
print(tf) # should print {'hello': 0.6666666666666666, 'world': 0.3333333333333333}
print(idf) # should print {'hello': 0.0, 'world': 0.3010299956639812, 'friends': 0.3010299956639812}
print(tf_idf) # should print [('hello', 0.0), ('world', 0.10034333188799373)]
When you are done, make sure you run the tester file test_ps3_student.py to check your code against our test cases.
5
7) Hand-in Procedure
7.1) Naming Files
Save your solutions with the original file name: document_distance.py. Do not ignore this step or save your file with a different name!
7.2) Time and Collaboration Info
At the start of each file, in a comment, write down the number of hours (roughly) you spent on the problems in that part, and the names of your
collaborators. For example:
# Problem Set 3
# Name: Jane Lee
# Collaborators: John Doe
Please estimate the number of hours you spent on the Problem Set in the question box below.
A Python Error Occurred:
Error on line 30 of question tag.ImportError: cannot import name 'Sequence' from 'collections' (/home/cat
Submit
Submit
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms