Book 2
Book 2
1. We could use the Jaccard distance between the sets of words (recall Sec-
tion 3.5.3).
2. We could use the cosine distance (recall Section 3.5.4) between the sets,
treated as vectors.
To compute the cosine distance in option (2), think of the sets of high-
TF.IDF words as a vector, with one component for each possible word. The
vector has 1 if the word is in the set and 0 if not. Since between two docu-
ments there are only a finite number of words among their two sets, the infinite
dimensionality of the vectors is unimportant. Almost all components are 0 in
326 CHAPTER 9. RECOMMENDATION SYSTEMS
both, and 0’s do not impact the value of the dot product. To be precise, the dot
product is the size of the intersection of the two sets of words, and the lengths
of the vectors are the square roots of the numbers of words in each set. That
calculation lets us compute the cosine of the angle between the vectors as the
dot product divided by the product of the vector lengths.
process only works if users are willing to take the trouble to create the tags, and
there are enough tags that occasional erroneous ones will not bias the system
too much.