0% found this document useful (0 votes)
15 views3 pages

Book 2

This section discusses the extraction of features from documents for content-based recommendation systems, highlighting the use of TF.IDF scores to identify key words that characterize a document. It also addresses the challenges of tagging images and the importance of user-generated tags for feature discovery. Finally, it outlines the representation of item and user profiles using vectors to facilitate recommendations based on both discrete and numerical features.

Uploaded by

brainx Magic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views3 pages

Book 2

This section discusses the extraction of features from documents for content-based recommendation systems, highlighting the use of TF.IDF scores to identify key words that characterize a document. It also addresses the challenges of tagging images and the importance of user-generated tags for feature discovery. Finally, it outlines the representation of item and user profiles using vectors to facilitate recommendations based on both discrete and numerical features.

Uploaded by

brainx Magic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

9.2.

CONTENT-BASED RECOMMENDATIONS 325

9.2.2 Discovering Features of Documents


There are other classes of items where it is not immediately apparent what the
values of features should be. We shall consider two of them: document collec-
tions and images. Documents present special problems, and we shall discuss
the technology for extracting features from documents in this section. Images
will be discussed in Section 9.2.3 as an important example where user-supplied
features have some hope of success.
There are many kinds of documents for which a recommendation system can
be useful. For example, there are many news articles published each day, and
we cannot read all of them. A recommendation system can suggest articles on
topics a user is interested in, but how can we distinguish among topics? Web
pages are also a collection of documents. Can we suggest pages a user might
want to see? Likewise, blogs could be recommended to interested users, if we
could classify blogs by topics.
Unfortunately, these classes of documents do not tend to have readily avail-
able information giving features. A substitute that has been useful in practice is
the identification of words that characterize the topic of a document. How we do
the identification was outlined in Section 1.3.1. First, eliminate stop words –
the several hundred most common words, which tend to say little about the
topic of a document. For the remaining words, compute the TF.IDF score for
each word in the document. The ones with the highest scores are the words
that characterize the document.
We may then take as the features of a document the n words with the highest
TF.IDF scores. It is possible to pick n to be the same for all documents, or to
let n be a fixed percentage of the words in the document. We could also choose
to make all words whose TF.IDF scores are above a given threshold to be a
part of the feature set.
Now, documents are represented by sets of words. Intuitively, we expect
these words to express the subjects or main ideas of the document. For example,
in a news article, we would expect the words with the highest TF.IDF score to
include the names of people discussed in the article, unusual properties of the
event described, and the location of the event. To measure the similarity of two
documents, there are several natural distance measures we can use:

1. We could use the Jaccard distance between the sets of words (recall Sec-
tion 3.5.3).

2. We could use the cosine distance (recall Section 3.5.4) between the sets,
treated as vectors.

To compute the cosine distance in option (2), think of the sets of high-
TF.IDF words as a vector, with one component for each possible word. The
vector has 1 if the word is in the set and 0 if not. Since between two docu-
ments there are only a finite number of words among their two sets, the infinite
dimensionality of the vectors is unimportant. Almost all components are 0 in
326 CHAPTER 9. RECOMMENDATION SYSTEMS

Two Kinds of Document Similarity


Recall that in Section 3.4 we gave a method for finding documents that
were “similar,” using shingling, minhashing, and LSH. There, the notion
of similarity was lexical – documents are similar if they contain large,
identical sequences of characters. For recommendation systems, the notion
of similarity is different. We are interested only in the occurrences of many
important words in both documents, even if there is little lexical similarity
between the documents. However, the methodology for finding similar
documents remains almost the same. Once we have a distance measure,
either Jaccard or cosine, we can use minhashing (for Jaccard) or random
hyperplanes (for cosine distance; see Section 3.7.2) feeding data to an LSH
algorithm to find the pairs of documents that are similar in the sense of
sharing many common keywords.

both, and 0’s do not impact the value of the dot product. To be precise, the dot
product is the size of the intersection of the two sets of words, and the lengths
of the vectors are the square roots of the numbers of words in each set. That
calculation lets us compute the cosine of the angle between the vectors as the
dot product divided by the product of the vector lengths.

9.2.3 Obtaining Item Features From Tags


Let us consider a database of images as an example of a way that features have
been obtained for items. The problem with images is that their data, typically
an array of pixels, does not tell us anything useful about their features. We can
calculate simple properties of pixels, such as the average amount of red in the
picture, but few users are looking for red pictures or especially like red pictures.
There have been a number of attempts to obtain information about features
of items by inviting users to tag the items by entering words or phrases that
describe the item. Thus, one picture with a lot of red might be tagged “Tianan-
men Square,” while another is tagged “sunset at Malibu.” The distinction is
not something that could be discovered by existing image-analysis programs.
Almost any kind of data can have its features described by tags. One of
the earliest attempts to tag massive amounts of data was the site del.icio.us,
later bought by Yahoo!, which invited users to tag Web pages. The goal of this
tagging was to make a new method of search available, where users entered a
set of tags as their search query, and the system retrieved the Web pages that
had been tagged that way. However, it is also possible to use the tags as a
recommendation system. If it is observed that a user retrieves or bookmarks
many pages with a certain set of tags, then we can recommend other pages with
the same tags.
The problem with tagging as an approach to feature discovery is that the
9.2. CONTENT-BASED RECOMMENDATIONS 327

Tags from Computer Games


An interesting direction for encouraging tagging is the “games” approach
pioneered by Luis von Ahn. He enabled two players to collaborate on the
tag for an image. In rounds, they would suggest a tag, and the tags would
be exchanged. If they agreed, then they “won,” and if not, they would
play another round with the same image, trying to agree simultaneously
on a tag. While an innovative direction to try, it is questionable whether
sufficient public interest can be generated to produce enough free work to
satisfy the needs for tagged data.

process only works if users are willing to take the trouble to create the tags, and
there are enough tags that occasional erroneous ones will not bias the system
too much.

9.2.4 Representing Item Profiles


Our ultimate goal for content-based recommendation is to create both an item
profile consisting of feature-value pairs and a user profile summarizing the pref-
erences of the user, based of their row of the utility matrix. In Section 9.2.2
we suggested how an item profile could be constructed. We imagined a vector
of 0’s and 1’s, where a 1 represented the occurrence of a high-TF.IDF word
in the document. Since features for documents were all words, it was easy to
represent profiles this way.
We shall try to generalize this vector approach to all sorts of features. It is
easy to do so for features that are sets of discrete values. For example, if one
feature of movies is the set of actors, then imagine that there is a component
for each actor, with 1 if the actor is in the movie, and 0 if not. Likewise, we
can have a component for each possible director, and each possible genre. All
these features can be represented using only 0’s and 1’s.
There is another class of features that is not readily represented by Boolean
vectors: those features that are numerical. For instance, we might take the
average rating for movies to be a feature,2 and this average is a real number.
It does not make sense to have one component for each of the possible average
ratings, and doing so would cause us to lose the structure implicit in numbers.
That is, two ratings that are close but not identical should be considered more
similar than widely differing ratings. Likewise, numerical features of products,
such as screen size or disk capacity for PC’s, should be considered similar if
their values do not differ greatly.
Numerical features should be represented by single components of vectors
representing items. These components hold the exact value of that feature.
2 The rating is not a very reliable feature, but it will serve as an example.

You might also like