Book 2

This section discusses the extraction of features from documents for content-based recommendation systems, highlighting the use of TF.IDF scores to identify key words that characterize a document. It also addresses the challenges of tagging images and the importance of user-generated tags for feature discovery. Finally, it outlines the representation of item and user profiles using vectors to facilitate recommendations based on both discrete and numerical features.

Uploaded by

brainx Magic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views3 pages

Book 2

Uploaded by

brainx Magic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

9.2.

CONTENT-BASED RECOMMENDATIONS 325

9.2.2 Discovering Features of Documents

There are other classes of items where it is not immediately apparent what the
values of features should be. We shall consider two of them: document collec-
tions and images. Documents present special problems, and we shall discuss
the technology for extracting features from documents in this section. Images
will be discussed in Section 9.2.3 as an important example where user-supplied
features have some hope of success.
There are many kinds of documents for which a recommendation system can
be useful. For example, there are many news articles published each day, and
we cannot read all of them. A recommendation system can suggest articles on
topics a user is interested in, but how can we distinguish among topics? Web
pages are also a collection of documents. Can we suggest pages a user might
want to see? Likewise, blogs could be recommended to interested users, if we
could classify blogs by topics.
Unfortunately, these classes of documents do not tend to have readily avail-
able information giving features. A substitute that has been useful in practice is
the identification of words that characterize the topic of a document. How we do
the identification was outlined in Section 1.3.1. First, eliminate stop words –
the several hundred most common words, which tend to say little about the
topic of a document. For the remaining words, compute the TF.IDF score for
each word in the document. The ones with the highest scores are the words
that characterize the document.
We may then take as the features of a document the n words with the highest
TF.IDF scores. It is possible to pick n to be the same for all documents, or to
let n be a fixed percentage of the words in the document. We could also choose
to make all words whose TF.IDF scores are above a given threshold to be a
part of the feature set.
Now, documents are represented by sets of words. Intuitively, we expect
these words to express the subjects or main ideas of the document. For example,
in a news article, we would expect the words with the highest TF.IDF score to
include the names of people discussed in the article, unusual properties of the
event described, and the location of the event. To measure the similarity of two
documents, there are several natural distance measures we can use:

1. We could use the Jaccard distance between the sets of words (recall Sec-
tion 3.5.3).

2. We could use the cosine distance (recall Section 3.5.4) between the sets,
treated as vectors.

To compute the cosine distance in option (2), think of the sets of high-
TF.IDF words as a vector, with one component for each possible word. The
vector has 1 if the word is in the set and 0 if not. Since between two docu-
ments there are only a finite number of words among their two sets, the infinite
dimensionality of the vectors is unimportant. Almost all components are 0 in
326 CHAPTER 9. RECOMMENDATION SYSTEMS

Two Kinds of Document Similarity

Recall that in Section 3.4 we gave a method for finding documents that
were “similar,” using shingling, minhashing, and LSH. There, the notion
of similarity was lexical – documents are similar if they contain large,
identical sequences of characters. For recommendation systems, the notion
of similarity is different. We are interested only in the occurrences of many
important words in both documents, even if there is little lexical similarity
between the documents. However, the methodology for finding similar
documents remains almost the same. Once we have a distance measure,
either Jaccard or cosine, we can use minhashing (for Jaccard) or random
hyperplanes (for cosine distance; see Section 3.7.2) feeding data to an LSH
algorithm to find the pairs of documents that are similar in the sense of
sharing many common keywords.

both, and 0’s do not impact the value of the dot product. To be precise, the dot
product is the size of the intersection of the two sets of words, and the lengths
of the vectors are the square roots of the numbers of words in each set. That
calculation lets us compute the cosine of the angle between the vectors as the
dot product divided by the product of the vector lengths.

9.2.3 Obtaining Item Features From Tags

Let us consider a database of images as an example of a way that features have
been obtained for items. The problem with images is that their data, typically
an array of pixels, does not tell us anything useful about their features. We can
calculate simple properties of pixels, such as the average amount of red in the
picture, but few users are looking for red pictures or especially like red pictures.
There have been a number of attempts to obtain information about features
of items by inviting users to tag the items by entering words or phrases that
describe the item. Thus, one picture with a lot of red might be tagged “Tianan-
men Square,” while another is tagged “sunset at Malibu.” The distinction is
not something that could be discovered by existing image-analysis programs.
Almost any kind of data can have its features described by tags. One of
the earliest attempts to tag massive amounts of data was the site del.icio.us,
later bought by Yahoo!, which invited users to tag Web pages. The goal of this
tagging was to make a new method of search available, where users entered a
set of tags as their search query, and the system retrieved the Web pages that
had been tagged that way. However, it is also possible to use the tags as a
recommendation system. If it is observed that a user retrieves or bookmarks
many pages with a certain set of tags, then we can recommend other pages with
the same tags.
The problem with tagging as an approach to feature discovery is that the
9.2. CONTENT-BASED RECOMMENDATIONS 327

Tags from Computer Games

An interesting direction for encouraging tagging is the “games” approach
pioneered by Luis von Ahn. He enabled two players to collaborate on the
tag for an image. In rounds, they would suggest a tag, and the tags would
be exchanged. If they agreed, then they “won,” and if not, they would
play another round with the same image, trying to agree simultaneously
on a tag. While an innovative direction to try, it is questionable whether
sufficient public interest can be generated to produce enough free work to
satisfy the needs for tagged data.

process only works if users are willing to take the trouble to create the tags, and
there are enough tags that occasional erroneous ones will not bias the system
too much.

9.2.4 Representing Item Profiles

Our ultimate goal for content-based recommendation is to create both an item
profile consisting of feature-value pairs and a user profile summarizing the pref-
erences of the user, based of their row of the utility matrix. In Section 9.2.2
we suggested how an item profile could be constructed. We imagined a vector
of 0’s and 1’s, where a 1 represented the occurrence of a high-TF.IDF word
in the document. Since features for documents were all words, it was easy to
represent profiles this way.
We shall try to generalize this vector approach to all sorts of features. It is
easy to do so for features that are sets of discrete values. For example, if one
feature of movies is the set of actors, then imagine that there is a component
for each actor, with 1 if the actor is in the movie, and 0 if not. Likewise, we
can have a component for each possible director, and each possible genre. All
these features can be represented using only 0’s and 1’s.
There is another class of features that is not readily represented by Boolean
vectors: those features that are numerical. For instance, we might take the
average rating for movies to be a feature,2 and this average is a real number.
It does not make sense to have one component for each of the possible average
ratings, and doing so would cause us to lose the structure implicit in numbers.
That is, two ratings that are close but not identical should be considered more
similar than widely differing ratings. Likewise, numerical features of products,
such as screen size or disk capacity for PC’s, should be considered similar if
their values do not differ greatly.
Numerical features should be represented by single components of vectors
representing items. These components hold the exact value of that feature.
2 The rating is not a very reliable feature, but it will serve as an example.

Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Artificial Intelligence Frame: Fundamentals and Applications
From Everand
Artificial Intelligence Frame: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Prob 7-50 PDF
No ratings yet
Prob 7-50 PDF
1 page
HA 08.07.2020 XII Solutions NCERT EXERCISE Q.S 2.17, 2.18, 2.19 Solved Example 2.6 Intext Question 2.9
No ratings yet
HA 08.07.2020 XII Solutions NCERT EXERCISE Q.S 2.17, 2.18, 2.19 Solved Example 2.6 Intext Question 2.9
1 page
Anthill Protocol
No ratings yet
Anthill Protocol
2 pages
1 Rakitan Printer 02 Agustus 2021
No ratings yet
1 Rakitan Printer 02 Agustus 2021
1 page
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Customer Course Catalog
100% (1)
Customer Course Catalog
112 pages
IndividualAssignment (Mek625) (2022487736)
No ratings yet
IndividualAssignment (Mek625) (2022487736)
2 pages
GROKKING ALGORITHMS: Advanced Methods to Learn and Use Grokking Algorithms and Data Structures for Programming
From Everand
GROKKING ALGORITHMS: Advanced Methods to Learn and Use Grokking Algorithms and Data Structures for Programming
Eric Schmidt
No ratings yet
Google Search Revealed: Mastering the Algorithm for Search Dominance
From Everand
Google Search Revealed: Mastering the Algorithm for Search Dominance
Azhar ul Haque Sario
No ratings yet
Zoecon EndUseDilutionTable
No ratings yet
Zoecon EndUseDilutionTable
2 pages
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
From Everand
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Steven Cooper
2.5/5 (2)
Myroslava
No ratings yet
Myroslava
1 page
Inquiry Worksheet 6.1 Answered
No ratings yet
Inquiry Worksheet 6.1 Answered
11 pages
Assignment 3 BTF3363
No ratings yet
Assignment 3 BTF3363
5 pages
M. Tech Disseratation and B. Tech Project
No ratings yet
M. Tech Disseratation and B. Tech Project
1 page
Place of Suppy
No ratings yet
Place of Suppy
15 pages
Assignment W1 UPDATED Merged
No ratings yet
Assignment W1 UPDATED Merged
102 pages
Architectural Metapatterns: The Pattern Language of Software Architecture
From Everand
Architectural Metapatterns: The Pattern Language of Software Architecture
Denys Poltorak
No ratings yet
Untrapped Value:: Software Reuse Powering Future Prosperity
From Everand
Untrapped Value:: Software Reuse Powering Future Prosperity
David Erickson
No ratings yet
Diagrama Elétrico Rolo 3411
100% (1)
Diagrama Elétrico Rolo 3411
67 pages
Code Explorers: Discovering Intermediate Programming
From Everand
Code Explorers: Discovering Intermediate Programming
Joseph Paul
No ratings yet
Data Structures and Algorithms with Go: Create efficient solutions and optimize your Go coding skills (English Edition)
From Everand
Data Structures and Algorithms with Go: Create efficient solutions and optimize your Go coding skills (English Edition)
Dušan Stojanović
No ratings yet
Volume 5-2 (C) - ESIA For Padibe West
No ratings yet
Volume 5-2 (C) - ESIA For Padibe West
288 pages
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
From Everand
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Steven Cooper
No ratings yet
Msafdzp 2025 Package
No ratings yet
Msafdzp 2025 Package
30 pages
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Everyday Data Structures
From Everand
Everyday Data Structures
William Smith
No ratings yet
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
From Everand
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
Bolakale Aremu
No ratings yet
Lecture6 2
No ratings yet
Lecture6 2
37 pages
Additional Table 5-2021
No ratings yet
Additional Table 5-2021
3 pages
Module 6 - Link Analysis Recommendation Systems
No ratings yet
Module 6 - Link Analysis Recommendation Systems
68 pages
Spread Spectrum
No ratings yet
Spread Spectrum
3 pages
Screenshot 2023-09-24 at 3.49.58 PM
No ratings yet
Screenshot 2023-09-24 at 3.49.58 PM
38 pages
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
MCS-024: Object Oriented Technologies and Java Programming
From Everand
MCS-024: Object Oriented Technologies and Java Programming
Dr. DK Sukhani
No ratings yet
Paper 6
No ratings yet
Paper 6
10 pages
Unit - 1 Notes MC
No ratings yet
Unit - 1 Notes MC
17 pages
Os 2020UIT3063
No ratings yet
Os 2020UIT3063
42 pages
Matlab
No ratings yet
Matlab
3 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
Unit 2
No ratings yet
Unit 2
30 pages
Class Test-2 Chemistry: Vidyamandir Classes
No ratings yet
Class Test-2 Chemistry: Vidyamandir Classes
1 page
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
Practise Questions For Test 2
No ratings yet
Practise Questions For Test 2
10 pages
Tkde 2014 26 7
No ratings yet
Tkde 2014 26 7
17 pages
Copy of Copy of LOCAL BIRTH CERTIFICATE - 20250116 - 135004 - 0000.pdf - 20 - 20250221 - 121021 - 0000
No ratings yet
Copy of Copy of LOCAL BIRTH CERTIFICATE - 20250116 - 135004 - 0000.pdf - 20 - 20250221 - 121021 - 0000
4 pages
Phillies 03 - 17 - 2020
No ratings yet
Phillies 03 - 17 - 2020
80 pages
RBH URC2004 v2012 PDF
No ratings yet
RBH URC2004 v2012 PDF
2 pages
Research Paper
No ratings yet
Research Paper
12 pages
IDFL Standards - European Sleeping Bag Labeling Info EN13537 Information For Consumers Jan 05
No ratings yet
IDFL Standards - European Sleeping Bag Labeling Info EN13537 Information For Consumers Jan 05
5 pages
CSS Master
From Everand
CSS Master
Tiffany B Brown
No ratings yet
Rohtang Pass Permit MIS: User Manual
No ratings yet
Rohtang Pass Permit MIS: User Manual
8 pages
AI Prompting: A Guide to Communicating with Artificial Intelligence
From Everand
AI Prompting: A Guide to Communicating with Artificial Intelligence
E. A. Ruppert II
No ratings yet
Module 5 Document Clustering
No ratings yet
Module 5 Document Clustering
33 pages
Labininay Carl Case Study1 DCIT65
No ratings yet
Labininay Carl Case Study1 DCIT65
4 pages
Unit II - MMD - Lecture NotesStu
No ratings yet
Unit II - MMD - Lecture NotesStu
8 pages
Inventor Tutorials
100% (3)
Inventor Tutorials
1,264 pages
Magnetic Effect of Current - Level - 2 - DTS 2 - Solution PDF
No ratings yet
Magnetic Effect of Current - Level - 2 - DTS 2 - Solution PDF
2 pages
DC Circuit Workbook
No ratings yet
DC Circuit Workbook
52 pages
Microcopy: Discover How Tiny Bits of Text Make Tasty Apps and Websites
From Everand
Microcopy: Discover How Tiny Bits of Text Make Tasty Apps and Websites
Niaw de Leon
4/5 (5)
Magnetic Effect of Current - Level - 1 - DTS 1
No ratings yet
Magnetic Effect of Current - Level - 1 - DTS 1
3 pages
STO Process - Pricing Procedure
No ratings yet
STO Process - Pricing Procedure
30 pages
Date Planned: - / - / - Cbse Pattern Duration: 3 Hours Actual Date of Attempt: - / - / - Level - 0 Maximum Marks: 70
100% (1)
Date Planned: - / - / - Cbse Pattern Duration: 3 Hours Actual Date of Attempt: - / - / - Level - 0 Maximum Marks: 70
39 pages
Format Synopsis DP
No ratings yet
Format Synopsis DP
12 pages
Problem Solving Analysis
From Everand
Problem Solving Analysis
Ron Rieke
No ratings yet
Electrostatics Workbook Solutions
No ratings yet
Electrostatics Workbook Solutions
17 pages
DC Circuits Workbook Solutions PDF
No ratings yet
DC Circuits Workbook Solutions PDF
55 pages
RVM100 Instruction Manual
No ratings yet
RVM100 Instruction Manual
7 pages
Unit 4
No ratings yet
Unit 4
61 pages
Unit III
No ratings yet
Unit III
85 pages
BT300KTS 674 TYM Rev04
No ratings yet
BT300KTS 674 TYM Rev04
53 pages
Q.1-Write A Program To Enter Three No's and Display This Sum Using Eval
No ratings yet
Q.1-Write A Program To Enter Three No's and Display This Sum Using Eval
12 pages
Electrostatics Workbook PDF
No ratings yet
Electrostatics Workbook PDF
28 pages
Capacitors Workbook Solutions PDF
No ratings yet
Capacitors Workbook Solutions PDF
45 pages
Magnetic Effect of Current - Level - 1 - DTS 2
No ratings yet
Magnetic Effect of Current - Level - 1 - DTS 2
2 pages
Illuminati: Mathematics Class Tutorial Sheet-2 Conic Section
100% (4)
Illuminati: Mathematics Class Tutorial Sheet-2 Conic Section
10 pages
2012 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. A Rank-Based Approach of Cosine Similarity With Applications in
No ratings yet
2012 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. A Rank-Based Approach of Cosine Similarity With Applications in
5 pages
Habeas
No ratings yet
Habeas
5 pages
Energy Source Pros and Cons
No ratings yet
Energy Source Pros and Cons
4 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Summer Internship Project Report
100% (1)
Summer Internship Project Report
49 pages
Government of India Technical Centre, Opposite Safdarjung Airport, New Delhi-110003
No ratings yet
Government of India Technical Centre, Opposite Safdarjung Airport, New Delhi-110003
11 pages
Cics Question Bank 1 of 28
No ratings yet
Cics Question Bank 1 of 28
28 pages
Mastering TensorFlow 2.x: Implement Powerful Neural Nets across Structured, Unstructured datasets and Time Series Data
From Everand
Mastering TensorFlow 2.x: Implement Powerful Neural Nets across Structured, Unstructured datasets and Time Series Data
Rajdeep Dua
No ratings yet
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
ETAP Installation Guide
No ratings yet
ETAP Installation Guide
2 pages
Organic Bakery Marketing Plan
No ratings yet
Organic Bakery Marketing Plan
30 pages
Mastering Computer Programming: A Comprehensive Guide
From Everand
Mastering Computer Programming: A Comprehensive Guide
Kondwani Hara
No ratings yet
Unit Ii Content-Based Recommendation Systems
No ratings yet
Unit Ii Content-Based Recommendation Systems
21 pages
Term Paper Topic:"Parking Management System"
No ratings yet
Term Paper Topic:"Parking Management System"
8 pages
Magnetic Effect of Current - Level - 2 - DTS 2 PDF
No ratings yet
Magnetic Effect of Current - Level - 2 - DTS 2 PDF
2 pages
BDA
No ratings yet
BDA
31 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Ccs360-Rs-Unit-Ii - Content Based RS
No ratings yet
Ccs360-Rs-Unit-Ii - Content Based RS
15 pages
The Ascetic Programmer
From Everand
The Ascetic Programmer
Antonio Piccolboni
5/5 (1)
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Survey On Cinematics Recommendation System
No ratings yet
Survey On Cinematics Recommendation System
10 pages
Chap07 DMMvideo
No ratings yet
Chap07 DMMvideo
40 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
The basic concepts of OOP in C#: Learn conceptually in simple language
From Everand
The basic concepts of OOP in C#: Learn conceptually in simple language
Hani Marzban
No ratings yet
Learning Material Design: Master Material Design and create beautiful, animated interfaces for mobile and web applications
From Everand
Learning Material Design: Master Material Design and create beautiful, animated interfaces for mobile and web applications
Kyle Mew
4/5 (1)
Personalize Movie Recommendation System CS 229 Project Final Writeup
0% (1)
Personalize Movie Recommendation System CS 229 Project Final Writeup
6 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
Finding Similar Items: Aadil Ahmad, Pawan Kumar, Himanshu Kamboj, and Sunil Kumar
No ratings yet
Finding Similar Items: Aadil Ahmad, Pawan Kumar, Himanshu Kamboj, and Sunil Kumar
3 pages
Two Types of Collaboration &Ten Requirements for Using Them
From Everand
Two Types of Collaboration &Ten Requirements for Using Them
Billy Cripe
No ratings yet
L6 Recommendation
No ratings yet
L6 Recommendation
56 pages
Workshop Master Revealed
From Everand
Workshop Master Revealed
Anil Soni
No ratings yet
Introduction To Rec Om Mender Systems, Algorithms and Evaluation
No ratings yet
Introduction To Rec Om Mender Systems, Algorithms and Evaluation
4 pages
A Study On The Architecture For Text Categorization and Summarization
No ratings yet
A Study On The Architecture For Text Categorization and Summarization
4 pages
Pazzani - Content-Based Recommender Systems
No ratings yet
Pazzani - Content-Based Recommender Systems
17 pages
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet

Book 2

Uploaded by

Book 2

Uploaded by

9.2.

CONTENT-BASED RECOMMENDATIONS 325

9.2.2 Discovering Features of Documents

Two Kinds of Document Similarity

9.2.3 Obtaining Item Features From Tags

Tags from Computer Games

9.2.4 Representing Item Profiles

You might also like