ContentBasedFiltering (2)
ContentBasedFiltering (2)
Recommendation
Content-based?
• No rating
• Item-feature matrix
Ratings
Item Representation
• Structured
• Unstructured
• Semi-structured
Structured
• Attribute - value
Unstructured
• Full-text
• No attributes formally defined
• Other complicated problems - such as
synonymy, polysymy
Semi-structured
• Structured + unstructured
• Well defined attributes/values + free text
Conversion of
Unstructured Data
• Need to convert to structured form
• IR techniques
• VSM, TF-IDF, stemming, etc.
User Profile
• A model of user s preference
• A history of the user s interactions
• Recently viewed items
• Already viewed items
• Training data — machine learning
User Profile
Tear Rate
Reduced Normal
No
Young
Age Pre-presb
Yes Presbyoptic
Prescription
Hypermetrope
Myope
Yes
Astigmatic
Yes No
Yes No
User Profiling
Collects information
about individual user User Modeling side
Train/Learn
Classify/Recommen
UM Learning
Feedbacks
• Implicit feedback
• Indirect interaction
• Opened document, Reading time, etc.
• Large data, high uncertainty
UM Learning
Feedbacks
• Explicit
• No noise, hard to
obtain
User Model Learning
Feature Selection
• Problem of high dimensional input vectors
• Overfit
(especially when a dataset is small)
• Document frequency thresholding,
Information gain, Mutual information, Chi
square statistic, Term strength
Overfitting
Overfit Underfit
User Model Learning
Feature Selection
• Mutual Information
• A = number of times t and c co-occur
B = number of times t occurs without c
C = number of times c occurs without t
N = number of total documents
A! N
I(t, c) = log
(A + C) ! (A + B)
User Model Learning
Feature Selection
• Austrian train fire accident
• After learning 5 documents
Fire
Train Alps Austria People
A 5 5 2 5 5
B 5873 8092 93 974 34501
C 0 0 3 0 0
Decision Tree
Young
Age Pre-presbyoptic
Presbyoptic Yes
Yes
Prescription
Hypermetrope
Myope
es
Astigmatic
Yes No
Decision Tree
Example - evaluation
k Nearest Neighbor
k=3
k=5
Linear Classifier
raining data