0% found this document useful (0 votes)
7 views38 pages

Recommendation Chapter2

The document provides an introduction to building content-based recommendation engines using Python, focusing on item attributes, vectorization, and Jaccard similarity for calculating item distances. It also discusses text-based similarities using TF-IDF and cosine similarity to recommend books based on user profiles. Practical examples and code snippets are included to illustrate the concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views38 pages

Recommendation Chapter2

The document provides an introduction to building content-based recommendation engines using Python, focusing on item attributes, vectorization, and Jaccard similarity for calculating item distances. It also discusses text-based similarities using TF-IDF and cosine similarity to recommend books based on user profiles. Practical examples and code snippets are included to illustrate the concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Intro to content-

based
recommendations
B U I L D I N G R E C O M M E N D AT I O N E N G I N E S I N P Y T H O N

Rob O'Callaghan
Director of Data
What are content-based recommendations?

BUILDING RECOMMENDATION ENGINES IN PYTHON


Items' attributes or characteristics

BUILDING RECOMMENDATION ENGINES IN PYTHON


Vectorizing your attributes
ITEM A ribute 1 A ribute 2 A ribute 3 A ribute 4
Item_001 0 1 1 0
Item_002 1 0 1 0
Item_003 0 1 0 1

BUILDING RECOMMENDATION ENGINES IN PYTHON


One to many relationships
Book Genre Book Adventure Fantasy Tragedy ...
The Hobbit Adventure The
1 1 0 ...
Hobbit
The Hobbit Fantasy
The
The Great Gatsby Tragedy Great 0 0 1 ...
Gatsby
... ...
... ... ... ... ...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Crosstabulation
pd.crosstab( , )

BUILDING RECOMMENDATION ENGINES IN PYTHON


Crosstabulation
pd.crosstab(book_genre_df['Book'], book_genre_df['Genre'])

Book Adventure Fantasy Tragedy Social commentary


The Hobbit 1 1 0 0
The Great Gatsby 0 0 1 1
A Game of Thrones 0 1 0 0
Macbeth 0 0 1 0
... ... ... ... ...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Let's practice!
B U I L D I N G R E C O M M E N D AT I O N E N G I N E S I N P Y T H O N
Making content-
based
recommendations
B U I L D I N G R E C O M M E N D AT I O N E N G I N E S I N P Y T H O N

Rob O'Callaghan
Director of Data
Introducing the Jaccard similarity
Jaccard similarity:

A∩B
J(A, B) =
A∪B

BUILDING RECOMMENDATION ENGINES IN PYTHON


Calculating Jaccard similarity between books
genres_array_df :

Book Adventure Fantasy Tragedy Social commentary ...


The Hobbit 1 1 0 0 ...
The Great Gatsby 0 0 1 1 ...
A Game of Thrones 0 1 0 0 ...
Macbeth 0 0 1 0 ...
... ... ... ... ... ...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Calculating Jaccard similarity between books
from sklearn.metrics import jaccard_score

hobbit_row = book_genre_df.loc['The Hobbit']


GOT_row = book_genre_df.loc['A Game of Thrones']

print(jaccard_score(hobbit_row, GOT_row))

0.5

BUILDING RECOMMENDATION ENGINES IN PYTHON


Finding the distance between all items
from scipy.spatial.distance import pdist, squareform

jaccard_distances = pdist(book_genre_df.values, metric='jaccard')


print(jaccard_distances)

[1. 0.5 1. 1. 0.5 1. ]

square_jaccard_distances = squareform(jaccard_distances)
print(square_jaccard_distances)

[[0. 1. 0.5 1. ]
[1. 0. 1. 0.5]
[0.5 1. 0. 1. ]
[1. 0.5 1. 0. ]]

BUILDING RECOMMENDATION ENGINES IN PYTHON


Finding the distance between all items
print(square_jaccard_distances)

[[0. 1. 0.5 1. ]
[1. 0. 1. 0.5]
[0.5 1. 0. 1. ]
[1. 0.5 1. 0. ]]

jaccard_similarity_array = 1 - square_jaccard_distances
print(jaccard_similarity_array)

[[1. 0. 0.5 0. ]
[0. 1. 0. 0.5]
[0.5 0. 1. 0. ]
[0. 0.5 0. 1. ]]

BUILDING RECOMMENDATION ENGINES IN PYTHON


Creating a usable distance table
distance_df = pd.DataFrame(jaccard_similarity_array,
index=genres_array_df['Book'],
columns=genres_array_df['Book'])
distance_df.head()

The Hobbit The Great Gatsby A Game of Thrones Macbeth ...


The Hobbit 1.00 0.15 0.75 0.01 ...
The Great Gatsby 0.15 1.00 0.01 0.43 ...
...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Comparing books
print(distance_df['The Hobbit']['A Game of Thrones'])

0.75

print(distance_df['The Hobbit']['The Great Gatsby'])

0.15

BUILDING RECOMMENDATION ENGINES IN PYTHON


Finding the most similar books
print(distance_df['The Hobbit'].sort_values(ascending=False))

title
The Hobbit 1.00
The Two Towers 0.91
A Game of Thrones 0.50
...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Let's practice!
B U I L D I N G R E C O M M E N D AT I O N E N G I N E S I N P Y T H O N
Text-based
similarities
B U I L D I N G R E C O M M E N D AT I O N E N G I N E S I N P Y T H O N

Rob O'Callaghan
Director of Data
Working without clear attributes

BUILDING RECOMMENDATION ENGINES IN PYTHON


Term frequency inverse document frequency
Count of word occurrences
Total words in document
TF-IDF = Number of docs word is in
log( Total number of docs )

BUILDING RECOMMENDATION ENGINES IN PYTHON


Our data
book_summary_df :

Book Description
The Hobbit "Bilbo Baggins lives a simple life with his fellow hobbits in the shire..."
The Great Gatsby "Set in Jazz Age New York, the novel tells the tragic story of Jay ..."
A Game of Thrones "15 years have passed since Robert's rebellion, with a nine-year-long ..."
Macbeth "A brave Sco ish general receives a prophecy from a trio of witches ..."
... ...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Instantiate the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer( , )

BUILDING RECOMMENDATION ENGINES IN PYTHON


Filtering the data
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer(min_df=2, )

BUILDING RECOMMENDATION ENGINES IN PYTHON


Filtering the data
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer(min_df=2, max_df=0.7)

BUILDING RECOMMENDATION ENGINES IN PYTHON


Vectorizing the data
vectorized_data = tfidfvec.fit_transform(book_summary_df['Descriptions'])
print(tfidfvec.get_feature_names)

['age', 'ancient', 'angry', 'brave', 'battle', 'fellow', 'game', 'general', ...]

print(vectorized_data.to_array())

[[0.21, 0.53, 0.41, 0.64, 0.01, 0.02, ...


[0.31, 0.00, 0.42, 0.03, 0.00, 0.73, ...
[..., ..., ..., ..., ..., ..., ...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Formatting the data
tfidf_df = pd.DataFrame(vectorized_data.toarray(),
columns=tfidfvec.get_feature_names())
tfidf_df.index = book_summary_df['Book']
print(tfidf_df)

| 'age'| 'ancient'| 'angry'| 'brave'| 'battle'| 'fellow'|...


|------------------|------|----------|--------|--------|---------|---------|...
| The Hobbit | 0.21| 0.53| 0.41| 0.64| 0.01| 0.02|...
| The Great Gatsby | 0.31| 0.00| 0.42| 0.03| 0.00| 0.73|...
| A Game of Thrones| 0.61| 0.42| 0.77| 0.31| 0.83| 0.03|...
| ...| ...| ...| ...| ...| ...| ...|...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Cosine similarity
Cosine Distance:
A.B
cos(θ) =
∣∣A∣∣ ⋅ ∣∣B∣∣

BUILDING RECOMMENDATION ENGINES IN PYTHON


Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# Find similarity between all items


cosine_similarity_array = cosine_similarity(tfidf_summary_df)

# Find similarity between two items


cosine_similarity(tfidf_df.loc['The Hobbit'].values.reshape(1, -1),
tfidf_df.loc['Macbeth'].values.reshape(1, -1))

BUILDING RECOMMENDATION ENGINES IN PYTHON


Let's practice!
B U I L D I N G R E C O M M E N D AT I O N E N G I N E S I N P Y T H O N
User profile
recommendations
B U I L D I N G R E C O M M E N D AT I O N E N G I N E S I N P Y T H O N

Rob O'Callaghan
Director of Data
Item to item recommendations

BUILDING RECOMMENDATION ENGINES IN PYTHON


User profiles
tfidf_summary_df :

Book Adventure Fantasy Tragedy Social commentary


The Hobbit 1 1 0 0
Macbeth 0 0 1 0
... ... ... ... ...

User Pro le:

User Pro le Adventure Fantasy Tragedy Social commentary


User_001 ??? ??? ??? ???

BUILDING RECOMMENDATION ENGINES IN PYTHON


Extract the user data
list_of_books_read = ['The Hobbit', 'Foundation', 'Nudge']
user_books = tfidf_summary_df.reindex(list_of_books_read)
print(user_books)

age ancient angry brave battle fellow ...


The Hobbit 0.21 0.53 0.41 0.64 0.01 0.02 ...
Foundation 0.31 0.90 0.42 0.33 0.64 0.04 ...
Nudge 0.61 0.01 0.45 0.31 0.12 0.74 ...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Build the user profile
user_prof = user_movies.mean()
print(user_prof)

age 0.376667
ancient 0.480000
angry 0.426667
brave 0.256667
...

print(user_prof.values.reshape(1,-1))

[0.376667, .480000, 0.426667, 0.256667, ...]

BUILDING RECOMMENDATION ENGINES IN PYTHON


Finding recommendations for a user
# Create a subset of only the non read books
non_user_movies = tfidf_summary_df.drop(list_of_movies_seen, axis=0)

# Calculate the cosine similarity between all rows


user_prof_similarities = cosine_similarity(user_prof.values.reshape(1, -1),
non_user_movies)
# Wrap in a DataFrame for ease of use
user_prof_similarities_df = pd.DataFrame(user_prof_similarities.T,
index=tfidf_summary_df.index,
columns=["similarity_score"])

BUILDING RECOMMENDATION ENGINES IN PYTHON


Getting the top recommendations
sorted_similarity_df = user_prof_similarities.sort_values(by="similarity_score",
ascending=False)
print(sorted_similarity_df)

similarity_score
Title
The Two Towers 0.422488
Dune 0.363540
The Magicians Nephew 0.316075
... ...

BUILDING RECOMMENDATION ENGINES IN PYTHON


Let's practice!
B U I L D I N G R E C O M M E N D AT I O N E N G I N E S I N P Y T H O N

You might also like