M4
M4
System
Introduction to Recommender Systems
Definition: Recommender systems are algorithms that suggest relevant
products or content to users based on their preferences or past behavior.
Importance: They help increase user satisfaction and boost sales by
providing personalized experiences.
Examples:
• Amazon’s “Customers who bought this item also bought”
• Netflix’s “Recommended for you”
Types of Recommender Systems
• Association Rule Mining
• It identifies patterns in large datasets by discovering rules that show the
relationship between items. For example, in a market basket analysis,
if many customers buy milk and bread together, the rule might be: If a
customer buys milk, they are likely to buy bread.
• Collaborative Filtering
• It predicts a user's preferences based on the preferences of similar
users. It can be user-based (similar users like similar items) or item-
based (similar items are liked by similar users).Widely used in
recommendation systems like Netflix or Amazon.
Types of Recommender Systems
• Matrix Factorization
• It breaks down a large matrix (e.g., user-item ratings matrix) into
lower-dimensional matrices to reveal latent factors. Commonly
used in collaborative filtering to find hidden relationships between
users and items, improving recommendations.
Dataset Overview
• Grocery Dataset: Contains transactions from a grocery store.
• This dataset contains transactions from a grocery store, where each record lists
items purchased together. It’s commonly used for association rule mining to find
frequent itemsets (e.g., bread and milk) and generate rules (e.g., if bread is
purchased, milk is likely bought too). It helps businesses understand customer
buying patterns.
• MovieLens Dataset: Contains movie ratings from users, with over 20 million
ratings.
• The MovieLens dataset contains user ratings for movies, with the largest version
having over 20 million ratings. It’s used for building and evaluating
recommendation systems, applying techniques like collaborative filtering and
matrix factorization to predict user preferences based on past ratings and user
similarities.
Association Rule (Association Rule Mining)
Association rule finds combinations of items that frequently
occur together in orders or baskets (in a retail context).
The items that frequently occur together are called itemsets.
Itemsets help to discover relationships between items that
people buy together and use that as a basis for creating
strategies like combining products as combo offer or place
products next to each other in retail shelves to attract customer
attention.
An application of association rule mining is in Market Basket
Analysis (MBA).
MBA is a technique used mostly by retailers to find
associations between items purchased by customers.
Association Rule (Association Rule Mining)
Association Rule (Association Rule Mining)
• The primary objective of a recommender system is to predict items that a
customer may purchase in the future based on his/her purchases so far. In
future, if a customer buys beer, can we predict what he/she is most likely to
buy along with beer? To predict this, we need to find out which items have
shown a strong association with beer in previously purchased baskets. We
can use association rule mining technique to find this out.
• Association rule considers all possible combination of items in the previous
baskets and computes various measures such as support, confidence, and lift
to identify rules with stronger associations. One of the challenges in association
rule mining is the number of combination of items that need to be considered;
as the number of unique items sold by the seller increases, the number
of associations can increase exponentially. And in today’s world,
retailers sell millions of items. Thus, association rule mining may require huge
computational power to go through all possible combinations. (Refer figure
in the previous slide)
• One solution to this problem is to eliminate items that
possibly cannot be part of any itemsets. One such algorithm
the association rules use apriori algorithm. The apriori
algorithm was proposed by Agrawal and Srikant (1994). The
rules generated are represented as
• {diapers} → {beer}
• which means that customers who purchased diapers also
purchased beer in the same basket. {diaper, beer} together
is called itemset. {diaper} is called the antecedent and the
{beer} is called the consequent. Both antecedents and
consequents can have multiple items, e.g. {diaper,
milk}→{beer, bread} is also a valid rule. Each rule is
measured with a set of metrics
Metrics Used in Association Rule Mining
Concepts such as support, confidence, and lift are used to
generate association rules.
Support indicates the frequencies of items appearing together in baskets with respect
to all possible baskets being considered
Lift can be interpreted as the degree of association between two items. Lift value 1 indicates that
the items are independent (no association), lift value of less than 1 implies that the products are
substitution (purchase one product will decrease the probability of purchase of the other product) and lift
value of greater than 1 indicates purchase of Product X will increase the probability of purchase of
Product Y. Lift value of greater than 1 is a necessary condition of generating association rules.
Generating Association Rules
1. Tools and Library
Python’s mlxtend library: This library provides efficient implementations of data mining
algorithms like Apriori and functions for generating association rules. It is widely used for
market basket analysis.
Step 1: Data Preprocessing
• One-hot encoding: Each transaction is represented in a binary format, where items
purchased are marked as ‘1’ and those not purchased as ‘0’.
• This format helps in efficient itemset generation using the Apriori algorithm.
Step 2: Applying Apriori Algorithm
• Setting a minimum support threshold: The minimum support is a user-defined value
(e.g., 0.01), filtering out infrequent itemset.
• Only itemset meeting this threshold are considered, reducing the computation time.
Generating Association Rules
Step 3: Extracting Frequent Itemsets
• Frequent Itemsets: Apriori generates itemsets that appear frequently together based on
the support threshold.
• These itemsets provide insights into common purchasing patterns.
Step 4: Generating Association Rules
• Rules Generation: Using metrics like confidence (likelihood of buying Y given X) and
lift (strength of the rule), association rules are derived from the frequent itemsets.
• Example: A rule like “If a customer buys bread, they are 70% likely to buy milk”.
6. Summary and Application
• Business Insights: These rules help businesses in product placement, targeted
promotions, and inventory management by revealing relationships between items.
Pros and Cons of Association Rule Mining
The following are advantages of using association rules:
• 1. Transactions data, which is used for generating rules, is
always available and mostly clean.
• 2. The rules generated are simple and can be interpreted.
The following are disadvantages of using association rules:
• Association rules do not take the preference or ratings
given by customers into account, which is an important
information pertaining for generating rules. If customers
have bought two items but disliked one of them, then the
association should not be considered.
Collaborative Filtering
Definition: Collaborative filtering is a recommendation technique that suggests items to users based
on patterns of behavior and preferences, leveraging similarities either between users or items.
Variations of Collaborative Filtering
User-Based Collaborative Filtering:
• Identifies users with similar tastes or preferences based on their past ratings or interactions.
Recommends items liked by similar users that the target user has not yet rated or interacted with.
• Example: If User A and User B both liked similar movies, User B might be recommended a
movie that User A enjoyed but hasn’t been watched by User B yet.
Item-Based Collaborative Filtering:
• Focuses on finding items that are similar based on user interactions. Recommends items that are
frequently rated similarly by many users.
• Example: If many users who liked Movie X also liked Movie Y, then Movie Y is recommended to
users who liked Movie X.
How to Find Similarity between Users?
The picture in
Figure 9.2 depicts
three users
Rahul, Purvi, and
Gaurav and the
books they have
bought and rated.
The users are
represented using
their rating on the
Euclidean space in
Figure 9.3. Here the
dimensions are
represented by the
two books Into Thin
Air and Missoula,
which are the two
books commonly
bought by Rahul,
Purvi, and Gaurav.
Rahul’s preferences are similar to Purvi’s rather than to Gaurav’s. So, the other book, Into the Wild, which
Rahul has bought and rated high, can now be recommended to Purvi.
• Collaborative filtering comes in two variations:
1. User-Based Similarity: Finds K similar users based on
common items they have bought.
2. Item-Based Similarity: Finds K similar items based on
common users who have bought those items.
Calculating Cosine Similarity
Definition
• Cosine similarity measures the cosine of the angle between two vectors (e.g., rating vectors of users
or items). It calculates how similar two users or items are based on their ratings or interactions.
Range
• Cosine similarity values range from -1 to 1:
• 1 indicates perfect similarity (the vectors are identical).
• 0 indicates no similarity (the vectors are orthogonal or unrelated).
• Negative values may indicate opposite preferences, though they are less common in practical recommendation
systems.
Application
• Used in Collaborative Filtering to identify similar users or items, helping to make accurate
recommendations based on shared preferences or behaviors. It is especially useful when comparing
user-item interaction patterns.
Calculating Cosine Similarity between Users
for table 9.5 / User-Based Similarity
Challenges with User-Based Similarity
• Finding user similarity does not work for new users.
• We need to wait until the new user buys a few items and rates them.
Only then users with similar preferences can be found and
recommendations can be made based on that. This is called cold start
problem in recommender systems.
• This can be overcome by using item-based similarity. Item-based
similarity is based on the notion that if two items have been bought by
many users and rated similarly, then there must be some inherent
relationship between these two items. In other terms, in future, if a user
buys one of those two items, he or she will most likely buy the other one.
Item-Based Similarity
• If two movies, movie A and movie B, have been watched by several users and rated very similarly, then
movie A and movie B can be similar in taste. In other words, if a user watches movie A, then he or she is
very likely to watch B and vice versa.
USING SURPRISE LIBRARY
• For real-world implementations, we need a more extensive library
which hides all the implementation details and provides abstract
Application Programming Interfaces (APIs) to build recommender
systems. Surprise is a Python library for accomplishing this.
• 1. Various ready-to-use prediction algorithms like neighborhood
methods (user similarity and item similarity), and matrix
factorization-based. It also has built-in similarity measures such as
cosine, mean square distance (MSD), Pearson correlation
coefficient, etc.
• 2. Tools to evaluate, analyze, and compare the performance of the
algorithms. It also provides methods to recommend.
User-Based Similarity Algorithm
• The surprise.prediction_algorithms.knns.KNNBasic provides the
collaborative filtering algorithm and takes the following parameters:
1. K: The (max) number of neighbors to take into account for
aggregation.
2. min_k: The minimum number of neighbors to take into account for
aggregation, if there are not enough neighbors.
3. sim_options - (dict): A dictionary of options for the similarity measure.
(a) name: Name of the similarity to be used, e.g., cosine, msd or
pearson. (b) user_based: True for user-based similarity and False for item-
based similarity.
Finding the Best Model
Sparse Matrix
• In real-world datasets, most users have rated only a few items, making the matrix sparse (i.e., containing
many NaN values).
• To simplify computation, these NaN entries are often replaced with 0s, assuming the user has not interacted
with the item.
• This helps when applying techniques like matrix factorization, which require numerical data.
MATRIX FACTORIZATION
• Matrix factorization is a matrix decomposition technique. Matrix
decomposition is an approach for reducing a matrix into its constituent
parts. Matrix factorization algorithms decompose the user-item matrix
into the product of two lower dimensional rectangular matrices.
• In Figure 9.4, (next slide) the original matrix contains users as rows,
movies as columns, and rating as values. The matrix can be decomposed
into two lower dimensional rectangular matrices.
• The Users–Movies matrix contains the ratings of 3 users (U1, U2, U3)
for 5 movies (M1 through M5). T his Users–Movies matrix is factorized
into a (3, 3) Users–Factors matrix and (3, 5) Factors–Movies matrix.
Multiplying the Users–Factors and Factors–Movies matrix will result in
the original Users Movies matrix.
The idea behind matrix factorization is that there are latent factors that determine why a
user rates a movie, and the way he/she rates. The factors could be the story or actors or
any other specific attributes of the movies. But we may never know what these factors
actually represent. That is why they are called latent factors. A matrix with size (n, m),
where n is the number of users and m is the number of movies, can be factorized into (n,
k) and (k, m) matrices, where k is the number of factors.
Real-World Applications
Retail (Product recommendations based on purchase history):
• Amazon uses a collaborative filtering recommendation system, where it suggests products like "Customers who bought this
also bought" based on users' browsing and purchase patterns
• Sephora, a beauty retailer, leverages product recommendations on product pages based on customer ratings and purchase
history. They suggest complementary products like makeup tools and skincare along with the primary product
Streaming Services (Movie or music recommendations):
• Netflix uses collaborative filtering to recommend movies and TV shows based on the user’s viewing history and similarities
to other users
• Spotify recommends playlists and songs using machine learning algorithms that analyze a user's listening habits, favorite
genres, and even the time of day
E-commerce (Personalized suggestions for cross-selling and upselling):
• eBay utilizes cross-selling recommendations such as "Frequently Bought Together" and upselling like "This item is part of
a more expensive version" to drive higher value purchases
• Best Buy offers upselling suggestions, like recommending a more advanced version of a tech product (e.g., higher-end
laptops) or additional accessories based on the user's interests
•Part 2 -Text
Analytics
TEXT ANALYTICS OVERVIEW
Definition: Text analytics is the process of transforming WHEN YOUR DATA IS
unstructured text into structured data for analysis. UNSTRUCTURED BUT
YOU WANT STRUCTURED
INSIGHTS ASAP!
Applications:
Sentiment Analysis: Used by businesses to understand customer
sentiment in reviews.
Spam Detection: Identifies spam emails by recognizing patterns
in words.
Topic Extraction: Clusters news articles into topics like politics,
sports, and technology.
Language Identification: Recognizes the language of a text (e.g.,
English, Spanish) for translation services.
Tools: Natural Language Processing (NLP), machine learning,
statistical analysis.
TEXT CLASSIFICATION AND SENTIMENT ANALYSIS
Text Classification: Assigns predefined categories to text data based on its
content.
● Examples: Classifying customer reviews as “positive” or “negative,” or
categorizing emails as “spam” or “not spam.”
Sentiment Analysis:
● Definition: Sentiment analysis determines the emotion (positive,
negative, neutral) expressed in text.
● Example: "The movie was fantastic!" (positive sentiment); "I wasted my
time on this movie" (negative sentiment).
● Importance: Companies use sentiment analysis to gauge public opinion
on products, events, and services
Exploring the Dataset
Loading data
Getting positive
sentiments
output
Exploring
sentiment
data using
matplotlib
output
Text Pre-processing
One way is to consider each word as a feature and find a measure to capture whether
a word exists or does not exist in a sentence. This is called the bag-of-words (BoW)
model. That is, each sentence (comment on a movie or a product) is treated as a bag
of words. Each sentence (record) is called a document and collection of all documents
is called corpus.
2. Informal Language
- Language on social media is often informal, includes a mix of languages, and
uses emoticons.
- Training data should include similar examples to help the model learn from these
variations.