Unit 3
Unit 3
YEAR/SEM: IV / VIII
Unit - 3
Regulation: 2017
Advanced Analytical Theory and Methods: Association Rules - Overview - Apriori Algorithm -
Evaluation of Candidate Rules - Applications of Association Rules - Finding Association&
finding similarity - Recommendation System: Collaborative Recommendation- Content Based
Recommendation - Knowledge Based Recommendation- Hybrid Recommendation Approaches.
The Association Rule Mining is main purpose to discovering frequent itemsets from a
large dataset is to discover a set of if-then rules called Association rules.
The form of an association rules is I→j, where I is a set of items(products) and j is a
particular item.
Apriori Algorithm
FP-Growth Algorithm
SON algorithm
PCY algorithm
FOR(each basket):
Toivonen’s algorithm makes only one full pass over the database.
The algorithm thus produces exact association rules in one full pass over the database.
The algorithm will give neither false negatives nor positives, but there is a small yet
non-zero probability that it will fail to produce any answer at all.
Toivonen’s algorithm begins by selecting a small sample of the input dataset and
finding from it the candidate frequent item sets.
Apriori is an algorithm for frequent item set mining and association rule learning
over relational databases.
It proceeds by identifying the frequent individual items in the database and
extending them to larger and larger item sets as long as those item sets appear
sufficiently often in the database.
Collaborative filtering
Customer segmentation
Data summarization
Dynamic trend detection
Multimedia data analysis
Biological data analysis
Social network analysis
Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases.
It proceeds by identifying the frequent individual items in the database and extending
them to larger and larger item sets as long as those item sets appear sufficiently often in
the database.
8. Define support and confidence.
Confidence shows transactions where the items are purchased one after the other.
9. What are the steps followed in the apriori algorithm of data mining ?
1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with
itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item
does not meet minimum support, then it is regarded as infrequent and thus it is removed.
This step is performed to reduce the size of the candidate itemsets.
3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
4. Apriori is used by many companies like Amazon in the Recommender System and by
Google for the auto-complete feature.
11. What are the applications of Association Rules?
Stock analysis
Web log mining,
Medical diagnosis,
Customer market analysis
Bioinformatics
To find the frequent itemsets of size k, monotonicity lets us restrict our attention
to only those itemsets such that all their subsets of size k − 1 have already been
found frequent.
5
15. Why do we need recommender systems?
Companies are able to gain and retain customers by sending out emails with links
to new offers that meet the recipients’ interests, or suggestions of films and TV
shows that suit their profiles.
The user starts to feel known and understood and is more likely to buy additional
products or consume more content. By knowing what a user wants, the company
gains competitive advantage and the threat of losing a customer to a competitor
decreases.
6
1. Briefly explain apriori algorithm to find a frequent item set in a data base.
“Let I= { …} be a set of ‘n’ binary attributes called items. Let D= { ….} be set of transaction
called database. Each transaction in D has a unique transaction ID and contains a subset of
the items in I. A rule is defined as an implication of form X->Y where X, Y? I and X?Y=?. The
set of items X and Y are called antecedent and consequent of the rule respectively.”
A determines the values of itemset B under the condition in which minimum support and
confidence are met”.
This means that there is a 2% transaction that bought bread and butter together and there
are 60% of customers who bought bread as well as butter.
7
Frequent item set or pattern mining is broadly used because of its wide applications in
mining association rules, correlations and graph patterns constraint that is based on
frequent patterns, sequential patterns, and many other data mining tasks.
Apriori algorithm was the first algorithm that was proposed for frequent item set mining.
It was later improved by R Agarwal and R Srikant and came to be known as Apriori.
This algorithm uses two steps “join” and “prune” to reduce the search space.
Apriori says:
The probability that item I is not frequent is if:
1. Join Step: This step generates (K+1) item set from K-item sets by joining each item with
itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item
does not meet minimum support, then it is regarded as infrequent and thus it is removed.
This step is performed to reduce the size of the candidate item sets.
Steps In Apriori
Apriori algorithm is a sequence of steps to be followed to find the most frequent item set
in the given database.
This data mining technique follows the join and the prune steps iteratively until the most
frequent item set is achieved.
8
A minimum support threshold is given in the problem or it is assumed by the user.
1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count more
than or equal to min_sup, are taken ahead for the next iteration and the others are pruned.
3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-
itemset is generated by forming a group of 2 by combining items with itself.
4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2
–itemsets with min-sup only.
5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each
group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent
otherwise it is pruned.
6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its
subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent
itemset is achieved.
9
Example of Apriori: Support threshold=50%, Confidence= 60%
TABLE-1
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
I1 4
I2 5
I3 4
I4 4
I5 2
10
2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3, thus it is deleted, only
I1, I2, I3, I4 meet min_sup count.
TABLE-3
Item Count
I1 4
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.
TABLE-4
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
11
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus
it is deleted.
TABLE-5
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences of 3-itemset.
From TABLE-5, find out the 2-itemset subsets which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in TABLE-
5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as
it is not occurring in TABLE-5 thus {I1, I2, I4} is not frequent, hence it is deleted.
TABLE-6
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
I2,I3,I4
12
Only {I1, I2, I3} is frequent.
6. Generate Association Rules: From the frequent item set discovered above the association
could be:
{I1, I2} => {I3}
Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%
Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%
Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%
This shows that all the above association rules are strong if minimum confidence threshold is
60%.
13
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large item sets in large databases
Disadvantages
1. It requires high computation if the item sets are very large and the minimum support is
kept very low.
Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-item sets and its corresponding count. It uses a hash function for
generating the table.
14
3. Partitioning: This method requires only two database scans to mine the frequent item
sets. It says that for any item set to be potentially frequent in the database, it should be
frequent in at least one of the partitions of the database.
4. Sampling: This method picks a random sample S from Database D and then searches for
frequent item set in S. It may be possible to lose a global frequent item set. This can be
reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate item sets at any
marked start point of the database during the scanning of the database.
3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
4. Apriori is used by many companies like Amazon in the Recommender System and by
Google for the auto-complete feature.
15
2. Explain the basic concept of Recommender Systems.
What Are Recommender Systems?
Recommender systems are one of the most common and easily understandable
applications of big data.
Recommender systems are an important class of machine learning algorithms that
offer "relevant" suggestions to users.
16
Types of Data Used by Recommender Systems
Since big data fuels recommendations, the input needed for model training plays
a key role.
Depending on your business goals, a system can work based on such types of
data as content, historical data, or user data involving views, clicks, and likes.
The data used for training a model to make recommendations can be split into
several categories.
Title
Category
Price
Description
Style
3. Contextual information
Device used
Current location
Referral URL
For you to get a full picture of your customer, it is not enough to be aware of
what he or she is viewing on your website and your competitors’ ones.
You should take into account the frequency of visits, user location, and types of
devices.
17
All the data sources are equally important for the smooth and consistent
operation of different types of algorithms.
18
There are three major types of recommender systems:
Content-based filtering
Collaborative filtering
Hybrid recommender systems
These methods can rely on user behavior data, including activities, preferences, and likes, or can
take into account the description of the items that users prefer, or both.
i) Content-based filtering
This method works based on the properties of the items that each user likes,
discovering what else the user may like.
It takes into account multiple keywords. Also, a user profile is designed to
provide comprehensive information on the items that a user prefers.
The system then recommends some similar items that users may also want to
purchase.
Recommendation engines can rely on likes and desires of other users to compute
a similarity index between users and recommend items to them accordingly.
This type of filtering relies on user opinion instead of machine analysis to
accurately recommend complex items, such as movies or music tracks.
19
The collaborative filtering algorithm has some specifics. The system can search
for look-alike users, which will be user-user collaborative filtering.
So, recommendations will depend on a user profile. But such an approach
requires a lot of computational resources and will be hard to implement for
large-scale databases.
Another option is item-item collaborative filtering. The system will find similar
items and recommend these items to a user on a case-by-case basis.
Users get movie recommendations based on their habits and the characteristics
of content they prefer.
20
3. Briefly explain the concept of collaborative filtering methods in recommender system.
Collaborative methods for recommender systems are methods that are based solely on the
past interactions recorded between users and items in order to produce new
recommendations.
These interactions are stored in the so-called “user-item interactions matrix”.
21
Moreover, the more users interact with items the more new recommendations become
accurate: for a fixed set of users and items, new interactions recorded over time bring
new information and make the system more and more effective.
22
i). Memory based or item based collaborative filtering.
The collaborative filtering algorithm has some specifics.
The system can search for look-alike users, which will be user-user collaborative
filtering.
So, recommendations will depend on a user profile. But such an approach requires a lot
of computational resources and will be hard to implement for large-scale databases.
Another option is item-item collaborative filtering.
The system will find similar items and recommend these items to a user on a case-by-
case basis.
It is a resource-saving approach, and Amazon utilizes it to engage customers and improve
sales volumes.
The main characteristics of user-user and item-item approaches it that they use only
information from the user-item interaction matrix and they assume no model to produce
new recommendations.
User-user
In order to make a new recommendation to a user, user-user method roughly tries to
identify users with the most similar “interactions profile” (nearest neighbours) in order to
suggest items that are the most popular among these neighbours (and that are “new” to
our user).
This method is said to be “user-centred” as it represent users based on their interactions
with items and evaluate distances between users.
Assume that we want to make a recommendation for a given user.
First, every user can be represented by its vector of interactions with the different items
(“its line” in the interaction matrix).
Then, we can compute some kind of “similarity” between our user of interest and every
other users.
That similarity measure is such that two users with similar interactions on the same items
should be considered as being close.
Once similarities to every users have been computed, we can keep the k-nearest-
neighbours to our user and then suggest the most popular items among them (only
looking at the items that our reference user has not interacted with yet).
Notice that, when computing similarity between users, the number of “common
interactions” (how much items have already been considered by both users?) should be
considered carefully!.
23
Indeed, most of the time, we want to avoid that someone that only have one interaction in
common with our reference user could have a 100% match and be considered as being
“closer” than someone having 100 common interactions and agreeing “only” on 98% of
them.
So, we consider that two users are similar if they have interacted with a lot of common
items in the same way (similar rating, similar time hovering…).
Item-item
To make a new recommendation to a user, the idea of item-item method is to find items
similar to the ones the user already “positively” interacted with.
Two items are considered to be similar if most of the users that have interacted with both
of them did it in a similar way.
This method is said to be “item-centred” as it represent items based on interactions users
had with them and evaluate distances between those items.
Assume that we want to make a recommendation for a given user.
First, we consider the item this user liked the most and represent it (as all the other items)
by its vector of interaction with every users (“its column” in the interaction matrix).
Then, we can compute similarities between the “best item” and all the other items.
Once the similarities have been computed, we can then keep the k-nearest-neighbours to
the selected “best item” that are new to our user of interest and recommend these items.
24
Illustration of the item-item method.
25
The difference between item-item and user-user methods.
26
Matrix factorisation
The main assumption behind matrix factorisation is that there exists a pretty low
dimensional latent space of features in which we can represent both users and items and
such that the interaction between a user and an item can be obtained by computing the dot
product of corresponding dense vectors in that space.
For example, consider that we have a user-movie rating matrix.
In order to model the interactions between users and movies, we can assume that:
there exists some features describing (and telling apart) pretty well movies.
these features can also be used to describe user preferences (high values for features
the user likes, low values otherwise)
However we don’t want to give explicitly these features to our model (as it could be done
for content based approaches that we will describe later).
Instead, we prefer to let the system discover these useful features by itself and make its
own representations of both users and items.
As they are learned and not given, extracted features taken individually have a
mathematical meaning but no intuitive interpretation (and, so, are difficult, if not
impossible, to understand as human).
However, it is not unusual to ends up having structures emerging from that type of
algorithm being extremely close to intuitive decomposition that human could think about.
Indeed, the consequence of such factorization is that close users in terms of preferences
as well as close items in terms of characteristics ends up having close representations in
the latent space.
27
Illustration of the matrix factorization method.
where X is the “user matrix” (nxl) whose rows represent the n users and where Y is the
“item matrix” (mxl) whose rows represent the m items:
Here l is the dimension of the latent space in which users and item will be represented.
So, we search for matrices X and Y whose dot product best approximate the existing
interactions.
Denoting E the ensemble of pairs (i,j) such that M_ij is set (not None), we want to find X
and Y that minimise the “rating reconstruction error”
28
Adding a regularisation factor and dividing by 2, we get
Deeper neural network models are often used to achieve near state of the art
performances in complex recommender systems.
Matrix factorization can be generalized with the use of a model on top of users and items
embeddings.
29
4. Briefly elaborate the content based recommender systems in business industries.
Content based approaches use additional information about users and/or items.
If we consider the example of a movies recommender system, this additional information
can be, for example, the age, the sex, the job or any other personal information for users
as well as the category, the main actors, the duration or other characteristics for the
movies (items).
Then, the idea of content based methods is to try to build a model, based on the available
“features”, that explain the observed user-item interactions.
Still considering users and movies, we will try, for example, to model the fact that young
women tend to rate better some movies, that young men tend to rate better some other
movies and so on.
If we manage to get such model, then, making new predictions for a user is pretty easy:
we just need to look at the profile (age, sex, …) of this user and, based on this
information, to determine relevant movies to suggest.
1. Content based methods suffer far less from the cold start problem than collaborative
approaches: new users or items can be described by their characteristics (content) and so
relevant suggestions can be done for these new entities.
2. Only new users or items with previously unseen features will logically suffer from this
drawback, but once the system old enough, this has few to no chance to happen.
30
In content based methods, the recommendation problem is casted into either a
classification problem (predict if a user “likes” or not an item) or into a regression
problem (predict the rating given by a user to an item).
In both cases, we are going to set a model that will be based on the user and/or item
features at our disposal (the “content” of our “content-based” method).
If our classification (or regression) is based on users features, we say the approach is
item-centred: modelling, optimisations and computations can be done “by item”.
In this case, we build and learn one model by item based on users features trying to
answer the question “what is the probability for each user to like this item?” (or “what is
the rate given by each user to this item?”, for regression).
The model associated to each item is naturally trained on data related to this item and it
leads, in general, to pretty robust models as a lot of users have interacted with the item.
However, the interactions considered to learn the model come from every users and even
if these users have similar characteristic (features) their preferences can be different.
This mean that even if this method is more robust, it can be considered as being less
personalised (more biased) than the user-centred method thereafter.
If we are working with items features, the method is then user-centred: modeling,
optimizations and computations can be done “by user”.
We then train one model by user based on items features that tries to answer the question
“what is the probability for this user to like each item?” (or “what is the rate given by this
user to each item?”, for regression).
We can then attach a model to each user that is trained on its data: the model obtained is,
so, more personalized than its item-centred counterpart as it only takes into account
interactions from the considered user.
However, most of the time a user has interacted with relatively few items and, so, the
model we obtain is a far less robust than an item-centred one.
31
Illustration of the difference between item-centred and user-centred content based methods.
From a practical point of view, we should underline that, most of the time, it is much
more difficult to ask some information to a new user (users do not want to answer too
much questions) than to ask lot of information about a new item (people adding them
have an interest in filling these information in order to make their items recommended to
the right users).
We can also notice that, depending on the complexity of the relation to express, the
model we build can be more or less complex, ranging from basic models (logistic/linear
regression for classification/regression) to deep neural networks.
Finally, let’s mention that content based methods can also be neither user nor item
centred: both informations about user and item can be used for our models, for example
by stacking the two features vectors and making them go through a neural network
architecture.
32
So, to achieve the classification task, we want to compute the ratio between the
probability for a user with given features to like the considered item and its probability to
dislike it.
This ratio of conditional probabilities that defines our classification rule (with a simple
threshold) can be expressed following the Bayes formula where are priors computed from
the data whereas are likelihoods assumed to follow Gaussian distributions with parameters
to be determined also from data.
Various hypothesis can be done about the covariance matrices of these two likelihood
distributions (no assumption, equality of matrices, equality of matrices and features
independence) leading to various well known models (quadratic discriminant analysis,
linear discriminant analysis, naive Bayes classifier).
We can underline once more that, here, likelihood parameters have to be estimated only
based on data (interactions) related to the considered item.
33
User-centred linear regression
Let’s now consider the case of a user-centred regression: for each user we want to train a
simple linear regression that takes item features as inputs and output the rating for this
item.
We still denote M the user-item interaction matrix, we stack into a matrix X row vectors
representing users coefficients to be learned and we stack into a matrix Y row vectors
representing items features that are given.
Then, for a given user i, we learn the coefficients in X_i by solving the following
optimisation problem where one should keep in mind that i is fixed and, so, the first
summation is only over (user, item) pairs that concern user i.
We can observe that if we solve this problem for all the users at the same time, the
optimisation problem is exactly the same as the one we solve in “alternated matrix
factorisation” when we keep items fixed.
This observation underlines the link we mentioned in the first section: model based
collaborative filtering approaches (such as matrix factorisation) and content based
methods both assume a latent model for user-item interactions but model based
collaborative approaches have to learn latent representations for both users and items
whereas content-based methods build a model upon human defined features for users
and/or items.
34
5. How to evaluate a recommender system?
As for any machine learning algorithm, we need to be able to evaluate the performances
of our recommender systems in order to decide which algorithm fit the best our situation.
Evaluation methods for recommender systems can mainly be divided in two sets:
evaluation based on well defined metrics and evaluation mainly based on human
judgment and satisfaction estimation.
35
As mentioned in the collaborative section, we absolutely want to avoid having a user
being stuck in what we called earlier an information confinement area.
The notion of “serendipity” is often used to express the tendency a model has or not to
create such a confinement area (diversity of recommendations).
Serendipity, that can be estimated by computing the distance between recommended
items, should not be too low as it would create confinement areas, but should also not be
too high as it would mean that we do not take enough into account our users interests
when making recommendations (exploration vs exploitation).
Thus, in order to bring diversity in the suggested choices, we want to recommend items
that both suit our user very well and that are not too similar from each others.
For example, instead of recommending a user “Star Wars” 1, 2 and 3, it seems better to
recommend “Star wars 1”, “Star trek into darkness” and “Indiana Jones and the raiders of
the lost ark”: the two later may be seen by our system as having less chance to interest our
user but recommending 3 items that look too similar is not a good option.
Explain ability is another key point of the success of recommendation algorithms. Indeed,
it has been proven that if users do not understand why they had been recommended as
specific item, they tend to lose confidence into the recommender system.
So, if we design a model that is clearly explainable, we can add, when making
recommendations, a little sentence stating why an item has been recommended (“people
who liked this item also liked this one”, “you liked this item, you may be interested by this
one”, …).
Finally, on top of the fact that diversity and explain ability can be intrinsically difficult to
evaluate, we can notice that it is also pretty difficult to assess the quality of a
recommendation that do not belong to the testing dataset:
How to know if a new recommendation is relevant before actually recommending it to our
user? For all these reasons, it can sometimes be tempting to test the model in “real
conditions”. As the goal of the recommender system is to generate an action (watch a
movie, buy a product, read an article etc…), we can indeed evaluate its ability to generate
the expected action.
For example, the system can be put in production, following an A/B testing approach, or
can be tested only on a sample of users.
Such processes require, however, having a certain level of confidence in the model.
36