Unit - IV
Unit - IV
Recommendation Systems
OUTLINE
• Recommendation Engine
• Dimensionality Reduction
• Exercise
Introduction
• Recommendation engines are also called as recommendation systems, are the typical
data product and are a good starting point when you’re explaining to non–data
scientists what you do or what data science really is?
• Example—What movie would you like, knowing other movies you liked?
What book would you like, keeping in mind past purchases? What kind
of vacation are you likely to embark on, considering past trips?
• You can represent the above scenario as a bipartite graph (shown in below figure) if each
user and each item has a node to represent it—there are lines from a user to an item if
that user has expressed an opinion about that item.
Contd…
• Note they might not always love that item, so the edges could have
weights: they could be positive, negative, or on a continuous scale (or
discontinuous, but many-valued like a star system).
• Example :
Contd…
• Next step is, you have training data in the form of some preferences i.e., you
know some of the opinions of some of the users on some of the items.
• From those training data, you want to predict other preferences for
your users. That’s essentially the output for a recommendation engine.
• You may also have metadata on users (i.e., they are male or female, etc.) or
on items (the color of the product).
• For example, users come to your website and set up accounts, so you may know each
There are too many dimensions, so the closest neighbors are too far
away from each other to realistically be considered “close.”
• Overfitting
One guy is closest, but that could be pure noise. How do you adjust for
that? One idea is to use kNN, with, say, k = 5 rather than k = 1, which
increases the noise. (For Optimal Solution Choose proper value for k).
Contd…
• Correlated features
There are tons of (Many) features, moreover, that are highly correlated (inter-linked) with
each other.
For example, you might imagine that as you get older you become more conservative. But
then counting both age and politics would mean you’re double counting a single feature in
some sense.
This would lead to bad performance, because you’re using redundant information and
essentially placing double the weight on some variables. It’s preferable to build in an
understanding of the correlation and project onto smaller dimensional space.
Contd…
• Relative importance of features
Some features are more informative than others. Weighting features may therefore
be helpful: maybe your age has nothing to do with your preference for item 1. You’d
probably use something like co-variances to choose your weights.
• Sparseness
If your vector (or matrix, if you put together the vectors) is too sparse (Ex: many
entries in the vector or matrix are 0s), or you have lots of missing data, then most
things are un‐known, and the Jaccard distance means nothing because there’s no
overlap.
Contd…
• Measurement errors
There’s measurement error (also called reporting error): people may lie. (Ex; When providing the data).
• Computational complexity
There’s a calculation cost—computational complexity
• Cost to update
It’s also expensive to update the model as you add more data.
• The biggest issues are the first two on the list, namely overfitting and the
curse of dimensionality problem.
Beyond Nearest Neighbor: Machine Learning
Classification
• To deal with overfitting and the curse of dimensionality problem, we’ll build a separate
linear regression model for each item.
• With each model, we could then predict for a given user, knowing their attributes,
whether they would like the item corresponding to that model.
• So one model might be for predicting whether you like Mad Men and another model
might be for predicting whether you would like Bob Dylan.
• Denote by fi, j user i’s stated preference for item j if you have it (or user i’s attribute, if item
j is a metadata item like age or is_logged_in).
Contd…
The good news: You know how to estimate the coefficients by linear algebra, optimization, and
statistical inference: specifically, linear regression.
The bad news: This model only works for one item, and to be complete, you’d need to build as many
models as you have items. Moreover, you’re not using other items’ information at all to create the
model for a given item, so you’re not leveraging (Using) other pieces of information.
Contd…
• But wait, there’s more good news: This solves the “weighting of the features”
problem we discussed earlier, because linear regression coefficients are
weights. (So that you can know which are more important and which are less
important)
• Crap, more bad news: overfitting is still a problem, and it comes in the form of
having huge coefficients when you don’t have enough data (i.e., not enough
opinions on given items).
Contd…
• To solve the overfitting problem, you impose a Bayesian prior that
these weights shouldn’t be too far out of whack (hit)—this is done by adding a
penalty term for large coefficients.
• But that begs the question: how do you choose λ? You could do it
experimentally: use some data as your training set, evaluate how well
you did using particular values of λ, and adjust.
Contd…
• A final problem with this prior stuff: although the problem will have a
unique solution (as in the penalty will have a unique minimum) if you
make λ large enough, by that time you may not be solving the problem
you care about.
• i.e., if you make λ absolutely huge, then the coefficients will all go to
zero and you’ll have no model at all.
The Dimensionality Problem
• So we’ve tackled the overfitting problem (previous slides), so now let’s
think about overdimensionality — i.e., the idea that you might have tens
of thousands of items.
• For example, people invent concepts like “coolness,” but we can’t directly measure how cool
someone is. Other people exhibit different patterns of behavior, which we internally map or reduce
to our one dimension of “coolness.”
• So coolness is an example of a latent feature in that it’s unobserved and not measurable directly,
and we could think of it as reducing dimensions because perhaps it’s a combination of many
“features” we’ve observed about the person and implicitly weighted in our mind.
• Two things are happening here: the dimensionality is reduced into a single feature and the latent
aspect of that feature.
Contd…
• But in this algorithm, we don’t decide which latent factors to care about. Instead we let
the machines do the work of figuring out what the important latent features are.
• “Important” in this context means they explain the variance in the answers to the
various questions—in other words, they model the answers efficiently
• Our goal is to build a model that has a representation in a low dimensional subspace
that gathers “taste information” to generate recommendations.
Given an m×n matrix X of rank k, it is a theorem from linear algebra that we can always
compose it into the product of three matrices as follows:
where U is m×k, S is k×k, and V is k×n, the columns of U and V are pairwise orthogonal, and
S is diagonal. Note the standard statement of SVD is slightly more involved and has U and V
both square unitary matrices, and has the middle “diagonal” matrix a rectangular. We’ll be
using this form, because we’re going to be taking approximations to X of increasingly smaller
rank.
Contd…
• Let’s apply the preceding matrix decomposition to our situation. X is our original dataset,
which has users’ ratings of items. We have m users, n items, and k would be the rank of X,
and consequently would also be an upper bound on the number d of latent variables we
decide to care about—note we choose d whereas m, n, and k are defined through our training
dataset. So just like in k-NN, where k is a tuning parameter (different k entirely—not trying to
confuse you!), in this case, d is the tuning parameter.
• Each row of U corresponds to a user, whereas V has a row for each item. The values along the
diagonal of the square matrix S are called the “singular values.” They measure the importance
of each latent variable—the most important latent variable has the biggest singular value.
YouTube URLs for SVD
• https://youtu.be/EokL7E6o1AE
• https://youtu.be/P5mlg91as1c
Important Properties of SVD
• Because the columns of U and V are orthogonal to each other, you can order the columns by singular
values via a base change operation. That way, if you put the columns in decreasing order of their
corresponding singular values (which you do), then the dimensions are ordered by importance from
highest to lowest. You can take lower rank approximation of X by throwing away part of S. In other
words, replace S by a submatrix taken from the upper-left corner of S.
• Of course, if you cut off part of S you’d have to simultaneously cut off part of U and part of V, but this
is OK because you’re cutting off the least important vectors. This is essentially how you choose the
number of latent variables d—you no longer have the original matrix X anymore, only an
approximation of it, because d is typically much smaller than k, but it’s still pretty close to X .
• https://www.youtube.com/watch?v=yLdOS6xyM_Q
• https://youtu.be/FgakZw6K1QQ
• https://www.youtube.com/watch?v=0Jp4gsfOLMs
Theorem: The resulting latent features
will be uncorrelated
• A nice aspect of these latent features is that they’re uncorrelated.
Here’s a sketch of the proof:
Contd…
Contd…
Alternating Least Squares
Exercise: Build Your Own Recommendation System
No introduction required
Really?
disclaimer: the brand logos are used here entirely for educational purpose
Social
Network
A collection of entities
– Typically people, but could be something else too
At least one relationship between entities of the network
– For example: friends
– Sometimes boolean: two people are either friends or they are not
– May have a degree
– Discrete degree: friends, family, acquaintances, or none
– Degree – real number: the fraction of the average day that two people spend talking to each
other
An assumption of nonrandomness or locality
– Hard to formalize
– Intuition: that relationships tend to cluster
– If entity A is related to both B and C, then the probability that B and C are related is higher than
average (random)
Social Network as a
Graph
A B D E
A graph with
boolean (friends)
C relationship
G F
C G F
Edges could be weighted by the number of times phone calls were made, or total time of
conversation
Types of Social (or
Professional) Networks
A B D E
C G F
AB is an edge if A and B sent mails to each other within the last one week, or month, or ever
– One directional edges would allow spammers to have edges
Edges could be weighted
Other networks: collaboration network – authors of papers, jointly written papers or not
Also networks exhibiting locality property
Clustering of Social
Network Graphs
Locality property there are clusters
Clusters are communities
– People of the same institute, or company
– People in a photography club
– Set of people with “Something in common” between them
Need to define a distance between points (nodes)
In graphs with weighted edges, different distances exist
For graphs with “friends” or “not friends” relationship
– Distance is 0 (friends) or 1 (not friends)
– Or 1 (friends) and infinity (not friends)
– Both of these violate the triangle inequality
– Fix triangle inequality: distance = 1 (friends) and 1.5 or 2 (not friends) or length of
shortest path
Traditional
Clustering
A B D E
C G F
C G F
Betweenness of an edge AB: #of pairs of nodes (X,Y) such that AB lies on the shortest
path between X and Y
– There can be more than one shortest paths between X and Y
– Credit AB the fraction of those paths which include the edge AB
High score of betweenness means?
– The edge runs “between” two communities
Betweenness gives a better measure
– Edges such as BD get a higher score than edges such as AB
Not a distance measure, may not satisfy triangle inequality. Doesn’t matter!
The Girvan – Newman Algorithm
Step 1 – BFS: Start at a node X, Calculate betweenness of edges
perform a BFS with X as root 1
E
Observe: level of node Y = length
1
of shortest path from X to Y 1
Edges between level are called D Level 1
F
“DAG” edges
– Each DAG edge is
part of at least one shortest 1 B G Level 2
path from X 2
Step 2 – Labeling: Label each node
Y by the number of shortest paths
from X to Y A C Level 3
1 1
The Girvan – Newman
Algorithm
Step 3 – credit sharing: Calculate betweenness of edges
Each leaf node gets credit 1 1
Each non-leaf node gets 1 + E
sum(credits of the DAG edges to the 4.5
1 1.5
level below) 1
Credit of DAG edges: Let Yi (i=1, 4.5 D Level 1
F
1.5
… , k) be parents of Z, pi = label(Yi)
credit(Z ) i 3 0.5 0.5
credit(Yi , Z )
p ( p1 !
Intuition: a DAG edge pk ) Y Zi gets 1 B G Level 2
share of credit of Z
the 3 2
1
proportional to
the #of shortest paths 1 1
from
Finally: X to Steps
Repeat Z 1, 2 and 3 with
goingasthrough
each node root. ForYiZeach edge, A C Level 3
betweenness = sum credits obtained in all 1 1 1 1
iterations / 2
Computation in
practice
Complexity: n nodes, e edges
– BFS starting at each node: O(e)
– Do it for n nodes
– Very expensive
Method in practice
– Choose a random subset W of
the nodes
– Compute credit of each edge
starting at each node in W
– Sum and compute betweenness
Finding Communities using
Betweenness
Method 1:
Keep adding edges (among existing ones) starting from lowest
betweenness
Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
Keep adding edges (among existing ones) starting from lowest betweenness
Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
Keep adding edges (among existing ones) starting from lowest betweenness
Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
Keep adding edges (among existing ones) starting from lowest betweenness
Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
Keep adding edges (among existing ones) starting from lowest betweenness
Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
Keep adding edges (among existing ones) starting from lowest betweenness
Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 2:
Start from all existing edges. The graph may look like one big component.
Keep removing edges starting from highest betweenness
Gradually split large components to arrive at communities
Finding Communities using
Betweenness
Method 2:
Start from all existing edges. The graph may look like one big component.
Keep removing edges starting from highest betweenness
Gradually split large components to arrive at communities
Finding Communities using
Betweenness
Method 2:
Start from all existing edges. The graph may look like one big component.
Keep removing edges starting from highest betweenness
Gradually split large components to arrive at communities
At some point, removing the edge with highest betweenness would split
the graph into separate components
Finding Communities using
Betweenness
For a fixed threshold of betweenness, both methods would
ultimately produce the same clustering
However, a suitable threshold is not known beforehand
Method 1 vs Method 2
– Method 2 is likely to take less number of operations. Why?
• In this section, we shall see a technique for discovering communities directly by looking
for subsets of the nodes that have a relatively large number of edges among them.
Finding Cliques
• Our first thought about how we could find sets of nodes with many
edges between them is to start by finding a large clique (a set of nodes
with edges between any two of them).
• Given a graph, we would like to divide the nodes into two sets so that
the cut, or set of edges that connect nodes in different sets is
minimized.
• Suppose we partition the nodes of a graph into two disjoint sets S and T .
Let Cut (S, T ) be the number of edges that connect a node in S to a node
in T .
• Now, consider the preferred cut for this graph consisting of the edges (B,D) and (C,G).
• Then S = {A,B,C,H} and T = {D,E, F,G}. Cut (S, T ) = 2, Vol (S) = 6, and Vol(T ) = 7.
• The normalized cut for this partition is thus only 2/6 + 2/7 = 0.62.
Some Matrices that describe
Graphs
• To develop the theory of how matrix algebra can help us find good
graph partitions, we first need to learn about three different matrices
that describe aspects of a graph.
= -
Eigenvalues of the Laplacian
Matrix
Reference
78