0% found this document useful (0 votes)
19 views78 pages

Unit - IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views78 pages

Unit - IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Unit - IV

Recommendation Systems
OUTLINE
• Recommendation Engine

• Dimensionality Reduction

• Singular Value Decomposition

• Principal Component Analysis

• Exercise
Introduction
• Recommendation engines are also called as recommendation systems, are the typical
data product and are a good starting point when you’re explaining to non–data
scientists what you do or what data science really is?

• Examples for recommendation systems:

1. Getting recommended movies on Netflix or YouTube

2. Getting recommended books on Flipkart or Amazon

• To build a solid recommendation system end-to-end requires an understanding of


linear algebra and an ability to code.
A Real-World Recommendation
Engine
• Recommendation engines are used all the time.

• Example—What movie would you like, knowing other movies you liked?
What book would you like, keeping in mind past purchases? What kind
of vacation are you likely to embark on, considering past trips?

• There are plenty of different ways to go about building such a model,


but they have very similar feels if not implementation.
Example to set up a
recommendation engine
• Scenario - suppose you have users, which form a set U; and you have items to
recommend, which form a set V.

• You can represent the above scenario as a bipartite graph (shown in below figure) if each
user and each item has a node to represent it—there are lines from a user to an item if
that user has expressed an opinion about that item.
Contd…
• Note they might not always love that item, so the edges could have
weights: they could be positive, negative, or on a continuous scale (or
discontinuous, but many-valued like a star system).

• Example :
Contd…
• Next step is, you have training data in the form of some preferences i.e., you
know some of the opinions of some of the users on some of the items.

• From those training data, you want to predict other preferences for
your users. That’s essentially the output for a recommendation engine.

• You may also have metadata on users (i.e., they are male or female, etc.) or
on items (the color of the product).
• For example, users come to your website and set up accounts, so you may know each

user’s gender, age, and preferences for up to three items.


Contd…
• Next, You may represent a given user as a vector of features, sometimes
including only metadata—sometimes including only preferences (which
would lead to a sparse vector because you don’t know all the user’s
opinions) —and sometimes including both, depending on what you’re
doing with the vector. Also, you can sometimes bundle all the user vectors
together to get a big user matrix, which we call U.
Nearest Neighbor Algorithm Review
• Let’s review the nearest neighbor algorithm (discussed already)
• Idea of Nearest Neighbor Algorithm is - if you want to predict whether user A likes something,
you look at a user B closest to user A who has an opinion, then you assume A’s opinion is the
same as B’s.
• To implement this you need a metric so you can measure distance.
• One example when the opinions are binary: Jaccard distance, i.e., 1–(the number of things
they both like divided by the number of things either of them likes).
• Other examples include Cosine similarity or Euclidean distance.
• To answer, Which Metric Is Best?
- Do experiment by using different distance measure for each experiment.
Some Problems with Nearest
Neighbors
• Curse of dimensionality

There are too many dimensions, so the closest neighbors are too far
away from each other to realistically be considered “close.”

• Overfitting

One guy is closest, but that could be pure noise. How do you adjust for
that? One idea is to use kNN, with, say, k = 5 rather than k = 1, which
increases the noise. (For Optimal Solution Choose proper value for k).
Contd…
• Correlated features
There are tons of (Many) features, moreover, that are highly correlated (inter-linked) with
each other.

For example, you might imagine that as you get older you become more conservative. But
then counting both age and politics would mean you’re double counting a single feature in
some sense.

This would lead to bad performance, because you’re using redundant information and
essentially placing double the weight on some variables. It’s preferable to build in an
understanding of the correlation and project onto smaller dimensional space.
Contd…
• Relative importance of features

Some features are more informative than others. Weighting features may therefore
be helpful: maybe your age has nothing to do with your preference for item 1. You’d
probably use something like co-variances to choose your weights.
• Sparseness
If your vector (or matrix, if you put together the vectors) is too sparse (Ex: many
entries in the vector or matrix are 0s), or you have lots of missing data, then most
things are un‐known, and the Jaccard distance means nothing because there’s no
overlap.
Contd…
• Measurement errors
There’s measurement error (also called reporting error): people may lie. (Ex; When providing the data).

• Computational complexity
There’s a calculation cost—computational complexity

• Sensitivity of distance metrics


Euclidean distance also has a scaling problem: distances in age outweigh distances for other features if
they’re reported as 0 (for don’t like) or 1 (for like). Essentially this means that raw Euclidean distance doesn’t make
much sense. Also, old and young people might think one thing but middle-aged people something else.
We seem to be assuming a linear relationship, but it may not exist. Should you be binning by age group (Creating
buckets based on Age) instead, for example? (i.e., You should use an alternative approach for this) .
Contd…
• Preferences change over time
User preferences may also change over time, which falls outside the model.
For example, at eBay, they might be buying a printer, which makes them only
want ink for a short time.

• Cost to update
It’s also expensive to update the model as you add more data.

• The biggest issues are the first two on the list, namely overfitting and the
curse of dimensionality problem.
Beyond Nearest Neighbor: Machine Learning
Classification
• To deal with overfitting and the curse of dimensionality problem, we’ll build a separate
linear regression model for each item.

• With each model, we could then predict for a given user, knowing their attributes,
whether they would like the item corresponding to that model.

• So one model might be for predicting whether you like Mad Men and another model
might be for predicting whether you would like Bob Dylan.

• Denote by fi, j user i’s stated preference for item j if you have it (or user i’s attribute, if item
j is a metadata item like age or is_logged_in).
Contd…

The good news: You know how to estimate the coefficients by linear algebra, optimization, and
statistical inference: specifically, linear regression.
The bad news: This model only works for one item, and to be complete, you’d need to build as many
models as you have items. Moreover, you’re not using other items’ information at all to create the
model for a given item, so you’re not leveraging (Using) other pieces of information.
Contd…
• But wait, there’s more good news: This solves the “weighting of the features”
problem we discussed earlier, because linear regression coefficients are
weights. (So that you can know which are more important and which are less
important)

• Crap, more bad news: overfitting is still a problem, and it comes in the form of
having huge coefficients when you don’t have enough data (i.e., not enough
opinions on given items).
Contd…
• To solve the overfitting problem, you impose a Bayesian prior that
these weights shouldn’t be too far out of whack (hit)—this is done by adding a
penalty term for large coefficients.

• That solution depends on a single parameter, which is traditionally called λ.

• But that begs the question: how do you choose λ? You could do it
experimentally: use some data as your training set, evaluate how well
you did using particular values of λ, and adjust.
Contd…
• A final problem with this prior stuff: although the problem will have a
unique solution (as in the penalty will have a unique minimum) if you
make λ large enough, by that time you may not be solving the problem
you care about.

• i.e., if you make λ absolutely huge, then the coefficients will all go to
zero and you’ll have no model at all.
The Dimensionality Problem
• So we’ve tackled the overfitting problem (previous slides), so now let’s
think about overdimensionality — i.e., the idea that you might have tens
of thousands of items.

• We typically use both Singular Value Decomposition (SVD) and Principal


Component Analysis (PCA) to tackle this.
Contd…
• To understand how this works before we dive into the math, let’s think about how we reduce
dimensions and create “latent features” internally every day.

• For example, people invent concepts like “coolness,” but we can’t directly measure how cool
someone is. Other people exhibit different patterns of behavior, which we internally map or reduce
to our one dimension of “coolness.”

• So coolness is an example of a latent feature in that it’s unobserved and not measurable directly,
and we could think of it as reducing dimensions because perhaps it’s a combination of many
“features” we’ve observed about the person and implicitly weighted in our mind.

• Two things are happening here: the dimensionality is reduced into a single feature and the latent
aspect of that feature.
Contd…
• But in this algorithm, we don’t decide which latent factors to care about. Instead we let
the machines do the work of figuring out what the important latent features are.

• “Important” in this context means they explain the variance in the answers to the
various questions—in other words, they model the answers efficiently

• Our goal is to build a model that has a representation in a low dimensional subspace
that gathers “taste information” to generate recommendations.

• To know Linear algebra click the link:


https://www.khanacademy.org/math/linear-algebra
Singular Value Decomposition (SVD)
• Maths background:

Given an m×n matrix X of rank k, it is a theorem from linear algebra that we can always
compose it into the product of three matrices as follows:

where U is m×k, S is k×k, and V is k×n, the columns of U and V are pairwise orthogonal, and
S is diagonal. Note the standard statement of SVD is slightly more involved and has U and V
both square unitary matrices, and has the middle “diagonal” matrix a rectangular. We’ll be
using this form, because we’re going to be taking approximations to X of increasingly smaller
rank.
Contd…
• Let’s apply the preceding matrix decomposition to our situation. X is our original dataset,
which has users’ ratings of items. We have m users, n items, and k would be the rank of X,
and consequently would also be an upper bound on the number d of latent variables we
decide to care about—note we choose d whereas m, n, and k are defined through our training
dataset. So just like in k-NN, where k is a tuning parameter (different k entirely—not trying to
confuse you!), in this case, d is the tuning parameter.

• Each row of U corresponds to a user, whereas V has a row for each item. The values along the
diagonal of the square matrix S are called the “singular values.” They measure the importance
of each latent variable—the most important latent variable has the biggest singular value.
YouTube URLs for SVD
• https://youtu.be/EokL7E6o1AE

• https://youtu.be/P5mlg91as1c
Important Properties of SVD
• Because the columns of U and V are orthogonal to each other, you can order the columns by singular
values via a base change operation. That way, if you put the columns in decreasing order of their
corresponding singular values (which you do), then the dimensions are ordered by importance from
highest to lowest. You can take lower rank approximation of X by throwing away part of S. In other
words, replace S by a submatrix taken from the upper-left corner of S.

• Of course, if you cut off part of S you’d have to simultaneously cut off part of U and part of V, but this
is OK because you’re cutting off the least important vectors. This is essentially how you choose the
number of latent variables d—you no longer have the original matrix X anymore, only an
approximation of it, because d is typically much smaller than k, but it’s still pretty close to X .

• SVD can’t handle missing values.

• SVD is extremely computationally expensive.


How would you actually use this for
recommendation?
Principal Component Analysis (PCA)
• Let’s look at another approach for predicting preferences. With this approach,
you’re still looking for U and V as before, but you don’t need S anymore, so you’re
just searching for U and V such that:
Contd…
Contd…
• How do you choose d? It’s typically about 100, because it’s more than 20
(as we told you, through the course of developing the product, we found that we
had a pretty good grasp on someone if we ask them 20 questions) and it’s as much
as you care to add before it’s computationally too much work.
YouTube URL for PCA
• https://www.youtube.com/watch?v=ZqXnPcyIAL8

• https://www.youtube.com/watch?v=yLdOS6xyM_Q

• https://youtu.be/FgakZw6K1QQ

• https://www.youtube.com/watch?v=0Jp4gsfOLMs
Theorem: The resulting latent features
will be uncorrelated
• A nice aspect of these latent features is that they’re uncorrelated.
Here’s a sketch of the proof:
Contd…
Contd…
Alternating Least Squares
Exercise: Build Your Own Recommendation System

• Just refer example given in Text book (Page No. 214)


Mining Social-Network
Graphs
Social
Network

No introduction required

Really?

We still need to understand a


few properties

disclaimer: the brand logos are used here entirely for educational purpose
Social
Network
 A collection of entities
– Typically people, but could be something else too
 At least one relationship between entities of the network
– For example: friends
– Sometimes boolean: two people are either friends or they are not
– May have a degree
– Discrete degree: friends, family, acquaintances, or none
– Degree – real number: the fraction of the average day that two people spend talking to each
other
 An assumption of nonrandomness or locality
– Hard to formalize
– Intuition: that relationships tend to cluster
– If entity A is related to both B and C, then the probability that B and C are related is higher than
average (random)
Social Network as a
Graph
A B D E

A graph with
boolean (friends)
C relationship
G F

 Check for the non-randomness criterion


 In a random graph (V,E) of 7 nodes and 9 edges, if XY is an edge, YZ
is an edge, what is the probability that XZ is an edge?
– For a large random graph, it would be close to |E|/(|V|C2) = 9/21 ~ 0.43
– Small graph: XY and YZ are already edges, so compute within the rest
– So the probability is (|E|−2)/(|V|C2−2) = 7/19 = 0.37
 Now let’s compute what is the probability for this graph in particular
Example courtesy: Leskovec, Rajaraman and Ullman
Social Network as a
Graph
A B D E

s have A graph with


Do e
boolean (friends)
locality C relationship
G F
ty
proper
 For each X, check possible YZ and check if YZ is an edge or not
 Example: if X = A, YZ = {BC}, it is an edge
X= YZ= Yes/Total X= YZ= Yes/Total
A BC 1/1 E DF 1/1
B AC, AD, CD 1/3 F DE,DG,EG 2/3
C AB 1/1 G DF 1/1

D BE,BG,BF,EF, 2/6 Total 9/16 ~ 0.56


EG,FG
Types of Social (or
Professional) Networks
A B D E

C G F

 Of course, the “social network”. But also several other types


 Telephone network
 Nodes are phone numbers
 AB is an edge if A and B talked over phone within the last one week, or month, or ever

 Edges could be weighted by the number of times phone calls were made, or total time of
conversation
Types of Social (or
Professional) Networks
A B D E

C G F

 Email network: nodes are email addresses

 AB is an edge if A and B sent mails to each other within the last one week, or month, or ever
– One directional edges would allow spammers to have edges
 Edges could be weighted

 Other networks: collaboration network – authors of papers, jointly written papers or not
 Also networks exhibiting locality property
Clustering of Social
Network Graphs
 Locality property  there are clusters
 Clusters are communities
– People of the same institute, or company
– People in a photography club
– Set of people with “Something in common” between them
 Need to define a distance between points (nodes)
 In graphs with weighted edges, different distances exist
 For graphs with “friends” or “not friends” relationship
– Distance is 0 (friends) or 1 (not friends)
– Or 1 (friends) and infinity (not friends)
– Both of these violate the triangle inequality
– Fix triangle inequality: distance = 1 (friends) and 1.5 or 2 (not friends) or length of
shortest path
Traditional
Clustering
A B D E

C G F

 Intuitively, two communities


 Traditional clustering depends on the distance
– Likely to put two nodes with small distance in the same cluster
– Social network graphs would have cross-community edges
– Severe merging of communities likely
 May join B and D (and hence the two communities) with not so low
Betweenness of an
Edge
A B D E

C G F

 Betweenness of an edge AB: #of pairs of nodes (X,Y) such that AB lies on the shortest
path between X and Y
– There can be more than one shortest paths between X and Y
– Credit AB the fraction of those paths which include the edge AB
 High score of betweenness means?
– The edge runs “between” two communities
 Betweenness gives a better measure
– Edges such as BD get a higher score than edges such as AB
 Not a distance measure, may not satisfy triangle inequality. Doesn’t matter!
The Girvan – Newman Algorithm
 Step 1 – BFS: Start at a node X, Calculate betweenness of edges
perform a BFS with X as root 1
E
 Observe: level of node Y = length
1
of shortest path from X to Y 1
 Edges between level are called D Level 1
F
“DAG” edges
– Each DAG edge is
part of at least one shortest 1 B G Level 2
path from X 2
 Step 2 – Labeling: Label each node
Y by the number of shortest paths
from X to Y A C Level 3
1 1
The Girvan – Newman
Algorithm
Step 3 – credit sharing: Calculate betweenness of edges
 Each leaf node gets credit 1 1
 Each non-leaf node gets 1 + E
sum(credits of the DAG edges to the 4.5
1 1.5
level below) 1
 Credit of DAG edges: Let Yi (i=1, 4.5 D Level 1
F
1.5
… , k) be parents of Z, pi = label(Yi)
credit(Z )  i 3 0.5 0.5
credit(Yi , Z )
p ( p1 !

 Intuition: a DAG edge pk ) Y Zi gets 1 B G Level 2
share of credit of Z
the 3 2
1
proportional to
the #of shortest paths 1 1
from
Finally: X to Steps
Repeat Z 1, 2 and 3 with
goingasthrough
each node root. ForYiZeach edge, A C Level 3
betweenness = sum credits obtained in all 1 1 1 1
iterations / 2
Computation in
practice
 Complexity: n nodes, e edges
– BFS starting at each node: O(e)

– Do it for n nodes

– Total: O(ne) time

– Very expensive

 Method in practice
– Choose a random subset W of
the nodes
– Compute credit of each edge
starting at each node in W
– Sum and compute betweenness
Finding Communities using
Betweenness
Method 1:
 Keep adding edges (among existing ones) starting from lowest
betweenness
 Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
 Keep adding edges (among existing ones) starting from lowest betweenness
 Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
 Keep adding edges (among existing ones) starting from lowest betweenness
 Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
 Keep adding edges (among existing ones) starting from lowest betweenness
 Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
 Keep adding edges (among existing ones) starting from lowest betweenness
 Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 1:
 Keep adding edges (among existing ones) starting from lowest betweenness
 Gradually join small components to build large connected components
Finding Communities using
Betweenness
Method 2:
 Start from all existing edges. The graph may look like one big component.
 Keep removing edges starting from highest betweenness
 Gradually split large components to arrive at communities
Finding Communities using
Betweenness
Method 2:
 Start from all existing edges. The graph may look like one big component.
 Keep removing edges starting from highest betweenness
 Gradually split large components to arrive at communities
Finding Communities using
Betweenness
Method 2:
 Start from all existing edges. The graph may look like one big component.
 Keep removing edges starting from highest betweenness
 Gradually split large components to arrive at communities

At some point, removing the edge with highest betweenness would split
the graph into separate components
Finding Communities using
Betweenness
 For a fixed threshold of betweenness, both methods would
ultimately produce the same clustering
 However, a suitable threshold is not known beforehand

 Method 1 vs Method 2
– Method 2 is likely to take less number of operations. Why?

– Inter-community edges are less than intra-community edges


Direct Discovery of Communities
• So for, we searched for communities by partitioning all the individuals in a social
network.

• While this approach is relatively efficient, it does have several limitations.

• It is not possible to place an individual in two different communities, and everyone is


assigned to a community.

• In this section, we shall see a technique for discovering communities directly by looking
for subsets of the nodes that have a relatively large number of edges among them.
Finding Cliques
• Our first thought about how we could find sets of nodes with many
edges between them is to start by finding a large clique (a set of nodes
with edges between any two of them).

• However, that task is not easy - NP-complete problems


Complete Bipartite Graphs
• A complete bipartite graph consists of s nodes on one side and t
nodes on the other side, with all st possible edges between the nodes
of one side and the other present.

Fig: The bipartite graph


Partitioning of Graphs

• Another approach to organizing social-network graphs. We use some


important tools from matrix theory (“spectral methods”) to formulate
the problem of partitioning a graph to minimize the number of edges
that connect different components.
What Makes a Good Partition?

• Given a graph, we would like to divide the nodes into two sets so that
the cut, or set of edges that connect nodes in different sets is
minimized.

• However, we also want to constrain the selection of the cut so that


the two sets are approximately equal in size.
Contd…
Normalized Cuts
• First, define the volume of a set S of nodes, denoted Vol (S), to be the
number of edges with at least one end in S.

• Suppose we partition the nodes of a graph into two disjoint sets S and T .
Let Cut (S, T ) be the number of edges that connect a node in S to a node
in T .

• Then the normalized cut value for S and T is


Example
• Again consider the graph of Fig. 10.11. If we choose S = {H} and T = {A,B,C,D,E, F,G}, then
Cut (S, T ) = 1. Vol(S) = 1, because there is only one edge connected to H.
• On the other hand, Vol(T ) = 11, because all the edges have at least one end at a node of T.
Thus, the normalized cut for this partition is 1/1 + 1/11 = 1.09.

• Now, consider the preferred cut for this graph consisting of the edges (B,D) and (C,G).
• Then S = {A,B,C,H} and T = {D,E, F,G}. Cut (S, T ) = 2, Vol (S) = 6, and Vol(T ) = 7.
• The normalized cut for this partition is thus only 2/6 + 2/7 = 0.62.
Some Matrices that describe
Graphs
• To develop the theory of how matrix algebra can help us find good
graph partitions, we first need to learn about three different matrices
that describe aspects of a graph.

i) Adjacency matrix (A)


Contd…
ii) Degree matrix (D)
Contd…
iii) Laplacian matrix, L = D − A

= -
Eigenvalues of the Laplacian
Matrix
Reference

 Mining of Massive Datasets, by Leskovec, Rajaraman and Ullman,


Chapter 10

78

You might also like