Facebook Friend Recommendation
Facebook Friend Recommendation
Capstone project
Facebook Friend Recommendation
Perumalla Siva Krishna Reddy
February 18th 2019
1)Definition
Project overview:
In this project, we are going to do the facebook friend suggestion which also called as
a link prediction problem.
Friend suggestions appear on the base of your friends, location, interest, study,
college, year of education and your profile details. Below details will make this more
clear.
5. Interest : some time Facebook also shows this suggestion on base of your
interest if there is multiple suggestion then the best pair as per your interest
will be shown to you.
However you never know the exact reason of why someone else profile appear on your
suggestions feed but it is sure that the one of the above reason is working behind that.
Reference link:
https://www.quora.com/How-do-new-friend-suggestions-appear-on-the-notiꢀications-in-Faceb
ook
But here we have only directed graph of the users and we are not getting any
information about the users. So we are going to use the supervised machine learning
techniques by converting the given data into a binary classification dataset.
Facebook is the biggest social network in the world and there are billions of users are
using it daily by linking with their friends.Here due to the privacy and community
guidelines we have given only directed graph between the users so we shift the
problem from recommendation problem to link prediction problem.
Link prediction is an important task in network science that offers unique ways
whereby the study of networks can benefit researchers and organizations in a variety
of fields. In medicine and biology, link prediction can be used to find relationships and
associations that exist, but which might otherwise surface only after arduous and
expensive research and study on a huge selection of agents. Finally, researchers can
easily adapt link prediction methods to identify links that are surprising given their
surrounding network, or which may not belong at all. Put simply, any environment that
naturally maps to a network probably has an equally coherent mapping from link
prediction in that network back to an important question in the environment.
Problem statement:
This project is mainly focuses on the problem of extracting features from the given
directed graph details. And then applying the suitable machine learning models and by
improving the performance of the model using either randomsearchCV or
GridsearchCV.
The data has only two columns
--destination_node : userid of the user who was followed by the first user
Given dataset is just a class1(link between two users), we will add the equal number of
datapoints in class0(users with no link between them).So that the data will be
balanced and converted into a binary classification problem.
We will apply the machine learning models Random Forest Classifier and XGBoost
classifier as they will work tremendously with these type of dataset problem with less
features. We will take a benchmark model to verify whether our models are working
well or not. We may use naive model but naive model will have lower performance so
in this problem we will take the Logistic Regression which will work very well on the
large datasets.(Here we have a large dataset).
The outputs we are predicting are either 0 or 1 with the given input as (user1_ID,
user2_ID). Output zero means there will be no link between the given two users and
output 1 means user1 may follow user2, so we suggest user2 to user1.
Metrics:
First the given problem needs good values for both recall and precision. So we will use
f1_score as a performance metric.
We have given only class1 data and we are converting the problem into a two class
classification problem by providing the class0 points which are equal in number to the
class1 data points.So our dataset will be a balanced dataset. And accuracy is a good
performance metric for balanced datasets we will take the accuracy also as a
performance metric.
Recall = TP / (TP+FN)
https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-
2-86d5649a5428
2) Analysis
Data Exploration:
The dataset i am working is downloaded from the kaggle from the facebook
recruitment competition. Link for the data :https://www.kaggle.com/c/FacebookRecruiting/data
The dataset is released by the facebook by complying with the user privacy and
community guidelines.The total number of datapoints in the dataset are 9,437,519.This
means that there are 9.4 million edges in the directed graph.
Here we have taken 20 samples of data and plotted the directed graph. In 20 data
points there are 26 nodes(users) present and 20 edges means 20 links between those
26 nodes(users).
The directed edge from 3 to 176995 means user with id 3 following the user with id
176995.
→ No of users those are not following anyone are 274512 and % is 14.74
→ No of users those are having zero followers are 188043 and % is 10.10
→ No of persons those are not not following anyone and also not having any followers are 0
It means that there is at least 1 edge between every user.
Exploratory visualizations:
As per the given data we have a directed graph with user id’s. We do not have any
other features which have numeric values so the only possible visualizations are the
directed graphs and we have already taken 20 samples and plotted the graph as it will
not be possible and we cannot observe the complete graph with 9.4 million edges.
We have already discussed about the graph. The main purpose of this project is
extracting the various features that are useful to convert the problem into a binary
classification problem. We will discuss the feature extraction techniques thoroughly in
the data preprocessing section.
Algorithms and techniques:
Since we have formulated the problem into a binary classification problem, we will use
the classification algorithms which work very well for our dataset. Random Forest and
XGBoost works very well with less features, but we are taking logistic regression also
as a benchmark model because it can work with our large dataset very efficiently and
we can compare our models whether they are working better or not.
1)Logistic Regression:
Logistic Regression is one of the most used Machine Learning algorithms for
binary classification. It is a simple Algorithm that you can use as a performance
baseline, it is easy to implement and it will do well enough in many tasks. Therefore
every Machine Learning engineer should be familiar with its concepts. The building
block concepts of Logistic Regression can also be helpful in deep learning while building
neural networks.
Where z = w*x + b
Advantages
It is a widely used technique because it is very efficient, does not require too many
Disadvantages
Parameters
Reference:
http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/
Advantages
An advantage of random forest is that it can be used for both regression and
classification tasks and that it’s easy to view the relative importance it assigns to the
input features. Random Forest is also considered as a very handy and easy to use
algorithm, because it’s default hyperparameters often produce a good prediction result.
The number of hyperparameters is also not that high and they are straightforward to
understand.
Disadvantages
The main limitation of Random Forest is that a large number of trees can make the
algorithm to slow and ineffective for real-time predictions.
Parameters
class.sklearn.ensemble.RandomForestClassifier(bootstrap=True, class_ weight=None,
criterion='gini', max_depth=9, max_features='auto', max_ leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split= None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=130, n_jobs=1, oob_score =False,
random_state=None, verbose=0, warm_start=False)
2) For m = 1 to M:
a) Compute so-called
pseudo-residuals:
b) Fit a base learner (e.g. a tree) hm (x) to pseudo-residuals, i.e. train it using the training
The following two images will show the difference between Gradient boosted decision trees and
Random Forest classifier.
Comparing the two methods, Random Forests are faster to train, but they often require
deeper trees than GBTs to achieve the same error. GBTs can further reduce the error
with each iteration, but they can begin to overfit (increase test error) after too many
iterations. Random Forests do not overfit as easily, but their test error plateaus.
Reference link:
https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html
Advantages
GBTs build trees one at a time, where each new tree helps to correct errors made by
the previously trained tree. With each tree added, the model becomes even more
expressive. There are typically three parameters - the number of trees, depth of trees
and learning rate, and each tree built are generally shallow.
Disadvantages
GBDT training generally takes longer because of the fact that trees are built
sequentially. However benchmark results have shown GBDT are better learners
Parameters
class.sklearn.ensemble.GradientBoostingClassifier(criterion='friedman_ mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=5, max_features=None, max_leaf_nodes=None,
min_impurity_decrease= 0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split
=2, min_weight_fraction_leaf=0.0, n_estimators=130,presort='auto', random_state=None,
subsample=1.0, verbose=0, warm_start=False)
Benchmark model
In our project, we are going to use the logistic regression model as the benchmark
model. We will calculate the performance metrics accuracy and f1_score by using the
best parameters for logistic regression.
We will take accuracy and f1_score values as performance metrics.
Train f1_score = 0.9072
Test f1_score = 0.9082
Train accuracy = 91.33%
Test accuracy = 91.43%
3) Methodology
1)Data preprocessing
In our project data preprocessing is the most important part. We have given only two
columns with user ids we have to extract the useful features using the graph mining
techniques.
In this process, we are going to use the python library networkx, which has all the
predefined functions and methods regarding the graph theory. This process involves
several steps in extracting different types of features.
Similarity measures:
i) Jaccard Distance:
Reference link: http://www.statisticshowto.com/jaccard-index/
This value indicated that how similar are the two sets. The maximum value is 100%. It
means two sets share all the similar members.In our problem, each set means the no
of users followed/followed by a user.
jaccard_distance(user1, user2) = (common no of users followed by user1 and user2 )
* 100 / (total number of unique users followed by user1 and user2)
ii) Cosine distance:
Reference: https://faculty.nps.edu/rgera/MA4404/Winter2018/15-nodeSimilarity.pdf
Cosine_distance = (total number of unique users in user1 and user2) / (no of users in
user1 * no of users in user2)
iii) Preferential attachment:
https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algori
thms.link_analysis.pagerank_alg.pagerank.html
PageRank computes a ranking of the nodes in the graph G based on the structure of
the incoming links.
i) Page Ranking:
Reference: https://en.wikipedia.org/wiki/PageRank
PageRank works by counting the number and quality of links to a page to determine a
rough estimate of how important the website is. The underlying assumption is that
more important websites are likely to receive more links from other websites.
Here we also take a webpage as one user.
Other features:
i) Shortest path:
Getting the Shortest path between two nodes, if nodes have direct path i.e directly
connected then we are removing that edge and calculating path.If there is no path we
will take it as -1
A weakly connected component is one in which all components are connected by some
path, ignoring direction. So this entire graph would be a weakly connected component.
Reference:https://www.quora.com/What-are-strongly-and-weakly-connected-compone
nts
iv) follows_back :
This feature depends on whether the second user is following back or not. If in test
data we have given (user1, user2) as input we have to extract the follows_back feature
as 1 if user2 is already following user1 otherwise 0.
Katz Centrality:
References: https://en.wikipedia.org/wiki/Katz_centrality
https://www.geeksforgeeks.org/katz-centrality-centrality-measure/
It is a measure of centrality in the network. It is used to measure the relative degree of
influence of a node within a social network.
xi = α ∑ Aij xj + β
j
The HITS algorithm computes two numbers for a node. Authorities estimate the node
value based on the incoming links. Hubs estimate the node value based on outgoing
links.
Reference: https://en.wikipedia.org/wiki/HITS_algorithm
In addition to the graph mining techniques to extract the features we also add some of
the statistical features:
1) Num_followers: here we are taking the number of followers of the source and
the destination node as two separate features, and one more feature is the
number of people in the intersection of two sets of followers of two users.
2) Num_followees: Here we are going to take the number of followees(followed
by that user) of the source and the destination node as two separate features.
The third feature is same as followers, the number of users in the intersection of
the two sets of followees of the two users. Hence we are adding 6 new features
by using the number of followers and followees
3) Weight features: In order to determine the similarity of nodes, an edge weight
value was calculated between nodes. Edge weight decreases as the neighbour
count go up. Intuitively, consider one million people following a celebrity on a
social network then chances are most of them never met each other or the
celebrity. On the other hand, if a user has 30 contacts in his/her social network,
the chances are higher that many of them know each other. `credit` -
Graph-based Features for Supervised Link Prediction William Cukierski, Benjamin
Hamner, Bo Yang
1
W =
√ 1+∣X∣
➔ the weight of incoming edges
➔ the weight of outgoing edges
➔ the weight of incoming edges + weight of outgoing edges
➔ the weight of incoming edges * weight of outgoing edges
➔ 2*weight of incoming edges + weight of outgoing edges
➔ the weight of incoming edges + 2*weight of outgoing edges
Implementation:
Before the algorithm implementation, we are splitting the train and test data into an
80:20 ratio and extracted the features from that data separately.
After extracting all the useful features we will apply the machine learning algorithms
from sklearn. We have no features in the provided dataset, so feature extraction is the
most difficult task and one should do a lot of research in the graph theory and graph
mining techniques.
We have taken logistic regression as a benchmark model because using a naive model
is not efficient in our case as all the machine learning algorithms work better than the
naive model and we may not be able to get which algorithm is best for our data. So we
have chosen logistic regression as our dataset is very large and it works very well on
the large dataset.
We have taken two algorithms Logistic Regression and Random Forest Classifier with
their default values from sklearn library. But XGBoost classifier is taken from xgboost
package
I have applied the three algorithms Logistic Regression (benchmark model), Random
Forest Classifier and XGBoost classifier and we have taken the f1_score and accuracy
for all the three models.
Model name Train f1_score Test f1_score Train accuracy Test accuracy
Random forest classifier works very well with 99% of train accuracy and 0.99 of
f1_score, but it is showing some overfitting in the data as the test accuracy is only
91%. So we may consider XGBoost classifier as it is showing less overfitting compared
to random forest classifier. Both our models are working very well over the benchmark
model.
Refinement:
In this part, I will use the random search method for each of the three classifiers to
refine their hyperparameters. In my research on hyperparameter tuning I found
following parameters for three models to be tuned. The parameters we have taken for
tuning for each model are as follows….
After running the random search, the refined hyperparameters obtained are….
Model name Train f1_score Test f1_score Train accuracy Test accuracy
In this case, we are getting the best result with XGBoost classifier but we have some
overfitting using XGBoost. Random forest also giving the better result and with less
overfitting. We consider random forest over XGBoost. These results may vary using
more data but for now we are taking these results into consideration.
4) Result
It is having less variation in the train and test performance. We do not have to test the
model by using the K-fold cross-validation, because we have already done the 10-fold
cross-validation using random search technique. You can see the cross-validation
parameter in the image
In the code for random search, we have provided the cv parameter as 10, it means
that it will check the model using 10-fold cross-validation.
Finally, we can conclude that we got robust results and no need to check the model by
doing perturbation using k-fold cross-validation.
Reference link:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.Randomize
dSearchCV.html
Justification
The benchmark model has an accuracy score of 91.43%, the optimized model has
obtained an accuracy score of 93.39%. So, this model is good performing. And the
confusion matrix report shows good results of recall, precision.
Our model show very small high value in test accuracy but the training accuracy is very
high, we can get better test accuracy using more data.
5) Conclusion
Freeform visualizations:
Train confusion matrix
There were very few data points that were predicted
wrong in both class0 and class1. In the wrongly
predicted data points, Class1 data points are more than
class0 data points.
Feature
importance
We have plotted the
top 25 important
features based on
their feature
importance.
Follows_back has
the highest
importance because
if the user is
following back
there may be a
higher chance that
our user follow that
user.
Reflection
As an active social media user, I found interest in how social media network especially
Facebook suggesting friends to me and what process involved in the background for
that. So I do research on that I found the competition by Facebook on kaggle. In this
process, I have learned a lot of things.
1) The first thing I have learned is that we will not always provide a good dataset
with all the features given initially.
2) We have to do research on the field of the problem, in this project I have learned
the interesting topics of graph theory and there are many surprising libraries for
graphs in python.
3) How to chose the machine learning algorithm and how to chose the performance
metric to measure the performance of the model.
4) Initially, I got very less performance from the models and then I have added
more features from the statistics field also using the numeric values from the
graph.
5) Finally, I have learned how the real world problems would be and how we have to
approach to get a solution to the problem and how we can apply machine
learning models to solve the problem.
Improvement:
In the real world scenarios suggesting a friend depends on a lot of factors such as
location, education, family, geography, interests of the users etc. But we have only
user id’s as a directed graph, we have no other information. So if we get all other
features and details we can definitely improve the quality of the machine learning
models.
In the given dataset also if we have used all the given data for training we may achieve
better results, but if we can know what Facebook knows about a user we can
implement a lot of features from that data and we can achieve better results.
89