0% found this document useful (0 votes)
100 views23 pages

Facebook Friend Recommendation

This document provides an overview of a capstone project for a machine learning nanodegree involving predicting friend recommendations on Facebook. The project will formulate the problem as a binary classification task to predict whether a link exists between two users based on features extracted from a directed graph showing connections between users. Logistic regression, random forest, and XGBoost classifiers will be applied and evaluated based on accuracy, precision, recall, and F1-score metrics. Data exploration found most users follow or are followed by fewer than 40 other users, with 1% of users having many more connections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views23 pages

Facebook Friend Recommendation

This document provides an overview of a capstone project for a machine learning nanodegree involving predicting friend recommendations on Facebook. The project will formulate the problem as a binary classification task to predict whether a link exists between two users based on features extracted from a directed graph showing connections between users. Logistic regression, random forest, and XGBoost classifiers will be applied and evaluated based on accuracy, precision, recall, and F1-score metrics. Data exploration found most users follow or are followed by fewer than 40 other users, with 1% of users having many more connections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Machine Learning Nanodegree

Capstone project
Facebook Friend Recommendation
Perumalla Siva Krishna Reddy
February 18th 2019

Facebook Friend Recommendation

1)Definition
Project overview:

In this project, we are going to do the facebook friend suggestion which also called as
a link prediction problem.

The general friend suggestion problem is defined by the following parameters:

Friend suggestions appear on the base of your friends, location, interest, study,
college, year of education and your profile details. Below details will make this more
clear.

1. Friends of Friend: if your friend has a common connection with someone


then it may appear to your suggestion feed.
2. College & year of Study: Facebook identify the people who study in the
same college or school in the same year. so if there is someone who entered
the same details as yours then there is a chance that you will see their profile
in your news feed. Interest :
3. Location: Facebook also shows suggestion on the base of location i there is 2
profile which shows that they both are in the same city for so long then there
is a chance that they know each other and Facebook shows suggestion on the
base of it.
4. Family & Relatives : same as friend if you have added any of your family
member in your list then there is a chances that family members of that
person will also see your profile in their suggestion feed.

5. Interest : some time Facebook also shows this suggestion on base of your
interest if there is multiple suggestion then the best pair as per your interest
will be shown to you.

However you never know the exact reason of why someone else profile appear on your
suggestions feed but it is sure that the one of the above reason is working behind that.

Reference link:
https://www.quora.com/How-do-new-friend-suggestions-appear-on-the-notiꢀications-in-Faceb
ook
But here we have only directed graph of the users and we are not getting any
information about the users. So we are going to use the supervised machine learning
techniques by converting the given data into a binary classification dataset.

Facebook is the biggest social network in the world and there are billions of users are
using it daily by linking with their friends.Here due to the privacy and community
guidelines we have given only directed graph between the users so we shift the
problem from recommendation problem to link prediction problem.

Link prediction is an important task in network science that offers unique ways
whereby the study of networks can benefit researchers and organizations in a variety
of fields. In medicine and biology, link prediction can be used to find relationships and
associations that exist, but which might otherwise surface only after arduous and
expensive research and study on a huge selection of agents. Finally, researchers can
easily adapt link prediction methods to identify links that are surprising given their
surrounding network, or which may not belong at all. Put simply, any environment that
naturally maps to a network probably has an equally coherent mapping from link
prediction in that network back to an important question in the environment.

Reference link: https://www.cs.cornell.edu/home/kleinber/link-pred.pdf

Problem statement:

This project is mainly focuses on the problem of extracting features from the given
directed graph details. And then applying the suitable machine learning models and by
improving the performance of the model using either randomsearchCV or
GridsearchCV.
The data has only two columns

--source_node : userid of the first user

--destination_node : userid of the user who was followed by the first user

Given dataset is just a class1(link between two users), we will add the equal number of
datapoints in class0(users with no link between them).So that the data will be
balanced and converted into a binary classification problem.

We will apply the machine learning models Random Forest Classifier and XGBoost
classifier as they will work tremendously with these type of dataset problem with less
features. We will take a benchmark model to verify whether our models are working
well or not. We may use naive model but naive model will have lower performance so
in this problem we will take the Logistic Regression which will work very well on the
large datasets.(Here we have a large dataset).

The outputs we are predicting are either 0 or 1 with the given input as (user1_ID,
user2_ID). Output zero means there will be no link between the given two users and
output 1 means user1 may follow user2, so we suggest user2 to user1.

Metrics:

The performance of any classification model can be evaluated using statistical


measures. They are true positives (TP), false positives (FP), false negatives (FN) and
true negatives (TN).

First the given problem needs good values for both recall and precision. So we will use
f1_score as a performance metric.

We have given only class1 data and we are converting the problem into a two class
classification problem by providing the class0 points which are equal in number to the
class1 data points.So our dataset will be a balanced dataset. And accuracy is a good
performance metric for balanced datasets we will take the accuracy also as a
performance metric.

Accuracy = (TP+TN) / (TP+TN+FP+FN)

Precision = (TP) / (TP+FP)

Recall = TP / (TP+FN)

F1_score = 2 * (Precision * Recall) / (Precision + Recall).

https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-
2-86d5649a5428
2) Analysis

Data Exploration:

The dataset i am working is downloaded from the kaggle from the facebook
recruitment competition. Link for the data :https://www.kaggle.com/c/FacebookRecruiting/data

The dataset is released by the facebook by complying with the user privacy and
community guidelines.The total number of datapoints in the dataset are 9,437,519.This
means that there are 9.4 million edges in the directed graph.

Number of nodes: 1862220 (no of users)


Number of edges: 9437519 (no of links between all the users)

Node: node is a user


Edge : edge is the link between the two users(nodes).

The directed graph will be represented as the following:

Here we have taken 20 samples of data and plotted the directed graph. In 20 data
points there are 26 nodes(users) present and 20 edges means 20 links between those
26 nodes(users).

The directed edge from 3 to 176995 means user with id 3 following the user with id
176995.

Number of users following each user:


There are 99% of users following only less than 40 users and remaining 1% users are
following nearly 500 users.

Number of users followed by each user:


There are 99% users followed by only less than 40 users and remaining 1% users are
followed by nearly 1500 users.

→ No of users those are not following anyone are 274512 and % is 14.74
→ No of users those are having zero followers are 188043 and % is 10.10
→ No of persons those are not not following anyone and also not having any followers are 0
It means that there is at least 1 edge between every user.

Exploratory visualizations:
As per the given data we have a directed graph with user id’s. We do not have any
other features which have numeric values so the only possible visualizations are the
directed graphs and we have already taken 20 samples and plotted the graph as it will
not be possible and we cannot observe the complete graph with 9.4 million edges.
We have already discussed about the graph. The main purpose of this project is
extracting the various features that are useful to convert the problem into a binary
classification problem. We will discuss the feature extraction techniques thoroughly in
the data preprocessing section.
Algorithms and techniques:
Since we have formulated the problem into a binary classification problem, we will use
the classification algorithms which work very well for our dataset. Random Forest and
XGBoost works very well with less features, but we are taking logistic regression also
as a benchmark model because it can work with our large dataset very efficiently and
we can compare our models whether they are working better or not.

1)Logistic Regression:
Logistic Regression is one of the most used Machine Learning algorithms for
binary classification. It is a simple Algorithm that you can use as a performance
baseline, it is easy to implement and it will do well enough in many tasks. Therefore
every Machine Learning engineer should be familiar with its concepts. The building
block concepts of Logistic Regression can also be helpful in deep learning while building
neural networks.

Logistic regression works like this………


if X is the given input data vector consisting of n number of input vectors.
Then decision boundary to classify the given data is a sigmoid function and it is defined
by
1
y = 1+e−z

Where z = w*x + b

The sigmoid function is as shown in figure


The possible values of y are between 0 and 1.
It means it represents the probability value of
input x. If y<=0.5 then the output class will
be defined as 0 and if y>0.5 then output class
will be given as 1.
The logistic function will separate the given data by the following manner

Reference link: https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf

Advantages

It is a widely used technique because it is very efficient, does not require too many

computational resources, it’s highly interpretable, it doesn’t require input features to


be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs

well-calibrated predicted probabilities.

Disadvantages

Logistic regression is not able to handle a large number of categorical features


/variables. It is vulnerable to over fitting. Also, can't solve the non -linear problem with
the logistic regression that is why it requires a transformation of non-linear features.
Logistic regression will not perform well with independent variables that are not
correlated to the target variable and are very similar or correlated to each other. Here
we are having less features but the logistic regression works efficiently with large
number of features so this may overfit in our case.

Parameters

class sklearn.linear_model.LogisticRegression(C=30, class_weight= None, dual=False,


fit_intercept=True, intercept_scaling=1, max_iter= 100, multi_class='ovr', n_jobs=1, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

2) Random Forest Classifier:


Random forest (or random forests) is an ensemble classifier that consists of
many decision trees and outputs the class that is the mode of the class es output by
individual trees. Random Forest is a flexible, easy to use machine learning algorithm
that produces, even without hyper-parameter tuning, a great result most of the time.
It is also one of the most used algorithms, because
it’s simplicity and the fact that it can be used for
both classification and regression tasks.Our data
has less features so the random forest classifier
works very well.
A decision tree means it classify the input
data by taking decisions on features in input vector.
An example of a decision tree is looks like the
image on side. It is classifying whether i have to go to the restaurant or buying a
hamburger with few features about am I hungry? And how much money do I have? If I
am hungry and i have money $25 I will go to a restaurant otherwise I will buy a
hamburger
Random forest uses a lot of decision trees as estimators and finally, it will take
one output from all the decision trees output based on max voting.

Reference:
http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/

Advantages
An advantage of random forest is that it can be used for both regression and
classification tasks and that it’s easy to view the relative importance it assigns to the
input features. Random Forest is also considered as a very handy and easy to use
algorithm, because it’s default hyperparameters often produce a good prediction result.
The number of hyperparameters is also not that high and they are straightforward to
understand.

Disadvantages
The main limitation of Random Forest is that a large number of trees can make the
algorithm to slow and ineffective for real-time predictions.

Parameters
class.sklearn.ensemble.RandomForestClassifier(bootstrap=True, class_ weight=None,
criterion='gini', max_depth=9, max_features='auto', max_ leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split= None, min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=130, n_jobs=1, oob_score =False,
random_state=None, verbose=0, warm_start=False)

3) Gradient Boosting classifier(XGBoost):


In gradient boosting, it trains many models sequentially. Each new model
gradually minimizes the loss function (y = ax + b + e, e needs special attention as it is
an error term) of the whole system using a Gradient Descent method. The learning
procedure consecutively fit new models to provide a more accurate estimate of the
response variable. The principle idea behind this algorithm is to construct new base
learners which can be maximally correlated with a negative gradient of the loss
function, associated with the whole ensemble.

Gradient boosting algorithm works as follows:


n
Input: the training set {(x , y )}
i i , i=1
a differentiable loss function L (y, F (x)),
number of iterations M.

1) Initialize model with a constant value:

2) For m = 1 to M:

a) Compute so-called
pseudo-residuals:

b) Fit a base learner (e.g. a tree) hm (x) to pseudo-residuals, i.e. train it using the training

set {(x i, y i)} n i=1 .


c) Compute multiplier γ m by solving the following one-dimensional optimization problem:

d) Update the model:

Reference link: https://en.wikipedia.org/wiki/Gradient_boosting

The following two images will show the difference between Gradient boosted decision trees and
Random Forest classifier.
Comparing the two methods, Random Forests are faster to train, but they often require
deeper trees than GBTs to achieve the same error. GBTs can further reduce the error
with each iteration, but they can begin to overfit (increase test error) after too many
iterations. Random Forests do not overfit as easily, but their test error plateaus.
Reference link:
https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html

Advantages
GBTs build trees one at a time, where each new tree helps to correct errors made by
the previously trained tree. With each tree added, the model becomes even more
expressive. There are typically three parameters - the number of trees, depth of trees
and learning rate, and each tree built are generally shallow.

Disadvantages

GBDT training generally takes longer because of the fact that trees are built
sequentially. However benchmark results have shown GBDT are better learners
Parameters
class.sklearn.ensemble.GradientBoostingClassifier(criterion='friedman_ mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=5, max_features=None, max_leaf_nodes=None,
min_impurity_decrease= 0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split
=2, min_weight_fraction_leaf=0.0, n_estimators=130,presort='auto', random_state=None,
subsample=1.0, verbose=0, warm_start=False)

Benchmark model
In our project, we are going to use the logistic regression model as the benchmark
model. We will calculate the performance metrics accuracy and f1_score by using the
best parameters for logistic regression.
We will take accuracy and f1_score values as performance metrics.
Train f1_score = 0.9072
Test f1_score = 0.9082
Train accuracy = 91.33%
Test accuracy = 91.43%

3) Methodology

1)Data preprocessing

In our project data preprocessing is the most important part. We have given only two
columns with user ids we have to extract the useful features using the graph mining
techniques.
In this process, we are going to use the python library networkx, which has all the
predefined functions and methods regarding the graph theory. This process involves
several steps in extracting different types of features.

Similarity measures:

i) Jaccard Distance:
Reference link: http://www.statisticshowto.com/jaccard-index/

This value indicated that how similar are the two sets. The maximum value is 100%. It
means two sets share all the similar members.In our problem, each set means the no
of users followed/followed by a user.
jaccard_distance(user1, user2) = (common no of users followed by user1 and user2 )
* 100 / (total number of unique users followed by user1 and user2)
ii) Cosine distance:

Reference: https://faculty.nps.edu/rgera/MA4404/Winter2018/15-nodeSimilarity.pdf

Cosine_distance = (total number of unique users in user1 and user2) / (no of users in
user1 * no of users in user2)
iii) Preferential attachment:

Reference link: https://en.wikipedia.org/wiki/Preferential_attachment


Pref_attach = (no of users in user1) * (no of users in user2)
Ranking Measures

https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algori
thms.link_analysis.pagerank_alg.pagerank.html

PageRank computes a ranking of the nodes in the graph G based on the structure of
the incoming links.

Mathematical PageRanks for a simple network, expressed as percentages. (Google


uses a logarithmic scale.) Page C has a higher PageRank than Page E, even though
there are fewer links to C; the one link to C comes from an important page and hence
is of high value. If web surfers who start on a random page have an 85% likelihood of
choosing a random link from the page they are currently visiting, and a 15% likelihood
of jumping to a page chosen at random from the entire web, they will reach Page E
8.1% of the time.

i) Page Ranking:

Reference: https://en.wikipedia.org/wiki/PageRank
PageRank works by counting the number and quality of links to a page to determine a
rough estimate of how important the website is. The underlying assumption is that
more important websites are likely to receive more links from other websites.
Here we also take a webpage as one user.
Other features:

i) Shortest path:

Getting the Shortest path between two nodes, if nodes have direct path i.e directly
connected then we are removing that edge and calculating path.If there is no path we
will take it as -1

ii) Same category of weakly connected components:

A weakly connected component is one in which all components are connected by some
path, ignoring direction. So this entire graph would be a weakly connected component.
Reference:https://www.quora.com/What-are-strongly-and-weakly-connected-compone
nts

iii) Adamic/Adar Index:


reference: https://en.wikipedia.org/wiki/Adamic/Adar_index

Adamic/Adar measures is defined as inverted sum of degrees of common neighbours


for given two vertices.

iv) follows_back :
This feature depends on whether the second user is following back or not. If in test
data we have given (user1, user2) as input we have to extract the follows_back feature
as 1 if user2 is already following user1 otherwise 0.

Katz Centrality:

References: https://en.wikipedia.org/wiki/Katz_centrality
https://www.geeksforgeeks.org/katz-centrality-centrality-measure/
It is a measure of centrality in the network. It is used to measure the relative degree of
influence of a node within a social network.
xi = α ∑ Aij xj + β
j

where A is the adjacency matrix of the graph G with eigenvalues λ


1
The parameter β controls the initial centrality and α< λmax

Hits(Hyperlink-Induced Topic Search) Score:

The HITS algorithm computes two numbers for a node. Authorities estimate the node
value based on the incoming links. Hubs estimate the node value based on outgoing
links.
Reference: https://en.wikipedia.org/wiki/HITS_algorithm

In addition to the graph mining techniques to extract the features we also add some of
the statistical features:
1) Num_followers: here we are taking the number of followers of the source and
the destination node as two separate features, and one more feature is the
number of people in the intersection of two sets of followers of two users.
2) Num_followees: Here we are going to take the number of followees(followed
by that user) of the source and the destination node as two separate features.
The third feature is same as followers, the number of users in the intersection of
the two sets of followees of the two users. Hence we are adding 6 new features
by using the number of followers and followees
3) Weight features: In order to determine the similarity of nodes, an edge weight
value was calculated between nodes. Edge weight decreases as the neighbour
count go up. Intuitively, consider one million people following a celebrity on a
social network then chances are most of them never met each other or the
celebrity. On the other hand, if a user has 30 contacts in his/her social network,
the chances are higher that many of them know each other. `credit` -
Graph-based Features for Supervised Link Prediction William Cukierski, Benjamin
Hamner, Bo Yang
1
W =
√ 1+∣X∣
➔ the weight of incoming edges
➔ the weight of outgoing edges
➔ the weight of incoming edges + weight of outgoing edges
➔ the weight of incoming edges * weight of outgoing edges
➔ 2*weight of incoming edges + weight of outgoing edges
➔ the weight of incoming edges + 2*weight of outgoing edges

4) SVD features for both source and destination:


We will create an Adjacency matrix from the graph. And then we will apply
Svd(singular value decomposition to the adjacency matrix).
After applying SVD we will get three matrices and we will take the 6 features from that
matrix.
The adjacency matrix is a square matrix used to represent a finite graph. The elements
of the matrix indicate whether pairs of vertices are adjacent or not in the graph.
Reference: https://en.wikipedia.org/wiki/Singular_value_decomposition
https://en.wikipedia.org/wiki/Adjacency_matrix

Implementation:
Before the algorithm implementation, we are splitting the train and test data into an
80:20 ratio and extracted the features from that data separately.

After extracting all the useful features we will apply the machine learning algorithms
from sklearn. We have no features in the provided dataset, so feature extraction is the
most difficult task and one should do a lot of research in the graph theory and graph
mining techniques.

We have taken logistic regression as a benchmark model because using a naive model
is not efficient in our case as all the machine learning algorithms work better than the
naive model and we may not be able to get which algorithm is best for our data. So we
have chosen logistic regression as our dataset is very large and it works very well on
the large dataset.

We have taken two algorithms Logistic Regression and Random Forest Classifier with
their default values from sklearn library. But XGBoost classifier is taken from xgboost
package

Reference link Sklearn :


https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegres
sion.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClas
sifier.html
Reference link xgboost: https://xgboost.readthedocs.io/en/latest/parameter.html

I have applied the three algorithms Logistic Regression (benchmark model), Random
Forest Classifier and XGBoost classifier and we have taken the f1_score and accuracy
for all the three models.

Model name Train f1_score Test f1_score Train accuracy Test accuracy

Logistic 0.9074 0.9073 91.353% 91.352%


Regression

Random Forest 0.9965 0.9129 99.66% 91.84%


Classifier

XGBoost 0.9736 0.9281 97.39% 93.22%


Classifier

Random forest classifier works very well with 99% of train accuracy and 0.99 of
f1_score, but it is showing some overfitting in the data as the test accuracy is only
91%. So we may consider XGBoost classifier as it is showing less overfitting compared
to random forest classifier. Both our models are working very well over the benchmark
model.
Refinement:
In this part, I will use the random search method for each of the three classifiers to
refine their hyperparameters. In my research on hyperparameter tuning I found
following parameters for three models to be tuned. The parameters we have taken for
tuning for each model are as follows….

For logistic Regression:


Penalty → [“L1”, “L2”]
6
C → [ 10 − , 10 −5, 10 −4, 10 −3, 10 −2, 10 −1, 1, 10, 10 2, 10 3 ]

For Random Forest Classifier:


Estimators → [10,50,100,250,450]
Depths → [3,9,11,15,20,35,50,70,130]
Min_samples_split → [2,3,4,5,6],
Min_samples_leaf → [1,2,3,4,5,6]
For XGBoost Classifier:
Max_depth → [1,2,3,4,5,6,7,8,9,10,11,12]
N_estimators → [30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170,
180, 190]
Min_child_weight → [1,2,3,4,5,6]
Gamma → [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
Colsample_bytree → [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
Subsample → [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

After running the random search, the refined hyperparameters obtained are….

1) Logistic regression → {'C': 0.1, 'penalty': 'l2'}


2) Random Forest classifier → {'n_estimators': 50, 'min_samples_split': 3,
'min_samples_leaf': 5, 'max_depth': 70}
3) XGBoost classifier → {'colsample_bytree': 0.5, 'gamma': 3, 'max_depth': 5,
'min_child_weight': 1, 'n_estimators': 180, 'subsample': 0.5}
Now i have used the best hyperparameters for the three algorithms and the results
are.

Model name Train f1_score Test f1_score Train accuracy Test accuracy

Logistic 0.9072 0.9083 91.33% 91.43%


Regression

Random Forest 0.9750 0.9303 97.53% 93.39%


Classifier

XGBoost 0.9826 0.9264 98.27% 93.07%


Classifier

In this case, we are getting the best result with XGBoost classifier but we have some
overfitting using XGBoost. Random forest also giving the better result and with less
overfitting. We consider random forest over XGBoost. These results may vary using
more data but for now we are taking these results into consideration.

4) Result

Model evaluation and validation:


After initial implementation and further refinement for the three classifiers our two
models work very well over the benchmark model. We can conclude from the results
random forest classifier is the best model. It is giving the two performance metrics
values as follows…..
Train f1 score : 0.9736268857840504
Test f1 score : 0.9281594571670908
Train accuracy score: 0.9739705205895882
Test accuracy score: 0.9322427102915883

It is having less variation in the train and test performance. We do not have to test the
model by using the K-fold cross-validation, because we have already done the 10-fold
cross-validation using random search technique. You can see the cross-validation
parameter in the image

In the code for random search, we have provided the cv parameter as 10, it means
that it will check the model using 10-fold cross-validation.

Finally, we can conclude that we got robust results and no need to check the model by
doing perturbation using k-fold cross-validation.

Reference link:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.Randomize
dSearchCV.html

Justification
The benchmark model has an accuracy score of 91.43%, the optimized model has
obtained an accuracy score of 93.39%. So, this model is good performing. And the
confusion matrix report shows good results of recall, precision.

Our model show very small high value in test accuracy but the training accuracy is very
high, we can get better test accuracy using more data.

5) Conclusion

Freeform visualizations:
Train confusion matrix
There were very few data points that were predicted
wrong in both class0 and class1. In the wrongly
predicted data points, Class1 data points are more than
class0 data points.

Test confusion matrix

In the wrongly predicted data points Class0 data


points are lower in number than class1 points.

Feature
importance
We have plotted the
top 25 important
features based on
their feature
importance.
Follows_back has
the highest
importance because
if the user is
following back
there may be a
higher chance that
our user follow that
user.
Reflection

As an active social media user, I found interest in how social media network especially
Facebook suggesting friends to me and what process involved in the background for
that. So I do research on that I found the competition by Facebook on kaggle. In this
process, I have learned a lot of things.
1) The first thing I have learned is that we will not always provide a good dataset
with all the features given initially.
2) We have to do research on the field of the problem, in this project I have learned
the interesting topics of graph theory and there are many surprising libraries for
graphs in python.
3) How to chose the machine learning algorithm and how to chose the performance
metric to measure the performance of the model.
4) Initially, I got very less performance from the models and then I have added
more features from the statistics field also using the numeric values from the
graph.
5) Finally, I have learned how the real world problems would be and how we have to
approach to get a solution to the problem and how we can apply machine
learning models to solve the problem.

Improvement:
In the real world scenarios suggesting a friend depends on a lot of factors such as
location, education, family, geography, interests of the users etc. But we have only
user id’s as a directed graph, we have no other information. So if we get all other
features and details we can definitely improve the quality of the machine learning
models.
In the given dataset also if we have used all the given data for training we may achieve
better results, but if we can know what Facebook knows about a user we can
implement a lot of features from that data and we can achieve better results.

89

You might also like