Unit 6
Unit 6
discussed before such that existing edges are labelled as “1” and non
and U2, U1 and U3, U1 and U4, U2 and U3, U2 and U4, U3 and U4.
● These edges should be labelled as “0” in the data. But this is not
provided in the train.csv file or we don’t have zero labelled data in the
dataset.
● So, I can add those nodes with no edges and label them as “0” in the
data.
● If I have 1.86 million nodes, then no of edges = 1.86 x 10⁶ x (1.86 x 10⁶-
which is about 9.43 million edges to make our data a balanced dataset
● We will now randomly sample 9.43 million edges out of total possible
edges which are not present in train.csv and construct our new dataset.
● Remember, we will consider and generate only those bad links for a
graph which are not in the graph & whose shortest path is greater
than 2.
In the real world , if you are an employee of facebook or Instagram, you have to deal
with the temporal nature of data. So, you needed to carry out time based splitting.
But we are just provided with graph of a particular time stamp. Since we don’t have
positive training data only for creating graph and feature generation.
● Number of nodes in the train data graph with edges 7550015 = 7550015
7550015
● Number of nodes in the test data graph with edges 1887504 = 1887504
1887504
● We found the unique nodes in both train positive and test positive
graphs.
● % of people not there in Train but exist in Test in total Test data are
7.1200735962845405 %
● As 7.12 % of test data is not already present in train data, so this leads to
● We concatenated (y_train_pos,y_train_neg).
● We concatenated (y_test_pos,y_test_neg).
Now, we will try to convert these pairs of vertices into some numerical, categorical or
binary features to carry out machine learning, as we cannot carry out algorithms on
For any graph-based machine learning problem, of course featurization is the most
For the code: refer to the “fb_featurization ipynb notebook” in my GitHub repository.
should be done after the train test split to avoid data leakage.
Here we will just focus on how to featurize graph based data.
𝑗=|𝑋∩𝑌|/|𝑋∪𝑌|
J = 2/4
Given any two sets, jaccard distance or jaccard similarity coefficient basically
says:
It is a statistic used for gauging the similarity of sample sets, which is the size
In a nutshell, there are lot of common followers between U1 and U2, so U1 may want
● Higher the jaccard index or more the overlap between two sets of
𝐶𝑜𝑠𝑖𝑛𝑒 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒=|𝑋∩𝑌|/|𝑋|⋅|𝑌|,
But, we will use a simple extension to cosine distance, which is Otsuka Ochiai
Coefficient.
● Here also, we will compute for the follower as well as for followee.
Page Rank
● PageRank was named after Larry Page, one of the founders of Google.
● It is an algorithm used by google search to rank web pages in their
important.
and
If my page “B” is given as link in many important pages like C,E,D etc,
followed by “you”.
● Now, for that 7.12 % of test data which is not present in train data, we
● The way we will use this for the both vertices is:
Shortest Path
value as “-1”.
‘1’, which makes no sense. In that case, we will remove that edge and
● Shortest path measures how far two vertices are at the least.
Connected Components
My college friends and my work related friends can be completely different.So weakly
A group of similar users or people with similar interests or people from same college
or people with same interest in Machine learning can probably form a same
community. So, if Ui and Uj belongs to the same weakly connected component, then
it is quite probable that they both may have same interest in something or belong to
If I was having even one edge between S1 and S2, then we would have been having
● If yes, then we removed the direct edge from b to a and calculated the
shortest path from b to a. If the shortest path does not exist then we are
declaring that they are not in same wcc/community even though both of
place then we are simply checking that do they belong to same wcc. If
and from a to b ?
In our dataset all of the data points have paths from a to b or from b to a .
used this strategy that we will remove this direct path and check if a is
Adar Index
network.
has small group, it can belong to a college group or work group, so x and
Where A is the adjacency matrix of the graph G with eigen values lambda.
Hits Score
● Step 1: This is an iterative algorithm. At the very initial stage, for every
● Step 4: If we keep running these update rules, they will run till infinity. So
times, the authority score of p and hub score of p will converge which is
Weight Features
Since Ui is a celebrity, the chances of any two random points knowing each other is
very small. But since Uj is not a celebrity, but a common man like us, so the chances
How to encode the information that if point is celebrity or not using just the number of
● We can compute the weight features for in links and out links too in the
same way.
In order to determine the similarity of nodes, an edge weight value was calculated
between nodes. Edge weight decreases as the neighbor count goes up. As we have
The most important part of this case-study was featurization. Now It’s quite easy to
go through our problem.
Modelling
● Based on our case study, a fairly non linear model might work well.
● Here, I have used df_final_test data for cross validation. Here test
tends to work well when there is non linearity and reasonable number of
● I recommend you to try other models too, but I got pretty good accuracy
here. Of course SVM, GBDT, Xgboost might work good on this problem.
- No of estimators
- Depth
as the metric.
● Here, I have plotted the change in score between my train data and my
changes and plotted it. We can observe that beyond a depth, the model
● Here, we can observe the f1 scores of train and test which are 96.2%
and 92.5% respectively , which are fairly close. Since these two values
are quite close, we can derive the conclusion that the model is not
the observation, precision for class “0” is dropping and recall for class “1”
is dropping.
Train confusion_matrix
Test confusion_matrix
● This gives us intuition that, we could add more features to more precisely
classify if it belongs to class “0” or not. It is doing pretty well in the train
data while it is little dull in the test data. It can happen due to many other
some of the vertices in test set were not present in the original train set
already.
● The very encouraging thing is, there is fairly very small difference
between train score and test score which shows ultimately no overfitting.
following Uj, then there is a very high chance that Uj will follow back Ui. If
you observe the feature importance graph, svd’s, pagerank, kartz are the
models.