0% found this document useful (0 votes)
6 views34 pages

Unit 6

Uploaded by

Pandu snigdha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views34 pages

Unit 6

Uploaded by

Pandu snigdha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT 6

BINARY CLASSIFICATION TASK


● As, we want to convert this into a binary classification problem as

discussed before such that existing edges are labelled as “1” and non

existing edges or absent edges are labelled as “0”.


● In the original graph in the above figure, there is no edge between U1

and U2, U1 and U3, U1 and U4, U2 and U3, U2 and U4, U3 and U4.

● These edges should be labelled as “0” in the data. But this is not

provided in the train.csv file or we don’t have zero labelled data in the

dataset.

● So, I can add those nodes with no edges and label them as “0” in the

data.

Now, there’s a cache.

● As, we need to generate data for label “0”.

● So, for each vertex, I can have (n-1) edges.

● As there is n vertex, I will have in total n(n-1) edges.

● If I have 1.86 million nodes, then no of edges = 1.86 x 10⁶ x (1.86 x 10⁶-

1) edges, which is very large.

● But we need a smaller percentage of edges out of total possible edges

which is about 9.43 million edges to make our data a balanced dataset

for binary classification.


Let’s call the possible edges which are not present to be bad edges.

● We will now randomly sample 9.43 million edges out of total possible

edges which are not present in train.csv and construct our new dataset.

● Remember, we will consider and generate only those bad links for a

graph which are not in the graph & whose shortest path is greater

than 2.

TRAIN & TEST SPLIT

In the real world , if you are an employee of facebook or Instagram, you have to deal

with the temporal nature of data. So, you needed to carry out time based splitting.
But we are just provided with graph of a particular time stamp. Since we don’t have

the time stamps, we will carry out random based splitting.

● We splitted the data into 80:20 ratio.

● We splitted positive links and negative links separately because we need

positive training data only for creating graph and feature generation.

● For positive links : X_train_pos, X_test_pos, y_train_pos, y_test_pos

● For negative links: X_train_neg, X_test_neg, y_train_neg, y_test_neg

● Number of nodes in the graph with edges = 9437519

Number of nodes in the graph without edges = 9437519

● Number of nodes in the train data graph with edges 7550015 = 7550015

Number of nodes in the train data graph without edges 7550015 =

7550015
● Number of nodes in the test data graph with edges 1887504 = 1887504

Number of nodes in the test data graph without edges 1887504 =

1887504

● We removed the header and saved the files separately.

● We created di-graph separately for train positive and test positive.

● We found the unique nodes in both train positive and test positive

graphs.

● No of people common in train positive and test positive = 1063125

● No of people present in train but not present in test = 717597

● No of people present in test but not present in train = 81498

● % of people not there in Train but exist in Test in total Test data are

7.1200735962845405 %

● As 7.12 % of test data is not already present in train data, so this leads to

cold start problem.


Number of nodes in the train data graph with edges 7550015

Number of nodes in the train data graph without edges 7550015

Number of nodes in the test data graph with edges 1887504

Number of nodes in the test data graph without edges 1887504


● Now, we appended X_train_neg to X_train_pos.

● We concatenated (y_train_pos,y_train_neg).

● We appended X_test_neg to X_test_pos.

● We concatenated (y_test_pos,y_test_neg).

Data points in train data (15100030, 2)

Data points in test data (3775008, 2)

Shape of traget variable in train (15100030,)

Shape of traget variable in test (3775008,)

Feature engineering on Graphs Jaccard & Cosine Similarities

Now, we will try to convert these pairs of vertices into some numerical, categorical or

binary features to carry out machine learning, as we cannot carry out algorithms on

the vertices set.

For any graph-based machine learning problem, of course featurization is the most

important part of the project.

For the code: refer to the “fb_featurization ipynb notebook” in my GitHub repository.

Remember we do feature engineering on existing features and creating new features

should be done after the train test split to avoid data leakage.
Here we will just focus on how to featurize graph based data.

First of all, we will operate on sets of followers and followee.

1) Similarity Measures: Jaccard Distance

X is the set of followers of U1

Y is the set of followers of U2

𝑗=|𝑋∩𝑌|/|𝑋∪𝑌|
J = 2/4

Given any two sets, jaccard distance or jaccard similarity coefficient basically

says:

It is a statistic used for gauging the similarity of sample sets, which is the size

of X intersection Y divided by size of X union Y.

In a nutshell, there are lot of common followers between U1 and U2, so U1 may want

to follow U2 and U2 may want to follow U1.

● Higher the jaccard index or more the overlap between two sets of

followers or followee , higher the probability that there exist an edge or

link between U2 and U2.


● I can find jaccard index for follower set as well as for followee set.

● We will use this metric in our actual model too.

2) Similarity Measure : Cosine distance (Otsuka-Ochiai coefficient)

X is the set of followers of U1

Y is the set of followers of U2

𝐶𝑜𝑠𝑖𝑛𝑒 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒=|𝑋∩𝑌|/|𝑋|⋅|𝑌|,

which is used when X and Y are vectors.

But, we will use a simple extension to cosine distance, which is Otsuka Ochiai

Coefficient.

Cosine Distance(Otsuka-Ochiai coefficient)= |𝑋∩𝑌|/ SQRT(|𝑋|⋅|𝑌|)

Otsuka-Ochiai coefficient is used when X and Y are set.

● So, Cosine distance ( Otsuka-Ochiai Coefficient ) will be high when there

is more overlap between sets X and Y.

● Here also, we will compute for the follower as well as for followee.

Page Rank

● PageRank is a very very popular featurization technique for directed

graphs as it was used by Google.

● PageRank was named after Larry Page, one of the founders of Google.
● It is an algorithm used by google search to rank web pages in their

search engine results.

● It is a way of measuring the importance of website pages.

● PageRank works by counting the number and quality of links to a

page to determine a rough estimate of how important the page is.

● If a lot of pages are having a destination as “B”, then “B” must be

important.

and

If my page “B” is given as link in many important pages like C,E,D etc,

then “B” page value increases.


● If a user has high pagerank score, then it implies that other users and

highly important users are linking to Ui.

● PageRank can tell us about relative importance.

● Bill Gates is a celebrity.

He is followed by some important people like Mark and Warren Buffet


and also by some common people like me. So it is quite sure that, he is

also an important person.

Now there is a significantly higher probability that Bill Gates will be

followed by “you”.

● Now, for that 7.12 % of test data which is not present in train data, we

can give our mean pagerank value as their pagerank value.

● The way we will use this for the both vertices is:
Shortest Path

● If there is no edge between two vertices , then we will consider default

value as “-1”.

● If we have an edge between 2 vertices, whose shortest path is of length

‘1’, which makes no sense. In that case, we will remove that edge and

calculate the shortest path between those two vertices.

● Shortest path measures how far two vertices are at the least.

Connected Components

● Strongly Connected Component

A graph is said to be strongly connected if every vertex is reachable


from every other vertex. The strongly connected components of an

arbitrary directed graph form a partition into subgraphs that are

themselves strongly connected.

● Weakly connected component

A weakly connected component is one in which all components are

connected by some path, ignoring direction.


● Checking for same weakly connected component (Community)
When I ignore directions, i have path from from each vertex to other vertex. So graph

S1 and S2 are weakly connected components of a graph.

My college friends and my work related friends can be completely different.So weakly

connected components actually creates something called “Community”.

A group of similar users or people with similar interests or people from same college

or people with same interest in Machine learning can probably form a same

community. So, if Ui and Uj belongs to the same weakly connected component, then

it is quite probable that they both may have same interest in something or belong to

the same city or can be anything in common.

If I was having even one edge between S1 and S2, then we would have been having

a single weakly connected component instead of two.

#getting weekly connected edges from graph


wcc=list(nx.weakly_connected_components(train_graph))
def belongs_to_same_wcc(a,b):
index = []
if train_graph.has_edge(b,a):
for i in wcc:
if b in i:
index = i
break
if (b in index):
train_graph.remove_edge(b,a)
if compute_shortest_path_length(b,a)==-1:
train_graph.add_edge(b,a)
return 0
else:
train_graph.add_edge(b,a)
return 1
else:
return 0
if train_graph.has_edge(a,b):
for i in wcc:
if a in i:
index= i
break
if (b in index):
train_graph.remove_edge(a,b)
if compute_shortest_path_length(a,b)==-1:
train_graph.add_edge(a,b)
return 0
else:
train_graph.add_edge(a,b)
return 1
else:
return 0
else:
for i in wcc:
if a in i:
index= i
break
if(b in index):
return 1
else:
return 0

In the above code:

● First, we are checking if there is a direct edge from b to a. If yes, then we

are checking do both b and a belong to same wcc or community.

● If yes, then we removed the direct edge from b to a and calculated the

shortest path from b to a. If the shortest path does not exist then we are

declaring that they are not in same wcc/community even though both of

them belong to same wcc/community.

● However, if there exists a shortest path from b to a , then it means they

could be friends and hence we are declaring them in wcc/community.


● secondly, we are checking that is there is a direct edge from a to b and

follow the similar approach as above and return 1 or 0.

● If however, there is no direct edge from s to d or from d to s at the first

place then we are simply checking that do they belong to same wcc. If

yes, then just return 1 else 0.

● why we have checked for the existence of direct path from b to a

and from a to b ?

In our dataset all of the data points have paths from a to b or from b to a .

So if we don’t check for this condition then it will return 1 everywhere in

most of the data-points. Hence, if there is a direct path then we have

used this strategy that we will remove this direct path and check if a is

reachable from d or d is reachable from a via path length more than 2 or

indirect path , if yes then only we are declaring them in same

wcc/community else not.

Adar Index

● The Adamic/Adar Index is a measure designed to predict links in a social

network.

● Neighborhoods are basically subset of vertices/nodes which are

connected in either direction to x.


● Now, let’s write the formula for Adar Index:

● Here , we are operation function on U which belongs to intersection of

neighborhood of x node and neighborhood of y node.


● Here, let’s say, U1 and U2 are two vertex belonging to intersection of

N(x) and N(y) and both are connected to both x and y.

U1 has very large neighborhood, so it’s probably a celebrity. So there is a

very small chance that x and y are going to be related.

While U2 has small neighborhood, so it’s a common man like us. As it

has small group, it can belong to a college group or work group, so x and

y can be related in this case.

● As size of the N(u) increases, log(N(u)) increases and 1/log(N(u))

decreases. So contribution os nodes like U1 who have large no

neighbors, their Adar Index is small and vice versa.


Katz Centrality

● It is similar to what we have seen in Google’s pagerank, but quite an old

technique from 1953.


● Katz centrality computes the centrality for a node based on the centrality

of it’s neighbors. It’s a generalization of eigenvector centrality.

● It is used to measure the relative degree of influence of neighboring

nodeswithin a graph or social network.

● The katz centraloty for node i :

Where A is the adjacency matrix of the graph G with eigen values lambda.

● The parameter β controls the initial centrality and 𝛼<1/𝜆𝑚𝑎𝑥.

Hits Score

● HITS stands for Hyperlink-Induced Topic Search(often referred as hubs

and authorities)is a link analysis algorithm that rates web pages.

● The HITS algorithm computes two numbers for a node. Authorities

estimates the node value based on the incoming links. Hubs

estimates the node value based on outgoing links.

● hubs → More no of outlinks, ex- yahoo.com

● Authorities → More no of inlinks, ex- cnn.com, bbc.com, mit.edu


● So, HITS gives every web page wi two scores <hubs, authorities>.

● Step 1: This is an iterative algorithm. At the very initial stage, for every

page, we have auth(p) = 1 and hub(p) = 1.

● Step 2: Authority update rule : For each p, we update auth(p) to

where Pto is all pages which link to page p.

● Step 3: Hub update rule: For each p, we update hub(p) to

where Pfrom is all pages to which p links to.

● Step 4: If we keep running these update rules, they will run till infinity. So

once we update the authority or hub score of any page, we normalize it

so that the score does not become infinitely large.


● We will run these steps iteratively and eventually if we do this many

times, the authority score of p and hub score of p will converge which is

proven mathematically. So run those step iteratively until only when

authority score and hub score do not change any more.

Weight Features

Since Ui is a celebrity, the chances of any two random points knowing each other is

very small. But since Uj is not a celebrity, but a common man like us, so the chances

of any two random points knowing each other is very high.

How to encode the information that if point is celebrity or not using just the number of

links linking into it !

There is a weighted feature called “W” for any point Ui:


Here X = set of vertices linking into Ui or linking out from Ui

● We can compute the weight features for in links and out links too in the

same way.

For every vertex Ui, we have these 6 features.

● weight of incoming edges

● weight of outgoing edges

● weight of incoming edges + weight of outgoing edges

● weight of incoming edges * weight of outgoing edges


● 2*weight of incoming edges + weight of outgoing edges

● weight of incoming edges + 2*weight of outgoing edges

In order to determine the similarity of nodes, an edge weight value was calculated

between nodes. Edge weight decreases as the neighbor count goes up. As we have

directed edges, we have calculated weighted in and weighted out differently.

The most important part of this case-study was featurization. Now It’s quite easy to
go through our problem.

These are the extracted features.

Let’s train our model.

Modelling

● Based on our case study, a fairly non linear model might work well.

● Here, I have used df_final_test data for cross validation. Here test

basically means cross validation just to be clear.


● I will be using Random Forest Classifier because we know random forest

tends to work well when there is non linearity and reasonable number of

features. I don’t need to do non linear transformation here.

● I recommend you to try other models too, but I got pretty good accuracy

here. Of course SVM, GBDT, Xgboost might work good on this problem.

● So, there are two hyperparameters for Random Forest.

- No of estimators

- Depth

● I have carried out standard hyperparameter tuning where I used f1 score

as the metric.

● I tried training my model on 10 base models, 50 base models, 100 base

models, 250 base models and 450 base models.

● Here, I have plotted the change in score between my train data and my

test/cross validation data as my number of estimator changes.


● I also hyperparametertuned on the depth of base learners which is depth

of 3,9,11,15,20,35,50,70,130 and we got the scores as the depth

changes and plotted it. We can observe that beyond a depth, the model

is not significantly improving much.

● I found some reasonably best values for base learners/number of

eastimators and depth using RandomisedSearchCV.


RandomForestClassifier(max_depth=14, min_samples_leaf=28,
min_samples_split=111, n_estimators=121, n_jobs=-1, random_state=25)

● Here, we can observe the f1 scores of train and test which are 96.2%

and 92.5% respectively , which are fairly close. Since these two values

are quite close, we can derive the conclusion that the model is not

overfitting. Here test basically means cross validate.

● I plotted confusion matrix to get the intuition of the process. According to

the observation, precision for class “0” is dropping and recall for class “1”

is dropping.
Train confusion_matrix

Test confusion_matrix

● This gives us intuition that, we could add more features to more precisely

classify if it belongs to class “0” or not. It is doing pretty well in the train

data while it is little dull in the test data. It can happen due to many other

problems. If we can recall, I stated about the cold-start problem, where

some of the vertices in test set were not present in the original train set

already.

● The very encouraging thing is, there is fairly very small difference

between train score and test score which shows ultimately no overfitting.

So even on future data or on completely new data, the metrics should

work pretty well.

● Reciever operating characteristic curve for cross validate data:


● The most important part of the analysis after training our model is feature

importance. So follows_back is the most important feature. If Ui is

following Uj, then there is a very high chance that Uj will follow back Ui. If

you observe the feature importance graph, svd’s, pagerank, kartz are the

lowest. Sometimes, very simple features outstand. Of course complex

features add values to the performance.


● Random Forest Classifier turns out to be one of the best performing

models.

You might also like