Topic 3
Topic 3
Topic 3
MS4252
Social Network Analysis III
1
Chapter 7
UNDERSTANDING STRUCTURE
THROUGH ATTRIBUTE AND
BEHAVIOR
2
Describe the Network
To really understand what’s happening in a
network, look beyond the structure at
users’ attributes, behaviour, the content they are
sharing, and their interactions
3
Attribute, Behavior, and Content
Attributes are characteristics of a node
Age, gender, or location
Person’s religion or political preferences
4
Attribute, Behavior, and Content
Content is a broader term that refers to the
nonstructural information about a node
Includes any information about the nodes or edges
beyond the structural features.
Combination of attributes and behavior, like the types of
comments they post online or the topics they discuss.
If a node represents a video or picture instead of a
person, the content would include the things depicted in
the video.
5
Structural Analysis
Identify those clusters
provide statistics about the network properties (like
density and connectivity)
about the importance of individual nodes with measures
(like centrality)
6
How to interpret the three clusters?
Built from the photo sharing website Flickr
On Flickr, photos are labeled with descriptive
keywords called tags
Nodes represent tags and an edge between tags
indicate that they were used to describe the
same image.
e.g. if an image is tagged with the word “desk”
and “keyboard” the network would show a line
connecting those two words.
Network is a 1.5 egocentric network of a single
tag
7
Now with content
8
Break down into Egocentric Network
The network of three months of discussion on the CSS-Discuss
mailing list. Node size reflects the node’s out-degree in this directed
network. Edge means the the person has replied to another.
9
Chapter 8
BUILDING NETWORKS
10
First Question?
What do the nodes represent?
11
Example
Facebook Network
What are the nodes and edges?
12
Network with Multiple node types
Bipartite graphs have two node types that do not
have connections within the type
E.g. no people connected to one another
13
To build a network
Step 1: Define Nodes
What are they?
What are the criteria for being included?
14
To Build a Network
Step 2: Define Edges
What does a edge represent?
What is the criteria for adding one?
15
Sampling Methods
Some networks may be too big to analyze in
their entirety.
Millions of nodes and edges are difficult to understand,
impossible to visualize.
16
Example: Enron email network
Node: messages
Edge: any pair of nodes that have exchanged at least 10
emails.
17
Random Sampling
Select a certain percentage of nodes and keep all edges
between them, or
Select a certain percentage of edges and keep all the
nodes that are mentioned.
Random Edge Sampling
Benefits
Reduce the network to a smaller size in an even way
A picture of the overall patterns of relationships and
clusters can be seen
Node sampling keeps some network statistical features
19
Snowball Sampling
A technique commonly used in sociology where
participants are interviewed
20
Snowball Sampling
Problems
Biased toward the part of the network sampled, may
miss other features
Benefits:
Easy to do, common
21
Egocentric Network Analysis
Instead of looking at the whole network, look at
the egocentric networks of some nodes.
Randomly selected egocentric networks, or
The networks of individuals selected based on certain
characteristics
22
Egocentric Network Analysis
A 1.5 egocentric network of a person who posts
in both English and Spanish
People who post only in Spanish are shown in black,
those posting only in English are in white, and people
using multiple languages or a third language are in
gray.
23
Chapter 9
24
Overview
Link Prediction is a method of analysis that detects
where missing links should be present in the network.
25
Link Prediction
Data often has errors in it, including missing
links,
and link prediction could identify places where an
analyst might want to check to confirm that there
is no edge between a pair of nodes
It can also be used to identify people.
26
Systematic methods for predicting link
If we have two nodes, A and B, then the
score(A,B) indicates how closely connected A and
B are in the graph
After computing the score for every pair of
nodes, the algorithm returns a ranked list
The pairs with the highest score are predicted to
be the most likely new edges.
27
Score Method
Shortest Path Length
Common Neighbors
Jaccard Index
Adamic/Adar
Preferential Attachment
28
Score – Shortest Path
One of the simplest ways to score the similarity
or closeness of two nodes is to use the shortest
path length between them
Score(A,B) = - shortestPath(A,B)
29
Score – Common Neighbors
Another way of computing score
uses more information from the network structure is to
count the number of common neighbors between the
two nodes in a pair
Score(A, G) = Neighbor(A) ∩ Neighbor(G)
30
Score – Common Neighbors
The number of common neighbors makes social
sense, too.
The more friends two people have in common, the more
likely they are to be introduced to one another.
31
Score – Jaccard Index
The Jaccard Index counts the total number of
friends in common and divides that by the total
number of people who are friends of either node
32
Example
Four nodes: Alice, Bob, Chuck, and Dave.
Let Alice and Bob be celebrities, each with 1 million
friends; Chuck and Dave are average users with 100
friends each.
Now say Alice and Bob have 2,000 friends in common
while Chuck and Dave have only 20 friends in common.
Although Alice and Bob may seem to be more strongly
connected than Chuck and Dave, since they have 100
times more common friends, the Jaccard Index
indicates this is not the case
What is Jaccard Index for Alice and Bob? How
about Chuck and Dave?
What if the 20 people Chuck and Dave know in
common are also celebrities
33
Score – Adamic/Adar
Sum up the inverse log of each neighbor’s degree
(common)
Giving more weights to friends with fewer edges.
The formula is
34
Score – Preferential Attachments
The network principle states nodes with a high degree
are more likely to gain new links.
Popular nodes are more likely to gain new friends than less
popular nodes.
35
Score – Preferential Attachments
The formula for this score method is relatively is
36
Entity Resolution
Identify nodes that represent the same entity
and then to merge them together
Entity resolution is a technique for merging
nodes that represent the same person.
John and J. Smith are actually the same person
37
Scoring Techniques
For each pair of nodes, we can compute a score that
represents the likelihood that they are the same node.
Then we can set a threshold value.
Any pair of nodes with score above the threshold will be
merged.
38
Scoring Techniques
To create a score for each pair of nodes, we will
consider similarity on a set of attributes.
For each pair of nodes, record 1 if their values match and 0
otherwise for a given attribute.
39
Scoring Techniques
Example: two nodes (J Smith and John Smith) and five
attributes
The vector of match/nonmatch is (0, 1, 0, 0, 1)
Suppose the weights are ()
We know that because a match on SSN is much
definitive.
40
Scoring Techniques
We are left to find a method for computing the weight
for each attribute.
We need two probabilities.
Probability : the probability that the two nodes will match on a
attribute by chance.
Example, same birth month is 1/12
41
Scoring Techniques
Once we have the and probabilities, we need to turn
them into weights.
There are two weights for each attribute.
42
Score for partial match
Return to the `John Smith’ and `J Smith’ example, while
their first names are not an exact match, they are
closed.
We may label this as a partial match.
For example, if we say `J’ is a 0.3 match for `John’, we would
add 0.3 times the weight for a name match.
Comparison? 43
Application
Link Prediction
Friend Recommendation
Entity Resolution
Finding duplicated account
https://sensorstechforum.com/new-facebook-scam-fake-duplicate-accounts-fraud/
44
Incorporating Network Data
Relational data is useful for enhancing the attribute-
based similarity!
45
Decision to merge
The results from these similarity measures can be used
in addition to attribute data.
Jointly use two similarity scores!
46
Chapter 13
47
Social Information Filtering
Some of the information on the social media will be
more useful than the rest, and the sheer volume means
a filter would be useful to sort through it.
48
Social sharing and social filtering
Social-sharing websites, like Digg, Slashdot, Reddit, are
designed to share interesting content.
49
Automated recommender systems
Amazon, Netflex, Pandora, use recommender systems
(RS) to suggest items a user might like.
50
Traditional recommender systems
Recommender systems work in one the two ways:
51
Traditional RS: example
How to suggest new items to Alice? How much she
might like movie Vertigo?
Compute the correlation between Alice and Bob (0.26),
and Alice and Chuck (0.83)!
Vertigo
?
3
4 52
Social Recommender Systems
Collaborative Filtering: an example of how algorithms
can leverage data from crowd!
53
Case Study: Reddit Voting
Social News Website
Users vote stories up or down
54
Reddit
55
Case Study: Reddit Voting
Solution:
Estimate how many votes a post would get over time
56
Case Study: Trust-based RS
57
Trust-based Recommender Systems
Trust-based recommenders, like FilmTrust, use social
network information.
58
Trust
Alice does not know Dave.
59
Weighted average
Alice has given Bob a higher trust rating
Weighted average
60
Example with People
Bob, Chuck, and Dave all saw and rated the move
`Night of the Living Dead’.
Their ratings are as follows on a five-star scale:
Bob: 5 stars
Chuck: 2 stars
Dave: 4 stars
Recommended rating:
61
How good is a RS?
Generally: RMSE
62