Topic 3

Big Data Analytics
MS4252
Social Network Analysis III
Sources: Jennifer Golbeck 2013. Analyzing the Social Web.

Elsevier (chapter 7, 8, 9, 13)
1
Chapter 7
UNDERSTANDING STRUCTURE
THROUGH ATTRIBUTE AND
BEHAVIOR
2
Describe the Network
 To really understand what’s happening in a
network, look beyond the structure at
 users’ attributes, behaviour, the content they are
sharing, and their interactions
 Analyzing this content can lead to many insights

into the meaning of network structure
3
Attribute, Behavior, and Content
 Attributes are characteristics of a node
 Age, gender, or location
 Person’s religion or political preferences
 Behavior refers to actions of nodes

 How frequently she posts,
 Who and how many people she follows, or
 How often she clicks on shared links.
4
Attribute, Behavior, and Content
 Content is a broader term that refers to the
nonstructural information about a node
 Includes any information about the nodes or edges
beyond the structural features.
 Combination of attributes and behavior, like the types of
comments they post online or the topics they discuss.
 If a node represents a video or picture instead of a
person, the content would include the things depicted in
the video.
5
Structural Analysis
 Identify those clusters
 provide statistics about the network properties (like
density and connectivity)
 about the importance of individual nodes with measures
(like centrality)
6
How to interpret the three clusters?
 Built from the photo sharing website Flickr
 On Flickr, photos are labeled with descriptive
keywords called tags
 Nodes represent tags and an edge between tags
indicate that they were used to describe the
same image.
 e.g. if an image is tagged with the word “desk”
and “keyboard” the network would show a line
connecting those two words.
 Network is a 1.5 egocentric network of a single
tag
7
Now with content
8
Break down into Egocentric Network
 The network of three months of discussion on the CSS-Discuss
mailing list. Node size reflects the node’s out-degree in this directed
network. Edge means the the person has replied to another.
9
Chapter 8
BUILDING NETWORKS
10
First Question?
 What do the nodes represent?
 What do the edges represent?
 Know this before doing anything with data!
11
Example
 Facebook Network
 What are the nodes and edges?
 What can you see if you were able to visualize this

network?
 What would network statistics mean in this network?
 What patterns might emerge?
12
Network with Multiple node types
 Bipartite graphs have two node types that do not
have connections within the type
 E.g. no people connected to one another
 Graphs can have multiple node types and not be

bipartite
13
To build a network
 Step 1: Define Nodes
 What are they?
 What are the criteria for being included?
 Example, build a social network of people in

Company X.
 Full time employees
 Part-time employees
 Contractor who are hired to work temporarily
 …
14
To Build a Network
 Step 2: Define Edges
 What does a edge represent?
 What is the criteria for adding one?
 Same Example: Employee network at Company

X, two people are connected if
 Two people work in the same dept and work closely on
many projects.
 One person works for another.
 Two people participate in department-wide discussions
on a mailing list.
15
Sampling Methods
 Some networks may be too big to analyze in
their entirety.
 Millions of nodes and edges are difficult to understand,
impossible to visualize.
 Sampling-selecting a subset of nodes and edges

—is an effective and common way of obtaining a
reasonably sized dataset from a large one.
 Random sampling
 Snowball sampling
 Egocentric analysis
16
Example: Enron email network
 Node: messages
 Edge: any pair of nodes that have exchanged at least 10
emails.
17
Random Sampling
 Select a certain percentage of nodes and keep all edges
between them, or
 Select a certain percentage of edges and keep all the
nodes that are mentioned.
Random Edge Sampling
50% 25% 10%

18
Random Sampling
 Problems
 Edge sampling biased toward high degree nodes
 Node sampling loses some structural characteristics.
 Benefits
 Reduce the network to a smaller size in an even way
 A picture of the overall patterns of relationships and
clusters can be seen
 Node sampling keeps some network statistical features
19
Snowball Sampling
 A technique commonly used in sociology where
participants are interviewed
 When working with a large network, choose a

starting node (initial seed node)
 Get that node, its connections, their connections, and so

on until the network is the right size for analysis
20
Snowball Sampling
 Problems
 Biased toward the part of the network sampled, may
miss other features
 Benefits:
 Easy to do, common
21
Egocentric Network Analysis
 Instead of looking at the whole network, look at
the egocentric networks of some nodes.
 Randomly selected egocentric networks, or
 The networks of individuals selected based on certain
characteristics
 A different type of analysis than overall network

analysis, but it shows the role of an individual in
context
22
Egocentric Network Analysis
 A 1.5 egocentric network of a person who posts
in both English and Spanish
 People who post only in Spanish are shown in black,
those posting only in English are in white, and people
using multiple languages or a third language are in
gray.
23
Chapter 9
ENTITY RESOLUTION AND LINK

PREDICTION
24
Overview
 Link Prediction is a method of analysis that detects
where missing links should be present in the network.
 Entity resolution is a technique for merging nodes that

represent the same person.
25
Link Prediction
 Data often has errors in it, including missing
links,
 and link prediction could identify places where an
analyst might want to check to confirm that there
is no edge between a pair of nodes
 It can also be used to identify people.
The frequency to attend the meeting
26
Systematic methods for predicting link
 If we have two nodes, A and B, then the
score(A,B) indicates how closely connected A and
B are in the graph
 After computing the score for every pair of
nodes, the algorithm returns a ranked list
 The pairs with the highest score are predicted to
be the most likely new edges.
27
Score Method
 Shortest Path Length
 Common Neighbors
 Jaccard Index
 Adamic/Adar
 Preferential Attachment
28
Score – Shortest Path
 One of the simplest ways to score the similarity
or closeness of two nodes is to use the shortest
path length between them
 Score(A,B) = - shortestPath(A,B)
29
Score – Common Neighbors
 Another way of computing score
 uses more information from the network structure is to
count the number of common neighbors between the
two nodes in a pair
 Score(A, G) = Neighbor(A) ∩ Neighbor(G)
30
Score – Common Neighbors
 The number of common neighbors makes social
sense, too.
 The more friends two people have in common, the more
likely they are to be introduced to one another.
 However, some people have an abnormally high

number of connections in social networks, like
singers
 many friends in common may not mean much since
they are connected
31
Score – Jaccard Index
 The Jaccard Index counts the total number of
friends in common and divides that by the total
number of people who are friends of either node
32
Example
 Four nodes: Alice, Bob, Chuck, and Dave.
 Let Alice and Bob be celebrities, each with 1 million
friends; Chuck and Dave are average users with 100
friends each.
 Now say Alice and Bob have 2,000 friends in common
while Chuck and Dave have only 20 friends in common.
 Although Alice and Bob may seem to be more strongly
connected than Chuck and Dave, since they have 100
times more common friends, the Jaccard Index
indicates this is not the case
 What is Jaccard Index for Alice and Bob? How
about Chuck and Dave?
 What if the 20 people Chuck and Dave know in
common are also celebrities
33
Score – Adamic/Adar
 Sum up the inverse log of each neighbor’s degree
(common)
 Giving more weights to friends with fewer edges.
 The formula is
34
Score – Preferential Attachments
 The network principle states nodes with a high degree
are more likely to gain new links.
 Popular nodes are more likely to gain new friends than less
popular nodes.
 When predicting edges, preferential attachments

suggest that nodes with high degree are more likely to
gain new edges.
35
Score – Preferential Attachments
 The formula for this score method is relatively is
 With this measure, we would predict that the

next edge to appear will be between nodes A and
G.
36
Entity Resolution
 Identify nodes that represent the same entity
and then to merge them together
 Entity resolution is a technique for merging
nodes that represent the same person.
 John and J. Smith are actually the same person
37
Scoring Techniques
 For each pair of nodes, we can compute a score that
represents the likelihood that they are the same node.
 Then we can set a threshold value.
 Any pair of nodes with score above the threshold will be
merged.
38
Scoring Techniques
 To create a score for each pair of nodes, we will
consider similarity on a set of attributes.
 For each pair of nodes, record 1 if their values match and 0
otherwise for a given attribute.
 However, matching on some attributes is more important than

on others (weights).
 Then, we add the positive weights for each attribute where

nodes match, and add the negative weights when they do not
match.
39
Scoring Techniques
 Example: two nodes (J Smith and John Smith) and five
attributes
 The vector of match/nonmatch is (0, 1, 0, 0, 1)
 Suppose the weights are ()
 We know that because a match on SSN is much
definitive.
40
Scoring Techniques
 We are left to find a method for computing the weight
for each attribute.
 We need two probabilities.
 Probability : the probability that the two nodes will match on a
attribute by chance.
 Example, same birth month is 1/12
 Probability : the probability that two nodes that represent the

same person will have the same value (close to 1).
 Example, the m probability for the birth month is 0.95
41
Scoring Techniques
 Once we have the and probabilities, we need to turn
them into weights.
 There are two weights for each attribute.
 The common formulas are:

 For a match,
 For a nonmatch,
 For the month attribute, for a match and for a

nonmatch.
42
Score for partial match
 Return to the `John Smith’ and `J Smith’ example, while
their first names are not an exact match, they are
closed.
 We may label this as a partial match.
 For example, if we say `J’ is a 0.3 match for `John’, we would
add 0.3 times the weight for a name match.
 In this approach, the formula that a partial match with

value we would add the following to the score:
 Comparison? 43
Application
 Link Prediction
 Friend Recommendation
 Entity Resolution
 Finding duplicated account
https://sensorstechforum.com/new-facebook-scam-fake-duplicate-accounts-fraud/
44
Incorporating Network Data
 Relational data is useful for enhancing the attribute-
based similarity!
45
Decision to merge
 The results from these similarity measures can be used
in addition to attribute data.
 Jointly use two similarity scores!
46
Chapter 13
SOCIAL INFORMATION FILTERING
47
Social Information Filtering
 Some of the information on the social media will be
more useful than the rest, and the sheer volume means
a filter would be useful to sort through it.
 This chapter introduces methods used to sort, filter, and

aggregate information from the social media.
 Two major approaches:

 Social voting
 Recommender systems
48
Social sharing and social filtering
 Social-sharing websites, like Digg, Slashdot, Reddit, are
designed to share interesting content.
 The community then votes items up or down, and the

most interesting links are highlighted.
 The reliance on large number of people to help

complete a task is a type pf crowdsourcing.
49
Automated recommender systems
 Amazon, Netflex, Pandora, use recommender systems
(RS) to suggest items a user might like.
 The RS leverages a user’s previous activity, either on

explicit data such as user ratings, or data implicitly
captured from user’s behavior such making a purchase
or viewing an item.
 The core idea behind RS is to take data created by other

people and personalize it for the individual user.
50
Traditional recommender systems
 Recommender systems work in one the two ways:
 Suggesting items similar to the ones a person likes (item-based)

 Suggesting item liked by people who are similar to the user
(user-based; collaborative filtering)
 Item based recommendation is not very social, but the

user-based RS relies on other people’s actions.
51
Traditional RS: example
 How to suggest new items to Alice? How much she
might like movie Vertigo?
 Compute the correlation between Alice and Bob (0.26),
and Alice and Chuck (0.83)!
Vertigo
?
3
4 52
Social Recommender Systems
 Collaborative Filtering: an example of how algorithms
can leverage data from crowd!
 With the rise of social media, RS have started to

consider social connections in addition to similarity.
 Example: In twitter, when a user searches for a term,

the search results can be shown in three ways:
 All tweets that match the search
 ‘Top’ tweets
 Tweets only posted by people the user knows
53
Case Study: Reddit Voting
 Social News Website
 Users vote stories up or down
 How to count votes?

 Using the total is biased to older posts
 Using percentage can skew high or low for new posts
54
Reddit
55
Case Study: Reddit Voting
 Solution:
 Estimate how many votes a post would get over time
 Use a 95% confidence interval to forecast.
 When Reddit is relatively sure a new post will get more

up votes than an older one, it ranks it higher.
56
Case Study: Trust-based RS
57
Trust-based Recommender Systems
 Trust-based recommenders, like FilmTrust, use social
network information.
 Instead of computing similarity between two users as

the weight, we can use trust value from social network
data.
 The system estimates how much the user might trust

other people’s opinion. How?
58
Trust
 Alice does not know Dave.
 Dave is trustworthy at a level 3 out of 10, and Chuck gives Dave

a 10 out of 10
 Alice could average these values, and estimate Dave’s
trustworthiness at (10+3)/2=6.5.
 However, Alice trusts Bob’s opinion far more than Chuck’s.
59
Weighted average
 Alice has given Bob a higher trust rating
 Weighted average
60
Example with People
 Bob, Chuck, and Dave all saw and rated the move
`Night of the Living Dead’.
 Their ratings are as follows on a five-star scale:
 Bob: 5 stars
 Chuck: 2 stars
 Dave: 4 stars
 Recommended rating:
61
How good is a RS?
 Generally: RMSE
 Error = My rating – Recommended Rating

 Compute the RMSE
 Back to Example, when user’s rating of a movie moved away

from the average rating, the trusted based rating greatly
outperformed collaborative filtering algorithm.
 Need alternative ways of evaluating systems

 Serendipity over accuracy
 Diversity
62

Topic 3

Uploaded by

Copyright:

Available Formats

Topic 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic 3

Uploaded by

Copyright:

Available Formats

Big Data Analytics

Sources: Jennifer Golbeck 2013. Analyzing the Social Web.

 Analyzing this content can lead to many insights

 Behavior refers to actions of nodes

 What do the edges represent?

 Know this before doing anything with data!

 What can you see if you were able to visualize this

 Graphs can have multiple node types and not be

 Example, build a social network of people in

 Same Example: Employee network at Company

 Sampling-selecting a subset of nodes and edges

50% 25% 10%

 When working with a large network, choose a

 Get that node, its connections, their connections, and so

 A different type of analysis than overall network

ENTITY RESOLUTION AND LINK

 Entity resolution is a technique for merging nodes that

The frequency to attend the meeting

 However, some people have an abnormally high

 When predicting edges, preferential attachments

 With this measure, we would predict that the

 However, matching on some attributes is more important than

 Then, we add the positive weights for each attribute where

 Probability : the probability that two nodes that represent the

 The common formulas are:

 For the month attribute, for a match and for a

 In this approach, the formula that a partial match with

SOCIAL INFORMATION FILTERING

 This chapter introduces methods used to sort, filter, and

 Two major approaches:

 The community then votes items up or down, and the

 The reliance on large number of people to help

 The RS leverages a user’s previous activity, either on

 The core idea behind RS is to take data created by other

 Suggesting items similar to the ones a person likes (item-based)

 Item based recommendation is not very social, but the

 With the rise of social media, RS have started to

 Example: In twitter, when a user searches for a term,

 How to count votes?

 Use a 95% confidence interval to forecast.

 When Reddit is relatively sure a new post will get more

 Instead of computing similarity between two users as

 The system estimates how much the user might trust

 Dave is trustworthy at a level 3 out of 10, and Chuck gives Dave

 Error = My rating – Recommended Rating

 Back to Example, when user’s rating of a movie moved away

 Need alternative ways of evaluating systems

You might also like