Topic 3

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 62

Big Data Analytics

MS4252
Social Network Analysis III

Sources: Jennifer Golbeck 2013. Analyzing the Social Web.


Elsevier (chapter 7, 8, 9, 13)

1
Chapter 7

UNDERSTANDING STRUCTURE
THROUGH ATTRIBUTE AND
BEHAVIOR
2
Describe the Network
 To really understand what’s happening in a
network, look beyond the structure at
 users’ attributes, behaviour, the content they are
sharing, and their interactions

 Analyzing this content can lead to many insights


into the meaning of network structure

3
Attribute, Behavior, and Content
 Attributes are characteristics of a node
 Age, gender, or location
 Person’s religion or political preferences

 Behavior refers to actions of nodes


 How frequently she posts,
 Who and how many people she follows, or
 How often she clicks on shared links.

4
Attribute, Behavior, and Content
 Content is a broader term that refers to the
nonstructural information about a node
 Includes any information about the nodes or edges
beyond the structural features.
 Combination of attributes and behavior, like the types of
comments they post online or the topics they discuss.
 If a node represents a video or picture instead of a
person, the content would include the things depicted in
the video.

5
Structural Analysis
 Identify those clusters
 provide statistics about the network properties (like
density and connectivity)
 about the importance of individual nodes with measures
(like centrality)

6
How to interpret the three clusters?
 Built from the photo sharing website Flickr
 On Flickr, photos are labeled with descriptive
keywords called tags
 Nodes represent tags and an edge between tags
indicate that they were used to describe the
same image.
 e.g. if an image is tagged with the word “desk”
and “keyboard” the network would show a line
connecting those two words.
 Network is a 1.5 egocentric network of a single
tag

7
Now with content

8
Break down into Egocentric Network
 The network of three months of discussion on the CSS-Discuss
mailing list. Node size reflects the node’s out-degree in this directed
network. Edge means the the person has replied to another.

9
Chapter 8

BUILDING NETWORKS

10
First Question?
 What do the nodes represent?

 What do the edges represent?

 Know this before doing anything with data!

11
Example
 Facebook Network
 What are the nodes and edges?

 What can you see if you were able to visualize this


network?
 What would network statistics mean in this network?
 What patterns might emerge?

12
Network with Multiple node types
 Bipartite graphs have two node types that do not
have connections within the type
 E.g. no people connected to one another

 Graphs can have multiple node types and not be


bipartite

13
To build a network
 Step 1: Define Nodes
 What are they?
 What are the criteria for being included?

 Example, build a social network of people in


Company X.
 Full time employees
 Part-time employees
 Contractor who are hired to work temporarily
 …

14
To Build a Network
 Step 2: Define Edges
 What does a edge represent?
 What is the criteria for adding one?

 Same Example: Employee network at Company


X, two people are connected if
 Two people work in the same dept and work closely on
many projects.
 One person works for another.
 Two people participate in department-wide discussions
on a mailing list.

15
Sampling Methods
 Some networks may be too big to analyze in
their entirety.
 Millions of nodes and edges are difficult to understand,
impossible to visualize.

 Sampling-selecting a subset of nodes and edges


—is an effective and common way of obtaining a
reasonably sized dataset from a large one.
 Random sampling
 Snowball sampling
 Egocentric analysis

16
Example: Enron email network
 Node: messages
 Edge: any pair of nodes that have exchanged at least 10
emails.

17
Random Sampling
 Select a certain percentage of nodes and keep all edges
between them, or
 Select a certain percentage of edges and keep all the
nodes that are mentioned.
Random Edge Sampling

50% 25% 10%


18
Random Sampling
 Problems
 Edge sampling biased toward high degree nodes
 Node sampling loses some structural characteristics.

 Benefits
 Reduce the network to a smaller size in an even way
 A picture of the overall patterns of relationships and
clusters can be seen
 Node sampling keeps some network statistical features

19
Snowball Sampling
 A technique commonly used in sociology where
participants are interviewed

 When working with a large network, choose a


starting node (initial seed node)

 Get that node, its connections, their connections, and so


on until the network is the right size for analysis

20
Snowball Sampling
 Problems
 Biased toward the part of the network sampled, may
miss other features
 Benefits:
 Easy to do, common

21
Egocentric Network Analysis
 Instead of looking at the whole network, look at
the egocentric networks of some nodes.
 Randomly selected egocentric networks, or
 The networks of individuals selected based on certain
characteristics

 A different type of analysis than overall network


analysis, but it shows the role of an individual in
context

22
Egocentric Network Analysis
 A 1.5 egocentric network of a person who posts
in both English and Spanish
 People who post only in Spanish are shown in black,
those posting only in English are in white, and people
using multiple languages or a third language are in
gray.

23
Chapter 9

ENTITY RESOLUTION AND LINK


PREDICTION

24
Overview
 Link Prediction is a method of analysis that detects
where missing links should be present in the network.

 Entity resolution is a technique for merging nodes that


represent the same person.

25
Link Prediction
 Data often has errors in it, including missing
links,
 and link prediction could identify places where an
analyst might want to check to confirm that there
is no edge between a pair of nodes
 It can also be used to identify people.

The frequency to attend the meeting

26
Systematic methods for predicting link
 If we have two nodes, A and B, then the
score(A,B) indicates how closely connected A and
B are in the graph
 After computing the score for every pair of
nodes, the algorithm returns a ranked list
 The pairs with the highest score are predicted to
be the most likely new edges.

27
Score Method
 Shortest Path Length

 Common Neighbors

 Jaccard Index

 Adamic/Adar

 Preferential Attachment

28
Score – Shortest Path
 One of the simplest ways to score the similarity
or closeness of two nodes is to use the shortest
path length between them
 Score(A,B) = - shortestPath(A,B)

29
Score – Common Neighbors
 Another way of computing score
 uses more information from the network structure is to
count the number of common neighbors between the
two nodes in a pair
 Score(A, G) = Neighbor(A) ∩ Neighbor(G)

30
Score – Common Neighbors
 The number of common neighbors makes social
sense, too.
 The more friends two people have in common, the more
likely they are to be introduced to one another.

 However, some people have an abnormally high


number of connections in social networks, like
singers
 many friends in common may not mean much since
they are connected

31
Score – Jaccard Index
 The Jaccard Index counts the total number of
friends in common and divides that by the total
number of people who are friends of either node

32
Example
 Four nodes: Alice, Bob, Chuck, and Dave.
 Let Alice and Bob be celebrities, each with 1 million
friends; Chuck and Dave are average users with 100
friends each.
 Now say Alice and Bob have 2,000 friends in common
while Chuck and Dave have only 20 friends in common.
 Although Alice and Bob may seem to be more strongly
connected than Chuck and Dave, since they have 100
times more common friends, the Jaccard Index
indicates this is not the case
 What is Jaccard Index for Alice and Bob? How
about Chuck and Dave?
 What if the 20 people Chuck and Dave know in
common are also celebrities
33
Score – Adamic/Adar
 Sum up the inverse log of each neighbor’s degree
(common)
 Giving more weights to friends with fewer edges.
 The formula is

34
Score – Preferential Attachments
 The network principle states nodes with a high degree
are more likely to gain new links.
 Popular nodes are more likely to gain new friends than less
popular nodes.

 When predicting edges, preferential attachments


suggest that nodes with high degree are more likely to
gain new edges.

35
Score – Preferential Attachments
 The formula for this score method is relatively is

 With this measure, we would predict that the


next edge to appear will be between nodes A and
G.

36
Entity Resolution
 Identify nodes that represent the same entity
and then to merge them together
 Entity resolution is a technique for merging
nodes that represent the same person.
 John and J. Smith are actually the same person

37
Scoring Techniques
 For each pair of nodes, we can compute a score that
represents the likelihood that they are the same node.
 Then we can set a threshold value.
 Any pair of nodes with score above the threshold will be
merged.

38
Scoring Techniques
 To create a score for each pair of nodes, we will
consider similarity on a set of attributes.
 For each pair of nodes, record 1 if their values match and 0
otherwise for a given attribute.

 However, matching on some attributes is more important than


on others (weights).

 Then, we add the positive weights for each attribute where


nodes match, and add the negative weights when they do not
match.

39
Scoring Techniques
 Example: two nodes (J Smith and John Smith) and five
attributes
 The vector of match/nonmatch is (0, 1, 0, 0, 1)
 Suppose the weights are ()
 We know that because a match on SSN is much
definitive.

40
Scoring Techniques
 We are left to find a method for computing the weight
for each attribute.
 We need two probabilities.
 Probability : the probability that the two nodes will match on a
attribute by chance.
 Example, same birth month is 1/12

 Probability : the probability that two nodes that represent the


same person will have the same value (close to 1).
 Example, the m probability for the birth month is 0.95

41
Scoring Techniques
 Once we have the and probabilities, we need to turn
them into weights.
 There are two weights for each attribute.

 The common formulas are:


 For a match,
 For a nonmatch,

 For the month attribute, for a match and for a


nonmatch.

42
Score for partial match
 Return to the `John Smith’ and `J Smith’ example, while
their first names are not an exact match, they are
closed.
 We may label this as a partial match.
 For example, if we say `J’ is a 0.3 match for `John’, we would
add 0.3 times the weight for a name match.

 In this approach, the formula that a partial match with


value we would add the following to the score:

 Comparison? 43
Application
 Link Prediction
 Friend Recommendation

 Entity Resolution
 Finding duplicated account

https://sensorstechforum.com/new-facebook-scam-fake-duplicate-accounts-fraud/
44
Incorporating Network Data
 Relational data is useful for enhancing the attribute-
based similarity!

45
Decision to merge
 The results from these similarity measures can be used
in addition to attribute data.
 Jointly use two similarity scores!

46
Chapter 13

SOCIAL INFORMATION FILTERING

47
Social Information Filtering
 Some of the information on the social media will be
more useful than the rest, and the sheer volume means
a filter would be useful to sort through it.

 This chapter introduces methods used to sort, filter, and


aggregate information from the social media.

 Two major approaches:


 Social voting
 Recommender systems

48
Social sharing and social filtering
 Social-sharing websites, like Digg, Slashdot, Reddit, are
designed to share interesting content.

 The community then votes items up or down, and the


most interesting links are highlighted.

 The reliance on large number of people to help


complete a task is a type pf crowdsourcing.

49
Automated recommender systems
 Amazon, Netflex, Pandora, use recommender systems
(RS) to suggest items a user might like.

 The RS leverages a user’s previous activity, either on


explicit data such as user ratings, or data implicitly
captured from user’s behavior such making a purchase
or viewing an item.

 The core idea behind RS is to take data created by other


people and personalize it for the individual user.

50
Traditional recommender systems
 Recommender systems work in one the two ways:

 Suggesting items similar to the ones a person likes (item-based)


 Suggesting item liked by people who are similar to the user
(user-based; collaborative filtering)

 Item based recommendation is not very social, but the


user-based RS relies on other people’s actions.

51
Traditional RS: example
 How to suggest new items to Alice? How much she
might like movie Vertigo?
 Compute the correlation between Alice and Bob (0.26),
and Alice and Chuck (0.83)!

Vertigo

?
3
4 52
Social Recommender Systems
 Collaborative Filtering: an example of how algorithms
can leverage data from crowd!

 With the rise of social media, RS have started to


consider social connections in addition to similarity.

 Example: In twitter, when a user searches for a term,


the search results can be shown in three ways:
 All tweets that match the search
 ‘Top’ tweets
 Tweets only posted by people the user knows

53
Case Study: Reddit Voting
 Social News Website
 Users vote stories up or down

 How to count votes?


 Using the total is biased to older posts
 Using percentage can skew high or low for new posts

54
Reddit

55
Case Study: Reddit Voting
 Solution:
 Estimate how many votes a post would get over time

 Use a 95% confidence interval to forecast.

 When Reddit is relatively sure a new post will get more


up votes than an older one, it ranks it higher.

56
Case Study: Trust-based RS

57
Trust-based Recommender Systems
 Trust-based recommenders, like FilmTrust, use social
network information.

 Instead of computing similarity between two users as


the weight, we can use trust value from social network
data.

 The system estimates how much the user might trust


other people’s opinion. How?

58
Trust
 Alice does not know Dave.

 Dave is trustworthy at a level 3 out of 10, and Chuck gives Dave


a 10 out of 10
 Alice could average these values, and estimate Dave’s
trustworthiness at (10+3)/2=6.5.
 However, Alice trusts Bob’s opinion far more than Chuck’s.

59
Weighted average
 Alice has given Bob a higher trust rating

 Weighted average

60
Example with People
 Bob, Chuck, and Dave all saw and rated the move
`Night of the Living Dead’.
 Their ratings are as follows on a five-star scale:
 Bob: 5 stars
 Chuck: 2 stars
 Dave: 4 stars

 Recommended rating:

61
How good is a RS?
 Generally: RMSE

 Error = My rating – Recommended Rating


 Compute the RMSE

 Back to Example, when user’s rating of a movie moved away


from the average rating, the trusted based rating greatly
outperformed collaborative filtering algorithm.

 Need alternative ways of evaluating systems


 Serendipity over accuracy
 Diversity

62

You might also like