0% found this document useful (0 votes)
98 views53 pages

Unit6-1Social Network Analysis

The document discusses social network analysis. It covers social network introductions, statistics and probability theory used in analysis, models for generating social networks including random graphs, Watts-Strogatz models, and scale-free networks. It also discusses analyzing networks in biological systems and mining social networks.

Uploaded by

Sai Gopal Bonthu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views53 pages

Unit6-1Social Network Analysis

The document discusses social network analysis. It covers social network introductions, statistics and probability theory used in analysis, models for generating social networks including random graphs, Watts-Strogatz models, and scale-free networks. It also discusses analyzing networks in biological systems and mining social networks.

Uploaded by

Sai Gopal Bonthu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

Social Network Analysis

 Social Network Introduction


 Statistics and Probability Theory
 Models of Social Network Generation
 Networks in Biological System
 Mining on Social Network
 Summary
May 5, 2023 Data Mining: Concepts and Techniques 1
Society

Nodes: individuals
Links: social relationship
(family/work/friendship/etc.)

S. Milgram (1967)
Six Degrees of Separation
John Guare

Social networks: Many individuals with


diverse social interactions between them.

May 5, 2023 Data Mining: Concepts and Techniques 2


Communication networks
The Earth is developing an electronic nervous system,
a network with diverse nodes and links are

-computers -phone lines


-routers -TV cables
-satellites -EM waves

Communication
networks: Many
non-identical
components with
diverse
connections
between them.

May 5, 2023 Data Mining: Concepts and Techniques 3


“Natural” Networks and Universality
 Consider many kinds of networks:
 social, technological, business, economic, content,…

 These networks tend to share certain informal properties:


 large scale; continual growth

 distributed, organic growth: vertices “decide” who to link to

 interaction restricted to links

 mixture of local and long-distance connections

 abstract notions of distance: geographical, content, social,…

 Do natural networks share more quantitative universals?


 What would these “universals” be?
 How can we make them precise and measure them?
 How can we explain their universality?
 This is the domain of social network theory
 Sometimes also referred to as link analysis

May 5, 2023 Data Mining: Concepts and Techniques 4


Some Interesting Quantities
 Connected components:
 how many, and how large?

 Network diameter:
 maximum (worst-case) or average?

 exclude infinite distances? (disconnected components)

 the small-world phenomenon

 Clustering:
 to what extent that links tend to cluster “locally”?

 what is the balance between local and long-distance connections?

 what roles do the two types of links play?

 Degree distribution:
 what is the typical degree in the network?

 what is the overall distribution?

May 5, 2023 Data Mining: Concepts and Techniques 5


A “Canonical” Natural Network has…
 Few connected components:
 often only 1 or a small number, indep. of network size

 Small diameter:
 often a constant independent of network size (like 6)

 or perhaps growing only logarithmically with network size

or even shrink?
 typically exclude infinite distances

 A high degree of clustering:


 considerably more so than for a random network

 in tension with small diameter

 A heavy-tailed degree distribution:


 a small but reliable number of high-degree vertices

 often of power law form

May 5, 2023 Data Mining: Concepts and Techniques 6


Social Network Analysis

 Social Network Introduction


 Statistics and Probability Theory
 Models of Social Network Generation
 Networks in Biological System
 Mining on Social Network
 Summary
May 5, 2023 Data Mining: Concepts and Techniques 7
The Poisson Distribution

single photoelectron distribution

May 5, 2023 Data Mining: Concepts and Techniques 8


Zipf’s Law

The same data plotted on linear and logarithmic scales.


Both plots show a Zipf distribution with 300 datapoints

Linear scales on both axes Logarithmic scales on both axes

May 5, 2023 Data Mining: Concepts and Techniques 9


Social Network Analysis

 Social Network Introduction


 Statistics and Probability Theory
 Models of Social Network Generation
 Networks in Biological System
 Mining on Social Network
 Summary
May 5, 2023 Data Mining: Concepts and Techniques 10
Some Models of Network Generation
 Random graphs (Erdös-Rényi models):
 gives few components and small diameter

 does not give high clustering and heavy-tailed degree distributions

 is the mathematically most well-studied and understood model

 Watts-Strogatz models:
 give few components, small diameter and high clustering

 does not give heavy-tailed degree distributions

 Scale-free Networks:
 gives few components, small diameter and heavy-tailed distribution

 does not give high clustering

 Hierarchical networks:
 few components, small diameter, high clustering, heavy-tailed

 Affiliation networks:
 models group-actor formation

May 5, 2023 Data Mining: Concepts and Techniques 11


Models of Social Network Generation

 Random Graphs (Erdös-Rényi models)


 Watts-Strogatz models
 Scale-free Networks

May 5, 2023 Data Mining: Concepts and Techniques 12


The Erdös-Rényi (ER) Model
(Random Graphs)
 All edges are equally probable and appear independently
 NW size N > 1 and probability p: distribution G(N,p)
 each edge (u,v) chosen to appear with probability p

 N(N-1)/2 trials of a biased coin flip

 The usual regime of interest is when p ~ 1/N, N is large


 e.g. p = 1/2N, p = 1/N, p = 2/N, p=10/N, p = log(N)/N, etc.

 in expectation, each vertex will have a “small” number of neighbors

 will then examine what happens when N  infinity

 can thus study properties of large networks with bounded degree

 Degree distribution of a typical G drawn from G(N,p):


 draw G according to G(N,p); look at a random vertex u in G

 what is Pr[deg(u) = k] for any fixed k?

 Poisson distribution with mean l = p(N-1) ~ pN

 Sharply concentrated; not heavy-tailed

 Especially easy to generate NWs from G(N,p)


May 5, 2023 Data Mining: Concepts and Techniques 13
Erdös-Rényi Model (1960)

Connect with
probability p Pál Erdös
(1913-1996)
p=1/6
N=10
k~1.5 Poisson distribution

- Democratic
- Random

May 5, 2023 Data Mining: Concepts and Techniques 14


The Clustering Coefficient of a Network
 Let nbr(u) denote the set of neighbors of u in a graph
 all vertices v such that the edge (u,v) is in the graph

 The clustering coefficient of u:


 let k = |nbr(u)| (i.e., number of neighbors of u)

 choose(k,2): max possible # of edges between vertices in nbr(u)

 c(u) = (actual # of edges between vertices in nbr(u))/choose(k,2)

 0 <= c(u) <= 1; measure of cliquishness of u’s neighborhood

 Clustering coefficient of a graph:


 average of c(u) over all vertices u

k=4
choose(k,2) = 6
c(u) = 4/6 = 0.666…

May 5, 2023 Data Mining: Concepts and Techniques 15


The Clustering Coefficient of a Network
Clustering: My friends will likely know each other!

Probability to be connected C »p
# of links between 1,2,…n neighbors
C=
n(n-1)/2
Network C Crand L N
WWW 0.1078 0.00023 3.1 153127
3015-
Internet 0.18-0.3 0.001 3.7-3.76
Networks are clustered 6209
Actor 0.79 0.00027 3.65 225226
[large C(p)]
but have a small Coauthorship 0.43 0.00018 5.9 52909

characteristic path length Metabolic 0.32 0.026 2.9 282

[small L(p)]. Foodweb 0.22 0.06 2.43 134

C. elegance 0.28 0.05 2.65 282


May 5, 2023 Data Mining: Concepts and Techniques 16
Small Worlds and Occam’s Razor
 For small , should generate large clustering coefficients
 we “programmed” the model to do so

 Watts claims that proving precise statements is hard…

 But we do not want a new model for every little property


 Erdos-Renyi  small diameter

 -model  high clustering coefficient

 In the interests of Occam’s Razor, we would like to find


 a single, simple model of network generation…

 … that simultaneously captures many properties

 Watt’s small world: small diameter and high clustering

May 5, 2023 Data Mining: Concepts and Techniques 17


Case 1: Kevin Bacon Graph
 Vertices: actors and actresses
 Edge between u and v if they appeared in a film together
Average # of # of
Rank Name
distance movies links
1 Rod Steiger 2.537527 112 2562
Kevin Bacon 2 Donald Pleasence 2.542376 180 2874
3 Martin Sheen 2.551210 136 3501
No. of movies : 46 4 Christopher Lee 2.552497 201 2993
No. of actors : 1811 5 Robert Mitchum 2.557181 136 2905
Average separation: 2.79 6 Charlton Heston 2.566284 104 2552
7 Eddie Albert 2.567036 112 3333
8 Robert Vaughn 2.570193 126 2761
Is Kevin Bacon 9 Donald Sutherland 2.577880 107 2865
10 John Gielgud 2.578980 122 2942
the most 11 Anthony Quinn 2.579750 146 2978
connected actor? 12 James Earl Jones 2.584440 112 3787

NO! 876
876

KevinBacon
Kevin Bacon 2.786981
2.786981 46
46 1811
1811

May 5, 2023 Data Mining: Concepts and Techniques 18


#1 Rod Steiger

#876
Kevin Bacon

Donald
#2 Pleasence

#3 Martin Sheen
May 5, 2023 Data Mining: Concepts and Techniques 19
Models of Social Network Generation

 Random Graphs (Erdös-Rényi models)


 Watts-Strogatz models
 Scale-free Networks

May 5, 2023 Data Mining: Concepts and Techniques 20


World Wide Web

Nodes: WWW documents


Links: URL links
800 million documents
(S. Lawrence, 1999)

ROBOT: collects all


URL’s found in a
document and follows
them recursively

R. Albert, H. Jeong, A-L Barabasi, Nature, 401 130 (1999)


May 5, 2023 Data Mining: Concepts and Techniques 21
World Wide Web

Expected Result Real Result

out= 2.45  in = 2.1

k ~ 6
P(k=500) ~ 10-99 Pout(k) ~ k-out Pin(k) ~ k- in
NWWW ~ 109 NWWW ~ 109
P(k=500) ~ 10-6
 N(k=500) ~ 103
 N(k=500)~10-90 J. Kleinberg, et. al, Proceedings of the ICCC (1999)
May 5, 2023 Data Mining: Concepts and Techniques 22
World Wide Web
3
l15=2 [125]
6
1
4 l17=4 [1346  7]
7
2 5
… < l > = ??
 Finite size scaling: create a network with N nodes with Pin(k) and Pout(k)

< l > = 0.35 + 2.06 log(N)


19 degrees of separation
R. Albert et al Nature (99)
nd.edu
based on 800 million webpages
<l>

[S. Lawrence et al
IBM Nature (99)]
A. Broder et al WWW9 (00)

May 5, 2023 Data Mining: Concepts and Techniques 23


Scale-free Networks

 The number of nodes (N) is not fixed


 Networks continuously expand by additional new nodes
 WWW: addition of new nodes
 Citation: publication of new papers
 The attachment is not uniform
 A node is linked with higher probability to a node that
already has a large number of links
 WWW: new documents link to well known sites
(CNN, Yahoo, Google)
 Citation: Well cited papers are more likely to be
cited again
May 5, 2023 Data Mining: Concepts and Techniques 24
Scale-Free Networks
 Start with (say) two vertices connected by an edge
 For i = 3 to N:
 for each 1 <= j < i, d(j) = degree of vertex j so far

 let Z = S d(j) (sum of all degrees so far)

 add new vertex i with k edges back to {1, …, i-1}:

 i is connected back to j with probability d(j)/Z

 Vertices j with high degree are likely to get more links!


 “Rich get richer”
 Natural model for many processes:
 hyperlinks on the web

 new business and social contacts

 transportation networks

 Generates a power law distribution of degrees


 exponent depends on value of k

May 5, 2023 Data Mining: Concepts and Techniques 25


Scale-Free Networks

 Preferential attachment explains


 heavy-tailed degree distributions
 small diameter (~log(N), via “hubs”)

 Will not generate high clustering coefficient


 no bias towards local connectivity, but towards hubs

May 5, 2023 Data Mining: Concepts and Techniques 26


Case1: Internet Backbone

Nodes: computers, routers


Links: physical lines

(Faloutsos, Faloutsos and Faloutsos, 1999)


May 5, 2023 Data Mining: Concepts and Techniques 27
May 5, 2023 Data Mining: Concepts and Techniques 28
Robustness of
Random vs. Scale-Free Networks
 The accidental failure
of a number of nodes
in a random network
can fracture the
system into non-
communicating islands.

 Scale-free networks
are more robust in the
face of such failures.

 Scale-free networks
are highly vulnerable
to a coordinated attack
against their hubs.

May 5, 2023 Data Mining: Concepts and Techniques 29


Social Network Analysis

 Social Network Introduction


 Statistics and Probability Theory
 Models of Social Network Generation
 Networks in Biological System
 Mining on Social Network
 Summary
May 5, 2023 Data Mining: Concepts and Techniques 30
Information on the Social Network
 Heterogeneous, multi-relational data represented as a
graph or network
 Nodes are objects

 May have different kinds of objects

 Objects have attributes

 Objects may have labels or classes

 Edges are links

 May have different kinds of links

 Links may have attributes

 Links may be directed, are not required to be binary

 Links represent relationships and interactions between


objects - rich content for mining
May 5, 2023 Data Mining: Concepts and Techniques 31
What is New for Link Mining Here

 Traditional machine learning and data mining approaches


assume:
 A random sample of homogeneous objects from single
relation
 Real world data sets:
 Multi-relational, heterogeneous and semi-structured
 Link Mining
 Newly emerging research area at the intersection of
research in social network and link analysis, hypertext
and web mining, graph mining, relational learning and
inductive logic programming

May 5, 2023 Data Mining: Concepts and Techniques 32


A Taxonomy of Common Link Mining Tasks

 Object-Related Tasks
 Link-based object ranking

 Link-based object classification

 Object clustering (group detection)

 Object identification (entity resolution)

 Link-Related Tasks
 Link prediction

 Graph-Related Tasks
 Subgraph discovery

 Graph classification

 Generative model for graphs

May 5, 2023 Data Mining: Concepts and Techniques 33


What Is a Link in Link Mining?

 Link: relationship among data


 Two kinds of linked networks
 homogeneous vs. heterogeneous

 Homogeneous networks
 Single object type and single link type

 Single model social networks (e.g., friends)

 WWW: a collection of linked Web pages

 Heterogeneous networks
 Multiple object and link types

 Medical network: patients, doctors, disease, contacts,

treatments
 Bibliographic network: publications, authors, venues

May 5, 2023 Data Mining: Concepts and Techniques 34


Link-Based Object Ranking (LBR)
 LBR: Exploit the link structure of a graph to order or
prioritize the set of objects within the graph
 Focused on graphs with single object type and single

link type
 This is a primary focus of link analysis community
 Web information analysis
 PageRank and Hits are typical LBR approaches

 In social network analysis (SNA), LBR is a core analysis task


 Objective: rank individuals in terms of “centrality”

 Degree centrality vs. eigen vector/power centrality

 Rank objects relative to one or more relevant objects in

the graph vs. ranks object over time in dynamic graphs


May 5, 2023 Data Mining: Concepts and Techniques 35
PageRank: Capturing Page Popularity (Brin & Page’98)

 Intuitions
 Links are like citations in literature

 A page that is cited often can be expected to be more

useful in general
 PageRank is essentially “citation counting”, but improves
over simple counting
 Consider “indirect citations” (being cited by a highly

cited paper counts a lot…)


 Smoothing of citations (every page is assumed to have

a non-zero citation count)


 PageRank can also be interpreted as random surfing
(thus capturing popularity)
May 5, 2023 Data Mining: Concepts and Techniques 36
The PageRank Algorithm (Brin & Page’98)

Random surfing model:


At any page,
With prob. , randomly jumping to a page
With prob. (1 – ), randomly picking a link to follow

d1 0 0 1/ 2 1/ 2 
1 0 0 0 
M  “Transition matrix”
0 1 0 0  Same as
 
d3 1/ 2 1/ 2 0 0  /N (why?)
d2
1
pt 1 (di )  (1   ) 
d j IN ( di )
m ji pt (d j )   
k N
pt (d k )

d4 1
p( di )   [   (1   )mki ] p(d k ) Stationary (“stable”)
k N distribution, so we
 
p  ( I  (1   ) M )T p I = 1/N ignore time
ij

Initial value p(d)=1/N Iterate until converge


Essentially an eigenvector problem….

May 5, 2023 Data Mining: Concepts and Techniques 37


HITS: Capturing Authorities & Hubs (Kleinberg’98)

 Intuitions
 Pages that are widely cited are good
authorities
 Pages that cite many other pages are good
hubs
 The key idea of HITS
 Good authorities are cited by good hubs
 Good hubs point to good authorities
 Iterative reinforcement …
May 5, 2023 Data Mining: Concepts and Techniques 38
The HITS Algorithm (Kleinberg 98)

0 0 1 1
1 0 0 0  “Adjacency matrix”
d1
A
0 1 0 0
d3  
d2 1 1 0 0 Initial values: a=h=1
h( d i )   a(d j )
d4 d j OUT ( di )
Iterat
a (di )   h( d j ) e
d j IN ( di ) Normalize:
   T

h  Aa ; a A h
 a ( d i )   h( d i )  1
2 2
 T
  
h  AA h ; a  AT Aa i i

Again eigenvector problems…

May 5, 2023 Data Mining: Concepts and Techniques 39


Block-level Link Analysis (Cai et al. 04)

 Most of the existing link analysis algorithms, e.g.


PageRank and HITS, treat a web page as a single
node in the web graph
 However, in most cases, a web page contains
multiple semantics and hence it might not be
considered as an atomic and homogeneous node
 Web page is partitioned into blocks using the
vision-based page segmentation algorithm
 extract page-to-block, block-to-page relationships
 Block-level PageRank and Block-level HITS

May 5, 2023 Data Mining: Concepts and Techniques 40


Link-Based Object Classification (LBC)

 Predicting the category of an object based on its


attributes, its links and the attributes of linked objects
 Web: Predict the category of a web page, based on
words that occur on the page, links between pages,
anchor text, html tags, etc.
 Citation: Predict the topic of a paper, based on word
occurrence, citations, co-citations
 Epidemics: Predict disease type based on characteristics
of the patients infected by the disease
 Communication: Predict whether a communication
contact is by email, phone call or mail

May 5, 2023 Data Mining: Concepts and Techniques 41


Challenges in Link-Based Classification

 Labels of related objects tend to be correlated


 Collective classification: Explore such correlations and
jointly infer the categorical values associated with the
objects in the graph
 Ex: Classify related news items in Reuter data sets
(Chak’98)
 Simply incorp. words from neighboring documents: not
helpful
 Multi-relational classification is another solution for link-
based classification

May 5, 2023 Data Mining: Concepts and Techniques 42


Group Detection
 Cluster the nodes in the graph into groups that
share common characteristics
 Web: identifying communities

 Citation: identifying research communities

 Methods
 Hierarchical clustering

 Blockmodeling of SNA

 Spectral graph partitioning

 Stochastic blockmodeling

 Multi-relational clustering

May 5, 2023 Data Mining: Concepts and Techniques 43


Entity Resolution
 Predicting when two objects are the same, based on their
attributes and their links
 Also known as: deduplication, reference reconciliation, co-
reference resolution, object consolidation
 Applications
 Web: predict when two sites are mirrors of each other

 Citation: predicting when two citations are referring

to the same paper


 Epidemics: predicting when two disease strains are

the same
 Biology: learning when two names refer to the same

protein
May 5, 2023 Data Mining: Concepts and Techniques 44
Entity Resolution Methods
 Earlier viewed as pair-wise resolution problem: resolved
based on the similarity of their attributes
 Importance at considering links
 Coauthor links in bib data, hierarchical links between

spatial references, co-occurrence links between name


references in documents
 Use of links in resolution
 Collective entity resolution: one resolution decision

affects another if they are linked


 Propagating evidence over links in a depen. graph

 Probabilistic models interact with different entity

recognition decisions
May 5, 2023 Data Mining: Concepts and Techniques 45
Link Prediction
 Predict whether a link exists between two entities, based
on attributes and other observed links
 Applications
 Web: predict if there will be a link between two pages

 Citation: predicting if a paper will cite another paper

 Epidemics: predicting who a patient’s contacts are

 Methods
 Often viewed as a binary classification problem

 Local conditional probability model, based on structural

and attribute features


 Difficulty: sparseness of existing links

 Collective prediction, e.g., Markov random field model

May 5, 2023 Data Mining: Concepts and Techniques 46


Link Cardinality Estimation
 Predicting the number of links to an object
 Web: predict the authority of a page based on the

number of in-links; identifying hubs based on the


number of out-links
 Citation: predicting the impact of a paper based on the

number of citations
 Epidemics: predicting the number of people that will

be infected based on the infectiousness of a disease


 Predicting the number of objects reached along a path
from an object
 Web: predicting number of pages retrieved by crawling

a site
 Citation: predicting the number of citations of a

particular author in a specific journal


May 5, 2023 Data Mining: Concepts and Techniques 47
Subgraph Discovery
 Find characteristic subgraphs
 Focus of graph-based data mining

 Applications
 Biology: protein structure discovery

 Communications: legitimate vs. illegitimate groups

 Chemistry: chemical substructure discovery

 Methods
 Subgraph pattern mining

 Graph classification
 Classification based on subgraph pattern analysis

May 5, 2023 Data Mining: Concepts and Techniques 48


Metadata Mining

 Schema mapping, schema discovery, schema


reformulation
 cite – matching between two bibliographic
sources
 web - discovering schema from unstructured or
semi-structured data
 bio – mapping between two medical ontologies

May 5, 2023 Data Mining: Concepts and Techniques 49


Link Mining Challenges
 Logical vs. statistical dependencies
 Feature construction
 Instances vs. classes
 Collective classification
 Collective consolidation
 Effective use of labeled & unlabeled data
 Link prediction
 Closed vs. open world
Challenges common to any link-based statistical model (Bayesian
Logic Programs, Conditional Random Fields, Probabilistic Relational
Models, Relational Markov Networks, Relational Probability Trees,
Stochastic Logic Programming to name a few)

May 5, 2023 Data Mining: Concepts and Techniques 50


Social Network Analysis

 Social Network Introduction


 Statistics and Probability Theory
 Models of Social Network Generation
 Networks in Biological System
 Mining on Social Network
 Summary
May 5, 2023 Data Mining: Concepts and Techniques 59
Ref: Mining on Social Networks
 D. Liben-Nowell and J. Kleinberg. The Link Prediction Problem for Social
Networks. CIKM’03
 P. Domingos and M. Richardson, Mining the Network Value of
Customers. KDD’01
 M. Richardson and P. Domingos, Mining Knowledge-Sharing Sites for
Viral Marketing. KDD’02
 D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the Spread of
Influence through a Social Network. KDD’03.
 P. Domingos, Mining Social Networks for Viral Marketing. IEEE
Intelligent Systems, 20(1), 80-82, 2005.
 S. Brin and L. Page, The anatomy of a large scale hypertextual Web
search engine. WWW7.
 S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P.
Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of
the World Wide Web. IEEE Computer’99
 D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004.

May 5, 2023 Data Mining: Concepts and Techniques 60


Other References
 Lecture notes from Professor Lise Getoor’s website.
http://www.cs.umd.edu/~getoor/
 Lecture notes from Professor ChengXiang Zhai’s website.
http://www-faculty.cs.uiuc.edu/~czhai/

May 5, 2023 Data Mining: Concepts and Techniques 61

You might also like