Advanced Topics in Data Mining Special Focus: Social Networks
Advanced Topics in Data Mining Special Focus: Social Networks
Social Networks
Links denote a social interaction
Networks of acquaintances collaboration networks
actor networks co-authorship networks director networks
phone-call networks e-mail networks IM networks Bluetooth networks sexual networks home page/blog networks
Design models that capture the generation process of network data (Generative Models)
Generate graphs with the same properties as real social network graphs
Introductory Lectures
Measurements in networks Generative models Algorithmic topics
Introduction to information propagation Expertise location Privacy
Measuring Networks
Degree distributions Small world phenomena Clustering Coefficient Mixing patterns Degree correlations Communities and clusters
Degree distributions
frequency
fk = fraction of nodes with degree k = probability of a randomly selected node to have degree k
fk
degree
Problem: find the probability distribution that best fits the observed data
Power-law distributions
The degree distributions of most real-life networks follow a power law
p(k) = Ck-
Right-skewed/Heavy-tail distribution
there is a non-negligible fraction of nodes that has very high degree (hubs) scale-free: no characteristic scale, average is not informative
zk z p(k) P(k; z) e k!
highly concentrated around the mean the probability of very high degree nodes is exponentially small
Power-law signature
Power-law distribution gives a line in the log-log plot
log p(k) = - logk + logC
frequency log frequency
degree
log degree
Examples
Exponential distribution
Observed in some technological or collaboration networks
p(k) = e-k
degree
Average/Expected degree
For random graphs z = np For power-law distributed degree
if 2, it is a constant if < 2, it diverges
Maximum degree
For random graphs, the maximum degree is highly concentrated around the average degree z For power law graphs
k max n1/(1)
Example
1 4
3 2 5
(1)
3 3 1 1 6 8
(2)
1 Ci n
Example
1 4
(2)
1 13 1 1 1 6 5 30 3 8
C (1)
The two clustering coefficients give different measures C(2) increases with nodes with low degree
C(k)
degree
As of Dec 2007, the highest (finite) Bacon number reported is 8 Only approx 12% of all actors cannot be linked to Bacon What is the Bacon number of Elvis Prisley?
Erdos numbers?
Further observations
People that owned the stock had shortest paths to the stockbroker than random people People from Boston area have even closer paths
Harmonic mean
1 1 -1 d n(n - 1)/2 i j ij
Degree correlations
Do high degree nodes tend to link to high degree nodes? Pastor Satoras et al.
plot the mean degree of the neighbors as a function of the degree
Degree correlations
Newman
compute the correlation coefficient of the degrees of the two endpoints of an edge assortative/disassortative
Connected components
For undirected graphs, the size and distribution of the connected components
is there a giant component?
For directed graphs, the size and distribution of strongly and weakly connected components
Graph eigenvalues
For random graphs
semi-circle law
Next class
What is a good model that generates graphs in which power law degree distribution appears? What is a good model that generates graphs in which small-world phenomena appear?