0% found this document useful (0 votes)
15 views39 pages

Mit14 15s22 Lec2

The document outlines a lecture on graph theory and social networks, focusing on the structure and measurement of networks using graph theory concepts. It discusses various types of networks, historical developments in network research, and key definitions such as walks, paths, cycles, and connectivity. Additionally, it introduces network statistics like degree distribution, clustering, and average path length to analyze and compare networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views39 pages

Mit14 15s22 Lec2

The document outlines a lecture on graph theory and social networks, focusing on the structure and measurement of networks using graph theory concepts. It discusses various types of networks, historical developments in network research, and key definitions such as walks, paths, cycles, and connectivity. Additionally, it introduces network statistics like degree distribution, clustering, and average path length to analyze and compare networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Lectures 2: Graph Theory and Social Networks

Alexander Wolitzky

MIT

6.207/14.15: Networks, Spring 2022

1
Plan

First part of the course focuses on the physical structure of


networks, with no or very simple models of behavior.

Basic tool: graph theory, the mathematical study of


graphs/networks.
I We use the terms “graph” and “network” interchangeably.

This lecture: Basic graph theory language and concepts for


describing and measuring networks.
I Next week: more advanced concepts and applications. E.g.,
Google’s PageRank algorithm, which ranks webpages by
“importance” based on their position in the Web network.
2
Types of Networks in the Real World

A network is a set of units (nodes or vertices) connected by


relationships (links or edges).

Types of networks:
I Social and economic networks: nodes are people or groups of
people.
I Friendship networks, business relationships between firms,
intermarriages between families, employment relations in the
labor market
I Information networks: nodes are “information objects”
I Web links, citation network between academic articles,
semantic/classification networks (e.g., taxonomies)
I ...
3
Types of Networks in the Real World (cntd.)

I Technological networks
I Infrastructure networks like internet, power grid, transportation
networks
I Temporary networks like sensor networks, autonomous vehicles
I Biological networks
I Food web, protein interaction network, neural network,
network of metabolic pathways

4
History of Study of Graphs/Networks
Historical study of networks:
I Mathematical graph theory: central part of discrete math
I Started with Euler’s 1735 solution to the Königsberg bridge
problem.
I Social network analysis in sociology.
I Typical studies involved circulation of questionnaires, leading
to relatively small networks; also little focus on individual
behavior.

Recent years witnessed a substantial change in network research.


I From analysis of single small graphs (<100 nodes) to
statistical properties of large-scale networks (millions/billions
of nodes).
I Motivated by availability of computers and computer data.
I On a different front, integration of game theory and
graph/social network theory.5
I Later in the course.
Graphs
An graph consists of a set of nodes N = {1, . . . , n } and an n × n
matrix g = [gij ]i ,j ∈N called the adjacency matrix, where
gij ∈ {0, 1} denotes the absence/presence of an edge from node i
to node j.
I In a weighted graph, the edge weight gij > 0 can take on
non-binary values, representing the intensity of the interaction.

In an undirected graph, gij = gji for all i, j ∈ N (g is symmetric).


I E.g. Facebook friends
In a directed graph (digraph), gij and gji may differ.
I E.g. web links

Examples: draw the graphs corresponding to adjacency matrices:


⎡ ⎤ ⎡ ⎤
0 1 0 0 1 1
I Example 1: ⎣ 0 0 1 ⎦ 6Example 2: ⎣ 1 0 1 ⎦
1 0 0 1 1 0
Graphs

Equivalently, can represent a graph by (N, E ), where E ⊆ N × N


is the set of edges.
I For directed graphs, E is the set of “directed” edges, write
(i, j ) ∈ E .
I For undirected graphs, E is the set of “undirected” edges,
write {i, j } ∈ E .

Example 1: Ed = {(1, 2) , (2, 3) , (3, 1)}


Example 2: could write as either Eu = {{1, 2} , {1, 3} , {2, 3}}
or Ed = {(1, 2) , (2, 1) , (2, 3) , (3, 2) , (3, 1) , (1, 3)}

We sometimes denote gij = 1 with the notation (i, j ) ∈ g , or


{i, j } ∈ g , or even ij ∈ g .
7
Walks, Paths, and Cycles
For an undirected graph (N, E ):
I A walk is a sequence of edges {i1 , i2 } , {i2 , i3 } , . . . , {iK −1 , iK }.
I A path between nodes i and j is a sequence of edges
{i1 , i2 } , {i2 , i3 } , . . . , {iK −1 , iK } such that i1 = i and iK = j,
and each node in the sequence i1 , . . . , iK is distinct.
(i.e. a walk with no repeated nodes)
I A cycle is a path where the final node equals the initial node.
I A geodesic between nodes i and j is a “shortest path” (i.e.
with minimum number of edges) between these nodes.

The length of a walk (or path) is the number of edges in the walk
(or path).
I The distance between nodes i and j is the length of a
geodesic between them (or ∞ if no such path exists).
8
For directed graphs, the same definitions hold with directed edges
(in which case we say “a path from node i to node j”).
Powers of the Adjacency Matrix
The powers of the adjacency matrix contain useful information
about walks and paths.

Under the convention gii = 0, the matrix g 2 tells us the number


of walks of length 2 between any two nodes:

(g × g )ij = ∑ gik gkj


k ∈N
= # {k : {i, k } , {k, j } is a walk between i and j }
(since gik gkj = 1 if {i, k } , {k, j } is such a walk, = 0 otherwise).

Similarly, the matrix g 3 tells us the number of walks of length 3


between any two nodes:
� 2  � 
g × g ij = ∑ g 2 ik2 gk2 j
k 2 ∈N
 
(k1 , k2 )9: {i, k1 } , {k1 , k2 } , {k2 , j }
= # .
is a walk between i and j
Powers of the Adjacency Matrix (cntd.)

By induction, g k tells us the number of walks of length k between


any two nodes.

This also gives a useful way to express the distance


�  between nodes
i and j: it is the smallest integer k such that g k ij 6= 0.

A similar interpretation works �for weighted


 graphs: given a
weighted adjacency matrix g , g k ij is the sum of the “values” of
all length-k walks from i to j, where the value of a walk is the
product of the weights on each link.

You’ll see more ways of using the adjacency matrix on the pset.

10
Connectivity and Components
An undirected graph is connected if for every two nodes there
exists a path between them.

A graph (N 0 , E 0 ) is a subgraph of (N, E ) if N 0 ⊂ N, E 0 ⊂ E , and


{i, j } ∈ E 0 implies i, j ∈ N 0 . (Each link must have ends.)

A component of a graph is a maximal connected subgraph.


I That is, a connected subgraph that is not contained in any
larger connected subgraph.

An edge {i, j } is a bridge if deleting it increases the number of


components.

Note: the adjacency matrix of a graph with more than one


component can be written in block-diagonal form: that is, the 1’s
11
are confined to square blocks along the diagonal, with all other
elements equal to 0. (Convince yourself.)
Connectivity and Components in Directed Graphs

A directed graph is
I connected if the underlying undirected graph is connected
(i.e. ignoring the directions of the edges).
I strongly connected if each node can reach every other node
by a “directed path”.

A strongly connected component is a maximal strongly


connected subgraph. That is,
1. Each node in the subgraph can reach every other node in the
subgraph by a directed path contained in the subgraph.
2. The subgraph is not contained in any larger subgraph with
this property.
12
Directed Graphs (cntd.)
I The out-component of a set of nodes S ⊂ N is the set of
nodes T ⊂ N that can be reached by a directed path starting
from some node in S.
I The in-component of a set of nodes S ⊂ N is the set of
nodes T ⊂ N that can reach some node in S by a directed
path.

Note: the strongly connected component of a node i consists of


the intersection of its out-component and its in-component.
Proof:
I Fix two nodes j and k in the intersection of i’s out-component
and in-component.
I j can reach i, because j is in i’s in-component.
I i can reach k, because k is 13
in i’s out-component.
I So j can reach k (by a path through i).
Some Special Networks

I A clique (or complete network) is a graph where all nodes


are linked to each other.
I A tree is a connected (undirected) graph with no cycles.
I A connected graph is a tree if and only if it has n − 1 edges.
I In a tree, there is a unique path between any two nodes.
I A forest is a graph in which each component is a tree.
I A star is a tree where one node (the center) is linked to all
other nodes.
I A ring (or circle, or cycle) is a connected graph where each
node is linked to two others.
I A bipartite graph is one that can be partitioned into two sets
such that all links connect nodes in “opposite” sets.
I Buyers&sellers, firms&workers, students&schools, men&women
14
Network Statistics
Small networks can be visualized directly, but larger networks are
harder to visualize and describe.

It’s therefore useful to define several summary statistics to


describe and compare networks (here focusing primarily on
undirected graphs):
I Degree distribution (how dense?)
I Diameter and average path length (how tightly connected?)
I Clustering (are friends-of-friends friends?)
I Centrality (which nodes are central or important?)
I Homophily (are nodes of the same “type” more likely to be
linked?)

The rest of today’s class introduces


15 these summary statistics and
discusses some applications.
Neighborhood and Degree
The neighborhood, Ni , of node i is the set of nodes to which it is
linked: Ni = {j : gij = 1}.

For undirected graphs, the degree, di , of node i is its number of


neighbors, or equivalently the cardinality of its neighborhood:
di = ∑j gij = ∑j gji = #Ni .

For directed graphs,


I The out-degree of node i is ∑j gij .
I The in-degree of node i is ∑j gji .
One also sometimes seems the terms “out-neighbor” and
“in-neighbor”.

In applications, if a link from i to j means that i “influences” j,


nodes with high out-degree are “influential.”
16
If a link means that i “listens to” or “endorses” j (e.g., hyperlink
to j), nodes with high in-degree are influential.
Mean Degree, Density, Sparseness
¯ of an undirected network is
The average (mean) degree, d,
1
d̄ = ∑ di .
n i
Note that if the network has a total of m edges, then we have
∑ di = 2m.
i

Therefore, d¯ = 2m/n. (Useful equation.)


The density, ρ, of an undirected network is the fraction of all
possible links that actually exist, given by
m d̄
ρ= = .
n (n − 1) /2 n−1
For large networks, often approximate as d¯ /n.
A network is sparse if ρ is small.17
I When discussing large networks, this is often taken to mean
that ρ → 0 as n → ∞.
Degree Distributions

The degree distribution, P (d ), of a network describes the


proportion of nodes that have different degrees d.
I For a given graph, P (·) is a histogram:
that is, P (d ) is the fraction of nodes with degree d.
I For a random graph model, P (·) is a probability distribution:
that is, P (d ) is the probability that a node has degree d.

A graph is d-regular if all nodes have the same degree d


(so P (d ) is a degenerate distribution).
I If a graph is d-regular with d odd, it must have an even
number of nodes. (Why?)

18
Degree Distributions (cntd.)
Two types of degree distributions for random graph models:
I P (d ) ≤ ce −αd for some constants α > 0 and c > 0:
the tails of the distribution fall off exponentially (or faster):
large degrees are very unlikely.
I P (d ) = cd −γ for some constants γ > 0 and c > 0:
called a power-law distribution, the tails of the distribution
are “fat”: large degrees are much less unlikely.
I (Approximate) power laws appear in many settings, including
distributions of income, city populations, and internet traffi c.
I Also known as a scale-free distribution: a distribution that is
unchanged (within a multiplicative factor) under a rescaling of
the variable.
I Appear linear on a log-log plot.

These concepts will play an important


19 role in coming lectures on
random graph models.
Diameter and Average Path Length
Let ` (i, j ) denote the distance (shortest path length) between i
and j.

The diameter of a connected network is the greatest distance


between any two nodes:

diameter = max ` (i, j )


i ,j

The average path length is the average distance between any two
nodes:
∑i 6=j ` (i, j )
average path length =
n (n − 1)

Average path length is bounded from above by diameter.


In some cases it is much shorter than diameter.
20
If the network is not connected, one often checks the diameter and
the average path length in the largest component.
Clustering
Measures the extent to which my friends are friends with each
other.

The simplest such measure is the overall clustering coeffi cient


Cl (g ), given by

3 × number of triangles in the network


Cl (g ) = ,
number of “potential triangles”

where a “potential triangle” is a triple of distinct nodes (i, j, k )


such that gij = gik = 1.
I Formally,
∑i ;j 6=i ;k 6=i ,j gij gik gjk
Cl (g ) = .
∑i ;j 6=i ;k 6=i ,j gij gik
I Note that 0 ≤ Cl (g ) ≤ 1.
I Also referred to as network21transitivity: measures extent to
which a friend of my friend is also my friend.
Clustering (cntd.)
A different measure of clustering is based on first measuring the
“individual clustering” for each node i, then averaging over nodes.
The individual clustering for node i is
number of triangles involving i
Cli (g ) =
number of potential triangles centered at i
∑j 6=i ;k 6=i ,j gij gik gjk
=
∑j 6=i ;k 6=i ,j gij gik

1
The average clustering coeffi cient is Cl Avg (g ) = n ∑i Cli (g ) .
Consider the undirected “windmill” network, where everyone is
linked to the center and one other node.
I Average clustering is close to 1, because Cli (g ) = 1 for
everyone except the center.
I Overall clustering is close to 0, because vast majority of
22
potential triangles consist of the center and two individuals
who are not linked.
Centrality Measures
There are several measures that capture some notion of the
“centrality” or “importance” of a node in a network.
I Different measures capture different notions of centrality,
which matter for answering different questions.

I Degree centrality: Simply degree divided by (n − 1).


I Closeness/decay centrality: “On average,” how close is the
node to other nodes?
I A simple measure: inverse average distance, or
(n − 1) / ∑j 6=i ` (i, j ) .
I A richer measure: decay centrality, given by ∑j 6=i δ`(i ,j ) for
some “decay parameter” δ ∈ (0, 1). (Depends on parameter.)
I Betweenness centrality: How important is the node for
connecting other nodes?
I Recall definition from last class:
23 Pk (i, j ) /P (i, j )
Bk = ∑ (n − 1)(n − 2)
(i ,j )∈N :i 6=j ,k 6=i ,j
Eigenvector-Based Centrality Measures
A more subtle and very important class of centrality measures are
based on the self-referential idea that a node is important if it is
connected to other important nodes.
I These meaures cannot be computed separately for each node;
instead, we compute the measure for all nodes simultaneously
via a system of equations.
I These measure are collective called eigenvector-based
centrality measures (because the calculation involves
eigenvectors). They have many applications in this course,
including understanding:
I How Google ranks webpages (PageRank).
I Which agents in a social network are influential in forming the
group’s long-run consensus opinion (DeGroot learning).
I Which firms in a production network are most systemically
important (Leontieff input-output analysis).
I . . . and more.
24
I We will study this class of measures and its applications next
week (start today time permitting).
Homophily and Segregation

Finally, another kind of network statistic is useful when nodes are


of different types, or belong to different groups.
I Individuals of different gender, race, age, political affi liation,
religion, education, etc.
I Liberal vs. conservative blogs (or other media)

In these settings, a key question is the degree of homophily: the


extent to which nodes of the same type are more likely to be
connected.
I “Similarity begets friendship” – Plato
I “People love those who are like themselves” – Aristotle
I “Birds of a feather flock together” – Proverb
25
Links Between Political Blogs in the US

© ACM. All rights reserved. This content is excluded from our Creative Commons
license. For more information, see https://ocw.mit.edu/help/faq-fair-use/
26
The Friendship Network at a US High School

© source unknown. All rights reserved. This 27


content is excluded from our Creative
Commons license. For more information, see
https://ocw.mit.edu/help/faq-fair-use/
Homophily and Segregation (cntd.)

Issues relating to homophily, assortativeness, and segregation will


arise repeatedly in this class.

Later lectures will ask:


I How strong an individual preference of “like for like” (or
discrimination of “like against unlike”) is needed to result in
extreme levels of segregation at the societal level?
I How does homophily affect the speed of diffusion or
contagion?
I How does homophily affect whether crowds are wise or
foolish? (i.e., whether people successfully aggregate their
information, or fall prey to “groupthink” or “echo chambers”)

28
Measuring Homophily

There are different ways of measuring homophily, but the simplest


is just to look at the fraction of links that actually exist between
individuals of different types, relative to what would be expected if
links were formed uniformly at random.

Suppose fraction p1 of the population is from group 1 and fraction


p2 of the population is from group 2. (May also be other types.)

If links were frandomly distributed, fraction p12 of links would


connect two group-1 nodes, and fraction 2p1 p2 would connect a
group-1 node and a group-2 node.
I If we fix a link and randomly assign the node at each end to
type 1 or type 2, we get two type-1’s w/ prob p12 and one of
each type w/ prob 2p1 p2 .
29
Measuring Homophily (cntd.)

Hence, if the fraction of links within group 1 is significantly above


p12 , this is evidence for homophily (or “assortative matching”)
within group 1.

If the fraction of links between group 1 and group 2 is significantly


below 2p1 p2 , this is evidence for homophily/assortativity within the
groups, or segregation/disassortativity between them.

30
Introducting Eigenvector Centrality (time permitting)
The simplest measure is eigenvector centrality: a non-zero vector
C = (Ci )i ∈N such that, for some scalar λ > 0, we have

λCi = ∑ gji Cj for all i ∈ N.


j 6 =i

That is, the centrality of each node i is proportional to the


weighted sum of the centrality of its neighbors.
I Note that in this definition we have gji rather than gij .
I This doesn’t matter for undirected graphs, but for directed
graphs it says that a node’s centrality derives from the
centrality of nodes that point to it.
I Interpretation: when “important” or “prestigous” nodes point
to you, this makes you important/prestigious.
I Equations still hold if multiply C by a scalar.
31
We typically normalize so that ∑i ∈N Ci = 1.
Eigenvector Centrality (cntd.)

Eigenvector centrality (Ci )i ∈N is defined by:

λCi = ∑ gji Cj for all i ∈ N.


j 6 =i

It’s not immediately obvious whether we can find such a vector C :


that is, whether such a measure exists or is unique.
I n linear equations with n unknowns, so looks promising. . .

32
When is Eigenvector Centrality Well-Defined?

For strongly connected networks, it turns out that eigenvector


centrality is always well-defined.
I Recall that a directed network is strongly connected if there
exists a directed path between any two nodes.
I In particular, every connected undirected network is strongly
connected.
I In general, the network is strongly connected iff �for every pair
of nodes i, j, there exists a number ` such that g ` ij > 0.
I Matrices g with this property are called irreducible.
I That is, a network is strongly connected if and only if its
adjacency matrix is irreducible.

33
When is Eigenvector Centrality Well-Defined? (cntd.)
In matrix form, the equation for the Ci ’s is
λC = g T C ,
where λ is a scalar, C is a n × 1 vector, and g T is the transpose of
the n × n adjacency matrix (transposed because, for directed
graphs, we care about the nodes that link to you, not the nodes
you link to).
I That is, C is an eigenvector of g T , with λ the corresponding
eigenvalue.
I The Perron-Frobenius theorem of linear algebra says that, for
every irreducible non-negative matrix, its largest eigenvalue is
positive, and all the components of the corresponding
eigenvector are also positive.
I So, if we let λ be the largest eigenvalue of g T , the
corresponding eigenvector C is non-negative.
34
I Thus, for any strongly connected network, the eigenvector
centrality vector C is well-defined.
Interpretation as Long-Run Population Shares
A useful interpretation of eigenvector centrality as the long-run
outcome of a reproduction process (which also explains why it’s
always well-defined for strongly connected networks):
I Suppose a “virus” starts at a random node in the graph.
I In each period, every virus sends one copy of itself along each
link from the node where it is located. Then it dies.
I (So there’s 1 virus in period 1, Ni viruses in period 2,
∑j ∈N i Nj viruses in period 3, etc.)
I Letting this process run forever, the virus never dies out
(because the network is strongly connected), and we can
calculate the long-run fraction of viruses located at each node.
I The long-run fraction of viruses located at node i equals Ci .

(Why? Because the long-run fraction of viruses located at node i


is proportional to the long-run fraction
35
of viruses located at nodes
that link to node i. This is the relationship that defines eigenvector
centrality.)
Perron-Frobenius Theorem

Theorem
For every irreducible non-negative matrix A, its largest eigenvalue
r1 is a positive real number, and the components of the
corresponding eigenvector v1 are also all positive.

The theorem also says more, but this is what we need.

The proof is outside our scope, but we can give an informative


informal argument.

36
Intuition for the Perron-Frobenius Theorem
I Fix any non-negative vector x (0) ∈ Rn . Suppose that we can
write it as a linear combination of the eigenvectors vi of A:
x (0) = ∑ ci vi .
i

I Consider repeatedly multipying x (0) by A. (Matrix


multiplication = copying viruses.) After t steps, we get the
vector
 t
ri
x (t ) = A x (0) = A ∑ ci vi = ∑ ci ri vi = r1 ∑ ci
t t t t
vi .
i i i r1
 t
I Since r1 is the largest eigenvalue, rr1i → 0 as t → ∞, for
all i 6= 1. Therefore, x (t ) /r1t → c1 v1 . That is, the limiting
vector x (∞) is proportional to the largest eigenvector.
I Since x (0) was non-negative and A is non-negative, each
37
x (t ) is also non-negative. Therefore, r1 must be positive (else
oscillates), and every component of v1 must also be positive.
Other Insights from this Argument

Just like with the viruses, the limiting vector x (∞) is proportional
to the largest eigenvector. This vector defines eigenvector
centrality.

We might also ask how fast this convergence takes place.


 t
I This is determined by how fast r2
r1 goes to 0 as t → ∞
 t
(since rr1i goes to 0 faster than this for each i ≥ 3).
I Bigger gap between first and second eigenvalue =⇒ faster
convergence.

38
MIT OpenCourseWare
https://ocw.mit.edu

14.15 / 6.207 Networks


Spring 2022

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

39

You might also like