Web Data Analysis
Web Data Analysis
Cheng-Jun Wang
Outline
I. Key features of web data
II. Major approaches to web data
analysis
i.
ii.
iii.
iv.
Network analysis
Temporal analysis
Spatial analysis
Sentiment analysis
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1,00
0
...
...
...
...
...
Analysis of web
(longitudinal, tall)
data
Time series analysis
ID
Network analysis1
Spatial analysis
2
Text mining
...
etc.
Time
V1
...
...
...
...
...
...
Multiple regression
1,000
...
...
Log-linear model
...
...
...
10,000
...
...
...
...
...
...
...
...
Multilevel analysis
Structural equation modeling
etc.
APPROACHES TO WEB
DATA ANALYSIS
5
Features
Temporal features
Spatial features
Structural/behavioral features (e.g., RT,
@)
Content features (term/topic/sentiment)
Approaches
Programmingbased
SPSS
Excel
Stata
SAS
NETWORK ANALYSIS
(ELEMENTARY-LEVEL)
8
What Is a Network?
A network consists of
Nodes (actors, agents,
etc.)
Edges (relations, ties, etc.)
a graph
a matrix
a web
a map
etc.
Key Concepts
Network
Node
Edge
Ego-network
Component
Triadic closure
Individual-level
analysis:
Centrality metrics
Group-level
analysis
Transitivity
Global-level
analysis
Density
Modularity
10
Edges:
Kinship links (family
ties)
Friendship ties (factual
or perceived)
Business transactions
Travel routes (highways,
subways, air flights)
Similarities (word cooccurrences in articles)
etc.
11
Examples of Innovative
Network Analysis
Food Flavor Network
http://www.nature.com/srep/2011/111215/srep0
0196/full/srep00196.html
http://www.eie.polyu.edu.hk/~xfliu/publications/
LiuXF.2010.physa.Music.pdf
12
More on Edges
Directed (one-way) vs. undirected (two-way)
Observed (directly measured, e.g., hyperlinks)
vs. hidden (inferred , e.g., co-occurrences)
Formal (institutionally arranged, top-down) vs.
informal (self-organized, bottom-up)
Static (unchanged over time) vs. dynamic
(evolving)
Positive (e.g., friending) vs. negative (e.g., defriending)
The key challenge to innovative network analysis is
to identify hidden, informal, and evolving edges
13
Direction of Ties
Undirected
Directed
Directly
Observed
Friendship networks
(e.g., Facebook,
Google+)
Microblog networks
(e.g., Twitter, Sina
Weibo)
Indirectly
Inferred
Semantic networks
(e.g., recommendation
systems, social tagging
systems)
Newsgroups, blogs,
WWW hyperlink
networks
Components
A component is a
subset of a network:
i.
ii.
16
Ego-Centric Network
Ego-network: a subset of a network
including a particularly designated node
(ego) and its neighbors (alters)
For example, followers of a VIP account on
Twitter or Sina Weibo form an ego-network
All snowballing samples of online social
networks are ego-networks.
An important property of ego-networks is
the depth (see next slide).
Family tree is a special case of egonetworks (see the second next slide).
18
19
A
t0
t1
Why are friends (B and C) of a common friend (A) more likely to become
friends themselves: 1. chances to meet each other; 2. similarity between them.
21
Triads of Undirected
Networks
Closed Triad
Connected Pair
Open Triad
Unconnected
22
B
A
C
D
24
Person
Person
Person
Person
Focus
(e.g.,
recommended
(c) Membership closure
books on
Amazon)
Person
Person
Focus
(e.g.,
recommended
groups on
Person
Facebook)
25
Individual-level Analysis
Find popular/important/influential nodes
usually based on centrality metrics
Degree centrality: How many nodes are you
connected to?
Closeness centrality: How close are you to
other nodes?
Betweenness centrality: How many paths
are through you?
Eigenvalue: How many important nodes are
round you?
28
Interpretation of Centrality
Scores
High centrality scores:
Individuals with high
centrality scores are
often more likely to be:
leaders
key conduits of
information
early adopters of
anything that spreads
in a network
who may be
protected from
negative contagion
and influence
who may be
associated with less
work overload in an
organization
29
E
G
H
D
A network of 10
nodes and 18 edges:
Who has the highest
degree centrality?
Who has the highest
betweenness
centrality?
Who has the highest
closeness centrality?
30
Degree Centrality
Number of
neighbors a node is
directly connected
Indicates how well
the node is
connected within
the graph
Degree of G = 6
31
Betweenness Centrality
The number of
shortest paths
between pairs of
other nodes through a
node (as compared
with total number of
shortest paths in the
graph)
Indicates how critical
the node is to the flow
of information or
resource in the graph
Betweenness of H = 14
32
Closeness Centrality
Number of steps
along the shortest
path from the focal
node to all other
node
Indicates how
quickly information
travels between the
node and anyone
else in the graph
Closeness of D and E =
14, respectively
14 == 1*5 + 2*3 + 3*1
33
Eigenvalue Centrality
The extent to which a node
is a big fish connected with
other big fish in a big pond.
Calculated by assessing
how well connected a node
is to the parts of the
network with the greatest
connectivity.
Nodes with high
eigenvector scores have
many connections who
have many connections,
etc., similar to the logic of
Google PageRank.
Group-level Analysis
Central Question: How are nodes
clustered (grouped) together?
based on clustering analysis, a method
to merge an n number of nodes into a g
number of groups such that:
the nodes within the same group are
maximally similar or homogeneous
the nodes between the groups are
maximally different or heterogeneous
35
Process of Clustering
Analysis
1
2
3
4
5
6
7
8
9
10
1
36
Group-level Metrics in
NodeXL
Vertex counts
Edge counts
Geodesic distances
Group density
Number of edges between each pair
of groups
38
Global-level Analysis
Key question: How
densely or closely
connected is the
network as a whole?
Fig a (top): connected
based on 67%
agreement
Fig b (bottom):
connected based on
75% agreement
39
Directed or undirected.
The number of vertices in the graph.
The number of edges that do not have duplicates.
The number of edges that have duplicates.
The number of edges in the graph. This is the sum of Unique
Edges and Edges With Duplicates.
Self-Loops
The number of edges that connect a vertex to itself.
Reciprocated Vertex Pair In a directed graph, this is the N of vertex pairs that have edges
Ratio
in both directions divided by the N of vertex pairs that are
connected by any edge. Duplicate edges and self-loops are
ignored. In an undirected graph, this is undefined.
Reciprocated Edge Ratio In a directed graph, this is the number of edges that are
reciprocated divided by the total number of edges. Duplicate
edges and self-loops are ignored. In an undirected graph, this is
undefined and is not calculated.
Connected Components The number of connected components in the graph. A
connected component is a set of vertices that are connected to
each other but not to the rest of the graph.
40
Single-Vertex Connected
Components
Maximum Vertices in a
Connected Component
Maximum Edges in a
Connected Component
Maximum Geodesic
Distance (Diameter)
Average Geodesic
Distance
Graph Density
Modularity
41
HANDS-ON TUTORIALS
42
and
Super R logo
Source: www.redbubble.com/
43
R packages
igraph, Statnet,
Rsiena
Spatial
Analysis
http://cran.rproject.org/web/views/Spat
ial.html
Sp, Spatial,
OpenStreetMap,
RgoogleMaps
Temporal
Analysis
Text Mining
http://cran.rproject.org/web/views/Natu
ralLanguageProcessing.ht
ml
Machine
http://cran.r-
44
came, I saw,
and I
walked away?
Picture: Gareth Jenkins/Solent
http://www.telegraph.co.uk/news/picturegalleries/picturesoftheday/8561204/Pictures-of-the-day-7-June-2011.html?image=6
Plunge
HANDS-ON!
46
Demo 1. Software
Installation
Download and install R, Rstudio, and
NodeXL
http://cran.r-project.org/
https://www.rstudio.com/ide/
http://nodexl.codeplex.com/
More information
https://www.rstudio.com/training/online.html
47
NETWORK ANALYSIS
(ADVANCED-LEVEL)
48
Network Topology
49
Regular or random?
Regular network
Nodes are connected in
a regular neighborhood
with a fixed number k
of edges per each node
They do not exhibit the
small world
characteristics
They may exhibit
clustering
Random network
Random networks
have randomly
connected edges
each node has an
average edges
They exhibit the small
world characteristics
They do not exhibit
clustering
50
Small-World Networks
Between order and chaos
Network generation
Such that
Scale-free network
Power law
Long-tail distribution
P(k) ~ k-a, 0<a<2
log(P) ~-a*log(k)
Zipf distribution
Pareto distribution
Properties
Scale-invariance
P(c*k) ~ (c*k) a
Thus, P(c*k) ~ c a k-a
P(c*k) k-a
No average
Universality
Barabsi, Albert, and Jeong, Scale-free characteristics of random networks: The topology of the world wide web, Physical A.,
281, 2000, pp.69-77.
52
R scripthttp://chengjun.github.io/web_data_analysis/demo2_simulate_networks/
install.packages("igraph")
library(igraph)
size = 50
g = graph.tree(size, children = 2); plot(g)
g = graph.star(size); plot(g)
g = graph.full(size); plot(g)
g = graph.ring(size); plot(g)
g = connect.neighborhood(graph.ring(size), 2); plot(g)
g = erdos.renyi.game(size, 0.1)
# small-world network
g = rewire.edges(erdos.renyi.game(size, 0.1), prob = 0.8 ); plot(g)
# scale-free network
g = barabasi.game(size) ; plot(g)
53
Friendship, Interaction
networks and Vote agreement
of congressmen in the United
States. 7th APNC, Montreal,
Canada
54
How to Represent a
Network?
A
e1
B
e3
e
2
C
e
4
e6 e5
E
A,
A,
A,
C,
C,
C,
B
D
C
D
E
F
55
56
57
Network
NodeXL: Calculating graph
metrics
58
Network
NodeXL: Set vertex color and
vertex size
59
R script
http://chengjun.github.io/web_data_analysis/demo3_describe_the_network/
Graph Statistics
Centrality Measures
Algorithms of graphs
Shortest path
Connected component algorithms
60
61
62
Procedures of ERGM
Network Configurations:
Undirected Networks
4-star
Edge
K-star
2-star
:
:
Triangle
3-star
64
Network Configurations:
Directed Networks
Arc
Reciprocity
isolate
2-mixed star
2-in star
2-out star
K-in star
Transitive triad
:
:
K-out star
:
:
Cyclic triad
65
66
One example
Tie-Network configuration
matrix
edges
2-star
K-star
Triangle
Y1,2
Y1,3
Y2, 3
Yn, n-1
68
69
70
TEMPORAL ANALYSIS
71
Time domain:
ARIMA/VAR analysis
Survival analysis
Multilevel analysis
Frequency domain:
Fourier
transformation
Spectrum analysis
(comparing ak and bk
of different time
series).
72
73
Survival Analysis of
Blogging Behavior
74
SPATIAL ANALYSIS
75
Spatial Analysis
Spatial Data:
Location names
IP addresses
Map visits
GPS usage
etc.
Well-developed for
offline data but underdeveloped/utilized for
web data beyond visual
inspections.
Spatial Analysis:
Spatial clusters/patterns
(by visual inspections)
Spatial autocorrelation
Spatial Regression
Spatial Dependence
(correlation between
nearby locations)
Spatial interaction
(correlation between
geo-coded variables)
76
77
79
SENTIMENT ANALYSIS
80
Sentiment Analysis
Decompose sentiment
Emotion
Joy
surprise
Anger
Sadness
Fear
disgust
Polarity
Positivity
Negativity
Neutral
Lexicon method
Carlo Strapparava and
Alessandro Valituttis
emotions lexicon
Janyce Wiebes
subjectivity lexicon
Liu Bings polarity
lexicon
Supervised machine
learning
Combine lexicon and
machine learning
81
82
83
84
85
REFLECTION ON WEB
DATA ANALYSIS
86
http://www.google.com/trends/correlate/comic?p=2
87
Nature reported that Google flu trends (GFT) was predicting more than
double the proportion of doctor visits for influenza-like illness (ILI) than the
Centers for Disease Control and Prevention (CDC), which bases its
estimates on surveillance reports from laboratories across the United States
(1, 2).
Lazer et al. (2014) The parable of Google Flu Traps in big data analysis. Science
88
Facebook Insight
http://www.cnn.com/election/2012/facebook-insights/
http://www.zerogeography.net/2012/11/obama-wins-election-ontwitter.html
http://www.huffingtonpost.com/simon-jackman/pollster-predictions_b_2081013.html
90
91
To Move on
R Style Guide
R bloggers
stackoverflow
github
http://adv-r.had.co.nz/Style.html
92
93