Taskar+al NIPS03b
Taskar+al NIPS03b
Stanford University
Abstract
Many real-world domains are relational in nature, consisting of a set of objects
related to each other in complex ways. This paper focuses on predicting the
existence and the type of links between entities in such domains. We apply the
relational Markov network framework of Taskar et al. to define a joint probabilis-
tic model over the entire link graph — entity attributes and links. The application
of the RMN algorithm to this task requires the definition of probabilistic patterns
over subgraph structures. We apply this method to two new relational datasets,
one involving university webpages, and the other a social network. We show that
the collective classification approach of RMNs, and the introduction of subgraph
patterns over link labels, provide significant improvements in accuracy over flat
classification, which attempts to predict each link in isolation.
1 Introduction
Many real world domains are richly structured, involving entities of multiple types that
are related to each other through a network of different types of links. Such data poses
new challenges to machine learning. One challenge arises from the task of predicting
which entities are related to which others and what are the types of these relationships. For
example, in a data set consisting of a set of hyperlinked university webpages, we might
want to predict not just which page belongs to a professor and which to a student, but also
which professor is which student’s advisor. In some cases, the existence of a relationship
will be predicted by the presence of a hyperlink between the pages, and we will have only
to decide whether the link reflects an advisor-advisee relationship. In other cases, we might
have to infer the very existence of a link from indirect evidence, such as a large number
of co-authored papers. In a very different application, we might want to predict links
representing participation of individuals in certain terrorist activities.
One possible approach to this task is to consider the presence and/or type of the link
using only attributes of the potentially linked entities and of the link itself. For example,
in our university example, we might try to predict and classify the link using the words on
the two webpages, and the anchor words on the link (if present). This approach has the
advantage that it reduces to a simple classification task and we can apply standard machine
learning techniques. However, it completely ignores a rich source of information that is
unique to this task — the graph structure of the link graph. For example, a strong predictor
of an advisor-advisee link between a professor and a student is the fact that they jointly
participate in several projects. In general, the link graph typically reflects common patterns
of interactions between the entities in the domain. Taking these patterns into consideration
should allow us to provide a much better prediction for links.
In this paper, we tackle this problem using the relational Markov network (RMN) frame-
work of Taskar et al. [14]. We use this framework to define a single probabilistic model
over the entire link graph, including both object labels (when relevant) and links between
objects. The model parameters are trained discriminatively, to maximize the probability
of the (object and) link labels given the known attributes (e.g., the words on the page, hy-
perlinks). The learned model is then applied, using probabilistic inference, to predict and
classify links using any observed attributes and links.
2 Link Prediction
A relational domain is described by a relational schema, which specifies a set of object
types and attributes for them. In our web example, we have a Webpage type, where each
page has a binary-valued attribute for each word in the dictionary, denoting whether the
page contains the word. It also has an attribute representing the “class” of the webpage,
e.g., a professor’s homepage, a student’s homepage, etc.
To address the link prediction problem, we need to make links first-class citizens in our
model. Following [5], we introduce into our schema object types that correspond to links
between entities. Each link object is associated with a tuple of entity objects
that participate in the link. For example, a Hyperlink link object would be associated with
a pair of entities — the linking page, and the linked-to page, which are part of the link
definition. We note that link objects may also have other attributes; e.g., a hyperlink object
might have attributes for the anchor words on the link.
As our goal is to predict link existence, we must consider links that exist and links that
do not. We therefore consider a set of potential links between entities. Each potential link
is associated with a tuple of entity objects, but it may or may not actually exist. We denote
this event using a binary existence attribute Exists, which is true if the link between the
associated entities exists and false otherwise. In our example, our model may contain a
potential link for each pair of webpages, and the value of the variable Exists determines
whether the link actually exists or not. The link prediction task now reduces to the problem
of predicting the existence attributes of these link objects.
An instantiation specifies the set of entities of each entity type and the values of all
attributes for all of the entities. For example, an instantiation of the hypertext schema is
a collection of webpages, specifying their labels, the words they contain, and which links
between them exist. A partial instantiation specifies the set of objects, and values for some
of the attributes. In the link prediction task, we might observe all of the attributes for all
of the objects, except for the existence attributes for the links. Our goal is to predict these
latter attributes given the rest.
Accuracy
0.75
Section
Accuracy 0.85 Section & Triad 0.6
0.7
0.8
0.55
0.65
0.75 0.5
0.7
0.6
0.45
ber mit sta ave ber m it sta ave ber mit sta ave
0.5 0.5
0.45 0.45
0.4 0.4
10% observed 25% observed 50% observed DD JL TX 67 FG LM BC SS
(a) (b)
Figure 2: (a) Average precision/recall breakeven point for 10%, 25%, 50% observed links. (b)
Average precision/recall breakeven point for each fold of school residences at 25% observed links.
with similar urls often belong to the same category or tightly linked categories (research
group/project, professor/course). For each page, two pages with urls closest in edit dis-
tance are selected as “neighbors”, and we introduced pairwise cliques between “neighbor-
ing” pages. Fig. 1(b) shows that the Neighbors model clearly outperforms the Flat model
across all schools, by an average of accuracy gain.
Given the page categories, we can now apply the different models for link classifica-
tion. Thus, the Phased (Flat/Flat) model uses the Entity-Flat model to classify the page
labels, and then the Link-Flat model to classify the candidate links using the resulting en-
tity labels. The Phased (Neighbors/Flat) model uses the Neighbors model to classify
the entity labels, and then the Link-Flat model to classify the links. The Phased (Neigh-
bors/Section) model uses the Neighbors to classify the entity labels and then the Section
model to classify the links.
We also tried two models that predict page and relation labels simultaneously. The
Joint + Neighbors model is simply the union of the Neighbors model for page categories
and the Flat model for relation labels given the page categories. The Joint + Neighbors
+ Section model additionally introduces the cliques that appeared in the Section model
between links that appear consecutively in a section on a page. We train the joint models
to predict both page and relation labels simultaneously.
As the proportion of the “none” relation is so large, we use the probability of “none” to
define a precision-recall curve. If this probability is less than some threshold, we predict
the most likely label (other than none), otherwise we predict the most likely label (includ-
ing none). As usual, we report results at the precision-recall breakeven point on the test
data. Fig. 1(c) show the breakeven points achieved by the different models on the three
schools. Relational models, both phased and joint, did better than flat models on the av-
erage. However, performance varies from school to school and for both joint and phased
models, performance on one of the schools is worse than that of the flat model.
References
[1] L. Adamic, O. Buyukkokten, and E. Adar. A social network caught in the web.
http://www.hpl.hp.com/shl/papers/social/, 2002.
[2] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.
Learning to extract symbolic knowledge from the world wide web. In Proc. AAAI, 1998.
[3] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 19(4):380–393, 1997.
[4] L. Egghe and R. Rousseau. Introduction to Informetrics. Elsevier, 1990.
[5] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Probabilistic models of relational structure.
In Proc. ICML, 2001.
[6] L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text and link structure for
hypertext classification. In IJCAI Workshop on Text Learning: Beyond Supervision, 2001.
[7] R. Ghani, S. Slattery, and Y. Yang. Hypertext categorization using hyperlink patterns and meta
data. In Proc ICML, 2001.
[8] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. JACM, 46(5):604–632,
1999.
[9] D. Koller and A. Pfeffer. Probabilistic frame-based systems. In Proc. AAAI98, pages 580–587,
1998.
[10] Nada Lavrac̆ and Saso Dz̆eroski. Inductive Logic Programming: Techniques and Applications.
Ellis Horwood, 1994.
[11] J. Neville and D. Jensen. Iterative classification in relational data. In AAAI Workshop on Learn-
ing Statistical Models from Relational Data, 2000.
[12] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order
to the web. Technical report, Stanford University, 1998.
[13] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
[14] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In
Proc. UAI, 2002.
[15] B. Taskar, E. Segal, and D. Koller. Probabilistic classification and clustering in relational data.
In Proc. IJCAI, pages 870–876, 2001.
[16] S. Wasserman and P. Pattison. Logit models and logistic regression for social networks. Psy-
chometrika, 61(3):401–425, 1996.
[17] J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation. In Proc. NIPS, 2000.