172 Paper
172 Paper
172 Paper
Abstract
Scientific literature records the research process with a standardized structure and provides the clues to track the progress in a scientific
field. Understanding its internal structure and content is of paramount importance for natural language processing (NLP) technologies.
To meet this requirement, we have developed a multi-layered annotated corpus of scientific papers in the domain of Computer Graphics.
Sentences are annotated with respect to their role in the argumentative structure of the discourse. The purpose of each citation is specified.
Special features of the scientific discourse such as advantages and disadvantages are identified. In addition, a grade is allocated to each
sentence according to its relevance for being included in a summary.To the best of our knowledge, this complex, multi-layered collection
of annotations and metadata characterizing a set of research papers had never been grouped together before in one corpus and therefore
constitutes a newer, richer resource with respect to those currently available in the field.
Keywords: multi-layered annotated corpus, scientific discourse, citations, summarization gold standard
1. Introduction
The development of natural language processing tools In the second phase of the annotation, a step further into
for information extraction or document summarization some other aspects of the scientific discourse has been
tailored to scientific literature will provide quick tracking taken. In the first place, we detected the interplay between
of scientific creativity and innovation. Easy access to the author’s work and other researchers’ contributions in
challenges faced by the researchers, their results and the field by annotating the purpose of citations. In the sec-
contributions, and how these relate to the works of other ond place, we tried to identify some frequent features in the
researchers highlighting the novelties or advantages of the scientific discourse (Advantages, Disadvantages, Novelties,
explored scientific project may inspire new approaches in Common practices and Limitations), which will allow to
a line of investigation. improve the comparison of articles in the domain. Simul-
taneously, we graded sentences in terms of their relevance
With the aim of supporting automated analysis and thus for a summary, in order to provide a manually annotated
easier access to this information, we have generated a resource for reference and training an automatic summa-
multi-layered annotated corpus of scientific discourse. In rization tool.
this article, after introducing our multi-layered scienfic
annotation schema, we describe the way we collaboratively 2. State of the Art
annotate our corpus so as to create its gold standard version. In this Section we provide a brief overview of the most
relevant annotation schemas proposed to support the
Corpus annotations have been provided in two stages. characterization of citations, the identification of relevant
In the first stage (Fisas et al., 2015), annotators were scientific discourse statements and the spotting of text
asked to characterize the argumentative structure of papers excerpts useful to summarize a paper.
by associating each sentence to one over 5 categories
(Challenge, Background, Approach, Outcome and Future Dealing with citation characterization, Moravcsik and Mu-
Work) and eventually specifying for Challenge sentences a rugesan (1975), precursors in this research field, presented
subcategory among Goal and Hypothesis and distingushing an in depth study of the nature of citations and developed
among the Outcome sentences the ones that describe an a typology to estimate their quality and context. The
authors’ Contribution. Based on the work of Liakata et work of Jörg (2008), following Moravcsik and Murugesan
al. (2010) and Teufel (2010) we developed an annotation (1975), inspired the Citation Typing Ontology (CiTO),
schema and produced an annotated corpus. Its quality who was later evaluated in terms of its use for annotation
was evaluated in terms of the inter-annotator agreement by Ciancarini et al. (2014).
(K=0.66) comparable to the values attained by the afore-
mentioned researchers. The other most influential taxonomy is proposed by
Spiegel-Rösing (1977), with 13 categories. However, 80%
The results were analysed by category and 5 main areas of the citation purposes could be classified in one category:
were identified in the articles where the middle 50% of sen- Cited source substantiates a statement of assumption, or
tences included in each category were located. The output points to further information.
was a gold standard annotated corpus (10,777 sentences;
40 documents) in the domain of Computer Graphics. Nanba and Okumura’s contribution (1999) is a very
This corpus constitutes a valid dataset to experiment with simplified scheme with 3 categories (Basis, Comparison
automatic sentence classification algorithms. and Other).
3081
Teufel et al. (2006), who consider citations as signals of PURPOSE SUB-PURPOSES
knowledge claims in the discourse structure, introduced CRITICISM Weakness
a citation annotation scheme, with 12 categories, adapted Strength
from Spiegel-Rösing (1977), and inspired by the findings Evaluation
from Swales (1990) that scientific argument follows a gen- Other
eral rhetorical structure to study the interplay of discourse COMPARISON Similarity
structures of scientific arguments with formal citations. Difference
USE Method
Following Spiegel-Rösing (1977) and Teufel et al. (2006), Data
Abu-Jbara and Radev (2012) stay with 6 categories (Criti- Tool
cism, Comparison, Use, Substantiation, Basis and Neutral) Other
to determine the purpose and polarity of citations. SUBSTANTIATION
BASIS Previous own Work
Research papers include the description of concepts such Others work
as advantages, disadvantages or novelties that do not belong Future Work
exclusively to any of the structural sections of the discourse. NEUTRAL Description
They are useful for comparison between scientific articles. Ref. for more information
Both Liakata et al. (2010) and Teufel et al. (2009) have Common Practices
incorporated some crosswise features in their annotation Other
schemes. Liakata’s 3-layered annotation scheme devotes
the 2nd layer to the annotation of properties of some
Table 1: Citation Purpose Annotation Scheme
of the concepts previously identified in the first layer.
AZ-II, Teufel’s annotation scheme, defines a category to
characterize a novelty or an advantage of the approach Our approach is similar to margin-based linear structures
mentioned in the paper. classification [Taskar et al, 2003]
In reference to grading sentences for summarization, Sag- is classified as purpose: Comparison and subpurpose:
gion et al. (2002) compiled human-generated ”ideal” sum- Similarity.
maries at different compression rates, and obtained a gold-
standard of sentence-based agreement, both between the 3.2. Crosswise Features
annotators, and between the summarizer and the human an-
notators. Sentences were assigned a score from 0 (irrele- Based on the previous work of Teufel and Liakata, our an-
vant) to 10 (essential) expressing the annotators subjective notation aim is to detect characteristic features of the scien-
opinion about how relevant each sentence is for a summary. tific discourse that may appear at any point in a research pa-
per. Therefore, the annotation scheme includes the follow-
ing 7 categories: Advantage and Disadvantage not only
3. Multi-Layered Annotation Schema limited to the author’s approach but also to any reference to
3.1. Citations: Purposes and Subpurposes an advantage or disadvantage in the documents of our cor-
Our annotation scheme for citation purposes is an exten- pus. Since advantages and disadvantages appear frequently
sion of the proposal of Abu-Jbara and Radev (2012), a in the same sentence, we have also included two double
well-balanced selection of 6 top-categories, to which a categories: Advantage-disadvantage and Disadvantage-
second level of sub-purposes has been added (Table 1). advantage. Scientific literature also pays special atten-
tion to Novelties (not exclusive of the author’s approach)
and comments on Common practices in the field, so these
The sub-purposes’ motivations are diverse: Weakness and
concepts were included in the annotation scheme. Finally,
Strength include a polarity judgement, Evaluation intends
Limitations (only referred to the author’s work) are also
to collect those sentences where a balance of a positive and
tagged, as they are of paramount importance in the com-
a negative comment on a cited paper is expressed; Simi-
parison of different investigations.
larity and Difference are opposite reasons for comparison.
The sub-purposes suggested for the purpose Use, are dif- For example, the sentence:
ferent elements of a cited work that can be used by the au-
thor of the citing work (see Table 1). Citations categorised Skeleton Subspace Deformation (SSD) is the predominant
as Basis include the reference to the works of researchers approach to character skinning at present.
upon which the citing work builds but also to the author’s
Own work; some cited works may also be suggested for is classified as containing a Common Practice.
Future work. Finally, the Neutral category includes all
the other citations that can be a mere Description of a re- 3.3. Grading for summarization
searcher’s work, the Reference for more information or The third annotation task is related to the summarization of
even a comment on Common practices in the field. scientific documents, following the works of Saggion et al.
For example, the citation in the sentence: (2002) and Radev et al. (2003).
3082
GRADE DEFINITION The leader annotators were then assigned a demo environ-
ment for training new annotators, together with guidelines
1 TOTALLY IRRELEVANT FOR A SUMMARY and recommendations. Similarly, once selected, the new
2 SHOULD NOT APPEAR IN A SUMMARY annotators had also a set of documents just for testing and
3 MAY APPEAR IN A SUMMARY practicing with the Annote Web annotation platform.
4 RELEVANT FOR A SUMMARY
5 VERY RELEVANT FOR A SUMMARY In order to monitor the progress of the annotation and
detect possible deviations or difficulties, an early check
was scheduled once all annotators had tagged 4 documents
each.
Table 2: Grading Scale
The citation purpose annotation schema was then simpli-
The annotatorion includes a double task: grading the sen- fied to a coarser-grained approach as the analysis of the
tences in each document according to their relevance for first results revealed that some annotators found it hard to
being included in a summary and providing a handwritten distinguish between some sub-purposes. The schema was
summary no longer than 250 words. therefore reduced to the purposes, leaving the specification
of the subpurpose as optional. At the same time, new
We adopted a shorter sentence relevance grading scale than recommendations and modificactions to the guidelines
Radev et al. (2003) and asked the annotators to mark the were forwarded to the annotators, making clearer priorities
sentences with a value from 1 to 5, 1 being the lowest rele- between some categories (for example, Advantage and
vance value and 5 the highest relevance value (Table 2). Disadvantage is preferred to Common Practice, in the
Crosswise Features task).
The last step of the process was the reconciliation of the an-
4. Corpus Dataset and Annotation Process notated versions of the documents in order to obtain a gold
As described in Fisas et al. (2015), the corpus is a set standard corpus and the collection of the human summaries.
of 40 randomly selected articles among a representative
sample of research papers previously chosen by experts in 5. Annotation Guidelines and
the domain of Computer Graphics. Articles were classified
into four important subjects in this area: Skinning, Motion
Recommendations
Capture, Fluid Simulation and Cloth Simulation. The annotators were provided not only with Guidelines
for the annotation of the three tasks, but also with a
The annotation is sentence based as we have considered recommended procedure for annotation.
sentences to be the most meaningful minimal unit for the
analysis of scientific discourse. The Guidelines provide support in the identification of the
purpose of a citation, the detection of crosswise features,
The annotation process is characterized by its collaborative and the criteria for grading sentences according to their
approach between the developers of the methodology, ex- relevance for a summary. This is a tedious and hard task
perts in annotation and text mining, and the 12 annotators and requires a careful reading of the original article, and
who are experts in the domain of Computer Graphics. annotators are suggested to highlight the main ideas as
Thus, a web-based collaborative annotation tool (Annote) they read on the article’s hardcopy of digital copy in order
was developed, enabling users to easily annotate textual to ease the grading task. Tables, figures, formulas and
contents by exploiting a web browser (see Section 6.). the division of the article in sections are dropped in the
annotation’s tool view of the paper.
The documents were divided into 4 groups of 10 docu-
ments each, one for each of the 4 subjects included in the The annotation procedure should ideally start by grading
Computer Graphics Corpus. Each group of documents each sentence in an article, and simultaneously look for
had to be annotated simultaneously by 3 annotators. Some the description of an advantage, disadvantage, novelty,
documents were allocated for inter-annotator checking common practice or limitation. All sentences have a default
purposes. value, which tags them as Totally irrelevant for a summary
and as having no crosswise feature (label:Unselected); this
The annotation process went through several steps: default value will remain unless the annotator chooses to
In the first place, a training session was held with the leader change it.
annotators for each one of the 4 groups. In this training
session, the designers explained the annotation goals, the After the grading task, the annotators were encouraged to
motivations, tasks, categories, criteria and examples, as write their personal summary, whose length should ideally
well as the details referred to the annotation tool and the be between 200-250 words for an average article (8-10
steps to follow. The annotators were encouraged to test the pages). The resulting text should be a short summary of
tool and guidelines with a hands-on annotation workshop. the paper.
3083
Figure 1: ANNOTE: annotation Web interface.
The citation context and in-line citation are preselected by places, we developed Annote1 , a web-based collaborative
the tool in a separate collection. For each in-line citation, annotation tool. Even if easily adaptable to carry out
we considered the sentence where the citation occurs and distinct types of textual annotation tasks, Annote has been
the two sentences preceeding and following this sentence customized to support the annotations of our corpus papers
as candidate sentences for containing the purpose. The with respect to the facets described in details in this paper.
citation purpose annotation consists in reading through the
whole context looking for the purpose of the citation and Textual annotations constitute the core item that annotators
selecting the reason from a pop-up list. can create and characterize by Annote. Each annotation
identifies a consecutive excerpt of a textual document and
After the early check, emphasis was made to the annotators is characterized by its name and a set of features, like the
for keeping their level of attention high and not missing rhetorical class or the summary relevance of the sentence
information. Some modifications to the guidelines were identified by the excerpt. The names of the annotations
forwarded to the whole team making priorities between as well as the named features that can characterize each
categories clearer for the Crosswise Features and Citations annotation are specific to the annotation task. Annotations
annotation task. can be logically organized by grouping them into annota-
tion sets. Usually all the annotations of an annotation set
share some features (i.e. all the sentences of a section of
For example, in the Crosswise Features task, annotators
the document).
were reminded that what the author states is prior to the
annotator’s opinion, and that some categories are preferred
By providing his credentials, each annotator can access
to others.
Annote and browse a customized view of the documents he
has to annotate. Annotators can annotate documents from
Regarding the Citations task, the priorities were set such one or more collections by choosing a specific annotation
that: if a sentence can be tagged as Criticism (if it states role that defines which editing actions the annotator can
an evaluation or a strength or a weakness according to the perform. Once selected the document to edit and the
author) the annotator should prefer this category to any annotation role among the set of document/role pairs that
other. If Use is possible, then it will be preferred to Basis. are available, Annote document annotation view is shown
Lastly Neutral has no preference, a citation will only be to the annotator (see Figure 1).
tagged as Neutral if it can’t be tagged as any other category.
In the center-left part of the document annotation view
there is the Document Viewer that shows the textual
contents of the document that is being annotated.
6. Web-based Annotation Tool: Annote
In order to enable the annotation of the our corpus by 1
http://penggalian.org/annote/ (Username: user — Password:
involving several annotators distributed across distinct userpswd)
3084
On the right side there is the Annotation Browser that
provides a two-levels tree view of all the document
annotations. The root nodes of the tree view represent the
annotation sets, while each leaf identifies a name of the
annotations that are contained inside the corresponding
annotation set. Each annotation name is characterized by
a color; when the checkbox next to an annotation name
is selected, the excerpts of the text that are identified by
annotations with that name are highlighted by the same
color in the Document Viewer (see the annotations named
Sentence in Figure 1).
On the bottom of the document annotation view there is the sentences were classified according to their argumentative
Annotation Editor, where annotators can edit the features content, and only 3 annotators were involved, disagreed
of the annotations and monitor the annotation status of the sentences (only 3%) were included in the Gold standard
whole document. with the category chosen by the designer of the annotation
scheme (who was also an annotator).
Besides the general layout of the document annotation
view, Figure 1 shows how Annote has been used in a In the second stage, described in this paper, sentences with
specific annotation task of the Dr. Inventor Corpus: the total inter-annotator disagreement have not been included
characterization of the purpose of an in-line citation of a in the Gold Standard because there was no reliable refer-
paper. ence on which to base a selection among the 3 proposed
categories.
On the right side, in the Annotation Browser we can see
that the the annotator is dealing with the in-line citation The dataset also contains a triple collection of human sum-
[Magnenat-Thalmann et al., 1988...] that is highlighted maries for each of the 40 documents.
in the Document Viewer together with the sentences
belonging to its context (Ctx sentence annotations). 7.1. Citations Gold Standard
In the lower part of the document annotation view, the The Gold Standard version includes the Totally and Par-
Annotation Editor shows the highlighted annotations (the tially agreed sentences (84%) (Fig. 2) and the distribution
sentences that belongs to the citation context) by means of of the non-default categories is the following: Criticism
a scroll list. Only the first item of this list is visualized: this 23%, Comparison 9%, Use 11%, Substantiation 1%, Ba-
item is related to the citation context sentence ’Example sis 5% and Neutral 53% (Fig. 3).
based approaches...’. By clicking the ’Edit’ button it
is possible to access the Annotation editing tab of the
annotation (citation context sentence) so as to specify its
citation purpose.
7. Annotation Results
The Corpus includes 10,780 sentences, with an average of
269.7 sentences per document. Figure 3: Distribution of purposes for citations
The Gold standard version has been built with totally and
partially agreed sentences. The strategy adopted with the This distribution is comparable to the data of Abu-Jbara and
totally disagreed sentences is different in the two stages Radev (2012) in a similar task of citation annotation with 30
of the annotation process. In the first stage, where the papers.
3085
7.2. Crosswise Features Gold Standard As expected, most sentences are considered totally irrele-
vant for a summary (grade 1). However, a closer analysis
of the graded sentences in one of the groups of 3 annotators
(An 1,An 2, An 3) revealed 3 different styles of selecting
the relevant information to be included in a summary
confirming that there is not one single way of summarizing
a text (Fig. 6).
The Crosswise Features Gold Standard contains 83% of the An 2 and An 3 left the default value (grade 1) in nearly
total sentences (Fig. 2), and excludes the totally disagreed 75% of the sentences. On the contrary, An 1 splits this
sentences. The distribution of non-default categories in the 75% into sentences with grade 1, 2 and 3, leaving the
Gold Standard is: Advantage 33%, Disadvantage 16%, remaining 25% equally distributed into sentences of grades
Advantage-disadavantage 3%, Disadvantage-advantage 4 and 5. An 2 considered that more than half of the graded
1%, Novelty 13%, Common Practice 32% and Limitation sentences should not appear in the summary (grade 2), a
2% (Fig. 4). quarter could be included in the summary (grade 3), and
only the remaining quarter were considered relevant (grade
7.3. Grading Gold Standard 4) or very relevant (grade 5) for a summary in similar
proportion. An 3 distributes the graded sentences in a
In this task, the percentage of disagreed sentences is more homogeneous way: he considered that about half of
higher (25%) (Fig. 2); however, in this case, the grade of them should appear in the summary (grades 4 and 5), while
each sentence is not a categorical feature like in previous the other half corresponds to sentences of grade 2 and 3.
annotations, but an ordinal one. Finally, An 1 splitted the graded sentences into 3 thirds:
according to him, one third of all the sentences should
not appear in the summary (grade 2), another third could
optionally be included (grade 3) and the last third contains
the relevant sentences (one sixth) and the very relevant
sentences (grade 5).
The distribution of the selected categories in the gold The Gold standard version of our scientific discourse
standard corpus of graded sentences is the following: corpus has been built according to the criteria of total
1-Totally irrelevant for a summary 66%, 2-Should not and partial agreement among the annotators’ versions.
appear in a summary 6%, 3-May appear in a summary Nevertheless, the values of the inter-annotator agreement
14% , 4-Relevant for a summary 6% and 5-Very relevant intra-group (considering only the 3 annotators of the same
for a summary 8% (Fig. 5). team) were very low for some annotators, especially in the
Skinning annotation team. Further analysis must be done
3086
in order to detect those annotators that do not meet the
standard quality in the Citations and Crosswise Features
annotation tasks. In this respect, two different strategies
are possible: benchmarking the quality across the 4 teams
against a reference annotation, or, alternatively, the most
reliable annotators could be determined with MACE -
Multi-Annotator Competence Estimation (Hovy et al.,
2013).
3087
J., Qi, H., Çelebi, A., Liu, D., and Drabek, E. (2003).
Evaluation challenges in large-scale document summa-
rization. In Proceedings of the 41st Annual Meeting on
Association for Computational Linguistics - Volume 1,
ACL ’03, pages 375–382, Stroudsburg, PA, USA. Asso-
ciation for Computational Linguistics.
Saggion, H., Radev, D., Teufel, S., Lam, W., and Strassel,
S. M. (2002). Developing infrastructure for the evalu-
ation of single and multi-document summarization sys-
tems in a cross-lingual environment. In In LREC 2002,
pages 747–754, Las Palmas, Gran Canaria, pages 747–
754.
Spiegel-Rösing, I. (1977). Science studies: Bibliometric
and content analysis. Social Studies of Science, 7(1):97–
113, February.
Swales, J. (1990). Genre Analysis: English in Academic
and Research Settings. Cambridge Applied Linguistics.
Cambridge University Press.
Teufel, S., Siddharthan, A., and Tidhar, D. (2006). Auto-
matic classification of citation function. In Proceedings
of the 2006 Conference on Empirical Methods in Natu-
ral Language Processing, pages 103–110, Sydney, Aus-
tralia, July. Association for Computational Linguistics.
Teufel, S., Siddharthan, A., and Batchelor, C. (2009). To-
wards discipline-independent argumentative zoning: Ev-
idence from chemistry and computational linguistics. In
Proceedings of the 2009 Conference on Empirical Meth-
ods in Natural Language Processing: Volume 3 - Vol-
ume 3, EMNLP ’09, pages 1493–1502, Stroudsburg, PA,
USA. Association for Computational Linguistics.
Teufel, S. (2010). The Structure of Scientific Articles -
Applications to Citation Indexing and Summarization.
CSLI Studies in Computational Linguistics. Univ. of
Chicago Press.
3088