Text Mining War and Peace
Text Mining War and Peace
Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017. # The Author 2016. Published by Oxford University i17
Press on behalf of EADH. All rights reserved. For Permissions, please email: [email protected]
doi:10.1093/llc/fqw052 Advance Access published on 29 December 2016
A. Bonch-Osmolovskaya and D. Skorinkin
structure that stores meanings rather than words раз остановился, положив Дунай между
(Manicheva et al., 2012; Petrova, 2013). The resulting собой и главными силами французов (On
trees contain nodes with all sorts of linguistic informa- the 28th of October Kutuzov with his army
tion attached to them: semantic classes from the said crossed to the left bank of the Danube and
hierarchy (e.g. ‘PERSON_BY_FIRSTNAME’ or took up a position for the first time with the
‘VERBS_OF_ ADDRESSING’), purely syntactic ‘sur- river between himself and the main body of the
face slots’ (e.g. $Subject or $Modifier_Adverbial), syn- French)
tactic-semantic ‘deep slots’ (e.g. Agent or Experiencer;
Note that just like in English, in Russian there are
the ‘deep slots’ in the COMPRENO model are quite
several meanings for the word ‘силы’ (forces), but
similar to Fillmore’s ‘deep cases’; Fillmore, 1968) and
the parser performed disambiguation correctly,
sets of grammemes. Figure 1 presents a sample tree
choosing the ‘FORCES_AS_PEOPLE’ semantic
that COMPRENO parser yields for a phrase in ex-
class. The parser is also capable of anaphora reso-
ample (1) from Tolstoy’s War and Peace:
lution, for more information on that see (Bogdanov
1) 28-го октября Кутузов с армией пере- et al., 2014). The information extraction system built
шел на левый берег Дуная и в первый upon COMPRENO allows writing sets of production
rules to extract facts and entities from unstructured In this case we demand the system to find any sub-
texts. The main advantage is that deep semantic rep- tree which has a node with a semantic class
resentation of text provided by COMPRENO en- ‘VERBS_OF_COMMUNICATION’ or any of its des-
ables us to describe a whole range of different cendant classes (since ‘VERBS_ OF_
variants of a phrase in a very concise manner. For COMMUNICATION’ is a very high-level class within
instance, we do not need to care about the word our hierarchy and there are many lower classes that
order (which is flexible in Russian), since the syn- inherit from it) and at least two children nodes—one
tactic roles of different words remain the same. And (or more) with ‘Agent’ deep syntax slot and another
Fig. 3 Visualization of PCA of the semantic role distribution in the first volume of War and Peace (books 1–3 of the
English translation)
‘Dmitri’, addressed Rostov to his valet on sleep), and ditransitive verbs (give, address, show).
the box, ‘those lights are in our house, The verbal arguments differentiate on the basis of
aren’t they?’ their syntactic position within the verbal semantic
3) Ну же пошел, — кричал он ямщику. frame—Experiencer, Agent, Patient, Addressee, and
‘Now then, get on’, he shouted to the Possessor. Therefore the idea of the experiment was
driver. to capture the relations between characters within
4) Никаких извинений, ничего решите- each volume4 of the War and Peace with the help of
льно, — говорил Долохов Денисову the very abstract model which measures aptness of
Table 1 The distribution of semantic roles in predicate argument structure associated with the fourteen prominent
characters of the first volume of War and Peace (books 1–3 of the English translation)
Character Agent Object Experiencer Addressee Possessor
Natasha 0.61 0.14 0.11 0.04 0.09
Anatole 0.59 0.17 0.13 0.04 0.07
Anna Scherer 0.59 0.10 0.16 0.04 0.11
Nikolai Bolkonsky 0.63 0.13 0.14 0.03 0.07
Lise 0.50 0.19 0.16 0.05 0.09
Fig. 4 Visualization of PCA of the semantic role distribution in the second volume of War and Peace (books 4-8 of the
English translation)
Clay, 1998), becomes much closer to Pierre now, agentive but is located now very close to his sister
which could also be viewed as a reflection of the Marya in the Experiential zone (Fig. 4).
plot. But so does Helene, his newlywed wife, a The third volume is rather distorted by the affairs
rather cynical woman of high society and (sup- of war, and the two opposing military com-
posedly) low moral standards, whose existence at manders—Napoleon and Kutuzov—suddenly come
this point forbids any possibility of romance between to the forefront and become two most agentive char-
Pierre and Natasha. Note also that both women, acters (Fig. 5).
Natasha and Helene, are located near the Addressee Helene leaves her husband to follow the imperial
zone as far as they are the goals of the agentive ac- court to Vilna, where she apparently has yet another
tions of the three male characters. Here we might affair, now with Boris, who is also the closest to her
point that this whole section of the novel is to a on the diagram (Fig. 5). Pierre and Natasha remain
large extent about the two women, Natasha and close to each other, and Andrey keeps leaning to-
Helen, and the men they attract (Anatole, wards Experiential positions.
Dolokhov, Boris, Denisov). All these complex inter- The data for the last books of the novel (volume 4
actions can be formalized with the semantic role ana- in the Russian canonical edition) is interesting
lysis in a new and potentially insightful way. mainly due to the apparent culmination of prince
In contrast, Natasha’s fiance?e Andrey Bolkonsky Andrey’s story. Severely wounded in the battle of
in this period is rethinking his life after his wife’s Borodino, he ends up with Rostov family and even-
death and his wounding in Austerlitz. Andrey is not tually dies in Natasha’s care. As he awaits his death
and rethinks his life, he is unable to act anymore, but the characters’ personal traits, providing us with
experiences a lot of feelings and is being nursed by ‘objective’ quantitative confirmation of something
others, which is reflected in his Objective/ that was obvious to a reader or a critic but was
Experiential position on the diagram (Fig. 6). not expressed explicitly in the text.
Andrey’s sister Marya makes it in time to say her Some of our findings have direct relation to cer-
goodbyes, and both women witness Andrey’s last mi- tain published critical interpretations of the novel.
nutes together. This common sorrow changes their For instance, princess Marya’s strong disposition
formerly hostile relationship, and the two women towards the Experiencer role obviously correlates
become quite close to each other, which, apparently, to the claims by some prominent Tolstoy scholars
is also reflected through semantic roles. (Eichenbaum, 2009) that this shy and sensitive char-
acter was borrowed by Tolstoy directly from the
XVIII century sentimentalists.
4 Conclusion At the next stage of our research we intend to
develop a dedicated information extraction model
Our study shows that automatic semantic roles for literary research based on the system we are
labelling could be applied to literary research. This working with. This model, already in the making,
technique appears to have some potential to reveal is to be designed and adjusted specifically to meet
the techniques authors use to construct the com- the needs of such research and is expected to help us
plexity of relations along a linear narrative. extract much more information about characters,
Semantic roles proved to be quite informative of their description by the author and their relations
between each other. The first obvious improvement Elson, D., Dames, N., and McKeown, K. (2010).
that could be made is to split the existing semantic Extracting Social Networks from Literary Fiction. In
roles into a bit more fine-grained and less abstract Proceedings of the 48th Annual Meeting of the
set that could, for example, distinguish between dif- Association for Computational Linguistics – Uppsala:
Uppsala University, pp. 138–47.
ferent types of ‘agentivity’ or ‘experientiality’ (sen-
sitivity) of a character. Fillmore C. J. (1968), The Case for Case. Universals in
We also plan to pay special attention to the in- Linguistic Theory edited by Emmon Bach and Robert
stances of direct speech within the novel since T. Harms, Holt, Rinehart and Winston, New York:
Academic Press, pp. 1–88.