0% found this document useful (0 votes)
29 views8 pages

Text Mining War and Peace

Uploaded by

gamze kaskas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views8 pages

Text Mining War and Peace

Uploaded by

gamze kaskas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Text mining War and Peace:

Automatic extraction of character


traits from literary pieces
............................................................................................................................................................
Anastasia Bonch-Osmolovskaya and Daniil Skorinkin

Downloaded from https://academic.oup.com/dsh/article/32/suppl_1/i17/2755123 by guest on 27 October 2024


National Research University ‘Higher School of Economics’, Russia
.......................................................................................................................................
Abstract
This article presents a study of Leo Tolstoy’s War and Peace by means of auto-
Correspondence: matic syntactic and semantic analysis. Using a parser that extracts syntactic
Skorinkin Daniil, National dependencies and semantic roles we were able to compare different characters
Research University ‘Higher of the novel in terms of the semantic roles they tend to occupy. Our data show
School of Economics’, that there are certain dependencies between the apparent personal traits of a
Myasnitskaya, 20, Moscow,
101000 Russia.
character and his or her positions within the predicate structures. We hope
E-mail: skorinkin.danil@g- that further research will help us gain more insights into the ‘literary technique’
mail.com of Tolstoy and enable us to create a semantic mark-up of his works.
.................................................................................................................................................................................

1 Introduction With more than 46,000 pages of text that contain


about 14.5 million words, Tolstoy is famed as one of
The idea that natural language processing tools and the most productive writers ever. The sheer size of
techniques might be used for literary research can the material suggests that some automation of the
hardly be called novel. There have been a number markup is desirable. In this article we demonstrate
of studies dedicated to applying such tools to works how the use of an advanced language analyser might
of fiction of different genres and languages (Elson help us extract information objects (entities and
et al., 2010, Kokkinakis and Malm et al., 2011). facts) which can be used for semantic mark-up of
Most of such works are focused on relation discov- the text later on. We also show that automatic ex-
ery between characters and use either simple co- traction of lexical patterns associated with different
occurrence metrics or slightly more sophisticated characters of the novel may improve our under-
lexico-syntactic patterns. standing of the ‘literary technique’ (see Shklovsky,
This article describes an attempt to apply state-of- 1925) used by the author for verbal distinction of
the-art text mining technologies to the works of Leo personal properties of the main characters
Tolstoy. This study is the preparatory part of a pro-
ject called ‘Tolstoy Digital’. The ultimate goal of this
project is to convert the ninety-volume collected 2 Tools and Method
works of Leo Tolstoy into a digital humanities re-
source (Bonch-Osmolovskaya, 2016). We intend to The technology we apply to this task is called
create a kind of a ‘semantic edition’ of Tolstoy’s COMPRENO (see Anisimovich et al., 2012; Selegey,
works by providing it with a markup consistent 2012). COMPRENO parser automatically converts
with Text Encoding Initiative (TEI) schema. The text into a forest of syntactic-semantic trees which
markup is expected to include a wide spectrum of comprise dependency links and constituency
tags, from persons, relations, and events to editorial structure. The analysis is based on the universal seman-
notes and critical apparatus entries. tic hierarchy—a complex WordNet-like ontological

Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017. # The Author 2016. Published by Oxford University i17
Press on behalf of EADH. All rights reserved. For Permissions, please email: [email protected]
doi:10.1093/llc/fqw052 Advance Access published on 29 December 2016
A. Bonch-Osmolovskaya and D. Skorinkin

Downloaded from https://academic.oup.com/dsh/article/32/suppl_1/i17/2755123 by guest on 27 October 2024


Fig. 1 Sample COMPRENO tree (automatically generated, no manual corrections)

structure that stores meanings rather than words раз остановился, положив Дунай между
(Manicheva et al., 2012; Petrova, 2013). The resulting собой и главными силами французов (On
trees contain nodes with all sorts of linguistic informa- the 28th of October Kutuzov with his army
tion attached to them: semantic classes from the said crossed to the left bank of the Danube and
hierarchy (e.g. ‘PERSON_BY_FIRSTNAME’ or took up a position for the first time with the
‘VERBS_OF_ ADDRESSING’), purely syntactic ‘sur- river between himself and the main body of the
face slots’ (e.g. $Subject or $Modifier_Adverbial), syn- French)
tactic-semantic ‘deep slots’ (e.g. Agent or Experiencer;
Note that just like in English, in Russian there are
the ‘deep slots’ in the COMPRENO model are quite
several meanings for the word ‘силы’ (forces), but
similar to Fillmore’s ‘deep cases’; Fillmore, 1968) and
the parser performed disambiguation correctly,
sets of grammemes. Figure 1 presents a sample tree
choosing the ‘FORCES_AS_PEOPLE’ semantic
that COMPRENO parser yields for a phrase in ex-
class. The parser is also capable of anaphora reso-
ample (1) from Tolstoy’s War and Peace:
lution, for more information on that see (Bogdanov
1) 28-го октября Кутузов с армией пере- et al., 2014). The information extraction system built
шел на левый берег Дуная и в первый upon COMPRENO allows writing sets of production

i18 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017


Text mining War and Peace

rules to extract facts and entities from unstructured In this case we demand the system to find any sub-
texts. The main advantage is that deep semantic rep- tree which has a node with a semantic class
resentation of text provided by COMPRENO en- ‘VERBS_OF_COMMUNICATION’ or any of its des-
ables us to describe a whole range of different cendant classes (since ‘VERBS_ OF_
variants of a phrase in a very concise manner. For COMMUNICATION’ is a very high-level class within
instance, we do not need to care about the word our hierarchy and there are many lower classes that
order (which is flexible in Russian), since the syn- inherit from it) and at least two children nodes—one
tactic roles of different words remain the same. And (or more) with ‘Agent’ deep syntax slot and another

Downloaded from https://academic.oup.com/dsh/article/32/suppl_1/i17/2755123 by guest on 27 October 2024


even in case of voice transformation (‘He loved her’ with ‘Addressee’ slot. Both children must also belong
→ ‘she was loved by him’) only surface syntax slots to/be—inherited from a semantic class ‘HUMAN’
change, while deep slots remain unchanged. Figure 2 (which contains all sorts of subclasses that define
shows an example of a simple production: people—names of occupations, social roles, relation
terms, known proper names, and so on). Despite its
simplicity, this rule will extract many examples of com-
munication between people (or, in our case, characters)
like the ones below in examples (2–4):
2) Дмитрий, — обратился Ростов к лак-
Fig. 2 A rule for the extraction of Speech activity instances ею на облучке. — Ведь это у нас огонь?

Fig. 3 Visualization of PCA of the semantic role distribution in the first volume of War and Peace (books 1–3 of the
English translation)

Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i19


A. Bonch-Osmolovskaya and D. Skorinkin

‘Dmitri’, addressed Rostov to his valet on sleep), and ditransitive verbs (give, address, show).
the box, ‘those lights are in our house, The verbal arguments differentiate on the basis of
aren’t they?’ their syntactic position within the verbal semantic
3) Ну же пошел, — кричал он ямщику. frame—Experiencer, Agent, Patient, Addressee, and
‘Now then, get on’, he shouted to the Possessor. Therefore the idea of the experiment was
driver. to capture the relations between characters within
4) Никаких извинений, ничего решите- each volume4 of the War and Peace with the help of
льно, — говорил Долохов Денисову the very abstract model which measures aptness of

Downloaded from https://academic.oup.com/dsh/article/32/suppl_1/i17/2755123 by guest on 27 October 2024


‘No apologies, none whatever’, said the characters to occupy specific syntactic positions
Dolokhov to Denisov. in the context of the verbs of different semantics.
5) Ростов сделался не в духе <…> Он вст- We used COMPRENO parser to extract semantic
ал и подошел к Борису. roles of the predicate structure associated with the
— Однако я тебя стесняю, — сказал most prominent characters. The final list of roles
он ему тихо, — пойдем, поговорим о included Agent, Object (equivalent to Patient),
деле, и я уйду. Experiencer, Addressee, and Possessor. Table 1
(Rostov became sullen <.> He got up and demonstrates the standardized results of the seman-
approached Boris. tic role distribution for fourteen characters of the
‘I’ve come at a bad time I think’, he said to first volume of the novel.
him in a low voice. ‘Let us talk business, For instance, Anna Drubetskaya is the most agent-
and then I’ll leave’)1 ive character contrasted to her son Boris being the
Entities and facts can be represented formally as parts most object-like character. This is clearly a reflection
of an ontological model. We develop ontologies using of the plot, where a determined business-like mother
OWL language2 developed by the W3C. In the execu- takes care of the career of her yet shy son (who would
table right-hand side of a production we can either become just as pragmatic later on). Sensitive Mariya
create a new information object of a certain class of an (Marya) Bolkonskaya is allocated as a prototypical
ontology or modify the existing ones (add a surname experiencer but one can also notice some inclination
attribute to a Person-class object, for instance). After to experiential vector of her brother Andrey (prince
the implementation of the rules we receive the result Andrew) and Nikolay Rostov due to the emotional
of the information extraction process in the form of stress of their first battle experiences. The most strik-
an XML document consistent with the Resource ing property of the visualization is the similarity of
Description Framework (RDF) schema.3 the semantic roles between Anatole and Natasha.
One may speculate that thus the reader gets a
vague expectation of the major love intrigue of the
3 Experiment novel which has been set long before it actually hap-
pens in the plot.
In our experiment we intended to disclose the Principal Component Analysis (PCA) diagrams
mechanisms of verbal contrast used by Tolstoy to for other volumes (excluding epilogue, which is
distinguish the characters of his novel and to follow much less statistically significant) also seem to pro-
the evolution of their behaviour through the novel. vide some meaningful insights into Tolstoy’s text. In
The semantic role analysis presupposes a very high the second volume (Fig. 4) Anatole is detached from
level of abstraction of both verbal classes and syn- Natasha (following their unsuccessful elopement and
tactic position of the argument. The verbal classes Anatole’s subsequent flight from Russia), and all the
include experiential verbs of mental, perceptional, male characters are generally much more ‘agentive’
and emotional sphere (think, feel, fear, see, etc.), than the females, possibly due to the duels, gambling,
agentive transitive verbs (kill, break, create, etc.), and other supposedly ‘manly’ affairs. Natasha, who is
agentive (unergative) intransitives (walk, laugh), undergoing her first serious moral crisis and is no
patientive (unaccusative) intransitive (die, fall, longer the lively and joyful kid she used to be (see

i20 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017


Text mining War and Peace

Table 1 The distribution of semantic roles in predicate argument structure associated with the fourteen prominent
characters of the first volume of War and Peace (books 1–3 of the English translation)
Character Agent Object Experiencer Addressee Possessor
Natasha 0.61 0.14 0.11 0.04 0.09
Anatole 0.59 0.17 0.13 0.04 0.07
Anna Scherer 0.59 0.10 0.16 0.04 0.11
Nikolai Bolkonsky 0.63 0.13 0.14 0.03 0.07
Lise 0.50 0.19 0.16 0.05 0.09

Downloaded from https://academic.oup.com/dsh/article/32/suppl_1/i17/2755123 by guest on 27 October 2024


Mariya (Marya) 0.44 0.18 0.26 0.05 0.08
Anna Drubetskaya 0.72 0.13 0.09 0.01 0.05
Boris Drubetskoy 0.48 0.24 0.17 0.05 0.07
Kutuzov 0.58 0.13 0.13 0.04 0.11
Vasili Kuragin 0.58 0.15 0.14 0.03 0.09
Nikolai 0.52 0.16 0.20 0.03 0.09
Pierre 0.58 0.15 0.15 0.04 0.07
Andrey 0.56 0.15 0.18 0.05 0.07
We then used PCA to visualize and analyse the data. As may be seen from Fig. 3, the PCA diagram of the very first volume already can
be very well interpreted in terms of the main actors and ‘undergoers’.

Fig. 4 Visualization of PCA of the semantic role distribution in the second volume of War and Peace (books 4-8 of the
English translation)

Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i21


A. Bonch-Osmolovskaya and D. Skorinkin

Downloaded from https://academic.oup.com/dsh/article/32/suppl_1/i17/2755123 by guest on 27 October 2024


Fig. 5 Visualization of PCA of the semantic role distribution in the third volume of War and Peace (books 9–11 of the
English translation)

Clay, 1998), becomes much closer to Pierre now, agentive but is located now very close to his sister
which could also be viewed as a reflection of the Marya in the Experiential zone (Fig. 4).
plot. But so does Helene, his newlywed wife, a The third volume is rather distorted by the affairs
rather cynical woman of high society and (sup- of war, and the two opposing military com-
posedly) low moral standards, whose existence at manders—Napoleon and Kutuzov—suddenly come
this point forbids any possibility of romance between to the forefront and become two most agentive char-
Pierre and Natasha. Note also that both women, acters (Fig. 5).
Natasha and Helene, are located near the Addressee Helene leaves her husband to follow the imperial
zone as far as they are the goals of the agentive ac- court to Vilna, where she apparently has yet another
tions of the three male characters. Here we might affair, now with Boris, who is also the closest to her
point that this whole section of the novel is to a on the diagram (Fig. 5). Pierre and Natasha remain
large extent about the two women, Natasha and close to each other, and Andrey keeps leaning to-
Helen, and the men they attract (Anatole, wards Experiential positions.
Dolokhov, Boris, Denisov). All these complex inter- The data for the last books of the novel (volume 4
actions can be formalized with the semantic role ana- in the Russian canonical edition) is interesting
lysis in a new and potentially insightful way. mainly due to the apparent culmination of prince
In contrast, Natasha’s fiance?e Andrey Bolkonsky Andrey’s story. Severely wounded in the battle of
in this period is rethinking his life after his wife’s Borodino, he ends up with Rostov family and even-
death and his wounding in Austerlitz. Andrey is not tually dies in Natasha’s care. As he awaits his death

i22 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017


Text mining War and Peace

Downloaded from https://academic.oup.com/dsh/article/32/suppl_1/i17/2755123 by guest on 27 October 2024


Fig. 6 Visualization of PCA of the semantic role distribution in the fourth volume of War and Peace (books 12–15 of
the English translation)

and rethinks his life, he is unable to act anymore, but the characters’ personal traits, providing us with
experiences a lot of feelings and is being nursed by ‘objective’ quantitative confirmation of something
others, which is reflected in his Objective/ that was obvious to a reader or a critic but was
Experiential position on the diagram (Fig. 6). not expressed explicitly in the text.
Andrey’s sister Marya makes it in time to say her Some of our findings have direct relation to cer-
goodbyes, and both women witness Andrey’s last mi- tain published critical interpretations of the novel.
nutes together. This common sorrow changes their For instance, princess Marya’s strong disposition
formerly hostile relationship, and the two women towards the Experiencer role obviously correlates
become quite close to each other, which, apparently, to the claims by some prominent Tolstoy scholars
is also reflected through semantic roles. (Eichenbaum, 2009) that this shy and sensitive char-
acter was borrowed by Tolstoy directly from the
XVIII century sentimentalists.
4 Conclusion At the next stage of our research we intend to
develop a dedicated information extraction model
Our study shows that automatic semantic roles for literary research based on the system we are
labelling could be applied to literary research. This working with. This model, already in the making,
technique appears to have some potential to reveal is to be designed and adjusted specifically to meet
the techniques authors use to construct the com- the needs of such research and is expected to help us
plexity of relations along a linear narrative. extract much more information about characters,
Semantic roles proved to be quite informative of their description by the author and their relations

Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i23


A. Bonch-Osmolovskaya and D. Skorinkin

between each other. The first obvious improvement Elson, D., Dames, N., and McKeown, K. (2010).
that could be made is to split the existing semantic Extracting Social Networks from Literary Fiction. In
roles into a bit more fine-grained and less abstract Proceedings of the 48th Annual Meeting of the
set that could, for example, distinguish between dif- Association for Computational Linguistics – Uppsala:
Uppsala University, pp. 138–47.
ferent types of ‘agentivity’ or ‘experientiality’ (sen-
sitivity) of a character. Fillmore C. J. (1968), The Case for Case. Universals in
We also plan to pay special attention to the in- Linguistic Theory edited by Emmon Bach and Robert
stances of direct speech within the novel since T. Harms, Holt, Rinehart and Winston, New York:
Academic Press, pp. 1–88.

Downloaded from https://academic.oup.com/dsh/article/32/suppl_1/i17/2755123 by guest on 27 October 2024


Tolstoy himself placed high emphasis on developing
‘character-specific language’. Our goal is to find out Kokkinakis D. and Malm, M. (2011). Character Profiling
whether each character actually possesses his personal in 19th Century Fiction. In Proceedings of the Workshop
style of speech. Then we hope to broaden the scope on Language Technologies for Digital Humanities and
of the study and include other authors to allow com- Cultural Heritage – Hissar, Bulgaria, pp. 70–77.
parison between their writing styles and techniques. Manicheva, E., Petrova M., Kozlova E., and Popova, T.
(2012) The Compreno Semantic Model as Integral
Framework for Multilingual Lexical Database. In
Zock, M. and Rapp, R. (eds), Proceedings of the 3rd
Acknowledgement Workshop on Cognitive Aspects of the Lexicon
This work was supported by grant 15-06-99523 (CogALex-III), COLING 2012, Mumbai: Curran
from the Russian Foundation for Basic Research. Associates, Inc., pp. 215–29
Petrova, M. (2013). The compreno semantic model:
the universality problem. International Journal of
Lexicography, 27: 105–29. 10.1093/ijl/ect038
References Selegey V. (2012). On Automated Semantic and Syntactic
Anisimovich K., Druzhkin K., Minlos F., Petrova M., Annotation of Texts for Lexicographic Purposes In: Ruth
Selegey V., and Zuev K. (2012). Syntactic and Vatvedt Fjeld and Julie Matilde Torjusen. Proceedings of
Semantic Parser Based on ABBYY Compreno the 15th EURALEX International Congress. 7-11 August
Linguistic Technologies, Computational Linguistics 2012, Oslo, Department of Linguistics and Scandinavian
and Intellectual Technologies. In Proceedings of the Studies, University of Oslo.
International Conference Dialogue, Bekasovo: Russian Shklovsky, V. (1925). Art as a technique. In Shklovsky, V.
State University for the Humanities Publishing (ed.), Theory of Prose. Moscow: Krug, pp. 7–20.
House, pp. 90–103.
Bogdanov A., Dzhumaev S., Skorinkin D., and Starostin
A. (2014). Anaphora Analysis Based on ABBYY
Compreno Linguistic Technologies. Computational Notes
Linguistics and Intellectual Technologies. In 1 Note also that example (5) clearly demonstrates the
Proceedings of the International Conference Dialogue, importance of correct anaphora resolution for the
Bekasovo: Russian State University for the tasks of in-depth text research.
Humanities Publishing House, pp. 89–102. 2 OWL Web Ontology Language Overview (http:// www.
Bonch-Osmolovskaya A. (2016) Digital edition of Leo w3. org/ TR/ 2004/ REC- owl- features- 20040210), ac-
Tolstoy works: contributing to advances in Russian liter- cessed 30 October 2015.
ary scholarship. Journal of Siberian Federal University, 3 Resource Description Framework http:// www. w3.
Humanities and Social Sciences, 9(7): 1605–14. org/ RDF, accessed 30 October 2015.
4 Russian canonical edition of the WP and its English
Clay G. R. (1998) Tolstoy’s Phoenix. From Method to translation differ in their structure. While the English
Meaning in War and Peace. Evanston, Illinois: translation follows the first edition and consists of fif-
Northwestern University Press. teen books, the Russian canonical text is divided into
Eichenbaum B (2009) Works on Leo Tolstoy. Saint- four volumes, cf. the Wikipedia reference (https:// en.
Petersburg: SPBSU Faculty of Philology and Arts. wikipedia. org/ wiki/ War_ and_ Peace).

i24 Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017

You might also like