0% found this document useful (0 votes)
21 views18 pages

Topic Modeling in The Voynich Manuscript

This article explores topic modeling techniques applied to the Voynich Manuscript to identify clusters of subjects within its text. Using methods like latent dirichlet allocation and nonnegative matrix factorization, the study finds that computationally derived topics align closely with the manuscript's illustrations and scribal attributions, suggesting the presence of meaningful text. The research contributes to understanding the manuscript's structure and the relationship between its language and content.

Uploaded by

C T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views18 pages

Topic Modeling in The Voynich Manuscript

This article explores topic modeling techniques applied to the Voynich Manuscript to identify clusters of subjects within its text. Using methods like latent dirichlet allocation and nonnegative matrix factorization, the study finds that computationally derived topics align closely with the manuscript's illustrations and scribal attributions, suggesting the presence of meaningful text. The research contributes to understanding the manuscript's structure and the relationship between its language and content.

Uploaded by

C T
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Topic Modeling in the Voynich Manuscript∗

Rachel Sterneck Annie Polish Claire Bowern


Yale University Yale University Yale University
Department of Computer Science Department of Linguistics
New Haven, CT 06520 New Haven, CT 06520 New Haven, CT 06520
[email protected] [email protected]@yale.edu

Abstract of flexibility with the accuracy of the transcription


This article presents the results of investigations using
itself.
topic modeling of the Voynich Manuscript (Beinecke
MS408). Topic modeling is a set of computational meth- Amancio et al. [2013] conducted an extensive
ods which are used to identify clusters of subjects within investigation into the statistical properties of un-
arXiv:2107.02858v1 [cs.CL] 6 Jul 2021

text. We use latent dirichlet allocation, latent semantic known texts, using the Voynich Manuscript as
analysis, and nonnegative matrix factorization to cluster a case study. They analyzed the word frequen-
Voynich pages into ‘topics’. We then compare the top- cies within the text, concluding that Voynichese
ics derived from the computational models to clusters is compatible with natural languages. Reddy and
derived from the Voynich illustrations and from paleo- Knight [2011] ran a variety of statistical and lin-
graphic analysis. We find that computationally derived guistic tests, and found page-level topics, as well
clusters match closely to a conjunction of scribe and sub- as word and length frequency distributions, that
ject matter (as per the illustrations), providing further ev- “conform to natural language-like text.” Though
idence that the Voynich Manuscript contains meaningful Rugg [2004], Rugg and Taylor [2016], and Daruka
text. [2020] suggest ways in which hoax text could be
generated, it is difficult to reconcile the character-
1 Introduction level methods they suggest with the document-
The Voynich Manuscript (Beinecke Library MS level structure that Reddy and Knight [2011] and
408)1 has puzzled linguists, historians, and con- Amancio et al. [2013] recover. That is, while
spiracy theorists alike for its unrecognizable Voynichese appears very un-languagelike at the
text and varied illustrations. For a medieval word level, the Voynich manuscript — at the level
manuscript, it is quite surprising that no one has of paragraph and page — has much in common
been able to identify its language of origin or break with natural language texts.
its cipher; this has prompted many to believe that
the manuscript is a hoax [e.g. Rugg, 2004, Barlow, The current paper extends research on word
1986, Timm and Schinner, 2020]. frequencies, focusing on methods for topic mod-
However, statistical approaches offer a novel eling, and investigates how computer-identified
way of analyzing the Voynich blackbox. Statis- topics relate to clusters of pages and paragraphs
tical methods offer tools that capture relevant fea- within the manuscript which have been identified
tures of the text without understanding its mean- on other bases, including illustrations, scribal at-
ing, and more importantly, allow a certain degree tribution [e.g. Davis, 2020], and the “languages”

Thanks to Luke Lindemann, Lisa F. Davis, and members identified by Prescott Currier [e.g. Currier, 1976].
of the Yale class “Linguistics of the Voynich Manuscript” in
2019. This paper began as a project with 5 members of the
class, including RS and AP. This article was written by RS Here we find evidence for matches between top-
and CB, using analyses mostly by RS, with contributions by ics and visually identified sections. We also find
AP. some support for different topics and different
1
High resolution images of all pages
of the manuscript are available from scribes. Finally, we show that scribal hands and
https://brbl-dl.library.yale.edu/ subject matter appear to jointly determine topic
vufind/Record/3519597. The text is membership, strongly implying that several lin-
available in machine-readable format from
voynich.nu. Our scripts and data are available from guistic features contribute to the identification of
https://github.com/rachelsterneck/voynich-topic-modeling. topics.

1
2 Background 2.1.2 Topic Modeling
Topic modeling is a technique for identifying
2.1 Computational models and topic analysis
relatedness between documents that share sets
This section provides information on the back- of words belonging to a particular semantic do-
ground to the methods used. Note in the discus- main. Topic modeling techniques apply unsuper-
sion that follows, a ‘document’ is simply a unit of vised learning algorithms to discover ‘topics’, or
text. We test several units of text of different sizes, themes, within documents, and consequently clus-
including folios, pages, paragraphs, and 40 word ter documents based on their content [Jurafsky and
chunks (discussed further below). Martin, 2009]. Although we cannot read the Voyn-
ich text, topic modeling is still applicable if we as-
2.1.1 Document Vectorization sume that Voynich words have a consistent form–
meaning correspondence across the manuscript.
Prior to analyzing the statistical properties of a That is, we need to assume that 8ain on f1r is
given corpus, it is necessary to vectorize the doc- the SAME word as 8ain on f7v. We do not need
uments, i.e. represent the text numerically. In to know what any of the words mean, but we do
this work, we consider various topic modeling need to assume that there is some consistency of
algorithms, which involve two methods for doc- representation. Note, however, that is there is no
ument vectorization: bag of words (BoW) and consistency of representation, we are unlikely to
term frequency-inverse document frequency (tf- find clear topic structure in the manuscript.
idf). Both BoW and tf-idf are count-based docu-
In this work, we divide the Voynich Manuscript
ment representations that ignore word order, how-
into chunks of text — documents — and apply
ever tf-idf weights the importance of each word
the following topic modeling algorithms: latent
in a document relative to other words in the same
dirichlet allocation (LDA), latent semantic anal-
document and the entire document collection,
ysis (LSA), and nonnegative matrix factorization
whereas BoW only considers raw word counts. In
(NMF). For LDA, we use the BoW approach to
both approaches, a given corpus is represented as
vectorize the documents, and for LSA and NMF,
a N × V matrix, where N is the total number of
we represent the documents using tf-idf.
documents and V is the number of words in the
vocabulary. LDA is a generative probabilistic model that ex-
plains sets of observations with latent, or unob-
The tf-idf value for a given word is computed
servable, groups [Blei et al., 2003]. It assumes
as the product between the term frequency and in-
that documents are produced with a fixed num-
verse document frequency:
ber of words and a mixture of topics that have a
Dirichlet distribution over a fixed set of k topics
N to discover. Each word in the document is gen-
wt,d = count(t, d) × log10 ( ) (1)
dft erated by picking a topic according to the distri-
bution sampled, and then using that topic to gen-
Here, count(t, d) is the term frequency, or the erate a word, according to the topic’s distribution.
number of occurrences of term t in document d, LDA learns the topic representation of each doc-
which measures the importance of t to the doc- ument, as well as the words associated with each
ument [Luhn, 1957]. The inverse document fre- topic. In order to do this, LDA distributes k top-
N
quency, df t
, is the quotient of the total number of ics across each word w in document m, and for
documents in a corpus, N, and the number of doc- each w in m assumes the topic assignment is cor-
uments in which t appears. The intuition of the in- rect for every word except the current w. Finally,
N
verse document frequency, df t
, is that rare terms, it probabilistically reassigns w a new topic based
i.e. words that appear infrequently in the entire on the proportion of topics within a document and
corpus, are important to the document(s) that con- the proportion of words within a topic. This pro-
tain those words [Jones, 1972]. Since there are cess is repeated many times, and the model even-
often a large number of documents in a corpus, a tually converges, achieving feasible topic mixtures
logarithm is applied to reduce the scale of tf-idf within documents. It is not necessary to weight
values; in other variations of tf-idf, a logarithm is words using tf-idf for LDA because it’s a genera-
also used to moderate the term frequency. tive model that estimates probability distributions,

2
thus we use BoW instead. We consider three different methods for project-
LSA is another method that takes advantage ing the data:
of implicit higher-order structure in the associa-
PCA: Principal Component Analysis is one of
tion of terms with documents to discover underly-
the oldest and best understood dimension reduc-
ing concepts within documents [Deerwester et al.,
tion algorithms. The first principal component is
1990]. LSA utilizes Singular Value Decomposi-
the axis along which there is the highest variance.
tion (SVD) to reconstruct the matrix such that the
The second axis also chooses the highest variance,
strongest relationships are preserved and noise is
with the constraint that it must be perpendicular
minimized. SVD follows that any matrix A can be
to the first axis. This continues until the required
factored as
(user-specified) number of axis is satisfied.
A = U SV T , (2) t-SNE: T-Distributed Stochastic Neighbor Em-
bedding is a relatively recent (2008) technique;
where U and V are orthogonal matrices with it examines the local neighborhood around every
eigenvectors from AAT and AT A, respectively. point, in addition to the properties of the distri-
This process decomposes the matrix into a bution as a whole [Van der Maaten and Hinton,
document-topic matrix and a topic-term matrix, 2008]. This facilitates the detections of small clus-
which contain the relative importance of each fac- ters in a large distribution, which might not be
tor and allow us to measure document similarity. picked up by PCA. However, it is more opaque
NMF is used to analyze high-dimensional data than PCA. t-SNE has been used on vectorizations
and automatically extract sparse and meaningful of the Voynich Manuscript with some success in
features from nonnegative vectors [Lee and Se- the past [Bunn, 2017, Perone, 2016]. We found
ung, 1999]. NMF decomposes the feature matrix that t-SNE was sensitive to the choice of hyperpa-
A into two lower dimensional matrices, W and H, rameters.
then iteratively modifies their initial values over an
objective function (e.g., the EM algorithm) such UMAP: Uniform Manifold Approximation and
that their product approaches the original matrix. Projection for Dimension Reduction is a newer al-
NMF is similar to LSA and SVD, however NMF gorithm [McInnes et al., 2018], which preserves
imposes the restriction that factored matrices, W more global structure than t-SNE.
and H, are positive. As a result, NMF better rep-
2.2 Voynich Manuscript background
resents the original feature matrix.
The basic facts of the Voynich manuscript are
2.1.3 Visaulization by now clearly summarized in other publica-
After applying NMF or LSA to the tf-idf tions. See, for example, Reddy and Knight
document-term matrix, we get a denser and [2011], Amancio et al. [2013], Bax [2014], Hark-
smaller matrix, but still one that is too large to be ness [2016] and Bowern and Lindemann [2020]
plotted usefully in two dimensions. We would like for a recent overview (book-length treatments can
to produce a plot that shows folios of the Voyn- be found in Kennedy and Churchill 2004 and
ich Manuscript on a 2-dimensional plane, with the D’Imperio 1978). Here we focus on three aspects
distance between points corresponding to the ‘se- of the manuscript most relevant for document clus-
mantic distance’ between the folios that were the tering: the “Languages” of the text, number of
output of the algorithms discussed above. This is scribal hands, and subject matter as inferred from
a general problem in machine learning, as many the manuscript’s illustrations.
datasets are multidimensional. Dimension reduc-
2.2.1 Currier Languages
tion is a difficult problem, and there are an infi-
nite number of ways to project a high-dimensional Currier [1976] observed differences in character
object into 2D space. Moreover, the choice of and substring frequencies across different sections
dimension-reduction algorithm is important, as the of the text. Consequently, he split the text into
algorithms are biased towards different projections “languages,” or dialects A and B, and analyzed the
and so the choice of algorithm can strongly influ- two languages separately.2 From there, he discov-
ence the result, especially in cases (such as this) 2
The original Currier designation included several pages
where the dataset is very small. that could not be assigned to either of the main hands. Note

3
ered that symbol groups appearing very frequently word frequency. Thus, we apply statistical analy-
in one language may be almost non-existent in the sis assuming Davis’ identification of hands in the
other. From a statistical perspective, this discovery manuscript.
has complicated our understanding of the Voyn- Figure 1, reproduced here from Lindemann and
ich Manuscript because topic modeling relies on Bowern [2020], illustrates how the scribes, sub-
word frequencies and expects consistency across jects, and Currier languages align throughout the
texts; for this reason, we consider the topic dis- manuscript. As can be seen, there is some over-
tributions in conjunction with Currier languages. lap in assignment to languages and section: for
Other Voynich investigations have tended to focus example, while the botanical section has folios in
on a single part of the manuscript. For example, both A and B languages, the balneological and
Reddy and Knight [2011] use only data from the B stars sections are in language B alone. Language
language for their investigations. Note that some and hand also aligns well, though there are more
discussion of Currier languages and further topics ‘hands’ than ‘languages’.
can be found in Lindemann and Bowern [2020].
3 Data and Methods
2.2.2 Manuscript illustrations
On the basis of illustrations which accompany the The Voynich transcript used here is Takahashi’s
text, it is customary to divide the manuscript into version of the text (as corrected by Zandber-
five sections. gen and Stolti) in the EVA transcription system.
The Takehashi transcription is complete and well-
1. botanical/herbal regarded, but does require some processing be-
fore it can be used. We used a ‘tokenizer’ script
2. astrological/astronomical to remove unnecessary annotations and charac-
3. balneological ters from the transcription. The Takehashi tran-
scription marks the location of breaks in the text
4. pharmaceutical for illustrations, as well as some other informa-
tion about the location of the text on the page.
5. starred paragraphs/“recipes” The Tokenizer removes all of this, separates each
word out as a single string, and groups them by
Here, we use a larger set of divisions, also in-
page. The vectorization and clustering algorithms
cluding the large foldout 9-sectioned rosette dia-
receive this and only this as input. No information
gram as its own ‘subject’, and an ‘unknown’ sec-
is conveyed to our analysis tools about the format-
tion for the few pages which comprise only text
ting of the manuscript or illustrations on a page.
and cannot be aligned with other subjects on the
The text was vectorized into word count matri-
basis of illustrations.
ces representing the text. Additionally, Davis’s 5
2.2.3 Lisa Fagin Davis’s Five Scribe Theory scribal attributions were used to construct a dataset
Davis [2020] proposes that the manuscript was of possible internal sections, along with language
written by five scribes. Her approach in- classification (language A, B), visual subject la-
volves identifying paleographic features (that is, bel (botanical, astrology, balneological, rosette,
glyph shapes) that distinguish scribes from each recipes, starred paragraphs, and unknown), and
other. Likewise, these hand attributions compli- quire number (1-18). For each topic modeling
cate our understanding of Voynichese as a lan- analysis, the topic classifications generated by the
guage, prompting us to consider whether the text algorithm were appended and compared.3
contains different linguistic dialects, or if writ-
3.1 Vectorization
ten variations of the same word may be tran-
scribed as different words. Alternatively, per- Given that there is no translation of Voynich text, it
haps different scribes were responsible for dif- is impossible to determine which words are most
ferent sections (or subjects), which would prob- meaningful, or which share similar meaning, by
ably be detectable with topic modeling based on reading the manuscript. However, we can work
out which words are most characteristic of which
also that while Currier discussed multiple languages, discus-
3
sion has tended to focus on just two (called “A” and “B” for Folio f57v was excluded from the analyses because its
convenience); we follow that tradition here. Currier language is unknown.

4
Figure 1: Alignment of languages, illustrative subjects (here labeled ‘section’), and hands in the Voynich
Manuscript, reproduced from Lindemann and Bowern [2020]

parts of the text. The first step of our process is to Clustering Inputs Clustering Outputs
Vectorized document Cluster ‘locations’
vectorize the document. This converts each word corpus
and each document into a vector in a very high di- Number of clusters ‘Distances’ from each
mensional space. Vectorization allows us to com- vector to each cluster
pute information about words, documents, and re- Table 1: Clustering inputs and outputs
lationships between them, even with no under-
standing of the words or documents themselves.
Once the folios and words have been vectorized, it each folio, categorizing them into ‘topic’ clusters.
begins to make sense to talk about how ‘close’ two
folios are to each other. If the Voynich Manuscript 3.3 Dimension reduction and visualization
were gibberish, we would expect the ‘distances’ At this point in our process, we have vector forms
between folios to be essentially random. However, of Voynichese words and VMS folios, as well as a
if the VMS contains natural language with genuine ‘cluster’ label for each folio. However, these vec-
meaning, we would expect to see that two pages tors are in a high-dimensional space, and are thus
that seem to cover the same topic (e.g. two herbal impossible to directly plot or visualize. We would
pages) would be closer to each other than to pages like to produce a 2D plot depicting each folio as a
that seem to be about other topics. Note that there point in our vector space. The choice of which two
are several types of similarity which might be de- dimensions to use is not trivial – even simple ob-
tected by this method: either meaning, type of en- jects can look different if they are rotated, and our
cipherment, or perhaps a combination of both. distribution has far more potential rotation axes
than a real 3D object. In addition, simply selecting
3.2 Clustering two dimensions of the current coordinate system
With these vectors in hand, the next step is to sys- is unlikely to be sufficient, as it would loose all of
tematically locate clusters of vectors that are close the information in the many other dimensions. In-
to each other. In general, clustering aims to find stead, a complicated projection is needed, that can
structure in a set of vectors by labeling nearby vec- collapse the data to two dimensions, while main-
tors as a single object. taining characteristic overall structures.
We tell our algorithm how many clusters to Fortunately, this is a common problem in ma-
make (that is, the value of k), but not what those chine learning, because high-dimensional vector
clusters are. The clustering algorithm then deter- spaces are a generally useful way to work with a
mines where to place the clusters, and from those dataset, and one often wants to plot multidimen-
locations we can determine the nearest cluster to sional data in two dimensions eventually. For this

5
project, we made use of three established algo-
rithms for dimension reduction: PCA, t-SNE, and
UMAP.
As discussed above, three topic modeling al-
gorithms were tested: latent dirichlet allocation
(LDA), latent semantic analysis (LSA), and non-
negative matrix factorization (NMF). LDA uses
raw word frequencies, whereas LSA and NMF use
tf-idf weighted counts. Finally, we used multiple
correspondence analysis (MCA) to compare the
topic modeling results to qualitative features of the
text, including hand attribution, Currier language,
and illustrative topic classification.
3.3.1 MCA
For each of these algorithms, we used MCA (Mul-
tiple Components Analysis), an extension of cor-
respondence analysis used to analyze the relation-
ship of categorical dependent variables [Abdi and Figure 2: LDA 6 topic distribution
Valentin, 2007]. By representing data points in
a 2-dimensional space, MCA visualizes the re-
lationship between the topics generated by the to cluster in any significant way. However, it is
topic modeling algorithms to qualitative measures interesting to note that there is a clear distinction
of hand, Currier language, and illustrative topics. between Language A and B in the results. The
MCA is a dimension reduction technique that uses top words per topic for LDA are shown in Ap-
an indicator matrix, which contains rows of indi- pendix A.
vidual data points, columns representing variable While it may be possible that the manuscript is
categories, and entries of 0 or 1. Relationships mostly about one topic, there are certain aspects
between variables are discovered by computing of LDA that may explain these results and render
the chi-square distance between categories of vari- LDA unfeasible to use for these analyses. LDA
ables and individual data points. MCA discovers expects a good understanding of the text data, as
the underlying dimensions that best explain differ- it groups co-occurring words together. When per-
ences in data, which allows for the data to be rep- formed with optimal parameter settings, the top-
resented in a reduced space. ics generated by LDA are close to human under-
standing. If topics share keywords, smaller top-
4 Results ics can be absorbed into a major one, which usu-
We report on all results, informative and uninfor- ally indicates a suboptimal parameter setting [Ma,
mative. Informative results are summarized in the 2016]. This seems to suggest several reasons why
conclusions but we felt it was important to also the LDA analysis is not accurate: perhaps there are
show the approaches that failed, and to discuss insufficiently clear word clusters (that is, that the
why. topics are not sufficiently lexically differentiated
to show up by this method), or, conversely, that
4.1 Analysis 1: Latent Dirichlet Allocation spelling variation obscures lexical identity. Note,
The results of LDA topic clustering are shown in however, that Reddy and Knight [2011, 83] were
Figure 2. able to find clusters (but they provided no informa-
The LDA model created one dominant topic, tion about their approach beyond the use of tf-idf).
with the majority of pages belonging to topic 2. It is also possible that the Voynich manuscript isn’t
The distribution of scribes, languages, visual top- human language, or the Takahashi tokenizer isn’t
ics, and LDA topics (visualized with MCA) are sufficiently accurate. It is typically encouraged to
shown in Figure 3. understand the structure of data using other meth-
The MCA results to visualize the LDA clusters ods before using LDA, which is why we use NMF
are not informative, as the LDA topics don’t seem in the following analyses (Analyses 4–6).

6
Figure 3: MCA for LDA topics, illustrative topics, languages, and hands

4.2 Analysis 2: LSA Additionally, LSA does not perform well on


The results of LSA topic clustering are shown texts with large amounts of polysemy or ho-
in Figure 5, and the distribution of scribes, lan- mophony, since it assumes words only have one
guages, visual topics, and LSA topics from MCA concept. We do not know if this is relevant to
are shown in Figure 4. the Voynich Manuscript, but if the cipher results
The clusters for LSA are more distinct than in the conflation of phonemic distinctions (implied
those of LDA. For both LSA and LDA, hand 4 and by H2 entropy, per Lindemann and Bowern 2020),
the astrology visual topic are quite close to each it is quite possible that there are more Voynich ho-
other. This is expected, since Hand 4 is responsi- mophones than in the underlying language. While
ble for this section of the MS (this is a check that future research into LSA may prove useful, we
sensible results are returned). used NMF as our primary topic modeling algo-
Again, we see a distinct contrast between Lan- rithm for the remaining analyses, given the issues
guage A and B, though it is perhaps not as strong. identified above. The top words per topic for LSA
In this LSA analysis, Language A and hand 1 are are shown in Appendix B.
overlapping in the figure, with LSA topic 1 very 4.3 Analysis 3: NMF
close by. Additionally, LSA topic 0 and the starred
The results of NMF topic clustering are shown in
paragraphs illustrative topic are almost identical,
Figure 6. There is a clear divide between the clus-
with balneological illustrative topic very close by.4
bution, however words in documents may follow a Poisson
4
It’s worth noting that LSA assumes a Gaussian distri- distribution.

7
Figure 4: MCA for LSA topics, illustrative topics, languages, and hands

ters representing Language A and B along the hor- close to language A, hand 1, NMF topic 1, and
izontal axis. This distinction is much stronger than NMF topic 3. This may suggest that recipes
those found in the analyses using either LDA or and botanicals are distinct topics with overlapping
LSA. content. The same could be said for illustrative
The distribution of scribes, languages, visual sections of rosette diagram, balneological mate-
topics, and NMF topics are shown in Figure 7. rial, and the starred paragraphs. The balneological
The MCA for NMF shows closer relationships section is very closely related to NMF topic 0, and
between NMF topics and the illustrative sections. the rosette and starred paragraphs are both quite
As with the case of both LSA and LDA, astrol- close to NMF topics 2 and 4. Taking a closer look
ogy is clustered with hand 4. However NMF dif- at the hands, we see that hand 2 and NMF topic 4
fers from the two previous analyses because as- are extremely close, hand 3 and illustrative rosette
trology and hand 4 are also shown to be related topic are almost identical. The top words per topic
together NMF topic 5. That is, we get an overlap for NMF are shown in Appendix C.
between the hand, the manuscript section (as in- It’s interesting to observe that the NMF top-
dicated from the illustrations), and the topic iden- ics have close relationships with both illustrative
tified through NMF. Note that the computational sections and hands; this supports the notion that
analysis has information about neither scribe nor NMF topic does not directly correspond to either
illustration, so aligning all three is unlikely to oc- hand or illustrative topic—rather that features of
cur by chance. both may be reflected in NMF topics. What these
Illustrative topics of recipes and botanicals are features are precisely, are not detectable through
distinct from the other topics, and both are quite these methods, but they must relate to aspects of

8
Figure 5: LSA 6 topic distribution Figure 6: NMF 6 topic distribution

word use. That could be spelling, encipherment star chart, compared to one of the pages of starred
patterns, or word choice, for example. We do not paragraphs, for example. It is possible that the
speculate further at this point on why hand shows length of the page may be distorting similarity.
up as a factor that is distinct from illustrative sec- In order to eliminate bias introduced by varying
tion. document lengths, analyses 4a and 4b use subsets
of the Voynich data by randomly selecting 40 and
The results of NMF clustering showed closer re-
20 words, respectively, from each page. Accord-
lationships with hands, illustrative topics, and Cur-
ingly, it excludes the astrological pages (f67-f73),
rier languages, than those of LSA and LDA. Un-
as the text is mostly labels, and pages with fewer
like LDA, NMF performs better with initial data
than 50 words,5 each of which belong to the botan-
explorations because it’s better at handling noise
ical visual topic. An alternative analysis would
[Ma, 2016]. In the case of Voynich Manuscript,
be to take the first 40 words of each page, or to
this noise may be introduced by the Takahashi
split the pages into sub-documents (this, however,
transcription; there is no way, of course, to ver-
would over-represent pages with more words).
ify that the transcription accurately divides the text
into linguistic units, such as words. Of course, 4.4.1 4a: Document length of 40 words
the Voynich dataset is “noisy” on several dimen- As shown in Figure 8, documents of 40 random
sions: in the digitization of the transcription and words still maintain strong clusters. Furthermore,
the small amount of data, to name just two factors. Figure 9 shows clear distinctions between lan-
LDA performs better when the data reflects “se- guage A and B.
mantic units”, whereas NMF performs better with The MCA reflects a close relationship between
unstructured data than the other methods do. For NMF topic 1 and the balneological illustrative
these reasons, we consider NMF to be the most topic. The star paragraphs are close to NMF topic
reliable algorithm for topic modeling the Voyn- 1 and the visual balneological topic, but it also is
ich Manuscript, and will continue to use NMF for close to NMF topic 6.6 Again, this suggests that
Analyses 4–7, as further detailed below. the starred paragraphs and balneological may have
overlapping content as distinct topics.
4.4 Analysis 4: Fixed document lengths with 5
These are f5v, f11v, f25r, f38r, f65r, f65v, f90r2
random word selection 6
Remember that topic numbers are arbitrary.
In previous analyses, the unit of analysis was
the ‘page’. However, pages have very different
amounts of text. Consider the amount of text on a

9
Figure 7: MCA for NMF topics, illustrative topics, languages, and hands

One striking difference between the page-level that they vary on some dimensions or features not
NMF analysis and this analysis with a 40 word represented by the figure, or that they are so simi-
cap is that the illustrative botanical and recipe sec- lar that they may be considered one topic.
tions are distinct both from each other and from
every other section (as defined by illustrations). 4.4.2 Analysis 4b: Document length of 20
It’s worth noting again that the folios removed for words
being less than 50 words were all botanical; this As shown in Figure 10, reducing the document
perhaps suggests those seven folios provide some window to 20 words still allows for topic cluster-
link, in the form of overlapping words, between ing. However the clusters are not as distinct from
recipes and botanicals. Three of the pages (f65r, each other as they are with 40 words or with the
f65v, and f90r2) are botanical pages which occur full text.
in the manuscript adjacent to or among the recipes. Likewise, Figure 11 also shows that the lan-
f65r,v are immediately preceded by missing pages, guage A and B distinction is not revealed as
and f90r2 is one of a set of herbal illustrations strongly. One stark difference from 40 words to 20
within the recipes section. words is that the “unknown” visual topic becomes
Additionally, the recipes section is not very its own cluster, distinct from any other feature,
closely related to NMF topic 3, however the two though closer to the recipes section than the bal-
points are much closer with each other than any neological, starred, and rosette sections. It’s also
other point in the figure. NMF topics 0 and 4 are interesting to see that the starred paragraphs fall
almost identical with the botanical section close somewhere between the rosette and balneological
by. It’s interesting to see NMF topics 0 and 4 visual topics, whereas it was much closer to the
nearly collapse into one topic; this may suggest balneological visual topic for analyses with the 40

10
Comparing the NMF topics to Lisa Fagin Davis’s
hand attributions may offer more insight into what
a topic is. For these analyses, we apply NMF with
5 topics to explore whether there is a direct map-
ping between 5 NMF topics and the 5 scribes iden-
tified in Davis [2020].
As shown in Figure 12, most of the folios fall
into topic 1. Likewise, Figure 13 finds that NMF
topic 1 and hand 1 are almost identical. Addition-
ally, NMF topic 2 and hand 3 show close similar-
ity. Hand 2 is somewhat close to NMF topics 0
and 4.
It is also interesting to note the spread of the
clusters: there seems to be one larger cluster on
the left side of the graph, one larger cluster in the
top right corner, and one larger cluster in the bot-
tom right corner. This division into three larger
clusters may have implications for semantic con-
Figure 8: NMF topic distribution with 40 word
tent or scribal similarity.
documents
Overall, the results suggest that “topic” as de-
fined by NMF is not quite synonymous with
word window, and much closer to the rosette for hands. We cannot create a complete one-to-
the full text NMF analysis. one mapping between NMF topics and Voynich
It is also intriguing to see that the structure of scribes; in other words, perhaps not every hand
the overall cluster of a 20 word NMF analysis is wrote about unique topics.
quite similar to that of the full text (page-level)
NMF analysis. This suggests that the words in 4.6 Analysis 6: NMF topics vs. Currier
each of the topics are distinct enough from each languages
other that only a small portion of randomly se- Scholars often classify sections of the text as be-
lected words from the text are enough to distin- longing to Currier A or B. Thus, this analysis seeks
guish documents from each other. Such a re- to understand if the Currier languages are reflected
sult strongly implies that the topics discussed here by a two topic NMF analysis.
are not artefacts of analysis on a manuscript with As shown by Figure 14, the text has distinc-
“gibberish”, and that there is, in fact, meaning- tive clusters between the two topics. Interestingly,
ful content underlying an enciphered text. How- Figure 15 shows that while the Currier languages
ever, the fact that there is a difference between 20 may share features with the topics, they are not the
words and 40 suggests that words in the Voyn- same; that is, two topics identified by tf-idf do not
ich manuscript are a combination of distinctive overlap with the languages identified according to
content words and less distinct function words Currier. It’s also worth noting that languages A
(or content words that do not uniquely define the and B were developed mostly based on character
topic under discussion). This again suggests that and substring frequencies, whereas NMF topics
the underlying text is not gibberish, or the result are based on word frequencies. Thus, the basis of
of locally random fluctuations in “word” produc- this analysis may be comparing two different units
tion: if that were the case, and a different type of (Currier symbol groups vs. Takahashi words).
non-word production was used for each page, we
4.7 Exploring the relationship between
would expect any set of methods that work at the
hands, topics, and illustrations
page level to perform roughly equally well.
In the previous analyses, we showed how TF-IDF
4.5 Analysis 5: NMF topics vs. scribe hands topics do not clearly match to either Hands or Il-
While topic modeling provides a “semantic” un- lustrative sections, though there is some overlap
derstanding of the text, it is still difficult to in- in association with both (not surprisingly, since
fer what linguistic features are reflected by topics. there is an association between scribe and illus-
trated section). In this section, we further explore
11 the “network” links between the TF-IDF topics
Figure 9: MCA for NMF topics, illustrative topics, languages, and hands, using a 40 word window

and other divisions in the manuscript. To do so,


we use a network as a visualization device, where
nodes in the network are the categories under con-
sideration (hand, illustrated sections, TF-IDF top-
ics, etc), and the edges are the pages the link the
hands to sections or topics.
Consider Figure 16. In this network, each node
is a hand (yellow) or a subject (green), and the
links between them are pages. The assignments
are from Davis [2020]. From this, we can see that
Hand 4 is associated with the astrology section;
Hand 3 mostly wrote the starred paragraphs and
botanical pages; Hand 1 wrote the recipes section
and also collaborated (with Hands 2 and 3) on the
botanical pages, and so forth.
This same visualization technique can be used
to combine illustrated subjects and tf-idf topics.
This is shown in Figure 17, with topics in red
Figure 10: NMF topic distribution with 20 word
(as identified by tf-idf NMF 6 topics using full
documents
pages) and subjects in green. Just as with the
scribes/hands, we see that tf-idf topics do not
uniquely correspond to the sections as defined by

12
Figure 11: MCA for NMF topics, illustrative topics, languages, and hands, using a 20 word window

illustrations. The balneological section, for exam-


ple, is split between topics 4 and 6; the starred
paragraphs mostly between topics 2 and 6 (but also
with one page assigned to topic 4 and 3 to topic 5).
We can look at Hands and topics, as shown in
Figure 18. Here, the patterns are quite clear: Hand
1 wrote topic 1, and collaborated with Hand 4 (and
Hand 5) on topic 3; Hand 4 collaborated on topics
3 and 5, with Hand 1 and 2 each contributing one
page to topic 5 as well. Hand two contributed to
topics 2, 4, and 6; they were the majority author of
topic 4 (though with input from Hand 3) and topic
6 (again with input from Hand 3) while toic 2 is
equally split between Hands 2 and 3.
Finally, we can combine the hands, topics, and
subjects, to see if there is further structure and
whether we can see anything further about the
topic collaboration. This is illustrated in Fig-
ure 19. In this figure, the yellow nodes are the
hands, and the white nodes are a combination of
Figure 12: NMF 5 topic distribution
illustrated subject and tf-idf topic. The figure is
somewhat difficult to read. Consider the astrology
sections, which are divided between topics 2, 3,
and 4 in the tf-idf. Here, astrology-2 and -3 are

13
Figure 13: MCA for NMF 5 topics and hands

Figure 15: MCA for NMF 2 topics and languages

from scribe 4, while astrology 4 is by hand 3. That


Figure 14: NMF 2 topic distribution is, the parts of the astrology section attributed to
tf-idf topic 4 is written in hand 3. Other parts of
topic 4 are also written by hand 3: the starred para-
graphs in topic 4 and the unknown/unassigned il-
lustrated section.

14
hand−5 unknown recipes

recipes

hand−1

topic−3
botanical
astrology
hand−3

balneological
topic−5
hand−2 topic−1
botanical
starred.paragraphs

unknown

rose topic−2 topic−4

rose starred.paragraphs
hand−4

balneological

astrology topic−6

Figure 16: Manuscript hands and illustrated sub- Figure 17: tf-idf generated topics matching illus-
jects trated subjects

When combining tf-idf topics and illustrated perhaps, provides a more convincing argument
sections, one can see further structure associated that NMF reflects distinct semantic and linguistic
with the tf-idf topics and hands. In fact, it is features, potentially also including differences in
striking how well this combination of illustra- scribe and/or dialect.
tions and computationally-derived topics divides These topic modeling analyses may also offer
the manuscript. Botanical-3 is mostly by Hand 1, evidence that some of the texts are written about
except for two pages assigned to Hand 5. Hand 2 the illustrations that accompany them. Although
is responsible for the balneological topics 4 and 6 the astrological section has fewer words than the
and botanical 2, 4, and 5 (that is, the parts of the other folios and appears to be mostly labels, the
botanical section in topics 2, 4, and 5). Hands 2 NMF full text analysis (Analysis 3) considers the
and 3 collaborate on botanical topic 2 (and hand illustrative astrological section its own topic, and
3 does two pages of botanical 4). Apart from its distance is quite far from the other NMF and
the pages associated with hand 5, there are clear illustrative topics. Furthermore, in Analyses 1–3,
associations between hands and tf-idf topics, as the astrological section is always clustered next to
well as tf-idf and illustrated sections. For exam- hand 4, although it is worth noting that both hand
ple, Hand 1 writes the recipes (associated with tf- and illustrative topic classifications come from
idf topic 3) and is the sole contributor to botani- Lisa Fagin Davis’s work. That is, the topic is asso-
cal topic 1; they are also associated with botanical ciated with that hand, even though no information
topic 3. Hand 3 is associated with botical-2 and about the hand was used in discerning the topic.
starred paragraph-2 (that is, the parts of those il- It’s also interesting to see the close clustering
lustrated sections which are assigned to the same of illustrative rosette, balneological, and starred
tf-idf topic). The correspondences are not perfect, paragraphs topics across the analyses. The anal-
but they are considerably better than chance. yses seem to suggest that these sections belong
to a larger common topic, though there are indi-
5 Further Discussion
vidual differences. This is particularly interest-
Analyses 1–4 consider the relationships between ing because the starred paragraphs and parts of the
categorical dependent variables, whereas analysis rosette sections offer few visual clues as to what
5 isolates NMF topic and scribe, and analysis 6 the text may be about, whereas a topic modeling
isolates NMF topic and Currier language. These approach — even with a fixed count and random
analyses provide strong evidence that scribe, NMF word selection — suggests they may be related
topic, and Currier language are not the same. This, to the balneological illustrations and text. Like-

15
6203. URL http://dx.plos.org/10.
1371/journal.pone.0067310.
topic−1

hand−1
Michael Barlow. The Voynich Manuscript
- By Voynich? Cryptologia, 10(4):
topic−3
210–216, 1986. ISSN 9780752859965.
hand−4
URL http://dblp.uni-trier.
de/db/journals/cryptologia/
topic−5
cryptologia10.html#Barlow86a.
Citation Key Alias: barlow1986a.
hand−5 topic−2
Stephen Bax. A proposed partial decoding
hand−3
of the voynich script. URL https://
hand−2
stephenbax.net/?pageid=11. 2014.
topic−6
topic−4 D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent
dirichlet allocation. JMLR, 3(5), 2003.
Claire Bowern and Luke Lindemann. The Linguis-
tics of the Voynich Manuscript. Annual Review
Figure 18: tf-idf generated topics matching scribal of Linguistics, 2020.
hands
Julian Bunn. Using t-distributed stochastic neigh-
bor embedding (tsne) to cluster folios, 2017.
wise, the increase in distance between the recipes https://voynichattacks.wordpress.com.
and botanical visual topics after several botanical
P Currier. Papers on the Voynich Manuscript. In
pages were removed from the analysis seems to
ME D’Imperio, editor, New Research on the
suggest that those texts provide some link between
Voynich Manuscript, Washington, DC, 1976.
the two sections. We may infer that the botanical
and recipes sections have distinct contents within István Daruka. On the voynich manuscript. Cryp-
a shared larger category. tologia, 0(0), Feb 2020. ISSN 0161-1194.
Based on this combined qualitative and quanti- Lisa Fagin Davis. How many glyphs and how
tative approach, we consider this strong evidence many scribes: Digital paleography and the
that the Voynich Manuscript is some form of en- voynich manuscript. Manuscript Studies, V(1):
ciphered human language, rather than “meaning- 162–178, 2020.
less” generated text. It is the most convincing S.C. Deerwester, S.T. Dumais, T.K. Landauer,
that there are distinct relationships between illus- G.W. Furnas, and R.A. Harshman. Indexing by
trative topics and NMF topics, while also main- latent semantics analysis, 1990.
taining some overlapping vocabulary between sec-
tions (see Appendix). Mary E. D’Imperio. The Voynich manuscript:
Figure 20 provides a summary of the tf-idf top- an elegant enigma. Technical Re-
ics (both the 6 topic analysis and the 40 words per port 9780894120381, 1978. URL
page analysis), along with language, scribe, and http://oai.dtic.mil/oai/oai?
illustrated section (per Figure 1 above). verb=getRecord&metadataPrefix=
html&identifier=ADA070618.
References Deborah E. Harkness. The Voynich manuscript.
H. Abdi and D. Valentin. Multiple correspondence Beinecke Rare Book & Manuscript Library in
analysis, 2007. The University of Texas at Dal- Association with Yale University Press, 2016.
las. ISBN 978-0-300-21723-0.
Diego R. Amancio, Eduardo G. Altmann, Diego Karen Spärck Jones. A statistical interpretation of
Rybski, Osvaldo N. Oliveira, and Luciano F. term specificity and its application in retrieval.
da Costa. Probing the Statistical Prop- Journal of Documentation, 28:11–21, 1972.
erties of Unknown Texts: Application to D. Jurafsky and J.H. Martin. Speech
the Voynich Manuscript. PLoS ONE, 8(7): and Language Processing. Prentice-
e67310–e67310, July 2013. ISSN 1932- Hall, Inc, 3rd edition, 2009. URL

16
astrology−2
balneological−6
balneological−4

starred.paragraphs−5

botanical−5
astrology−5 starred.paragraphs−6 starred.paragraphs−2

hand−4
hand−2
rose−2
starred.paragraphs−4
hand−3
botanical−4
astrology−3
unknown−2
botanical−2

unknown−5

hand−5

hand−1

unknown−4
botanical−3
recipes−3

botanical−1

Figure 19: Topic: tf-idf generated topics matching manuscript hands and illustrated subjects

https://web.stanford.edu/ L. Ma. topic-finding-for-short-texts, 2016. URL


˜jurafsky/slp3/ed3book.pdf. https://github.com/dolaameng/
Gerry Kennedy and Rob. Churchill. The Voynich tutorials/tree/master/
manuscript: The mysterious code that has de- topic-finding-for-short-texts.
fied interpretation for centuries. Orion publish- Leland McInnes, John Healy, and James Melville.
ing company, 2004. ISBN 0-7528-5996-X. Umap: Uniform manifold approximation and
D.D. Lee and H.S. Seung. Learning the parts projection for dimension reduction. arXiv
of objects by nonnegative matrix factorization, preprint arXiv:1802.03426, 2018.
1999. Christian S. Perone. Voynich manuscript:
Luke Lindemann and Claire Bowern. Voynich word vectors and t-sne visual-
manuscript linguistic analysis. archiv.org, 2020. izations of some patterns, 2016.
http://blog.christianperone.com/2016/01/
Hans Peter Luhn. A statistical approach to
voynich-manuscript-word-vectors-and-t-sne-
the mechanized encoding and searching of
visualization-of-some-patterns/.
literary information. IBM Journal of Re-
search and Development, pages 309–317, 1957. Sravana Reddy and Kevin Knight. What we know
URL https://doi.org/10.1147/rd. about the Voynich manuscript. Proceedings of
14.0309. the 5th ACL-HLT Workshop on Language Tech-

17
folio

f101v2

f102v1
f102v2
f101r1

f102r1
f102r2
f67v1
f67v2

f68v1
f68v2
f68v3

f70v1
f70v2

f72v1
f72v2
f72v3

f86v3
f86v4
f86v5
f86v6

f89v1
f89v2

f90v1
f90v2

f95v1
f95v2

f100v

f103v

f104v

f105v

f106v

f107v

f108v

f111v

f112v

f113v

f114v

f115v
f67r1
f67r2

f68r1
f68r2
f68r3

f70r1
f70r2

f72r1
f72r2
f72r3

f85r1
f85r2

f89r1
f89r2

f90r1
f90r2

f95r1
f95r2

f100r

f103r

f104r

f105r

f106r

f107r

f108r

f111r

f112r

f113r

f114r

f115r

f116r
f10v

f11v

f13v

f14v

f15v

f16v

f17v

f18v

f19v

f20v

f21v

f22v

f23v

f24v

f25v

f26v

f27v

f28v

f29v

f30v

f31v

f32v

f33v

f34v

f35v

f36v

f37v

f38v

f39v

f40v

f41v

f42v

f43v

f44v

f45v

f46v

f47v

f48v

f49v

f50v

f51v

f52v

f53v

f54v

f55v

f56v

f57v

f58v

f65v

f66v

f69v

f71v

f73v

f75v

f76v

f77v

f78v

f79v

f80v

f81v

f82v

f83v

f84v

f87v

f88v

f93v

f94v

f96v

f99v
f10r

f11r

f13r

f14r

f15r

f16r

f17r

f18r

f19r

f20r

f21r

f22r

f23r

f24r

f25r

f26r

f27r

f28r

f29r

f30r

f31r

f32r

f33r

f34r

f35r

f36r

f37r

f38r

f39r

f40r

f41r

f42r

f43r

f44r

f45r

f46r

f47r

f48r

f49r

f50r

f51r

f52r

f53r

f54r

f55r

f56r

f57r

f58r

f65r

f66r

f69r

f71r

f73r

f75r

f76r

f77r

f78r

f79r

f80r

f81r

f82r

f83r

f84r

f87r

f88r

f93r

f94r

f96r

f99r
f1v

f2v

f3v

f4v

f5v

f6v

f7v

f8v

f9v
f1r

f2r

f3r

f4r

f5r

f6r

f7r

f8r

f9r
Section

value
NMF40 Section: Herbal Hand: 5
Section: Cosmological NMF 6 Topics: 1
Section: Balneological NMF 6 Topics: 2
Section: Recipes NMF 6 Topics: 3
Section: Stars NMF 6 Topics: 4
Currier Language: A NMF 6 Topics: 5
NMF 6 Topics Currier Language: B NMF 6 Topics: 6
Currier Language: Undetermined NMF40: 1
Hand: 1 NMF40: 2
Hand: 2 NMF40: 3
Hand: 3 NMF40: 4
Hand: 2,3 NMF40: 5
Language Hand: 4 NMF40: 6

Hand

Figure 20: Summary of topics, hands, languages, and scribes

nology for Cultural Heritage, Social Sciences,


and Humanities, pages 78–86, 2011. ISSN
9781937284046. URL http://dl.acm.
org/citation.cfm?id=2107647.
Gordon Rugg. An elegant hoax? A possible
solution to the Voynich manuscript. Cryp-
tologia, 28(1):31–46, January 2004. URL
http://www.tandfonline.com/doi/
Figure 22: Top 20 words per topic for LSA, 6 top-
abs/10.1080/0161-110491892755.
ics
Gordon Rugg and Gavin Taylor. Hoax-
ing statistical features of the Voynich
Manuscript. Cryptologia, pages 1–
22, September 2016. URL https:
//www.tandfonline.com/doi/full/
10.1080/01611194.2016.1206753.
Torsten Timm and Andreas Schinner. A pos-
sible generating algorithm of the Voynich
manuscript. Cryptologia, 44(1):1–19, Jan 2020.
ISSN 0161-1194.
Laurens Van der Maaten and Geoffrey Hinton. Vi-
Figure 23: Top 20 words per topic for NMF, 6 top-
sualizing data using t-sne. Journal of machine
ics
learning research, 9(11), 2008.

A LDA

Figure 21: Top 20 words per topic for LDA, 6 top-


ics
Figure 24: Top 20 words per topic for NMF, 5 top-
ics
B LSA
C NMF

18

You might also like