The Voynich Ninja

Full Version: Topic Modeling in the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
There is a new paper published about the VMS: You are not allowed to view links. Register or Login to view. 

The Authors of the You are not allowed to view links. Register or Login to view. are Rachel Sterneck, Annie Polish and Claire Bowern.

[font=Tahoma, Verdana, Arial, sans-serif]The abstract says:[/font]
[font=Tahoma, Verdana, Arial, sans-serif]
Quote:This article presents the results of investigations using topic modeling of the Voynich Manuscript (Beinecke MS408). Topic modeling is a set of computational methods which are used to identify clusters of subjects within text. We use latent dirichlet allocation, latent semantic analysis, and nonnegative matrix factorization to cluster Voynich pages into `topics'. We then compare the topics derived from the computational models to clusters derived from the Voynich illustrations and from paleographic analysis. We find that computationally derived clusters match closely to a conjunction of scribe and subject matter (as per the illustrations), providing further evidence that the Voynich Manuscript contains meaningful text.
[/font]
I especially like the figures 6, 8, 10, 12, and 14. The figures clearly visualize the gradual evolution of a single system from "state A" to "state B" as suggested in Timm & Schinner (see Timm & Schinner 2020, p. 6).

The results of the LDA topic clustering method "created one dominant topic, with the majority of pages belonging to topic 2" (after figure 3 the topic 2 belongs to Currier B). This way "there is a clear distinction between Language A and B" (Sterneck et. al. 2021, p. 7). The reason for this result is that "LDA learns the topic representation of each document, as well as the words associated with each topic" (Sterneck et. al. 2021, p. 2). Words typical for Currier A also exist in Currier B, but not the other way round. Therefore it is possible to distinguish Currier B from Currier A based on frequency counts of tokens containing the sequence "ed". It is therefore reasonable to assume that the LDA algorithm has learned words typical for Currier B like "chedy", "shedy", "qokeedy" etc.

A result of the NMF topic clustering method is "The star paragraphs are close to NMF topic 1 and the visual balneological topic, but it also is close to NMF topic 6." This confirms that "the starred paragraphs and balneological may have overlapping content as distinct topics" [font=Tahoma, Verdana, Arial, sans-serif](Sterneck et. al. 2021, p. 11). [/font]

[font=Tahoma, Verdana, Arial, sans-serif]Another result is "NMF topics [/font]0 and 4 are almost identical with the botanical section close by. It’s interesting to see NMF [font=Tahoma, Verdana, Arial, sans-serif]topics 0 and 4 nearly collapse into one topic" (Sterneck et el, p. 11). Again it become apparent that on a smaller scale the "topics" seem to overlap, whereas on a larger scale the massive differences between Currier A and B are immediately apparent. Such results demonstrate that it not that easy to find a clear topic structure in the manuscript.[/font]

[font=Tahoma, Verdana, Arial, sans-serif]However, Sterneck et al interpret the results differently than I do. For instance they dismiss the results of the LDA analyses since "the LDA topics don’t seem to cluster in any significant way" (Sterneck et al 2021, p. 7). [/font]

[font=Tahoma, Verdana, Arial, sans-serif]The authors themselves warn because of the existence of two different 'languages' or 'dialects' A and B: "From a statistical perspective, this discovery has complicated our understanding of the Voynich Manuscript because topic modeling relies on word frequencies and expects consistency across texts" (Sterneck et al 2021, p. 4). But if this is the case their conclusion is contradicting itself: "[/font]We find that computationally derived clusters match closely to a conjunction [font=Tahoma, Verdana, Arial, sans-serif]of scribe and subject matter (as per the illustrations)" (Sterneck et al 2021, p. 1). Either the topics are a result of scribal differences or a result of subject changes. Therefore it is at least problematic to use two contradicting explanations together.[/font]
I am glad that linguists at Yale are proceeding in their research! I understood very little of the maths behind this work, but the overall concepts are clear enough.

Here are a few random comments:

Quote:On the basis of illustrations which accompany the text, it is customary to divide the manuscript into five sections. 1. botanical/herbal 2. astrological/astronomical 3. balneological 4. pharmaceutical 5. starred paragraphs/“recipes”

This does not appear to be consistent with Figure 1, where the herbal pages intermixed with the "pharmaceutical" pages are assigned to the "pharmaceutical" section. In other words, "sections" in Figure 1 do not appear to be based on the illustrations. Were the "herbal" pages f90-96 treated as belonging to the "pharmaceutical" (as Figure1 suggests)?

Stolfi is misspelled as  "Stolti". Also I am not sure that the transliteration used can be "Takahashi’s version of the text (as corrected by Zandbergen and Stolfi)": IIRC, Takahashi's file does not include the Rosettes. So maybe the Zandbergen-Landini EVA file was used?

In paragraph 2.2.2 "pharmaceutical" and "starred/recipes" are listed as sections, while in Figure 3 topic_starred and topic_recipes appear. I guess that in Figure 3 (and in most of the paper) "recipes" stands for "pharmaceutical"? It would help if sections were named consistently.

It would be interesting if the page-level plots (like Figure 8) included both the colour-encoded topic classification and illustration-type classification (e.g. as point shape). In particular, I would be curious to see the distribution of Herbal A / Herbal B pages: I doubt they can be assigned to a single topic. Also, how do the "pharmaceutical" (i.e. "small-plants") pages behave with respect to Herbal A and Herbal B? Do they turn out to be somehow "intermediate" between the two sets?
Also for Figure 12, one could find a way to show topics and hands at the same time for each sample. The node-edge graphs at the end (16,17,18) make some of this information visible, but I find them difficult to read (possibly because I am not familiar with this format).

I found paragraph "4.6 Analysis 6: NMF topics vs. Currier languages" and Figure 14 particularly interesting. I think it is clear that Voynichese cannot be split into two languages A and B; I am still uncertain about a single continuum as Torsten's thinks and a set of discreet dialects. Anyway, what is interesting is that word structure varies together with the illustrations; a well known example is the frequency of You are not allowed to view links. Register or Login to view. in the "pharmaceutical" pages.
I would definitely love to see a more extensive discussion of the relations between "topics" and word structure.
I also find the cloud in Figure 14 somehow comparable with what I got by plotting the 2 principal components (PCA) of You are not allowed to view links. Register or Login to view.. I wonder if the tip of the V consists of Astro and Pharma pages also for the topics plot.
[attachment=5641]


Typo here: "nodes in the network are the categories under consideration (hand, illustrated sections, TF-IDF topics, etc), and the edges are the pages the [that] link the hands to sections or topics"

The fundamental distinction between function words and content words is mentioned in the paper. I believe this is a field that could be explored in greatest depth. The methods used in the paper are apparently based on the idea that words in the different sections are comparable:

Quote:Although we cannot read the Voynich text, topic modeling is still applicable if we assume that Voynich words have a consistent form-meaning correspondence across the manuscript. That is, we need to assume that 8ain on You are not allowed to view links. Register or Login to view. is the SAME word as 8ain on f7v.

While there are a few words that appear through the manuscript, the identification of function words is problematic (this has been also discussed by Torsten).
One can assume that the top 4 most frequent words in the Balneological section (Quire 13) are function words: Shedy chedy qokedy qokain
These cumulatively appear 711 times and make up 11% of the text.
But each of these words only appears once in the Pharma section. What is happening here. The word frequencies in each section are compatible with some of the words being function words, but the fact that they are not shared among sections suggests that they appear in different forms. This could contradict the consistency assumption quoted above.

I am looking forward to see how this research develops and if some light can be shade on the peculiar word structure of Voynichese and the distribution of word types across sections. BTW, also the fact that word structure is fairly consistent across sections while word frequencies vary so much is very puzzling.
(09-07-2021, 04:50 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.BTW, also the fact that word structure is fairly consistent across sections while word frequencies vary so much is very puzzling.

Assuming that the ms was created with the use of external aids like tables, wheels, lists, external text, or something else, this could be explained by some portion of these aids or their content having been changed or replaced at points during the creation for various reasons. Has there been much discussion that different sections could have been enciphered differently? 
(12-07-2021, 06:08 PM)byatan Wrote: You are not allowed to view links. Register or Login to view.
(09-07-2021, 04:50 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.BTW, also the fact that word structure is fairly consistent across sections while word frequencies vary so much is very puzzling.

Assuming that the ms was created with the use of external aids like tables, wheels, lists, external text, or something else, this could be explained by some portion of these aids or their content having been changed or replaced at points during the creation for various reasons. Has there been much discussion that different sections could have been enciphered differently? 

If they used modified cipher keys similar to the kind used in Italian diplomacy, at that time, then those kind of changes would be quite normal. I don't know how much this has been discussed, but I have certainly mentioned it before either on Ninja or Nick Pelling's blog.
(12-07-2021, 06:08 PM)byatan Wrote: You are not allowed to view links. Register or Login to view.
(09-07-2021, 04:50 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.BTW, also the fact that word structure is fairly consistent across sections while word frequencies vary so much is very puzzling.

Assuming that the ms was created with the use of external aids like tables, wheels, lists, external text, or something else, this could be explained by some portion of these aids or their content having been changed or replaced at points during the creation for various reasons. Has there been much discussion that different sections could have been enciphered differently? 

I think most people who are aware of the phenomenon think of it as an effect of different encodingsor writing systems. This idea however has problems of its own. See for instance what Rene wrote in his recent paper You are not allowed to view links. Register or Login to view.:

Rene Wrote:In the case of the original table-and-grille approach, a relatively easy explanation presents itself, namely that different tables were used for the pages in the two languages. However, this would not easily explain that almost all A-language words also tend to appear in the B language.

Another problem is deciding how many different systems (or "dialects") there are. The number of systems appears to be greater than the number of sections (e.g. the Herbal section appears to include two distinct parts A and B). Also Lisa's scribes do not seem to have a one-to-one correspondence to encoding systems or dialects, since scribe 1 wrote both Herbal-A and the different A variant in the Pharma aka Small-Plants section.

It is also possible that Voynichese is a single continuum with a slow "drift" in bigram frequencies. This would be even harder to explain. I think that Torsten is right in pointing out that also the plots in Sterneck et al. do not show well-separated clusters.
To illustrate the connections between Hands and TF-IDF topics Figure 18 is used. (The nodes in this graph are the categories in consideration (hands, TF-IDF topics) and the edges are the pages which link the hands to topics (p. 12).)

[attachment=5652]

Figure 18 demonstrates that TF-IDF topics distinguish between Currier A and B. There are two topics for Currier A (topic 1 and 3) and three topics for Currier B (topics 2, 4, and 6). It is also observable that Hand 5 is not related to any topic, that Hand 1 is equivalent to Currier A and that Hand 2 and 3 do cover Currier B. Hands 2 and 3 are connected with each other since "topic 2 is equally split between Hands 2 and 3" (p. 11). Only for Hand 4 a one-to-one mapping to a topic exists: "the astrological section is always clustered next to hand 4" (p. 15).
In essence the paper demonstrates that it is possible to distinguish between Currier A and B, that labels behave differently from plain text and that there is a connection between the stars and the balneological section. None of this is a surprise. Renè Zandbergen already presented similar results of a cluster analysis on his website (You are not allowed to view links. Register or Login to view.).

What is surprising is the conclusion "We find that computationally derived clusters match closely to a conjunction of scribe and subject matter (as per the illustrations)" (p. 1). Actually the paper demonstrated quite the opposite: "In the previous analyses, we showed how TF-IDF topics do not clearly match to either Hands or Illustrative sections, though there is some overlap in association with both" (p. 11). This means that neither the assumption of scribal differences nor subject changes would explain the changes in word frequencies as observed.

But even if it would be possible to build a one-to-one mapping to illustrations and also to scribal hands this would be problematic. Topic changes are either the result of subject changes or they occur because the text representation does change. Sterneck et al. try to explain topic changes both ways. But this only means that the claim of meaningless 'dialect' differences weakens the claim of meaningful subject changes and vice versa.
Hi Torsten and Marco, thank you for illuminating some of the paper to people like me who are statistically challenged.  And of course to the researchers: this paper took time and effort and is appreciated!

Can I ask what might seem an obviously answered question, because I frankly don't understand 90% of the study, and it's a question I had even prior to it. But I am not a linguist at all.

Is it possible that there actually are two dialects being encoded here?  Latin, for instance, seems to have been modified by every language group in Europe.  Doesn't the difference between A and B, for instance, seem to argue for encoding two different original works (or written by different dialect-speaking scribes) in closely related 'dialects'?  And could linguists perhaps derive some clues from the very frequent 89, or eva dy, in B as opposed to A, that seems to be a common ending in one but not the other?  

As an aside but continuing thus subject, I do think I remember reading as well that the 40 construction in the VMS is much more frequent in some sections than others.  Some observors have thought this might translate to "qu".  If so, might the difference between the two " dialects" be the difference between classical Latin and slightly more vulgar Latin that was using quod, quia, etc for clauses very frequently.  Both Latins were used at the same time.

I don't want to isolate Latin, btw, just using it as an example.

I guess I find it somewhat dismaying that a different code or cipher might have been used throughout the manuscript; I'd rather believe in a slight shift of dialect in the same language!  But is what I've said here wishful thinking or a possibility?
(14-07-2021, 03:14 AM)Barbrey Wrote: You are not allowed to view links. Register or Login to view.Hi Torsten and Marco, thank you for illuminating some of the paper to people like me who are statistically challenged.  And of course to the researchers: this paper took time and effort and is appreciated!

Can I ask what might seem an obviously answered question, because I frankly don't understand 90% of the study, and it's a question I had even prior to it. But I am not a linguist at all.

Is it possible that there actually are two dialects being encoded here?  Latin, for instance, seems to have been modified by every language group in Europe.  Doesn't the difference between A and B, for instance, seem to argue for encoding two different original works (or written by different dialect-speaking scribes) in closely related 'dialects'?  And could linguists perhaps derive some clues from the very frequent 89, or eva dy, in B as opposed to A, that seems to be a common ending in one but not the other?  

As an aside but continuing thus subject, I do think I remember reading as well that the 40 construction in the VMS is much more frequent in some sections than others.  Some observors have thought this might translate to "qu".  If so, might the difference between the two " dialects" be the difference between classical Latin and slightly more vulgar Latin that was using quod, quia, etc for clauses very frequently.  Both Latins were used at the same time.
I don't want to isolate Latin, btw, just using it as an example.

Hi Barbrey,
I am not a linguist either, but here is my opinion for what it's worth.
Looking at 40 (EVA:qo) and 89 (EVA:dy) as corresponding to bigrams in the source language implies that one is looking for a simple substitution cipher (where 4=q, 0=u, maybe -9=-us etc). But character entropy shows that Voynichese cannot be a simple substitution cipher of an ordinary European language: the way in which characters follow each other in Voynichese is too rigid to match any of those languages.
40 also has the feature that removing 4- from a 4-word typically results in a legal 0-word: if you remove the initial q- from a Latin q-word you almost never get a legal word. Finally, the qo- and o- variants often appear consecutively as qoX.oX (Zandbergen-Landini EVA transliteration):

<f40r.2,+P0>      qokar.okar.okedy.dar.<->ykchey.kaiin.ok[a:o]s,chedy.okar.a,ralos
<f78r.15,+P0>    dchckhedy.qokchdy.qokedy.okedy.dal,or.okeed.olkain
<f103v.4,+P0>    y,cheey.qokeey.okeey.lkees,ol.qoteedy.ykeedy<$>
<f112r.13,+P0>    sor,aiin.chdy.ches.qokeey.okeey.otaiin.chcthy.oteey,dy

or as oX.qoX

<f31r.10,+P0>    <%>tol,shso.okedy.okedy.qokedy.qokeedy.dar.shedshey
<f79v.13,+P0>    dain.ar.olshey.dytain.qokain.checthy.okeedy.qokeedy.ror
<f102r2.11,+P0>  kockhas.okor.ykeey.okeey.qokeey.dol.ol.sheody.okey.da,l,{cthhh}y
<f107v.38,+P0>    dain.okchey.qokchey.qokaiin.olkeey.qokol.oteey.oteey.lkain

This is a special case of a phenomenon that is typical of Voynichese: similar words tend to appear consecutively. This has been addressed by Timm and Schinner and (from a different angle) by Rene with his You are not allowed to view links. Register or Login to view..
The relationship between qo- and o- has also been discussed by Emma May Smith (You are not allowed to view links. Register or Login to view. and following posts).

Instead of thinking of a simple substitution, it is better to focus on the weaker assumption that each Voynichese word corresponds to exactly one word in the underlying language: this can be accomplished with a nomenclator, but the point really is to consider words as the atomic element to be analyzed. This is what has been done in the paper discussed in this thread.
Here one is faced with the problem of function words: the most frequent words tend to be more or less the same for all texts in a given language. Even different but related languages can have a considerable overlap in function words. But in the Voynich manuscript each section has a distinctive set of top-10 words.
See for instance this table of word types sorted by decreasing frequency (originally posted You are not allowed to view links. Register or Login to view.).

You are not allowed to view links. Register or Login to view.

This shows that in Voynich sections the top 30 words vary a lot, with 'daiin' and 'chol' decreasing from left to right (A to B) while 'chedy', 'shedy', 'qokedy' increase.
The two extremes HerbalA and Bio share 9 out of 30 words (smaller blue circles): this looks significant, but the top 4 words in Bio are excluded from the intersection with HerbalA.

Also notice the  frequent 'eol' words which are typical of Pharma (marked with orange circles). Though both Pharma and HerbalA are both classified as Currier A, they are noticeably different.
On the other hand, in Latin texts about different subjects and from different times, the top five words are fairly consistent.

In my opinion, we are left with three options:
1. This assumption is wrong and Voynichese words do not correspond to words in an underlying language (Currier and Torsten favour this possibility).
2. Words are written in different ways in the different sections (e.g. Bio 'qokain' corresponds to HerbalA 'sho' - using two random words as an example). Nick Pelling proposed the task of mapping Currier A to Currier B or vice-versa: I think this is a great, though terribly hard, research area.
3. The underlying language has no function words (I am far from sure that such a language exists, but this is my preferred option, though I am also interested in option 2).

Quote:I guess I find it somewhat dismaying that a different code or cipher might have been used throughout the manuscript; I'd rather believe in a slight shift of dialect in the same language!  But is what I've said here wishful thinking or a possibility?

A light shift of dialect would not by itself result in totally different top ranking words. At the level of bigrams (couples of consecutive characters) the You are not allowed to view links. Register or Login to view. is as large as that between Latin and Italian (two distinct languages).
In my opinion, if one wants to only consider "obvious" European languages with a one-to-one word correspondence, points 3 and 1 above are excluded and one is left with option 2: radically different encoding/spelling for the different sections.

Another observation against a one-to-one correspondence with words in normal European languages is that, in Voynichese, many of the most frequent words can be reduplicated, e.g.

<f5v.3,+P0>      qotcho.ytor.daiin.daiin.otchor.daiin.q'o.darchor.do
<f32v.8,+P0>      otchol.daiin.daiin.ctho,daiin.qotaiin.<->otchy.d.<->shan
<f78r.35,+P0>    y.sain.checkhy.qokain.cheeky.daiin.daiin.y,tees.ol,y
<f115r.17,+P0>    qol.cheey.qotchy.daiin.daiin.cheocthy.dolkeedy.qotaiin.chol.oteeedchey.okeedain

<f76v.23,+P0>    dchedy.qokeedy.qotchy.qokol.shedy.shedy.chedy.olched[?:r].shetey.saiin
<f82r.18,+P0>    polched.otain.shedy.shedy.dal.chedar.qokeey.ykeey.l,s,araiin,ory
<f82v.6,+P0>      qokedy.lshedy.qotol.dol,shedy.shedy,dy.darotedy.chetedy.lokam
<f103v.7,+P0>    daiin.shey.chol.chey.oteey.lkeeor.okaiin.shedy.shedy.qokaiin.ol.chedydy

This is not the case for frequent words in European languages (e.g. 'and and' 'the the' 'of of'...). One can sometimes build sentences with those patterns (not what I am thinking of, of course) but such examples are extremely rare in actual texts.
Thank you very much MarcusP.  You have certainly cleared up my questions, and given much food for thought. We seem to be able to derive so much statistical data from the text, but so few conclusions.

I did join the recent webinar and by Twitter afterwards asked Claire if a numerical cipher might change entropy.  She said not normal substitution, but I asked if something a bit more complex might work.  So say o was 1, but ox was 14, and or, 15. She seemed to think this might be better but the convo ended.

That maybe would help with the repeats. Daiin, for instance, might be 22, equal to some letter, say T, so Daiin Daiin would just be TT in the middle of a word.

But I'm spinning wheels here because I really can't keep up with the specifics of ciphers or the word structure patterns, though I like to try to achieve a broad understanding of the new studies, so that in my imagery analysis I might recognize patterns related to the text if they happen to show up.  You've helped so thanks again!
Pages: 1 2