The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11
(20-10-2025, 10:24 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.EDIT: another point that I think Rene mentioned in the past. Results based on only a few samples are more noisy and unreliable than results based on larger sets. This could play a role in the fact that Bio/Q13 paragraphs get more consistent results than the much shorter Stars/Q20 paragraphs.
Even though I agree in general terms, I don’t fully agree regarding topic modelling. Topic modelling is intended to be applied at the sentence level. The shorter Stars paragraphs are actually the perfect size, whereas the longer herbal or biological paragraphs might be too long. I suppose (though I can’t confirm it) that the longer paragraphs should contain internal sentences, but since there is no punctuation, we can’t recognize them yet.

Hi quimqu, I am afraid that what can be said of topic modelling in general cannot always be said of its application to Voynichese.

For instance, Blei et. al You are not allowed to view links. Register or Login to view. mentions a data-set that is orders of magnitudes larger than the Voynich corpus:

Quote: Our data are 16,000 documents from a subset of the TREC AP corpus (Harman, 1992). After removing a standard list of stop words, we used the EM algorithm described in Section 5.3 to find the Dirichlet and conditional multinomial parameters for a 100-topic LDA model.

I am sure that, with a large enough corpus written in a consistent language, single words like “ball” and “knife” can distinguish between “sports” and “crimes”, but Voynich sections are so tiny and the language is so volatile that we aren’t even able to guess if Voynich words encode language words and even less which words (if any) are stop words i.e. function words. So we are unable to remove stop words, and instead of “ball” and “knife” we might be classifying based on “these” and “which” (in the very optimistic scenario that single Voynich words do correspond to single plain-text words).

But, more importantly, we don’t know if the differences we see are due to different topics, different scribes, different languages or dialects, different cipher mechanisms such as different usage of nulls, different spelling, different preferences in spontaneous gibberish etc. We might even be classifying based on "potato" and "potahto".

Personally, I believe that the behavior of the ‘ed’ bigram shown in You are not allowed to view links. Register or Login to view. cannot be explained by different topics, so I feel sure that different topics can at most be a minor component of the Voynich language shift. Scribes alone also cannot be “the” reason, since scribe 1 uses two different languages in his Herbal and Pharma pages (see "eo" bigram); as Jorge says, it’s likely a mixture of factors. If the text is meaningful, we can be sure that topics will play some role in its composition (that’s basically what “meaningful” means), but such differences appear to be obscured by other phenomena that we cannot understand.
I wonder how/if Colin Layfield got around the scarcity of data problem. Restoring folio order based on a method similar to this sounds like it should be impossible, but from what we've seen in Lisa's talk, it looks like they may have gotten some useful results.
(21-10-2025, 08:15 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I wonder how/if Colin Layfield got around the scarcity of data problem. Restoring folio order based on a method similar to this sounds like it should be impossible, but from what we've seen in Lisa's talk, it looks like they may have gotten some useful results.

Hi Koen,
I think that measuring page distance based on word statistics could work: the fact that Voynichese shifts so much could even make the task easier, since differences in lexicon will tend to be considerable, and for each page I expect you only have very few other pages which are quite close. Also, they work on whole pages rather than paragraphs, and they might even consider whole folios (recto+verso) or bifolios. Of course, I expect there could be ambiguities and cases that are harder than average. Lisa's idea of also examining the stain provides a totally independent approach, and I am looking forward to seeing how the two methods interact.
Hello, Marco,

thanks again for your comments.

(21-10-2025, 07:44 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi quimqu, I am afraid that what can be said of topic modelling in general cannot always be said of its application to Voynichese.
For instance, Blei et. al You are not allowed to view links. Register or Login to view. mentions a data-set that is orders of magnitudes larger than the Voynich corpus

Well, here I have two comments to do:

Firstly, of course, if we have a lot of data, we can have a better model. But the Voynich is limited and we have no more text as what we have. This is why I ran 2 tests before, with known and understandable languages:
  • You are not allowed to view links. Register or Login to view. about detecting portuguese, phonetic Portuguese and Spanish have given excellent results. The corpus was just of 17045 words, and 300 paragraphs (that I created artificially).
  • You are not allowed to view links. Register or Login to view., where I got acceptable topic classification (about 85% of accuracy for each topic), used just 6054 words and 600 paragraphs (headlines).

Both test have much less data than the Voynich, and the results are very acceptable. So are we really sure that we don't have enough text with the Voynich transliterations for an automatic topic calculation?

Secondly, I said this: "The shorter Stars paragraphs are actually the perfect size, whereas the longer herbal or biological paragraphs might be too long." for one reason, that will also answer this point:

(21-10-2025, 08:30 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.they work on whole pages rather than paragraphs, and they might even consider whole folios (recto+verso) or bifolios. 

Well, LDA will give as output the % of each topic found in an input (and all topic weights for each word, if wished). I think that working at page level with the Voynich is maybe not optimal. We will get a topic mixture for each page, but it will be hard to set where are the topics detected in the page. I prefer to work with paragraph for many reasons:
- I think we can have a more granular topic labelling
- I think they are clear unit of the Voynich (I mean paragraphs can easily be delimited, not spaces, i.e.).
- I can work with a set of about 900 units (pragraphs) while by page we have around 200
Then once you have the results, you can see clearly the topic distribution in a page, just by grouping the paragraphs by page as I did in my previous post.

In addition, when I say "The shorter Stars paragraphs are actually the perfect size, whereas the longer herbal or biological paragraphs might be too long." I meant than longer paragraphs as in Herbal may mix topics (as in the page level) because they may content more than one sentence. In comparison, smaller units like the paragraphs in Stars will most likelly content just a topic.  Of course I would prefer to have 100.000 paragraphs than the 900 paragraphs that we have with the MS, but my comment was about the topic mixture in bigger units such as herbal paragraphs or pages.

At page level the best model is LDA with K=3 topics (just a bit better than K=2) with a score of around 0,815. At paragraph level, best model is also LDA with 2 topics (with a score of 0,891). That's why I chose to work on LDA K=2 at paragraph level. The score is calculated with different KPI (silouhete, coherence, reconstruction error (NMF), perplexity (LDA), degenerated topics (redundant or very small) and stability). These are results without supressing any stopword (most common words like daiin)
[attachment=11769][attachment=11766]

(21-10-2025, 07:44 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.But, more importantly, we don’t know if the differences we see are due to different topics, different scribes, different languages or dialects, different cipher mechanisms such as different usage of nulls, different spelling, different preferences in spontaneous gibberish etc. We might even be classifying based on "potato" and "potahto".

That's absolutely right. I just say topic because the models are intended for topic modeling. But I really do not know if they are real topics, dialects, languages... But note that what is detected is not a pure bag of words for topic 0 and another for topic 1, lots of words appear in both topics, so they ar quite close to each other.
Pages: 1 2 3 4 5 6 7 8 9 10 11