The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
I am wondering if this study is worth going deeper. If the drawings were understandable, we could get a glance of what some words may mean. But is is hard to trybto understand one topic, as there will be no clues for understanding anything.

Do you have any suggestion how to go further with this? Do you think is it worth giving a try?
One idea maybe some sub-section studies.

The Herbal folios can be divided into Currier A and B, do topic modellers agree with this.

The Balneo section; some have suggested it could be divided into 2 sub-sections.

ReneZ has postulated up to 10 dialects within the vms, could topic models, throw some light on this.
   voynich.nu >> The Currier languages revisited   ::   Link: You are not allowed to view links. Register or Login to view.
Hello all,

Over the past few days, I’ve been experimenting with Non-negative Matrix Factorization (NMF) to detect topics across the Voynich manuscript, and to explore how these topics correlate with Currier languages A and B as well as the scribe hands.

Surprisingly (or perhaps not), the topics detected are clearly aligned with both the Currier classification and the identified scribal hands—even more strongly with the hands than with the language groups.

To determine the optimal number of topics, I evaluated a range of values using several metrics:
  • Pseudo-perplexity: An approximation of how well the model predicts unseen data. Lower values generally indicate better topic quality.
  • Topic coherence: Measures how semantically related the top words in each topic are. Higher coherence typically means more interpretable topics.
  • Topic overlap (Jaccard similarity): Measures how much top words are shared across topics. Lower overlap is better—indicating more distinct topics.
  • Number of unique high-weight words: Tracks how many distinct informative words are used across topics.

Based on this evaluation, 11 topics emerged as the most meaningful. Note that paragraphs containing only one or two words were excluded from the analysis to avoid noise.

Here's a heatmap showing how topic presence evolves across folios. Each row is a topic, each column is a folio, and the color intensity shows how strongly the topic is represented:

[Image: UeibxYM.png]

I then examined how well the detected topics aligned with:
  • Currier languages A and B
  • Identified scribal hands

The results are shown in the following scatter plots:

[Image: xhWakQR.png]

[Image: RCRZyy5.png]

To quantify these correlations, I calculated the Chi-squared (χ²) p-value between the topic assignments and each categorical variable. A lower p-value (approaching 0) indicates stronger statistical association between the topic distribution and the given variable.

Here are the results:

p-value (topic vs language): 8.693479383813543e-33
p-value (topic vs hand): 2.8408917882824174e-119


As you can see, the p-value is much lower for the scribe hand, indicating that the detected topics are even more tightly linked to who wrote the paragraph than to the Currier language classification.

These findings suggest that topic modeling not only helps cluster content by lexical features, but also reflects deeper structural patterns of authorship and writing practices in the manuscript. It supports the idea that different scribes may have introduced or emphasized different "topics", even when writing in the same Currier language.

I'd be very interested to hear your interpretations or see comparisons with other modeling approaches.
Claire Bowern and two of her PhD students also addressed the question of topic modeling in the VMS here: 
You are not allowed to view links. Register or Login to view.
(21-08-2025, 08:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view. 

These findings suggest that topic modeling not only helps cluster content by lexical features, but also reflects deeper structural patterns of authorship and writing practices in the manuscript. It supports the idea that different scribes may have introduced or emphasized different "topics", even when writing in the same Currier language.

I'd be very interested to hear your interpretations or see comparisons with other modeling approaches.

Just thinking of the naibbe cipher, could it be that each scribe had it's own encryption table? What is the scribral frequency? Would it match the 5-3-1-1 naibbe distribution? Regardless,  I think these are extremely interesting results.
Just a question: aren't those p-values extremely low? 8*10E-33 is practically zero already, but 2.8*10E-119 strains my imagination as a meaningful number. But I'm not a statistician, are those p-values normal?

About the name 'topics', the objection to its use (as already noticed by davidma, and surely many others before me) is that one cannot be sure if what ultimately are the big differences in vocabulary and word frequencies among the different sections of the VMS (even within the same Currier language) are due to the change of topic in a meaningful text or other kinds of data (possibly/probably coded, surely written in a non-standard script), or to a change in the coding tables of a meaningful text/other kinds of data (see the Naibbe example by davidma above), or to a change in the 'generating rules' of a meaningless text.

That said, the observation that the VMS pages can be clustered by scribe on the basis of their text is surely very interesting.
(21-08-2025, 11:32 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Just a question: aren't those p-values extremely low? 8*10E-33 is practically zero already, but 2.8*10E-119 strains my imagination as a meaningful number. But I'm not a statistician, are those p-values normal?

Interesting question.
In hidden Markov modelling, one is working with extremely low probabilities, namely the proability that the piece of text being studied was the outcome of a free-running Markov chain.  To avoid numerical issues, the software works with the logarithm of the probabilities throughout.
Typical values can be -150

The second probability quoted above seems to go into this direction, and while for the Markov case this is understandable, I also don't immediately see how this can be the case here.
To evaluate whether the distribution of topics depends on another categorical variable (e.g., language A vs. B, or the 5 scribal hands), I used a chi-square test of independence.

For this, I used the contingency tables that are plotted in my post (the heatmaps of topic vs. language / topic vs. hand).

- Rows represent the different topics.
- Columns represent the categories of the variable of interest (e.g., Language A, B, or Hand 1–5).
-Each cell contains the observed frequency (the count of how often a given topic occurs with that category).

The null hypothesisis is: 

H₀: topics are independent of language/hand.

The p-value is the probability of obtaining a chi-square statistic at least as extreme as the observed one, assuming the null hypothesis is true.

A low p-value (e.g., < 0.05) means we reject the null hypothesis, suggesting that the distribution of topics depends on the language or scribal hand.

A high p-value means we cannot reject the null, suggesting no evidence of dependence.

In both cases, we have a very low p-value (lowest when we analyze the topics vs. hands), rejecting the null hipotesis that topics are independent of.language/hand.
I am not sure I understand the results for “pharmacese”, i.e. the language corresponding to the “pharma” / small-plants layout. Lisa assigns the pages to Hand 1, but words are morphologically different from HerbalA, with a high frequency of “eo” (as Emma pointed out, this feature is shared with the HerbalA pages that were bound close to Pharma, also by Hand 1, You are not allowed to view links. Register or Login to view. and the following comment). Pharmacese appears to be classified as Topic 8? But Topic 8 doesn’t correlate at all with Hand 1? I am probably misunderstanding something….
(22-08-2025, 07:53 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The null hypothesisis is: 

H₀: topics are independent of language/hand.

I'm not sure this null hypothesis is valid, at least not when it's used for the languages. Since languages were initially defined using the properties of the text (relative abundance of various words and symbol combinations), and topic modeling uses the same properties, the null hypothesis states an a priori impossible situation, so its p value might not have any sense. I'm not a scientist though and my experience with p-values is nearly zero.

For hands this is more interesting, but as far as I know, some correlation between hands and languages do exist? This is not really my area.
Pages: 1 2 3 4 5 6