The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
(22-08-2025, 08:42 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(22-08-2025, 07:53 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The null hypothesisis is: 

H₀: topics are independent of language/hand.

I'm not sure this null hypothesis is valid, at least not when it's used for the languages. Since languages were initially defined using the properties of the text (relative abundance of various words and symbol combinations), and topic modeling uses the same properties, the null hypothesis states an a priori impossible situation, so its p value might not have any sense. I'm not a scientist though and my experience with p-values is nearly zero.

For hands this is more interesting, but as far as I know, some correlation between hands and languages do exist? This is not really my area.

You’re right that the null hypothesis is problematic for languages, because both languages and topics are defined from the same textual features (word frequencies, symbol patterns,..). That means independence is impossible by construction, so I agree that the p-value wouldn’t really have a valid interpretation.

For the scribal hands, it’s different: hands are identified from handwriting features, not from lexical distributions (if I am not wrong). So here the null hypothesis (that topics are independent of hands) makes sense, even if some correlation between hands and languages is already known.
(22-08-2025, 07:53 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The p-value is the probability of obtaining a chi-square statistic at least as extreme as the observed one, assuming the null hypothesis is true.


A low p-value (e.g., < 0.05) means we reject the null hypothesis, suggesting that the distribution of topics depends on the language or scribal hand.



A high p-value means we cannot reject the null, suggesting no evidence of dependence.



In both cases, we have a very low p-value (lowest when we analyze the topics vs. hands), rejecting the null hipotesis that topics are independent of.language/hand.

This is also my understanding of p-values, but I'm bugged by the vanishingly small values you report, which look very unusual to me as p-values (amounting to rock-solid certainties as ever there was one).


I also agree with ofshdk (and with your following answer): 'languages' and 'topics' are defined in the same way, so it's not surprising to find them correlated, while the correlation with the scribes is more interesting (*). But scribes too correlate with the sections of the manuscript, ie. scribe 3 is mostly the Stars section, scribe 4 is Zodiac/Astronomy, scribe 1 mostly Botanical and Recipes, scribe 2 Botanical and Balneological. And there are big 'linguistical' differences between all of these sections (and thus between scribes). There are big differences even within the same scribe, ie. 'qokain' is the 5th most frequent word in Balneological, scribe 2, but it only appears 6 times in the Botanical section of scribe 2. I actually find it plausible that each section+scribe piece in which the VMS can be divided can be seen as having being written in a different 'language'.


(*) but this does not remove my doubts about the p-values, they're really too small.
(22-08-2025, 10:38 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(*) but this does not remove my doubts about the p-values, they're really too small.

Well,I used scipy function:

scipy.stats.chi2_contingency 

It performs a Chi-square test of independence on a contingency table.

Input: a 2D table of observed frequencies (e.g., topics × languages, or topics × scribal hands).

Null hypothesis (H₀): the two categorical variables are independent.

Output:

chi2 → the Chi-square statistic

p → the p-value

dof → degrees of freedom = (rows − 1) × (columns − 1)

expected → the expected counts under H₀

Where I gave in my post p.

This is the full output with the 4 outputs per table:

Topic vs language:
 Chi2ContingencyResult(statistic=177.25623014793612, pvalue=8.693479383813543e-33, dof=10, expected_freq=array(
[[23.93714927, 19.06285073],
[50.1010101 , 39.8989899 ],
[65.68799102, 52.31200898],
[51.21436588, 40.78563412],
[36.18406285, 28.81593715],
[39.52413019, 31.47586981],
[30.06060606, 23.93939394],
[48.98765432, 39.01234568],
[41.75084175, 33.24915825],
[46.20426487, 36.79573513],
[62.34792368, 49.65207632]]))

Ttopic vs hand: Chi2ContingencyResult(statistic=689.3808658898682, pvalue=2.8408917882824174e-119, dof=40, expected_freq=array(
[[ 7.335578  , 16.02244669,  2.41301908, 16.60157127,  0.62738496],
[15.35353535, 33.53535354,  5.05050505, 34.74747475,  1.31313131],
[20.1301908 , 43.96857464,  6.62177329, 45.55780022,  1.72166105],
[15.69472503, 34.28058361,  5.1627385 , 35.51964085,  1.34231201],
[11.08866442, 24.21997755,  3.64758698, 25.09539843,  0.94837262],
[12.11223345, 26.45566779,  3.98428732, 27.41189675,  1.0359147 ],
[ 9.21212121, 20.12121212,  3.03030303, 20.84848485,  0.78787879],
[15.01234568, 32.79012346,  4.9382716 , 33.97530864,  1.28395062],
[12.79461279, 27.94612795,  4.20875421, 28.95622896,  1.09427609],
[14.15937149, 30.92704826,  4.65768799, 32.04489338,  1.21099888],
[19.10662177, 41.7328844 ,  6.28507295, 43.24130191,  1.63411897]]

))
(21-08-2025, 08:43 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.
(21-08-2025, 08:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view. 

These findings suggest that topic modeling not only helps cluster content by lexical features, but also reflects deeper structural patterns of authorship and writing practices in the manuscript. It supports the idea that different scribes may have introduced or emphasized different "topics", even when writing in the same Currier language.

I'd be very interested to hear your interpretations or see comparisons with other modeling approaches.

Just thinking of the naibbe cipher, could it be that each scribe had it's own encryption table? What is the scribral frequency? Would it match the 5-3-1-1 naibbe distribution? Regardless,  I think these are extremely interesting results.

Just to add here, Naibbe ciphertexts exhibit language-dependent and topic-dependent variation in word frequencies, for the simple reason that alphabet letter and plaintext bigram frequencies can vary by language and topic. For example, because it uses the word “herba” very often, a herbal plaintext might have more Bs than an astrology text, holding the language constant across both texts. As an exercise, I recommend that people do NMF on my reference Naibbe ciphertexts, using 1000-5000 tokens as the “document” subdivision. Each of those reference texts is equal parts of Dante’s Divine Comedy, Book 16 of Pliny’s Natural History, Grosseteste’s De sphaera, and the Latin alchemical herbal. The Divine Comedy portion will cleanly separate out, as will the alchemical herbal section. De sphaera and Natural History will tend to resemble each other a bit more.
(22-08-2025, 04:11 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.Just to add here, Naibbe ciphertexts exhibit language-dependent and topic-dependent variation in word frequencies, for the simple reason that alphabet letter and plaintext bigram frequencies can vary by language and topic. For example, because it uses the word “herba” very often, a herbal plaintext might have more Bs than an astrology text, holding the language constant across both texts. As an exercise, I recommend that people do NMF on my reference Naibbe ciphertexts, using 1000-5000 tokens as the “document” subdivision. Each of those reference texts is equal parts of Dante’s Divine Comedy, Book 16 of Pliny’s Natural History, Grosseteste’s De sphaera, and the Latin alchemical herbal. The Divine Comedy portion will cleanly separate out, as will the alchemical herbal section. De sphaera and Natural History will tend to resemble each other a bit more.

Hello Magnesium,

I have doubts how to process your reference Naibbe ciphertexts.

- shall i process all of them one after the other? In case it is yes, which is the order, the alphabetical order of the files?
- how should I divide the paragraphs or pages?  Or should I take one file as a "page" every time. For topic processing I need to separate the texts so the progam can find the topics of he texts (one topic per text). How should I separate the texts?

Once I have the pages (or the chunks of texts that should be given a topic), I can show you the results and see if it is coherent. I don't know which real text is in each file.

Thank you
(21-08-2025, 08:38 PM)LisaFaginDavis Wrote: You are not allowed to view links. Register or Login to view.Claire Bowern and two of her PhD students also addressed the question of topic modeling in the VMS here: 

You are not allowed to view links. Register or Login to view.

Hello Lisa,

Thank you for sharing the paper — I found it very interesting.

I noticed that Claire Bowern and the two PhD students appear to have fixed the number of topics to 5, corresponding to the known sections of the manuscript. In contrast, in my own research, I allowed the models to determine the optimal number of topics, as I wanted to avoid imposing any predefined structure based on the manuscript's divisions.

In my most recent study using NMF, I evaluated the optimal number of topics using four different KPIs:
  • Pseudo-perplexity
  • Topic coherence
  • Topic overlap (Jaccard similarity)
  • Number of unique high-weight words

These metrics consistently pointed to 11 as the optimal number of topics, and all subsequent results were based on this configuration (for example the low p-value of the hand, telling us that the scriba hands are very linked to the topics found).

Regards.
(22-08-2025, 10:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hello Magnesium,

I have doubts how to process your reference Naibbe ciphertexts.

- shall i process all of them one after the other? In case it is yes, which is the order, the alphabetical order of the files?
- how should I divide the paragraphs or pages?  Or should I take one file as a "page" every time. For topic processing I need to separate the texts so the progam can find the topics of he texts (one topic per text). How should I separate the texts?

Once I have the pages (or the chunks of texts that should be given a topic), I can show you the results and see if it is coherent. I don't know which real text is in each file.

Thank you

All 20 of the reference ciphertexts encrypt an identical 32,000-letter plaintext. The plaintext is equal parts (aka 8,000 letters each) of Pliny's Natural History (Book 16), Grosseteste's De sphaera, the Latin alchemical herbal, and Dante's Divine Comedy (though not necessarily in that order...it might be fun to try and figure out the ordering!). The ciphertexts vary between 20,000 and 21,000 tokens long, roughly, with the variation stemming from random fluctuations in the application of the cipher.

If you want to analyze a given Naibbe ciphertext as if it were a synthetic Voynich B, divide each ciphertext into four equal portions, aka each ~5000-5500 tokens long, and then subdivide from there. Each fourth will roughly correspond with one of the original plaintext sections. There are no exact equivalents to folios in these ciphertexts, but you could explore the statistical effect of smaller subdivisions by treating each ciphertext as if it were a corpus of N different documents each one roughly (total/N) tokens long, just as you have been doing with the various folios of the VMS.
I don't think p values mean anything here. I think we agreed that it made little sense for the languages, but as it is now it makes no sense for the hands either. The null hypothesis is total independence, that is topics are uniformly randomly assigned to hands. Since we know that hands have certain preferences for languages, this makes the null hypothesis equally invalid, because it posits that there is no dependence of any kind.

Maybe in order to get some meaningful results, we need a less restrictive or more specific null hypothesis. For example, H0 is "within a single language topic assignment doesn't depend on hands" (that is, separate tests for language A and language B).
(22-08-2025, 11:19 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Maybe in order to get some meaningful results, we need a less restrictive or more specific null hypothesis. For example, H0 is "within a single language topic assignment doesn't depend on hands" (that is, separate tests for language A and language B).

This is a very interesting question.

I performed a Chi² test to examine the correlation between NMF-derived topics and scribal hands, separately for Currier languages A and B.

Language A

[Image: hJVMqPZ.png]
  • Topics show very strong correlation with hands (hand 1 and 4).
  • Some topics are almost exclusive to one hand.
    Example: Topic 10 appears 85 times with hand 4, only once with hand 1.
  • Chi² = 348.76, p-value ≈ 7.3e-69 → extremely significant.

Language B


[Image: VEIeKNF.png]
  • Correlation is weaker, but still statistically significant.
  • Topic distributions are more spread across hands 2, 3, and 5.
  • Chi² = 64.13, p-value ≈ 1.6e-6 → lower than language A, but still statistically significant

Conclusion
  • Topics are strongly linked to scribal hands, especially in Language A.
  • Maybe the most important suggestion is, that according to these results, the scribes may not only be copyists, but thematic contributors, at least for some sections of the manuscript. Just to remember, the topics are independently calculated from hand or language, just with the paragraph inputs. The relation between the topics and the languages/hands is studied after the topics are determined by NMF model.
These results are intriguing, but I remain pseudo perplexed about how the 11 topics "emerged," and whether their correlations with more-familiar categories require any explanation.

Comment:
(21-08-2025, 08:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.A lower p-value (approaching 0) indicates stronger statistical association between the topic distribution and the given variable.
It is good practice to report the probability that a statistical inference is due to chance.  According to your calculations, this risk is astronomically low for correlations between the 11 "optimal" topics and either Currier language or scribal hand.  But the p value only quantifies confidence that a correlation exists, not the strength of correlation.  The interesting part of a statistically significant result is its effect size, which is measured by other means.

The association that you find between "optimal topic" and scribal-hand categories excels over the Currier categories  not because of its lower p value (both nulls are rejected at any sane cutoff), but because of the stronger clustering of counts in the contingency table, which is sufficiently convincing by inspection:  in You are not allowed to view links. Register or Login to view., the table for scribal hand is clearly more contrasty than the one for Currier language.  For You are not allowed to view links. Register or Login to view., a quantitative measure of effect size might help;  conventional for these data would be "Cramér's V."

Question:
How was the text sample partitioned for classification?  The number of text blocks tagged as paragraphs in the RF transliteration is less than 300.  Summing counts in your contingency tables, 891 "paragraphs" were sampled in You are not allowed to view links. Register or Login to view., and 506 + 396 = 902 in You are not allowed to view links. Register or Login to view..  Am I right to conclude that Automated Topic Analysis is able to distinguish between 11 different topics using a sample size of 34 000 words / 900 paragraphs = ~38 words/paragraph?

(Caution:  The p value is an intrinsically confounded measure because it always decreases with increasing sample size, while effect size is independent of sample size.  Thus it is notoriously easy to mine "rock solid" correlations simply by gathering large data sets.  While formally valid, such effects are often so small as to be meaningless in practice.)
Pages: 1 2 3 4 5 6