![]() |
Automated Topic Analysis of the Voynich Manuscript - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Automated Topic Analysis of the Voynich Manuscript (/thread-4834.html) |
RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 22-08-2025 (22-08-2025, 08:42 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(22-08-2025, 07:53 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The null hypothesisis is: You’re right that the null hypothesis is problematic for languages, because both languages and topics are defined from the same textual features (word frequencies, symbol patterns,..). That means independence is impossible by construction, so I agree that the p-value wouldn’t really have a valid interpretation. For the scribal hands, it’s different: hands are identified from handwriting features, not from lexical distributions (if I am not wrong). So here the null hypothesis (that topics are independent of hands) makes sense, even if some correlation between hands and languages is already known. RE: Automated Topic Analysis of the Voynich Manuscript - Mauro - 22-08-2025 (22-08-2025, 07:53 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The p-value is the probability of obtaining a chi-square statistic at least as extreme as the observed one, assuming the null hypothesis is true. This is also my understanding of p-values, but I'm bugged by the vanishingly small values you report, which look very unusual to me as p-values (amounting to rock-solid certainties as ever there was one). I also agree with ofshdk (and with your following answer): 'languages' and 'topics' are defined in the same way, so it's not surprising to find them correlated, while the correlation with the scribes is more interesting (*). But scribes too correlate with the sections of the manuscript, ie. scribe 3 is mostly the Stars section, scribe 4 is Zodiac/Astronomy, scribe 1 mostly Botanical and Recipes, scribe 2 Botanical and Balneological. And there are big 'linguistical' differences between all of these sections (and thus between scribes). There are big differences even within the same scribe, ie. 'qokain' is the 5th most frequent word in Balneological, scribe 2, but it only appears 6 times in the Botanical section of scribe 2. I actually find it plausible that each section+scribe piece in which the VMS can be divided can be seen as having being written in a different 'language'. (*) but this does not remove my doubts about the p-values, they're really too small. RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 22-08-2025 (22-08-2025, 10:38 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(*) but this does not remove my doubts about the p-values, they're really too small. Well,I used scipy function: scipy.stats.chi2_contingency It performs a Chi-square test of independence on a contingency table. Input: a 2D table of observed frequencies (e.g., topics × languages, or topics × scribal hands). Null hypothesis (H₀): the two categorical variables are independent. Output: chi2 → the Chi-square statistic p → the p-value dof → degrees of freedom = (rows − 1) × (columns − 1) expected → the expected counts under H₀ Where I gave in my post p. This is the full output with the 4 outputs per table: Topic vs language: Chi2ContingencyResult(statistic=177.25623014793612, pvalue=8.693479383813543e-33, dof=10, expected_freq=array( [[23.93714927, 19.06285073], [50.1010101 , 39.8989899 ], [65.68799102, 52.31200898], [51.21436588, 40.78563412], [36.18406285, 28.81593715], [39.52413019, 31.47586981], [30.06060606, 23.93939394], [48.98765432, 39.01234568], [41.75084175, 33.24915825], [46.20426487, 36.79573513], [62.34792368, 49.65207632]])) Ttopic vs hand: Chi2ContingencyResult(statistic=689.3808658898682, pvalue=2.8408917882824174e-119, dof=40, expected_freq=array( [[ 7.335578 , 16.02244669, 2.41301908, 16.60157127, 0.62738496], [15.35353535, 33.53535354, 5.05050505, 34.74747475, 1.31313131], [20.1301908 , 43.96857464, 6.62177329, 45.55780022, 1.72166105], [15.69472503, 34.28058361, 5.1627385 , 35.51964085, 1.34231201], [11.08866442, 24.21997755, 3.64758698, 25.09539843, 0.94837262], [12.11223345, 26.45566779, 3.98428732, 27.41189675, 1.0359147 ], [ 9.21212121, 20.12121212, 3.03030303, 20.84848485, 0.78787879], [15.01234568, 32.79012346, 4.9382716 , 33.97530864, 1.28395062], [12.79461279, 27.94612795, 4.20875421, 28.95622896, 1.09427609], [14.15937149, 30.92704826, 4.65768799, 32.04489338, 1.21099888], [19.10662177, 41.7328844 , 6.28507295, 43.24130191, 1.63411897]] )) RE: Automated Topic Analysis of the Voynich Manuscript - magnesium - 22-08-2025 (21-08-2025, 08:43 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.(21-08-2025, 08:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view. Just to add here, Naibbe ciphertexts exhibit language-dependent and topic-dependent variation in word frequencies, for the simple reason that alphabet letter and plaintext bigram frequencies can vary by language and topic. For example, because it uses the word “herba” very often, a herbal plaintext might have more Bs than an astrology text, holding the language constant across both texts. As an exercise, I recommend that people do NMF on my reference Naibbe ciphertexts, using 1000-5000 tokens as the “document” subdivision. Each of those reference texts is equal parts of Dante’s Divine Comedy, Book 16 of Pliny’s Natural History, Grosseteste’s De sphaera, and the Latin alchemical herbal. The Divine Comedy portion will cleanly separate out, as will the alchemical herbal section. De sphaera and Natural History will tend to resemble each other a bit more. RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 22-08-2025 (22-08-2025, 04:11 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.Just to add here, Naibbe ciphertexts exhibit language-dependent and topic-dependent variation in word frequencies, for the simple reason that alphabet letter and plaintext bigram frequencies can vary by language and topic. For example, because it uses the word “herba” very often, a herbal plaintext might have more Bs than an astrology text, holding the language constant across both texts. As an exercise, I recommend that people do NMF on my reference Naibbe ciphertexts, using 1000-5000 tokens as the “document” subdivision. Each of those reference texts is equal parts of Dante’s Divine Comedy, Book 16 of Pliny’s Natural History, Grosseteste’s De sphaera, and the Latin alchemical herbal. The Divine Comedy portion will cleanly separate out, as will the alchemical herbal section. De sphaera and Natural History will tend to resemble each other a bit more. Hello Magnesium, I have doubts how to process your reference Naibbe ciphertexts. - shall i process all of them one after the other? In case it is yes, which is the order, the alphabetical order of the files? - how should I divide the paragraphs or pages? Or should I take one file as a "page" every time. For topic processing I need to separate the texts so the progam can find the topics of he texts (one topic per text). How should I separate the texts? Once I have the pages (or the chunks of texts that should be given a topic), I can show you the results and see if it is coherent. I don't know which real text is in each file. Thank you RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 22-08-2025 (21-08-2025, 08:38 PM)LisaFaginDavis Wrote: You are not allowed to view links. Register or Login to view.Claire Bowern and two of her PhD students also addressed the question of topic modeling in the VMS here: Hello Lisa, Thank you for sharing the paper — I found it very interesting. I noticed that Claire Bowern and the two PhD students appear to have fixed the number of topics to 5, corresponding to the known sections of the manuscript. In contrast, in my own research, I allowed the models to determine the optimal number of topics, as I wanted to avoid imposing any predefined structure based on the manuscript's divisions. In my most recent study using NMF, I evaluated the optimal number of topics using four different KPIs:
These metrics consistently pointed to 11 as the optimal number of topics, and all subsequent results were based on this configuration (for example the low p-value of the hand, telling us that the scriba hands are very linked to the topics found). Regards. RE: Automated Topic Analysis of the Voynich Manuscript - magnesium - 22-08-2025 (22-08-2025, 10:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hello Magnesium, All 20 of the reference ciphertexts encrypt an identical 32,000-letter plaintext. The plaintext is equal parts (aka 8,000 letters each) of Pliny's Natural History (Book 16), Grosseteste's De sphaera, the Latin alchemical herbal, and Dante's Divine Comedy (though not necessarily in that order...it might be fun to try and figure out the ordering!). The ciphertexts vary between 20,000 and 21,000 tokens long, roughly, with the variation stemming from random fluctuations in the application of the cipher. If you want to analyze a given Naibbe ciphertext as if it were a synthetic Voynich B, divide each ciphertext into four equal portions, aka each ~5000-5500 tokens long, and then subdivide from there. Each fourth will roughly correspond with one of the original plaintext sections. There are no exact equivalents to folios in these ciphertexts, but you could explore the statistical effect of smaller subdivisions by treating each ciphertext as if it were a corpus of N different documents each one roughly (total/N) tokens long, just as you have been doing with the various folios of the VMS. RE: Automated Topic Analysis of the Voynich Manuscript - oshfdk - 22-08-2025 I don't think p values mean anything here. I think we agreed that it made little sense for the languages, but as it is now it makes no sense for the hands either. The null hypothesis is total independence, that is topics are uniformly randomly assigned to hands. Since we know that hands have certain preferences for languages, this makes the null hypothesis equally invalid, because it posits that there is no dependence of any kind. Maybe in order to get some meaningful results, we need a less restrictive or more specific null hypothesis. For example, H0 is "within a single language topic assignment doesn't depend on hands" (that is, separate tests for language A and language B). RE: Automated Topic Analysis of the Voynich Manuscript - quimqu - 23-08-2025 (22-08-2025, 11:19 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Maybe in order to get some meaningful results, we need a less restrictive or more specific null hypothesis. For example, H0 is "within a single language topic assignment doesn't depend on hands" (that is, separate tests for language A and language B). This is a very interesting question. I performed a Chi² test to examine the correlation between NMF-derived topics and scribal hands, separately for Currier languages A and B. Language A ![]()
Language B ![]()
Conclusion
RE: Automated Topic Analysis of the Voynich Manuscript - obelus - 24-08-2025 These results are intriguing, but I remain pseudo perplexed about how the 11 topics "emerged," and whether their correlations with more-familiar categories require any explanation. Comment: (21-08-2025, 08:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.A lower p-value (approaching 0) indicates stronger statistical association between the topic distribution and the given variable.It is good practice to report the probability that a statistical inference is due to chance. According to your calculations, this risk is astronomically low for correlations between the 11 "optimal" topics and either Currier language or scribal hand. But the p value only quantifies confidence that a correlation exists, not the strength of correlation. The interesting part of a statistically significant result is its effect size, which is measured by other means. The association that you find between "optimal topic" and scribal-hand categories excels over the Currier categories not because of its lower p value (both nulls are rejected at any sane cutoff), but because of the stronger clustering of counts in the contingency table, which is sufficiently convincing by inspection: in You are not allowed to view links. Register or Login to view., the table for scribal hand is clearly more contrasty than the one for Currier language. For You are not allowed to view links. Register or Login to view., a quantitative measure of effect size might help; conventional for these data would be "Cramér's V." Question: How was the text sample partitioned for classification? The number of text blocks tagged as paragraphs in the RF transliteration is less than 300. Summing counts in your contingency tables, 891 "paragraphs" were sampled in You are not allowed to view links. Register or Login to view., and 506 + 396 = 902 in You are not allowed to view links. Register or Login to view.. Am I right to conclude that Automated Topic Analysis is able to distinguish between 11 different topics using a sample size of 34 000 words / 900 paragraphs = ~38 words/paragraph? (Caution: The p value is an intrinsically confounded measure because it always decreases with increasing sample size, while effect size is independent of sample size. Thus it is notoriously easy to mine "rock solid" correlations simply by gathering large data sets. While formally valid, such effects are often so small as to be meaningless in practice.) |