Options

Automated Topic Analysis of the Voynich Manuscript

Index
Automated Topic Analysis of the Voynich Manuscript
RE: Automated Topic Analysis of the Voynich Manuscript

quimqu > 24-08-2025, 09:41 AM
Hello, Obelus,

thank you for reading my posts and arise your comments.

(24-08-2025, 07:24 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.a quantitative measure of effect size might help; conventional for these data would be "Cramér's V."

I should say that one of my first metrics for calculating the optimal number of topics was Cramer's V. At that time, I didn't distinguish between language A's hands and language B's hands, i just focused in languages and hands. The results were here. You can see that for hands best topic number was 3, but I wanted more granularity, and 11 can also be OK (it is just after the bump), for languages, k=11 can also be OK (it is where the curve flattens):

Nevertheless, I calculated the Cramer's V of both contingency tables (Language A's hands vs topics and Language B's hands vs. topics). Results are surprisingly high for language A's hands vs. topics: 0,830 and not so good for language B's hands vs. topics: 0,264 (scale is from 0 to 1). Language a's hands vs. topics have an almost perfect association.

(24-08-2025, 07:24 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.How was the text sample partitioned for classification? The number of text blocks tagged as paragraphs in the RF transliteration is less than 300.

Here is a summary of the steps I followed to create the paragraphs:
- In the file ZL3b-n.txt (EVA transliteration), the start of a paragraph is marked with <%> and the end with <$>.
- I joined the marked paragraphs into single lines.
- The remaining lines (those outside of marked paragraphs) were left as they were. So in the end, I had a mix of paragraph lines and other types of content (e.g., labels or short sentences from astrological folios).
- I removed lines that contained only 1 or 2 "words", and repeated the process as many times as necessary to ensure that, for example, a line originally containing 3 words wouldn't leave behind a line with just 1 word after cleaning.
- At first, I considered removing the most common "words" for topic detection. However, I decided not to, since even if we suspect that words like daiin are not meaningful, we can't be completely sure—so I didn't remove any words at all.
- This resulted in a total of 891 "paragraphs".
If you have any further questions, please feel free to ask!
RE: Automated Topic Analysis of the Voynich Manuscript

oshfdk > 24-08-2025, 01:23 PM

(23-08-2025, 11:28 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This is a very interesting question.

I performed a Chi² test to examine the correlation between NMF-derived topics and scribal hands, separately for Currier languages A and B.

I'm not sure about the details of the implementation, but I think for clarity this should be run from scratch, first identifying the topics using NMF separately for language A and language B. Just to avoid any possibility that the language information affects the outcome.
RE: Automated Topic Analysis of the Voynich Manuscript

quimqu > 24-08-2025, 01:41 PM

(24-08-2025, 01:23 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.this should be run from scratch, first identifying the topics using NMF separately for language A and language B. Just to avoid any possibility that the language information affects the outcome.

Hello oshfdk,

I’ve already run the NMF topic modeling separately for languages A and B. The results for language A are quite consistent, while language B appears a bit messier in terms of topic structure. However, splitting by language goes beyond my current goal.

What I’m aiming for is to detect which paragraphs “talk” about the same topic, locate them throughout the manuscript, view the illustrations on those pages, and try to infer what the content might be about. Of course, this is a very challenging task, since we have no clear understanding of the visual design or semantic meaning, but I wanted to explore this relationship across the entire manuscript.

If I follow your suggestion, I end up with topics 0 to N for language A and topics 0 to M for language B, but then I lose any potential connection between the two sets of topics. So, I feel this approach would fragment the analysis too much.

Topic modeling algorithms like NMF are often used in contexts such as news categorization—grouping articles into topics like politics, sports, etc. These models don’t understand the meaning of the text per se, but they can learn statistical associations between words. In known languages, we can assign semantic labels to topics once the model clusters them. But in the case of the Voynich Manuscript, we can’t yet interpret the clusters in semantic terms—so we just observe their structure and distribution.

Best regards
RE: Automated Topic Analysis of the Voynich Manuscript

oshfdk > 24-08-2025, 03:05 PM

(24-08-2025, 01:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.If I follow your suggestion, I end up with topics 0 to N for language A and topics 0 to M for language B, but then I lose any potential connection between the two sets of topics. So, I feel this approach would fragment the analysis too much.

I'm not sure I agree. If the topics correspond to some actual underlying feature, I think it won't be hard to perform cross detection and identify the mappings between N topics for A and M topics for B. After all, you will have a model that detects topics, as far as I understand. If there is a clean match between A topics and B topics, this would boost the claim that the topics are not a random statistical split of data, but a grouping that is naturally present in the manuscript. On the other hand, if NMF topic modeling for A and NMF topic modeling for B produce incompatible topics with no connection, then what is the use of combining them into a single model? Obviously that would mean that either this separation into topics is spurious or topics exist separately in language A and language B.

Also, because you mentioned deciphering in another thread, I need to clarify that personally I don't find topic modeling of the Voynich Manuscript useful for deciphering attempts, so if you don't agree with my view above, it certainly doesn't make sense to adjust your approach. Unfortunately, I do not see any plausible path from modeling topics (or any other macro scale analysis) to actually deciphering the MS.
RE: Automated Topic Analysis of the Voynich Manuscript

quimqu > 24-08-2025, 07:24 PM
(22-08-2025, 11:11 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.If you want to analyze a given Naibbe ciphertext as if it were a synthetic Voynich B, divide each ciphertext into four equal portions, aka each ~5000-5500 tokens long, and then subdivide from there. Each fourth will roughly correspond with one of the original plaintext sections. There are no exact equivalents to folios in these ciphertexts, but you could explore the statistical effect of smaller subdivisions by treating each ciphertext as if it were a corpus of N different documents each one roughly (total/N) tokens long, just as you have been doing with the various folios of the VMS.

Hello Magnesium,

I ran a topic detection experiment using NMF on the file /kaggle/input/voynich/naibbe_Cleaned_52_01_10_word_lines.txt and wanted to share some observations.

Paragraph Structure: I grouped the text into paragraphs of 10 lines each. That is, every 10 lines were concatenated into a single paragraph for analysis.

Optimal Number of Topics: The optimal number of topics identified by the model was 6. Here's the topic distribution across paragraphs:

Since I’m not entirely familiar with the content structure of the Naibbe cipher, I can’t fully assess whether the topics align with actual semantic groupings. But here’s something I found interesting:

I generated topic–paragraph heatmaps to understand how distinctive each topic is. Here's how to read them:
- Y-axis: Paragraph index (each paragraph corresponds to 10 lines of the original text).
- X-axis: Topic number.
- Cell value: For each topic, I selected the high-weight words (those above the mean + 3×std). Then, for each paragraph, I counted how many of these words appear. This helps visualize whether a paragraph is dominated by its assigned topic or shares vocabulary with others.
That said, for the MS topics, the heatmaps look as follows:

As you can see, the topics are quite well-separated. Most paragraphs are dominated by the vocabulary of a single topic, with limited overlap. This suggests that the MS text, even if undeciphered, has internal structure that the model can detect.

But for the Naibbe cipher, the heatmaps look as follows:

In contrast, the Naibbe cipher shows significant vocabulary overlap between topics. Many paragraphs contain high-weight words from multiple topics, making them harder to distinguish. This suggests that the output from the Naibbe cipher is relatively homogeneous across the text and topics.

Interpretation
- The MS text appears to be more topically diverse and structured, which supports the idea that it contains meaningful internal variation (even if we don’t understand the language).
- The Naibbe cipher output seems too uniform across paragraphs, which might indicate it's closer to random output or lacks internal semantic variation — at least from the point of view of a topic model.
What do you think of these results?
RE: Automated Topic Analysis of the Voynich Manuscript

quimqu > 25-08-2025, 09:27 PM

I'm going to post maybe my last post about the topics detected with NMF. I don't know if this can help with the MS deciphering or understanding. I post this figure, which I found interesting and funny at the same time.

The figure shows a timeline of NMF topics across the manuscript, binned by folio (and side).

Each vertical slice corresponds to one folio and side; the stacked areas in that slice sum to 100% and show the mixture of topics detected on that folio. Colors are unique per topic (T0, T1, …), and the letters at the top mark the section of the manuscript for the corresponding range of folios (H=Herbal, P=Pharmaceutical, A=Astronomical, B=Biological, T=Text-only, R=Recipes).

Rare topics are suppressed. I used a 15% threshold (topics contributing <15% in a bin are set to 0). The 15% threshold emphasizes dominant topics in each bin; lowering it will reintroduce rare topics (more bands), raising it will focus on only the strongest themes.

What I think is interesting is that, overall, this timeline compresses the manuscript’s topic structure into a single, readable view, showing where topics dominate, where they mix, and how they evolve across folios and sections.

If anyone finds this useful, I will be happy.

Either way, I’ll keep pulling on this topic-detection thread and see what I get.
Next Oldest Next Newest

Automated Topic Analysis of the Voynich Manuscript

Index

RE: Automated Topic Analysis of the Voynich Manuscript

RE: Automated Topic Analysis of the Voynich Manuscript

RE: Automated Topic Analysis of the Voynich Manuscript

RE: Automated Topic Analysis of the Voynich Manuscript

RE: Automated Topic Analysis of the Voynich Manuscript

RE: Automated Topic Analysis of the Voynich Manuscript