The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6
Hello, Obelus,

thank you for reading my posts and arise your comments.

(24-08-2025, 07:24 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.a quantitative measure of effect size might help;  conventional for these data would be "Cramér's V."

I should say that one of my first metrics for calculating the optimal number of topics was Cramer's V. At that time, I didn't distinguish between language A's hands and language B's hands, i just focused in languages and hands. The results were here. You can see that for hands best topic number was 3, but I wanted more granularity, and 11 can also be OK (it is just after the bump), for languages, k=11 can also be OK (it is where the curve flattens):

[Image: UfpWxy5.png]

Nevertheless, I calculated the Cramer's V of both contingency tables (Language A's hands vs topics and Language B's hands vs. topics). Results are surprisingly high for language A's hands vs. topics: 0,830 and not so good for language B's hands vs. topics: 0,264 (scale is from 0 to 1). Language a's hands vs. topics have an almost perfect association.

(24-08-2025, 07:24 AM)obelus Wrote: You are not allowed to view links. Register or Login to view.How was the text sample partitioned for classification?  The number of text blocks tagged as paragraphs in the RF transliteration is less than 300.

Here is a summary of the steps I followed to create the paragraphs:
  • In the file ZL3b-n.txt (EVA transliteration), the start of a paragraph is marked with <%> and the end with <$>.
  • I joined the marked paragraphs into single lines.
  • The remaining lines (those outside of marked paragraphs) were left as they were. So in the end, I had a mix of paragraph lines and other types of content (e.g., labels or short sentences from astrological folios).
  • I removed lines that contained only 1 or 2 "words", and repeated the process as many times as necessary to ensure that, for example, a line originally containing 3 words wouldn't leave behind a line with just 1 word after cleaning.
  • At first, I considered removing the most common "words" for topic detection. However, I decided not to, since even if we suspect that words like daiin are not meaningful, we can't be completely sure—so I didn't remove any words at all.
  • This resulted in a total of 891 "paragraphs".

If you have any further questions, please feel free to ask!
(23-08-2025, 11:28 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This is a very interesting question.

I performed a Chi² test to examine the correlation between NMF-derived topics and scribal hands, separately for Currier languages A and B.

I'm not sure about the details of the implementation, but I think for clarity this should be run from scratch, first identifying the topics using NMF separately for language A and language B. Just to avoid any possibility that the language information affects the outcome.
(24-08-2025, 01:23 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.this should be run from scratch, first identifying the topics using NMF separately for language A and language B. Just to avoid any possibility that the language information affects the outcome.

Hello oshfdk,

I’ve already run the NMF topic modeling separately for languages A and B. The results for language A are quite consistent, while language B appears a bit messier in terms of topic structure. However, splitting by language goes beyond my current goal.

What I’m aiming for is to detect which paragraphs “talk” about the same topic, locate them throughout the manuscript, view the illustrations on those pages, and try to infer what the content might be about. Of course, this is a very challenging task, since we have no clear understanding of the visual design or semantic meaning, but I wanted to explore this relationship across the entire manuscript.

If I follow your suggestion, I end up with topics 0 to N for language A and topics 0 to M for language B, but then I lose any potential connection between the two sets of topics. So, I feel this approach would fragment the analysis too much.

Topic modeling algorithms like NMF are often used in contexts such as news categorization—grouping articles into topics like politics, sports, etc. These models don’t understand the meaning of the text per se, but they can learn statistical associations between words. In known languages, we can assign semantic labels to topics once the model clusters them. But in the case of the Voynich Manuscript, we can’t yet interpret the clusters in semantic terms—so we just observe their structure and distribution.

Best regards
(24-08-2025, 01:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.If I follow your suggestion, I end up with topics 0 to N for language A and topics 0 to M for language B, but then I lose any potential connection between the two sets of topics. So, I feel this approach would fragment the analysis too much.

I'm not sure I agree. If the topics correspond to some actual underlying feature, I think it won't be hard to perform cross detection and identify the mappings between N topics for A and M topics for B. After all, you will have a model that detects topics, as far as I understand. If there is a clean match between A topics and B topics, this would boost the claim that the topics are not a random statistical split of data, but a grouping that is naturally present in the manuscript. On the other hand, if NMF topic modeling for A and NMF topic modeling for B produce incompatible topics with no connection, then what is the use of combining them into a single model? Obviously that would mean that either this separation into topics is spurious or topics exist separately in language A and language B.

Also, because you mentioned deciphering in another thread, I need to clarify that personally I don't find topic modeling of the Voynich Manuscript useful for deciphering attempts, so if you don't agree with my view above, it certainly doesn't make sense to adjust your approach. Unfortunately, I do not see any plausible path from modeling topics (or any other macro scale analysis) to actually deciphering the MS.
(22-08-2025, 11:11 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.If you want to analyze a given Naibbe ciphertext as if it were a synthetic Voynich B, divide each ciphertext into four equal portions, aka each ~5000-5500 tokens long, and then subdivide from there. Each fourth will roughly correspond with one of the original plaintext sections. There are no exact equivalents to folios in these ciphertexts, but you could explore the statistical effect of smaller subdivisions by treating each ciphertext as if it were a corpus of N different documents each one roughly (total/N) tokens long, just as you have been doing with the various folios of the VMS.

Hello Magnesium,

I ran a topic detection experiment using NMF on the file /kaggle/input/voynich/naibbe_Cleaned_52_01_10_word_lines.txt and wanted to share some observations.

Paragraph Structure: I grouped the text into paragraphs of 10 lines each. That is, every 10 lines were concatenated into a single paragraph for analysis.

Optimal Number of Topics: The optimal number of topics identified by the model was 6. Here's the topic distribution across paragraphs:


[Image: 46TvfZz.png]

Since I’m not entirely familiar with the content structure of the Naibbe cipher, I can’t fully assess whether the topics align with actual semantic groupings. But here’s something I found interesting:

I generated topic–paragraph heatmaps to understand how distinctive each topic is. Here's how to read them:

  • Y-axis: Paragraph index (each paragraph corresponds to 10 lines of the original text).
  • X-axis: Topic number.
  • Cell value: For each topic, I selected the high-weight words (those above the mean + 3×std). Then, for each paragraph, I counted how many of these words appear. This helps visualize whether a paragraph is dominated by its assigned topic or shares vocabulary with others.

That said, for the MS topics, the heatmaps look as follows:

[Image: vJL9GSE.png]

As you can see, the topics are quite well-separated. Most paragraphs are dominated by the vocabulary of a single topic, with limited overlap. This suggests that the MS text, even if undeciphered, has internal structure that the model can detect.

But for the Naibbe cipher, the heatmaps look as follows:

[Image: RL3cBZc.png]

In contrast, the Naibbe cipher shows significant vocabulary overlap between topics. Many paragraphs contain high-weight words from multiple topics, making them harder to distinguish. This suggests that the output from the Naibbe cipher is relatively homogeneous across the text and topics.

Interpretation
  • The MS text appears to be more topically diverse and structured, which supports the idea that it contains meaningful internal variation (even if we don’t understand the language).
  • The Naibbe cipher output seems too uniform across paragraphs, which might indicate it's closer to random output or lacks internal semantic variation — at least from the point of view of a topic model.
What do you think of these results?
I'm going to post maybe my last post about the topics detected with NMF. I don't know if this can help with the MS deciphering or understanding. I post this figure, which I found interesting and funny at the same time.

[Image: 2wpDwmC.jpeg]

The figure shows a timeline of NMF topics across the manuscript, binned by folio (and side).

Each vertical slice corresponds to one folio and side; the stacked areas in that slice sum to 100% and show the mixture of topics detected on that folio. Colors are unique per topic (T0, T1, …), and the letters at the top mark the section of the manuscript for the corresponding range of folios (H=Herbal, P=Pharmaceutical, A=Astronomical, B=Biological, T=Text-only, R=Recipes). 

Rare topics are suppressed. I used a 15% threshold (topics contributing <15% in a bin are set to 0). The 15% threshold emphasizes dominant topics in each bin; lowering it will reintroduce rare topics (more bands), raising it will focus on only the strongest themes.

What I think is interesting is that, overall, this timeline compresses the manuscript’s topic structure into a single, readable view, showing where topics dominate, where they mix, and how they evolve across folios and sections.

If anyone finds this useful, I will be happy.

Either way, I’ll keep pulling on this topic-detection thread and see what I get.
I have been working further on the topic analysis of the Voynich manuscript. As you might know, I am trying to detect topics throughout the text using an NMF model. Sorry to insist, but I think this analysis is leading me to confirm—at least from my current point of view—that the text has meaning and that it is related to the different sections.

I have created a GIF where you can see how the topics are distributed per section and per folio in the section (the plot titles are maybe confusing), starting from the detection of 2 topics and going up to 15 topics:



[Image: Vh8vy4k.gif]

  • With 2 topics you can already see a considerable difference between the Herbal and Biological sections.
  • With 3 topics, part of the orange topic begins to transform into green, while the blue topic remains consistent in the Biological and Marginal Stars (Recipes) sections.
  • With 4 topics, the first part of the Herbal folios clearly stands out, where the orange topic predominates.
  • As the number of topics increases, the detection gains granularity and topics get mixed, but each section still shows different mixtures. For example, note the division of the orange topic when moving from 7 to 8 topics.
  • At higher topic numbers, the granularity becomes very fine. Even so, we can still see clear differences between sections.

This leads me to the opinion that the Voynich has meaning, and that this meaning is also linked to the sections (and their drawings).

In the coming days, I plan to:
  • Detect topics within each section (instead of across the whole Voynich).
  • Examine how language and handwriting influence topic detection.

If you have other suggestions, please let me know.

I hope you find this interesting.
What happens if you run the pipeline on some gibberish text, like the output of Torsten Timms autogenerator.


Edit: Removed erroneous attachment, thanks nablator for spotting those errors.
(02-09-2025, 06:45 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.What happens if you run the pipeline on some gibberish text, like the output of Torsten Timms autogenerator.

A while ago i used TT's code to generate some text then split it into lines that match the word count of vms folios.
It maybe of some use.

Word count?

Your You are not allowed to view links. Register or Login to view. line has 155 words, but You are not allowed to view links. Register or Login to view. in the VM has 209 (according to RF1a)
Your You are not allowed to view links. Register or Login to view. line has 287 words, but You are not allowed to view links. Register or Login to view. in the VM has 511 (according to RF1a)
etc.
(02-09-2025, 06:45 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.What happens if you run the pipeline on some gibberish text, like the output of Torsten Timms autogenerator.


Hello Rob,


This was one of the tests I wanted to carry out. Here’s what I did: I generated two different texts using Timm's generator. I replaced the real Voynich words with Torsten's text, in the same order. So, if the first MS paragraph has, let’s say, 40 words, I took the first 40 words of Timm's text, and so on. I also did the following: I took all the real Voynich words and randomly shuffled them, so that every Voynich word appears in a random position, but the overall corpus remains the same. Here are the results:



Torsten Timm 1:

[Image: n4JONxz.gif]



Torsten Timm 2:
[Image: o6TfzZI.gif]



Randomized Voynich:

[Image: B3T1P5U.gif]



Voynich:

[Image: Vh8vy4k.gif]


As you can see, the topics found in Torsten Timm's texts, even if they are not as flat as a purely random text, don’t provide any clue to a real topic. There are small parts that seem to hint at different topic usage, but I think that’s just coincidence. No real topic emerges; no section clearly differs from the rest. For the randomized Voynich, it is even more evident that the text is completely random, without meaning.

But look again at the Voynich sections! This makes my opinion even stronger (personal, of course): that the Voynich does have meaning—or at least some underlying structure connected to its sections or drawings. Depending on the part we “read,” we find a different topic compared to other parts of the manuscript.
Pages: 1 2 3 4 5 6