The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11
(21-09-2025, 09:45 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So you are trying several K-clustering algorithms on the same set of data points, where each point is a bag-of-words -- essentially an N-vector where N is the number of distinct word types.  Where the algorithms do not assign each parag to a single cluster (topic), but gives it a "belonging" or "mixing" score for each cluster.  Is that correct?

Can we interpret those scores as Bayesian probabilities of the parag belonging to each topic?

Almost, but not quite. Yes: each paragraph is turned into a bag-of-words vector, and I try several values of K.

But it isn’t classic clustering where each paragraph goes in one bucket. It’s a mixed-membership topic model: the model learns K word-themes, and each paragraph is a mixture of those themes (different amounts for each). With DMR, the expected mix can also depend on metadata (language A/B, writing hand, Currier), while the themes themselves (their word lists) stay shared.

For LDA/DMR, you can read the scores as “what fraction of this paragraph’s tokens the model thinks come from each theme.” They sum to 1 and are a reasonable probabilistic mix over themes—more like “probability a random token in this paragraph was generated by topic k,” not “probability the whole paragraph belongs to topic k” (documents are assumed to mix topics). They’re approximate posteriors from the model.

For NMF, the numbers are non-negative weights, not probabilities. You can normalize them to add up to 1 for plotting, but that makes them proportions by construction, not Bayesian probabilities.
I know the conversation up to now was quite technical, but a simpler briefing is this.

I fed the Voynich transcription into an automatic topic model that doesn’t know the manuscript’s sections, illustrations, or my expectations. After a light clean-up (removing single-character tokens and a tiny set of forms that occur everywhere) the model was asked to find recurring “themes” in the text and to say how strongly each paragraph uses each theme. I also let it account for obvious style factors (language A/B and scribal hands) in how much of each theme appears, without letting those factors change what the themes are.

With that setup, the model settled on six themes that line up with the well-known sections: Herbal, Pharmaceutical, Biological (balneological), Astronomical/Zodiac, Marginal Stars, and a Text/Cosmological connector. When the manuscript switches section, the dominant theme usually flips at the same boundary. Language and hands still matter, but mainly in how much of each theme is used; they don’t redefine the themes themselves. This pattern shows up clearly in both the heatmaps and the “timeline by folio” plots.

What does that mean? In plain terms, the clustering the model finds is not random. The themes it discovers match the book’s content divisions, even though the model wasn’t told about those divisions. That is hard to reconcile with pure gibberish. It doesn’t prove a reading, but it does suggest there’s systematic structure in the text beyond scribal style.


It is also hard to reconcile with independent gibberish per theme (as if each section invented its own nonsense). The themes reuse stable families of forms across many paragraphs, the dominant theme flips exactly at section boundaries, and related sections share overlaps (e.g., Herbal with Pharmaceutical). Even after letting language and scribal hands nudge the mixtures, the same theme vocabularies persist. That pattern looks like shared structure running through the book, not several unrelated streams of invented text.
(22-09-2025, 08:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That pattern looks like shared structure running through the book, not several unrelated streams of invented text.

In one of the few You are not allowed to view links. Register or Login to view., back in the late 2nd millennium CE (at the IMPA Brazilian Mathematics Colloquium), I mentioned the use of word distribution similarity to infer the original nesting and folding of the Biology bifolios.  The relevant part of that file starts at slide/page 25.  The basic idea is to solve a Traveling Salesman Problem in N-space, by exhaustive enumeration of all possible page orders.  It is feasible because the physical pairing of pages and folios puts strong constraints on these orders.

I did not give the result in that talk because (IIRC) it was rather ambiguous and I did not have time to discuss it.   Anyway the transcription quality must have improved enormously since then, and the page distance I used probably was not ideal.   I think it would be better to redo it from scratch. You seem to have all the required boring parts programmed already...

All the best, --jorge
It is not the first time when a similar research shows some patterns in the text.

And I wonder why we cannot progress from it.
If there are patterns then we should be able to see something in the text - beginnings and ends of sentences, repeated phrases, nouns and verbs, "and" word, numbers and so on. But we can't.
(22-09-2025, 07:28 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.If there are patterns then we should be able to see something in the text - beginnings and ends of sentences, repeated phrases, nouns and verbs, "and" word, numbers and so on. But we can't.

The distinction between "nouns" and "verbs" exists in most "European" languages, but it is not a linguistic universal.  Can you spot the verbs and nouns in "مكتوب أن الكاتب كتب كتابًا مكتوبًا بأداة الكتابة"? Or in "maktub 'ana alkatib katab ktaban mktwban bi'adat alkitabati"?

Even in English that distinction has been largely lost with the loss of noun cases and verb inflections.  See the famous "buffalo buffalo" example.

Same for the distinction between 'numbers" and "words".   Numbers stand out when they are written with Roman or Arabic numerals.  But in some languages numbers were traditionally spelled out as spoken.  Can you spot the numbers in "孩子从桌子上拿走了四十六块饼干"?  Or in "Hái zǐ cóng zhuō zǐ shàng ná zǒu le sì shí liù kuài bǐng gān"?

I believe that the reason why no progress in the decipherment has been made in 25+ years is that everybody who knows even a pinch of medieval paleography correctly observes that the parchment, instrument, hand, writing direction, text layout, figure style,  illustration elements, castles, dresses, hairdos, and marginalia are all "European" -- and concludes that "therefore" the language and contents must be "European".

All the best, --jorge

PS. Rafal, have you checked the story of Gaspar da Gama?
I’m not using this topic automation to find parts of speech (verbs, nouns, adjectives). The goal is to group words into “topics,” even if we don’t know what the words mean. So far it’s interesting that the model finds stable topic structure, and that these topics line up with section, language, and scribal hands. In my last post I tried to reduce the influence of style and language, and the topics still show up, and they’re still bounded by section.

Now I’ve run into a different issue. I’m drilling down by section, starting with Herbal, by filtering to those paragraphs only. Inside Herbal I expected at least a couple of thematic clusters (think Culpeper-style: when to harvest, flowering stages, preparations, etc.). Instead, the splits I’m getting are almost entirely language A vs B, with little evidence of cross-language themes. That’s surprising. Herbal is big enough that, ideally, we should see some topic diversity across languages, but we don't; the fact that I’m mostly seeing language separation makes me wonder why...
(22-09-2025, 08:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.recurring “themes” in the text and to say how strongly each paragraph uses each theme. With that setup, the model settled on six themes that line up with the well-known sections: Herbal, Pharmaceutical, Biological (balneological), Astronomical/Zodiac, Marginal Stars, and a Text/Cosmological connector. When the manuscript switches section, the dominant theme usually flips at the same boundary.

Consider the hypothesis 

  "in each page (with maybe a couple of exceptions), all parags have the same 'topic'".

Are your results compatible with this hypothesis?  In other words, how many pages did you find which have two or more parags with clearly distinct topics? 

If there are such pages, is the break between topics consistent with a transition between two "sections" that falls in the middle of a page?  That is,  do you see only transitions like [AA][AAA][AAB][BBB][BB][BAA][AA], or are there transitions like [AA][BA][BBB][BAB][AA] etc.?

I would not join labels to nearby parags.  Label distributions are quite different from parag word distributions.  If a page has two parags and you join a bunch of labels to the second parag, they will probably came out as having different topics, just for that reason.

All the best, --jorge
(22-09-2025, 09:40 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Now I’ve run into a different issue. I’m drilling down by section, starting with Herbal, by filtering to those paragraphs only. Inside Herbal I expected at least a couple of thematic clusters (think Culpeper-style: when to harvest, flowering stages, preparations, etc.). Instead, the splits I’m getting are almost entirely language A vs B, with little evidence of cross-language themes. That’s surprising. Herbal is big enough that, ideally, we should see some topic diversity across languages, but we don't; the fact that I’m mostly seeing language separation makes me wonder why...

Well, VMS Herbal parags are fairly short, so some of those "items" (like "where it grows" and "when to harvest") may be just a word or two.  Possibly most of the text is repeated items like 
  • "For condition X1, take N1 ounces in wine for D1 days. For condition X2, apply poultice R2 times for D2 hours.  If symptoms persist, call your doctor. For condition X3, make a tea from leaves and drink a gallon every day for a week."

The differences between "languages" could be due to the text being the join of two sources with different wordings of the formulas, like
  • "This herb is good for X1 if taken in wine for D1 days. A poultice applied R2 times will cure condition X2, but some cases may need something else. The tea from leaves, twenty glasses per day, will cure condition X3."

All the best, --jorge
(22-09-2025, 03:52 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(22-09-2025, 08:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That pattern looks like shared structure running through the book, not several unrelated streams of invented text.

In one of the few You are not allowed to view links. Register or Login to view., back in the late 2nd millennium CE (at the IMPA Brazilian Mathematics Colloquium), I mentioned the use of word distribution similarity to infer the original nesting and folding of the Biology bifolios.  The relevant part of that file starts at slide/page 25.  The basic idea is to solve a Traveling Salesman Problem in N-space, by exhaustive enumeration of all possible page orders.  It is feasible because the physical pairing of pages and folios puts strong constraints on these orders.

I did not give the result in that talk because (IIRC) it was rather ambiguous and I did not have time to discuss it.   Anyway the transcription quality must have improved enormously since then, and the page distance I used probably was not ideal.   I think it would be better to redo it from scratch. You seem to have all the required boring parts programmed already...

All the best, --jorge

I wish we had the audio for this part of the talk:

[Image: qQc0Pe3.png]

On the folio ordering: in this plot, for example

[Image: VpV1cTM.png]

the folia are sorted first by section and then by folio number. With this ordering, you can see that within Herbal some "general" topics appear, but they’re not neatly sequential, they pop in and out. My next step is to collect what’s known about the constraints on folio order (the quire structure constraints the order of its folia) and then try a few re-orderings to see whether the topics settle into a more coherent sequence. If anyone has this constraints information, it would help.

Note about my previous post: in this plot, there are general topics found in herbal section (the 6 topics are represented). But when I was talking about topics found only in herbal, I meant that I am not finding clear topics as I expected, only using the paragraphs of herbal section.
(22-09-2025, 09:50 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Consider the hypothesis 

  "in each page (with maybe a couple of exceptions), all parags have the same 'topic'".

Are your results compatible with this hypothesis?  In other words, how many pages did you find which have two or more parags with clearly distinct topics? 

If there are such pages, is the break between topics consistent with a transition between two "sections" that falls in the middle of a page?  That is,  do you see only transitions like [AA][AAA][AAB][BBB][BB][BAA][AA], or are there transitions like [AA][BA][BBB][BAB][AA] etc.?

I would not join labels to nearby parags.  Label distributions are quite different from parag word distributions.  If a page has two parags and you join a bunch of labels to the second parag, they will probably came out as having different topics, just for that reason.

All the best, --jorge

If I’m understanding you, that’s exactly what my plots show: I just didn’t explain them well (maybe they were only clear to me, sorry). The four plots contain the same data but are ordered differently. Each color is a topic. For each folio (think of a single vertical column), the vertical stack shows the proportions of topics on that folio. If a folio is a single color, all its paragraphs fall into the same topic; if it’s multicolored, different paragraphs are assigned to different topics.

In the first plot, folios are grouped by language (A/B) and, within each language, kept in manuscript order.
[Image: DlvkVvn.png]
This lets you see how topics distribute within each grouping. For example, within language A the early folios are mostly the purple topic (with two small peaks of green topic); later there’s a small mix of orange, red, and lilac; towards the end of language A the green topic dominates. In language B there’s almost no purple or green, which matches the language split the model is picking up.

In the second, folios are grouped by section and, within each section, ordered as in the manuscript.

[Image: dvpzOuc.png]
In the third, folios are grouped by Currier hand, again preserving manuscript order within each group.
[Image: xioF0VV.png]

In the fourth, folios are grouped by writing hand, also in manuscript order.
[Image: BmSd3C4.png]
Pages: 1 2 3 4 5 6 7 8 9 10 11