The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11
(22-09-2025, 10:27 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The four plots contain the same data but are ordered differently. Each color is a topic. For each folio (think of a single vertical column), the vertical stack shows the proportions of topics on that folio. If a folio is a single color, all its paragraphs fall into the same topic; if it’s multicolored, different paragraphs are assigned to different topics.

What got me confused is that the plots show gradual transition (slanted lines) between topics.  But those slanted lines are an artifact of the plotting routine.  For instance, from You are not allowed to view links. Register or Login to view. to f17v the plot suggests a gradual transition, but the transition is actually abrupt: You are not allowed to view links. Register or Login to view. is 100% "brown",  You are not allowed to view links. Register or Login to view. is 100% "green", then You are not allowed to view links. Register or Login to view. is again 100% "brown".  Right?
(23-09-2025, 05:04 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.What got me confused is that the plots show gradual transition (slanted lines) between topics.  But those slanted lines are an artifact of the plotting routine.  For instance, from You are not allowed to view links. Register or Login to view. to f17v the plot suggests a gradual transition, but the transition is actually abrupt: You are not allowed to view links. Register or Login to view. is 100% "brown",  You are not allowed to view links. Register or Login to view. is 100% "green", then You are not allowed to view links. Register or Login to view. is again 100% "brown".  Right?

That's it.
You are not allowed to view links. Register or Login to view. has topic 2. It skews to the -ol/-or family and -eol/-eey clusters: ol, chol, or, cheol, cheor, sheol, qokol, qokeol, okeol, cheey, sheey. It also shows longer ""vowel""-heavy strings like saiin, qokeey, and compounds like cthey. Overall: lots of -ol/-or endings and ee sequences.
You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. have topic 5. It leans hard into the -chy/-ty/-y morphology with "sh/ch/cth" frames: chol, chor, shol, sho, dy, chy, cthy, dain, shor, shy, chey, qotchy, qokchy, otchy, cthol, cthor, qoty, oky. Overall: shorter tokens, many -chy/-ty/-y endings and cth/ch/sh stems.

[*]By the way, I attach a plot of the topics with the current order of folia (so, folia as they are bound):

[*][Image: NUgVuQh.png]

Regards
Very interesting!
While I am not even remotely qualified to judge what you are doing, I am also confused by the x-axis of your graphs. Since we are dealing with discrete data chunks (folios), wouldn't it be better to have a bar for each folio? It is impossible to see which folio has which topic right now. So I think a bar graph would be neater.

I have mostly worked with the herbal section so it's interesting that Currier hand 1 almost completely agrees with your brown topic while Lisa's hand 1 also includes the green one.

It would be very helpful to have a list of each folio with the topic(s) your model assigned to it. Or at least, can you tell me which herbal pages are purely brown topic? It would be intriguing if we can find any hints in the imagery that distinguish the ones that have been assigned to different topics like f17.

Also, have you checked if there is a correlation between your topics and bifolios? Almost all bifolios have been written by the same hand and it is thought that they have been created in one go. So it would be interesting to see if bifolios also have topic similarity, even if they are not adjacent.
Hello, Bernd,

to check the correlation between topics and bifolios i just prepared the timeline plot per bifolio in stacked area and in stacked bars, as you said.

[Image: jJh7rmD.png]
[Image: EiHFPQX.png]

But you can find all raw data in this You are not allowed to view links. Register or Login to view.. Note that the topic is assigned by paragraph, so one folio might have multiple topics.
I ran two types of topic models, NMF (Non-negative Matrix Factorization) and LDA (Latent Dirichlet Allocation), on the Voynich text. I did this twice: first using paragraphs as the unit of analysis, and then using the folios (front and back) as units.

Both models clearly show a peak at K = 2, meaning that the text naturally splits into two distinct writing styles or “languages”.
[attachment=11709]
These models don’t actually understand the language; they simply look at which words tend to appear together and group them into underlying patterns or topics that best explain the data.

When we compare the model’s results with the known Currier A and Currier B classifications, the match is very strong, but not perfect. Below is a confusion matrix showing how well the model’s inferred “languages” align with Currier’s.

[attachment=11710]

I’ve also included a list of the few folios where the classification doesn’t match.

[attachment=11711]

At the paragraph level, the number of mismatches increases slightly (which makes sense, since a single folio can mix both styles), but for now let’s focus on the folio-level agreement.

I also built a visualization tool that highlights which words contribute most to each topic or language. Each topic is shown in a different color, and the intensity of the color reflects how strongly that word belongs to that topic. Words that are very characteristic of one style appear in bright, saturated colors, while neutral words appear in lighter tones. This makes it possible to see the stylistic contrast within a single paragraph or folio at a glance.
Here are a few examples. I think this way of visualizing the text makes the two “languages” of the Voynich much easier to grasp intuitively.

For example: You are not allowed to view links. Register or Login to view. [attachment=11716]
You can see that a lot of words are coloured un blue, meaning they are words with weights in language A. The few words in pink(language B) have almost no weight, so the folio is defined as language A.

Let's check one of the folio where the detected language is not the Currier language indicated, You are not allowed to view links. Register or Login to view. [attachment=11713]
As you can see, even if Currier noted this folio as B, according to the model, it is A. But the model has it not so clear, language B has also some weight, even if for the model, language A is the winner.

Let's check a balneological folio (full of qokeedy), like You are not allowed to view links. Register or Login to view. [attachment=11714]
You can see it is full of pink words, indicating it is in language B, as Currier defined it.

I hope you find this interesting. If you have any doubts or comments, feel free to ask.
(16-10-2025, 10:23 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Let's check one of the folio where the detected language is not the Currier language indicated, You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view. has 10 "edy" in its second paragraph, none in the first, what is the detected Currier language of the 2nd paragraph in isolation?

What about You are not allowed to view links. Register or Login to view. ? It has 5 "edy" in its second paragraph, none in the first so there seems to be a transition from A to B there too.
(16-10-2025, 10:33 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(16-10-2025, 10:23 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Let's check one of the folio where the detected language is not the Currier language indicated, You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view. has 10 "edy" in its second paragraph, none in the first, what is the detected Currier language of the 2nd paragraph in isolation?

What about You are not allowed to view links. Register or Login to view. ? It has 5 "edy" in its second paragraph, none in the first so there seems to be a transition from A to B there too.

Hello nablator,

yes, this is something that the model (trained at paragraph level) detects. These are the two paragraphs of f57r 
[attachment=11720]
[attachment=11719]
You can see that the whole page was classified as language B, but first paragraph is clearly language A (according to the model) and second paragraph is clearly language B.

For You are not allowed to view links. Register or Login to view. the transition that you mention is not detected by the language, as both paragraphs are detected as language A:
[attachment=11722]
[attachment=11721]
For the model it is not a simple matter of endings. Look at paragraph 2: chedy, dshedy are labelled as language B. But other "edy" words as sheody, okeody, chokeody, schkhdy are labelled as language A, and the punctuation for language A is 3 times the punctuation for language B, so it is labelled as language A, as the first paragraph is.

Note that the models for paragraphs and for folios are different (one is trained with paragraphs and the other with entire folios) and some words may vary from topic between both models.
(17-10-2025, 07:32 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.the punctuation for language A is 3 times the punctuation for language B

Punctuation? Do you mean cumulative weight or number of words or some other metric?

Since the detected language at the page and paragraph level doesn't always match Currier's assessment, would it be useful to re-train the models using the detected languages as input until (at the first iteration hopefully) the result is stable, i.e. the detection matches the training? And maybe drop the page model, if the paragraph model is more detailed and accurate.

It would be interesting to know how reliable is the classification by a "stable" model as either A or B, and how many "in-between" (undecidable) paragraphs there are. If each paragraph is strongly A or B,  we have a discontinuous process (like a change in plaintext language), if not there could be a continuous drift between A and B and possibly more "languages" or "dialects". Which hypothesis is more likely?
(17-10-2025, 08:52 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(17-10-2025, 07:32 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.the punctuation for language A is 3 times the punctuation for language B

Punctuation? Do you mean cumulative weight or number of words or some other metric?

Yes, the model gives punctuation (a score) for the words within each topic. Some words have more weight than others. That's why I get the % relation of language A and language B (the barplots). Note that there are words that are shared within both topics, but they weight more in one topic than the other. Look here for example:

[attachment=11723]

You can see that "dy" appears in both languages, but it weights much more in language A ("dy" appearance score for language B is roughly 90 but for language A is almost 160). I also mark the shared words with both colors (take a look at "dy" or "otar" that the color is mixed). "chedy" is purely language B and the color is unique.
After building the topic models (LDA and NMF) described in my previous post, I projected the results over the full manuscript to see how the two main clusters  (which correspond closely to Currier A and B) are distributed across folios and handwriting styles. The analysis was done at two levels: folios and paragraphs.

At folio level, the two clusters are almost perfectly separated (about 92% agreement with the known A/B labels). At paragraph level, the separation remains strong (around 83%), but mixed pages appear more clearly (some folios contain both “languages” in different paragraphs).

Main results:
CurrierPred APred B
Language A1091
Language B1677

When folios are grouped by section:
SectionDominant
Astronomical, Pharmaceutical, ZodiacA
Biological (balneological), Marginal stars, Text-onlyB
Herbal, CosmologicalMixed

When compared with Currier’s hands, the correspondence is very close:
Currier hand%A%B
1, 4~1000
24357
3, 5, X, Y0–1882–100

And when compared with the modern classification of hands (writting_hand), the same pattern emerges even more sharply. At paragraph level, small internal mixtures appear within some pages, but the overall pattern remains consistent:

writting_hand%A (folios)%B (folios)%A (paragraphs)%B (paragraphs)Observation
11.000.000.9840.016Clearly A
20.001.000.0780.922Clearly B
30.0970.9030.0810.919Consistently B
40.9330.0670.8050.195Mostly A, with some B contamination
50.1430.8570.2630.737Clearly B but with more mixing than 2–3
@0.001.000.2310.769Clearly B, with slight local variation
Pages: 1 2 3 4 5 6 7 8 9 10 11