The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11
Very interesting plots!

When you colour-code a dot based on A vs B language, on what exactly do you base the choice A vs B?
Is it according to Currier's list?
I would be curious to see PCA versions of the two plots in #83. I wonder if individual samples can be labeled with something like folio.parNum. Making individual sections visible (Herbal a/b, Pharma, Q13, Q20 etc) could help getting a feeling of what is happening

Also, this clearly overlaps with the research by Lisa Fagin Davis and Colin Layfield that will soon be published

EDIT: You are not allowed to view links. Register or Login to view.

u/Miseryy Wrote:Your UMAP is massively overfit. In general tight strings that curve around are just indicative of a very small # neighbors used and too small distance threshold. You can replicate this effect with ~any dataset.

Also, I'm of the opinion you should never do clustering on UMAP, ever. Furthermore "UMAP clustering" isn't a noun that exists. UMAP can be used as an initial preprocessing step, and then a standard clustering algorithm can be used. But again, I think it's terrible methodology, since you can ~always tune UMAP to achieve the clusters you want in the first place. People do it though, no denying that.

I'd suggest going with the default parameters unless you really know what you're doing and have a good justification (read: mathematical reason) to adjust them. The parameters affect the math.

Don't focus to much on meaning. If you want meaning, use PCA, and look at the vectors. The only interpretable meaning of UMAP is relative positioning. And even that is sketchy. You really should be taking away: there are groups that can be visually separated and appear to be distinct. UMAP is not proof of anything

I would recommend that you just do clustering on your data. Not "UMAP clustering". How about starting with a simple hierarchical clustering and then looking at what you get? You can cluster either across genes or samples, and observe what falls into what group.
OK, so after all the morning investigating, finally I got it clear.

First: headlines topic distribution finally scored better NMF model, which gives discrete values for each topic. LDA topic distribution (the one that scored better in the Voynich) gives a percent on each topic, that adds up to 1. That's why both plots were so different (clouds for headlines NMF, sneaky fiures for Voynich LDA)

Remember that the plots indicate the topic distribution per paragraph, and that the topic distribution is given by the model. So for K=2, UMAP is exploding the differences of a single line (from distributions (0,1), (0.01,0.99)... (1,0), accentuating the differences of topic distribution.

To understand it better, I recalculated everything to K=3 topics with LDA (the second best score). For K=3 the vectors would look like (0,0.99,0.01) and UMAP reduces them to a bidimensional vector. I keep UMAP because it helps visualize the distribution of the topics assigned in my opinion more clearly as PCA. For this post I changed a bit the parameters so they are better separated. I post also PCA plots, so you can see the differences ploting UMAP vs. PCA.

UMAP for K=2 (model finds 2 topics) and by paragraph:

[attachment=11733]

PCA for K=2 by paragraph:

[attachment=11732]

UMAP for K=3 (model finds 3 topics) and by paragraph:

[attachment=11730]

PCA for K=3 by paragraph:

[attachment=11731]
I am not sure I understand correctly, but the PCA K=2 plots show a single outlier at the top that makes the plots basically useless? Is it so? Or maybe it's a set of samples that generate identical outlier dots?

Anyway, if that actually is the situation, investigating and removing, or fixing, that sample (or samples) could improve the quality of the overall analysis.
(19-10-2025, 09:04 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I am not sure I understand correctly, but the PCA K=2 plots show a single outlier at the top that makes the plots basically useless? Is it so? Or maybe it's a set of samples that generate identical outlier dots?

Anyway, if that actually is the situation, investigating and removing, or fixing, that sample (or samples) could improve the quality of the overall analysis.

Yes, it is an outlier, it is You are not allowed to view links. Register or Login to view. paragraph starting at position 7. I am not sure exactly why that outlier exists and should investigate. It is neither a pure topic 0 or pure topic 1 or pure 50%/50%... I just left it as it is. I have noted it for investigate later.
I have been working further on the two topics detected by the model. As you might remember, at the folio level we observe a close match with the Currier A/B language distinction. At the paragraph level, the fit is not as strong, but still reasonably good. This is to be expected, since Currier assigned a single topic or language label to an entire folio, whereas at the paragraph level the model operates with finer granularity and can identify paragraphs with different dominant topics within the same folio.

These are the results vs. Currier's languages:

[attachment=11735]

The mixture found suggests that the detected topics might actually represent linguistic styles or sublanguages. If they were truly different languages, they would have to be very close to each other (like Spanish and Portuguese) since they share quite a lot of words, or perhaps they are dialects. But let’s assume they are topics instead. In that case, topic 0 "talks" about A and topic 1 "talks" about B. Temember that for each paragraph, we get a mixture of topics, and from this we can infer or imagine how it "sounds" or what it is "about".

So... how does the manuscript "sound" using topic modelling? This is a graphical experiment. Each paragraph is weighted and the black trend line (a bit smoothed) shows if it tends to topic A or topic B. The result is curious to see:

[attachment=11737]

Following the black line you can see through the Manuscript how the topics detected are shown by paragraph. Note that as the plot is by paragraph, herbal section is much shorter than stars (herbal has 2/3 paragraphs by folio and stars much more).
Sorry to insist about my analysis, but I would like to share some thoughts about the previous plot with you, and if you wish to write your ideas, they would be very welcome.

I find this plot fascinating. Looking at it, from left to right, it seems that the MS starts with a very strong topic 0 (let's call it as Currier did A). Most of the first 1/3 of the herbal, it is almost pure A. But then gruadually, paragraphs that mix B topic words appear, and appear quite strongly. They are not as continuous as strong topic A paragraphs, but they appear suddenly after 1/3 of herbal paragraphs. Are the herbs starting to be described partially with topic B?

Then we get to the cosmological and astronomical part of the MS. There, we can see that there is a mixture of topic A and B. Note that at those folia, I "created" the paragraphs grouping the small sentences and labels, making a sort of bag of words. These words, even if they are mixed, seem to tend to topic A (look that the black line is above 0).

Then we get into the zodiac part of the MS. It starts above 0 and finished below 0. It seems that the mean topic changes from A to B gradually in that part of the MS. Well, it is maybe not so strange... maybe the stars apply to the plants and the zodiac apply to the humans. So there is a tend from topic A (plants) to topic B (humans) all through the astronomical and zodiac part.

Because when we get into the biological part, topic B is clearly the boss there. No place for topic A (the plants topic) but for topic B (the human? topic). 

But then, when we go further to the following cosmological part... look how different it is from the first cosmological part! It is clearly a topic B cosmological part, while the first one was more about topic A.

And what happens when we get into the Pharmaceutical part? Again topic A is the main topic. Strange? Well, not at all. If that part talks about herbs and recipes, maybe the herbal topic is the natural one.

But look again the change when we get into the marginal stars part of the MS. The main topic changes again to B. But look how mixed it is. There are plenty of paragraphs that have almost 50/50 A/B topic words. Maybe pages about medical recipes that mix topic A (herbal) with topic B (human)? Maybe medical treatments?

Well, I find it fascinating. As said at the beginning of this post, your thoughts are very welcome!

P.S: I know that this kind of machine learning algorisms might seem complicated or even strange to apply to the Voynich. I would suggest you to look at the results of a test Jorge Stolf proposed to me: see if that kind of models were able to distinguish Portuguese, phonetic Portuguese and Spanish. And they did it clearly. See You are not allowed to view links. Register or Login to view.to get an idea of how the models work and how to interpret the results on the Voynich (if we can iterpret them...).
Hi quimqu, as I pointed out You are not allowed to view links. Register or Login to view., this appears to overlap with the ongoing research by Lisa Fagin Davis and Colin Layfield. We know that quires were put together in an at least partly arbitrary way; trying to understand more of the order in which the ms was created is extremely interesting.

Quote:Most of the first 1/3 of the herbal, it is almost pure A. But then gruadually, paragraphs that mix B topic words appear, and appear quite strongly. They are not as continuous as strong topic A paragraphs, but they appear suddenly after 1/3 of herbal paragraphs. Are the herbs starting to be described partially with topic B?
We know what is happening here since Currier’s analysis half a century ago: herbal bifolios by different scribes were mixed and bound together. This is detailed in You are not allowed to view links. Register or Login to view. (Table 1). All pages from f1 to f25 (“the first 1/3 of the herbal”) were created by scribe 1. After that point, scribe 2 pages begin to be inter-mixed with scribe 1 pages. If we take the illustrations as indicative of a topic, the change in statistics does not appear to be due to a different topic, but to a different scribe.

The CUVA bigram plots You are not allowed to view links. Register or Login to view. (or before) are a good match for your topic line. See bottom of the page.
[attachment=11754]
E.g. the plot for ‘ed’ shows how the zodiac section (gray) appears to gradually shift from Currier A (bottom) towards Currier B (top). It also shows that Quire13 Bio is more strongly “B” than Quire20 Star-Paragraphs. It also shows that Pharma (yellow) is comparable with Herbal A pages (both HA and Pharma are attributed to Scribe 1).
In general, the fact that there are such huge differences at bigram level does not sound like different topics, but rather like different languages (like Portuguese vs Spanish, as you mention). But of course the gradual shift between the two A-B extremes does not fit with the idea of different languages (and we know that Voynich bigrams cannot correspond to bigrams in European languages).

[attachment=11755]
As Rene’s plots show, this information can be used to reorder Voynich sections, so that the text begins with Strong-A (HA) and ends with Strong-B (Bio). If I understand correctly, Lisa and Colin are going even more in-depth, also considering the stains that affect many of the pages; they are working on bifolio-level reordering, rather than section-level. I expect that their paper will be a major step forward in our understanding of the structure of the text.

EDIT: another point that I think Rene mentioned in the past. Results based on only a few samples are more noisy and unreliable than results based on larger sets. This could play a role in the fact that Bio/Q13 paragraphs get more consistent results than the much shorter Stars/Q20 paragraphs.
Hello Marco,

Thanks for your reply. You have much more experience with the Voynich, and I am a novice at it.

(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.If we take the illustrations as indicative of a topic, the change in statistics does not appear to be due to a different topic, but to a different scribe.

Please note that I don’t claim to know whether what the model detects corresponds to a “topic”, a “dialect”, or a “writing style”. What I can confirm is that the model identifies a mixture of these components within many paragraphs. That’s why in the following plots the colors appear softer rather than purely red or blue for those mixed paragraphs.
This indicates that many words are shared between the two distributions, suggesting that the underlying components are not completely distinct. In my opinion, this could point to closely related languages or styles, or to genuine topics, in the sense that certain words are more typical of specific semantic contexts, while still co-occurring with shared vocabulary across the manuscript.

(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Hi quimqu, as I pointed out You are not allowed to view links. Register or Login to view., this appears to overlap with the ongoing research by Lisa Fagin Davis and Colin Layfield. We know that quires were put together in an at least partly arbitrary way; trying to understand more of the order in which the ms was created is extremely interesting.
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.If I understand correctly, Lisa and Colin are going even more in-depth, also considering the stains that affect many of the pages; they are working on bifolio-level reordering, rather than section-level. I expect that their paper will be a major step forward in our understanding of the structure of the text.

Yes, I attended Lisa's presentation. I was surprised and happy to see that, apart from the study of stains on the pages, another topic-related study was ongoing. I’m looking forward to seeing the results and checking how wrong I am. By the way, if Lisa reads this, I’m open to collaborating if they need any kind of work done.

(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.We know what is happening here since Currier’s analysis half a century ago: herbal bifolios by different scribes were mixed and bound together. This is detailed in You are not allowed to view links. Register or Login to view. (Table 1). All pages from f1 to f25 (“the first 1/3 of the herbal”) were created by scribe 1. After that point, scribe 2 pages begin to be inter-mixed with scribe 1 pages. If we take the illustrations as indicative of a topic, the change in statistics does not appear to be due to a different topic, but to a different scribe.

Right. My model shows this quite clearly. Even if there are some mixed paragraphs, it has separated the bifolia by topic quite well (sorry for the large image, but if I make it smaller, it becomes unreadable):
[attachment=11759]
Here we can see that all herbal bifolia are "language A" except for bifolia D2 (f26r, f26v, f31r, f31v), E1 (f26r, f26v, f31r, f31v), E2 (f34r, f34v, f39r, f39v), F1 (f41r, f41v, f48r, f48v), F3 (f43r, f43v, f46r* this one is marked as lang. A, f46v), G2 (f50r, f50v, f55r, f55v), and Q2 (f94r, f94v, f95r, f95v). And our beloved bifolio H1 (f57r, f66v), which is really mixed.

I have grouped all the bifolia in the MS, and here is how my model determines the "languages":
[attachment=11756]
Most bifolia seem to have a unique "language", but you can see that some are quite mixed.

(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.The CUVA bigram plots You are not allowed to view links. Register or Login to view. (or before) are a good match for your topic line. See bottom of the page.
null   
E.g. the plot for ‘ed’ shows how the zodiac section (gray) appears to gradually shift from Currier A (bottom) towards Currier B (top). It also shows that Quire13 Bio is more strongly “B” than Quire20 Star-Paragraphs. It also shows that Pharma (yellow) is comparable with Herbal A pages (both HA and Pharma are attributed to Scribe 1).

Right. This is already shown by the model (note that I use the quire notation from the EVA file):
[attachment=11758]
You can see some quires where the language is mixed, suggesting that maybe the bifolia don’t correspond exactly to the quire. At first sight D, E, F, G, H (?), I (?), Q

(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.EDIT: another point that I think Rene mentioned in the past. Results based on only a few samples are more noisy and unreliable than results based on larger sets. This could play a role in the fact that Bio/Q13 paragraphs get more consistent results than the much shorter Stars/Q20 paragraphs.

Even though I agree in general terms, I don’t fully agree regarding topic modelling. Topic modelling is intended to be applied at the sentence level. The shorter Stars paragraphs are actually the perfect size, whereas the longer herbal or biological paragraphs might be too long. I suppose (though I can’t confirm it) that the longer paragraphs should contain internal sentences, but since there is no punctuation, we can’t recognize them yet.
(20-10-2025, 06:29 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.this information can be used to reorder Voynich sections, so that the text begins with Strong-A (HA) and ends with Strong-B (Bio). If I understand correctly, Lisa and Colin are going even more in-depth, also considering the stains that affect many of the pages; they are working on bifolio-level reordering, rather than section-level.

But if the pages/folios/bifolios were to be reordered based on any such "linguistic" similarity criterion, the transitions on the above plots would be gradual --- rather than abrupt steps, as they should be if they were due to switching Scribes or Authors.  No?

It still seems possible that the differences in "languages" are due partly to different topics, partly to the Author changing the spelling/encryption between one section and the next.  Any Scribe changes then could be just incidental, a secondary effect of the writing having stretched over along time (years or decades) -- which would not per se affect the "language".  No?

All the best, --stolfi
Pages: 1 2 3 4 5 6 7 8 9 10 11