The Voynich Ninja

Full Version: Automated Topic Analysis of the Voynich Manuscript
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11
The interesting part is what happens inside each topic. The model’s boundaries are not absolute: some folios and even single paragraphs show mixed patterns, and the transition between A and B is gradual rather than abrupt.
This suggests we’re not seeing two distinct “languages”, but rather two registers or writing modes of the same underlying system (perhaps evolving over time, or reflecting different scribal habits or styles).
The UMAP projection below (by paragraph) shows this: each dot is a paragraph, positioned by overall word co-occurrence patterns.

[attachment=11725]

Even though the LDA model was forced to find only two main topics (that without knowing it correspond roughly to Currier A and B), the UMAP projection clearly shows three big (and even 8-9 smaller) distinct arms or regions.
This suggests internal variation inside both A and B: perhaps different scribal habits or styles, subject or topic sections, or chronological phases.
The Voynich script doesn’t behave as a clean binary system, but as a continuum with multiple internal clusters.
(17-10-2025, 05:32 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The UMAP projection below

Could you please explain what are the two coordinates?

All the best, --stolfi

(Since now there is another "jorge" posting to this forum, I should sign "stolfi" from now on...)
(17-10-2025, 07:49 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(17-10-2025, 05:32 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The UMAP projection below

Could you please explain what are the two coordinates?

All the best, --stolfi

(Since now there is another "jorge" posting to this forum, I should sign "stolfi" from now on...)

Hi (Jorge) Stolfi, I'll try to explain UMAP coordinates.

UMAP is a way to turn complex data into a 2D picture so we can see patterns. In this case, each paragraph is first turned into a set of numbers (a vector) that describe how often different words appear together, sort of its fingerprint. Those fingerprints live in a huge mathematical space (maybe hundreds of dimensions).

UMAP takes all that information and squeezes it down to two coordinates, X and Y, in a way that tries to keep similar things close together and different things far apart. So, points that are close on the map have similar word patterns and points that are far apart use different kinds of words or structures.

The X and Y values themselves don’t have meaning like time or frequency. They’re just the result of UMAP arranging all the texts so that their distances reflect how similar their word patterns are.

And now, the strange thing about this UMAP. I have experimented the same models with headlines about crimes and sports. I got an accuracy of about 85%, with 300 headlines of each topic, which is quite good. But when I plotted its UMAP...

[attachment=11726]

Well this is a typical UMAP, the one the the data scientists use to find when working with clusters and so on. Look how different the UMAP of the Voynich (at paragraph level) is!!!

[attachment=11727]

In the Voynich UMAP plot, the paragraphs don’t form clouds or clusters as in normal texts, they align along narrow, continuous lines. This means that each paragraph is very similar to its neighbors (the previous and the next ones in UMAP space "line"), but much less similar to the rest. In other words, the text seems to follow a very regular pattern, where nearby paragraphs (in terms of UMAP) share nearly the same word combinations. This is amazing.

The logic tells me that if I take a paragraph, there must be two very similar paragraphs, but no more! I must admit I had to search and use some GPT help to undestand what is happening here, because I have never seen this. After explainin the plots and showing the plots, the output was:

That linear structure can have several logical explanations:
  1. Strong local similarity – each paragraph is extremely similar to the next one, as if the text evolved by small gradual changes rather than introducing new combinations of words.
    → This produces a “chain” effect in UMAP: each paragraph links to its immediate neighbors.
  2. Limited combinatorial diversity – if the text is generated by a small set of rules or templates (for example, repeating prefixes, suffixes, or word frames), the possible variations are restricted.
    → The UMAP projection then collapses those patterns into narrow linear tracks instead of broad clusters.
  3. Sequential dependency – the text might have been produced in sequence, reusing nearby words or structures (like a Markov process), so neighboring lines are statistically dependent.
    → This continuity is reflected as “lines” in topic space.
  4. Lack of thematic jumps – natural language paragraphs shift topics often; the Voynich does not.
    → Without those thematic jumps, UMAP finds no reason to break the flow into separate groups.
(17-10-2025, 09:40 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.UMAP is a way to turn complex data into a 2D picture so we can see patterns. In this case, each paragraph is first turned into a set of numbers (a vector) that describe how often different words appear together, sort of its fingerprint. Those fingerprints live in a huge mathematical space (maybe hundreds of dimensions).

I've only heard a bit about UMAP and never had chance to use it or understand what it does exactly, but some cursory reading on the internet suggest that UMAP tries to preserve topology of the underlying set and that its low dimensional axes are usually meaningless. Is this correct? I'm not sure how to interpret these graphs.
(17-10-2025, 09:40 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view. In other words, the text seems to follow a very regular pattern, where nearby paragraphs (in terms of UMAP) share nearly the same word combinations. This is amazing.

Indeed!

However the sequence cannot be the current page sequence as defined by the folio numbers.  That must be the "true" page sequence.  

Many years ago I created You are not allowed to view links. Register or Login to view. but connected the dots in the "official" page order, and the result was a rat's nest for each section.  

All the best, --stolfi
(17-10-2025, 09:53 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(17-10-2025, 09:40 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.UMAP is a way to turn complex data into a 2D picture so we can see patterns. In this case, each paragraph is first turned into a set of numbers (a vector) that describe how often different words appear together, sort of its fingerprint. Those fingerprints live in a huge mathematical space (maybe hundreds of dimensions).

I've only heard a bit about UMAP and never had chance to use it or understand what it does exactly, but some cursory reading on the internet suggest that UMAP tries to preserve topology of the underlying set and that its low dimensional axes are usually meaningless. Is this correct? I'm not sure how to interpret these graphs.

It’s a sort of Principal Component Analysis (PCA) space reduction, but with a very different philosophy. PCA looks for straight lines that explain most of the overall variation in the data: it’s linear and global. UMAP, on the other hand, tries to preserve the local relationships between points. It builds a kind of graph of “who is close to whom” in the high-dimensional space and then tries to project that structure into two dimensions while keeping those neighborhood relations as intact as possible.

So yes, you’re right, the actual numeric values of the UMAP axes don’t mean anything by themselves. The shape and distances do mean something: points that are close together in the plot are similar in the original space, and those far apart are dissimilar. It’s more about the geometry and continuity than the axes.

That’s why, in natural language data, you normally get cloud-like clusters (similar topics grouped together), but in the Voynich, the result is very different: the points form continuous lines. That suggests that each paragraph is most similar to the next and previous one, like a chain or trajectory.

(17-10-2025, 10:01 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(17-10-2025, 09:40 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view. In other words, the text seems to follow a very regular pattern, where nearby paragraphs (in terms of UMAP) share nearly the same word combinations. This is amazing.

Indeed!

However the sequence cannot be the current page sequence as defined by the folio numbers.  That must be the "true" page sequence.  

Many years ago I created You are not allowed to view links. Register or Login to view. but connected the dots in the "official" page order, and the result was a rat's nest for each section.  

All the best, --stolfi

Absolutelly not. It is not the folio or paragraph sequence. I will try to work to find that sequence.
(17-10-2025, 10:01 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
(17-10-2025, 09:40 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view. In other words, the text seems to follow a very regular pattern, where nearby paragraphs (in terms of UMAP) share nearly the same word combinations. This is amazing.

Indeed!

However the sequence cannot be the current page sequence as defined by the folio numbers.  That must be the "true" page sequence.  

Many years ago I created You are not allowed to view links. Register or Login to view. but connected the dots in the "official" page order, and the result was a rat's nest for each section.  

All the best, --stolfi

Please beware that these are UMAPS over the topic distribution of each paragraph. Not UMAPS over the words directly.
(17-10-2025, 10:08 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That’s why, in natural language data, you normally get cloud-like clusters (similar topics grouped together), but in the Voynich, the result is very different: the points form continuous lines. That suggests that each paragraph is most similar to the next and previous one, like a chain or trajectory.

So, if this result is correct, it can imply that each paragraph is somehow built using the previous paragraph? Something like self-citation or similar methods?
(17-10-2025, 10:32 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(17-10-2025, 10:08 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That’s why, in natural language data, you normally get cloud-like clusters (similar topics grouped together), but in the Voynich, the result is very different: the points form continuous lines. That suggests that each paragraph is most similar to the next and previous one, like a chain or trajectory.

So, if this result is correct, it can imply that each paragraph is somehow built using the previous paragraph? Something like self-citation or similar methods?

Not exatcly. I need to understand anf find the sequence. Note that the UMAPS shown are not directly calculated with the original words, but wit the topic distribution of the paragraph. We have two topics, but it seems that the distributions are moving from a paragraph to the other.
(17-10-2025, 10:34 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Not exatcly. I need to understand anf find the sequence. Note that the UMAPS shown are not directly calculated with the original words, but wit the topic distribution of the paragraph. We have two topics, but it seems that the distributions are moving from a paragraph to the other.

If these just track short topic vectors, then it's possible that paragraphs can be simply ordered by the model using the percentage of the dominating topic in the paragraph (or any other simple function), in which case for 4 topics you can get 4 snake-like clusters, each representing paragraphs that are rich in topic 1, 2, 3 or 4 sorted by the weight of the main topic.

If you are modeling low dimensional data, you naturally get a very low dimensional simple picture, isn't this expected? There is no feature expansion or anything like that.
Pages: 1 2 3 4 5 6 7 8 9 10 11