The Voynich Ninja

Full Version: Mapping Voynich connections through rare tokens
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
(02-06-2026, 01:18 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(01-06-2026, 12:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So I started wondering if the opposite approach might actually be more informative.

Or you could use a balanced approach where rare terms are upweighted to reflect their relative importance.

Quote:In addition to creating the A matrix as described, which uses straight TF values, two weighting schemes are also employed to modify the values contained in A. The two schemes applied are Term Frequency-Inverse Document Frequency (TF-IDF) and Log-Entropy (LE).
You are not allowed to view links. Register or Login to view.

I don’t think that Lisa's work is really the same thing as what I’m doing (it is simmilar to my automitic topic analysis). What they describe there is basically a weighting scheme inside an LSA/topic-modelling framework. TF-IDF increases the influence of rarer words and reduces the effect of very common ones. That is standard NLP.

But their goal is still global semantic structure and document similarity through latent-space methods. They are trying to see whether nearby text behaves coherently, whether sections cluster, whether discourse segmentation exists, etc.

My approach is much more explicit and local. I am not embedding the text into a latent semantic space or reducing dimensions. I am literally tracing the actual low-frequency tokens themselves across folios and sections, then building direct overlap networks from those tokens. In practice this changes the interpretation a lot.

A TF-IDF weighted LSA model may tell you that two pages are “similar” in some abstract statistical sense (so do other topic models). What I’m looking at is closer to: “this exact semi-rare token appears in these particular folios and creates this concrete bridge between these sections”. So when You are not allowed to view links. Register or Login to view. or some Marginal Stars folios emerge as hubs, I can inspect the exact tokens creating those links page by page. That is a different scale of analysis.

Also, interestingly, their own paper actually supports part of what I’m seeing. They found that Biological, Pharmaceutical and Stars sections show unusually high internal cohesion compared to other sections. That is not very far from the kind of structure emerging in my rare-token graphs.

The important difference is that I’m not measuring coherence of adjacent text windows. I’m measuring selective reuse of uncommon vocabulary across distant folios.
(02-06-2026, 02:02 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.If I strip the gallows from those hapax tokens, that decline flattens out.

That is interesting, although I would probably be careful interpreting the gallows result too deeply for now.

If removing gallows causes many forms to collapse into already existing tokens, then part of the flattening may simply come from reducing the vocabulary space artificially. The Voynich token space already has extremely small Levenshtein distances between many words, so removing one high-information element like gallows can easily create merges with forms that already exist elsewhere in the manuscript.

My own objective was a bit different anyway. I was trying to see whether low-frequency tokens create bridges between folios and sections. That is why I did not use hapax legomena initially: a single occurrence cannot really create network structure or repeated connections between pages.
(02-06-2026, 02:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.My own objective was a bit different anyway. I was trying to see whether low-frequency tokens create bridges between folios and sections. That is why I did not use hapax legomena initially: a single occurrence cannot really create network structure or repeated connections between pages.

But, if one scribe creates a hapax in their work and then that word gets used or copied by another scribe, that creates a network between scribes.  And keep in mind there's more than just a few folios in herbal that are not currier A (Scribe 1). If you're testing all of herbal then you're mixing two regimes and will get a false network connecting to other scribe 2 pages in other sections.  Same for Scribe 1 and 3 in Pharma.
(02-06-2026, 02:49 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.
(02-06-2026, 02:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.My own objective was a bit different anyway. I was trying to see whether low-frequency tokens create bridges between folios and sections. That is why I did not use hapax legomena initially: a single occurrence cannot really create network structure or repeated connections between pages.

But, if one scribe creates a hapax in their work and then that word gets used or copied by another scribe, that creates a network between scribes.  And keep in mind there's more than just a few folios in herbal that are not currier A (Scribe 1). If you're testing all of herbal then you're mixing two regimes and will get a false network connecting to other scribe 2 pages in other sections.  Same for Scribe 1 and 3 in Pharma.

I did a quick graph of how the rare tokens are distributed by writting hand (according to EVA transliteration). There are plenty of connections, meaning that the use of the rare tokens is transversal between scribas.

[attachment=15896]
(02-06-2026, 03:13 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I did a quick graph of how the rare tokens are distributed by writting hand (according to EVA transliteration). There are plenty of connections, meaning that the use of the rare tokens is transversal between scribas.

And that's showing pretty much the same thing I was seeing with the hapax.  Scribe 1's connection weakens sequentialy from Scribe 2 to Scribe 5.  You're showing the 1-3 connection slightly stronger than the tests I run which is not unexpected and 2 & 3 are showing a strong connection.
I made another small experiment using the rare-token networks.

This time I started from a speculative assumption of mine: maybe the Text-only, Pharmaceutical and Marginal stars folios are the parts of the manuscript where recipes, procedures, remedies or references are explained. Not necessarily "meaning" in a direct sense, but perhaps more functional or referential text.

So I checked how the Herbal and Biological sections connect to those three "hub-like" sections using only low-frequency tokens.

For each page in Herbal and Biological, I asked: what is the first maximum token frequency needed so that every page in the section shares at least one uncommon token with Text-only, Pharmaceutical or Marginal stars folios?

Results:
Source section Full coverage threshold Distribution of links
Biological (balneological) max token freq = 3 75% Stars, 15% Text-only, 10% Pharmaceutical
Herbal max token freq = 6 58% Stars, 20% Text-only, 22% Pharmaceutical

I also checked direct shared semi-rare vocabulary between sections:

Section pair Shared semi-rare tokens
Herbal ↔ Marginal stars 560
Biological ↔ Marginal stars 319
Herbal ↔ Biological 249
Herbal ↔ Text-only 208
Biological ↔ Text-only 106

What I find interesting is not just that both sections eventually connect, but how they connect.

Biological becomes fully covered extremely quickly using only very rare tokens, and most of those links point toward Marginal stars folios. Herbal also connects, but it requires slightly less rare vocabulary and the connections are more distributed.

At the same time, Herbal and Biological do share a considerable amount of uncommon vocabulary directly between them. So this is not simply a case of isolated sections. But even then, Marginal stars still remains the strongest lexical attractor for both.

That is probably the most interesting part of the analysis so far.

Of course this does not prove semantic meaning. But if those hub-like sections really contain some kind of explanatory, procedural or referential content, then this starts looking less like random local repetition and more like selective reuse of uncommon vocabulary between specific textual communities.
(02-06-2026, 03:25 PM)Dunsel Wrote: You are not allowed to view links. Register or Login to view.And that's showing pretty much the same thing I was seeing with the hapax.  Scribe 1's connection weakens sequentialy from Scribe 2 to Scribe 5.  You're showing the 1-3 connection slightly stronger than the tests I run which is not unexpected and 2 & 3 are showing a strong connection.

Interesting!  I have no support for this, but if Scribe 1's influence weakens as we go along the scribe vector from 2 to 5 and there are adjacency effects, it suggests to me that the project was done serially, and perhaps not as I and others have envisioned where the author/leader worked in a room with the scribes as a group. I don't recall any evidence for the sequence of creation, and I don't know how the scribes were numbered. There doesn't seem to be much objection to You are not allowed to view links. Register or Login to view. as the intended first page and the first page written. I went back and read Lisa's "How Many Glyphs, and How Many Scribes" article, which gives some details about which scribe wrote where, and it's complex. That article doesn't support some kind of sequential work.
(03-06-2026, 01:55 AM)Eiríkur Wrote: You are not allowed to view links. Register or Login to view.Interesting!  I have no support for this, but if Scribe 1's influence weakens as we go along the scribe vector from 2 to 5 and there are adjacency effects, it suggests to me that the project was done serially, and perhaps not as I and others have envisioned where the author/leader worked in a room with the scribes as a group. 

Not to steal You are not allowed to view links. Register or Login to view.'s thunder but, see my post here and tell me what you think: You are not allowed to view links. Register or Login to view.

I believe Davis has publicly stated that because she's a paleographer and codicologist that her work isn't interpreting the glyphs or their meaning. I seem to recall that she mentioned that they were working on page ordering by studying the water stains in the herbal section.  But, I believe she's more into scribal hands and the physical arrangement of manuscript rather than the numbers.

(03-06-2026, 01:55 AM)Eiríkur Wrote: You are not allowed to view links. Register or Login to view.if Scribe 1's influence weakens as we go along the scribe vector from 2 to 5

I believe You are not allowed to view links. Register or Login to view.'s brief look at the scribe connections is saying roughly the same thing.  When you look at the huge difference even with the broad strokes of Currier A and B, that suggests serial, not parallel production. That or each sribe had their own set of rules they followed.

Oh, and I'd argue your "first page written" but that's another discussion.
The differences that categorise languages A and B ( character pairs or,eo,cho high in A and low in B, character pairs ee,ed high in B and low in A, word suffix edy high in B and low in A ) are also present in the rare words in these language sections. ( See attached. )

So, this is biasing your numbers. It should not surprise that hands 2 and 3 and quires 13 and 20 have better than expected probabilities of sharing these words.
The text may not be readable yet, but the manuscript is starting to behave like it was built to preserve structured relationships between different textual areas.

After my previous posts about semi-rare token networks, I wanted to test the problem more globally. Until now I was mostly looking at specific rare-token bridges between folios and sections. This time I analysed ALL tokens in the manuscript simultaneously and classified them by distribution behaviour.

Very roughly, the categories were:
Category Behaviour
Hapax appear only once
Local-specific appear in very few pages/sections
Domain bridges partially specific but transversal
Global-distributed common almost everywhere
Functional-transversal very common structural vocabulary

The interesting part is not the hapax themselves, but the middle layer: tokens that are neither ultra-local nor globally common. Those are the ones most capable of creating selective links between different textual communities.

Using TF-IDF weighted similarities, I then compared the real section-to-section structure against several null models. The nulls preserved: token frequencies, Currier/language distribution, writting hand distribution, but randomized where section labels appear.

The result surprised me a bit: even after controlling for language and writting hand, the real manuscript remains much more lexically segregated than the randomized versions.

Average z-scores:
Null model Mean z-score
Global section shuffle -36
Shuffle within language -19
Shuffle within writting hand -6.6

Negative z-scores here mean the real sections are LESS similar to each other than expected under the null models. In other words: the manuscript is internally more organized than randomized versions preserving Currier and scribal structure.

This is important because one obvious criticism (and a fair one) is that Currier A/B morphology itself already biases many rare words, as pointed Dashstofsk:

(03-06-2026, 09:55 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.The differences that categorise languages A and B ( character pairs or,eo,cho high in A and low in B, character pairs ee,ed high in B and low in A, word suffix edy high in B and low in A ) are also present in the rare words in these language sections. ( See attached. )

So, this is biasing your numbers. It should not surprise that hands 2 and 3 and quires 13 and 20 have better than expected probabilities of sharing these words.

And that is true. Forms like -edy, qo-, ol-, etc are not neutral. But the structure does not disappear after conditioning by language or writting hand. The null models already preserve those distributions. So the links between sections do not seem reducible only to Currier or scribal hand effects.

Some of the strongest surviving links are still:
Section pair TF-IDF similarity
Herbal ↔ Pharmaceutical 0.71
Marginal stars ↔ Text-only 0.76
Biological ↔ Marginal stars 0.72
Biological ↔ Text-only 0.67

What I personally find difficult to explain under a fully meaningless-text hypothesis is not simply local repetition, but the existence of stable cross-sectional structure after normalization.

A pseudo-language can easily generate repeated patterns. It is harder to accidentally generate sections that repeatedly behave as selective lexical communities after controlling for Currier and scribal effects.

Of course this still does NOT prove semantic meaning. A strongly constrained generative system could also produce part of this behaviour.
But at minimum, the manuscript increasingly behaves like a text organized to maintain internal relationships between different sections.

One speculative possibility (only speculation) is that sections like Marginal stars or some Text-only folios may contain more procedural, referential or organizational text. If so, this could partially explain why they repeatedly behave like lexical hubs connecting otherwise distant sections such as Herbal and Biological.

I also think the most interesting objects now are not the sections themselves, but the surviving bridge-token families:

qokain / qokedy / qolkedy / olkain / qotchy / cthor ...

Those are not isolated random forms anymore. They start looking to me more like structured lexical neighborhoods.
Pages: 1 2 3 4