The Voynich Ninja - Mapping Voynich connections through rare tokens

Pages: 1 2 3 4

(03-06-2026, 09:55 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.So, this is biasing your numbers. It should not surprise that hands 2 and 3 and quires 13 and 20 have better than expected probabilities of sharing these words.

I checked that possibility directly, because if the effect was mainly caused by scribal or by quire proximity, then most of the rare-token links should stay inside the same hand or the same quire.

I measured page-to-page links generated by semi-rare tokens (freq 2-10), and then checked whether the linked folios belonged to the same hand or quire. Global distribution:

Link type	Percentage
Cross-hand + cross-quire	59.4%
Same-hand + cross-quire	21.9%
Same-hand + same-quire	16.9%
Cross-hand + same-quire	1.7%

Then I checked the strongest section relationships specifically:

Section pair	Cross-hand	Cross-quire
Herbal ↔ Marginal stars	91.9%	99.8%
Biological ↔ Marginal stars	100%	100%
Herbal ↔ Biological	65.5%	100%

To clarify what the "100%" means here: it does not mean the sections are unrelated to scribes or codicology. In fact, part of the reason may be precisely that Biological and Marginal stars are mostly copied by different hands and appear in different quires. What it means is something different: the semi-rare tokens linking those sections are not staying inside local scribal clusters (hand or quire). The links themselves consistently jump across hand and quire boundaries instead of remaining local.

For example, some of the semi-rare tokens linking Biological and Marginal stars are:

Token	Biological folio	Marginal stars folio
alchl	f76v	f113r
cholchey	f79r	f113r
fshedy	f80r	f115r
cheety	f80r	f112r
ckhedy	f84r	f112v

If I am not wrong, those links are all cross-hand and cross-quire. So I agree there is definitely a codicological component here. But the strongest lexical bridges do not seem to be reducible to simple "same scribe / same quire" proximity either.

(03-06-2026, 04:02 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.For example, some of the semi-rare tokens linking Biological and Marginal stars are:

Token Biological folio Marginal stars folio

alchl f76v f113r

cholchey f79r f113r

fshedy f80r f115r

cheety f80r f112r

ckhedy f84r f112v

A note on one of your examples. In the Takahashi transcription there is only one instance of "alchl" — on f113r. On f76v, Takahashi reads "ralchl" as one word, while the transcription you used probably reads it as "r" + "alchl." So whether the You are not allowed to view links. Register or Login to view. to You are not allowed to view links. Register or Login to view. link exists depends for rare tokens on which transcription you use and how you interpret a word boundary. With tokens appearing only twice, a single transcription disagreement can create or destroy a cross-section connection.

But the broader picture is actually more interesting than the exact-match network captures. The sequence "lchl" is rare — only seven instances in the entire manuscript (Takahashi):

Folio	Token	Section
f76v	ralchl	Biological
f77r	dolchl	Biological
f81r	lchl	Biological
f105r	lchl	Marginal Stars
f106r	rarolchl	Marginal Stars
f107r	polchls	Marginal Stars
f113r	alchl	Marginal Stars

These are not seven unrelated words that happen to share a character sequence. They are variants of the same form — "ralchl," "dolchl," "alchl," "polchls," "rarolchl" — each differing from the others by prefix addition or substitution. They form a word family connected through the same modification operations visible throughout the VMS vocabulary.

And their distribution — three in Biological, four in Marginal Stars — confirms the connection you found between these two sections. But it also illustrates the general point: the similarity between Voynich words is the text's most characteristic feature. Looking for exact token matches misses the family structure. "alchl" and "ralchl" are different tokens but members of the same "lchl"-family. A network built on word families rather than exact matches would capture more of the manuscript's real structure.

(03-06-2026, 04:02 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.If I am not wrong, those links are all cross-hand and cross-quire. So I agree there is definitely a codicological component here. But the strongest lexical bridges do not seem to be reducible to simple "same scribe / same quire" proximity either.

I think this is an important observation. The rare-token links don't cluster by hand and they don't cluster by illustration type. If the text described the illustrations, we would expect rare vocabulary to cluster by illustration type — plant-related terms on herbal pages, star-related terms on star pages, biological terms on biological pages. That's not what we see.

If different scribes wrote different sections, we would expect rare vocabulary to cluster by hand — each scribe's idiosyncratic word choices staying within that scribe's sections. That's not what we see either.

What we do see is that the strongest connections follow the "ed" frequency gradient: Herbal A (0.2 %) → Pharma A (0.7 %) → Astro (9.5 %) → Herbal B (17%) → Marginal Stars / Quire 20 (19.4%) → Biological / Quire 13 (27.8%). Sections that are adjacent in this gradient share the most rare vocabulary — regardless of hand, regardless of illustration type, regardless of quire boundaries. This is what I would expect if the vocabulary evolved continuously through the production of the manuscript [see You are not allowed to view links. Register or Login to view.].

There doesn't seem to be anything much odd about rare gallows words.

If you look at my matrices of gallows words for language B that I presented some time ago,

You are not allowed to view links. Register or Login to view.

you will see many rare gallows words. These particular words are rare simply because they are formed either with an infrequent prefix or an infrequent suffix. For example lkchdy occurs 5 times in the language B pages and this is about the expected number based on the frequencies of the l prefix and the chdy suffix.

Likewise ytey, 5 occurrences.

And if you look at the prefixes and suffices of rare gallow words ( see attached, for language B and quire 13 ) you will see that the top ones are very much the same as for all gallows words ( see attached also ).

So Gallows words, both rare and frequent, seem to have a regular construction. The majority of the rare ones look normal, have also got either a regular prefix or a regular suffix. And it is just a matter of luck if one turns out to be rare.

But also rare words ( 2 to 10 occurrences ) make up ~20-25% of the total number of words. I can't quite see how you are going to be able to obtain any better conclusions about the manuscript just from the minority words and by ignoring the majority in the text.

Your rare-token hub result made me look more carefully at something I'd been tracking separately; boundary concentration at the folio level rather than corpus-wide.
The Stars folios (f111r, f111v, f108v) come out consistently low, around 0.37, against a corpus average closer to 0.59. You are not allowed to view links. Register or Login to view. similar. The herbal section runs higher, some pages above 0.80. So the pages with the widest rare-token reach also tend to have the most interior-distributed boundary patterns. Whether that's the same structural phenomenon from a different angle, or two independent signals that happen to co-occur in those sections, I'm still not sure. What do you think?

(04-06-2026, 05:25 PM)petronio Wrote: You are not allowed to view links. Register or Login to view.Your rare-token hub result made me look more carefully at something I'd been tracking separately; boundary concentration at the folio level rather than corpus-wide.
The Stars folios (f111r, f111v, f108v) come out consistently low, around 0.37, against a corpus average closer to 0.59. You are not allowed to view links. Register or Login to view. similar. The herbal section runs higher, some pages above 0.80. So the pages with the widest rare-token reach also tend to have the most interior-distributed boundary patterns. Whether that's the same structural phenomenon from a different angle, or two independent signals that happen to co-occur in those sections, I'm still not sure. What do you think?

Can you explain how boundary concentration is calculated? Is it based on word boundaries, line positions, or something else?

(04-06-2026, 11:37 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view.If different scribes wrote different sections, we would expect rare vocabulary to cluster by hand — each scribe's idiosyncratic word choices staying within that scribe's sections. That's not what we see either.

I think we may be looking at the problem from slightly different angles.

My point is not that rare-token links disprove vocabulary evolution. They probably don't. What interests me is that these semi-rare tokens often connect sections that, under a semantic interpretation, could plausibly be related.

For example, in a normal herbal manuscript, semi-rare words might be plant names. You would expect them to appear on the herb page itself, but also in recipes or remedies referring to that plant.

What I find interesting in the Voynich is that some of these semi-rare tokens connect Herbal folios with Marginal Stars folios. If the Stars section contains something recipe-like (which is of course only a hypothesis), then that is exactly the kind of pattern one might expect from a meaningful text.

The hand argument is also interesting here. Many of these links cross scribal boundaries. So the connection is not simply that one scribe tends to reuse his own rare vocabulary. The same semi-rare forms appear in sections copied by different hands.

Of course this does not prove meaning. But if rare vocabulary repeatedly links Herbal pages with possible recipe-like pages (or even Balneological pages with recipe-like pages), I think that is at least worth investigating further.

(04-06-2026, 10:28 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.My point is not that rare-token links disprove vocabulary evolution. They probably don't. What interests me is that these semi-rare tokens often connect sections that, under a semantic interpretation, could plausibly be related.

For example, in a normal herbal manuscript, semi-rare words might be plant names. You would expect them to appear on the herb page itself, but also in recipes or remedies referring to that plant.

What I find interesting in the Voynich is that some of these semi-rare tokens connect Herbal folios with Marginal Stars folios. If the Stars section contains something recipe-like (which is of course only a hypothesis), then that is exactly the kind of pattern one might expect from a meaningful text.

The interpretation that rare tokens connecting Herbal to Stars could be plant names appearing in recipes is plausible in principle. But several observations are difficult to reconcile with a semantic interpretation:

The specific tokens you listed as cross-section bridges look like modifications of common word families: 'fshedy' is 'shedy' with an 'f' prefix. 'ckhedy' is 'chedy' with a 'ckh' variant. 'cholchey' looks like a combination of 'chol' and 'chey.' 'cheety' is 'cheey' with an additonal 't'-gallow. To me they just look like modifications of common word families that happen to be rare because the specific modification was only produced a few times.

Some identical labels appear in both the herbal and the astronomical section: "okary," "oky," "otalam," "okeoly," "otaly," "otoky," "otaldy," "otal," "ykeody," "okeody," "okeos," "otory," "okody," and "oran" (You are not allowed to view links. Register or Login to view., p. 9). Labels are the most semantically constrained words in any manuscript — they should identify what they're attached to. One would not expect several stars or star constellations to be named after plants or parts of plants. If even the labels don't respect illustration boundaries, paragraph text connecting across sections doesn't need a semantic explanation either.

There is a continuous "ed" gradient across sections — from 0.23% in Herbal A to 27.84% in Quire 13. If words represent concepts, why would the vocabulary for describing plants, stars, and biological subjects shift systematically along a single axis? Different topics should produce different vocabularies — not the same vocabulary at different evolutionary stages.

The text lacks function words. In any meaningful text — regardless of language, cipher, or encoding — some words should distribute uniformly across all sections because they serve grammatical functions (conjunctions, articles, prepositions). No such words exist in the VMS. "Frequently used tokens differ from page to page" (Timm & Schinner 2020, p. 5).

The text contains almost no repeated phrases. Tiltman (1967), D'Imperio (1978), and Reddy & Knight (2011) all noted that sequences of three or more words virtually never repeat. Any meaningful text — descriptions of plants, recipes, instructions — produces repeated phrases ("take the root of," "boil in water," "apply to the"). The VMS doesn't.

The cross-section connections you found are real. But in my view they reflect the production process rather than semantic references — the same word families appearing wherever the scribe was working, regardless of what the illustrations depict.

(04-06-2026, 11:45 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view.The text contains almost no repeated phrases. Tiltman (1967), D'Imperio (1978), and Reddy & Knight (2011) all noted that sequences of three or more words virtually never repeat. Any meaningful text — descriptions of plants, recipes, instructions — produces repeated phrases ("take the root of," "boil in water," "apply to the"). The VMS doesn't.

Hello Torsten,

this suprised me much. I run a short script and found:

BIGRAMS
-------
Repeated bigrams: 2651
Repeated >2: 1032
Repeated >5: 254

Cross-section >=2: 1781
Cross-section >=3: 419
Cross-section >=4: 130

TRIGRAMS
--------
Repeated trigrams: 197
Repeated >2: 28
Repeated >5: 0

Cross-section >=2: 83
Cross-section >=3: 3
Cross-section >=4: 2

QUATRIGRAMS
-----------
Repeated quatrigrams: 22
Repeated >2: 11
Repeated >5: 0

Cross-section >=2: 0
Cross-section >=3: 0
Cross-section >=4: 0

(Igrouped the paragraphs as a single line and checked for ngrams in paragraphs, labels and single lines)

And here their distribution when they get cross section:

[attachment=15938]
[attachment=15937]

(05-06-2026, 07:46 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view....

Hello qumiqu!

The raw numbers need context. 197 repeated trigrams in 37,000 words is less than expected for natural languages — moreover two questions matter more than the count:

Do they repeat in fixed order? In Timm 2014 (p. 3) I found only 35 word sequences of three or more words appearing at least three times. Of those 35, only 5 have fixed word order. The other 30 have variable word order — the same words appearing together but in different arrangements. For instance "chol," "shol," and "cthol" occur together three times, each time in different order:

f1v: "chol cthol shol"
f4r: "chol shol cthol"
f42r: "cthol chol shol"

In a natural language, phrases have fixed word order because grammar constrains the sequence. In the VMS, word order varies — which is expected if similar words co-occur because they belong to the same word family being copied in the same production session.

Do they repeat because of meaning or because of similarity? In 24 out of those 35 cases, the repeated sequences use at least two words that are either spelled the same or very similarly (Timm 2014, p. 3). The repetition reflects word-family clustering, not grammatical phrases.

Reddy & Knight (2011, p. 82) quantified this directly by measuring how much the previous word predicts the next word:

Text	Unigram	Bigram	Improvement
VMS B	2.30%	2.50%	8.85%
English	4.72%	11.9%	151%
Arabic	3.81%	14.2%	252%
Chinese	16.5%	19.8%	19.7%
Hungarian	5.84%	13.0%	123%

In any meaningful text, knowing the previous word dramatically constrains what comes next — 151% improvement for English, 252% for Arabic, 123% for Hungarian. Even Chinese with its famously weak word order gets 19.7%. The VMS gets 8.85%. Knowing the previous word tells you almost nothing about the next word.

Repeated bigrams and trigrams exist in the VMS — but they behave differently from phrases in meaningful text. They use variable word order, they cluster by word similarity rather than by grammatical function, and they provide almost no predictive power for the following word.

This is this reason Tiltman and D'Imperio wrote about the VMS:
"Languages simply do not behave in this way." ... "And yet I am not aware of any long repetitions of more than 2 or 3 words in succession, as might be expected for instance in the text under the botanical drawings". (You are not allowed to view links. Register or Login to view. p.9)

"The short words, the many sequential repetitions, the rarity of one- or two letter words, the rarity of doublets (doubled letters), all militate against simple substitution. So also does the strange lack for parallel context surrounding different occurrences of the same word as shown by words indexes. In the words of several researchers 'the text just doesn't act like natural language'." (You are not allowed to view links. Register or Login to view. p. 30)

(05-06-2026, 09:01 AM)Torsten Wrote: You are not allowed to view links. Register or Login to view....

Hello Torsten,

I agree that the repeated trigrams do not behave like normal linguistic phrases. Variable word order, low bigram predictability, and clustering of similar word forms are all observations that any explanation of the Voynich text must account for.

However, I think there are also observations that are difficult to reconcile with a purely local text-generation mechanism based on copying and modifying nearby words.

Recently I looked at tokens that appear in exactly one Herbal folio, but also occur elsewhere in the manuscript. The idea was simple: if Herbal folios represent individual plants, then any token restricted to a single Herbal folio but reused in other sections might be functioning as a plant identifier or some other folio-specific concept.

The results were surprisingly widespread (in this table I show the number of tokens per section that appear in just one page of the section and then appear in other sections):

Section	Candidate tokens	Section vocabulary	% vocabulary
Astronomical	297	614	48.4%
Text-only	449	995	45.1%
Cosmological	439	1126	39.0%
Zodiac	287	784	36.6%
Pharmaceutical	409	1162	35.2%
Biological (balneological)	427	1480	28.9%
Marginal stars only	654	3339	19.6%
Herbal	620	3490	17.8%

In total I found 620 such tokens for herbal vocabulary.

Even more interesting, these 620 tokens are distributed across 117 of the 125 Herbal folios (almost all of them). On average, each Herbal folio contributes about 5 such tokens, and some folios contribute as many as 19.

A few examples:

Code:
qotey -> Herbal f4r, also found in Stars and Balneological

chetey -> Herbal f34r, also found in multiple non-Herbal folios

olchedy -> Herbal f26r, reused outside Herbal

(The exact examples are not important. The general pattern is.)

What I find difficult to explain is not the existence of these words, but their distribution. These tokens are restricted to a single Herbal folio, yet they reappear in multiple folios belonging to completely different sections.

If the text is generated primarily through local copying and word-family variation, why do so many folio-specific words propagate across section boundaries while remaining restricted to a single Herbal folio?

This observation does not prove that the words have meaning, nor that they are plant names. But it seems to suggest that at least some lexical items are linked to individual Herbal folios and then reused elsewhere in the manuscript in a non-local way.

I would be interested to hear how your generation model accounts for this pattern.

(Just a quick note: myborevious studies led me to a generation model, but these findings are hard to explain with that kind of models).

Pages: 1 2 3 4