The Voynich Ninja

Full Version: Mapping Voynich connections through rare tokens
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
(05-06-2026, 09:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.shedy -> Herbal f2v, also found in Stars, Biological and Pharmaceutical
chey -> Herbal f9r, also found in multiple non-Herbal folios
olchedy -> Herbal f25r, reused outside Herbal

shedy in not in You are not allowed to view links. Register or Login to view.
chey is in many Herbal folios
olchedy do you mean f26r?
(05-06-2026, 10:41 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(05-06-2026, 09:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.shedy -> Herbal f2v, also found in Stars, Biological and Pharmaceutical
chey -> Herbal f9r, also found in multiple non-Herbal folios
olchedy -> Herbal f25r, reused outside Herbal

shedy in not in You are not allowed to view links. Register or Login to view.
chey is in many Herbal folios
olchedy do you mean f26r?

Oh, that's right. Let me double check.the code...

Alrighrt, I edited the examples..Thank you.
(05-06-2026, 09:37 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view....

I think this finding actually supports a production-based explanation rather than arguing against it.

First, the corrected examples — "qotey" (24 instances), "olchedy" (38 instances), "chetey" (5 instances) — are low-frequency members of common word families. 

"olchedy" contains the most characteristic Currier B form "chedy." It appears on You are not allowed to view links. Register or Login to view. — a Currier B Herbal folio — and also in other Currier B sections. A Currier B word appearing on Currier B pages across sections isn't a semantic reference — it's the same evolutionary stage producing the same word families wherever B-stage text was written. If you look at the instances of "olchedy" you find it in the context of words like "chedy", "shedy", "lchedy", "lshedy", and "olshedy" (see You are not allowed to view links. Register or Login to view.)

On You are not allowed to view links. Register or Login to view. there is for instance even an instance of the two words "ol" and "shedy"
<f111r.P.5;H>      ol.shedy
<f111r.P.6;H>      olchedy
<f111r.P.9;H>      olchedy.lchedy
[attachment=15956]

Second, the existence of low-frequency words alongside high-frequency ones is itself predicted by the self-reinforcing feedback loop. The feedback loop means that high-frequency words have the highest resonance — they are visible most often, copied most often, generating the most variants. But this necessarily produces a spectrum: some words accumulate high frequency through repeated copying, others remain rare because they were produced few times and never gained momentum in the loop. With other words high frequency words can only exists alongside low frequency words.

The low-frequency words in your analysis are simply the tail of this distribution. They are rare not because they carry specific semantic content but because they had low resonance in the feedback loop — produced few times, rarely visible as sources, rarely copied further. Their cross-section distribution follows the evolutionary gradient: they appear in sections that are adjacent in the production sequence because the same word families were active during those production stages.
Quote:What I find difficult to explain is not the existence of these words, but their distribution. These tokens are restricted to a single Herbal folio, yet they reappear in multiple folios belonging to completely different sections.

I have some blurry idea how it could work. It won't be very statistical, maybe more psychological but let me try.

Lets suppose a man is going to create 37000 fake words which will make the text of Voynich manuscript.

First he decides on some structure of these words. It will make the grammar of Voynich words, the rules saying that for example "n" often goes in the end but never in the beginning of the word.

Then he starts writing words. When making a new word in a sentence he often (but not always) uses exisiting words, already written on the page and visible to him. He alters them, this would be the "autocitation method".

But sometimes he goes away from autocitation. Maybe he realizes that would be too repetitive, maybe he just gets bored and wants something else. So he writes a word that isn't on the page but he remembers it or invents a new word according to the grammar or even invents a new word that doesn't fully fit the grammar. Breaking the grammar will be the rarest case.

When he invents a new word he can actually reinvent something that he already used 50 pages earlier.
Some words come easily to his mind, he will use them quite often when he steps from autocitation.
Some will come much rarer but common enough to repeat in the text.
And some will be so unobvious that they will appear only once and will be hapax legomena.

Imagine for comparison a case when we ask some student to write a long list not of fake words but real existing nouns.
He will probably make a lot of repetitions of common words - man, dog, house, work, book, car.
He will also possibly prefer shorter words over longer ones to save himself work of writing them down.
But he may also come with rare and longer words - melancholy, zealot, propaganda, junta, leprosy and so on. He will probably use them in much seldom way.

Yet he may repeat the same rare word after many pages just "out of the blue".
@quimqu — the threshold is position ≥ ⌊2L/3⌋, final third of the word by character count. For each BPE segmentation, I count how many of the k−1 boundaries fall at or past that threshold, divided by k−1. Folio mean excludes unsegmented tokens.

So a 9-char token with cuts at {3, 7}: threshold = 6, one qualifies → BC = 0.5. Stars averaging 0.37 means most cuts land in the first two-thirds — right edge is inconsistent across that section. Herbal 0.80+ is the opposite.

Curious how you're approaching boundary detection on your end — different segmentation method, or something positional?
Pages: 1 2 3 4