The Voynich Ninja
Measuring Long-Range Structure in the Voynich Manuscript - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Measuring Long-Range Structure in the Voynich Manuscript (/thread-5380.html)

Pages: 1 2 3 4 5


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 27-02-2026

To check whether the gap is simply caused by different pages having different letter frequencies, I built a very simple artificial model. For each page (or paragraph), I kept the same letter proportions but generated the text randomly inside the block. No word structure, no internal rules. Just random letters with the right frequencies. This is an i.i.d. block model.

Here is the comparison:

TextReal tail gapBlock i.i.d. gap (mean)
Voynich (Currier)0.002460.00407 (page) / 0.00498 (para)
Torsten Timm0.002700.00221–0.00234
Natural texts≈ 0clearly larger than real
Nibe cipher≈ 0.00004≈ 0.00013–0.00016

Now the picture is clearer.

For natural language, the real gap is basically zero. When we force artificial block differences, the gap increases. That means natural texts are fairly stable across pages.

For Torsten Timm, the real gap is very close to the artificial block model. That suggests his generator behaves largely like a simple mixture of statistically different blocks.

The Nibe cipher behaves like natural language. Its real gap is essentially zero. The artificial block-mixture model produces a slightly larger value, but still very small. This shows that simply encrypting natural language does not create the long-range gap seen in Voynich. That comparison is important. It means the Voynich effect is not just a by-product of symbol substitution or reduced alphabet size. A cipher of natural text does not reproduce it.

For Voynich Currier, the artificial block model produces a much larger gap than the real manuscript. This matters. It means the Voynich is not just a loose mixture of pages with different letter frequencies. There is internal structure inside the blocks that moderates the effect.

So the long-range gap in Voynich cannot be explained by block-level frequency drift alone. It reflects a system that combines large-scale segmentation with internal constraints that are not captured by a purely random block model.


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 27-02-2026

One obvious question is whether the Voynich gap could simply be the result of mixing different linguistic systems inside the same manuscript.

To test this, I ran the same long-range tail gap calculation on mixtures of real natural texts. Each text alone has essentially zero gap. Then I concatenated them and measured the effect.

Here is the comparison (tail gap = MI(raw) − MI(word shuffle), d = 60–100):

Text combinationTail gap
Latin + Latin (Plato + Alchemical)≈ 0.00014
Catalan + Latin (Tirant + Ambrosius)≈ 0.00123
English + Latin≈ 0.00585
Voynich (Currier)≈ 0.00246

A few things stand out.

Mixing two Latin texts produces almost no gap. Mixing two related but distinct languages (Catalan and Latin) produces a moderate gap. Mixing very different languages (English and Latin) produces a large one.

The Voynich sits between these cases. Its internal heterogeneity is clearly stronger than what we see between closely related natural languages, but weaker than what we see between very distant ones.

So the long-range gap does not require anything exotic. Strong non-stationarity alone can create it. But the magnitude in Voynich suggests that its internal blocks are statistically more differentiated than simple stylistic or dialectal variation within a single language.

That gives us a useful scale. We are no longer asking whether there is a gap, but how large it is compared to known linguistic differences.


RE: Measuring Long-Range Structure in the Voynich Manuscript - magnesium - 27-02-2026

Importantly, the Naibbe cipher proceeds with encryption randomly on a letter-by-letter basis. The natural question: Is it possible at all to construct a homophonic substitution cipher where long-range correlations in the sequence of homophone selections induce long-range correlations in the ciphertext?


RE: Measuring Long-Range Structure in the Voynich Manuscript - nablator - 27-02-2026

(27-02-2026, 02:59 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.Importantly, the Naibbe cipher proceeds with encryption randomly on a letter-by-letter basis. The natural question: Is it possible at all to construct a homophonic substitution cipher where long-range correlations in the sequence of homophone selections induce long-range correlations in the ciphertext?

I tried a gradual cumulative change of the text, one additional letter of the alphabet rotated every 100 lines.

At the end of the Latin text (Marbodius carmina varia, nearly 1000 lines) 10 letters are rotated:
Quote:pmne manufactum cpntumit lpnga vetuttat
ruie cautut caveat aliena exempla epcebunt
jaue ullum temput vanitat timulata manebit
cpneit fescla famet plenit intuavia cuncta
epctsina ett fsuctut eulcit saeicit amasa

   

I tried incrementing the rotation of all letters every 5 or 10 or 50 or 100 lines a few days ago and it didn't work at all (no gap) so I concluded that something more gradual is needed, now only one letter's rotation is incremented every 100 lines. 100 lines is about right, the gap gets bigger when the number of lines between each rotation increment is decreased.

I suppose rotating one table (resulting in a different choice among several possibilities with different frequencies) or switching two or more elements of a mapping table (the entire list of homophones maybe for two plaintext chunks) once every 50 or 100 lines in a deterministic way to make the cipher reversible will have a similar effect on the Naibbe ciphertext.


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 27-02-2026

The next step was obvious.

If the gap can be produced by mixing different systems, then what happens if we isolate Voynich A and Voynich B?

Using Currier transliteration, I measured the tail gap separately.

Subset Tail gap
Voynich A only ≈ 0.00084
Voynich B only ≈ 0.00080
A + B concatenated ≈ 0.00255
A + B interleaved ≈ 0.00046

This clarifies several things.

First, A and B are statistically distinct. When concatenated in blocks, the gap rises sharply. When interleaved word by word, it drops. This is exactly what we expect from mixing two differentiated systems.

But here is the important part: A alone and B alone still show a non-zero gap.

So the Voynich effect is not simply “two languages glued together”. Each variant already contains internal non-stationarity.

To probe deeper, I clustered pages automatically using bigram statistics. The clustering cleanly recovered A vs B without being told anything about Currier classification. That confirms the distinction is real and structural.

Then I measured the gap inside each cluster:
Cluster Tail gap
A cluster ≈ 0.00083
B cluster ≈ 0.00082

Again, the same pattern. The difference between A and B explains part of the global gap, but not all of it.

Finally, I tested whether the gap inside A might just be an ordering issue. By permuting page order freely, the gap dropped from about 0.00084 to about 0.00034. When I restricted reordering to physically plausible moves (only swapping entire quires), the reduction was much smaller, down to about 0.00065.

That tells us something important.

Part of the signal behaves like block-level drift. But part of it is intrinsic to the internal structure of the text. Even within a single Currier variant, the manuscript is not statistically stationary.

Putting everything together:
– The Voynich gap is not caused by punctuation.
– It is not generated inside lines or paragraphs.
– It is not solely explained by the A/B split.
– It behaves like structured heterogeneity across coherent blocks.

Compared to natural language mixtures, the magnitude of differentiation inside Voynich is stronger than dialectal variation, weaker than completely unrelated languages, and present even within single variants.

At this point, the manuscript looks less like a homogeneous language and more like a system with layered statistical regimes.

The (old) question is what kind of generative process produces this kind of multi-level structure.


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 27-02-2026

Over the last tests, the picture has become much clearer.

The Voynich long-range gap is not generated inside lines, paragraphs, or pages. When words are shuffled within those units, the gap disappears. The signal only appears when coherent large blocks are treated as a single homogeneous string.

Mixing natural languages reproduces the effect. The stronger the statistical difference between blocks, the larger the gap. Voynich falls between closely related languages and very distant ones.

Separating Currier A and B shows that they are clearly distinct. Concatenating A and B increases the gap strongly. Interleaving them reduces it. But even A alone and B alone still show a non-zero gap.

Automatic clustering of pages recovers the A/B division without prior information. Yet within each cluster, the gap remains.

Reordering pages freely can reduce the gap substantially, but physically plausible reordering (by quires) reduces it only partially.

Taken together, this suggests that the Voynich manuscript is not statistically homogeneous. It behaves like a system composed of differentiated but internally coherent blocks, with layered non-stationarity across scales.

The gap is real. It is measurable. And it reflects structured heterogeneity rather than a simple local sequential effect.


RE: Measuring Long-Range Structure in the Voynich Manuscript - pjburkshire - 27-02-2026

(27-02-2026, 09:16 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
Taken together, this suggests that the Voynich manuscript is not statistically homogeneous. It behaves like a system composed of differentiated but internally coherent blocks, with layered non-stationarity across scales.

The gap is real. It is measurable. And it reflects structured heterogeneity rather than a simple local sequential effect.


So, if I understand you correctly, you are saying that the Voynich Manuscript is like a collection of short stories by different authors and not like a novel by one author.  Is that close?


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 27-02-2026

(27-02-2026, 09:42 PM)pjburkshire Wrote: You are not allowed to view links. Register or Login to view.So, if I understand you correctly, you are saying that the Voynich Manuscript is like a collection of short stories by different authors and not like a novel by one author.  Is that close?

Not exactly but quite. Without claiming that this is necessarily the correct explanation, the observed gaps could be interpreted as follows:

At the level of paragraphs and folios, the gap is essentially zero. This would be consistent with the same author, topic, and language within those units.

At the level of Currier A and B, the gap is small but clearly present. This is comparable to what we observe when combining different works written in similar but not the same language (for example, Latin and Catalan), where the gap is around 0.001.

At the level of the entire Voynich manuscript, the gap is much larger. Its magnitude is similar to what we see when combining books written in different languages. So there is a sort of difference hierarchy.

So perhaps, and I say this very cautiously, we could hypothesize that there are two languages (A and B), and within each of them different dialects or simmilar languages. I am only saying this to understand the structure of the findings and as a simple explanation of something that can create the gaps dimensionality found in the MS.


RE: Measuring Long-Range Structure in the Voynich Manuscript - nablator - 27-02-2026

(27-02-2026, 11:10 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So perhaps, and I say this very cautiously, we could hypothesize that there are two languages (A and B), and within each of them different dialects or simmilar languages.

The two Currier "languages" are the most prominent statistical inconsistencies but the "dialects" are not different in nature: they modify the unigram/bigram/trigram/word frequencies perceptibly at the level of a section, a page. Sometimes, but it's less frequent, a single paragraph* or two strongly deviate from the rest of the page: there is no rule for the length of the apparently homogeneous blocks. Maybe nothing is homogeneous and we see only the most flagrant deviations, when some pattern or patterns suddenly become conspicuously frequent or absent. There are countless examples.

The inhomogeneities haven't been studied enough. I started a thread You are not allowed to view links. Register or Login to view. looking for a statistical tool that would help detect and categorize them.

* An example of "dialect" shift between paragraphs (after the first word of the 2nd paragraph actually): the sudden change of frequent eod to frequent ed on f57r:
   


RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 28-02-2026

(27-02-2026, 11:56 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.The two Currier "languages" are the most prominent statistical inconsistencies but the "dialects" are not different in nature: they modify the unigram/bigram/trigram/word frequencies perceptibly at the level of a section, a page. Sometimes, but it's less frequent, a single paragraph or two strongly deviate from the rest of the page: there is no rule for the length of the apparently homogeneous blocks. Maybe nothing is homogeneous and we see only the most flagrant deviations, when some pattern or patterns suddenly become conspicuously frequent or absent. There are countless examples.

The inhomogeneities haven't been studied enough. I started a thread You are not allowed to view links. Register or Login to view. looking for a statistical tool that would help detect and categorize them.

Yes, I think you’re right that variation exists at many scales, and that A and B may just be the most visible statistical contrasts within a broader continuum. I don’t see the manuscript as neatly divided into perfectly homogeneous blocks either.

What I find interesting, though, is that the long-range gap behaves very consistently across all the tests we’ve run. It’s not just reacting to small local frequency shifts. When we concatenate large coherent blocks, the gap increases sharply. When we interleave them, it drops. That pattern keeps repeating.

And the clustering result is what makes me pause. When we clustered pages automatically using the same statistical features, without telling the algorithm anything about Currier A or B, it essentially recovered the A/B split on its own. That suggests the boundary is not just one deviation among many, but a real structural discontinuity.

I’m not saying this proves there are two distinct languages. But the magnitude of the effect between A and B is comparable to mixing different languages in natural corpora, while the variation inside A or inside B looks more like mixing slifhtly different language (eg. Latin and catalan).

So I agree that nothing may be fully homogeneous. But the gap gives us a quantitative way to measure how strong those block differences are, and at least in this sense, A and B don’t look like just minor fluctuations in a smooth continuum.

I will take a look to your thread. I think there might also be some connections to my You are not allowed to view links. Register or Login to view. about automatic topic detection (where the distinction between A/B was also detected, and other subtopics emerged).