The Voynich Ninja - Measuring Long-Range Structure in the Voynich Manuscript

Pages: 1 2 3 4 5

(24-02-2026, 06:28 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am confused about what what you mean by "word shuffled" and "token shuffled". Is that a random permutation of the tokens on each line, on each parag, or on the whole text?

Random permutation of word tokens on the whole text from paragraphs of Currier A and (separately) B pages only for me.

(24-02-2026, 11:26 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(24-02-2026, 06:28 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I am confused about what what you mean by "word shuffled" and "token shuffled". Is that a random permutation of the tokens on each line, on each parag, or on the whole text?

Random permutation of word tokens on the whole text from paragraphs of Currier A and (separately) B pages only for me.

For me, in post #2 I compare:

- Original text
- Tokens shuffled (words preserved internally, but reordered)
- Lines shuffled
- Full character shuffle

(it is interesting to see how any shuffling (words, lines or characters) make the MI tend to the same value; but the raw Voynich tends to a value a bit higher)

And from there, it is always word token shuffled (keeping the words, but randomly reordered).

The Breslau Pharmacopoeia (first part, less that half, 187 KB of transcript).

The curves are close to Voynich (EVA) Currier B but no gap at long range. Surprise? No. Wink

[attachment=14366]

We observed a consistent pattern when comparing the Voynich text with natural languages and with Torsten Timm’s generated text. If we compute character-level mutual information MI(d) and then randomly shuffle whole words (tokens), the long-range MI of natural languages changes very little. The overall shape of the tail remains almost the same.

In contrast, when the same word shuffle is applied to the Voynich text or to Torsten Timm’s generated text, the long-range MI decreases clearly. (I have carried out bootstrap tests and the difference is statistically significant). This behavior has also been noted by Nablator, who reported a similar stability in natural languages and a reduction in Voynich after word shuffle.

To understand where the effect comes from, I repeated the analysis inside the Voynich manuscript:

by section (Herbal, Biological, Zodiac, etc.)
by page
by lines grouped by their initial character
using leave-one-out tests to see which parts contribute most to the global gap

The results show that the effect is not uniform across the manuscript. The Herbal section accounts for a large share of the global long-range gap. Some other sections, such as Marginal Stars, actually reduce the overall effect when included.

The pattern related to line-initial characters is much weaker and does not explain the global behavior.

Overall, the data suggest that the Voynich manuscript contains long-range sequential structure that is more sensitive to word order than typical natural language texts. At the same time, this structure is concentrated in specific sections rather than evenly distributed.

Corpus / Subset	Tokens	Internal tail gap	Change in global gap if removed	Meaning
Natural languages (control)	varies	~0 to 1e-5	—	Word shuffle has almost no effect on long-range MI
Voynich (global)	38262	0.001222	—	Clear reduction after word shuffle
Voynich – Herbal	10928	0.001198	-0.000204	Strong driver of the global gap
Voynich – Biological (balneological)	6327	0.000402	-0.000049	Small positive contribution to global gap
Voynich – Cosmological	2246	~0.00117 (global without it)	-0.000052	Minor contributor
Voynich – Marginal stars only	11646	0.000810	+0.000209	Reduces the global gap (diluting effect)
Voynich – Lines starting with "o"	6700	0.001188	Very small effect	Strong internal structure, but not a main driver
Voynich – Lines starting with "d"	5819	0.000816	Near zero effect	Moderate internal structure, minimal global impact
Torsten Timm (generated text)	varies	~0.002–0.003	—	Strong reduction after word shuffle

Good analysis!

I've been trying a few things from the other direction, trying to create a long-range gap with simple "common sense" manipulations of texts and totally failing so far. Big Grin

It should be easy but it isn't!

I was wondering... what happens with different transliterations? So far I was using EVA. But EVA splits many glyphs into very fine units and ends up with a large alphabet. Currier, on the other hand, uses a much smaller set of symbols and groups elements that EVA keeps separate. Torsten Timm style text also works with a relatively compact alphabet based on EVA (there are not special glyphs, making the amount of used characters quite simmilar to a latin alphabet).

If the long-range MI gap were mainly an artifact of rare glyphs or over-fragmentation, reducing the alphabet should weaken the effect (this is at least the logical thing: it should behave more like natural languages and reduce the gap, isn't it?).

Instead, I observe the opposite...

With the same raw vs shuffled comparison:

EVA shows a modest long-range gap.
Currier shows a much larger one.
Torsten Timm style generated text falls in a similar range to Currier.

Summary:

System	Tokens	Alphabet size	Tail MI gap
EVA	~38,000	116	0.00122
Currier	~15,000	36	0.00269
Torsten Timm	~11,000	20	~0.002–0.003

The Currier gap is more than twice the EVA gap. Huh

This is not what I would expect if the signal were coming from rare or decorative glyphs. When the alphabet is reduced and tokens are grouped more coarsely, the long-range structure does not disappear. It becomes stronger.

At the section level, the same pattern seen in EVA still holds in Currier. Some sections contribute strongly to the global gap, while others dilute it. The effect is not uniformly distributed across the manuscript.

The leave-one-out tests confirm that removing certain sections changes the global gap in a measurable way. This again indicates that the structure is not evenly spread across the text.

Below is a simplified comparison of the strongest section-level effects under EVA and Currier.

Grouping	EVA tail gap (approx)	Currier tail gap (approx)	Effect
Global baseline	0.00122	0.00269	Currier > EVA
Herbal section	High contributor	High contributor	Strong in both
Marginal / stars only	Dilutes gap	Dilutes gap	Weak structure
Biological	Moderate	Moderate	Section-dependent

Line-initial groupings show a weaker effect overall, but still reflect uneven distribution.

Grouping	EVA pattern	Currier pattern
Lines starting with common glyphs	Small variation	Small variation
Rare initials	Minimal impact	Minimal impact

In both transliterations, section-level grouping matters more than line-initial grouping. The structure appears to be organized at a broader textual scale rather than driven by local line mechanics.

What could this mean?

One explanation is statistical. A smaller alphabet reduces sparsity and improves estimation of mutual information. That alone could amplify the measured gap.

Another explanation is structural. EVA may over-segment the text. If meaningful units are split into smaller pieces, long-range dependencies get diluted. Currier, by grouping more tightly, may better reflect the functional units that carry sequential constraints.

The important point is that the long-range MI effect is robust under transliteration changes. It does not depend on the large EVA glyph inventory. And in a compact representation, it becomes comparable in magnitude to a known sequential generative model such as Torsten Timm.

(26-02-2026, 10:55 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.If the long-range MI gap were mainly an artifact of rare glyphs or over-fragmentation, reducing the alphabet should weaken the effect (this is at least the logical thing: it should behave more like natural languages and reduce the gap, isn't it?).

No idea, really. This is not at all what I have in mind. I should wait to have something positive to report before I comment, more tests to do before that. Migraine today until an hour ago, I tested nothing. Sad

Here is my possibly wrong and very badly explained understanding of the most probable cause of the long-range gap with zero evidence that I will regret posting tomorrow. What is the difference between the distribution of the letter-bigrams (x, y) at (anywhere, anywhere + long distance) and (anywhere, anywhere else)? The long distance of course between the letters x and y. In "normal" texts the language is the same everywhere, so the distribution of letter-monograms is the same everywhere. The two distributions at <anywhere> and <anywhere else> are exactly the same. On the other hand in bigrams (anywhere, anywhere + long distance) the two distributions at position <anywhere> and <anywhere + long distance> remain consistently different over the whole text (or most of it) if there is a (mostly) consistent increasing or decreasing trend in the frequencies of some or all of the letters (many letters and high frequency change to have a big gap, few letters and low frequency change for a small gap). In an evolving text with a direction of evolution, not a completely random walk, distance matters: a (voluntary or involuntary) direction of evolution is the simplest way I can think of that could create the long-range gap, other than an unlikely, I suppose, and epistemologically costly mathematical link between all letters that would need a theory of spooky statistical action at a (long) distance.

(26-02-2026, 11:49 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.No idea, really. This is not at all what I have in mind. I should wait to have something positive to report before I comment, more tests to do before that. Migraine today until an hour ago, I tested nothing.

I hope you get better from your migraine. I have found some interesting things that might bring some light to this. I post them in the following post.

I think I found something very interesting...

What happens if we stop looking at the whole manuscript as one long string, and instead ask where the long-range signal actually lives?

Up to now, we knew this: if we shuffle words globally in Currier, the long-range MI gap is large and clearly significant. That already sets Voynich apart from normal language corpora, where word shuffle barely changes the tail.

But that still leaves an open question. Is the gap coming from inside lines? Inside paragraphs? Or from the larger structure?

So I ran three different shuffles:

Shuffle words globally (baseline).
Shuffle words only inside each line.
Shuffle words only inside each paragraph.

Before showing the numbers, a quick clarification of what I calculted:

The tail gap means the average difference between the mutual information of the original text and the mutual information after word shuffle, measured at long distances (here d = 60–100). In simple terms, it tells us how much long-range structure disappears when we randomize word order.

The normalized gap is the same quantity divided by H1, the basic entropy of the character distribution. This just rescales the gap so that differences in alphabet size or overall entropy do not distort the comparison.

Here is the summary for Currier (punctuation removed, MI normalized by H1).

Shuffle scheme	Tail gap (MI raw − shuffle)	Normalized gap (÷ H1)
Global token shuffle	0.00250	0.00094
Shuffle within lines	≈ 0	≈ 0
Shuffle within paragraphs	≈ 0	≈ 0

The result is very clear.

1. When words are shuffled only inside lines, the long-range gap disappears.
2. When words are shuffled only inside paragraphs, the gap also disappears.

In other words, the signal is not generated inside lines. It is not generated inside paragraphs either. The long-range gap only appears when the global order of the text is disturbed.

To check this further, I shuffled entire lines as intact blocks, and then entire paragraphs as intact blocks.

Shuffle scheme	Tail gap	Normalized gap
Shuffle order of lines (lines kept intact)	0.00251	0.00094
Shuffle order of paragraphs (paragraphs kept intact)	0.00062	0.00023

Shuffling the order of lines changes almost nothing.
Shuffling the order of paragraphs reduces the gap strongly, almost eliminating it.

This tells us something important: The long-range MI gap in Currier is not a micro-level effect. It does not come from word order inside lines. It does not come from local syntactic structure. It is mainly a macro-structural effect tied to how paragraphs are arranged across the manuscript, and likely also how larger sections are organized.

If we compare this to natural language corpora, we see a difference. In normal texts, global word shuffle does not produce a large long-range gap in the first place. Here, the gap appears only when we break the global block structure of the manuscript.

So a cautious way to phrase it would be: Inside lines and inside paragraphs, Voynich does not produce the long-range anomaly. The anomaly emerges at the level of paragraph sequencing and higher-level organization.

This is more consistent with block-level non-stationarity than with some kind of long-distance “interaction” between glyphs. This suggests that the long-range anomaly in Voynich is not driven by local word-order constraints inside paragraphs, which makes it less consistent with a purely local sequential generator such as Torsten Timm’s model. Instead, the effect appears to be tied to block-level organization of the manuscript.

One small side note. When punctuation and separators are removed from Currier, the gap actually increases, even after normalizing by entropy. That means the effect is not caused by dots or special symbols. Those elements were diluting the signal, not creating it.

Taken together, this suggests that the long-range behavior of the Voynich manuscript is primarily a property of its large-scale structure, not of its internal line-level syntax.

There is one more step that completes the picture.

If the gap is not generated inside lines, and not inside paragraphs, what about pages?

So I ran the same type of test again, but this time at the page level. First, I shuffled words only inside each page. Then I kept each page intact and shuffled the order of entire pages. Here is the summary (Currier, punctuation removed, normalized by H1):

Shuffle scheme	Tail gap	Normalized gap
Global token shuffle	0.00250	0.00094
Shuffle within lines	≈ 0	≈ 0
Shuffle within paragraphs	≈ 0	≈ 0
Shuffle within pages	≈ 0	≈ 0
Shuffle order of lines	0.00251	0.00094
Shuffle order of paragraphs	0.00062	0.00023
Shuffle order of pages	≈ 0.00205	≈ 0.00077

Now the structure becomes much clearer.

If we shuffle words inside lines, the gap disappears.
If we shuffle words inside paragraphs, the gap disappears.
If we shuffle words inside pages, the gap disappears.

So the anomaly is not produced inside lines.
It is not produced inside paragraphs.
It is not even produced inside pages.

But when we shuffle the order of entire lines, nothing changes.
When we shuffle the order of paragraphs, the gap drops sharply.
When we shuffle the order of pages, the gap drops slightly, but remains large.

This combination is important. It means the long-range gap depends on two things at the same time:

Internal coherence inside large blocks (pages and paragraphs).
Systematic differences between those blocks across the manuscript.

If we destroy the internal coherence of a page by shuffling its words, the effect vanishes.
If we keep pages intact but rearrange them, the effect survives.

That tells us the main driver is not simply the linear order of pages. It is the fact that different parts of the manuscript have different statistical profiles, and that those profiles are preserved inside pages.

In other words, the manuscript behaves like a set of statistically distinct blocks. When we globally shuffle all words, we destroy that block structure and the contrast becomes visible as a long-range MI gap.

Inside a page, however, the behavior is much closer to stationary. The anomaly only emerges when we treat the whole manuscript as one homogeneous string and then break that structure.

So if we put everything together:

The gap is not a line-level effect.
It is not a paragraph-level word-order effect.
It is not caused by punctuation.
It is not removable by simply reordering pages.
It reflects large-scale structural segmentation of the manuscript.

This strongly points toward macro-level organization rather than local sequential constraints. The long-range anomaly appears to be a property of how the manuscript is partitioned and organized globally, not of how words are arranged inside sentences or paragraphs.

That is a very different picture from a purely local generative mechanism.

Pages: 1 2 3 4 5