Measuring Long-Range Structure in the Voynich Manuscript

Measuring Long-Range Structure in the Voynich Manuscript - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Measuring Long-Range Structure in the Voynich Manuscript (/thread-5380.html)

Pages: 1 2 3 4 5

Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 18-02-2026

Hello again!

I've taken a break from studying the Voynich for a few months. It's such a complex subject that I think you need to mentally disconnect from it every now and then. Shy

Lately, I've been doing a simple statistical test on the manuscript, which in principle doesn't depend on any linguistic interpretation. The idea is to measure to what extent the identity of a character at position t gives us information about the character at position t+d. In more technical terms, I'm measuring mutual information, which can be calculated for different distances d. If the text has only very local structure, this dependence, however small, should quickly disappear. If there is a deeper structure, some of the dependence should persist even at larger distances.

In the case of the Voynich, mutual information is maintained at a certain level even at distances of 50 to 100 characters (very similar to natural languages). When the same text is globally shuffled, the signal collapses. This seems to confirm that the effect depends on the actual order of the characters and not just their frequencies.

I have also tried a control that preserves local patterns but destroys the global order by shuffling entire lines. In this case, the short-range dependence is maintained, but the behavior at longer distances is lost. This suggests that the signal is not limited to regularities within each line.

To make sure that the result is not just due to the fact that the manuscript has different parts with different letter styles or frequencies, I did a very simple test. I created artificial texts divided into blocks. In each block, the letters appear in the same proportions as in the original text of that part, but they are placed randomly, with no real order.

So the artificial text preserves the slow changes in frequencies between sections, but removes any real structure in the sequence. When I apply the same measurement to these artificial texts, the signal disappears almost completely. This means that the pattern we see in the Voynich cannot be explained simply by the fact that different parts of the manuscript have different letter frequencies. There is more than just variation between sections.

Filename: Screenshot_2026-02-18-18-23-09-327_com.android.chrome-edit.jpg Size: 99.44 KB 18-02-2026, 06:24 PM

I also trained simple generative models on the Voynich text itself. A 1st-order Markov model captures local transitions but fails to reproduce the structure over longer distances. Moderate-order character n-gram models reproduce short-range effects, but they do not match the persistence observed in the original text.

Importantly, the pattern is robust whether spaces are removed or if one changes from EVA to an alternative transliteration (CUVA). The overall behavior remains qualitatively the same.

For comparison, I have applied the same analysis to several natural language corpora. The Voynich curves fall within the same general range as those of these texts: they do not behave like mixed noise or like sequences generated by simple local models. On a purely statistical level, the Voynich character sequence shows a structured long-range dependence comparable to that of natural texts.

Filename: 2.jpg Size: 52.17 KB 18-02-2026, 05:28 PM

This does not prove that the manuscript encodes a natural language. But I do think it shows that its character sequence behaves like a structured system with persistent long-range dependencies, and not like a mixed or purely local construct.

RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 18-02-2026

I ran a very simple experiment to see where the character-level structure of the Voynich text actually lives.
The figure shows raw mutual information between characters at distance d under four conditions:

- Original text
- Tokens shuffled (words preserved internally, but reordered)
- Lines shuffled
- Full character shuffle

Filename: Screenshot_2026-02-18-20-53-07-964_com.android.chrome-edit.jpg Size: 93.92 KB 18-02-2026, 08:59 PM

The character shuffle behaves as expected: the curve collapses almost completely. This confirms that the signal depends on ordering, not just on character frequencies.

The key comparison is between the original text and the token shuffle. When tokens are randomly reordered, the curve drops sharply and stabilizes close to the character-shuffle baseline after short distances. Internal word structure is still present in this condition, so what disappears is the structure between tokens.

Line shuffling sits in between: it preserves structure within lines but weakens longer-range continuity across lines.
The ordering is consistent:

Original > Line shuffle > Token shuffle > Character shuffle.

From this figure alone, it seems clear that the Voynich manuscript contains statistical dependencies that extend beyond individual tokens. The structure is not purely internal to word-like units.

This does not imply syntax or natural language. It simply shows that cross-token ordering contributes significantly to the long-range signal

RE: Measuring Long-Range Structure in the Voynich Manuscript - Fontanellean - 19-02-2026

Do I read the graph correctly that a Latin Alchemical-Herbal text is uniquely situated among the Voynich samples when it comes to mutual information? Is that significant? It sounds like you expect natural texts to surround the Voynich curves if enough are sampled.

RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

(19-02-2026, 12:04 AM)Fontanellean Wrote: You are not allowed to view links. Register or Login to view.Do I read the graph correctly that a Latin Alchemical-Herbal text is uniquely situated among the Voynich samples when it comes to mutual information? Is that significant? It sounds like you expect natural texts to surround the Voynich curves if enough are sampled.

Yes, it is Marco' alchaemical herbal You are not allowed to view links. Register or Login to view.

By the way: I got some.messages warning to be banned as my text looks like GPT generated. I use to write in Catalan and translate it with help of AI as I am not able to explain the results and technical stuff in any good English (mine is not good enough). This does not mean that it is a chatGPT allucination and if I an not wrong, this is not against the rules of the forum (at least it wasn't a couple.of months ago).

RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

I did some comparing tests today, that show something intresting (I will try to explain it in my own English this time).

I wanted to compare the behaviour of the mutual information of the characters. Remember that it is sort of "having the character c, how does it affect the predictability of a character d spaces away). I have limited d to 50 and removed spaces (as the effect I will try to explain is shown better) (note that with spaces the effect is very simmilar). I compared Voynich A, B and full with Torsten Timm's generated text, naibe cipher text and several natural languages (including Vietnamese). I compared them full text and with words shuffled.

Here are the plots:

Filename: voynich vs natural.jpg Size: 137.28 KB 19-02-2026, 09:49 AM

Filename: voynich like.jpg Size: 99.25 KB 19-02-2026, 09:49 AM

I think it is interesting to compare the differences between the raw text and the shuffled text. We can see that all natural languages (including Naibe cipher) tend to the same MI at d=50. You can see that shuffling the words collapses the text into that MI very fast. To myself (and this is full my theroy), we could say that the area between the full text and the shuffled text is the effect that the word ordening has to the MI (so how the natural language links words and characters in a long term). For example, If we look at Alchemical Herbal or Culpeppers's lines (full and shuffled) the area within both lines is the effect of the natural language that we loose shuffling the words:

Filename: alchemical herbal.jpg Size: 56.95 KB 19-02-2026, 10:11 AM

Filename: Culpepper.jpg Size: 54.44 KB 19-02-2026, 10:11 AM

Let's call the area between both plots once the MI is converged as text meaning area.

What is interesting is what happens to Torsten Timm's text. We know it is a generated text that depends much on the previous word. But look how the MI at long distances do not converge. The MI falls to an evident gap. This can mean that shuffing the words, we have broken the link thay have during their generation.There is also one more point to observe: both curves (full and shuffled) seem to have the same form, just moved in the vertical axis. If we force the MI at long d to converge, we see that the lines have almost the same shape:

Filename: torsten Timm.jpg Size: 58.26 KB 19-02-2026, 10:33 AM

Filename: Torsten Timm moved.jpg Size: 46.99 KB 19-02-2026, 10:12 AM

So... no text meaning area at all. Shuffling the words we loose "only" the links between them (MI converges at a lower rate), but it may have no meaning at all. Let's call the vertical difference between full text and shuffled text the generative area.

So what happens to the Voynich? Let's divide it in Voynich A and Voynich B. This is how they look like (full text versus shuffled text):

Filename: Voynich A.jpg Size: 53.43 KB 19-02-2026, 10:17 AM

Filename: Voynich B.jpg Size: 54.06 KB 19-02-2026, 10:17 AM

We see both of them have some generative area, not as much as Torsten Timm plots, but more then the natural languages. Do they have text meaning area? Welll. take a look:

Filename: Voynich A joint.jpg Size: 42.77 KB 19-02-2026, 10:17 AM

Filename: Voynich B joint.jpg Size: 43.33 KB 19-02-2026, 10:17 AM

Only Voynich B seems to have a small amount of text meaning area. From the natural languages analyzed in this work, only Ambrosius Medionalensis In Psalmum David CXVIII Expositio has similar behaviour in terms of text meaning area, but it has (almost) no generative area:

Filename: Ambrosius.jpg Size: 52.72 KB 19-02-2026, 10:24 AM

Well, these results raise questions rather than answer them. Whether the observed patterns reflect linguistic organization, generative constraints, or something else entirely remains an open problem.

RE: Measuring Long-Range Structure in the Voynich Manuscript - Rafal - 19-02-2026

Quote:From the natural languages analyzed in this work, only Ambrosius Medionalensis In Psalmum David CXVIII Expositio has similar behaviour

I guess that's the real pain Wink

If none real text had this behaviour then we could think it describes some disctinction between natural and constructed text.
But as some 100% natural text has this behavior then is it really indicative of anything ???

RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

(19-02-2026, 11:29 AM)Rafal Wrote: You are not allowed to view links. Register or Login to view.
Quote:From the natural languages analyzed in this work, only Ambrosius Medionalensis In Psalmum David CXVIII Expositio has similar behaviour

I guess that's the real pain

If none real text had this behaviour then we could think it describes some disctinction between natural and constructed text.
But as some 100% natural text has this behavior then is it really indicative of anything ???

Well, Ambrosius Medionalensis In Psalmum David CXVIII Expositio is a theological commentary, written in a highly structured and repetitive style. They are comments about a psalm that it is also very repetitive. The texts is written with parallel constructions, recurring formulas, and a very stable, repeated vocabulary. When a text is already internally redundant in this way, a big part of its character-level structure may come from repeated lexical patterns rather than from the global order of words. So maybe this is why shuffling the words removes relatively little additional information at the character level.

RE: Measuring Long-Range Structure in the Voynich Manuscript - nablator - 19-02-2026

(19-02-2026, 10:35 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.To myself (and this is full my theroy), we could say that the area between the full text and the shuffled text is the effect that the word ordening has to the MI (so how the natural language links words and characters in a long term).

I am trying to understand the asymptotic behavior: your MI(d) curves seem to converge to a different value in word-shuffled texts than in "raw" (original) texts, right? (But the Ambrosius text does not, the curves are very close.) If so, it seems very counterintuitive: there should be no predictability at all at large distances. For the "raw" texts because obviously words are preferentially followed by other words, but this fact tells you nothing about the words on the next page or 10 pages later. For the word-shuffled texts it is often characters of the same word that are picked together at low distances so there can be some predictability, it diminishes with distance and goes down to zero as soon as the distance exceeds the length of the longest word.

The explanation must be in the way MI(d) is calculated or displayed. Pardon the very basic question: how is it defined? Is it this formula?

Quote:The mutual information of two jointly discrete random variables X and Y is calculated as a double sum:

You are not allowed to view links. Register or Login to view.

RE: Measuring Long-Range Structure in the Voynich Manuscript - quimqu - 19-02-2026

(19-02-2026, 11:46 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(19-02-2026, 10:35 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.To myself (and this is full my theroy), we could say that the area between the full text and the shuffled text is the effect that the word ordening has to the MI (so how the natural language links words and characters in a long term).

I am trying to understand the asymptotic behavior: your MI(d) curves seem to converge to a different value in word-shuffled texts than in "raw" (original) texts, right? If so, it seems very counterintuitive: there should be no predictability at all at large distances. For the "raw" texts because obviously words are preferentially followed by other words, but this fact tells you nothing about the words on the next page or 10 pages later. For the word-shuffled texts it is often characters of the same word that are picked together at low distances so there can be some predictability, it diminishes with distance and goes down to zero as soon as the distance exceeds the length of the longest word.

The explanation must be in the way MI(d) is calculated or displayed. Pardon the very basic question: how is it defined? Is it this formula?

Quote:The mutual information of two jointly discrete random variables X and Y is calculated as a double sum:
You are not allowed to view links. Register or Login to view.

That's right, Nablator. It is this formula. It is essentially the mutual information between characters separated by exactly d positions, averaged over all such pairs in the corpus.

Mutual information at distance d does not measure whether we can predict words on the next page. It simply measures whether knowing one character slightly changes the probability distribution of another character located d positions away.

It is true that in a perfectly random and fully independent sequence, MI(d) would go to zero as d increases. But real texts are not random or stationary. They contain repeated words, recurring morphemes, topic clusters, stylistic patterns... All of these create weak but measurable statistical dependencies, even if they do not correspond to meaningful prediction across pages.

In the case of word shuffled texts, word structure is preserved, but the order of words is destroyed. Ths is why most medium-range structure collapses. However, because the same words are still present, and because some words are repeated or clustered, a small residual dependence can remain. I have observed that in natural languages, the residual dependence is "the same" as for the raw texts (at high d) but, that for generated text like Torsten's (without a meaning) this residual dependence is different. Why does this sappens? Well, I guess (I repeat, I just guess), that the remaining MI(d) which is the total words of the corpus keeps a sort of natural meaning that does not depend on the order of the words. In torsten's text, this meaning disappears.

I know it is hard to explain or understand... I am just guessing and making some suggestions about the findings as I haven't seen any literature that talks about these differences.

Please note, this is not semantic prediction at long distances, but simply a reflection of the overall statistical structure of the corpus.

RE: Measuring Long-Range Structure in the Voynich Manuscript - nablator - 19-02-2026

(19-02-2026, 12:17 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.It is this formula. It is essentially the mutual information between characters separated by exactly d positions, averaged over all such pairs in the corpus.

That's what worries me (potentially, I don't know, but I suspect a possible issue there): how do you average the MI of all the pairs at distance d? A simple mean or a weighted average of all log(...) as in the formula?