The Voynich Ninja

Pages: 1 2 3 4 5

(19-02-2026, 05:48 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I changed h to H because it is actually a Greek eta and the purists (like ReneZ) are shocked by the improper romanization.

I'll try it and see what happens. Sometimes intuition is totally wrong...

Please let me know

(19-02-2026, 07:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Please let me know

I have similar curves for H1-H2(d) but the values are different, probably because the data and preprocessing are different. I removed all punctuation, parenthesis, brackets, apostrophes, numbers (including some internal and external references), spaces, and everything is converted to lowercase.

Voynich data from RF1b-er (basic_EVA) including "?". Paragraph text only.

[attachment=14310]

(21-02-2026, 09:24 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(19-02-2026, 07:12 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Please let me know

I have similar curves for H1-H2(d) but the values are different, probably because the data and preprocessing are different. I removed all punctuation, parenthesis, brackets, apostrophes, numbers (including some internal and external references), spaces, and everything is converted to lowercase.

Voynich data from RF1b-er (basic_EVA) including "?". Paragraph text only.

Great

It seems that your Voynich curves have the same pattern as mine. Raw and shuffled seem to look very simmilar, but shuffled is shifted a bit lower in terms of final MI, while the natural languages seem to loose the softr decay of H1-H2(d) at short medium d, once the words are shuffled, keeping the same final MI).

I guess the natural languages loose the mid range connections once the words are shuffled, but they stil tend to the same final MI. But the Voynich really seems not to have this kind of connection and it looses even MI when shuffled.

Your Ambrosius plots looks more to a natural language than mine Did you take the same psalm?

About the results, I think the difference may be about where the structure of the text actually comes from.
In natural languages, a lot of the structure is spread everywhere. Words have internal patterns, the same endings and letter combinations appear again and again, and the overall “feel” of the language stays quite stable throughout the text. When we shuffle the words, we break the logical flow of sentences, but we do not destroy this underlying texture. That is why the final level of mutual information stays about the same, even if the curve reaches it more quickly.

In Torsten’s text, and possibly in the Voynich, more of the structure seems to depend directly on the order of the words themselves. If each word is strongly influenced by the previous one, then changing the order really removes part of the structure. That could explain why the shuffled version ends up with a lower final mutual information.

So I think natural languages keep much of their structure even when word order is broken, while Torsten’s text and the Voynich seem to rely more on the sequence of words itself. So this could mean that the Voynich might have been written in a way where the order of the words plays a much bigger role than in normal languages, and that changing this order removes part of its structure (that's why final MI is lower in the shuffled text)..

I think we can agree that, in the natural languages tested here, the final level of mutual information does not seem to depend strongly on word ordering. When words are shuffled, the curve changes shape, but it tends to converge to a similar final value.

In contrast, in Torsten Timm’s generated text and in the Voynich, the final mutual information appears to depend more on the original word order. When the words are shuffled, the long-distance MI converges to a lower level.

This could suggest that, in these cases, a significant part of the structure is carried by the sequential chaining of words, meaning that each word may depend more directly on the previous one than in typical natural language texts.

We know for sure that Torsten text is from a generative algorithm... If Voynich has the same pattern... can we say it is also artificially generated?

(21-02-2026, 11:43 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.This could suggest that, in these cases, a significant part of the structure is carried by the sequential chaining of words, meaning that each word may depend more directly on the previous one than in typical natural language texts.

We already know that there are very significant non-random patterns across word breaks in the VMS and also in Torsten's generator, and I see now he has a setting "method.canFollow=statistic" for selecting the follow-up word. It could be interesting to play with conf.properties to activate/deactivate settings that have an effect on word ordering.

So the loss of predictability in a word-shuffled version of the Voynich text accounts, understandably, for the gap between the two curves at medium range (at a distance consistent with the selection of the 1st letter of the bigram in a word and the 2nd in the next word or next few word because the selection of the next word also impacts the words after it), but I don't understand why the gap remains at very long range.

Distance 1-300:
[attachment=14329]

Distance 1-3000 (dotted black curves are the moving average of the 50 previous values) :
[attachment=14332]

Note: I am using the "circularized" version of the conditional entropy calculation so the number of bigrams in the calculation is the same for any distance.

(21-02-2026, 11:43 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.We know for sure that Torsten text is from a generative algorithm... If Voynich has the same pattern... can we say it is also artificially generated?

If only it were that simple to test artificiality. Smile

(21-02-2026, 11:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Your Ambrosius plots looks more to a natural language than mine Did you take the same psalm?

Yes. Maybe I cleaned it too much: removed apparatus, sermon titles, chapter numbers, references and punctuation. I left the few (61) words in Greek.

Here are my pre-processed texts if you want to check them: all lowercase, one line per word. I removed the line feeds for the calculation of course.

[attachment=14316]

I'm glad you all are looking at this further. After my work on the Naibbe cipher, I am convinced that the long-range correlations and line-position effects are the two hardest VMS properties to reconcile with meaningful text. So kudos to you all.

A few more graphs for comparison:

[attachment=14345]
[attachment=14347]
Voynich transliteration: gap at long range

Torsten Timm's generated text: gap at long range

Natural languages: no gap at long range

Naibbe ciphertext: no gap at long range

So Torsten's generator does something right and it is not hard to figure out what: it's in the title of a current thread.

I'll try and find out if the gap can be replicated with a simple modification of the texts that don't show a gap at long range.

(23-02-2026, 10:48 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.A few more graphs for comparison:

Voynich transliteration: gap at long range

Torsten Timm's generated text: gap at long range

Natural languages: no gap at long range

Naibbe ciphertext: no gap at long range

So Torsten's generator does something right and it is not hard to figure out what... I'll try and find out if the gap can be replicated with a simple modification of the texts that don't show a gap at long range.

Thank you! Yes, my plots showed something simmilar. Note that naibe cipher is originally a ciphered natural language.

(21-02-2026, 11:43 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view....

I am confused about what what you mean by "word shuffled" and "token shuffled". Is that a random permutation of the tokens on each line, on each parag, or on the whole text?

Either way, if you shuffle the words and then measure the mutual information between characters at distance d, you are basically computing the mutual information of characters at distance d within the word, and then mixing that function with shifted copies of itself.

More generally, again, statistics about characters are more confusing than illuminating. For one thing, they depend primarily on the spelling/encoding system, secondly on the topic, genre and style, and lastly on the language. But also the distance d between characters is not a very meaningful variable, because words and syllables have different lengths.

For instance, suppose that you were trying to see whether the language has gender/number agreement of adjectives and nouns, as in Indo-European languages. To properly test that theory you would need statistics on suffixes of consecutive words. If that theory was true, you would notice a signal in that fixed-distance correlation plot that would disappear when words were scrambled. But that signal it would be spread out over all distances d between ~3 and ~15, and thus may be hard to see.

A similar comment would apply if you wanted to test for vowel harmony, a feature of Turkish and Hungarian. That feature would manifest itself as a correlation between vowels inside a word. In the fixed-distance correlation plots, it would show up as a fuzzy peak at distance ~2, that does not disappear when words are scrambled, becomes broader when characters are scrambled within each word, and disappears when characters are scrambled across words. A better way to test for such a feature would be to look for correlations between characters inside each word independently of distance.

And another suggestion: use log-log scales in your plots. That would expand both the left side, where all plots now get squeezed into a narrow vertical band, and the longer range, where all plots get squeezed together near the bottom. A justification for using log on the horz axis is that the difference between d = 2 and d = 3 is expected to be much more dramatic than the difference between d=21 and d=22. Indeed, maybe you should somehow combine the correlations for ranges of large distances, say 20-29 and 30-39, into single statistics.

All the best, --stolfi

Pages: 1 2 3 4 5

quimqu

nablator

quimqu

quimqu

nablator

nablator

magnesium

nablator

quimqu

Jorge_Stolfi