The Voynich Ninja - Comparing the Voynich by Word Position Profiles

Hello everyone,

while working on the idea of my You are not allowed to view links. Register or Login to view. I tried a small exercise. Imagine that the Voynich is a positional substitution cipher, where each position in a word is encoded independently.

What I did was to take the Voynich tokens in EVA transliteration, look at them position by position, and record the distribution of characters. But I deliberately ignored which character it was. In other words: at position 1 we might have 40% of one character, 30% of another, 20% of a third, and so on. Just the shape of the distribution, not the labels.

Then I repeated the same procedure for several different languages. My working assumption is that if the corpora are large enough, the positional distributions should be similar within texts of the same language. Here is the result I got from the texts I currently have available:

Corpus	Distance	Tokens
Alchemical herbal (Latin)	0.327	6,536
De Docta Ignorantia (Latin)	0.374	37,121
Tirant lo Blanc (Catalan)	0.395	419,309
La Reine Margot (French)	0.396	112,803
Ambrosius Medionalensis (Latin)	0.402	117,734
El Lazarillo de Tormes (Spanish)	0.403	20,060
Simplicius Simplicissimus (German)	0.415	189,804
Romeo and Juliet (English)	0.451	24,822
The English Physician (Culpepper) (English)	0.460	135,362

So what does this mean? In this experiment the texts that came out closest to the Voynich were in Latin (especially the “Alchemical herbal” and “De Docta Ignorantia”), followed by Catalan, French, and Spanish. German and English were clearly further away.

Of course this does not prove the language of the Voynich, but it is interesting that the nearest matches are all Romance or Latin texts, and the Germanic ones sit lower down the ranking. It suggests that, at least under this positional-distribution approach, the Voynich behaves more like Romance/Latin than like Germanic languages.

Note: I used the "Alchemical herbal" transliteration from Marco Ponzi and the german Simplicius Simplicissimus version of Jorge Stolfi.

(18-09-2025, 11:31 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.In other words: at position 1 we might have 40% of one character, 30% of another, 20% of a third, and so on. Just the shape of the distribution, not the labels.

That is enough information to compute the entropy of the first character. For the entropy of the other characters one would need the distribution of the first n characters, not just the nth character,

I don't put much faith in this kind of analysis because the token and lexeme length distributions of Voynichese are quite different from those of "European" languages (Indo-European, Finno-Ugric, Semitic, Turkic, Basque).

Thus the 8th letter of an "European" word would at best correspond to maybe half of the 4th letter of a Voynichese word. But in fact we know that the character entropy of Voynichese words is spread out more evenly along the word than that of "European" words, so that the correspondence is not even linear.

And moreover these statistics are highly dependent on the spelling system, much more than on the language, and even on the nature of the text. For instance, the Voynichese statistics would be quite different if the qo and y prefixes of Voynichese words were always split off, or always joined to the next word. Or if one spells Arabic "alkitab" as "al-kitab" or "al kitab" or "alktab"...

All the best, --jorge

(18-09-2025, 12:15 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I don't put much faith in this kind of analysis...

Yes, I completely agree that comparing natural languages with a transliteration of Voynichese is risky, for all the reasons you mention: segmentation, transcription conventions, word length distribution, and orthography, amongst others. My goal is far away to claim a linguistic match, but simply to run the comparison and show the outcome. I found the result curious and thought it might be interesting to share, even if it can only be taken as an exploratory observation.

(18-09-2025, 12:15 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I don't put much faith in this kind of analysis...

Do you think we cannot squeeze the transliterations anymore?

(18-09-2025, 08:41 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Do you think we cannot squeeze the transliterations anymore?

I believe that we won't get much from just computing statistics and looking at them. To make progress we must make specific hypotheses and then compute statistics, simulations, etc that are best suited to confirm or reject those hypotheses.

The test you just did could confirm or reject the hypothesis that Voynichese is one of the languages you used AND it was written with its official spelling system AND spaces were faithfully preserved by the Scribe and transcriber AND it is encrypted with a simple (1 letter)=(1 EVA glyph) substitution cipher. The test rejected that hypothesis. But it would have been unnecessary, since we already know that the token and lexeme length distributions of Voynichese do not match those of those languages.

As you may know, I have my own hypothesis about the nature of Voynichese and the contents of the book. That hypothesis passed several tests, but they were not tight enough to convince the unbelievers. I am still looking for better tests.

If you believe that it could be an European language in cipher, then you should state that hypothesis in a way that allow all the perturbing possibilities that were excluded above. As a minimum you must assume that word spaces in the transcription file are mostly wrong, because otherwise the hypothesis immediately fails the length distribution test.

And it would be prudent to relax the (1 letter)=(1 EVA glyph) assumption too. Since the author invented a new alphabet, one cannot assume that he followed the official spelling system. If I had to transcribe a book in Arabic or Hebrew, for instance (knowing the language or not), I would write down all vowels, not just the long ones.

So the challenge is, what sort of statistics would be best for confirming or refuting this kind of hypothesis? Seems a tough problem...

All the best, --jorge