Analysis of Voynich Average Cross-Entropy Loss

Analysis of Voynich Average Cross-Entropy Loss - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Analysis of Voynich Average Cross-Entropy Loss (/thread-5020.html)

Analysis of Voynich Average Cross-Entropy Loss - Trithemius - 04-11-2025

Hey everyone,

I'm new to the forum, so I'm posting some of my findings here in a bit of a data-dump, sorry about that.

Here's the experiment:

1. I looked at the word-length frequency and character-frequency of the Voynich and created a random corpus of equal length which has the same word length and character frequency histograms, but are effectively random. So, an exerpt of my random Voynich simulant looks like this:

Filename: random_voynich.png Size: 13.1 KB 04-11-2025, 10:27 PM

whereas, as we all know, real Voynich looks like this

Filename: real_voynich.png Size: 13.01 KB 04-11-2025, 10:28 PM

My random Voynich simulant has the same word length frequency and same character frequency as the real Voynich, but it's gibberish.

As a control, I used the latin bible for comparison. Here's my random biblical latin simulant excerpt

…snfeluvu tddr eeee nuqnc tt osiem ili clac udieouenl eaula tnmeu…

and here's real latin

...in principio creavit Deus caelum et terram terra autem erat…

2. Then, I trained an LSTM on the random corpora (both in the control and voynich cases). This LSTM therefore learned to predict the character and word length frequency that we see in the actual Voynich (or bible in the control case).

3. Then, I exposed the model to the actual Voynich and did a continuous character-by-character plotting of the degree to which the model was surprised by the next word it saw after being trained on randomness. So, the higher the orange line, the higher the apparent order in the text.

Here's the bibical control

Filename: lstmsurprisal_latin.png Size: 176.9 KB 04-11-2025, 10:34 PM

And here's the Voynich

Filename: voynich_surprisal.png Size: 760.98 KB 04-11-2025, 10:35 PM

Interestingly, they both share comparable levels of order relative to their random controls. In the Voynich case, there are pretty clear spikes in orderedness around f57, which is not surprising as that concentric-circle diagram has the same sequence repeated 4 times on the 3rd ring from the inside.

I'd be curious to hear anyone's thoughts on this or if there are any other ideas for experiments like this.

RE: Analysis of Voynich Average Cross-Entropy Loss - quimqu - 04-11-2025

Interesting. It is logical, as you have "only" the same word-length frequency and character-frequency and this creates words without any structure. The Voynich language has a very low entropy, so this means the next character is very constricted by the curent one. In your output, the next character it is absolutely random (you can calculate entropy and it will be really high). So, the Voynich has "words" or things that look like words, and your output has ... well, random characters...

RE: Analysis of Voynich Average Cross-Entropy Loss - dashstofsk - 05-11-2025

The peak around f105 is curious. You are not allowed to view links. Register or Login to view. is the only page within quire 20 that has a top margin with a strange starting doodle. It might be that it was originally the first page of the section. But also because it appears that the manuscript was written sheet-by-sheet and not in book page order the pages on the other half of the sheet ( You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. ) might exhibit the same divergence in your graph. But your graph stops at f113v. Why does it not show the last pages of quire 20? Also there seems to be something odd about f115v. It seems not to be representative of the section. I am trying to find out why, and I am curious to know what the divergence is for that page.

RE: Analysis of Voynich Average Cross-Entropy Loss - Jorge_Stolfi - 05-11-2025

(04-11-2025, 10:38 PM)Trithemius Wrote: You are not allowed to view links. Register or Login to view.I looked at the word-length frequency and character-frequency of the Voynich and created a random corpus of equal length which has the same word length and character frequency histograms, but are effectively random.

It has long been known that Voynichese words have a rather rigid structure. Some early "word models" were published by Titman, Mike Roe, and Robert Firth. Here is one of my own:

Filename: word-structure-1.png Size: 156.29 KB 05-11-2025, 08:22 AM

Filename: word-structure-2.png Size: 57.36 KB 05-11-2025, 08:22 AM

Filename: word-structure-3.png Size: 64.3 KB 05-11-2025, 08:22 AM

Filename: word-structure-4.png Size: 63.8 KB 05-11-2025, 08:22 AM

This rigid structure is why the per-character entropy of Voynichese is lower than that of English and other common languages, and much lower than that of your pseudo-Voynichese (a Markov generator of order zero).

But character statistics are generally not very useful (unless one makes very specific assumptions about the language and encoding). They are very hard to interpret and can easily lead people astray. For instance, we still don't know what exactly is the the Voynichese alphabet. Even relatively phonetic and logical spelling systems, like Italian and Spanish, have complicated rules that would make it impossible to answer that question correctly from statistics alone. In Italian the letters "g" and "n" are common and occur independently as separate letters. Bu the pair "gn" is essentially a separate letter of the alphabet, with its own sound (like English "ny" or Spanish "ñ") unrelated to those of the other two. On the other hand, the pair "he" is not a separate letter of the English alphabet, even though it is abnormally common (because of "the", "they", "them", "there") .

For that reason, word statistics are generally more meaningful -- as long one can assume that the spelling and encryption always takes the same plain-text lexeme to the same encoded lexeme. But even then one must allow for errors and inconsistencies in the data, at all levels -- from the Author himself to the transcriber.

All the best, --stolfi

RE: Analysis of Voynich Average Cross-Entropy Loss - Trithemius - 05-11-2025

(05-11-2025, 08:56 AM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.The peak around f105 is curious. You are not allowed to view links. Register or Login to view. is the only page within quire 20 that has a top margin with a strange starting doodle. It might be that it was originally the first page of the section. But also because it appears that the manuscript was written sheet-by-sheet and not in book page order the pages on the other half of the sheet ( You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. ) might exhibit the same divergence in your graph. But your graph stops at f113v. Why does it not show the last pages of quire 20? Also there seems to be something odd about f115v. It seems not to be representative of the section. I am trying to find out why, and I am curious to know what the divergence is for that page.

Good catch, I think I might have accidentally truncated the results because of a nuance of my code. I'll try to see why it's not completing to the end.

(05-11-2025, 08:57 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.[quote="Trithemius" pid='72773' dateline='1762292329']I looked at the word-length frequency and character-frequency of the Voynich and created a random corpus of equal length which has the same word length and character frequency histograms, but are effectively random.

Thank you for sharing this Jorge, I hadnt seen these models before, but I've thought that something like this would make sense. I'll try to examine the text through this lens and see what comes of it!