The Voynich Ninja

Full Version: How multi-character substitution might explain the voynich’s strange entropy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(27-06-2025, 11:34 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.So that's more like a polyalphabetic cipher with frequent shifting between alphabets. Sounds like a headache!

Indeed. With say 20 characters and 3 encoding possibilities for each one, there are 20^3 = 8000 different substitution alphabets to choose from for every character. And they look enough to me to be able to get many different 'succesful decodings', in many different languages, from any given snippet of text.

But I agree encoding this way would add randomness to the text (*), so both n-grams entropies and word entropies will decrease with respect to a natural language. But I also agree with @Jorge_Stolfi here: n-grams entropies do not necessarily mean much because they are heavily dependent on the transcription. Word entropies are more reliable (they only depend on 'space' being actually a 'space', and on differently written words being actually different words).


(*) provided the choice among the alphabets is random. But if you need to add a deterministic choosing rule to enable the decoding of the text (which looks indispensable), then the effect on entropy becomes much more difficult to determine. I suspect (but cannot prove) it might well be zero.
(27-06-2025, 11:31 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.a can be ciphered by (T,O,P)
b can be ciphered by (U,P,W)
c can be ciphered by (T,M,Z)

This is something completely different than the null generator code that only inserted 0, 1, 2. There may be a huge misunderstanding here. Can you post the Python code for the actual mapping, please?
(27-06-2025, 12:18 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.But I also agree with @Jorge_Stolfi here: n-grams entropies do not necessarily mean much because they are heavily dependent on the transcription. Word entropies are more reliable (they only depend on 'space' being actually a 'space', and on differently written words being actually different words).

I don't agree with Stolfi's dismissal of letter-based statistics. We don't know if Voynichese is a phonetic representation at all. Whatever it is, is very much dependent on its writing system. So to get a clue about what's going on, we must compare different ways of transliterating Voynichese to different writing systems. Including codes and ciphers, which are essentially also writing systems.

What I do vehemently agree with is that people must be more aware that Voynichese =/= EVA. When running letter-based tests, there must always be awareness of the choices made and their effects.
(27-06-2025, 12:57 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(27-06-2025, 11:31 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.a can be ciphered by (T,O,P)
b can be ciphered by (U,P,W)
c can be ciphered by (T,M,Z)

This is something completely different than the null generator code that only inserted 0, 1, 2. There may be a huge misunderstanding here. Can you post the Python code for the actual mapping, please?

Yes it is true and you are right and I am surprised of the results.

Initally I had this:

a can be ciphered by (T,O,P)
b can be ciphered by (U,P,W)
c can be ciphered by (T,M,Z)

At some point I should have mix up some functions and didn't doublecheck. The interesting thing is that the insert of 0,1 and 2 creates the entropy bump that I find caracteritic of the Voynich. You can se here You are not allowed to view links. Register or Login to view. that it is really different comparing to natural languages. So, I might have found some thread of investigation in an involuntary way... I am now thinking how can I translate this into the Voynich cipher...
(27-06-2025, 01:15 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(27-06-2025, 12:18 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.But I also agree with @Jorge_Stolfi here: n-grams entropies do not necessarily mean much because they are heavily dependent on the transcription. Word entropies are more reliable (they only depend on 'space' being actually a 'space', and on differently written words being actually different words).

I don't agree with Stolfi's dismissal of letter-based statistics. We don't know if Voynichese is a phonetic representation at all. Whatever it is, is very much dependent on its writing system. So to get a clue about what's going on, we must compare different ways of transliterating Voynichese to different writing systems. Including codes and ciphers, which are essentially also writing systems.

What I do vehemently agree with is that people must be more aware that Voynichese =/= EVA. When running letter-based tests, there must always be awareness of the choices made and their effects.

It was in this post You are not allowed to view links. Register or Login to view. where I noticed the bump of the Voynich (EVA and CUVA) comparing to natural languages. That's why I kept pushing this thread and verified different ciphers.
By the way, I corrected this post for clarity.
(27-06-2025, 01:15 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(27-06-2025, 12:18 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.But I also agree with @Jorge_Stolfi here: n-grams entropies do not necessarily mean much because they are heavily dependent on the transcription. Word entropies are more reliable (they only depend on 'space' being actually a 'space', and on differently written words being actually different words).

I don't agree with Stolfi's dismissal of letter-based statistics. We don't know if Voynichese is a phonetic representation at all. Whatever it is, is very much dependent on its writing system. So to get a clue about what's going on, we must compare different ways of transliterating Voynichese to different writing systems. Including codes and ciphers, which are essentially also writing systems.

What I do vehemently agree with is that people must be more aware that Voynichese =/= EVA. When running letter-based tests, there must always be awareness of the choices made and their effects.

I second Koen's vehement agreement that people must be more aware that Voynichese =/= EVA, and also don't agree (probably more vehemently than he does) with Stolfi's dismissal of letter-based statistics. While there is room for disagreement on questions like how many characters do ligatured gallows represent and in what order (compare Bennett & Currier's transcription schemes), etc. it is fundamentally unproductive to pretend we can just ignore the text as anything other than bags of strokes separated by spaces. 

It also (IMNSHO) constitutes an unreasonable level of agnosticism regarding how ink strokes on the vellum group together into morphologically equivalent glyph forms. I think it's fair to say that few people question (for instance) whether EVA 'a', 'o', 'l', 'd', and 'y' are, in fact, discrete, identifiable glyph forms in the text -- although, to be fair, we have (for instance) seen folk suggest that it matters whether an 'a' is closed at the top or not, and there are at least three distinct variants of 'l' (often on the same page) involving how the loop is closed at the top. And anyone who believes that those variations matter is both welcome and encouraged to transcribe a few pages and see how making those assumptions changes various letter-level stats....

The solution isn't to throw one's hands up in despair and pretend that glyph level statistics are useless, it's to calculate the statistics over different options (how are ligatured gallows handled, how are word-final i*<x> combinations handled, etc.) and see how sensitive the values of the statistics are to those choices.
When vernacular languages adopted Latin alphabet, they adopted it for their unique sounds by:
- dropping a letter
- using one letter for two or more sounds
- using two or more letters for one sound 
- slightly re-designing the Latin letterform
- invent new letter
- used diacritic markers

This means that even if the VM exact transliteration were possible (which remains ambiguous due to minims and other glyphs), the transcription alphabet could only be used to transform the text into Latin letters. Some words in different languages could be found in such transcription, but to convert the text to particular language, assuming that it can be recognized, the above steps must be considered.
There is no such a thing as simple substitution - VM glyph to Latin letter - except when the Latin letter stands only for one sound, and that letter in the proposed language is also pronounced only for one sound.
On top of that, the contemporary VM research is based on ZL transliteration and compared to the way the contemporary Wikipedia articles are written which used different alphabet than it was is use in the 15th century. For instance: Latin C =k, c, č, h, z; Latin tsch = č = ch = zh. Besides this medieval confusion, there are still differences between phonetic and book writing, sound changes, regional mix-ups and mistakes.
No chat gpt can handle that. When I asked it if it recognizes a few lines from ZL transliteration of the VM text, it told me it was from the VM. When I asked the same question and submitted one-to-one substitution, using Slovenian VM alphabet, the answer was that it has some Slavic flavour.
That same text, in which ch and sh was changed to č and š, the language was recognized as Slavic and quite a few of the words were correctly recognized, but many of them were still false guesses.
Asking chat-gpt to translate is useless.
(27-06-2025, 10:34 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I measured n-gram entropy per word

I am following the discussion with interest... and concern about the numerical values calculated for high-order conditional entropies (if I understand your methods correctly).

Consider an infinitely long Cuva text sample with its characters randomly scrambled.  The first-order entropy remains 3.85 bits, as calculated from the character frequencies.  Because all correlations between characters have been removed, all conditional entropies of higher order are also 3.85 bits.  The true "Entropy vs n-gram order" curve in this case is a flat line at 3.85 bits, out to n=14 and beyond.

Already at n=4, however, the value calculated from a character-scrambled version of the complete but finite-length Cuva text is only 3.39 bits, and by n=14 it will be approaching zero.  So in this case at least, tailing off of an "Entropy vs n-gram order" plot would be a numerical artifact. As mentioned elsewhere in these pages, the geometric increase in possible n-mers requires geometrically longer text samples in order to obtain quantitatively correct values.  Shorter text samples will tail off sooner.  And a correct monotonic decrease, acting in concert with a steeper numerical decrease, can potentially create a qualitative bump in the curve.

It is tempting to assign "entropy" to a single text sample (which I might call a "micro-message").  But the sequence of characters is known, so all of the probabilities are unity;  insofar as it can be said to have an entropy, the value is exactly trivially zero.  The meaningful entropy that we discuss here should be attributed to the text generator ("macro-messenger"), as a logarithmic measure of how many different micro-messages it is capable of producing.  Approximate values at various orders can be calculated from one micro-message, but the accuracy of approximation depends on its length.  I expect that there are formulas in the literature that relate n-mer order, sample size, and systematic error.  The only way to evaluate high-order conditional entropies of the VMS author is to discover more text that it produced.

I look forward to new experiments!
(27-06-2025, 09:43 PM)obelus Wrote: You are not allowed to view links. Register or Login to view.It is tempting to assign "entropy" to a single text sample (which I might call a "micro-message").  But the sequence of characters is known, so all of the probabilities are unity;  insofar as it can be said to have an entropy, the value is exactly trivially zero.  The meaningful entropy that we discuss here should be attributed to the text generator ("macro-messenger"), as a logarithmic measure of how many different micro-messages it is capable of producing.  An approximate value can be calculated from one micro-message, but the accuracy of approximation depends on its length.  I expect that there are formulas in the literature that relate n-mer order, sample size, and systematic error.  The only way to evaluate high-order conditional entropies of the VMS author is to discover more text that it produced.

Thanks for your interesting reply — and yes, you're absolutely right about the challenges in interpreting high-order n-gram entropies with limited text. I just wanted to clarify that in my analysis I was actually working with perplexity, rather than entropy directly. That’s the metric I’m more used to, especially in the context of language modeling, as it's more intuitive: it roughly reflects how “surprised” a model is when trying to predict the next character. As you know, perplexity is just the exponential of the entropy (assuming log base 2), so the trends are equivalent — just represented differently.
I completely agree that the drop in entropy (or perplexity) at high n values is partly a numerical artifact, especially because long repeated sequences are rare. This naturally leads to lower perplexity, simply because the model starts memorizing the few long words it has seen before — essentially overfitting.
That said, what caught my attention — and what motivated this line of inquiry — was something peculiar I observed when comparing Voynich text to natural language: in a perplexity-vs-n plot, the Voynich text showed a distinct “bump” or bulge at mid-range n, whereas real language tends to show a much smoother, almost linear decline. Look at the graphic:
[Image: CYM3Nr6.png]
This anomaly made me wonder: could certain types of ciphers produce similar bumps when applied to real language? That became the basis of the experiments I’m currently running. Surprisingly, most ciphers (like simple substitutions or transpositions) still result in a smooth drop in perplexity. Only one type — what I initially called a homophonic cipher, though it’s more accurately a multi-character substitution cipher — generated a similar bump to the one I see in the Voynich text.
So, I’m not claiming this is conclusive by any means, but the similarity in the perplexity curve between the Voynich and this kind of ciphered text definitely raised some interesting questions.
(27-06-2025, 01:15 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(27-06-2025, 12:18 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.But I also agree with @Jorge_Stolfi here: n-grams entropies do not necessarily mean much because they are heavily dependent on the transcription. Word entropies are more reliable (they only depend on 'space' being actually a 'space', and on differently written words being actually different words).
I don't agree with Stolfi's dismissal of letter-based statistics. We don't know if Voynichese is a phonetic representation at all. Whatever it is, is very much dependent on its writing system. So to get a clue about what's going on, we must compare different ways of transliterating Voynichese to different writing systems. Including codes and ciphers, which are essentially also writing systems.

I don't think letter based statistic should be dismissed altogether, but only that characters-based entropies are less reliable than word-based entropies (I'm not sure also @Jorge_Stolfi thinks exactly this, I interpreted him this way).

Ie: the bigram entropy (of the Voynichese.com transcription in this case, I don't have the data ready at hand for RF1a-n) is 6.02 for pure EVA; replacing 'ch' = 'C', 'sh' = 'S', 'qo' = Q, and the gallows 'cth' 'ckh' etc. with T,K,P,F raises the entropy to 6.19 [it's a perfectly reversible transformation]; replacing also  'iii' = 'J', ii='I', 'eee' = 'U' and 'ee' = 'E' raises the entropy to 6.38; and so on (*). While in all these cases the 1-word entropy remains exactly constant at 10.454 (**) (and will remain constant under any possible substitution of any EVA string, provided the transformation is reversible)


(27-06-2025, 01:15 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.What I do vehemently agree with is that people must be more aware that Voynichese =/= EVA. When running letter-based tests, there must always be awareness of the choices made and their effects.
I vehemently agree too!!!!!


(*) In the most extreme case, replacing each word type with a different symbol (a rather weird, but perfectly legit 'transcription') will bring the 1-gram entropy to the same value of the 1-word entropy (10.454)

(**) which, by the way, I just noticed is rather high. Chanson de Roland gets 9.686, the complete works of Shakespeare get 9.974, Boccaccio's Decameron gets 10.006, Caesar's De Bello Gallico has the highest 1-word entropy (as expected for a language as Latin) at 11.211. Classic Greek (Aeschilus, Prometheus Bound) gets 10.457, but with the caveat I removed all the diacritics (entropy should be higher with the diacritics, taht is to say, the tones, in place).
Pages: 1 2 3