The Voynich Ninja - How multi-character substitution might explain the voynich’s strange entropy

Pages: 1 2 3

Correction
Originally, I described the transformation used as a homophonic cipher, but that label is misleading. What I actually applied was a form of multi-character substitution, where each letter in the original word is replaced by a randomly chosen variant (e.g., a0, a1, a2), simulating a kind of randomized expansion at the character level. This isn't a true homophonic cipher in the historical sense — which typically replaces plaintext characters with multiple possible cipher symbols without increasing the total character count. My version expanded the text significantly and altered its structure.
Despite the naming inaccuracy, the method did reproduce an entropy curve similar to the Voynich CUVA profile, especially in the characteristic “bump” around n=3–6. The results still support the hypothesis that some kind of structured substitution — possibly at the syllable or morph level — could account for the entropy behavior in the Voynich manuscript. However, any conclusions should be interpreted with this clarification in mind.

You can also check this post of mine where you can see the entropy bump comparing the MS in EVA and in CUVA versus natural languages texts:

You are not allowed to view links. Register or Login to view.

Maybe by accident, I’ve pulled on a thread worth following — I’ll keep exploring what really generates the bump.

------------------------------------------

In this experiment, I tried to simulate how different historical ciphers affect the entropy profile of a text, and compare the results to the Voynich CUVA (explained here You are not allowed to view links. Register or Login to view. by René Zandbergen). The idea was to test whether the statistical behavior of the Voynich text—especially its distinctive “entropy bump”—could emerge from known cipher types.

Method

I took the Latin text De Docta Ignorantia and applied 10 classical cipher transformations likely known or possible in the 15th century:

Syllabic substitution
Homophonic cipher
Caesar cipher
Grammatical expansion
Transposition cipher
Contextual substitution
Polyalphabetic cipher
Cardano grille
Relative-position encoding

For each version, I measured n-gram entropy per word (resetting after every word) from n=1 to n=14.
I then plotted these values against the Voynich CUVA section.

[Image: uLOSZCq.png]

This graph shows that most cipher types produce entropy curves that drop steeply after n=3–5, while the Voynich text declines gradually and smoothly. This is already unusual.
But there's one exception...

Homophonic cipher anomaly

Only the homophonic cipher (3+ variants tested) produces an entropy “bump” that matches the Voynich profile. Specifically, when using a homophonic cipher with 3 or 4 characters per symbol, the entropy curve is smoother and shows a slow decay, similar to the CUVA data.

This raises two hypotheses:

A system with homophonic encoding of syllables or morphs could recreate a Voynich-like structure.
The smoothness of the curve may suggest internal rules or language constraints, not just random substitution.

Notice how the 3- and 4-character homophonic ciphers almost replicate the Voynich curve — both in shape and range. The 2-character version decays a bit faster but still mimics the bump.

Natural text vs. Voynich

To test if this was just a quirk of De Docta Ignorantia, I took four different natural texts (Latin, French, English):

Ambrosius Medionalensis In Psalmum David CXVIII Expositio (Latin)
La reine Margot (French)
Romeo and Juliet (English)
De Docta Ignorantia again

Each was encrypted with a 3-character homophonic cipher and compared to Voynich CUVA.

[Image: kSTbMuI.png]

Interestingly, when using a 3-character homophonic cipher on natural texts (Latin, French, English), the entropy curves become much smoother and more sustained. For several of them, the n-gram entropy remains high up to n=6–7, and only drops significantly past n=8 or n=9.
The curve shapes are now visibly closer to Voynich CUVA, with the most similar being De Docta Ignorantia and Romeo and Juliet. However, the Voynich text still has:

A slightly smoother and more consistent decay, without sudden drops
A more gradual “tail” beyond n=9, where others still not flatten or zero out (except Romeo and Juliet)

This supports the idea that some homophonic structure — perhaps morph- or syllable-based — could explain the entropy shape. But it also reinforces the notion that Voynich words follow a more regulated internal logic, possibly due to morphological templates or position-based constraints.

Interpretation

There are two key features that stand out:

The “Voynich bump” (sustained entropy around n=3–6) is only replicated by homophonic substitution.
The smoothness of the curve in CUVA suggests an underlying linguistic system — natural or artificially constructed — rather than arbitrary encoding.

This doesn’t prove the Voynich uses a homophonic cipher, but it does suggest that such systems can generate statistically similar profiles, especially when applied at the syllable or morph level.
It may also support theories that posit an artificial language, a constructed morphology, or template-driven word generation, all of which maintain internal consistency over longer n-grams.

(27-06-2025, 10:34 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The curve shapes are now visibly closer to Voynich CUVA, with the most similar being De Docta Ignorantia and Romeo and Juliet. However, the Voynich text still has:
...

A more gradual “tail” beyond n=9, where others flatten or zero out

The graph seems to show exactly the opposite, the Voynich Curve is the first to zero out and flatten, closely followed by Romeo and Juliet, with the rest having much more gradual tail.

I have to ask this, is any part of your analysis AI generated?

I'm surprised that the curves in the last diagram are so close especially up to 3-grams, because your homophonic ciphertext must have a lot more symbols than CUVA and the (random?) choice of homophone must add a lot of entropy... how is it possible?

(27-06-2025, 10:56 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(27-06-2025, 10:34 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The curve shapes are now visibly closer to Voynich CUVA, with the most similar being De Docta Ignorantia and Romeo and Juliet. However, the Voynich text still has:
...

A more gradual “tail” beyond n=9, where others flatten or zero out

The graph seems to show exactly the opposite, the Voynich Curve is the first to zero out and flatten, closely followed by Romeo and Juliet, with the rest having much more gradual tail.

I have to ask this, is any part of your analysis AI generated?

No, I've been struggling with Kaggle and the data over two nights. It's a matter of my English and trying to make my thoughts understandable. It should saiy where others still not flatten. I correct it.

(27-06-2025, 10:57 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.I'm surprised that the curves in the last diagram are so close especially up to 3-grams, because your homophonic ciphertext must have a lot more symbols than CUVA and the (random?) choice of homophone must add a lot of entropy... how is it possible?

I attach the function to cipher:

def homophonic_cipher_3c(words):
letter_map = {c: [c + str(i) for i in range(3)] for c in 'abcdefghijklmnopqrstuvwxyz'}
out = []
for word in words:
new_word = ''.join(random.choice(letter_map.get(c, [c])) for c in word)
out.append(new_word)
return out

It's a simple homophonic cipher. First creates a letter map for each alphabetical character (so, each letter will have three different letters to choose randomly during the cipher).

In the cipher, I use alphabetical characters. Imagine that I use voynichese characters instead. I would use only the CUVA symbols. But beware, in order to decipher, it is tricky, as one CUVA symbol might be deciphered also into 3 natural language charaters. So there must be another hidden rule to decipher.

Interesting work!

A homophonic substitution cipher with positional restraints is basically the same thing as positional allography, right? So we would be looking for one glyph that's the equivalent of another glyph, and their positions tend to be mutually exclusive.

I think this is an underexplored avenue that might solve a lot of our problems. But it creates new problems also. Let's say in practice, a medieval text may use something like 22 letters frequently (we can do without w, u/v, j...). To enable homophonic substitution, the number of frequently used characters needs to increase. So you would end up with something like Arabic with 40+ letterforms.

However, if the 20 something frequent glyphs of CUVA are our expanded set, then that means that solving the homophonic part of the cipher would mean a drastic decrease of the number of letters. To accommodate for that, you would need a cipher that's polyphonic and homophonic at the same time. (Which might not even be as impractical as it sounds).

(27-06-2025, 11:08 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I attach the function to cipher:

def homophonic_cipher_3c(words):
letter_map = {c: [c + str(i) for i in range(3)] for c in 'abcdefghijklmnopqrstuvwxyz'}
out = []
for word in words:
new_word = ''.join(random.choice(letter_map.get(c, [c])) for c in word)
out.append(new_word)
return out

It's a simple homophonic cipher. First creates a letter map for each alphabetical character (so, each letter will have three different letters to choose randomly during the cipher).

In the cipher, I use alphabetical characters. Imagine that I use voynichese characters instead. I would use only the CUVA symbols. But beware, in order to decipher, it is tricky, as one CUVA symbol might be deciphered also into 3 natural language charaters. So there must be another hidden rule to decipher.

I'm not sure this works as a homophonic cipher, it looks like this is just adding a random number from (0, 1, 2) after each character. Since you perform a string join, you will convert:

"manuscript" to something like "m0a2n0u1s0c1r0i1p2t0"

If you do simple character ngrams later, this can create all kinds of weird results.

(27-06-2025, 11:08 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I attach the function to cipher:

def homophonic_cipher_3c(words):
letter_map = {c: [c + str(i) for i in range(3)] for c in 'abcdefghijklmnopqrstuvwxyz'}
out = []
for word in words:
new_word = ''.join(random.choice(letter_map.get(c, [c])) for c in word)
out.append(new_word)
return out

It's a simple homophonic cipher. First creates a letter map for each alphabetical character (so, each letter will have three different letters to choose randomly during the cipher).

Thanks for the explanation. So the letter "l" is ciphered either as "l0", "l1" or "l2", only the digit is chosen randomly. This is not a homophonic cipher: the digits are all nulls.

Code:
import random

def homophonic_cipher_3c(words):

    letter_map = {c: [c + str(i) for i in range(3)] for c in 'abcdefghijklmnopqrstuvwxyz'}

    out = []

    for word in words:

        new_word = ''.join(random.choice(letter_map.get(c, [c])) for c in word)

        out.append(new_word)

    return out

pt = input('Enter plaintext: ')

print(f'Ciphertext of plaintext "{pt}" is {homophonic_cipher_3c(pt)}')

Ciphertext of plaintext "tell me more" is ['t0', 'e2', 'l2', 'l0', ' ', 'm0', 'e2', ' ', 'm1', 'o1', 'r2', 'e1']

(27-06-2025, 11:22 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Interesting work!

A homophonic substitution cipher with positional restraints is basically the same thing as positional allography, right? So we would be looking for one glyph that's the equivalent of another glyph, and their positions tend to be mutually exclusive.

I think this is an underexplored avenue that might solve a lot of our problems. But it creates new problems also. Let's say in practice, a medieval text may use something like 22 letters frequently (we can do without w, u/v, j...). To enable homophonic substitution, the number of frequently used characters needs to increase. So you would end up with something like Arabic with 40+ letterforms.

However, if the 20 something frequent glyphs of CUVA are our expanded set, then that means that solving the homophonic part of the cipher would mean a drastic decrease of the number of letters. To accommodate for that, you would need a cipher that's polyphonic and homophonic at the same time. (Which might not even be as impractical as it sounds).

Not really... I tried to explain that in my last answer. We can use the same number of characters. For example:

a can be ciphered by (T,O,P)
b can be ciphered by (U,P,W)
c can be ciphered by (T,M,Z)
...

As you can see the voynich characters (represented by the CAPS) can substitute up to 3 natural language characters. But this is tricky to decipher. There should be another hidden rule to know when to cipher with one of the three characters, and this rule may not be very obvious. I have already done some experiments with the appearance of the characters. For example the word "appear", I culd change the first "a" by char #1 and then the second "a" by char #2 etc, etc.. But this gives more understansabilty to the cipher and then the entropy bump dissapears.

So yes, in my opinion there is a hiddn rule. And we even don't know if there are 2, 3, 4 voynich characters to cipher.

So that's more like a polyalphabetic cipher with frequent shifting between alphabets. Sounds like a headache!

Pages: 1 2 3