The Voynich Ninja
Why the Voynich Manuscript Might Not Be a Real Language - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Why the Voynich Manuscript Might Not Be a Real Language (/thread-4763.html)

Pages: 1 2 3


RE: Why the Voynich Manuscript Might Not Be a Real Language - quimqu - 25-06-2025

(25-06-2025, 12:24 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But please, please, please people: the character entropy is a property of the encoding.  Not of the language, not of the text.  IIff yyoouu  jjuusstt  wwrriittee  eevveerryy  lleetttteerr  ttwwiiccee, the average per-character entropy will be halved.  iF yOU raNdOmLY cAPitAliZe eACh lEtTer, you will add 1 to the average per-character entropy. If you insert two random capital letters after each letter, lMTiGUkSHeWP tNNhLOiZQsDB, you will add at leasr 3 to the per-character entropy.

You're absolutely right — and thank you for bringing this up.

Yes, perplexity is a function of next-character entropy, and that entropy is deeply tied to how a language is encoded as a symbol stream. The examples you give (doubling characters, random capitalization) show how drastically entropy can shift depending on the structure of the encoding — not necessarily the language itself.

To show it clearly to everybody, I ran a character-level perplexity analysis on a Latin philosophical text (De Docta Ignorantia). Then I created two variants:
  1. Doubled every character (ddee  ddooccttaa...)
  2. Randomly capitalized letters (dE DocTA iGNorANTiA)

Here are the results (perplexity across n-grams from 1 to 9):

[Image: wtassLg.png]

So while random capitalization increased the perplexity dramatically, doubling the characters actually reduced it — especially for higher-order n-grams. This is because repetition makes the sequence far more predictable: once you see a d, you know the next character is another d, and so on.

This highlights a subtle but important distinction:
  • Entropy and perplexity depend on how information is distributed in the symbol stream, not just the richness or structure of the language itself.
  • A more "redundant" encoding lowers perplexity, because it’s easier to guess what’s coming next — even if the underlying content hasn't changed at all.

So when we see low perplexity in something like Voynichese, it may reflect a repetitive or inefficient encoding system, rather than an inherently simple or transparent language. Given that Voynichese was created centuries before modern linguistic or cryptographic theory, this is entirely plausible.

Thanks again for the insightful comment — it's an important reminder not to overinterpret statistical measures without considering encoding effects.


RE: Why the Voynich Manuscript Might Not Be a Real Language - quimqu - 25-06-2025

(25-06-2025, 02:37 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(24-06-2025, 09:57 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The Voynich Manuscript, transcribed using the EVA alphabet

Others have already pointed this out, but it feels worth repeating: this explains a significant part of the behaviour of the graph. 

Use modified transliteration alphabets and see what happens. 
At least convert Ch Sh in and iin to single characters.

Also a question: are space characters treated just like normal text characters?

Your result is quite different from my results here: You are not allowed to view links. Register or Login to view.
where Voynichese becomes less predictable after 3 characters, but that is probably due to the way in which I treated space characters, namely as separators.

Thanks — that’s a great point, and I actually followed your suggestion to test with a modified transliteration system.

I compared EVA and CUVA transliterations (following your 'rules file' conversion provided in the site You are not allowed to view links. Register or Login to view. ). This gives a more compact alphabet. (By the way, note that in EVA and CUVA I assigned each extended character glyph a distinct Unicode).

[Image: JbuJiHw.png]

Interestingly, CUVA produces higher perplexity beyond n=3, which might seem counterintuitive. But here's what's happening:
  • The CUVA glyphs increase entropy per character, because each unit contains more information.
  • As a result, sequences become less predictable in higher-order n-grams — especially when you're conditioning on long contexts.
  • So even though CUVA seems more compact, it compresses repeated structure into fewer units, reducing the surface-level redundancy that makes Voynichese so predictable under EVA.
This supports the idea that Voynichese's low perplexity (in EVA) reflects encoding regularity rather than linguistic simplicity.

Even in CUVA form, Voynichese is still more predictable than Latin or English — unless you use high-order n-grams, where CUVA becomes far less compressible.

Regarding Your Question About Spaces: Yes — in my case, spaces are treated as normal characters. I model character-level perplexity over the full stream, including word boundaries. This is crucial, since many of Voynichese’s patterns involve position within words (q-initial, -dy final, etc.). So the model learns transition probabilities across word endings too.

This may explain part of the divergence from your entropy graph, which treats space as a delimiter and resets statistics per word. That method is great for isolating internal word structure, while mine reflects stream-level predictability.

So in short:
  • Glyph-grouping (like CUVA) affects predictability a lot — especially in high-order n-grams.
  • EVA’s low perplexity seems to stem from highly redundant symbol sequences, not linguistic simplicity.
  • Treating space as a token (vs delimiter) changes the entropy dynamics significantly.
Thanks again for raising this!


RE: Why the Voynich Manuscript Might Not Be a Real Language - quimqu - 25-06-2025

(24-06-2025, 10:17 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.If I'm reading this graph correctly, then making a simple change in the transcription by treating ch and Sh as two distinct whole characters and merging a few more sequences into separate single glyphs, like qo, aiin, ain, would shrink the Voynich graph horizontally and bring it well within the normal range, closer to Romeo and Juliet. Is this so?

Hello, yes, I think that my try with CUVA transliteration has done exactly what you suggested. Grouping some characters to single glyphs makes the graphick "shrink" horizontally, but it also gives a higher perplexity. Please check my answer to René.


RE: Why the Voynich Manuscript Might Not Be a Real Language - Koen G - 25-06-2025

I (as someone who doesn't speak statistese) am confused. If perplexity is basically another way to measure entropy, then why is CUVA higher than Romeo and Juliet? Even with EVA's entropy-reducing properties minimalized, we should still end up well below English texts. Or French. Or the vast majority of Latin texts...


RE: Why the Voynich Manuscript Might Not Be a Real Language - oshfdk - 25-06-2025

(25-06-2025, 09:02 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I (as someone who doesn't speak statistese) am confused. If perplexity is basically another way to measure entropy, then why is CUVA higher than Romeo and Juliet? Even with EVA's entropy-reducing properties minimalized, we should still end up well below English texts. Or French. Or the vast majority of Latin texts...

Why so? As far I as know, Voynichese entropy is relatively low only for short ngram values, like bigrams and trigrams and gets much closer to normal with the increasing length. This is exactly what happens in the original graph at the top of this thread.


RE: Why the Voynich Manuscript Might Not Be a Real Language - ReneZ - 25-06-2025

Thanks for taking up all suggestions and try them out!

(25-06-2025, 08:34 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Interestingly, CUVA produces higher perplexity beyond n=3, which might seem counterintuitive.

This is also what I saw here: You are not allowed to view links. Register or Login to view. even though there I did a restart at the start of every word.
It basically means that the information in the Voynich MS text is spread more evenly over the characters.
(This is something that one might expect more with numbers than with text).


RE: Why the Voynich Manuscript Might Not Be a Real Language - Bernd - 25-06-2025

(25-06-2025, 09:07 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.This is something that one might expect more with numbers than with text
This is what has always baffled me. On one hand, the structure of 'Voynichese' appears to be more consistent with numbers, Roman numerals or code-like mathematical operations (bookkeeping?), but it can be expressed in the form of a language-like seemingly readable and pronounceable text. How can such a transformation be accomplished?

Have any experiments to transform (Roman) numerals into language been carried out?


RE: Why the Voynich Manuscript Might Not Be a Real Language - ReneZ - 25-06-2025

Even if we don't strictly consider numbers, but rather some more generic 'enumeration' system, we will run into a problem that I see clearly in my mind, but may not be able to explain clearly.

The words okeey and qokeey can appear near each other, but differ only on the extreme left side of the word.

The words qokal and qokar are also similar and differ only on the extreme right. 

Using computer terminology, if this were an enumeration system, it would appear to be neither high-endian nor low-endian, but rather both-endian.


RE: Why the Voynich Manuscript Might Not Be a Real Language - Koen G - 25-06-2025

(25-06-2025, 09:04 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Why so? As far I as know, Voynichese entropy is relatively low only for short ngram values, like bigrams and trigrams and gets much closer to normal with the increasing length. This is exactly what happens in the original graph at the top of this thread.

Ah, I see. So you have a sliding window of n, and then you see how well you can predict the next n characters? I just can't intuitively grasp what this tells us. It's too abstract - a measure of the information density of the writing system. 

When it comes to conditional character entropy (next character), it's easier to explain to a general audience what's going on:

* given a letter from a random English word, you can't easily predict the next letter. For example, with "s" as the first letter of the pair, the distribution of the second letter is quite even, and you have many options to choose from.
* In Voynichese, the opposite is true: given any glyph, you have fewer options to choose from for the second glyph (because of positional restrictions etc.), and the distribution is skewed towards one or a few most likely options (low conditional character entropy). EVA makes this worse by chopping some glyphs in half, artificially creating very frequent pairs.

When comparing larger n-grams, this intuitive aspect of it is gone: you are just looking at information density as an abstract number.


RE: Why the Voynich Manuscript Might Not Be a Real Language - Mauro - 25-06-2025

(25-06-2025, 10:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Even if we don't strictly consider numbers, but rather some more generic 'enumeration' system, we will run into a problem that I see clearly in my mind, but may not be able to explain clearly.

The words okeey and qokeey can appear near each other, but differ only on the extreme left side of the word.

The words qokal and qokar are also similar and differ only on the extreme right. 

Using computer terminology, if this were an enumeration system, it would appear to be neither high-endian nor low-endian, but rather both-endian.

It might be that 'a' words behave differently from 'e' words: two different numbering systems mixed together. But I fear that while this works well with your example it will run into the usual problems (too many exceptions and running around in circles) when the text is analyzed in depth.