The Voynich Ninja
Why the Voynich Manuscript Might Not Be a Real Language - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Why the Voynich Manuscript Might Not Be a Real Language (/thread-4763.html)

Pages: 1 2 3


Why the Voynich Manuscript Might Not Be a Real Language - quimqu - 24-06-2025

In this post, I’ll walk you through a machine learning approach I used to analyze the Voynich Manuscript using character-level n-gram language models, a simple but powerful way to measure how predictable a text is. My goal was not to decode the Voynich, but to compare its statistical structure to that of other known texts — including literary works, religious treatises, and artificially encrypted versions — to see if it behaves like a natural language, a cipher, or something entirely different.

What Are Character-Level N-grams and Perplexity?

Before diving into the results, let’s quickly explain two key concepts:
  • Character-level n-grams: These are sequences of n consecutive characters. For example, in the word "language", the 3-grams (trigrams) are
    lan
    ang
    ngu
    gua
    uag
    age
  • An n-gram model learns the likelihood of seeing a particular character given the previous n-1 characters.
  • Perplexity: This is a measure of how well a model predicts a sequence. Low perplexity means the model can easily predict the next character — the text is “regular” or “learnable.” High perplexity means the text is less predictable, like a noisy or complex system. It’s often used to evaluate how well a language model fits a dataset.

The Experiment

I trained simple n-gram models (from 1-gram to 9-gram) on the following types of texts:
  • Classical literature (e.g., Romeo and Juliet, La Reine Margot)
  • Religious and philosophical texts (e.g., Ambrosius Mediolanensis, De Docta Ignorantia), with date of creation simmilar to the MS
  • Ciphered texts using a Trithemius-style letter substitution
  • The Voynich Manuscript, transcribed using the EVA alphabet

For each text, I split it into a training and validation set, trained n-gram models by character, and computed the perplexity at each n-gram size. I plotted these to visualize the predictability curves.

[Image: Y93Ys1l.png]
What Did I Find?

The results were surprising:
  1. The Voynich Manuscript exhibits surprisingly low perplexity for high n-grams (n=7 to n=9) — much lower than expected for a truly random or strongly encrypted text.
  2. Its perplexity curve closely resembles that of religious or philosophical medieval texts, such as De Docta Ignorantia and Ambrosius Mediolanensis. These texts also show low perplexity at high n-grams, reflecting strong internal regularity and repetitive patterns.
  3. In contrast, literary texts like Shakespeare or Dumas show a sharp increase in perplexity for high n-grams, indicating a richer and more unpredictable sequence of characters.
  4. Artificially encrypted texts using simple substitution ciphers (like Trithemius-style transformations) show consistently high perplexity, since character distributions are scrambled.

Interpretation

This suggests something important: The Voynich Manuscript does not behave like a substitution cipher or a natural literary language. Instead, it statistically resembles structured, repetitive writing such as liturgical or philosophical works.
This does not mean it’s meaningful — but it does imply that the text might have been designed to look structured and formal, mimicking the style of medieval sacred or scholarly texts.
Its internal predictability could arise from:
  • Repeated formulas or ritualistic phrases
  • A constrained or templated grammar
  • Artificial generation using consistent rules (even if meaningless)

Conclusion

While many have tried to translate the Voynich Manuscript into known languages or decode it with cipher-breaking techniques, this analysis suggests that a direct translation approach may be futile. The manuscript’s character-level structure mirrors that of repetitive, highly formalized texts rather than expressive natural language or encrypted writing.
Any attempt to decipher it without first understanding its generative rules — or lack thereof — is likely to miss the mark.
That said, its statistical behavior is not unique. Other texts from the same era show similar n-gram patterns. So perhaps the Voynich isn’t a hoax — it might just be mimicking the structure of sacred or scholarly texts we no longer fully understand.


RE: Why the Voynich Manuscript Might Not Be a Real Language - oshfdk - 24-06-2025

If I'm reading this graph correctly, then making a simple change in the transcription by treating ch and Sh as two distinct whole characters and merging a few more sequences into separate single glyphs, like qo, aiin, ain, would shrink the Voynich graph horizontally and bring it well within the normal range, closer to Romeo and Juliet. Is this so?


RE: Why the Voynich Manuscript Might Not Be a Real Language - RobGea - 24-06-2025

Not bad and thank you for introducing me to Perplexity, thats something i've never come across before.


RE: Why the Voynich Manuscript Might Not Be a Real Language - Rafal - 24-06-2025

And If I get things correctly, your result is that Voynich text is very predictable, you can predict next letter based on previous letter with a good ratio. It's hardly surprising, people in these forums often talk about it. Just more often they call it low entropy instead of low perplexity but it'd say it's more or less the same.

It would be interesting to me to see how prediction of not letters but complete words works in VM compared to other, real texts. It was again often said in these forums that there arent repeatable word sequences in VM but maybe we are wrong about it???

By the way, what do you mean by Ambrosius Mediolanensis? He was a man, not a book  Wink


RE: Why the Voynich Manuscript Might Not Be a Real Language - Jorge_Stolfi - 25-06-2025

(24-06-2025, 09:57 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Perplexity: This is a measure of how well a model predicts a sequence. Low perplexity means the model can easily predict the next character — the text is “regular” or “learnable.” High perplexity means the text is less predictable, like a noisy or complex system. It’s often used to evaluate how well a language model fits a dataset.

That should be a simple function of the next-character entropy, no?

But please, please, please people: the character entropy is a property of the encoding.  Not of the language, not of the text.  IIff yyoouu  jjuusstt  wwrriittee  eevveerryy  lleetttteerr  ttwwiiccee, the average per-character entropy will be halved.  iF yOU raNdOmLY cAPitAliZe eACh lEtTer, you will add 1 to the average per-character entropy. If you insert two random capital letters after each letter, lMTiGUkSHeWP tNNhLOiZQsDB, you will add at leasr 3 to the per-character entropy.

The low per-character entropy of Voynichese could mean simply that the "spelling system" that the Author invented for its language is inefficient.  Which is not surprising, since the theory needed to design good encodings was 600 years in the future.  Whereas the usual spelling systems for natural languages have become somewhat more efficient through centuries of use, which explains why they have higher per-character entropy than Voynichese. 

By the way, one possible explanation for the differences between the "languages" (actually just word distributions) A and B is that the Author made some simplifications to the spelling between the writing of the two parts.


RE: Why the Voynich Manuscript Might Not Be a Real Language - RadioFM - 25-06-2025

Perplexity is a function of the language model too, not just the string of text. With a simple n-gram language model I guess the perplexity could be worked out as a function of the n-order conditional entropy.

Whatever changes they made to the spelling, it certainly didn't make it less repetitive. Would you argue that A would be an improvement upon B, given the decrease of the q prefix or the other way around? x is exclusive to B, IIRC?


RE: Why the Voynich Manuscript Might Not Be a Real Language - ReneZ - 25-06-2025

(24-06-2025, 09:57 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The Voynich Manuscript, transcribed using the EVA alphabet

Others have already pointed this out, but it feels worth repeating: this explains a significant part of the behaviour of the graph. 

Use modified transliteration alphabets and see what happens. 
At least convert Ch Sh in and iin to single characters.

Also a question: are space characters treated just like normal text characters?

Your result is quite different from my results here: You are not allowed to view links. Register or Login to view.
where Voynichese becomes less predictable after 3 characters, but that is probably due to the way in which I treated space characters, namely as separators.


RE: Why the Voynich Manuscript Might Not Be a Real Language - quimqu - 25-06-2025

(24-06-2025, 10:38 PM)Rafal Wrote: You are not allowed to view links. Register or Login to view.And If I get things correctly, your result is that Voynich text is very predictable, you can predict next letter based on previous letter with a good ratio. It's hardly surprising, people in these forums often talk about it. Just more often they call it low entropy instead of low perplexity but it'd say it's more or less the same.

It would be interesting to me to see how prediction of not letters but complete words works in VM compared to other, real texts. It was again often said in these forums that there arent repeatable word sequences in VM but maybe we are wrong about it???

By the way, what do you mean by Ambrosius Mediolanensis? He was a man, not a book  Wink

Hello, I am running more code to answer everybody, but a fast answer first to you: I used Ambrosius Medionalensis In Psalmum David CXVIII Expositio

[url=https://monumenta.ch/latein/text.php?tabelle=Ambrosius&rumpfid=Ambrosius,%20In%20Psalmum%20David%20CXVIII%20Expositio&level=&domain=&lang=1&links=&inframe=1&PHPSESSID=e4330e1fab8aa0c3d974cae894dc00cb][/url]


RE: Why the Voynich Manuscript Might Not Be a Real Language - oshfdk - 25-06-2025

Entropy is the logarithm of perplexity, as far as I understand.

It is possible that the transformer can find some rules that work beyond simple probability map from an ngram to the next character. Would be interesting to compare the log2 of transformer perplexity with the statistical ngram entropy.

Edit: maybe as a simple fix you can show the perplexity as powers of 2 for the vertical axis, this should give us the bit entropy figures as the exponents.


RE: Why the Voynich Manuscript Might Not Be a Real Language - oshfdk - 25-06-2025

(25-06-2025, 01:51 AM)RadioFM Wrote: You are not allowed to view links. Register or Login to view.Whatever changes they made to the spelling, it certainly didn't make it less repetitive. Would you argue that A would be an improvement upon B, given the decrease of the q prefix or the other way around? x is exclusive to B, IIRC?

This could be an improvement, for example, if:
1) it made it easier to write with a quill
2) it made it less mentally challenging to read/write (e.g., if q is too similar to an unrelated letter of another script that the scribe or the reader is familiar with, changing the script to reduce the number of q's could reduce the number of mistakes)
3) it made the script look cleaner overall or allow for tighter line spacing - a lot of descenders and ascenders in the original version may have been clashing producing hard to read mess

If the script was work in progress, there could be many reasons to adjust it mid project. If several people were involved with the project, the adjustments could start with one person and then some or all of them could be accepted by other scribes. Who knows, maybe even different people were in charge of different sections of the manuscript and could set their own rules, so there was no requirement for the whole thing to be in one single script version at all. Maybe the scribes didn't even consider standardization to be so important and were comfortable with several versions of the script being used in a single codex.