The Voynich Ninja - GPT Models Fail to Find Language Structure in the Voynich Manuscript

Pages: 1 2 3

The GPT model is a type of neural network trained to predict the next token (e.g., word or character) in a sequence, based on the context of the previous ones. During training, the model gradually learns patterns and structures that help it guess what might come next. When applied to natural language, this often results in learning the grammar and syntax of the language. The goal of this experiment is to see whether a GPT, when trained on Voynich text, can reproduce even short valid word sequences — a basic sign of underlying grammatical structure.

Using a minimal GPT architecture trained on 11,000-token corpora from natural languages and the Voynich manuscript (EVA and CUVA transcriptions, only paragraphs of what seems natural language (not cosmological parts, for example)), I evaluated how well the model could reproduce sequences of two or three consecutive words (bigrams and trigrams) from the original corpus. The results reveal stark differences between Voynichese and natural languages.

I trained several nanoGPT models (roughly 1.1M parameters each) on corpora limited to 11,000 words each. The corpora included:

Latin (e.g. De Docta Ignorantia)
Classical religious text (In Psalmum David CXVIII)
Early Modern English (Romeo and Juliet)
Esperanto (Alicie en Wonderland)
Voynich EVA transcription
Voynich CUVA transcription

Each model was trained on tokenized text split by the dot (".") separator, treating each token as a "word". Then, I prompted each model to generate 1000 words, starting from a random token from the original corpus.

For each generated sequence, I extracted all bigrams and trigrams and checked how many were present in the original corpus text (used as training data).

Results (Bigrams and Trigrams Found in Training Text):

[Image: s5N9a0a.png]

The Latin religious text In Psalmum David CXVIII had pretty low bigram and trigram scores — not too far from the Voynich numbers. This could be because of its complex sentence structure or how rarely some word combinations repeat. But even then, it still produced some consistent word sequences, which the GPT picked up.

That didn’t happen with the Voynich at all — no three-word sequences from the original text were ever regenerated. This makes Voynichese stand out as fundamentally different.

In addition, the entropy of word distributions was comparable across corpora (~8.5 to 9.6 bits), meaning the GPT learned the relative frequencies of words quite well. However, only in natural language corpora did it also learn statistically consistent co-occurrence patterns.

Conclusion:

If the Voynich manuscript encoded a natural language, we would expect a GPT trained on it to be able to reproduce at least a small proportion of common bigrams and trigrams from the training corpus. This is exactly what we observe in natural language corpora (e.g. Esperanto 25.9% bigram match). In contrast, the bigram match rate for Voynichese is nearly zero, and trigrams are entirely absent.

This strongly supports the hypothesis that the Voynich manuscript is not a natural language encoding. While it has an internally consistent lexicon (i.e., words), it lacks the sequential dependencies and word-to-word transitions that characterize even simple or constructed languages.

Implication:

If a small GPT can learn bigrams and trigrams from natural languages in just 11,000 words — but completely fails to do so with Voynichese — this suggests that the manuscript does not reflect natural language structure.

This casts serious doubt on claims of direct decryption or translation into real languages. It’s likely that such efforts are misapplied.

Instead, the Voynich may reflect a pseudo-linguistic system — a generative algorithm, a constructed gibberish, or even a cipher whose output was never meant to carry true semantic depth. The surface form may resemble language, but its internal statistical behavior tells a different story.

In short: be skeptical of anyone claiming to have “translated” the Voynich into English, Latin, or any other language — unless they can show that their version has the statistical fingerprints of a true linguistic system.

For the results to be comparable, we need to establish how Voynichese words correspond to plaintext words (and if they are words and not syllables, character combinations, etc). I think it makes sense repeating the experiment using comparable training sizes and output sizes expressed in Shannon bits

Hmm. Your range for regular language bigrams spans two orders of magnitude, from 24.6 to 3.4. is Voynichese's 1.9 then really an outlier?

I wonder if there is any correlation with the number of repeating bi/trigrams in each corpus text

(02-07-2025, 05:11 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Hmm. Your range for regular language bigrams spans two orders of magnitude, from 24.6 to 3.4. is Voynichese's 1.9 then really an outlier?

Yes, that’s a fair question — and in fact I already hint at it in the paragraph about the Latin religious text:

> “The Latin religious text In Psalmum David CXVIII had pretty low bigram and trigram scores — not too far from the Voynich numbers. This could be because of its complex sentence structure or how rarely some word combinations repeat. But even then, it still produced some consistent word sequences, which the GPT picked up.”

I was actually surprised myself by how low the score was for the Psalmum text. It made me wonder whether Voynichese might somehow resemble this kind of structure. Still, there are a few things to consider:

The Psalmum GPT did manage to reproduce some consistent multi-word patterns, even if rarely.

The Voynich GPT didn’t manage to regenerate any trigrams at all — not even once.

And perhaps most striking: the other Latin text I used, De Docta Ignorantia, gave scores much more in line with natural language.

So while Psalmum and Voynichese are relatively close in absolute numbers, there still seems to be a meaningful difference in behavior — especially in terms of sequential co-occurrence.

That said, this is definitely one of the grey areas that calls for further digging — maybe through more controlled training setups or different tokenization strategies (e.g., syllables vs. words). I’d love to explore this further.

(02-07-2025, 05:32 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.I wonder if there is any correlation with the number of repeating bi/trigrams in each corpus text

My assumption was that training on 11k tokens would give a good sense of the model's ability to learn and reproduce frequent patterns, but you're right — if some texts naturally have fewer internal repetitions, then we should expect lower match rates in generation, even if the language is "natural."

It would be interesting to calculate how many unique bigrams/trigrams appear more than once in each corpus — and then compare that to the generation match rate. If there's a strong correlation, it could help distinguish between data sparsity vs. structural absence of co-occurrence.

I’ll look into it!

(02-07-2025, 05:05 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.For the results to be comparable, we need to establish how Voynichese words correspond to plaintext words (and if they are words and not syllables, character combinations, etc). I think it makes sense repeating the experiment using comparable training sizes and output sizes expressed in Shannon bits

I see your point, but I wonder if that distinction might be overemphasized. Whether the tokens represent words, syllables, or even phonemes, any genuine writing system that encodes natural language should still reflect the underlying linguistic structure — which GPT models are designed to pick up on, regardless of granularity.

For example, Chinese uses characters that are often closer to morphemes or syllables than words, and Latin is highly inflected with longer words — but both still have strong statistical patterns in their sequential structure. If Voynichese encoded a real language at any level — word, syllable, or otherwise — we’d expect some degree of learnable co-occurrence patterns.

So while it’s an open question what exactly Voynichese tokens represent, the GPT’s failure to learn even short sequences suggests something deeper than just a mismatch in segmentation level. It might be structural.

From my experiments with MATTR, I know that Latin texts can be particularly diverse. It depends on text type, but also on "Latin type". Classical authors wrote in their own language (with various degrees of loftiness) while medieval authors were writing in a learned dead language, with more restricted vocabulary and grammar.

Some Latin text without repetitiveness inherent to the genre or style may score very low in your system, since all the little grammatical words are expressed grammatically.

(02-07-2025, 06:45 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Some Latin text without repetitiveness inherent to the genre or style may score very low in your system, since all the little grammatical words are expressed grammatically.

Thank you — that’s a very useful distinction. It would indeed explain why Psalmum David yields such low bigram/trigram matches despite being coherent Latin. Its morphological structure likely reduces surface-level repetition, making it harder for the model to detect sequential patterns.

This also makes me wonder: are you suggesting that Voynichese might behave similarly because it encodes something like a “morphologically rich” or non-repetitive learned language — perhaps even analogous to medieval Latin?

(02-07-2025, 06:34 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So while it’s an open question what exactly Voynichese tokens represent, the GPT’s failure to learn even short sequences suggests something deeper than just a mismatch in segmentation level. It might be structural.

It is very likely that the situation is exactly as you describe it. I just think that using a black box like GPT for analysis should be accompanied with very strict definition of what the inputs mean exactly and how exactly we interpret the outputs and why. Black boxes already introduce a lot of uncertainty by themselves, when we multiply uncertainties, weird things can happen.

(02-07-2025, 06:34 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(02-07-2025, 05:05 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.For the results to be comparable, we need to establish how Voynichese words correspond to plaintext words (and if they are words and not syllables, character combinations, etc). I think it makes sense repeating the experiment using comparable training sizes and output sizes expressed in Shannon bits

I see your point, but I wonder if that distinction might be overemphasized. Whether the tokens represent words, syllables, or even phonemes, any genuine writing system that encodes natural language should still reflect the underlying linguistic structure — which GPT models are designed to pick up on, regardless of granularity.

For example, Chinese uses characters that are often closer to morphemes or syllables than words, and Latin is highly inflected with longer words — but both still have strong statistical patterns in their sequential structure. If Voynichese encoded a real language at any level — word, syllable, or otherwise — we’d expect some degree of learnable co-occurrence patterns.

So while it’s an open question what exactly Voynichese tokens represent, the GPT’s failure to learn even short sequences suggests something deeper than just a mismatch in segmentation level. It might be structural.

This could be interesting to test systematically with different kinds of ciphertexts. Are there classes of substitution ciphers, for instance, that produce readily decipherable ciphertexts that exhibit anomalously low, VMS-like GPT predictability? If a given type of cipher, encrypting a wide range of plaintexts, consistently appears to be more GPT-predictable than the VMS, than that kind of cipher probably isn't consistent with the VMS. But this method probably would be overkill: Monoalphabetic substitution ciphers would probably show up as much more predictable in this analysis than the VMS, but we can rule out monoalphabetic substitution ciphers using much less computationally expensive techniques.

Pages: 1 2 3