The Voynich Ninja - Why the Voynich Manuscript Might Not Be a Real Language

Pages: 1 2 3

(25-06-2025, 10:18 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Ah, I see. So you have a sliding window of n, and then you see how well you can predict the next n characters? I just can't intuitively grasp what this tells us. It's too abstract - a measure of the information density of the writing system.

Actually, maybe it was me who misunderstood what quimqu was computing. According to the intro post:

Quote:An n-gram model learns the likelihood of seeing a particular character given the previous n-1 characters.

So, for 5-gram it should be guessing the next character based on the previous 4 characters.

For example, given qoke*, what is *?

But then it's strange that the perplexity goes up with longer ngrams, it shouldn't.

So, I don't understand what is going on here.

(25-06-2025, 11:46 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(25-06-2025, 10:18 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Ah, I see. So you have a sliding window of n, and then you see how well you can predict the next n characters? I just can't intuitively grasp what this tells us. It's too abstract - a measure of the information density of the writing system.

Actually, maybe it was me who misunderstood what quimqu was computing. According to the intro post:

Quote:An n-gram model learns the likelihood of seeing a particular character given the previous n-1 characters.

So, for 5-gram it should be guessing the next character based on the previous 4 characters.

For example, given qoke*, what is *?

But then it's strange that the perplexity goes up with longer ngrams, it shouldn't.

So, I don't understand what is going on here.

Yes, your understanding is correct now: for an n-gram model, the prediction is always made based on the previous n–1 characters. So a 5-gram model estimates the probability of a character given the 4 characters before it — exactly like your example:

given qoke*, what is *?

However, there is a subtlety in how perplexity is calculated, which may explain the unexpected behavior you're seeing — especially the fact that perplexity increases for higher n. That seems counterintuitive, because more context should normally make predictions more accurate (and perplexity lower), right?
But here's why that’s not always the case in practice:

Why perplexity can increase with higher n

Data sparsity
As n increases, the number of possible (n–1)-length contexts grows exponentially. Most of these will never appear in the training data, so the model encounters many "unknown" contexts at test time. That leads to fallback probabilities (e.g. uniform guesses or smoothed estimates), which are typically worse than the lower-n generalizations.
Voynich text characteristics
In natural languages, character sequences are often quite predictable due to morphological and phonotactic constraints. But in the Voynich manuscript, the information seems more evenly spread across the word. That means longer contexts don’t always help more than shorter ones — they just become more brittle.

So yes — in theory, a 5-gram model should outperform a 3-gram model. But in practice:

If the data is sparse
If the structure is not very redundant (as may be the case in Voynichese)

Then longer n-grams may struggle due to overfitting or lack of seen contexts, leading to higher perplexity.

I've been exploring character-level n-gram models on the Voynich manuscript and several known-language texts. This time, I tried a different approach, as you did in your study: resetting the context at every word, so that each word is modeled independently from its neighbors.

For each n-gram order, the model is trained on the entire corpus to collect counts of n-grams and their contexts. Then, to calculate perplexity with reset per word, I iterate over each word individually. For each valid n-gram inside the word, we compute the probability of the next character given the preceding n-1 characters, using the relative frequency from training counts. We accumulate the negative log probabilities across all n-grams in all words, then take the exponential of the average negative log probability (i.e., total negative log probability divided by total number of predicted characters). This average exponentiation gives the perplexity, which can be interpreted as the effective average branching factor or uncertainty of predicting the next character given the previous context.

Here’s what happens to perplexity as n increases, when we reset per word. I include results for voynich_EVA, voynich_CUVA, and a range of comparison texts in Latin, English, and French.

[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif] [Image: CYM3Nr6.png]

[/font][/size]

Across all texts, perplexity drops consistently as n increases — including for the Voynich corpus. In Voynichese, the drop is especially smooth and monotonic, with no apparent plateau even up to 9-gram. To investigate further, I extended the analysis up to 14-grams.

Interestingly, perplexities approach 1 for the highest n-grams, but this should be interpreted cautiously: perplexity values equal to 1 typically occur because there are few or no words long enough to contain such long n-grams, so the model predicts with near certainty simply because there are no valid n-grams of that length to evaluate. In other words, the data sparsity for very long contexts limits the calculation, making perplexity values at these high orders less informative.

In natural languages like Latin or English, perplexity drops steeply at first but tends to flatten around 6- to 8-gram, reflecting strong spelling regularities within words. For some classical texts (e.g., La Reine Margot), perplexity decreases from around 26 at 1-gram to nearly 1 at 6+ grams, consistent with near-deterministic character sequences.

Based on these patterns, here are some hypotheses:

Voynich words are internally structured.
The smooth perplexity decrease suggests each character depends strongly on its predecessors within words, implying internal templates or morphological patterns. Longer n-grams continue improving prediction quality, indicating exploitable redundancy or structure inside words.
Inter-word context matters less.
Resetting at word boundaries removes benefits from cross-word syntactic or semantic dependencies, which typically degrade model performance in natural languages. Yet, Voynichese perplexity still falls smoothly, implying weak or absent inter-word dependencies.
Compression-like behavior.
The steady decline in perplexity might reflect a "quasi-compressed" structure, where most information is front-loaded within words, and subsequent characters become highly predictable. Unlike some regular texts where perplexity plateaus quickly, Voynichese shows continued improvement with longer contexts.

In conclusion, even with context reset per word, the Voynich script exhibits clear, consistent internal structure. This supports the view that its "words" are not random strings but follow constrained generative processes — morphological, templatic, or algorithmic — and that cross-word syntax is minimal or absent, which is unusual for natural languages. Why do I say this?

When modelling the text as one long sequence (no reset between words):

The model uses context across words—how words follow each other—to predict the next character.
In natural languages, this helps lower perplexity as you increase n-gram size, up to a point.
When n-grams get very long, perplexity can rise because the model rarely sees exact long sequences, so predictions get harder.

When resetting the model at each word (treating words independently):

The model ignores cross-word context, only using inside-word character patterns.
In natural languages, perplexity still drops with n but stabilizes quickly at a low level, because inside-word spelling patterns are very predictable.
In the Voynich manuscript, perplexity keeps dropping smoothly even at very long n-grams, showing very strong and unusual internal structure inside words, and little to no dependency between words.

Key difference:

Natural languages have strong cross-word dependencies, which help prediction when modeling the whole text, but inside words the patterns stabilize quickly.
Voynich text shows minimal cross-word dependencies and highly structured, longer-range patterns inside words, unlike typical languages.

Quote:But then it's strange that the perplexity goes up with longer ngrams, it shouldn't.

I guess at some moment you hit end of the word effect.

So when you have let's say 5 characters and you are predicting the 6th one, you will be often predicting a first character in a new word based on a previous word. In such case all the patterns like "ccc" or "aiin" appearing in the Voynich words have no use.

(25-06-2025, 10:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Even if we don't strictly consider numbers, but rather some more generic 'enumeration' system, we will run into a problem that I see clearly in my mind, but may not be able to explain clearly.

The words okeey and qokeey can appear near each other, but differ only on the extreme left side of the word.

The words qokal and qokar are also similar and differ only on the extreme right.

Using computer terminology, if this were an enumeration system, it would appear to be neither high-endian nor low-endian, but rather both-endian.

Dear René,

Your idea that the Voynich system may be “both-endian” — with variation at both the beginning and end of words — is a very compelling intuition. I explored this idea computationally by generating synthetic texts and comparing their character-level n-gram perplexity curves to Voynich and natural language texts.

Two simple generative models were tested:

1. Both-endian generator
Words are composed as:

prefix + core +suffix

Prefixes and suffixes are drawn independently from small sets.
Cores are also varied (e.g., oka, to, s, ly, or empty).
Some short words (e.g. 1–3 characters) are allowed.

This setup creates clusters like: qokain, qokar, qoka, shoey, shoka, loin, etc.
These mirror the cases you mentioned — words that differ only at the extremes.

2. Phonotactic generator
Words follow abstract syllable patterns like CVC, CVVC, CVCC, etc., optionally surrounded by fixed prefixes and suffixes. This simulates a regular, artificial morphology without borrowing from real texts. The model is flexible enough to produce “both-endian” words — that is, words that vary at both the beginning and the end — since both the prefix and suffix are optional and randomly chosen. However, such combinations are less common due to how probabilities are distributed, making them emergent rather than dominant.

Perplexity curves:
I computed character-level n-gram perplexity under two settings:

Full-text mode: the entire corpus is treated as a single character sequence.
Reset-per-word mode: the model resets at each word boundary.

Here I attach the graphic results:

[Image: pSGS61S.png]

[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif] [Image: FK69oVE.png]

[/font][/size]

When comparing Voynich texts, natural languages, and synthetic corpora, a key difference emerges in the shape of the perplexity curve under per-word reset.

In natural languages:

Perplexity drops quickly at n=6–7 (reflecting phonotactic regularity).
Then it stabilizes rapidly, forming a smooth, convex curve.
There is no "hump" — the gain from longer n-grams becomes marginal.

In Voynich (EVA, CUVA) and the phonotactic model:

There is a sharp drop at n=2, indicating strong local regularities.
Then, a gradual decline over the next n values — forming a characteristic hump.
The curve finally flattens near perplexity = 1, due to word length limits (most Voynich words <14 characters).

The both-endian model:

Starts similarly, but the curve flattens too early, with minimal further gain.
Its structure is too regular — the perplexity becomes stagnant.

The Voynich perplexity curve suggests:

A system with strong intra-word regularity, like a constructed morphology.
Enough internal variation to avoid early flattening (unlike purely mechanical systems).
A structure not typical of natural language, but also not trivial enumeration.

The both-endian hypothesis aligns with what we see in the text: variation at both ends of words, combined with predictable cores. But the behavior of the perplexity curve shows that something richer is at play — possibly a controlled generative mechanism with constraints that mimic linguistic balance.

The fact that Voynich perplexity curves resemble those of rule-based phonotactic generators — more than natural languages or pure enumerations — supports the idea of a designed, internally consistent system, whether meaningful or not. I believe the shape of the perplexity curve — especially under reset per word — is a fingerprint of that structure.

(25-06-2025, 10:17 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Using computer terminology, if this were an enumeration system, it would appear to be neither high-endian nor low-endian, but rather both-endian.

Yes, this is the only other possible explanation that I could think of for the binomial-like distribution of the lexeme(*) lengths. See my You are not allowed to view links. Register or Login to view.about this.

Roman numerals from 1 to 999 too have a binomial-like distribution. So, just looking at the stats, it is possible that Voynichese words are just numbers into a codebook, constructed "on the fly".

Namely, the Author keeps a dictionary of plainword->codenumber as a stack of index cards, sorted by the word. The first word of the plaintext gets the code 1. He takes each word of the plaintext in turn and looks it up inthe dictionary. If it is there, he writes the corresponding code number. Otherwise he assigns to that word the next available code number, and adds it to the dictionary.

This process has the result that the most frequent words of the language will tend to get the smallest code numbers. if the codes are written in a Roman-like scheme as described in that page, the most common words will tend to get shorter codes-- an optimization seen in natural languages and the VMS.

However, I don't believe in this alternative theory; because, for one thing, a codebook cipher is extremely laborious to generate and read. Without computer help, it would be feasible only for short texts, like a diplomatic or military message.

All the best, --jorge

(*) lexeme = an entry of the lexicon, irrespective of its frequency in the text. Is that what is being called "word type" now?

(26-06-2025, 03:46 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.This process has the result that the most frequent words of the language will tend to get the smallest code numbers. if the codes are written in a Roman-like scheme as described in that page, the most common words will tend to get shorter codes-- an optimization seen in natural languages and the VMS.

Exactly, but this scheme would not result in words differing only on the left-hand side to appear near each other frequently. That's the problem of the both-endian nature of the words, and the resulting unlikelihood that the words follow some enumeration scheme.

(Of course, it is still possible to imagine ways around that).

Pages: 1 2 3