I've been exploring character-level n-gram models on the Voynich manuscript and several known-language texts. This time, I tried a different approach, as you did in your study: resetting the context at every word, so that each word is modeled independently from its neighbors.
For each n-gram order, the model is trained on the entire corpus to collect counts of n-grams and their contexts. Then, to calculate perplexity with reset per word, I iterate over each word individually. For each valid n-gram inside the word, we compute the probability of the next character given the preceding n-1 characters, using the relative frequency from training counts. We accumulate the negative log probabilities across all n-grams in all words, then take the exponential of the average negative log probability (i.e., total negative log probability divided by total number of predicted characters). This average exponentiation gives the perplexity, which can be interpreted as the effective average branching factor or uncertainty of predicting the next character given the previous context.
Here’s what happens to perplexity as n increases, when we reset per word. I include results for voynich_EVA, voynich_CUVA, and a range of comparison texts in Latin, English, and French.
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif]
[/font][/size]
Across all texts, perplexity drops consistently as n increases — including for the Voynich corpus. In Voynichese, the drop is especially smooth and monotonic, with no apparent plateau even up to 9-gram. To investigate further, I extended the analysis up to 14-grams.
Interestingly, perplexities approach 1 for the highest n-grams, but this should be interpreted cautiously: perplexity values equal to 1 typically occur because there are few or no words long enough to contain such long n-grams, so the model predicts with near certainty simply because there are no valid n-grams of that length to evaluate. In other words, the data sparsity for very long contexts limits the calculation, making perplexity values at these high orders less informative.
In natural languages like Latin or English, perplexity drops steeply at first but tends to flatten around 6- to 8-gram, reflecting strong spelling regularities within words. For some classical texts (e.g.,
La Reine Margot), perplexity decreases from around 26 at 1-gram to nearly 1 at 6+ grams, consistent with near-deterministic character sequences.
Based on these patterns, here are some hypotheses:
- Voynich words are internally structured.
The smooth perplexity decrease suggests each character depends strongly on its predecessors within words, implying internal templates or morphological patterns. Longer n-grams continue improving prediction quality, indicating exploitable redundancy or structure inside words.
- Inter-word context matters less.
Resetting at word boundaries removes benefits from cross-word syntactic or semantic dependencies, which typically degrade model performance in natural languages. Yet, Voynichese perplexity still falls smoothly, implying weak or absent inter-word dependencies.
- Compression-like behavior.
The steady decline in perplexity might reflect a "quasi-compressed" structure, where most information is front-loaded within words, and subsequent characters become highly predictable. Unlike some regular texts where perplexity plateaus quickly, Voynichese shows continued improvement with longer contexts.
In conclusion, even with context reset per word, the Voynich script exhibits clear, consistent internal structure. This supports the view that its "words" are not random strings but follow constrained generative processes — morphological, templatic, or algorithmic — and that cross-word syntax is minimal or absent, which is unusual for natural languages. Why do I say this?
When modelling the text as one long sequence (no reset between words):
- The model uses context across words—how words follow each other—to predict the next character.
- In natural languages, this helps lower perplexity as you increase n-gram size, up to a point.
- When n-grams get very long, perplexity can rise because the model rarely sees exact long sequences, so predictions get harder.
When resetting the model at each word (treating words independently):
- The model ignores cross-word context, only using inside-word character patterns.
- In natural languages, perplexity still drops with n but stabilizes quickly at a low level, because inside-word spelling patterns are very predictable.
- In the Voynich manuscript, perplexity keeps dropping smoothly even at very long n-grams, showing very strong and unusual internal structure inside words, and little to no dependency between words.
Key difference:
- Natural languages have strong cross-word dependencies, which help prediction when modeling the whole text, but inside words the patterns stabilize quickly.
- Voynich text shows minimal cross-word dependencies and highly structured, longer-range patterns inside words, unlike typical languages.