The Voynich Ninja - What Lies Beneath: Statistical Structure in Voynichese Revealed by Transformers

This work approaches the Voynich manuscript from a fresh angle—by applying small-scale character-level GPT models to its transliterated text. Instead of attempting direct decipherment or semantic interpretation, the focus is on investigating the internal structure and statistical patterns embedded in the glyph sequences.
By training models on different sections of the manuscript using a simplified and consistent transliteration system, this study probes how well a transformer-based language model can learn and predict the character sequences. The results provide compelling evidence that the text is far from random, showing meaningful structural regularities that a machine can capture with relatively low uncertainty.
This computational perspective offers a complementary lens to traditional Voynich research, suggesting that the manuscript’s mysterious text may follow underlying syntactic or generative rules—even if their semantic content remains unknown. It is an invitation to consider the manuscript as a linguistic system in its own right, accessible to modern machine learning tools, and to explore new paths for understanding its secrets.

Objective
The aim of this project is to explore the internal structure of the Voynich manuscript by training a small GPT model on its transliterated text. Using the Courrier transliteration, which offers a simplified and consistent representation of Voynichese glyphs, the goal is to test how well a transformer model can learn and predict character sequences within this mysterious and undeciphered corpus.

Methodology
I trained four different character-level GPT models (≈0.8M parameters) using You are not allowed to view links. Register or Login to view., each on a different subset of the manuscript:

Notebook	Text Scope	Validation Loss	Perplexity
Voynich_char_tokenizer	Full manuscript	1.2166	3.38
Biological_Voynich_char_tokenizer	Only biological section	1.2845	3.61
Herbal_Voynich_char_tokenizer	Only herbal section	1.5337	4.64
Herbal_and_pharmaceutical_Voynich_char_tokenizer	Herbal + pharmaceutical	1.5337	4.64

Each dataset was carefully filtered to remove uncertain tokens (?), header lines, and other non-linguistic symbols. Paragraphs were reconstructed using markers from the transcription file.

Why character-level tokenization?
Early attempts at word-level tokenization (based on dot-separated EVA words) yielded poor results, primarily due to:

A large vocabulary size (~15,000+ unique tokens).
Very sparse and repetitive training data per token.
Increased perplexity and unstable loss curves.

In contrast, character-level models:

Have a much smaller and denser vocabulary.
Perform well with limited data.
Naturally capture the morphological regularities of Voynichese.

Perplexity Results & Interpretation
Perplexity measures how well a model predicts the next token — lower values mean better predictability.

Dataset	Perplexity
Full Voynich	3.38
Biological	3.61
Herbal	4.64
Herbal + Pharmaceutical	4.64

The relatively low perplexity values (3–4.6) show that the model can learn strong internal structure from the Voynich text, particularly in the full corpus and biological section. These numbers are comparable to what we observe in natural languages at the character level, and far from what we would expect from purely random or meaningless sequences.

Why this matters
These results support the long-standing hypothesis — prominently discussed by René Zandbergen and others — that the Voynich manuscript, while undeciphered, exhibits non-random, rule-governed linguistic patterns.
Even though the GPT model has no access to semantics, its ability to predict Voynichese characters with such low uncertainty suggests that the manuscript likely follows an underlying syntax or generation process — artificial or natural.
In essence, the model behaves like a human listener hearing a foreign language repeatedly: it can’t understand the meaning, but learns to anticipate the next syllables based on structure.

Future Work
This approach opens up further directions:

Train section-specific models (e.g., cosmological, recipes).
Cluster generated tokens morphologically.
Compare synthetic Voynichese to natural languages.
Test statistical properties against controlled glossolalia or cipher texts.

To facilitate further exploration and replication, I’m sharing my Github where you can find the Jupyter notebooks used in this study:

You are not allowed to view links. Register or Login to view.

Feel free to download, review, and experiment with the code and data. Your feedback and insights are very welcome!

(08-06-2025, 01:44 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models.

Without some metric that shows that a GPT based approach identifies structures beyond already identified with previous methods, it's hard for me to see if the GPT based approach is of any use at all.

I ran a validation test on a recent output of 891 generated words using a simplified slot grammar system adapted to EVA transliteration conventions.

Basic slot grammar rules:

valid_prefixes = ("qo", "ch", "sh")
valid_suffixes = ("y", "dy", "aiin", "ain", "daiin")
invalid_double_letters = ("tt", "pp", "ff")
invalid_final_glifs = ("q", "e", "ch")

Adapted for EVA transcription in Courier-like glifs:

valid_prefixes = ("4O", "S", "Z")
valid_suffixes = ("9", "89", "AM", "AN", "8AM")
invalid_final_glifs = ("4", "C", "S")
invalid_double_letters = ("PP", "BB", "FF")

Summary of results:

✅ 96.74% of the generated words match the slot grammar rules
✅ 95.5% of the total set correspond to real EVA words from the corpus
⚠️ 100% of the words have valid final glifs and no invalid double letters (so these two conditions have been learnt 100%)

Below is a list of the invented words that fulfill 100% the slot grammar restrictions:

Correct invented words: ['SCCX9', 'ZCCC9', '4OFAECC89', '4OFAEZC89', '4OFAEFCCC9', '4OFAEO9', 'SCCC89', '4OFAEZC89', 'SEFAN', '4ORAR9', '4OFCC889']

Below is a table of the invented words that do not fulfill 100% the slot grammar restrictions, and why:

Word	Prefix	Suffix	EndOK	NoBadDbl	✅ AllOK
ROPAJ	False	False	True	True	False
OFAEZE	False	False	True	True	False
AEOR	False	False	True	True	False
EZCC89R	False	False	True	True	False
4OESCC9R	True	False	True	True	False
ESCO8	False	False	True	True	False
4CFAR	False	False	True	True	False
8AROE	False	False	True	True	False
OEFCCC89	False	True	True	True	False
FAEOE9	False	True	True	True	False
POEZC89	False	True	True	True	False
EFS9	False	True	True	True	False
OZCC9	False	True	True	True	False
AEFM	False	False	True	True	False
2OEZCC9	False	True	True	True	False
OEFAROR	False	False	True	True	False
2OEZCC89	False	True	True	True	False
E8AN	False	True	True	True	False
Z2AE	True	False	True	True	False
AEAR	False	False	True	True	False
8EAM	False	True	True	True	False
RSCC89	False	True	True	True	False
8AEZC9	False	True	True	True	False
2AROE	False	False	True	True	False
EOEZC9	False	True	True	True	False
BOEFAN	False	True	True	True	False
EOEFCC89	False	True	True	True	False
4OFOEOE	True	False	True	True	False
4OFCCOE	True	False	True	True	False

The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural.

Update 06/09/2025: I update with a heatmap of the loss per folio according my trained GPT. This gives an insight of the strangest folia according to the model.

(08-06-2025, 05:44 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.
(08-06-2025, 03:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural.

Could you in theory test for non-conforming words already in the VM? Or i guess measure how much they break the internal rules? I wonder if it could be interesting to see where these pop up in the VM, if they are predominant in "labelese" or maybe in certain sections? Or certain scribes? To my knowledge this hasnt been done yet but I am probably wrong.

Hi!

Thanks for the thoughtful suggestion — I think you're absolutely right that identifying non-conforming words and tracking their locations in the manuscript could reveal meaningful patterns.

What I’ve done so far is train a character-level GPT model on Voynichese text (using Currier transliteration). Then, I used this model to estimate the average word-level loss for each token in the manuscript — essentially measuring how well the model “understands” each word given its context.

I attach here a png fie of the interesting resulting heatmap of loss, so you can easily see which folios are the most strange ones according to GPT. Surprisingly, the last folios have the lowest loss.

[Image: jKk1ZAA.png]

From this, I have been able to:

Compute the average loss per folio, showing how predictable the text is in different sections.
Visualize this data as a heatmap, coloring folios by their average word loss.

This framework opens up possibilities exactly along the lines you suggested:

Highlighting words or regions where the model struggles most.
Investigating whether these “high-loss” zones correlate with labelese, specific sections, or particular scribes (if metadata is available).
Zooming in on the individual words the model finds most anomalous, and seeing their frequency and distribution.

I haven’t done the full detailed analysis yet, but the infrastructure is ready and the heatmap helps guide further exploration.

Below is a BBCode table showing the top 30 most anomalous words in Currier transcription ranked by loss (i.e. the words the model found hardest to predict):

? Top 30 most anomalous words (by loss):

Word	Loss	Freq	Len
RH	10.1558	1	2
J	9.8573	3	1
E	9.8573	15	1
6	9.8573	15	1
B	9.8573	2	1
Q	9.8573	2	1
F	9.8573	4	1
9	9.8573	53	1
3	9.8573	1	1
R	9.8573	35	1
4	9.8573	4	1
8	9.8573	32	1
Z	9.8573	9	1
A	9.8573	1	1
2	9.8573	121	1
C	9.8573	1	1
D	9.8573	3	1
O	9.8573	18	1
9J	9.4344	1	2
FT	8.8774	1	2
FU	8.5493	1	2
OT	8.3373	1	2
8DE	8.2584	1	3
OU	8.0167	1	2
O3	7.5268	3	2
9EJE	7.5207	1	4
P3	7.4110	1	2
8R	7.3208	1	2
ON	7.2464	2	2
QE	7.0748	1	2

These words often consist of very short tokens, sometimes single characters or rare combinations, which the model struggles to predict confidently. Investigating where these words cluster — whether in labels, particular manuscript sections, or scribes — could provide insights into the structure or anomalies within the text.

To my knowledge, a comprehensive analysis of “non-conforming” words in the Voynich Manuscript has not yet been performed at this level of detail, so this approach offers a promising direction for further research.

If you or anyone else is interested, I’d be happy to collaborate or share the tools I’ve developed so far.

Is it possible to tell if these GPT models exploit any patterns beyond what is already known from slot grammars?

The models I trained don’t explicitly incorporate or reference slot grammars, so any patterns they exploit emerge purely from statistical regularities in the character sequences. Interestingly, the relatively low perplexity scores (especially for the full and biological sections) suggest that the models are indeed capturing consistent structures — possibly beyond what slot grammars alone describe.
Slot grammars tend to focus on local, template-like patterns (e.g. word frames like “qokeedy”), but transformer models can, in theory, learn longer-range dependencies and more abstract relationships between glyphs. Whether what they learn is beyond slot grammars is hard to prove directly.
So while I can't guarantee that the models go beyond slot grammar patterns, the approach opens the door to testing exactly that — and it’s something I’m definitely interested in pursuing further.

Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models.

Without some metric that shows that a GPT based approach identifies structures beyond already identified with previous methods, it's hard for me to see if the GPT based approach is of any use at all.

Yeah, I totally get your point — and I actually agree.
One of the big advantages of slot grammars and curve-line systems is that they're interpretable. You can look at the rules, tweak them, compare them. With GPT-style models, everything’s buried in a tangle of weights and activations, so it’s much harder to pin down what the model is actually learning.
That said, my goal here wasn’t to claim that GPT models are better or smarter — just to see if, given only raw Voynichese sequences, a tiny transformer could pick up on any structure at all. And it seems like it does. The low perplexity suggests it’s finding something regular, even if we don’t know exactly what yet.
But you're absolutely right: without comparing it to existing frameworks, it’s hard to know if it’s finding anything new. That’s definitely the next step — like checking whether the model’s predictions line up with known slot structures, or whether it generalizes in weird ways they don’t.
So I see this more as a starting point — not a replacement for existing theories, but maybe a tool to help test or even challenge them.

(08-06-2025, 01:44 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models.

Without some metric that shows that a GPT based approach identifies structures beyond already identified with previous methods, it's hard for me to see if the GPT based approach is of any use at all.

I ran a validation test on a recent output of 891 generated words using a simplified slot grammar system adapted to EVA transliteration conventions.

Basic slot grammar rules:

valid_prefixes = ("qo", "ch", "sh")
valid_suffixes = ("y", "dy", "aiin", "ain", "daiin")
invalid_double_letters = ("tt", "pp", "ff")
invalid_final_glifs = ("q", "e", "ch")

Adapted for EVA transcription in Courier-like glifs:

valid_prefixes = ("4O", "S", "Z")
valid_suffixes = ("9", "89", "AM", "AN", "8AM")
invalid_final_glifs = ("4", "C", "S")
invalid_double_letters = ("PP", "BB", "FF")

Summary of results:

✅ 96.74% of the generated words match the slot grammar rules
✅ 95.5% of the total set correspond to real EVA words from the corpus
⚠️ 100% of the words have valid final glifs and no invalid double letters (so these two conditions have been learnt 100%)

Below is a list of the invented words that fulfill 100% the slot grammar restrictions:

Correct invented words: ['SCCX9', 'ZCCC9', '4OFAECC89', '4OFAEZC89', '4OFAEFCCC9', '4OFAEO9', 'SCCC89', '4OFAEZC89', 'SEFAN', '4ORAR9', '4OFCC889']

Below is a table of the invented words that do not fulfill 100% the slot grammar restrictions, and why:

Word	Prefix	Suffix	EndOK	NoBadDbl	✅ AllOK
ROPAJ	False	False	True	True	False
OFAEZE	False	False	True	True	False
AEOR	False	False	True	True	False
EZCC89R	False	False	True	True	False
4OESCC9R	True	False	True	True	False
ESCO8	False	False	True	True	False
4CFAR	False	False	True	True	False
8AROE	False	False	True	True	False
OEFCCC89	False	True	True	True	False
FAEOE9	False	True	True	True	False
POEZC89	False	True	True	True	False
EFS9	False	True	True	True	False
OZCC9	False	True	True	True	False
AEFM	False	False	True	True	False
2OEZCC9	False	True	True	True	False
OEFAROR	False	False	True	True	False
2OEZCC89	False	True	True	True	False
E8AN	False	True	True	True	False
Z2AE	True	False	True	True	False
AEAR	False	False	True	True	False
8EAM	False	True	True	True	False
RSCC89	False	True	True	True	False
8AEZC9	False	True	True	True	False
2AROE	False	False	True	True	False
EOEZC9	False	True	True	True	False
BOEFAN	False	True	True	True	False
EOEFCC89	False	True	True	True	False
4OFOEOE	True	False	True	True	False
4OFCCOE	True	False	True	True	False

The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural.

Interesting work. I agree with you (and many others) that the structure of Voynich word types is rather regular, with some underlying 'grammar', and I think the method you used is surely interesting.

I had my share of fun working on grammars (actual slot grammars, in my case). Now I don't have the time at hand to check your GitHub repository and see what I can understand, I confess I did not understand much of how your method works and what it actually does, but I'll try tomorrow.

One question: can you apply your method in 'reverse', I mean: use your GPT model to decompose the original Voynich word types in parts ('tokens'), such as "qo", "y", "dy", "aiin" and so on, for each word (some of which may turn out not to be conforming to the grammar)? In this case it could be possible to compare the results with different approaches (I defined a metric which ranks different grammars according to how many bits of information are needed to represent the text by using optimal encoding on the words divided in 'tokens').

(08-06-2025, 05:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Interesting work. I agree with you (and many others) that the structure of Voynich word types is rather regular, with some underlying 'grammar', and I think the method you used is surely interesting.

I had my share of fun working on grammars (actual slot grammars, in my case). Now I don't have the time at hand to check your GitHub repository and see what I can understand, I confess I did not understand much of how your method works and what it actually does, but I'll try tomorrow.

One question: can you apply your method in 'reverse', I mean: use your GPT model to decompose the original Voynich word types in parts ('tokens'), such as "qo", "y", "dy", "aiin" and so on, for each word (some of which may turn out not to be conforming to the grammar)? In this case it could be possible to compare the results with different approaches (I defined a metric which ranks different grammars according to how many bits of information are needed to represent the text by using optimal encoding on the words divided in 'tokens').

Hi,
Thanks a lot for your message. I find your suggestion about grammar-based decomposition really interesting — that’s exactly the kind of direction I’d like to explore next.
To clarify what I’ve done so far:
I trained several small GPT models on Voynichese using a character-level tokenizer, based on the Currier transliteration. This worked quite well: the model was able to predict character sequences with low perplexity (~3.3), which suggests a high degree of internal structure.
In contrast, using word-level tokenization (based on dot-separated EVA words) gave very poor results — mainly because of the large vocabulary size and lack of training data per token.

At this point, I’m considering two directions:

Trying more modern transliterations (like Takahashi or updated EVA versions). But I’m a bit concerned that these are too detailed — they distinguish rare glyphs very precisely, which might make it harder for the model to generalize.
Switching to syllable-like units instead of characters — which is exactly what you suggested.
I’d love to hear your opinion on this:

What kind of syllables (e.g. "qo", "dy", "aiin", etc.) do you think would make sense?
Which transliteration would be the best basis for such tokenization?

If you’ve already explored this type of decomposition, I’d be really interested in hearing more or comparing approaches.
Thanks again for your input!

(08-06-2025, 03:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural.

Could you in theory test for non-conforming words already in the VM? Or i guess measure how much they break the internal rules? I wonder if it could be interesting to see where these pop up in the VM, if they are predominant in "labelese" or maybe in certain sections? Or certain scribes? To my knowledge this hasnt been done yet but I am probably wrong.

(08-06-2025, 05:44 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.
(08-06-2025, 03:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural.

Could you in theory test for non-conforming words already in the VM? Or i guess measure how much they break the internal rules? I wonder if it could be interesting to see where these pop up in the VM, if they are predominant in "labelese" or maybe in certain sections? Or certain scribes? To my knowledge this hasnt been done yet but I am probably wrong.

Hi!

Thanks for the thoughtful suggestion — I think you're absolutely right that identifying non-conforming words and tracking their locations in the manuscript could reveal meaningful patterns.

What I’ve done so far is train a character-level GPT model on Voynichese text (using Currier transliteration). Then, I used this model to estimate the average word-level loss for each token in the manuscript — essentially measuring how well the model “understands” each word given its context.

I attach here a png fie of the interesting resulting heatmap of loss, so you can easily see which folios are the most strange ones according to GPT. Surprisingly, the last folios have the lowest loss.

[Image: jKk1ZAA.png]

From this, I have been able to:

Compute the average loss per folio, showing how predictable the text is in different sections.
Visualize this data as a heatmap, coloring folios by their average word loss.

This framework opens up possibilities exactly along the lines you suggested:

Highlighting words or regions where the model struggles most.
Investigating whether these “high-loss” zones correlate with labelese, specific sections, or particular scribes (if metadata is available).
Zooming in on the individual words the model finds most anomalous, and seeing their frequency and distribution.

I haven’t done the full detailed analysis yet, but the infrastructure is ready and the heatmap helps guide further exploration.

Below is a BBCode table showing the top 30 most anomalous words in Currier transcription ranked by loss (i.e. the words the model found hardest to predict):

? Top 30 most anomalous words (by loss):

Word	Loss	Freq	Len
RH	10.1558	1	2
J	9.8573	3	1
E	9.8573	15	1
6	9.8573	15	1
B	9.8573	2	1
Q	9.8573	2	1
F	9.8573	4	1
9	9.8573	53	1
3	9.8573	1	1
R	9.8573	35	1
4	9.8573	4	1
8	9.8573	32	1
Z	9.8573	9	1
A	9.8573	1	1
2	9.8573	121	1
C	9.8573	1	1
D	9.8573	3	1
O	9.8573	18	1
9J	9.4344	1	2
FT	8.8774	1	2
FU	8.5493	1	2
OT	8.3373	1	2
8DE	8.2584	1	3
OU	8.0167	1	2
O3	7.5268	3	2
9EJE	7.5207	1	4
P3	7.4110	1	2
8R	7.3208	1	2
ON	7.2464	2	2
QE	7.0748	1	2

These words often consist of very short tokens, sometimes single characters or rare combinations, which the model struggles to predict confidently. Investigating where these words cluster — whether in labels, particular manuscript sections, or scribes — could provide insights into the structure or anomalies within the text.

To my knowledge, a comprehensive analysis of “non-conforming” words in the Voynich Manuscript has not yet been performed at this level of detail, so this approach offers a promising direction for further research.

If you or anyone else is interested, I’d be happy to collaborate or share the tools I’ve developed so far.