![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What Lies Beneath: Statistical Structure in Voynichese Revealed by Transformers - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: What Lies Beneath: Statistical Structure in Voynichese Revealed by Transformers (/thread-4744.html) Pages:
1
2
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What Lies Beneath: Statistical Structure in Voynichese Revealed by Transformers - quimqu - 08-06-2025 This work approaches the Voynich manuscript from a fresh angle—by applying small-scale character-level GPT models to its transliterated text. Instead of attempting direct decipherment or semantic interpretation, the focus is on investigating the internal structure and statistical patterns embedded in the glyph sequences. By training models on different sections of the manuscript using a simplified and consistent transliteration system, this study probes how well a transformer-based language model can learn and predict the character sequences. The results provide compelling evidence that the text is far from random, showing meaningful structural regularities that a machine can capture with relatively low uncertainty. This computational perspective offers a complementary lens to traditional Voynich research, suggesting that the manuscript’s mysterious text may follow underlying syntactic or generative rules—even if their semantic content remains unknown. It is an invitation to consider the manuscript as a linguistic system in its own right, accessible to modern machine learning tools, and to explore new paths for understanding its secrets. Objective The aim of this project is to explore the internal structure of the Voynich manuscript by training a small GPT model on its transliterated text. Using the Courrier transliteration, which offers a simplified and consistent representation of Voynichese glyphs, the goal is to test how well a transformer model can learn and predict character sequences within this mysterious and undeciphered corpus. Methodology I trained four different character-level GPT models (≈0.8M parameters) using You are not allowed to view links. Register or Login to view., each on a different subset of the manuscript:
Each dataset was carefully filtered to remove uncertain tokens (?), header lines, and other non-linguistic symbols. Paragraphs were reconstructed using markers from the transcription file. Why character-level tokenization? Early attempts at word-level tokenization (based on dot-separated EVA words) yielded poor results, primarily due to:
Perplexity Results & Interpretation Perplexity measures how well a model predicts the next token — lower values mean better predictability.
The relatively low perplexity values (3–4.6) show that the model can learn strong internal structure from the Voynich text, particularly in the full corpus and biological section. These numbers are comparable to what we observe in natural languages at the character level, and far from what we would expect from purely random or meaningless sequences. Why this matters These results support the long-standing hypothesis — prominently discussed by René Zandbergen and others — that the Voynich manuscript, while undeciphered, exhibits non-random, rule-governed linguistic patterns. Even though the GPT model has no access to semantics, its ability to predict Voynichese characters with such low uncertainty suggests that the manuscript likely follows an underlying syntax or generation process — artificial or natural. In essence, the model behaves like a human listener hearing a foreign language repeatedly: it can’t understand the meaning, but learns to anticipate the next syllables based on structure. Future Work This approach opens up further directions:
To facilitate further exploration and replication, I’m sharing my Github where you can find the Jupyter notebooks used in this study: You are not allowed to view links. Register or Login to view. Feel free to download, review, and experiment with the code and data. Your feedback and insights are very welcome! (08-06-2025, 01:44 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models. I ran a validation test on a recent output of 891 generated words using a simplified slot grammar system adapted to EVA transliteration conventions. Basic slot grammar rules: valid_prefixes = ("qo", "ch", "sh") valid_suffixes = ("y", "dy", "aiin", "ain", "daiin") invalid_double_letters = ("tt", "pp", "ff") invalid_final_glifs = ("q", "e", "ch") Adapted for EVA transcription in Courier-like glifs: valid_prefixes = ("4O", "S", "Z") valid_suffixes = ("9", "89", "AM", "AN", "8AM") invalid_final_glifs = ("4", "C", "S") invalid_double_letters = ("PP", "BB", "FF") Summary of results:
Below is a list of the invented words that fulfill 100% the slot grammar restrictions: Correct invented words: ['SCCX9', 'ZCCC9', '4OFAECC89', '4OFAEZC89', '4OFAEFCCC9', '4OFAEO9', 'SCCC89', '4OFAEZC89', 'SEFAN', '4ORAR9', '4OFCC889'] Below is a table of the invented words that do not fulfill 100% the slot grammar restrictions, and why:
The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural. Update 06/09/2025: I update with a heatmap of the loss per folio according my trained GPT. This gives an insight of the strangest folia according to the model. (08-06-2025, 05:44 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.(08-06-2025, 03:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural. Hi! Thanks for the thoughtful suggestion — I think you're absolutely right that identifying non-conforming words and tracking their locations in the manuscript could reveal meaningful patterns. What I’ve done so far is train a character-level GPT model on Voynichese text (using Currier transliteration). Then, I used this model to estimate the average word-level loss for each token in the manuscript — essentially measuring how well the model “understands” each word given its context. I attach here a png fie of the interesting resulting heatmap of loss, so you can easily see which folios are the most strange ones according to GPT. Surprisingly, the last folios have the lowest loss. ![]() From this, I have been able to:
Below is a BBCode table showing the top 30 most anomalous words in Currier transcription ranked by loss (i.e. the words the model found hardest to predict): ? Top 30 most anomalous words (by loss):
These words often consist of very short tokens, sometimes single characters or rare combinations, which the model struggles to predict confidently. Investigating where these words cluster — whether in labels, particular manuscript sections, or scribes — could provide insights into the structure or anomalies within the text. To my knowledge, a comprehensive analysis of “non-conforming” words in the Voynich Manuscript has not yet been performed at this level of detail, so this approach offers a promising direction for further research. If you or anyone else is interested, I’d be happy to collaborate or share the tools I’ve developed so far. RE: Unveiling the Hidden Syntax: Transformer Models Decode the Voynich Manuscript - oshfdk - 08-06-2025 Is it possible to tell if these GPT models exploit any patterns beyond what is already known from slot grammars? RE: Unveiling the Hidden Syntax: Transformer Models Decode the Voynich Manuscript - quimqu - 08-06-2025 The models I trained don’t explicitly incorporate or reference slot grammars, so any patterns they exploit emerge purely from statistical regularities in the character sequences. Interestingly, the relatively low perplexity scores (especially for the full and biological sections) suggest that the models are indeed capturing consistent structures — possibly beyond what slot grammars alone describe. Slot grammars tend to focus on local, template-like patterns (e.g. word frames like “qokeedy”), but transformer models can, in theory, learn longer-range dependencies and more abstract relationships between glyphs. Whether what they learn is beyond slot grammars is hard to prove directly. So while I can't guarantee that the models go beyond slot grammar patterns, the approach opens the door to testing exactly that — and it’s something I’m definitely interested in pursuing further. RE: Unveiling the Hidden Syntax: Transformer Models Decode the Voynich Manuscript - oshfdk - 08-06-2025 Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models. Without some metric that shows that a GPT based approach identifies structures beyond already identified with previous methods, it's hard for me to see if the GPT based approach is of any use at all. RE: What Lies Beneath: Statistical Structure in Voynichese Revealed by Transformers - quimqu - 08-06-2025 Yeah, I totally get your point — and I actually agree. One of the big advantages of slot grammars and curve-line systems is that they're interpretable. You can look at the rules, tweak them, compare them. With GPT-style models, everything’s buried in a tangle of weights and activations, so it’s much harder to pin down what the model is actually learning. That said, my goal here wasn’t to claim that GPT models are better or smarter — just to see if, given only raw Voynichese sequences, a tiny transformer could pick up on any structure at all. And it seems like it does. The low perplexity suggests it’s finding something regular, even if we don’t know exactly what yet. But you're absolutely right: without comparing it to existing frameworks, it’s hard to know if it’s finding anything new. That’s definitely the next step — like checking whether the model’s predictions line up with known slot structures, or whether it generalizes in weird ways they don’t. So I see this more as a starting point — not a replacement for existing theories, but maybe a tool to help test or even challenge them. RE: Unveiling the Hidden Syntax: Transformer Models Decode the Voynich Manuscript - quimqu - 08-06-2025 (08-06-2025, 01:44 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Well, that's the main question for me, because we already know about a lot of regularity from curve-line systems, slot grammars, etc. These approaches have explicit simple rules that are easy to analyze and compare, as opposed to black-box GPT models. I ran a validation test on a recent output of 891 generated words using a simplified slot grammar system adapted to EVA transliteration conventions. Basic slot grammar rules: valid_prefixes = ("qo", "ch", "sh") valid_suffixes = ("y", "dy", "aiin", "ain", "daiin") invalid_double_letters = ("tt", "pp", "ff") invalid_final_glifs = ("q", "e", "ch") Adapted for EVA transcription in Courier-like glifs: valid_prefixes = ("4O", "S", "Z") valid_suffixes = ("9", "89", "AM", "AN", "8AM") invalid_final_glifs = ("4", "C", "S") invalid_double_letters = ("PP", "BB", "FF") Summary of results:
Below is a list of the invented words that fulfill 100% the slot grammar restrictions: Correct invented words: ['SCCX9', 'ZCCC9', '4OFAECC89', '4OFAEZC89', '4OFAEFCCC9', '4OFAEO9', 'SCCC89', '4OFAEZC89', 'SEFAN', '4ORAR9', '4OFCC889'] Below is a table of the invented words that do not fulfill 100% the slot grammar restrictions, and why:
The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural. RE: What Lies Beneath: Statistical Structure in Voynichese Revealed by Transformers - Mauro - 08-06-2025 Interesting work. I agree with you (and many others) that the structure of Voynich word types is rather regular, with some underlying 'grammar', and I think the method you used is surely interesting. I had my share of fun working on grammars (actual slot grammars, in my case). Now I don't have the time at hand to check your GitHub repository and see what I can understand, I confess I did not understand much of how your method works and what it actually does, but I'll try tomorrow. One question: can you apply your method in 'reverse', I mean: use your GPT model to decompose the original Voynich word types in parts ('tokens'), such as "qo", "y", "dy", "aiin" and so on, for each word (some of which may turn out not to be conforming to the grammar)? In this case it could be possible to compare the results with different approaches (I defined a metric which ranks different grammars according to how many bits of information are needed to represent the text by using optimal encoding on the words divided in 'tokens'). RE: What Lies Beneath: Statistical Structure in Voynichese Revealed by Transformers - quimqu - 08-06-2025 (08-06-2025, 05:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Interesting work. I agree with you (and many others) that the structure of Voynich word types is rather regular, with some underlying 'grammar', and I think the method you used is surely interesting. Hi, Thanks a lot for your message. I find your suggestion about grammar-based decomposition really interesting — that’s exactly the kind of direction I’d like to explore next. To clarify what I’ve done so far: I trained several small GPT models on Voynichese using a character-level tokenizer, based on the Currier transliteration. This worked quite well: the model was able to predict character sequences with low perplexity (~3.3), which suggests a high degree of internal structure. In contrast, using word-level tokenization (based on dot-separated EVA words) gave very poor results — mainly because of the large vocabulary size and lack of training data per token. At this point, I’m considering two directions:
Thanks again for your input! RE: Unveiling the Hidden Syntax: Transformer Models Decode the Voynich Manuscript - davidma - 08-06-2025 (08-06-2025, 03:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural. Could you in theory test for non-conforming words already in the VM? Or i guess measure how much they break the internal rules? I wonder if it could be interesting to see where these pop up in the VM, if they are predominant in "labelese" or maybe in certain sections? Or certain scribes? To my knowledge this hasnt been done yet but I am probably wrong. RE: Unveiling the Hidden Syntax: Transformer Models Decode the Voynich Manuscript - quimqu - 09-06-2025 (08-06-2025, 05:44 PM)davidma Wrote: You are not allowed to view links. Register or Login to view.(08-06-2025, 03:39 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The high percentage of conformity suggests that the generation process is strongly guided by structural constraints similar to those observed in actual Voynichese. While not all words match real entries from the manuscript, most invented forms remain within plausible morpho-phonological boundaries defined by the slot grammar. This supports the idea that the model is not producing random noise, but instead approximates a coherent internal system—whether artificial or natural. Hi! Thanks for the thoughtful suggestion — I think you're absolutely right that identifying non-conforming words and tracking their locations in the manuscript could reveal meaningful patterns. What I’ve done so far is train a character-level GPT model on Voynichese text (using Currier transliteration). Then, I used this model to estimate the average word-level loss for each token in the manuscript — essentially measuring how well the model “understands” each word given its context. I attach here a png fie of the interesting resulting heatmap of loss, so you can easily see which folios are the most strange ones according to GPT. Surprisingly, the last folios have the lowest loss. ![]() From this, I have been able to:
Below is a BBCode table showing the top 30 most anomalous words in Currier transcription ranked by loss (i.e. the words the model found hardest to predict): ? Top 30 most anomalous words (by loss):
These words often consist of very short tokens, sometimes single characters or rare combinations, which the model struggles to predict confidently. Investigating where these words cluster — whether in labels, particular manuscript sections, or scribes — could provide insights into the structure or anomalies within the text. To my knowledge, a comprehensive analysis of “non-conforming” words in the Voynich Manuscript has not yet been performed at this level of detail, so this approach offers a promising direction for further research. If you or anyone else is interested, I’d be happy to collaborate or share the tools I’ve developed so far. |