The Voynich Ninja - How LLM models try to understand Voynichese

Dear Voynich Ninja community,

As you might know, I’ve been working on training LLM Large Language Models (GPT-like models) on Voynich EVA transliterations. This is not about using ChatGPT, but about training language models from scratch using only Voynich EVA text.

I’m aware that GPT models are a sort of black box, and it’s often hard to understand the mechanisms they use to “learn” patterns. In this project, I’ve tried to explore how the GPT model makes predictions — to gain some intuition into the decision-making process.

Let me first introduce the key concepts I’ve been working with:

Loss: Loss is a measure of how wrong the model's predictions are compared to the actual next word. In language models, it's typically cross-entropy loss, which penalizes the model more when it assigns low probability to the correct word. A lower loss means the model is better at predicting the next token given its context.
Prediction: The prediction is the model’s guess for the next word in a sequence. For example, given a context of 4 tokens (block_size = 4), the model looks at those 4 tokens and outputs a probability distribution over the vocabulary, selecting the most likely next token.
Saliency: Saliency refers to how much each input token contributes to the model’s prediction. If we use a block_size of 4, saliency tells us which of the 4 previous tokens had the most influence on predicting the next word. For example, in the sequence ["the", "brown", "cat", "sat"] → ?, the model might predict "on". Saliency would then indicate how important each of the previous tokens was in making that prediction. Tokens with higher saliency are considered more influential.

What I did:

First, I optimized model parameters to maximize the number of real bigrams and trigrams (n-grams) generated by the model. Results are similar to training GPT on a real natural language text. Results after training on Voynich EVA text:

% of 2-grams found in Voynich EVA with block_size 4: 22.40% (224/1000)
% of 3-grams found in Voynich EVA with block_size 4: 0.80% (8/999)

Then, I trained the model on all paragraph-style lines in the Voynich manuscript (i.e., excluding labels or isolated words from cosmological sections). I used a 5-fold cross-validation approach:

I split the text into 5 segments. For each fold, I used 80% of the data for training and 20% for validation, rotating through all segments.
This way, I could generate predictions for the entire corpus.

I then visualized the predictions using HTML files (saliency_valset_voynich_1.html to saliency_valset_voynich_5.html)

You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.

[Image: YYIIL2c.png]

Each word is annotated with three values:

Loss: represented by the border thickness — thicker means higher loss.
Saliency: represented by the background color intensity — darker means higher saliency. Since each word is part of 4 prediction contexts (due to block_size = 4), saliency here is averaged over those 4 instances.
Prediction probability: represented by border color — green for high confidence, red for low. The predicted probabilities are generally low, but this is also the case when training GPT on small corpora like a single book, even in natural languages.

This visualization makes it easy to see at a glance which words the model finds easier or harder to predict. The HTML is interactive — hovering over any word shows the 3 metrics mentioned above.

Deeper inspection:

I also created a second HTML file: context_saliency_colored_and_target.html that looks like this:

You are not allowed to view links. Register or Login to view.

[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif] [Image: BNnJ5O6.png]

[/font][/size]

This version shows for each word in the Voynich EVA paragraph:

context_0 to context_3: the 4 previous tokens used as input (the model's context).
target: the real next word in the sequence.
pred_word: the word predicted by the model.

The model tends to predict the most frequent words in the Voynich corpus, as expected. However, the saliency values let us observe which previous words influenced the prediction the most, token by token.

I highlighted:

green: when pred_word == target
yellow: similar words according to LevenShtein similarity (>0.5)

I don't have any conclusions yet, but I think this could be useful for others interested in understanding how contextual information influences predictions in GPT-like models trained on Voynich EVA.

Let me know what you think — I’d love to hear your thoughts!

(16-07-2025, 10:50 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The model tends to predict the most frequent words in the Voynich corpus, as expected. However, the saliency values let us observe which previous words influenced the prediction the most, token by token.

I highlighted:
• green: when pred_word == target
• yellow: when more than 50% of the characters in pred_word match the target (partial match)

Am I reading this correctly? It looks like GPT cannot successfully predict the fifth word knowing the previous 4 words. Correct predictions seem very rare and are the most common words like "daiin" or "chol", so this looks like pure chance to me.

If so, I'm not sure of how much use the saliency is when the predictions themselves are wrong. For a model that successfully predicts a large percentage of words (without overfitting) it would be very interesting to know which information it uses, but not for a model that cannot predict much.

(16-07-2025, 12:03 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Am I reading this correctly? It looks like GPT cannot successfully predict the fifth word knowing the previous 4 words. Correct predictions seem very rare and are the most common words like "daiin" or "chol", so this looks like pure chance to me.

Yes, that's absolutely right — this also happens when I train a GPT model from scratch using just a single English book. It's expected: the corpus is extremely small, and unfortunately, that's all we have for the Voynich Manuscript.
This project was never intended to produce a model that "understands" Voynichese — training a language model on a single book is not enough to learn a true grammar or semantics. However, I believe that even the limited learning the model does achieve may offer insights into the internal structure or patterns of the language.
That was the sole aim of this work: not to prove comprehension, but to explore which elements the model finds predictable, and how those predictions are distributed across the text.

But for example, it's interesting to see that at the beginning of the Voynich, the model (although mostly wrong) tends to predict common tokens like daiin, chol, etc. Later in the manuscript, it shifts toward predicting more qokeedy, qokain, and similar forms. So even though the predictions are mostly incorrect, the model’s behavior seems to depend on the region of the Voynich it is processing.

(16-07-2025, 12:03 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Am I reading this correctly? It looks like GPT cannot successfully predict the fifth word knowing the previous 4 words. Correct predictions seem very rare and are the most common words like "daiin" or "chol", so this looks like pure chance to me.

I You are not allowed to view links. Register or Login to view. a few days ago. To train any word predictor that uses the last K words, you need a corpus in which all possible sequences of K words have a good chance of appearing. If the lexicon has1000 equally likely words, the corpus needed for that training must have on the order of 1000 to the power K words. The number will be smaller if the lexicon words have a Zipf-like frequency distribution; but not much smaller.

If you insist that the predictor must use the last 4 words, and you train it with half of the VMS, you end up with a predictor that just copies long stretches of that training text, with only occasional switches when it jumps from one piece of the training set to another. Namely, those switches will occur only when the generated output has K consecutive words that occur more than once in the training set -- which will be very few of them.

If you use a black-box predictor, you never know how many past words it actually uses. Or what the hell it is doing.

All the best, --jorge

(16-07-2025, 11:18 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.To train any word predictor that uses the last K words, you need a corpus in which all possible sequences of K words have a good chance of appearing. If the lexicon has1000 equally likely words, the corpus needed for that training must have on the order of 1000 to the power K words.

You may get more interesting results if you first try to partition the lexicon into a small number -- say 20 -- sets of words that are similar in their immediate context (1-2 words before and/or a after at most). Then you map each word in the corpus to the index of its subset, so the corpus becomes a "text" with the same length but whose vocabulary is the numbers 1 to 20. Then you train a predictor to use the last 4 words of that text. For that, a corpus with on the order of 20^4 = 160000 words should be adequate. Then you take the output of that predictor and replace each index by a random word in the corresponding set, with the appropriate frequencies.

You could complicate this last step by making the choice of a word from each set depend on the previous 2-3 generated indices, or the previous generated word. (But not on the last 4 indices, because then you would run into the same probele of insufficient examples to build the necesary probability tables.)

And anyway you should definitely map similar glyphs to the same glyph: a->o, r->s , ch->ee, etc. Anyone who tried to transcribe a significant amount of Voynichese surely must have concluded that 10% or more of those letters are just wrong -- the transcription says "o" when the Author meant "a", etc.

All the best, --jorge

Hi Jorge,

Thank you for your comment. Just to clarify — the goal of this work was never to focus on prediction accuracy itself. In fact, it was inspired by some of your earlier remarks on my posts, especially about the "black box" nature of GPT models.

What I’ve been trying to do is open up that box a little and see what’s going on inside. That’s why I’ve focused on saliency — to visualize how much weight the model gives to each of the previous words when predicting the next one. I'm not trying to optimize performance, but rather to observe whether the model has learned any meaningful structure, or is just regurgitating memorized sequences.

What’s interesting (and a bit puzzling, even if it is something that is also found in natural languages) is how the saliency varies according to the word position in the block_size. The same word has different influence in the prediction of the next word at different positions. For example here:

[Image: FFsXyNP.png]

word choiin has a big saliency at position 4 for predicting chol, but then at position 3 (the next position) it has the less saliency to predict cphey and increases again for positions 2 and 1. I hope that visualizing the MS so, can give some new perspectives.

This experiment was also inspired by something René once said — that instead of trying to decipher the Voynich, we should aim to understand the mechanism used to generate the language. So in a way, this work is a blend of your idea about model transparency and René’s hypothesis about language generation (and trying to find its mechanism)

Thanks again!