Dear Voynich Ninja community,
As you might know, I’ve been working on training LLM Large Language Models (GPT-like models) on Voynich EVA transliterations. This is not about using ChatGPT, but about training language models from scratch using only Voynich EVA text.
I’m aware that GPT models are a sort of black box, and it’s often hard to understand the mechanisms they use to “learn” patterns. In this project, I’ve tried to explore how the GPT model makes predictions — to gain some intuition into the decision-making process.
Let me first introduce the key concepts I’ve been working with:
- Loss: Loss is a measure of how wrong the model's predictions are compared to the actual next word. In language models, it's typically cross-entropy loss, which penalizes the model more when it assigns low probability to the correct word. A lower loss means the model is better at predicting the next token given its context.
- Prediction: The prediction is the model’s guess for the next word in a sequence. For example, given a context of 4 tokens (block_size = 4), the model looks at those 4 tokens and outputs a probability distribution over the vocabulary, selecting the most likely next token.
- Saliency: Saliency refers to how much each input token contributes to the model’s prediction. If we use a block_size of 4, saliency tells us which of the 4 previous tokens had the most influence on predicting the next word. For example, in the sequence ["the", "brown", "cat", "sat"] → ?, the model might predict "on". Saliency would then indicate how important each of the previous tokens was in making that prediction. Tokens with higher saliency are considered more influential.
What I did:
First, I optimized model parameters to maximize the number of
real bigrams and trigrams (n-grams) generated by the model. Results are similar to training GPT on a real natural language text. Results after training on Voynich EVA text:
% of 2-grams found in Voynich EVA with block_size 4: 22.40% (224/1000)
% of 3-grams found in Voynich EVA with block_size 4: 0.80% (8/999)
Then, I trained the model on all paragraph-style lines in the Voynich manuscript (i.e., excluding labels or isolated words from cosmological sections). I used a 5-fold cross-validation approach:
- I split the text into 5 segments. For each fold, I used 80% of the data for training and 20% for validation, rotating through all segments.
- This way, I could generate predictions for the entire corpus.
I then visualized the predictions using HTML files (saliency_valset_voynich_1.html to saliency_valset_voynich_5.html)
You are not allowed to view links.
Register or
Login to view.
You are not allowed to view links.
Register or
Login to view.
You are not allowed to view links.
Register or
Login to view.
You are not allowed to view links.
Register or
Login to view.
You are not allowed to view links.
Register or
Login to view.
Each word is annotated with three values:
- Loss: represented by the border thickness — thicker means higher loss.
- Saliency: represented by the background color intensity — darker means higher saliency. Since each word is part of 4 prediction contexts (due to block_size = 4), saliency here is averaged over those 4 instances.
- Prediction probability: represented by border color — green for high confidence, red for low. The predicted probabilities are generally low, but this is also the case when training GPT on small corpora like a single book, even in natural languages.
This visualization makes it easy to see at a glance which words the model finds easier or harder to predict. The HTML is interactive — hovering over any word shows the 3 metrics mentioned above.
Deeper inspection:
I also created a second HTML file: context_saliency_colored_and_target.html that looks like this:
You are not allowed to view links.
Register or
Login to view.
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif]
[/font][/size]
This version shows for each word in the Voynich EVA paragraph:
- context_0 to context_3: the 4 previous tokens used as input (the model's context).
- target: the real next word in the sequence.
- pred_word: the word predicted by the model.
The model tends to predict the most frequent words in the Voynich corpus, as expected. However, the saliency values let us observe
which previous words influenced the prediction the most,
token by token.
I highlighted:
- green: when pred_word == target
- yellow: similar words according to LevenShtein similarity (>0.5)
I don't have any conclusions yet, but I think this could be useful for others interested in understanding how contextual information influences predictions in GPT-like models trained on Voynich EVA.
Let me know what you think — I’d love to hear your thoughts!