![]() |
How LLM models try to understand Voynichese - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: How LLM models try to understand Voynichese (/thread-4803.html) |
How LLM models try to understand Voynichese - quimqu - 16-07-2025 Dear Voynich Ninja community, As you might know, I’ve been working on training LLM Large Language Models (GPT-like models) on Voynich EVA transliterations. This is not about using ChatGPT, but about training language models from scratch using only Voynich EVA text. I’m aware that GPT models are a sort of black box, and it’s often hard to understand the mechanisms they use to “learn” patterns. In this project, I’ve tried to explore how the GPT model makes predictions — to gain some intuition into the decision-making process. Let me first introduce the key concepts I’ve been working with:
What I did: First, I optimized model parameters to maximize the number of real bigrams and trigrams (n-grams) generated by the model. Results are similar to training GPT on a real natural language text. Results after training on Voynich EVA text: % of 2-grams found in Voynich EVA with block_size 4: 22.40% (224/1000) % of 3-grams found in Voynich EVA with block_size 4: 0.80% (8/999) Then, I trained the model on all paragraph-style lines in the Voynich manuscript (i.e., excluding labels or isolated words from cosmological sections). I used a 5-fold cross-validation approach:
I then visualized the predictions using HTML files (saliency_valset_voynich_1.html to saliency_valset_voynich_5.html) You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view. You are not allowed to view links. Register or Login to view. ![]() Each word is annotated with three values:
This visualization makes it easy to see at a glance which words the model finds easier or harder to predict. The HTML is interactive — hovering over any word shows the 3 metrics mentioned above. Deeper inspection: I also created a second HTML file: context_saliency_colored_and_target.html that looks like this: You are not allowed to view links. Register or Login to view. [size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif] ![]() This version shows for each word in the Voynich EVA paragraph:
The model tends to predict the most frequent words in the Voynich corpus, as expected. However, the saliency values let us observe which previous words influenced the prediction the most, token by token. I highlighted:
I don't have any conclusions yet, but I think this could be useful for others interested in understanding how contextual information influences predictions in GPT-like models trained on Voynich EVA. Let me know what you think — I’d love to hear your thoughts! RE: How GPT-like models try to read Voynichese - oshfdk - 16-07-2025 (16-07-2025, 10:50 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The model tends to predict the most frequent words in the Voynich corpus, as expected. However, the saliency values let us observe which previous words influenced the prediction the most, token by token. Am I reading this correctly? It looks like GPT cannot successfully predict the fifth word knowing the previous 4 words. Correct predictions seem very rare and are the most common words like "daiin" or "chol", so this looks like pure chance to me. If so, I'm not sure of how much use the saliency is when the predictions themselves are wrong. For a model that successfully predicts a large percentage of words (without overfitting) it would be very interesting to know which information it uses, but not for a model that cannot predict much. RE: How GPT-like models try to read Voynichese - quimqu - 16-07-2025 (16-07-2025, 12:03 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Am I reading this correctly? It looks like GPT cannot successfully predict the fifth word knowing the previous 4 words. Correct predictions seem very rare and are the most common words like "daiin" or "chol", so this looks like pure chance to me. Yes, that's absolutely right — this also happens when I train a GPT model from scratch using just a single English book. It's expected: the corpus is extremely small, and unfortunately, that's all we have for the Voynich Manuscript. This project was never intended to produce a model that "understands" Voynichese — training a language model on a single book is not enough to learn a true grammar or semantics. However, I believe that even the limited learning the model does achieve may offer insights into the internal structure or patterns of the language. That was the sole aim of this work: not to prove comprehension, but to explore which elements the model finds predictable, and how those predictions are distributed across the text. But for example, it's interesting to see that at the beginning of the Voynich, the model (although mostly wrong) tends to predict common tokens like daiin, chol, etc. Later in the manuscript, it shifts toward predicting more qokeedy, qokain, and similar forms. So even though the predictions are mostly incorrect, the model’s behavior seems to depend on the region of the Voynich it is processing. RE: How GPT-like models try to read Voynichese - Jorge_Stolfi - 16-07-2025 (16-07-2025, 12:03 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Am I reading this correctly? It looks like GPT cannot successfully predict the fifth word knowing the previous 4 words. Correct predictions seem very rare and are the most common words like "daiin" or "chol", so this looks like pure chance to me. I You are not allowed to view links. Register or Login to view. a few days ago. To train any word predictor that uses the last K words, you need a corpus in which all possible sequences of K words have a good chance of appearing. If the lexicon has1000 equally likely words, the corpus needed for that training must have on the order of 1000 to the power K words. The number will be smaller if the lexicon words have a Zipf-like frequency distribution; but not much smaller. If you insist that the predictor must use the last 4 words, and you train it with half of the VMS, you end up with a predictor that just copies long stretches of that training text, with only occasional switches when it jumps from one piece of the training set to another. Namely, those switches will occur only when the generated output has K consecutive words that occur more than once in the training set -- which will be very few of them. If you use a black-box predictor, you never know how many past words it actually uses. Or what the hell it is doing. All the best, --jorge RE: How GPT-like models try to read Voynichese - Jorge_Stolfi - 16-07-2025 (16-07-2025, 11:18 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.To train any word predictor that uses the last K words, you need a corpus in which all possible sequences of K words have a good chance of appearing. If the lexicon has1000 equally likely words, the corpus needed for that training must have on the order of 1000 to the power K words. You may get more interesting results if you first try to partition the lexicon into a small number -- say 20 -- sets of words that are similar in their immediate context (1-2 words before and/or a after at most). Then you map each word in the corpus to the index of its subset, so the corpus becomes a "text" with the same length but whose vocabulary is the numbers 1 to 20. Then you train a predictor to use the last 4 words of that text. For that, a corpus with on the order of 20^4 = 160000 words should be adequate. Then you take the output of that predictor and replace each index by a random word in the corresponding set, with the appropriate frequencies. You could complicate this last step by making the choice of a word from each set depend on the previous 2-3 generated indices, or the previous generated word. (But not on the last 4 indices, because then you would run into the same probele of insufficient examples to build the necesary probability tables.) And anyway you should definitely map similar glyphs to the same glyph: a->o, r->s , ch->ee, etc. Anyone who tried to transcribe a significant amount of Voynichese surely must have concluded that 10% or more of those letters are just wrong -- the transcription says "o" when the Author meant "a", etc. All the best, --jorge RE: How GPT-like models try to read Voynichese - quimqu - 17-07-2025 Hi Jorge, Thank you for your comment. Just to clarify — the goal of this work was never to focus on prediction accuracy itself. In fact, it was inspired by some of your earlier remarks on my posts, especially about the "black box" nature of GPT models. What I’ve been trying to do is open up that box a little and see what’s going on inside. That’s why I’ve focused on saliency — to visualize how much weight the model gives to each of the previous words when predicting the next one. I'm not trying to optimize performance, but rather to observe whether the model has learned any meaningful structure, or is just regurgitating memorized sequences. What’s interesting (and a bit puzzling, even if it is something that is also found in natural languages) is how the saliency varies according to the word position in the block_size. The same word has different influence in the prediction of the next word at different positions. For example here: ![]() word choiin has a big saliency at position 4 for predicting chol, but then at position 3 (the next position) it has the less saliency to predict cphey and increases again for positions 2 and 1. I hope that visualizing the MS so, can give some new perspectives. This experiment was also inspired by something René once said — that instead of trying to decipher the Voynich, we should aim to understand the mechanism used to generate the language. So in a way, this work is a blend of your idea about model transparency and René’s hypothesis about language generation (and trying to find its mechanism) Thanks again! |