| Welcome, Guest |
You have to register before you can post on our site.
|
|
|
| How LLM models try to understand Voynichese |
|
Posted by: quimqu - 16-07-2025, 10:50 AM - Forum: Analysis of the text
- Replies (6)
|
 |
Dear Voynich Ninja community,
As you might know, I’ve been working on training LLM Large Language Models (GPT-like models) on Voynich EVA transliterations. This is not about using ChatGPT, but about training language models from scratch using only Voynich EVA text.
I’m aware that GPT models are a sort of black box, and it’s often hard to understand the mechanisms they use to “learn” patterns. In this project, I’ve tried to explore how the GPT model makes predictions — to gain some intuition into the decision-making process.
Let me first introduce the key concepts I’ve been working with:
- Loss: Loss is a measure of how wrong the model's predictions are compared to the actual next word. In language models, it's typically cross-entropy loss, which penalizes the model more when it assigns low probability to the correct word. A lower loss means the model is better at predicting the next token given its context.
- Prediction: The prediction is the model’s guess for the next word in a sequence. For example, given a context of 4 tokens (block_size = 4), the model looks at those 4 tokens and outputs a probability distribution over the vocabulary, selecting the most likely next token.
- Saliency: Saliency refers to how much each input token contributes to the model’s prediction. If we use a block_size of 4, saliency tells us which of the 4 previous tokens had the most influence on predicting the next word. For example, in the sequence ["the", "brown", "cat", "sat"] → ?, the model might predict "on". Saliency would then indicate how important each of the previous tokens was in making that prediction. Tokens with higher saliency are considered more influential.
What I did:
First, I optimized model parameters to maximize the number of real bigrams and trigrams (n-grams) generated by the model. Results are similar to training GPT on a real natural language text. Results after training on Voynich EVA text:
% of 2-grams found in Voynich EVA with block_size 4: 22.40% (224/1000)
% of 3-grams found in Voynich EVA with block_size 4: 0.80% (8/999)
Then, I trained the model on all paragraph-style lines in the Voynich manuscript (i.e., excluding labels or isolated words from cosmological sections). I used a 5-fold cross-validation approach:
- I split the text into 5 segments. For each fold, I used 80% of the data for training and 20% for validation, rotating through all segments.
- This way, I could generate predictions for the entire corpus.
I then visualized the predictions using HTML files (saliency_valset_voynich_1.html to saliency_valset_voynich_5.html)
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
![[Image: YYIIL2c.png]](https://i.imgur.com/YYIIL2c.png)
Each word is annotated with three values:
- Loss: represented by the border thickness — thicker means higher loss.
- Saliency: represented by the background color intensity — darker means higher saliency. Since each word is part of 4 prediction contexts (due to block_size = 4), saliency here is averaged over those 4 instances.
- Prediction probability: represented by border color — green for high confidence, red for low. The predicted probabilities are generally low, but this is also the case when training GPT on small corpora like a single book, even in natural languages.
This visualization makes it easy to see at a glance which words the model finds easier or harder to predict. The HTML is interactive — hovering over any word shows the 3 metrics mentioned above.
Deeper inspection:
I also created a second HTML file: context_saliency_colored_and_target.html that looks like this:
You are not allowed to view links. Register or Login to view.
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif] [/font][/size]
This version shows for each word in the Voynich EVA paragraph:- context_0 to context_3: the 4 previous tokens used as input (the model's context).
- target: the real next word in the sequence.
- pred_word: the word predicted by the model.
The model tends to predict the most frequent words in the Voynich corpus, as expected. However, the saliency values let us observe which previous words influenced the prediction the most, token by token.
I highlighted:- green: when pred_word == target
- yellow: similar words according to LevenShtein similarity (>0.5)
I don't have any conclusions yet, but I think this could be useful for others interested in understanding how contextual information influences predictions in GPT-like models trained on Voynich EVA.
Let me know what you think — I’d love to hear your thoughts!
|
|
|
| A good match, perhaps from the Zürich area... |
|
Posted by: ReneZ - 15-07-2025, 10:50 AM - Forum: Marginalia
- Replies (47)
|
 |
Just to highlight this interesting post and give it its own thread:
(13-07-2025, 08:34 PM)magnesium Wrote: You are not allowed to view links. Register or Login to view.I know this is slightly off-topic, but as we're poking around digitized Swiss archives: Koen, if you haven't seen it already, this manuscript is an extremely good reference for the handwriting of the You are not allowed to view links. Register or Login to view. and You are not allowed to view links. Register or Login to view. marginalia: You are not allowed to view links. Register or Login to view.
This is not only an interesting match for the marginalia, but there are several cases of 'm' characters, especially left of the illustration, that look very similar to Voynich iin , version of Scribes 1 and/or 5.
This may seem a frivolous comparison, and I am not really suggesting that the scribe of this MS is the same as one of the scribes of the Voynich MS, but I have not seen this type of curl 'upward and backward' too often.
|
|
|
| f69r circle |
|
Posted by: magnesium - 14-07-2025, 03:19 PM - Forum: Imagery
- Replies (9)
|
 |
I'm not making any definitive claims here, but I wanted to point out a superficial resemblance between the 16-wedge circle on You are not allowed to view links. Register or Login to view. and this circular diagram of the 16 geomantic figures in the following 15th-century divination/astrology manual:
St. Gallen, Stiftsbibliothek, Cod. Sang. 756: Composite manuscript on geomancy, chiromancy, iatromathematics, astronomy, alchemy and medicine (You are not allowed to view links. Register or Login to view.).
![[Image: nJEjXYw.png]](https://i.imgur.com/nJEjXYw.png)
The You are not allowed to view links. Register or Login to view. circle, for reference:
![[Image: NrTZ6CH.jpeg]](https://i.imgur.com/NrTZ6CH.jpeg)
The 16-wedge subdivision of the circle and the central floral motif stood out to me. However, the most glaring difference between the two is obviously that the figures themselves are missing from f69r. It's piling speculation on speculation, but I have half a mind that one of the VMS authors saw something like the Cod. Sang. 756 figure, had no knowledge of geomancy, and then either tried to draw something similar from memory or verbally described the figure to the illustrator.
|
|
|
| A Beekeeping-Based Hypothesis for the Voynich Manuscript |
|
Posted by: えすじーけけ - 14-07-2025, 02:18 PM - Forum: Theories & Solutions
- Replies (6)
|
 |
Hello everyone,
I’m a newcomer to Voynich studies, but I’d like to share a hypothesis I developed after closely examining publicly available images of the manuscript. Please keep in mind I’m not a specialist — this is just an idea from an outsider's perspective — but I’d sincerely appreciate any feedback.
My hypothesis is that the manuscript may be centered around beekeeping, with symbolic and possibly sacred aspects attached to bees.
Here’s a brief summary of my reasoning: - The botanical pages could represent either plants grown near beehives or flowers favored by bees for nectar. In some drawings, the roots seem exaggerated — possibly indicating how “attractive” the plant is to bees (e.g., stronger roots = more nectar?).
- Many of the female figures might represent bees — particularly worker bees or queens — often depicted immersed in fluid, holding objects (perhaps pollen or tools), or emerging from pipes (perhaps hive entrances).
- The spiral or rosette diagrams might be stylized cross-sections of hives, showing their inner structure or seasonal changes.
- The astronomical sections might represent the cycle of queen production, honey storage patterns, or symbolic relationships between bees and celestial patterns — such as star positions during swarming.
- The lack of realistic depiction in many plants could reflect a bee’s visual world (colors from above vs. below, petal symmetry, etc.), not a human herbalist’s.
This hypothesis may also offer explanations for:- Why some women are crowned or veiled (perhaps symbolizing queens),
- Why the text is written so carefully (a sacred or secret manuscript about bees?),
- Why there is only one copy (a ritual or private use?).
Of course, I can’t interpret the script, and I realize this is speculative. But I found it interesting that so many otherwise disconnected elements can align under a beekeeping framework.
Thanks so much for your time — and I’d be grateful for your thoughts, corrections, or even counterexamples!
By the way, I'm not a native English speaker, so I apologize for any awkward phrasing or mistakes.
|
|
|
| Structural and Reverse-Cipher Hypothesis |
|
Posted by: Capric9ne - 14-07-2025, 02:09 PM - Forum: The Slop Bucket
- Replies (5)
|
 |
Dear Ninja Team,
My name is Hakan Adıbelli, and I’ve been conducting an independent deep-structure analysis of the Voynich Manuscript based on alchemical logic, Fibonacci sequencing, and symbolic fusion cycles across the folios.
I believe I may have uncovered a key principle underlying the manuscript’s structure:
namely, that each plant illustration represents not a singular species, but a fusion blueprint of multiple botanical and symbolic elements — constructed in harmonic cycles and encoded across pages via Fibonacci-derived cipher blocks.
Further, I’ve come to realize that the final diagrams of the manuscript likely represent an “end state” — a symbolic interface or mechanical construct — which must be reverse-engineered to decipher earlier folios. In this view, the manuscript is not meant to be “read” linearly, but “reconstructed” structurally.
The full exploration of this hypothesis is captured in my ongoing dialogue with an AI-based assistant (GPT), which has helped me formulate and refine these ideas. The original discussion is in Dutch, but I believe its insights may be of interest to your research team or broader Voynich scholars.
➤ You can access the full discussion (with live hypotheses, cipher parsing, and layered logic) here:
You are not allowed to view links. Register or Login to view.
I understand the document is currently in Dutch, and I am in the process of translating it into English for broader academic use. If your team would be interested in early access to this material, or would be willing to assist in its validation or dissemination, I would greatly appreciate the opportunity to collaborate or contribute to ongoing Voynich research.
Thank you for preserving and opening access to this enigmatic treasure. I hope this contribution may help shed further light on its mystery.
Warm regards,
Hakan Adıbelli
belli187@hotmail.com
Venlo, The Netherlands
|
|
|
| Pastebin claiming to decode Voynich Manuscript |
|
Posted by: Strategeryist - 06-07-2025, 01:19 AM - Forum: Theories & Solutions
- Replies (8)
|
 |
Hello everyone,
I know nothing about voynich or its translation. I randomly found this pastebin on the internet while going down an interesting unrelated rabbit hole. Posting it here because I'm curious if its legitimate or not as I am not an expert in this topic. It was uploaded anonymously on Nov 29 2022 and has only received around 158 views. It was removed March 11, 2025 and is now only available on the wayback machine. In addition to the link, I've also attached the txt file.
The Voynich Manuscript DECODED.txt (Size: 35.48 KB / Downloads: 22)
You are not allowed to view links. Register or Login to view.
Below are some interesting paragraphs from the pastebin.
> One thing the Voynich Group talks about continually is that they believe this manuscript is encoded. Great lengths have been taken to find the code and decipher the words. After having looked over page one of this manuscript, I felt that the group was right in one aspect. It was encoded. But not by the logical method, the mathematical method of encoding they were looking for. It was visually encoded. At least as far as I have gone with it.
> For instance, that word AIN. I thought about the fact that if AIN was Hebrew for the eye, then to visually encode it the author could also write AIIN and AIIIN. Even though on the surface it would logically appear that these variations represented different words, they were all actually the same word, the Hebrew AIN, the eye. When the letter D was added, as DAIN, it represented the construction D'A, the Hebrew word "knowledge", spelled daleth ain. You really have a two letter word, not a four letter word. When he changed 'A (the letter ain) for the Latin letter O, OIIN, it would show the writer was possibly accessing another definition of AIN, such as "to look at." The Latin letter O is derived from the Hebrew letter 'Ain.
...
> Line by line updated paraphrase translation of page 1 of the Voynich Manuscript:
Seeing, my friend, as there is power that will prevail \[the power would have to refer to the power of the devastation of the eye\], be wakeful. It is the appointed time of the Eye. Know also that opponents, deniers, have been allowed by men. They are now the head evil faction. All that are precious to them will be killed.
|
|
|
| GPT Models Fail to Find Language Structure in the Voynich Manuscript |
|
Posted by: quimqu - 02-07-2025, 04:10 PM - Forum: Analysis of the text
- Replies (27)
|
 |
The GPT model is a type of neural network trained to predict the next token (e.g., word or character) in a sequence, based on the context of the previous ones. During training, the model gradually learns patterns and structures that help it guess what might come next. When applied to natural language, this often results in learning the grammar and syntax of the language. The goal of this experiment is to see whether a GPT, when trained on Voynich text, can reproduce even short valid word sequences — a basic sign of underlying grammatical structure.
Using a minimal GPT architecture trained on 11,000-token corpora from natural languages and the Voynich manuscript (EVA and CUVA transcriptions, only paragraphs of what seems natural language (not cosmological parts, for example)), I evaluated how well the model could reproduce sequences of two or three consecutive words (bigrams and trigrams) from the original corpus. The results reveal stark differences between Voynichese and natural languages.
I trained several nanoGPT models (roughly 1.1M parameters each) on corpora limited to 11,000 words each. The corpora included:
- Latin (e.g. De Docta Ignorantia)
- Classical religious text (In Psalmum David CXVIII)
- Early Modern English (Romeo and Juliet)
- Esperanto (Alicie en Wonderland)
- Voynich EVA transcription
- Voynich CUVA transcription
Each model was trained on tokenized text split by the dot (".") separator, treating each token as a "word". Then, I prompted each model to generate 1000 words, starting from a random token from the original corpus.
For each generated sequence, I extracted all bigrams and trigrams and checked how many were present in the original corpus text (used as training data).
Results (Bigrams and Trigrams Found in Training Text):
![[Image: s5N9a0a.png]](https://i.imgur.com/s5N9a0a.png)
The Latin religious text In Psalmum David CXVIII had pretty low bigram and trigram scores — not too far from the Voynich numbers. This could be because of its complex sentence structure or how rarely some word combinations repeat. But even then, it still produced some consistent word sequences, which the GPT picked up.
That didn’t happen with the Voynich at all — no three-word sequences from the original text were ever regenerated. This makes Voynichese stand out as fundamentally different.
In addition, the entropy of word distributions was comparable across corpora (~8.5 to 9.6 bits), meaning the GPT learned the relative frequencies of words quite well. However, only in natural language corpora did it also learn statistically consistent co-occurrence patterns.
Conclusion:
If the Voynich manuscript encoded a natural language, we would expect a GPT trained on it to be able to reproduce at least a small proportion of common bigrams and trigrams from the training corpus. This is exactly what we observe in natural language corpora (e.g. Esperanto 25.9% bigram match). In contrast, the bigram match rate for Voynichese is nearly zero, and trigrams are entirely absent.
This strongly supports the hypothesis that the Voynich manuscript is not a natural language encoding. While it has an internally consistent lexicon (i.e., words), it lacks the sequential dependencies and word-to-word transitions that characterize even simple or constructed languages.
Implication:
If a small GPT can learn bigrams and trigrams from natural languages in just 11,000 words — but completely fails to do so with Voynichese — this suggests that the manuscript does not reflect natural language structure.
This casts serious doubt on claims of direct decryption or translation into real languages. It’s likely that such efforts are misapplied.
Instead, the Voynich may reflect a pseudo-linguistic system — a generative algorithm, a constructed gibberish, or even a cipher whose output was never meant to carry true semantic depth. The surface form may resemble language, but its internal statistical behavior tells a different story.
In short: be skeptical of anyone claiming to have “translated” the Voynich into English, Latin, or any other language — unless they can show that their version has the statistical fingerprints of a true linguistic system.
|
|
|
|