09-06-2025, 10:52 AM
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hi,
Thanks a lot for your message. I find your suggestion about grammar-based decomposition really interesting — that’s exactly the kind of direction I’d like to explore next.
To clarify what I’ve done so far:
I trained several small GPT models on Voynichese using a character-level tokenizer, based on the Currier transliteration. This worked quite well: the model was able to predict character sequences with low perplexity (~3.3), which suggests a high degree of internal structure.
If I understood, the GPT models learn and build an internal representation of the Voynich word 'grammar', then this is used to generate new words which are compared to the original and from this comparison a 'perplexity' score is calculated, which gives a measure of how good the newly generated words with respect to actual Voynich (and I guess the same score is used while training the model?). Did I get it about right?
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.In contrast, using word-level tokenization (based on dot-separated EVA words) gave very poor results — mainly because of the large vocabulary size and lack of training data per token.
At this point, I’m considering two directions:
- Trying more modern transliterations (like Takahashi or updated EVA versions). But I’m a bit concerned that these are too detailed — they distinguish rare glyphs very precisely, which might make it harder for the model to generalize.
This looks a bit problematic to me, because Currier made many choices in his transliteration (while EVA is much more 'agnostic') which are bound to influence your results. I'm not saying that grouping EVA characters together is necessarily a bad thing (I myself always grouped together "ch", "sh", "cph", "cfh", "cth", "ckh") and I'm not saying Currier made bad choices, but personally I'd rather use a more 'raw' transliteration in EVA.
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
- Switching to syllable-like units instead of characters — which is exactly what you suggested.
I’d love to hear your opinion on this:
- What kind of syllables (e.g. "qo", "dy", "aiin", etc.) do you think would make sense?
Eh.. that is the question that a grammar is meant to solve

(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
- Which transliteration would be the best basis for such tokenization?
I don't think it matters much, but should you decide to try EVA, I suggest the Reference transliteration by Renè Zandbergen (sorry but I can never find the link...).
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.If you’ve already explored this type of decomposition, I’d be really interested in hearing more or comparing approaches.
Thanks again for your input!
I experimented with a decomposition based on slot grammars. If you're interested you can check here: You are not allowed to view links. Register or Login to view. . I don't want to hijack your thread, if you have questions send me a PM.