Mauro > 09-06-2025, 10:52 AM
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Hi,
Thanks a lot for your message. I find your suggestion about grammar-based decomposition really interesting — that’s exactly the kind of direction I’d like to explore next.
To clarify what I’ve done so far:
I trained several small GPT models on Voynichese using a character-level tokenizer, based on the Currier transliteration. This worked quite well: the model was able to predict character sequences with low perplexity (~3.3), which suggests a high degree of internal structure.
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.In contrast, using word-level tokenization (based on dot-separated EVA words) gave very poor results — mainly because of the large vocabulary size and lack of training data per token.
At this point, I’m considering two directions:
- Trying more modern transliterations (like Takahashi or updated EVA versions). But I’m a bit concerned that these are too detailed — they distinguish rare glyphs very precisely, which might make it harder for the model to generalize.
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
- Switching to syllable-like units instead of characters — which is exactly what you suggested.
I’d love to hear your opinion on this:
- What kind of syllables (e.g. "qo", "dy", "aiin", etc.) do you think would make sense?
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
- Which transliteration would be the best basis for such tokenization?
(08-06-2025, 05:31 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.If you’ve already explored this type of decomposition, I’d be really interested in hearing more or comparing approaches.
Thanks again for your input!
Koen G > 09-06-2025, 11:11 AM
(09-06-2025, 10:52 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.This looks a bit problematic to me, because Currier made many choices in his transliteration (while EVA is much more 'agnostic') which are bound to influence your results. I'm not saying that grouping EVA characters together is necessarily a bad thing (I myself always grouped together "ch", "sh", "cph", "cfh", "cth", "ckh") and I'm not saying Currier made bad choices, but personally I'd rather use a more 'raw' transliteration in EVA.
Mauro > 09-06-2025, 12:14 PM
(09-06-2025, 11:11 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.This feels to me like one of the most pervasive misconceptions in modern Voynich research, up to the academic level. EVA makes just as many choices as Currier did, and one is not necessarily closer to the truth than the other.
quimqu > 09-06-2025, 01:02 PM
(09-06-2025, 12:14 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(09-06-2025, 11:11 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.This feels to me like one of the most pervasive misconceptions in modern Voynich research, up to the academic level. EVA makes just as many choices as Currier did, and one is not necessarily closer to the truth than the other.
I surely agree with you! Would then be okay to say EVA is more 'analytic' in its choices while Currier is more 'synthetic'?
oshfdk > 09-06-2025, 01:54 PM
(09-06-2025, 01:02 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.Yes, you got it mostly right! GPT models learn patterns and rules (like a kind of “grammar”) from the Voynich text. They use this knowledge to predict or generate new words. Then, by comparing these generated words to the actual Voynich words, the model calculates a “perplexity” score, which shows how well it understands the text. Lower perplexity means better understanding. This score is also used during training to improve the model step by step.
quimqu > 09-06-2025, 02:05 PM