![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Detecting Vowels in the Voynich Text - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: Detecting Vowels in the Voynich Text (/thread-4901.html) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Detecting Vowels in the Voynich Text - quimqu - 01-09-2025 One of the big challenges when analyzing the Voynich manuscript is that we don’t even know what symbols are “vowels” and which are “consonants”. To explore this, I built an unsupervised pipeline that tries to infer vowel-like characters directly from the distribution of glyphs in the text, without assuming a known language. To do this, I treat each glyph as a separate character. But in the MS, common pairs like “ch”, “sh”, “ai”, “in” might actually behave as single units. So, to detect these automatically, I compute PMI (Pointwise Mutual Information): if two symbols co-occur much more than expected by chance, I fuse them into a digraph token.This avoids false positives like treating “c” as a vowel just because it appears almost always in “ch.” Then I train a simple HMM with two hidden states: one meant to represent “vowel-like” positions, the other “consonant-like.” The model is biased to prefer alternation (V↔C), because in most languages vowels and consonants interleave rather than forming long runs. But the HMM alone isn’t enough. So I add metrics that capture linguistic tendencies:
Languages typically have only a handful of vowels (say 4–8). So instead of accepting every symbol above 0.5, I impose a parsimonious prior:
I can configure the script to work on latin like languages, Indo-European languages, Semitic languages, Syllabic languages... setting the vowel patterns for them. The result is a short list of the most “vowel-like” glyphs or digraphs. I have tested: Indo-European languages often have 5–8 vowels, noticeable V↔C alternation, and frequent diphthongs; this config favors ~6 vowels, rewards alternation and neighbor diversity, and lets strong data-driven digraphs emerge without hardwired seeds. This is the result: Indo-European
Semitic scripts often omit short vowels and allow heavy consonant clustering; this setup targets ~3 vowel-like symbols, relaxes alternation and repetition penalties, and keeps digraph selection fully data-driven. This is the result: Semitic
Syllabic scripts pack consonant+vowel into single signs, so we weaken alternation, fuse more bigrams to approximate CV units, use gentler repetition/edge penalties, and select a flexible 3–7 set of vowel-like nuclei. This is the result: Syllabic
Arabic is an abjad where short vowels are often omitted and consonant clusters are common, so this config targets ~3 vowel-like symbols, weakens alternation and repetition penalties, and keeps digraph discovery fully data-driven. This is the result: Arabic
In every configuration tested, the glyph o comes out as the clearest vowel, with y and a also very strong. What surprises me much is the appearance of k and t, the common gallows seem to act as “vowel-like,” though they might act as semi-vowels or special cases, the patterns fit well with a “vowel-like” pattern. As you can see, some languages result in having 2grams in the top “vowel-like” list, what gives us an idea that the configuration for different language types gives different outputs. RE: Detecting Vowels in the Voynich Text - Jorge_Stolfi - 01-09-2025 Arabic is a Semitic ("Afroasiatic") language too. The mos important one by number of speakers. The traditional Arabic script writes only the long vowels. Natv Arbc spkrs cn uslly gess th shrt vowls. But, if the language is Arabic (not saying it is!) perhaps one reason why the Author chose to invent a new script was to record the short vowels, which he had difficulty guessing. Ll th bst, --jrg |