The Voynich Ninja
Detecting Vowels in the Voynich Text - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Detecting Vowels in the Voynich Text (/thread-4901.html)



Detecting Vowels in the Voynich Text - quimqu - 01-09-2025

One of the big challenges when analyzing the Voynich manuscript is that we don’t even know what symbols are “vowels” and which are “consonants”. To explore this, I built an unsupervised pipeline that tries to infer vowel-like characters directly from the distribution of glyphs in the text, without assuming a known language.

To do this, I treat each glyph as a separate character. But in the MS, common pairs like “ch”, “sh”, “ai”, “in” might actually behave as single units. So, to detect these automatically, I compute PMI (Pointwise Mutual Information): if two symbols co-occur much more than expected by chance, I fuse them into a digraph token.This avoids false positives like treating “c” as a vowel just because it appears almost always in “ch.”

Then I train a simple HMM with two hidden states: one meant to represent “vowel-like” positions, the other “consonant-like.” The model is biased to prefer alternation (V↔C), because in most languages vowels and consonants interleave rather than forming long runs.

But the HMM alone isn’t enough. So I add metrics that capture linguistic tendencies:
  • Coverage: vowels tend to appear in many different words.
  • Neighbor entropy: vowels have diverse neighbors on both sides (they can be surrounded by many different consonants).
  • Repetition rate: I applied a threshold based on when they appear doubled (consonants are more often repeated).
  • Position: vowels often appear inside words, not only at the edges.
Each of these gets a weight, and I combine them with the HMM score into a calibrated probability of being a vowel.

Languages typically have only a handful of vowels (say 4–8). So instead of accepting every symbol above 0.5, I impose a parsimonious prior:
  • I target ~6 vowels in total.
  • Candidates must pass a minimum probability threshold.
  • I choose the set that minimizes structural loss, i.e., reduces long VV or CC runs and increases alternation.

I can configure the script to work on latin like languages, Indo-European languages, Semitic languages, Syllabic languages... setting the vowel patterns for them.

The result is a short list of the most “vowel-like” glyphs or digraphs.

I have tested:

Indo-European languages often have 5–8 vowels, noticeable V↔C alternation, and frequent diphthongs; this config favors ~6 vowels, rewards alternation and neighbor diversity, and lets strong data-driven digraphs emerge without hardwired seeds. This is the result:

Indo-European
VoynicheseEVAprob_vowel
oo0.983284
yy0.970378
aa0.886179
kk0.865411
tt0.824919
ll0.709329
chch0.684790
cc0.660998
ff0.658652
pp0.640008
ss0.624870
aiai0.602096


Semitic scripts often omit short vowels and allow heavy consonant clustering; this setup targets ~3 vowel-like symbols, relaxes alternation and repetition penalties, and keeps digraph selection fully data-driven. This is the result:

Semitic
VoynicheseEVAprob_vowel
oo0.978236
yy0.904206
kk0.888190
tt0.809498
dd0.792952
aa0.763163
ll0.750061
hh0.661877
ss0.654497
rr0.632587
pp0.625348
ff0.619525

Syllabic scripts pack consonant+vowel into single signs, so we weaken alternation, fuse more bigrams to approximate CV units, use gentler repetition/edge penalties, and select a flexible 3–7 set of vowel-like nuclei. This is the result:

Syllabic
VoynicheseEVAprob_vowel
oo0.975626
yy0.945105
tt0.796126
aa0.756872
kk0.748639
rr0.702806
chch0.685732
ll0.643413
olol0.634971
ss0.608752
shsh0.604313
dd0.581746

Arabic is an abjad where short vowels are often omitted and consonant clusters are common, so this config targets ~3 vowel-like symbols, weakens alternation and repetition penalties, and keeps digraph discovery fully data-driven. This is the result:

Arabic
VoynicheseEVAprob_vowel
oo0.960930
yy0.866163
kk0.852621
tt0.762965
dd0.744871
ss0.711418
ll0.705106
aa0.693979
hh0.689840
ee0.660995
rr0.618145
chch0.598030

In every configuration tested, the glyph o comes out as the clearest vowel, with y and a also very strong. What surprises me much is the appearance of k and t, the common gallows seem to act as “vowel-like,” though they might act as semi-vowels or special cases, the patterns fit well with a “vowel-like” pattern. As you can see, some languages result in having 2grams in the top “vowel-like” list, what gives us an idea that the configuration for different language types gives different outputs.


RE: Detecting Vowels in the Voynich Text - Jorge_Stolfi - 01-09-2025

Arabic is a Semitic ("Afroasiatic") language too.  The mos important one by number of speakers.

The traditional Arabic script writes only the long vowels. Natv Arbc spkrs cn uslly gess th shrt vowls.  But, if the language is Arabic (not saying it is!) perhaps one reason why the Author chose to invent a new script was to record the short vowels, which he had difficulty guessing.

Ll th bst, --jrg