Detecting Vowels in the Voynich Text

Detecting Vowels in the Voynich Text - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Detecting Vowels in the Voynich Text (/thread-4901.html)

Detecting Vowels in the Voynich Text - quimqu - 01-09-2025

One of the big challenges when analyzing the Voynich manuscript is that we don’t even know what symbols are “vowels” and which are “consonants”. To explore this, I built an unsupervised pipeline that tries to infer vowel-like characters directly from the distribution of glyphs in the text, without assuming a known language.

To do this, I treat each glyph as a separate character. But in the MS, common pairs like “ch”, “sh”, “ai”, “in” might actually behave as single units. So, to detect these automatically, I compute PMI (Pointwise Mutual Information): if two symbols co-occur much more than expected by chance, I fuse them into a digraph token.This avoids false positives like treating “c” as a vowel just because it appears almost always in “ch.”

Then I train a simple HMM with two hidden states: one meant to represent “vowel-like” positions, the other “consonant-like.” The model is biased to prefer alternation (V↔C), because in most languages vowels and consonants interleave rather than forming long runs.

But the HMM alone isn’t enough. So I add metrics that capture linguistic tendencies:

Coverage: vowels tend to appear in many different words.
Neighbor entropy: vowels have diverse neighbors on both sides (they can be surrounded by many different consonants).
Repetition rate: I applied a threshold based on when they appear doubled (consonants are more often repeated).
Position: vowels often appear inside words, not only at the edges.

Each of these gets a weight, and I combine them with the HMM score into a calibrated probability of being a vowel.

Languages typically have only a handful of vowels (say 4–8). So instead of accepting every symbol above 0.5, I impose a parsimonious prior:

I target ~6 vowels in total.
Candidates must pass a minimum probability threshold.
I choose the set that minimizes structural loss, i.e., reduces long VV or CC runs and increases alternation.

I can configure the script to work on latin like languages, Indo-European languages, Semitic languages, Syllabic languages... setting the vowel patterns for them.

The result is a short list of the most “vowel-like” glyphs or digraphs.

I have tested:

Indo-European languages often have 5–8 vowels, noticeable V↔C alternation, and frequent diphthongs; this config favors ~6 vowels, rewards alternation and neighbor diversity, and lets strong data-driven digraphs emerge without hardwired seeds. This is the result:

Indo-European

Voynichese	EVA	prob_vowel
o	o	0.983284
y	y	0.970378
a	a	0.886179
k	k	0.865411
t	t	0.824919
l	l	0.709329
ch	ch	0.684790
c	c	0.660998
f	f	0.658652
p	p	0.640008
s	s	0.624870
ai	ai	0.602096

Semitic scripts often omit short vowels and allow heavy consonant clustering; this setup targets ~3 vowel-like symbols, relaxes alternation and repetition penalties, and keeps digraph selection fully data-driven. This is the result:

Semitic

Voynichese	EVA	prob_vowel
o	o	0.978236
y	y	0.904206
k	k	0.888190
t	t	0.809498
d	d	0.792952
a	a	0.763163
l	l	0.750061
h	h	0.661877
s	s	0.654497
r	r	0.632587
p	p	0.625348
f	f	0.619525

Syllabic scripts pack consonant+vowel into single signs, so we weaken alternation, fuse more bigrams to approximate CV units, use gentler repetition/edge penalties, and select a flexible 3–7 set of vowel-like nuclei. This is the result:

Syllabic

Voynichese	EVA	prob_vowel
o	o	0.975626
y	y	0.945105
t	t	0.796126
a	a	0.756872
k	k	0.748639
r	r	0.702806
ch	ch	0.685732
l	l	0.643413
ol	ol	0.634971
s	s	0.608752
sh	sh	0.604313
d	d	0.581746

Arabic is an abjad where short vowels are often omitted and consonant clusters are common, so this config targets ~3 vowel-like symbols, weakens alternation and repetition penalties, and keeps digraph discovery fully data-driven. This is the result:

Arabic

Voynichese	EVA	prob_vowel
o	o	0.960930
y	y	0.866163
k	k	0.852621
t	t	0.762965
d	d	0.744871
s	s	0.711418
l	l	0.705106
a	a	0.693979
h	h	0.689840
e	e	0.660995
r	r	0.618145
ch	ch	0.598030

In every configuration tested, the glyph o comes out as the clearest vowel, with y and a also very strong. What surprises me much is the appearance of k and t, the common gallows seem to act as “vowel-like,” though they might act as semi-vowels or special cases, the patterns fit well with a “vowel-like” pattern. As you can see, some languages result in having 2grams in the top “vowel-like” list, what gives us an idea that the configuration for different language types gives different outputs.

RE: Detecting Vowels in the Voynich Text - Jorge_Stolfi - 01-09-2025

Arabic is a Semitic ("Afroasiatic") language too. The mos important one by number of speakers.

The traditional Arabic script writes only the long vowels. Natv Arbc spkrs cn uslly gess th shrt vowls. But, if the language is Arabic (not saying it is!) perhaps one reason why the Author chose to invent a new script was to record the short vowels, which he had difficulty guessing.

Ll th bst, --jrg