One of the big challenges when analyzing the Voynich manuscript is that we don’t even know what symbols are “vowels” and which are “consonants”. To explore this, I built an unsupervised pipeline that tries to infer vowel-like characters directly from the distribution of glyphs in the text, without assuming a known language.
To do this, I treat each glyph as a separate character. But in the MS, common pairs like “ch”, “sh”, “ai”, “in” might actually behave as
single units. So, to detect these automatically, I compute PMI (Pointwise Mutual Information): if two symbols co-occur much more than expected by chance, I fuse them into a digraph token.This avoids false positives like treating “c” as a vowel just because it appears almost always in “ch.”
Then I train a simple HMM with two hidden states: one meant to represent “vowel-like” positions, the other “consonant-like.” The model is biased to prefer alternation (V↔C), because in most languages vowels and consonants interleave rather than forming long runs.
But the HMM alone isn’t enough. So I add metrics that capture
linguistic tendencies:
- Coverage: vowels tend to appear in many different words.
- Neighbor entropy: vowels have diverse neighbors on both sides (they can be surrounded by many different consonants).
- Repetition rate: I applied a threshold based on when they appear doubled (consonants are more often repeated).
- Position: vowels often appear inside words, not only at the edges.
Each of these gets a weight, and I combine them with the HMM score into a calibrated probability of being a vowel.
Languages typically have only a handful of vowels (say 4–8). So instead of accepting every symbol above 0.5, I impose a parsimonious prior:
- I target ~6 vowels in total.
- Candidates must pass a minimum probability threshold.
- I choose the set that minimizes structural loss, i.e., reduces long VV or CC runs and increases alternation.
I can configure the script to work on latin like languages, Indo-European languages, Semitic languages, Syllabic languages... setting the vowel patterns for them.
The result is a short list of the most “vowel-like” glyphs or digraphs.
I have tested:
Indo-European languages often have 5–8 vowels, noticeable V↔C alternation, and frequent diphthongs; this config favors ~6 vowels, rewards alternation and neighbor diversity, and lets strong data-driven digraphs emerge without hardwired seeds. This is the result:
Indo-European
Voynichese | EVA | prob_vowel |
o | o | 0.983284 |
y | y | 0.970378 |
a | a | 0.886179 |
k | k | 0.865411 |
t | t | 0.824919 |
l | l | 0.709329 |
ch | ch | 0.684790 |
c | c | 0.660998 |
f | f | 0.658652 |
p | p | 0.640008 |
s | s | 0.624870 |
ai | ai | 0.602096 |
Semitic scripts often omit short vowels and allow heavy consonant clustering; this setup targets ~3 vowel-like symbols, relaxes alternation and repetition penalties, and keeps digraph selection fully data-driven. This is the result:
Semitic
Voynichese | EVA | prob_vowel |
o | o | 0.978236 |
y | y | 0.904206 |
k | k | 0.888190 |
t | t | 0.809498 |
d | d | 0.792952 |
a | a | 0.763163 |
l | l | 0.750061 |
h | h | 0.661877 |
s | s | 0.654497 |
r | r | 0.632587 |
p | p | 0.625348 |
f | f | 0.619525 |
Syllabic scripts pack consonant+vowel into single signs, so we weaken alternation, fuse more bigrams to approximate CV units, use gentler repetition/edge penalties, and select a flexible 3–7 set of vowel-like nuclei. This is the result:
Syllabic
Voynichese | EVA | prob_vowel |
o | o | 0.975626 |
y | y | 0.945105 |
t | t | 0.796126 |
a | a | 0.756872 |
k | k | 0.748639 |
r | r | 0.702806 |
ch | ch | 0.685732 |
l | l | 0.643413 |
ol | ol | 0.634971 |
s | s | 0.608752 |
sh | sh | 0.604313 |
d | d | 0.581746 |
Arabic is an abjad where short vowels are often omitted and consonant clusters are common, so this config targets ~3 vowel-like symbols, weakens alternation and repetition penalties, and keeps digraph discovery fully data-driven. This is the result:
Arabic
Voynichese | EVA | prob_vowel |
o | o | 0.960930 |
y | y | 0.866163 |
k | k | 0.852621 |
t | t | 0.762965 |
d | d | 0.744871 |
s | s | 0.711418 |
l | l | 0.705106 |
a | a | 0.693979 |
h | h | 0.689840 |
e | e | 0.660995 |
r | r | 0.618145 |
ch | ch | 0.598030 |
In every configuration tested, the glyph
o comes out as the clearest vowel, with
y and
a also very strong. What surprises me much is the appearance of
k and t, the common gallows seem to act as “vowel-like,” though they might act as semi-vowels or special cases, the patterns fit well with a “vowel-like” pattern. As you can see, some languages result in having 2grams in the top “vowel-like” list, what gives us an idea that the configuration for different language types gives different outputs.