The Voynich Ninja

Dear all,

These holidays I’ve been exploring how to analyze the Voynich manuscript’s word structure with models from data science - text analyse branch. Checking the Markov Models, I found some information about Hidden Markov Model (HMM) and explored a code that runs on EVA transliteration. The goal is not decipherment, but to quantify recurring patterns in how word pieces combine and to score how “typical” each word looks under those rules.

A Hidden Markov Model is a simple probabilistic model with hidden states (unseen “roles”), transition probabilities between states, and emission probabilities for the observable symbols. From data, an HMM learns those transitions and emissions; for a new sequence it can decode the most likely path of roles and compute a log-likelihood that tells how well the model explains the sequence. Voynich "words" show strong positional regularities (common openings and endings), and an HMM gives a compact way to (i) discover recurring roles behind word pieces, (ii) quantify which pieces go where, and (iii) measure how typical a word is under the learned rules.

What I did:

Tokenized and cleaned paragraphs.
Discovered affix candidates (short frequent strings with high branching entropy).
Segmented each word into prefix | stem | suffix with a simple scorer that prefers productive affixes and reasonable stem lengths.
Built sequences of morphological tokens (such as pre:qo, st:dai, suf:n).
Trained an HMM on those sequences (learning states, transitions, and emissions).
Decoded each word with Viterbi to get its state path and log-likelihood.
Exported a state-transition graph (below) and an You are not allowed to view links. Register or Login to view. to explore words (click the link, I recommend you to check it out!)

[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif] [Image: GlishM5.png]

[/font][/size]

How to read the state graph:

Boxes are states (latent “roles”). Each label lists the state’s top pieces: P = top prefix fragments, T = top stem fragments, F = top final/suffix fragments.
Edges are transitions and their width is proportional to probability. Styles mean different things:
- solid black = within a word (typical flows such as START → MID and MID → END)- dashed gray = from the end of one word to the beginning of the next (e.g., END → START or END → MID; also MID → START when a word has no explicit suffix)- dotted light = unusual directions
Some states behave like START, MID, or END by how they are used after training.
Read the solid paths left to right to see typical inside-word sequences of roles. Dashed paths show what tends to follow in the next word. For example, a dashed S5 (END) → S4 (MID) means words that end in S5 are often followed by a word that begins in an S4-like role; it does not mean “go back to MID within the same word.”

What the You are not allowed to view links. Register or Login to view. shows when you click a word:

Basics: the (prefix, stem, suffix) segmentation; the morph tokens used (e.g., pre:qo, st:dai, suf:n); the token IDs and whether any mapped to UNK; the word’s log-likelihood under the HMM.
Decoded path: the state sequence (e.g., S3 → S2 → S5) and mapped role names; for each morph token, its emission probability in that state and its rank among that state’s emissions (how typical it is).
State context: for each state visited by the word, the state’s top prefix/stem/suffix pieces, plus the pieces observed in the current word with their probabilities.
Across the page, each word is colored by log-likelihood bins from red to green (least to most typical), using the 1–99% range to avoid outliers.

How this could help with the manuscript:

A compact, testable “grammar of pieces” for Voynich words.
A way to compare sections or folios: train on one part, score another.
A tool to spot anomalies or outliers (very low-likelihood words) and dominant regularities (roles and paths).
Exported emissions and transitions for further statistics and plots.

This may be relevant, Application of Hidden Markov Modelling --> You are not allowed to view links. Register or Login to view.

(28-08-2025, 12:42 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.This may be relevant, Application of Hidden Markov Modelling --> You are not allowed to view links. Register or Login to view.

Thanks! I had not seen that analysis.

So the HMM optimization consistently identifies o and a as "vowels", but adds other glyphs that depend on the transcription alphabet used. It seems consistent with my own mostly-by-hand analysis of the first level of the VMS word structure:

[attachment=11345]

In that formula, the O with superscript '?' means "zero or one glyph from the set O = {a,o,y}"

The HMM optimization probably failed to recognize y as a "vowel" because (IIRC) it occurs mostly at the beginning or end of the word (a further constraint that is not captured by the formula above), and therefore does not alternate with "consonants" often enough.

Were the Sukhotin and HMM optimization algorithms instructed to keep at least four glyphs in the "vowel" set? That could explain why various other glyphs besides a and o were included in the "vowels", and why they vary according to the transliteration alphabet.

I wonder what HMM optimization would do with Arabic or Hebrew? Arabic has only three "long" or "strong" vowels (I,A,U), which are always written in the standard script, and their "short" or "weak" versions, which are usually not written. You are not allowed to view links. Register or Login to view. is a sample of Arabic with both short and long vowels ("bīsmī allāhī alrrāµmānī alrrāµymī") and You are not allowed to view links. Register or Login to view. is the same text with long vowels only ("bsm allh alrµmn alrµym"). The files are in the iso-latin-1 character set, using an You are not allowed to view links. Register or Login to view. of Arabic letters as single iso-latin-1 characters.

All the best, --jorge

(28-08-2025, 12:42 AM)RobGea Wrote: You are not allowed to view links. Register or Login to view.This may be relevant, Application of Hidden Markov Modelling --> You are not allowed to view links. Register or Login to view.

This is great and very interesting! I will try to see how to use René's approach. Thank you.

Very interesting, and I like how you consider the sequence of the words in your Markov model.

I need some clarifications Smile

Quote:Each label lists the state’s top pieces: P = top prefix fragments, T = top stem fragments, F = top final/suffix fragments.

By 'top', do you mean 'most frequent'?

I'm not sure how to read the state graph. Ie., if I start at S3, which strings does it actually contain? All the six strings [o che she d o daiin] ('o' being duplicated because it's been classified both as a prefix and a stem)?

How can the outputs of S3 add up to 1.06? They should be < 1.

Hello Mauro,

I try to explain:

(28-08-2025, 09:14 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.By 'top', do you mean 'most frequent'?

Yes, in every state, there is a quantity of prefixes, stems and sufixes. The shown are the "top", which are the, let's say, sub-morphemes (prefixes, stems, or suffixes) that are most likely to appear when the HMM is in that state (but there are more, only the top are shown).

(28-08-2025, 09:14 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I'm not sure how to read the state graph. Ie., if I start at S3, which strings does it actually contain? All the six strings [o che she d o daiin] ('o' being duplicated because it's been classified both as a prefix and a stem)?

I suggest you click on the link to the html and see how the morphemes are distributed per "word". You can start in S3 with a part of the words (which may contain 3 subpart, let's sy sub-prefix, sub-stem and/or sub-sufix) then the most likely way to go is to S7 (as the arrow is thicker). In the html you can see word per word how the HMM understands it is constructed.

(28-08-2025, 09:14 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.How can the outputs of S3 add up to 1.06? They should be < 1.

I see the arrow labels are not correct. S3->S7 is labelled as S1->S6... I think it is confusing. I'll try to fix it and update the post.

Mary D'Imperio does something similar here:
You are not allowed to view links. Register or Login to view.

It starts off with individual characters, and then moves to 'small groups'.

I updated a bit the code and found stronger relationships:

From S2 ends always with S5 (100%).
From S4 ends always with S0 (100%).

I updated the graf plot and the html in the link.

Thank you

quimqu

RobGea

Jorge_Stolfi

quimqu

Mauro

quimqu

ReneZ

quimqu