27-08-2025, 11:09 PM
Dear all,
These holidays I’ve been exploring how to analyze the Voynich manuscript’s word structure with models from data science - text analyse branch. Checking the Markov Models, I found some information about Hidden Markov Model (HMM) and explored a code that runs on EVA transliteration. The goal is not decipherment, but to quantify recurring patterns in how word pieces combine and to score how “typical” each word looks under those rules.
A Hidden Markov Model is a simple probabilistic model with hidden states (unseen “roles”), transition probabilities between states, and emission probabilities for the observable symbols. From data, an HMM learns those transitions and emissions; for a new sequence it can decode the most likely path of roles and compute a log-likelihood that tells how well the model explains the sequence. Voynich "words" show strong positional regularities (common openings and endings), and an HMM gives a compact way to (i) discover recurring roles behind word pieces, (ii) quantify which pieces go where, and (iii) measure how typical a word is under the learned rules.
What I did:
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif]
[/font][/size]
How to read the state graph:
What the You are not allowed to view links. Register or Login to view. shows when you click a word:
How this could help with the manuscript:
These holidays I’ve been exploring how to analyze the Voynich manuscript’s word structure with models from data science - text analyse branch. Checking the Markov Models, I found some information about Hidden Markov Model (HMM) and explored a code that runs on EVA transliteration. The goal is not decipherment, but to quantify recurring patterns in how word pieces combine and to score how “typical” each word looks under those rules.
A Hidden Markov Model is a simple probabilistic model with hidden states (unseen “roles”), transition probabilities between states, and emission probabilities for the observable symbols. From data, an HMM learns those transitions and emissions; for a new sequence it can decode the most likely path of roles and compute a log-likelihood that tells how well the model explains the sequence. Voynich "words" show strong positional regularities (common openings and endings), and an HMM gives a compact way to (i) discover recurring roles behind word pieces, (ii) quantify which pieces go where, and (iii) measure how typical a word is under the learned rules.
What I did:
- Tokenized and cleaned paragraphs.
- Discovered affix candidates (short frequent strings with high branching entropy).
- Segmented each word into prefix | stem | suffix with a simple scorer that prefers productive affixes and reasonable stem lengths.
- Built sequences of morphological tokens (such as pre:qo, st:dai, suf:n).
- Trained an HMM on those sequences (learning states, transitions, and emissions).
- Decoded each word with Viterbi to get its state path and log-likelihood.
- Exported a state-transition graph (below) and an You are not allowed to view links. Register or Login to view. to explore words (click the link, I recommend you to check it out!)
[size=1][font='Proxima Nova Regular', 'Helvetica Neue', Helvetica, Arial, sans-serif]
[/font][/size]How to read the state graph:
- Boxes are states (latent “roles”). Each label lists the state’s top pieces: P = top prefix fragments, T = top stem fragments, F = top final/suffix fragments.
- Edges are transitions and their width is proportional to probability. Styles mean different things:
- solid black = within a word (typical flows such as START → MID and MID → END)- dashed gray = from the end of one word to the beginning of the next (e.g., END → START or END → MID; also MID → START when a word has no explicit suffix)- dotted light = unusual directions
- Some states behave like START, MID, or END by how they are used after training.
- Read the solid paths left to right to see typical inside-word sequences of roles. Dashed paths show what tends to follow in the next word. For example, a dashed S5 (END) → S4 (MID) means words that end in S5 are often followed by a word that begins in an S4-like role; it does not mean “go back to MID within the same word.”
What the You are not allowed to view links. Register or Login to view. shows when you click a word:
- Basics: the (prefix, stem, suffix) segmentation; the morph tokens used (e.g., pre:qo, st:dai, suf:n); the token IDs and whether any mapped to UNK; the word’s log-likelihood under the HMM.
- Decoded path: the state sequence (e.g., S3 → S2 → S5) and mapped role names; for each morph token, its emission probability in that state and its rank among that state’s emissions (how typical it is).
- State context: for each state visited by the word, the state’s top prefix/stem/suffix pieces, plus the pieces observed in the current word with their probabilities.
- Across the page, each word is colored by log-likelihood bins from red to green (least to most typical), using the 1–99% range to avoid outliers.
How this could help with the manuscript:
- A compact, testable “grammar of pieces” for Voynich words.
- A way to compare sections or folios: train on one part, score another.
- A tool to spot anomalies or outliers (very low-likelihood words) and dominant regularities (roles and paths).
- Exported emissions and transitions for further statistics and plots.
.