Ok here is an overview:
1. Grapheme normalization
Each Voynich word is first normalized:
- Replace
ch → C
- Replace gallows letters (t, k, p, f) →G
Then map characters into a reduced alphabet:
- C,G stay unchanged
- y → Y
- q →Q
- vowels → V
- all other letters → X
This produces a skeleton string encoding coarse word shape.
Example for word "chekcheor":
Step-by-step:
Resulting skeleton: chekcheor → CVGCVVX
2. Frame extraction from skeletons
Each skeleton is split into:
- ONSET: substring before the first vowel (V)
- CODA: substring after the last vowel
- The central vowel region is discarded
This yields a frame: frame=(ONSET,CODA)
Example using skeleton CVGCVVX:
Structure:
- first V at position 1
- last V at position 5
So:
ONSET = C
CODA = X
Frame = (C, X)
3. Frame induction from corpus
All observed (ONSET, CODA) pairs are extracted from the corpus:
- frames are defined as distinct observed pairs
- frames are ranked by frequency
- the most frequent frames form the model’s state space
4. Template association with frames
Each corpus word has:
- a skeleton (full structural pattern)
- a derived frame (via ONSET/CODA split)
Thus:
- each frame is associated with a set of observed skeleton templates
- this association is induced statistically from the corpus
Example for frame (C, X)
Observed templates:
CV
CVX
CVGCVVX (chekcheor)
...
5. Markov model over frames
A Hidden Markov Model is trained where:
- hidden states = frames
- transitions = empirical frame-to-frame transitions observed in lines
- additional biases exist for:
- line start states
- line end states
- line length distribution
6. Frame sequence generation
At generation time:
- a frame is sampled from the Markov chain
- subsequent frames are generated via transition probabilities
Example generated path: (C, X) → (V, X) → (V, X) → (C, X)
Each step is chosen by Markov probabilities.
7. Template emission conditioned on position
Each frame emits a template according to position-dependent empirical distributions:
- initial position distribution
- middle position distribution
- final position distribution
These distributions are learned from observed corpus frequencies of templates within each frame context.
The emission selects existing templates only.
8. Surface word realization (lexical sampling)
Each emitted template indexes a bucket of attested corpus words:
- the final output word is sampled from this set
- no new words are constructed at this stage
The generator is trained separately on Currier A and Currier B, producing different texts for these two languages.
Example (Currier A):
lchdy qokol olo chekcheor chol
tchor chol qoaiin qoteol cho dy
tchory
chopchal chody tos kcheey ainy chaiin
ydar cho cholkol cheeykeem
kcho dair dar daiin chok chy daiin
alam
toleechal daiin chol saiin
oldal cheeky chol chotchy keol chan okar cholfy
daiin chory daiiin ches aiin ykeol
choly dor choo chody aiin olchy qoty
ykeey al otoldy saiin choek chos chor
ol
dar daiin damo chol alaiinom okoldy daiin
daiin okaraiin ols ldy ykoaiin dain cheokeey
okeeor cheos qokchey qokod chetchy oeeeb
chkor tchor chod otchol chaiin chain cho
otcheey pykchy okoldg ytchom otoldy ol
lchal choky dar cheky chor ykaiin dal
qotchy qokeol qotomody chear chey
kochor olkor chol chor chor chaiin kchy
schey qopchy kchol olchor olaiin chokchol oty cheodar
dchokchy chotchy daiin dal