The Voynich Ninja

Full Version: update to Zattera's slot machine
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
(11-05-2026, 06:12 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I'm not much into Python, so I cannot check the code, I'm sorry.

I haven't yet understood which are the final 3 slot grammars which are compared, Model A, B and C. Can you post their stucture? Ie. Slot1 = ['q', ch', sh'] Slot2 = [..], something like that?

All 3 models share the same structure: 
Slot1 = ['q', 'ch', 'sh', 'cth', 'ckh', 'cph', 'cfh', '']
Slot2 = ['e', '']
Slot3 = ['e', '']
Slot4 = ['o', 'a', '']
Slot5 = ['i', 'ii', 'iii', 'iiii', '']
Slot6 = ['l', 'r', 'd', 'n', 'm', 's', 't', 'k', 'p', 'f', '']
Slot7 = ['y', '']

Here are the main differences:
  • A: all slot combinations treated as chunks
  • B (Zattera, one loop): slot sequence as a single-pass automaton (one pass through the 7-slot sequence, then training removes edges to improve F1 as Zattera did, Output words are paths through the slot chain)
  • C (Zattera, 2 loops): slot sequence as a two-pass automaton 
(13-05-2026, 03:36 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.All 3 models share the same structure: 
Slot1 = ...

My old "crust-mantle-core" model, discussed starting at You are not allowed to view links. Register or Login to view., seems to be more demanding than that one.  Except perhaps for the placement of [aoy].

Here is my current preferred version.   It s described as a sequence of four parsing steps/levels for convenience, but can be encoded as a single loop-free finite automaton, although some of the "total count" constraints in the last level are more compactly described by an algorithm.

CLEANUP pass: The character m is considered an abbreviation for in.  The combination ir is assumed to be a scribal error for iin.  The characters b g u are malformed versions of other characters, possibly n m an. The glyphs Ih ITh IKh etc are malformed versions of Ch CTh CKh etc. A doubled hh is a malformed version of he. The glyphs Cs and sh are just variant forms of Sh. The abbreviations should be expanded, and the erroneus characters and combinations should be mapped to the most likely correct ones before applying the following levels of parsing.

ELEM level: A cleaned word of the VMS (in the EVA encoding) passes this level if it can be parsed into a string of elements drawn from the following sets:
  • "Q" just {q}.
  • "D" the /dealers/ {d} {l} {r} {s}.
  • "X" the /benches/ {ch} {sh} {ee} with an optional 'e' suffix.
  • "G" the /simple gallows/ {k} {t} {p} {f} with optional 'e' suffix.
  • "H" the /platform gallows/ {cth} {ckh} {cph} {cfh} with optional 'e' or 'h' suffix.
  • "N" the /codas/ {n}, {in}, {iin}, {iiin}, 
Note that e and i are not[edited] valid elements per se. An e can occur only after a bench or a gallows, or as part of an element {ee} or {eee}.  An i can only appear as part of a coda element {in}, {iin} or {iiin}. Recall that ir and im are replaced in the CLEAN pass.

Thus, for instance
ockhechdy (oCKheChdy) is parsed as {o}{ckhe}{ch}{d}{y} 
qokaiin is parsed as {q}{o}{k}{iin}
chedy is parsed as {che}{d}{y}
cheedy is parsed as {ch}{ee}{d}{y}
cheeedy is parsed as {che}{ee}{d}{y}
cheeeedy is parsed as {ch}{ee}{ee}{d}{y}

Note that the parsing is ambiguous if a words has three or more e in a row. So cheeedy could also be {ch}{eee}{d}{y}, and cheeeeedy could be parsed also as {che}{eee}{dy}. The choices above are arbitrary, and have limited implications.

OKOKO level:  Let K be the set of all elements that are not in the set O, namely K = Q ∪ D ∪ X ∪ G ∪ H U N.  A cleaned word that passed the ELEM level also passes this level if it consists of zero or more K elements with at most two O elements inserted before the first K, between every two consecutive Ks, and after the last K.

Thus, for example, {o}{y} passes this level, {o}{a}{ch}{sh}{o}{r}{o}{a}{d}{o}{y} passes (with pattern OOKKOKOOKOO), whereas {ch}{o}{a}{y}{d}{y} does not (three Os in a row).

CMC level: A cleaned word that passes the ELEM and OKOKO levels will pass the crust-mantle-core (CMC) level if, after deleting all the O elements, it has the form

    Q^q D^d X^x G^g H^h X^y D^e N^n
   
where 
  • q,g,h,n may be 0 or 1;
  • g+h at most 1 (there can be at most one gallows per word);
  • q+d+e+n <= 3 (there can be at most three of Q, D, and N);
  • x+h+y <= 2 (, there can be at most two benches, counting a platform gallows as one bench.)

Without these sum constraints, the three parsing levels can be realized as compact finite automaton that can be drawn on a single page.  With the sum constraints, the automaton is about 3x bigger because each state must  be unfolded into three states in order to record the three counts in the part already parsed.

The numbers vary depending on the section and transcription version used, but it seems that, after the CLEAN step, about 95% of the tokens pass the other three levels.

There probably are further rules relating the insertion of the "O"s in the CMC pattern.  For instance, maybe we can require that a "Q" is always followed by at least one "O", and an "N" is almost always preceded by at least one "O".   Said another way, maybe we can combine the OKOKO and CMC models in a single formula with rules that tie the number of "O"s in each slot to the presence or number of the other "Q", "D" etc elements.  This and other refinements remain to be explored.

But the main question is how this word model compare to those you are using.

All the best, --stolfi
Jorge_Stolfi dateline='[url=tel:1778705882' Wrote: You are not allowed to view links. Register or Login to view.1778705882[/url]']
Labyrinthinesecurity dateline='[url=tel:1778682981' Wrote: You are not allowed to view links. Register or Login to view.1778682981[/url]']
All 3 models share the same structure: 
Slot1 = ...

My old "crust-mantle-core" model, discussed starting at You are not allowed to view links. Register or Login to view., seems to be more demanding than that one.  Except perhaps for the placement of [aoy].

Here is my current preferred version.   It s described as a sequence of four parsing steps/levels for convenience, but can be encoded as a single loop-free finite automaton, although some of the "total count" constraints in the last level are more compactly described by an algorithm.

CLEANUP pass: The character m is considered an abbreviation for in.  The combination ir is assumed to be a scribal error for iin.  The characters b g u are malformed versions of other characters, possibly n m an. The glyphs Ih ITh IKh etc are malformed versions of Ch CTh CKh etc. A doubled hh is a malformed version of he. The glyphs Cs and sh are just variant forms of Sh. The abbreviations should be expanded, and the erroneus characters and combinations should be mapped to the most likely correct ones before applying the following levels of parsing.

ELEM level: A cleaned word of the VMS (in the EVA encoding) passes this level if it can be parsed into a string of elements drawn from the following sets:
  • "Q" just {q}.
  • "D" the /dealers/ {d} {l} {r} {s}.
  • "X" the /benches/ {ch} {sh} {ee} with an optional 'e' suffix.
  • "G" the /simple gallows/ {k} {t} {p} {f} with optional 'e' suffix.
  • "H" the /platform gallows/ {cth} {ckh} {cph} {cfh} with optional 'e' or 'h' suffix.
  • "N" the /codas/ {n}, {in}, {iin}, {iiin}, 
Note that e and i are valid elements per se. An e can occur only after a bench or a gallows, or as part of an element {ee} or {eee}.  An i can only appear as part of a coda element {in}, {iin} or {iiin}. Recall that ir and im are replaced in the CLEAN pass.

Thus, for instance
ockhechdy (oCKheChdy) is parsed as {o}{ckhe}{ch}{d}{y} 
qokaiin is parsed as {q}{o}{k}{iin}
chedy is parsed as {che}{d}{y}
cheedy is parsed as {ch}{ee}{d}{y}
cheeedy is parsed as {che}{ee}{d}{y}
cheeeedy is parsed as {ch}{ee}{ee}{d}{y}

Note that the parsing is ambiguous if a words has three or more e in a row. So cheeedy could also be {ch}{eee}{d}{y}, and cheeeeedy could be parsed also as {che}{eee}{dy}. The choices above are arbitrary, and have limited implications.

OKOKO level:  Let K be the set of all elements that are not in the set O, namely K = Q ∪ D ∪ X ∪ G ∪ H U N.  A cleaned word that passed the ELEM level also passes this level if it consists of zero or more K elements with at most two O elements inserted before the first K, between every two consecutive Ks, and after the last K.

Thus, for example, {o}{y} passes this level, {o}{a}{ch}{sh}{o}{r}{o}{a}{d}{o}{y} passes (with pattern OOKKOKOOKOO), whereas {ch}{o}{a}{y}{d}{y} does not (three Os in a row).

CMC level: A cleaned word that passes the ELEM and OKOKO levels will pass the crust-mantle-core (CMC) level if, after deleting all the O elements, it has the form

    Q^q D^d X^x G^g H^h X^y D^e N^n
   
where 
  • q,g,h,n may be 0 or 1;
  • g+h at most 1 (there can be at most one gallows per word);
  • q+d+e+n <= 3 (there can be at most three of Q, D, and N);
  • x+h+y <= 2 (, there can be at most two benches, counting a platform gallows as one bench.)

Without these sum constraints, the three parsing levels can be realized as compact finite automaton that can be drawn on a single page.  With the sum constraints, the automaton is about 3x bigger because each state must  be unfolded into three states in order to record the three counts in the part already parsed.

The numbers vary depending on the section and transcription version used, but it seems that, after the CLEAN step, about 95% of the tokens pass the other three levels.

There probably are further rules relating the insertion of the "O"s in the CMC pattern.  For instance, maybe we can require that a "Q" is always followed by at least one "O", and an "N" is almost always preceded by at least one "O".   Said another way, maybe we can combine the OKOKO and CMC models in a single formula with rules that tie the number of "O"s in each slot to the presence or number of the other "Q", "D" etc elements.  This and other refinements remain to be explored.

But the main question is how this word model compare to those you are using.

All the best, --stolfi

that's VERY interesting indeed, and well worth exploring when I will have time. thanks for sharing!
OOP. I wrote:
(13-05-2026, 09:58 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Note that e and i are valid elements per se.

I meant "e and are not valid elements per se". Sorry.

All the best, --stolfi
(13-05-2026, 03:36 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.All 3 models share the same structure: 
Slot1 = ['q', 'ch', 'sh', 'cth', 'ckh', 'cph', 'cfh', '']
Slot2 = ['e', '']
Slot3 = ['e', '']
Slot4 = ['o', 'a', '']
Slot5 = ['i', 'ii', 'iii', 'iiii', '']
Slot6 = ['l', 'r', 'd', 'n', 'm', 's', 't', 'k', 'p', 'f', '']
Slot7 = ['y', '']

This would not be able to generate very common words like "qokeey" or "qokedy".
Or am I misunderstanding something?
(13-05-2026, 09:58 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Without these sum constraints, the three parsing levels can be realized as compact finite automaton that can be drawn on a single page.  With the sum constraints, the automaton is about 3x bigger because each state must  be unfolded into three states in order to record the three counts in the part already parsed.

Oops. Actually it is more than 3x, because each state must keep track of the x+y count (0,1,2) and the q+d+e count (01,2); so each state near the end of the word may become 9 states.  (The g+h count is checked immediately so it does not require duplicating states)

All the best, --stolfi
(13-05-2026, 03:36 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.
(11-05-2026, 06:12 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I'm not much into Python, so I cannot check the code, I'm sorry.

I haven't yet understood which are the final 3 slot grammars which are compared, Model A, B and C. Can you post their stucture? Ie. Slot1 = ['q', ch', sh'] Slot2 = [..], something like that?

All 3 models share the same structure: 
Slot1 = ['q', 'ch', 'sh', 'cth', 'ckh', 'cph', 'cfh', '']
Slot2 = ['e', '']
Slot3 = ['e', '']
Slot4 = ['o', 'a', '']
Slot5 = ['i', 'ii', 'iii', 'iiii', '']
Slot6 = ['l', 'r', 'd', 'n', 'm', 's', 't', 'k', 'p', 'f', '']
Slot7 = ['y', '']

Here are the main differences:
  • A: all slot combinations treated as chunks
  • B (Zattera, one loop): slot sequence as a single-pass automaton (one pass through the 7-slot sequence, then training removes edges to improve F1 as Zattera did, Output words are paths through the slot chain)
  • C (Zattera, 2 loops): slot sequence as a two-pass automaton 

I'm a little confused.

Model A: I'm sorry but I don't understand what you mean by "all slot combinations treated as chunks"

Model B: why do you call it 'Zattera'? It's completely different from Zattera's. My tests:

1 repetition along the slot grammar
Quote:Transliteration file used: RF1a-n-x7 Full cleaned
Grammar name: PROVA Voynich Ninja Labyrinthinesecurity, max repeats = 1
Grammar notes: You are not allowed to view links. Register or Login to view.

Slot 1: q ch sh cth ckh cph cfh
Slot 2: e
Slot 3: e
Slot 4: o a
Slot 5: iiii iii ii i
Slot 6: l d k r s t p f n m
Slot 7: y

Word types coverage (excluding word types with rare characters): 0,06015976

Total number of bits required for the Huffmann codes dictionary: 14175 (character set = 16 characters)
Total number of bits required for the compressed text (all tokens): 80300
Total number of bits required (all tokens): 94475

Nbits_tokens/Coverage metric: 1570401,83852722

It has an extremely low coverage (ie. it cannot find 'daiin').


With 2 repetitions (Model C)
Quote:Transliteration file used: RF1a-n-x7 Full cleaned
Grammar name: PROVA Voynich Ninja Labyrinthinesecurity, max repeats = 2
Grammar notes: You are not allowed to view links. Register or Login to view.

Slot 1: q ch sh cth ckh cph cfh
Slot 2: e
Slot 3: e
Slot 4: o a
Slot 5: iiii iii ii i
Slot 6: l d k r s t p f n m
Slot 7: y

Word types coverage (excluding word types with rare characters): 0,4767848

Total number of bits required for the Huffmann codes dictionary: 20678 (character set = 16 characters)
Total number of bits required for the compressed text (all tokens): 329808
Total number of bits required (all tokens): 350486

Nbits_tokens/Coverage metric: 735103,093436505

Coverage stays low, ~46%

So I etsted with 5 repetitions:
Quote:Transliteration file used: RF1a-n-x7 Full cleaned
Grammar name: PROVA Voynich Ninja Labyrinthinesecurity, max repeats = 5
Grammar notes: You are not allowed to view links. Register or Login to view.

Slot 1: q ch sh cth ckh cph cfh
Slot 2: e
Slot 3: e
Slot 4: o a
Slot 5: iiii iii ii i
Slot 6: l d k r s t p f n m
Slot 7: y

Word types coverage (excluding word types with rare characters): 0,9978782

Total number of bits required for the Huffmann codes dictionary: 21870 (character set = 16 characters)
Total number of bits required for the compressed text (all tokens): 467105
Total number of bits required (all tokens): 488975

Nbits_tokens/Coverage metric: 490014,71623591

Coverage gets quite good: ~100%. Nbits_tokens is 488975, and Nbits_tokens/Coverage is 490014, quite good values. Loop-Lay was slightly better, at Nbits = 486946 and Nbits/Coverage = 487859. The best grammar I ever found (unpublished in detail, just posted in a thread) improves a little more:

Quote:Transliteration file used: RF1a-n-x7 Full cleaned
Grammar name: LOOP_chshy_481947_Nbitstokens_vs_Coverage, max repeats = 5
Grammar notes: Best Nbits_token/Coverage found up to 02/02/2025

Slot 1: ch sh y
Slot 2: eee ee e q a
Slot 3: o
Slot 4: iii ii i d
Slot 5: l k r s t p f cth ckh cph cfh n m y

Word types coverage (excluding word types with rare characters): 0,998003

Total number of bits required for the Huffmann codes dictionary: 15314 (character set = 16 characters)
Total number of bits required for the compressed text (all tokens): 465671
Total number of bits required (all tokens): 480985

Nbits_tokens/Coverage metric: 481947,446167254
(14-05-2026, 09:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Coverage gets quite good: ~100%. Nbits_tokens is 488975, and Nbits_tokens/Coverage is 490014, quite good values. Loop-Lay was slightly better, at Nbits = 486946 and Nbits/Coverage = 487859. The best grammar I ever found (unpublished in detail, just posted in a thread) improves a little more:

Does the number of allowed loops affect Nbits in any way? Assuming the same slot assignments achieve 95% coverage for 4 loops and 100% coverage for 5 loops, is it possible for the Nbits score to be worse for the second grammar?
(14-05-2026, 11:32 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(14-05-2026, 09:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Coverage gets quite good: ~100%. Nbits_tokens is 488975, and Nbits_tokens/Coverage is 490014, quite good values. Loop-Lay was slightly better, at Nbits = 486946 and Nbits/Coverage = 487859. The best grammar I ever found (unpublished in detail, just posted in a thread) improves a little more:

Does the number of allowed loops affect Nbits in any way? Assuming the same slot assignments achieve 95% coverage for 4 loops and 100% coverage for 5 loops, is it possible for the Nbits score to be worse for the second grammar?

The lower the coverage the lower Nbits will be, because less tokens are encoded. That's why now I don't use the raw Nbits as a metric, rather Nbits/Coverage.  Ie. if you look at one of the tables of my post above (2 repetitions Model C):
Quote:Grammar name: PROVA Voynich Ninja Labyrinthinesecurity, max repeats = 2

....

Word types coverage (excluding word types with rare characters): 0,4767848


Total number of bits required (all [encodable] tokens): 350486

Nbits_tokens/Coverage metric: 735103,093436505

This grammar has a low coverage, ~47%, so it encodes only a part of the text and, indeed, it has a very low Nbits (328908). But Nbits/Coverage is 735103, much higher i.e. than the 481947 of my 'best' grammar.

Or, in other words: the original Nbits metric is okay when comparing grammars with a similar coverage (as I did in my article, Nbits/Coverage was just in a footnote), but one needs Nbits/Coverage (or some analogous formula) in the more general case, where the coverage changes. And, in effect, better always use Nbits/Coverage. Sorry for the confusion!
Pages: 1 2