Labyrinthinesecurity > 13-05-2026, 03:36 PM
(11-05-2026, 06:12 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I'm not much into Python, so I cannot check the code, I'm sorry.
I haven't yet understood which are the final 3 slot grammars which are compared, Model A, B and C. Can you post their stucture? Ie. Slot1 = ['q', ch', sh'] Slot2 = [..], something like that?
Jorge_Stolfi > 13-05-2026, 09:58 PM
(13-05-2026, 03:36 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.All 3 models share the same structure:
Slot1 = ...
Labyrinthinesecurity > 13-05-2026, 10:09 PM
Jorge_Stolfi dateline='[url=tel:1778705882' Wrote: You are not allowed to view links. Register or Login to view.1778705882[/url]']
Labyrinthinesecurity dateline='[url=tel:1778682981' Wrote: You are not allowed to view links. Register or Login to view.1778682981[/url]']
All 3 models share the same structure:
Slot1 = ...
My old "crust-mantle-core" model, discussed starting at You are not allowed to view links. Register or Login to view., seems to be more demanding than that one. Except perhaps for the placement of [aoy].
Here is my current preferred version. It s described as a sequence of four parsing steps/levels for convenience, but can be encoded as a single loop-free finite automaton, although some of the "total count" constraints in the last level are more compactly described by an algorithm.
CLEANUP pass: The character m is considered an abbreviation for in. The combination ir is assumed to be a scribal error for iin. The characters b g u are malformed versions of other characters, possibly n m an. The glyphs Ih ITh IKh etc are malformed versions of Ch CTh CKh etc. A doubled hh is a malformed version of he. The glyphs Cs and sh are just variant forms of Sh. The abbreviations should be expanded, and the erroneus characters and combinations should be mapped to the most likely correct ones before applying the following levels of parsing.
ELEM level: A cleaned word of the VMS (in the EVA encoding) passes this level if it can be parsed into a string of elements drawn from the following sets:
Note that e and i are valid elements per se. An e can occur only after a bench or a gallows, or as part of an element {ee} or {eee}. An i can only appear as part of a coda element {in}, {iin} or {iiin}. Recall that ir and im are replaced in the CLEAN pass.
- "Q" just {q}.
- "D" the /dealers/ {d} {l} {r} {s}.
- "X" the /benches/ {ch} {sh} {ee} with an optional 'e' suffix.
- "G" the /simple gallows/ {k} {t} {p} {f} with optional 'e' suffix.
- "H" the /platform gallows/ {cth} {ckh} {cph} {cfh} with optional 'e' or 'h' suffix.
- "N" the /codas/ {n}, {in}, {iin}, {iiin},
Thus, for instance
ockhechdy (oCKheChdy) is parsed as {o}{ckhe}{ch}{d}{y}
qokaiin is parsed as {q}{o}{k}{iin}
chedy is parsed as {che}{d}{y}
cheedy is parsed as {ch}{ee}{d}{y}
cheeedy is parsed as {che}{ee}{d}{y}
cheeeedy is parsed as {ch}{ee}{ee}{d}{y}
Note that the parsing is ambiguous if a words has three or more e in a row. So cheeedy could also be {ch}{eee}{d}{y}, and cheeeeedy could be parsed also as {che}{eee}{dy}. The choices above are arbitrary, and have limited implications.
OKOKO level: Let K be the set of all elements that are not in the set O, namely K = Q ∪ D ∪ X ∪ G ∪ H U N. A cleaned word that passed the ELEM level also passes this level if it consists of zero or more K elements with at most two O elements inserted before the first K, between every two consecutive Ks, and after the last K.
Thus, for example, {o}{y} passes this level, {o}{a}{ch}{sh}{o}{r}{o}{a}{d}{o}{y} passes (with pattern OOKKOKOOKOO), whereas {ch}{o}{a}{y}{d}{y} does not (three Os in a row).
CMC level: A cleaned word that passes the ELEM and OKOKO levels will pass the crust-mantle-core (CMC) level if, after deleting all the O elements, it has the form
Q^q D^d X^x G^g H^h X^y D^e N^n
where
- q,g,h,n may be 0 or 1;
- g+h at most 1 (there can be at most one gallows per word);
- q+d+e+n <= 3 (there can be at most three of Q, D, and N);
- x+h+y <= 2 (, there can be at most two benches, counting a platform gallows as one bench.)
Without these sum constraints, the three parsing levels can be realized as compact finite automaton that can be drawn on a single page. With the sum constraints, the automaton is about 3x bigger because each state must be unfolded into three states in order to record the three counts in the part already parsed.
The numbers vary depending on the section and transcription version used, but it seems that, after the CLEAN step, about 95% of the tokens pass the other three levels.
There probably are further rules relating the insertion of the "O"s in the CMC pattern. For instance, maybe we can require that a "Q" is always followed by at least one "O", and an "N" is almost always preceded by at least one "O". Said another way, maybe we can combine the OKOKO and CMC models in a single formula with rules that tie the number of "O"s in each slot to the presence or number of the other "Q", "D" etc elements. This and other refinements remain to be explored.
But the main question is how this word model compare to those you are using.
All the best, --stolfi
Jorge_Stolfi > 13-05-2026, 10:34 PM
ReneZ > 14-05-2026, 12:54 AM
(13-05-2026, 03:36 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.All 3 models share the same structure:
Slot1 = ['q', 'ch', 'sh', 'cth', 'ckh', 'cph', 'cfh', '']
Slot2 = ['e', '']
Slot3 = ['e', '']
Slot4 = ['o', 'a', '']
Slot5 = ['i', 'ii', 'iii', 'iiii', '']
Slot6 = ['l', 'r', 'd', 'n', 'm', 's', 't', 'k', 'p', 'f', '']
Slot7 = ['y', '']
Jorge_Stolfi > 14-05-2026, 05:21 PM
(13-05-2026, 09:58 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Without these sum constraints, the three parsing levels can be realized as compact finite automaton that can be drawn on a single page. With the sum constraints, the automaton is about 3x bigger because each state must be unfolded into three states in order to record the three counts in the part already parsed.
Mauro > 14-05-2026, 09:19 PM
(13-05-2026, 03:36 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.(11-05-2026, 06:12 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I'm not much into Python, so I cannot check the code, I'm sorry.
I haven't yet understood which are the final 3 slot grammars which are compared, Model A, B and C. Can you post their stucture? Ie. Slot1 = ['q', ch', sh'] Slot2 = [..], something like that?
All 3 models share the same structure:
Slot1 = ['q', 'ch', 'sh', 'cth', 'ckh', 'cph', 'cfh', '']
Slot2 = ['e', '']
Slot3 = ['e', '']
Slot4 = ['o', 'a', '']
Slot5 = ['i', 'ii', 'iii', 'iiii', '']
Slot6 = ['l', 'r', 'd', 'n', 'm', 's', 't', 'k', 'p', 'f', '']
Slot7 = ['y', '']
Here are the main differences:
- A: all slot combinations treated as chunks
- B (Zattera, one loop): slot sequence as a single-pass automaton (one pass through the 7-slot sequence, then training removes edges to improve F1 as Zattera did, Output words are paths through the slot chain)
- C (Zattera, 2 loops): slot sequence as a two-pass automaton
Quote:Transliteration file used: RF1a-n-x7 Full cleaned
Grammar name: PROVA Voynich Ninja Labyrinthinesecurity, max repeats = 1
Grammar notes: You are not allowed to view links. Register or Login to view.
Slot 1: q ch sh cth ckh cph cfh
Slot 2: e
Slot 3: e
Slot 4: o a
Slot 5: iiii iii ii i
Slot 6: l d k r s t p f n m
Slot 7: y
Word types coverage (excluding word types with rare characters): 0,06015976
Total number of bits required for the Huffmann codes dictionary: 14175 (character set = 16 characters)
Total number of bits required for the compressed text (all tokens): 80300
Total number of bits required (all tokens): 94475
Nbits_tokens/Coverage metric: 1570401,83852722
Quote:Transliteration file used: RF1a-n-x7 Full cleaned
Grammar name: PROVA Voynich Ninja Labyrinthinesecurity, max repeats = 2
Grammar notes: You are not allowed to view links. Register or Login to view.
Slot 1: q ch sh cth ckh cph cfh
Slot 2: e
Slot 3: e
Slot 4: o a
Slot 5: iiii iii ii i
Slot 6: l d k r s t p f n m
Slot 7: y
Word types coverage (excluding word types with rare characters): 0,4767848
Total number of bits required for the Huffmann codes dictionary: 20678 (character set = 16 characters)
Total number of bits required for the compressed text (all tokens): 329808
Total number of bits required (all tokens): 350486
Nbits_tokens/Coverage metric: 735103,093436505
Quote:Transliteration file used: RF1a-n-x7 Full cleaned
Grammar name: PROVA Voynich Ninja Labyrinthinesecurity, max repeats = 5
Grammar notes: You are not allowed to view links. Register or Login to view.
Slot 1: q ch sh cth ckh cph cfh
Slot 2: e
Slot 3: e
Slot 4: o a
Slot 5: iiii iii ii i
Slot 6: l d k r s t p f n m
Slot 7: y
Word types coverage (excluding word types with rare characters): 0,9978782
Total number of bits required for the Huffmann codes dictionary: 21870 (character set = 16 characters)
Total number of bits required for the compressed text (all tokens): 467105
Total number of bits required (all tokens): 488975
Nbits_tokens/Coverage metric: 490014,71623591
Quote:Transliteration file used: RF1a-n-x7 Full cleaned
Grammar name: LOOP_chshy_481947_Nbitstokens_vs_Coverage, max repeats = 5
Grammar notes: Best Nbits_token/Coverage found up to 02/02/2025
Slot 1: ch sh y
Slot 2: eee ee e q a
Slot 3: o
Slot 4: iii ii i d
Slot 5: l k r s t p f cth ckh cph cfh n m y
Word types coverage (excluding word types with rare characters): 0,998003
Total number of bits required for the Huffmann codes dictionary: 15314 (character set = 16 characters)
Total number of bits required for the compressed text (all tokens): 465671
Total number of bits required (all tokens): 480985
Nbits_tokens/Coverage metric: 481947,446167254
oshfdk > 14-05-2026, 11:32 PM
(14-05-2026, 09:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Coverage gets quite good: ~100%. Nbits_tokens is 488975, and Nbits_tokens/Coverage is 490014, quite good values. Loop-Lay was slightly better, at Nbits = 486946 and Nbits/Coverage = 487859. The best grammar I ever found (unpublished in detail, just posted in a thread) improves a little more:
Mauro > 15-05-2026, 08:10 PM
(14-05-2026, 11:32 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(14-05-2026, 09:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Coverage gets quite good: ~100%. Nbits_tokens is 488975, and Nbits_tokens/Coverage is 490014, quite good values. Loop-Lay was slightly better, at Nbits = 486946 and Nbits/Coverage = 487859. The best grammar I ever found (unpublished in detail, just posted in a thread) improves a little more:
Does the number of allowed loops affect Nbits in any way? Assuming the same slot assignments achieve 95% coverage for 4 loops and 100% coverage for 5 loops, is it possible for the Nbits score to be worse for the second grammar?
Quote:Grammar name: PROVA Voynich Ninja Labyrinthinesecurity, max repeats = 2
....
Word types coverage (excluding word types with rare characters): 0,4767848
Total number of bits required (all [encodable] tokens): 350486
Nbits_tokens/Coverage metric: 735103,093436505