The Voynich Ninja - update to Zattera's slot machine

Pages: 1 2

Using switchable templates, it looks like we can improve Zattera's slot machine performance:

Metric Before After Change
Generated 3,110 5,524 +2,414
TP types 1,226 1,680 +454
TP tokens 20,454 24,662 +4,208
Precision 0.3942 0.3041 -0.0901
Recall 0.1468 0.2011 +0.0543
F1 0.2139 0.2421 +0.0282
TokCov 0.5491 0.6620 +0.1129

Note that we worked on the IVTFF corpus, not Zattera's homebrew Slot Alphabet.

The improved grammar (F1=0.242) outperforms Zattera's grammar (F1=0.214) when both are scored on our corpus. However, Zattera reports F1=0.270 on his own filtered corpus (~5,105 types), and our grammar was F1-trained on the test corpus while his was not.

(07-05-2026, 10:36 AM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.The improved grammar (F1=0.242) outperforms Zattera's grammar (F1=0.214) when both are scored on our corpus. However, Zattera reports F1=0.270 on his own filtered corpus (~5,105 types), and our grammar was F1-trained on the test corpus while his was not.

What are the You are not allowed to view links. Register or Login to view.? I don't think F1 is informative when comparing two different grammar mechanics. Whatever "switchable templates" are they sound like they can capture some information too, leading to visibly better numbers for no actual improvement in the grammar.

(07-05-2026, 10:58 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(07-05-2026, 10:36 AM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.The improved grammar (F1=0.242) outperforms Zattera's grammar (F1=0.214) when both are scored on our corpus. However, Zattera reports F1=0.270 on his own filtered corpus (~5,105 types), and our grammar was F1-trained on the test corpus while his was not.

What are the You are not allowed to view links. Register or Login to view.? I don't think F1 is informative when comparing two different grammar mechanics. Whatever "switchable templates" are they sound like they can capture some information too, leading to a visibly better numbers for no actual improvement in the grammar.

I wasnt aware of Lanzini's metric, as a big fan of Kolmogorov complexity I can only approve of Nbits, its sounds like a perfect addition to coverage! Thanks for sharing this gem

(07-05-2026, 10:58 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(07-05-2026, 10:36 AM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.The improved grammar (F1=0.242) outperforms Zattera's grammar (F1=0.214) when both are scored on our corpus. However, Zattera reports F1=0.270 on his own filtered corpus (~5,105 types), and our grammar was F1-trained on the test corpus while his was not.

What are the You are not allowed to view links. Register or Login to view.? I don't think F1 is informative when comparing two different grammar mechanics. Whatever "switchable templates" are they sound like they can capture some information too, leading to visibly better numbers for no actual improvement in the grammar.

Currently my best Zattera model, Zat+, beats Loop-Lay in terms of Nbits, but for a coverage of about 80% versus 100% for Loop-Lay. However due to the fat tail distribution of Voynisch words with potétial scribal errors, 100% coverage does not make much sense. I think we should settle for a much lower cocerage and optimize Nbits from that.

(07-05-2026, 10:55 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.Currently my best Zattera model, Zat+, beats Loop-Lay in terms of Nbits, but for a coverage of about 80% versus 100% for Loop-Lay. However due to the fat tail distribution of Voynisch words with potétial scribal errors, 100% coverage does not make much sense. I think we should settle for a much lower cocerage and optimize Nbits from that.

I really have no opinion about the best coverage. Personally, I think the grammar should reproduce all words (maybe with the exception of those with really rare characters like v), otherwise it's not a grammar of Voynichese, but a grammar of some subset of Voynichese, and all conclusions made based on this grammar should carry this footnote with them. I think the main idea of loop grammars was to handle all or almost all words while keeping the grammar size manageable.

Would be good if Mauro could comment on this one. While I use other people's grammars occasionally to check if some of my ideas are obviously bonkers or not, I'm not in the grammar business myself, so to speak. To me a grammar is not a statement of what happens in Voynichese, but rather a statement of what doesn't happen in Voynichese.

(07-05-2026, 11:05 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I really have no opinion about the best coverage. Personally, I think the grammar should reproduce all words (maybe with the exception of those with really rare characters like v), otherwise it's not a grammar of Voynichese, but a grammar of some subset of Voynichese, and all conclusions made based on this grammar should carry this footnote with them.

But what is the purpose of the grammar?

Any "ordinary" text in any language will use only a small "random" subset T of that language's lexicon L. For polysyllabic languages, L is typically 100'000 or more.

Thus we cannot know the lexicon L of Voynichese from the VMS text alone. We can build a grammar -- or, equivalently,  a finite deterministic automaton -- that generates precisely the set T: every word that occurs in the VMS, and no other.

Such a grammar or automaton can be useful as an efficient way to check whether a word is in T or not (Appel and Jacobson, You are not allowed to view links. Register or Login to view.) But, because T is a small and random sample of words, it does not give much information about the language, and its complexity c(T) -- the number of states of the automaton, or the number of rules in the grammar -- will be several times  the number #T of words. That is, in the tens of thousands.

A grammar or automaton for L -- the lexicon of the language -- would be more useful. But there is no way to get that either. The set L is usually a random subset of W, the set of all words allowed by the language's phonetic and morphological constraints. Its complexity c(L) will be even higher than c(T).

On the other hand, those constraints are generally fairly simple, so the set W can usually be precisely described by a fairly small grammar or automaton. The latter may give useful insights about the language.

However, we cannot categorically recover W from the sample T. First, a finite sample will generally lack examples of some combinations of phonemes or letters that are valid in W. Well's novel War of the Worlds contains no word with the letter combination "unstr". If that novel was the only sample of English we had, should we conclude that this combination is not valid by the English phonetic and morphological rules? (It is valid, in fact.)

Second, the set T extracted from a specific text will usually have many "noise" words that are not in W -- spelling errors, onomatopoeias, foreign words, words that have been accidentally split or joined, etc. The VMS, in particular, must have several hundreds of them.  For example, Well's novel uses the word "rahnd" instead of "round" to imitate some strong dialectal accent, and the incorrect or old spelling "vallyble". But neither  "-hnd" nor "-yble" are valid English suffixes. There is no reliable way to detect such "noise" words; and any grammar that includes all of T will define a set of words larger than the true W.

Therefore, it is pointless to require 100% coverage ("the grammar must generate all the words that occur in the VMS") or 100% precision ("the grammar must not generate any word that does not occur in the VMS"). Or even to try to maximize some combination of the two metrics, like 3*coverage + 5*precision. A more useful approach must combine those metrics with some measure of the complexity of the grammar. But the weights of those metrics will be arbitrary...

Quote:I think the main idea of loop grammars was to handle all or almost all words while keeping the grammar size manageable.

If the language is polysyllabic, the set W will usually be infinite, and describing it will require a recursive grammar or an automaton with loops.

But everything indicates that the proper words of Voynichese have a very limited size. The longer words are almost certainly cases where two or more words were accidentally stuck together. In that case W is more naturally described by a non-recursive grammar, or an automaton without loops.

All the best, --stolfi

(08-05-2026, 09:22 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But what is the purpose of the grammar?

If we are talking about Mauro grammars, I used them a few times to try understanding the mechanics of the cipher, without much success. I assume that the text is a cipher that uses a small set of glyphs with strong ordering rules to represent arbitrary plaintexts, and the grammar mostly reflects the mechanics of the cipher and not properties of the plaintext. Under this assumption Mauro's grammars are very useful, because they are optimized to minimize the total information content of the grammar and the generated text combined, and I believe they may uncover some essential properties of the cipher by identifying the grammar with the best compression properties.

Under different assumptions, like an exotic language or any other scenario where the grammar is mostly the result of the features of the plaintext and not some mechanical encoding process, I agree that covering only a part of the corpus may be just as useful. Or just as useless.

(07-05-2026, 10:55 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.
(07-05-2026, 10:58 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(07-05-2026, 10:36 AM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.The improved grammar (F1=0.242) outperforms Zattera's grammar (F1=0.214) when both are scored on our corpus. However, Zattera reports F1=0.270 on his own filtered corpus (~5,105 types), and our grammar was F1-trained on the test corpus while his was not.

What are the You are not allowed to view links. Register or Login to view.? I don't think F1 is informative when comparing two different grammar mechanics. Whatever "switchable templates" are they sound like they can capture some information too, leading to visibly better numbers for no actual improvement in the grammar.

Currently my best Zattera model, Zat+, beats Loop-Lay in terms of Nbits, but for a coverage of about 80% versus 100% for Loop-Lay. However due to the fat tail distribution of Voynisch words with potétial scribal errors, 100% coverage does not make much sense. I think we should settle for a much lower cocerage and optimize Nbits from that.

Which target should be set for coverage is not clear. I experimented with different targets for coverage, but with little results. I think 80% is a bit low, but this is just a personal opinion. On the other side, going above 95% probably captures more noise than data.

Can you publish here your improved garmmar? I'd be very interested to see it.

(09-05-2026, 09:11 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(07-05-2026, 10:55 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.
(07-05-2026, 10:58 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(07-05-2026, 10:36 AM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.The improved grammar (F1=0.242) outperforms Zattera's grammar (F1=0.214) when both are scored on our corpus. However, Zattera reports F1=0.270 on his own filtered corpus (~5,105 types), and our grammar was F1-trained on the test corpus while his was not.

What are the You are not allowed to view links. Register or Login to view.? I don't think F1 is informative when comparing two different grammar mechanics. Whatever "switchable templates" are they sound like they can capture some information too, leading to visibly better numbers for no actual improvement in the grammar.

Currently my best Zattera model, Zat+, beats Loop-Lay in terms of Nbits, but for a coverage of about 80% versus 100% for Loop-Lay. However due to the fat tail distribution of Voynisch words with potétial scribal errors, 100% coverage does not make much sense. I think we should settle for a much lower cocerage and optimize Nbits from that.

Which target should be set for coverage is not clear. I experimented with different targets for coverage, but with little results. I think 80% is a bit low, but this is just a personal opinion. On the other side, going above 95% probably captures more noise than data.

Can you publish here your improved garmmar? I'd be very interested to see it.

Here is the latest version (source code in attachment): the difference between my interpretation of your model (called model A below) and Zattera's model with 2 loops (model C below) is very small in terms of coverage and Nbits size.

Some important notes: we are not comparing Zattera's slot_machine versus Loop-Lay:

I use neither your transliteration, nor Zattera's but a slightly modified version of RF1b-e (sse CORPUS section in source code).
Model C doesnt use Zattera's original 12 slots structure, but your 7 slots structure. It does, however, use Zattera's machine training.

A side note: removing the 'q' gallow from all chunks except the first one improves Nbits slightly. Further similar optimizations along this line may help.

========================================================================================
SIDE-BY-SIDE COMPARISON
────────────────────────────────────────────────────────────────────────────────────────

All models built from the same Manzini 7-column grid of Loop-Lay.

A = Manzini raw grid chunks (7,919), greedy DP (max_rep=5)
B = Zattera-trained 1-loop → 1,369 words as chunks → greedy DP
C = Zattera-trained 2-loop → 4,365 words as chunks → greedy DP

┌──────────────────┬────────────┬────────────┬────────────┐
│ Metric │ A: Raw DP │ B: Train×1 │ C: Train×2 │
├──────────────────┼────────────┼────────────┼────────────┤
│ Chunk vocab size │ 7,919 │ 1,369 │ 4,365 │
│ Types covered │ 8,105 │ 7,709 │ 8,062 │
│ Type coverage │ 99.8% │ 94.9% │ 99.2% │
│ Token coverage │ 100.0% │ 98.1% │ 99.8% │
│ Chunks used │ 734 │ 596 │ 4,004 │
│ Avg ch/word │ 1.892 │ 1.894 │ 1.232 │
│ Nb_dict │ 22,852 │ 18,193 │ 151,519 │
│ Nb_text │ 456,287 │ 443,031 │ 422,655 │
│ Nb_total │ 479,139 │ 461,224 │ 574,174 │
│ b/tok │ 12.97 │ 12.72 │ 15.56 │
│ vs Flat │ 0.661× │ 0.636× │ 0.792× │
└──────────────────┴────────────┴────────────┴────────────┘

Zattera's machine training stats:
┌──────────────────┬────────────┬────────────┐
│ Metric │ B: Train×1 │ C: Train×2 │
├──────────────────┼────────────┼────────────┤
│ Precision │ 0.331 │ 0.835 │
│ Recall │ 0.056 │ 0.449 │
│ F1 │ 0.095 │ 0.584 │
│ True positives │ 453 │ 3,646 │
│ False positives │ 916 │ 719 │
└──────────────────┴────────────┴────────────┘

Please let me know if you see any caveat or bug.

I'm not much into Python, so I cannot check the code, I'm sorry.

I haven't yet understood which are the final 3 slot grammars which are compared, Model A, B and C. Can you post their stucture? Ie. Slot1 = ['q', ch', sh'] Slot2 = [..], something like that?

Pages: 1 2