The Voynich Ninja - A family of grammars for Voynichese

Pages: 1 2 3 4 5 6 7 8 9 10

Happy New Year!

(03-01-2025, 11:28 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I review the metrics used to evaluate slot grammars, showing the normally used efficiency and F1 score to be fundamentally flawed. After introducing the concept of ‘looping’ slot grammars, a generalization of standard grammars, I show how any grammar can be used in a distinctive lossless data compression algorithm to generate a compressed Voynich Manuscript text. This allows the definition of a new metric free from fundamental flaws: Nbits, the total number of bits needed to store the compressed text. I then compare published state-of-the-art grammars and the newly introduced LOOP-L grammar class using the Nbits metric.

Thank you! For some reason it was hard for me to follow the discussion in the thread, but the paper looks very clear and well-structured.

As far as I understand, in effect, you are evaluating word-token grammars and not word-type grammars? At least I see that you multiply the word-chunk bits by the number of occurrences of the word token. In this case, maybe it makes sense to use Shannon information metric, instead of just log2(CSize), optimizing frequent chunks over rare chunks and frequent words tokens over rare word tokens?

Edit: I think I need to clarify my point. The metric you use, Nbits, essentially compares grammars by the way of how much information is needed to reproduce the whole text using the grammar. But the way you compute the information content, even though it looks straightforward, is somewhat arbitrary. Using a different encoding you can get a different number for the same grammar. But using the Shannon information as the theoretical limit of compression, it can be possible to have a more universal metric for a grammar, not dependent on the particular encoding used.

(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Happy New Year!

Thank you! For some reason it was hard for me to follow the discussion in the thread, but the paper looks very clear and well-structured.

Thank you!

(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As far as I understand, in effect, you are evaluating word-token grammars and not word-type grammars? At least I see that you multiply the word-chunk bits by the number of occurrences of the word token.

Yes, the metric is calculated using the word tokens (the whole text). However you gave me the idea to make a test also with word types (the vocabulary) and see what happerns.

(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view. In this case, maybe it makes sense to use Shannon information metric, instead of just log2(CSize), optimizing frequent chunks over rare chunks and frequent words tokens over rare word tokens?

Edit: I think I need to clarify my point. The metric you use, Nbits, essentially compares grammars by the way of how much information is needed to reproduce the whole text using the grammar. But the way you compute the information content, even though it looks straightforward, is somewhat arbitrary. Using a different encoding you can get a different number for the same grammar. But using the Shannon information as the theoretical limit of compression, it can be possible to have a more universal metric for a grammar, not dependent on the particular encoding used.

The metric is already optimized: the calculation of the Huffmann codes already considers frequent chunks over rare ones and frequent words over rare ones, and the codes are optimal (as demonstrated by Huffmann himself), so the metric is guaranteed to find the minimum amount of information needed (given a certain source grammar).

Note: log2(Csize) is used only in the chunks dictionary to calculate how many bits are needed for the chunks themselves, ie. 'ch' requires log2(Csize) bits, 'cph' requires 3*log2(Csize) bits, etc.

(03-01-2025, 05:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.The metric is already optimized: the calculation of the Huffmann codes already considers frequent chunks over rare ones and frequent words over rare ones, and the codes are optimal (as demonstrated by Huffmann himself), so the metric is guaranteed to find the minimum amount of information needed (given a certain source grammar).

Huffmann codes! Heart

Optimal yes, for a practical binary compression that is not the best choice for a metric: it doesn't matter how many actual bits you need to write. Fractional bits aren't writable but they are a better measure of the theoretical amount of information.

Happy New Year.

(03-01-2025, 05:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.The metric is already optimized: the calculation of the Huffmann codes already considers frequent chunks over rare ones and frequent words over rare ones, and the codes are optimal (as demonstrated by Huffmann himself), so the metric is guaranteed to find the minimum amount of information needed (given a certain source grammar).

Note: log2(Csize) is used only in the chunks dictionary to calculate how many bits are needed for the chunks themselves, ie. 'ch' requires log2(Csize) bits, 'cph' requires 3*log2(Csize) bits, etc.

This is great, I misunderstood the formula then. The numbers shown for the naive grammars do roughly correspond to what I get by just gzipping the transliteration (minus the metadata, line numbers, etc), I get something like 6.5e5 bits.

I think your metric looks very good for practical assessment of these grammars.

Would be interesting to see which process, beyond the grammar compression, would results in the smallest lossless output.

(03-01-2025, 06:10 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Optimal yes, for a practical binary compression that is not the best choice for a metric: it doesn't matter how many actual bits you need to write. Fractional bits aren't writable but they are a better measure of the theoretical amount of information.

I like that it's possible to actually compress and decompress the data with the practical binary compression as shown in the paper, I generally prefer demonstrable results to formulas Smile

(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.This is great, I misunderstood the formula then. The numbers shown for the naive grammars do roughly correspond to what I get by just gzipping the transliteration (minus the metadata, line numbers, etc), I get something like 6.5e5 bits.

Yes, with pkzip it's about 68K. My algorithm compresses the text to ~59K with LOOP-Lay-oa, but this excludes the spaces, it only compresses the word tokens.

(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I think your metric looks very good for practical assessment of these grammars.

Thank you!

(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Would be interesting to see which process, beyond the grammar compression, would results in the smallest lossless output.

What I want to do next is to try to find a better grammar than LOOP-Lay-oa. The number of possibilities is enormous but maybe I can use a Monte Carlo algorithm to sample the grammars space. Will see what happens.

(03-01-2025, 09:41 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.What I want to do next is to try to find a better grammar than LOOP-Lay-oa. The number of possibilities is enormous but maybe I can use a Monte Carlo algorithm to sample the grammars space. Will see what happens.

Maybe comparing the best grammars for various parts of the manuscript (sections, "hands", pages) could yield some interesting results? Personally, I don't think Voynichese is grammar-based, but your method seems to be built on solid information theoretical basis, so it could apply across the boundaries of various hypotheses and scenarios of text generation.

Well well, I've been rather lazy with slot grammars in the last few months. I did something but, TL;DR; I did not find anything particularly useful or interesting, and I feel I need some fresh idea or approach before pursuing this again.

So (brief resume) I developed a 'figure of merit' (Nbits) for grammars and used it to find slot grammars which scored quite high on this metric (which I called looping slot grammars). What I wanted next to do was to explore the space of looping slot grammars to see if I could find better ones. I used a MonteCarlo method to generate grammars, examined several tens of millions of them, tweaking the grammars by hand to try to get the best results. I actually found thousands of grammars which score better than my previous record, but this is more a bane than a boon because they all score very similarly and there is no obvious way to select one. But just for reference, this is the best one I found (whole VMS, Rf1a-n transcription, words with rare characters removed):

SLOT 1: ch sh y
SLOT 2: eee ee e q a
SLOT 3: o
SLOT 4: iii ii i d
SLOT 5: y p f k l r s t cth ckh cph cfh n m

Looping it 5 times it can be used to encode the whole VMS (with 99.8% coverage) using 480985 bits (the previous record holder, LOOP-Lay-oa, needed 484688). Just for reference again, the second-best grammar I found uses only 30 more bits .

I also followed oshfdk's suggestion above and worked with the different sections of the manuscript, finding the same disappointing behaviour. Moreover, all the 'best' grammars for one section perform poorly on the other sections and on the manuscript as a whole, and they have structures different from the 'best overall' shown above (ie. [n m] may not cluster with the other characters, [y] may not be duplicated, etc.).

However, I would like to notice that the 'looping slot grammar' approach tries to divide the Voynich word types in 'chunks' (ie. the 'best' grammar above finds aiin ar ol al qok etc. as chunks) which are then used to 'compose' words, and as such it has many analogies with, say, the Naibbe cipher or core-mantle-crust or the 'syllables' recently discussed elsewhere here by dashofsk (if I'm not mistaken!).

I would like I could write a software which tries to divide the Voynich words in chunks, trying to maximize the Nbits metric, but in an unsupervised way (without the need for an underlying slot grammar). I know there are unsupervised clustering algorithms around which could maybe be adapted to the task, but I don't know them well (read: at all) and in any case it looks like a major project which I'm surely not in the mood to tackle at the moment Rolleyes

And ofc there's no guarantee of success, but this is a given anyway with the VMS.

(21-08-2025, 05:05 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I also followed oshfdk's suggestion above and worked with the different sections of the manuscript, finding the same disappointing behaviour. Moreover, all the 'best' grammars for one section perform poorly on the other sections and on the manuscript as a whole, and they have structures different from the 'best overall' shown above (ie. [n m] may not cluster with the other characters, [y] may not be duplicated, etc.).

I think this is very interesting. Could you publish samples of best grammars for different sections?

(21-08-2025, 06:30 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(21-08-2025, 05:05 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I also followed oshfdk's suggestion above and worked with the different sections of the manuscript, finding the same disappointing behaviour. Moreover, all the 'best' grammars for one section perform poorly on the other sections and on the manuscript as a whole, and they have structures different from the 'best overall' shown above (ie. [n m] may not cluster with the other characters, [y] may not be duplicated, etc.).

I think this is very interesting. Could you publish samples of best grammars for different sections?

Sure, give me one day though, I did not plan for this and it needs some work to put it in a readable form.

Pages: 1 2 3 4 5 6 7 8 9 10