Happy New Year!
(03-01-2025, 11:28 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I review the metrics used to evaluate slot grammars, showing the normally used efficiency and F1 score to be fundamentally flawed. After introducing the concept of ‘looping’ slot grammars, a generalization of standard grammars, I show how any grammar can be used in a distinctive lossless data compression algorithm to generate a compressed Voynich Manuscript text. This allows the definition of a new metric free from fundamental flaws: Nbits, the total number of bits needed to store the compressed text. I then compare published state-of-the-art grammars and the newly introduced LOOP-L grammar class using the Nbits metric.
Thank you! For some reason it was hard for me to follow the discussion in the thread, but the paper looks very clear and well-structured.
As far as I understand, in effect, you are evaluating word-token grammars and not word-type grammars? At least I see that you multiply the word-chunk bits by the number of occurrences of the word token. In this case, maybe it makes sense to use Shannon information metric, instead of just log2(CSize), optimizing frequent chunks over rare chunks and frequent words tokens over rare word tokens?
Edit: I think I need to clarify my point. The metric you use, Nbits, essentially compares grammars by the way of how much information is needed to reproduce the whole text using the grammar. But the way you compute the information content, even though it looks straightforward, is somewhat arbitrary. Using a different encoding you can get a different number for the same grammar. But using the Shannon information as the theoretical limit of compression, it can be possible to have a more universal metric for a grammar, not dependent on the particular encoding used.
(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Happy New Year!
Thank you! For some reason it was hard for me to follow the discussion in the thread, but the paper looks very clear and well-structured.
Thank you!
(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As far as I understand, in effect, you are evaluating word-token grammars and not word-type grammars? At least I see that you multiply the word-chunk bits by the number of occurrences of the word token.
Yes, the metric is calculated using the word tokens (the whole text). However you gave me the idea to make a test also with word types (the vocabulary) and see what happerns.
(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view. In this case, maybe it makes sense to use Shannon information metric, instead of just log2(CSize), optimizing frequent chunks over rare chunks and frequent words tokens over rare word tokens?
Edit: I think I need to clarify my point. The metric you use, Nbits, essentially compares grammars by the way of how much information is needed to reproduce the whole text using the grammar. But the way you compute the information content, even though it looks straightforward, is somewhat arbitrary. Using a different encoding you can get a different number for the same grammar. But using the Shannon information as the theoretical limit of compression, it can be possible to have a more universal metric for a grammar, not dependent on the particular encoding used.
The metric is already optimized: the calculation of the Huffmann codes already considers frequent chunks over rare ones and frequent words over rare ones, and the codes are optimal (as demonstrated by Huffmann himself), so the metric is guaranteed to find the minimum amount of information needed (given a certain source grammar).
Note: log2(Csize) is used only in the chunks dictionary to calculate how many bits are needed for the chunks themselves, ie. 'ch' requires log2(Csize) bits, 'cph' requires 3*log2(Csize) bits, etc.
(03-01-2025, 05:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.The metric is already optimized: the calculation of the Huffmann codes already considers frequent chunks over rare ones and frequent words over rare ones, and the codes are optimal (as demonstrated by Huffmann himself), so the metric is guaranteed to find the minimum amount of information needed (given a certain source grammar).
Huffmann codes!
Optimal yes, for a practical binary compression that is not the best choice for a metric: it doesn't matter how many actual bits you need to write. Fractional bits aren't writable but they are a better measure of the theoretical amount of information.
Happy New Year.
(03-01-2025, 05:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.The metric is already optimized: the calculation of the Huffmann codes already considers frequent chunks over rare ones and frequent words over rare ones, and the codes are optimal (as demonstrated by Huffmann himself), so the metric is guaranteed to find the minimum amount of information needed (given a certain source grammar).
Note: log2(Csize) is used only in the chunks dictionary to calculate how many bits are needed for the chunks themselves, ie. 'ch' requires log2(Csize) bits, 'cph' requires 3*log2(Csize) bits, etc.
This is great, I misunderstood the formula then. The numbers shown for the naive grammars do roughly correspond to what I get by just gzipping the transliteration (minus the metadata, line numbers, etc), I get something like 6.5e5 bits.
I think your metric looks very good for practical assessment of these grammars.
Would be interesting to see which process, beyond the grammar compression, would results in the smallest lossless output.
(03-01-2025, 06:10 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Optimal yes, for a practical binary compression that is not the best choice for a metric: it doesn't matter how many actual bits you need to write. Fractional bits aren't writable but they are a better measure of the theoretical amount of information.
I like that it's possible to actually compress and decompress the data with the practical binary compression as shown in the paper, I generally prefer demonstrable results to formulas

(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.This is great, I misunderstood the formula then. The numbers shown for the naive grammars do roughly correspond to what I get by just gzipping the transliteration (minus the metadata, line numbers, etc), I get something like 6.5e5 bits.
Yes, with pkzip it's about 68K. My algorithm compresses the text to ~59K with LOOP-Lay-oa, but this excludes the spaces, it only compresses the word tokens.
(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I think your metric looks very good for practical assessment of these grammars.
Thank you!
(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Would be interesting to see which process, beyond the grammar compression, would results in the smallest lossless output.
What I want to do next is to try to find a better grammar than LOOP-Lay-oa. The number of possibilities is enormous but maybe I can use a Monte Carlo algorithm to sample the grammars space. Will see what happens.
(03-01-2025, 09:41 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.What I want to do next is to try to find a better grammar than LOOP-Lay-oa. The number of possibilities is enormous but maybe I can use a Monte Carlo algorithm to sample the grammars space. Will see what happens.
Maybe comparing the best grammars for various parts of the manuscript (sections, "hands", pages) could yield some interesting results? Personally, I don't think Voynichese is grammar-based, but your method seems to be built on solid information theoretical basis, so it could apply across the boundaries of various hypotheses and scenarios of text generation.