oshfdk > 03-01-2025, 12:47 PM
(03-01-2025, 11:28 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I review the metrics used to evaluate slot grammars, showing the normally used efficiency and F1 score to be fundamentally flawed. After introducing the concept of ‘looping’ slot grammars, a generalization of standard grammars, I show how any grammar can be used in a distinctive lossless data compression algorithm to generate a compressed Voynich Manuscript text. This allows the definition of a new metric free from fundamental flaws: Nbits, the total number of bits needed to store the compressed text. I then compare published state-of-the-art grammars and the newly introduced LOOP-L grammar class using the Nbits metric.
Mauro > 03-01-2025, 05:19 PM
(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Happy New Year!
Thank you! For some reason it was hard for me to follow the discussion in the thread, but the paper looks very clear and well-structured.
(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.As far as I understand, in effect, you are evaluating word-token grammars and not word-type grammars? At least I see that you multiply the word-chunk bits by the number of occurrences of the word token.
(03-01-2025, 12:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view. In this case, maybe it makes sense to use Shannon information metric, instead of just log2(CSize), optimizing frequent chunks over rare chunks and frequent words tokens over rare word tokens?
Edit: I think I need to clarify my point. The metric you use, Nbits, essentially compares grammars by the way of how much information is needed to reproduce the whole text using the grammar. But the way you compute the information content, even though it looks straightforward, is somewhat arbitrary. Using a different encoding you can get a different number for the same grammar. But using the Shannon information as the theoretical limit of compression, it can be possible to have a more universal metric for a grammar, not dependent on the particular encoding used.
nablator > 03-01-2025, 06:10 PM
(03-01-2025, 05:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.The metric is already optimized: the calculation of the Huffmann codes already considers frequent chunks over rare ones and frequent words over rare ones, and the codes are optimal (as demonstrated by Huffmann himself), so the metric is guaranteed to find the minimum amount of information needed (given a certain source grammar).
oshfdk > 03-01-2025, 06:13 PM
(03-01-2025, 05:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.The metric is already optimized: the calculation of the Huffmann codes already considers frequent chunks over rare ones and frequent words over rare ones, and the codes are optimal (as demonstrated by Huffmann himself), so the metric is guaranteed to find the minimum amount of information needed (given a certain source grammar).
Note: log2(Csize) is used only in the chunks dictionary to calculate how many bits are needed for the chunks themselves, ie. 'ch' requires log2(Csize) bits, 'cph' requires 3*log2(Csize) bits, etc.
oshfdk > 03-01-2025, 06:22 PM
(03-01-2025, 06:10 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Optimal yes, for a practical binary compression that is not the best choice for a metric: it doesn't matter how many actual bits you need to write. Fractional bits aren't writable but they are a better measure of the theoretical amount of information.
Mauro > 03-01-2025, 09:41 PM
(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.This is great, I misunderstood the formula then. The numbers shown for the naive grammars do roughly correspond to what I get by just gzipping the transliteration (minus the metadata, line numbers, etc), I get something like 6.5e5 bits.
(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I think your metric looks very good for practical assessment of these grammars.
(03-01-2025, 06:13 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Would be interesting to see which process, beyond the grammar compression, would results in the smallest lossless output.
oshfdk > 04-01-2025, 05:52 AM
(03-01-2025, 09:41 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.What I want to do next is to try to find a better grammar than LOOP-Lay-oa. The number of possibilities is enormous but maybe I can use a Monte Carlo algorithm to sample the grammars space. Will see what happens.