03-01-2025, 12:47 PM
Happy New Year!
Thank you! For some reason it was hard for me to follow the discussion in the thread, but the paper looks very clear and well-structured.
As far as I understand, in effect, you are evaluating word-token grammars and not word-type grammars? At least I see that you multiply the word-chunk bits by the number of occurrences of the word token. In this case, maybe it makes sense to use Shannon information metric, instead of just log2(CSize), optimizing frequent chunks over rare chunks and frequent words tokens over rare word tokens?
Edit: I think I need to clarify my point. The metric you use, Nbits, essentially compares grammars by the way of how much information is needed to reproduce the whole text using the grammar. But the way you compute the information content, even though it looks straightforward, is somewhat arbitrary. Using a different encoding you can get a different number for the same grammar. But using the Shannon information as the theoretical limit of compression, it can be possible to have a more universal metric for a grammar, not dependent on the particular encoding used.
(03-01-2025, 11:28 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I review the metrics used to evaluate slot grammars, showing the normally used efficiency and F1 score to be fundamentally flawed. After introducing the concept of ‘looping’ slot grammars, a generalization of standard grammars, I show how any grammar can be used in a distinctive lossless data compression algorithm to generate a compressed Voynich Manuscript text. This allows the definition of a new metric free from fundamental flaws: Nbits, the total number of bits needed to store the compressed text. I then compare published state-of-the-art grammars and the newly introduced LOOP-L grammar class using the Nbits metric.
Thank you! For some reason it was hard for me to follow the discussion in the thread, but the paper looks very clear and well-structured.
As far as I understand, in effect, you are evaluating word-token grammars and not word-type grammars? At least I see that you multiply the word-chunk bits by the number of occurrences of the word token. In this case, maybe it makes sense to use Shannon information metric, instead of just log2(CSize), optimizing frequent chunks over rare chunks and frequent words tokens over rare word tokens?
Edit: I think I need to clarify my point. The metric you use, Nbits, essentially compares grammars by the way of how much information is needed to reproduce the whole text using the grammar. But the way you compute the information content, even though it looks straightforward, is somewhat arbitrary. Using a different encoding you can get a different number for the same grammar. But using the Shannon information as the theoretical limit of compression, it can be possible to have a more universal metric for a grammar, not dependent on the particular encoding used.

And ofc there's no guarantee of success, but this is a given anyway with the VMS.