03-12-2024, 04:38 PM
(03-12-2024, 10:33 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I have no clue on how to solve the problem of ‘consistency’. Yeah, I could put logarithms or exponents or anything else into the formula until it ‘behaves’, but those would all be ad-hoc fixes: that is to say, a probably useless/futile exercise. What would be needed is a formula which can be justified rationally, on basic principles (it might also use different ‘X’ variables, of course), but I fear that’s well beyond my math skills.
Should anyone be interested in this obscure problem: any ideas? Remarks? Maybe just settle for F1 score and grammars binned somehow by coverage? But in that case, how the binning should be done, without risking to introduce a personal bias? Thank you for any comments/answers!
I still have very poor grasp of what question would these grammars address, so I definitely have no answer. But maybe if we try with a simple example, where it's possible to enumerate all slot grammars and identify which one of them looks more useful and why, then we can at least qualitatively identify what we are looking for and start talking about numeric metrics. Also this short exercise will let me check whether my understanding of your approach is correct, if you don't mind spending a minute on it.
First let's consider two grammars that have the same set of metrics: coverage, efficiency and the number of elements, and try to see whether we can say that one is better than the other (again, I have no answer to this).
Suppose we have the following set of words: AB, ABC, BC, AC, BA, BAB. Let's say, I'm trying to analyze it using a slot grammar. The degenerate grammars that you mentioned are [A|B|C|∅] [A|B|C|∅] [A|B|C|∅] (any letter in each slot), or [AB|ABC|BC|AC|BA|BAB] (all words in a single slot, 6 elements in 1 slot). These are not interesting, they won't tell us anything. There are not that many non-trivial slot grammars that we can have for this text.
Let's compare two grammars that, if I get it right, have the same statistics when it comes to coverage, efficiency and the number of elements:
[B|∅][AB|B|A][C|∅], this one can produce 12 strings: BABC, BAB, BBC, BB, BAC, BA, ABC, AB, BC, B, AC, A, covers the whole text and has 7 elements in 3 slots.
[BA|A|B][A|B|BC|C], this one can produce 12 strings as well: BAA, BAB, BABC, BAC, AA, AB, ABC, AC, BA, BB, BBC, BC, it covers the whole text and has the same stats as above, but with 7 elements in 2 slots
Is any one of these grammars better than the other in any way? Are they better than the trivial 6 elements in 1 slot grammar and if so, what exactly is their advantage when it comes to the analysis of the text? I have troubles understanding this. Yes, intuitively these grammars seem “interesting”, they have many different elements, and one could try to guess some interactions, but I still have no idea how they can be practically used to analyze text properties quantitatively and so how to evaluate their usefulness.
[Edit]: I just realized, that my main point got diluted in the text, so I'll try to reiterate: as far as I understand, we can only have a satisfactory quantitative metric for the quality of a grammar if it's tied to some quantitative task wrt the text. If we identify this task, then we can try to work backwards from it to the properties of grammars and find a grammar that would solve this task better. However, I'm still in the woods when it comes to understanding the simple question of what this task is exactly ¯\_(ツ)_/¯