The difficult problem of evaluating slot grammars.
This post is a bit off-topic from the rest of the thread. It talks only about the difficult (but interesting) problem of evaluating slot grammars, some thoughts I made following the exchange of opinions we had in here. I’m quite less interested in this problem now than I was when I started the thread, but anyway, it’s an interesting problem, and I thought sharing some considerations and ask for help on some questions might be good things.
It’s actually a fairly obscure mathematical problem, related to the Voynich because ‘slot grammars’ are used as a tool to study it.
I used separators to better show what the chain of reasoning is, hoping it will make it more easy to read. It's actually shorter than at first sight
-----------------------------------------
From what I know from literature (but I may have missed many references!), in addition to the number of unique word types in the text (a number I’ll call for short ‘NVoynichwords’, ~8400 depending on the transcription), two variables are used for the evaluation of slot grammars:
• The number of correct Voynich word types the grammar generates, I shall use ‘Nhits’ as a shortand
• The total number of word types the grammar can possibly generate: ‘Ngrammarspace’
These two variables are usually used as (exact names may vary):
• Coverage (‘C’ ) = Nhits / NVoynichwords
• Efficiency (‘E’) = Nhits / Ngrammarspace
And as a combination of the two, the ‘F1 score:
• F1 = 2 * C * E / (C + E)
----------------------------------------------
The rationale for using these metrics is as follows:
The primary requirement for a grammar is, obviously, its coverage. But coverage can’t be the only one, because it can be cheated around by using a trivial grammar: just use as many slots as the length of the longest word type, with each slot containing all the characters used to write the text, to get 100% coverage (I’ll call this grammar “Trivial LL*CS”, shorthand for “Longest word type length * Character set size”).
To safeguard against this degenerate case we use the efficiency, which gets terribly low when calculated on the “Trivial LL*CS” grammar.
The F1 score does not add any new information, but combines both coverage and efficiency in a single ‘figure of merit’. However, I think it suffers for a numerical problem (sorry for not being rigorous in this part, it’s not easy). For simplicity, instead of coverage and efficiency let’s just consider the variables from which they are derived: Nhits and Ngrammarspace; Ngrammarspace is a number which tends to increase ‘exponentially’ with the grammar, unbounded, because it’s the result of a products series (possible choices of slot 1 * possible choices of slot 2 * ….., excuse me for not writing it in the proper mathematical form). On the contrary, Nhits is fundamentally limited to a small number (the number of word types in the text, ~8000). The result is that efficiency tends to swamp coverage in the calculation of the F1 score (see also Note 2), so that grammars with high coverage (the primary requirement), which will have a low efficiency, are treated worse than grammars with a low coverage but higher efficiency.
-------------------------------------
But most importantly, these metrics are obviously not enough. This because it’s also possible to trivially cheat on efficiency and F1 score, by using a grammar with just one slot, but containing all the word types of the text (I’ll call this “Trivial 1*WS”, shorthand for “One slot * Number of word types”): 100% coverage and 100% efficiency, and the best possible F1 score in the world (one).
-------------------------------------
How can this situation be improved?
I thought a little about it, and it’s a difficult problem. The best idea I had is to add a third variable, which’ll just call X, and the simplest one I thought of is X = total size of the character set used by the grammar (‘Ncharset’). By character set I don’t mean just ‘c’, ‘d’, ‘e’ etc., but also all the unique combination of characters which are used by the grammar, ie. “ch”, “sh”, “eee”, “ol”, “lk” or even “teodyteytar” (this last one in the Trivial 1*WS grammar), plus the ‘null’ character. And of course, the same character used in different slots (‘nulls’ included) must be counted multiple times (see also picky Note 3).
Using this variable it’s possible, for instance, to multiply by (NVoynichwords - Ncharset), which is zero in the case of Trivial 1*WS, and get the desired result of giving a zero score also to this degenerate case.
------------------------------------------
Now we need the ‘figure of merit’ function. It’s easy to say which basic constraints it must satisfy: it must be zero if Nhits is zero, then increase monotonically. It must decrease monotonically with Ngrammarspace and be zero at infinity. It must decrease monotonically with Ncharset and be zero at Ncharset = NVoynichwords
Another important constraint it must satisfy, imho, is that a grammar should score better over another grammar with a much higher efficiency but much lower coverage, because it’s all too easy to trick on efficiency just by defining more and more complex ‘characters’ (see also Note 4), and/or by judiciously reducing the coverage. For lack of a better name, I call this propriety ‘consistency’. By the way, F1 score does not respect ‘consistency’, because it suffers the problem of ‘efficiency swamps coverage’ noted above (unless, of course, the grammars are first binned by coverage, with each bin containing grammars with a comparable coverage: then the F1 score inside each bin becomes ‘consistent’).
-------------------------------------------
The basic constraints for a figure of merit are easy to satisfy with a function, ie. F = (NVoynichwords - Ncharset) * Nhits / Ngrammarsize would already do (as you can easily verify), but it suffers from the same problem of the F1 score: at a certain point Ngrammarsize dominates everything else, so this formula lacks ‘consistency’ as (non-rigorously) defined above.
I have no clue on how to solve the problem of ‘consistency’. Yeah, I could put logarithms or exponents or anything else into the formula until it ‘behaves’, but those would all be ad-hoc fixes: that is to say, a probably useless/futile exercise. What would be needed is a formula which can be justified rationally, on basic principles (it might also use different ‘X’ variables, of course), but I fear that’s well beyond my math skills.
Should anyone be interested in this obscure problem: any ideas? Remarks? Maybe just settle for F1 score and grammars binned somehow by coverage? But in that case, how the binning should be done, without risking to introduce a personal bias? Thank you for any comments/answers!
Note 1 : (just to make sure I got it right and not have switched the consensus meanings instead…) A ‘word type’ is any unique word appearing in the text, ie. ‘daiin’ and ‘teodyteytar’ are two ‘word types’. This is contrasted with ‘word token’ (not used in this post), ie. there are 865 ‘daiin’ word tokens and one ‘teodyteytar’ word token in the text. Anyway, I used ‘word type’ in the sense defined here.
Note 2 : if E << C then F1 = 2*C*E/(C+E) =~ 2*C*E/C = 2*E, and the coverage even disappears from the calculation.
Note 3 (picky note): adding Ncharset seems should not add any more information because the very same numbers used to calculate it (the number of ‘characters’ in each slot) have been already used in calculating Ngrammarsize (just in a series of products instead than in a series of sums). But it’s also true that it would be impossible to calculate Ncharset from Ngrammarsize (this because in general there are multiple ways of factoring Ngrammarsize, ie. is it 2*3*5, or 6*5, or 2*15? they all give different Ncharset values). So Ncharset does add some more information about the slot grammar, which Ngrammarsize alone does not capture.
Note 4 : a posteriori, I realized that’s actually what I also did in the presentation in the first post of this thread, where I was completely focused on efficiency (following Zattera’s footsteps on this). To chase efficiency and get even better F1 scores I transformed a simple and regular grammar into more complex ones, just by adding more symbols. That was a totally unnecessary asinine work, I should have instead started directly with the LOOP grammar (which in effect was the key insight), worrying about efficiency much less and ditching BASIC, COMPACT and EXTENDED altogether. It’s been just a few days, but I’d re-write that presentation upside-down now, and I owe that to talking with the Voynich Ninja community

. You rock!