oshfdk > 03-12-2024, 04:38 PM
(03-12-2024, 10:33 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I have no clue on how to solve the problem of ‘consistency’. Yeah, I could put logarithms or exponents or anything else into the formula until it ‘behaves’, but those would all be ad-hoc fixes: that is to say, a probably useless/futile exercise. What would be needed is a formula which can be justified rationally, on basic principles (it might also use different ‘X’ variables, of course), but I fear that’s well beyond my math skills.
Should anyone be interested in this obscure problem: any ideas? Remarks? Maybe just settle for F1 score and grammars binned somehow by coverage? But in that case, how the binning should be done, without risking to introduce a personal bias? Thank you for any comments/answers!
Mauro > 03-12-2024, 06:01 PM
(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.(03-12-2024, 10:33 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I have no clue on how to solve the problem of ‘consistency’. Yeah, I could put logarithms or exponents or anything else into the formula until it ‘behaves’, but those would all be ad-hoc fixes: that is to say, a probably useless/futile exercise. What would be needed is a formula which can be justified rationally, on basic principles (it might also use different ‘X’ variables, of course), but I fear that’s well beyond my math skills.
Should anyone be interested in this obscure problem: any ideas? Remarks? Maybe just settle for F1 score and grammars binned somehow by coverage? But in that case, how the binning should be done, without risking to introduce a personal bias? Thank you for any comments/answers!
I still have very poor grasp of what question would these grammars address, so I definitely have no answer. But maybe if we try with a simple example, where it's possible to enumerate all slot grammars and identify which one of them looks more useful and why, then we can at least qualitatively identify what we are looking for and start talking about numeric metrics. Also this short exercise will let me check whether my understanding of your approach is correct, if you don't mind spending a minute on it.
First let's consider two grammars that have the same set of metrics: coverage, efficiency and the number of elements, and try to see whether we can say that one is better than the other (again, I have no answer to this).
Suppose we have the following set of words: AB, ABC, BC, AC, BA, BAB. Let's say, I'm trying to analyze it using a slot grammar. The degenerate grammars that you mentioned are [A|B|C|∅] [A|B|C|∅] [A|B|C|∅] (any letter in each slot), or [AB|ABC|BC|AC|BA|BAB] (all words in a single slot, 6 elements in 1 slot). These are not interesting, they won't tell us anything. There are not that many non-trivial slot grammars that we can have for this text.
(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Let's compare two grammars that, if I get it right, have the same statistics when it comes to coverage, efficiency and the number of elements:
[B|∅][AB|B|A][C|∅], this one can produce 12 strings: BABC, BAB, BBC, BB, BAC, BA, ABC, AB, BC, B, AC, A, covers the whole text and has 7 elements in 3 slots.
[BA|A|B][A|B|BC|C], this one can produce 12 strings as well: BAA, BAB, BABC, BAC, AA, AB, ABC, AC, BA, BB, BBC, BC, it covers the whole text and has the same stats as above, but with 7 elements in 2 slots
Is any one of these grammars better than the other in any way?
(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Are they better than the trivial 6 elements in 1 slot grammar?
(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view. and if so, what exactly is their advantage when it comes to the analysis of the text? I have troubles understanding this. Yes, intuitively these grammars seem “interesting”, they have many different elements, and one could try to guess some interactions, but I still have no idea how they can be practically used to analyze text properties quantitatively and so how to evaluate their usefulness.
[Edit]: I just realized, that my main point got diluted in the text, so I'll try to reiterate: as far as I understand, we can only have a satisfactory quantitative metric for the quality of a grammar if it's tied to some quantitative task wrt the text. If we identify this task, then we can try to work backwards from it to the properties of grammars and find a grammar that would solve this task better. However, I'm still in the woods when it comes to understanding the simple question of what this task is exactly ¯\_(ツ)_/¯
Juan_Sali > 03-12-2024, 07:08 PM
(03-12-2024, 10:33 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.From what I know from literature (but I may have missed many references!), in addition to the number of unique word types in the text (a number I’ll call for short ‘NVoynichwords’, ~8400 depending on the transcription), two variables are used for the evaluation of slot grammars:I would add another variable, the total coverage C1: the summ of all instances of every word type the grammar generates divided by the total number of existent words.
• The number of correct Voynich word types the grammar generates, I shall use ‘Nhits’ as a shortand
• The total number of word types the grammar can possibly generate: ‘Ngrammarspace’
These two variables are usually used as (exact names may vary):
• Coverage (‘C’ ) = Nhits / NVoynichwords
• Efficiency (‘E’) = Nhits / Ngrammarspace
Mauro > 03-12-2024, 08:41 PM
(03-12-2024, 07:08 PM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.I would add another variable, the total coverage C1: the summ of all instances of every word type the grammar generates divided by the total number of existent words.
(03-12-2024, 07:08 PM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.Some words have few instances and reducing the size of the slot grammar may reduce E significantly with a small decrease on C1.
(03-12-2024, 07:08 PM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.It is important to search the maximum amount of total words generated with the minimum size of grammar possible.
ReneZ > 04-12-2024, 12:33 AM
oshfdk > 04-12-2024, 05:04 AM
(03-12-2024, 06:01 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Suppose we have the following set of words: AB, ABC, BC, AC, BA, BAB. Let's say, I'm trying to analyze it using a slot grammar. The degenerate grammars that you mentioned are [A|B|C|∅] [A|B|C|∅] [A|B|C|∅] (any letter in each slot), or [AB|ABC|BC|AC|BA|BAB] (all words in a single slot, 6 elements in 1 slot). These are not interesting, they won't tell us anything. There are not that many non-trivial slot grammars that we can have for this text.
[B|∅][AB|B|A][C|∅], this one can produce 12 strings: BABC, BAB, BBC, BB, BAC, BA, ABC, AB, BC, B, AC, A, covers the whole text and has 7 elements in 3 slots.
[BA|A|B][A|B|BC|C], this one can produce 12 strings as well: BAA, BAB, BABC, BAC, AA, AB, ABC, AC, BA, BB, BBC, BC, it covers the whole text and has the same stats as above, but with 7 elements in 2 slots
Is any one of these grammars better than the other in any way?
The example is right. In this case I would say those two grammars are equivalent in all respects (the number of slots I think is irrelevant).
(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Are they better than the trivial 6 elements in 1 slot grammar?
Sure they are, they are not trivial, they actually say something about the structure of words AB, ABC etc., using less information.
(03-12-2024, 06:01 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view. [Edit]: I just realized, that my main point got diluted in the text, so I'll try to reiterate: as far as I understand, we can only have a satisfactory quantitative metric for the quality of a grammar if it's tied to some quantitative task wrt the text. If we identify this task, then we can try to work backwards from it to the properties of grammars and find a grammar that would solve this task better. However, I'm still in the woods when it comes to understanding the simple question of what this task is exactly ¯\_(ツ)_/¯
Ah that's the good question. I'll try to be as schematic as possible:
- The VMS has a word structure which is remarkable and unheard of.
- So it's very probable the word structure has something to do with the method used to write it (be it a meaningful or meaningless text, does not matter).
- So getting a better grab on this structure should be of help in studying VMS (and many people have tried to do this since, I don't know exactly, the 60s?). Will this be enough? I very much doubt it, but who knows
- It's very difficult to say if a grammar is 'good'. But having a reliable method to rank them according to some factor would help.
Once one has settled on a grammar the task could be "imagine what the word structure could encode and how, then try demonstrate it does". Very, very difficult. Btw I tried again to try to 'decode' Voynich as a Latin syllabary (by trying to fit the more common Voynich words to the most common Latin ones). Nada de nada, it does not work at all. But that was just one possibility, the problem is there are too many of them
Mauro > 04-12-2024, 10:02 AM
(04-12-2024, 12:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.One can of course do several things, and as usual it is unpractical to try everyting.
Giving more weight to frequent words in the score can be achieved by not counting the number of word types, but word tokens. So if a word occurs 200 times, count it as 200.
Alternatively, one can leave words that appear only once out of consideration altogether.
One could make refinements using conditional slots. Example: a certain slot can only be used if a certain other slot is also used. Or the opposite: a certain slot cannot be used if a certain other slot is also used. This immediately makes it considerably more complicated. But I have a feeling that this would improve the success.
(04-12-2024, 12:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The real mystery is IMHO that all this is not supposed to exist.
Which language exhibits a slot structure?
(04-12-2024, 12:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.There are examples in other areas, e.g. the trivial case of numbers. You just need four (identical) slots to cover the numbers 1 - 9999. Trivial of course, but it becomes more interesting when using the Greek or Arabic numbering systems that uses letters of the alpbahet.
Even more interesting are Roman numerals, which would benefit from the conditional slots mentioned above.
(04-12-2024, 12:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The one thing I am reasonable certain about is that the MS text was not created by someone setting up a slot system and actually using it to generate the words.
It should have been something else that ended up being equivalent.
This works both for meaningful and meaningless text.
Mauro > 04-12-2024, 10:42 AM
(04-12-2024, 05:04 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.This is interesting. If the number of slots is irrelevant, then the trivial grammar comes at the top on each of three metrics: coverage (they all have 100%), efficiency and the number of elements. My intuition would be that the number of slots matters and my intuition would be that grammar [B|∅][AB|B|A][C|∅] is subjectively better than [BA|A|B][A|B|BC|C], because it includes two optional elements, effectively showing prefix-infix-suffix form (which may or may not correspond to some actual feature of the underlying language or could just be a statistical fluke), while grammar [BA|A|B][A|B|BC|C] appears to assign some chunks to prefixes and suffixes with no clear path for a language model. E.g., why BA is a prefix and BC is a suffix? Note that either grammar could be the true grammar of the original language, it's just that one of them attempts to extract functional elements from the text, and the other doesn't.
If we assume the number of slots is irrelevant, maybe the total combined length of all element strings is relevant? But in my opinion, all this is just guesswork, until the more fundamental problem below is addressed.
(04-12-2024, 05:04 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.The problem is if we are trying to optimize the quantitative (numeric) metrics of the grammars, we need a quantitative metric of the end task. We can try comparing the statistical or structural properties of the predicted and the actual sets. Something like (I'm just throwing in examples randomly): "median length of a grammar word divided by the median length of a text word", "median Levenshtein distance between a grammar word and the nearest text word", "total number of character bigrams predicted by the grammar but absent in the text", "same as the last one, but reformulated as probabilities: the probability of the actual count of a character bigram in the word list given the expected count of a character bigram in the grammar, assuming all the outputs of the grammar are equally likely", etc. There are many and many potential metrics of how well the predicted set matches the actual text, but without choosing a specific metric it's really hard to evaluate the grammars.
[Edit]: Maybe we even can reframe the metric as a linear algebra problem: given a slot grammar assign probabilities to each element (so that the probabilities sum to 100% for each slot) in such a way as to minimize the difference between the probabilities of the words of the predicted set (computed as the product of the probabilities of respective elements taken from each slot, I guess) vs the relative counts of the same words in the observed set (computed as the actual word count divided by the total number of tokens, 0 for missing words). This way we will try to mimic the actual text token distribution. Then compare the grammars using their respective optimal element probabilities with this metric (the difference between the probabilities of the words of the predicted set vs the relative counts of the same words in the observed set).
Ruby Novacna > 04-12-2024, 12:53 PM
oshfdk > 04-12-2024, 01:54 PM
(04-12-2024, 10:42 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I also take the chance to compliment for your works on the Marginalia (just saw them yesterday, I have a lot of threads to read xD). I'd wish I had 1/100th of your skills and knowledge in paleography and image processing!! And of course, clues external to the bare transcribed text could be (and are) very, very useful.