The Voynich Ninja - A family of grammars for Voynichese

Pages: 1 2 3 4 5 6 7 8

(03-12-2024, 10:33 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I have no clue on how to solve the problem of ‘consistency’. Yeah, I could put logarithms or exponents or anything else into the formula until it ‘behaves’, but those would all be ad-hoc fixes: that is to say, a probably useless/futile exercise. What would be needed is a formula which can be justified rationally, on basic principles (it might also use different ‘X’ variables, of course), but I fear that’s well beyond my math skills.
Should anyone be interested in this obscure problem: any ideas? Remarks? Maybe just settle for F1 score and grammars binned somehow by coverage? But in that case, how the binning should be done, without risking to introduce a personal bias? Thank you for any comments/answers!

I still have very poor grasp of what question would these grammars address, so I definitely have no answer. But maybe if we try with a simple example, where it's possible to enumerate all slot grammars and identify which one of them looks more useful and why, then we can at least qualitatively identify what we are looking for and start talking about numeric metrics. Also this short exercise will let me check whether my understanding of your approach is correct, if you don't mind spending a minute on it.

First let's consider two grammars that have the same set of metrics: coverage, efficiency and the number of elements, and try to see whether we can say that one is better than the other (again, I have no answer to this).

Suppose we have the following set of words: AB, ABC, BC, AC, BA, BAB. Let's say, I'm trying to analyze it using a slot grammar. The degenerate grammars that you mentioned are [A|B|C|∅] [A|B|C|∅] [A|B|C|∅] (any letter in each slot), or [AB|ABC|BC|AC|BA|BAB] (all words in a single slot, 6 elements in 1 slot). These are not interesting, they won't tell us anything. There are not that many non-trivial slot grammars that we can have for this text.

Let's compare two grammars that, if I get it right, have the same statistics when it comes to coverage, efficiency and the number of elements:

[B|∅][AB|B|A][C|∅], this one can produce 12 strings: BABC, BAB, BBC, BB, BAC, BA, ABC, AB, BC, B, AC, A, covers the whole text and has 7 elements in 3 slots.

[BA|A|B][A|B|BC|C], this one can produce 12 strings as well: BAA, BAB, BABC, BAC, AA, AB, ABC, AC, BA, BB, BBC, BC, it covers the whole text and has the same stats as above, but with 7 elements in 2 slots

Is any one of these grammars better than the other in any way? Are they better than the trivial 6 elements in 1 slot grammar and if so, what exactly is their advantage when it comes to the analysis of the text? I have troubles understanding this. Yes, intuitively these grammars seem “interesting”, they have many different elements, and one could try to guess some interactions, but I still have no idea how they can be practically used to analyze text properties quantitatively and so how to evaluate their usefulness.

[Edit]: I just realized, that my main point got diluted in the text, so I'll try to reiterate: as far as I understand, we can only have a satisfactory quantitative metric for the quality of a grammar if it's tied to some quantitative task wrt the text. If we identify this task, then we can try to work backwards from it to the properties of grammars and find a grammar that would solve this task better. However, I'm still in the woods when it comes to understanding the simple question of what this task is exactly ¯\_(ツ)_/¯

(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(03-12-2024, 10:33 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I have no clue on how to solve the problem of ‘consistency’. Yeah, I could put logarithms or exponents or anything else into the formula until it ‘behaves’, but those would all be ad-hoc fixes: that is to say, a probably useless/futile exercise. What would be needed is a formula which can be justified rationally, on basic principles (it might also use different ‘X’ variables, of course), but I fear that’s well beyond my math skills.
Should anyone be interested in this obscure problem: any ideas? Remarks? Maybe just settle for F1 score and grammars binned somehow by coverage? But in that case, how the binning should be done, without risking to introduce a personal bias? Thank you for any comments/answers!

I still have very poor grasp of what question would these grammars address, so I definitely have no answer. But maybe if we try with a simple example, where it's possible to enumerate all slot grammars and identify which one of them looks more useful and why, then we can at least qualitatively identify what we are looking for and start talking about numeric metrics. Also this short exercise will let me check whether my understanding of your approach is correct, if you don't mind spending a minute on it.

First let's consider two grammars that have the same set of metrics: coverage, efficiency and the number of elements, and try to see whether we can say that one is better than the other (again, I have no answer to this).

Suppose we have the following set of words: AB, ABC, BC, AC, BA, BAB. Let's say, I'm trying to analyze it using a slot grammar. The degenerate grammars that you mentioned are [A|B|C|∅] [A|B|C|∅] [A|B|C|∅] (any letter in each slot), or [AB|ABC|BC|AC|BA|BAB] (all words in a single slot, 6 elements in 1 slot). These are not interesting, they won't tell us anything. There are not that many non-trivial slot grammars that we can have for this text.

Perfect, you got it right.

(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Let's compare two grammars that, if I get it right, have the same statistics when it comes to coverage, efficiency and the number of elements:

[B|∅][AB|B|A][C|∅], this one can produce 12 strings: BABC, BAB, BBC, BB, BAC, BA, ABC, AB, BC, B, AC, A, covers the whole text and has 7 elements in 3 slots.

[BA|A|B][A|B|BC|C], this one can produce 12 strings as well: BAA, BAB, BABC, BAC, AA, AB, ABC, AC, BA, BB, BBC, BC, it covers the whole text and has the same stats as above, but with 7 elements in 2 slots

Is any one of these grammars better than the other in any way?

The example is right. In this case I would say those two grammars are equivalent in all respects (the number of slots I think is irrelevant).

(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Are they better than the trivial 6 elements in 1 slot grammar?

Sure they are, they are not trivial, they actually say something about the structure of words AB, ABC etc., using less information.

(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view. and if so, what exactly is their advantage when it comes to the analysis of the text? I have troubles understanding this. Yes, intuitively these grammars seem “interesting”, they have many different elements, and one could try to guess some interactions, but I still have no idea how they can be practically used to analyze text properties quantitatively and so how to evaluate their usefulness.

[Edit]: I just realized, that my main point got diluted in the text, so I'll try to reiterate: as far as I understand, we can only have a satisfactory quantitative metric for the quality of a grammar if it's tied to some quantitative task wrt the text. If we identify this task, then we can try to work backwards from it to the properties of grammars and find a grammar that would solve this task better. However, I'm still in the woods when it comes to understanding the simple question of what this task is exactly ¯\_(ツ)_/¯

Ah that's the good question. I'll try to be as schematic as possible:

- The VMS has a word structure which is remarkable and unheard of.
- So it's very probable the word structure has something to do with the method used to write it (be it a meaningful or meaningless text, does not matter).
- So getting a better grab on this structure should be of help in studying VMS (and many people have tried to do this since, I don't know exactly, the 60s?). Will this be enough? I very much doubt it, but who knows Smile

- It's very difficult to say if a grammar is 'good'. But having a reliable method to rank them according to some factor would help.

Once one has settled on a grammar the task could be "imagine what the word structure could encode and how, then try demonstrate it does". Very, very difficult. Btw I tried again to try to 'decode' Voynich as a Latin syllabary (by trying to fit the more common Voynich words to the most common Latin ones). Nada de nada, it does not work at all. But that was just one possibility, the problem is there are too many of them Cry

(03-12-2024, 10:33 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.From what I know from literature (but I may have missed many references!), in addition to the number of unique word types in the text (a number I’ll call for short ‘NVoynichwords’, ~8400 depending on the transcription), two variables are used for the evaluation of slot grammars:

• The number of correct Voynich word types the grammar generates, I shall use ‘Nhits’ as a shortand
• The total number of word types the grammar can possibly generate: ‘Ngrammarspace’

These two variables are usually used as (exact names may vary):

• Coverage (‘C’ ) = Nhits / NVoynichwords
• Efficiency (‘E’) = Nhits / Ngrammarspace

I would add another variable, the total coverage C1: the summ of all instances of every word type the grammar generates divided by the total number of existent words.
Some words have few instances and reducing the size of the slot grammar may reduce E significantly with a small decrease on C1.
It is important to search the maximum amount of total words generated with the minimum size of grammar possible.

(03-12-2024, 07:08 PM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.I would add another variable, the total coverage C1: the summ of all instances of every word type the grammar generates divided by the total number of existent words.

Yes, I think I understand what you mean with 'C1'. I agree that a grammar is better the more it generates word types that are the most frequent in the text. I used the rank of the 'first missed word' (in effect, a proxy of C1) as an additional datum in my comparisons in the first posts. So C1 (or the rank of the first missed word) is indeed very much interesting. But it doesn't help in solving the problem of the "One slot * All the words types" degenerate grammar (C1 = 100% in that case), which is fundamental. But for further refinements, sure.

(03-12-2024, 07:08 PM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.Some words have few instances and reducing the size of the slot grammar may reduce E significantly with a small decrease on C1.

Yes, but this is akin to what I referreded to as "Increasing efficiency... by judiciously decreasing the coverage". Ie. by dropping 'cfh' from a grammar the C coverage will decrease just a little, while efficiency will increase (probably) proportionally more, and the same holds for the 'C1' coverage (which will decrease even less). But I find this approach (and the other approach that can be easily used to pump up efficiency: defining more symbols) to be little useful, just a pointless exercise in optimization.

(03-12-2024, 07:08 PM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.It is important to search the maximum amount of total words generated with the minimum size of grammar possible.

I agree on the importance of efficiency, but the whole main point of my post was to show that using coverage and efficiency is as much flawed as using coverage alone (and consequently the 'F1 score' is fundamentally flawed too, just for this reason. It then has another big problem, but that's another matter, less fundamental).

One can of course do several things, and as usual it is unpractical to try everyting.
Giving more weight to frequent words in the score can be achieved by not counting the number of word types, but word tokens. So if a word occurs 200 times, count it as 200.

Alternatively, one can leave words that appear only once out of consideration altogether.

One could make refinements using conditional slots. Example: a certain slot can only be used if a certain other slot is also used. Or the opposite: a certain slot cannot be used if a certain other slot is also used. This immediately makes it considerably more complicated. But I have a feeling that this would improve the success.

The real mystery is IMHO that all this is not supposed to exist.
Which language exhibits a slot structure?

There are examples in other areas, e.g. the trivial case of numbers. You just need four (identical) slots to cover the numbers 1 - 9999. Trivial of course, but it becomes more interesting when using the Greek or Arabic numbering systems that uses letters of the alpbahet.
Even more interesting are Roman numerals, which would benefit from the conditional slots mentioned above.

The one thing I am reasonable certain about is that the MS text was not created by someone setting up a slot system and actually using it to generate the words.
It should have been something else that ended up being equivalent.

This works both for meaningful and meaningless text.

(03-12-2024, 06:01 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Suppose we have the following set of words: AB, ABC, BC, AC, BA, BAB. Let's say, I'm trying to analyze it using a slot grammar. The degenerate grammars that you mentioned are [A|B|C|∅] [A|B|C|∅] [A|B|C|∅] (any letter in each slot), or [AB|ABC|BC|AC|BA|BAB] (all words in a single slot, 6 elements in 1 slot). These are not interesting, they won't tell us anything. There are not that many non-trivial slot grammars that we can have for this text.

[B|∅][AB|B|A][C|∅], this one can produce 12 strings: BABC, BAB, BBC, BB, BAC, BA, ABC, AB, BC, B, AC, A, covers the whole text and has 7 elements in 3 slots.

[BA|A|B][A|B|BC|C], this one can produce 12 strings as well: BAA, BAB, BABC, BAC, AA, AB, ABC, AC, BA, BB, BBC, BC, it covers the whole text and has the same stats as above, but with 7 elements in 2 slots

Is any one of these grammars better than the other in any way?

The example is right. In this case I would say those two grammars are equivalent in all respects (the number of slots I think is irrelevant).

(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Are they better than the trivial 6 elements in 1 slot grammar?

Sure they are, they are not trivial, they actually say something about the structure of words AB, ABC etc., using less information.

This is interesting. If the number of slots is irrelevant, then the trivial grammar comes at the top on each of three metrics: coverage (they all have 100%), efficiency and the number of elements. My intuition would be that the number of slots matters and my intuition would be that grammar [B|∅][AB|B|A][C|∅] is subjectively better than [BA|A|B][A|B|BC|C], because it includes two optional elements, effectively showing prefix-infix-suffix form (which may or may not correspond to some actual feature of the underlying language or could just be a statistical fluke), while grammar [BA|A|B][A|B|BC|C] appears to assign some chunks to prefixes and suffixes with no clear path for a language model. E.g., why BA is a prefix and BC is a suffix? Note that either grammar could be the true grammar of the original language, it's just that one of them attempts to extract functional elements from the text, and the other doesn't.

If we assume the number of slots is irrelevant, maybe the total combined length of all element strings is relevant? But in my opinion, all this is just guesswork, until the more fundamental problem below is addressed.

(03-12-2024, 06:01 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(03-12-2024, 04:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view. [Edit]: I just realized, that my main point got diluted in the text, so I'll try to reiterate: as far as I understand, we can only have a satisfactory quantitative metric for the quality of a grammar if it's tied to some quantitative task wrt the text. If we identify this task, then we can try to work backwards from it to the properties of grammars and find a grammar that would solve this task better. However, I'm still in the woods when it comes to understanding the simple question of what this task is exactly ¯\_(ツ)_/¯

Ah that's the good question. I'll try to be as schematic as possible:

- The VMS has a word structure which is remarkable and unheard of.
- So it's very probable the word structure has something to do with the method used to write it (be it a meaningful or meaningless text, does not matter).
- So getting a better grab on this structure should be of help in studying VMS (and many people have tried to do this since, I don't know exactly, the 60s?). Will this be enough? I very much doubt it, but who knows
- It's very difficult to say if a grammar is 'good'. But having a reliable method to rank them according to some factor would help.

Once one has settled on a grammar the task could be "imagine what the word structure could encode and how, then try demonstrate it does". Very, very difficult. Btw I tried again to try to 'decode' Voynich as a Latin syllabary (by trying to fit the more common Voynich words to the most common Latin ones). Nada de nada, it does not work at all. But that was just one possibility, the problem is there are too many of them

I think this is a very good explanation of what is of interest qualitatively, the problem is if we are trying to optimize the quantitative (numeric) metrics of the grammars, we need a quantitative metric of the end task. We can try comparing the statistical or structural properties of the predicted and the actual sets. Something like (I'm just throwing in examples randomly): "median length of a grammar word divided by the median length of a text word", "median Levenshtein distance between a grammar word and the nearest text word", "total number of character bigrams predicted by the grammar but absent in the text", "same as the last one, but reformulated as probabilities: the probability of the actual count of a character bigram in the word list given the expected count of a character bigram in the grammar, assuming all the outputs of the grammar are equally likely", etc. There are many and many potential metrics of how well the predicted set matches the actual text, but without choosing a specific metric it's really hard to evaluate the grammars.

[Edit]: Maybe we even can reframe the metric as a linear algebra problem: given a slot grammar assign probabilities to each element (so that the probabilities sum to 100% for each slot) in such a way as to minimize the difference between the probabilities of the words of the predicted set (computed as the product of the probabilities of respective elements taken from each slot, I guess) vs the relative counts of the same words in the observed set (computed as the actual word count divided by the total number of tokens, 0 for missing words). This way we will try to mimic the actual text token distribution. Then compare the grammars using their respective optimal element probabilities with this metric (the difference between the probabilities of the words of the predicted set vs the relative counts of the same words in the observed set).

(04-12-2024, 12:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.One can of course do several things, and as usual it is unpractical to try everyting.

Giving more weight to frequent words in the score can be achieved by not counting the number of word types, but word tokens. So if a word occurs 200 times, count it as 200.

Alternatively, one can leave words that appear only once out of consideration altogether.

One could make refinements using conditional slots. Example: a certain slot can only be used if a certain other slot is also used. Or the opposite: a certain slot cannot be used if a certain other slot is also used. This immediately makes it considerably more complicated. But I have a feeling that this would improve the success.

That's indeed a possible way, but I think it's equivalent to define more and more complex symbols (ie.: if an 'iii' forces the subsequent use of the ['n', 'm'] slot, looks the same as defining 'iiin' and 'iiim' as symbols and using them instead). Another thing one could make is to use multiple grammars, ie. one for 'aiin-like' word chunks and one another for 'ee?y-like' chunks (or whatever), then choose one of the available 'templates' before start building the word (a slot without symbols, but with 'pointers' to the sub-grammars). I would have even given it a try if not that it's a painful software work.

As you said, many several things, unpractical to try everytrhing.

(04-12-2024, 12:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The real mystery is IMHO that all this is not supposed to exist.

Which language exhibits a slot structure?

Risking a Dunning-Kruger, I'd say "none that I know of". Only semitic languages (not that I know much about them!) have something vaguely similar in the form of the tri-consonantal word roots + different infixes/affixes, but the Voynich system looks much more developed and much more rigid. Agglutinating (or better, 'synthetic' languages, but 'synthetic' can create confusion in this context) have a kind of 'word' structure too, but (always from my limited knowledge) it does not look much Voynichese-like to me, and an agglutinating language in late medieval Europe looks farfetched anyway (with the possible exception of Hungarian and Suomi, languages I don't know at all, but which seem to have words much more varied in form than Voynichese). A constructed language (expecially a 'philosophical' language) could fit the bill, ie. Volapük builds words in quite a rigid way. Possible, (well, maybe not something as complex as Volapük..) but in all effects equivalent to using a nomenclator cipher, extremely hard to crack without having clues external to the transcribed text.

(04-12-2024, 12:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.There are examples in other areas, e.g. the trivial case of numbers. You just need four (identical) slots to cover the numbers 1 - 9999. Trivial of course, but it becomes more interesting when using the Greek or Arabic numbering systems that uses letters of the alpbahet.

Even more interesting are Roman numerals, which would benefit from the conditional slots mentioned above.

Numbers (or, mainly numbers + other features) could fit very well. In the years I actually tried two or three times to 'decode' Voynichese words as Roman numbers, with the usual endless running around in circles to nowhere, but numbers are decidedly a possibility. I'm pretty sure you already know P. Feaster's "Ruminations on the Voynich" article (I just found it yesterday while browsing VNinja forums hehe), a lot of informations about how numbers (+ features) were written in medieval times.

(04-12-2024, 12:33 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.The one thing I am reasonable certain about is that the MS text was not created by someone setting up a slot system and actually using it to generate the words.

It should have been something else that ended up being equivalent.
This works both for meaningful and meaningless text.

Oh yes!

(I removed most of the quotes form previous posts, they were getting unwieldly)

(04-12-2024, 05:04 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.This is interesting. If the number of slots is irrelevant, then the trivial grammar comes at the top on each of three metrics: coverage (they all have 100%), efficiency and the number of elements. My intuition would be that the number of slots matters and my intuition would be that grammar [B|∅][AB|B|A][C|∅] is subjectively better than [BA|A|B][A|B|BC|C], because it includes two optional elements, effectively showing prefix-infix-suffix form (which may or may not correspond to some actual feature of the underlying language or could just be a statistical fluke), while grammar [BA|A|B][A|B|BC|C] appears to assign some chunks to prefixes and suffixes with no clear path for a language model. E.g., why BA is a prefix and BC is a suffix? Note that either grammar could be the true grammar of the original language, it's just that one of them attempts to extract functional elements from the text, and the other doesn't.

If we assume the number of slots is irrelevant, maybe the total combined length of all element strings is relevant? But in my opinion, all this is just guesswork, until the more fundamental problem below is addressed.

I doubt the number of slots by itself is an useful metric, also because this number already enters in the calculation of Ngrammarsize (a series of multiplications extended to the number of slots) and eventually of Ncharset (a series of sums over the slots). And the degenerate "1*WS" grammar has just one slot, so it beats any other one on this count. But I'm not 100% certain, I didn't go deep enough in this.

(04-12-2024, 05:04 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.The problem is if we are trying to optimize the quantitative (numeric) metrics of the grammars, we need a quantitative metric of the end task. We can try comparing the statistical or structural properties of the predicted and the actual sets. Something like (I'm just throwing in examples randomly): "median length of a grammar word divided by the median length of a text word", "median Levenshtein distance between a grammar word and the nearest text word", "total number of character bigrams predicted by the grammar but absent in the text", "same as the last one, but reformulated as probabilities: the probability of the actual count of a character bigram in the word list given the expected count of a character bigram in the grammar, assuming all the outputs of the grammar are equally likely", etc. There are many and many potential metrics of how well the predicted set matches the actual text, but without choosing a specific metric it's really hard to evaluate the grammars.

[Edit]: Maybe we even can reframe the metric as a linear algebra problem: given a slot grammar assign probabilities to each element (so that the probabilities sum to 100% for each slot) in such a way as to minimize the difference between the probabilities of the words of the predicted set (computed as the product of the probabilities of respective elements taken from each slot, I guess) vs the relative counts of the same words in the observed set (computed as the actual word count divided by the total number of tokens, 0 for missing words). This way we will try to mimic the actual text token distribution. Then compare the grammars using their respective optimal element probabilities with this metric (the difference between the probabilities of the words of the predicted set vs the relative counts of the same words in the observed set).

Those are all good suggestions, thank you! But I need time to think about each, will see what comes out of it.

And of course I agree with you that, even if the 'evaluation problem' I posed is solved, it's yet not enough to say if a grammar is actually 'good' (and hopefully useful). But it's also true that's difficult to make progress if the basic questions are not answered first, and this was my aim: try to put the evaluation process on firmer basic mathematical grounds, then see if further progress can be made. This may not even be the best way to proceed, I don't know', I'm just trying what comes to my mind (and stands up to basic sanity checks Smile

).

I also take the chance to compliment for your works on the Marginalia (just saw them yesterday, I have a lot of threads to read xD). I'd wish I had 1/100th of your skills and knowledge in paleography and image processing!! And of course, clues external to the bare transcribed text could be (and are) very, very useful.

(04-12-2024, 10:42 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(I removed most of the quotes form previous posts, they were getting unwieldly)

To reply, you don't have to repeat the entire message, just select the key phrase.

(04-12-2024, 10:42 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I also take the chance to compliment for your works on the Marginalia (just saw them yesterday, I have a lot of threads to read xD). I'd wish I had 1/100th of your skills and knowledge in paleography and image processing!! And of course, clues external to the bare transcribed text could be (and are) very, very useful.

Thank you so much, but I think there is some misunderstanding, while I do have some skills in automated image processing, paleography is definitely not on my CV. I'm not even sure what paleography is exactly, I assume it has something to do with history and images. With the marginalia I only tried to enhance the multispectral images and accidentally stumbled upon a strange feature at the bottom right of f116v, that's my only contribution, I think. My main area of interest is the text, that's why I'm so curious about the results you have shared and the general methodology.

Pages: 1 2 3 4 5 6 7 8