The Voynich Ninja

Full Version: A family of grammars for Voynichese
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
Hello Mauro,

If I understand correctly your "chunkification" process, the pattern matching is greedy: this is the first of the three possibilities in the example that I posted earlier on the first page of this thread, and I don't see why it is a better choice that the other two:

(28-11-2024, 10:10 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.If some words are indeed "separable" (as suggested by M. Zattera's article), maximizing the F1 score or any metric is futile: the model is not meant to match "separable" words. Fitting the model to data containing "separable" words is probably what creates the wrap-around effect in the MZ slot sequence: slots 7-11 have 6 glyphs in common with slots 0-2, resulting in multiple possibilities of separation: for example sheolkeedy (in f115v) could be, according to Massimiliano Zattera's 12-slot sequence:
sheol keedy
sheo lkeedy
she olkeedy
These 6 words exist in the VM: it is unclear where the space should be inserted. There could be a rule, for example, to choose the first of the three because it maximizes the length of the first word, but we don't know.

Maybe a non-ambiguous slot sequence, allowing a unique re-parsing from a space-less transliteration, would better capture the partial ordering principle at work in the VM.
(09-12-2024, 11:27 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Thank you for your effort! Probably, I'm missing some context or some important background idea, I still don't understand how and why this metric works. To me it seems that any grammar with 100% coverage will reduce to the 1*WS grammar after chunkification, because it should be possible to match any word to the grammar, producing a single chunk.

Yes, by repeated chunkification every grammar (with or without 100% cover) is ultimately reduced to the 1*WS trivial. But this does not mean the method does not work. Example:

We want to evaluate grammar X.

We chunkify grammar X: X' = Chunkify(X)
We calculate Nchunktypes(X'): the lower this number is, the better the original grammar X is.


Now we can go on and evaluate also grammar X'

We chunkify grammar X': X'' = Chunkify(X')
We calculate Nchunktypes(X''): if it's lower than Nchunktypes(X') then well, we have found an even better grammar than X: X'. If it's higher, we can stop here, but if we go on...

... ultimately we find the 1*WS trivial, at the bottom of the scoring list.


If X = the trivial LL*WS grammar the process is very fast:

We chunkify grammar LL*WS: LL*WS' = Chunkify(LL*WS) = 1*WS (!)

And we have already hit the bottom.
(09-12-2024, 11:50 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.Hello Mauro,

If I understand correctly your "chunkification" process, the pattern matching is greedy: this is the first of the three possibilities in the example that I posted earlier on the first page of this thread, and I don't see why it is a better choice that the other two:

(28-11-2024, 10:10 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.If some words are indeed "separable" (as suggested by M. Zattera's article), maximizing the F1 score or any metric is futile: the model is not meant to match "separable" words. Fitting the model to data containing "separable" words is probably what creates the wrap-around effect in the MZ slot sequence: slots 7-11 have 6 glyphs in common with slots 0-2, resulting in multiple possibilities of separation: for example sheolkeedy (in f115v) could be, according to Massimiliano Zattera's 12-slot sequence:
sheol keedy
sheo lkeedy
she olkeedy
These 6 words exist in the VM: it is unclear where the space should be inserted. There could be a rule, for example, to choose the first of the three because it maximizes the length of the first word, but we don't know.

Maybe a non-ambiguous slot sequence, allowing a unique re-parsing from a space-less transliteration, would better capture the partial ordering principle at work in the VM.

You are perfectly right: the pattern-matching algorithm is 'greedy', it will stop at the first match it finds. Actually the algorithm is more sophisticated and I could go on and find all the possible alternative matches too, but I think this would not be much useful. It's one of those things which would be very easy to explain face-to-face just by scribbling on a piece of paper, but it's not easy to explain in written form. I try my best.

It does not matter because, after all, the grammar structure univocally determines the chunk structure (even if there may be many possible alternative 'chunkings'). We could easily re-arrange the grammar to get the alternative chunkings, for instance in LOOP-4 all the 'consonants' [d, k, l, r...] etc. are at the end, so a word such as 'daiin' is chunked as [d] + [aiin]. But I could just put that slot at the beginning, and 'daiin' would become [daii] + [n].

So, what discriminates between one of the two alternatives forms of LOOP-4, that is to say, the different forms of possible chunkings? It's the total number of chunks, Nchunktypes, found by chunkifying each grammar: I actually made a test with the 'consonants' at the beginning (just past the [q, ch, sh, y] slot) and it finds 609 Nchunktypes, so it's slightly worse than the standard LOOP-4 at 606 (but a very viable grammar nonetheless, I guess).  I also tested with 'consonants' both at the beginning and the end and it resulted in ~675 chunks (I don't remember exactly), decidedly worse. That's why in the end I settled for the standard LOOP-4 of post #53.


Addendum. About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it Smile
(09-12-2024, 01:58 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.You are perfectly right: the pattern-matching algorithm is 'greedy', it will stop at the first match it finds. Actually the algorithm is more sophisticated and I could go on and find all the possible alternative matches too, but I think this would not be much useful. It's one of those things which would be very easy to explain face-to-face just by scribbling on a piece of paper, but it's not easy to explain in written form. I try my best.

It does not matter because, after all, the grammar structure univocally determines the chunk structure (even if there many be many possible alternative 'chunkings'). We could easily re-arrange the grammar to get the alternative chunkings, for instance in LOOP-4 all the 'consonants' [d, k, l, r...] etc. are at the end, so a word such as 'daiin' is chunked as [d] + [aiin]. But I could just put that slot at the beginning, and 'daiin' would become [daii] + [n].

So, what discriminates between one of the two alternatives forms of LOOP-4? It's the total number of chunks, Nchunktypes, found by chunkifying each grammar: I actually made a test with the 'consonants' at the beginning (just past the [q, ch, sh, y] slot) and it finds 609 Nchunktypes, so it's slightly worse than the standard LOOP-4 at 606 (but a very viable grammar nonetheless, I guess).  I also tested with 'consonants' both at the beginning and the end and it resulted in ~675 chunks (I don't remember exactly).

About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it Smile

After reading this reply I realized that the (loop) grammar matching mechanics are more complicated than I thought, which probably explains my persistent confusion. I cannot really understand your results so far, but I'm looking forward to new updates, especially those a bit less technical in nature  Shy
(09-12-2024, 02:47 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(09-12-2024, 01:58 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.You are perfectly right: the pattern-matching algorithm is 'greedy', it will stop at the first match it finds. Actually the algorithm is more sophisticated and I could go on and find all the possible alternative matches too, but I think this would not be much useful. It's one of those things which would be very easy to explain face-to-face just by scribbling on a piece of paper, but it's not easy to explain in written form. I try my best.

It does not matter because, after all, the grammar structure univocally determines the chunk structure (even if there many be many possible alternative 'chunkings'). We could easily re-arrange the grammar to get the alternative chunkings, for instance in LOOP-4 all the 'consonants' [d, k, l, r...] etc. are at the end, so a word such as 'daiin' is chunked as [d] + [aiin]. But I could just put that slot at the beginning, and 'daiin' would become [daii] + [n].

So, what discriminates between one of the two alternatives forms of LOOP-4? It's the total number of chunks, Nchunktypes, found by chunkifying each grammar: I actually made a test with the 'consonants' at the beginning (just past the [q, ch, sh, y] slot) and it finds 609 Nchunktypes, so it's slightly worse than the standard LOOP-4 at 606 (but a very viable grammar nonetheless, I guess).  I also tested with 'consonants' both at the beginning and the end and it resulted in ~675 chunks (I don't remember exactly).

About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it Smile

After reading this reply I realized that the (loop) grammar matching mechanics are more complicated than I thought, which probably explains my persistent confusion. I cannot really understand your results so far, but I'm looking forward to new updates, especially those a bit less technical in nature  Shy

Heheh yeah, I tried to simplify the mechanics as much as possible for my presentation, but indeed the repeating loop is one of the problems. Another problem is that the character strings inside the slots have variable lenghts, and another one is the strings can overlap, ie. "lk" overlaps both "l" and "k", which give rise to many possible different chunkification paths (ie.: is 'lkaiin' lk+aiin or l+k+aiin?).

But don't worry, it all works as intended, and I think the best proof of it (beyond the Excel tables I posted) is that it can then generate a pseudo-Voynich text which resembles a lot the original (indeed the main purpose of writing pseudo-Voynich was to verify that everything worked properly, and it's also cool xD).

So I'm sorry if you are yet confused, I very much understand my reasonings are easy to follow for me, but not that much for people who never heard about them (and expecially if they're not used to write code, and in my convoluted English). So I thank you (and all others) a lot for taking the time to go through my writings and try to understand. I actually made a lot of progress by reasoning on what you (and others) have said, ie. the key idea of the 1*WS grammar came to my mind after reading an answer here.

I really don't know what else I coud report, unfortunately. My grammar project looks to be 'finished' at the moment:
  1. I think I found a way to soundly score grammars which was previously lacking (efficiency, the previous standard, being unsound, as I demonstrated with 1*WS). This is actually my main result.
  2. And I think the LOOP grammar, or a variation of it, can possibly be the best one and actually be the 'true' structure underlying Voynich words, given it scores so well against all others on the Nchunktypes metric. At the very least, it's the best grammar available at this time.

But any questions you may have, just ask!

Now of course I still have to get my two results to gain acceptance, and this another hard part Big Grin , and maybe someone will come out and point to a fatal flaw in my reasonings, bursting everything. But I'm rather happy anyway! If something more about the Voynich can be learned by building on these results remains to be seen, but I think that the problem of elucidating the structure of Voynich words is essentialy solved at this point, or on the brink of it.
I am also still trying to fully understand...

Taking this line as a starting point:

(06-12-2024, 09:53 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Only 58 words (on 8031) are missed and they are all with more than 4 chunks, ie. ‘tchodypodar’ would be [t][chod][yp][od][ar]. Just for colour, ‘fachys’ is chunkified as [f][a][ch][ys] and is among the valid word types found by the grammar.

One risk I see is overmodelling. It is probably sufficient, even better (subjectively) not to aim for more than, say, 98% coverage. There will be errors in the text. We don't know how many and where, but the high number of hapax is also a hint in that direction.
The major issue of word spaces is largely covered by a looped slot system, which is probably one of the most attractive aspects of it.

The word tchodypodar is a tricky one. It is very long, compared to average word length, but all its transitions are common. Now it 'feels' wrong that this should have as many as 5 chunks, and it also 'feels' wrong that the common bigram 'dy' is split into two different chunks. The word 'fachys' is nice and short but highly irregular. It is the type of word that I would discard when setting up a model.

On a side note, the first character of each paragraph could be left out of consideration for the known reasons, but that would change 'fachys' into 'achys' which is only a bit better.

However, my main question is: how does your asemic text relate to the slot model? Is it generated based entirely on it?
(09-12-2024, 01:58 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Addendum. About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it Smile

Yes, but when it's too easy to match anything, the slot sequence can't be optimized. There must be some rule or limit to prevent over-modeling (as ReneZ said) otherwise it's more a certitude than a risk. I think the MZ 12-slot sequence is already too long, it includes a limited wrap-around effect that can be extended by allowing full loops: allowing 3 or even 4 loops improves the coverage, but I don't see the point.
(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I am also still trying to fully understand...

Thank you René. One thing which does not help is that the whole exposition, at the moment, is spread over multiple posts, and with changing and evolving ideas on top of that. I could (indeed I should) re-write everything from scratch in a single, coherent document (which might possibly go into a new thread): my prose won't get any better I fear, but at least the reading will be simpler.

But, the last thing I want is to appear as an insistent and annoying guy constantly peddling his pet theory... what's your opinion?

(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Taking this line as a starting point:

(06-12-2024, 09:53 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Only 58 words (on 8031) are missed and they are all with more than 4 chunks, ie. ‘tchodypodar’ would be [t][chod][yp][od][ar]. Just for colour, ‘fachys’ is chunkified as [f][a][ch][ys] and is among the valid word types found by the grammar.

One risk I see is overmodelling. It is probably sufficient, even better (subjectively) not to aim for more than, say, 98% coverage. There will be errors in the text. We don't know how many and where, but the high number of hapax is also a hint in that direction.

The major issue of word spaces is largely covered by a looped slot system, which is probably one of the most attractive aspects of it.

The word tchodypodar is a tricky one. It is very long, compared to average word length, but all its transitions are common. Now it 'feels' wrong that this should have as many as 5 chunks, and it also 'feels' wrong that the common bigram 'dy' is split into two different chunks. The word 'fachys' is nice and short but highly irregular. It is the type of word that I would discard when setting up a model.

On a side note, the first character of each paragraph could be left out of consideration for the known reasons, but that would change 'fachys' into 'achys' which is only a bit better.

Yes, chasing coverage for the sake of itself can be problematic, overmodelling is always a risk and yes, errors/transcription errors/interpretation of spaces etc. are very much expected to be present in the transcription and they will skew/alter the results.

Coverage: it's not that I chased coverage. At the beginning yes, of course, but once I struck onto the LOOP grammar coverage became, effortlessly, very high. And indeed SLOT and ThomasCoon's grammars reach a very high coverage too when the 'loop mechanism' is applied to them by repeating them twice as I showed (and the whole problem of 'separable' words melts away). So I think that at least the basic idea of 'looping grammars' is probably sound.

Errors/transcription errors/interpretation of spaces: all what you said is true, and of course influences the results. In the past I've always been very willing to dismiss exceptional words as flunkies or errors (and to dismiss 'separable' words too as just being two words joined together). Then in my last bout of Voynich mania I decided to go the opposite way: to take the whole text at face value, without invoking any exceptions, and see what would happen. Exceptions could always be added later if needed. Then when I came to the first rough versions of LOOP (not yet thought as a loop at that time!), I was very much surprised to find that almost every word stopped to be 'weird' or 'separable' or 'exceptional' (this happens with SLOT and Coon's grammars too, just by looping them twice), so I stopped worrying  about textual errors. I don't need them for LOOP (or SLOT x2 or Coon x2) to work, and in any case they can always be factored in later (and will always simplify the results). Details may change, but I doubt the basics will.

Overmodelling and, I would add, overfitting: I am indeed worried that what I did, after all, is something that works on every possible text of any kind (provided one has first defined a reasonable grammar to use), so the results obtained on the VMS could just be some generic propriety of my software rather than give true insights into the VMS. I've made a sanity check (albeit preliminary and with many caveats) comparing the results with the syllabification of natural languages (briefly reported in post #53 iirc), with encouraging results. But yeah, overmodelling/overfitting is a big, possibly fatal, risk with my model. But I also think the metric based on Nchunktypes (or Nchunktokens) is very interesting, after all it gives a way to reduce a text to its basic 'atoms' (the chunks), and then to compare the results with different ways of 'atomizing' the text to see which one is more 'compact' (needs the least information to be described). And getting a good score on this metric is encouraging for LOOP.

As a side note: I said before I was pretty sure that the 'chunkification' and its associated metric had already been 'discovered' before, in a different form, in some branch of mathematics. Now I realize what I did is actually a form of data compression, similar to 'zipping' a file in order to reduce its size (with lower sizes being the better). Not sure what I can get from this consideration, if anything, nor if it's good or bad for me, but it could be useful.

(10-12-2024, 01:31 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.However, my main question is: how does your asemic text relate to the slot model? Is it generated based entirely on it?

It's entirely based on the slot model + 'bi-chunks' (analogous to bi-grams) frequencies. It works like this:

1) The VMS word types are divided in chunks following the grammar. Ie., with LOOP-4: 'daiin' = [d] + [aiin]
2) I calculate the 'chunkified' grammar, just by putting every chunk found in the corresponding slot. Ie.: [d] goes in slot 1 of the chunkified grammar, [aiin] goes in slot 2.
3) I calculate all the 'bi-chunks' probabilities, ie. P(STARTOFWORD followed by [d]) = xx; P([d in 1st slot] followed by [aiin]) = yy; etc. etc.
4) These probabilities give the transitions of the Markov chain, while the chunks are the nodes. Ie.: node START; the random number chooses a chunk in the first slot of the chunkified grammar as the next node, based on its relative frequency, let's say it was [d]. Next slot: the random generator chooses [aiin], based on the probability of finding an [aiin] after a [d] in the first slot. Next slot: the random generator chooses [END], based on the probability of finding [END] after an [aiin] in the second slot.
All these passages can be followed 'easily' in the Excel files I posted in post #54, first the 'Chunks', then the 'Chunkified' sheets  (Don't mind the chunk 'categories' columns in the first sheet, the X__X__X, that was and idea which I think is cool but I did not develop further. They are not actually used anywhere, even if the total number of chunks (Nchunktypes) is written at the bottom of this table).

*** But I want to add a very important thing: Asemic Voynich is rather cool, but its primary purpose was to verify all the software passages are working properly. It adds something to the discussion, because it's nice to be able to see where VMS differs from an (otpimized, and possibly overfitted) random process, and maybe learning something from that, but it's not the main point.

And yet another side note... In a sense, Asemic Voynich is a 'trick' conceptually similar to what Zattera did with his SLOT MACHINE grammar: it sumperimposes a structure (a deterministic state-machine in the case of Zattera, a probabilistic frequency table in the case of Asemic Voynich) to an underlying slot grammar, thus constraining (deterministically or probabilistically) the paths it can take. What do I make of this? I don't know atm Smile
(10-12-2024, 10:10 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(09-12-2024, 01:58 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Addendum. About 'separable' words: I don't think the concept is even meaningful in the context of slot grammars: just duplicate any grammar and it will find all the separable words it had been missing before. Seen from another point of view: it's not that defining a word as 'separable' excuses the original grammar for not having found it Smile

Yes, but when it's too easy to match anything, the slot sequence can't be optimized. There must be some rule or limit to prevent over-modeling (as ReneZ said) otherwise it's more a certitude than a risk. I think the MZ 12-slot sequence is already too long, it includes a limited wrap-around effect that can be extended by allowing full loops: allowing 3 or even 4 loops improves the coverage, but I don't see the point.

I 100% agree with you (even if I'd say 2 loops are definitely enough to greatly improve MZ's SLOT grammar to ~100% coverage).

That's the whole point of the Nchunktypes (or Nchunktokens) metric: to check if the grammar is trivial-ish, because it too easily matches anything, or if it is actually non-trivial and useful. Is this the 'right' metric to use? Well, it avoids the pitfalls where efficiency and F1 score crash, but I don't know. I feel it might be the right one, but of course I do... I 'invented' it... that's why I'm presenting it to the community for criticisms.
Happy new year everybody!

I have published my work on slot grammars on Academia.edu

You are not allowed to view links. Register or Login to view.


It's mirrored here:

You are not allowed to view links. Register or Login to view.


It changed a good deal from what I presented here, this is the abstract:

I review the metrics used to evaluate slot grammars, showing the normally used efficiency and F1 score to be fundamentally flawed. After introducing the concept of ‘looping’ slot grammars, a generalization of standard grammars, I show how any grammar can be used in a distinctive lossless data compression algorithm to generate a compressed Voynich Manuscript text. This allows the definition of a new metric free from fundamental flaws: Nbits, the total number of bits needed to store the compressed text. I then compare published state-of-the-art grammars and the newly introduced LOOP-L grammar class using the Nbits metric.

Thanks to all who'll read it, as usual remarks and criticism are welcomed.
Pages: 1 2 3 4 5 6 7 8