The Voynich Ninja

Full Version: A family of grammars for Voynichese
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
Hello everybody. I’m a new member, an Italian with a background in electronics engineering (industrial automation sector, now happily retired). I have dabbled from time to time into the Voynich manuscript, expecially from the point of view of statistics and word structure, without ever finding anything useful to report.
 
However, I have recently developed a word grammar or, better, a family of grammars, which I would like to share, together with a comparison with the grammars proposed by ThomasCoon, Zattera and Stolfi. Unfortunately, I have not found a way to format the presentation in a good way to display on the forum, so I have uploaded the .pdf file on Google Drive at this link (sorry for the inconvenience):
 
You are not allowed to view links. Register or Login to view.
 
The contents of the (short) presentation are:
 
1) INTRODUCTION: mostly concerned in defining how the evaluation of the grammars is made.
2) “BASIC” GRAMMAR: the simplest one, which makes clear how the underlying structure works. It finds 48.7-50.58% of the Voynichese words (depending on the variant of the grammar), with a better efficiency than ThomasCoon and Zattera (efficiency = number of Voynich words found / total words which the grammar can possibly generate).
3) “COMPACT” GRAMMAR: tweaked for efficiency, it’s 3-6 times better than the BASIC grammar, with a small decrease in the words found.
4) “EXTENDED” GRAMMAR: tweaked for coverage, it finds 88.45% of the Voynichese words, while at the same time being simple and reasonably efficient (and the coverage rises to 92.79% if words with ‘rare’ characters such as ‘g’ or ‘v’ etc., which are not used by the grammar, are excluded).
5) APPENDIX: a final consideration and a resume of the data.
 
Apologizing in advance for any mistakes I may have made, I hope you’ll find it interesting, and I will gladly read any answer and comments from the community.
This sounds interesting! Checking "the first missed word" is a nice addition too.

About the two "loop" grammars, I would be curious to know how they perform when limited to 1, 2 or 3 iterations. That makes the possible output finite, so they can be compared with the other grammars.
Welcome to the forum!

I am not sure what you are comparing and how. Massimiliano Zattera claims an F1 score of 0.27 for his "Simply the Best" grammar "SLOT MACHINE", described by a state machine: You are not allowed to view links. Register or Login to view.

The state machine restricts considerably the possibilities that "SLOT" (slot model, F1 = 0.001) offers. You are not allowed to view links. Register or Login to view.

I guess your "grammars" are defined by a slot sequence only, without a state machine, like MZ's "SLOT".
(27-11-2024, 04:51 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.This sounds interesting! Checking "the first missed word" is a nice addition too.

About the two "loop" grammars, I would be curious to know how they perform when limited to 1, 2 or 3 iterations. That makes the possible output finite, so they can be compared with the other grammars.

Yes, I had thought about this. Another possibility could be to classify each word according to the number of iterations needed.
(27-11-2024, 05:05 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Welcome to the forum!

I am not sure what you are comparing and how. Massimiliano Zattera claims an F1 score of 0.27 for his "Simply the Best" grammar "SLOT MACHINE", described by a state machine: You are not allowed to view links. Register or Login to view.

The state machine restricts considerably the possibilities that "SLOT" (slot model, F1 = 0.001) offers. You are not allowed to view links. Register or Login to view.

I guess your "grammars" are defined by a slot sequence only, without a state machine, like MZ's "SLOT".

Thank you for your remarks. I used the the grammar which Zattera published in reference [2] of my presentation: M. Zattera, A new transliteration alphabet brings new evidence of word structurs and multiple "languages" in the Voynich manuscript, 2022. You are not allowed to view links. Register or Login to view.. The values Zattera gives (in Table 2 of his paper) are different from the ones in my presentation, but this is because he used a different transcription, so I re-calculated the results of his grammar with the same transcription I used to get comparable results (I tried to explain this in my presentation, probably not that clearly). I guess this is the grammar you call MZ's "SLOT" (without MACHINE attached).

I was not aware of Zattera's "Simply the best" "SLOT MACHINE" grammar that you brought to my attention. At the moment I don't have the time to check your links, so I don't know what to say. Tomorrow I will look at them and see what I can say.

And yes, my grammar is stateless.
(27-11-2024, 05:47 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(27-11-2024, 04:51 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.This sounds interesting! Checking "the first missed word" is a nice addition too.

About the two "loop" grammars, I would be curious to know how they perform when limited to 1, 2 or 3 iterations. That makes the possible output finite, so they can be compared with the other grammars.

Yes, I had thought about this. Another possibility could be to classify each word according to the number of iterations needed.

I forgot to mention that the idea of having at most three syllables is based on Emma May Smith's word structure 

You are not allowed to view links. Register or Login to view.
If some words are indeed "separable" (as suggested by M. Zattera's article), maximizing the F1 score or any metric is futile: the model is not meant to match "separable" words. Fitting the model to data containing "separable" words is probably what creates the wrap-around effect in the MZ slot sequence: slots 7-11 have 6 glyphs in common with slots 0-2, resulting in multiple possibilities of separation: for example sheolkeedy (in f115v) could be, according to Massimiliano Zattera's 12-slot sequence:
sheol keedy
sheo lkeedy
she olkeedy
These 6 words exist in the VM: it is unclear where the space should be inserted. There could be a rule, for example, to choose the first of the three because it maximizes the length of the first word, but we don't know.

Maybe a non-ambiguous slot sequence, allowing a unique re-parsing from a space-less transliteration, would better capture the partial ordering principle at work in the VM.
(27-11-2024, 05:05 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Welcome to the forum!



I am not sure what you are comparing and how. Massimiliano Zattera claims an F1 score of 0.27 for his "Simply the Best" grammar "SLOT MACHINE", described by a state machine: You are not allowed to view links. Register or Login to view.



The state machine restricts considerably the possibilities that "SLOT" (slot model, F1 = 0.001) offers. You are not allowed to view links. Register or Login to view.



I guess your "grammars" are defined by a slot sequence only, without a state machine, like MZ's "SLOT".

Ok, I think I can now answer your consideration.

First thing: contrary to what I said in my previous answer, I did actually know already about "SLOT MACHINE": it's in the comparison table in Zattera's article. I did not give attention to it because of its low coverage (only 21.6% according to Zattera's data) and it got completely out of my mind (but it would have been better if I had remembered it instead!).



What Zattera did was to add on top of "SLOT" a set of rules (implemented by the state machine) which take advantage of the many regularities of the manuscript (ie.: an 'i' is (almost) always followed by 'n', 'm', 'l' , 'r', 's') to restrict the possibilities of "SLOT". This is logically equivalent to what I did with my grammars, ie. I added to BASIC-11 the rule "a final 'y' is (almost) never preceded by 'n' or 'm' " to get COMPACT-7, increasing its efficiency. I could have added more rules (like the "i rule" mentioned above) to further restrict the possibilities: I actually experimented a COMPACT-6 version where that rule was implemented, but I felt that that way of proceeding was little useful, because, obviously, by adding more and more rules one can reach any arbitrarily high efficiency. Said in another way: after coverage, efficiency is an important measure, because it's trivial to create a grammar with 100% coverage but ~0% efficiency (just use as many slots as the maximum word length and put every letter in each slot). But efficiency is not an absolute criterion to compare grammars with similar coverages, because it's also trivially easy to create a grammar with 100% coverage and 100% efficiency (just use a different symbol for each Voynich word and put them all in the first slot).

So my opinion is that "SLOT MACHINE" is clever, but it reduces too much the coverage (to a meager 21.6%) and adds too much algorithmic complexity in order to chase a high efficiency target. And reducing coverage and adding complexity to reach an arbitrarily high efficiency can be done, in fact, with any grammar.


I'd also like to add TL;DR comparison of my grammars vs. SLOT/SLOT MACHINEs, see my presentation for all the hard data:

By comparing grammars with similar coverage, BASIC-13 improves on SLOT both in efficiency and coverage. Moreover BASIC-13 seems to do a better job that SLOT in capturing the structure of Voynich words, by going much more deeper in the words rank before failing to generate a word (and I think this is important).

A comparison with SLOT MACHINE is rather moot, because its coverage (the primary metric of any grammar) is too low with respect to BASIC-13 (or even SLOT). Moreover, every grammar (including mine) can be made arbitrarily efficient by adding more complications (and reducing coverage), but this becomes, beyond some threshold of complexity, a rather pointless exercise.

I have also demonstrated a grammar, EXTENDED-12, with an outstanding coverage (88-92% depending on how words with 'rare' characters' are counted) which far surpasses SLOT. It's surely possible to expand SLOT to increase its coverage, but at the moment no "SLOT-HIGHCOVERAGE" exists to make such comparison, and it's yet to be seen what the efficiency of this hypothetical grammar could be and how it would compare to EXTENDED-12 (but in regard to the only comparable grammars we have, Stolfi's, EXTENDED-12 seems to have an even better coverage and orders of magnitude better efficiency)
(27-11-2024, 06:57 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.
(27-11-2024, 05:47 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
(27-11-2024, 04:51 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.This sounds interesting! Checking "the first missed word" is a nice addition too.

About the two "loop" grammars, I would be curious to know how they perform when limited to 1, 2 or 3 iterations. That makes the possible output finite, so they can be compared with the other grammars.

Yes, I had thought about this. Another possibility could be to classify each word according to the number of iterations needed.

I forgot to mention that the idea of having at most three syllables is based on Emma May Smith's word structure 

You are not allowed to view links. Register or Login to view.

Thanks for the link (which I didn't know, I need to add it to my References section). Yes, it seems Emma May Smith was on the same track as me (and a few years before!). Unfortunately she did not post a full grammar (there is a table for the 'syllables bodies', but it's not clear what she used for 'onsets' and 'codas'), else I could easily test her grammar (also for efficiency, which she did not speak of) by replicating it into my software. But the basic idea (discarding variations in the actual grammar), is the same.

By the way, my EXTENDED-12 grammars breaks down words similarly to Emma's May Smith: many have one "syllable", a good quantity have two, a few have three. I have not counted the words according to the number of syllables (there is also a thorny problem: many words have multiple decompositions), nor I have tested with more than 4 syllables, but I think I can wholly confirm her results. Also by the way, the word which requires more syllables is 'ooooooooolarsr', with a whopping 12 syllables (counted using my EXTENDED-12), which are a lot (the required grammar is on the order of 10EXP44 total strings..., while an EXTENDED-12 with one more syllable (4 in total) would be around 33*10EXP12).

Just for reference, I also tested a 'two-syllables' version of Zattera's SLOT (which, here referring to a previous post of @nablator, would be able to find also the "separable" words). I actually optimized it a bit for efficiency (removing the final 'y' at the end of the first syllable, it's almot useless as coverage goes). This are the results (directly pasted form my software):

-----------
Grammar name: Zattera's SLOT duplicated, 23 slots   (may I call it Zattera-Lanzini SLOT*2 ? in exchange I could call my grammar the EmmaSmith-Lanzini Smile )

Slot 1: q s d
Slot 2: o y
Slot 3: l r
Slot 4: t k p f
Slot 5: ch sh
Slot 6: cth ckh cph cfh
Slot 7: e ee eee
Slot 8: s d
Slot 9: o a
Slot 10: i ii iii
Slot 11: d l r m n
Slot 12: q s d
Slot 13: o y
Slot 14: l r
Slot 15: t k p f
Slot 16: ch sh
Slot 17: cth ckh cph cfh
Slot 18: e ee eee
Slot 19: s d
Slot 20: o a
Slot 21: i ii iii
Slot 22: d l r m n
Slot 23: y

Found 7228 valid words. Grammar can generate 1,088391E+13 total words
Total number of words in the Voynich text: 8078 (7700 excluding words with rare characters)
Coverage = 0,8947759, Efficiency = 6,640995E-10, F1 score = 1,328199E-09
Coverage is 0,9387013 excluding words with rare characters
---------

Coverage is excellent, even slightly better than my EXTENDED-12 (which scores 88.45%-92.79%), but the size of the grammar is greatly increased (SLOT-almost-duplicated: 1.08*10EXP13, EXTENDED-12: 6.12*10EXP9), and the F1 score greatly decreased (SLOT-almost-duplicted: 1.33*10EXP-9, EXTENDED-12: 2.33*10EXP-6). These are the words which appear at least twice in the text which cannot be generated (excellent result, I'd say at about the same level of EXTENDED-12, which misses 4 words instead of 6, but two of them have three occurences):

chotols 3 NOTFOUND
choekeey 2 NOTFOUND
otaraldy 2 NOTFOUND
polarar 2 NOTFOUND
chkoldy 2 NOTFOUND
sheoeky 2 NOTFOUND
Interesting, thank you!

At first I was a bit confused as the 'first missing word' did not match mine, but I based the checks on the reference transliteration (RF-1a).

My present thinking (as reflected in the music paper) goes more into the direction of a looped grammar.
I called the result of each loop a 'word chunk'.

However, I do not yet have a good result.

I wondered after first seeing M.Zattera's work, if the efficiency figure is not penalising the results too much.
After all, a perfect word generation rule should not be expected to exist.

But I also don't have a better suggestion.
Pages: 1 2 3 4 5 6 7 8