The Voynich Ninja

Full Version: A family of grammars for Voynichese
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
(29-11-2024, 08:45 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Interesting, thank you!

At first I was a bit confused as the 'first missing word' did not match mine, but I based the checks on the reference transliteration (RF-1a).

My present thinking (as reflected in the music paper) goes more into the direction of a looped grammar.
I called the result of each loop a 'word chunk'.

However, I do not yet have a good result.

I wondered after first seeing M.Zattera's work, if the efficiency figure is not penalising the results too much.
After all, a perfect word generation rule should not be expected to exist.

But I also don't have a better suggestion.

First of all, I assume you are René Zandbergen: thank you for being interested in my grammars, and most of all thank you for your wounderful website, which has been a key reference for me for many years. I cannot praise it enough.


I did notice you reported 'chtor' as the first word missed by SLOT, I checked 5 or 6 times to be sure I did not make a mistake in syaing it was 'cheky' instead! By the way, 'cheky' can be found by SLOT by adding a "k" in the 11th slot. I have just made a test and this brings the number of words found to 3405 (efficiency is reduced by 6/7). The first missed word becomes 'sheckhy', which would need a "ckh" added to the 11th slot to be found (and then, to go deeper, I guess also "cth" "ckh" etc.).

I like 'word chunk': more intuitive than my 'module', much less charged than Emma May Smith's 'syllable'.

Of course I agree with you about the efficiency figure (see my post #8). Efficiency is (mathematically) the main drawback of the LOOP grammar, but if this is compensated by gaining a truly interesting insight on the word structure it would look much less important.


I also have a some preliminary data for the distribution of words vs. chunks, but I have to work a little more on it because I'm not yet settled on what exactly to put in the 'HEAD', 'LOOP' and 'TAIL' parts of the grammar (and I first need to make my software a little more manageable, at the moment I have to recompile it every time I change a grammar and now I have so many of them that it's becoming unwieldly). Anyway:

About 0% words with zero chunks (they are 'y', 151 occurences;'qy' 3 occurrences;'q' and 'yy' one occurence each)
Words with 1 chunk: < 10%
Words with two chunks: about 50%
Words with three chunks: about 40%
Words with four or more chunks: < 10%

(voynichese.com transcription, words with 'rare' charactes ('g', 'x', 'v', 'z' and 'c', 'h' appearing alone) excluded, 7700 total words remaining)
(27-11-2024, 04:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
However, I have recently developed a word grammar or, better, a family of grammars, which I would like to share, together with a comparison with the grammars proposed by ThomasCoon, Zattera and Stolfi.

Looks great! I'd like to better understand the implications of your results wrt statistical properties of Voynichese. Could you publish the wordlist that was used as the basis of the grammar/efficiency computations? Sorry for my ignorance if there is already some "standard" list of words for this task. I assume I could just take the EVA file and split by periods, but then there are many variables to consider: like what to do with ambiguous readings, ligatures, half spaces, weirdos, etc.
(29-11-2024, 10:08 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(voynichese.com transcription, words with 'rare' charactes ('g', 'x', 'v', 'z' and 'c', 'h' appearing alone) excluded, 7700 total words remaining)

Voynichese.com uses the old Takeshi Takahashi transliteration (1995) without unclear characters (all '?' removed)...

Try You are not allowed to view links. Register or Login to view. instead, it's better.
(29-11-2024, 11:31 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(27-11-2024, 04:19 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.
However, I have recently developed a word grammar or, better, a family of grammars, which I would like to share, together with a comparison with the grammars proposed by ThomasCoon, Zattera and Stolfi.

Looks great! I'd like to better understand the implications of your results wrt statistical properties of Voynichese. Could you publish the wordlist that was used as the basis of the grammar/efficiency computations? Sorry for my ignorance if there is already some "standard" list of words for this task. I assume I could just take the EVA file and split by periods, but then there are many variables to consider: like what to do with ambiguous readings, ligatures, half spaces, weirdos, etc.

Sure I can publish the words lists (and the raw outputs of the various grammar). Now I have them in Excel files, would it be okay if I upload them onto Google? (I ask this because Excel files are often seen with suspicion, they can contain dangerous macros, but there are no macros in those files).

And surely I can try with the RF1a-n transcription. Just, can you point me to a link? Ideally it should be a single .txt file without any metadata or added remarks (that would save a lot of asinine work).


Well, in the meantime I wrote down some other considerations (a kind of qualitative Bayesian analysis of where "word chunks" could lead). It's just in draft form (and formats badly, sigh), but it may be interesting.

------
 IF: LOOP grammar is valid, THEN:
 
Voynichese words can be generated by an algorithm which cycles through a short ‘word chunk’ slot alphabet, plus, possibly, a header and a final tail. The exact definition of the chunk slot alphabet is ambiguos, because different choices and variations are possible (more on this problem later), but in any case each slot can hold a certain amount of information. For instance:
HEAD = “q” holds one bit of information (“q” or null)
CHUNK SLOT1 = “ch, sh” holds 1.5 bits (“ch”, “sh” or null)
And so on.
 
Thus, each Voynich word can be thought as a sequence of ‘fields’, each one holding some bits of information.
Note: this reminds me of the control registers in microprocessor, where a sequence of bit fields is used to specify different functions. For instance (a completely made-up example) bit0 = s.erial communication interface enabled/disabled, bits 1-4: set speed of the serial line, bit5 = parity even/odd, etc. etc.).
Now, as an aside, I have always been an “agnostic, leaning on meaningless” for the Voynich text, but the LOOP grammar tilts my ‘evidence balance’ towards “agnostic, but meaningful more probable than before”. So I asked myself:
IF the text is meaningful:
1.      What could the fields encode?
2.      What are the chances to ever discover what they actually do encode?
One big problem is the ambiguity in the definition of the chunks already noted above. Are there two slots [“e ee eee”, “o”], or one slot [“e eo ee eeo eee eeeo”]? This is a terrible complication, and, unfortunately, decreases a lot the chances of finding a solution.
The other big problem is, of course, that the encoded information could be any number of different things, and each one could be encoded in a number of possible ways. Which decreases by another lot the chances of a solution.
So my answer to the second question is, unfortunately, that the chances for a solution are very, very (you may add more ‘very’ at your pleasure) low.
Now for the first question, I had a little fun trying to imagine some possibilities. The list, of course, is not exhaustive by any means (imagination is the limit!). And notice: the ordering of the list has no meaning, it’s just as things came to my mind.
·[font=Times New Roman]        [/font]It’s a syllabary. I actually tried a little to investigate more, and I even found a reasonable way to convert fields to syllables, but then I realized the most frequent word in Voynich (daiin) has 2 chunks, which would be two syllables, while in all the languages for which I have statistics I can trust (English, Italian, Spanish, Latin, Classic Greek, Koine Greek, (rather old) German, (rather old) French) all the most frequent words, by far, have only one syllable. On the upside, if it’s a syllabary, the chances of finding a solution do not decrease much beyond the baseline (just divide by the number of all possible languages). The ‘decoding’ worked roughly like this: chunks such as ‘aain’encode CV/VC/V syllables (the slots can be arranged to get two fields with about 3 bits of information, enough for vowels, and one field for a consonant (with ~14 possible consonants, which would be +- enough for Latin, much less for English). Chunks such as ‘Cedy’ would encode CVC syllables: it’s possible to get a field for 5 vowels (but only in the first syllables, choices are limited to three after the first) and two fields for the consonants (but one of them is limited to about 9 choices).
·[font=Times New Roman]        [/font]It’s a nomenclator cipher (which includes “it’s a constructed language”). Each word could be an index to the nomenclator table (or to the dictionary of the constructed language). Ie., with a nomenclator ‘qokeey’ could mean “Bible-Revelation-Chapter 1-2nd column-5th word-from top”, or anything like that. The worse problem I see with this are the points in the Voynich where a word is repeated four times in a row. Yeah, one could conceive of formulaic phrases (“Sanctus! Sanctus! Sanctus! Sanctus!”), but four words in a row are really a lot. If it’s a nomenclator, the chances for a solution drop to nihil (there is nothing recognizable with certainty in the manuscript, and every attempt to find cribs has failed miserably). So it’s an unfalsifiable hypothesis anyway.
·[font=Times New Roman]        [/font]It’s mathematics, which would require words to encode mostly numbers plus some other symbol/feature/maybe some actual real-language words. On the same flavour: it could be an accounting book, or a list of astronomical observations, or a plotting language (someone proposed this already, I don’t remember who or where), or music (but it looks way too much complicated to encode music, unless it was for an orchestra xD), or a myriad of other mostly-numerical things. I think all these possibilities are highly improbable, even if they all get rid of the pesky “four words in a row” problem. Chances to find a solution in this case? Essentially zero, essentialy an unfalsifiable hypothesis.
·[font=Times New Roman]        [/font]It’s meaningless, after all. But one must then explain how the consistency of the words grammar was mantained all along the text, even surviving a radical change of ‘language’ (from Currier A to Currier B): I think this is a problem particularly for the Timm & Schinner ‘self citation’ idea (which I really loved, btw). A possibility is that the words (or at least most of them) were generated before writing the text, then choosing somehow the sequence of words (and sprinkling in some 'more complicated' word), but this looks rather weird to me. Another possibility is the writer being affected by some kind of mental problem (akin to some forms of autism), so that the word structure came naturally to him. This would be consistent with all the data (including the weird illustrations and diagrams), but, unfortunately, it’s another unfalsifiable hypothesis.

PS.: as a corollary, if the 'words chunks' hypothesis is valid then the statistics based on bigram frequencies and the like are just a consequence of the underlying grammar, and have no meaning by themselves.
(29-11-2024, 01:41 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Sure I can publish the words lists (and the raw outputs of the various grammar). Now I have them in Excel files, would it be okay if I upload them onto Google? (I ask this because Excel files are often seen with suspicion, they can contain dangerous macros, but there are no macros in those files).

Excel will do, thanks!

I have a question, is my understanding correct that when you quote/compute the coverage you are referring to word type coverage and not word token coverage? I wonder what word token coverage would look like. E.g., from my point of view a grammar that successfully covers 95% of the text tokens by fully incorporating 60% of the most frequent words has more merit than a model covering 95% of words but missing some of the frequent ones. After all, rare words are much more likely to be scribal errors, ambiguous writing, etc.
(29-11-2024, 02:38 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(29-11-2024, 01:41 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Sure I can publish the words lists (and the raw outputs of the various grammar). Now I have them in Excel files, would it be okay if I upload them onto Google? (I ask this because Excel files are often seen with suspicion, they can contain dangerous macros, but there are no macros in those files).



Excel will do, thanks!



I have a question, is my understanding correct that when you quote/compute the coverage you are referring to word type coverage and not word token coverage? I wonder what word token coverage would look like. E.g., from my point of view a grammar that successfully covers 95% of the text tokens by fully incorporating 60% of the most frequent words has more merit than a model covering 95% of words but missing some of the frequent ones. After all, rare words are much more likely to be scribal errors, ambiguous writing, etc.

Very good, I'll post the Excel files, allow me some time.

About your question: yeah, I should have used a more precise terminology, just I don't know if any standard exists. My 'word' is, I think, what you call 'word type': ie. the transcription I used has 8078 different unique 'words', then of course a 'word' can appear multiple times in the text (ie.: daiin appears 864 times). The coverage I reported is calculated on word types, ie.: 7145 unique 'words' found / 8078 unique 'words' in total.

I can give you also the percentage vs. word _tokens_ (it's rather easy from the Excel file):

Grammar used: EXTENDED-12
Total word _tokens_ in the text (excluding 484 word _tokens_ with 'rare' characters as previously defined): 37402
Total word _types_ found by the grammar: 7145, which sums up to 36886 word _tokens_. Coverage vs. word _tokens_ = 36886/37402 = 98.62%
Ok, here comes the files, it was less pòainless than I though Smile.

1) Transcription

I downloaded the transcription of the Voynich pages from the github repository of You are not allowed to view links. Register or Login to view., then I used a function of my software to get rid of metadata and to collate everything in a single (.txt) file:

You are not allowed to view links. Register or Login to view.


2) Raw results

This Excel file is the dump of the grammar EXTENDED-12 (a few column headers are in Italian, sorry).
  • In the first columns the 'report' shows the calculated data and the definition of the grammar.
  • Columns L-M-N-O contain the complete dictionary of the words _tokens_ (column L), the number of times each token appears in the text (column M), and two flags (columns N and O) which say if a word has not been found (NOTFOUND) or if a word has not been found but contains rare characters (RARECHARS). You have to scroll down a good bit before you see anything here, NOTFOUND should be marked by a red background, RARECHARS by a blue one.
  • Columns R-S-T contain only the NOTFOUND words
  • Columns U-V-W contain only the RARECHARS words

You are not allowed to view links. Register or Login to view.


Let me know if you need/want anything else. By the way, I have no qualms in posting the software too, sources included (btw it does a lot more than parsing Voynich words, that's a recent add-on, but really a lot, and on any possible language, not just Voynichese), but 1) I'd need some time for some indispensable fixes & documentation and 2) I'd need someone pointing me to the best way to post it, because I've never ever done anything similar in my life. If anyone is interested, let me know. It's a C# program developed on Microsoft Visual Studio 2022 (or you can just use the .exe as it is if you're not interested in the source code).

Thank you! I just needed the wordlist to understand what's covered by this grammar, your Excel file should be perfect for that.

I'm not sure I understand the implications of the qualitative analysis you added in one of the previous posts. I'm not very familiar with grammars, I have some basic understanding of how they work in, say, describing programming languages, but I never had to deal with a grammar as a data analysis tool. What kinds of conclusions can be made from the fact that your grammars achieve good scores? Is it possible to somehow identify functional elements or gain some understanding of actual character boundaries (e.g., whether ch, iin, qo should be treated as single entities)?
(29-11-2024, 06:45 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(29-11-2024, 04:15 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.

Thank you! I just needed the wordlist to understand what's covered by this grammar, your Excel file should be perfect for that.

I'm not sure I understand the implications of the qualitative analysis you added in one of the previous posts. I'm not very familiar with grammars, I have some basic understanding of how they work in, say, describing programming languages, but I never had to deal with a grammar as a data analysis tool. What kinds of conclusions can be made from the fact that your grammars achieve good scores? Is it possible to somehow identify functional elements or gain some understanding of actual character boundaries (e.g., whether ch, iin, qo should be treated as single entities)?

I think you can find all the answers in my post #14. For an example of how functional element could be identified, see the note in small characters at the end of the point "It's a sillabary".

Btw I have just completed a run using a WordChunk (new official name for the LOOP grammar, I propose) grammar with up to 4 word chunks. I need just a few mins to upload the Excel file and write the post.
(29-11-2024, 01:41 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.And surely I can try with the RF1a-n transcription. Just, can you point me to a link? Ideally it should be a single .txt file without any metadata or added remarks (that would save a lot of asinine work).

The link is in my post #13.

All separators converted to spaces, no metadata: ivtt -x7 RF1a-n.txt RF1a-n-x7.txt
Pages: 1 2 3 4 5 6 7 8