The Voynich Ninja

Full Version: A family of grammars for Voynichese
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8
Preliminary results using a "WordChunks" grammar (up to 4 chunks).

Before proceeding, I need to stress that the grammar can have many different variations, so the grammar which follows is not written in stone, it's just an example how it could be. Given I have no problems of efficiency for this test, I have kept it as clean as possible. After the results, I'll add a few more considerations about this.

The transcription used is the voynichese.com one I already posted.

The grammar is as follows (it's not a true loop, it's just 4 repeated blocks, but this was much easier and quicker to do):

-----------

Grammar name: WordChunks - 4 chunks,26 slots

Slot 1: q    // HEADER

Slot 2: ch sh y    // 1st WORD CHUNK
Slot 3: eee ee e
Slot 4: o
Slot 5: a
Slot 6: iii ii i
Slot 7: lk ld l d k r s t p f cth ckh cph cfh n m

Slot 8: ch sh y  // 2nd WORD CHUNK (same as the 1st)
Slot 9: eee ee e
Slot 10: o
Slot 11: a
Slot 12: iii ii i
Slot 13: lk ld l d k r s t p f cth ckh cph cfh n m

.. (two more identical chunks follow)...

Slot 26: y  // TAIL


--------------------
The grammar finds 7630 word types on 7700 (99.09% coverage does not surprises me anymore though Smile ), only 70 words cannot be generated. The efficiency is abysmal, but who cares. Unfortunately, I cannot give a count of the number of word chunks required for each word type (that would require me to implement a true loop in the software, not difficult, but quite annoying, I may do it tomorrow maybe or one day later more probably).

The full list of the words which cannot be generated is in the Excel file (link below), but at first sight they divide in two groups:
  • Word types which cannot be generated because they have a 'q' not at the beginning of a word (oqokain, oqol, etc.)
  • Word types which need more than 4 word chunks (ofyskydal, 6 chunks if I'm not mistaken, cthdaoto, 6 chunks, etc.)

The Excel file is here:

You are not allowed to view links. Register or Login to view.


Some considerations about the WordChunks grammar I used:
  • As I said, it could take many different forms. I had to put an 'y' at the beginning of the chunk, together with 'ch' and 'sh', for words such as 'ykedy' and the like, which are relatively frequent. It's probably not needed in every chunk, just one or two may suffice, but I wnated to keep things uniform. But maybe it would have been better to use a separate slot (above 'ch, sh') for the 'y'. Or maybe something different, but in any case, the placement of the 'y' is a subtle problem because (except when in final position) it's used very rarely, yet enough too frequently to simply ignore it.
  • I put the only 'q' in the header of the loop, thus if the 'q' is not at the beginning the word is not generated. One solution would be to put the 'q' inside the looping chunk, removing the header altogether, but the words with a non initial 'q' are so rare that seems to be an overkill.
  • I find the groups 'lk', 'ld' to be an interesting feature, because in a great deal of words, such as 'lkedy', they shorten the word by a whole chunk. They also increase the number of bits of information which can be stored in the "l, k, r, s, etc." slots. It's tempting to add more of them ("lt", "ls"..), this would change very little the statistic I have just presented, but would alter the subdivision in chunks of certain (rare) word, i.e. 'ols' would drop from two chunks: [ol][s] to one chunk: [o ls]
Thanks @nablator for the RF1-a link! I had missed it, apologies.

(29-11-2024, 01:24 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(29-11-2024, 10:08 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.(voynichese.com transcription, words with 'rare' charactes ('g', 'x', 'v', 'z' and 'c', 'h' appearing alone) excluded, 7700 total words remaining)

Voynichese.com uses the old Takeshi Takahashi transliteration (1995) without unclear characters (all '?' removed)...

Try You are not allowed to view links. Register or Login to view. instead, it's better.

Thanks @nablator for the RF1-a link! I had missed it, apologies.

Will be working on it tomorrow I guess, too late here now and I need to first remove the metadata... sigh.
(29-11-2024, 07:50 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(29-11-2024, 01:41 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.And surely I can try with the RF1a-n transcription. Just, can you point me to a link? Ideally it should be a single .txt file without any metadata or added remarks (that would save a lot of asinine work).

The link is in my post #13.

All separators converted to spaces, no metadata: ivtt -x7 RF1a-n.txt RF1a-n-x7.txt

... and now I found also the link without the metadata.. re-thanks and re-apologies!

Added: from just one quick test, the results are equivalent to those with the voynichese.com transliteration (some details vary, but insignificantly). Time to stop now though, tomorrow I'll do a proper test and post the Excel files with the results. Thanks again.
(29-11-2024, 07:36 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I think you can find all the answers in my post #14. For an example of how functional element could be identified, see the note in small characters at the end of the point "It's a sillabary".

It's exactly the analysis in post #14 that I have troubles understanding. I still have a very poor comprehension of how a grammar relates to potential features of the text. If I understand this correctly, any list of words can be approximated (?) via any of multiple grammars incompatible between themselves (in a sense that they can generate quite different sets of strings). As far as I can see, almost all proposed grammars generate vastly larger number of word types than those that are actually contained in the manuscript. In this case, I suppose, if it's possible to compute the overlap between all the strings predicted by grammar A and all the strings predicted by grammar B, this overlap would probably represent a very tiny fraction of both spaces. Does this mean that either grammar is likely a very poor model of the underlying text?

[Edit] My understanding of using grammars for Voynichese originally was that they are just to show that there is some structure to the words. For this grammars serve as a good tool. However, unless there is a very tight grammar that, say, covers 95% of the text and only produces less than 10x the number of known words types, I have little understanding of what we can learn by further refining these loose grammars.

Quote:It’s a syllabary. I actually tried a little to investigate more, and I even found a reasonable way to convert fields to syllables, but then I realized the most frequent word in Voynich (daiin) has 2 chunks, which would be two syllables, while in all the languages for which I have statistics I can trust (English, Italian, Spanish, Latin, Classic Greek, Koine Greek, (rather old) German, (rather old) French) all the most frequent words, by far, have only one syllable. On the upside, if it’s a syllabary, the chances of finding a solution do not decrease much beyond the baseline (just divide by the number of all possible languages). The ‘decoding’ worked roughly like this: chunks such as ‘aain’encode CV/VC/V syllables (the slots can be arranged to get two fields with about 3 bits of information, enough for vowels, and one field for a consonant (with ~14 possible consonants, which would be +- enough for Latin, much less for English). Chunks such as ‘Cedy’ would encode CVC syllables: it’s possible to get a field for 5 vowels (but only in the first syllables, choices are limited to three after the first) and two fields for the consonants (but one of them is limited to about 9 choices).

Thank you for singling out this part of your post as the example, I'll copy it here so that it's easier to find.
(30-11-2024, 03:42 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(29-11-2024, 07:36 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I think you can find all the answers in my post #14. For an example of how functional element could be identified, see the note in small characters at the end of the point "It's a sillabary".



It's exactly the analysis in post #14 that I have troubles understanding. I still have a very poor comprehension of how a grammar relates to potential features of the text. If I understand this correctly, any list of words can be approximated (?) via any of multiple grammars incompatible between themselves (in a sense that they can generate quite different sets of strings). As far as I can see, almost all proposed grammars generate vastly larger number of word types than those that are actually contained in the manuscript. In this case, I suppose, if it's possible to compute the overlap between all the strings predicted by grammar A and all the strings predicted by grammar B, this overlap would probably represent a very tiny fraction of both spaces. Does this mean that either grammar is likely a very poor model of the underlying text?

[Edit] My understanding of using grammars for Voynichese originally was that they are just to show that there is some structure to the words. For this grammars serve as a good tool. However, unless there is a very tight grammar that, say, covers 95% of the text and only produces less than 10x the number of known words types....

Your remarks are all true and the question you pose is very interesting: I don't have a definitive answer, as post #14 made clear (I'd wish I do!), but I can try to explain how I see it (I'll use an example, because a general explanation would be terribly convoluted). Let me re-work the 'syllabary hypothesis' for my example, I hope it will be enough (and the example does not at all implies I want to propose it as a possible actual solution!).

Say I want to encode the sentence "Voynich manuscript is a hard problem" as a sillabary, so I get "Voy-nich ma-nu-script is a hard pro-blem". Then I devise a weird method for it, for instance I decide to encode VC/CV syllables as I briefly sketched in post #14 (the part you copied). What I did in effect is to define a 'grammar':

SLOT1: the consonant, encoded, say, with the cipher glyphs "ld", "lk", "l", "r", "s" etc. (let's say 16 symbols)
SLOT2: the first vowel, encoded, say, with "a", "ai", "aii", "oi" etc. (let's say 8 symbols)
SLOT3: the last vowel, encoded (because I'm weird) in a different way than before, say "m", "n", "s", "l" etc. (let's say 16 more symbols, they can overlap those of SLOT1)

This grammar will generate 16*8*16 = 2048 different word types.
But in "Voynich manuscript is a hard problem" there are just two CV syllables, so of my grammar of 2048 words I will just use two in my encrypted text.

Does the fact that 2048 is much greater than 2 invalidates the grammar? Surely not, because that's the grammar I actually used for the encoding. [By the way, this opens up the cipher for a decryption attack based on frequencies, slot frequencies in this case, which is reason why, of all the possibilities I listed in post #14, the syllabary is the only one which could be reasonably attacked. Just, and see again post #14, it would be very, very, ... , very difficult to do it]. So, to try an attack on a syllabary by using a slot alphabet grammar, the problem is not 'define a grammar with excellent coverage and efficiency' (that would be auspicable, of course!) but rather "define a grammar with excellent coverage and which makes sense (expecially in the context of late Middle Ages technology and culture) and is not trivial", and by "trivial" I mean something like:

SLOT1: every possible symbol
SLOT2: every possible symbol
SLOT3: every possible symbol

Now this kind of trivial grammar has a very low efficiency, so efficiency does indeed help in discarding from consideration the grammars which are little informative, but it cannot be taken as an absolute requirement, at all.



(30-11-2024, 03:42 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view..... I have little understanding of what we can learn by further refining these loose grammars.

As I said in post #14, unfortunately, the probabilities we can find a 'solution' are about zero. They are about zero both with grammars (any kind of them) and without. But I wouldn't say there is nothing to gain in exploring the grammars path: after all, Voynich words obviously have a (weird) structure, which probably has some function in the manuscript (or, in the case it's a meaningless text, it is some consequence of the pseudo-random method used to generate the gibberish) and even just finding a class of grammars where we cannot say exactly which is the right one, but we can be confident that one of them is probably the correct one, would be a step forward.

It may also open up new lines of inquiry, for instance the hypothesis of a syllabary is usually discarded because there are not enough characters to represent syllables, but the LOOP grammar rescues the hypothesis because it gives a way to encode syllables using few characters (now, that syllables can by encoded by using just few characters is by itself a rather trivial conclusion, but the grammar also gives a framework which fits naturally the manuscript, because it also 'explains' its remarkable word structure).
(30-11-2024, 10:03 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Let me re-work the 'syllabary hypothesis' for my example, I hope it will be enough (and the example does not at all implies I want to propose it as a possible actual solution!).

Thank you for explaining this further. If I understand it right, even if we stumble upon the correct grammar, it looks like we don't have any way to tell it apart from incorrect grammars with similar loose predictive power (this is, huge number of possible strings).

I don't know how you identify these grammars. If by some sort of machine learning process, maybe it could be possible to split the words into training and validation subsets? To check whether the grammar learned on a subset of words is as efficient for the rest of the text.

(30-11-2024, 10:03 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.But I wouldn't say there is nothing to gain in exploring the grammars path: after all, Voynich words obviously have a (weird) structure, which probably has some function in the manuscript (or, in the case it's a meaningless text, it is some consequence of the pseudo-random method used to generate the gibberish) and even just finding a class of grammars where we cannot say exactly which is the right one, but we can be confident that one of them is probably the correct one, would be a step forward.

Let's say, hypothetically, we learn that the correct intended grammar for the whole text is A + B + C + D (we can't know this before we know the method used to create the text, so this is a pure thought experiment). Let's say that we know that A can be either 'q' or 'y' or null, etc. for B, C and D. I still don't understand how knowing all these facts would give us anything in terms of identifying the method used to produce the text. The only thing I see is that we will be able to generate new conforming Voynichese words, with no clue about their meaning or function. Maybe we could try to guess something from the number of entries in each slot, but overall I just see no clear path from identifying the correct grammar (which could be impossible in the first place) to identifying the meaning or function.

BTW, do you have any thought about the binomial distribution of word lengths, as identified by Stolfi (if I'm not mistaken)? Does it put some constraints on possible grammars?
Results of the "LOOP" grammar applied to reference transliteration RF1a-n

----------------------------------------------

The grammar I used is exactly the same as the one of post #21, see there for some additional remarks about its construction. I just changed back the name from WordChunks to LOOP, this because I realized I had mis-appropriated ReneZ's 'word chunk' and used it as a proper name for my grammar, which sureloy is not appropiate. Apologies to @ReneZ!


*** At the risk of being pedant, I want to stress again that the grammar is just one possibility inside a broad class of "LOOP" grammars, which can have many variations but always the same basic structure and are all +- equivalent as coverage goes (efficiency is not a concern here), but which will parse each word type in possibly different ways. And I'd also notice a grammar which loop a chunk can be tricky in this regard, ie.: we could rotate the basic loop chunk of one position to the left, so the "eee ee e" slot is now at the beginning, while the "ch sh" slot goes at the end. This form is very counter-intuitive at first sight, but given the looping mechanism it actually makes little difference to the overall coverage. It will however parse the word types in a different way, ie. 'chedy' will be parsed as [ch][ed]y, with two chunks, while before it was just [ched]y, in one chunk ***

Methodological note: I added 'b', 'j', 'u' to the list of 'rare' characters. They were not present in the transcription I used previously.

Coverage is 98.85% (excluding word types withy 'rare' characters): 7939 word types found on a total of 8031, a total of 92 word types are missed. Efficiency is abysmal, but not of concern in this context. The first word miss is 'oqokain', 2 occurences. All the other word types with at least 2 occurrences are found, as are >97% of the words appearing only once (I don't have an exact count for this). As before, the missed word types have a 'q' not in the first position, or they would require more than 4 word chunks.

This is the Excel file with all the data:

You are not allowed to view links. Register or Login to view.
(30-11-2024, 10:57 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.
(30-11-2024, 10:03 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.Let me re-work the 'syllabary hypothesis' for my example, I hope it will be enough (and the example does not at all implies I want to propose it as a possible actual solution!).



Thank you for explaining this further. If I understand it right, even if we stumble upon the correct grammar, it looks like we don't have any way to tell it apart from incorrect grammars with similar loose predictive power (this is, huge number of possible strings).

I would say it in a different way. Finding a possibly correct grammar class would be a major achievement, because it gives a possible explanation of a remarkable feature of the manuscript. This is what Stolfi, ThomasCoon, Zattera etc. etc., and me too, lately, have been trying to do since years and years. Choosing which of all those classes is the best is of course a hard problem, but a grammar class could be found which stands out as 'explanatory power', and this would be a big step forward already.


(30-11-2024, 10:57 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.I don't know how you identify these grammars. If by some sort of machine learning process, maybe it could be possible to split the words into training and validation subsets? To check whether the grammar learned on a subset of words is as efficient for the rest of the text.

I don't know too: all the innumerable discussions and comparisons on different grammars (including those in this thread) serve this purpose. Using AI is a very nice idea, I actually had fancied to ask ChatGTP4 to generate for me a random sample of Voynichese and see what happens, never did it because I never subscribed to ChatGPT (I have a strong allergy for subscriptions) and because it works in a way that it could have easily generated text by simply assembling together actual Voynich snippets, which is rather trivial.


(30-11-2024, 10:03 AM)Mauro Wrote: You are not allowed to view links. Register or Login to view.But I wouldn't say there is nothing to gain in exploring the grammars path: after all, Voynich words obviously have a (weird) structure, which probably has some function in the manuscript (or, in the case it's a meaningless text, it is some consequence of the pseudo-random method used to generate the gibberish) and even just finding a class of grammars where we cannot say exactly which is the right one, but we can be confident that one of them is probably the correct one, would be a step forward.


(30-11-2024, 10:57 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Let's say, hypothetically, we learn that the correct intended grammar for the whole text is A + B + C + D (we can't know this before we know the method used to create the text, so this is a pure thought experiment). Let's say that we know that A can be either 'q' or 'y' or null, etc. for B, C and D. I still don't understand how knowing all these facts would give us anything in terms of identifying the method used to produce the text. The only thing I see is that we will be able to generate new conforming Voynichese words, with no clue about their meaning or function. Maybe we could try to guess something from the number of entries in each slot, but overall I just see no clear path from identifying the correct grammar (which could be impossible in the first place) to identifying the meaning or function.

Nor I do see any clear, practicable path. As I tried to make clear in the, at this time notorious, post #14. But even those meager considerations are a step forward nonetheless, well, at least for me, ie.: I did not think it possible the Voynich could be a sillabary, now I think it could be. I thought the Timm & Schinner "copy and modify" hypothesis was consistent with most of the available data, now I think that that hypothesis is difficult to reconcile with the LOOP grammar (supposinjg it has some validity).

(30-11-2024, 10:57 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.BTW, do you have any thought about the binomial distribution of word lengths, as identified by Stolfi (if I'm not mistaken)? Does it put some constraints on possible grammars?

Yes, I have some opinions and no, I don't think they can constrain the possible grammars. But (as usual with me, I'm sorry) it's a rather long explanation, and being it mostly off-topic here, I'll send you instead a PM later.
(29-11-2024, 08:45 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.At first I was a bit confused as the 'first missing word' did not match mine, but I based the checks on the reference transliteration (RF-1a).

It's taken me a few days to double-check this.
I had written that the first word not modeled by M.Zattera was 150th ranked 'choty'. This has 40 occurrences in RF-1a. However, I had indeed overlooked 'cheky', 89th ranked with 65 occurrences.

Of course you are welcome to use the term 'word chunks' if you think it fits your model.

I can recommend you to look at the tool 'ivtt' which is available as a single C source file, with documentation.
It will eliminate all sorts of boring and time consuming editorial work. (You are not allowed to view links. Register or Login to view.)
@nablator also used it to create the clean file for you.
(01-12-2024, 04:23 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.
(29-11-2024, 08:45 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.At first I was a bit confused as the 'first missing word' did not match mine, but I based the checks on the reference transliteration (RF-1a).

It's taken me a few days to double-check this.
I had written that the first word not modeled by M.Zattera was 150th ranked 'choty'. This has 40 occurrences in RF-1a. However, I had indeed overlooked 'cheky', 89th ranked with 65 occurrences.

Yes



(01-12-2024, 04:23 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.Of course you are welcome to use the term 'word chunks' if you think it fits your model.

Thanks, but I feel it's not appropiate, almost a kind of plagiarism, I'll keep 'LOOP' as a proper name. But of course I'll use 'word chunks' as a thechnical term, it's perfect!


(01-12-2024, 04:23 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.I can recommend you to look at the tool 'ivtt' which is available as a single C source file, with documentation.
It will eliminate all sorts of boring and time consuming editorial work. (You are not allowed to view links. Register or Login to view.)
@nablator also used it to create the clean file for you.

Ahhhh... so the cryptic string @nablator posted...

Quote:ivtt -x7 RF1a-n.txt RF1a-n-x7.txt

.. was a command line for ivtt.exe, lol, I could not make sense of it Smile . It looks quite interesting for a lot of useful things. I guess it runs under Windows command prompt, is it possible to get the .exe? I did not find a link to it on your website (may have missed it), and building it from the C file would not be easy for me. It's been years and years since I worked with C, and while I guess Visual Studio can build C programs, I have no idea of how to do it (and little will to learn how, tbh, a lot of head-scratching for a once-time use is not appealing, I'd rather port the whole thing to C# even! xD). Thanks!
Pages: 1 2 3 4 5 6 7 8