29-11-2024, 08:12 PM
Preliminary results using a "WordChunks" grammar (up to 4 chunks).
Before proceeding, I need to stress that the grammar can have many different variations, so the grammar which follows is not written in stone, it's just an example how it could be. Given I have no problems of efficiency for this test, I have kept it as clean as possible. After the results, I'll add a few more considerations about this.
The transcription used is the voynichese.com one I already posted.
The grammar is as follows (it's not a true loop, it's just 4 repeated blocks, but this was much easier and quicker to do):
-----------
Grammar name: WordChunks - 4 chunks,26 slots
Slot 1: q // HEADER
Slot 2: ch sh y // 1st WORD CHUNK
Slot 3: eee ee e
Slot 4: o
Slot 5: a
Slot 6: iii ii i
Slot 7: lk ld l d k r s t p f cth ckh cph cfh n m
Slot 8: ch sh y // 2nd WORD CHUNK (same as the 1st)
Slot 9: eee ee e
Slot 10: o
Slot 11: a
Slot 12: iii ii i
Slot 13: lk ld l d k r s t p f cth ckh cph cfh n m
.. (two more identical chunks follow)...
Slot 26: y // TAIL
--------------------
The grammar finds 7630 word types on 7700 (99.09% coverage does not surprises me anymore though
), only 70 words cannot be generated. The efficiency is abysmal, but who cares. Unfortunately, I cannot give a count of the number of word chunks required for each word type (that would require me to implement a true loop in the software, not difficult, but quite annoying, I may do it tomorrow maybe or one day later more probably).
The full list of the words which cannot be generated is in the Excel file (link below), but at first sight they divide in two groups:
The Excel file is here:
You are not allowed to view links. Register or Login to view.
Some considerations about the WordChunks grammar I used:
Before proceeding, I need to stress that the grammar can have many different variations, so the grammar which follows is not written in stone, it's just an example how it could be. Given I have no problems of efficiency for this test, I have kept it as clean as possible. After the results, I'll add a few more considerations about this.
The transcription used is the voynichese.com one I already posted.
The grammar is as follows (it's not a true loop, it's just 4 repeated blocks, but this was much easier and quicker to do):
-----------
Grammar name: WordChunks - 4 chunks,26 slots
Slot 1: q // HEADER
Slot 2: ch sh y // 1st WORD CHUNK
Slot 3: eee ee e
Slot 4: o
Slot 5: a
Slot 6: iii ii i
Slot 7: lk ld l d k r s t p f cth ckh cph cfh n m
Slot 8: ch sh y // 2nd WORD CHUNK (same as the 1st)
Slot 9: eee ee e
Slot 10: o
Slot 11: a
Slot 12: iii ii i
Slot 13: lk ld l d k r s t p f cth ckh cph cfh n m
.. (two more identical chunks follow)...
Slot 26: y // TAIL
--------------------
The grammar finds 7630 word types on 7700 (99.09% coverage does not surprises me anymore though

The full list of the words which cannot be generated is in the Excel file (link below), but at first sight they divide in two groups:
- Word types which cannot be generated because they have a 'q' not at the beginning of a word (oqokain, oqol, etc.)
- Word types which need more than 4 word chunks (ofyskydal, 6 chunks if I'm not mistaken, cthdaoto, 6 chunks, etc.)
The Excel file is here:
You are not allowed to view links. Register or Login to view.
Some considerations about the WordChunks grammar I used:
- As I said, it could take many different forms. I had to put an 'y' at the beginning of the chunk, together with 'ch' and 'sh', for words such as 'ykedy' and the like, which are relatively frequent. It's probably not needed in every chunk, just one or two may suffice, but I wnated to keep things uniform. But maybe it would have been better to use a separate slot (above 'ch, sh') for the 'y'. Or maybe something different, but in any case, the placement of the 'y' is a subtle problem because (except when in final position) it's used very rarely, yet enough too frequently to simply ignore it.
- I put the only 'q' in the header of the loop, thus if the 'q' is not at the beginning the word is not generated. One solution would be to put the 'q' inside the looping chunk, removing the header altogether, but the words with a non initial 'q' are so rare that seems to be an overkill.
- I find the groups 'lk', 'ld' to be an interesting feature, because in a great deal of words, such as 'lkedy', they shorten the word by a whole chunk. They also increase the number of bits of information which can be stored in the "l, k, r, s, etc." slots. It's tempting to add more of them ("lt", "ls"..), this would change very little the statistic I have just presented, but would alter the subdivision in chunks of certain (rare) word, i.e. 'ols' would drop from two chunks: [ol][s] to one chunk: [o ls]