02-05-2021, 02:46 PM
I ran some simple experiments with the 3-table cipher idea. As always, I may have misunderstood something or made errors in the process.
I started from the paragraphs text in Q13 (Zandbergen-Landini EVA transliteration).
I split each word into three segments, basically:
I sorted by frequency each of the three segments and assigned them to the English characters having the same rank. I based this on the Genesis in King James' edition.
The result is the following cipher table (sorted both by rank and alphabetically):
I ciphered the English text by first splitting it into 3-letter groups and mapping each group to a Voynichese word make of prefix+stem+suffix.
in the beginning god created the
INT HEB EGI NNI NGG ODC REA TED THE
she,ckh,ar ch,_,es _,kched,or l,ckh,or l,kched,r che,ted,ary sh,_,ey qo,_,ol qo,t,y
Sheckhar ches kchedor lckhor lkchedr chetedary Shey qool qoty
I applied the same process to a version of the English text where word order was randomly scrambled.
These plots show the % of perfect reduplication vs MATTR 200. For comparison, I also included the text files from Brian Cham's corpus. In the plot on the right, I removed the outlier PML file.
[attachment=5490]
The encoding process has these effects:
Note that the two measures discussed above do not depend on the mapping table. They are purely an effect of splitting words into 3-letter groups.
BTW, this also connects to the problem of labels mentioned by Anton above: with this system, a Voynichese word is typically not enough to encode a plain-text word. But labels appear to be words and have no deep structural difference from words in paragraphs.
The next plot shows measures that depend on the specifics of the table. The X axis shows that the cipher greatly reduces character conditional entropy, making it comparable with the notoriously anomalous values found in the VMS. The Y axis shows that average word length is increased with respect to the original English and the value for the cipher is slightly greater than that for the VMS (~6 vs ~5.2).
[attachment=5491]
Finally, I checked the distribution of word lengths in the dictionary. As Stolfi observed, Voynichese words are distributed along a binomial curve. For European languages the distribution has a longer right-side tail (due to longer words): this is shown by the green squares that have higher values than the fitting curve for words longer than 8 characters. Both Quire13 and the ciphered text are quite close to their bell-curves: also in this case, one can see that the cipher text results in longer words.
[attachment=5489]
Emma pointed out to me that this cipher system could be unable to generate a sufficient number of short words. This simple experiments confirms the existence of this problem. The formulation of the table should allow to somehow adjust things, but if the empty prefix and stem must correspond to a single Latin-alphabet letter, than their frequency cannot be very high, do I am not sure that a solution exists. Similarly, I have not looked at how closely the lexicon of the cipher text matches that of Quire 13: also in this case, the table could be built in a more sophisticated way, that would likely result in a better match.
In conclusion, I confirm that I am impressed by Rene's proposal. The method is simple and it does a good job at imitating Voynichese word structure. As Rene pointed out both in the paper and in this thread, there are other features that do not appear to be possibly explained by a similar approach.
It is great to read of a cipher-oriented idea for the VMS that can be tested and analysed! This could be the first time I see such a well thought and well presented cipher hypothesis.
I started from the paragraphs text in Q13 (Zandbergen-Landini EVA transliteration).
I split each word into three segments, basically:
- the first characters of the word make up the prefix
- the stem starts with either one gallows, benched-gallows or 'd'
- the suffix is a final sequence made of [oeyainrlsmg]
I sorted by frequency each of the three segments and assigned them to the English characters having the same rank. I based this on the Genesis in King James' edition.
The result is the following cipher table (sorted both by rank and alphabetically):
This is very rough, including some ambiguity. For instance, the word 'ol' can be parsed both as 'o,_,l' (AEL) and '_,_,ol (EED).
I ciphered the English text by first splitting it into 3-letter groups and mapping each group to a Voynichese word make of prefix+stem+suffix.
in the beginning god created the
INT HEB EGI NNI NGG ODC REA TED THE
she,ckh,ar ch,_,es _,kched,or l,ckh,or l,kched,r che,ted,ary sh,_,ey qo,_,ol qo,t,y
Sheckhar ches kchedor lckhor lkchedr chetedary Shey qool qoty
I applied the same process to a version of the English text where word order was randomly scrambled.
These plots show the % of perfect reduplication vs MATTR 200. For comparison, I also included the text files from Brian Cham's corpus. In the plot on the right, I removed the outlier PML file.
[attachment=5490]
The encoding process has these effects:
- MATTR is greatly increased. These is due to the fact that MATTR is reduced by the regularity of word sequences in grammatical text: the Genesis is particularly repetitive and it has a particularly low MATTR. The encoding process destroys word patterns, since spaces are re-assigned during the creation of 3-letter groups.
- Reduplication is reduced: this is not visible for the original Genesis file (where reduplication is ~0%), but it is illustrated by the scramble file. The reason for this is analogous to the increased MATTR: reduplication also depends on words and if words are destroyed it also is affected.
Note that the two measures discussed above do not depend on the mapping table. They are purely an effect of splitting words into 3-letter groups.
BTW, this also connects to the problem of labels mentioned by Anton above: with this system, a Voynichese word is typically not enough to encode a plain-text word. But labels appear to be words and have no deep structural difference from words in paragraphs.
The next plot shows measures that depend on the specifics of the table. The X axis shows that the cipher greatly reduces character conditional entropy, making it comparable with the notoriously anomalous values found in the VMS. The Y axis shows that average word length is increased with respect to the original English and the value for the cipher is slightly greater than that for the VMS (~6 vs ~5.2).
[attachment=5491]
Finally, I checked the distribution of word lengths in the dictionary. As Stolfi observed, Voynichese words are distributed along a binomial curve. For European languages the distribution has a longer right-side tail (due to longer words): this is shown by the green squares that have higher values than the fitting curve for words longer than 8 characters. Both Quire13 and the ciphered text are quite close to their bell-curves: also in this case, one can see that the cipher text results in longer words.
[attachment=5489]
Emma pointed out to me that this cipher system could be unable to generate a sufficient number of short words. This simple experiments confirms the existence of this problem. The formulation of the table should allow to somehow adjust things, but if the empty prefix and stem must correspond to a single Latin-alphabet letter, than their frequency cannot be very high, do I am not sure that a solution exists. Similarly, I have not looked at how closely the lexicon of the cipher text matches that of Quire 13: also in this case, the table could be built in a more sophisticated way, that would likely result in a better match.
In conclusion, I confirm that I am impressed by Rene's proposal. The method is simple and it does a good job at imitating Voynichese word structure. As Rene pointed out both in the paper and in this thread, there are other features that do not appear to be possibly explained by a similar approach.
It is great to read of a cipher-oriented idea for the VMS that can be tested and analysed! This could be the first time I see such a well thought and well presented cipher hypothesis.