The Voynich Ninja

Full Version: Word entropy in reverse direction
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
On this page of Rene's site (You are not allowed to view links. Register or Login to view.) the conclusion is that Voynich words have similar entropies to comparator languages, but that the entropy is distributed differently. Specifically, there is less information at the start of a word but more toward the end.

My question is whether this conclusion is only for left to right reading or would be true when examined in either reading direction. What I mean is this: do Voynich words have more information specifically on the right hand side (so toward the end in normal reading direction) and less information on the left, or is it that any 4-gram within a word would have a higher level of information?

(I guess it is the latter, as many words are longer than 4 characters.)
What I took away from Rene's findings is his conclusion that "the information contained in Voynich MS words is similar to that in Latin or Italian words, but the information is more equally distributed over the characters while the words are shorter."

A first thing to note is that he used the Cuva alphabet, which I agree is a better standard than EVA for statistical analysis. Cuva reduces the predictability introduced by EVA, which has the effect that words in Cuva are shorter. When using EVA, I would predict a less uniform entropy distribution, because things like EVA-n will have an effect on entropy statistics in a very specific spot.

But when these things are remedied (as with Cuva), my understanding of what Rene wrote is that the entropy distribution within a VM word is more even than in Latin. I am not sure if I can understand this intuitively though. Is it because the "options" in Voynichese are lower for each given spot, that the next spot will still have a relatively large impact on the information content?

Why does the entropy per character decrease in Latin from the first to the second and so on? Do I understand it correctly that the first character could be any character (low predictability), but then the second character must be able to combine with the first, so its predictability is higher?

Anyway, to answer your question, if I understood everything correctly, I would predict that reversing the direction would lead to similar results. There is not a single hotspot of information content in a VM word, but the combination of parts still makes it so that word entropy is pretty normal.
After a few failed attempts yesterday, today I managed to replicate Rene's experiments. I could then also run the same code on reversed files. These are the results (top 3 lines of each table comparable with Rene's tables):

[attachment=6585]

The results for reversed files are considerably different in all cases. For all three files, the last character is less uncertain than the first one (basically, suffixes are more constrained than prefixes). In particular, the reversed results for Italian show that the last character has a much lower entropy than the first one: I guess this is due to many Italian words ending in one of 'a','e','i','o'.
The very low value for the second-last CUVA character (1.7) is likely due to 'D' which almost invariably appears when the last character is 'Y' (this is Currier B, with the well-known abundance of words ending by EVA:dy). The third character increases again because -DY can be preceded by several options: EDY, UDY (EVA:eedy), ODY are all among the top 10 most frequent suffixes.

(03-06-2022, 09:44 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Why does the entropy per character decrease in Latin from the first to the second and so on? Do I understand it correctly that the first character could be any character (low predictability), but then the second character must be able to combine with the first, so its predictability is higher?

This is my understanding of these figures as well.