The Voynich Ninja

Full Version: entropy splits in parts of lines?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
It is well known that certain structures and patterns are typical of the beginnings of lines, and very different structures and patterns are typical of the ends of lines, in the Voynich ms text. But has anyone studied the relative entropy levels of just the beginnings of lines, and just the ends of lines, compared with the whole text? For example, what is the entropy of just the first halves of all lines of the ms text? And of just the second halves of all lines? What about just the first three vords of all lines, and just the last three vords of all lines? 

I suppose the expectation would be that all such entropy levels would be even lower than that of the whole ms text, since the beginnings of lines by themselves and the ends of lines by themselves would be expected to be even more similar to each other than the whole text is. But I wonder if that has ever been tested. There is also plenty of repetition within each line, from beginning to end, so I wonder if breaking apart line beginnings and line endings will really lower entropy so much. I also wonder if there is any significant difference in the entropy of just the beginnings of lines and just the ends of lines. 

The reason I ask is that I have come across certain groups of lines in my research where I find it easier to interpret certain parts if I only read the first few vords of each line and continue with the first few vords of the next line, ignoring the rest of each line. But this may well be an illusion on my part, which is why I am curious about the entropy statistics of just the beginnings of lines and just the ends of lines. 

It would be possible to encrypt a message by only making the first three words of each line meaningful, and padding out the rest of each line with nonsense nursery rhyme repetition of the sounds of the first three words:

meet me at fleet be mat sleet we vat
the back door he lack moor we sack poor
of jons house off cons mouse scoff nons louse
monday at noon sunday cat moon runway sat loon

This doesn't seem like a very secure level of encryption, but now combine it with a simple substitution cipher, or even better a mysterious invented script that no one else knows. Then it would become rather difficult to decipher. By the standards of the early 15th century, it would have probably been quite secure. And both the concept of simple substitution (possibly incorporating elements of a verbose cipher, as we have recently been discussing elsewhere on this forum) and the steganographic concept of hiding the words of a meaningful message within a larger nonsensical message were simple enough to have been known and possibly employed in the time period of the Voynich ms. 

Also, in filling out the nonsensical parts of each line, the author could very well have followed some of the principles of the "auto-copying" or gibberish theory that has also been discussed recently on this forum. That could have been deliberate, or it even could have happened unconsciously as the author thought of nonsense rhyming words and phrases to fill out each line. I found myself doing it as I wrote the lines above, at first accidentally and then deliberately after I noticed I was doing it. It's only natural to take "inspiration" from the other nonsense words and phrases that are already in the immediate vicinity of the line that one is filling out. And for the author and the intended recipient, it doesn't matter what those words and phrases are anyway. 

If the principle of filling out the nonsense parts was based on choosing rhyming words and phrases, as in my example above, then one would also expect the middles and ends of words to have much lower entropy (more predictability) than the beginnings of words, a statistical feature that we also observe to be present in the Voynich ms text. 

I am aware that the beginnings of lines also tend to be repetitive in the Voynich ms text, so I do not at all expect the idea I raise here to solve all the problems inherent in the difficult structures and patterns of the ms text. But I'm curious if the entropy breakdowns by parts of lines may give us some clues and leads to follow for further and more sophisticated examination of these ideas.
Dear Geoff


I cannot back my assumption up yet with hard facts, however I assume as a whole that the structure of the VMS although it obeys Zipf's law is a high entropy.  I think its unorganized unlike a natural language.  Ironically I just posted something about "VSO".

So I believe after an exhausted look at every first three vords and in trying to find meaning from a certain substitution cipher you will find yourself in another hole with input and gibberish output.  If the VMS behaves a certain way I assume the whole script follows suit.  To me the VMS is a gibberish trap!

Remember if you do decode a large text in this fashion other vords which are the same that are not inline with the first three would decode too.  I think the average line contains anywhere from eight to twelve tokens so what your suggesting possibly is that the VMS is 20% to 25% encoded.
(01-10-2020, 01:22 AM)geoffreycaveney Wrote: You are not allowed to view links. Register or Login to view.If the principle of filling out the nonsense parts was based on choosing rhyming words and phrases, as in my example above, then one would also expect the middles and ends of words to have much lower entropy (more predictability) than the beginnings of words, a statistical feature that we also observe to be present in the Voynich ms text. 

I don't know if the parts of the lines have been tested, but Rene Z. definitely tested the entropy of each individual character of words in the "stars" section (Quire 20) and unfortunately, he did not find what you are describing.  Instead, the beginnings of words were much more predictable than words from Latin and Italian, and the characters became increasingly more unpredictable as the word went along -- resulting in words that were collectively of the same "character" entropy per word as natural language.  But the distribution of the entropy values across each character differed from the natural languages tested (Latin and Italian).

Here is how Rene stated his conclusions.

  • First-character and second-character entropy of Voynich MS text are significantly low, and thus especially also the initial-bigram entropy. However, in Voynichese, the third character of each word is as unpredictable as the second, and the fourth contains much more information than is the case in Latin or Italian
See the section called Entropy per Character on this page: You are not allowed to view links. Register or Login to view. for other data and discussion related to this.

Of course, this could turn out to be different if this were to be redone using single character substitutions for the bigrams that Koen identified as impacting overall second order entropy -- but as Koen's work has shown, it is hard to predict what the results on entropy is with alterations.  I would imagine this plays out at the character level entropy as well.  At least Rene's work does say that the initial bigram is collectively of lower individual entropy than Latin and Italian so substitution of one character in those initial two places likely will not increase the unpredictability.  But again, this is just a guess.
That's an interesting approach, in respect of the possible multipass nature of the text (see the thread about baseline jumps & multipass).

I think that's a good idea.