The Voynich Ninja
Character entropy of Voynichese - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Character entropy of Voynichese (/thread-148.html)

Pages: 1 2 3 4 5 6 7 8 9 10


RE: Character entropy of Voynichese - Torsten - 20-03-2016

(20-03-2016, 12:26 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.Torsten:  i was focussing on the entropy here.  Too many short words? I am sure the text is too short to show or not show binomial distribution. 

 Words similar to each other are not common enough in your text. This is the same as to say that the entropy values are not low enough. The values for your sample text are H0=4.52 H1=4.07 H2=2.67 (see You are not allowed to view links. Register or Login to view. for the entropy values for the VMS).


RE: Character entropy of Voynichese - Davidsch - 20-03-2016

Ok, i am willing to see if i can build a text that has the best entropy possible based on a language.

For that i need a little bit more than  that explanation and You are not allowed to view links. Register or Login to view.

H= 2log(alphabet letters)  is too general.

I need to know what values you want and how to apply the exact formula's


RE: Character entropy of Voynichese - Torsten - 21-03-2016

In the VMS a character is predictable from its context (see [font=Verdana]http://www.cs.dartmouth.edu/~sravana/papers/voynich_latech.pdf p. 81 "4.2 How predictable are letters within a word?"). Therefore the second-order entropy is low for the VMS. [/font]
For your text such an effect exist for the characters in front of 'n'. The character in front of 'n' is most probably an 'e' (10 times) or an 'i' (9 times).


RE: Character entropy of Voynichese - ReneZ - 21-03-2016

(20-03-2016, 02:42 PM)Torsten Wrote: You are not allowed to view links. Register or Login to view. Words similar to each other are not common enough in your text. This is the same as to say that the entropy values are not low enough. The values for your sample text are H0=4.52 H1=4.07 H2=2.67 (see You are not allowed to view links. Register or Login to view. for the entropy values for the VMS).

The text given by Davidsch is a bit shortish to compute a reliable h2, but I can't say what would be a good minimum length.
The problem is that many low-frequency character pairs appear either 0 or 1 times.

Still, both h1 and h2 are way higher than the Voynich MS (cf. values of Bennett: 3.66 and 2.22), for a similar character set size.
However, I consider this type of exercise very useful.


RE: Character entropy of Voynichese - Anton - 02-11-2016

I have purchased the Bennett's book myself. I have already provided a general review of the book here: You are not allowed to view links. Register or Login to view. Now I would like to share what's actually written there about the VMS.

The VMS is dealt with in Chapter 4 "Languages". The chapter is large and stands for more than 20% of the eight-chapter book. However, not all the chapter is dedicated to Voynich. The chapter deals with various issues: valid language patterns occurrences in randomly-generated texts, identification of authors and languages of unknown texts, compression in message transmission, encryption and decryption of text messages.

Only one section is dedicated to the VMS - that is, the last section 4.22 of the chapter. It is only nine page long, with one page dedicated to problems for students and two pages - to scans of two folios of the VMS. That is not what I expected from what I read about this book in the Internet - I expected much more volume to be dedicated to the VMS. However, the actual state of things is reasonable - the whole book is dedicated to solving tasks with computer, so the VMS is just one interesting illustration or application. It is not subject to any dedicated focus neither in the book on the whole , nor even in the chapter 4.

That said, four pages are dedicated to the brief history of the VMS (with focus on the names of Dee and R. Bacon) and the attempts to analyse it (Newbold, Brumbaugh). The names of Yardley, Friedman and Tiltman are mentioned, as well as articles of Oneil (sic!), Friedman and Tiltman.

So only three pages are left for the discussion of the statistical properties of the VMS, which is much less than I expected.

The alphabet that Bennett uses is as follows (p. 192). He considers a, i, l, o, e, h, p, f, t, k, r, n, q, d, y, v and x as standalone characters. He treats composite (benched) gallows in the same way as they are treated in EVA transcriptions - as sequences of elementary characters outlined above, with h as the final character and t attributed twice in the case of the "gallows coverage". He, however, treats iin as a single character and in as another single character. Last, he distinguishes two variants of the s - one as s in sh or in benched gallows, the other as s encountered per se. The former variant he recognizes as e with an apostrophe. However, he does not recognize the apostrophe as a single character (on the grounds that "the apostrophe appears only to follow с <that is, e> throughout the entire manuscript" - which, of course, is wrong). Hence he treats these two variants of s as two different characters.

It is not clear whether Bennett adopted any characters other than listed above. It is also not clear whether Bennett included space as a character into the calculation, but this seems likely, since space is recognized as such throughout all the preceding material of Ch. 4.

The bulk of the text analyzed has been the first ten pages of the VMS. No intermediary counts are provided, only the final result (p. 193), in bits per character here and hereinafter:

h1 = 3.66
h2 = 2.22
h3 = 1.86

where the subscript index stands for the order of the entropy.

Earlier in the book (p.140), Bennett provides some results for natural languages, such as:
  • English contemporary (cited from Shannon 1951)
  • English Chaucer
  • English Shakespeare
  • English Poe
  • English Hemingway
  • English Joyce
  • German Wiese
  • French Baudelaire
  • Italian Landolfi
  • Spanish Servantes
  • Portuguese Coutinho
  • Latin Caesar
  • Greek Rosetta Stone
  • Japanese Kawabata
All tests, except the first and the last, used the 28 character Latin alphabet (letters, space and apostrophe). Shannon seems to have omitted the apostrophe, and the Japanese test was based on a 77-character set (76 kana and the space). For German language, characters with umlauts were substituted with the respective letter followed by the letter "e".

The results can be summarized as follows (my summary differs slightly from that on Rene's website: You are not allowed to view links. Register or Login to view. , and also Rene seems to omit the results for Japanese):

h1 = 3.91 ... 4.81
h2 = 3.01 ... 3.63
h3 = 2.12 ... 3.1

The lowest value of the 1st order entropy is observed for Portuguese, and the highest - for Japanese.

The lowest value of the 2nd order entropy is observed for Spanish, and the highest - for Japanese.

The lowest value of the 3rd order entropy is observed for English Chaucer, and the highest - for English contemporary.

However, Bennett notes that Shannon's calculation of the 3rd order entropy has been approximate and based on inaccurate data. That excluded, the highest value would be for English Poe (2.62). Also, almost half of the cases (Japanese included) do not have the 3rd order entropy calculated.

Note that all calculations were made for texts of "narrative" style, with no highly conspected or highly abbreviated texts put under test.

Note, also, how the 77-character Japanese yields higher entropies than 27 or 28 character-alphabetted languages. Actually, it looks to me not very practical to directly compare entropies of languages with different alphabet sets. Rather, redundancies should be compared. Maximum character entropy is obtained when all characters are equally probable to occur (this is a fact mathematically proven). Thus, the respective maximized 1st order entropy value is given as log2N, where N is the number of characters in the alphabet. This value is also sometimes called "0th order entropy". So, e.g. comparing Japanese with English Joyce (the case which has next to Japanese value of the 1st order entropy - that of 4.144), we'll have log277 - 4.809 = 6.267 - 4.809 = 1.46 for Japanese and log228 - 4.144 = 4.807 - 4.144 = 0.663. This means that, notwithstanding the fact that the character entropy for Japanese is notably higher, the Japanese alphabet in question is notably more redundant than the one used by James Joyce for his English works, because for the latter, the character entropy value is closer to the maximum possible value.

For Voynichese, assuming Bennett's 22 character alphabet (space included), we'll get log222 - 3.66 = 4.46 - 3.66 = 0.8, which is at least way better than Japanese and close to that of Joyce.

It is strange that Bennett does not speak anything about this matter (discussing ratios such as h1/h2 although), while Stallings at least recognizes it, making use of "differential" entropies. The latter also shows that h1- h2 of Voynichese is considerably higher than in natural language samples. One will find that that is also true in respect to Hawaiian.

Returning to Bennett, he provides an interesting observation that hn of Voynichese is approximately the same as hn+1 of Western European languages.

It is often asserted that ciphers tend to increase the character entropy of the text. While stating essentially the same on p. 194, Bennett, however, readily provides two examples from Poe where this is not the case; one being an "extreme" type of cipher where all the message is effectively contained in the key, the other being a multiple-substitution cipher. Neither of these two would directly generate text comparable to Voynichese, but Bennett discusses only entropies, leaving Voynich "morphology" and "grammar" out of scope.

Finally, Bennett states that there are natural languages with low entropies, providing the example of Hawaiian with entropy values as follows:

h1 = 3.20
h2 = 2.45
h3 = 1.98

This calculation was performed across the source text from a XIX c. book in a 13 (12 plus space) character alphabet introduced by missionairies in mid-1800s. "It has been estimated" - says Bennett - "that only about 100 people still" used this language in daily communication at the time of Bennett's writing.

Calculating h0 - h1 for this variety of Hawaiian, one gets log213 - 3.20 = 3.70 - 3.20 = 0.5. Good work by missionairies.  Smile


RE: Character entropy of Voynichese - Anton - 03-11-2016

Quote:One will find that that is also true in respect to Hawaiian.

Just to clarify - what I meant is not that Hawaiian resembles Voynichese in this respect (which is not true), but that Voynichese has h1-h2 considerably higher than natural languages including Hawaiian.

Sorry for the awkward phrasing.


RE: Character entropy of Voynichese - Davidsch - 03-11-2016

You forget to mention Maori.

Did not calculate, but it must have a lower or same entropy as Hawaiian.
More info: You are not allowed to view links. Register or Login to view.


RE: Character entropy of Voynichese - Anton - 03-11-2016

(03-11-2016, 03:26 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.You forget to mention Maori.

Did not calculate, but it must have a lower or same entropy as Hawaiian.
More info: You are not allowed to view links. Register or Login to view.


Hi David,

It's probably not me who forgets to mention Maori, but Bennett Smile  To be just to Bennett, however, he does not speak of Hawaiian exclusively, but mentions that "there actually are languages in some parts of the world" (that exhibit low character entropy values), and in the next sentence he makes use of the phrase "Polynesian languages" suggesting that they were not known to Bacon or other characters in the plot (in the times of publishing of this book the whole plot revolved around Dee and Bacon, and even D'Imperio's book was not yet published). From the footnote on p. 194 it just seems that Bennett had a handy opportunity of exploring Hawaiian in the person of one Thaddeus P. Dryja who tracked the Hawaiian references and prepared the ASCII coded tape of the source text for computation.

As I suggest above (and some clever men a while before me), it is, in the first place, not just the absolute values of entropies that needs be compared, but rather "differential" ones. And one finds that while h0-h1 of Voynichese is close to that of natural languages, its h1-h2 is way higher than natural languages (don't know how about Maori, but Hawaiian included).

My further thought - a trivial one - is that (as already expressed above), even within the same language the character entropy will be alphabet - dependent on how the alphabet is constructed. So what one needs is to check variations in Voynichese entropies with different approaches to constructing the Voynichese alphabet. Actually if one manages to "construct" a Voynichese alphabet such as that differential entropies would be close to that of the natural languages, that would be a huge step to decrypting Voynichese.

So one could start with checking Bennett's calculations. Bennett Jr was (he passed away in 2008) a renowned scientist, so he was clearly knowing what he is doing, and there's little probability that he made any direct mistakes, but even such things happen. Next one could explore the following paths:
  • "modified" EVA - the same as EVA but with benched gallows considered as standalone composite characters;
  • curve-line system - this needs to be modified to adopt all critique and developments made after Cham's article - considering apostrophes, horizontal crossbars etc.
Maybe I'll find time to do that myself, but I'm more than happy if that's done by anyone else, since the task is tedious Dodgy


RE: Character entropy of Voynichese - Anton - 03-11-2016

By the way, neither of Bennett's calculations for natural languages does incorporate digits - which would add ten additional characters into the alphabet. Of course for texts such as "Hamlet" this is not crucial - since digits, if ever met with,  are very rare there and do not influence the end result. But what e.g. for an apothecary's notebook?


RE: Character entropy of Voynichese - Koen G - 03-11-2016

Guess I should have payed more attention in maths class, all this talk of levels of entropy goes right over my head Confused