The Voynich Ninja

Full Version: An Essay on Entropy: what is it, and why is it so important?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
(27-04-2022, 11:30 AM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.
(27-04-2022, 12:44 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.if you take some set of 12th cent. Persian texts and map them to Voynich glyphs using his theory
Why you should map Persian texts in Voynich glyphs? Can't you calculate their entropy directly?

In his theory, the mapping from Voynich glyphs to characters in the Persian script is 1-to-many in some cases -- going from the Persian script to Voynich script is unambiguous, but going in the other direction is not. If you want to compare the statistics of Voynich text (after removing what he claims are nulls) with those of some corpus of Persian texts, putting the Persian text into the Voynich script rather than the other way around would seem to be the appropriate way of doing it.

(27-04-2022, 11:30 AM)Ruby Novacna Wrote: You are not allowed to view links. Register or Login to view.
(27-04-2022, 12:44 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.it's an issue of methodology
Speaking of methodology, the translation you mention, which I have just looked at, has not matured to be subject to testing: I have not even found matches between the transcription, in EVA, for example, and the proposed translation, unless the page exists and I have not seen it.

If you go to the main page (You are not allowed to view links. Register or Login to view.), the first three posts listed ("Folio 58r of Voynich Manuscript", "The Lizard of Folio 73r", and "Voynich Manuscript: Fungi and Ants") all show claimed translations of snippets of text where he shows his work in sufficient detail to follow along once you understand how he thinks null removal works. The best of the three (in terms of following along with what he's doing, especially for those of us who don't read the Persian script) is You are not allowed to view links. Register or Login to view., which starts with a drawing of the chunk of Voynich text he's working with, provides his transcription of that text, shows the iterations of null removal, converts the de-nulled text to the Persian script (making some set of choices where he has multiple options), and then translates it.

All of which is utterly beside the point, because the purpose of referencing it as an example wasn't to advocate for the theory. Even if you had been right about a lack of proposed translations, it would still be the case that purely on the basis of the claimed mapping from Persian script characters to Voynich glyphs and the supposed null removal method it would be possible to perform the entropy comparisons suggested.
(27-04-2022, 09:13 PM)kckluge Wrote: You are not allowed to view links. Register or Login to view.it would still be the case that purely on the basis of the claimed mapping from Persian script characters to Voynich glyphs and the supposed null removal method it would be possible to perform the entropy comparisons suggested.
In this case there is no need to use heavy tools to see that the work is in its early stages and the data is not (yet) reliable.
(27-04-2022, 02:26 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view.* With regard to the whole are "words" _words_ issue -- I've seen lots of papers talk about Voynich "words" and Zipf's (word frequency vs word frequency rank) Law. There is a different Zip's Law relating word length vs word frequency rank -- more frequent words are also shorter (a ref for anyone wanting to pull on that thread is You are not allowed to view links. Register or Login to view.) -- has anyone every fit that vs. the Voynich vocabulary?

Well, this is mildly embarrassing -- I looked at the paper I cited and realized I was confused. The "other" Zipf's Law it's talking about is total relative frequency of all words (tokens, not types) of length L vs L -- so the binomial word length studies that compare "Voynichese" with natural languages have (implicitly or explicitly) looked at this.

Mea culpa.
That is certainly a good idea.
Not to integrate the glyphs into the calculation, but the different VM words as a whole.
I can imagine that this would achieve a completely different result.
For the language comparison.
Hi, Koen,

Not going to lie, the entropy stuff has always made my brain hurt, so this is really helpful!  Especially as somehow before I had ended up with the completely erroneous impression that either h0 or h1 was about word pair predictability and h2 letter pair predictability.  

Would you be able to answer two questions through the same simple format?

  1. To what extent and why is it important when assessing h2 to divide it by h1?  I think this is what I saw in the other works on entropy.   Why does the score that reflects how much your tea drinking follows certain patterns need to be divided by the score that reflects how even/varied your choice of tea tends to be?  
     
  2. Why does the entropy problem rule out abbreviation in the manuscript?  Or at least rule it out as a key mechanism?  I know I read it said somewhere that abbreviation can't be part of the solution due to the entropy problem.  It might have been Rene's site but it was a couple of years ago, and I can't exactly recall.  I don't think it was saying that there's absolutely no abbreviation, rather that abbreviation can't be common in the text because otherwise the entropy score would become even worse.  Would "unfolding" abbreviations out (were we ever able to do it) really make character pairs more predictable?  Surely it might in theory reveal a greater diversity of character pairs?  

I suspect there is a really intuitive/common sense answer to both questions that I'm not seeing; grateful if you can help!
Those are both very pertinent questions, tavie. And both are somewhat tricky.

The way I think of h1 and h2 is that they form an entropy fingerprint of a language together. Dividing or subtracting both values is an attempt to integrate both. In my most recent writing about entropy, I preferred to use a scatter plot instead because this just shows both values as they are.

The thing to keep in mind is that texts with a similar h2 but different h1 (or the other way around) can be very different entropy-wise. For example, I have medieval German texts with a similar h1 to Voynichese (even one with a lower h1 than all Voynichese samples). But their h2 is much higher than that of Voynichese, so they are still no good candidate, and trying to map Voynichese to this type of German would fail spectacularly. 

Let me try with the tea example. So let's say that I found out that you are observing my tea consumption and I want to mess with your results. Instead of 10 types of tea, I buy 100 different types (this is actually possible). My h0 goes through the roof. If I consume all teas at a similar frequency, h1 will shoot up as well. This in turn may impact my h2 as well. 

Now, h0 is an unreliable stat because it doesn't even take frequency into account. Like I said before, adding a single "&"-sign at the end of a 200-page book will impact its h0 the same way the frequent letter "e" does. But h1 tells us a bit more and is a bit more sturdy. If I add a letter and make it frequent, h1 will change more than if I just add a single token of the new type. So we can use h1 and h2 to fingerprint a text's entropy behavior.

Some extreme examples:

Low h1 + low h2: one or few types of tea dominate the selection, and teas are consumed in an ordered manner. If I drink mint and Earl Grey alternatively, you have a good (50%) chance of predicting which tea I will drink even without knowing which tea I drank yesterday (low h1). But if you know which tea I drank yesterday, you have a 100% chance of predicting which one I will drink today.
Low h1 + high h2: a few types of tea dominate the selection, but there is no pattern. In this case, your guess between mint and Earl Grey will always be 50% chance correct. For h1 these are decent odds, but h2 does not help us. 
High h1 + low h2: many types of tea are consumed in similar frequencies, in an ordered manner. If I consume all 100 teas in a fixed order, you only have a 1% chance of guessing correctly without prior knowledge. But if you know what I consumed yesterday, you can predict confidently what I will drink today, despite the high h1. In this example and the first example, my odds of predicting today based on yesterday are the same, but the tea consumption landscape is completely different. Same h2 with a different h1 can mean we are looking at a very different text. 
High h1 + high h2: many types of tea are consumed in similar frequencies, but there is no pattern. It is hard to predict anything at all. 

In summary, when comparing Voynichese to another text, we want to be able to compare the glyph inventory and its use. H1 and h2 together give a more complete picture. And when modifying a Voynich transcription, we want to avoid introducing an absurd amount of different glyphs. But since rare glyphs are not interesting for the "broad" stat of entropy, h1 is a better gauge than h0. 

2. Abbreviations: we tested this once, but I'm writing on my phone so I'm going by memory. One might hypothesise that abbreviating reduces entropy, because the same symbol is often found at the end. So the fact that abbreviation symbol is followed by a space is predictable. 

Now remember that Voynichese has a lower h1 than most texts. You should now know what will happen if we introduce frequent abbreviation symbols in a regular text: h1 will increase. So even if you do manage to decrease h2 (which is not guaranteed), you will push h1 the wrong way. I didn't realize this yet, but question 2 is actually a good example for question 1  Smile

Edit: I forgot to add that some types of abbreviation will increase h2. For example, if you drop common endings, the final letter of words may become more unpredictable. Therefore, abbreviating will always increase at least one of h1 or h2, possibly both. And what we are looking for is a decrease.
Hi Tavie,
I'll add a few words to Koen's reply about abbreviations and conditional entropy.
Many medieval abbreviations work by replacing a frequent sequence (e.g. bigram / trigram) with a single symbol. This results in a shorter text because the sequence is frequent (if it were infrequent, the effect would be minimal) and because of course the original sequence is longer than the single symbol that replaces it. One way to see conditional entropy is that it measures the relative quantity of frequent bigrams: very frequent bigrams result in lower conditional entropy. Consequently, removing frequent bigrams tends to increase conditional entropy.

[attachment=6469]

For instance, this text reads:

et gemitibus incessanter clamemus

but this was rendered as something like:

2 gemitib9 incessant^ clamem9


The abbreviated text is 4 character shorter than the actual text (about 10% shorter).
"Et" (and) is the most frequent Latin word. When you see an 'e', you can expect that the next letter will be 't' and you will often be right. The abbreviation that I have rendered as "2" represents the whole word "et", so this removes a good chance to guess the next letter given the previous one. If _ represents space, the sequence _et_ becomes _2_, instead of three "easy" guesses (_e,et,t_), you get only two (_2,2_)

Similarly, "us" is a frequent bigram at the end of Latin words. Again, guessing that 'u' is followed by 's' is a good bet. But this bet is made impossible if 'us' is rendered as the single symbol '9'.

Also, abbreviations were not applied consistently: in the same text you can find 'et' and '2', as well as 'gemitib9' and 'gemitibus'. Since the use of abbreviations was something that the scribe could apply or not at will for each word occurrence, guessing the next character is made even harder: you don't know if any particular word token will be abbreviated or not. One of the meanings of 'entropy' is exactly 'lack of order or predictability' and scribes could be highly unpredictable in their use of abbreviations.

Finally, this subject was discussed by Lindemann and Bowern, who compared actual abbreviated historical texts in a few languages with normalized, unabbreviated versions (You are not allowed to view links. Register or Login to view.):

Lindemann and Bowern Wrote:The usage of abbreviations and special characters has the effect of raising the conditional character entropy of the English, Icelandic, and Latin texts and taking them further from the values we find for Voynichese.
Another way to think about abbreviation is that it will target areas of a text that are low-entropy to begin with. A common glyph cluster, a common ending. These things can be abbreviated because they are low entropy. Their predictability makes it so that I can still read the text if they are gone or replaced. 

The little bit I wrote about information density also comes into play here. When abbreviating, you are looking for those bits that are easy to predict and hence carry little crucial information. The fact that they can be shortened or omitted implies that they were too "verbose" to begin with. This is why abbreviation will increase the entropy of a text.

Note: the abbreviations I have in mind here are ones that could be used by medieval scribes. I am sure we could invent some novel form of abbreviation that does decrease entropy, but then I wonder of the text would still be legible, because you'd have to shorten a text and reduce its information density at the same time!

Conversely, if I "abbreviate" an EVA transcription of Voynichese by rewriting common glyph clusters as single glyphs, I will increase its entropy. So one could say that Voynichese needs to be abbreviated to bring it closer to a normal text, rather than the other way around. This, in turn, would mean that "words aren't words", because Voynichese words would become much too short, and we now shifted the problem from entropy to spaces.

Edit: oops, I was typing this reply while Marco made his post - seems like there is some overlap.
(27-04-2022, 02:26 AM)kckluge Wrote: You are not allowed to view links. Register or Login to view., tiThe second type is where some plaintext letters map to a single glyphs, while some map to 2-(or 3- or more glyph combos -- so plaintext 'C' might map to Currier "OP", while plaintext 'K' might map to Currier "P".  A straddling checkerboard is a (from the point of view of the C-14 date somewhat anachronistic) cipher of this type (You are not allowed to view links. Register or Login to view. shows examples). The problem this raises is that if you have "CLOCK" as a plaintext word, then the "CK" at the end would result in "OPP" in the Voynich text given the hypothetical mappings -- and except for Currier 'C', repeated glyphs vanishingly rare -- out of 35483 digrams in (D'Imperio's) BioB, there are 15 that are doubled glyphs other than "CC". So either there just miraculously happens to be a natural language whose letter contact stats allow a mapping that avoids that problem, or there is some mechanism in the cipher that hides such doubled glyphs -- perhaps the difference between Currier "OF" and "OP" isn't that they map to different plaintext letters, perhaps "OF" is what you write where "OPP" would otherwise occur. Or maybe the verbose cipher theory is just fundamentally wrong...
You are not allowed to view links. Register or Login to view., this cipher is very interesting. Consider that someone doesnt know what arabic numbers are, thinks that glyphs of this chiper are 1,2,3,4,5,6,7,8,9,0, and make entropy analysis based on them. What will be the result? A low entropy compared with a natural language. 
Consider other options, like adding nulls, repeated letters in plain text, to say hello heeello or hehello, and homophonic sustitutions. 
There is a wide variety of options to deal with.
Excellent crash course in Shannon entropy, Koen. If there is an official FAQ, or official prerequisite reading for participation in some of the more advanced discussions here, this should be part of it. I joined this form just as you, Marco, and nablator were starting to play around with these statistics, and I taught myself all I could to try to keep up with these discussions. I wish I’d had this post back in 2018.

In your reply to tavie, you mention h1 and h2 forming a profile which is distinct for any given language. Having read Patrick Feaster’s work on the VMs, I’m inclined to take this out a step further and say that any kind of information has a unique profile of h0, h1, h2, and h1/h2 values. If validated over enough specimens of sufficient length, such a profile could be used to identify the type of information an unknown specimen most likely contains. A weather station’s hourly report data and an electrocardiogram machine’s logs will produce output specimens that are distinct from each other on all these measurements in consistent ways, no matter how they’re encoded. Patrick wrote on his blog that the arrangement of glyphs in the VMs reminds him more than a bit of medieval French business ledgers, where each line represents a transaction and a sum of money expressed in Roman numerals. If any such ledgers have been transcribed in a machine-readable form, I wonder what their entropy values would look like. Closer to medieval prose, or closer to the VMs?
Pages: 1 2 3 4 5