Options

An Essay on Entropy: what is it, and why is it so important?

Index
An Essay on Entropy: what is it, and why is it so important?
An Essay on Entropy: what is it, and why is it so important?

Koen G > 26-04-2022, 02:48 PM
The concept of entropy is often mentioned in relation to the Voynich text, and rightfully so. Voynichese differs from regular texts in a number of ways, but its entropy values are one of the clearest indications. The purpose of this post is twofold:

1. To explain entropy intuitively. I am one of those people who dreads numbers and had trouble with mathematics in high school. When people were talking about entropy, I was never sure if I understood correctly what they meant. It was only after Anton's threads about the subject, and asking a bunch of questions, that I became able to play around with entropy myself. Having crossed from one side to the other, I hope I will be able to explain these concepts in a way almost everybody can understand. Still, I am likely to make mistakes, and will welcome corrections and improvements in the comments. Keep in mind that we are talking about one very specific application of entropy here (information theory), and different fields may use different definitions.

2. To demonstrate why entropy is so important in Voynich studies. Statistics people will often shoot down simple substitution solutions on sight, and they have good reasons to do so. A simple substitution solution is one where, simply put, each Voynich glyph corresponds to one letter or sound in your translation. Low entropy is one of the reasons why they don't work. If you want to understand the challenges of the VM text, understanding entropy is crucial. Of course, entropy is not the only problem, but it demonstrates well just how different Voynichese is from regular texts. Most importantly, understanding the low entropy problem may also guide us when looking for a solution.

If you want to continue researching the VM's text without learning about entropy, that is of course fine, and then I will see you in another thread. For me, however, learning to understand entropy was a real eye-opener, and I want to share this with those who may find other explanations too technical.

A Metaphor about Entropy and Drinking Tea

Imagine you are a statistician and I am your roommate, and I like to have a cup of tea every morning at 8 AM. You want to analyze my behaviour statistically, and start to try and predict which type of tea I will select each morning.

Your first bit of information is that there are ten different types of tea in my cupboard. These are the options I can theoretically choose from. When Voynich researchers talk about h0, this is what they mean: the diversity of options (glyphs or words) without any information about their actual use. Maybe I am saving some teas for special occasions, and maybe there are others I drink all the time and keep buying again. Well, h0 does not care about this, it only cares about the number of options, and it increases when there are more options.



In the graph above, we can see how h0 grows when we know about more options (1 to 5 in this case). However, knowing my h0 won't help you much in predicting which tea I will drink: it does not tell you anything about my preferences or other habits that might influence my selection.

This gives you a new idea: you will observe my selection for 100 days, and add a tally mark for each type of tea I select. This is what we call h1: it will take into account frequency, but no order or patterns. If I consume all ten types of tea equally, you will have ten tally marks with each type, and my h1 entropy will be high: my tea consumption is unpredictable (again, if the only information you have is the relative frequency of each type). If, however, I drank the same kind of tea 100 times, my behavior is predictable and there is low entropy in my tea-drinking system.



In the above graph, all ten types of tea are selected at least once throughout the 100 data points in each case. But at the left of the graph, I have a strong preference for only one type, so h1 is low: nine out of ten times I will select my favorite type, so I am easy to predict. As we move towards the right, my number of preferred types increases, and so does the entropy (h1) in the system. Note that in the second graph, h0 is equal for all data points, since each type of tea has been drunk at least once, so we know there are ten options. However, h1 is variable since I varied how often each option was selected.

Now let us go with the worst case scenario: I drink equal amounts of each of the ten types. Going by frequency alone (h1), you now have no additional information to help you predict which type I will select. As far as you know, I might drink ten different types of tea in ten different days, or I might drink the same type ten days in a row. It's complete chaos.

Luckily, you have a final secret weapon: h2, or conditional entropy. This is Voynich researchers' favorite type of entropy, because it is where the VM really sets itself apart from everything else. What if I have a preference for a certain type of tea today, depending on the type I drank yesterday? Maybe I'm a total weirdo who always drinks tea in the same order. You start noticing a pattern: if I've had mint yesterday, I will drink lemon today, and if I've had lemon today, I will drink Earl Grey tomorrow. And after Earl Grey, it's always either black tea or green tea. My h1 is still through the roof, because I drink all teas in equal amounts. But luckily for you, my conditional entropy (h2) is very low. You can use yesterday's tea selection to predict what I will pick next.



In the graph above, the left shows a situation where all 100 entries follow the same order (1,2,3...). The system is very predictable and there is almost zero entropy. As long as you know what I drank yesterday, you can predict what I will drink today. On the right, I used the same numbers, but shuffled them randomly. This means that h0 (number of options) and h1 (frequency) remain exactly the same, but h2 is taken near its maximal potential. In this case, yesterday's choice has no influence on what I will pick today.

Character Entropy vs Word Entropy

Voynich researchers will often talk about either word entropy or character entropy. These are two separate things: one will study how predictable words are, the other will look at individual glyphs. The numbers will be different as well. For example, h0 in word entropy is much higher than h0 in character entropy, since any given text may have thousands of different word types, but only a few dozen different characters/glyphs. Both word entropy and character entropy are strange in Voynichese. However, I will only write about character entropy because this is what I know the most about.

Entropy and Information

There is a correlation between a writing system's entropy values and how efficiently it conveys information. We can grasp the basics of VM entropy without understanding the details of this matter, but I mention it anyway because you might see someone write about it. Some examples:

* The word "Voynich" in binary is "01010110 01101111 01111001 01101110 01101001 01100011 01101000". Alphabetic text has a much higher h0 than binary code. In everyday writing, using the alphabet is more efficient: I can get the same information across with fewer characters.

* In English, "q" is usually followed by "u" (bar a few exceptions in loan words like qanat, Iraq....). We can say that the "u" in words like "quest" does not add any information, because it is expected with near-complete certainty. I cannot change my message by toggling this "u". Because "u after q" is predictable, its presence lowers the h2 of English, which lowers how efficiently written English transmits information.

This talk about information density feels a bit too abstract and theoretical to me: after all, most historical writing systems are not designed for optimal efficiency, but have evolved over time.

Medieval Manuscripts, the VM and h0

Calculating the true h0 of medieval manuscripts is more difficult than it sounds. What to do with ligatures, abbreviation symbols, positional variation, capitals... What about rare symbols that are used only a few times? We might use a transcription of the text, but this is a cleaned-up, abstract version that does not exist on parchment. Maybe we should assume Voynichese behaves like a cleaned-up version, since it is a novel "code" that might disregard things like capitalization, abbreviation and other scribal conventions?

Apart from that, how many different characters does a section like Herbal A use? It depends how we count. In EVA, there are 19 characters used in Herbal A. But we can easily increase this number: counting benched gallows as separate glyphs will add four. We might also guess that "in" and "iin" are separate glyphs, and so on.

With some tweaking, it is perfectly possible to get an acceptable h0 value for Voynichese. But I think h0 is also the value that suffers the most from the way we transcribe our text: each manuscript, including the VM, can be described with various degrees of standardization and differentiation between glyph forms, which makes comparing h0 difficult. Moreover, I simply find h0 unreliable overall. If a scribe slaps a novel symbol at the end of a 200-page manuscript, the h0 value of the whole manuscript changes because of this.

For reference, in my corpus of medieval texts, h0 reaches all the way from 4.25 (EVA transliteration of Q13B) to 6.95 (a Greek historical text).

Medieval Manuscripts, the VM and h1

We generally use h1 to get a more accurate "entropy fingerprint" of a text. Trying something to manipulate h2 may change h1 and vice versa, so usually both are tracked at once. The lowest h1 I have in my corpus is a German text, followed by EVA VM sections and other German texts. However, matching h1 without also matching h2 is not worth much (I would be happy with any corrections about this statement if it is inaccurate!)

Why Voynichese has a Character Entropy Problem: h2

The most obvious reason why Voynichese has a huge character entropy problem is conditional entropy, h2. Remember the thing with "qu" in English? Well, Voynichese is kind of like that all the time.

Let's start in EVA. I give you a glyph, you tell me what's next (spaces also count). If you are a bit familiar with VM transliterations, you can do this off the top of your head:
- q
  You are not allowed to view links. Register or Login to view.
  o
- a
  You are not allowed to view links. Register or Login to view.
  overwhelmingly i, then l,r, then m
- i
  You are not allowed to view links. Register or Login to view.
  another i, or n
- n
  You are not allowed to view links. Register or Login to view.
  space
- y
  You are not allowed to view links. Register or Login to view.
  space in the vast majority of cases
- d
  You are not allowed to view links. Register or Login to view.
  "y" in about half of the cases
- c
  You are not allowed to view links. Register or Login to view.
  "h" in the vast majority of cases
Other glyphs show some more options, but they still tend to be quite restricted. This is not normal!

As far as the numbers go, they are easy to remember. VM sections are a bit below or above h2=2. Quire 13 gets really low with a value of 1.8. "Normal" medieval texts, on the other hand, have h2 values above 2.8, usually above 3. Again, this difference is huge, and it blows all other problems out of the water.

Is EVA a problem?

When I wrote my "You are not allowed to view links. Register or Login to view." posts, one thing I wondered was: to what extent does EVA influence these statistics? If the glyph "bench" is always written as "ch", maybe this is enough to mess things up. It turns one glyph into a predictable pair. So what I did was to fix benches and clusters involving "in", doing my utmost best to squeeze as much as possible h2 out of it. This is the result, in the graph below. You can see that the "fixed" Voynichese versions outperform EVA, but they are still waaaay below any normal text.

So is EVA a problem? Well, yes and no. Yes because maybe there are choices in EVA that should be corrected for before performing certain analyses, because EVA does probably lower h2. But also no, because the biggest problem is certainly not EVA, the biggest problem is Voynichese itself. The reason for this is simple: if I fix the "ch" situation by representing the bench with a single glyph, this new glyph is now also predictable, because all glyphs in the VM are too predictable.

People often ask me if I included this or that dialect in my corpus, and my answer is always the same: it does not matter! Differences between entire language families are much smaller than the difference between Voynichese and normal text. Give me any text in any language, and I can almost guarantee you that its h2 will be above 2.8, which absolutely crushes Voynichese.

What's with those Verbose Ciphers?

In the same post, I tried to push my approach further, which took me unwittingly into verbose cipher territory. A verbose cipher is basically a cipher that obfuscates by adding unnecessary stuff. In a very simple example, I could verbosely obfuscate the word "Voynich" by adding a v after evert letter: "Vvovyvnvivcvhv". If Voynichese is the result of a verbose cipher, I could try to reverse this by rewriting common glyph clusters (bigrams, trigrams) as single glyphs. For example, I could replace "dy" by "&" and run the entropy test again to see what changed. After lots of trial and error, I got almost-but-not-quite-normal entropy values this way. Apparently Rene did better with some method, which I am really looking forward to learning more about. As Rene also noticed, however, the "rewriting n-grams" method has a significant drawback: it makes words really short. As you can see in the "Voynich- Vvovyvnvivcvhv" example above, verbose encoding has the effect of lengthening words, and Voynichese words aren't excessively long to begin with.

What's the takeaway?

If you want to have a chance of solving Voynichese, you must take into account the entropy problem - there is no way around it. Knowing about the entropy issue is also interesting when assessing proposed solutions. They will either:

* Focus on single words. This locks in glyph correspondences, and makes it impossible to expand the system to a paragraph of text. Because Voynichese is abnormally low entropy, it will not freely convert to any reasonable text in any writing system that has been considered so far.
*Add their own step to introduce entropy. This is the infamous "interpretative step". The translator realizes that Voynichese does not provide enough options, so they find ways to increase the entropy. For example, they will say each VM glyph can stand for various plaintext glyphs. This leads to massive problems, which I will not go into right now (basically, the one-way cipher).

But most importantly, knowing about the nature of Voynichese's entropy problems will hopefully help us to work towards the right type of solution. One that changes the entropy density of the text without actually inventing information. But that may still be a way's off.

Comments, questions and additions are welcome.
RE: An Essay on Entropy: what is it, and why is it so important?

LisaFaginDavis > 26-04-2022, 03:29 PM

Koen, thank you so much for this! It's so helpful for those of us who aren't linguists. (I was very proud of myself that I scored 100% on the guess-the-next-letter quiz!)

I have often thought that the way EVA is structured may lead to bias in resulting analytics, just as you say. [ch], [qo], even [ain] and [aiin]. I can see the reasons why these decisions were made, but your strategy of trying to account for that bias seems like a very good way to go.

I have also noticed that there are errors in the standard EVA transcription that we all depend on, likely because that was originally based on lower-resolution images. I am beginning a very long-term project of proof-reading the transcription to make corrections and to resolve many of the uncertain readings. Might take as long as a year, but I think it will be well worth the effort. I have made ten corrections on f. 1r alone, so I suspect there will be well over a thousand changes to be made, which may very well impact computational and linguistic analyses.
RE: An Essay on Entropy: what is it, and why is it so important?

Ruby Novacna > 26-04-2022, 03:59 PM

(26-04-2022, 02:48 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.If you want to have a chance of solving Voynichese, you must take into account the entropy problem
Forgive me, Koen, for not understanding everything.
I have the impression that you are talking about entropy as the ultimate purpose in the research on the Voynich manuscript. I was naive enough to believe that statistics in general and entropy in particular were means, among others, to achieve a purpose, which would be common to all Voynich fans: understanding the text.
Why are you trying to convince everyone at all costs that your approach is the best and perhaps even the only possible one?

P.S. Do you believe that the science of statistics needs to be defended?
RE: An Essay on Entropy: what is it, and why is it so important?

RobGea > 26-04-2022, 04:10 PM

Thanks Koen G.
Nice, clear, easy to understand, just the job.
RE: An Essay on Entropy: what is it, and why is it so important?

Koen G > 26-04-2022, 04:39 PM

(26-04-2022, 03:29 PM)LisaFaginDavis Wrote: You are not allowed to view links. Register or Login to view.I am beginning a very long-term project of proof-reading the transcription to make corrections and to resolve many of the uncertain readings. Might take as long as a year, but I think it will be well worth the effort.

Woah! The LFD master transcription, sounds like a dream.

If I may dare to make a prediction, I would guess that even correcting lots of errors won't change the troublesome h2 statistic a lot. It is a fairly broad statistic that can take quite a beating before it changes. And one of your corrections that shifts the results one way may be offset by another. I could be proven wrong though, which would be a very pleasant surprise. And either way there are lots of other analyses where having an improved transcription will make a significant difference.

Ruby: Yes, I'm afraid you misunderstand my purpose. I did not invent VM entropy research, in fact I learned most of what I know about it from others, like Anton and Rene. Of course you are right that statistical analyses are tools. And in this case, the tool reveals a troublesome side of the VM: its conditional character entropy is much too low compared to any known text. I did not invent this either, but I struggled to wrap my mind around it since I saw how important others found these observations.

The results of my own research are mostly negative. I demonstrated that accounting for an EVA bias is not sufficient to fix everything, and explained why this is the case. Then I explained that I explored if Voynichese could statistically be a verbose cipher, but again I did not reach any results yet that I find satisfactory: the stats aren't quite right yet, and the process makes words much too short.

However, looking for a VM solution without taking into account the entropy problem is like driving around blindfolded.
RE: An Essay on Entropy: what is it, and why is it so important?

bi3mw > 26-04-2022, 04:59 PM

A verbose cipher can also be created with a procedure that does not necessarily create longer words and provides much better obfuscation than just padding with letters. I have tried it myself once. The result is the following text ( Regimen Sanitatis ):

You are not allowed to view links. Register or Login to view.

You are not allowed to view links. Register or Login to view.
RE: An Essay on Entropy: what is it, and why is it so important?

davidjackson > 26-04-2022, 05:40 PM

I would suggest, that in your tea example, the Voynich is drinking coffee
Very nicely written piece. Worth making a sticky.
RE: An Essay on Entropy: what is it, and why is it so important?

Searcher > 26-04-2022, 06:54 PM

(26-04-2022, 02:48 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Both word entropy and character entropy are strange in Voynichese. However, I will only write about character entropy because this is what I know the most about.
Thank you for this essay.
It seems to me that the entropy of words is no less important than the entropy of glyphs, because, firstly, it also looks quite low to the eye, when compared with text in any normal language, and secondly, its low entropy affects the entropy of glyphs . It is obvious that the high frequency of bigrams "ai", "ii", "in" and "ol" is also associated with the high frequency of repetition of the tokens "daiin" and "ol". The limited vocabulary of the Voynich manuscript is also not normal and requires solutions.
I haven't done any research in this area, so I'd really be interested in how much VMs word repetition affects glyph entropy.
RE: An Essay on Entropy: what is it, and why is it so important?

Ruby Novacna > 26-04-2022, 07:07 PM

(26-04-2022, 04:39 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Ruby: Yes, I'm afraid you misunderstand my purpose.
I certainly find your approach too abstract, perhaps because you don't define your objective clearly. For me "increasing entropy" is not a clear objective.
RE: An Essay on Entropy: what is it, and why is it so important?

R. Sale > 26-04-2022, 07:38 PM

Is this a potential proof of the internally repetitive, sometimes almost imitative, structure of vords?

Beyond vords, is there some sort of comparison for specific portions of VMs text - ostensibly folios, or otherwise? I believe certain portions of VMs text are far more repetitive in the use of similar vords than are found on average.
Next Oldest Next Newest

An Essay on Entropy: what is it, and why is it so important?

Index

An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?

RE: An Essay on Entropy: what is it, and why is it so important?