The Voynich Ninja
Word Entropy - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: Word Entropy (/thread-2928.html)

Pages: 1 2 3 4 5 6 7 8 9


Word Entropy - Koen G - 13-09-2019

I think it's time we fish this out of the depths of the off topic section and give it its own proper thread.

Anton suggested that investigating word entropy would be an interesting exercise. Thanks to Nablator's code I gathered some initial data, which can be viewed in my You are not allowed to view links. Register or Login to view. under the word entropy tab.

What I did right now is make some quick graphs to see whether there is any signal in the noise. My favorite way of visualizing lots of data is in scatter plots, so that's what I used. For the second value I used MATTR 500, because I know this forms "language clouds", and additionally I wanted to find out whether there would be any correlation between this and word entropy (both are about vocabulary, after all).

Also, I wanted to get an idea of which values might be most useful to focus on.

Note: in most graphs, I left two VM outliers, those are te labels and the GC transcription. It is best to focus on the main VM cloud, which sits somewhere between Latin and German.
Note2: Greek is usually somewhere in the middle, but since there are so many dots of it, visibility is impaired, so I turned it off for these graphs.

h1

   

First order entropy is clearly affected by language and shows some correlation with m500.

h0-h1

   

Here I see only very slight effects of language on the entropy value. 

h2

   

A similar result to h0-h1. You can see that Latin leans more to the left, but there is a significant overlap. 

h1-h2

   

This one surprised me, it correlates really well with m500 (and hence, language type).

h1/h2

   

An effect is visible, but less pronounced than in h1-h2.


Conclusion: Voynichese does not behave abnormally as far as word entropy goes. It sits somewhere between Latin and German. Some other languages like Italian and Slavic are also close, but I didn't include those in these graphs.


RE: Word Entropy - Koen G - 13-09-2019

And the following graph shows average values for h1-h2 for all languages of some size in the current corpus, together with some different VM sections.

   


RE: Word Entropy - Anton - 14-09-2019

Thanks Koen, this is something which needs to be thought over - provided, of course, that text sizes were sufficient.

Quote:Conclusion: Voynichese does not behave abnormally as far as word entropy goes.

If this conclusion is true, then it speaks against vord shuffling, does it not?


RE: Word Entropy - ReneZ - 14-09-2019

There's a fundamental problem with word-h2.

If a text is made up of 26 different characters , then there are 676 different possible character pairs. The entropy is computed as a function of the 676 different probabilities, effectively by computing the frequency.
The longer the text, the more representative the frequencies become.

For word entropy, the number of different words keeps increasing. The number of possible word pairs actually grows faster than the text length, so in a way the sampling gets worse as the text length increases.

Edit:

One can do an estimate. Assume that a text consists of N+1 words. Then there are N different word pairs. Let's assume that all pairs are different. In that case each of the N pairs has a frequency of 1 / N, and the absolute word H2 would be 2log(N). (Base-2 logarithm).

For the computation of h2 one would need to know H1. For a text following the Zipf law, this has been estimated in a table provided here:
You are not allowed to view links. Register or Login to view.

The h2 can then be estimated (for this particular case) as the difference between the two columns H(max) - H(zipf), and for text lenghts of 5000 to 50,000 words it would run from 3.3 to 4.7 .  (All barring simple mistakes).


RE: Word Entropy - Koen G - 14-09-2019

(14-09-2019, 07:23 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.There's a fundamental problem with word-h2.
I see, that's probably why the graph looks so bad. It gets a lot better in h1-h2 though, so maybe this operation somehow normalizes the data?


RE: Word Entropy - MarcoP - 14-09-2019

Hi Koen,
these graphs are interesting thank you for sharing them! I think that using a different color, or a different shape, for VMS samples might make the graphs more readable.

Again, I believe that h2 here is conditional entropy (not second order, i.e. Bi-Word, entropy).

ConditionalEntropy=BiWordEntropy-WordEntropy
I think your h1-h2 is WordEntropy-ConditionalEntropy = WE-(BWE-WE) = 2*WE-BWE

This could explain why the graph for h1-h2 is so similar to the  graph for h1 (WE). 
   

I hope that Rene or Anton will provide a better explanation, but I guess that the striking correlation of h1-h2 with m500 is due to the fact that it removes from word entropy the component due to unpredictable word combinations (BWE): what is left is almost identical to the variability of the lexicon (TTR). The left and right outliers in the h1-h2 graphs (mostly English texts, it seems) could then be samples with exceptionally strong or weak two-word sequences repetition. 
With the exception of labels, Bi-Word repetition appears to be normal in the VMS. What is not normal is that repeating Bi-Words so often are reduplicating words ("daiin daiin" etc), but I think this is transparent to word entropy.

I also wonder if you could compare with your corpus You are not allowed to view links. Register or Login to view.: we know their code is reasonably accurate in modelling reduplication, but I would be curious to see how it works in terms of recurring Bi-Words in general.


RE: Word Entropy - nablator - 14-09-2019

(14-09-2019, 12:47 AM)Anton Wrote: You are not allowed to view links. Register or Login to view.
Quote:Conclusion: Voynichese does not behave abnormally as far as word entropy goes.

If this conclusion is true, then it speaks against vord shuffling, does it not?

(Values for entire TT with unclear "?" vords removed):

h0         h1         h2
12.97    10.45    4.36

When vords are randomly shuffled h2 increases:
12.97    10.45    4.52

There are enough repetitions of 2-vord patterns for this and, more generally, correlations between a vord and the next. But some reordering could still take place. For example, if every even-indexed vord is moved elsewhere (different line, different page) then for each repeated 3- or 4-vord pattern a repeated 2-vord pattern would remain.


RE: Word Entropy - MarcoP - 14-09-2019

(14-09-2019, 07:31 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.
(14-09-2019, 07:23 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.There's a fundamental problem with word-h2.
I see, that's probably why the graph looks so bad. It gets a lot better in h1-h2 though, so maybe this operation somehow normalizes the data?

That is my favourite graph among those you posted here: it has a lower correlation than the others, so I find it more informative. Also, it is the only one in which German is (partially) separated from the other languages, which I think is a good result.

I think an h1 (Word Entropy) vs h2 (Conditional Word Entropy) graph should also look somehow similar.


RE: Word Entropy - Koen G - 14-09-2019

Marco, does this mean that h2+h1 is also a relevant value (Bi-word entropy)?

I will have a look at the other issues right away. Including Timm's file should be interesting.


RE: Word Entropy - ReneZ - 14-09-2019

There will always be confusion between conditional and unconditional entropy. I prefer to write upper hase H for unconditional and lower case h (which is a bit of a misnomer) for conditional entropy. Of course, there is no conditional first order entropy so one can write h1 just as well as H1.

What I intended to show in my earlier post is that, all other things being equal, both the word-H2 and the word-h2 depend directly on the text length. To be safe, one has to use the same length for all texts. Clearly, the experiment by nablator above is a valid one since both texts have the same length.
Also, arbitrarily mixing up the words should bring one close to the case I was describing, namely that all word pairs are different ones.

Using word-H1 minus word-h2 instead of word-h2 brings an advantage, because this quantity should be less dependent on text length. Using my earlier model, the value would vary from 5.6 to 6.2 for a text length varying from 5000 to 50,000 words.