Okay, I redid the m500/h1-h2 graph including Timm. I also chose some more obvious colors (I hope).
[
attachment=3302]
There are two VM groups. The bottom one is Q13 and its two subsections, so you can really see this as one VM outlier. The other one is various VM transcriptions and subsections other than Q13. The VM as a whole still sits somewhere between German and Latin, while Q13 is right in the middle of the English cloud.
Timm (dark dot) is not too far off Q13, but for the VM in general his values are really low.
As far as text length goes, all included texts are over 1000 words, since this was a requirement for m1000. Most are several 1000 words long. I had a few texts that were much longer, but those have been excluded by Java limitations.
Marco suggested h1 vs. h2, this actually shows some really nice language clustering:
[
attachment=3303]
(14-09-2019, 07:23 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.For word entropy, the number of different words keeps increasing. The number of possible word pairs actually grows faster than the text length, so in a way the sampling gets worse as the text length increases.
My idea is that one's vocabulary is limited, and if one knows 200 words (that's how many English words I knew after one year of learning English

), then the number of unique words in a text won't significantly increase, be the text size 10000 words or 100000 words. But the size should be sufficient, nonetheless. That's why I said in the other thread that I'm interested to know the text and vocabulary sizes of the samples, and also how rapidly the value changes with the increase in the size.
(14-09-2019, 08:50 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Again, I believe that h2 here is conditional entropy (not second order, i.e. Bi-Word, entropy).
There is much confusing jargon with entropies, because the notion of "order" can be applied to different things. The best way (in my opinion), is to use the terms "unconditional" (this can be omitted for brevity) - this is our h1, "conditional" - these are h2, h3 etc., and "joint" - these are n-grams.
Then, one can apply the notion of order to conditional and joint entropies. Our h2 would be called "first order conditional entropy", since it's conditioned by one character, h3 - "second order conditional entropy", since it's conditioned by two characters, etc. Likewise, for joint entropies. Bigram would be second-order, trigram - third order.
This is the scheme that many authors adopt.
In contrast to that, when we speak about "order in general", this introduces confusion about what's the talk about, and then to avoid that, "n-th order entropy" should be considered as unconditional symbol entopy when order = 1, and as conditional symbol entropies when order >=2, while joint entropies are just out of scope of consideration. When you need to invoke them, you just say "joint".
Quote:I hope that Rene or Anton will provide a better explanation, but I guess that the striking correlation of h1-h2 with m500 is due to the fact that it removes from word entropy the component due to unpredictable word combinations (BWE): what is left is almost identical to the variability of the lexicon (TTR). The left and right outliers in the h1-h2 graphs (mostly English texts, it seems) could then be samples with exceptionally strong or weak two-word sequences repetition.
Unfortunately, I did not follow all that TTR discussion, and hence can say nothing.
(14-09-2019, 08:51 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.(Values for entire TT with unclear "?" vords removed):
h0 h1 h2
12.97 10.45 4.36
When vords are randomly shuffled h2 increases:
12.97 10.45 4.52
There are enough repetitions of 2-vord patterns for this and, more generally, correlations between a vord and the next. But some reordering could still take place. For example, if every even-indexed vord is moved elsewhere (different line, different page) then for each repeated 3- or 4-vord pattern a repeated 2-vord pattern would remain.
OK, let's say that there would be no positive hints as to the shuffling. What's interesting is that when several years ago I did some screening for the regularities in narration structure in botanical folios based on keyword occurrences (Voynich stars), I failed to detect it, which suggested that vord shuffling
might be in place. So it's of interest to approach the question of shuffling from different angles.
Another point of interest with h2 is that whether it could provide hints as to inflections. My idea is that different degrees of inflection would influence the word h2 (of course also h0 and h1) values. For example, in Russian you have six cases for nouns - nominative, genitive, dative, accusative, ablative and prepositional. And, generally, word ending would be different. In Latin, if I'm not mistaken, you have only four. In English you have none, or even if they are distinguished, the word endings do not change, that's the point. Same thing for verbs. So it's of interest to see how different inflection behaviour of languages is (or is not) reflected in word entropy.
Anton, what you're describing now is exactly what MATTR does. You should really check my blog posts on the subject.
Here I introduce the concept and literature I used, note that I still had a lot to learn about graphs: You are not allowed to view links.
Register or
Login to view.
And here is the best overview I've made so far: You are not allowed to view links.
Register or
Login to view.
Since some of the topics you bring up overlap with this, you might want to check it out.
Something seems to be going wrong with the plots in You are not allowed to view links.
Register or
Login to view. , but the mistake could be on my side and I need to think about it for a bit.
Word-h1 is dominated by the text length, and values from 7 to 12 are a very wide range.
In the first plot I am very surprised to see values of h1-h2 going above 7, and even by lots, but I have to think about it
(14-09-2019, 01:01 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.Another point of interest with h2 is that whether it could provide hints as to inflections. My idea is that different degrees of inflection would influence the word h2 (of course also h0 and h1) values. For example, in Russian you have six cases for nouns - nominative, genitive, dative, accusative, ablative and prepositional. And, generally, word ending would be different. In Latin, if I'm not mistaken, you have only four. In English you have none, or even if they are distinguished, the word endings do not change, that's the point. Same thing for verbs. So it's of interest to see how different inflection behaviour of languages is (or is not) reflected in word entropy.
Latin has five commonly used cases, six if you count the vocative case, which is rarely needed in written documents. Sorry to be pedantic. Out of curiosity, if I'm addressing someone directly in Russian, what case do I use for their name?
The difference between analytical and agglutinative languages comes to mind here. The difference, as I understand it, is that agglutinative languages modify words by adding to or changing their pronunciation, whilst analytical languages prefer to leave words unchanged, and modify them by changing their placement among other words in the sentence. It's really a gamut, with modern English being one of the most analytical Indo-European languages, and Latin and Russian much more agglutinative. (Someone with a background in linguistics please feel free to correct me — I'm only an amateur linguist.)
Pertinent to the VMS, word order tends to be much less flexible in analytical languages. It seems intuitive to me that Shannon entropy and agglutinative-ness of a language would correlate significantly, and that Koen's test data in this thread supports this. If so, then it seems to me that we might be able to predict how agglutinative the Voynichese language is, and compare it to known languages with a similar pattern of inflection.
@Koen G, This thread is making me wish I'd paid more attention in statistics class. Nice work.
Renegade: you should also check the TTR blogposts I linked for Anton. Since I'm a total statistics noob myself, I tried to explain it in an accessible way. Basically what it does is sort languages by how inflected they are. Voynichese is somewhere between Latin and German, around the level of Greek and Slavic languages. But there are differences between VM sections and window sizes.
Rene: when I get home I can attach some of the offending texts if that would help.
Just to see if I remember correctly.
If you have an m500 of 0.8 , that means that typically of every group of 500 word tokens you expect to find 400 word types.
If if is 0.4, you would expect to find only 200 word types on average in 500 words (word tokens).
Correct?