13-09-2019, 09:52 PM
I think it's time we fish this out of the depths of the off topic section and give it its own proper thread.
Anton suggested that investigating word entropy would be an interesting exercise. Thanks to Nablator's code I gathered some initial data, which can be viewed in my You are not allowed to view links. Register or Login to view. under the word entropy tab.
What I did right now is make some quick graphs to see whether there is any signal in the noise. My favorite way of visualizing lots of data is in scatter plots, so that's what I used. For the second value I used MATTR 500, because I know this forms "language clouds", and additionally I wanted to find out whether there would be any correlation between this and word entropy (both are about vocabulary, after all).
Also, I wanted to get an idea of which values might be most useful to focus on.
Note: in most graphs, I left two VM outliers, those are te labels and the GC transcription. It is best to focus on the main VM cloud, which sits somewhere between Latin and German.
Note2: Greek is usually somewhere in the middle, but since there are so many dots of it, visibility is impaired, so I turned it off for these graphs.
h1
[attachment=3293]
First order entropy is clearly affected by language and shows some correlation with m500.
h0-h1
[attachment=3294]
Here I see only very slight effects of language on the entropy value.
h2
[attachment=3295]
A similar result to h0-h1. You can see that Latin leans more to the left, but there is a significant overlap.
h1-h2
[attachment=3296]
This one surprised me, it correlates really well with m500 (and hence, language type).
h1/h2
[attachment=3297]
An effect is visible, but less pronounced than in h1-h2.
Conclusion: Voynichese does not behave abnormally as far as word entropy goes. It sits somewhere between Latin and German. Some other languages like Italian and Slavic are also close, but I didn't include those in these graphs.
Anton suggested that investigating word entropy would be an interesting exercise. Thanks to Nablator's code I gathered some initial data, which can be viewed in my You are not allowed to view links. Register or Login to view. under the word entropy tab.
What I did right now is make some quick graphs to see whether there is any signal in the noise. My favorite way of visualizing lots of data is in scatter plots, so that's what I used. For the second value I used MATTR 500, because I know this forms "language clouds", and additionally I wanted to find out whether there would be any correlation between this and word entropy (both are about vocabulary, after all).
Also, I wanted to get an idea of which values might be most useful to focus on.
Note: in most graphs, I left two VM outliers, those are te labels and the GC transcription. It is best to focus on the main VM cloud, which sits somewhere between Latin and German.
Note2: Greek is usually somewhere in the middle, but since there are so many dots of it, visibility is impaired, so I turned it off for these graphs.
h1
[attachment=3293]
First order entropy is clearly affected by language and shows some correlation with m500.
h0-h1
[attachment=3294]
Here I see only very slight effects of language on the entropy value.
h2
[attachment=3295]
A similar result to h0-h1. You can see that Latin leans more to the left, but there is a significant overlap.
h1-h2
[attachment=3296]
This one surprised me, it correlates really well with m500 (and hence, language type).
h1/h2
[attachment=3297]
An effect is visible, but less pronounced than in h1-h2.
Conclusion: Voynichese does not behave abnormally as far as word entropy goes. It sits somewhere between Latin and German. Some other languages like Italian and Slavic are also close, but I didn't include those in these graphs.