The Voynich Ninja

Full Version: Word Entropy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9
Koen, did you try to apply these calculations to Torsten's auto-generated texts? I have not followed the respective discussion in detail, but I had the impression that some kind of software generator was made available.
Do you mean entropy over increasing text size? I've only got the text of some 10800 words someone shared.

h0 h1 h2 words types

11.12153352 9.196643491 3.850987991 10832 2228
No I mean just entropy calculation over a large piece of text.
(15-09-2019, 10:54 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Do you mean entropy over increasing text size? I've only got the text of some 10800 words someone shared.

h0 h1 h2 words types

11.12153352 9.196643491 3.850987991 10832 2228

For this text, the theoretical maximum h2 is 4.206, meaning that the actual h2 is at 91.55% of that.
Quite like Voynich TT above.
Similar to the plot in You are not allowed to view links. Register or Login to view. ,  it is possible to make the h2 vs. h1 plot for the several texts that were analysed for different lengths. These were:
Pliny (in blue)
Text "M" (in orange) [I suspect this could Mattioli]
Text "B" (in grey) [I suspect this could be the German text 'Barlaam' ?]
A green dot for Voynich TT and a black one for Timm's text has been added.

[attachment=3310]

The influence of text length is clearly dominating.
I see what you mean, Rene, those are huge differences. No wonder I couldn't make much sense of the h2 graph.

What about h1-h2 though?
Marco tweaked nablator's code so now I can limit it to the first n words; I used 5000 for this graph, comparing h1 and h2. Looks better, right?

[attachment=3315]

Full data for this set is in the sheet WordEntropyN
You are not allowed to view links. Register or Login to view.

It looks like Voynichese h1 or h2 is a little bit out of proportion?
Edit: also, is it expected that they are inversely proportional?
Next I ran it on 20,000 words. Marco's code automatically selected the files that were large enough, which was very convenient. I then isolated those same files and ran them on 2000 words. The effect appears to be that there is more spread as word count increases. For example, three out of four Voynichese files overlap completely in the 2000 words. Still, even on 20,000 words, languages cluster well. The top-right drift of Voynichese becomes more apparent in 20,000 words.

[attachment=3316]
Koen, I read your blog posts about TTR and MATTR, and I agree that these are potentially very useful tools in telling signal from noise in the VMS, and narrowing our language search to ones with a similar statistical profile.
Pages: 1 2 3 4 5 6 7 8 9