The Voynich Ninja

Full Version: Word Entropy
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9
(16-09-2019, 10:51 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.You could also try plotting this...

Percentage of theoretical max h2 compared to number of types (=TTR because tokens are constant)

[attachment=3322]
I find this a bit easier to think about using 1-(h2/(12.3-h1)) on the y-axis and types on the x-axis. So if I understand correctly (and my understanding is still rudimentary so do correct me if I'm wrong), this graph shows the extent to which a text uses structures/patterns/predictable word order. Texts with a lower TTR tend to score higher on structure (as expected).

The VM is an outlier, with low word predictability for its amount of types.

[attachment=3323]
Entropy plots appear not to separate different languages. In particular, You are not allowed to view links. Register or Login to view. separate Italian from Spanish, while here the two are mixed together. Actually, most languages overlap in the central area. 
But these last graphs show the possibility of separating VMS samples: an excellent result. In TTR graphs, Voynichese overlaps (at least) with Italian. 
The two methods appear to be complementary. I don't understand enough to interpret the overall results, but I think they could add to our knowledge of how Voynichese is both similar to and different from other languages.

A detail I don't remember: what is the VM sample near Timm's generated text?
The separate sample is always Q13, it's much different than all the rest of the VM!
(16-09-2019, 07:04 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.The separate sample is always Q13, it's much different than all the rest of the VM!

Thank you! Once again, it is confirmed that Timm and Schinner's model is accurate in many respects.
I just thought of an experiment, I don't know how well it's gonna work but I'd like to give it a try.
It's easy to shuffle words, for example here: You are not allowed to view links. Register or Login to view.

What I'd like to do is focus on 4 files of 5000 words each. The first 5000 words of TT transcription, of Timm, of a high TTR text and a low TTR text. Reshuffle each a few times and save each shuffle. Then compare to the current plot and see if the shuffles distance themselves from the original.
Let's remember that Torsten Timm's text was computer-generated, involving (presumably) a random number generator, while all other texts were human-generated. From all I have read I would say that it has been tuned to closely resemble the Voynich MS text in one way or another.

I consider that looking at the impact of randomly shuffling the words of texts is of great interest, and Koen has the right approach that doing it just once is not enough.
It appears that only the h2 changes (which upon consideration is a confirmation that everything went well).

I selected the Spanish text that bears David's seal of approval for low TTR, Metamorphoses for high TTR.
In all four cases, the original text has the lowest h2.

xLat_Metamorphoses_5k.txt 1.458446853
xLat_Metamorphoses_Shuffle3.txt 1.477670897
xLat_Metamorphoses_Shuffle4.txt 1.481520466
xLat_Metamorphoses_Shuffle2.txt 1.481822421
xLat_Metamorphoses_Shuffle1.txt 1.48373768
xVM_TT_5k.txt 2.822321673
xVM_TT_Shuffle1.txt 2.864917871
xVM_TT_Shuffle4.txt 2.879660606
xVM_TT_Shuffle3.txt 2.881712268
xVM_TT_Shuffle2.txt 2.890436015
xTimm_5k.txt 3.381786872
xTimm_Shuffle1.txt 3.449490828
xTimm_Shuffle3.txt 3.467096717
xTimm_Shuffle2.txt 3.468241637
xTimm_Shuffle4.txt 3.468843325
xSP_Valles_5k.txt 3.310481035
xSP_Valles_Shuffle4.txt 3.655653799
xSP_Valles_Shuffle3.txt 3.666619409
xSP_Valles_Shuffle2.txt 3.67061522
xSP_Valles_Shuffle1.txt 3.672615272


I'm not sure how to best put this in a graph and how to interpret the results. Here's how much each shuffle adds to the original h2:
[attachment=3325]

Now I'm wondering though, the first 5000 words of the VM, doesn't that end up mixing Currier A and B?

Edit: I forgot to add that the difference between Latin and Spanish is expected, of course.
Edit2: I guess to be sure I have to add some separate VM sections...
Here's with the first 5000 words from Q13 and Q20 tested as well.

[attachment=3326]
Two things:

- the 'room for change' for h2 depends how far away it is from the hypothetical limit in the original text, so the absolute changes are perhaps not the right way to measure it.

- the rather large spread in the changes for the Voynich MS TT text are a bit worrying. It shows that the suffling does not necessarily generate a representative random example.
Pages: 1 2 3 4 5 6 7 8 9