Ah, yes, I will try to address your first point.
I'm not entirely sure if the second point is worrying. Since the graphs only show the difference, any spread appears larger. In all cases, the difference between unshuffled and lowest shuffled is greater than the difference between lowest and highest shuffled.
These are the percentages h2/(12.3-h1)
xLat_Metamorphoses_5k.txt 0.967690217
xLat_Metamorphoses_Shuffle1.txt 0.9844708671
xLat_Metamorphoses_Shuffle2.txt 0.9832000784
xLat_Metamorphoses_Shuffle3.txt 0.9804455116
xLat_Metamorphoses_Shuffle4.txt 0.9829997291
xSP_Valles_5k.txt 0.7930672279
xSP_Valles_Shuffle1.txt 0.8798210235
xSP_Valles_Shuffle2.txt 0.8793418862
xSP_Valles_Shuffle3.txt 0.8783846395
xSP_Valles_Shuffle4.txt 0.8757576902
xTimm_5k.txt 0.9133277188
xTimm_Shuffle1.txt 0.9316126973
xTimm_Shuffle2.txt 0.9366767755
xTimm_Shuffle3.txt 0.936367564
xTimm_Shuffle4.txt 0.9368392749
xVM_Q13_5k.txt 0.8997510321
xVM_Q13_s1.txt 0.9343812633
xVM_Q13_s2.txt 0.938978789
xVM_Q13_s3.txt 0.9361011263
xVM_Q13_s4.txt 0.933977632
xVM_Q20_5k.txt 0.953076171
xVM_Q20_s1.txt 0.9729300202
xVM_Q20_s2.txt 0.9718343942
xVM_Q20_s3.txt 0.9744684644
xVM_Q20_s4.txt 0.9774969691
xVM_TT_5k.txt 0.9385874706
xVM_TT_Shuffle1.txt 0.9527532045
xVM_TT_Shuffle2.txt 0.9612394838
xVM_TT_Shuffle3.txt 0.9583383262
xVM_TT_Shuffle4.txt 0.9576560283
(16-09-2019, 09:41 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Here's with the first 5000 words from Q13 and Q20 tested as well.
There's one point I could use some clarification on, Koen. Were the reshufflings cumulative, or was each reshuffling a different random output of the same (plaintext) input? I'm assuming it's the former, since this simulates an analog randomization process where every shake and shuffle builds upon the results of the last, but the entropy of the assortment approaches an upper limit.
If so, and I'm interpreting the data correctly, Timm's word salad really stands out, and the real VMS clusters with known real language samples. In all the histograms except Timm's, there's a jump up from the plaintext to the first reshuffling, then a noticeably smaller jump up for the second reshuffling, and an even smaller one for the third, such that I don't see the point of doing any more reshufflings. Timm's
lorem ipsum soup, meanwhile, plateaus right away. What I see here is an assortment of values that was already almost as random as it could get, and hit essentially maximum entropy in one shake.
I'd say if the VMS was generated by some randomizing algorithm, it must have differed from the one Torsten Timm designed in some significant ways.
That's the problem, I'm not sure how the reshuffling of that site I used worked. I just pressed reshuffle each time. But if it shuffles well, shouldn't each successive shuffle be of equal randomness? Maybe I should force cumulative shuffling by pasting the results into the input and see whether that makes a difference.
Note that the graph shows the difference between shuffles and plaintext. So the original is actually the baseline. Also, the graph sorted the data. In the list I posted above, you'll see that the first shuffle doesn't necessarily correspond to the lowest value. Whether these fluctuations are within acceptable bounds or not is something I might test in the way described above.
What does put Timm's text apart somewhat is that its number of word types is really low like English, while its h2 is more like Latin. Normally we would expect a language to compensate low word types with more fixed word order, i.e. lower h2. The VM texts are an outlier in this respect as well, but Timm"s is more so.
I don't know to what extent the relatively low h2 (compared to TTR) is problematic. But looking at all tested texts (see scatter plot above), this does put the VM and Timm's text apart from the rest.
It's due to the problem of sampling with word-h2. The word pairs are highly undersampled, so counts of pairs are typically small numbers and small variations due to chance will be visible in h2.
Note that differences are in the second decimal (except for Spanish), while h1 is of the order of magnitude 10.
(17-09-2019, 06:12 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.That's the problem, I'm not sure how the reshuffling of that site I used worked. I just pressed reshuffle each time. But if it shuffles well, shouldn't each successive shuffle be of equal randomness? Maybe I should force cumulative shuffling by pasting the results into the input and see whether that makes a difference.
Correct me if I'm wrong, because this is not my area at all, but it's my understanding that computer-generated randomization is fundamentally problematic, but close enough for most statistical tests. I've heard it said that a computer is ultimately incapable of true randomness, and humans are not nearly as good at it as we think we are. I remember being prompted for a "random number seed" when writing programs in Basic as a kid in the 80s, and being intrigued that the computer needed more than just "gimmie any old number," like you could say to a person. Since then computer simulation has achieved a high level of verisimilitude for a lot of processes, and I'd guess that randomization is one of them.
Still, I think inquiring into the code the webpage uses to shuffle the letters, and making sure it's a good enough model for pulling letters out of a well-shaken bag, is a great idea for improving this test.
Looks my interpretation of your data missed the mark a bit. Pity, that. I was hoping to see the VMS cluster closer to the natural languages, and it looks like I... saw what I wanted to see.
I foresee a trend of scientists and mathematicians, all under the given assumption that the VMS is a medieval hoax, making it their challenge to reverse engineer the VMS's text. In fact, I could see this becoming
the predominant trend in Voynich research circles, with theorists who doubt the hoax hypothesis becoming more peripheral. The treasure is an algorithm that is not only plausible and practical for 15th century Europe, but produces an output that is nearly indistinguishable from the real VMS text, for any amount of output, over many trials. In addition to 15min of fame, an accomplishment like this could be a real ticket into a career in IT or informatics. (And to the well-heeled and well-connected sceptic scene, if that's their cup o' joe.)
Anyhow, I think your test will be very helpful for vetting the many Voynich-bots about to be built.
There can be only one!
(17-09-2019, 07:24 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view.highly undersampled
What's the number to use to calculate the theoretical max for 35,000 words?
I tried the experiment again with 35,000 words per text. Unfortunately I had to drop Timm since the sample I had was c. 10,000. If someone can obtain a 35,000 word sample I'll include it.
Just to make sure I always used the shuffled text as input for the next shuffle, to see if there is an accumulative effect. I don't think this was the case.
Then I wanted to find out what the effect of increased text size was on consistency. I did this by dividing the lowest shuffled h2 for each text by the highest.
For the previous test with 5000 words, this was:
99.6%
99.5%
99.4%
99.5%
99.4%
99.1%
In the current test with 35,000 words, the shuffled h2 values were more consistent, with the lowest value at ca. 99,8% of the highest:
99.9%
99.8%
99.8%
99.8%
It's probably not quite enough yet, but I've reached the limits of the VM text size.
Averaging out the four shuffles for each text, the results are as follows (this should become a table, hope it works):
text | h0 | h1 | h2 | h2_avg_s | TTR |
Barlaam | 12.02583167 | 9.329523064 | 4.737718001 | 5.212352078 | 0.119170096 |
Matthioli | 13.18920683 | 11.21191871 | 3.388789001 | 3.732572534 | 0.267055527 |
Pliny | 13.62798978 | 11.820678 | 3.029928693 | 3.174719428 | 0.3617142857 |
VM_TT | 12.89500718 | 10.40921542 | 4.29429829 | 4.461228079 | 0.2176285714 |
Rene, how can theoretical max h2 be best taken into account?
This graph is a summary of what I've found about 35,000 word texts. It shows percentage of theoretical maximum h2 over TTR. This sounds complicated but actually it's not.
(Note: for max h2 I used (15-h1), this value is probably a bit low, but not much, and it makes no difference for the relative distance between values.)
[
attachment=3331]
The top line of dots, connected by the pink line, are the scrambled values. The bottom line are original values. Scrambled values are on top because obviously their entropy is higher. Entropies are not expressed as an absolute number, however, but as a percentage of the theoretical maximum entropy for this text.
TTR for scrambled and unscrambled texts are identical because the words stay the same, they are just in a different order.
Observations:
* Focusing on the pink curve, we see that it is pretty consistent. As the amount of word types increases, so does the entropy of the scrambled text.
* The bottom line, h2 of the original texts, varies much more.
* It is tempting to imagine a line from the bottom left dot (Barlaam unscrambled) through Matthioli and Pliny. The VM and Timm would be above this line.
Another way to visualize this is to plot the difference between scrambled and unscrambled forms of each text. For Pliny, this value is really small already, and the VM barely clings on. Timm's is smaller still.
[
attachment=3332]
Hi Koen,
I believe that your approach of repeatedly shuffling and taking the average values should allow you to get reliable results also with shorter texts.
I guess that, if you check 10,000-words samples, Timm's generated text will again be close to Q13. The higher TTR and entropy of the whole VMS are likely due to the fact that it is made of different sections, with apparently different subjects and "languages" (in Currier's sense), while Timm's text is uniform (it was created by a single execution of his software).