Let's be very careful.
We need to be as critical about these statistics and their interpretation as about those of Torsten.
What is
not a problem is the statistical significance. All results (Torstens's and these) are based on large text samples.
Now to start with Koen's last remark.
It was to be expected that all three modified versions of Macchiavelli and Pliny would be closer to the Voynich text than the original. However, it was not expected - at least by me - that the modified Macchiavelli was that close to the Voynich text.
Obviously, the two originals behave completely differently - see the dark purple and light blue bars.
The modified versions of Pliny only impose a word structure and they both are even more oriented towards small edit distances than Voynichese. This suggests that these word structures are perhaps more rigid than Voynichese. pliny_mod1 uses Roman numerals, which 'feels' indeed more rigid than Voynichese.
pliny_mod2 is interesting, because due to its structure it allows to predict the frequencies of the various edit distances. For all words that are not the same (edit distance 0) the ratio should be 20 : 5 : 1 for edit distances 1 , 2 , 3 , and smaller again for 4.
Since edit distance 0 accounts for 19%, that leaves 81% for the others, and the predicted frequencies would be:
62 , 15.5 , 3
which fits almost exactly.
So now, what does this very close match between the modified Macchivelli and Voynichese mean?
I would say, that it answers exactly this question of Marco:
Quote:For instance, in the VMS, 87% of words have a distance from a previous word in the same page that is smaller than 3. Is this value enough to confirm Timm's theory?
The answer is 'No', because a meaningful text that was not generated through auto-copying has exactly these statistics.
Now does this show that the Voynich text was not necessarily generated by auto-copying?
Also here I have to say: no, or at least, not yet. This is not yet demonstrated.
This is mainly because the test of Marco isn't exactly testing the method of Torsten. This method says that all first lines of paragraphs are not copied from the same page but from previous first lines of paragraphs. That means that these lines should not be searched for matching previous words, but should be searched as sources for matching words from later text on the same page.
On average, there are probably more than 13% of words on first lines, keeping in mind that the average paragraph length is strongly affected by the short paragraphs in quire 20.
This would be a more complicated test. For the modified Macchiavelly text this is a meaningless point.
However, we can still learn more. In the modified Macchiavelli, the page breaks are arbitrary. The text is not page-oriented and in any case the page breaks were introduced arbitrarily.
So, one can predict that the minimum edit distance will vary over each page as follows:
Since the appearance of similar words is purely a matter of chance, near the top of each page the average will be higher, since there are simply fewer words to compare with. As one goes down the page, the average will decrease as there are more and more words to compare with.
This would be the typical behaviour of a text that was not created using auto-copying.
For a text that was created using auto-copying, such a trend should not exist, and the minimum edit distance should be much more equal over each page.
In fact, if this could be visualised (e.g. using colour-coded words), one should see a more or less constant behaviour over the text, except the first lines of paragraphs which should be higher as the words were also taken from earlier pages.