With the recent questions about how MATTR could be used to evaluate nonsense text and the effects of shuffling words, I decided to dust off nablator's JAVA code.
Here are some examples of randomly selected texts from my corpus, in Latin, Italian and German. Values have been normalized.
[
attachment=7176]
Using MATTR, we can see both language-based tendencies and unusual behavior of specific texts. In this random selection, Latin texts consistently have higher MATTR values for all windows (so a more diverse use of vocabulary). But it is much more difficult to distinguish between the Italian and German texts.
We can also see unusual behavior in individual texts. For example in this selection, look at the lowest yellow line. The text Ger_Mechtild does something in the 10-50 word window that causes its value to plummet there. If you were to take 5-word chunks from this text, TTR would not be abnormally low. But take 20-word windows, and you would notice a relatively high amount (compared to the average) of duplicate words. Then take a larger window again, like 1000 words, and the value would be closer to average again.
Now let's add Voynich texts in green:
[
attachment=7177]
We see the by now well-known property that their small-window TTR is low. But already by 50-100-word windows, the B-language samples brush against the Latin texts. This indicates relatively high vocabulary variation, but also frequent reduplication over short spans (a long-known and often discussed property of Voynichese).
Finally, I shuffled the words in Q20, feeding it to two different web tools subsequently. This is how it compares, now only including the Voynich samples. I added Timm as well.
[
attachment=7178]
A lot can be seen in this graph. Unmodified EVA samples (red, orange, two greens) all follow the same shape, which is remarkable in itself. Even though the TTR of A-samples is poorer (and Q13 even more so), the TTR distribution evolves very similarly. Basically a steep rise until 50, then very slightly down towards 1000.
Timm's sample follows largely the same shape, as we have seen before. It is very Voynichese-like in overall MATTR, so this is certainly not a criticism. Still, we can use it to see how similar the Voynichese lines actually are. The purple line stands out because it starts closest to Herbal A, then dips below Q13, only to catch up with it by 1000-word windows.
Finally, let's compare the shuffled Q20 (blue line) with the unshuffled one (red line). As expected, the difference is large on the left (3-word windows). By 1000 words, the effect of the shuffling is gone.