Here is the next step in my continuing struggle to do something more meaningful with shuffling and TTR statistics (and visualize the whole thing).
First, I shuffled the words for my entire corpus (this was easily done with a python script written by ChatGPT).
Then, I divided shuffled MATTR values by nonshuffled values. What this tells us is to what extent shuffling
increases vocabulary entropy over a given window.
As noted before, we will observe two opposed phenomena in the statistics. On the one hand, shaking up the text will undo clusterings of vocabulary, which is something we will notice over larger windows. On the other hand though, and perhaps less intuitively, shuffling will cause
more common words to appear in each other's proximity. Written language will avoid "this this" or "and and", but these words are so common that they will sometimes be grouped in a shuffled text. Therefore, we expect entropy to
decrease for small windows, but to start
increasing for larger windows.
To put it differently, for very small windows, shuffling will undo the natural tendency of languages to avoid immediate repetition of the same word. But for larger windows, shuffling will undo any (thematic?) clustering in the text.
It might be more interesting to flip it around and ask: what is the effect of text structure compared to the same words in a random order? Here are a few examples:
The 1.00 line is important here. Below the line, it means that linguistic properties will cause the vocabulary to be less repetitive than chance. Above the line, it means that shuffling has undone certain (thematic?) clustering that was observable over these windows. We see that the first effect is moderately present for windows up to ca. 6 to 15 words. After that, however, the graphs shoot up quickly.
Then I wanted to know at which point texts will typically cross the 1.00 line. Where does the "stirring the soup" effect overtake the "accidental clumping" effect? The following graph counts the number of texts in the corpus that reach 1.00 at a given window:
Strangely, there are two spikes at 6 and at 12. I don't know why this is the case, but for many texts, shuffling starts creating more entropy after these windows. Maybe we could say that, compared to chance, it is relatively unlikely for many texts to reuse the same word within a 6-word stretch, while for many other texts this "do not reuse"-window is larger, at 12 words. (I realize this is probably too much of an abstraction, but I'm trying to explain it in an intuitive way). For almost half of the texts in the corpus, the point where shuffling starts increasing entropy is either at 6 or at 12-word windows. I didn't measure 11 though, which will be counted towards 12's total, so especially 6 is important.
For many texts, the difference between shuffled and non-shuffled keeps increasing as we move towards 1000-word windows. This suggests that these texts have "themes" in their vocabulary that run over large windows, and the shuffling undoes this. Other texts, however, peak at lower windows (most often 250), after which the effect of shuffling decreases again.
Now the important question is, whether this provides a "signature" for Voynichese. Here are four Voynich sections graphed out:
I'd say that generally, there is a signature: Voynichese texts tend to start
above 1. This means that Voynichese repeats words often in small windows (a-a-b, a-b-a...), more so than "accidental clumping" would cause. Interestingly, this effect disappears between 5 and 6, especially for Herbal B. Then, "stirring the soup" takes over, and shuffling increases entropy again.
Focus on the green line of Herbal B now, since it is quite interesting. We start above 1, this means that in 2-word windows, Voynichese creates more pairs (qokedy qokedy) than chance. But between 6 and 10, it is under 1, which means that the effect of Voynichese low-window repeating of words is now more than cancelled out.
However, above 15, the graph shoots up rapidly. This means that Voynichese has certain vocabulary consistencies that span larger portions of text, but are
not caused by low-window phenomena.
I will now isolate Q20 as a typical Voynich graph and compare it to the other texts that start above 1.
Above are all texts where shuffling does not cause small-window TTR to decrease by "accidental clumping". There are very few: one Latin, one German, one Italian and Timm's sample. Only one of these texts, the Italian "Fabula" in green, has a similar signature to Voynichese, though it must be said that its higher window TTR (which is not on the graph) is much higher.
Finally, here's a scatter plot comparing the values for 3 (small window) with those for 1000 (large window). We see how the Voynichese dots (green) are in the bottom right, which is their "signature" so to speak. The sample generated by Timm's method is also an outlier, but in a different corner.
All data is in the public share file, tab "shuffle": You are not allowed to view links.
Register or
Login to view.