Options

VM TTR values

Index
VM TTR values
RE: VM TTR values

Aga Tentakulus > 20-02-2023, 10:54 PM

I do not understand your statement.
"CT,CK,CT" ?
I think you should take a closer look.
RE: VM TTR values

nablator > 20-02-2023, 11:41 PM

(20-02-2023, 09:19 PM)Aga Tentakulus Wrote: You are not allowed to view links. Register or Login to view.Have you ever seen EVA "ch" without "c" ?
Yes, there are two on You are not allowed to view links. Register or Login to view. .
RE: VM TTR values

Koen G > 17-03-2023, 04:44 PM

Here is the next step in my continuing struggle to do something more meaningful with shuffling and TTR statistics (and visualize the whole thing).

First, I shuffled the words for my entire corpus (this was easily done with a python script written by ChatGPT).
Then, I divided shuffled MATTR values by nonshuffled values. What this tells us is to what extent shuffling increases vocabulary entropy over a given window.

As noted before, we will observe two opposed phenomena in the statistics. On the one hand, shaking up the text will undo clusterings of vocabulary, which is something we will notice over larger windows. On the other hand though, and perhaps less intuitively, shuffling will cause more common words to appear in each other's proximity. Written language will avoid "this this" or "and and", but these words are so common that they will sometimes be grouped in a shuffled text. Therefore, we expect entropy to decrease for small windows, but to start increasing for larger windows.

To put it differently, for very small windows, shuffling will undo the natural tendency of languages to avoid immediate repetition of the same word. But for larger windows, shuffling will undo any (thematic?) clustering in the text.

It might be more interesting to flip it around and ask: what is the effect of text structure compared to the same words in a random order? Here are a few examples:



The 1.00 line is important here. Below the line, it means that linguistic properties will cause the vocabulary to be less repetitive than chance. Above the line, it means that shuffling has undone certain (thematic?) clustering that was observable over these windows. We see that the first effect is moderately present for windows up to ca. 6 to 15 words. After that, however, the graphs shoot up quickly.

Then I wanted to know at which point texts will typically cross the 1.00 line. Where does the "stirring the soup" effect overtake the "accidental clumping" effect? The following graph counts the number of texts in the corpus that reach 1.00 at a given window:



Strangely, there are two spikes at 6 and at 12. I don't know why this is the case, but for many texts, shuffling starts creating more entropy after these windows. Maybe we could say that, compared to chance, it is relatively unlikely for many texts to reuse the same word within a 6-word stretch, while for many other texts this "do not reuse"-window is larger, at 12 words. (I realize this is probably too much of an abstraction, but I'm trying to explain it in an intuitive way). For almost half of the texts in the corpus, the point where shuffling starts increasing entropy is either at 6 or at 12-word windows. I didn't measure 11 though, which will be counted towards 12's total, so especially 6 is important.

For many texts, the difference between shuffled and non-shuffled keeps increasing as we move towards 1000-word windows. This suggests that these texts have "themes" in their vocabulary that run over large windows, and the shuffling undoes this. Other texts, however, peak at lower windows (most often 250), after which the effect of shuffling decreases again.

Now the important question is, whether this provides a "signature" for Voynichese. Here are four Voynich sections graphed out:



I'd say that generally, there is a signature: Voynichese texts tend to start above 1. This means that Voynichese repeats words often in small windows (a-a-b, a-b-a...), more so than "accidental clumping" would cause. Interestingly, this effect disappears between 5 and 6, especially for Herbal B. Then, "stirring the soup" takes over, and shuffling increases entropy again.

Focus on the green line of Herbal B now, since it is quite interesting. We start above 1, this means that in 2-word windows, Voynichese creates more pairs (qokedy qokedy) than chance. But between 6 and 10, it is under 1, which means that the effect of Voynichese low-window repeating of words is now more than cancelled out. However, above 15, the graph shoots up rapidly. This means that Voynichese has certain vocabulary consistencies that span larger portions of text, but are not caused by low-window phenomena.

I will now isolate Q20 as a typical Voynich graph and compare it to the other texts that start above 1.



Above are all texts where shuffling does not cause small-window TTR to decrease by "accidental clumping". There are very few: one Latin, one German, one Italian and Timm's sample. Only one of these texts, the Italian "Fabula" in green, has a similar signature to Voynichese, though it must be said that its higher window TTR (which is not on the graph) is much higher.

Finally, here's a scatter plot comparing the values for 3 (small window) with those for 1000 (large window). We see how the Voynichese dots (green) are in the bottom right, which is their "signature" so to speak. The sample generated by Timm's method is also an outlier, but in a different corner.



All data is in the public share file, tab "shuffle": You are not allowed to view links. Register or Login to view.
RE: VM TTR values

R. Sale > 17-03-2023, 06:18 PM

While it may be somewhat tenuous to let a single text represent an entire language group, I see in the first graph that the last two Latin texts are quite similar. The other Latin text would have been written by an English Norman.

In the fourth graph, there is a strong similarity between the VMs and the Italian 'Fabula'. Where does the 'Fabula' show up in the dots?
RE: VM TTR values

davidjackson > 17-03-2023, 10:47 PM

Koen, have you linked your shuffling with the vocabulary size in each text? Purely off the top of my head, a text with double the vocab should have half the repetition. Maybe this is what is causing the spikes at 6 and 12?
RE: VM TTR values

Juan_Sali > 18-03-2023, 05:09 PM

(17-03-2023, 04:44 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.I'd say that generally, there is a signature: Voynichese texts tend to start above 1. This means that Voynichese repeats words often in small windows (a-a-b, a-b-a...), more so than "accidental clumping" would cause. Interestingly, this effect disappears between 5 and 6, especially for Herbal B. Then, "stirring the soup" takes over, and shuffling increases entropy again.
Im my opininio this cluster effect on words starts at a lower level, in the n-grams estructure of vochinese (mainly bigrams and 3-grams). The cluster distribution of many n-grams is reflected in the words.
Next Oldest Next Newest

VM TTR values

Index

RE: VM TTR values

RE: VM TTR values

RE: VM TTR values

RE: VM TTR values

RE: VM TTR values

RE: VM TTR values