The Voynich Ninja
VM TTR values - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: VM TTR values (/thread-2818.html)

Pages: 1 2 3 4 5 6 7 8


RE: VM TTR values - nablator - 04-10-2022

I propose that the low MATTR in Q13 is caused by the restriction on (EVA regex) "[eolr]o[aysd]".
You are not allowed to view links. Register or Login to view.

The only exceptions are one "lod" on f77r, one "los" on f78v, and maybe "par,ody" on You are not allowed to view links. Register or Login to view. and "oteeo,dy" on You are not allowed to view links. Register or Login to view. if the small spaces are not significant.

(Maybe some other patterns are missing too, these are just a few "antipatterns" that I noticed.)


RE: VM TTR values - Koen G - 02-02-2023

With the recent questions about how MATTR could be used to evaluate nonsense text and the effects of shuffling words, I decided to dust off nablator's JAVA code. 

Here are some examples of randomly selected texts from my corpus, in Latin, Italian and German. Values have been normalized.

   

Using MATTR, we can see both language-based tendencies and unusual behavior of specific texts. In this random selection, Latin texts consistently have higher MATTR values for all windows (so a more diverse use of vocabulary). But it is much more difficult to distinguish between the Italian and German texts. 

We can also see unusual behavior in individual texts. For example in this selection, look at the lowest yellow line. The text Ger_Mechtild does something in the 10-50 word window that causes its value to plummet there. If you were to take 5-word chunks from this text, TTR would not be abnormally low. But take 20-word windows, and you would notice a relatively high amount (compared to the average) of duplicate words. Then take a larger window again, like 1000 words, and the value would be closer to average again.

Now let's add Voynich texts in green:

   

We see the by now well-known property that their small-window TTR is low. But already by 50-100-word windows, the B-language samples brush against the Latin texts. This indicates relatively high vocabulary variation, but also frequent reduplication over short spans (a long-known and often discussed property of Voynichese).

Finally, I shuffled the words in Q20, feeding it to two different web tools subsequently. This is how it compares, now only including the Voynich samples. I added Timm as well.

   

A lot can be seen in this graph. Unmodified EVA samples (red, orange, two greens) all follow the same shape, which is remarkable in itself. Even though the TTR of A-samples is poorer (and Q13 even more so), the TTR distribution evolves very similarly. Basically a steep rise until 50, then very slightly down towards 1000. 

Timm's sample follows largely the same shape, as we have seen before. It is very Voynichese-like in overall MATTR, so this is certainly not a criticism. Still, we can use it to see how similar the Voynichese lines actually are. The purple line stands out because it starts closest to Herbal A, then dips below Q13, only to catch up with it by 1000-word windows.

Finally, let's compare the shuffled Q20 (blue line) with the unshuffled one (red line). As expected, the difference is large on the left (3-word windows). By 1000 words, the effect of the shuffling is gone.


RE: VM TTR values - Koen G - 02-02-2023

I forgot to add what prompted this in the first place: if we want to compare Voynichese to randomly generated text, it would be nice to have a sample of a couple of thousand words, so several windows of 500 and 1000 can be analyzed.

Also, I want to stress how very remarkable it is that the Voynichese samples all follow the same shape, even though we are looking at several scribes using two shifting "dialects". Look again at the first graph from my previous post and see how different texts in the same language "intertwine".There must be something underlying the whole Voynich text that causes this.


RE: VM TTR values - MarcoP - 02-02-2023

Hi Koen,
I don't remember which kind of normalization you are applying here. What is the intuitive meaning of a value of -1, 0 or +1?

About the shuffled VMS text, I guess that the plot being above the others means that the shuffled text has more different types in all windows smaller than 1000. As we already discussed in the past, this is particularly interesting for very small windows, since I expect that natural languages behave in the opposite way: e.g. "the the" will make up almost 10% of consecutive word pairs in a shuffled English text, therefore the shuffled texts will have fewer types in very small windows.

Also, it seems interesting that the shape of the shuffled VMS text is not very different from that of actual Voynichese. Again, I expect this not to be the case for the natural language samples.


RE: VM TTR values - Koen G - 03-02-2023

Hi Marco, I used the function =STANDARDIZE, which uses mean and stdev. It is very likely that someone on the forum once recommended this to me, though I don't remember the exact context. 

If I understand correctly, 0 is the mean of the corpus. So a text that behaves average for all windows I measured will be a line on 0. Above 0 is above average (like Latin), below zero is below average. 

Since the corpus might be imbalanced, I don't think we should take the position of lines compared to 0 into account. But I find this the best way to visualize and compare different texts with each other. If we don't normalize, the differences at the critical small windows will be hard to see.

Edit: also, these are all values I measured: 3 5 8 12 15 20 30 50 75 100 150 250 500 1000


RE: VM TTR values - MarcoP - 03-02-2023

Thank you Koen. If I understand correctly, each MATTR window is normalized to average=0;  values of +1/-1 mean 1 stdev above/below average. As you say, without such normalization the tiny differences in small windows would be totally lost (all lines would appear to start at 1).

About Mechthild of Magdeburg, I guess what is special in her work is the frequency of consecutive phrases based on the same structure. This example shows the passage from a first (wie,min) to a second (in) structure. The window for the minimum MATTR in your plots (~20 words) could correspond to the average length of these structures.

Vide mea sponsa: Sich wie schöne min ögen sint, wie relit min munt si, wie ftrig min herze ist, wie geringe min hende sint, wie snel min fasse sint und volge mir. Du solt gemartert werden mit mir, verraten in der abegunst, gesftchet  in der vare, gevangen in dem hasse, gebunden in höresagen...

ChatGPT translation:

See, my spouse: how beautiful my eyes are, how red my mouth is, how anxious my heart is, how small my hands are, how swift my feet are, and follow me. You should be tormented with me, betrayed in jealousy, seized in war, captured in hate, bound in rumors...


RE: VM TTR values - Koen G - 03-02-2023

(02-02-2023, 10:52 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Also, it seems interesting that the shape of the shuffled VMS text is not very different from that of actual Voynichese. Again, I expect this not to be the case for the natural language samples.

That is interesting, though I don't know what to make of it. Maybe it is because Q20 combines a relatively high TTR with an unusual amount of small-window reduplication? So stirring it up will break up these "unnatural clumps" of vocabulary, while having little effect on the overall solution?

I think you're totally right about Mechthild.


RE: VM TTR values - Koen G - 03-02-2023

I did some more experiments with shuffling, here is the first one: some more Voynich samples. Each color is the section and its shuffled version (which is the higher one in each case).

   

For Q20, the blue line, the shape of the line remains similar, but TTR is increased a lot in small windows. This effect is all but gone by the 1000-word window. Herbal B is not included since it largely coincides with Q20, so both B-sections behave very similarly.

Herbal A (yellow) is similar, though I would have expected a much larger increase in TTR on the left side. 

Q13 (black) has the most dramatic increase in TTR by shuffling. 

An intuitive way to think of this graph is "how much vocabulary clustering over blocks of x words does shuffling undo?" I don't actually know which behaviors can be considered normal for a text, so let's try some real texts next.


RE: VM TTR values - Aga Tentakulus - 03-02-2023

I'm not surprised by the differences.
Would herbs be more like a description and recipes a description with a listing. Are others more like a narrative or story.
I think that creates this pattern.


RE: VM TTR values - Koen G - 03-02-2023

(03-02-2023, 03:57 PM)Aga Tentakulus Wrote: You are not allowed to view links. Register or Login to view.I'm not surprised by the differences.
Would herbs be more like a description and recipes a description with a listing. Are others more like a narrative or story.
I think that creates this pattern.

I think the similarities are more striking than the differences: all Voynichese lines have the same uncommon shape and they maintain this shape after shuffling. This is remarkable because we are looking at the work of different people, in different "subsystems", and surrounding different subject matters.