The Voynich Ninja
VM TTR values - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: VM TTR values (/thread-2818.html)

Pages: 1 2 3 4 5 6 7 8


RE: VM TTR values - Bernd - 08-02-2023

Thanks Marco, I'll try out the script as soon as I find the time.

Thanks Koen, I finally understand now. You used the Excel STANDARDIZE function to assign a z-score to each frame value of a manuscript using the average and stdev of the entire corpus.
How many manuscripts does your corpus contain, how many of each language?
If you find the time, could you make a graph with the average values of the corpus with error bars (the stdev or standard error) for comparison? And would you mind posting graphs of the raw non-normalized values of the VM texts?

So if your new normalized VM graphs are indeed correct what does this tell us?
.)all VM sections you measured perform in an extremely similar way and fundamentally different from other texts

.)all VM sections more or less follow a highly predictable hyperbolic or x^1/x function with very low values at small frames and a flat and linear distribution from medium to large frames.

.)randomly shuffling the text of other manuscripts drastically modifies and unifies curves, making them quite similar to those of the VM but less flat at medium frame sizes.

.)randomly shuffling VM text only superficially modifies curves also making them less flat at medium frame sizes

.)Thorsten's artificial 'Voynichese' performs extremely similar to actual Voynichese here

I'm no expert on this topic but this are my 2cts:
I find it very,very hard to believe it's possible to arbitrarily write a text with properties that generate such perfect curves, neither deliberately in a manuscript with meaning nor as pseudorandom nonsense. I might be wrong but I don't see how this would be possible. To me the VM texts appear to have been generated by some sort of algorithm just as in Thorsten's experiment.

Clearly the VM doesn't behave like any ordinary text, including repetitive poetry. It would therefore be interesting to compare it to something different with a more formalized structure like accounting or bookkeeping texts containing Roman numerals. But I still doubt this could explain the MATTR properties. Maybe we need to look into non-contemporary sources like computer code as well? I wonder if it's possible to find anything human-made that generates such MATTR curves.

I was never fond of the 'meaningless text hypothesis' but seeing those graphs I can't help but wonder. Maybe there is still meaning encoded but there appears to be some sort of low entropy highly predictable carrier function. Clearly the text isn't randomly shuffled, see the inhomogenities thread. In fact it is quite the opposite and shows some peculiar degree of order and predictability on vord, line paragraph, page and quire level. I have no idea how to reconcile this with the MATTR data suggesting a close relation with randomness though.


RE: VM TTR values - MarcoP - 08-02-2023

(08-02-2023, 01:59 AM)Bernd Wrote: You are not allowed to view links. Register or Login to view.I find it very,very hard to believe it's possible to arbitrarily write a text with properties that generate such perfect curves, neither deliberately in a manuscript with meaning nor as pseudorandom nonsense.

Hi Bernd, in my opinion, the perfect curves are largely the result of normalization. This plot shows non-normalized data for Q20 and two of Koen's text. I also include an edited version of the Latin text (in green), where I replaced each occurrence of "ad" with "ad ad": the frequency of "ad" is close to 1%, so this produces a 1% word reduplication rate, similar to what we see in the VMS.
   
It is clear that, without normalization, results from small windows are unreadable.

This plot shows data normalized with this formula, where M is a MATTR measure (Y values in the first plot) and W is the window size for which M was computed:
(M-1)/(W^0.8)
Values for small windows are very close to 1: this function amplifies that difference. Vice-versa, values for large windows are made smaller. The result is that values for all windows fall in a roughly comparable range. By playing around with the 0.8 exponent, one can adjust the effect of normalization in different directions.
   

Both plots show that Q20 is close to the Latin text for larger windows. For small windows, Q20 has lower MATTR. As the edited Latin file shows, this is likely due to the repetition of consecutive words, whose impact is considerable at W=3 and gradually disappears as windows get larger and larger.

In conclusion, unless I messed up something (which is totally possible), Q20 appears to behave like a relatively high-MATTR text (an highly inflected language like Latin) together with a high rate of reduplication (which is not observed in any European language).

As Koen has shown, another way to produce a similar distribution is to shuffle words in a text: in my opinion, this could support the idea of a pseudorandom text.


RE: VM TTR values - Koen G - 08-02-2023

Thank you Bernd and Marco for keeping an eye on this. Everything I know about statistics I specifically learned for this research, so all of this is very experimental for me, and likely to be suboptimal.

Anyway, I don't have much time today, but I will simply share my data: You are not allowed to view links. Register or Login to view.

You will find three sheets. The first one is simply raw entropy data. This is the cleaned, updated corpus, so it is to be preferred over the one I shared years ago.
The second sheet is the data, normalized to the corpus.
The third sheet is something I was working on for fun. It is the data normalized to shuffled texts.

For this third tab, what I thought was the following: imagine a scenario where we put a regular text into a bag and shake it up, then pull out one word at a time and form a text like that. What would happen if call this scenario "normal", and see how actual meaningful texts compare. What is the impact of the rules of natural language and text composition on TTR?

To test this I took a number of texts and shuffled the words. I took mostly texts with an average overall TTR to aim for a situation as average as possible. I then normalized to these shuffled texts. Now, with what Marco said about shuffling actually lowering small-window TTR (the the, this this....), one hypothesis might be that normalizing to shuffled texts will bring Voynichese more in line with the norm. In this case, we can expect Voynichese to form a line parallel with 0. This is a quick graph I made with Voynich samples and some random other texts:

   

Focus first on the blue and teal lines, Herbal B and Q20. I would have expected them to behave more like the norm, but not this much!
The yellow line, Herbal A, is still relatively parallel but with lower overall TTR. Q13 is lower still.

Now what I did here is weird, using the average and stdev of one set to plot another. Especially stdev will have an impact on the size of the undulations in the lines. So this could certainly be optimized. However, apart from that, I do think this is a meaningful result: Voynichese distributes its vocabulary as if it were drawn from a bag, at least it does so more than regular texts. In the current experiment, this is especially prominent for the B-language. A-language samples are less clear due to their inherently lower TTR.


RE: VM TTR values - MarcoP - 08-02-2023

(08-02-2023, 10:53 AM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Voynichese distributes its vocabulary as if it were drawn from a bag, at least it does so more than regular texts.

Hi Koen,
I don't know much about statistics either, but normalization is tricky, though necessary here to make things visually comparable.
The fact that Q20 is very close to the average of your corpus of shuffled texts does not mean that it behaves as a shuffled text. The average line is not a shuffled document, but an abstraction. If one compares the actual Q20 with a shuffled Q20, one finds that reduplication is more frequent in the actual file than in the shuffled file, hence MATTR is lower in the actual file than in the shuffled file. As was already mentioned in this thread, Voynichese favours reduplication in comparison to shuffling, while actual texts have less reduplication than their shuffled counterpart.

The high reduplication rate shows that Voynichese words are not distributed as if they were drawn from a bag. Line phenomena are another aspect of Voynichese that is highly predictable in some cases (You are not allowed to view links. Register or Login to view.) and certainly not random. The last-first character combinations that Emma and I discussed in our paper is a third way in which Voynichese word order shows marked deviations from randomness. I am not sure of the effects of second and third point on MATTR, but they are likely to have an impact in some way.

In conclusion, I am not sure that Voynichese is closer to being drawn from a bag than natural languages. What we can tell is that Voynichese words appear to be arranged by rules and preferences that are different from the rules and preferences of the grammar of European languages.


RE: VM TTR values - Koen G - 08-02-2023

Good points, Marco. In fact, I would say these line patterns are an argument against any "chaotic" explanation, as they imply some planning or at least consistency. 

Also, I cannot think of any realistic scenario where words would actually be drawn from a bag, since this also presumes that some words are very frequent while many are unique or rare. I don't see any practical way this could work.

Now on the other hand, I do think it remains interesting that shuffling is a decent way of approaching Q20's TTR values, even if it may not be directly relevant. Maybe it is just that shuffling lowers  small window TTR, which is enough to bring it closer to Voynichese even though the underlying principles are unrelated.


RE: VM TTR values - nablator - 09-02-2023

(08-02-2023, 07:40 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Voynichese favours reduplication in comparison to shuffling, while actual texts have less reduplication than their shuffled counterpart.

On a related subject: the amazingly strong non-randomness in k/t sequences: there are far too many (long) sequences of the same gallows (especially t), despite the known opposite bias in human-generated pseudo-random sequences. New thread here : You are not allowed to view links. Register or Login to view.

Both statistics argue for the presence of a systemic reason and against both true randomness and pseudo-randomness (human-generated gibberish) where word ordering would rather mimic natural languages (with extremely rare reduplication) and k/t sequences should show a tendency to overalternate: the opposite of what is observed.


RE: VM TTR values - MarcoP - 10-02-2023

Hi Nablator, I don't think I ever looked into the gallows-sequences you mention, but I am aware that perfect reduplication is a special case of a more general phenomenon. Timm and Schinner described it quite clearly in my opinion:

Timm&Schinner' Wrote:The closer two words are (with respect to their edit distance), the more likely these words also can be found written in close vicinity (i.e. on the same page).

Another example of the more general phenomenon is quasi-reduplication (consecutive words with edit-distance=1) which in the VMS occurs in about 2% of consecutive word pairs. Anyway, perfect reduplication has the most direct impact on MATTR, while other cases involve different (though similar) word types. But of course what Timm and Schinner describe cannot occur in a randomly shuffled text.


RE: VM TTR values - Addsamuels - 14-02-2023

(03-02-2023, 08:33 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view....in natural language we won't write "the the", but there are so many of them in the text that random shuffling will put some of them next to each other.
In certain languages like French and German, duplication of some words are common. 
1. En négociant le divorce, nous nous sommes rendus nous-mêmes orphelins.
2. Die Firma, die die technische Pionierarbeit dafür geleistet hat.


RE: VM TTR values - Koen G - 14-02-2023

Yeah that's certainly true, in my native Dutch you could grammatically do three, like: "Ik dacht dat dat dat kind was." (I thought that it was that child). However, two remarks with this: one is that it feels kind of ugly and writers will avoid overusing these structures. Two, even in cases where this is not the case, like in the French "nous nous" example, texts will still contain fewer of such reduplications than Voynichese and shuffled texts do.


RE: VM TTR values - Searcher - 15-02-2023

(14-02-2023, 10:59 PM)Koen G Wrote: You are not allowed to view links. Register or Login to view.Yeah that's certainly true, in my native Dutch you could grammatically do three, like: "Ik dacht dat dat dat kind was." (I thought that it was that child). However, two remarks with this: one is that it feels kind of ugly and writers will avoid overusing these structures. Two, even in cases where this is not the case, like in the French "nous nous" example, texts will still contain fewer of such reduplications than Voynichese and shuffled texts do.
Recently, I wrote a blogpost about word repetitions in the VMs text. I suppose they can be fake repetitions, just garbage.
For example:
When Tämerlin returned home 9home from from Babiloni, he he he sent word to all in his land that they were were to be ready in four months, as he wanted to go into Lesser India, distant from from from9 9from his capital a four months’ journey. When the time came, he went into Lesser India with four hundred thousand men9 men, and crossed a desert...
I think those odd repetitions can be easily detected when the text is deciphered. The only need for an enciperer - to choose right words for fake repetitions during enciphering, not to make questionable phrases.
What do you think about this?
25