The Voynich Ninja
VM TTR values - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: VM TTR values (/thread-2818.html)

Pages: 1 2 3 4 5 6 7 8


RE: VM TTR values - R. Sale - 03-02-2023

The two aspects of 'work' need to be better clarified. There is the physical aspect of the object itself and there is the intellectual aspect of the ideological content of the text. Multiple 'hands' created the VMs quires - apparently. How many 'minds' determined what was written down? Is every scribe on their own or does all information come from a single source?

The unity shown across the different 'hands' indicates that the 'hands' are less important than the patterns of information that they contain. All tend to follow a shared source. An uncommon source, one that is represented throughout the VMs. One that can be seen in an uncommon cosmos.


RE: VM TTR values - Koen G - 03-02-2023

I noticed that when shuffling, I often obtain a similar shape:

   

The shuffled versions are relatively low on the left. This is because of the phenomenon Marco described before: in natural language we won't write "the the", but there are so many of them in the text that random shuffling will put some of them next to each other. So in this case, natural language "avoids" some of the reduplication that would normally occur statistically. 

Then there is a sharp rise between 5-30, something like that. This is the area where natural languages will apparently do things that decrease TTR, so shuffling undoes this.

Finally, the line descends again towards the higher windows, because here shuffling vs not shuffling doesn't make much of a difference. If you take a window of a 1000 words, its vocabulary composition will be similar whether the text has been shuffled first or not.

Now the thing is, irrelevant of the initial shape, graphs for shuffled texts always seem to do the same: start low, then climb quickly, then slope down. This is a result of the mean and stdev I use for normalization, which is based on unshuffled texts. Maybe if I use mean and stdev values based on shuffled text, the graphs might be more informative. I need to think about this some more.


RE: VM TTR values - Aga Tentakulus - 03-02-2023

@Koen
I know what you mean. But I still think that if all lines have about the same amount of writers, the waves should even out.
But I do not know how sensitive is measured here.


RE: VM TTR values - Bernd - 04-02-2023

Koen, why do your recent VM graphs look so different from You are not allowed to view links. Register or Login to view. regarding small window sizes? Are they smoothed and extrapolated towards 0? It would be helpful to display the actual data points in the graphs as dots.

On page 1, your VM graphs are way more 'wavy' for window sizes 2-5 and even move downward until 5 for Q20 before sharply rising again. The new graphs are almost perfectly resembling logarithmic or You are not allowed to view links. Register or Login to view.equations without the slightest wavyness over the 14 datapoints you measured and look highly artificial, in stark contrast to all other manuscripts where graphs tend to undulate as one would expect. I also see you didn't calculate window sizes 2 and 4 for the new graphs. Why not? I think more datapoints would be helpful to investigate differences between the VM texts, e.g. window size 4,6,7,10.

Please check if the VM texts really generate such perfect and perfectly similar MATTR curves. Also I find it extremely peculiar that shuffling Herbal A results in almost exactly the same graph as Q20.


RE: VM TTR values - Koen G - 04-02-2023

(04-02-2023, 05:54 AM)Bernd Wrote: You are not allowed to view links. Register or Login to view.Koen, why do your recent VM graphs look so different from You are not allowed to view links. Register or Login to view. regarding small window sizes? Are they smoothed and extrapolated towards 0? It would be helpful to display the actual data points in the graphs as dots.

I must admit that I was struck with a mild panic when you pointed this out, but then I realized the answer is simple: the reference corpus is different, so the mean and stdev are different, so the normalization creates different shapes. In my initial corpus, there was a huge number of Greek texts because I had found a large number of them readily available in txt format. I have since reduced their number drastically to balance things out a bit. 

When I have some more time, I am going to try normalizing to shuffled text though, and see if that makes a difference. My main goal here is to find a way to present TTR information that allows us to compare and understand what's going on. As usual, it will be an iterative process Smile I will also include measurements for 2 and 4.


RE: VM TTR values - Bernd - 04-02-2023

Thanks for the reply!
I still don't understand how exactly you generated these curves. Sorry for the trouble but could you please give us a step by step explanation of what you did? I couldn't find nablator's script either so I don't know what it does. As far as I understood you calculated the average MATTR value for each frame, then normalized each frame of a manuscript against the corresponding frame of your entire corpus. How exactly?

I still suspect there must be some error in the VM curves, otherwise it's impossible to come up with such perfect and nearly identical curves - even for Voynichese. No normalization should turn your old curves which were vastly different at small frames into nearly identical almost perfect logarithmic curves just because you change the reference corpus. I think it's two mistakes in frame 3 and 5 where values shifted by 1.:

Q13 frame 3
old: -2
new: -3

Herb.A frame 5
old: -1.3
new: -0.3

This cannot be explained by normalization with another corpus.
All other values in Q13 and Herb.A didn't change. Also old and new Herb.B and Q20 curves are identical when only taking frame 3 and 5 into account.


But please also calculate frame 4 and maybe 6. I know little about MATTR so I don't know how useful extremely small frame sizes are. Also please show data points in the graphs!


RE: VM TTR values - RenegadeHealer - 04-02-2023

Hi Koen and Marco. I've been following this thread. I'm still trying to wrap my head around the implications of this experiment. From a layman's perspective, I'm quite tempted to jump to the conclusion that it provides strong evidence for a meaningless VMs text. But that's premature and not at all parsimonious.

I do feel comfortable, however, concluding that your experiment provides good support for Currier's famous observation that "Vords are not words". And as such, if they encode any meaningful information at all, that information relies very little on what could be called syntax, for its encoding. In other words, your experiment is robust support for the old observation that vords do not seem to have much pattern to their order, in stark contrast to the rigid patterns of glyphs within vords. There clearly are some patterns to the order in which they occur, at the level of the line, paragraph, and page. Patrick Feaster has amply demonstrated this in his paper "Rightward and Downward". But these patterns are subtler and statistically very unlike those of the words of a plaintext passage in a known language.

If the VMs text does indeed contain meaningful information, I could see each vord encoding a single letter of the Roman alphabet. This could be supported, I surmise, by running your experiment on a readable plaintext, with every letter treated as its own token. The problem, of course, is that this specimen would have a maximum of 20~30 types. This might be able to be corrected for by enciphering the text in a many-to-one fashion, using enough possible encodings for each Roman letter to reach a number of types as high as the VMs, but I'm really not sure if this would still make a valid comparison.

Your results are also compatible with the VMs text encoding real information that's inherently non-linguistic, for example, map coordinates or musical notes. That would be a lot harder to prove, though.


RE: VM TTR values - Aga Tentakulus - 04-02-2023

This would be possible with an older Latin text.
It uses single letters, as well as abbreviations and acronyms.
Here one could see that a text where makes sense also reacts comparably with the VM and the system.

Certainly, one must not use a Latin text where it was written in modern times.


RE: VM TTR values - MarcoP - 06-02-2023

(04-02-2023, 05:21 PM)Bernd Wrote: You are not allowed to view links. Register or Login to view.I couldn't find nablator's script either so I don't know what it does. 

Hi Bernd,
searching posts in this forum gets harder every day. Since I have a local copy of Nablator's java class, I could finally dig out You are not allowed to view links. Register or Login to view..

Save the script as MATTR2.java
Compile it:
javac MATTR2.java

Run it passing a file (or folder, to process multiple files) and window  size:

java -cp . MATTR2 king_james_revelation.txt 50

Output:
king_james_revelation.txt 0.6845309230897983

The java class expects UTF-8 encoded data, removes punctuation and converts everything to lowercase.


RE: VM TTR values - Koen G - 07-02-2023

I started looking into what caused the differences between the old and new graphs. First of all, the raw TTR values are exactly the same, so this is fine.

The reference corpus did change in two ways. One, many Greek texts were removed because they were simply overrepresented. Second, at some point I shared the corpus with Marco and he helped me clean it up and sort out some issues. The bottom line is that the new reference corpus is better, more representative and contains less problematic formatting.

Let's first look at Q13. These are the values for windows 2, 3, 4, 5:

0.9943804035 0.9871259067 0.9803978092 0.9734755658

Normalized to the old (problematic) corpus, this gave the values:

-3.544204936 -3.06939543 -2.630757497 -2.240265703

And in the new corpus:

-8.378444309 -5.083947024 -3.252307989 -2.454021631

Here I'll plot old normalized Q13 and new normalized Q13 on a graph:

   

Now what we see here is clearly an effect of the different normalization. The overall shape is the same, but clearly stdev is greater in the new "small windows", hence a greater difference with the norm. So it looks like after having cleaned up the corpus of unwanted layout and punctuation artifacts, the strange behavior of Voynichese in small windows is even more obvious.

There must have been a mistake in my earlier graphs though, since the values used there are sometimes different from the dataset I was using. I have no idea what went wrong there, it was several years ago. Either way, thank you for pointing this out.

So with the new (and hopefully correct) data, this is what I get for various Voynich sections:

   

I realize this graph is impossible to read, but the point is that the logarithmic shape is universal for Voynichese samples. And this makes sense: it behaves fairly normally overall, but does weird things (reduplication patterns) in small windows. Hence the low start that gets corrected over longer windows. 

It also makes sense that the shuffled versions look similar, though less extreme. As Marco said, shuffling a normal text will create more instances of reduplication of frequent words (the the, this this...). So we also expect a lower TTR in small windows.