[split] Word length autocorrelation

[split] Word length autocorrelation - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: [split] Word length autocorrelation (/thread-4910.html)

Pages: 1 2 3

[split] Word length autocorrelation - ReneZ - 05-09-2025

(04-09-2025, 09:59 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.In real languages, the correlation is usually slightly positive: long words tend to follow long words, short after short. But in the Voynich, it’s negative (about –0.07). That means the text tends to alternate — long words are followed by short words, and short by long.
It gives the text a zig-zag rhythm.

If that is a 'normalised' correlation, i.e. not a covariance, then the value -0.07 means 'no correlation'.
So, it depends on how it was calculated.

(04-09-2025, 09:59 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.When I scrambled the text within lines as a test, the alternation got even stronger, which proves the manuscript isn’t random, but it’s still unlike natural language.

This is not expected, and suggests that both cases effectively show 'no correlation'.

RE: Can LAAFU effects be modeled? - quimqu - 05-09-2025

Thanks René, that’s very helpful. You’re right, the lag-1 autocorrelation I used is just the standard Pearson correlation between word length and the next word’s length. With that definition, a value of –0.07 is indeed very close to zero, so it’s more accurate to say the Voynich shows no strong correlation in word length sequence, rather than claiming a clear "zig-zag rhythm".
What I found interesting is that when I scrambled the words within each line, the mean autocorrelation went further negative (about –0.14). That suggests the manuscript isn’t behaving like pure random either, though it’s still unlike natural languages where the value is usually slightly positive.
For reference, the values vary by section:

- Herbal: around –0.12
- Balneo: around –0.08
- Cosmological: –0.13 (but small sample)
- Astronomical: –0.20 (also small sample)
- Marginal stars only: ~0.00
- Text-only: ~0.01

I should say it more cautiously: the Voynich shows no strong correlation, but its profile is not the same as either scrambled text or typical natural languages.

RE: Can LAAFU effects be modeled? - MarcoP - 05-09-2025

It's interesting that Gaskell and Bowern found the opposite for natural languages (negative autocorrelation: short-long-short-long). But this is only marginally relevant to LAAFU, so it should probably be discussed in a separate thread.

RE: Can LAAFU effects be modeled? - nablator - 05-09-2025

(05-09-2025, 07:30 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.You’re right, the lag-1 autocorrelation I used is just the standard Pearson correlation between word length and the next word’s length.

MarcoP Wrote:It's interesting that Gaskell and Bowern found the opposite for natural languages

The way they calculated is (I think) simply a count of positive and negative autocorrelations to see which one is dominant.

This correlation is not Pearson correlation:
A positive autocorrelation is short-short or long-long
A negative autocorrelation is short-long or long-short

Every word is either short or long:
long if word length > mean word length
short if word length <= mean word length

I did something similar with: (number of positive autocorrelations for all word bigrams - number of negative autocorrelations for all word bigrams) / number of all word bigrams

Testing this on a few texts I found an exception to the rule of dominant negative autocorrelation in Latin poetry: Marbodus Redonensis Carmina varia (available on Corpus Corporum) is slightly positive (0.009), but Voynichese (EVA) is more positive (0.16). Still positive when all words (not only within each line) are scrambled (0.06). Same results with median word length instead of mean.

RE: Can LAAFU effects be modeled? - quimqu - 05-09-2025

I had used the standard Pearson correlation on word length vs. next word’s length, which gave me a slightly negative value (alternating tendency). I now see that Nablator uses a binary measure (short vs. long compared to the mean). These are two different definitions of autocorrelation, so it makes sense that results diverge. Both are interesting because they highlight different aspects of Voynichese rhythm.

I think it is true that with lag-1 we are only looking at short-term autocorrelation. Several studies indicate that natural languages also tend to show slightly negative autocorrelation at this short distance — meaning that long words are often followed by short ones, and vice versa. At longer ranges, however, the correlation tends to become slightly positive in natural languages. From this perspective, it is actually good news that the Voynich also shows negative short-term autocorrelation, because that is in line with what we would expect from natural language behavior.

Anyway, I agree with Marco's opinion that this shoud go to another thread. Sorry.

RE: Can LAAFU effects be modeled? - Jorge_Stolfi - 05-09-2025

(05-09-2025, 07:57 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.It's interesting that Gaskell and Bowern found the opposite for natural languages (negative autocorrelation: short-long-short-long). But this is only marginally relevant to LAAFU, so it should probably be discussed in a separate thread.

It should strongly depend on the language, and somewhat on the text.

English has detached articles, prepositions instead of cases, mandatory pronouns, analytic verb tenses, etc., all which should imply negative correlation (tendency to alternate long-short-long).

Latin has none of those things, and thus I expect that it will have near zero or positive correlation.

All the best, --jorge

RE: [split] Word length autocorrelation - MarcoP - 05-09-2025

Quote and figure from You are not allowed to view links. Register or Login to view.:

Quote:A notable feature of the VMS that has to our knowledge only been discussed by one other publication [20] is positive autocorrelation of word lengths. Word lengths in most meaningful texts are negatively autocorrelated: that is, long words tend to be interspersed with short words (long-short-long-short). By contrast, the VMS exhibits positive autocorrelation (long-long-short-short). Positive autocorrelation is only observed in a limited number of natural languages, but is common in gibberish (Figure 3).

[20] V. Matlach, B. A. Janečková, and D. Dostál, “The Voynich manuscript: Symbol roles revisited,” PLOS ONE, vol. 17, no. 1, p. e0260948, Jan. 2022, doi: 10.1371/journal.pone.0260948.

Filename: autocorrelation.png Size: 453.65 KB 05-09-2025, 03:38 PM

RE: [split] Word length autocorrelation - Jorge_Stolfi - 05-09-2025

(05-09-2025, 03:43 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Quote and figure from You are not allowed to view links. Register or Login to view.:
Quote:A notable feature of the VMS that has to our knowledge only been discussed by one other publication [20] is positive autocorrelation of word lengths. Word lengths in most meaningful texts are negatively autocorrelated: that is, long words tend to be interspersed with short words (long-short-long-short). By contrast, the VMS exhibits positive autocorrelation (long-long-short-short).

Thanks for the neat graphic!

However, this statistics is very sensitive to the spelling system. Arabic presumably has near zero correlation because the article "al-", the conjunction "wa-" ("and") and other prefixes and suffixes like "ma-" and "sa-" are traditionally written attached to the modified word. If they were written as a separate words, I bet the correlation would probably become negative, as for English.

Ditto for Hebrew, Turkish, ...

All the best, --jorge

RE: [split] Word length autocorrelation - quimqu - 05-09-2025

(05-09-2025, 03:43 PM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.Quote and figure from You are not allowed to view links. Register or Login to view.:

Quote:A notable feature of the VMS that has to our knowledge only been discussed by one other publication [20] is positive autocorrelation of word lengths. Word lengths in most meaningful texts are negatively autocorrelated: that is, long words tend to be interspersed with short words (long-short-long-short). By contrast, the VMS exhibits positive autocorrelation (long-long-short-short). Positive autocorrelation is only observed in a limited number of natural languages, but is common in gibberish (Figure 3).

[20] V. Matlach, B. A. Janečková, and D. Dostál, “The Voynich manuscript: Symbol roles revisited,” PLOS ONE, vol. 17, no. 1, p. e0260948, Jan. 2022, doi: 10.1371/journal.pone.0260948.

Marco, thank you. I see what happens.

My whole summary (even phase 5 about autocorrelation) was about the results line by line, as we were talking about how different the starting of the lines are, if they are the first line of the paragraph, the middle line, the last line, if it is not in a paragraph, etc... The result of Gaskell and Bowern is with the whole transliteration of the Voynich. Doing so, I get an autocorrelation of 0.16, like they have in the paper.
But doing line by line as I did, its mean is negative (about -0.07). I attach here the results of grouping the autocorrelations (untouched and permuted within line) by section:

[Image: zq0GqwF.png]

This is by currier hand:
[Image: AIMFJ6j.png]

This is by writtin hand:
[Image: qPEbrgW.png]

And here we have a histogram:

[Image: cYbaQgq.png]

This is interesting because it says, that at line level, the MS actually has natural language autocorrelation, at least in most sections, currier hands and writting hands.

RE: [split] Word length autocorrelation - nablator - 05-09-2025

(05-09-2025, 05:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.But doing line by line as I did, its mean is negative (about -0.07).

I get 0.169 for word bigrams all in the same line (positive-negative)/(positive+negative). I used all the lines (including labels) of the old Takahashi transliteration (ivtff_v0a).