The Voynich Ninja - [split] Word length autocorrelation

Pages: 1 2 3

(05-09-2025, 05:51 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.
(05-09-2025, 05:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.But doing line by line as I did, its mean is negative (about -0.07).

I get 0.169 for word bigrams all in the same line (positive-negative)/(positive+negative). I used all the lines (including labels) of the old Takahashi transliteration (ivtff_v0a).

No, what I did is entire words. The goal is to see if one word after the other has the form [short-long-short] or [short-short] / [long-long]

(05-09-2025, 05:59 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.No, what I did is entire words.

I don't understand, sorry.

I counted all "positive" and "negative" bigrams where both words are on the same line of the transliteration file: no (last word of a line, first word of next line) bigram. I thought this is what you meant by "line by line".

(05-09-2025, 06:43 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't understand, sorry.

Hi nablator,

To clarify, I did not calculate bigrams across words. My goal was to compute autocorrelation, but only within each line. That is, every line is treated independently, and the autocorrelation measures patterns of word lengths inside that line, never spanning to the next line.

This way, the result reflects line-level structure, not cross-line sequences. But always with entire words, not n-grams

(05-09-2025, 06:53 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.
(05-09-2025, 06:43 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't understand, sorry.

Hi nablator,

To clarify, I did not calculate bigrams across words. My goal was to compute autocorrelation, but only within each line. That is, every line is treated independently, and the autocorrelation measures patterns of word lengths inside that line, never spanning to the next line.

This way, the result reflects line-level structure, not cross-line sequences. But always with entire words, not n-grams

Hi quimqu,

I think we mean the same but use different words. Smile

By bigram I mean word bigram, not character bigram. The autocorrelation (positive or negative) is a property of a word bigram. So I am counting how many of these bigrams (sequences of two words on the same line, no word bigram across lines) are positively autocorrelated = p and how many are negatively correlated = n. The result is (p-n)/(p+n) = 0.169

One complication for this sort of analysis is that in the available transcriptions there are many uncertain spaces (EVA comma) that may or may not be word breaks. And there are many spaces in the text that are wider than normal inter-glyph spaces but narrower than normal inter-word spaces, which transcribers may have either entered as word breaks (EVA period) or just ignored.

Many of these dubious word breaks are after a short prefix like y or ol or before a short suffix like dy or ar. If those dubious spaces were treated as word breaks, the length correlation would probably drop, possibly even becoming negative.

All the best, --jorge

(05-09-2025, 07:03 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I think we mean the same but use different words.

By bigram I mean word bigram, not character bigram. The autocorrelation (positive or negative) is a property of a word bigram. So I am counting how many of these bigrams (sequences of two words on the same line, no word bigram across lines) are positively autocorrelated = p and how many are negatively correlated = n. The result is (p-n)/(p+n) = 0.169

Hi nablator,
Thanks: using your method (p−n)/(p+n)(p-n)/(p+n)(p−n)/(p+n) I also get about 0.19, close to your result.
The difference is that I usually work with Pearson autocorrelation line by line, which gives mostly negative values (like natural languages). But when computed on the whole corpus, Pearson turns positive, in line with Gaskell & Bowern (around 0.16).

So the surprising part is really this:

Line level: negative, natural-like.
Corpus level: positive, gibberish-like.

That contrast seems to be a key feature of Voynichese.That contrast seems to be a key feature of Voynichese.

(05-09-2025, 07:06 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One complication for this sort of analysis is that in the available transcriptions there are many uncertain spaces (EVA comma) that may or may not be word breaks. And there are many spaces in the text that are wider than normal inter-glyph spaces but narrower than normal inter-word spaces, which transcribers may have either entered as word breaks (EVA period) or just ignored.

Many of these dubious word breaks are after a short prefix like y or ol or before a short suffix like dy or ar. If those dubious spaces were treated as word breaks, the length correlation would probably drop, possibly even becoming negative.

All the best, --jorge

You’re right, there are definitely uncertainties in the transcriptions (commas, periods, ambiguous spaces). That could affect the exact values of word-length correlation.

But since I’m calculating autocorrelation line by line, those local errors should average out and not flip the overall sign. In other words, they might shift the correlation slightly, but they wouldn’t normally turn a negative profile into a positive one.

That’s why the contrast between line-level (negative) and whole-corpus (positive) looks "robust", even allowing for those transcription issues.

Not that i am trying to make more work for anyone Smile

We have line-level (negative) and whole-corpus (positive) autocorrelation so what about Paragraph level ?

MarcoP mentions Paragraph As A Functional Unit (PAAFU) here:
You are not allowed to view links. Register or Login to view.

And i saw ReneZ mentioned paragraph as a unit recently.
You are not allowed to view links. Register or Login to view.

I tried comparing different ways of computing the lag-1 autocorrelation of word lengths in the Voynich:

Global sequence (all words concatenated): +0.16
→ same result as Gaskell & Bowern, since long lines dominate and cross-line pairs are included.
Line-by-line average (each line treated independently, then averaged equally): –0.07
→ negative, like what is usually seen in natural languages.
Line-by-line weighted average (weighted by the number of word pairs per line): –0.03
→ almost neutral, but still slightly negative.

So the discrepancy comes from how the calculation is done:

Looking at the whole corpus, the Voynich shows positive autocorrelation.
Looking line-by-line, most sections, Currier hands, and scribal hands show negative autocorrelation.
Weighting by line length pushes it closer to zero.

This suggests that cross-line word pairs and the dominance of longer lines are what make the global measure flip to positive, while within lines the tendency is closer to the short-term negative correlation typical of natural languages.

I have done the same calculations for different languages and a Torsten Timm generated text:

Text	Global (within+cross)	Within-line	Cross-line	Weighted mean (per line)
Voynich (EVA)	+0.16	–0.07	+0.10	–0.03
Timm (generated)	+0.02	+0.03	+0.02	–0.07
Platonis Apologia (Latin)	–0.09	–0.09	–0.08	–0.17
Unfortunate Traveller (English)	–0.11	–0.11	–0.12	–0.21
Lazarillo de Tormes (Spanish)	–0.19	–0.19	–0.19	–0.27

All three natural language texts (Latin, English, Spanish) show negative autocorrelation across the board, both globally and line by line.
Timm’s generated text shows near-zero / slightly positive global autocorrelation, but negative when measured per line.
The Voynich manuscript is the outlier: positive globally (+0.16), but negative line by line (–0.07 or –0.03 weighted).

The Voynich combines two opposing tendencies:
- At the global corpus level, it looks like Timm’s generated text (positive autocorrelation, even more unlikely than natural languages).
- At the line level, it behaves more like natural languages (negative autocorrelation).
- The difference between within-line and cross-line autocorrelation in the Voynich seems key. In natural languages, line breaks do not affect the measure, but in the Voynich they do: within-line is –0.07 while cross-line is +0.10, a gap of 0.17. This difference is unlikely to be accidental and suggests that line breaks play a structural role in how the text was generated.

This split behavior could be a clue that the line structure plays a central role in how the text was generated.

(05-09-2025, 08:06 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.We have line-level (negative) and whole-corpus (positive) autocorrelation so what about Paragraph level ?

Hello Rob,

here the results at paragraph level:
[Image: 830Hb3m.png]

Most paragraphs cluster around slightly positive values, with a fairly wide spread: some negative, some strongly positive, but the overall center of mass is just above zero.
At the paragraph level, the Voynich behaves differently than at the line level. Line-by-line analysis gave a slightly negative mean autocorrelation (around –0.07), similar to natural languages. But when whole paragraphs are taken as the unit, the average flips to positive (~+0.08).

This reinforces the idea that the Voynich text is sensitive to how we segment it:
- Within lines, it shows the short-term alternation typical of natural languages.
- Across larger units (paragraphs, or globally), the accumulated effect of line breaks and long sequences pushes it into the positive range.

That sensitivity to segmentation is itself interesting: in natural languages, cutting text into lines or paragraphs doesn’t usually change the sign of autocorrelation so dramatically.

(05-09-2025, 10:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.in natural languages

Ahem, please be more precise: "in English and some Romance Languages, with their standard writing systems"...

In Chinese, with its standard writing system, the long-short correlation is exactly zero. Because every "word" has exactly the same length -- one character. Every one with a perfectly square outline. With exactly the same space between "words".

(When transliterated with Latin letters, e.g. in pinyin, it is customary to join the syllables of compounds that would have their own entries in dictionaries. Like 中国 = zhōng guó = center country -> zhōngguó = China. But these compounds have never been marked in any way in the traditional writing system.)

All the best, --jorge

Pages: 1 2 3