![]() |
||||||||||||||||||||||||||||||
[split] Word length autocorrelation - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: [split] Word length autocorrelation (/thread-4910.html) |
||||||||||||||||||||||||||||||
RE: [split] Word length autocorrelation - quimqu - 05-09-2025 (05-09-2025, 05:51 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.(05-09-2025, 05:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.But doing line by line as I did, its mean is negative (about -0.07). No, what I did is entire words. The goal is to see if one word after the other has the form [short-long-short] or [short-short] / [long-long] RE: [split] Word length autocorrelation - nablator - 05-09-2025 (05-09-2025, 05:59 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.No, what I did is entire words. I don't understand, sorry. I counted all "positive" and "negative" bigrams where both words are on the same line of the transliteration file: no (last word of a line, first word of next line) bigram. I thought this is what you meant by "line by line". RE: [split] Word length autocorrelation - quimqu - 05-09-2025 (05-09-2025, 06:43 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't understand, sorry. Hi nablator, To clarify, I did not calculate bigrams across words. My goal was to compute autocorrelation, but only within each line. That is, every line is treated independently, and the autocorrelation measures patterns of word lengths inside that line, never spanning to the next line. This way, the result reflects line-level structure, not cross-line sequences. But always with entire words, not n-grams RE: [split] Word length autocorrelation - nablator - 05-09-2025 (05-09-2025, 06:53 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.(05-09-2025, 06:43 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I don't understand, sorry. Hi quimqu, I think we mean the same but use different words. ![]() By bigram I mean word bigram, not character bigram. The autocorrelation (positive or negative) is a property of a word bigram. So I am counting how many of these bigrams (sequences of two words on the same line, no word bigram across lines) are positively autocorrelated = p and how many are negatively correlated = n. The result is (p-n)/(p+n) = 0.169 RE: [split] Word length autocorrelation - Jorge_Stolfi - 05-09-2025 One complication for this sort of analysis is that in the available transcriptions there are many uncertain spaces (EVA comma) that may or may not be word breaks. And there are many spaces in the text that are wider than normal inter-glyph spaces but narrower than normal inter-word spaces, which transcribers may have either entered as word breaks (EVA period) or just ignored. Many of these dubious word breaks are after a short prefix like y or ol or before a short suffix like dy or ar. If those dubious spaces were treated as word breaks, the length correlation would probably drop, possibly even becoming negative. All the best, --jorge RE: [split] Word length autocorrelation - quimqu - 05-09-2025 (05-09-2025, 07:03 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.I think we mean the same but use different words. Hi nablator, Thanks: using your method (p−n)/(p+n)(p-n)/(p+n)(p−n)/(p+n) I also get about 0.19, close to your result. The difference is that I usually work with Pearson autocorrelation line by line, which gives mostly negative values (like natural languages). But when computed on the whole corpus, Pearson turns positive, in line with Gaskell & Bowern (around 0.16). So the surprising part is really this:
(05-09-2025, 07:06 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One complication for this sort of analysis is that in the available transcriptions there are many uncertain spaces (EVA comma) that may or may not be word breaks. And there are many spaces in the text that are wider than normal inter-glyph spaces but narrower than normal inter-word spaces, which transcribers may have either entered as word breaks (EVA period) or just ignored. You’re right, there are definitely uncertainties in the transcriptions (commas, periods, ambiguous spaces). That could affect the exact values of word-length correlation. But since I’m calculating autocorrelation line by line, those local errors should average out and not flip the overall sign. In other words, they might shift the correlation slightly, but they wouldn’t normally turn a negative profile into a positive one. That’s why the contrast between line-level (negative) and whole-corpus (positive) looks "robust", even allowing for those transcription issues. RE: [split] Word length autocorrelation - RobGea - 05-09-2025 Not that i am trying to make more work for anyone ![]() We have line-level (negative) and whole-corpus (positive) autocorrelation so what about Paragraph level ? MarcoP mentions Paragraph As A Functional Unit (PAAFU) here: You are not allowed to view links. Register or Login to view. And i saw ReneZ mentioned paragraph as a unit recently. You are not allowed to view links. Register or Login to view. RE: [split] Word length autocorrelation - quimqu - 05-09-2025 I tried comparing different ways of computing the lag-1 autocorrelation of word lengths in the Voynich:
I have done the same calculations for different languages and a Torsten Timm generated text:
All three natural language texts (Latin, English, Spanish) show negative autocorrelation across the board, both globally and line by line. Timm’s generated text shows near-zero / slightly positive global autocorrelation, but negative when measured per line. The Voynich manuscript is the outlier: positive globally (+0.16), but negative line by line (–0.07 or –0.03 weighted). The Voynich combines two opposing tendencies: - At the global corpus level, it looks like Timm’s generated text (positive autocorrelation, even more unlikely than natural languages). - At the line level, it behaves more like natural languages (negative autocorrelation). - The difference between within-line and cross-line autocorrelation in the Voynich seems key. In natural languages, line breaks do not affect the measure, but in the Voynich they do: within-line is –0.07 while cross-line is +0.10, a gap of 0.17. This difference is unlikely to be accidental and suggests that line breaks play a structural role in how the text was generated. This split behavior could be a clue that the line structure plays a central role in how the text was generated. RE: [split] Word length autocorrelation - quimqu - 05-09-2025 (05-09-2025, 08:06 PM)RobGea Wrote: You are not allowed to view links. Register or Login to view.We have line-level (negative) and whole-corpus (positive) autocorrelation so what about Paragraph level ? Hello Rob, here the results at paragraph level: ![]() Most paragraphs cluster around slightly positive values, with a fairly wide spread: some negative, some strongly positive, but the overall center of mass is just above zero. At the paragraph level, the Voynich behaves differently than at the line level. Line-by-line analysis gave a slightly negative mean autocorrelation (around –0.07), similar to natural languages. But when whole paragraphs are taken as the unit, the average flips to positive (~+0.08). This reinforces the idea that the Voynich text is sensitive to how we segment it: - Within lines, it shows the short-term alternation typical of natural languages. - Across larger units (paragraphs, or globally), the accumulated effect of line breaks and long sequences pushes it into the positive range. That sensitivity to segmentation is itself interesting: in natural languages, cutting text into lines or paragraphs doesn’t usually change the sign of autocorrelation so dramatically. RE: [split] Word length autocorrelation - Jorge_Stolfi - 05-09-2025 (05-09-2025, 10:19 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.in natural languages Ahem, please be more precise: "in English and some Romance Languages, with their standard writing systems"... In Chinese, with its standard writing system, the long-short correlation is exactly zero. Because every "word" has exactly the same length -- one character. Every one with a perfectly square outline. With exactly the same space between "words". (When transliterated with Latin letters, e.g. in pinyin, it is customary to join the syllables of compounds that would have their own entries in dictionaries. Like 中国 = zhōng guó = center country -> zhōngguó = China. But these compounds have never been marked in any way in the traditional writing system.) All the best, --jorge |