The Voynich Ninja

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13

(14-11-2025, 11:26 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Going back to the topic of this thread: I am still confused about what exactly LAAFU means. I see that there has been a huge amount of discussion about it,and what I have read only left me more confused.

Is there a short summary of what is known somewhere?

LAAFU ("Line As A Functional Unit") refers broadly to the observation that statistical properties of Voynichese text vary based on line position -- implying in turn that the line functions somehow as a "unit."

Historically it has mostly been used to refer to distinctive patterns found at the beginnings and ends of lines, probably because the differences in those positions are most readily apparent.

However, Emma May Smith and Marco Ponzi have also identified statistical anomalies among second words of lines, and my own studies of "rightwardness" metrics have (I think) shown that subtler forms of line patterning permeate the whole text, with many word features consistently "preferring" earlier or later positions within a line. For example: choose any pair of Voynichese words that differ only in that one contains [a] where the other contains [o]. Considering only mid-line tokens of those two words -- excluding first and last words -- I believe you'll find that in nearly every case the word containing [a] appears further rightward on average than the word containing [o].

If the line weren't somehow fundamental to the process by which Voynichese text was composed, I can't imagine how such consistent and pervasive patterns would have arisen.

(14-11-2025, 11:26 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Going back to the topic of this thread

I wouldn't say that the question of possible syllables is off topic. If you want to determine whether lines in the VMS are structured according to any meter (hexameter or not), then the definition of individual word segments is essential. Analyses such as those presented by @quimqu in post #71 are then a first step toward recognizing possible rhythms. Prefixes, for example, seem to be significantly longer than stems and suffixes.

The goal may be to break down a few random lines (Quire 20?) in such a way that it becomes apparent whether or not there is a repeating pattern ( long / short ) after the (supposedly) recognized word segments.

(14-11-2025, 01:20 PM)Bluetoes101 Wrote: You are not allowed to view links. Register or Login to view.You are not allowed to view links. Register or Login to view.

Thanks, but that is Currier's thinking from 1976. I imagine that more has been known since then.

I saw @tavie's presentation at the last Voynich day,but that was a lot of detail, and included speculation about the head lines...

All the best, --stolfi

(14-11-2025, 01:13 PM)Kaybo Wrote: You are not allowed to view links. Register or Login to view.But for me it means, that a paragraph is not a continues text and that every line starts new.

I this would be an extreme form of LAAFU, no? And anyway it is a conjecture about the cause, not a summary description of the anomalies without trying to explain them -- which is what I was looking for.

All the best, -stolfi

(14-11-2025, 01:08 PM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.Is it important how the split is made? For example, randomly assigning lines to A or B vs assigning the first half of the lines to A and the second to B.

Either way will have the same chance of detecting spurious "anomalies" due to sampling error alone. But if the lines are split as ( first half, second half ), any anomaly that is detected in only one half could still be a real anomaly that occurs only there. Which would be an interesting discovery...

All the best, --stolfi

(14-11-2025, 02:10 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.For example: choose any pair of Voynichese words that differ only in that one contains [a] where the other contains [o]. Considering only mid-line tokens of those two words -- excluding first and last words -- I believe you'll find that in nearly every case the word containing [a] appears further rightward on average than the word containing [o].

Revisiting my notes, I'd suggest that a more persuasive test case is [Sh] (earlier in line) versus [ch] (later in line). The [a] / [o] case *does* generally work when considering all line positions but is a bit weaker for mid-line only.

[attachment=12353]

Let me also remind people of the recent finding that the line-breaking algorithm used by scribes (not just on the VMS, but in any language and any epoch, even today) has the side effect of making the first word of each line longer than average, and the last few words shorter than average.

This phenomenon alone can have a significant effect on word frequencies at the start of the line, because the most frequent words in a text tend to be short. For instance, if a running English text is broken into lines in the simplest possible way, we can expect that the words "the", "and", "is", "it" will occur more rarely at the start of lines than in the text as a whole. Conversely they will occur more frequently at the end of lines.

That is why it is important to run all statistical analyses on a control (non-VMS) sample whenever possible. The anomalies at line-start may be real, but may be due to this and other causes -- not to any semantic role of line breaks.

For example, here is a quick test using the Portuguese novel that I sent to @Quimqu a while ago:

Code:
# first              nonfirst

# 

  0.04249 que          0.04354 que            

  0.04155 e            0.04039 a              

  0.03732 a            0.03443 e              

  0.03404 não          0.03131 de            

  0.02582 de            0.02779 o              

  0.02207 o            0.02185 não            

  0.01338 mas          0.01686 me            

  0.01150 para          0.01343 se            

  0.01150 se            0.01231 um            

  0.01103 era          0.01093 os            

  0.01103 é            0.01053 é              

  0.01033 capitu        0.01010 da            

  0.00986 como          0.00981 do            

  0.00962 um            0.00892 era            

  0.00915 os            0.00877 mas            

  0.00845 da            0.00850 as            

  0.00822 com          0.00848 para          

  0.00751 em            0.00826 com            

  0.00681 do            0.00826 eu            

  0.00681 por          0.00782 lhe            

  0.00634 as            0.00737 em            

  0.00610 ao            0.00654 uma            

  0.00610 eu            0.00565 ao            

  0.00563 me            0.00555 por            

  0.00563 uma          0.00552 minha          

  0.00516 na            0.00538 na            

  0.00493 quando        0.00534 no            

  0.00469 no            0.00526 mais          

  0.00446 mais          0.00519 como          

  0.00446 ou            0.00445 ou            

  0.00423 nem          0.00441 à              

  0.00399 dos          0.00420 capitu        

  0.00399 foi          0.00379 mãe            

  0.00399 à            0.00377 nem            

  0.00376 já            0.00356 ele            

  0.00376 minha        0.00352 foi              

  0.00376 também        0.00298 tudo          

  0.00352 agora        0.00288 só            

  0.00352 lhe          0.00281 dias          

  0.00352 só            0.00281 quando        

  0.00352 tudo          0.00279 casa          

  0.00329 pois          0.00279 ela            

  0.00329 sem          0.00277 dos            

  0.00329 tinha        0.00275 disse          

  0.00282 ainda        0.00275 meu            

  0.00282 antes        0.00275 olhos          

  0.00282 assim        0.00271 *              

  0.00282 ele          0.00261 já            

  0.00282 há            0.00261 ser            

  0.00282 josé          0.00259 mim

These frequencies were obtained by feeding the running text, where each paragraph was formatted as a single line, through a trivial line breaking program (Linux "fmt --split-lines") that broke each parag into lines of 72 chars max. Then taking each line with 10 words of more (there were ~4800 of them) and separating the first word from the other words, and computing the frequencies of both.

Note that the the frequencies of short words like "a", "o" are systematically higher in the second column than in the first. And I did not bother to exclude the parag head lines; this may explain why "que" (= "what", "which", "whose", "who", "that"...) is still about equally common in both sets, since it occurs often at the start of interrogative sentences, and hence at the start of parags.

This bias towards longer words can also affect the statistics of line-initial characters, since character frequencies are determined largely by their occurrence in high-frequency words.

All the best, --stolfi

(14-11-2025, 02:45 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.For example: choose any pair of Voynichese words that differ only in that one contains [a] where the other contains [o]. Considering only mid-line tokens of those two words -- excluding first and last words -- I believe you'll find that in nearly every case the word containing [a] appears further rightward on average than the word containing [o].

Revisiting my notes, I'd suggest that a more persuasive test case is [Sh] (earlier in line) versus [ch] (later in line).

That is interesting, because those scores should not be affected by the length bias due to line-breaking. (Unless the Scribe unconsciously felt that a word with Sh was longer than the version with Ch...)

But does that table include the parag head lines? Does it include labels and titles? Is the effect the same in all sections?

All the best, --stolfi

More about the size bias of line-breaking:

The simplest line-breaking algorithm for running text is: keep writing words on the current line, until you get to a word that, if written, would run past the right rail. In that case, break line before that word, and continue as before on the next line.

It should be easy to see why that algorithm makes a line break much more likely before a long word than a short one.

To estimate the effect precisely one could use a word Markov or order 2 of 3, pipe the output though that line-breaking algorithm, and run the desired statistics. Or take the VMS text, join the lines of each parag into a single line, and pipe it through that line-breaking algorithm with a very different line width.

But that basic algorithm could be elaborated in a number of ways. First, the scribe choose to could split words in order to save vellum. In that case, it may leave some indication of the split (m?) at the end of the first line, or at the beginning the second one (y? q?) or both, or neither. Even if he leaves no mark, the statistics of the line-initial words will be different because they will include word suffixes in addition to whole words. And if he splits the word only between syllables, as we do today, both parts will be morphologically similar to whole words.

In another independent elaboration, when the scribe gets near the end of the line, he will look ahead 2-3 words, choose the line break, then stretch or shrink the writing so that the line will end on the right rail. (Modern word processors will stretch or shrink the spaces, but a Medieval scribe also stretch or shrink the characters themselves.) As a result, words near the end of the line are more likely to be improperly split or joined in the transcription files.

And a scribe could use abbreviations when necessary to squeeze one more word before the line break. The VMS Scribe probably could not read the text, but the Author may have told him that he could abbreviate aiin or aiiin as am if he needed to.

Also, as in many manuscripts of the time, the scribe may have placed some special mark (y? q?) at the start of a line whenever there was a sentence of sub-paragraph break anywhere within that line.

And here is another rather far-fetched idea. Suppose the language was tonal (not necessarily Asian; I gather that Swedish is tonal too, for instance). That is, the pitch pattern along a word would change its meaning. One way to record tones in such a language is to insert special symbols -- like digits, or a/o/y -- in the words to indicate the pitch level. Thus, for example, the Mandarin word "lǎo", with the "dipping" tone, could be written as "l2a1o3" (This system still used by linguists to discuss tone systems in a language-independent way.) Then, within the same line, one could save some ink by inserting those pitch codes only when there was a change of pitch. That is, instead of "b2a1o3 b3a4o4 b4a1o1" one could write just "b2a1o3 b3a4o ba1o". But after a line break the scribe may have felt necessary to insert a pitch code, even it there was no change from the previous line, for the benefit of the reader...

There may be many more processes like those above that result in anomalous statistics at the start and end of lines. And several of them may be at work in the VMS. Untangling them will require more than just staring at tables of statistics. One should try instead the "scientific"method: formulate an hypothesis about a possible cause of the anomalies, then devise the simplest statistical test that could prove or disprove that hypothesis...

All the best, --stolfi

(14-11-2025, 02:10 PM)pfeaster Wrote: You are not allowed to view links. Register or Login to view.If the line weren't somehow fundamental to the process by which Voynichese text was composed, I can't imagine how such consistent and pervasive patterns would have arisen.

I would agree, but if it were, I would still have a hard time to see how they could have arisen.

Until a few years ago, the most unusual part of the Voynich MS text used to be the low bigram entropy and the word patterns, which are closely related to each other. I can think of ways how these could have arisen.

However, by now, both the general rightward/downward issue of Patrick and the line initial character alternations of Tavi have me flustered.

Currier had no idea of any of this. I would consider his Line As A Functional Unit an outdated concept.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13

pfeaster

bi3mw

Jorge_Stolfi

Jorge_Stolfi

Jorge_Stolfi

pfeaster

Jorge_Stolfi

Jorge_Stolfi

Jorge_Stolfi

ReneZ