Thanks Jorge, the English test revealed a minor logical error. Very common, short whole words were counted as chunks (“the”). I have corrected the script accordingly. Here is the output:
[
attachment=12904]
[
attachment=12905]
[
attachment=12906]
The table for the VMS with the corrections noted by @nablator. The combination qo → y is now far ahead.
[
attachment=12907]
(11-12-2025, 08:50 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.In manuscripts and printed books, line breaks don't cut syllables, usually.
Yes. But, AFAIK, there is no evidence that the VMS scribe would break lines in the middle of words, so that rule probably does not apply here.
Quote:There may be other rules in some cases.
There are rules for breaking before or after "and" and articles, I forgot which. There is also a rule against breaking between a number and a unit of measure.
All the best, --stolfi
(11-12-2025, 09:32 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.Here is the output:
Thanks! So, what is the conclusion?
Are there "LAAFU" effects even in the English prose text?
If so, how do they compare to those seen in the VMS?
All the best, --stolfi
(11-12-2025, 11:11 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Thanks! So, what is the conclusion?
Unfortunately, I'm not there yet

What one could try is to write all the chunks line by line into a file and then calculate the expected start and end chunks. Then one could compare these with the actual output. I'll think about it.
I was thinking about this pattern while reading a book lately. It struck me that this sort of "green eggs and ham" talking is a trope of "learned (sometimes self proposed) men" towards others.
"All you that faine philosophers would be,
And night and day in Geber's kitchen broyle,
Wasting the chipps of ancient Hermes' Tree,
Weening to turn them to a precious oyle,
The more you worke the more you loose and spoile;
To you, I say, how learned soever you be,
Go burne your Bookes and come and learne of me."
It's not a song or poem, just Edward Kelly trying to blow his own trumpet. There's pages more, if you dislike your braincells I can share.
(11-12-2025, 07:56 PM)tavie Wrote: You are not allowed to view links. Register or Login to view. (11-12-2025, 07:41 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view. The question that remains is how much of the "LAAFU" phenomenon on the VMS can be explained by this effect.
If indeed any (for the reasons I've already set out earlier in this post or another thread)
The mere line breaking process definitely created differences in the line-initial, line-medial, and line-final word distributions. It does that with any language, as long as the words have different lengths.
If the three word distributions are different, the statistics of glyphs and digraphs at the three locations are almost certainly different too.
I can believe that this Line Breaking Bias (LBB) was not the
only cause of the differences we see in the VMS. I expect that there will be other causes too; like the Scribe using
am as an abbreviation of
aiin, or y as a calligraphic variant of
o, or squeezing words spaces and stretching glyph spaces so that transcribers incorrectly join or split words. And
maybe the line is indeed a functional unit.
One could test the effect of each of these possible causes, alone and then in combination. Then, if the LAAFU theory is true, one would hopefully get a better understanding of it after subtracting all those other factors.
To test and measure the LBB effect, for instance, one could
- Take all parags in some hopefully homogeneous section, such as Herbal-A or Bio;
- Remove the parag head lines, and, for good measure, maybe also the second lines;
- Join all remainig lines of all parags into a single string of words;
- Insert line breaks with the trivial line breaking algorithm, for an arbitrary line width W;
- Compute the "LAAFU statistics" on the resulting text;
- Compare that with the "LAAFU statistics" of the same lines with original breaks.
The LBB effect should not depend much on the width parameter W, as long as there are half a dozen words per line. It should also be somewhat independent of whether one counts the width of a line in EVA characters or any other "alphabet", e. g. counting combos like
Ch or
iiin as single "letters".
All the best, --stolfi
I thought about how to predict the start and end chunks. The result is as follows (if there is a flaw in my reasoning, please let me know):
A script reads a file in which each word is divided into start, middle, and end chunks. These parts are separated from each other. The middle chunks of each line are then converted into binary vectors to calculate the similarity between these lines. For each line, only those lines that achieve a certain minimum similarity (threshold value 0.6) are taken into account. This forms a group of similar lines. From this group, the most probable start and end chunks are predicted using weighted voting. Result:
Median group size: 25.0
Average similarity in groups: 0.636
Min similarity in groups: 0.600 (Threshold)
Max similarity in groups: 0.741
Exact matches (Actual == Predicted): 81 / 3885
Here is the output file: You are not allowed to view links.
Register or
Login to view.
The corpus used: You are not allowed to view links.
Register or
Login to view.
[
attachment=12916]
Here is the cross-check with a shuffled corpus (words and lines):
Median group size: 15.0
Average similarity in groups: 0.633
Min similarity in groups: 0.600 (Threshold)
Max similarity in groups: 0.816
Exact matches (Actual == Predicted): 93 / 3885
It can be observed that the candidates are distributed much more evenly across the corpus, but the number of candidates in a group decreases dramatically. The number of exact matches is virtually unchanged (even slightly increased).
[
attachment=12928]
Conclusion: It appears that certain sections of the VMS bear a greater resemblance to other lines than others. This is not a coincidence, but indicates a structure within the text.
For what it's worth, I have divided the similarities in the VMS corpus according to the so-called “sections.” The balneological section clearly distinguishes itself from its neighbors.
1.) „Herbal“ section (folio 1r–66v)
2.) „Astronomical“ section (folio 67r–73v)
3.) „Balneological“ section (folio 75r–84v)
4.) „Cosmological“ section (folio 85r–86v)
5.) „Pharmaceutical“ section (folio 87r–102v)
6.) „Recipes section“ (folio 103r–116v)
[
attachment=12965]
Explanation of the procedure (template from AI, but proofread and corrected):
[
attachment=12979]
As mentioned, when examining the plot distribution in the “balneological section,” it becomes apparent that many lines have high group numbers and are often green (exact match) or at least yellow (half match). This is a clear sign of a strong pattern. It means that many lines of text there are structured very similarly. They often begin and end the same and differ only slightly in the middle. This does not happen by chance, but indicates fixed text modules or recurring phrases. The entire section then probably belongs together thematically or functionally. In short: the script has recognized a clearly structured text block here. So if you are looking for lines that are structured as a functional unit, you are most likely to find them in Quire 13.
Here is a brief note on the “auto-copy theory”:
According to this theory, one would expect similarity patterns to be distributed relatively evenly throughout the entire manuscript, since the text is supposed to have been generated by a self-copying mechanism. A strong, clearly defined accumulation of such patterns in only one specific section of text does not fit well with this theory. It suggests locally effective rules rather than a global generation mechanism. If “auto-copy” were the dominant process, a comparable degree of similarity would also have to be evident in other parts of the manuscript.