Jorge_Stolfi > 12-04-2026, 09:56 PM
(12-04-2026, 06:38 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.I use six thematic sections (Botanical, Astrological, Balneological, Rosettes, Pharmaceutical, Stars)
Quote:and, separately, five scribes identified by Lisa Fagin Davis.
Quote:The Currier A/B dialect distinction is tagged at the folio level by scribe, not by section.
## Test 1: Re-breaking the lines
[...] preserved folio boundaries (so no token crosses a folio edge) and measured four things at each break width:
- **Gallows-initial**: first character of the first word is k, t, p, or f (note: the character-level analysis below shows this is driven by p and t; k is slightly depleted at line start)
- **Hapax-initial**: first word appears only once in the corpus
Every effect weakens or vanishes when line breaks are moved:
The "gallows-initial" effect is driven by p (17.8×) and t (4.8×).
Quote: Width 4
Quote: -m final drops from 8.6× (14.6% vs 1.7%) to roughly 1.2× [...] Line-end enrichment is concentrated in -m and -g, while -l, -r, and -o are actively depleted.
Quote:The AC gap (0.173 within vs 0.051 across) closes to roughly 0.15 both ways
Quote:These effects are tied to the real line breaks. Lines are functional units, not word-wrap.
quimqu > 12-04-2026, 09:58 PM
(12-04-2026, 02:30 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.But you don't have to believe the theory. In fact, the scientific method works best when you set out to check a theory that you don't. Because then you try harder to find a good test that will disprove it.
DG97EEB > 12-04-2026, 10:20 PM
(12-04-2026, 09:56 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Dear DG97EEB, thanks a lot for doing the tests. I don't mind if the programs were generated by AI. Their code may be buggy, but human-written code may be buggy too, so we must check the results anyway.
(12-04-2026, 06:38 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.I use six thematic sections (Botanical, Astrological, Balneological, Rosettes, Pharmaceutical, Stars)
As a general remark, I think that any statistical analysis would be easier to understand if the it was confined to just one section, like Herbal. While in theory one could focus on the "Herbal" lines of the tables only, presenting results for all sections at the same time, as a big "wall of numbers", is rather distracting and makes it hard to draw conclusions. Once we understood the results for Herbal, we could look at the other sections, if desired.
Quote:and, separately, five scribes identified by Lisa Fagin Davis.
Ditto. That is a separate theory that is best checked separately.
Quote:The Currier A/B dialect distinction is tagged at the folio level by scribe, not by section.
## Test 1: Re-breaking the lines
[...] preserved folio boundaries (so no token crosses a folio edge) and measured four things at each break width:
- **Gallows-initial**: first character of the first word is k, t, p, or f (note: the character-level analysis below shows this is driven by p and t; k is slightly depleted at line start)
- **Hapax-initial**: first word appears only once in the corpus
Every effect weakens or vanishes when line breaks are moved:
The "gallows-initial" effect is driven by p (17.8×) and t (4.8×).
From the above I understand that the program joined the whole text of each folio into a single string of words, and then ran that through the trivial line-breaking algorithm. Is that so?
That is not a good way to do the test, because we know that the first line of each paragraph is special, and the first word of those lines is very special -- for being a sentence-initial word, for being part of a special sentence, for being subject to the "ornate letter" treatment, etc. Most likely that is the name of the plant that is described in that parag. Thus it is usually a hapax, with puffs (p/f gallows), unusual morphology, etc.
If the entire text of a folio is treated like a single string of words and re-justified, most of those special parag-initial words will end up inside lines. Thus, if that what Claude's code does, it is not surprising that the number of line-initial puffs and hapaxes went down. Basically, what the exercise confirmed is PAAFU, not LAAFU: that the first word of each paragraph is special -- and that anomaly obviously is not due to line-breaking bias.
What LAAFU says is that the stats of the first and last words of parag body lines (excluding parag head and tail lines) are different from those or other positions in the line. And these anomalies are what, in my view, may be partly or wholly caused by the line-breaking bias.
So the proper way to test the LAAFU theory, I think, is to do that exercise for each paragraph, discarding the first line of the original parag and the first line of the re-justified parag. My conjecture is that a good part of the "LAAFU anomalies" that are seen in the body lines of the original parags will still be present in the body lines re-justified parags, even though the original line breaks got buried inside the middle of the lines.
Quote: Width 4
Those widths are EVA character counts? If so,even width 20 is too small, as the line-initial words will run into the line-final ones. Try using 1.618 or 0.618 times the average line width of the original parag. Note that if the new width is a simple multiple or sub-multiple of the original one (like 50% or 200%), many of the new line breaks may fall on or near the original ones.
Quote: -m final drops from 8.6× (14.6% vs 1.7%) to roughly 1.2× [...] Line-end enrichment is concentrated in -m and -g, while -l, -r, and -o are actively depleted.
The prevalence of m at the end of lines is a known anomaly. It is not caused by the trivial line-breaking algorithm per se, but my conjecture is that m is an abbreviation (possibly of iin) that the scribe could use where space is tight. Thus the m-anomaly is not evidence of LAAFU, as this theory is generally understood.
One could perhaps include this conjecture in the test by expanding every m into iin, and modifying the trivial line breaking algorithm with the clause "if the next word does not fit in the current line, but it ends in iin, and the same word with m would fit, write down the latter instead." And occasionally abbreviate iin in the middle of lines too, with the right probability.
Quote:The AC gap (0.173 within vs 0.051 across) closes to roughly 0.15 both ways
I do not quite understand what this means, but it looks like this too could be an effect of parag boundaries being destroyed and/or the m quirk above.
Quote:These effects are tied to the real line breaks. Lines are functional units, not word-wrap.
I don't think we can make this conclusion if the test indeed destroyed parag boundaries and did not exclude parag-initial lines.
All the best, --stolfi
Juan_Sali > 12-04-2026, 11:05 PM
tavie > 12-04-2026, 11:13 PM
Jorge_Stolfi > 12-04-2026, 11:35 PM
(12-04-2026, 10:20 PM)DG97EEB Wrote: You are not allowed to view links. Register or Login to view.Widths are in tokens (words), not EVA characters.
quimqu > 13-04-2026, 01:00 PM
| Corpus | Mean abs asymmetry (fuzzy) | Median abs asymmetry |
|---|---|---|
| voynich | 0.055 | 0.045 |
| timm_generated | 0.103 | 0.080 |
| ambrosius_latin | 0.100 | 0.083 |
| chinese | 0.127 | 0.086 |
| docta_ignorantia | 0.142 | 0.118 |
| tirant_cat | 0.162 | 0.133 |
| culpepper_en | 0.187 | 0.157 |
| materia_medica_en | 0.220 | 0.199 |
| simplicissimus_de | 0.223 | 0.236 |
| alchemical_latin | 0.257 | 0.214 |
quimqu > 13-04-2026, 04:30 PM
| Position | Natural texts (segmented) | Voynich |
|---|---|---|
| First word | Moderate asymmetry (edge effect) | Strong asymmetry |
| Start | Near neutral | Clearly asymmetric |
| Middle | Almost flat | Near flat |
| End | Mild opposite effect | Strong opposite effect |
| Last word | Clear edge effect | Weaker than expected |
Jorge_Stolfi > Yesterday, 03:19 AM
(13-04-2026, 04:30 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.I have seen one interesting thing: when you take a normal text and cut it into artificial lines, something simple happens. Words at the edges of the line behave differently. The first word has no left context. The last word has no right context. This alone creates a small asymmetry.
You can see it clearly if you measure how predictable a word is from the previous one versus the next one. In natural texts, once you force them into lines, a very regular pattern appears. It’s always the same shape, and it’s not very strong. It comes from the cut, not from the language itself.
Quote:The Voynich behaves differently. It also shows positional effects, but they are stronger and the shape is not the same. The beginning of the line is much more constrained than expected, and the end behaves differently from what we see in normal texts.
That makes it unlikely that the line structure is just a passive formatting layer. It looks more like something that actively constrains how the text is generated.