(16-04-2026, 09:09 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.There is a clear progression. In several sections, especially Herbal and Biological, tokens become increasingly end-like as we approach the line ending. So the effect is not just at the last word, it builds up gradually.
Yes, and that is expected even from the trivial line-breaking algorithm (TLA). It creates anomalies on the first token of the line, and on
several tokens at the end of the line.
As an extreme example, suppose that, in the original text, 90% of the tokens have two letters, and 10% have 20 letters, alternating at random. The TLA will enhance the probability of the first token of the line being long, and enhance the probability of
the last 7 tokens of the line being short (because up to 7 short words would fit in the space where a long word would not).
The TLA algorithm may create anomalies also on the second token of the line, if there is correlation between the lengths of successive tokens. In the example above, if 10% of the tokens are long, but after a long token there is 50% chance that the next token will be long too, then the TLA will also enhance the probability of the
second token of the line being long.
Quote:These end-like tokens tend to belong to specific families and shapes that are well known at line ends in EVA, for example patterns like -dy, -y, -l, -r, or short forms such as dy, dal, lo, which frequently appear in final position and have high end-like scores.
Words like al, dy, dal are shorter than average, so they should be enhanced at or near line end by the TLA alone. But presumably iin -> m is not the only abbreviation that the Scribe was allowed to use. (But I don't see any such in the Starred section.)
Quote:So I thought: maybe the scriba is kind of compressing words when approaching to the end of line.
That is expected, and I think we can see that just by looking at the page scans. Word spaces seem to become narrower near the end of the line. But it is more complicated than that.
A good professional scribe would strive to (1) end every line except the tail precisely on the right rail, and (2) avoid bad line breaks. As a minimum, a bad line break is one that creates a very short tail line, with only one or two words.
To achieve these goals, the scribe would (consciously or unconsciously), while getting close to the end of the line, look ahead and try to predict where the next line break could be; and then compress
or expand the spacing of the words and glyphs, and possibly the glyphs themselves, to try to achieve goals (1) and (2). (Good typesetting software like TeX does this too, but uses dynamic programming to find all the optimum line breaks at once for the whole parag.)
Quote:So I tested the compression idea directly. If the scribe is compressing, we should see that these end-like tokens correspond more often to longer interior forms, for example via prefix matches or small edit distance. I compared real line-final tokens with matched interior controls and checked several criteria (longer prefix matches, Levenshtein neighbors, subsequences).
The result is negative. Final tokens do not show an excess of longer expandable forms. If anything, they show slightly fewer such relations than the controls.
I don't know what to make of this result. I looked in the Starred section at the word distribution for the last token of each line, and of the fourth-last one. Some of the most significant anomalies I can see in the last token:
- Much higher frequency of am (and other words ending in am, like otam, qokam, dam), aiinal, ary, dal, aral,ldy, oly,dy, y, etc.
- Much lower frequency of otedy and other words of 4 or more letters, l, r, s, ar (and other words ending in ar like dar, lkar), or (ditto), os (ditto), air, sho, qol, etc.
Some of these differences, like the words ending in am, are probably due to use of abbreviations. Maybe
l sometimes works as an abbreviation like
m.
Other anomalies, like the suppression of isolated
r and
s, may be the result of compression. If you have ever tried transcribing from the page images, you must have noticed that there is often extra space after an
r or
s glyph, which is hard to decide whether it is a word space or not. It seems plausible that those spaces are indeed
not word spaces, and thus get compressed to nothing near the end of the line, when space is running short.
Conversely, the excess of
dy and
y at the end of the line may be the result of the Scribe expanding the text in that region, and writing those common suffixes as separate words (like a Spanish scribe, back when orthography was just a suggestion, might stretch "darme" into "dar me" to meet the rail). Or maybe they are suffixes that follow
r or
s, and those ambiguous spaces after those glyphs get stretched to the point that they get transcribed as word gaps.
Anyway, I still think that the line-position anomalies that are alleged as evidence for LAAFU could be explained as mere artifacts of the Scribe's line breaking algorithm. Thus I still don't see those anomalies as evidence that line breaks are semantically significant, or that they trigger a reset of the hypothetical "encryption algorithm".
All the best, --stolfi