Juan_Sali > 16-04-2026, 01:34 PM
(16-04-2026, 04:56 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.I dont think that the remaining space had in general any influence in the pre-creation of the last word.Quote:So it seems that two things are happening at the same time. Lines fill the available space, but they also tend to end and start with specific families of patterns.Thanks for doing these tests! But indeed we should expect the VMS to have stronger line-end anomalies than those that are created by the simple line-breaking algorithm. Because the real algorithm has two more complications.
First, the m character is probably an abbreviation for some other ending, that the Scribe could use when needed but also just when he felt like it. I guess that the "some other ending" is iin. That is, where the trivial algorithm said
1. If the next word W fits in the remaining space, write W,
Else break the line and write W.
Jorge_Stolfi > 16-04-2026, 01:49 PM
(16-04-2026, 09:09 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.There is a clear progression. In several sections, especially Herbal and Biological, tokens become increasingly end-like as we approach the line ending. So the effect is not just at the last word, it builds up gradually.
Quote:These end-like tokens tend to belong to specific families and shapes that are well known at line ends in EVA, for example patterns like -dy, -y, -l, -r, or short forms such as dy, dal, lo, which frequently appear in final position and have high end-like scores.
Quote:So I thought: maybe the scriba is kind of compressing words when approaching to the end of line.
Quote:So I tested the compression idea directly. If the scribe is compressing, we should see that these end-like tokens correspond more often to longer interior forms, for example via prefix matches or small edit distance. I compared real line-final tokens with matched interior controls and checked several criteria (longer prefix matches, Levenshtein neighbors, subsequences).
The result is negative. Final tokens do not show an excess of longer expandable forms. If anything, they show slightly fewer such relations than the controls.
quimqu > 16-04-2026, 05:43 PM
(16-04-2026, 01:49 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.The TLA algorithm may create anomalies also on the second token of the line, if there is correlation between the lengths of successive tokens.
quimqu > 16-04-2026, 05:49 PM
(16-04-2026, 01:19 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.(15-04-2026, 07:07 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So it seems that not all line breaks behave the same. There is a subset that fits very well the “good ending + good beginning” pattern, and another subset that barely fits it.
It could be that, unsurprisingly, some words fit in the available space without shortening. Is the average length of the "good" words smaller than the average length of the "barely fits" words?
(16-04-2026, 12:42 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Have you used the words before/after "<->" in your before/after linefeed data or not? If not, what happens if you add them?
(15-04-2026, 05:34 PM)Fontanellean Wrote: You are not allowed to view links. Register or Login to view.Perhaps it suggests that the meaningful content is in the central part of each line, with filler words added at the beginning and end.
quimqu > 16-04-2026, 05:55 PM
(16-04-2026, 12:42 PM)nablator Wrote: You are not allowed to view links. Register or Login to view.Do only the last two EVA characters matter?
oeesordy > 16-04-2026, 06:52 PM
(15-04-2026, 12:33 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.(15-04-2026, 08:25 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view. The null control shows that this does not survive random relabelling. So the effect looks real. I think the safest formulation is this: in Herbal and Biological, line breaks are statistically structured rather than arbitrary.
Good, but, again, that is an expected (and even already verified) consequence of the trivial line-breaking algorithm (provided that the margins are defined by mm or character count, not by word count). It does not insert line breaks at random, but in a way that strongly depends on the lengths of the words before and after the break. As a result, the first word after a line break tends to be longer than average, and the last 1-3 words before a line break tend to be shorter than average. And if the word length distributions are different, the same will almost certainly be true of any other other statistic, character- or word-based.
So, before we can take your results above as evidence of LAAFU, you need to repeat the analysis on text that has been re-justified as discussed before. Namely, discard the first line of each parag, join the remaining lines in a single token stream, and feed that to the trivial line-breaking algorithm, with maximum line length set to about 62% or 162% of the original average line length -- always counting characters, not words.
The set of these re-justified parags should be the "null control". I am quite sure that you will see on this control text the same kind of anomalies that you see on the VMS. The question is only whether they will be just as strong, or significantly weaker.
And even if the anomalies in this control text are weaker than on the original parags, that is still not yet evidence of LAAFU. Because the algorithm used by the scribe is a bit more complicated than the trivial one, and the additional complications add to the line-break anomalies. So we will have to try to simulate these complications too.
All the best, --stolfi
quimqu > 16-04-2026, 10:43 PM
quimqu > 16-04-2026, 10:49 PM
(16-04-2026, 01:34 PM)Juan_Sali Wrote: You are not allowed to view links. Register or Login to view.I dont think that the remaining space had in general any influence in the pre-creation of the last word.
The right margin is quite irregular in many pages, there is space to write larger words that would make the margin more regular, but this doesnt happen.
In the pages with more regular right margins, like 81v, it is possible that some last 2 or 3 words in a line were originally separated but aglutinated (spaces deleted) by the the scribe because of lack of space.
Jorge_Stolfi > 17-04-2026, 02:55 AM
(16-04-2026, 10:43 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.The internal split points marked with <-> do show part of the same effect seen at true line endings, but more weakly. ...That is exactly what happens on the left side of the split. The right side of the split shows a weaker version of the same pattern.
am | . | 0.0375 | +12.835
otam | . | 0.0100 | +11.513
ary | . | 0.0088 | +11.379
dal | . | 0.0088 | +11.379
dam | . | 0.0088 | +11.379
ram | . | 0.0088 | +11.379
okam | . | 0.0063 | +11.043
qotam | . | 0.0063 | +11.043
aral | . | 0.0050 | +10.820
chedam | . | 0.0050 | +10.820
kam | . | 0.0050 | +10.820
ldy | . | 0.0050 | +10.820
oly | . | 0.0050 | +10.820
om | . | 0.0050 | +10.820
aiinal | . | 0.0037 | +10.532
lkam | . | 0.0037 | +10.532
olam | . | 0.0037 | +10.532
opam | . | 0.0037 | +10.532
raram | . | 0.0037 | +10.532
...
skaiin | 0.0026 | . | -10.166
tchey | 0.0026 | . | -10.166
aiiin | 0.0039 | . | -10.574
aiir | 0.0039 | . | -10.574
char | 0.0039 | . | -10.574
chody | 0.0039 | . | -10.574
kaiin | 0.0039 | . | -10.574
keey | 0.0039 | . | -10.574
pchedy | 0.0039 | . | -10.574
qokchedy | 0.0039 | . | -10.574
taiin | 0.0039 | . | -10.574
kchedy | 0.0052 | . | -10.861
l | 0.0052 | . | -10.861
otar | 0.0052 | . | -10.861
chedar | 0.0065 | . | -11.084
cheedy | 0.0065 | . | -11.084
chor | 0.0065 | . | -11.084
cheo | 0.0091 | . | -11.420
otedy | 0.0117 | . | -11.672nablator > 17-04-2026, 08:12 AM
(16-04-2026, 05:49 PM)quimqu Wrote: You are not allowed to view links. Register or Login to view.That makes it hard to explain this just as a packing or fitting effect. Short words alone don’t explain why certain tokens are so strongly preferred at line ends. It looks more like something acting at the level of specific variants or families, not just length.