(17-04-2026, 09:44 AM)quimqu Wrote: You are not allowed to view links. Register or Login to view.So even after removing all spaces, the signal is still there, and it is strong.
I am not sure I understood your point... But this experiment does
not imply that "the line-end anomalies are not caused by or related to spaces".
The line breaking algorithms -- that split a stream of
tokens into lines -- directly change the statistics of token
lengths at both ends of the lines. But different token length distributions imply different
word distributions, and this implies different
character and digraph distributions. And these differences will persist even if you remove spaces from the resulting lines.
For instance, the frequency of the digraph "th" in English is largely determined by the frequency of the common words that use it: "the", "this", "that", "they", "them", "thus", "there", "then", etc. But those words are shorter than average, so the TLA will make them more common at the end of lines and less common at the start. Thus, even if you remove spaces from English text
after running the TLA, you should still see the frequency of "th" enhanced near the end of lines and depressed near the beginning of lines.
And that is even more true of the more sophisticated line-breaking algorithm (SLA) that the scribe actually used, which included the use of abbreviations and stretching/compression. The effects of SLA too will persist even if you remove the spaces on the resulting lines. In the case of the VMS, the occurrence of an
m will be a good hint that there may have been a line break just after it.
And if you join those lines into a single character stream without any spaces, a good pattern detector should be able to locate the line breaks, based on those frequency anomalies. What is happening is that the pattern recognition algorithm learns to see several short words before one long word, and learns that a line break is more likely to occur there than elsewhere.
But this is true only if TLA or SLA is given a stream of
tokens. If you give it just a stream of
characters, without the spaces, obviously there not be any statistical anomalies around the resulting line breaks, which will be blindly inserted every N characters, exactly;
Quote:Maybe what we call "words" are not real words. [... but] Removing spaces does not kill the effect, and keeping them improves the model, which suggests that the spacing, even if imperfect, still reflects some real structure in the text.
Well, you know my (very strong now) beliefs about the nature of the text.
But independently of my theories, I propose that the following claims are true, and should be confirmed by experiments:
- Whatever their function, word spaces were important to the Author. Thus the draft that the Scribe received was a sequence of tokens separated by spaces (and not just a sequence of non-blank characters, that the Scribe would split into words at will).
- The Scribe was supposed to ignore the line breaks of the draft, within each parag, and insert new line breaks so as to properly fill the space between the text rails.
- The Scribe did not have permission to:
- .. join tokens by discarding word spaces.
- .. break tokens by inserting new spaces
- .. split a token across a line break.
- The Scribe had a handful of abbreviations that he could use to help avoid bad breaks or rail overflow. Changing iin to m was one. Any others?
The arguments for (1) are many. The argument for (2) is that the lines (other than parag tails) generally fill the space between the rails more or less neatly -- even when the rails are slanted, bent, or broken because of vellum defects that not even the Scribe could have predicted. Note that (2), in particular, denies the LAAFU theory.
I don't have arguments at hand for items (3)-(6), but I believe that they have been verified long ago.
The argument for (7), of course, is that the frequency of all words ending in
iin is depressed at line end, while that of the corresponding words with
iin replaced by m is enhanced.
Is there significant evidence
against claims (1)-(7)? Again, the mere observation of statistical anomalies around line breaks is not a valid argument, unless it can be shown that such anomalies cannot be simply side effects of the SLA. I don't think this has been shown yet.
All the best, --stolfi