21-04-2026, 11:52 PM
First, I must apologize to @quimqu and others here. I was wrong.
Or at least half-wrong. I have been claiming that TLA makes the first word of the line longer than average, and the 1-3 lines words at the end shorter than average. The first part is true, but the second part is false. I have run some simulations, and to my surprise TLA (or SLA) has practically no effect on the length of words near the end of the line:
[attachment=15225]
The top plot is the average word length of the Nth word, counting from the line start. The bottom plot is the same, but counting from the line end.
In this simulation the words are 50% "long" (7 letters) and 50% "short" (2 letters). The max line lenth is set at 60 characters (including blanks between words). The SLA algorithm
will abreviate iin to m if that delays the line break. Here is a sample
of SLA-justfied text:
ytchedy dy ar dy qokaiin ol qokaiin qokaiin qokaiin qokaiin
dy shochdy qokaiin shochdy ytchedy dy shochdy ytchedy ol ol
qokaiin ytchedy shochdy qokaiin shochdy qokaiin ytchedy
shochdy shochdy qokaiin ol ytchedy ol shochdy ytchedy dy
shochdy ar ar dy dy ol ol ytchedy dy ar dy ol shochdy qokam
dy shochdy ar ytchedy ol dy qokaiin ol ol dy qokaiin shochdy
ytchedy ol ol ar ytchedy ytchedy qokaiin dy dy ar qokaiin
qokaiin dy shochdy dy shochdy ol ol shochdy shochdy shochdy
qokaiin ar shochdy ol ytchedy ytchedy ytchedy dy ytchedy
The top plot shows that, as I claimed, TLA (and SLA) cause the first word on the line to be longer on average than the overall word length average of 0.5*7 + 0.5*2 = 4.5 (green line).
But the bottom plot shows that the average length of last word of each line (rightmost point) is practically the same as the overall average! Oops!
That was quite a blow to my intuition. I observed that shorter words could still be added to the end of the line in those situations where longer words would drop to the next line. I intuitively concluded that, for that reason, the last word would be more likely to be short. But in fact the breaking of the line depends on the next word, whereas the last word that remains on the line depends on the previous word -- which is independent of the word that caused the break.
So TLA can explain line-start anomalies, but not line-end anomalies.
One bizarre feature of the top plot is that it shows that the average length of the Nth word keeps decreasing below the global average as N increases. That seems to contradict the bottom plot. The average number of words per line is about 10 but the 15th word from the start (top plot) is shorter than average, while the 5th word from the end (bottom plot) is just average.
That turns out to be an illusion, the result of what could be called "selection" or "survivor bias". It turns out that the lines that have a 15th word must have lots of short words; and since the average length of the 15th word is computed only over those lines, it naturally comes out less than average. At the extreme, the only lines that have a 19th or 20th word are lines that consist entirely of short words (3 of them in my sample text), so the average length of the 19th word and of the 20th word is just 2 letters.
Ans this bias starts to show even before the 10th word. Lines that have only long words cannot have more than 7 words. Thus the average length of the 8th word is less than the global average because it considers only lines that have at least one short word, and so on.
The same bias explains the left-hand part of the bottom plot, that says that the 15th word counting from the end is much shorter than average -- even though the top plot shows that the 5th word counting from the start is precisely average.
This is an artifact that one must be wary of when plotting average lengths of lines as a function of position, counted in words from the line start. Could it be that this selection bias explains some of the claimed line-end anomalies?
I will answer @pfeaster in the next post.
All the best, and again apologies --stolfi
Or at least half-wrong. I have been claiming that TLA makes the first word of the line longer than average, and the 1-3 lines words at the end shorter than average. The first part is true, but the second part is false. I have run some simulations, and to my surprise TLA (or SLA) has practically no effect on the length of words near the end of the line:
[attachment=15225]
The top plot is the average word length of the Nth word, counting from the line start. The bottom plot is the same, but counting from the line end.
In this simulation the words are 50% "long" (7 letters) and 50% "short" (2 letters). The max line lenth is set at 60 characters (including blanks between words). The SLA algorithm
will abreviate iin to m if that delays the line break. Here is a sample
of SLA-justfied text:
ytchedy dy ar dy qokaiin ol qokaiin qokaiin qokaiin qokaiin
dy shochdy qokaiin shochdy ytchedy dy shochdy ytchedy ol ol
qokaiin ytchedy shochdy qokaiin shochdy qokaiin ytchedy
shochdy shochdy qokaiin ol ytchedy ol shochdy ytchedy dy
shochdy ar ar dy dy ol ol ytchedy dy ar dy ol shochdy qokam
dy shochdy ar ytchedy ol dy qokaiin ol ol dy qokaiin shochdy
ytchedy ol ol ar ytchedy ytchedy qokaiin dy dy ar qokaiin
qokaiin dy shochdy dy shochdy ol ol shochdy shochdy shochdy
qokaiin ar shochdy ol ytchedy ytchedy ytchedy dy ytchedy
The top plot shows that, as I claimed, TLA (and SLA) cause the first word on the line to be longer on average than the overall word length average of 0.5*7 + 0.5*2 = 4.5 (green line).
But the bottom plot shows that the average length of last word of each line (rightmost point) is practically the same as the overall average! Oops!
That was quite a blow to my intuition. I observed that shorter words could still be added to the end of the line in those situations where longer words would drop to the next line. I intuitively concluded that, for that reason, the last word would be more likely to be short. But in fact the breaking of the line depends on the next word, whereas the last word that remains on the line depends on the previous word -- which is independent of the word that caused the break.
So TLA can explain line-start anomalies, but not line-end anomalies.
One bizarre feature of the top plot is that it shows that the average length of the Nth word keeps decreasing below the global average as N increases. That seems to contradict the bottom plot. The average number of words per line is about 10 but the 15th word from the start (top plot) is shorter than average, while the 5th word from the end (bottom plot) is just average.
That turns out to be an illusion, the result of what could be called "selection" or "survivor bias". It turns out that the lines that have a 15th word must have lots of short words; and since the average length of the 15th word is computed only over those lines, it naturally comes out less than average. At the extreme, the only lines that have a 19th or 20th word are lines that consist entirely of short words (3 of them in my sample text), so the average length of the 19th word and of the 20th word is just 2 letters.
Ans this bias starts to show even before the 10th word. Lines that have only long words cannot have more than 7 words. Thus the average length of the 8th word is less than the global average because it considers only lines that have at least one short word, and so on.
The same bias explains the left-hand part of the bottom plot, that says that the 15th word counting from the end is much shorter than average -- even though the top plot shows that the 5th word counting from the start is precisely average.
This is an artifact that one must be wary of when plotting average lengths of lines as a function of position, counted in words from the line start. Could it be that this selection bias explains some of the claimed line-end anomalies?
I will answer @pfeaster in the next post.
All the best, and again apologies --stolfi