(17-09-2025, 08:14 AM)MarcoP Wrote: You are not allowed to view links. Register or Login to view.We know from Patrick Feaster’s research (e.g. his 2022 Malta paper You are not allowed to view links. Register or Login to view.) that ch- words are relatively rare both at line start at line end. Their frequency peaks immediately after the first word in a line (positions 2 and 3) and then decreases as one moves rightward along the line. It’s also probably worth remembering that the frequency of words containing ch (as a prefix or not) is not lower at line start than in other positions. But words starting with ych- dch- are 10 times more frequent at line start than elsewhere (see You are not allowed to view links. Register or Login to view.); they account for almost half of the line-start words that contain ‘ch’.
I believe that those discrepancies could be caused by, among other things,
- The line-breaking algorithm (LBA) effect, together with different initial-glyph distribs for long and short words.
- Formula effects which make the word distrib at the start of line 2 different form the overall distrib.
- Misreading of ambiguous spaces leading to splitting or joining of words away from line start.
To illustrate point 1, suppose that 80% of the words occurrences in a language are only 2-3-letter long and start with "u", while 20% are 20-letter long and start with "a". Then the LBA effect world result in, say, only 50% of the lines starting with "u" (because, say, 50% of the line-initial words would be long and hence start with "a"), while, say, 85% of non-line-initial words would start with "u".
To illustrate point 2, imagine an English herbal where each parag starts with a dry list of diseases, without any of the common "th" words, ("the", "this", "that", "then", "they", "them", "there", ...); and the list often extends into line 2. Then "th" would be under-represented in line-start position just because it would never occur at the start of line 2.
To illustrate point 3, imagine that, in an English text, a blank is inserted before every "t" letter, thus splitting any word with an embedded "t". Then the frequency of "t" as word-initial would be much increased, except among line-initial words, since those blanks would have no effect there.
Or, conversely, suppose that every word that starts with "e" is joined to the preceding word on the same line. That would reduce the frequency of "e" in word-initial position to zero, except among the line-initial words.
One could test whether explanation 1 is viable by tabulating the length distribution of line-initial and non-line-initial words. If the two columns are significantly different, then one could tabulate the the initial letter of n-letter words as a function of n, and see whether the two tables together explain the initial-letter discrepancies.
One could test explanation 2 by comparing the initial-word distributions of line 2 in parags with a full-length head line with
that of line 2 in parags where the head line is much shorter because of intruding plants. Any formula effects should be more likely to spill into line 2 in the second case than in the first case.
One could perhaps test for explanation 3 by taking lines with one or more commas, and doing the statistics with commas deleted and with commas treated as word spaces. Or maybe by comparing the frequencies of "X.Y" "XY" for the relevant glyphs X and Y.
Any better ideas?
All the best, --jorge