25-09-2019, 10:10 AM
For a while, I have been wondering about how mid-line breaks due to drawings compare with "true" line breaks. In other words, I wanted to check if the same LAAFU (line as a functional unit) effects that happen at line boundaries also appear when lines are interrupted by a drawing. For instance You are not allowed to view links. Register or Login to view..
This has likely been studied before, but I am not aware of any specific research in this area.
As always, I cannot exclude I have made errors somewhere in the process.
This analysis is based on the corpus of text from pages that include at least a drawing interruption. I used the Zandbergen-Landini transcription (ignoring uncertain spaces) where mid-line breaks are marked by the three characters sequence '<->'. The text from pages that do not include mid-line breaks was ignored. The corpus includes:
13966 words
11050 regular word breaks (word couples separated by a space)
751 image word breaks (word couples separated by an illustration)
2168 lines
The following histogram illustrates word length, considering:
* all words in the corpus
* first words in lines
* words that appear immediately before an image break
* words that appear immediately after an image break
* last words in lines
[attachment=3369]
The histogram shows that the first word of each line is slightly longer than average. This has been discussed, for instance, by You are not allowed to view links. Register or Login to view..
On the contrary, words that appear immediately before a mid-line break are shorter than the average. Line final words and words following the image-break have a normal length.
The following is the histogram for specific word-lengths. Last words in lines have more frequent 1 and 2 length words: maybe because they can be more easily squeezed at the end of a line. This tendency is much stronger for words before an image break. It is possible that words are sometimes split around the image, but words after the image break do not show any particular word-length pattern.
[attachment=3370]
This graph shows frequencies for the most common word-initial characters in the different positions. s[^h] stands for s-not(h), i.e. it excludes the "bench" Sh which is considered separately.
The fact that p-, t-, y-, s- are more frequent at line start is another known fact, discussed for instance by Emma You are not allowed to view links. Register or Login to view..
[attachment=3371]
The first word after an image break shares the preference for y- and s- with line-initial words, but initial gallows are almost totally absent. In addition to the gallows, also l- and q- are rare after the image break. o- is also more frequent than usual. The drop in the frequency of q- after an image break and the (symmetrical?) increase in y-/o- are particularly noticeable and puzzling.
The graph for the word-final character shows what could be the best known LAAFU effect: the high frequency of -g and -m at line end. The two characters are not particularly frequent before a mid-line break. On the other hand, -d and -s are twice more frequent before a mid-line break than in the other positions; -y is also more frequent than expected.
[attachment=3372]
I looked into the specific case of -s before an image break. It turns out that almost half of the occurrences are due to the word 's' itself: the word occurs before 0.6% of regular (i.e. space) word breaks and before 3.2% of image breaks, more than 5 times as frequently. Nothing similar happens for 'd' and 'y', the other two characters that are more common immediately before an image break: they only rarely occur in that position as stand-alone words.
In a few cases, there are multiple occurrences of 's' immediately before an image on a single page:
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
's' is the only character to appear twice isolated by two image breaks.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
It seems possible that this could be described as a preference for detaching an initial s- from the rest of the word.
This has likely been studied before, but I am not aware of any specific research in this area.
As always, I cannot exclude I have made errors somewhere in the process.
This analysis is based on the corpus of text from pages that include at least a drawing interruption. I used the Zandbergen-Landini transcription (ignoring uncertain spaces) where mid-line breaks are marked by the three characters sequence '<->'. The text from pages that do not include mid-line breaks was ignored. The corpus includes:
13966 words
11050 regular word breaks (word couples separated by a space)
751 image word breaks (word couples separated by an illustration)
2168 lines
The following histogram illustrates word length, considering:
* all words in the corpus
* first words in lines
* words that appear immediately before an image break
* words that appear immediately after an image break
* last words in lines
[attachment=3369]
The histogram shows that the first word of each line is slightly longer than average. This has been discussed, for instance, by You are not allowed to view links. Register or Login to view..
On the contrary, words that appear immediately before a mid-line break are shorter than the average. Line final words and words following the image-break have a normal length.
The following is the histogram for specific word-lengths. Last words in lines have more frequent 1 and 2 length words: maybe because they can be more easily squeezed at the end of a line. This tendency is much stronger for words before an image break. It is possible that words are sometimes split around the image, but words after the image break do not show any particular word-length pattern.
[attachment=3370]
This graph shows frequencies for the most common word-initial characters in the different positions. s[^h] stands for s-not(h), i.e. it excludes the "bench" Sh which is considered separately.
The fact that p-, t-, y-, s- are more frequent at line start is another known fact, discussed for instance by Emma You are not allowed to view links. Register or Login to view..
[attachment=3371]
The first word after an image break shares the preference for y- and s- with line-initial words, but initial gallows are almost totally absent. In addition to the gallows, also l- and q- are rare after the image break. o- is also more frequent than usual. The drop in the frequency of q- after an image break and the (symmetrical?) increase in y-/o- are particularly noticeable and puzzling.
The graph for the word-final character shows what could be the best known LAAFU effect: the high frequency of -g and -m at line end. The two characters are not particularly frequent before a mid-line break. On the other hand, -d and -s are twice more frequent before a mid-line break than in the other positions; -y is also more frequent than expected.
[attachment=3372]
I looked into the specific case of -s before an image break. It turns out that almost half of the occurrences are due to the word 's' itself: the word occurs before 0.6% of regular (i.e. space) word breaks and before 3.2% of image breaks, more than 5 times as frequently. Nothing similar happens for 'd' and 'y', the other two characters that are more common immediately before an image break: they only rarely occur in that position as stand-alone words.
In a few cases, there are multiple occurrences of 's' immediately before an image on a single page:
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
's' is the only character to appear twice isolated by two image breaks.
You are not allowed to view links. Register or Login to view.
You are not allowed to view links. Register or Login to view.
It seems possible that this could be described as a preference for detaching an initial s- from the rest of the word.