(6 hours ago)quimqu Wrote: You are not allowed to view links. Register or Login to view.That initial step is much more dependent on the line type (paragraph-initial, paragraph-internal, paragraph-final, or lines outside paragraphs) than anything that comes after. Once you move further into the word, the distributions of character transitions look much more alike across types.
Forget the words in the first lines of paragraphs. Or even the first N words of a paragraph, even if they extend into the second line (e.g. in pages where the lines are short because of drawings.) Those lines are special for a few known and probably many unknown reasons.
As fof other lines: it has been observed recently that the algorithm that a scribe uses to break paragraphs into lines, in any language, will naturally result in the first word of each line being longer than average, and the last 2-3 words being shorter than average.
This fact should have an effect on letter and letter pair statistics, since they are are normally dominated by the letters and pairs that appear in the most frequent words. For instance, in English the most common words that begin with "th" are relatively short: "the", "this", "that", "then", "they", "them", "there". If these words are suppressed at the beginning of lines, the frequency of "th" in line-initial position will be reduced.
But that line-breaking bias is not the only phenomenon that could explain those anomalies. For instance, the Scribe may have been instructed to avoid or prefer inserting a line break between certain pairs of words. (However, while such rules are common today in professional typesetting (e. g. avoid breaking between a number and a unit of measure), they probably did not apply to manuscripts from the 1400s.)
Another possible factor is that there are many spaces of intermediary width between the usual inter-glyph and inter-word spaces. In the available transcriptions, some of these ambiguous spaces are marked "," some are marked ".", some are just omitted. These ambiguous spaces tend to occur before or after specific letters. Incorrect reading of those spaces may therefore change the statistics of word-initial letters and digraphs.
Just to explain this theory, suppose that every word that contains
ol was incorrectly split before the
ol. Then
l would become much more common as the next-char after word-initial
o in words that are not line-initial. Conversely, suppose that every non-line-initial word that started with
oa was mistakenly joined to the previous word. That would decrease the frequency of
a after word-initial
o in words that are not line-initial.
People have been puzzling about those statistical anomalies for decades, with little progress. I don't hope that the explanations will jump out by themselves from just looking at the numbers. Progress is more likely to come by thinking of a possible explanation, then collecting statistics that would most clearly confirm or negate that theory. The old and boring "scientific method". That is how two important facts were discovered recently: that (in any language) the scribe's natural line-breaking algorithm distorts the distribution of line-initial and line-final words, and that the somewhat fixed formula of herbal captions greatly distorts the distribution of words, letters, and digraphs depending on the position within the parag.
All the best, --jorge