(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Wow, Patrick's paper is quite a big meal to digest.
I'd recommend the You are not allowed to view links.
Register or
Login to view. over the earlier/longer blog post.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.
1. Remove the head lines of parags. Among other things, that line is likely to have special contents (like plant names and aliases), which could well imply different word frequencies and positional patterns, and hence the same for characters and digraphs.
When I published that initial blog post, I hadn't yet worked out another kind of display I used in the conference paper, using differences of brightness and/or color to represent the frequency of particular features in different areas of lines and paragraphs. Here are a few examples from the conference paper (which also goes into more detail about how these displays are generated).
[
attachment=12463]
The first and last lines of paragraphs and the first and last words of lines are separated out into their own rows or columns, while everything in the "middle" is displayed relatively. As you can see, in spite of some notable differences, the key patterns visible in mid-paragraph also extend into first lines. The [Sh] / [ch] case seems due mostly to [Sh] words being especially common in
second position specifically. The [k] / [t] and [qo] / [o] cases are more spread out -- loosely separated into a "first half of line" and "second half of line."
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.One problem to watch for here is that parag breaks are sometimes not obvious. In the Stars section, in particular, I suspect that there there is a run of 5-6 parags that were joined by the Scribe (a newbie?) into a single parag, before he returned to the normal format.
I doubt there are enough truly ambiguous cases of paragraph division to make much overall impact on these statistics, but I agree that it's worth being aware of some uncertainty on this front.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.2. Limit the analysis to just one of the sections with substantial running text -- Herbal-A, Herbal-B, Bio, and Stars. If the anomalies are real, they should be noticeable, and probably even stronger, in one of those sections. If they turn out to be absent or different in other sections, that by itself would be important information.
They're still noticeable within individual sections, and of course there's some variation across sections with this as with everything else. Rather than pursuing this exhaustively on my own, I've made the You are not allowed to view links.
Register or
Login to view. for generating grayscale or multicolor charts like the ones shown above available for anyone who would like to run whatever specific comparisons/contrasts they like.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.3. Try to identify the words that are responsible for the anomalies. Maybe I have misread the tables in the paper, but among the Sh/Ch word pairs, some seem to have greater positional bias than others.
I don't believe specific "words" are in fact responsible for the anomalies, in the sense I think you mean. From what I can tell, "words" generally show positional biases based on the positional patterns of their constituent parts. If you want to predict the positional bias of any given word, your best bet seems to be to combine the positional biases of the glyphs or glyph groups that make it up -- odd as that might seem.
(15-11-2025, 12:21 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Could it be, for example, that the most leftward member of the pair often occurs after a long word, while the other member more often occurs after a short one? That could perhaps explain the positional anomaly as a consequence of the line-breaking word-length bias.
Or maybe the two members of the pair can get fused or split at different rates in the transcription. So that that some of the Sheols are actually Sheoldy while most Cheols are indeed Cheols. I can't think how this possible confounding factor could be addressed. Although this may be one case
I'd encourage you to test those hypotheses.
Still, if we generate similar displays based purely on raw glyph sequences, ignoring word breaks altogether, the skewed distribution of glyphs and glyph groups tends to recapitulate the skewed distribution of words containing those same glyphs and glyph groups -- so my sense is that what we're seeing here isn't likely to be an artifact of spacing ambiguities or the lengths of adjacent words.