(02-12-2021, 04:06 PM)Davidsch Wrote: You are not allowed to view links. Register or Login to view.could you find a way to merge these with the following: horizontal line size in amount of words and/or vertical line size in amount of words ?
I built an option into my script that allows for generating separate images for different line lengths -- say, all seven-vord lines, then all eight-vord lines, then all nine-vord lines, etc. For each pass, any line containing a different quantity of vords is ignored and also factored out of the total possible value at each point (which I use to help scale brightness at the end, so that, for example, lines that have fewer than four vords or paragraphs that have fewer than three lines don't "count against" missing positions).
I haven't tried this approach yet with paragraph length, partly because I'm less sure what different sizes or groups to use. But it should be equally possible in principle.
The main problem, I think, is that limiting results by line or paragraph length exacerbates the problem of dataset size which Marco brought up. If the quantity of tokens of [daiin] starts to get uncomfortably small when we limit our scope to scribe 2 verso pages, it's easy to imagine a similar outcome if we were instead to limit our scope to something like lines containing seven vords or paragraphs containing five lines -- or both.
And then on top of that, there's the detail that a single uncertain vord space means the difference between (say) a seven-vord line and an eight-vord line.
One alternative I've tried has been to analyze lines as continuous glyph strings, either ignoring vord divisions or treating [.] as its own glyph. Of course, that still means making decisions about what counts as one glyph. In my case, I've been arbitrarily treating [ch], [Sh], benched gallows, and any quantity of [e] or [i] as single glyphs. The idea is then to go through each line comparing a target string such as [r] or [ot] or [edy] against every successive string of that size with a moving window -- so, for example, matching [edy] against [chedychey] would yield 0 [ched] 1 [edy] 0 [dych] 0 [yche] 0 [chey] = 01000. Then the result for each line can be stretched to 500 pixels, as before.
But if the results of vord-by-vord analysis quickly go "out of focus" in mid-line when we mix different line lengths, the same is true of glyph-by-glyph analysis, or even more so, since there are -- of course -- many more line lengths in glyphs than there are line lengths in vords.
Since I think we care less about *exactly* where a glyph falls in mid-line than we do about *approximately* where it falls, I've tried blurring the result by compressing it by a particular factor before expanding it to 500 pixels. That factor should be our best guess about average cycle length -- that is, the number of "steps" that would typically separate one [ot] from the nearest other [ot]. Fortunately, this probably doesn't need to be exact. I've generally been going with a factor of four or five. What we're then measuring is the count of occurrences of the target string within four or five positions of wherever we're at in a line.
The results tend not to differ radically from those that come from vord-by-vord analysis, but they should be free of any distortions due to uncertain vord breaks -- albeit maybe at the expense of something else.
One advantage of this other approach is that it lets us examine positions of phenomena that cross vord boundaries, such as Smith-Ponzi word-break combinations (which could also be tracked more directly, but this way seems to work).
Here's a pair of charts for Davis Scribe 1 / Currier A plotting [nd] in blue, [no] in green, and [nch] in red. They actually show all occurrences of those glyph strings with or without spacing, but all these strings contain a space in an overwhelming majority of cases, as [n.d], [n.o], [n.ch]. The chart on the right separates out first and last lines of paragraphs.
[
attachment=6088]
The brightness scale runs from roughly 11.5 tokens maximum to 1.5 token minimum, which seems like enough of a range to be statistically meaningful, though I'd welcome a more expert opinion on that point. The blurring factor is four positions.
It looks like a vord ending [n] is most likely to be followed by a vord beginning [o] in the first line of a paragraph; by a vord beginning [ch] in the first third of a line or the last line of a paragraph; and by a vord beginning [d] in the final third of a line.
But is that distinctive? For comparison, here's a chart of [.d] in blue, [.o] in green, and [.ch] in red to show the overall distribution of these glyphs in vord-initial position (but not line-initial position) regardless of preceding glyph. This time the brightness scale runs from around 40 tokens maximum to around 1.5 tokens minimum; same blurring factor as before.
[
attachment=6089]
I'm not sure whether the pattern after [n.] is significantly different from the general pattern or not -- the contrasts look stronger, particularly in the red [n.ch] region, but that may just be due to the smaller amount of data.
Still, adapting the ingenious Smith-Ponzi method of analyzing word-break combinations, I suppose we could in principle take the chart for [n.] in vord-final but not line-final position --
[
attachment=6090]
-- and multiply it by the chart for [.d], [.o], and [.ch] to generate a chart representing the "expected" distribution of [n.d], [n.o], and [n.ch], and then contrast the actual and expected distributions -- and we could try this with other combinations as well. It would be interesting to see if the intriguing disparities Emma and Marco wrote about have a positional dimension to them.