(11-12-2025, 01:35 AM)ReneZ Wrote: You are not allowed to view links. Register or Login to view. (11-12-2025, 01:14 AM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.The top 20 chunk pairs
Looking at the list, I guess that this applies to start-end chunks that also have something in between, correct?
Yes, the output is line by line, with the first chunk of the first word at the beginning of the line and the last chunk in the last word at the end of the line.
(11-12-2025, 01:14 AM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.The top 20 chunk pairs line by line
What is the result of the same analysis applied to English or Latin text (formatted as filled parags)?
All the best, --stolfi
(11-12-2025, 01:14 AM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.The top 20 chunk pairs line by line
If you want to do statistics on lines you should have only lines in your data, not labels, circular texts, etc.: the "P" lines of the IVTFF transliterations.
Actually "Pt" lines are on the same physical line as the previous paragraph line.
In some cases free-floating words look more like labels than lines to me:
<f68r1.5,@Pb> yky
<f68r1.6,+Pb> dary
<f68r1.7,+Pb> chkchykoly
All single free-floating words should probably be removed:
<f84v.42,@Pb> okar
<f84v.43,+Pb> ydairol
<f84v.44,+Pb> ychckhy
<f84v.45,+Pb> dshedy
<f84v.46,@Pb> okchdy
<f84v.47,+Pb> solchey
<f84v.48,+Pb> dairoldy
<f84v.49,+Pb> darchy
<f84v.50,+Pb> yskhy
<f84v.51,+Pb> ochedy
(11-12-2025, 01:14 AM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.The top 20 chunk pairs line by line
This isn't the way to do it. You need to display the ratio of observed ocurrences to the number that would be expected if the start and end 'chunks' were distributed randomly. If then it appears that there is general parity then it would indicate that there is nothing significant.
(11-12-2025, 05:05 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.What is the result of the same analysis applied to English or Latin text (formatted as filled parags)?
All the best, --stolfi
It would certainly be interesting to see what kind of purely statistically generated chunks would result. I'll see where I can get a sufficiently large corpus.
(11-12-2025, 11:33 AM)nablator Wrote: You are not allowed to view links. Register or Login to view.If you want to do statistics on lines you should have only lines in your data, not labels, circular texts, etc.: the "P" lines of the IVTFF transliterations.
Actually "Pt" lines are on the same physical line as the previous paragraph line.
In some cases free-floating words look more like labels than lines to me:
<f68r1.5,@Pb> yky
<f68r1.6,+Pb> dary
<f68r1.7,+Pb> chkchykoly
All single free-floating words should probably be removed:
<f84v.42,@Pb> okar
<f84v.43,+Pb> ydairol
<f84v.44,+Pb> ychckhy
<f84v.45,+Pb> dshedy
<f84v.46,@Pb> okchdy
<f84v.47,+Pb> solchey
<f84v.48,+Pb> dairoldy
<f84v.49,+Pb> darchy
<f84v.50,+Pb> yskhy
<f84v.51,+Pb> ochedy
You're right, the result would probably be more accurate then. Maybe it would be best to only process +P0> lines?
(11-12-2025, 02:52 PM)dashstofsk Wrote: You are not allowed to view links. Register or Login to view.This isn't the way to do it. You need to display the ratio of observed ocurrences to the number that would be expected if the start and end 'chunks' were distributed randomly. If then it appears that there is general parity then it would indicate that there is nothing significant.
Sorry, I don't quite understand what you mean.
(11-12-2025, 06:01 PM)bi3mw Wrote: You are not allowed to view links. Register or Login to view.I'll see where I can get a sufficiently large corpus [of English]
Would this file do?
[
attachment=12900]
H. G. Wells
War of the Worlds
60'007 words
All lowercase, no punctuation, no chapter/book titles
One blank line between parags.
Line breaks created by Linux's "fmt" with default line length 75.
Seems to me that we already know that 'normal' text does not treat lines as functional units. Perhaps examples of poetry could be included, where we can presume that the individual lines were "crafted" of some extent. How about Homer and Ovid?
(11-12-2025, 07:17 PM)R. Sale Wrote: You are not allowed to view links. Register or Login to view.Seems to me that we already know that 'normal' text does not treat lines as functional units.
Actually we already know that formatting prose text with the trivial line-breaking algorithm
does result different word distributions at the beginning, middle, and end of the lines. The question that remains is how much of the "LAAFU" phenomenon on the VMS can be explained by this effect.
(The trivial line-breaking algorithm is: keep writing until the next word will not fit, then break the the line and continue. Good scribes and good word processors may use more sophisticated algorithms, that e.g. try to avoid breaks at "bad" places. But they too will have the effect above.)
All the best, --stolfi
I don't know if this is what dashstofsk meant but for identifying unusual line start/line end statistics, I find it helpful to 1) use each cluster's distribution in the middle of the line (excluding line start/line end) to predict its expected distribution at line start/line end, and ii) then compare it to the cluster's actual performance at line start and line end.
I also find it helpful to cut out top rows because the special processes at play there may distort the stats. And also separate scribal sections from each other. For instance the Herbal A folios treat line start initial o differently compared to the Balneological and Stars sections, and this might be concealed in overall stats.
(11-12-2025, 07:41 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view. The question that remains is how much of the "LAAFU" phenomenon on the VMS can be explained by this effect.
If indeed any (for the reasons I've already set out earlier in this post or another thread)
(11-12-2025, 07:16 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Good scribes and good word processors may use more sophisticated algorithms
In manuscripts and printed books, line breaks don't cut syllables, usually. There may be other rules in some cases.