26-01-2020, 11:05 AM
(25-01-2020, 05:06 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.The length of the paragraphs is a good metric. Besides the average length, I would immediately suggest to compare the RMS deviation. Are all recipes roughly of the same length? Or there are some extremely short, others extremely long?...
Hi Anton,
I agree that deviation looks very significant for this task. A problem with it is that it is difficult to estimate without a complete and well-structured transcription that also makes paragraphs boundaries clear. For instance, the Bologna transcription provides a total number of words and total number of recipes, but (without significant preprocessing) it does not tell us how long each recipe is. For the two parts of the BNF ms, we don't have transcriptions and things are even harder. But yes, when a well structured transcription is available, deviation from the mean average will certainly be informative.
(25-01-2020, 05:06 PM)Anton Wrote: You are not allowed to view links. Register or Login to view.About the structure. This is the domain which needs careful consideration. The text might have been written backwards, or each line backwards.
I believe that TTR is not significantly affected by the direction in which the text was written. MATTR 100, with a window which covers several lines, will not be much affected even by a boustrophedon.
Also, if the same words appear in the same order in several paragraphs, this will also happen if the text is reversed. If the text is a boustrophedon, half of the matches will be preserved and a rigid structure like that in the Alfonsine Astromagia will still be apparent. A similar technique is discussed in You are not allowed to view links. Register or Login to view., Béchet et. al, 2012 (though they work with POS-tagged data): even if efficiently searching for such patterns appears to require complex algorithms, in our small-data domain a simple brute-force search could still be practical.
As you recently pointed out, we have the problem of inconsistency in spelling and abbreviation: in a manuscript, the same word can take different forms. The impact of this varies a lot between manuscripts. I posted an analysis of a Latin script You are not allowed to view links. Register or Login to view.. In that case, about 20% of word tokens appear to be affected: a rigid structure as the Astromagia would still be detected, but weaker patterns would be significantly harder to recover.
Maybe I am being optimistic, but my impression is that the Latin ms I analyzed is less consistent than the average. For instance, Alfonso's Astromagia You are not allowed to view links. Register or Login to view. is much less abbreviated and the script looks more accurate, as in most of the VMS.
While the polaiin patterns I pointed out in my previous comment might result from auto-copying (as by Timm and Schinner's algorithm or some variant of it), deeper structure could support the idea that the text is meaningful. Which other methods can be used to search for patterns and to compare Q20 and the Pharma section to other texts?