Options

Formulaic text, "micro-context", and language statistics

Index
Formulaic text, "micro-context", and language statistics
Formulaic text, "micro-context", and language statistics

Jorge_Stolfi > 12-09-2025, 02:18 PM

There are significant and obvious statistical differences between the "head" lines of parags and other "body" lines, in terms of word, character, and digraph distributions.

The most obvious is the preponderance of "puffs" (the p and f gallows) on head lines as opposed to "tikes" (the t and k gallows). There are reasons to assume that this difference, in particular, is the result of some "puffing transformation" applied by the Scribe to the head lines, in order to mark them as such -- which apparently was a not uncommon scribal habit at the time. The occurrence of puffs in scattered words and phrases within parag bodies is presumably the result of the Scribe applying a similar transformation to them-- possibly to indicate emphasis, respect, proper nouns, etc.  These transformations would be loosely analogous (in spirit, not in detail or use) to our habits of capitalizing most words in titles and capitalizing proper names.

These "puffing" transformations obviously imply the replacement of some tikes by puffs, but apparently not in a trivial 1-for-1 way. It has been conjectured that each puff may stand for some combination of a tike and some adjacent glyph, like e or Ch. The actual replacement rules may be non-deterministic, or depend on local context (like out rule of not capitalizing articles and prepositions), and may not be invertible (just like when we map both "rocky mountains" and "Rocky Mountains" to "Rocky Mountains")

However, it seems that significant statistical differences between head and body lines persist even after one accounts for such a puffing transformation.

It has been conjectured that these residual differences are due to the fact that the paragraphs in many (if not all) sections generally tend to follow a fixed formula. For instance, a herbal paragraph is expected to start with the name(s) of the plant, then include a list of conditions that can be treated with it, where the plant grows, instructions of how to prepare it, dosages, etc., generally in a certain order. As a result, the word distribution should change significantly depending on the position within the parag. And since the frequency of a glyph or glyph group is dominated by its occurrence in the most common words, these frequencies too should be dependent on position.

I just tested this conjecture using the transcription and translation of the You are not allowed to view links. Register or Login to view. published by Marco Ponzi (@MarcoP). It has 95 herbs, all but one with a single paragraph of text, and ~6700 Latin words total (which is ~1/5 of the VMS).  AFAIK that file does not mark the original line breaks, so it was not possible to identify the head lines. As a substitute, I extracted 12 words from each parag, either from the beginning (including the plant's name), around the middle, and at the end of the parag. Here are the results, with three columns (count, frequency, item) for each subset

Latin text:

== words ==
116 0.10329 herba | 106 0.09439 et    |  115 0.10240 et
71 0.06322 ad | 38 0.03384 de | 69 0.06144 in
54 0.04809 accipe | 30 0.02671 item | 41 0.03651 nascitur
45 0.04007 sanandum | 26 0.02315 herbe | 21 0.01870 subito
39 0.03473 de | 25 0.02226 accipe | 20 0.01781 de
37 0.03295 et | 25 0.02226 si | 17 0.01514 per
33 0.02939 si | 23 0.02048 istius | 16 0.01425 herba
22 0.01959 ista | 22 0.01959 ad | 15 0.01336 est
22 0.01959 istius | 22 0.01959 in | 15 0.01336 montibus
20 0.01781 quis | 22 0.01959 ista | 14 0.01247 dierum

== chars ==
856 0.14142 a | 753 0.13639 e    | 709 0.12101 i
676 0.11168 e | 580 0.10505 i | 620 0.10582 e
559 0.09235 i | 513 0.09292 t | 544 0.09285 t
436 0.07203 s | 501 0.09074 a | 492 0.08397 s
413 0.06823 r | 397 0.07191 u | 467 0.07971 a
362 0.05981 u | 379 0.06865 s | 452 0.07715 u
341 0.05634 m | 346 0.06267 r    | 445 0.07595 r
334 0.05518 t | 313 0.05669 m | 350 0.05974 n
318 0.05254 n | 250 0.04528 n | 292 0.04984 o
261 0.04312 c | 238 0.04311 o    | 259 0.04421 m

== char pairs ==
264 0.03679 a. | 212 0.03191 m.   | 199 0.02850 t.
228 0.03177 m. | 199 0.02995 t. | 185 0.02650 s.
210 0.02926 er |  190 0.02860 e. | 156 0.02234 et
185 0.02578 .a | 165 0.02483 er | 151 0.02163 .e
158 0.02202 s. | 155 0.02333 .e | 147 0.02105 .s
157 0.02188 .h | 149 0.02243 et | 145 0.02077 r.
156 0.02174 e. | 134 0.02017 .s | 142 0.02034 tu
139 0.01937 rb | 128 0.01927 .i | 139 0.01991 er
139 0.01937 um | 128 0.01927 a. | 129 0.01848 ur
138 0.01923 an | 123 0.01851 s. | 125 0.01790 is

English:

== words ==
67 0.05852 for | 74 0.06463 and | 77 0.06725 and
66 0.05764 the | 66 0.05764 the |   71 0.06201 the
62 0.05415 take | 43 0.03755 it | 63 0.05502 it
45 0.03930 healing | 42 0.03668 this | 61 0.05328 in
44 0.03843 of | 41 0.03581 herb | 43 0.03755 will
44 0.03843 this | 33 0.02882 of | 41 0.03581 be
38 0.03319 and | 26 0.02271 will | 39 0.03406 grows
37 0.03231 herb | 21 0.01834 also | 28 0.02445 healed
33 0.02882 a | 21 0.01834 be | 23 0.02009 they
29 0.02533 or | 20 0.01747 in | 22 0.01921 is

== chars ==
644 0.12485 e | 611 0.13309 e | 608 0.12528 e
513 0.09946 a | 418 0.09105 t | 437 0.09005 i
445 0.08627 o | 370 0.08059 a | 415 0.08551 t
419 0.08123 t | 369 0.08037 i | 388 0.07995 n
401 0.07774 i | 348 0.07580 o | 361 0.07439 a
371 0.07193 n | 324 0.07057 n | 332 0.06841 o
341 0.06611 r | 303 0.06600 h | 284 0.05852 r
317 0.06146 s | 242 0.05271 r | 272 0.05605 s
311 0.06029 h | 233 0.05075 l | 266 0.05481 d
248 0.04808 l | 231 0.05032 s | 252 0.05193 h

== char pairs ==
232 0.03681 .t | 224 0.03905 e. | 183 0.03051 e.
230 0.03649 e. | 200 0.03487 .t | 183 0.03051 s.
193 0.03062 he | 185 0.03225 th | 174 0.02901 .i
173 0.02745 s. | 183 0.03190 he | 174 0.02901 .t
157 0.02491 th | 161 0.02807 .a | 173 0.02884 d.
131 0.02078 r. | 149 0.02598 d. | 161 0.02684 he
125 0.01983 .a | 128 0.02232 s. | 160 0.02668 n.
121 0.01920 .h | 116 0.02022 .i | 150 0.02501 th
118 0.01872 or | 107 0.01865 an | 149 0.02484 in
113 0.01793 in | 99 0.01726 t. | 108 0.01801 .a

So it seems that, indeed, in a formulaic text like a herbal all three statistics can vary a lot between the start, middle, and end of parags.

All the best, --jorge
Next Oldest Next Newest

Formulaic text, "micro-context", and language statistics

Index

Formulaic text, "micro-context", and language statistics