I prepared a bunch of files with the
fractional word counts per section and text type. These fileslist
all words that would appear under any interpretation of the dubious space markers (commas, ","). Se more below. The files are in the attached file st_files.zip.
st_files.zip (Size: 161.57 KB / Downloads: 6)
The files are named "{SEC}.{TYP}.evt" and "{SEC}.{TYPE}.wff"
{SEC} is a major VMS section: "hea" (Herbal A), "heb" (Herbal B), "bio", "cos", "zod", "pha", "str" (Starred Parags). And also "unk" for pages of unknown nature, such as You are not allowed to view links.
Register or
Login to view. and f86v6.
{TYP} is a type of text: "parags", "labels", "trings" (text in rings), "titles" (short phrases next to parags), "radios" (radial lines in circular diagrams),and "glyphs" (isolated characters). Note that this classification is somewhat different that the one used by Rene and others; for instance, the short paragraphs in the sectors of f67r2 are here classified as "parags" too.
The file {SEC}.{TYP}.evt contains all the lines of section {SEC} and type {TYP}, in a simplified IVTFF/EVMT format, like "<f75r.47;U> sal.okeedy". The transcription used is based on a recent one of my own, from the Beinecke 2014 scans (4162 lines, code ";U"), completed with a version derived from release "RF1b-e.txt" of Rene's IVT (1226 lines, code ";Z"). I removed all inline comments, page headers, and parag markers, and mapped figure breaks to ".". All letters were mapped to lowercase. A few common weirdos were turned into their best approximations, like Rene's "&152;" turned into
d and "&222;" into
y. All other weirdos were mapped to "?". All ligature braces were removed, so some information may have been lost in rare ligatures.
The file "{SEC}.{TYPE}.wff" has onle line "{COUNT} {WORD}" for each word type (lexeme) {WORD} that occurs in "{SEC}.{TYPE}.evt". The {COUNT} is a fractional number, obtained by assuming that each comma (",") in a line of the transcription may be
independently either a "word space" or "no space", with equal probabilities, in
all possible combinations. For each combination, each word is counted, not as 1 but as the probability of that combination.
For instance, in the line "
chedy.
cho,
ke,
or,
ol.
daiin.
dal,
dy", the words
chedy and
daiin are counted as 1 each, while
dal,
dy, and
daldy have a count of 0.5 each (corresponding to the two interpretations for the comma between them). Also
cho and
ol have a count of 0.5 each,
choke,
ke,
or, and
orol have count of 0.25, and
chokeor,
chokeorol,
keorol have a count of 0.125. Note that the total count for each glyph of the input is still 1.
Using these fractional counts for word-related statistics may reduce biases that may result from either treating all commas as word spaces or ignoring all commas. For instance, dubious spaces often occur after
r and
s, or after a word-initial
y. But this is still a far from perfect solution to that problem. The Scribe himself may have improperly joined or split words, and the transcribers may have omitted many dubious spaces, or entered them as ".".
Please let me know if you find any errors in those files. Also if you would like the (somewhat messy) scripts that I used to create them.
All the best, --stolfi