REVISITING MY OLD WORD PARADIGM
SUMMARY
This is a brushed-up version of the word structure model (aka word paradigm, word grammar) that I proposed to the old mailing list around ~2000 and is You are not allowed to view links.
Register or
Login to view.. Here I also report the results of parsing the main text of the VMS with this paradigm. See the end of this note for details on the transcription file I used.
I see that there have been several other proposed paradigms since that time. I did not have the time to analyze them or compare them with this one, sorry. In particular, I have measured only the coverage (how many VMS tokens fit this paradigm) and not the specificity (how many of the words allowed by this paradigm actually occur in the VMS).
In the following, the VMS sections are named "hea" and "heb" for Herbal-A and Herbal-B, "bio" and "zod" obvious, "cos" for Cosmo, "pha" for Pharma, "str" for Starred Parags, and "unk" for pages of unknown nature. The latter include the bottom half of f116r, as well as f1r, f86v6, and f86v5.
For this note I am considering only running text in paragraphs -- excluding labels, titles, radial labels, and text rings. Tokens adjacent to dubious spaces ',' are counted as fractions, as I explained elsewhere.
My paradigm can be described a stack of filters, where each filter takes the words that passed the previous filters, rejects some, and parses the others in some way.
FILTER LEVEL 0: VALID EVA CHARACTERS
This filter says that a word is valid if and only if its EVA encoding, after mapping to lowercase and removing all fluff (ligature braces, inline comments, parag markers like "<%>" and "<$>"), uses only the letters
a c d e f h i k l m n o p q r s t y
This filter rejects as invalid words that contain weirdo codes ("&NNN;" in Rene's notation) or the unreadable glyph symbol "?", or any non-alpha character, or and of the EVA letters
b g j u v x w z
(My scripts currently accept also w and z because I have been using them to denote the hooked-arm versions of p and f, respectively. I now strongly suspect that the hook is meaningless flourish and thus that effort was wasted. Anyway, for this note consider w and z are forbidden too.)
Here are the basic statistics for this level, per section and total. "all" is the total (fractional) count of tokens in the input, "gud" are the tokens that pass this level, "bad" are those which are rejected.
all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
6210.000000 6028.000000 182.000000 97.07 bio-parags
1008.500000 974.500000 34.000000 96.63 cos-parags
7722.500000 7442.500000 280.000000 96.37 hea-parags
3360.000000 3230.000000 130.000000 96.13 heb-parags
2291.500000 2181.500000 110.000000 95.20 pha-parags
10613.500000 10318.250000 295.250000 97.22 str-parags
3001.500000 2892.500000 109.000000 96.37 unk-parags
34207.500000 33067.250000 1140.250000 96.67 tot-parags
So at least 96% of the words in the main sections ("str", "bio", "hea", "heb") use only the "valid" EVA letters above.
The vast majority of the "bad" words have "?" or weirdo codes. The most words with neither of those that are rejected because of rare glyphs is 51 in "hea" and 42 in "str".
FILTER LEVEL 1 - ELEMENTS
The next filter parses the valid EVA strings into the /elements/ of the word paradigm.
{q} {o} {a} {y} {d} {r} {l} {s}
{ch} {che} {sh} {she} {ee} {eee}
{k} {ke} {t} {te} {p} {pe} {f} {fe}
{ckh} {ckhh} {ckhe} {ckhhe}
{cth} {cthh} {cthe} {cthhe}
{cph} {cphh} {cphe} {cphhe}
{cfh} {cfhh} {cfhe} {cfhhe}
{n} {in} {iin} {iiin}
{m} {im} {iim} {iiim}
{ir} {iir} {iiir}
This definition of element is a bit ambiguous since, 'cheeee' for instance, could be parsed as {che}{eee} or {ch}{ee}{ee}. I chose to break these ambiguous cases by excluding the e from the first element.
Here are the results of this filter, where "all" are all the tokens that passed level 0:
all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
6028.000000 5980.500000 47.500000 99.21 bio-parags
974.500000 942.500000 32.000000 96.72 cos-parags
7442.500000 7321.000000 121.500000 98.37 hea-parags
3230.000000 3176.000000 54.000000 98.33 heb-parags
2181.500000 2120.500000 61.000000 97.20 pha-parags
10318.250000 10142.250000 176.000000 98.29 str-parags
2892.500000 2856.500000 36.000000 98.76 unk-parags
33067.250000 32539.250000 528.000000 98.40 tot-parags
So at least 98% of all words in the main sections that have only the "valid" glyphs can be parsed into valid elements of the model.
There are only a dozen words rejected because of clusters like ith ikh etc.. I thought it was not worth including those combinations into valid elements. Maybe they should be turned into cth etc. for statistics.
FILTER LEVEL 2 - THE OKOKO MODEL
This filter applies to the words that consist only of valid elements, parsed and marked with braces as above. We tag the elements {o} {a} {y} as "O" and all the others as "K", and then try to parse the resulting string as a sequence of zero or more "K" with at most one "O" after each "K" and an optional "O" prefix.
O? K O? K O? ... K O?
where the "?" means that the "O" may be present or not.
all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
5980.500000 5968.500000 12.000000 99.80 bio-parags
942.500000 922.500000 20.000000 97.88 cos-parags
7321.000000 7174.250000 146.750000 98.00 hea-parags
3176.000000 3143.500000 32.500000 98.98 heb-parags
2120.500000 2067.500000 53.000000 97.50 pha-parags
10142.250000 9994.750000 147.500000 98.55 str-parags
2856.500000 2820.000000 36.500000 98.72 unk-parags
32539.250000 32091.000000 448.250000 98.62 tot-parags
Thus, in the main sections, at least 98% of all words that consist of valid elements also fit this "OKOKO" model.
Many of the rejected words are rejected because of two or more "O" elements in a row. If we allow up to two "O" in each slot, the acceptance becomes almost total:
all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
5980.500000 5980.500000 0.000000 100.00 bio-parags
942.500000 941.500000 1.000000 99.89 cos-parags
7321.000000 7319.000000 2.000000 99.97 hea-parags
3176.000000 3176.000000 0.000000 100.00 heb-parags
2120.500000 2120.500000 0.000000 100.00 pha-parags
10142.250000 10141.250000 1.000000 99.99 str-parags
2856.500000 2856.500000 0.000000 100.00 unk-parags
32539.250000 32535.250000 4.000000 99.99 tot-parags
FILTER LEVEL 3 - LAYER MODEL
This level considers the strings that passed the OKOKO criterion, parsed into elements and marked with braces as per level 1. All "O" elements are ignored, and the "K" elements are tagged instead with the specific classes "Q" = { @q }, "D" = { @d, @l,@r, @s } (the "dealers") "X" = { @ch, @sh, @ee } (the "benches") with optional @e suffix, "H" = all gallows, with optional platform and @e suffix, and "N" = { @n, @m } after zero or more @i, or @r after one or more @i.
The resulting string of classes is then parsed to fit the pattern
Q^q D^d X^x H^h X^y D^e N^n
where q,h,n may be 0 or 1, and d+e and x+y may be 0 to 3. Note that many potential sequences are invalid, e. g. words two gallows, with four dealers or four benches, with a "D" between two "X" or between an "X" and a gallows, etc. And of course "Q" can only occur at the beginning, and "N" at the end.
all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
5968.500000 5830.500000 138.000000 97.69 bio-parags
922.500000 894.750000 27.750000 96.99 cos-parags
7174.250000 6945.250000 229.000000 96.81 hea-parags
3143.500000 3026.500000 117.000000 96.28 heb-parags
2067.500000 2004.750000 62.750000 96.96 pha-parags
9994.750000 9565.250000 429.500000 95.70 str-parags
2820.000000 2707.875000 112.125000 96.02 unk-parags
32091.000000 30974.875000 1116.125000 96.52 tot-parags
Thus at least 96% of the words that fit the OKOKO model also fit this layer model.
As noted before, many rejected words seem to be pairs of more or less common words run together. Here is a sample of the rejected words (the "*" marks the point(s) where parsing failed):
pattern ! word
--------+----------------------------
HD*X | {p}{o}{l}*{sh}{y}
HD*XD | {p}{o}{l}*{che}{d}{y}
XD*HN | {che}{o}{l}*{k}{a}{in}
XD*X | {che}{d}*{che}{y}
QHD*XD | {q}{o}{k}{o}{l}*{che}{d}{y}
HD*XD | {t}{o}{l}*{che}{d}{y}
XD*HD | {che}{d}{y}*{k}{a}{r}
HDD*XD | {p}{d}{a}{l}*{sh}{o}{r}
H*HN | {p}{o}*{k}{a}{in}
XD*H*X | {ch}{o}{l}*{t}*{eee}{y}
XD*XD | {che}{o}{l}*{ch}{d}{y}
HD*X | {o}{te}{d}*{ee}{y}
HD*XDD | {p}{o}{l}*{sh}{d}{a}{l}
HD*XDD | {f}{s}*{che}{d}{a}{l}
HX*HXD | {t}{she}{o}*{k}{ee}{d}{y}
The words chedy, kain, shor etc are fairly common on their own.
CONCLUSION
In summary, of the 34207.5 "parags" tokens in the transcription file, 30974.875 passed through all the filters. The overall coverage rate is thus 90%. If we consider only the tokens that passed level 0 (valid EVA characters), the coverage is 30974.875/33067.25 = 93%.
TRANSCRIPTION AND SECTIONS
The statistics above used a VMS transcription which is 80% new readings of mine from the BL 2014 scans (code ";U") with the remaining 20% taken from Rene's IVT file (code ";Z"). A large part of the my readings were compared with Rene's and the discrepancies were double-checked against the images.
I intend to release this transcription to be included in Rene's files, but it currently uses a somewhat incompatible format that I found more convenient to use while building it. Anyway most differences are like a/o, r/s, on ambiguous glyphs. And I also make more liberal use of ',' for uncertain word spaces.
If anyone is interested, I can provide this transcription and/or the GAWK scripts that implement this word model. But you may want to use your own favorite transcription and implement the filter on your own, to better suit your needs.
All the best, --stolfi