Labyrinthinesecurity > Yesterday, 03:36 PM
(11-05-2026, 06:12 PM)Mauro Wrote: You are not allowed to view links. Register or Login to view.I'm not much into Python, so I cannot check the code, I'm sorry.
I haven't yet understood which are the final 3 slot grammars which are compared, Model A, B and C. Can you post their stucture? Ie. Slot1 = ['q', ch', sh'] Slot2 = [..], something like that?
Jorge_Stolfi > Yesterday, 09:58 PM
(Yesterday, 03:36 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.All 3 models share the same structure:
Slot1 = ...
Labyrinthinesecurity > Yesterday, 10:09 PM
Jorge_Stolfi dateline='[url=tel:1778705882' Wrote: You are not allowed to view links. Register or Login to view.1778705882[/url]']
Labyrinthinesecurity dateline='[url=tel:1778682981' Wrote: You are not allowed to view links. Register or Login to view.1778682981[/url]']
All 3 models share the same structure:
Slot1 = ...
My old "crust-mantle-core" model, discussed starting at You are not allowed to view links. Register or Login to view., seems to be more demanding than that one. Except perhaps for the placement of [aoy].
Here is my current preferred version. It s described as a sequence of four parsing steps/levels for convenience, but can be encoded as a single loop-free finite automaton, although some of the "total count" constraints in the last level are more compactly described by an algorithm.
CLEANUP pass: The character m is considered an abbreviation for in. The combination ir is assumed to be a scribal error for iin. The characters b g u are malformed versions of other characters, possibly n m an. The glyphs Ih ITh IKh etc are malformed versions of Ch CTh CKh etc. A doubled hh is a malformed version of he. The glyphs Cs and sh are just variant forms of Sh. The abbreviations should be expanded, and the erroneus characters and combinations should be mapped to the most likely correct ones before applying the following levels of parsing.
ELEM level: A cleaned word of the VMS (in the EVA encoding) passes this level if it can be parsed into a string of elements drawn from the following sets:
Note that e and i are valid elements per se. An e can occur only after a bench or a gallows, or as part of an element {ee} or {eee}. An i can only appear as part of a coda element {in}, {iin} or {iiin}. Recall that ir and im are replaced in the CLEAN pass.
- "Q" just {q}.
- "D" the /dealers/ {d} {l} {r} {s}.
- "X" the /benches/ {ch} {sh} {ee} with an optional 'e' suffix.
- "G" the /simple gallows/ {k} {t} {p} {f} with optional 'e' suffix.
- "H" the /platform gallows/ {cth} {ckh} {cph} {cfh} with optional 'e' or 'h' suffix.
- "N" the /codas/ {n}, {in}, {iin}, {iiin},
Thus, for instance
ockhechdy (oCKheChdy) is parsed as {o}{ckhe}{ch}{d}{y}
qokaiin is parsed as {q}{o}{k}{iin}
chedy is parsed as {che}{d}{y}
cheedy is parsed as {ch}{ee}{d}{y}
cheeedy is parsed as {che}{ee}{d}{y}
cheeeedy is parsed as {ch}{ee}{ee}{d}{y}
Note that the parsing is ambiguous if a words has three or more e in a row. So cheeedy could also be {ch}{eee}{d}{y}, and cheeeeedy could be parsed also as {che}{eee}{dy}. The choices above are arbitrary, and have limited implications.
OKOKO level: Let K be the set of all elements that are not in the set O, namely K = Q ∪ D ∪ X ∪ G ∪ H U N. A cleaned word that passed the ELEM level also passes this level if it consists of zero or more K elements with at most two O elements inserted before the first K, between every two consecutive Ks, and after the last K.
Thus, for example, {o}{y} passes this level, {o}{a}{ch}{sh}{o}{r}{o}{a}{d}{o}{y} passes (with pattern OOKKOKOOKOO), whereas {ch}{o}{a}{y}{d}{y} does not (three Os in a row).
CMC level: A cleaned word that passes the ELEM and OKOKO levels will pass the crust-mantle-core (CMC) level if, after deleting all the O elements, it has the form
Q^q D^d X^x G^g H^h X^y D^e N^n
where
- q,g,h,n may be 0 or 1;
- g+h at most 1 (there can be at most one gallows per word);
- q+d+e+n <= 3 (there can be at most three of Q, D, and N);
- x+h+y <= 2 (, there can be at most two benches, counting a platform gallows as one bench.)
Without these sum constraints, the three parsing levels can be realized as compact finite automaton that can be drawn on a single page. With the sum constraints, the automaton is about 3x bigger because each state must be unfolded into three states in order to record the three counts in the part already parsed.
The numbers vary depending on the section and transcription version used, but it seems that, after the CLEAN step, about 95% of the tokens pass the other three levels.
There probably are further rules relating the insertion of the "O"s in the CMC pattern. For instance, maybe we can require that a "Q" is always followed by at least one "O", and an "N" is almost always preceded by at least one "O". Said another way, maybe we can combine the OKOKO and CMC models in a single formula with rules that tie the number of "O"s in each slot to the presence or number of the other "Q", "D" etc elements. This and other refinements remain to be explored.
But the main question is how this word model compare to those you are using.
All the best, --stolfi
Jorge_Stolfi > Yesterday, 10:34 PM
ReneZ > 11 hours ago
(Yesterday, 03:36 PM)Labyrinthinesecurity Wrote: You are not allowed to view links. Register or Login to view.All 3 models share the same structure:
Slot1 = ['q', 'ch', 'sh', 'cth', 'ckh', 'cph', 'cfh', '']
Slot2 = ['e', '']
Slot3 = ['e', '']
Slot4 = ['o', 'a', '']
Slot5 = ['i', 'ii', 'iii', 'iiii', '']
Slot6 = ['l', 'r', 'd', 'n', 'm', 's', 't', 'k', 'p', 'f', '']
Slot7 = ['y', '']