![]() |
|
A family of grammars for Voynichese - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: A family of grammars for Voynichese (/thread-4418.html) |
RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025 Finally, here are the counts for crust-mantle-core patterns with a core (gallows). Here I make a distinction between simple gallows and those with platforms. Again, the CMC pattern of a word is obtained by deleting all "O" elements {a} {o} {y}, and mapping the other elements to the classes
In this modified CMC model, a valid word with core ("G" or "H") must have the form Q^q D^d X^x (G|H) X^y D^e N^n where q,n are 0 or 1, and d+e and x+y are in 0..3. Here are the counts of the patterns with G core (left) and with H core (right): COUNTING TOKENS WITH G AND H CORE BY CMC PATTERN 14060.375000 1.00000 TOTAL 1487.250000 1.00000 TOTAL 2008.125000 0.14282 GD 394.750000 0.26542 XH 1538.750000 0.10944 GXD 375.250000 0.25231 HD 1314.500000 0.09349 QGD 315.000000 0.21180 H 1175.375000 0.08359 GN 76.750000 0.05161 XHD 1025.000000 0.07290 GX 58.500000 0.03933 QH 858.500000 0.06106 QGXD 55.000000 0.03698 HN 813.875000 0.05788 QGN 41.000000 0.02757 QHD 687.000000 0.04886 QGX 33.000000 0.02219 HX 483.000000 0.03435 G 31.750000 0.02135 HDD 417.625000 0.02970 XG 21.000000 0.01412 DXH 410.750000 0.02921 QG 11.250000 0.00756 HDN 274.625000 0.01953 XGD 10.000000 0.00672 DH 255.500000 0.01817 GDD 8.000000 0.00538 HXD 244.750000 0.01741 DGD 6.500000 0.00437 XXH 240.875000 0.01713 DGXD 6.000000 0.00403 DHD 234.625000 0.01669 DGN 5.500000 0.00370 HDDD 201.000000 0.01430 XGX 5.000000 0.00336 XHN 191.500000 0.01362 XGN 4.500000 0.00303 XHX 189.000000 0.01344 GDN 4.000000 0.00269 QHX 174.125000 0.01238 DGX 3.250000 0.00219 HDDN 170.125000 0.01210 GXDD 3.000000 0.00202 QXH 111.750000 0.00795 GXDN 2.000000 0.00134 DDH 102.687500 0.00730 DG 2.000000 0.00134 QDH 101.625000 0.00723 XGXD 2.000000 0.00134 QHDD 75.000000 0.00533 XXG 2.000000 0.00134 QHN 69.000000 0.00491 GXX 2.000000 0.00134 QHXD 67.250000 0.00478 QGDD 1.500000 0.00101 DDXH 62.000000 0.00441 GXN 1.000000 0.00067 DXHD 39.500000 0.00281 GXXD 1.000000 0.00067 DXXH 34.500000 0.00245 QGDN 1.000000 0.00067 DXXHD 33.000000 0.00235 QGXDD 1.000000 0.00067 HXDD 27.875000 0.00198 GDDN 1.000000 0.00067 HXN 18.500000 0.00132 XGDD 1.000000 0.00067 XXHD 18.375000 0.00131 DDGXD 0.250000 0.00017 XHDD 16.875000 0.00120 DDGD 0.250000 0.00017 XHDDD 16.875000 0.00120 DXG 0.250000 0.00017 XHDN 16.750000 0.00119 QGXDN 16.500000 0.00117 QGXN 16.500000 0.00117 QGXX 16.500000 0.00117 XXGX 16.125000 0.00115 GDDD 15.500000 0.00110 DDGN 14.125000 0.00100 DGDD 13.500000 0.00096 GXDDD 13.000000 0.00092 QDGXD 12.250000 0.00087 DDGX 12.000000 0.00085 XGDN 11.750000 0.00084 DGXX 10.750000 0.00076 XXGD 9.750000 0.00069 XXGN 9.500000 0.00068 QGXXD 9.250000 0.00066 DGDN 9.125000 0.00065 DGXDD 8.562500 0.00061 DDG 8.500000 0.00060 QDGX 8.000000 0.00057 QDGN 7.500000 0.00053 DGXDN 7.500000 0.00053 XGXX 7.250000 0.00052 QDG 7.000000 0.00050 XGXN 6.750000 0.00048 DXGX 6.500000 0.00046 DGXXD 4.750000 0.00034 GXDDN 4.625000 0.00033 DXGD 4.500000 0.00032 DXGXD 4.125000 0.00029 QDGD 4.000000 0.00028 QGDDD 3.750000 0.00027 DXGN 3.000000 0.00021 DGXN 3.000000 0.00021 DXXG 3.000000 0.00021 XGXDD 2.500000 0.00018 DXXGD 2.000000 0.00014 GXXN 2.000000 0.00014 QGDDN 2.000000 0.00014 QGXXN 1.625000 0.00012 DGDDN 1.500000 0.00011 DDXXG 1.500000 0.00011 QDDGXD 1.500000 0.00011 QXG 1.500000 0.00011 XXGXD 1.250000 0.00009 XGXDN 1.000000 0.00007 DXXGX 1.000000 0.00007 GXXDN 1.000000 0.00007 QGXDDD 1.000000 0.00007 QXGN 1.000000 0.00007 QXGXD 1.000000 0.00007 QXXG 1.000000 0.00007 XGDDN 1.000000 0.00007 XGXDDN 1.000000 0.00007 XGXXD 0.500000 0.00004 DDDGN 0.500000 0.00004 DDXGX 0.500000 0.00004 GXXDD 0.500000 0.00004 GXXDDD 0.500000 0.00004 GXXX 0.500000 0.00004 GXXXD 0.500000 0.00004 QGXXDD 0.500000 0.00004 XGDDD 0.500000 0.00004 XXGXN 0.250000 0.00002 DGXXDD 0.250000 0.00002 GDDDN 0.250000 0.00002 XXGDN 0.125000 0.00001 DDGXDN I don't know yet what to conclude from these numbers. For either class of core, the formula above allows 2 x 10 x10 x 2 = 400 possible patterns, but only 103 "G" patterns occur in the parags text, and only 36 "H" patterns, even with fractional counting. Obviously there are some combinations of q,d,x,y,e,n that are so rare that they could be excluded from the CMC model; but I don't see a simple rule yet. One notable thing is that words with H core are not only ~1/10 as common as those with G core, but the distribution of H patterns decays significantly faster. In particular, there is a large drop from the first three patterns to the fourth one in the above list. The most common "G" pattern with three benches "X" is XXGX, that occurs 16.5 times (~0.1% of all "G" patterns). There are no "H" patterns with three benches. That may be just a consequence of "H" patterns being less common, but it is also consistent with the theory that an "H" element should be counted as one bench for the rule x+y <= 3. All the best, --stolfi RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025 I think I can narrow the core-mantle-crust a bit more. Recall that the model is Q^q D^d X^x G^g H^h X^y D^e N^n where q,g,h,n are 0 or 1, d+e is at most 3, x+y is at most 3, and g+h is at most 1. (The last condition says simply that there may be at most one gallows, simple or platform.) The range of d+e seems to be indeed 0..3: COUNTING CMC PATTERNS BY NUMBER OF D 31300.250000 1.00000 TOTAL 9705.125000 0.31007 0 16944.000000 0.54134 1 4153.562500 0.13270 2 497.562500 0.01590 3 However, the range of q+d+e+n also seems to be 0..3. That is, the rule should be q+d+e+n <= 3, not just d+e <= 3: COUNTING CMC PATTERNS BY NUMBER OF QDN 31300.250000 1.00000 TOTAL 4937.125000 0.15773 0 15160.625000 0.48436 1 10097.625000 0.32261 2 1071.812500 0.03424 3 33.062500 0.00106 4 And, moreover, the condition on the number of benches should be x+h+y <= 2, not just x+y <= 3. That is, there could be at most two benches, counting a platform gallows as one bench: COUNTING CMC PATTERNS BY NUMBER OF XH 31300.250000 1.00000 TOTAL 15443.250000 0.49339 0 13730.125000 0.43866 1 2080.875000 0.06648 2 46.000000 0.00147 3 I am adding these revised rules to the model. All the best, --stolfi RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025 In conclusion, the revised crust-mantle-core (CMC) model says that a word that fits the OKOKO model is valid if, after deleting the "O"s, it has the form Q^q D^d X^x G^g H^h X^y D^e N^n where
With these tighter rules, the statistics for the CMC level are all gud bad % gud sec-type ------------ ------------ ------------ ----- ---------- 5966.750000 5824.687500 142.062500 97.62 bio-parags 935.500000 904.750000 30.750000 96.71 cos-parags 7277.500000 7035.625000 241.875000 96.68 hea-parags 3157.375000 3034.875000 122.500000 96.12 heb-parags 2100.750000 2034.625000 66.125000 96.85 pha-parags 10095.250000 9665.500000 429.750000 95.74 str-parags 2832.375000 2721.125000 111.250000 96.07 unk-parags 32365.500000 31221.187500 1144.312500 96.46 tot-parags That is, more than 96% of the words that fit the OKOKO model fit the CMC pattern above. We also have that of the 33067.25 words that contain only valid EVA characters (no weirdos, "?", or the rare characters b g j u v w x z), 94.4% satisfy the rest of the model (parsing into elements, OKOKO structure, and CMC structure). Maybe the revised model is too tight, and some fraction of the 5.6% rejected words are in fact valid. But the number seems compatible with the theory that those 5.6% are indeed errors -- by the Author, by the Scribe, by the Retracers, and by the transcribers. Especially words run together. There probably are further rules relating the insertion of the "O"s in the CMC pattern. For instance, maybe we can require that a "Q" is always followed by at least one "O", and an "N" is almost always preceded by at least one "O". Said another way, maybe we can combine the OKOKO and CMC models in a single formula with rules that tie the number of "O"s in each slot to the presence or number of the other "Q", "D" etc elements. All the best, --stolfi RE: A family of grammars for Voynichese - oshfdk - 29-12-2025 5.6% of word types or word tokens? Sorry, I haven't been following the discussion closely. RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025 (29-12-2025, 10:05 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.5.6% of word types or word tokens? Sorry, I haven't been following the discussion closely. Sorry, all the numbers above are token counts, not word type counts. I presume that the rejection rate for word types would be much higher, since many rejected words occur only once or a few times. Unless we define the lexicon as being the word types that occur at least N times, where N is, say, 10. I will try to do this statistic. Please stay tuned... All the best, --stolfi RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025 So far I have excluded {ith} {ikh} {iph} {ifh} from the set of valid elements. However there are 70 words (tokens) with those combinations in the whole parags text, out of more that 30'000 tokens. Is that significant? Contrast that with 1622 occurrences of {cth} {ckh} {cph} {cfh}. Maybe I should accept those 'i' variants too (and treat them too as platform gallows, CMC class "H"). On the other hand here are only 4 occurrences of {ih}; contrast with 9362 of {ch}. This may be a clue that {ith} {ikh} {iph} {ifh} are instances of {cth} {ckh} {cph} {cfh} that were mangled by the Scribe/Retracer/Transcriber. So maybe we should just "error-correct" them as such. --All the best, --stolfi RE: A family of grammars for Voynichese - Grove - 29-12-2025 Don’t e’s only precede d’s , or o’s or a’s and [font=TimesNewRomanPS-BoldMT]that a’s only precede r’s l’s and n’s (or i benches)? I see the a as a transition from e to i series although there are cases of o’s followed by i series without the a transition.[/font] [font=TimesNewRomanPS-BoldMT]I feel like the e ee eee series preceding a d act in a similar way to the i ii iii that precedes the rln variants.[/font] RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025 (29-12-2025, 10:05 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.5.6% of word types or word tokens? Here are the numbers for word types (as opposed to tokens). For these statistics I defined the lexicon of a text as being the set of words that occur at least 3 times. The column "vlex" is the number of word types in each subset that occur at least 3 times and uses only "valid" EVA chars (excluding '?', weirdos, and [bgjuvwxz]). The column "vgud" is the number of those words that fit the crust-mantle-core (CMC) model as defined before. "vbad" is the number that violate that model, and %gud is the percentage of "vgud" over "vlex". LEXICON SIZES WITH VALID CMC STRUCTURE sec-type vlex vgud vbad %gud ----------- ----- ----- ----- ------ bio-parags 295 295 0 100.00 cos-parags 56 56 0 100.00 hea-parags 407 404 3 99.26 heb-parags 210 209 1 99.52 pha-parags 150 149 1 99.33 str-parags 546 541 5 99.08 unk-parags 205 205 0 100.00 tot-parags 1361 1328 33 97.58 That is, 97.58% of the word types that use only valid EVA characters fit the CMC model. The numbers for "tot-parags" are bigger than the sum of the section numbers because a word that occurs (say) once in one section and 2 times in another will not be in the lexicons of those sections, but will be in the lexicon of the whole parags text. These are the 33 word types (with at least 3 occurrences) that do not fit the CMC model, and their occurrence counts in the total parags text: 7.0 daiidy Should be valid? 7.0 polchedy Two words? 6.0 aithy Uses the {ith} non-element. 5.5 cholky Two words? 4.5 ail Should be valid? 4.5 cholkaiin Two words? 4.0 cheeteey Has 3 benches. 4.0 chodchy Two words? 4.0 cholkar Two words? 4.0 cholkeedy Two words? 4.0 dairal Two words, {d}{a}{ir}.{a}{l}? 4.0 dairin Ditto? 4.0 ety Should be valid? 4.0 qoedy Should be valid? 4.0 qoeol Should be valid? 4.0 shoikhy Uses the {ikh} non-element. 3.5 pchocthy Two gallows; two words? 3.0 aiinal Two words, {a}{in}.{a}{l}? 3.0 aikhy Uses the {ikh} non-element. 3.0 airody Two words, {a}{ir}.{o}{d}{y}? 3.0 airol Two words, {a}{ir}.{o}{l}? 3.0 cheolkain Two words? 3.0 cholchedy Two words? 3.0 chsey The {se} is {sh} perhaps? 3.0 chsky Two words? 3.0 dairody Two words, {d}{a}{ir}.{o}{d}{y}? 3.0 daldaiin Two words, {d}{a}{d}.{d}{a}{iin}? 3.0 oikhy Uses the {ikh} non-element. 3.0 polshy Two words? 3.0 qekchdy Should be valid? 3.0 qoedaiin Should be valid? 3.0 sheekchy Has 3 benches. 3.0 tockhy Has 2 gallows; two words? All the best, --stolfi RE: A family of grammars for Voynichese - ReneZ - 29-12-2025 (29-12-2025, 01:05 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So far I have excluded {ith} {ikh} {iph} {ifh} from the set of valid elements. Your counts correspond reasonably well with those in this document: You are not allowed to view links. Register or Login to view. for the three most complete transliterations. At this point, I don't see any way how we can conlude that things like ith / iTh are meaningfully different from cth / cTh or not. If we found that they strongly depend on the scribal hand, then that would be an indication, but I do not think that that is the case. Failing that, we can just record the differences, and then decide when doing stats. The problem with that, of course, is that one quickly runs into many different scenarios based on different combinations of 'choices' to be made, to the point where it becomes totally unpractical. RE: A family of grammars for Voynichese - dashstofsk - 30-12-2025 (29-12-2025, 03:33 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.Don’t e’s only precede d’s , or o’s or a’s and [font=TimesNewRomanPS-BoldMT]that a’s only precede r’s l’s and n’s (or i benches)? I see the a as a transition from e to i series although there are cases of o’s followed by i series without the a transition.[/font] You are right in highlighting the fact that many word strings can be classed e -series and i -series. If you look at the most frequent words and the way they are written then you will see that this is so. The characters in words such as chedy, Shedy are all e -based. Character d is written first as e and a loop is then added upwards and to the right. Character y is written first as e and the downward swing is then added. The initial e curve to these characters look the same. Likewise for characters n l r m . They are usually written first with the same downward stroke i . These characters often like to come after other i -stroke characters. Character a does indeed seem to be written with the e stroke followed by the i stroke and does seem to be the transition character between e -strings and i -strings. It seems that the writer likes the easy-to-repeat strokes i and e . And seems to prefer to write in a style that is not too taxing. He is doodling with letters to create artificial text. I mentioned something about this theory of mine in a previous post You are not allowed to view links. Register or Login to view. |