![]() |
|
A family of grammars for Voynichese - Printable Version +- The Voynich Ninja (https://www.voynich.ninja) +-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html) +--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html) +--- Thread: A family of grammars for Voynichese (/thread-4418.html) |
RE: A family of grammars for Voynichese - ReneZ - 26-12-2025 (26-12-2025, 11:06 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view. I intend to release this transcription to be included in Rene's files, but it currently uses a somewhat incompatible format that I found more convenient to use while building it. Anyway most differences are like a/o, r/s, on ambiguous glyphs. And I also make more liberal use of ',' for uncertain word spaces. I would be happy to support that exercise. RE: A family of grammars for Voynichese - Jorge_Stolfi - 26-12-2025 (26-12-2025, 11:06 AM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.REVISITING MY OLD WORD PARADIGM And here are the frequencies of elements in tokens that pass level 1 (parsing into the elements above): 12443.625000 0.09984 {a} 21543.125000 0.17285 {o} 15120.250000 0.12132 {y} 5192.000000 0.04166 {q} 11388.500000 0.09137 {d} 9220.250000 0.07398 {l} 5792.375000 0.04647 {r} 2081.750000 0.01670 {s} 5664.500000 0.04545 {ch} 3906.250000 0.03134 {che} 3842.250000 0.03083 {ee} 324.000000 0.00260 {eee} 2085.500000 0.01673 {sh} 1869.250000 0.01500 {she} 7295.000000 0.05853 {k} 1519.000000 0.01219 {ke} 4173.750000 0.03349 {t} 786.000000 0.00631 {te} 316.000000 0.00254 {f} 0.0 0.0 {fe} 1197.375000 0.00961 {p} 0.0 0.0 {pe} 452.500000 0.00363 {ckh} 169.500000 0.00136 {ckhe} 591.000000 0.00474 {cth} 148.000000 0.00119 {cthe} 36.000000 0.00029 {cfh} 14.000000 0.00011 {cfhe} 104.500000 0.00084 {cph} 49.000000 0.00039 {cphe} 16.000000 0.00013 {ckhh} 0.0 0.0 {ckhhe} 25.000000 0.00020 {cthh} 1.0 0.00001 {cthhe} 1.0 0.00001 {cfhh} 0.0 0.0 {cfhhe} 4.0 0.00003 {cphh} 2.0 0.00002 {cphhe} 113.500000 0.00091 {n} 868.250000 0.00697 {m} 1665.500000 0.01336 {in} 40.000000 0.00032 {im} 3779.000000 0.03032 {iin} 15.000000 0.00012 {iim} 159.000000 0.00128 {iiin} 1.0 0.00001 {iiim} 487.750000 0.00391 {ir} 130.500000 0.00105 {iir} 1.0 0.00001 {iiir} Obviously some of the originally proposed elements hardly occur at all, and can be excluded from the "alphabet" with little loss. In particular, note that element {ckhhe} does not occur, while {ckh}, {ckhe} and {ckhh} do. I take that as evidence that {ckhe} and {ckhh} are just calligraphic variants of each other, and should perhaps be mapped and counted as such. Ditto for the other three gallows, including p and p. Also note that while {cfh}, {cfhe},{cph} and {cphe} occur in reasonable numbers and proportions, {pe} and {fe} are completely absent. This well-known anomaly may indicate that p and f are substitutes for te and ke. Maybe the hook at the end of the arm is meant to be the e. (But beware that the frequencies of these characters and digraphs cannot be relied upon to reveal what p and f mean. Word frequencies on parag head lines, where most p and f are found, are very likely different from the word frequencies in the other lines; and character frequencies depend entirely on word frequencies.) And apparently we can exclude {iiir} and {iiim} from the element set. All the best, --stolfi RE: A family of grammars for Voynichese - Grove - 26-12-2025 With p, f, cph, and cfh- how valid are they if they aren’t in the headline of a paragraph? What percentage of them fall outside of those identifiable headlines? And thanks for rewinding back to the earlier 2000’s 8-) RE: A family of grammars for Voynichese - Jorge_Stolfi - 26-12-2025 (26-12-2025, 02:18 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.With p, f, cph, and cfh- how valid are they if they aren’t in the headline of a paragraph? What percentage of them fall outside of those identifiable headlines? According to my scripts (if they are buggy, blame them, not me!), in running paragraph lines:
All the best, --stolfi RE: A family of grammars for Voynichese - ReneZ - 26-12-2025 (26-12-2025, 06:29 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.However these numbers do not include the "fancy" puffs, like the one shaped like a woman's body, that have been transcribed as weirdos. If you use STA, they are all together in a single family. The pedestalled ones are in three additional (small) families. STA is only available for all 'published' transliterations. If you want to add yours, we will first have to do the exercise of finding any 'new' characters, which is not trivial, but also not overly complicated. RE: A family of grammars for Voynichese - Jorge_Stolfi - 28-12-2025 Recall that my "OKOKO" partial paradigm says that a Voynichese word is a sequence of "K" elements with "O" elements inserted before, between, and after them, with at most two "O" per slot. Where the "O" elements are {a} {o} {y}, and the "K" elements are all the others. Here are the counts of tokens that satisfy this paradigm, condensed to the number of {O} elements: COUNTING TOKENS BY NUMBER OF 'O'S 15656.687500 0.48866 O 13798.312500 0.43066 OO 1437.625000 0.04487 OOO 1052.250000 0.03284 - 89.500000 0.00279 OOOO 5.437500 0.00017 OOOOO 0.187500 0.00001 OOOOOO These counts consider the running paragraph text only, of all sections, from the same transcription file as before (80% U, 20% Z). The counts are fractional in order to account for uncertain spaces ',' as I explained before. Thus it seems that, in addition to the limit of two "O" elements per slot, there is also a limit of three "O" elements in total. I will add this rule to the model. The exceptions are only 0.3% of all tokens, and may well be cases of two words run together in the transcription. If we delete the "O" elements and count only "K" elements, we get this: COUNTING TOKENS BY NUMBER OF 'K'S 14144.875000 0.44148 KK 9226.875000 0.28798 KKK 5183.687500 0.16179 K 2717.125000 0.08480 KKKK 401.875000 0.01254 KKKKK 281.750000 0.00879 - 68.062500 0.00212 KKKKKK 13.187500 0.00041 KKKKKKK 2.437500 0.00008 KKKKKKKK 0.125000 0.00000 KKKKKKKKK Here the cutoff is not as clear as for the "O"s, but tokens with more than six "K"s can be considered noise. Maybe those with six "K"s are noise too. I have not yet decided whether I should put that limit on total "K"s as another constraint of the "OKOKO" model. Anyway the number of "K"s will be limited by the next stage of the model, the "layer" (formerly "crust-mantle-core") model. All the best, --stolfi RE: A family of grammars for Voynichese - Jorge_Stolfi - 28-12-2025 (28-12-2025, 02:01 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.Recall that my "OKOKO" partial paradigm says that a Voynichese word is a sequence of "K" elements with "O" elements inserted before, between, and after them, with at most two "O" per slot. Where the "O" elements are {a} {o} {y}, and the "K" elements are all the others. If we map each element to "O" or "K", we get the "OKOKO pattern" of a word. Here are the 50 most common OKOKO patterns in parag text: COUNTING OKOKO PATTERNS 5560.000000 0.17179 KOK 2687.250000 0.08303 OKOK 2613.062500 0.08074 KOKOK 2560.125000 0.07910 KKO 2192.250000 0.06773 OK 1751.125000 0.05410 KO 1479.875000 0.04572 KOKKO 1400.312500 0.04327 KOKO 1296.250000 0.04005 OKKO 1276.125000 0.03943 KKOK 1108.750000 0.03426 KKKO 839.000000 0.02592 KOKKKO 833.000000 0.02574 OKKKO 646.562500 0.01998 OKO 627.625000 0.01939 OKKOK 593.750000 0.01835 K 425.875000 0.01316 KOKKOK 287.437500 0.00888 OKOKO 281.750000 0.00871 O 254.875000 0.00787 KK 232.437500 0.00718 KOKOKO 226.000000 0.00698 KKKOK 221.375000 0.00684 OKOKOK 201.250000 0.00622 KKOKO 192.562500 0.00595 KKKKO 160.625000 0.00496 KKK 147.312500 0.00455 KKOKOK 138.750000 0.00429 KOOK 128.625000 0.00397 KOKK 112.000000 0.00346 OKKOKO 110.437500 0.00341 OKKKKO 98.625000 0.00305 OKK 97.375000 0.00301 KOKKK 94.375000 0.00292 OKKKOK 91.500000 0.00283 OKOKKO 86.000000 0.00266 OKKK 79.250000 0.00245 KOKOKOK 75.750000 0.00234 KOKKKKO 72.750000 0.00225 OOK 66.875000 0.00207 KKOKKO 59.625000 0.00184 KOKKOKO 55.000000 0.00170 KOKKKOK 52.250000 0.00161 KOKOKKO 45.500000 0.00141 OKKOKOK 39.250000 0.00121 KOO 38.500000 0.00119 KKKK 34.875000 0.00108 OKOKKOK 33.000000 0.00102 OKOKK 31.000000 0.00096 KKKOKO 30.250000 0.00093 OKOKKKO These counts are already using the new constraint of at most three "O"s, which, as shown in the previous post, excludes only 94 tokens out of ~31'000. Note that the distribution is long-tailed, with no obvious cut-off. On this list, the first OKOKO pattern with two consecutive "O"s is #28, "KOOK", with 138.75 occurrences. The first one with three total "O"s is (surprise"!) #18, "OKOKO" itself, with 287.4375 occurrences. All the best, --stolfi RE: A family of grammars for Voynichese - Grove - 28-12-2025 I’m curious what happens if you don’t include gallows as part of K (except the bench ones because they still have the underlying ch). I’m wondering if the gallows were some sort of instruction that could take place not only in a page or word initial position, but mid-word as well. John RE: A family of grammars for Voynichese - Jorge_Stolfi - 28-12-2025 (28-12-2025, 02:43 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.I’m curious what happens if you don’t include gallows as part of K (except the bench ones because they still have the underlying ch). Good point. Lumping the platform gallows (CTh etc) with simple gallows (t etc) is a potential defect of the "crust-mantle-core" (CMC) model as I formulated it, because former should contribute to the count of benches "X". Maybe as half a bench in the prefix, half in the suffix. While we think about that issue, here are some counts that don't depend on it. I we delete the "O"s and map each element to its class "Q", "D", "X" etc, we get the "CMC pattern" of the word. If the word has no gallows ("H") and no benches ("X"), a valid CMC pattern must be Q^q D^d+e N^n where q and n are 0 or 1, and d+e is between 0 and 3. Recall that
Here are the counts: COUNTING CRUST-ONLY TOKENS BY CMC PATTERN 2433.187500 0.30517 D 2034.375000 0.25515 DN 1552.062500 0.19466 DD 849.250000 0.10651 N 286.500000 0.03593 - 225.750000 0.02831 QD 217.375000 0.02726 DDD 148.562500 0.01863 DDN 83.500000 0.01047 QDN 47.000000 0.00589 Q 44.250000 0.00555 QN 34.000000 0.00426 QDD 15.437500 0.00194 DDDN 2.000000 0.00025 QDDD There are 2 x 4 x 2 = 16 CMC patterns that fit the formula above. Here they are sorted "alphabetically" instead of by frequency: 217.375000 0.02726 DDD 1552.062500 0.19466 DD 2433.187500 0.30517 D 286.500000 0.03593 - 15.437500 0.00194 DDDN 148.562500 0.01863 DDN 2034.375000 0.25515 DN 849.250000 0.10651 N 2.000000 0.00025 QDDD 34.000000 0.00426 QDD 225.750000 0.02831 QD 47.000000 0.00589 Q 0.0 0.0 QDDDN 0.0 0.0 QDDN 83.500000 0.01047 QDN 44.250000 0.00555 QN The patterns "QDDDN" and "QDDN" are absent, and "QDDD" is down at noise level. Maybe the limit should be q+d+e+n <= 3, rather than just d+e <= 3. But let me first see what happens when there are "X" and "H"... All the best, --stolfi RE: A family of grammars for Voynichese - Jorge_Stolfi - 28-12-2025 And here are the counts for crust-mantle-core (CMC) patterns with mantle ("X") but no core ("H"). Recall that a valid word in this class must have the CMC pattern Q^q D^d X^x+y D^e N^n Where (currently) q and n are 0 or 1, d+e is in 0..3, and x+y is in 1..3. The counts are: COUNTING TOKENS WITH MANTLE BUT NO CORE BY CMC PATTERN 3343.125000 0.42974 XD 1185.250000 0.15236 X 654.375000 0.08412 DXD 415.000000 0.05335 XX 357.375000 0.04594 XDD 315.750000 0.04059 DX 309.750000 0.03982 XXD 307.500000 0.03953 XDN 245.750000 0.03159 XN 85.375000 0.01097 DXX 77.625000 0.00998 DDXD 56.500000 0.00726 QXD 55.250000 0.00710 DXXD 44.250000 0.00569 QX 41.500000 0.00533 DXDD 38.625000 0.00497 DDX 31.500000 0.00405 DXN 27.000000 0.00347 QDXD 26.375000 0.00339 XDDD 25.500000 0.00328 QDX 24.500000 0.00315 DXDN 19.750000 0.00254 XDDN 19.000000 0.00244 XXDN 13.875000 0.00178 XXDD 6.250000 0.00080 XXN 5.750000 0.00074 QXDN 5.250000 0.00067 DDXX 4.500000 0.00058 QDXX 4.000000 0.00051 DDXXD 4.000000 0.00051 DXXDD 3.500000 0.00045 QDXXD 3.250000 0.00042 QXX 3.000000 0.00039 QXDD 3.000000 0.00039 QXXD 2.000000 0.00026 DXDDN 2.000000 0.00026 XXDDN 1.500000 0.00019 DXXDN 1.500000 0.00019 XXX 1.000000 0.00013 DDXN 1.000000 0.00013 QDXDD 1.000000 0.00013 QDXDN 1.000000 0.00013 QDXN 1.000000 0.00013 QXN 1.000000 0.00013 XXDDD 0.750000 0.00010 XXXD 0.500000 0.00006 DDDX 0.500000 0.00006 DXXN 0.500000 0.00006 QXXXD 0.500000 0.00006 XDDDN 0.250000 0.00003 DXXXD 0.250000 0.00003 QXXN 0.125000 0.00002 DDXDN The exponents d,e may be 0,0 1,0 0,1 2,0 1,1 0,2 3,0 2,1 1,2 0,3 so there are 2 x 10 x 4 x 2 = 160 possible CMC patterns with mantle but no core. Yet only the 52 patterns above occur at all, and several occur only at noise level. I have to think more abut these numbers. Please stay tuned... All the best, --stolfi |