A family of grammars for Voynichese

A family of grammars for Voynichese - Printable Version

+- The Voynich Ninja (https://www.voynich.ninja)
+-- Forum: Voynich Research (https://www.voynich.ninja/forum-27.html)
+--- Forum: Analysis of the text (https://www.voynich.ninja/forum-41.html)
+--- Thread: A family of grammars for Voynichese (/thread-4418.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

Finally, here are the counts for crust-mantle-core patterns with a core (gallows). Here I make a distinction between simple gallows and those with platforms.

Again, the CMC pattern of a word is obtained by deleting all "O" elements {a} {o} {y},
and mapping the other elements to the classes

"Q" just {q}.
"D" the /dealers/ {d} {l} {r} {s}.
"X" the /benches/ {ch} {sh} {ee} with an optional 'e' suffix.
"G" the /simple gallows/ {k} {t} {p} {f} with optional 'e' suffix.
"H" the /platform gallows/ {cth} {ckh} {cph} {cfh} with optional 'e' or 'h' suffix.
"N" the /codas/ {n}, {in}, {iin}, {iiin}, {m} {im} {iim} {iiim}, {ir}, {iir}, {iir}.

In this modified CMC model, a valid word with core ("G" or "H") must have the form

Q^q D^d X^x (G|H) X^y D^e N^n

where q,n are 0 or 1, and d+e and x+y are in 0..3.

Here are the counts of the patterns with G core (left) and with H core (right):

COUNTING TOKENS WITH G AND H CORE BY CMC PATTERN

14060.375000 1.00000 TOTAL 1487.250000 1.00000 TOTAL

2008.125000 0.14282 GD 394.750000 0.26542 XH
1538.750000 0.10944 GXD 375.250000 0.25231 HD
1314.500000 0.09349 QGD 315.000000 0.21180 H
1175.375000 0.08359 GN 76.750000 0.05161 XHD
1025.000000 0.07290 GX    58.500000 0.03933 QH
858.500000 0.06106 QGXD 55.000000 0.03698 HN
813.875000 0.05788 QGN 41.000000 0.02757 QHD
687.000000 0.04886 QGX 33.000000 0.02219 HX
483.000000 0.03435 G 31.750000 0.02135 HDD
417.625000 0.02970 XG 21.000000 0.01412 DXH
410.750000 0.02921 QG 11.250000 0.00756 HDN
274.625000 0.01953 XGD 10.000000 0.00672 DH
255.500000 0.01817 GDD 8.000000 0.00538 HXD
244.750000 0.01741 DGD 6.500000 0.00437 XXH
240.875000 0.01713 DGXD 6.000000 0.00403 DHD
234.625000 0.01669 DGN 5.500000 0.00370 HDDD
201.000000 0.01430 XGX 5.000000 0.00336 XHN
191.500000 0.01362 XGN 4.500000 0.00303 XHX
189.000000 0.01344 GDN 4.000000 0.00269 QHX
174.125000 0.01238 DGX 3.250000 0.00219 HDDN
170.125000 0.01210 GXDD 3.000000 0.00202 QXH
111.750000 0.00795 GXDN 2.000000 0.00134 DDH
102.687500 0.00730 DG 2.000000 0.00134 QDH
101.625000 0.00723 XGXD 2.000000 0.00134 QHDD
75.000000 0.00533 XXG 2.000000 0.00134 QHN
69.000000 0.00491 GXX 2.000000 0.00134 QHXD
67.250000 0.00478 QGDD 1.500000 0.00101 DDXH
62.000000 0.00441 GXN 1.000000 0.00067 DXHD
39.500000 0.00281 GXXD 1.000000 0.00067 DXXH
34.500000 0.00245 QGDN 1.000000 0.00067 DXXHD
33.000000 0.00235 QGXDD 1.000000 0.00067 HXDD
27.875000 0.00198 GDDN 1.000000 0.00067 HXN
18.500000 0.00132 XGDD 1.000000 0.00067 XXHD
18.375000 0.00131 DDGXD 0.250000 0.00017 XHDD
16.875000 0.00120 DDGD 0.250000 0.00017 XHDDD
16.875000 0.00120 DXG 0.250000 0.00017 XHDN
16.750000 0.00119 QGXDN
16.500000 0.00117 QGXN
16.500000 0.00117 QGXX
16.500000 0.00117 XXGX
16.125000 0.00115 GDDD
15.500000 0.00110 DDGN
14.125000 0.00100 DGDD
13.500000 0.00096 GXDDD
13.000000 0.00092 QDGXD
12.250000 0.00087 DDGX
12.000000 0.00085 XGDN
11.750000 0.00084 DGXX
10.750000 0.00076 XXGD
9.750000 0.00069 XXGN
9.500000 0.00068 QGXXD
9.250000 0.00066 DGDN
9.125000 0.00065 DGXDD
8.562500 0.00061 DDG
8.500000 0.00060 QDGX
8.000000 0.00057 QDGN
7.500000 0.00053 DGXDN
7.500000 0.00053 XGXX
7.250000 0.00052 QDG
7.000000 0.00050 XGXN
6.750000 0.00048 DXGX
6.500000 0.00046 DGXXD
4.750000 0.00034 GXDDN
4.625000 0.00033 DXGD
4.500000 0.00032 DXGXD
4.125000 0.00029 QDGD
4.000000 0.00028 QGDDD
3.750000 0.00027 DXGN
3.000000 0.00021 DGXN
3.000000 0.00021 DXXG
3.000000 0.00021 XGXDD
2.500000 0.00018 DXXGD
2.000000 0.00014 GXXN
2.000000 0.00014 QGDDN
2.000000 0.00014 QGXXN
1.625000 0.00012 DGDDN
1.500000 0.00011 DDXXG
1.500000 0.00011 QDDGXD
1.500000 0.00011 QXG
1.500000 0.00011 XXGXD
1.250000 0.00009 XGXDN
1.000000 0.00007 DXXGX
1.000000 0.00007 GXXDN
1.000000 0.00007 QGXDDD
1.000000 0.00007 QXGN
1.000000 0.00007 QXGXD
1.000000 0.00007 QXXG
1.000000 0.00007 XGDDN
1.000000 0.00007 XGXDDN
1.000000 0.00007 XGXXD
0.500000 0.00004 DDDGN
0.500000 0.00004 DDXGX
0.500000 0.00004 GXXDD
0.500000 0.00004 GXXDDD
0.500000 0.00004 GXXX
0.500000 0.00004 GXXXD
0.500000 0.00004 QGXXDD
0.500000 0.00004 XGDDD
0.500000 0.00004 XXGXN
0.250000 0.00002 DGXXDD
0.250000 0.00002 GDDDN
0.250000 0.00002 XXGDN
0.125000 0.00001 DDGXDN

I don't know yet what to conclude from these numbers.

For either class of core, the formula above allows 2 x 10 x10 x 2 = 400 possible patterns, but only 103 "G" patterns occur in the parags text, and only 36 "H" patterns, even with fractional counting. Obviously there are some combinations of q,d,x,y,e,n that are so rare that they could be excluded from the CMC model; but I don't see a simple rule yet.

One notable thing is that words with H core are not only ~1/10 as common as those with G core, but the distribution of H patterns decays significantly faster. In particular, there is a large drop from the first three patterns to the fourth one in the above list.

The most common "G" pattern with three benches "X" is XXGX, that occurs 16.5 times (~0.1% of all "G" patterns). There are no "H" patterns with three benches.

That may be just a consequence of "H" patterns being less common, but it is also consistent with the theory that an "H" element should be counted as one bench for the rule x+y <= 3.

All the best, --stolfi

RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

I think I can narrow the core-mantle-crust a bit more. Recall that the model is

Q^q D^d X^x G^g H^h X^y D^e N^n

where q,g,h,n are 0 or 1, d+e is at most 3, x+y is at most 3, and g+h is at most 1. (The last condition says simply that there may be at most one gallows, simple or platform.) The range of d+e seems to be indeed 0..3:

COUNTING CMC PATTERNS BY NUMBER OF D

31300.250000 1.00000 TOTAL

9705.125000 0.31007 0
16944.000000 0.54134 1
4153.562500 0.13270 2
497.562500 0.01590 3

However, the range of q+d+e+n also seems to be 0..3. That is, the rule should be q+d+e+n <= 3, not just d+e <= 3:

COUNTING CMC PATTERNS BY NUMBER OF QDN

31300.250000 1.00000 TOTAL

4937.125000 0.15773 0
15160.625000 0.48436 1
10097.625000 0.32261 2
1071.812500 0.03424 3
33.062500 0.00106 4

And, moreover, the condition on the number of benches should be x+h+y <= 2, not just x+y <= 3. That is, there could be at most two benches, counting a platform gallows as one bench:

COUNTING CMC PATTERNS BY NUMBER OF XH

31300.250000 1.00000 TOTAL

15443.250000 0.49339 0
13730.125000 0.43866 1
2080.875000 0.06648 2
46.000000 0.00147 3

I am adding these revised rules to the model.

All the best, --stolfi

RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

In conclusion, the revised crust-mantle-core (CMC) model says that a word that fits the OKOKO model is valid if, after deleting the "O"s, it has the form

Q^q D^d X^x G^g H^h X^y D^e N^n

where

q,g,h,n may be 0 or 1;
g+h at most 1 (there can be at most one gallows per word);
q+d+e+n <= 3 (there can be at most three of Q, D, and N);
x+h+y <= 2 (, there can be at most two benches, counting a platform gallows as one bench.)

With these tighter rules, the statistics for the CMC level are

all gud bad % gud sec-type
------------ ------------ ------------ ----- ----------
5966.750000 5824.687500 142.062500 97.62 bio-parags
  935.500000 904.750000 30.750000 96.71 cos-parags
7277.500000 7035.625000 241.875000 96.68 hea-parags
3157.375000 3034.875000 122.500000 96.12 heb-parags
2100.750000 2034.625000   66.125000 96.85 pha-parags
10095.250000 9665.500000 429.750000 95.74 str-parags
2832.375000 2721.125000 111.250000 96.07 unk-parags

32365.500000 31221.187500 1144.312500 96.46 tot-parags

That is, more than 96% of the words that fit the OKOKO model fit the CMC pattern above.

We also have that of the 33067.25 words that contain only valid EVA characters (no weirdos, "?", or the rare characters b g j u v w x z), 94.4% satisfy the rest of the model (parsing into elements, OKOKO structure, and CMC structure).

Maybe the revised model is too tight, and some fraction of the 5.6% rejected words are in fact valid. But the number seems compatible with the theory that those 5.6% are indeed errors -- by the Author, by the Scribe, by the Retracers, and by the transcribers. Especially words run together.

There probably are further rules relating the insertion of the "O"s in the CMC pattern. For instance, maybe we can require that a "Q" is always followed by at least one "O", and an "N" is almost always preceded by at least one "O". Said another way, maybe we can combine the OKOKO and CMC models in a single formula with rules that tie the number of "O"s in each slot to the presence or number of the other "Q", "D" etc elements.

All the best, --stolfi

RE: A family of grammars for Voynichese - oshfdk - 29-12-2025

5.6% of word types or word tokens? Sorry, I haven't been following the discussion closely.

RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

(29-12-2025, 10:05 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.5.6% of word types or word tokens? Sorry, I haven't been following the discussion closely.

Sorry, all the numbers above are token counts, not word type counts.

I presume that the rejection rate for word types would be much higher, since many rejected words occur only once or a few times. Unless we define the lexicon as being the word types that occur at least N times, where N is, say, 10.

I will try to do this statistic. Please stay tuned...

All the best, --stolfi

RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

So far I have excluded {ith} {ikh} {iph} {ifh} from the set of valid elements.

However there are 70 words (tokens) with those combinations in the whole parags text, out of more that 30'000 tokens. Is that significant? Contrast that with 1622 occurrences of {cth} {ckh} {cph} {cfh}.

Maybe I should accept those 'i' variants too (and treat them too as platform gallows, CMC class "H").

On the other hand here are only 4 occurrences of {ih}; contrast with 9362 of {ch}.

This may be a clue that {ith} {ikh} {iph} {ifh} are instances of {cth} {ckh} {cph} {cfh} that were mangled by the Scribe/Retracer/Transcriber. So maybe we should just "error-correct" them as such.

--All the best, --stolfi

RE: A family of grammars for Voynichese - Grove - 29-12-2025

Don’t e’s only precede d’s , or o’s or a’s and that a’s only precede r’s l’s and n’s (or i benches)? I see the a as a transition from e to i series although there are cases of o’s followed by i series without the a transition.

I feel like the e ee eee series preceding a d act in a similar way to the i ii iii that precedes the rln variants.

RE: A family of grammars for Voynichese - Jorge_Stolfi - 29-12-2025

(29-12-2025, 10:05 AM)oshfdk Wrote: You are not allowed to view links. Register or Login to view.5.6% of word types or word tokens?

Here are the numbers for word types (as opposed to tokens). For these statistics I defined the lexicon of a text as being the set of words that occur at least 3 times. The column "vlex" is the number of word types in each subset that occur at least 3 times and uses only "valid" EVA chars (excluding '?', weirdos, and [bgjuvwxz]). The column "vgud" is the number of those words that fit the crust-mantle-core (CMC) model as defined before. "vbad" is the number that violate that model, and %gud is the percentage of "vgud" over "vlex".

LEXICON SIZES WITH VALID CMC STRUCTURE

sec-type vlex vgud vbad %gud
----------- ----- ----- ----- ------
bio-parags 295 295 0 100.00
cos-parags 56 56 0 100.00
hea-parags 407 404 3 99.26
heb-parags 210 209 1 99.52
pha-parags 150 149 1 99.33
str-parags 546 541 5 99.08
unk-parags 205 205 0 100.00

tot-parags 1361 1328 33 97.58

That is, 97.58% of the word types that use only valid EVA characters fit the CMC model.

The numbers for "tot-parags" are bigger than the sum of the section numbers because a word that occurs (say) once in one section and 2 times in another will not be in the lexicons of those sections, but will be in the lexicon of the whole parags text.

These are the 33 word types (with at least 3 occurrences) that do not fit the CMC model, and their occurrence counts in the total parags text:

7.0 daiidy Should be valid?
7.0 polchedy Two words?
6.0 aithy Uses the {ith} non-element.
5.5 cholky Two words?
4.5 ail Should be valid?
4.5 cholkaiin Two words?
4.0 cheeteey Has 3 benches.
4.0 chodchy Two words?
4.0 cholkar Two words?
4.0 cholkeedy Two words?
4.0 dairal Two words, {d}{a}{ir}.{a}{l}?
4.0 dairin Ditto?
4.0 ety Should be valid?
4.0 qoedy Should be valid?
4.0 qoeol Should be valid?
4.0 shoikhy Uses the {ikh} non-element.
3.5 pchocthy Two gallows; two words?
3.0 aiinal Two words, {a}{in}.{a}{l}?
3.0 aikhy   Uses the {ikh} non-element.
3.0 airody Two words, {a}{ir}.{o}{d}{y}?
3.0 airol Two words, {a}{ir}.{o}{l}?
3.0 cheolkain  Two words?
3.0 cholchedy Two words?
3.0 chsey The {se} is {sh} perhaps?
3.0 chsky Two words?
3.0 dairody   Two words, {d}{a}{ir}.{o}{d}{y}?
3.0 daldaiin Two words, {d}{a}{d}.{d}{a}{iin}?
3.0 oikhy   Uses the {ikh} non-element.
3.0 polshy Two words?
3.0 qekchdy Should be valid?
3.0 qoedaiin Should be valid?
3.0 sheekchy Has 3 benches.
3.0 tockhy Has 2 gallows; two words?

All the best, --stolfi

RE: A family of grammars for Voynichese - ReneZ - 29-12-2025

(29-12-2025, 01:05 PM)Jorge_Stolfi Wrote: You are not allowed to view links. Register or Login to view.So far I have excluded {ith} {ikh} {iph} {ifh} from the set of valid elements.

However there are 70 words (tokens) with those combinations in the whole parags text, out of more that 30'000 tokens. Is that significant? Contrast that with 1622 occurrences of {cth} {ckh} {cph} {cfh}.

Maybe I should accept those 'i' variants too (and treat them too as platform gallows, CMC class "H").

On the other hand here are only 4 occurrences of {ih}; contrast with 9362 of {ch}.

This may be a clue that {ith} {ikh} {iph} {ifh} are instances of {cth} {ckh} {cph} {cfh} that were mangled by the Scribe/Retracer/Transcriber. So maybe we should just "error-correct" them as such.

Your counts correspond reasonably well with those in this document: You are not allowed to view links. Register or Login to view.
for the three most complete transliterations.

At this point, I don't see any way how we can conlude that things like ith / iTh are meaningfully different from cth / cTh or not. If we found that they strongly depend on the scribal hand, then that would be an indication, but I do not think that that is the case.

Failing that, we can just record the differences, and then decide when doing stats. The problem with that, of course, is that one quickly runs into many different scenarios based on different combinations of 'choices' to be made, to the point where it becomes totally unpractical.

RE: A family of grammars for Voynichese - dashstofsk - 30-12-2025

(29-12-2025, 03:33 PM)Grove Wrote: You are not allowed to view links. Register or Login to view.Don’t e’s only precede d’s , or o’s or a’s and that a’s only precede r’s l’s and n’s (or i benches)? I see the a as a transition from e to i series although there are cases of o’s followed by i series without the a transition.

I feel like the e ee eee series preceding a d act in a similar way to the i ii iii that precedes the rln variants.

You are right in highlighting the fact that many word strings can be classed e -series and i -series. If you look at the most frequent words and the way they are written then you will see that this is so. The characters in words such as chedy, Shedy are all e -based. Character d is written first as e and a loop is then added upwards and to the right. Character y is written first as e and the downward swing is then added. The initial e curve to these characters look the same.

Filename: f104v.png Size: 25.36 KB 30-12-2025, 11:26 AM

Filename: f155r_Line11.png Size: 33.09 KB 30-12-2025, 11:26 AM

Filename: f155r_Line13.png Size: 57.39 KB 30-12-2025, 11:26 AM

Likewise for characters n l r m . They are usually written first with the same downward stroke i . These characters often like to come after other i -stroke characters.

Filename: i_Series_06.png Size: 17.83 KB 30-12-2025, 11:27 AM

Filename: i_Series_05.png Size: 38.44 KB 30-12-2025, 11:27 AM

Filename: i_Series_04.png Size: 28.48 KB 30-12-2025, 11:27 AM

Filename: i_Series_03.png Size: 16.57 KB 30-12-2025, 11:27 AM

Filename: i_Series_02.png Size: 18.19 KB 30-12-2025, 11:27 AM

Filename: i_Series_01.png Size: 35.27 KB 30-12-2025, 11:27 AM

Character a does indeed seem to be written with the e stroke followed by the i stroke and does seem to be the transition character between e -strings and i -strings.

It seems that the writer likes the easy-to-repeat strokes i and e . And seems to prefer to write in a style that is not too taxing. He is doodling with letters to create artificial text. I mentioned something about this theory of mine in a previous post

You are not allowed to view links. Register or Login to view.